Tuesday, December 26, 2023

Parallel Traversal of a Skeletal Animation Tree

Intro

In this post, I want to show a way to iterate a skeletal animation tree in a faster way but before jumping to the actual way of showing this, I will review a few topics like skeletal animation and tree node indexing  as a prelude to the topic.


Skeletal Animation Representation


Skeletal animation is represented by a tree data structure usually called skeleton though the name can be different based on the animation platform you are using. For instance in OGRE3D or Unreal Engine is called skeleton. For the sake of simplicity, I also call it skeleton in this post.

Skeleton tree is an abstract representation of a body of a human or an animal. This abstract representation can describe the relationship between the bones exactly like the way a human or an animal body is. For instance the elbow joint is connected to upper arm. When upper arm moves the elbow is taking the upper arms motion so we can say the elbow is the follower of upper arm. Such relationship can be defined by a parent-child schema and a tree data structure is a very good way to show a set of parent-child relationship for all the bones of the body. That's why a skeleton in animation is literally a tree structure!

For the skeletal animation, each node of the tree is called bone or sometimes joint. I will call them bones in this post to unify the vocabulary. Bones hold a 3D transformation which are passed to the skinning modifiers to move the mesh's vertices and animate the mesh of the characters.


Indexing the tree nodes

Now that we have a tree representing the skeleton, we should be able to traverse the tree and process the nodes upon any need. So the question here is that which method of tree indexing you need to take?That's a very important question for an animation system to answer because it has an influence on the performance of an animation system. The most common ways of indexing tree nodes are pre-order, post-order, in-order and level-order. I haven't seen post-order and in-order anywhere being used in any animation system I worked on and I don't see it being beneficial to be used for traversing a skeleton but level-order and pre-order are common to be used and they work well for animation. To select a good tree indexing method you need to have these points in mind:

1- How to specify each bone's parent?

2- How to specify each bone's children?

3- For world/modeling space calculation of the bones where you need to access the parent or children of each bone, there should be small amount of jumps in memory.

4- Does the tree indexing method allow you to process the tree in parallel?


Now let's just focus on pre-order and level-order tree traversal and decide which one to pick for our tree indexing method. Before going forward, take a look at these two photos showing tree indexing for pre-order and level-order tree traversal:





The tree indices shows the order of processing of the tree nodes (bones). In case of animation, we are most likely processing their 3D transformation so a fairly simple linear algebra operations!

Before comparing pre-order and level-order, I would like to point out that I'm forming the whole tree nodes in a flat array of transforms, where each array index represents the bone index. For instance, array element at index 10 represents the bone with index 10 and the element is just holding position, rotation or scale. When processing the animation data, most of the times we don't even care for the hierarchy. For instance when blending between two poses in local(parent) space, we just need to read the transforms of the same bone indices between two transform array and blend them together so no hierarchy checks are involved. This was said to explain why we just keep a flat array of transforms to represent a skeleton tree. This way of representing a tree is different than what is usually shown for tree structures in academic books. In academic materials, it's very often to make trees using linked lists, connecting the nodes with newly allocated nodes through pointers. This way of showing a skeleton tree won't work well for a real-time animation systems due to lower degree of locality of references. Real-time animation systems should run efficient and fast, not wasting unnecessary CPU cycles. So in order to have the data representing the hierarchy, we define other flat arrays showing the children and parents of each tree node. We can refer to these arrays whenever we need to have access to the hierarchy data. This array's elements keep the data of parent and children of the bone at the current bone index. For instance element 10 of these arrays keep the parent index and children indices of the bone with index 10.


Now let's get back to the 4 points above to compare pre-order and level-order tree traversal methods.

1- How to specify each bones's parent? For both the level-order and the pre-order, the parent index can be just defined by one index in our skeleton tree hierarchy data. So both methods are equal in this case.


2- How to specify each bone's children? Level-order absolutely wins this as specifying children of each bone can be defined by only two indices. One showing first child index and one last child. Any other children has an index between first and last child. If you look at the image of level-order indexing at the beginning of the post, the children of each bone are arranged sequentially. This is not true for the pre-order. To specify the children of each bone in pre-order tree indexing, we need to define an array of indices showing the children of each node. This brings some difficulty and extra data to handle for each node. Also this means accessing the children nodes ends up with possible memory gather and scatters.


3- For world/modeling space calculation of the bones where you need to access the parent or children of each bone, there should be small amount of jumps in memory: 

When using pre-order, there is a notable chance that the parent bone of each bone is the bone with  the previous index especially for the bone structures like spine, neck and finger joints so the locality of references for some bone structures like spine, neck and finger joints is high. This is not necessarily true for the level order as the bones are indexed based on the level so memory location differences are more often between parent and child bones. We can say accessing the parents in pre-order in average has a higher degree of locality of references compared to level order.

On the reverse, in the level-order traversal, the children of each bone are always next to each other in memory so accessing children of each bone should be done without memory jumps. This is not true in pre-order as the children are scattered around. For processing the whole skeleton tree in cases like world/modeling space calculations or branch filtering, you need to have access to both parent and children and level-order works better in such cases in average.

4- Does the tree indexing method allow you to process the skeleton in parallel? Level-Order does! I will explain this in the next sections.

Level-Order Traversal Representation

So with the reasoning done in the past section, I pick level-order for the skeleton tree traversal method. I define the skeleton tree with three flat array with element numbers equal to the skeleton bone count (BonePositions, BoneRotations and BoneScales).

I also define the skeleton tree hierarchy data as structure of array called BoneIndexDataSOA. Each array element show the parent, first child or last child of the bone. If the bone has no child, value of -1 is used.



struct BoneIndexDataSOA
{
std::vector<int> Parents;
std::vector<int> ChildrenStart;
std::vector<int> ChildrenEnd;
};
//
__declspec(align(16))
struct BoneData
{
float X = 0.f;
    float Y = 0.f;
    float Z = 0.f;
    float W = 0.f;

inline BoneData operator+(const BoneData Other)
{
BoneData Ret;
Ret.X = X + Other.X;
Ret.Y = Y + Other.Y;
Ret.Z = Z + Other.Z;
Ret.W = W + Other.W;
return Ret;
}
};

//
std::vector<BoneData> BonePositions;
std::vector<BoneData> BoneRotations;
std::vector<BoneData> BoneScales;
BoneIndexDataSOA BonesMapSOA;


Parallel Traversal of a Skeleton Tree 

We already picked level-order skeleton tree indexing and defined the structures to represent it. Now what if we boost the tree traversal method and process the bones in parallel? When we traverse the tree, we do some operation on each tree node. For instance for calculating the modeling space transformation of a skeleton, the process on each bone is the multiplication of the bone transformation and its parent. So now we want to see if we can process these operations in parallel. By parallel processing it means we process more than one bone at a time.

Such a thing is possible with SPMD programming and level-order traversal is a good suit for it. To handle the SPMD programming I used Intel ISPC compiler. It is an implicit SPMD programming compiler but a very powerful one!

We're going to parallel process the children of each bone. Since the children of each bone are located sequentially in the memory, we can utilize the usage of ISPC as ISPC eventually map the program indices into SIMD registers (as it's an implicit SPMD). So sequential memory means easier vector load/store in which ISPC is great at.

Below I defined three functions to compare the performance with. Two are traversing the skeleton using MSVC with serial code execution and one using ISPC with parallel code execution. Our operation on each bone is to add the position of each bone with its parent. A fairly simple operation!



//Calculating the sum of each bone position with its parent using level-order traversal
void LevelOrderSum()
{
for (int I = 0; I < BonePositions.size(); I++)
{
if (BonesMapSOA.ChildrenStart[I] != -1)
{
for (int J = BonesMapSOA.ChildrenStart[I]; J <= BonesMapSOA.ChildrenEnd[I]; J++)
{
BonePositions[J] = BonePositions[J] + BonePositions[I];
}
}
}
}

// Traversing the tree in a faster way only by querying the parent.
// This runs faster than LevelOrderSum due to less condition checks.
// It's a tree traversal-agnostic method.
// It just needs the access to the parent index however it's not possible to use this method-
// for implicit SPMD programming.
void RegularTraverseSum()
{
for (int I = 1; I < BonePositions.size(); I++)
{
BonePositions[I] = BonePositions[I] + BonePositions[BonesMapSOA.Parents[I]];
}
}

// ISPC function. Parallel processing of the children of each bone using-
// implicit SPMD programming by utilizing the SIMD commands and registers.
export inline void ParallelLevelOrder(const uniform int InChildrenStart[], const uniform int InChildrenEnd[], uniform float OutData[], const uniform int BonesCount)
{
for (uniform int I = 0; I < BonesCount; I++)
{
if (InChildrenStart[I] != -1)
{
foreach(J = InChildrenStart[I]...InChildrenEnd[I] + 1)
{
OutData[J] += OutData[I];
}
}
}
}

Now I compare the time each function takes to run. Note that all the functions provide the same output and only the execution time is different.

For comparison I defined a tree of 40000 bones and processed it 100 times. Here are the results:


LevelOrderSum: 6.3 milliseconds
RegularTraverseSum: 4.374 milliseconds
ParallelLevelOrder: 1.795 milliseconds

As seen above, the parallel processing using ISPC compiler runs much faster than the serial C++ codes however the comparison is not yet fair!

When I defined the + operator on BoneData struct, I didn't vectorize the + operation. In a game/graphics engine,  all these vector operations are using SIMD commands so it's not fair not to use the SIMD commands for our comparison. To make the results closer to reality, I change the BoneData struct like this:



struct BoneData
{
__m128 Data = _mm_setzero_ps();
inline BoneData operator+(const BoneData Other)
{
BoneData Ret;
Ret.Data = _mm_add_ps(Data, Other.Data);
return Ret;
}
};


So now we can check the results in a more fair comparison:

LevelOrderSum: 3.93 milliseconds
RegularTraverseSum: 2.65 milliseconds
parallel_level_order: 1.795 milliseconds

The first two functions improved a lot almost twice better but still not as good as the result with parallel processing. In a  nutshell, ISPC utilized the usage of SIMD registers and since my local PC is supporting AVX2, it serialized each bone's children positions in ymm registers and at least processes two vectors at the same time while the + operator above only processes one vector. This degree of parallelism can grow even more if your CPU supports AVX 512!

Why Pre-Order Is Not Suitable for Parallel Skeleton Processing?

There were two reasons that we could parallel process the children of each bone in the skeleton efficiently, one is that the children are sequentially allocated in memory and second is that when we are writing to the children transforms, the parent transform is already written and we just read it. So we can say there is a sequence point between reads and writes here but this is not true if we want to parallel process the bones using pre-order indexing! In pre-order, when we serialize the sequential bone indices to be loaded into the SIMD registers, there is a big chance we are processing the parent and child transforms at the same time which means we are reading the parent transform to calculate the child transform while the parent is being written itself so there can be race over the data.

However we can traverse the pre-order indexed skeleton in a different way by introducing memory gather and scatters but that might not end up efficient so I skip talking about that here.


Is Parallel Processing a Skeleton Tree Always a Win?

Such win in parallel processing of a skeleton usually shows up if you have characters with many bones. Especially characters with loads of muscle correctives and facial joints. It might not be a win for the characters with simple hierarchies including only big muscles!