Friday, July 4, 2025

Improving Unreal Engine's CPU Skinning Performance

I've found some time to update the blog again. This time I want to show you how you can improve Unreal Engine's CPU skinning with just a few changes. By these few changes the CPU skinning can run up to 2.6 times faster at least on my local machine. The changes are targeting utilization of the CPU components including SIMD unit and different CPU cores and removing unnecessary operations. The first and second section is more of an intro to skinning. If you just want to know about the performance just jump to the Unreal CPU Skinning section.

I uploaded the changes in a Github repo so you can compare with the base change. You can find the codes here.


I am testing unreal default character called UE Mannequin which has 68 bones, 23309 vertices for LOD0 and 6733 vertices for LOD 1. Also the changes are done on Unreal 5.6 which is released in late spring 2025.

When to use CPU Skinning?


With presence of GPUs and their SPMD friendly architecture, using  CPU skinning is an overkill because CPU is an all-round processor. It can do everything with its complicated architecture but it doesn't mean it can be as good as a GPU in vector and matrix processing. Although CPU is nowhere near good as GPU in matrix/vector processing, it still has some components to do vector processing. If they are utilized, some good performance can be captured out for the CPU skinning.

Considering the fact that CPU is no match to a GPU for character mesh skinning, when is good to use CPU skinning then?

There are cases where you still can use CPU skinning. For instance if you have a GPU-bound game and want to offload a bit from GPU to CPU. Or if you have a mobile game with low res characters and your CPU is an ARM based architecture, it probably can consume smaller energy than a high voltage GPU so balancing between CPU and GPU could be a good option. Another case could be an LOD based system where the characters in far with lower resolution be skinned by CPU than GPU to give a good balance between CPU and GPU.

So there are still cases you can use CPU skinning hence a good and optimized CPU skinning can be an asset. So in this post I will go through Unreal's CPU skinning and just try to optimize it with only a few changes. The CPU skinning in Unreal is almost complete, it's sometimes unstable though but it didn't prevent me from testing and optimizing the codes! So here, I only want to show you how you can optimize it to run faster by a few simple changes in the codes but before doing that I explain what a character mesh skinning is in short.

Mesh Skinning

So the process of assigning the vertex positions and normals of a mesh to bones of a skeletal mesh is called skinning. During the skinning process, artists give each vertex a weight of influence to different bones. For instance a vertex located on the neck of a character can get transform influence of 50% from the last neck joint, 30% from the first neck joint and 20% from the clavicle joint. So when the character moves its neck bones, you see the vertex reacts to the transforms and imitates a process of a skin being stretched.

Each bone transform comes with a 4x4 3D transformation matrix combining scale, rotation and position in order. By in order I mean first Scale of the bone is applied, then rotation and then position. This is a standard in all 3D packages as if the position being added as the last operation, it exactly moves the object with the specified position vector inside the matrix and it's the most intuitive way of representing a combined 3D transform.

Now our challenge is to go through each vertex, read how many bones are influencing the vertex with what weight then make an average transformation matrix of  the influencing bones then multiply the result matrix to each vertex position and normal.

So this looks like a good food for SPMD programming. That's why GPU skinning is a win here however with CPU we still can get decent results if we target characters with low details. Just never expect complicated characters to run well with CPU skinning. Complicated characters with morph targets and many bones will run very slow with CPU skinning. CPU skinning is better to be used for low detailed characters.

Unreal CPU Skinning

So what I explained in the previous section is already implemented in Unreal Engine however with some small changes we can improve it to run faster up to 2.6 times. There are 4 key areas we can focus on:

1- Improve vector operations to utilize SIMD commands.
2- Use multi cores to run every vertex position and normal calculations on different CPU cores.
3- Generally simplifying the Loop.
4- Simplifying the matrix average calculation codes.

To make the characters run with CPU skinning you can add such a test code in the BeginPlay of your character actor:

void ALyraCpuSkinningTest::BeginPlay()
{
	Super::BeginPlay();
	if (GetMesh() != nullptr)
	{
		GetMesh()->SetCPUSkinningEnabled(true);
	}
}


After turning CPU skinning on, by proper debugging we can get into this function in the codes which is doing the biggest part of mathematics to calculate vertex position and normal.

template < typename VertexType, int32 NumberOfUVs>
static void SkinVertexSection(
FFinalSkinVertex*&amp DestVertex, TArray & <FMorphTargetInfo;> & MorphEvalInfos, const TArray &<float> & MorphWeights, const FSkelMeshRenderSection & Section, const FSkeletalMeshLODRenderData &LOD, FSkinWeightVertexBuffer& WeightBuffer, int32 VertexBufferBaseIndex, uint32 NumValidMorphs, int32 & CurBaseVertIdx, int32 LODIndex, const FMatrix44f* RESTRICT ReferenceToLocal, const FClothSimulData* ClothSimData, float ClothBlendWeight, const FMatrix& WorldToLocal, const FVector& WorldScaleAbs )

So our focus will be on optimizing this function. Let's go to the four main cases we want to improve.


1- SIMD Utilization


Looking at the codes, there are heavy usage of VectorRegister class which in unreal is used for SIMD operations! For instance a code like this is loading and, adding the matrices of an influencing bones:

const FMatrix44f BoneMatrix1 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_1]]];
VectorRegister Weight1 = VectorReplicate( Weights, INFLUENCE_1 );
M00	= VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[0][0] ), Weight1, M00 );
M10	= VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[1][0] ), Weight1, M10 );
M20	= VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[2][0] ), Weight1, M20 );
M30	= VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[3][0] ), Weight1, M30 );

Or in another part, if you scroll down in the same function, you can find such codes:

VectorRegister N_xxxx = VectorReplicate( SrcNormals[0], 0 );
VectorRegister N_yyyy = VectorReplicate( SrcNormals[0], 1 );
VectorRegister N_zzzz = VectorReplicate( SrcNormals[0], 2 );
DstNormals[0] = VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiplyAdd( N_zzzz, M20, M30 ) ) );

DstNormals[1] = VectorZero();
N_xxxx = VectorReplicate( SrcNormals[1], 0 );
N_yyyy = VectorReplicate( SrcNormals[1], 1 );
N_zzzz = VectorReplicate( SrcNormals[1], 2 );
DstNormals[1] = VectorNormalize(VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiply( N_zzzz, M20 ) ) ));

N_xxxx = VectorReplicate( SrcNormals[2], 0 );
N_yyyy = VectorReplicate( SrcNormals[2], 1 );
N_zzzz = VectorReplicate( SrcNormals[2], 2 );
DstNormals[2] = VectorZero();
DstNormals[2] = VectorNormalize(VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiply( N_zzzz, M20 ) ) ));

// carry over the W component (sign of basis determinant) 
DstNormals[2] = VectorMultiplyAdd( VECTOR_0001, SrcNormals[2], DstNormals[2] );


Looking at the codes above this is the multiplication of averaged matrix to the vertex position and normals.

The good thing about these codes is that the bone transform matrices are shaped as a transposed matrix. Having a transposed matrix is good for two reasons. One is that by having a transposed matrix, if you want to read the position vector, you can do it with only one load because the position is placed in Matrix[3][0], [3][1], [3][2]. With non transposed matrix if you want to get the position you need three loads as position is placed in Matrix[0][3], [1][3], [2][3]. In addition to this, having a transposed matrix, means we can do the vector/matrix multiplication with less SIMD operations and also use operations which has higher throughput. This is mainly due to the fact that we won't need to reduce the values in one xmm/ymm/zmm register. By reduction I mean when you need to do one operation on the whole SIMD lanes in a SIMD register and make a scalar value out of it. For instance if you don't use transposed matrix here, you will need to do a reduce_add for each element of the vertex position. That means you need to add xmm[0] + xmm[1] + xmm[2] + xmm[3] to find a value of an element of the final vertex position. This needs several shuffle and add commands. This extra steps can be avoided by making the bone transform matrices, transposed. So Unreal does that! The bone matrices are transposed and that's good so no change needed for the matrix data layout.

Now the bad thing about the codes above! If you look at the data like vertex positions, bone matrices, and morph target deltas are all in modeling space (component space with unreal vocabulary) and they are all single precision floating points. So all make sense but if you look at all the calls to VectorRegister which is representative of the SIMD registers like _m128, you will find out the data type used in VectorRegister are double precision floating points!

So in all the lines above there is a penalty of converting a 4 float xmm register to two xmm registers of two doubles. This is an expensive conversion and totally unnecessary as every data is using 32 bits single precision float.

It's good to know that in Unreal PC builds all the SIMD VectorRegister classes are compiled with SSE4 by default due to need of supporting backward compatibility but in consoles like XSX or PS5 they are compiled with AVX. So the  penalty of this conversion on the PC is higher than on consoles however this conversion on consoles is still costly. Generally converting  between SIMD vector double and float is costly compared to serial data. Just changing VectorRegisters to VectorRegister4f brought around 0.5 ms of improvement for my test case with LOD0 of the character.

2- Use multi cores to run every vertex position and normal calculations on different CPU cores


I am trying the CPU skinning on UE5's default character called mannequin. The character on the first LOD has over 23K vertices. On Lod 1 over 6K vertices. It also has 68 bones. So to do the CPU skinning we should go through all the vertices and compute the average matrix of the influencing bones and then multiply them to calculate vertex normal and position. Seems to be a lot of instructions! In such case we can get much better performance if we allow bunch of vertices to be calculated in a separate CPU core.

So here we can use a ParallelFor. ParallelFor is a great way to break down long loops with heavy bodies to make the loop run faster. However to use a parallel for we need to be sure if the code in the loop body is thread-safe. Important thing about using parallel for is that we need to be sure that the for loop body with each index will have no data dependency to the other indices of the loop. This way we can be sure, if turning the codes into ParallelFor is thread-safe or not!

Now looking at the function SkinVertexSection which we are optimizing, we see that the main for loop is just a few  lines after the definition of the function. This is very good because we can be sure any non-reference/non-pointer variable defined inside the loop body is thread safe as it's local and is only accessed by the loop body.

To be sure if the other variables are also thread-safe, we need to check how many of the variables in the surrounding scope of the for loop are non-const. When data is const it means it's read-only. Since I am sure the data being passed to the ParallelFor is not changed outside of it during the execution of ParallelFor then any readonly data becomes naturally thread-safe! Multple threads can read from a readonly data without being worried about thread-safety.

By turning all the data of the surrounding scope of the ParallelFor into const, I can find out if these data are changed in the loop body or not.

By turning all the data in the surrounding scope to const, I see there are three data changed in the loop body as the compiler generates error of writing on a const data. So they are not thread safe. These 3 variables are changed it the loop and are from the input parameters of the SkinVertexSection. They are as follows:

FFinalSkinVertex*& DestVertex,
TArray<FMorphTargetInfo>& MorphEvalInfos,
int32& CurBaseVertIdx

The DestVertex is the destination vertex that needs to be having updated position and normals. This data naturally is thread-safe in a ParallelFor because we pass the loop body to update each vertex and each vertex is independent of the other vertices. They don't read or write or read on/from each other's data.

CurBaseVertIdx, is an output value but the good with it is that it's not read and used inside the loop body. We should just be sure it will have a valid value as output. So we can take it out of the loop body and just calculate it outside of the loop with one operation.

TArray<FMorphTargetInfo>& MorphEvalInfos, has a data member called NextDeltaIndex. This index is used as an optimization while blending with morph targets. This index is changed during the loop and it has dependency to the next vertices. So this means applying morph targets are not thread-safe as they should be calculated in a serial manner and not parallel due to the data dependency created by the optimization used in the NextDeltaIndex. So to fix this, I am extracting the morph target calculations outside of the ParallelFor loop and apply them in serial code.

The rest of the code is thread safe except one static local variable which by turning it into non-static local, it guarantees the thread safety.

To minimize the overhead of the thread works, I unrolled the loop with an experimental number. This also contributed in the efficiency of the ParallelFor. It could bring me around 0.04 ms on the LOD0.

The multi-core optimization made the code run almost up to 2.2 times faster and was the biggest contributor in the optimization.


3- Simplifying the Loop

So far there has been two big improvements. One SIMD utilization and one multi-core utilization. However there are a few small other things which we can improve.

1- Removing unnecessary ifs:  There are two ifs in the for loop which are not changed during the loop but are checked in the loop all the time. One if is for NumValidMorphs to check if there is any morph targets and one is for Cloth simulation to see if there is any cloth simulation. To avoid these two ifs getting checked in the for body loop we should divide the loop into three sections, check the ifs before the loop and assemble these three sections conditionally. Such a change almost brought me 0.02 ms improvement for LOD0. Generally if the loop body number was small I wouldn't change the setup as CPU branch prediction could optimize the loop pretty well but with this number of vertices removing any unnecessary CPU operation is a win.

2- Removing some unnecessary prefetches: In a function named "void SkinVertices()" which is the caller of the SkinVertexSection(), there is a loop like:

// Prefetch all matrices
for ( uint32 MatrixIndex=0; MatrixIndex < MaxGPUSkinBones; MatrixIndex+=2 )
{
FPlatformMisc::Prefetch( ReferenceToLocal + MatrixIndex );
}

This code prefetches around 65536 matrices! It's just spamming the cache by prefetching memory locations which are not needed and also it iterates over 65536 integers! So it's a loop that just harms. By removing it, I could gain 0.05 ms.

It's worth saying that the modern CPUs have a very good prefetching mechanism and using software prefetching might do more harm than good!

3- Removing the UV calculations: It seems there are some UV calculations which are not used or referenced anywhere. It's calculated in this loop and then they get overwritten by another code:

for (uint32 j = 0; j < VertexType::NumTexCoords; j++)
{
    SrcSoftVertex.UVs[j] = LOD.StaticVertexBuffers.StaticMeshVertexBuffer.GetVertexUV_Typed < VertexType::StaticMeshVertexUVType >(VertexBufferIndex, j);
}

Removing the loop above helps with the performance as the result is not used anywhere. The UVs are just copied at the end of the loop from the the static vertex buffers not any skinned vertex so the change is redundant.


4- Simplifying the matrix average calculation codes: The matrix average calculation codes are something like this:

const FMatrix44f BoneMatrix0 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_0]]];
VectorRegister4f Weight0 = VectorReplicate(Weights, INFLUENCE_0);
VectorRegister4f M00 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[0][0]), Weight0);
VectorRegister4f M10 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[1][0]), Weight0);
VectorRegister4f M20 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[2][0]), Weight0);
VectorRegister4f M30 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[3][0]), Weight0);

if (MaxSectionBoneInfluences > 1)
{
	const FMatrix44f BoneMatrix1 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_1]]];
	VectorRegister4f Weight1 = VectorReplicate(Weights, INFLUENCE_1);
	M00 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[0][0]), Weight1, M00);
	M10 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[1][0]), Weight1, M10);
	M20 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[2][0]), Weight1, M20);
	M30 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[3][0]), Weight1, M30);

	if (MaxSectionBoneInfluences > 2)
	{
		const FMatrix44f BoneMatrix2 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_2]]];
		VectorRegister4f Weight2 = VectorReplicate(Weights, INFLUENCE_2);
		M00 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[0][0]), Weight2, M00);
		M10 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[1][0]), Weight2, M10);
		M20 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[2][0]), Weight2, M20);
		M30 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[3][0]), Weight2, M30);

		if (MaxSectionBoneInfluences > 3)
                {
                .
                .
                .
}

This code continues until MaxSectionBoneInfluences goes over 11. It's not readable! Also it's not scalable. For instance FMatrix44f contains 16 float. The matrix average can be calculated much faster if the CPU supports AVX512 as with AVX512, 16 single precision floats can be loaded in one zmm register. As for AVX, 8 single precision floats can be loaded but the codes above only supports SSE4 which is 4 float per SIMD register. So in theory, the code above probably can run 4 times slower than AVX512 and 2 times slower than AVX. Of course this is just in theory and in practice it should be measured as sometimes using AVX and AVX512 for simple cases work worse than SSE. But anyways we can say the codes are not readable and are not scalable. I think we can agree on that!

To fix this I replace this part of the code with just a simple ISPC code which does the same but in only a few lines. The script is also scalable based on local CPU support of AVX or AVX 512:

inline export void CpuSkinningGetAveragedMatrix(const uniform uint16 NumBoneWeights, const uniform uint16 BoneWeights[], const uniform uint16 BoneMap[], const uniform uint16 BoneIndices[],
	const uniform FMatrix44f ReferenceToLocal[],
	uniform float OutBoneWeights[], 
	uniform float OutMatrix[])
{
	foreach(I = 0 ... NumBoneWeights)
	{
		OutBoneWeights[I] = BoneWeights[I] * Inv_65535;
	}

	const uniform FMatrix44f* uniform CurrentBoneMatrix = &ReferenceToLocal[BoneMap[BoneIndices[0]]];
	uniform float CurrentBoneWeight = OutBoneWeights[0];
	foreach(MatrixElement = 0 ... 16)
	{
		OutMatrix[MatrixElement] = CurrentBoneMatrix->M[MatrixElement] * CurrentBoneWeight;
	}

	for (uniform int I = 1; I < NumBoneWeights; I++)
	{
		CurrentBoneWeight = OutBoneWeights[I];
		if (CurrentBoneWeight > FLOAT_SMALL_NUMBER)
		{
			CurrentBoneMatrix = &ReferenceToLocal[BoneMap[BoneIndices[I]]];
			foreach(MatrixElement = 0 ... 16)
			{
				OutMatrix[MatrixElement] += CurrentBoneMatrix->M[MatrixElement] * CurrentBoneWeight;
			}
		}
		else
		{
			break;
		}
	}
}

Just note that calling an ISPC code in a large loop can have a bit of performance penalty because the exported functions from ISPC can't be inlined. It's usually better that the ISPC functions contain the main loop inside themselves rather than be called in a large loop but here the main goals was scalability and readability. The performance didn't change that much by using ISPC here.

Results


So here are the results. I write here again  about the character I was testing. I tested CPU skinning on UE Mannequin which has 68 bones in total, 23309 vertices for LOD0 and 6733 vertices for LOD 1. Unfortunately I couldn't test LOD2 and 3 due to some memory crash unreal CPU skinning introduces on lower LODs!

Tested on CPU Core i5-12400F.

So the results before optimization are like this:

LOD0: 2.8 ms
LOD1: 0.95 ms

After optimization:

LOD0: 1.07
LOD1: 0.42

Overall around 2.6 times faster on LOD0 and 2 times faster on LOD1

No comments:

Post a Comment