I've found some time to update the blog again. This time I want to show you how you can improve Unreal Engine's CPU skinning with just a few changes. By these few changes the CPU skinning can run up to 2.6 times faster at least on my local machine. The changes are targeting utilization of the CPU components including SIMD unit and different CPU cores and removing unnecessary operations. The first and second section is more of an intro to skinning. If you just want to know about the performance just jump to the Unreal CPU Skinning section.
I uploaded the changes in a Github repo so you can compare with the base change. You can find the codes here.
I am testing unreal default character called UE Mannequin which has 68 bones, 23309 vertices for LOD0 and 6733 vertices for LOD 1. Also the changes are done on Unreal 5.6 which is released in late spring 2025.
When to use CPU Skinning?
Considering the fact that CPU is no match to a GPU for character mesh skinning, when is good to use CPU skinning then?
So there are still cases you can use CPU skinning hence a good and optimized CPU skinning can be an asset. So in this post I will go through Unreal's CPU skinning and just try to optimize it with only a few changes. The CPU skinning in Unreal is almost complete, it's sometimes unstable though but it didn't prevent me from testing and optimizing the codes! So here, I only want to show you how you can optimize it to run faster by a few simple changes in the codes but before doing that I explain what a character mesh skinning is in short.
Mesh Skinning
Now our challenge is to go through each vertex, read how many bones are influencing the vertex with what weight then make an average transformation matrix of the influencing bones then multiply the result matrix to each vertex position and normal.
So this looks like a good food for SPMD programming. That's why GPU skinning is a win here however with CPU we still can get decent results if we target characters with low details. Just never expect complicated characters to run well with CPU skinning. Complicated characters with morph targets and many bones will run very slow with CPU skinning. CPU skinning is better to be used for low detailed characters.
Unreal CPU Skinning
1- Improve vector operations to utilize SIMD commands.
To make the characters run with CPU skinning you can add such a test code in the BeginPlay of your character actor:
void ALyraCpuSkinningTest::BeginPlay()
{
Super::BeginPlay();
if (GetMesh() != nullptr)
{
GetMesh()->SetCPUSkinningEnabled(true);
}
}
After turning CPU skinning on, by proper debugging we can get into this function in the codes which is doing the biggest part of mathematics to calculate vertex position and normal.
template < typename VertexType, int32 NumberOfUVs>
static void SkinVertexSection(
FFinalSkinVertex*& DestVertex,
TArray & <FMorphTargetInfo;> & MorphEvalInfos,
const TArray &<float> & MorphWeights,
const FSkelMeshRenderSection & Section,
const FSkeletalMeshLODRenderData &LOD,
FSkinWeightVertexBuffer& WeightBuffer,
int32 VertexBufferBaseIndex,
uint32 NumValidMorphs,
int32 & CurBaseVertIdx,
int32 LODIndex,
const FMatrix44f* RESTRICT ReferenceToLocal,
const FClothSimulData* ClothSimData,
float ClothBlendWeight,
const FMatrix& WorldToLocal,
const FVector& WorldScaleAbs )
So our focus will be on optimizing this function. Let's go to the four main cases we want to improve.
1- SIMD Utilization
const FMatrix44f BoneMatrix1 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_1]]];
VectorRegister Weight1 = VectorReplicate( Weights, INFLUENCE_1 );
M00 = VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[0][0] ), Weight1, M00 );
M10 = VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[1][0] ), Weight1, M10 );
M20 = VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[2][0] ), Weight1, M20 );
M30 = VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[3][0] ), Weight1, M30 );
Or in another part, if you scroll down in the same function, you can find such codes:
VectorRegister N_xxxx = VectorReplicate( SrcNormals[0], 0 );
VectorRegister N_yyyy = VectorReplicate( SrcNormals[0], 1 );
VectorRegister N_zzzz = VectorReplicate( SrcNormals[0], 2 );
DstNormals[0] = VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiplyAdd( N_zzzz, M20, M30 ) ) );
DstNormals[1] = VectorZero();
N_xxxx = VectorReplicate( SrcNormals[1], 0 );
N_yyyy = VectorReplicate( SrcNormals[1], 1 );
N_zzzz = VectorReplicate( SrcNormals[1], 2 );
DstNormals[1] = VectorNormalize(VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiply( N_zzzz, M20 ) ) ));
N_xxxx = VectorReplicate( SrcNormals[2], 0 );
N_yyyy = VectorReplicate( SrcNormals[2], 1 );
N_zzzz = VectorReplicate( SrcNormals[2], 2 );
DstNormals[2] = VectorZero();
DstNormals[2] = VectorNormalize(VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiply( N_zzzz, M20 ) ) ));
// carry over the W component (sign of basis determinant)
DstNormals[2] = VectorMultiplyAdd( VECTOR_0001, SrcNormals[2], DstNormals[2] );
Looking at the codes above this is the multiplication of averaged matrix to the vertex position and normals.
Now the bad thing about the codes above! If you look at the data like vertex positions, bone matrices, and morph target deltas are all in modeling space (component space with unreal vocabulary) and they are all single precision floating points. So all make sense but if you look at all the calls to VectorRegister which is representative of the SIMD registers like _m128, you will find out the data type used in VectorRegister are double precision floating points!
It's good to know that in Unreal PC builds all the SIMD VectorRegister classes are compiled with SSE4 by default due to need of supporting backward compatibility but in consoles like XSX or PS5 they are compiled with AVX. So the penalty of this conversion on the PC is higher than on consoles however this conversion on consoles is still costly. Generally converting between SIMD vector double and float is costly compared to serial data. Just changing VectorRegisters to VectorRegister4f brought around 0.5 ms of improvement for my test case with LOD0 of the character.
2- Use multi cores to run every vertex position and normal calculations on different CPU cores
So here we can use a ParallelFor. ParallelFor is a great way to break down long loops with heavy bodies to make the loop run faster. However to use a parallel for we need to be sure if the code in the loop body is thread-safe. Important thing about using parallel for is that we need to be sure that the for loop body with each index will have no data dependency to the other indices of the loop. This way we can be sure, if turning the codes into ParallelFor is thread-safe or not!
To be sure if the other variables are also thread-safe, we need to check how many of the variables in the surrounding scope of the for loop are non-const. When data is const it means it's read-only. Since I am sure the data being passed to the ParallelFor is not changed outside of it during the execution of ParallelFor then any readonly data becomes naturally thread-safe! Multple threads can read from a readonly data without being worried about thread-safety.
By turning all the data in the surrounding scope to const, I see there are three data changed in the loop body as the compiler generates error of writing on a const data. So they are not thread safe. These 3 variables are changed it the loop and are from the input parameters of the SkinVertexSection. They are as follows:
CurBaseVertIdx, is an output value but the good with it is that it's not read and used inside the loop body. We should just be sure it will have a valid value as output. So we can take it out of the loop body and just calculate it outside of the loop with one operation.
TArray<FMorphTargetInfo>& MorphEvalInfos, has a data member called NextDeltaIndex. This index is used as an optimization while blending with morph targets. This index is changed during the loop and it has dependency to the next vertices. So this means applying morph targets are not thread-safe as they should be calculated in a serial manner and not parallel due to the data dependency created by the optimization used in the NextDeltaIndex. So to fix this, I am extracting the morph target calculations outside of the ParallelFor loop and apply them in serial code.
The rest of the code is thread safe except one static local variable which by turning it into non-static local, it guarantees the thread safety.
3- Simplifying the Loop
1- Removing unnecessary ifs: There are two ifs in the for loop which are not changed during the loop but are checked in the loop all the time. One if is for NumValidMorphs to check if there is any morph targets and one is for Cloth simulation to see if there is any cloth simulation. To avoid these two ifs getting checked in the for body loop we should divide the loop into three sections, check the ifs before the loop and assemble these three sections conditionally. Such a change almost brought me 0.02 ms improvement for LOD0. Generally if the loop body number was small I wouldn't change the setup as CPU branch prediction could optimize the loop pretty well but with this number of vertices removing any unnecessary CPU operation is a win.
This code prefetches around 65536 matrices! It's just spamming the cache by prefetching memory locations which are not needed and also it iterates over 65536 integers! So it's a loop that just harms. By removing it, I could gain 0.05 ms.
It's worth saying that the modern CPUs have a very good prefetching mechanism and using software prefetching might do more harm than good!
for (uint32 j = 0; j < VertexType::NumTexCoords; j++)
{
SrcSoftVertex.UVs[j] = LOD.StaticVertexBuffers.StaticMeshVertexBuffer.GetVertexUV_Typed < VertexType::StaticMeshVertexUVType >(VertexBufferIndex, j);
}
Removing the loop above helps with the performance as the result is not used anywhere. The UVs are just copied at the end of the loop from the the static vertex buffers not any skinned vertex so the change is redundant.
const FMatrix44f BoneMatrix0 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_0]]];
VectorRegister4f Weight0 = VectorReplicate(Weights, INFLUENCE_0);
VectorRegister4f M00 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[0][0]), Weight0);
VectorRegister4f M10 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[1][0]), Weight0);
VectorRegister4f M20 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[2][0]), Weight0);
VectorRegister4f M30 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[3][0]), Weight0);
if (MaxSectionBoneInfluences > 1)
{
const FMatrix44f BoneMatrix1 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_1]]];
VectorRegister4f Weight1 = VectorReplicate(Weights, INFLUENCE_1);
M00 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[0][0]), Weight1, M00);
M10 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[1][0]), Weight1, M10);
M20 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[2][0]), Weight1, M20);
M30 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[3][0]), Weight1, M30);
if (MaxSectionBoneInfluences > 2)
{
const FMatrix44f BoneMatrix2 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_2]]];
VectorRegister4f Weight2 = VectorReplicate(Weights, INFLUENCE_2);
M00 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[0][0]), Weight2, M00);
M10 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[1][0]), Weight2, M10);
M20 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[2][0]), Weight2, M20);
M30 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[3][0]), Weight2, M30);
if (MaxSectionBoneInfluences > 3)
{
.
.
.
}
This code continues until MaxSectionBoneInfluences goes over 11. It's not readable! Also it's not scalable. For instance FMatrix44f contains 16 float. The matrix average can be calculated much faster if the CPU supports AVX512 as with AVX512, 16 single precision floats can be loaded in one zmm register. As for AVX, 8 single precision floats can be loaded but the codes above only supports SSE4 which is 4 float per SIMD register. So in theory, the code above probably can run 4 times slower than AVX512 and 2 times slower than AVX. Of course this is just in theory and in practice it should be measured as sometimes using AVX and AVX512 for simple cases work worse than SSE. But anyways we can say the codes are not readable and are not scalable. I think we can agree on that!
To fix this I replace this part of the code with just a simple ISPC code which does the same but in only a few lines. The script is also scalable based on local CPU support of AVX or AVX 512:
inline export void CpuSkinningGetAveragedMatrix(const uniform uint16 NumBoneWeights, const uniform uint16 BoneWeights[], const uniform uint16 BoneMap[], const uniform uint16 BoneIndices[],
const uniform FMatrix44f ReferenceToLocal[],
uniform float OutBoneWeights[],
uniform float OutMatrix[])
{
foreach(I = 0 ... NumBoneWeights)
{
OutBoneWeights[I] = BoneWeights[I] * Inv_65535;
}
const uniform FMatrix44f* uniform CurrentBoneMatrix = &ReferenceToLocal[BoneMap[BoneIndices[0]]];
uniform float CurrentBoneWeight = OutBoneWeights[0];
foreach(MatrixElement = 0 ... 16)
{
OutMatrix[MatrixElement] = CurrentBoneMatrix->M[MatrixElement] * CurrentBoneWeight;
}
for (uniform int I = 1; I < NumBoneWeights; I++)
{
CurrentBoneWeight = OutBoneWeights[I];
if (CurrentBoneWeight > FLOAT_SMALL_NUMBER)
{
CurrentBoneMatrix = &ReferenceToLocal[BoneMap[BoneIndices[I]]];
foreach(MatrixElement = 0 ... 16)
{
OutMatrix[MatrixElement] += CurrentBoneMatrix->M[MatrixElement] * CurrentBoneWeight;
}
}
else
{
break;
}
}
}
Just note that calling an ISPC code in a large loop can have a bit of performance penalty because the exported functions from ISPC can't be inlined. It's usually better that the ISPC functions contain the main loop inside themselves rather than be called in a large loop but here the main goals was scalability and readability. The performance didn't change that much by using ISPC here.
Results
Tested on CPU Core i5-12400F.
So the results before optimization are like this:
LOD0: 2.8 ms
LOD1: 0.95 ms
After optimization:
LOD0: 1.07
LOD1: 0.42
Overall around 2.6 times faster on LOD0 and 2 times faster on LOD1