Friday, July 4, 2025

Improving Unreal Engine's CPU Skinning Performance

I've found some time to update the blog again. This time I want to show you how you can improve Unreal Engine's CPU skinning with just a few changes. By these few changes the CPU skinning can run up to 2.6 times faster at least on my local machine. The changes are targeting utilization of the CPU components including SIMD unit and different CPU cores and removing unnecessary operations. The first and second section is more of an intro to skinning. If you just want to know about the performance just jump to the Unreal CPU Skinning section.

I uploaded the changes in a Github repo so you can compare with the base change. You can find the codes here.


I am testing unreal default character called UE Mannequin which has 68 bones, 23309 vertices for LOD0 and 6733 vertices for LOD 1. Also the changes are done on Unreal 5.6 which is released in late spring 2025.

When to use CPU Skinning?


With presence of GPUs and their SPMD friendly architecture, using  CPU skinning is an overkill because CPU is an all-round processor. It can do everything with its complicated architecture but it doesn't mean it can be as good as a GPU in vector and matrix processing. Although CPU is nowhere near good as GPU in matrix/vector processing, it still has some components to do vector processing. If they are utilized, some good performance can be captured out for the CPU skinning.

Considering the fact that CPU is no match to a GPU for character mesh skinning, when is good to use CPU skinning then?

There are cases where you still can use CPU skinning. For instance if you have a GPU-bound game and want to offload a bit from GPU to CPU. Or if you have a mobile game with low res characters and your CPU is an ARM based architecture, it probably can consume smaller energy than a high voltage GPU so balancing between CPU and GPU could be a good option. Another case could be an LOD based system where the characters in far with lower resolution be skinned by CPU than GPU to give a good balance between CPU and GPU.

So there are still cases you can use CPU skinning hence a good and optimized CPU skinning can be an asset. So in this post I will go through Unreal's CPU skinning and just try to optimize it with only a few changes. The CPU skinning in Unreal is almost complete, it's sometimes unstable though but it didn't prevent me from testing and optimizing the codes! So here, I only want to show you how you can optimize it to run faster by a few simple changes in the codes but before doing that I explain what a character mesh skinning is in short.

Mesh Skinning

So the process of assigning the vertex positions and normals of a mesh to bones of a skeletal mesh is called skinning. During the skinning process, artists give each vertex a weight of influence to different bones. For instance a vertex located on the neck of a character can get transform influence of 50% from the last neck joint, 30% from the first neck joint and 20% from the clavicle joint. So when the character moves its neck bones, you see the vertex reacts to the transforms and imitates a process of a skin being stretched.

Each bone transform comes with a 4x4 3D transformation matrix combining scale, rotation and position in order. By in order I mean first Scale of the bone is applied, then rotation and then position. This is a standard in all 3D packages as if the position being added as the last operation, it exactly moves the object with the specified position vector inside the matrix and exactly scales the mesh in the direction of scale saved in the matrix to it and this the most intuitive way of representing a combined 3D transform.

Now our challenge is to go through each vertex, read how many bones are influencing the vertex with what weight then make an average transformation matrix of  the influencing bones then multiply the result matrix to each vertex position and normal.

So this looks like a good food for SPMD programming. That's why GPU skinning is a win here however with CPU we still can get decent results if we target characters with low details. Just never expect complicated characters to run well with CPU skinning. Complicated characters with morph targets and many bones will run very slow with CPU skinning. CPU skinning is better to be used for low detailed characters.

Unreal CPU Skinning

So what I explained in the previous section is already implemented in Unreal Engine however with some small changes we can improve it to run faster up to 2.6 times. There are 4 key areas we can focus on:

1- Improve vector operations to utilize SIMD commands.
2- Use multi cores to run bunch of vertices position and normal calculations on different CPU cores.
3- Generally simplifying the Loop.

To make the characters run with CPU skinning you can add such a test code in the BeginPlay of your character actor:

void ALyraCpuSkinningTest::BeginPlay()
{
	Super::BeginPlay();
	if (GetMesh() != nullptr)
	{
		GetMesh()->SetCPUSkinningEnabled(true);
	}
}


After turning CPU skinning on, by proper debugging we can get into this function in the codes which is doing the biggest part of mathematics to calculate vertices positions and normals.

template < typename VertexType, int32 NumberOfUVs>
static void SkinVertexSection(
FFinalSkinVertex*&amp DestVertex, TArray & <FMorphTargetInfo;> & MorphEvalInfos, const TArray &<float> & MorphWeights, const FSkelMeshRenderSection & Section, const FSkeletalMeshLODRenderData &LOD, FSkinWeightVertexBuffer& WeightBuffer, int32 VertexBufferBaseIndex, uint32 NumValidMorphs, int32 & CurBaseVertIdx, int32 LODIndex, const FMatrix44f* RESTRICT ReferenceToLocal, const FClothSimulData* ClothSimData, float ClothBlendWeight, const FMatrix& WorldToLocal, const FVector& WorldScaleAbs )

So our focus will be on optimizing this function. Let's go to the four main cases we want to improve.


1- SIMD Utilization


Looking at the codes, there are heavy usage of VectorRegister class which in unreal is used for SIMD operations! For instance a code like this is loading and, adding the matrices of an influencing bone:

const FMatrix44f BoneMatrix1 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_1]]];
VectorRegister Weight1 = VectorReplicate( Weights, INFLUENCE_1 );
M00	= VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[0][0] ), Weight1, M00 );
M10	= VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[1][0] ), Weight1, M10 );
M20	= VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[2][0] ), Weight1, M20 );
M30	= VectorMultiplyAdd( VectorLoadAligned( &BoneMatrix1.M[3][0] ), Weight1, M30 );

Or in another part, if you scroll down in the same function, you can find such codes:

VectorRegister N_xxxx = VectorReplicate( SrcNormals[0], 0 );
VectorRegister N_yyyy = VectorReplicate( SrcNormals[0], 1 );
VectorRegister N_zzzz = VectorReplicate( SrcNormals[0], 2 );
DstNormals[0] = VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiplyAdd( N_zzzz, M20, M30 ) ) );

DstNormals[1] = VectorZero();
N_xxxx = VectorReplicate( SrcNormals[1], 0 );
N_yyyy = VectorReplicate( SrcNormals[1], 1 );
N_zzzz = VectorReplicate( SrcNormals[1], 2 );
DstNormals[1] = VectorNormalize(VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiply( N_zzzz, M20 ) ) ));

N_xxxx = VectorReplicate( SrcNormals[2], 0 );
N_yyyy = VectorReplicate( SrcNormals[2], 1 );
N_zzzz = VectorReplicate( SrcNormals[2], 2 );
DstNormals[2] = VectorZero();
DstNormals[2] = VectorNormalize(VectorMultiplyAdd( N_xxxx, M00, VectorMultiplyAdd( N_yyyy, M10, VectorMultiply( N_zzzz, M20 ) ) ));

// carry over the W component (sign of basis determinant) 
DstNormals[2] = VectorMultiplyAdd( VECTOR_0001, SrcNormals[2], DstNormals[2] );


Looking at the codes above this is the multiplication of averaged matrix to the vertex position and normals.

The good thing about these codes is that the bone transform matrices are shaped as a transposed matrix. Having a transposed matrix is good for two reasons. One is that by having a transposed matrix, if you want to read the position vector, you can do it with only one load because the position is placed in Matrix[3][0], [3][1], [3][2]. With non transposed matrix if you want to get the position you need three loads as position is placed in Matrix[0][3], [1][3], [2][3]. In addition to this, having a transposed matrix, means we can do the vector/matrix multiplication with less SIMD operations and also use operations which has higher throughput. This is mainly due to the fact that we won't need to reduce the values in one xmm/ymm/zmm register. By reduction I mean when you need to do one operation on the whole SIMD lanes in a SIMD register and make a scalar value out of it. For instance if you don't use transposed matrix here, you will need to do a reduce_add for each element of the vertex position. That means you need to add xmm[0] + xmm[1] + xmm[2] + xmm[3] to find a value of an element of the final vertex position. This needs several shuffle and add commands. This extra steps can be avoided by making the bone transform matrices, transposed. So Unreal does that! The bone matrices are transposed and that's good so no change needed for the matrix data layout.

Now the bad thing about the codes above! If you look at the data like vertex positions, bone matrices, and morph target deltas are all in modeling space (component space with Unreal's vocabulary) and they are all single precision floating points. So all make sense but if you look at all the calls to VectorRegister which is representative of the SIMD registers like _m128, you will find out the data type used in VectorRegister are double precision floating points!

So in all the lines above there is a penalty of converting a 4 float xmm register to two xmm registers of two doubles. This is an expensive conversion and totally unnecessary as every data is using 32 bits single precision float.

It's good to know that in Unreal PC builds all the SIMD VectorRegister classes are compiled with SSE4 by default due to need of supporting backward compatibility but in consoles like XSX or PS5 they are compiled with AVX. So the  penalty of this conversion on the PC is higher than on consoles however this conversion on consoles is still costly. Generally converting  between SIMD vector double and float is costly compared to serial data. Just changing VectorRegisters to VectorRegister4f brought around 0.5 ms of improvement for my test case with LOD0 of the character.

2- Use multi cores to run bunch of vertices position and normal calculations on different CPU cores


I am trying the CPU skinning on UE5's default character called mannequin. The character on the first LOD has over 23K vertices. On Lod 1 over 6K vertices. It also has 68 bones. So to do the CPU skinning we should go through all the vertices and compute the average matrix of the influencing bones and then multiply them to calculate vertex normal and position. Seems to be a lot of instructions! In such case we can get much better performance if we allow bunch of vertices to be calculated in a separate CPU core.

So here we can use a ParallelFor. ParallelFor is a great way to break down long loops with heavy bodies to make the loop run faster. However to use a parallel for we need to be sure if the code in the loop body is thread-safe. Important thing about using parallel for is that we need to be sure that the for loop body with each index will have no data dependency to the other indices of the loop. This way we can be sure, if turning the codes into ParallelFor is thread-safe or not!

Now looking at the function SkinVertexSection which we are optimizing, we see that the main for loop is just a few  lines after the definition of the function. This is very good because we can be sure any non-reference/non-pointer variable defined inside the loop body is thread safe as it's local and is only accessed by the loop body.

To be sure if the other variables are also thread-safe, we need to check how many of the variables in the surrounding scope of the for loop are non-const. When data is const it means it's read-only. Since I am sure the data being passed to the ParallelFor is not changed outside of it during the execution of ParallelFor then any readonly data becomes naturally thread-safe! Multple threads can read from a readonly data without being worried about thread-safety.

By turning all the data of the surrounding scope of the ParallelFor into const, I can find out if these data are changed in the loop body or not.

By turning all the data in the surrounding scope to const, I see there are three data changed in the loop body as the compiler generates error of writing on a const data. So they are not thread safe. These 3 variables are changed in the loop and are from the input parameters of the SkinVertexSection function. 

They are as follows:

FFinalSkinVertex*& DestVertex,
TArray<FMorphTargetInfo>& MorphEvalInfos,
int32& CurBaseVertIdx

The DestVertex is the destination vertex that needs to be having updated position and normals. This data naturally is thread-safe in a ParallelFor because we pass the loop body to update each vertex and each vertex is independent of the other vertices. They don't read or write or read on/from each other's data.

CurBaseVertIdx, is an output value but the good with it is that it's not read and used inside the loop body. We should just be sure it will have a valid value as output. So we can take it out of the loop body and just calculate it outside of the loop with one operation.

TArray<FMorphTargetInfo>& MorphEvalInfos, has a data member called NextDeltaIndex. This index is used as an optimization while blending with morph targets. This index is changed during the loop and it has dependency to the next vertices. So this means applying morph targets are not thread-safe as they should be calculated in a serial manner and not parallel due to the data dependency created by the optimization used in the NextDeltaIndex. So to fix this, I am extracting the morph target calculations outside of the ParallelFor loop and apply them in serial code.

The rest of the code is thread-safe except one static local variable which by turning it into non-static local, it guarantees the thread safety.

To minimize the overhead of the thread works, I unrolled the loop with an experimental number. This also contributed in the efficiency of the ParallelFor. It could bring me around 0.04 ms on the LOD0.

The multi-core optimization made the code run almost up to 2.2 times faster and was the biggest contributor in the optimization.


3- Simplifying the Loop


So far there has been two big improvements. One SIMD utilization and one multi-core utilization. However there are a few small other things which we can improve.

1- Removing unnecessary ifs:  There are two ifs in the for loop which are not changed during the loop but are checked in the loop all the time. One if is for NumValidMorphs to check if there is any morph targets and one is for Cloth simulation to see if there is any cloth simulation. To avoid these two ifs getting checked in the for body loop we should divide the loop into three sections, check the ifs before the loop and assemble these three sections conditionally. Such a change almost brought me 0.02 ms improvement for LOD0. Generally if the loop body number was small I wouldn't change the setup as CPU branch prediction could optimize the loop pretty well but with this number of vertices removing any unnecessary CPU operation is a win.

2- Removing some unnecessary prefetches: In a function named "void SkinVertices()" which is the caller of the SkinVertexSection(), there is a loop like:

// Prefetch all matrices
for ( uint32 MatrixIndex=0; MatrixIndex < MaxGPUSkinBones; MatrixIndex+=2 )
{
FPlatformMisc::Prefetch( ReferenceToLocal + MatrixIndex );
}

This code prefetches around 65536 matrices! It's just spamming the cache by prefetching memory locations which are not needed and also it iterates over 65536 / 2 integers! So it's a loop that just harms. By removing it, I could gain 0.05 ms.

It's worth saying that the modern CPUs have a very good prefetching mechanism and using software prefetching might do more harm than good!

3- Removing the UV calculations: It seems there are some UV calculations which are not used or referenced anywhere. It's calculated in this loop and then they get overwritten by another code:

for (uint32 j = 0; j < VertexType::NumTexCoords; j++)
{
    SrcSoftVertex.UVs[j] = LOD.StaticVertexBuffers.StaticMeshVertexBuffer.GetVertexUV_Typed < VertexType::StaticMeshVertexUVType >(VertexBufferIndex, j);
}

Removing the loop above helps with the performance as the result is not used anywhere. The UVs are just copied at the end of the loop from the the static vertex buffers not any skinned vertex so the change is redundant.


4- Simplifying the matrix average calculation codes: The matrix average calculation codes are something like this:

const FMatrix44f BoneMatrix0 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_0]]];
VectorRegister4f Weight0 = VectorReplicate(Weights, INFLUENCE_0);
VectorRegister4f M00 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[0][0]), Weight0);
VectorRegister4f M10 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[1][0]), Weight0);
VectorRegister4f M20 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[2][0]), Weight0);
VectorRegister4f M30 = VectorMultiply(VectorLoadAligned(&BoneMatrix0.M[3][0]), Weight0);

if (MaxSectionBoneInfluences > 1)
{
	const FMatrix44f BoneMatrix1 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_1]]];
	VectorRegister4f Weight1 = VectorReplicate(Weights, INFLUENCE_1);
	M00 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[0][0]), Weight1, M00);
	M10 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[1][0]), Weight1, M10);
	M20 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[2][0]), Weight1, M20);
	M30 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix1.M[3][0]), Weight1, M30);

	if (MaxSectionBoneInfluences > 2)
	{
		const FMatrix44f BoneMatrix2 = ReferenceToLocal[BoneMap[BoneIndices[INFLUENCE_2]]];
		VectorRegister4f Weight2 = VectorReplicate(Weights, INFLUENCE_2);
		M00 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[0][0]), Weight2, M00);
		M10 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[1][0]), Weight2, M10);
		M20 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[2][0]), Weight2, M20);
		M30 = VectorMultiplyAdd(VectorLoadAligned(&BoneMatrix2.M[3][0]), Weight2, M30);

		if (MaxSectionBoneInfluences > 3)
                {
                .
                .
                .
}

This code continues until MaxSectionBoneInfluences goes over 11. It's not readable! Also it's not scalable. For instance FMatrix44f contains 16 float. The matrix average can be calculated much faster if the CPU supports AVX512 as with AVX512, 16 single precision floats can be loaded in one zmm register. As for AVX, 8 single precision floats can be loaded in one ymm register but the codes above only supports SSE4 which is 4 float per SIMD register (xmm).

So in theory, the code above probably can run 4 times slower than AVX512 and 2 times slower than AVX. Of course this is just in theory and in practice it should be measured as sometimes using AVX and AVX512 for simple cases work worse than SSE. But anyways we can say the codes are not readable and are not scalable. I think we can agree on that at least!

To fix this I replaced this part of the code with just a simple ISPC code which does the same but in only a few lines. The script is also scalable based on local CPU support of AVX or AVX 512:

inline export void CpuSkinningGetAveragedMatrix(const uniform uint16 NumBoneWeights, 
        const uniform uint16 BoneWeights[], const uniform uint16 BoneMap[],
        const uniform uint16 BoneIndices[],
	const uniform FMatrix44f ReferenceToLocal[],
	uniform float OutBoneWeights[], 
	uniform float OutMatrix[])
{
	foreach(I = 0 ... NumBoneWeights)
	{
		OutBoneWeights[I] = BoneWeights[I] * Inv_65535;
	}

	const uniform FMatrix44f* uniform CurrentBoneMatrix = &ReferenceToLocal[BoneMap[BoneIndices[0]]];
	uniform float CurrentBoneWeight = OutBoneWeights[0];
	foreach(MatrixElement = 0 ... 16)
	{
		OutMatrix[MatrixElement] = CurrentBoneMatrix->M[MatrixElement] * CurrentBoneWeight;
	}

	for (uniform int I = 1; I < NumBoneWeights; I++)
	{
		CurrentBoneWeight = OutBoneWeights[I];
		if (CurrentBoneWeight > FLOAT_SMALL_NUMBER)
		{
			CurrentBoneMatrix = &ReferenceToLocal[BoneMap[BoneIndices[I]]];
			foreach(MatrixElement = 0 ... 16)
			{
				OutMatrix[MatrixElement] += CurrentBoneMatrix->M[MatrixElement] * CurrentBoneWeight;
			}
		}
		else
		{
			break;
		}
	}
}

Just note that calling an ISPC code in a large loop can have a bit of performance penalty because the exported functions from ISPC can't be inlined in C++ code. It's usually better that the ISPC functions contain the main loop inside themselves rather than be called in a large loop from C++ but here the main goals was scalability and readability. The performance didn't change that much by using ISPC here even this function runs a bit slower due to the fact it can't be inlined.


Results


So here are the results. I write here again  about the character I was testing. I tested CPU skinning on UE Mannequin which has 68 bones in total, 23309 vertices for LOD0 and 6733 vertices for LOD 1. Unfortunately I couldn't test LOD2 and 3 due to some memory stomp crash unreal CPU skinning introduces on lower LODs!

Tested on CPU Core i5-12400F.

So the results before optimization are like this:

LOD0: 2.8 ms
LOD1: 0.95 ms

After optimization:

LOD0: 1.07
LOD1: 0.42

Overall around 2.6 times faster on LOD0 and 2 times faster on LOD1

Monday, June 16, 2025

Copying Animation Poses: std::Memcpy vs SIMD Load/Store?

Introduction


Animation poses are used many times during animation processing loop in a game engine. Many times being written and read, hundreds of thousands of linear math operations are done on the poses every frame hence they need to be quickly readable and writable with high cache access efficiency. The most common way to define an animation pose is a flat array of transforms like an array of structures where each index keeps the transform of the corresponding skeletal mesh bone index. Other ways to define poses can be like having 3 arrays for a pose, one for position, one for quaternions and one for scale3D. There could be also other ways to make the pose data layout be completely vertical.

No matter which method above be taken to define a pose, the poses can be considered as a flat array of floating points. In an animation loop, these arrays of floating points need to be copied several times based on different reasons. For instance you need to copy one pose to another, operate some transform blending on the copied pose and blend it back with the source pose. You will see such actions many times in an animation processing loop. Now that we see copying is a common action being done in an animation processing loop, we should be sure we will have an efficient way of copying data between poses.

We know C++ standard memcpy is implemented in a very efficient way and it's the best way to copy between different data types. So std::memcpy is something to be always considered when you need to copy any data type however we can consider our case special because we are copying between two arrays of floating points which are memory aligned. So what we want to compare here is to check if we can get a better performance by using AVX' SIMD load and store instead of standard library memcpy? So the rest of the post will discuss the comparison between c++ std::memcpy and a custom SIMD based memcpy for animation poses (flat array of SIMD aligned floats). I will measure the times between the both methods.


std::memcpy vs Custom SIMD Load/Store

We define two scenarios to compare the results. One is with cold cache and one with warm cache. By cold cache it means both the source and destination data we are reading from and writing to are unlikely in the cache yet so we will have more cache misses. Warm cache means both the source and destination data are likely in the cache memory so less possible cache misses will be there.

So what I try to achieve here is if I can get something better than std::memcpy if I use AVX and fully use my CPU SIMD unit. I need to remind that this is a specific case that I have a CPU supporting AVX2 and I already have an array of 32 bytes aligned float. So I suggest to use std::memcpy for any other generic case as it's running very fast already.

So here I have these two sets of codes. One for std::memcpy and one for SIMD copy using AVX commands. I am building two arrays of 13000 floats and copy one to another:

std::memcpy
// Cold cache. No initialization of allocated data.
// To have a warm cache you can just initialize the data after allocating them using std::memset float* A1 = (float*)std::malloc(130000 * sizeof(float));
float* B1 = (float*)std::malloc(130000 * sizeof(float)); if (A1 != nullptr && B1 != nullptr) { const auto TimeBefore = std::chrono::high_resolution_clock::now(); std::memcpy((void*)B1, (void*)A1, sizeof(float) * 130000); const auto DeltaTime = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - TimeBefore); std::cout << "DeltaTimeMemCpy = " << DeltaTime.count() << "\n";
free(A1); free(B1); }




AVX load and store with loop unrolling of 128 count.
// Cold cache. No initialization of allocated data.
// To have a warm cache you can just initialize the data after allocating them using std::memset float* A2 = (float*)_aligned_malloc(130000 * sizeof(float), sizeof(__m256)); float* B2 = (float*)_aligned_malloc(130000 * sizeof(float), sizeof(__m256)); if (A2 != nullptr && B2 != nullptr) { __m256 Buff1; __m256 Buff2; __m256 Buff3; __m256 Buff4; const auto TimeBefore = std::chrono::high_resolution_clock::now();
    constexpr int NumElementsToProcessUnrolling = 130000 - 130000 % 128; for (int I = 0; I < NumElementsToProcessUnrolling ; I += 128) { Buff1 = _mm256_load_ps((const float*)&A2[I]); Buff2 = _mm256_load_ps((const float*)&A2[I + 8]); Buff3 = _mm256_load_ps((const float*)&A2[I + 16]); Buff4 = _mm256_load_ps((const float*)&A2[I + 24]); _mm256_store_ps((float*)&B2[I], Buff1); _mm256_store_ps((float*)&B2[I + 8], Buff2); _mm256_store_ps((float*)&B2[I + 16], Buff3); _mm256_store_ps((float*)&B2[I + 24], Buff4); Buff1 = _mm256_load_ps((const float*)&A2[I + 32]); Buff2 = _mm256_load_ps((const float*)&A2[I + 40]); Buff3 = _mm256_load_ps((const float*)&A2[I + 48]); Buff4 = _mm256_load_ps((const float*)&A2[I + 56]); _mm256_store_ps((float*)&B2[I + 32], Buff1); _mm256_store_ps((float*)&B2[I + 40], Buff2); _mm256_store_ps((float*)&B2[I + 48], Buff3); _mm256_store_ps((float*)&B2[I + 56], Buff4); Buff1 = _mm256_load_ps((const float*)&A2[I + 64]); Buff2 = _mm256_load_ps((const float*)&A2[I + 72]); Buff3 = _mm256_load_ps((const float*)&A2[I + 80]); Buff4 = _mm256_load_ps((const float*)&A2[I + 88]); _mm256_store_ps((float*)&B2[I + 64], Buff1); _mm256_store_ps((float*)&B2[I + 72], Buff2); _mm256_store_ps((float*)&B2[I + 80], Buff3); _mm256_store_ps((float*)&B2[I + 88], Buff4); Buff1 = _mm256_load_ps((const float*)&A2[I + 96]); Buff2 = _mm256_load_ps((const float*)&A2[I + 104]); Buff3 = _mm256_load_ps((const float*)&A2[I + 112]); Buff4 = _mm256_load_ps((const float*)&A2[I + 120]); _mm256_store_ps((float*)&B2[I + 96], Buff1); _mm256_store_ps((float*)&B2[I + 104], Buff2); _mm256_store_ps((float*)&B2[I + 112], Buff3); _mm256_store_ps((float*)&B2[I + 120], Buff4); }
constexpr int RemainderToProcess = (130000 % 128) * sizeof(float);      std::memcpy((void*)&B2[NumElementsToProcessUnrolling], (void*)&A2[NumElementsToProcessUnrolling], RemainderToProcess);
     const auto DeltaTime = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - TimeBefore);      std::cout << "DeltaTimeMemCpy = " << DeltaTime.count() << "\n";
_aligned_free(A2); _aligned_free(B2); }



In the code snippet above, I unrolled the loop to 128 to have much less branch comparison. I used AVX registers as I know my CPU supports AVX2 so on every load and store I can fetch 8 floats. I'm using 4 ymm registers to use load and store. The reason I picked 4 is to increase instruction parallelism. For instance many modern CPUs can run two 
_mm256_load_ps per cycle. So it can fetch and execute two load instructions per cycle. Or even if they don't support more than one of the same command per cycle, they still support instruction parallelism. For instance if a load command latency is 3 cycles they could fetch the next instruction after 1 cycle. This 1 cycle is called the throughput of the command. So this means the whole load instruction might take 3 cycles to finish (instruction latency of 3 cycles) but after 1 cycle CPU can fetch the next instruction due to CPU pipelinning. To get the exact details of the latency and throughput of each command you should check the model of your CPU.

Apart from what mentioned above, making a store and load on the same data away from each other with some instructions with using more than one ymm registers, this will remove the data dependency between load and store instructions and increases the efficiency of the instruction level parallelism in a modern CPU. So overall such structure of code is done to increase the instruction level parallelism and make sure the code can compete with memcpy in terms of performance.


Results


The results are as follows. Measured on a PC with CPU Intel core i5 12400F supporting AVX2 and on the release build setup:


Cold Cache

std::memcpy: 203000 Nanoseconds

SIMD Load/Sore: 157000 Nanoseconds

Warm Cache

std::memcpy: 20800 Nanoseconds

SIMD Load/Sore: 15000 Nanoseconds


So looking at the results, it is possible to gain some better performance if you utilize the CPU SIMD unit. For instance be sure using AVX or AVX 512. Be sure you allow proper instruction parallelism and loop unrolling and using the specifications of the data we have which is just a flat array of aligned floats.

I assume testing such code on a CPU supporting AVX 512 can have a much better performance but at the moment I don't have a CPU supporting AVX 512 so I can't be sure about the exact results.