Introduction
Animation poses are used many times during animation processing loop in a game engine. Many times being written and read, hundreds of thousands of linear math operations are done on the poses every frame hence they need to be quickly readable and writable with high cache access efficiency. The most common way to define an animation pose is a flat array of transforms like an array of structures where each index keeps the transform of the corresponding skeletal mesh bone index. Other ways to define poses can be like having 3 arrays for a pose, one for position, one for quaternions and one for scale3D. There could be also other ways to make the pose data layout be completely vertical.
No matter which method above be taken to define a pose, the poses can be considered as a flat array of floating points. In an animation loop, these arrays of floating points need to be copied several times based on different reasons. For instance you need to copy one pose to another, operate some transform blending on the copied pose and blend it back with the source pose. You will see such actions many times in an animation processing loop. Now that we see copying is a common action being done in an animation processing loop, we should be sure we will have an efficient way of copying data between poses.
We know C++ standard memcpy is implemented in a very efficient way and it's the best way to copy between different data types. So std::memcpy is something to be always considered when you need to copy any data type however we can consider our case special because we are copying between two arrays of floating points which are memory aligned. So what we want to compare here is to check if we can get a better performance by using AVX' SIMD load and store instead of standard library memcpy? So the rest of the post will discuss the comparison between c++ std::memcpy and a custom SIMD based memcpy for animation poses (flat array of SIMD aligned floats). I will measure the times between the both methods.
std::memcpy vs Custom SIMD Load/Store
We define two scenarios to compare the results. One is with cold cache and one with warm cache. By cold cache it means both the source and destination data we are reading from and writing to are unlikely in the cache yet so we will have more cache misses. Warm cache means both the source and destination data are likely in the cache memory so less possible cache misses will be there.
So what I try to achieve here is if I can get something better than std::memcpy if I use AVX and fully use my CPU SIMD unit. I need to remind that this is a specific case that I have a CPU supporting AVX2 and I already have an array of 32 bytes aligned float. So I suggest to use std::memcpy for any other generic case as it's running very fast already.
So what I try to achieve here is if I can get something better than std::memcpy if I use AVX and fully use my CPU SIMD unit. I need to remind that this is a specific case that I have a CPU supporting AVX2 and I already have an array of 32 bytes aligned float. So I suggest to use std::memcpy for any other generic case as it's running very fast already.
So here I have these two sets of codes. One for std::memcpy and one for SIMD copy using AVX commands. I am building two arrays of 13000 floats and copy one to another:
std::memcpy
// Cold cache. No initialization of allocated data.// To have a warm cache you can just initialize the data after allocating them using std::memset float* A1 = (float*)std::malloc(130000 * sizeof(float));float* B1 = (float*)std::malloc(130000 * sizeof(float)); if (A1 != nullptr && B1 != nullptr) { const auto TimeBefore = std::chrono::high_resolution_clock::now(); std::memcpy((void*)B1, (void*)A1, sizeof(float) * 130000); const auto DeltaTime = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - TimeBefore); std::cout << "DeltaTimeMemCpy = " << DeltaTime.count() << "\n";free(A1); free(B1); }
AVX load and store with loop unrolling of 128 count.// Cold cache. No initialization of allocated data.// To have a warm cache you can just initialize the data after allocating them using std::memset float* A2 = (float*)_aligned_malloc(130000 * sizeof(float), sizeof(__m256)); float* B2 = (float*)_aligned_malloc(130000 * sizeof(float), sizeof(__m256)); if (A2 != nullptr && B2 != nullptr) { __m256 Buff1; __m256 Buff2; __m256 Buff3; __m256 Buff4; const auto TimeBefore = std::chrono::high_resolution_clock::now();constexpr int NumElementsToProcessUnrolling = 130000 - 130000 % 128; for (int I = 0; I < NumElementsToProcessUnrolling ; I += 128) { Buff1 = _mm256_load_ps((const float*)&A2[I]); Buff2 = _mm256_load_ps((const float*)&A2[I + 8]); Buff3 = _mm256_load_ps((const float*)&A2[I + 16]); Buff4 = _mm256_load_ps((const float*)&A2[I + 24]); _mm256_store_ps((float*)&B2[I], Buff1); _mm256_store_ps((float*)&B2[I + 8], Buff2); _mm256_store_ps((float*)&B2[I + 16], Buff3); _mm256_store_ps((float*)&B2[I + 24], Buff4); Buff1 = _mm256_load_ps((const float*)&A2[I + 32]); Buff2 = _mm256_load_ps((const float*)&A2[I + 40]); Buff3 = _mm256_load_ps((const float*)&A2[I + 48]); Buff4 = _mm256_load_ps((const float*)&A2[I + 56]); _mm256_store_ps((float*)&B2[I + 32], Buff1); _mm256_store_ps((float*)&B2[I + 40], Buff2); _mm256_store_ps((float*)&B2[I + 48], Buff3); _mm256_store_ps((float*)&B2[I + 56], Buff4); Buff1 = _mm256_load_ps((const float*)&A2[I + 64]); Buff2 = _mm256_load_ps((const float*)&A2[I + 72]); Buff3 = _mm256_load_ps((const float*)&A2[I + 80]); Buff4 = _mm256_load_ps((const float*)&A2[I + 88]); _mm256_store_ps((float*)&B2[I + 64], Buff1); _mm256_store_ps((float*)&B2[I + 72], Buff2); _mm256_store_ps((float*)&B2[I + 80], Buff3); _mm256_store_ps((float*)&B2[I + 88], Buff4); Buff1 = _mm256_load_ps((const float*)&A2[I + 96]); Buff2 = _mm256_load_ps((const float*)&A2[I + 104]); Buff3 = _mm256_load_ps((const float*)&A2[I + 112]); Buff4 = _mm256_load_ps((const float*)&A2[I + 120]); _mm256_store_ps((float*)&B2[I + 96], Buff1); _mm256_store_ps((float*)&B2[I + 104], Buff2); _mm256_store_ps((float*)&B2[I + 112], Buff3); _mm256_store_ps((float*)&B2[I + 120], Buff4); }constexpr int RemainderToProcess = (130000 % 128) * sizeof(float); std::memcpy((void*)&B2[NumElementsToProcessUnrolling], (void*)&A2[NumElementsToProcessUnrolling], RemainderToProcess);const auto DeltaTime = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - TimeBefore); std::cout << "DeltaTimeMemCpy = " << DeltaTime.count() << "\n";_aligned_free(A2); _aligned_free(B2); }
In the code snippet above, I unrolled the loop to 128 to have much less branch comparison. I used AVX registers as I know my CPU supports AVX2 so on every load and store I can fetch 8 floats. I'm using 4 ymm registers to use load and store. The reason I picked 4 is to increase instruction parallelism. For instance many modern CPUs can run two _mm256_load_ps per cycle. So it can fetch and execute two load instructions per cycle. Or even if they don't support more than one of the same command per cycle, they still support instruction parallelism. For instance if a load command latency is 3 cycles they could fetch the next instruction after 1 cycle. This 1 cycle is called the throughput of the command. So this means the whole load instruction might take 3 cycles to finish (instruction latency of 3 cycles) but after 1 cycle CPU can fetch the next instruction due to CPU pipelinning. To get the exact details of the latency and throughput of each command you should check the model of your CPU.
Apart from what mentioned above, making a store and load on the same data away from each other with some instructions with using more than one ymm registers, this will remove the data dependency between load and store instructions and increases the efficiency of the instruction level parallelism in a modern CPU. So overall such structure of code is done to increase the instruction level parallelism and make sure the code can compete with memcpy in terms of performance.
Results
The results are as follows. Measured on a PC with CPU Intel core i5 12400F supporting AVX2 and on the release build setup:
Cold Cache
std::memcpy: 203000 Nanoseconds
SIMD Load/Sore: 157000 Nanoseconds
Warm Cache
std::memcpy: 20800 Nanoseconds
SIMD Load/Sore: 15000 Nanoseconds
So looking at the results, it is possible to gain some better performance if you utilize the CPU SIMD unit. For instance be sure using AVX or AVX 512. Be sure you allow proper instruction parallelism and loop unrolling and using the specifications of the data we have which is just a flat array of aligned floats.
I assume testing such code on a CPU supporting AVX 512 can have a much better performance but at the moment I don't have a CPU supporting AVX 512 so I can't be sure about the exact results.