CPU performance optimization guide – part 2
Introduction
After reading the Preface to the CPU performance optimization guide and part 1, I believe you already have a general understanding of the basic concepts of CPU performance optimization. This part of the series will introduce a problem encountered in a real project to help readers understand more about CPU performance optimization and I hope this series will be of some help to your own project.
Chapter 1 – Problem description
A very real issue we often encounter in project development is how to find and solve cache invalidation? Memory operations are widely distributed in the program. The read and write speeds of the cache and the memory are tens or hundreds of times different, so ensuring that the program obtains data from the cache rather than the memory as much as possible can greatly improve performance.
1.1. Data structure
First of all, different data structures can cause completely different caching behavior, so we need to determine the data structure. Here, we choose the continuous array of structures and array of pointers to structure.
1.2. Compiler
Then comes the compiler. Due to the popularity and ease of use of Windows 10, Visual Studio 2022 is selected as the development environment, version 17.8.6. The compiler version is 19.38.33135. The compiler type is x64 RelWithDebInfo
. The operating system version is 10.0.19045.3930.
1.3. CPU and performance analyzer
Finally, the CPU needs to be identified because the architecture is constantly being updated, and the cache capacity and policies are completely different for different CPUs. The strong dependence on hardware means that the CPU determines which performance analyzer can be used. The CPU used in this example is an AMD 7900X3D, so the performance analyzer I used is AMD μProf, version number 4.2.845.
Chapter 2 – Benchmarking
Now we have set the stage, it’s time to build the benchmark. The benchmark needs to meet the following conditions:
- Micro benchmark – streamline the test code as much as possible.
- Make sure that the data is of sufficient magnitude to ensure accurate performance data.
- The data is random, but try to be consistent to simulate the real situation.
2.1. Data preparation
This time, we are using a struct array. The only difference between the two arrays is what they store. One stores the struct instances and the other stores the pointers to struct. 50 million data points is large enough to ensure accurate performance data. Randomness is guaranteed by the memory allocator of std::unique_ptr
. Each pointer has a unique and random memory address. Memory addresses are not necessarily consistent, but the read speed of the CPU after the cache prefetch failure is roughly the same, which can ensure partial consistency.
2.2. Test function
The test function is relatively simple. We traverse two std::vector
and operate on its internal data. Finally, we count the elapsed time. Consider adding a noinline
declaration to prevent the compiler from expanding the function to facilitate debugging.
Copied!
__declspec(noinline) float test_array(unsigned long long size) {
float result = 0;
std::vector<Point> points(size);
auto start = std::chrono::steady_clock::now();
for (const auto& i : points) {
result = i.x + i.y + i.z + i.w;
}
const auto end = std::chrono::steady_clock::now();
std::cout
<< "Slow calculations took "
<< std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
<< " microseconds" << std::endl;
return result;
}
Copied!
template <int prefetch>
__declspec(noinline) float test_pointer(unsigned long long size) {
float result = 0;
std::vector<std::unique_ptr<Point>> points;
for (decltype(size) i = 0; i < size; ++i) {
points.emplace_back(std::make_unique<Point>());
}
auto start = std::chrono::steady_clock::now();
for (auto i = points.cbegin(); i < points.cend(); ++i) {
if (prefetch) {
_mm_prefetch((char*)((*std::next(i, 32)).get()), _MM_HINT_NTA);
_mm_prefetch((char*)((*std::next(i, 48)).get()), _MM_HINT_NTA);
_mm_prefetch((char*)((*std::next(i, 64)).get()), _MM_HINT_NTA);
}
result = (*i)->x + (*i)->y + (*i)->z + (*i)->w;
}
const auto end = std::chrono::steady_clock::now();
std::cout
<< "Slow calculations took "
<< std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
<< " microseconds" << std::endl;
return result;
}
2.3. Performance data
Next is the performance test. You need to run it a few more times and average the data to prevent hopping. As you can clearly see, an array of structs is faster, an array of pointers to struct is slower, and an array of pointers to struct with software cache prefetching is somewhere between the two.
Chapter 3 – Performance analysis
The next phase is to analyze the problem, with a focus on the PMC counters within the performance analyzer.
3.1. Analyze PMC
Use the performance analyzer to analyze PMC, using “Assess Performance (Extended)” for the option set. After the run, switch to the “ANALYZE” page, and then switch to the “Data Cache” view to see the PMC associated with the branch. The number of L1 cache misses is “L1_DEMAND_DC_REFILLS_LOCAL_DRAM” and “L1_DEMAND_DC_REFILLS_LOCAL_CACHE”. The PTI suffix indicates that the data is normalized per thousand instructions. These names, of course, may vary in different versions of the performance analyzer. It hinges on the corresponding hardware PMC in the PPR. First, we analyze the test_pointer<0>
function. It is obvious that there are 12.56 memory reads caused by L1 cache misses and 0.1 lower-level cache reads caused by L1 cache misses per 1,000 instructions. The number of memory reads should be as low as possible.
Figure1: Pointer array cache invalidation
Double-click the test_pointer<0>
function to open the “SOURCES” page. You can see that branch prediction failures do mostly occur near jump instructions within max macros.
Figure2: Pointer Array Cache Invalidation Assembly
3.2. Hardware cache prefetch
Next, let’s look at the test_array
function. It’s clear that there are fewer cache misses, which guarantees lower latency to access instructions. It’s mainly because successive data reads will trigger the CPU’s hardware cache prefetch, thus ensuring a high cache hit rate. The type of data structure access mode which will trigger the hardware cache prefetch depends on the specific CPU model. Please refer to the CPU programming manuals for these details. In most cases, however, the CPU performs well when processing small ranges of contiguous data. “AMD Ryzen™ Processor software optimization GDC 2023”, pp. 23-26, can serve as a good reference.
Figure3: Contiguous Array Cache Hits
3.3. Software cache prefetch
Now let’s look at the test_pointer<1>
function. We can see that using _mm_prefetch
for software cache prefetching results in fewer cache misses, thus guaranteeing lower latency for accessing instructions. However, the latency is still not as good as hardware cache prefetching for a number of reasons:
_mm_prefetch
itself also consumes CPU clock and cannot be called too much.- The addresses that require software prefetching are difficult to determine. Too many hard-coded addresses are not appropriate.
- Invalid software prefetches can lead to cache pollution. The specific indicator is “INEFFECTIVE_SW_PF”. Since the minimum unit of cache prefetch is cache line capacity, it can be seen that there are 128.67 invalid software cache prefetches per thousand instructions in the example, which needs to be avoided in actual projects.
To sum up, software prefetching is a trade-off when hardware prefetching is not possible. It can reduce some of the latency, but it’s hard to be optimal. Similar to hardware prefetching, software prefetching strategies vary by CPU model. “AMD Ryzen™ Processor software optimization GDC 2023”, p. 22, can be a useful reference.
Figure4: Pointer Array Cache Prefetch
Chapter 4 – Conclusion
This blog uses cache invalidation as a simple example to show you how to analyze PMC and provides a workaround. It must be noted, however, that caching issues are extremely complex. The example presented in this blog is just one of the more common ones. We will probably cover issues like cache alignment and multi-core ping-pong cache in the future. Hitherto, a complete cache invalidation analysis process has been completed. You can see the huge performance gaps behind the different memory operations. Ensuring that the program can make the most of the CPU cache to reduce latency is also a task for software optimization.
Finally, I wish you all success in your work.