CPU performance optimization guide – part 4
Introduction
Hello! We assume, by default, that you have already read the Preface to the CPU Performance Optimization Guide, and that you have a general understanding of the basic concepts of CPU performance optimization. This chapter, as a part of the series, is a continuation of the CPU Performance Optimization Guide III, to help you better understand how to use assembly optimization code and hopefully bring some help to your own project.
Keywords: assembly.
Chapter 1 – Problem description
A challenge that may arise in project development is whether the high-level language compiler has generated the optimal instructions. If the compiler generates instructions that do not perform optimally, is there any other way to generate better instructions?
1.1. Instruction selection
First of all, most modern CPUs contain several instruction sets, such as x86, x64, SSE, and AVX instruction sets. Here we choose the x64 instruction set, which is the most common. The goal is to write assembly code using only the x64 instruction set and see if it can run faster than the instructions generated by the compiler.
1.2. Compiler
Then comes the compiler. Due to the popularity and ease of use of Microsoft Windows® 10, Microsoft Visual Studio® 2022 is selected as the development environment, and the version used is 17.9.2. The compiler version is 19.39.33521. The compiler type is x64 RelWithDebInfo. The operating system version is 10.0.19045.4046. The assembly compiler we use the MASM of Visual Studio.
1.3. CPU
Finally, the CPU needs to be identified because the architecture is constantly being updated. The latency of each instruction varies from CPU to CPU. The CPU used here is an AMD Ryzen™ 9 7900X3D.
Chapter 2 Analysis and Optimization
Since there have been performance tests before, this time we will go directly to the analysis phase, focusing on the assembly code generated by the compiler.
2.1. Displaced TB040
First look at the compiler-generated assembly code for displaced TB040. For such a simple function, the compiler performs well with almost all the code running on registers, ensuring the lowest instruction latency and highest throughput. However, the assembly code still has some redundancies. It is mainly manifested in the excessive register assignment operations generated in the while loop to calculate the variable g.
Figure 1: Displaced TB040 assembly code
2.2. Compiler and Assembly
For such simple C++ codes, the compiler’s performance is ideal enough. To get the compiler to produce better assembly code is very hard. The reason is that the compiler itself needs to consider compatibility, and the generated assembly code needs to ensure good performance in all code logics. If a solution does not perform correctly in some logics, the compiler will choose an alternative solution with better average performance. In this case, writing the assembly manually is the best option if you want to discard some of the compatibility for optimal performance.
You need to meet the following requirements to write the assembly manually:
-
First, you need to be familiar with the various registers and memory operations of the CPU.
-
Second, you need to understand the specifications of function calls, parameter passing, and stack balancing between the compiler and assembly.
-
Finally, you need to know how the assembly instructions perform on different CPUs. Since Visual Studio and the MASM x64 assembly compiler are used here, you need to read the following documentation to meet the above requirements:
-
x64-architecture describes the registers and their meanings in the x64 CPU architecture. Simply put, the registers that a function can safely alter are
rax
,rcx
,rdx
, andr8
–r11
. -
x64-architecture also describes the specifications of function calls, parameter passing, and stack balancing for the x64 architecture. Simply put, the
rcx
,rdx
,r8
, andr9
registers, as well as the stack, can be passed in as parameters to the function, and therax
can be returned as an integer value. -
Each CPU vendor provides the instruction sets supported on the respective CPU models and their instruction latencies. For AMD platforms, for example, you can go to https://www.amd.com/ to search for and download the Software Optimization Guide. Take Software Optimization Guide for the AMD Zen 4 Microarchitecture as an example, the document contains all instructions and related performance parameters for the AMD Zen4 architecture CPUs.
2.3. Assembly TB040
Now that all the conditions have been met, the next step is to write the assembly code. The overall goal is to streamline the assembly instructions inside the while loop, using fewer registers to do the same job. At the same time, don’t forget to align the function and jump entry address to 16 bytes, so as to ensure that the CPU can access the address quickly.
1. _TEXT$21 segment para 'CODE'
2.
3. align 16
4. public TB040_shl_asm
5. TB040_shl_asm proc frame
6.
7. ; r9d: add, r8d: res, r10d: x
8.
9. mov r10d,ecx
10. mov r9d,1
11. bsr ecx,ecx
12. xor r8d,r8d
13. shr ecx,1
14. shl r9d,cl
15. test r9d,r9d
16. je TB040_shl_end
17.
18. align 16
19. TB040_shl_loop:
20. mov ecx,r9d
21. or ecx,r8d
22. mov eax,ecx
23. imul rax,rcx
24. cmp r10,rax
25. cmovae r8d,ecx
26. shr r9d,1
27. mov eax,r8d
28. jne TB040_shl_loop
29.
30. align 16
31. TB040_shl_end:
32. ret
33.
34. .endprolog
35.
36. TB040_shl_asm endp
37.
38. _TEXT$21 ends
39.
40. end
Chapter 3 Benchmarking
Once the code is completed, we can start building the benchmark that meet the following criteria:
-
Micro benchmark. Streamline the test code as much as possible.
-
Make sure that the data is of sufficient magnitude to ensure accurate performance data.
-
Valid validation to ensure algorithm correctness.
3.1. Data Preparation
The goal this time is consistent with part III. Therefore, the data set is still all unsigned integer ranges that uint32_t
can cover, and this data magnitude is enough to ensure the accuracy of performance data.
At the same time, the algorithm needs to be verified. Each integer will be compared with the result of std::sqrt
to ensure the correctness of the algorithm itself.
3.2. Test Function
The reference sample is the Displaced TB040 referred to in part III. The test function is the Displaced TB040 written using only assembly instructions. The new function is called Assembly TB040.
3.3. Performance Data
Finally, the performance test needs to be run a few more times to average the data to prevent hopping. It can be seen that Assembly TB040 is slightly faster than the Displaced TB040 by about 5%, which is quite a good result for such simple register-only operation code.
Table 1
Although the numbers are good, take into account the difficulty and low compatibility of writing assembly code, you still need to balance performance gains against actual resource costs in developing projects.
Careful readers may find that at the beginning of the while loop, there is a jump instruction to determine whether add is 0 will not be triggered. The reason is that the implementation logic is flawed. The more accurate logic is do-while, which can be a little faster than while. Such minor errors are also easier to be spotted from an assembly perspective. However, do-while is slower than assembly that is optimized based on while. You can also try to optimize do-while logic using assembly instructions if you are interested.
Chapter 4 Afterword
This article introduces the possibility of writing assembly code to optimize the program with the Displaced TB040 as an example. So readers can see why it is necessary to keep an assembly compiler, even when high-level language compilers are very well developed. If you are interested, you can refer to the source code of glibc or Visual C++. The underlying functions, such as memcpy and memset, are implemented separately on different CPU instruction sets using assembly instructions, so as to ensure optimal performance.
Now it is clear to readers that optimization of assembly instructions can be divided into two directions:
-
Modify the high-level language logic to ensure that the compiler generates better assembly code.
-
Write programs directly using assembly.
You can choose your own solution based on actual needs. Noted, there are still two directions to choose from when using assembly directly:
-
Write static assembly code using the assembly compiler.
-
Generate assembly code dynamically according to the runtime environment, which is also known as Just-in-Time (JIT) technology.
The assembly compiler is adequate most of the time, but JIT is also ubiquitous in today’s basic software, such as virtual machines or simulators. Even some operators of deep learning frameworks are implemented by JIT. JIT has the potential to achieve the best performance on all hardware, but it takes time to generate JIT code, which requires a good balance between assembly code generation time and actual running time.
After reading this, readers should understand the various efforts made by different frameworks and systems to achieve performance. I sincerely hope that your programs can reach their optimal performance. Finally, I wish you all success in your work.