CPU performance optimization guide - part 4

Originally posted: March 25, 2025

Hui Zhang

Introduction

Hello! We assume, by default, that you have already read the Preface to the CPU Performance Optimization Guide, and that you have a general understanding of the basic concepts of CPU performance optimization. This chapter, as a part of the series, is a continuation of the CPU Performance Optimization Guide III, to help you better understand how to use assembly optimization code, and hopefully bring some help to your own project.

Keywords: assembly.

Chapter 1 - Problem description

A challenge that may arise in project development is whether the high-level language compiler has generated the optimal instructions. If the compiler generates instructions that do not perform optimally, is there any other way to generate better instructions?

1.1. Instruction selection

First of all, most modern CPUs contain several instruction sets, such as x86, x64, SSE, and AVX instruction sets. Here we choose the x64 instruction set, which is the most common. The goal is to write assembly code using only the x64 instruction set and see if it can run faster than the instructions generated by the compiler.

1.2. Compiler

Then comes the compiler. Due to the popularity and ease of use of Microsoft® Windows® 10, Microsoft® Visual Studio® 2022 is selected as the development environment, and the version used is 17.9.2. The compiler version is 19.39.33521. The compiler type is x64 RelWithDebInfo. The operating system version is 10.0.19045.4046. The assembly compiler we use the MASM of Visual Studio.

1.3. CPU

Finally, the CPU needs to be identified because the architecture is constantly being updated. The latency of each instruction varies from CPU to CPU. The CPU used here is an AMD Ryzen™ 9 7900X3D.

Chapter 2 Analysis and Optimization

Since there have been performance tests before, this time we will go directly to the analysis phase, focusing on the assembly code generated by the compiler.

2.1. Displaced TB040

First look at the compiler-generated assembly code for displaced TB040. For such a simple function, the compiler performs well with almost all the code running on registers, ensuring the lowest instruction latency and highest throughput. However, the assembly code still has some redundancies. It is mainly manifested in the excessive register assignment operations generated in the while loop to calculate the g.

Figure 1: Displaced TB040 assembly code

2.2. Compiler and Assembly

For such simple C++ codes, the compiler’s performance is ideal enough. To get the compiler to produce better assembly code is very hard. The reason is that the compiler itself needs to give consideration to compatibility, and the generated assembly code needs to ensure good performance in all code logics. If a solution does not perform correctly in some logics, the compiler will choose an alternative solution with better average performance. In this case, writing the assembly manually is the best option if you want to discard some of the compatibility for optimal performance.

You need to meet the following requirements to write the assembly manually:

First, you need to be familiar with the various registers and memory operations of the CPU.
Second, you need to understand the specifications of function calls, parameter passing, and stack balancing between the compiler and assembly.
Finally, you need to know how the assembly instructions perform on different CPUs. Since Visual Studio and the MASM x64 assembly compiler are used here, you need to read the following documentation to meet the above requirements.
x64-architecture describes the registers and their meanings in the x64 CPU architecture. Simply put, the registers that a function can safely alter are rax, rcx, rdx, and r8 - r11.
x64-architecture also describes the specifications of function calls, parameter passing, and stack balancing for the x64 architecture. Simply put, the rcx, rdx, r8, and r9 registers, as well as the stack, can be passed in as parameters to the function, and the rax can be returned as an integer value.
Each CPU vendor provides the instruction sets supported on the respective CPU models and their instruction latencies. For AMD platforms, for example, you can go to https://www.amd.com/ to search for and download the Software Optimization Guide. Take Software Optimization Guide for the AMD Zen4 Microarchitecture as an example, the document contains all instructions and related performance parameters for the Zen4 architecture CPUs.

2.3. Assembly TB040

Now that all the conditions have been met, the next step is to write the assembly code. The overall goal is to streamline the assembly instructions inside the while loop, using fewer registers to do the same job. At the same time, don’t forget to align the function and jump entry address to 16 bytes, so as to ensure that the CPU can access the address quickly.

1. _TEXT$21 segment para 'CODE'
2.
3. align   16
4. public  TB040_shl_asm
5. TB040_shl_asm    proc    frame
6.
7. ; r9d: add, r8d: res, r10d: x
8.
9.     mov         r10d,ecx
10.     mov         r9d,1
11.     bsr         ecx,ecx
12.     xor         r8d,r8d
13.     shr         ecx,1
14.     shl         r9d,cl
15.     test        r9d,r9d
16.     je          TB040_shl_end
17.
18.     align   16
19.     TB040_shl_loop:
20.     mov         ecx,r9d
21.     or          ecx,r8d
22.     mov         eax,ecx
23.     imul        rax,rcx
24.     cmp         r10,rax
25.     cmovae      r8d,ecx
26.     shr         r9d,1
27.     mov         eax,r8d
28.     jne         TB040_shl_loop
29.
30.     align   16
31.     TB040_shl_end:
32.     ret
33.
34. .endprolog
35.
36. TB040_shl_asm endp
37.
38. _TEXT$21 ends
39.
40. end

Chapter 3 Benchmarking

Once the code is completed, we can start building the benchmark that meet the following criteria:

Micro benchmark. Streamline the test code as much as possible.
Make sure that the data is of sufficient magnitude to ensure accurate performance data.
Valid validation to ensure algorithm correctness.

3.1. Data Preparation

The goal this time is consistent with part III. Therefore, the data set is still all unsigned integer ranges that uint32_t can cover, and this data magnitude is enough to ensure the accuracy of performance data.

At the same time, the algorithm needs to be verified. Each integer will be compared with the result of std::sqrt to ensure the correctness of the algorithm itself.

3.2. Test Function

The reference sample is the Displaced TB040 referred to in part III. The test function is the Displaced TB040 written using only assembly instructions. The new function is called Assembly TB040.

3.3. Performance Data

Finally, the performance test needs to be run a few more times to average the data to prevent hopping. It can be seen that Assembly TB040 is slightly faster than the Displaced TB040 by about 5%, which is quite a good result for such simple register-only operation code.

Table 1

Although the numbers are good, take into account the difficulty and low compatibility of writing assembly code, you still need to balance performance gains against actual resource costs in developing projects.

Careful readers may find that at the beginning of the while loop, there is a jump instruction to determine whether add is 0 will not be triggered. The reason is that the implementation logic is flawed. The more accurate logic is do-while, which can be a little faster than while. Such minor errors are also easier to be spotted from an assembly perspective. However, do-while is slower than assembly that is optimized based on while. You can also try to optimize do-while logic using assembly instructions if you are interested.

Chapter 4 Afterword

This article introduces the possibility of writing assembly code to optimize the program with the Displaced TB040 as an example. So the readers can see why it is necessary to keep an assembly compiler, even when high-level language compilers are very well developed. If you are interested, you can refer to the source code of glibc or Visual C++. The underlying functions, such as memcpy and memset, are implemented separately on different CPU instruction sets using assembly instructions, so as to ensure optimal performance.

Now it is clear to the readers that optimization of assembly instructions can be divided into two directions:

Modify the high-level language logic to ensure that the compiler generates better assembly code.
Write programs directly using assembly.

You can choose your own solution based on actual needs. Noted, there are still two directions to choose from when using assembly directly:

Write static assembly code using the assembly compiler.
Generate assembly code dynamically according to the runtime environment, which is also known as Just-in-Time (JIT) technology.

The assembly compiler is adequate most of the time, but JIT is also ubiquitous in today’s basic software, such as virtual machines or simulators. Even some operators of deep learning frameworks are implemented by JIT. JIT has the potential to achieve the best performance on all hardware, but it takes time to generate JIT code, which requires a good balance between assembly code generation time and actual running time.

After reading this, readers should understand the various efforts made by different frameworks and systems to achieve performance. ‌I sincerely hope that your programs can reach their optimal performance. Finally, I wish you all success in your work.

Hui Zhang

Zhang Hui is a Member of Technical Staff in the AMD Devtech team where he focuses on helping developers utilize AMD CPU cores efficiently and make deep learning solutions for AMD AI products.