This tutorial explains how to use Radeon GPU Analyzer (RGA) to produce a live VGPR analysis report for your shaders and kernels. Basic RGA usage knowledge is assumed.
Motivation
By performing a live register analysis on your shaders and kernels, you can identify code blocks with higher VGPR pressure, and opportunities for register usage optimizations.
Background
The live register analysis determines “live” registers, that is, all registers which contain values that will be consumed by subsequent instructions. The maximum number of live registers is thus the lower bound on how many registers need to be allocated.
The analysis computes the live register set by building the control flow graph directly from the ISA disassembly, and propagating the read/write information through it. Every read is propagated “up” through the control flow graph until a write is encountered. This produces the live range, which starts with a write, and ends with a read instruction.
Usage
To generate a live VGPR analysis report for any type of shader or kernel, add the –livereg switch to your command. Make sure that your command contains the –isa switch, because the live register report is being generated by processing the GCN ISA disassembly. Without using –isa, there will be no GCN ISA disassembly to analyze.
Example
The following command will generate a live VGPR analysis report for a Vulkan™ vertex shader:rga -s vulkan --vert ~/Vertex1.vert --isa ~/isa_output.txt --livereg ~/livereg_report.txt
Output Interpretation
Let’s take a look at the following live register analysis report, which was generated for a DirectX®11 vertex shader:
1 | 9 | ::::::: :: | label_basic_block_1: s_swappc_b64 s[2:3], s[2:3]
2 | 9 | ::::::: :: | s_andn2_b32 s0, s9, 0x3fff0000
3 | 9 | ::::::: :: | s_mov_b32 s1, s0
4 | 9 | ::::::: :: | s_mov_b32 s2, s10
5 | 9 | ::::::: :: | s_mov_b32 s3, s11
6 | 9 | ::::::: :: | s_mov_b32 s0, s8
7 | 9 | ::::::: :: | s_buffer_load_dwordx8 s[4:11], s[0:3], 0x00
8 | 9 | ::::::: :: | s_buffer_load_dwordx8 s[12:19], s[0:3], 0x20
9 | 9 | ::::::: :: | s_waitcnt lgkmcnt(0)
10 | 10 | ^ v:::::: :: | v_mul_f32 v0, s4, v4
11 | 11 | :^ v:::::: :: | v_mul_f32 v1, s8, v4
12 | 12 | ::^ v:::::: :: | v_mul_f32 v2, s12, v4
13 | 13 | :::^v:::::: :: | v_mul_f32 v3, s16, v4
14 | 12 | x::: v::::: :: | v_mac_f32 v0, s5, v5
15 | 12 | :x:: v::::: :: | v_mac_f32 v1, s9, v5
16 | 12 | ::x: v::::: :: | v_mac_f32 v2, s13, v5
17 | 12 | :::x v::::: :: | v_mac_f32 v3, s17, v5
18 | 11 | x::: v:::: :: | v_mac_f32 v0, s6, v6
19 | 11 | :x:: v:::: :: | v_mac_f32 v1, s10, v6
20 | 11 | ::x: v:::: :: | v_mac_f32 v2, s14, v6
21 | 11 | :::x v:::: :: | v_mac_f32 v3, s18, v6
22 | 10 | x::: v::: :: | v_mac_f32 v0, s7, v7
23 | 10 | :x:: v::: :: | v_mac_f32 v1, s11, v7
24 | 10 | ::x: v::: :: | v_mac_f32 v2, s15, v7
25 | 10 | :::x v::: :: | v_mac_f32 v3, s19, v7
26 | 9 | vvvv ::: :: | exp pos0, v0, v1, v2, v3
27 | 5 | ::: :: | s_buffer_load_dwordx4 s[4:7], s[0:3], 0x40
28 | 5 | ::: :: | s_buffer_load_dwordx4 s[8:11], s[0:3], 0x50
29 | 5 | ::: :: | s_buffer_load_dwordx4 s[0:3], s[0:3], 0x60
30 | 5 | ::: :: | s_waitcnt expcnt(0)
31 | 6 | ^ v:: :: | v_mul_f32 v0, s4, v8
32 | 7 | :^ v:: :: | v_mul_f32 v1, s8, v8
33 | 8 | ::^ v:: :: | v_mul_f32 v2, s0, v8
34 | 7 | x:: v: :: | v_mac_f32 v0, s5, v9
35 | 7 | :x: v: :: | v_mac_f32 v1, s9, v9
36 | 7 | ::x v: :: | v_mac_f32 v2, s1, v9
37 | 6 | x:: v :: | v_mac_f32 v0, s6, v10
38 | 6 | :x: v :: | v_mac_f32 v1, s10, v10
39 | 6 | ::x v :: | v_mac_f32 v2, s2, v10
40 | 5 | vvv :: | exp param0, v0, v1, v2, off
41 | 2 | vv | exp param1, v12, v13, off, off
42 | 0 | | s_endpgm
Maximum # VGPR used 13, # VGPR allocated: 14
Report structure:
- First (leftmost) column: a running number which represents the code line number
- Second column: the number of live VGPRs at that point of the program’s execution
- Third column: symbols that represent the status of each register. The i’th symbol refers to the i’th register:
- ‘:’ means that the register is kept alive, while it is not actively being used by the current instruction
- ‘^’ means that the current instruction writes to the register
- ‘v’ means that the current instruction reads from the register
- ‘x’ means that the current instruction both reads from the register and writes to it
- Fourth column: the disassembly of the current instruction
- The bottom line of the report presents a summary of the number of VGPRs which were actually used by the shader, and the number of VGPRs which were allocated for it.
Remarks
- The analysis takes branches in the code into account, and assumes that either way can be taken. In those cases, the live registers appear “out of nowhere” at a label. This is by-design.
- The analysis only looks at VGPRs, not SGPRs. Many instructions will consume scalar registers, those are ignored as there’s generally more than enough scalar registers, and scalar registers are not the limiting factor for occupancy on GCN.
- Some registers will appear live when the program starts – these are generally pre-loaded, for instance, in a vertex shader, the fetch shader will load data into registers before the shader starts.
Contributing to RGA
The source code for RGA’s live register analysis engine can be found on GitHub with the link below.
Resources
AMD GPU Services (AGS) Library
The AMD GPU Services (AGS) library provides software developers with the ability to query AMD GPU software and hardware state information that is not normally available through standard operating systems or graphics APIs.
Radeon™ GPU Analyzer
Radeon GPU Analyzer is an offline compiler and performance analysis tool for DirectX®, Vulkan®, SPIR-V™, OpenGL® and OpenCL™.
Using Radeon™ GPU Analyzer with Direct3D®12 Compute
Radeon GPU Analyzer (RGA) has support for DirectX12 compute shaders with the command line tool. This mode can generate GCN/RDNA ISA disassembly for your compute shaders, regardless of the physically installed GPU.
Radeon™ GPU Analyzer – Visual Studio® Code Extension
This is a Visual Studio® Code extension for Radeon GPU Analyzer (RGA) to allow you to use RGA directly from within VS Code.