This tutorial explains how to use Radeon GPU Analyzer (RGA) to produce a live VGPR analysis report for your shaders and kernels. Basic RGA usage knowledge is assumed.

Motivation

By performing a live register analysis on your shaders and kernels, you can identify code blocks with higher VGPR pressure, and opportunities for register usage optimizations.

Background

The live register analysis determines “live” registers, that is, all registers which contain values that will be consumed by subsequent instructions. The maximum number of live registers is thus the lower bound on how many registers need to be allocated.
The analysis computes the live register set by building the control flow graph directly from the ISA disassembly, and propagating the read/write information through it. Every read is propagated “up” through the control flow graph until a write is encountered. This produces the live range, which starts with a write, and ends with a read instruction.

Usage

To generate a live VGPR analysis report for any type of shader or kernel, add the –livereg switch to your command. Make sure that your command contains the –isa switch, because the live register report is being generated by processing the GCN ISA disassembly. Without using –isa, there will be no GCN ISA disassembly to analyze.

Example

The following command will generate a live VGPR analysis report for a Vulkan™ vertex shader:
rga -s vulkan --vert ~/Vertex1.vert --isa ~/isa_output.txt --livereg ~/livereg_report.txt

Output Interpretation

Let’s take a look at the following live register analysis report, which was generated for a DirectX®11 vertex shader:

    1 |   9 |     ::::::: :: | label_basic_block_1: s_swappc_b64 s[2:3], s[2:3]
    2 |   9 |     ::::::: :: | s_andn2_b32 s0, s9, 0x3fff0000
    3 |   9 |     ::::::: :: | s_mov_b32 s1, s0
    4 |   9 |     ::::::: :: | s_mov_b32 s2, s10
    5 |   9 |     ::::::: :: | s_mov_b32 s3, s11
    6 |   9 |     ::::::: :: | s_mov_b32 s0, s8
    7 |   9 |     ::::::: :: | s_buffer_load_dwordx8 s[4:11], s[0:3], 0x00
    8 |   9 |     ::::::: :: | s_buffer_load_dwordx8 s[12:19], s[0:3], 0x20
    9 |   9 |     ::::::: :: | s_waitcnt lgkmcnt(0)
   10 |  10 | ^   v:::::: :: | v_mul_f32 v0, s4, v4
   11 |  11 | :^  v:::::: :: | v_mul_f32 v1, s8, v4
   12 |  12 | ::^ v:::::: :: | v_mul_f32 v2, s12, v4
   13 |  13 | :::^v:::::: :: | v_mul_f32 v3, s16, v4
   14 |  12 | x::: v::::: :: | v_mac_f32 v0, s5, v5
   15 |  12 | :x:: v::::: :: | v_mac_f32 v1, s9, v5
   16 |  12 | ::x: v::::: :: | v_mac_f32 v2, s13, v5
   17 |  12 | :::x v::::: :: | v_mac_f32 v3, s17, v5
   18 |  11 | x:::  v:::: :: | v_mac_f32 v0, s6, v6
   19 |  11 | :x::  v:::: :: | v_mac_f32 v1, s10, v6
   20 |  11 | ::x:  v:::: :: | v_mac_f32 v2, s14, v6
   21 |  11 | :::x  v:::: :: | v_mac_f32 v3, s18, v6
   22 |  10 | x:::   v::: :: | v_mac_f32 v0, s7, v7
   23 |  10 | :x::   v::: :: | v_mac_f32 v1, s11, v7
   24 |  10 | ::x:   v::: :: | v_mac_f32 v2, s15, v7
   25 |  10 | :::x   v::: :: | v_mac_f32 v3, s19, v7
   26 |   9 | vvvv    ::: :: | exp pos0, v0, v1, v2, v3
   27 |   5 |         ::: :: | s_buffer_load_dwordx4 s[4:7], s[0:3], 0x40
   28 |   5 |         ::: :: | s_buffer_load_dwordx4 s[8:11], s[0:3], 0x50
   29 |   5 |         ::: :: | s_buffer_load_dwordx4 s[0:3], s[0:3], 0x60
   30 |   5 |         ::: :: | s_waitcnt expcnt(0)
   31 |   6 | ^       v:: :: | v_mul_f32 v0, s4, v8
   32 |   7 | :^      v:: :: | v_mul_f32 v1, s8, v8
   33 |   8 | ::^     v:: :: | v_mul_f32 v2, s0, v8
   34 |   7 | x::      v: :: | v_mac_f32 v0, s5, v9
   35 |   7 | :x:      v: :: | v_mac_f32 v1, s9, v9
   36 |   7 | ::x      v: :: | v_mac_f32 v2, s1, v9
   37 |   6 | x::       v :: | v_mac_f32 v0, s6, v10
   38 |   6 | :x:       v :: | v_mac_f32 v1, s10, v10
   39 |   6 | ::x       v :: | v_mac_f32 v2, s2, v10
   40 |   5 | vvv         :: | exp param0, v0, v1, v2, off
   41 |   2 |             vv | exp param1, v12, v13, off, off
   42 |   0 |                | s_endpgm 

Maximum # VGPR used  13, # VGPR allocated:  14

Report structure:

  • First (leftmost) column: a running number which represents the code line number
  • Second column: the number of live VGPRs at that point of the program’s execution
  • Third column: symbols that represent the status of each register. The i’th symbol refers to the i’th register:
    • ‘:’ means that the register is kept alive, while it is not actively being used by the current instruction
    • ‘^’ means that the current instruction writes to the register
    • ‘v’ means that the current instruction reads from the register
    • ‘x’ means that the current instruction both reads from the register and writes to it
  • Fourth column: the disassembly of the current instruction
  • The bottom line of the report presents a summary of the number of VGPRs which were actually used by the shader, and the number of VGPRs which were allocated for it.

Remarks

  • The analysis takes branches in the code into account, and assumes that either way can be taken. In those cases, the live registers appear “out of nowhere” at a label. This is by-design.
  • The analysis only looks at VGPRs, not SGPRs. Many instructions will consume scalar registers, those are ignored as there’s generally more than enough scalar registers, and scalar registers are not the limiting factor for occupancy on GCN.
  • Some registers will appear live when the program starts – these are generally pre-loaded, for instance, in a vertex shader, the fetch shader will load data into registers before the shader starts.

Contributing to RGA

The source code for RGA’s live register analysis engine can be found on GitHub with the link below.

Resources

AGS

AMD GPU Services (AGS) Library

The AMD GPU Services (AGS) library provides software developers with the ability to query AMD GPU software and hardware state information that is not normally available through standard operating systems or graphics APIs.

Radeon™ GPU Analyzer

Radeon GPU Analyzer is an offline compiler and performance analysis tool for DirectX®, Vulkan®, SPIR-V™, OpenGL® and OpenCL™.