Up and Running with CodeXL Analyzer CLI (ARCHIVED CONTENT)

About CodeXL Analyzer CLI

CodeXL Analyzer CLI is an offline compiler and performance analysis tool for OpenCL kernels, DirectX® shaders and OpenGL® shaders. Using CodeXL Analyzer CLI, you can compile kernels and shaders for a variety of AMD GPUs and APUs, independent of your system hardware, and generate AMD ISA, intermediate language and performance statistics for each target platform.

CodeXL Analyzer CLI is being used by graphics engineers and by developers of parallel-computing applications to identify performance bottlenecks and optimize their code. It is also being used as a backend for shader compilation and performance statistics generation by AMD tool products:  CodeXL’s Analyze mode and GPU PerfStudio’s Shader Analyzer.

Key Features

  1. Compile OpenCL kernels and DirectX or OpenGL shaders, to generate AMD ISA code, intermediate language code, performance statistics and program binaries.
  2. Generate , binaries and performance statistics for a variety of AMD GPUs and APUs, independent from the device that is physically installed on your system.
  3. Observe how different optimizations and compilation chains affect the performance of your kernels and shaders: 32-bit vs 64-bit, , various compiler optimizations, kernel and shader code changes, and more.

CodeXL Analyzer CLI supports both Microsoft Windows® and Linux®.

Launching CodeXL Analyzer CLI

CodeXL Analyzer CLI’s commands are comprised of multiple command line switches, some of which are relevant to all platforms (OpenCL, DirectX, OpenGL), and others are platform-specific. Below is a list of key command options that are applicable to all platforms:

Key basic options

Command line switchDescriptionComments
-s <arg>Specifies the platform: “cl” for OpenCL, “hlsl” for DirectX and “glsl” for OpenGLEach invocation handles a single platform
-s <arg> –hDisplay the help menu for the selected platform 
-c <device name>Target device for which output would be generatedCan appear multiple times; If not present, all devices are targeted by default
-lList the names of the supported devices 
–isa <full path>Generate textual ISA code and save the result to the given output full pathThe Analyzer concatenates the device name to the file name to differentiate between the output of different devices
-a <full path>Generate performance statistics and save the result to the given output full pathThe Analyzer concatenates the device name to the file name to differentiate between the output of different devices

In the following sections, we will go through key command options for specific platforms. We will focus on the most commonly used commands, and not cover all available options. For the list of all available options, you can always use the –h command line switch.

Key OpenCL-specific options

For OpenCL kernels, CodeXL Analyzer CLI can compile high-level source code and extract AMD IL code and compiled binaries, in addition to textual ISA and performance statistics. Here are the options that are specific for OpenCL kernels:

Command line switchDescriptionComments
–il <full path>Generate textual AMD IL code and save the result to the given output file (full file path)Output file name is changed to differentiate between the output of different devices
-b <full path>Save the compiled binaries to the given output file (full file path) 
–kernel <kernel name>Generate output for the given kernelUse –kernel all to target all kernels

 

Key DirectX-specific options

For DirectX shaders, CodeXL Analyzer CLI can extract DX ASM code, in addition to textual ISA and performance statistics. Here are the options that are specific for DirectX shaders:

Command line switchDescriptionComments
-f <shader name>The name of the target entry point 
-p <profile>The shader profile (e.g. vs_5_0, ps_5_0, etc.) 
–DumpMSIntermediate <full path>Save the DX ASM code to the given output full path 

 

Key OpenGL-specific options

For OpenGL, only single shader source files.

Command line switchDescriptionComments
-p <stage>Specifies the shader type: Vertex, TessEval, Geometry, Fragment and computeTessellation control shaders are not supported by the Anlayzer’s “glsl” mode

Note: CodeXL Analyzer CLI’s “glsl” mode, which accepts only a single shader, is deprecated and will be replaced in future versions with a new OpenGL mode, which will allow compiling and linking of whole OpenGL programs, and generation of more accurate ISA, performance statistics and per-stage binaries.

Usage Examples

Let’s have a look at the following .cl file (BinarySearch_Kernels.cl, taken from the AMD APP SDK):

__kernel void
binarySearch( __global uint4 * outputArray,
__const __global uint2 * sortedArray,
const unsigned int findMe)
{
    unsigned int tid = get_global_id(0);
    uint2 element = sortedArray[tid];
    if((element.x > findMe) || (element.y < findMe))
    {
        return;
    }
    else
    {
        outputArray[0].x = tid;
        outputArray[0].w = 1;
    }
}

__kernel void
binarySearch_mulkeys(__global int *keys,
__global uint *_input,
const unsigned int numKeys,
__global int *_output)
{
    int gid = get_global_id(0);
    int lBound = gid * 256;
    int uBound = lBound + 255;

    for(int i = 0; i < numKeys; i++)
    {
        if(keys[i] >= _input[lBound] && keys[i] <=_input[uBound])
            _output[i]=lBound;
    }
}

__kernel void
binarySearch_mulkeysConcurrent(__global uint *keys,
__global uint *_input,
const unsigned int inputSize,
const unsigned int numSubdivisions,
__global int *_output)
{
    int lBound = (get_global_id(0) % numSubdivisions) * (inputSize / numSubdivisions);
    int uBound = lBound + inputSize / numSubdivisions;
    int myKey = keys[get_global_id(0) / numSubdivisions];
    int mid;

    while(uBound >= lBound)
    {
        mid = (lBound + uBound) / 2;
        if(_input[mid] == myKey)
        {
            _output[get_global_id(0) / numSubdivisions] = mid;
            return;
        }
        else if(_input[mid] > myKey)
            uBound = mid - 1;
        else
            lBound = mid + 1;
    }
}

As you can see, the file contains three kernels: binarySearch, binarySearch_mulkeys and binarySearch_mulkeysConcurrent.

To generate AMD ISA code and performance statistics for the binarySearch kernel and for the Hawaii ASIC (R9 290X), use the following command:

CodeXLAnalyzer.exe –c Hawaii --kernel binarySearch --isa exampleIsa.txt –a stats.csv BinarySearch_Kernels.cl

To also generate the AMD IL code and to save the compiled binary, include the –il and –b command line switches:

CodeXLAnalyzer.exe –c Hawaii --kernel binarySearch --isa exampleIsa.txt –a stats.csv --il exampleIl.txt –b exampleBin.bin BinarySearch_Kernels.cl

To target all three kernels, use the –kernel all option:

CodeXLAnalyzer.exe –c Hawaii --kernel all --isa exampleIsa.txt –a stats.csv --il exampleIl.txt –b exampleBin.bin BinarySearch_Kernels.cl

A Quick Look at CodeXL Analyzer CLI’s Output

ISA Disassembly

Let’s have a look at a typical ISA disassembly which was generated using CodeXL Analyzer CLI. This specific ISA disassembly was generated for a Direct3D Vertex shader:

shader VSMain
asic(SI)
type(VS)
s_swappc_b64  s[2:3], s[2:3]                                // 00000000: BE822102
s_buffer_load_dwordx8  s[0:7], s[8:11], 0x00                // 00000004: C2C00900
s_buffer_load_dwordx8  s[12:19], s[8:11], 0x08              // 00000008: C2C60908
s_waitcnt     lgkmcnt(0)                                    // 0000000C: BF8C007F
v_mul_f32     v0, s3, v7                                    // 00000010: 10000E03
v_mul_f32     v1, s7, v7                                    // 00000014: 10020E07
v_mul_f32     v2, s15, v7                                   // 00000018: 10040E0F
v_mul_f32     v3, s19, v7                                   // 0000001C: 10060E13
v_mac_f32     v0, s2, v6                                    // 00000020: 3E000C02
v_mac_f32     v1, s6, v6                                    // 00000024: 3E020C06
v_mac_f32     v2, s14, v6                                   // 00000028: 3E040C0E
v_mac_f32     v3, s18, v6                                   // 0000002C: 3E060C12
v_mac_f32     v0, s1, v5                                    // 00000030: 3E000A01
v_mac_f32     v1, s5, v5                                    // 00000034: 3E020A05
v_mac_f32     v2, s13, v5                                   // 00000038: 3E040A0D
v_mac_f32     v3, s17, v5                                   // 0000003C: 3E060A11
v_mac_f32     v0, s0, v4                                    // 00000040: 3E000800
v_mac_f32     v1, s4, v4                                    // 00000044: 3E020804
v_mac_f32     v2, s12, v4                                   // 00000048: 3E04080C
v_mac_f32     v3, s16, v4                                   // 0000004C: 3E060810
exp           pos0, v0, v1, v2, v3 done                     // 00000050: F80008CF 03020100
s_buffer_load_dwordx4  s[0:3], s[8:11], 0x10                // 00000058: C2800910
s_buffer_load_dwordx4  s[4:7], s[8:11], 0x14                // 0000005C: C2820914
s_buffer_load_dwordx4  s[8:11], s[8:11], 0x18               // 00000060: C2840918
s_waitcnt     expcnt(0) & lgkmcnt(0)                        // 00000064: BF8C000F
v_mul_f32     v0, s2, v10                                   // 00000068: 10001402
v_mul_f32     v1, s6, v10                                   // 0000006C: 10021406
v_mul_f32     v2, s10, v10                                  // 00000070: 1004140A
v_mac_f32     v0, s1, v9                                    // 00000074: 3E001201
v_mac_f32     v1, s5, v9                                    // 00000078: 3E021205
v_mac_f32     v2, s9, v9                                    // 0000007C: 3E041209
v_mac_f32     v0, s0, v8                                    // 00000080: 3E001000
v_mac_f32     v1, s4, v8                                    // 00000084: 3E021004
v_mac_f32     v2, s8, v8                                    // 00000088: 3E041008
v_mov_b32     v3, 1.0                                       // 0000008C: 7E0602F2
v_mov_b32     v4, 0                                         // 00000090: 7E080280
exp           param0, v0, v1, v2, v3                        // 00000094: F800020F 03020100
exp           param1, v12, v13, v4, v3                      // 0000009C: F800021F 03040D0C
s_endpgm                                                    // 000000A4: BF810000
end

The basic structure of a shader’s ISA disassembly is:

shader <entry point name>
asic(<hardware generation code name>)
type(<shader type>)
<ISA code>
end

Where:

  • <entry point name> is the name of the shader.
  • <hardware generation code name> is an acronym that specifies the hardware family: “CI” stands for Sea Islands, “SI” stands for Southern Islands, “VI” stands for Volcanic Islands.
  • <shader type> specifies the type of the shader:
    AcronymMeaning
    VSVertex Shader
    HSHull Shader or Tessellation Control Shader
    ESDomain Shader or Tessellation Evaluation Shader
    GSGeometry Shader
    PSPixel Shader or Fragment Shader
    CSCompute Shader or OpenCL kernel
  • <ISA code> is the sequence of processor instructions. Every instruction has the following format:
    <opcode> <operands> // <PC>: <machine code>
    Where <opcode> and <operands> are the instruction’s opcode and operands, <PC> is the program counter (i.e. the offset of the instruction from the location of the shader’s first instruction, in bytes) and <machine code> is a hexadecimal representation of the instruction and its operands.You can find more details about AMD’s GCN ISA in the following links:

     

Performance Statistics

CodeXL Analyzer lets you observe how different optimizations and compilation flags affect the performance of your kernels and shaders. The performance statistics output file contains the following attributes in a CSV format:

AttributeDescription
DEVICEThe name of the device for which the statistics were generated
SCRATCH_REGSThe number of bytes which were allocated in scratch memory. When the AMD driver’s shader compiler needs to allocate more than 256 VGPRs (Vector General-Purpose Registers), it spills into scratch memory. Main video memory is used for scratch, backed by the L1 and L2 caches. Using scratch memory degrades performance comparing to only using VGPRs. When your kernel/shader’s use of vector registers forces the compiler to allocate registers in scratch memory, it is a good indication that you should review your code’s resource usage, to avoid register spilling
THREADS_PER_WORKGROUPThe number of threads (i.e. work-items) in a workgroup
WAVEFRONT_SIZEThe size of a Wavefront (AMD SIMDs execute their code in groups of work-items. Each such group is called a Wavefront). The Wavefront size in GCN devices is always 64. This field is kept primarily for legacy purposes
AVAILABLE_LDS_BYTESThe maximum bytes that can be allocated in the LDS (Local Data Store)
USED_LDS_BYTESThe number of bytes which were allocated by the AMD driver’s shader compiler in the LDS (Local Data Store)
AVAILABLE_SGPRsThe number of SGPRs (Scalar General Purpose Registers) which are available for a thread
USED_SGPRsThe number of SGPRs (Scalar General Purpose Registers) which were allocated for the thread by the AMD driver’s shader compiler
AVAILABLE_VGPRsThe number of VGPRs (Vector General Purpose Registers) which are available for a
USED_VGPRsThe number of VGPRs (Vector General Purpose Registers) which were allocated for the by the AMD driver’s shader compiler
CL_WORKGROUP_X_DIMENSIONThe number of work items in a workgroup’s X dimension
CL_WORKGROUP_Y_DIMENSIONThe number of work items in a workgroup’s Y dimension
CL_WORKGROUP_Z_DIMENSIONThe number of work items in a workgroup’s Z dimension
ISA_SIZEThe size of the ISA code, in bytes. If your ISA size exceeds the size of the instruction cache, this will incur a performance penalty

In Q1 2017, AMD CodeXL Analyzer was replaced by Radeon GPU Analyzer.

Looking for something more up-to-date and fully supported?