Amit Ben-Moshe
About CodeXL Analyzer CLI
CodeXL Analyzer CLI is an offline compiler and performance analysis tool for OpenCL kernels, DirectX® shaders and OpenGL® shaders. Using CodeXL Analyzer CLI, you can compile kernels and shaders for a variety of AMD GPUs and APUs, independent of your system hardware, and generate AMD ISA, intermediate language and performance statistics for each target platform.
CodeXL Analyzer CLI is being used by graphics engineers and by developers of parallel-computing applications to identify performance bottlenecks and optimize their code. It is also being used as a backend for shader compilation and performance statistics generation by AMD tool products: CodeXL’s Analyze mode and GPU PerfStudio’s Shader Analyzer.
Key Features
- Compile OpenCL kernels and DirectX or OpenGL shaders, to generate AMD ISA code, intermediate language code, performance statistics and program binaries.
- Generate , binaries and performance statistics for a variety of AMD GPUs and APUs, independent from the device that is physically installed on your system.
- Observe how different optimizations and compilation chains affect the performance of your kernels and shaders: 32-bit vs 64-bit, , various compiler optimizations, kernel and shader code changes, and more.
CodeXL Analyzer CLI supports both Microsoft Windows® and Linux®.
Launching CodeXL Analyzer CLI
CodeXL Analyzer CLI’s commands are comprised of multiple command line switches, some of which are relevant to all platforms (OpenCL, DirectX, OpenGL), and others are platform-specific. Below is a list of key command options that are applicable to all platforms:
Key basic options
Command line switch | Description | Comments |
-s <arg> | Specifies the platform: “cl” for OpenCL, “hlsl” for DirectX and “glsl” for OpenGL | Each invocation handles a single platform |
-s <arg> –h | Display the help menu for the selected platform | |
-c <device name> | Target device for which output would be generated | Can appear multiple times; If not present, all devices are targeted by default |
-l | List the names of the supported devices | |
–isa <full path> | Generate textual ISA code and save the result to the given output full path | The Analyzer concatenates the device name to the file name to differentiate between the output of different devices |
-a <full path> | Generate performance statistics and save the result to the given output full path | The Analyzer concatenates the device name to the file name to differentiate between the output of different devices |
In the following sections, we will go through key command options for specific platforms. We will focus on the most commonly used commands, and not cover all available options. For the list of all available options, you can always use the –h command line switch.
Key OpenCL-specific options
For OpenCL kernels, CodeXL Analyzer CLI can compile high-level source code and extract AMD IL code and compiled binaries, in addition to textual ISA and performance statistics. Here are the options that are specific for OpenCL kernels:
Command line switch | Description | Comments |
–il <full path> | Generate textual AMD IL code and save the result to the given output file (full file path) | Output file name is changed to differentiate between the output of different devices |
-b <full path> | Save the compiled binaries to the given output file (full file path) | |
–kernel <kernel name> | Generate output for the given kernel | Use –kernel all to target all kernels |
Key DirectX-specific options
For DirectX shaders, CodeXL Analyzer CLI can extract DX ASM code, in addition to textual ISA and performance statistics. Here are the options that are specific for DirectX shaders:
Command line switch | Description | Comments |
-f <shader name> | The name of the target entry point | |
-p <profile> | The shader profile (e.g. vs_5_0, ps_5_0, etc.) | |
–DumpMSIntermediate <full path> | Save the DX ASM code to the given output full path |
Key OpenGL-specific options
For OpenGL, only single shader source files.
Command line switch | Description | Comments |
-p <stage> | Specifies the shader type: Vertex, TessEval, Geometry, Fragment and compute | Tessellation control shaders are not supported by the Anlayzer’s “glsl” mode |
Note: CodeXL Analyzer CLI’s “glsl” mode, which accepts only a single shader, is deprecated and will be replaced in future versions with a new OpenGL mode, which will allow compiling and linking of whole OpenGL programs, and generation of more accurate ISA, performance statistics and per-stage binaries.
Usage Examples
Let’s have a look at the following .cl file (BinarySearch_Kernels.cl, taken from the AMD APP SDK):
__kernel void
binarySearch( __global uint4 * outputArray,
__const __global uint2 * sortedArray,
const unsigned int findMe)
{
unsigned int tid = get_global_id(0);
uint2 element = sortedArray[tid];
if((element.x > findMe) || (element.y < findMe))
{
return;
}
else
{
outputArray[0].x = tid;
outputArray[0].w = 1;
}
}
__kernel void
binarySearch_mulkeys(__global int *keys,
__global uint *_input,
const unsigned int numKeys,
__global int *_output)
{
int gid = get_global_id(0);
int lBound = gid * 256;
int uBound = lBound + 255;
for(int i = 0; i < numKeys; i++)
{
if(keys[i] >= _input[lBound] && keys[i] <=_input[uBound])
_output[i]=lBound;
}
}
__kernel void
binarySearch_mulkeysConcurrent(__global uint *keys,
__global uint *_input,
const unsigned int inputSize,
const unsigned int numSubdivisions,
__global int *_output)
{
int lBound = (get_global_id(0) % numSubdivisions) * (inputSize / numSubdivisions);
int uBound = lBound + inputSize / numSubdivisions;
int myKey = keys[get_global_id(0) / numSubdivisions];
int mid;
while(uBound >= lBound)
{
mid = (lBound + uBound) / 2;
if(_input[mid] == myKey)
{
_output[get_global_id(0) / numSubdivisions] = mid;
return;
}
else if(_input[mid] > myKey)
uBound = mid - 1;
else
lBound = mid + 1;
}
}
As you can see, the file contains three kernels: binarySearch, binarySearch_mulkeys and binarySearch_mulkeysConcurrent.
To generate AMD ISA code and performance statistics for the binarySearch kernel and for the Hawaii ASIC (R9 290X), use the following command:
CodeXLAnalyzer.exe –c Hawaii --kernel binarySearch --isa exampleIsa.txt –a stats.csv BinarySearch_Kernels.cl
To also generate the AMD IL code and to save the compiled binary, include the –il and –b command line switches:
CodeXLAnalyzer.exe –c Hawaii --kernel binarySearch --isa exampleIsa.txt –a stats.csv --il exampleIl.txt –b exampleBin.bin BinarySearch_Kernels.cl
To target all three kernels, use the –kernel all option:
CodeXLAnalyzer.exe –c Hawaii --kernel all --isa exampleIsa.txt –a stats.csv --il exampleIl.txt –b exampleBin.bin BinarySearch_Kernels.cl
A Quick Look at CodeXL Analyzer CLI’s Output
ISA Disassembly
Let’s have a look at a typical ISA disassembly which was generated using CodeXL Analyzer CLI. This specific ISA disassembly was generated for a Direct3D Vertex shader:
shader VSMain
asic(SI)
type(VS)
s_swappc_b64 s[2:3], s[2:3] // 00000000: BE822102
s_buffer_load_dwordx8 s[0:7], s[8:11], 0x00 // 00000004: C2C00900
s_buffer_load_dwordx8 s[12:19], s[8:11], 0x08 // 00000008: C2C60908
s_waitcnt lgkmcnt(0) // 0000000C: BF8C007F
v_mul_f32 v0, s3, v7 // 00000010: 10000E03
v_mul_f32 v1, s7, v7 // 00000014: 10020E07
v_mul_f32 v2, s15, v7 // 00000018: 10040E0F
v_mul_f32 v3, s19, v7 // 0000001C: 10060E13
v_mac_f32 v0, s2, v6 // 00000020: 3E000C02
v_mac_f32 v1, s6, v6 // 00000024: 3E020C06
v_mac_f32 v2, s14, v6 // 00000028: 3E040C0E
v_mac_f32 v3, s18, v6 // 0000002C: 3E060C12
v_mac_f32 v0, s1, v5 // 00000030: 3E000A01
v_mac_f32 v1, s5, v5 // 00000034: 3E020A05
v_mac_f32 v2, s13, v5 // 00000038: 3E040A0D
v_mac_f32 v3, s17, v5 // 0000003C: 3E060A11
v_mac_f32 v0, s0, v4 // 00000040: 3E000800
v_mac_f32 v1, s4, v4 // 00000044: 3E020804
v_mac_f32 v2, s12, v4 // 00000048: 3E04080C
v_mac_f32 v3, s16, v4 // 0000004C: 3E060810
exp pos0, v0, v1, v2, v3 done // 00000050: F80008CF 03020100
s_buffer_load_dwordx4 s[0:3], s[8:11], 0x10 // 00000058: C2800910
s_buffer_load_dwordx4 s[4:7], s[8:11], 0x14 // 0000005C: C2820914
s_buffer_load_dwordx4 s[8:11], s[8:11], 0x18 // 00000060: C2840918
s_waitcnt expcnt(0) & lgkmcnt(0) // 00000064: BF8C000F
v_mul_f32 v0, s2, v10 // 00000068: 10001402
v_mul_f32 v1, s6, v10 // 0000006C: 10021406
v_mul_f32 v2, s10, v10 // 00000070: 1004140A
v_mac_f32 v0, s1, v9 // 00000074: 3E001201
v_mac_f32 v1, s5, v9 // 00000078: 3E021205
v_mac_f32 v2, s9, v9 // 0000007C: 3E041209
v_mac_f32 v0, s0, v8 // 00000080: 3E001000
v_mac_f32 v1, s4, v8 // 00000084: 3E021004
v_mac_f32 v2, s8, v8 // 00000088: 3E041008
v_mov_b32 v3, 1.0 // 0000008C: 7E0602F2
v_mov_b32 v4, 0 // 00000090: 7E080280
exp param0, v0, v1, v2, v3 // 00000094: F800020F 03020100
exp param1, v12, v13, v4, v3 // 0000009C: F800021F 03040D0C
s_endpgm // 000000A4: BF810000
end
The basic structure of a shader’s ISA disassembly is:
shader <entry point name>
asic(<hardware generation code name>)
type(<shader type>)
<ISA code>
end
Where:
- <entry point name> is the name of the shader.
- <hardware generation code name> is an acronym that specifies the hardware family: “CI” stands for Sea Islands, “SI” stands for Southern Islands, “VI” stands for Volcanic Islands.
- <shader type> specifies the type of the shader:
Acronym Meaning VS Vertex Shader HS Hull Shader or Tessellation Control Shader ES Domain Shader or Tessellation Evaluation Shader GS Geometry Shader PS Pixel Shader or Fragment Shader CS Compute Shader or OpenCL kernel
- <ISA code> is the sequence of processor instructions. Every instruction has the following format:
<opcode> <operands> // <PC>: <machine code>
Where <opcode> and <operands> are the instruction’s opcode and operands, <PC> is the program counter (i.e. the offset of the instruction from the location of the shader’s first instruction, in bytes) and <machine code> is a hexadecimal representation of the instruction and its operands.You can find more details about AMD’s GCN ISA in the following links:
Performance Statistics
CodeXL Analyzer lets you observe how different optimizations and compilation flags affect the performance of your kernels and shaders. The performance statistics output file contains the following attributes in a CSV format:
Attribute | Description |
DEVICE | The name of the device for which the statistics were generated |
SCRATCH_REGS | The number of bytes which were allocated in scratch memory. When the AMD driver’s shader compiler needs to allocate more than 256 VGPRs (Vector General-Purpose Registers), it spills into scratch memory. Main video memory is used for scratch, backed by the L1 and L2 caches. Using scratch memory degrades performance comparing to only using VGPRs. When your kernel/shader’s use of vector registers forces the compiler to allocate registers in scratch memory, it is a good indication that you should review your code’s resource usage, to avoid register spilling |
THREADS_PER_WORKGROUP | The number of threads (i.e. work-items) in a workgroup |
WAVEFRONT_SIZE | The size of a Wavefront (AMD SIMDs execute their code in groups of work-items. Each such group is called a Wavefront). The Wavefront size in GCN devices is always 64. This field is kept primarily for legacy purposes |
AVAILABLE_LDS_BYTES | The maximum bytes that can be allocated in the LDS (Local Data Store) |
USED_LDS_BYTES | The number of bytes which were allocated by the AMD driver’s shader compiler in the LDS (Local Data Store) |
AVAILABLE_SGPRs | The number of SGPRs (Scalar General Purpose Registers) which are available for a thread |
USED_SGPRs | The number of SGPRs (Scalar General Purpose Registers) which were allocated for the thread by the AMD driver’s shader compiler |
AVAILABLE_VGPRs | The number of VGPRs (Vector General Purpose Registers) which are available for a |
USED_VGPRs | The number of VGPRs (Vector General Purpose Registers) which were allocated for the by the AMD driver’s shader compiler |
CL_WORKGROUP_X_DIMENSION | The number of work items in a workgroup’s X dimension |
CL_WORKGROUP_Y_DIMENSION | The number of work items in a workgroup’s Y dimension |
CL_WORKGROUP_Z_DIMENSION | The number of work items in a workgroup’s Z dimension |
ISA_SIZE | The size of the ISA code, in bytes. If your ISA size exceeds the size of the instruction cache, this will incur a performance penalty |
In Q1 2017, AMD CodeXL Analyzer was replaced by Radeon GPU Analyzer.