Live VGPR Analysis with Radeon GPU Analyzer
This article explains how to use Radeon GPU Analyzer (RGA) to produce a live VGPR analysis report for your shaders and kernels. Basic RGA usage …
Introduction Sub DWord Addressing is a feature of the AMD GCN architecture which allows the efficient extraction of 8-bit and 16-bit values from a 32-bit register. …
In 2016, AMD brought TrueAudio Next to GameSoundCon. GameSoundCon was held Sept 27-28 at the Millennium Biltmore Hotel in Los Angeles. GameSoundCon caters to game …
HBM The AMD Radeon™ R9 Fury Series graphics cards (Fury X, R9 Fury and the R9 Nano graphics cards) are the world’s first GPU family …
With ROCm 1.2 we are moving beyond the Fiji Islands of GPU’s to bring a broader selection of hardware with the inclusion of the Hawaii …
We’ve been super-busy – so busy that it has been a while since I’ve been able to post. I wanted to pause long enough to …
Cross-lane operations are an efficient way to share data between wavefront lanes. This article covers in detail the cross-lane features that GCN3 offers. I’d like …
A new release of the CodeXL open-source developer tool is out! Here’s the hot new stuff in this release: New platforms support Support Linux systems …
Many fast Fourier transform (FFT) algorithms implement an intermediate transpose stage. Traditionally, the transpositions have used an out-of-place approach in the clFFT library – that …
The ability to write code in assembly is essential to achieving the best performance for a GPU program. In a previous blog we described how …
We previously looked at how to launch an OpenCL™ kernel using the HSA runtime. That example showed the basics of using the HSA Runtime. Here we’ll …
The team just released an update to HIP in version 0.86 which includes several improvements in the functionality and tools. Also we have included several additional …
Introduction In a previous blog we discussed the different languages available on the ROCm platform. Here we’ll show you how to combine several of these …
A new CodeXL release is out! For the first time the AMD Developer Tools group worked on this release on the CodeXL GitHub public repository, …
Achieving high performance from your Graphics or GPU Compute applications can sometimes be a difficult task. There are many things that a shader or kernel …
A Complete Tool to Transform Your Desktop Appearance After introducing our Display Output Post Processing (DOPP) technology, we are introducing a new tool to change …
ROCm-gdb v1.0 includes new features to assist application developers with understanding their application’s behavior. To get started with ROCm-gdb follow the installation directions and introductory …
One of the exciting new features that is available in clFFT 2.10 is the ability to compute very large FFTs. By very large, I mean …
The open-source ROCm stack offers several programming-language choices. Overall, the goal is to give you a range of tools to help solve the problem at …
Are You Ready to ROCK! The ROCm Platform delivers on the vision of the Boltzmann Initiative, bringing new opportunities in GPU Computing Research. On November …
CodeXL major release 2.0 is out! It is chock-full of new features and a drastic change in the CodeXL development model: CodeXL is now open …
With the announcement of the Boltzmann Initiative and the recent releases of ROCK and ROCR, AMD has ushered in a new era of Heterogeneous Computing. …
It’s been just under two months since we publicly launched the HIP repository, and I wanted to share a quick update on the work we’ve …
It was a critical question we asked ourselves early in the project, but we also asked if we can bring together a solution where you …
The ROCm Platform Deliver on the Vison of the Boltzmann Initiative, Bringing a New Opportunities in GPU Computing Research On November 16th, 2015, the Radeon Technology …
The Open Path to Bring Forward Your Ideas to High-Performance GPU Computing Welcome to the new Portal I want to welcome you to the new …
Announcing HSAIL GDB Version 1.0 … Today as part of AMD’s GPUOpen initiative, we are happy to announce the release of HSAIL GDB version 1.0 …
“AMD is releasing open source code for CodeXL Analyzer CLI. This is a performance analysis tool for OpenCL™ kernels, DirectX® shaders and OpenGL shaders. Using …
This tutorial shows how to get started with HIP. We’ll take a simple CUDA application, hipify it, and run it on multiple platforms. Editor’s note: …
In November, AMD launched the Boltzmann Initiative at Supercomputing 2015 with the goal of enabling developers to more easily employ the full compute potential of …
Intro The “P” in HIP literally stands for portability – HIP’s full and formal name is the “Heterogeneous-computing Interface for Portability”. However, even in a …
Sub DWord Addressing is a feature of the AMD GCN architecture which allows the efficient extraction of 8-bit and 16-bit values from a 32-bit register. Multiple small values can be packed into a 32-bit register, maximizing the utilization of a 32-bit register. Efficiently executing Sub DWord operations can be difficult, however. For example, to add two 16-bit floats (half) resident in the most significant ([31:16]) bits of 2 registers requires the programmer to shift the inputs, store the shifted values into a new register (this can cause register pressure), do a regular addition, shift the outputs back to the [31:16] bits and finally do a bitwise-OR on the final output. The following code shows how this is implemented using GCN ISA (going forward in this blog, all comparisons are against ISA capabilities of GFX 8 Fiji + Polaris):
# Optimized for low register pressure v_lshrrev_b32 v3, 16, v1 # can take 4 or 8 bytes v_lshrrev_b32 v4, 16, v2 # can take 4 or 8 bytes v_add_f16 v2, v3, v4 # can take 4 bytes v_lshlrev_b32 v2, 16, v2 # can take 4 or 8 bytes v_or_b32 v0, v1, v2 # can take 4 or 8 bytes
The compiler can generate between 20 and 36 bytes of binary for this section of the kernel. The performance is 1/5th the rate of a native 16-bit floating point add and uses 5 to 9 times the instruction cache of a native instruction. It also requires an extra 2 registers per code block.
We use the above code to implement
is packed addition on 16-bit floating point data present in a 32-bit register. For example,
v_addpk_f16 v0, v1, v2
v0.x = v1.x + v2.x
v0.y = v1.y + v2.y
# Optimized for low register pressure v_lshrrev_b32 v3, 16, v1 # can take 4 or 8 bytes v_lshrrev_b32 v4, 16, v2 # can take 4 or 8 bytes v_add_f16 v1, v1, v2 # can take 4 bytes v_add_f16 v2, v3, v4 # can take 4 bytes v_lshlrev_b32 v2, 16, v2 # can take 4 or 8 bytes v_or_b32 v0, v1, v2 # can take 4 or 8 bytes
The code above shows how v_addpk_f16 can be implemented without SDWA. The code takes up 24 to 40 bytes of kernel binary and 2 extra registers are used. These short-comings can be solved by using Sub-DWord Addressing.
Sub-DWord Addressing (SDWA) is another instruction modifier supported from gfx 8 (just like
). Similar to
instructions, the SDWA instruction
takes up 32-bits with following format.
RESERVED v[31:30] SRC1_ABS v SRC1_NEG v SRC1_SEXT v SRC1_SEL v[26:24] RESERVED v[23:22] SRC0_ABS v SRC0_NEG v SRC0_SEXT v SRC0_SEL v[18:16] CLAMP v DST_UNUSED v[12:11] DST_SEL v[10:8] SRC0 v[7:0]
The bit fields control the addressing mode of the registers. This blog focuses on the 16-bit addressing mode, but different addressing modes are possible.
The instruction types that can be used with
vop1, vop2, vopc
, which are 32-bit encoded instructions. Execute the
instruction to enable the desired addressing mode before using the
instructions. It only allows vector registers for source and destination which means no scalar registers and no immediate literals. The assembler provides a good interface for using SDWA without requiring the user fabricate the ISA (binary).
Using SDWA, a 16-bit float addition using the most significant part of a 32-bit register can be accomplished as follows:
v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1
v_add_f16_sdwamean that we are operating add op on 2 16bit floats in SDWA mode.
dst_sel:WORD_1mean that we are storing the output in [31:16] of
v0(WORD_1, [15:0] means WORD_0).
dst_unused:UNUSED_PRESERVEmean that the rest of the bits after storing in output will not be made 0. Which means with the current instruction, [15:0] bits of
v0are left untouched.
src0_sel:WORD_1mean that we are using [31:16] of
src1_sel:WORD_1mean that we are using [31:16] of
The size of this instruction is only 64-bit (8bytes) and has 1x the execution rate of
. Using SDWA saves instruction cache misses and the ALU can operate at full op rate. No extra registers used, there is no drop in rate and only 4 bytes are added instruction cache.
In several Machine Learning algorithms, training the data on
data types has been effective in increasing the execution rate. In some cases the amount of training data is reduced to half and the effective bandwidth is improved by 2. Of greatest interest is optimizing the most used
ops in a kernel,
add, mul, mad
. This example shows an implementation of
that does a component-wise add on 2 fp16s loaded into a 32-bit register.
# v_addpk_f16 v1, v2, v3 v_add_f16 v1, v2, v3 v_add_f16_sdwa v1, v2, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1
That’s all it takes and the rate of the instructions does not change, which means you get the same TFLOPs as a
. Also, this piece of code takes 32 + 64 bits on instruction cache (12 bytes).
means that the data present in
is not touched (hence the word PRESERVE).
The next example is an implementation of
, and shows the limitations of SDWA.
# v_madpk_f16 v1, v2, v3, v4 v_mad_f16 v1, v2, v3, v4 v_mul_f16_sdwa v1, v2, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1 v_add_f16_sdwa v1, v1, v4 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1
Wait! Why not use
is already a 64-bit encoded instruction (
); it isn’t one of the instructions that supports Sub-DWord addressing:
vopc, vop1, vop2
. In other words, SDWA doesn’t apply here. Eventhough
is a 32bit encoded instruction (vop2), it only allows DWORD addressing for destination register.
Let’s try one more example swapping the results between the most significant part and least significant part of a packed add before storing it in destination register?
# v_addpk_swap_f16 v0, v1, v2 v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0 v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1
# v_addpk_swap_f16 v0, v1, v2 v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_0 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1 v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_0
SDWA operations seem nice, but how do they perform and how do you use them? In HIP, several SDWA math operations are implemented using the techniques described in this blog. A full list can be found here Link.
HIP is a portable higher level API that has similar syntax as CUDA which works on both AMD and NVIDIA GPUs. Module APIs from HIP can be utilized to load and run AMD GPU HSA code objects compiled offline. Using HIP several performance tests using the SDQA math operations were created, using the following techniques:
Generating AMD GCN ISA is important as it helps to check whether we are generating the ISA we intend to benchmark. MI-6 and MI-8 cards which are gfx8 cards were used to generate performance numbers. The following table presents the results:
The numbers in the above image show that
performs same as
on MI-6 and MI-8 using SDWA. The non-SDWA implementation is 0.25x or 0.3x the performance of
cannot achieve same performance as
as it cannot operate on most significant 32 bit register using SDWA. Hence only a fraction of
throughput is achieved.