Sub DWord Addressing is a feature of the AMD GCN architecture which allows the efficient extraction of 8-bit and 16-bit values from a 32-bit register. Multiple small values can be packed into a 32-bit register, maximizing the utilization of a 32-bit register. Efficiently executing Sub DWord operations can be difficult, however. For example, to add two 16-bit floats (half) resident in the most significant ([31:16]) bits of 2 registers requires the programmer to shift the inputs, store the shifted values into a new register (this can cause register pressure), do a regular addition, shift the outputs back to the [31:16] bits and finally do a bitwise-OR on the final output.
The following code shows how this is implemented using GCN ISA (going forward in this blog, all comparisons are against ISA capabilities of GFX 8 Fiji + Polaris):
# Optimized for low register pressure v_lshrrev_b32 v3, 16, v1 # can take 4 or 8 bytes v_lshrrev_b32 v4, 16, v2 # can take 4 or 8 bytes v_add_f16 v2, v3, v4 # can take 4 bytes v_lshlrev_b32 v2, 16, v2 # can take 4 or 8 bytes v_or_b32 v0, v1, v2 # can take 4 or 8 bytes
The compiler can generate between 20 and 36 bytes of binary for this section of the kernel. The performance is 1/5th the rate of a native 16-bit floating point add and uses 5 to 9 times the instruction cache of a native instruction. It also requires an extra 2 registers per code block.
We use the above code to implement
is packed addition on 16-bit floating point data present in a 32-bit register. For example,
v_addpk_f16 v0, v1, v2
v0.x = v1.x + v2.x
v0.y = v1.y + v2.y
# Optimized for low register pressure v_lshrrev_b32 v3, 16, v1 # can take 4 or 8 bytes v_lshrrev_b32 v4, 16, v2 # can take 4 or 8 bytes v_add_f16 v1, v1, v2 # can take 4 bytes v_add_f16 v2, v3, v4 # can take 4 bytes v_lshlrev_b32 v2, 16, v2 # can take 4 or 8 bytes v_or_b32 v0, v1, v2 # can take 4 or 8 bytes
The code above shows how
can be implemented without SDWA. The code takes up 24 to 40 bytes of kernel binary and 2 extra registers are used. These short-comings can be solved by using Sub-DWord Addressing.
About Instruction Type
Sub-DWord Addressing (SDWA) is another instruction modifier supported from gfx 8 (just like
). Similar to
instructions, the SDWA instruction
takes up 32-bits with following format.
RESERVED v[31:30] SRC1_ABS v SRC1_NEG v SRC1_SEXT v SRC1_SEL v[26:24] RESERVED v[23:22] SRC0_ABS v SRC0_NEG v SRC0_SEXT v SRC0_SEL v[18:16] CLAMP v DST_UNUSED v[12:11] DST_SEL v[10:8] SRC0 v[7:0]
The bit fields control the addressing mode of the registers. This blog focuses on the 16-bit addressing mode, but different addressing modes are possible.
The instruction types that can be used with
vop1, vop2, vopc
, which are 32-bit encoded instructions. Execute the
instruction to enable the desired addressing mode before using the
instructions. It only allows vector registers for source and destination which means no scalar registers and no immediate literals. The assembler provides a good interface for using SDWA without requiring the user fabricate the ISA (binary).
Using SDWA, a 16-bit float addition using the most significant part of a 32-bit register can be accomplished as follows:
v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1
v_add_f16_sdwamean that we are operating add op on 2 16bit floats in SDWA mode.
dst_sel:WORD_1mean that we are storing the output in [31:16] of
v0(WORD_1, [15:0] means WORD_0).
dst_unused:UNUSED_PRESERVEmean that the rest of the bits after storing in output will not be made 0. Which means with the current instruction, [15:0] bits of
v0are left untouched.
src0_sel:WORD_1mean that we are using [31:16] of
src1_sel:WORD_1mean that we are using [31:16] of
The size of this instruction is only 64-bit (8bytes) and has 1x the execution rate of
. Using SDWA saves instruction cache misses and the ALU can operate at full op rate. No extra registers used, there is no drop in rate and only 4 bytes are added instruction cache.
In several Machine Learning algorithms, training the data on
data types has been effective in increasing the execution rate. In some cases the amount of training data is reduced to half and the effective bandwidth is improved by 2. Of greatest interest is optimizing the most used
ops in a kernel,
add, mul, mad
. This example shows an implementation of
that does a component-wise add on 2 fp16s loaded into a 32-bit register.
# v_addpk_f16 v1, v2, v3 v_add_f16 v1, v2, v3 v_add_f16_sdwa v1, v2, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1
That’s all it takes and the rate of the instructions does not change, which means you get the same TFLOPs as a
. Also, this piece of code takes 32 + 64 bits on instruction cache (12 bytes).
means that the data present in
is not touched (hence the word PRESERVE).
The next example is an implementation of
, and shows the limitations of SDWA.
# v_madpk_f16 v1, v2, v3, v4 v_mad_f16 v1, v2, v3, v4 v_mul_f16_sdwa v1, v2, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1 v_add_f16_sdwa v1, v1, v4 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1
Wait! Why not use
is already a 64-bit encoded instruction (
); it isn’t one of the instructions that supports Sub-DWord addressing:
vopc, vop1, vop2
. In other words, SDWA doesn’t apply here. Eventhough
is a 32bit encoded instruction (vop2), it only allows DWORD addressing for destination register.
Let’s try one more example swapping the results between the most significant part and least significant part of a packed add before storing it in destination register?
# v_addpk_swap_f16 v0, v1, v2 v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0 v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:WORD_1 src1_sel:WORD_1
# v_addpk_swap_f16 v0, v1, v2 v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_0 dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1 v_add_f16_sdwa v0, v1, v2 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_0
SDWA operations seem nice, but how do they perform and how do you use them? In HIP, several SDWA math operations are implemented using the techniques described in this blog. A full list can be found here Link.
HIP is a portable higher level API that has similar syntax as CUDA which works on both AMD and NVIDIA GPUs. Module APIs from HIP can be utilized to load and run AMD GPU HSA code objects compiled offline. Using HIP several performance tests using the SDQA math operations were created, using the following techniques:
- The kernels for performance test were written in LLVM IR.
- LLVM (which is open source and fully supported by ROCm) was used to compile IR to AMD GCN ISA.
- CLANG was used to compile ISA to a HSA Code Object
- Finally, the HIPs Module API loaded the code object and ran the kernel using
Generating AMD GCN ISA is important as it helps to check whether we are generating the ISA we intend to benchmark. MI-6 and MI-8 cards which are gfx8 cards were used to generate performance numbers. The following table presents the results:
The numbers in the above image show that
performs same as
on MI-6 and MI-8 using SDWA. The non-SDWA implementation is 0.25x or 0.3x the performance of
cannot achieve same performance as
as it cannot operate on most significant 32 bit register using SDWA. Hence only a fraction of
throughput is achieved.