First Steps When Implementing FP16

Originally posted: April 20, 2018

Tom Hammersley

Half-precision (FP16) computation is a performance-enhancing GPU technology long exploited in console and mobile devices not previously used or widely available in mainstream PC development. With the advent of AMD’s Vega GPU architecture, this technology is now more easily accessible and available for boosting graphics performance in mainstream PC development.

The latest iteration of the GCN architecture allows you to pack 2x FP16 values into each 32-bit VGPR register. This enables you to:

Halve ALU operations by using new data-parallel instructions.
Reduce the VGPR footprint of a shader, leading to a potential increase occupancy and performance.

There are also some minor risks:

Poor use of FP16 can result in excessive conversion between FP16 and FP32. This can reduce the performance advantage.
FP16 gently increases code complexity and maintenance.

Getting started

It is tempting to assume that implementing FP16 is as simple as merely substituting the ‘half’ type for ‘float’. Alas not: this simply doesn’t work on PC. The DirectX® FXC compiler only offers half for compatibility; it maps it onto float . If you compare the bytecode generated, it is identical.

The correct types to use are the standard HLSL types prefixed with min16 : min16float , min16int , min16uint . These can be used as a scalar or vector type in the usual fashion.

Your development environment will need specific software to successfully generate FP16 code. First of all, you need Windows 8 or later. Older versions of Windows will simply fail to create shaders if you use min16float . Whilst there is a Platform Update available for Windows 7 which enables FP16 shaders to compile, it simply compiles the code as FP32. In practice you are emulating absent hardware and therefore the resulting code may not be as efficient. It may therefore be worthwhile providing an alternative code path or set of shaders for hardware and operating systems lacking FP16 support.

Secondly, you will need up-to-date versions of the FXC compiler and the driver compiler. The FXC compiler in the Windows 10 SDK will suffice, and Radeon Crimson driver version 17.9.1 or later is required.

Thirdly, it is worth clarifying that FP16 will work on DirectX 11.1 and Shader Model 5 code. DirectX 12 is not required. Simply add a run-time test to query for D3D11_FEATURE_SHADER_MIN_PRECISION_SUPPORT from ID3D11Device::CheckFeatureSupport() .

And most importantly, compatible hardware is required: an AMD RX Vega or Vega Frontier Edition GPU!

Recommendation: Pre-processor Support

Whilst min16float is perfectly legal HLSL syntax and therefore fine to use as-is, I would caution against using it directly. I find it better to implement pre-processor support to globally include or remove the use of min16float and therefore FP16. There are two reasons for this:

You need to be able to strip it all out for compatibility with older operating systems or incompatible hardware.
It provides a convenient way to perform A/B testing for performance or correctness.

Example Output

With all these tools in place, what does compiled FP16 code look like? Let’s write a trivial test function:

cbuffer params
{
    min16float4 colour;
};

Texture2D<min16float4> tex;
SamplerState samp;

min16float4 test( in min16float2 uv : TEXCOORD0 ) : SV_Target
{
    return colour * tex.Sample( samp, uv );
}</min16float4>

The first step of verifying FP16 functionality is to look at the FXC .asm output. The driver cannot compile FP16 code unless it is given the correct bytecode from DirectX. Here we see the compiler has introduced a series of {min16f} suffixes:

ps_5_0
dcl_globalFlags refactoringAllowed | enableMinimumPrecision
dcl_constantbuffer CB0[1], immediateIndexed
dcl_sampler s0, mode_default
dcl_resource_texture2d (float,float,float,float) t0
dcl_input_ps linear v0.xy {min16f}
dcl_output o0.xyzw {min16f}
dcl_temps 1
sample_indexable(texture2d)(float,float,float,float) r0.xyzw {min16f}, v0.xyxx {min16f}, t0.xyzw, s0
mul o0.xyzw {min16f}, r0.xyzw {min16f}, cb0[0].xyzw {min16f}
ret

Now we turn to the ISA output. There are typically two major classes of instruction to look for:

Packed FP16 instructions such as v_pk_add/mul/sub/mad_f16
Mix instructions, such as v_mad_mix_f32/v_mad_mixlo_f16/v_mad_mixhi_f16

Instructions such as v_pk_add/mul/sub_f16 perform an ALU operation on two FP16 values at once, halving the instructions needed and your ALU time. This gives one of the primary performance advantages of FP16.

The mix modifiers allow you to freely mix FP16 and FP32 operands in one VOP3 instruction, without requiring an additional conversion instruction. The cost of using these instructions is the lost opportunity to issue a packed instruction. It is therefore neither faster nor slower than the equivalent FP32 instruction you would have done.

Note that the specific form of the mix instruction is a multiply-add instruction. The compiler can use this to implement most arithmetic operations with creative use of 0, 1 or -1 constants. However, commonly encountered shader ALU operations, such as min() or max() , cannot be performed using a mix instruction.

Here is the GCN ISA output for the above shader:

shader main
asic(GFX9)
type(PS)

s_mov_b32     m0, s20
s_mov_b64     s[22:23], exec
s_wqm_b64     exec, exec
s_setreg_imm32_b32  hwreg(HW_REG_MODE, 0, 8), 0x000001cc
v_interp_p1ll_f16  v2, v0, attr0.x
v_interp_p1ll_f16  v0, v0, attr0.y
v_interp_p2_f16  v2, v1, attr0.x, v2
v_interp_p2_f16  v2, v1, attr0.y, v0 op_sel:[0,0,0,1]
image_sample  v[0:3], v[2:4], s[4:11], s[12:15] dmask:0xf a16 d16
s_buffer_load_dwordx4  s[0:3], s[16:19], 0x00
s_waitcnt     lgkmcnt(0)
v_mov_b32     v2, s1
v_cvt_pkrtz_f16_f32  v2, s0, v2
v_mov_b32     v3, s3
v_cvt_pkrtz_f16_f32  v3, s2, v3
s_setreg_imm32_b32  hwreg(HW_REG_MODE, 0, 8), 0x000001c0
s_waitcnt     vmcnt(0)
v_pk_mul_f16  v0, v0, v2 op_sel_hi:[1,1]
v_pk_mul_f16  v1, v1, v3 op_sel_hi:[1,1]
v_mov_b32     v2, v0 src0_sel: WORD_0
v_mov_b32     v0, v0 src0_sel: WORD_1
v_mov_b32     v3, v1 src0_sel: WORD_0
v_mov_b32     v1, v1 src0_sel: WORD_1
s_mov_b64     exec, s[22:23]
v_lshl_or_b32  v0, v0, 16, v2
v_lshl_or_b32  v1, v1, 16, v3
exp           mrt0, v0, v0, v1, v1 done compr vm
s_endpgm
end

This output illustrates a couple of interesting points. Firstly, the compiler has successfully introduced some v_pk_mul_f16 instructions. Instead of the usual four v_mul_f32 ops required to multiply a float4 by a a scalar, we’ve halved that to two v_mul_pk_f16 ops.

Secondly, consider the two v_cvt_pkrtz instructions. These operations take 2 FP32 source values and packs them to 2 FP16 values in a single 32-bit destination register. It does this to form the min16float4 in the cbuffer. It is surprising that despite using the correct type, the compiler has not generated the simple load we may have expected. We will return to this issue later.

Recommendation: Radeon GPU Analyzer

AMD offers an extremely powerful software tool known as Radeon GPU Analyzer (RGA). This tool is an interface to the driver compiler which allows you to directly see the resulting code. RGA accepts shader source or intermediates from all main graphics APIs. The user specifies which generation of GCN GPU to target, and the tool can output a number of analyses, including but not limited to ISA output and register usage analysis.

I consider RGA invaluable for FP16 work. We have integrated this tool into our tool chain so that we can obtain ISA output or register analysis immediately after compilation. I iterate on the ISA output until I have satisfactory code, and then test it for performance and correctness. Whilst some GPU capture tools now offer ISA disassembly, this is a far more productive method of working.

FP16 Target Selection

It is critical to choose your targets very carefully. Not all code is a suitable candidate for FP16 optimisation. The ideal target:

Is compatible with the precision limitations of FP16.
Offers good scope for data parallelism.
Is fully or partially bound by:
- The quantity of ALU operations, or
- Overall VGPR register footprint.

Data parallelism typically comes in two forms. Packed instructions can easily be used on code employing 2-, 3- or 4-component vectors. Alternatively, strictly scalar code can be made suitable for packed instructions by unrolling the loop manually and working on pairs of data.

Common Targets

A reliable target for FP16 optimisation is the blending of colour and normal maps. These operations are typically heavy on data-parallel ALU operations. What’s more, such data frequently originates from a low-precision texture and therefore fits comfortably within FP16’s limitations. A typical game frame has a plentiful supply of these operations in gbuffer export and post-process shaders, all ripe for optimisation.

BRDFs are an attractive but difficult candidate. The portion of a BRDF that computes specular response is typically very register- and ALU-intensive. This would seem a promising target. However, caution must be exercised. BRDFs typically contain exponent and division operations. There are currently no FP16 instructions for these operations. This means that at best there will be no parallelisation of those operations; at worst it will introduce conversion overhead between FP16 and FP32.

All is not lost. There is a suitable optimization candidate in the typical BRDF equation: the large number of vectors and dot products typically present. Whilst individual dot products are more a data reduction operation than a data parallel operation, many dot products can be performed in parallel using SIMD code. These dot products often feed back into FP32 BRDF code, so care must be taken not to introduce FP16 to FP32 conversion overhead that exceeds the gains made.

Finally, TAA or checker-boarding systems offer strong potential for optimisation alongside surprising risks. These systems perform a great deal of colour processing, and ALU can indeed be the primary bottleneck. UV calculations often consume much of this ALU work. It is tempting to assume these screen-space UVs are well within the limits of FP16. Surprisingly, the combination of small pixel velocities and high resolutions such as 4K can cause artefacts when using FP16. Exercise care when optimising similar code.

Constants

The most efficient way to write FP16 code is to supply it with FP16 constant data. Any use of FP32 constant data will invoke a conversion operation. Constant data typically occurs in two forms: cbuffer values and literals.

In an ideal world, there would be an FP16 version of every cbuffer value available for use. In practice, it is often possible to obtain a performance advantage just using FP32 cbuffer data. It depends on how frequently a constant is used. If a constant is used only once or twice it is no slower to simply use a mix instruction. If a constant is used more widely, or on vectors, it is usually more efficient to provide an FP16 cbuffer value. Clearly, larger types such as vectors or matrices should be supplied as native FP16 data as the conversion overhead would be prohibitive.

The second source of constant data is the use of literal values in the shader. It is tempting to assume that using the h suffix would be sufficient to introduce an FP16 constant. It isn’t. Again, the half type is for backwards compatibility and FXC converts it to an FP32 literal. Using either the h or f suffix will result in a conversion. It is better to use the unadorned literal, such as 0.0, 1.5 and so on. Generally, the compiler is able to automatically encode that literal as FP32 or FP16 as appropriate according to context.

One exception is expanding literals for use in an operation with a vector. Sometimes the compiler is unable to expand the literal to a min16float3 automatically. In this case, you must either manually construct a min16float3 , or use syntax such as 1.5.xxx .

Loading FP16 data

Recall the earlier example code snippet. Whilst the compiler emitted the expected v_pk_mul_f16 operations, it didn’t emit the code sequence you might expect to load a min16float4 from memory. It loaded FP32 values and packed them down to an FP16 vector manually. If you were to access a larger type, such as a min16float4x4 matrix, the code sequence would be very sub-optimal. There is an easy solution. If we change the source code to:

cbuffer params
{
    uint2 packedColour;
};

Texture2D<min16float4> tex;
SamplerState samp;

min16float2 UnpackFloat16( uint a )
{
    float2 tmp = f16tof32( uint2( a & 0xFFFF, a >> 16 ) );
    return min16float2( tmp );
}

min16float4 UnpackFloat16( uint2 v )
{
    return min16float4( UnpackFloat16( v.x ), UnpackFloat16( v.y ) );
}

min16float4 test( in min16float2 uv : TEXCOORD0 ) : SV_Target
{
    min16float4 colour = UnpackFloat16( packedColour );
    return colour * tex.Sample( samp, uv );
}

The driver recognises this code sequence, and issues a much more optimal sequence of instructions:

shader main
asic(GFX9)
type(PS)

s_mov_b32     m0, s20
s_mov_b64     s[2:3], exec
s_wqm_b64     exec, exec
s_setreg_imm32_b32  hwreg(HW_REG_MODE, 0, 8), 0x000001cc
v_interp_p1ll_f16  v2, v0, attr0.x
v_interp_p1ll_f16  v0, v0, attr0.y
v_interp_p2_f16  v2, v1, attr0.x, v2
v_interp_p2_f16  v2, v1, attr0.y, v0 op_sel:[0,0,0,1]
image_sample  v[0:3], v[2:4], s[4:11], s[12:15] dmask:0xf a16 d16
s_buffer_load_dwordx2  s[0:1], s[16:19], 0x00
s_setreg_imm32_b32  hwreg(HW_REG_MODE, 0, 8), 0x000001c0
s_waitcnt     vmcnt(0) & lgkmcnt(0)
v_pk_mul_f16  v0, v0, s0 op_sel_hi:[1,1]
v_pk_mul_f16  v1, v1, s1 op_sel_hi:[1,1]
v_mov_b32     v2, v0 src0_sel: WORD_0
v_mov_b32     v0, v0 src0_sel: WORD_1
v_mov_b32     v3, v1 src0_sel: WORD_0
v_mov_b32     v1, v1 src0_sel: WORD_1
s_mov_b64     exec, s[2:3]
v_lshl_or_b32  v0, v0, 16, v2
v_lshl_or_b32  v1, v1, 16, v3
exp           mrt0, v0, v0, v1, v1 done compr vm
s_endpgm
end

Finally, it is useful to embed FP16 constants at the end of the cbuffer rather than mix them alongside FP32 constants. This makes it much easier to strip away FP16 constants for the non-FP16 compatibility path, causing minimal effect on cbuffer size, layout and member alignment for both C++ and shader code.

It’s worth noting that Shader Model 6.2 supports 16-bit scalar types for all memory operations, meaning that the above issue will eventually go away in the future!

Challenges

FP16 optimisation typically encounters two main problems:

Conversion overhead between FP16 and FP32.
Code complexity.

At present, FP16 is typically introduced to a shader retrospectively to improve its performance. The new FP16 code requires conversion instructions to integrate and coexist with FP32 code. The programmer must take care to ensure these instruction do not equal or exceed the time saved. Is is important to keep large blocks of computation as purely FP16 or FP32 in order to limit this overhead. Indeed, shaders such as post-process or gbuffer exports as FP16 can run entirely in FP16 mode.

This leads us to the final point. FP16 code adds a little extra complexity to shader code. This article has outlined issues such as minimising conversion overhead, the special code to unpack FP16 data, and maintaining a non-FP16 code path. Whilst these issues are easily overcome, they may make the code take a little more effort to write and maintain. It is important to remember the reward is very worthwhile.

Conclusion

FP16 is a valuable additional tool in the programmer’s toolbox for obtaining peak shader performance. We have observed gains of around 10% on AMD RX Vega hardware. This is an attractive and lasting return for a moderate investment of engineering effort.