Getting the Most Out of Delta Color Compression

Originally posted: March 14, 2016

Chris Brennan

Update - RDNA/RDNA2

The information below still applies in general for RDNA/RDNA2 architectures, with the exception of when decompression happens.

Bandwidth is always a scarce resource on a GPU. On one hand, hardware has made dramatic improvements with the introduction of ever faster memory standards like HBM, but on the other hand, games are rendering at higher resolutions, with larger buffers and more data, eating up a lot of bandwidth. A lot of this bandwidth is used to read and write render targets. What hasn’t been exploited is the fact that render targets tend to store slowly varying data. For example, a sky will be blue, with little variance, yet the GPU would treat every pixel independently as if they contain all unique, unrelated values.

This has recently changed with the introduction of Delta Color Compression — or DCC for short. This is a domain-specific compression that tries to take advantage of this data coherence to reduce the required bandwidth. It’s lossless, in many ways similar to typical compressors but adapted for 3D rendering. The key idea is to process whole blocks instead of individual pixels. Inside a block, only one value is stored with full precision, and the rest is stored as a delta – hence the name. If the colors are similar, the delta values can use a lot fewer bits relative to the input. DCC is enabled on discrete GPUs and APUs based on GCN 1.2 or later. The actual hardware implementation is quite a bit fancier than what I just described though. For instance, it adjusts its block size based on access patterns (and the data itself) to optimize for potentially random accesses.

The new compressor is inside the Color Block which allows graphics to compress color render targets similarly to the way that the Depth Block has been compressing depth and stencil targets. This means that there’s no special setup required: if the target surface is compressed, the rendering just goes through the compressor. Otherwise, the pipeline is not impacted.

Compressing is only half of the story, as data is typically read much more often than written. To get the bandwidth-saving benefits there as well, the shader core has been given the ability to read the new compressed color as well as all the existing compressed surfaces. This allows decompress operations do be skipped entirely for render-to-texture scenarios – that is, a barrier which transitions from render-target to texture is effectively a no-op and does not trigger a costly decompression.

Dos and Don’ts

While DCC is a “transparent” feature in the sense that it doesn’t require a specific setup by developers there are still a few caveats to be aware of to make the best use of it.

1) Clear to 0.0s and 1.0s

Clearing to common values such as 0.0s and 1.0s will be much faster and save more bandwidth than arbitrary values.

Color more easily handles opaque or transparent black or white, so use {1.0, 0.0, 0.0, 0.0}, {0.0, 1.0, 1.0, 1.0}, or all 0.0s or 1.0s for the clear color of ARGB surfaces. Any combination of 1.0s and 0.0s works best for two channel formats.
0.0 or 1.0 for Depth
0x00 for Stencil
Try and only clear if you know you need to; if you know you’ll definitely write to all of the surface, there’s usually no need.

2) Don’t flag render targets as shader-readable unless you really need them to be

Shader-readable targets are not as well compressed as when it is known that the shader will not read them. As mentioned previously, the shader cores have been extended to give them access to compressed data directly, but there are additional compression options which are only available if read access from the shader cores is prohibited. MSAA depth targets suffer the most from being flagged as “shader resource” if they don’t need to be because they otherwise compress extremely well.

3) Try 32-bit floating point depth buffer formats (D32F) instead of 16-bit (D16) for better performance

D32Fs actually may compress smaller than D16s when used as shader resources, and compress exactly the same way when not shader compatible. They are only different in allocation size and bandwidth when decompressed, which typically isn’t too frequent (but may happen when a dense mesh with many micro-triangles is rendered into a small screen-space area). D32F also allows you to use reverse Z for added precision, so that can be leveraged for nearly free. Keep in mind that on GCN, there’s no such thing as a real 24-bit depth target. Under the hood, those are handled as 32-bit, just with 8 bit of precision thrown away – so there’s no cost in switching from D24 to D32 targets.

4) Try mipmapping or pre-filtering in general instead of sparsely reading images

Sparse reads are bad to begin with because they can thrash caches. Compression can make sparse reads worse because instead of thrashing one cache, now it’s thrashing two. Ever find that fewer waves in flight or less valid threads per wave actually increases performance? A likely reason is because the cache is being thrashed. In particular we’ve sometimes observed sparse reads from shadow map filtering go from very bad to worse with compression enabled. This especially happens on niche “ultra” graphics quality modes where settings tend to be cranked up to 11. If shadows look noisy and aliased then caches are being thrashed! Pre-filtering or picking a lower resolution will look better and run much faster.

5) Write all color channels when you can, even if some are redundantly written

Partial writes require special care. In case of uncompressed data, it’s possible to simply mask the writes. For compressed data, this doesn’t work, and hence the data must be first read, decompressed, updated, and then written back to preserve the untouched channels. To efficiently use compression, it is best to fully overwrite the underlying data if it is not needed for blending so write all the channels at the same time if possible.

6) Organize G-Buffer data to maximize compression

If arbitrarily bit-packing fields into a G-Buffer, put highly correlated bits in the Most Significant Bits (MSBs) and noisy data in the Least Significant Bits (LSBs) of each channel. This will compress better because it responds similarly to typical data patterns.

When do decompresses still happen?

Update for RDNA/RDNA 2: To see what’s actually happening on your driver and ASIC combination, make sure to check with our Radeon GPU Profiler (RGP) tool.
- Here’s where you can find the DCC information in RGP.

Even when you’ve followed all the rules above, DCC may sometimes get disabled as not all parts of the GPU can read and write compressed data yet. In these cases, the barrier will result in a decompression. It is important to understand when those cases may occur:

When simultaneously writing and reading compressed targets.
- This is not coherent between metadata and data, so corruption would occur.
- The driver has to play safe – even if there may be simultaneous access it will decompress, so make sure usage flags are properly set in explicit APIs (Direct3D® 12, Vulkan™) and surfaces unbound and/or write-masked off when not intended to be written.
When the shader can write to the resource as an Unordered Access View (UAV)
- Flagging a texture to be writable as a UAV through the shader currently disallows compression entirely.
- Some drivers may still allow fast clears, then decompress on shader UAV write bind until the next full clear.
Before using the copy engine.
- Generally once flagged as a copy source or target, we decompress first because we don’t know where it is going.
- Raw copies can be done between surfaces of the same type and size… if the driver knows that this is going to happen.

Chris Brennan

Chris Brennan is a GPU Hardware Architect at AMD. Having started with the first Radeon's Texture Pipeline, then moving to 3D application research and back to HW for XBox® 360, he now drives the fixed function graphics pipelines and compression.