
AMD FidelityFX™ Single Pass Downsampler (SPD)
AMD FidelityFX Single Pass Downsampler (SPD) provides an AMD RDNA™ architecture optimized solution for generating up to 12 MIP levels of a texture.
Update - RDNA/RDNA2
The information below still applies in general for RDNA/RDNA2 architectures, with the exception of when decompression happens.
Bandwidth is always a scarce resource on a GPU. On one hand, hardware has made dramatic improvements with the introduction of ever faster memory standards like HBM, but on the other hand, games are rendering at higher resolutions, with larger buffers and more data, eating up a lot of bandwidth. A lot of this bandwidth is used to read and write render targets. What hasn’t been exploited is the fact that render targets tend to store slowly varying data. For example, a sky will be blue, with little variance, yet the GPU would treat every pixel independently as if they contain all unique, unrelated values.
This has recently changed with the introduction of Delta Color Compression — or DCC for short. This is a domain-specific compression that tries to take advantage of this data coherence to reduce the required bandwidth. It’s lossless, in many ways similar to typical compressors but adapted for 3D rendering. The key idea is to process whole blocks instead of individual pixels. Inside a block, only one value is stored with full precision, and the rest is stored as a delta – hence the name. If the colors are similar, the delta values can use a lot fewer bits relative to the input. DCC is enabled on discrete GPUs and APUs based on GCN 1.2 or later. The actual hardware implementation is quite a bit fancier than what I just described though. For instance, it adjusts its block size based on access patterns (and the data itself) to optimize for potentially random accesses.
The new compressor is inside the Color Block which allows graphics to compress color render targets similarly to the way that the Depth Block has been compressing depth and stencil targets. This means that there’s no special setup required: if the target surface is compressed, the rendering just goes through the compressor. Otherwise, the pipeline is not impacted.
Compressing is only half of the story, as data is typically read much more often than written. To get the bandwidth-saving benefits there as well, the shader core has been given the ability to read the new compressed color as well as all the existing compressed surfaces. This allows decompress operations do be skipped entirely for render-to-texture scenarios – that is, a barrier which transitions from render-target to texture is effectively a no-op and does not trigger a costly decompression.
While DCC is a “transparent” feature in the sense that it doesn’t require a specific setup by developers there are still a few caveats to be aware of to make the best use of it.
Clearing to common values such as 0.0s and 1.0s will be much faster and save more bandwidth than arbitrary values.
Shader-readable targets are not as well compressed as when it is known that the shader will not read them. As mentioned previously, the shader cores have been extended to give them access to compressed data directly, but there are additional compression options which are only available if read access from the shader cores is prohibited. MSAA depth targets suffer the most from being flagged as “shader resource” if they don’t need to be because they otherwise compress extremely well.
D32Fs actually may compress smaller than D16s when used as shader resources, and compress exactly the same way when not shader compatible. They are only different in allocation size and bandwidth when decompressed, which typically isn’t too frequent (but may happen when a dense mesh with many micro-triangles is rendered into a small screen-space area). D32F also allows you to use reverse Z for added precision, so that can be leveraged for nearly free. Keep in mind that on GCN, there’s no such thing as a real 24-bit depth target. Under the hood, those are handled as 32-bit, just with 8 bit of precision thrown away – so there’s no cost in switching from D24 to D32 targets.
Sparse reads are bad to begin with because they can thrash caches. Compression can make sparse reads worse because instead of thrashing one cache, now it’s thrashing two. Ever find that fewer waves in flight or less valid threads per wave actually increases performance? A likely reason is because the cache is being thrashed. In particular we’ve sometimes observed sparse reads from shadow map filtering go from very bad to worse with compression enabled. This especially happens on niche “ultra” graphics quality modes where settings tend to be cranked up to 11. If shadows look noisy and aliased then caches are being thrashed! Pre-filtering or picking a lower resolution will look better and run much faster.
Partial writes require special care. In case of uncompressed data, it’s possible to simply mask the writes. For compressed data, this doesn’t work, and hence the data must be first read, decompressed, updated, and then written back to preserve the untouched channels. To efficiently use compression, it is best to fully overwrite the underlying data if it is not needed for blending so write all the channels at the same time if possible.
If arbitrarily bit-packing fields into a G-Buffer, put highly correlated bits in the Most Significant Bits (MSBs) and noisy data in the Least Significant Bits (LSBs) of each channel. This will compress better because it responds similarly to typical data patterns.
Even when you’ve followed all the rules above, DCC may sometimes get disabled as not all parts of the GPU can read and write compressed data yet. In these cases, the barrier will result in a decompression. It is important to understand when those cases may occur: