- There are 4 SIMD units and 1 TEX unit per CU
- Each SIMD unit processes at peak, a VALU instruction for a 64-wide wave in 4 clocks (SIMD unit is 16-wide)
- The aggregate for the 4 SIMD units is a VALU instruction for 64 invocations per clock
- The TEX unit processes at filtered peak, one 2×2 fragment quad per clock (64-wide wave in 16 clocks)
The peak rate of one 2×2 fragment quad per clock requires address generation for image VMEM operations for 4 invocations in one clock. For point sampled (non-filtered) image operations GCN can return results for up to four 128-bit texels per clock. For filtered image operations, GCN processes 32-bits for each four fragments per clock. So the same four 128-bit texels for filtered VMEM operations takes 4 clocks.
- 32-bit (or smaller) single-channel buffer loads / wave = 4 clocks (under specific cases)
- multi-channel buffer loads / wave = 16 clocks
- 128-bit (or smaller) point sampled texels / wave = 16 clocks
- 32-bit (or smaller) filtered texels / wave = 16 clocks
- 64-bit filtered texels / wave = 32 clocks
- 128-bit filtered texels / wave = 64 clocks
Coalescing For L2 Access Via Image or Buffer Operations
Coalescing For L1 Access Via Image Operations
Coalescing for L1 Access Via Buffer Operations
- When all invocations access the same address
- When all invocations in each aligned group of 4 invocations access the same address
- When all invocations in each aligned group of 4 invocations access in a block of 4 consecutive addresses (ordering in the 4 invocations does not matter)
Other posts by Timothy Lottes
Optimized tonemapper form of the technique Brian Karis talks about on Graphics Rants: Tone mapping. Replace the luma computation with max3(red,green,blue).
For GPU-side dynamically generated data structures which need 3D spherical mappings, two of the most useful mappings are cubemaps and octahedral maps. This post explores the overhead of both mappings.
Expanding on Advanced Techniques and Optimization of VDR Color Pipelines: Details on the generation of film grain ideal for transfer functions like sRGB.
This post is going to look at very subtle changes to improve grain and fine details using the same 3-bit/channel quantization case from the prior post.