Understanding Memory Coalescing on GCN

Share on facebook
Share on twitter
Share on linkedin
Share on reddit
Share on email
This post describes how GCN hardware coalesces memory operations to minimize traffic throughout the memory hierarchy. The post uses the term “invocation” to describe one shader invocation. This term is also commonly referred to as “lane” (as in one lane of a SIMD vector, or wave), and also “thread” (as in one thread of execution of a shader).


Reviewing peak throughput limits for a GCN Compute Unit (CU),
  • There are 4 SIMD units and 1 TEX unit per CU
  • Each SIMD unit processes at peak, a VALU instruction for a 64-wide wave in 4 clocks (SIMD unit is 16-wide)
  • The aggregate for the 4 SIMD units is a VALU instruction for 64 invocations per clock
  • The TEX unit processes at filtered peak, one 2×2 fragment quad per clock (64-wide wave in 16 clocks)
The one 2×2 fragment quad per clock rate provides hints about peak throughput. It takes up to 4 reads per fragment to gather the texels required for one filtered output, for a total of 16 reads for the 2×2 fragment quad per clock. Peak rate of one filtered 2×2 fragment quad per clock requires a return of 16 32-bit values in a single cycle (4 fragments * 4 channels). GCN leverages those two data paths to provide support for single channel buffer loads, in the cases described later in this post, at a rate of up to 16 32-bit values in a single cycle. 

The peak rate of one 2×2 fragment quad per clock requires address generation for image VMEM operations for 4 invocations in one clock. For point sampled (non-filtered) image operations GCN can return results for up to four 128-bit texels per clock. For filtered image operations, GCN processes 32-bits for each four fragments per clock. So the same four 128-bit texels for filtered VMEM operations takes 4 clocks.

Summary of peak throughput per wave supported by data paths (actual rates will vary),
  • 32-bit (or smaller) single-channel buffer loads / wave = 4 clocks (under specific cases)
  • multi-channel buffer loads / wave = 16 clocks
  • 128-bit (or smaller) point sampled texels / wave = 16 clocks
  • 32-bit (or smaller) filtered texels / wave = 16 clocks
  • 64-bit filtered texels / wave = 32 clocks
  • 128-bit filtered texels / wave = 64 clocks

Coalescing For L2 Access Via Image or Buffer Operations

For L2 cache accesses, the hardware groups the accesses for all 64 invocations of a wave into the fewest 64-byte aligned 64-byte memory requests. For loads and stores, all accesses to any given aligned 64-byte cacheline get coalesced into a single memory request. Also note for stores, when multiple invocations of a wave write to the same address, they are automatically collapsed into one write to the given address. There is no need to waste VALU cycles to avoid colliding writes. This is not true for atomics, as each access to a given address must remain a unique operation.

Coalescing For L1 Access Via Image Operations

Image VMEM operations are serviced in aligned groups of 4 invocations in a wave. For example in compute, invocations {0,1,2,3} would be a group, and in fragment or pixel shaders each 2×2 fragment quad is a group. Note compute invocations are organized in linear order: invocationIndex=x+(width*y)+(width*height*z).

Coalescing for L1 Access Via Buffer Operations

Multi-channel buffer VMEM operations, and buffer VMEM operations with single-channel accesses greater than 32-bits, are handled like point sampled image operations (as described above). Under specific cases, 32-bit (and smaller) single-channel buffer VMEM operations are first coalesced in aligned groups of 16 invocations in a wave. Invocations {0-15}, {16-31}, {32-47}, and {48-63} are all separate groups. Cases include:
  • When all invocations access the same address
  • When all invocations in each aligned group of 4 invocations access the same address
  • When all invocations in each aligned group of 4 invocations access in a block of 4 consecutive addresses (ordering in the 4 invocations does not matter)

Other posts by Timothy Lottes

Fetching From Cubes and Octahedrons

For GPU-side dynamically generated data structures which need 3D spherical mappings, two of the most useful mappings are cubemaps and octahedral maps. This post explores the overhead of both mappings.

Timothy Lottes
Timothy Lottes is a member of the Graphics Performance R&D team at AMD. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

You may also like...

Some light reading to take away with you. Our ISAs, manuals, whitepapers, and many more.

Explore our huge collection of detailed tutorials, sample code, presentations, and documentation to find your answers to your graphics development questions.

Browse all our useful samples. Perfect for when you’re needing to get started, want to integrate one of our libraries, and much more.

Browse all our fantastic tutorials, including programming techniques, performance improvements, guest blogs, and how to use our tools.