Understanding Memory Coalescing on GCN

This post describes how GCN hardware coalesces memory operations to minimize traffic throughout the memory hierarchy. The post uses the term “invocation” to describe one shader invocation. This term is also commonly referred to as “lane” (as in one lane of a SIMD vector, or wave), and also “thread” (as in one thread of execution of a shader).

Throughput

Reviewing peak throughput limits for a GCN Compute Unit (CU),
  • There are 4 SIMD units and 1 TEX unit per CU
  • Each SIMD unit processes at peak, a VALU instruction for a 64-wide wave in 4 clocks (SIMD unit is 16-wide)
  • The aggregate for the 4 SIMD units is a VALU instruction for 64 invocations per clock
  • The TEX unit processes at filtered peak, one 2×2 fragment quad per clock (64-wide wave in 16 clocks)
The one 2×2 fragment quad per clock rate provides hints about peak throughput. It takes up to 4 reads per fragment to gather the texels required for one filtered output, for a total of 16 reads for the 2×2 fragment quad per clock. Peak rate of one filtered 2×2 fragment quad per clock requires a return of 16 32-bit values in a single cycle (4 fragments * 4 channels). GCN leverages those two data paths to provide support for single channel buffer loads, in the cases described later in this post, at a rate of up to 16 32-bit values in a single cycle. 

The peak rate of one 2×2 fragment quad per clock requires address generation for image VMEM operations for 4 invocations in one clock. For point sampled (non-filtered) image operations GCN can return results for up to four 128-bit texels per clock. For filtered image operations, GCN processes 32-bits for each four fragments per clock. So the same four 128-bit texels for filtered VMEM operations takes 4 clocks.

Summary of peak throughput per wave supported by data paths (actual rates will vary),
  • 32-bit (or smaller) single-channel buffer loads / wave = 4 clocks (under specific cases)
  • multi-channel buffer loads / wave = 16 clocks
  • 128-bit (or smaller) point sampled texels / wave = 16 clocks
  • 32-bit (or smaller) filtered texels / wave = 16 clocks
  • 64-bit filtered texels / wave = 32 clocks
  • 128-bit filtered texels / wave = 64 clocks

Coalescing For L2 Access Via Image or Buffer Operations

For L2 cache accesses, the hardware groups the accesses for all 64 invocations of a wave into the fewest 64-byte aligned 64-byte memory requests. For loads and stores, all accesses to any given aligned 64-byte cacheline get coalesced into a single memory request. Also note for stores, when multiple invocations of a wave write to the same address, they are automatically collapsed into one write to the given address. There is no need to waste VALU cycles to avoid colliding writes. This is not true for atomics, as each access to a given address must remain a unique operation.

Coalescing For L1 Access Via Image Operations

Image VMEM operations are serviced in aligned groups of 4 invocations in a wave. For example in compute, invocations {0,1,2,3} would be a group, and in fragment or pixel shaders each 2×2 fragment quad is a group. Note compute invocations are organized in linear order: invocationIndex=x+(width*y)+(width*height*z).

Coalescing for L1 Access Via Buffer Operations

Multi-channel buffer VMEM operations, and buffer VMEM operations with single-channel accesses greater than 32-bits, are handled like point sampled image operations (as described above). Under specific cases, 32-bit (and smaller) single-channel buffer VMEM operations are first coalesced in aligned groups of 16 invocations in a wave. Invocations {0-15}, {16-31}, {32-47}, and {48-63} are all separate groups. Cases include:
  • When all invocations access the same address
  • When all invocations in each aligned group of 4 invocations access the same address
  • When all invocations in each aligned group of 4 invocations access in a block of 4 consecutive addresses (ordering in the 4 invocations does not matter)

Other posts by Timothy Lottes

Timothy Lottes

Timothy Lottes

Timothy Lottes is a member of the Graphics Performance R&D team at AMD. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

Enjoy this blog post? If you found it useful, why not share it with other game developers?

You may also like...

Getting started: our software

New or fairly new to AMD’s tools, libraries, and effects? This is the best place to get started on GPUOpen!

Getting started: development and performance

Looking for tips on getting started with developing and/or optimizing your game, whether on AMD hardware or generally? We’ve got you covered!

If slide decks are what you’re after, you’ll find 100+ of our finest presentations here. Plus there’s a handy list of our product manuals!

Developer guides

Browse our developer guides, and find valuable advice on developing with AMD hardware, ray tracing, Vulkan, DirectX, UE4, and lots more.

Words not enough? How about pictures? How about moving pictures? We have some amazing videos to share with you!

The home of great performance and optimization advice for AMD RDNA™ 2 GPUs, AMD Ryzen™ CPUs, and so much more.

Product Blogs

Our handy product blogs will help you make good use of our tools, SDKs, and effects, as well as sharing the latest features with new releases.

Publications

Discover our published publications.