Optimizing GPU occupancy and resource usage with large thread groups

When using a compute shader, it is important to consider the impact of thread group size on performance. Limited register space, memory latency and SIMD occupancy each affect shader performance in different ways. This article discusses potential performance issues, and techniques and optimizations that can dramatically increase performance if correctly applied.

Capsaicin and Cream developer talks at GDC 2017

Introduction Shortly after our Capsaicin and Cream event at GDC this year where we unveiled Radeon RX Vega, we hosted a developer-focused event designed to …

Leveraging asynchronous queues for concurrent execution

Understanding concurrency (and what breaks it) is extremely important when optimizing for modern GPUs. Modern APIs like DirectX® 12 or Vulkan™ provide the ability to …

AMD GCN Assembly: Cross-Lane Operations

Cross-lane operations are an efficient way to share data between wavefront lanes. This article covers in detail the cross-lane features that GCN3 offers.  I’d like …

Anatomy Of The Total War Engine: Part III

It’s Wednesday, so we’re continuing with our series on Total War: Warhammer. Here’s Tamas Rabel again with some juicy details about how Creative Assembly brought …

Anatomy Of The Total War Engine: Part II

We’re back again on this fine Warhammer Wednesday with more from Tamas Rabel, Lead Graphics Programmer on the Total War series. In last week’s post …

Vulkan Device Memory

EDIT: 2016/08/08 – Added section on Targeting Low-Memory GPUs This post serves as a guide on how to best use the various Memory Heaps and …

Performance Tweets Series: Root signature & descriptor sets

Before Direct3D® 12 and Vulkan™, resources were bound to shaders through a “slot” system. Some of you might remember when hardware did have only very …

The Art of AMDGCN Assembly: How to Bend the Machine to Your Will

The ability to write code in assembly is essential to achieving the best performance for a GPU program. In a previous blog we described how …

ROCm with Rapid Harmony : Optimizing HSA Dispatch

We previously looked at how to launch an OpenCL™ kernel using the HSA runtime. That example showed the basics of using the HSA Runtime. Here we’ll …