HIP Ray Tracing
HIP RT is a ray tracing library for HIP, making it easy to write ray tracing applications in HIP.
On this page
In-register matrix transpose is an important optimization technique in FFT and Neural Texture Compression. This article explores practical implementations using WMMA on AMD RDNA™ 4 architecture graphics cards.
Matrix transpose loading is critical in modern GPGPU computing. However, RDNA 4 architecture GPUs lack both shared-memory transpose loading and in-register matrix transpose capabilities, making it difficult for programmers to achieve peak performance. Consequently, an efficient in-register matrix transpose solution is essential for RDNA 4.
A straightforward approach uses warp shuffle instructions to exchange data between threads in the warp. However, as the RDNA 4 WMMA layout and shuffle instructions only support constant index of a value, requiring redundant data exchanges that limit performance. Here is a sample code:
#define WMMA_DATA_WIDTH 8typedef _Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));
__device__ __forceinline__ frag_type shfl_movmatrix(const frag_type& t) { frag_type ret; auto lane_id = __lane_id(); auto v_src = lane_id % 8;
auto TLayout = [] (auto lane_id) { constexpr unsigned shape[] = {8, 2, 2}; constexpr unsigned stride[] = {0, 16, 8};
unsigned result = 0; for(int i = 0; i < sizeof(shape) / sizeof(shape[0]); ++i) { result += lane_id % shape[i] * stride[i]; lane_id /= shape[i]; } return result; };
auto t_trans = TLayout(lane_id);
#pragma unroll for(int v = 0; v < 8; ++v) { auto t_src = t_trans + v;
frag_type reg; uint32_t* in_reg = (uint32_t*)(&t); uint32_t* out_reg = (uint32_t*)(®);
static_assert(sizeof(reg) % sizeof(*in_reg) == 0, "frag_type must be dividend by uint32_t evenly");
#pragma unroll for(int tv = 0; tv < sizeof(reg) / sizeof(*in_reg); ++tv) { out_reg[tv] = __shfl(in_reg[tv], t_src); }
ret[v] = reg[v_src]; } return ret;}To implement in-register matrix transpose effectively, one must first understand the Wide Matrix Multiply Accumulate (WMMA) layout in RDNA 4. Using the Matrix Cores of AMD RDNA 4 architecture GPUs provides a comprehensive introduction to this topic.
The following illustrates the WMMA layout in RDNA 4. All data types—including FP16, INT8, and INT4—utilize this unified layout:
| Matrix | Position | Dimensions |
|---|---|---|
| A | Lower left | M rows × K columns |
| B | Upper right | K rows × N columns |
| D | Lower right | M rows × N columns |
Both matrix A and B are K-major, with each thread holding 8 contiguous elements. This layout enables efficient 128-bit vectorized loads. Matrix D is M-major. So, matrix D is the transposed version of matrix A.
Leveraging the RDNA 4 architecture WMMA layout, we can construct an identity matrix in register B while loading the source matrix into register A. A single WMMA operation then performs the transpose entirely in-register—no additional memory operations required.
The following code demonstrates the in-register matrix transpose implementation on RDNA 4.
#include <hip/hip_runtime.h>#include <hip/hip_fp16.h>#include <thrust/host_vector.h>#include <thrust/device_vector.h>
using namespace std;
#define WMMA_DATA_WIDTH 8typedef _Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));
__device__ __forceinline__ void make_identity(frag_type& m) { const int lIdx = __lane_id(); const int row = lIdx / 16; const int col = lIdx % 16 / 8; const __half num = row == col; const int idx = lIdx % 16 % 8; m[idx] = num;}
__global__ void wmma_movmatrix(__half* a, __half* c) { const int gIdx = blockIdx.x * blockDim.x + threadIdx.x; const int lIdx = threadIdx.x;
const int lane = lIdx % 16; const int laneGroup = lIdx / 16;
frag_type a_frag = {0}; frag_type b_frag = {0}; frag_type c_frag = {0};
for(int ele = 0; ele < WMMA_DATA_WIDTH; ++ele) { a_frag[ele] = a[16 * lane + ele+laneGroup * WMMA_DATA_WIDTH]; }
make_identity(b_frag);
c_frag = __builtin_amdgcn_wmma_f16_16x16x16_f16_w32_gfx12(a_frag, b_frag, c_frag);
for( int ele = 0; ele < WMMA_DATA_WIDTH; ++ele ) { c[16 * lane + ele+laneGroup * WMMA_DATA_WIDTH] = c_frag[ele]; }}
int main(int argc, char* argv[]) { thrust::host_vector<__half> h_a(16*16); thrust::host_vector<__half> h_c(16*16); for(int i = 0; i < h_a.size(); ++i) { h_a[i] = (float)i; } thrust::device_vector<__half> d_a = h_a; thrust::device_vector<__half> d_c = h_c;
wmma_movmatrix<<<dim3(1), dim3(32, 1, 1), 0, 0>>>(d_a.data().get(), d_c.data().get());
h_c = d_c;
printf("original:\n"); for (int i = 0; i < 16; ++i) { for (int j = 0; j < 16; ++j) { printf("%3i ", (int)h_a[i*16 + j]); } printf("\n"); }
printf("transposed:\n"); for (int i = 0; i < 16; ++i) { for (int j = 0; j < 16; ++j) { printf("%3i ", (int)h_c[i*16 + j]); } printf("\n"); } return 0;}Using WMMA for in-register matrix transpose on AMD RDNA 4 architecture GPUs provides a lightweight alternative to CUDA’s ldmatrix.trans and movmatrix instructions, particularly when matrix core utilization is low.
This technique has been deployed in Llama.cpp to implement ldmatrix_trans in Flash Attention on RDNA 4, serving as a real-world validation of the approach.
Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-97.