AMD matrix cores
Matrix multiplication is a fundamental aspect of Linear Algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardwareaccelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.
AMD’s Matrix Core technology supports a full range of mixed precision operations bringing us the ability to work with large models and enhance memorybound operation performance for any combination of AI and machine learning workloads. The various numerical formats have uses in different applications. Examples include use of 8bit integers (INT8) for ML inference, 32bit floating point (FP32) data for ML Training and HPC applications, 16bit floating point (FP16) data for graphics workloads and 16bit brain float (BF16) data for ML training with fewer convergence issues.
To learn more about the theoretical speedups achievable by using matrix cores compared to SIMD Vector Units, please refer to the tables below. The tables list the performance of the Vector (i.e. Fused MultiplyAdd or FMA) and Matrix core units of the previous generation (MI100) and current generation (MI250X) of CDNA Accelerators.
Matrix Core Performance for MI100 and MI250X:
Data format 
MI100 Flops/Clock/CU 
MI250X Flops/Clock/CU 

FP64 
N/A 
256 
FP32 
256 
256 
FP16 
1024 
1024 
BF16 
512 
1024 
INT8 
1024 
1024 
Vector (FMA) Unit Performance for MI100 and MI250X:
Data format 
MI100 Flops/Clock/CU 
MI250X Flops/Clock/CU 

FP64 
64 
128 
FP32 
128 
128 
Matrix Core speedup compared to Vector Unit Performance for MI100 and MI250X. Note, MI250X also supports packed FP32 instructions which also double the FP32 throughput:
Data format 
MI100 Matrix/Vector Speedup 
MI250X Matrix/Vector Speedup 

FP64 
N/A 
2x 
FP32 
2x 
2x 
Using AMD Matrix Cores
The Matrix Fused Multiply Add (MFMA) instructions in AMD CDNA GPUs operate on a perwavefront basis, rather than on a perlane (thread) basis: entries of the input and output matrices are distributed over the lanes of the wavefront’s vector registers.
AMD Matrix Cores can be leveraged in several ways. At a high level, one can use libraries such as rocBLAS or rocWMMA to do matrix operations on the GPU. For instance, rocBLAS may choose to use MFMA instructions if this is advantageous for the computations at hand. For a closertothemetal approach, one can choose to

write GPU kernels entirely in assembly (which can be somewhat challenging and impractical)

sprinkle HIP kernels with inline assembly (not recommended, since the compiler does not look at the semantics of the inlined instructions, and may not take care of data hazards such as the mandatory number of cycles before using the results of the MFMA instructions)

use compiler intrinsics: these represent the assembly instructions in such a way that the compiler knows about the semantics and requirements.
The coding examples in this post use some of the available compiler intrinsics for the MFMA instructions and show how to map entries of the input and output matrices to lanes of vector registers of the wavefront. All the examples use a single wavefront to compute a small matrix multiplication. The examples are not intended to show how to achieve high performance out of the MFMA operations.
MFMA Compiler Intrinsic Syntax
Consider the following multiplication MFMA operation where all the operands A, B, C, and D are matrices:
D = A B + C
To perform the MFMA operation on AMD GPUs, LLVM has builtin intrinsic functions. Recall that these intrinsic functions are executed on a wavefrontwide basis and pieces of the input and output matrices are loaded into registers of each lane in the wavefront. The syntax of a MFMA compiler intrinsic is shown below.
d = __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt (a, b, c, cbsz, abid, blgp)
where,

CDfmt
is the data format of the C & D matrices 
ABfmt
is the data format of the A & B matrices 
M
,N
andK
are matrix dimensions:
mA[M][K]
Source A matrix 
mB[K][N]
Source B matrix 
mC[M][N]
Accumulation input matrix C 
mD[M][N]
Accumulation result matrix D


a
is the set of vector registers storing values from source matrix A 
b
is the set of vector registers storing values from source matrix B 
c
is the set of vector registers storing values from accumulation input matrix C 
d
is the set of vector registers storing values of accumulation result matrix D 
cbsz
, the Control Broadcast Size modifier, is used to change which input values are fed to the matrix cores and is supported only by instructions that have multiple input blocks for the A matrix. Settingcbsz
informs the instruction to broadcast values of one chosen input block to 2^{cbsz} other neighboring blocks in A. The input block picked for broadcasting is determined by theabid
parameter. The default value of 0 results in no broadcast of values. As an example, for a 16block A matrix, settingcbsz=1
would result in blocks 0 and 1 receiving the same input values, blocks 2 and 3 receiving the same input values, blocks 4 and 5 receiving the same input values, etc. 
abid
, the Amatrix Broadcast Identifier, is supported by instructions that have multiple input blocks for the A matrix. It is used withcbsz
and indicates which input block is selected for broadcast to other neighboring blocks in the A matrix. As an example, for a 16block A matrix, settingcbsz=2
andabid=1
will result in the values from block 1 to be broadcast to blocks 03, values from block 5 to be broadcast to blocks 47, values from block 9 to be broadcast to blocks 811, etc. 
blgp
, the Bmatrix Lane Group Pattern modifier, allows a constrained set of swizzling operations on B matrix data between lanes. For the instructions that support this modifier, the following values are supported:
blgp=0
normal matrix layout for B 
blgp=1
the B matrix data from lanes 031 is also broadcast into lanes 3263 
blgp=2
the B matrix data from lanes 3263 is broadcast into lanes 031 
blgp=3
the B matrix data from all lanes is rotated down by 16 (e.g., lane 0’s data is put into lane 48, lane 16’s data is put into lane 0) 
blgp=4
the B matrix data from lanes 015 is broadcast into lanes 1631, 3247, and 4863 
blgp=5
the B matrix data from lanes 1631 is broadcast into lanes 015, 3247, and 4863 
blgp=6
the B matrix data from lanes 3247 is broadcast into lanes 015, 1631, and 4863 
blgp=7
the B matrix data from lanes 4863 is broadcast into lanes 015, 1631, and 3247

The matrix dimensions and number of blocks supported on CDNA2 GPUs are listed in the table below.
A/B Data Format 
C/D Data Format 
M 
N 
K 
Blocks 
Cycles 
Flops/cycle/CU 

FP32 
FP32 

32 
32 
2 
1 
64 
256 

32 
32 
1 
2 
64 
256 

16 
16 
4 
1 
32 
256 

16 
16 
1 
4 
32 
256 

4 
4 
1 
16 
8 
256 

FP16 
FP32 

32 
32 
8 
1 
64 
1024 

32 
32 
4 
2 
64 
1024 

16 
16 
16 
1 
32 
1024 

16 
16 
4 
4 
32 
1024 

4 
4 
4 
16 
8 
1024 

INT8 
INT32 

32 
32 
8 
1 
64 
1024 

32 
32 
4 
2 
64 
1024 

16 
16 
16 
1 
32 
1024 

16 
16 
4 
4 
32 
1024 

4 
4 
4 
16 
8 
1024 

BF16 
FP32 

32 
32 
8 
1 
64 
1024 

32 
32 
4 
2 
64 
1024 

16 
16 
16 
1 
32 
1024 

16 
16 
4 
4 
32 
1024 

4 
4 
4 
16 
8 
1024 

32 
32 
4 
1 
64 
512 

32 
32 
2 
2 
64 
512 

16 
16 
8 
1 
32 
512 

16 
16 
2 
4 
32 
512 

4 
4 
2 
16 
8 
512 

FP64 
FP64 

16 
16 
4 
1 
32 
256 

4 
4 
4 
4 
16 
128 
A complete list of all instructions supported by the CDNA2 Architecture can be found in the AMD Instinct MI200 Instruction Set Architecture Reference Guide.
Example 1 – V_MFMA_F32_16x16x4F32
Consider the matrix multiplication operation D = A B where M=N=16 and K=4 and the elements are of type FP32. Assume that the input C matrix contains zeroes for simplicity sake. We will demonstrate the use of the intrinsic function __builtin_amdgcn_mfma_f32_16x16x4f32
that computes the sum of four outer products in one invocation. This function operates on a single block of matrices.
The input matrices, A and B, have dimensions 16\times4 and 4\times16 respectively and the matrices C and D have 16\times16 elements. It is convenient to map a 16\times4 thread block to the elements of both input matrices. Here, the thread block has one wavefront with 16 threads in the x dimension and 4 threads in the y dimension. We represent the matrices in row major format: A[i][j] = j + i*N
where i
is the row and j
is the column. Using this representation, a thread at position x, y
in the block would load entry A[x][y]
and B[y][x]
. The output matrix has 16\times16 elements, so each thread would have 4 elements to store as illustrated in the figure and code snippet below.
The following two figures show 1) the shape and size of the A and B inputs; and 2) how the elements of A and B map to lanes in the register owned by the wavefront.
The following two figures show 1) the shape and size of the output matrix D; and 2) how the elements of D map to lanes in the registers owned by the wavefront:
An example kernel performing this MFMA operation is given below.
#define M 16
#define N 16
#define K 4
using float4 = __attribute__( (__vector_size__(K * sizeof(float)) )) float;
__global__ void sgemm_16x16x4(const float *A, const float *B, float *D)
{
float4 dmn = {0};
int mk = threadIdx.y + K * threadIdx.x;
int kn = threadIdx.x + N * threadIdx.y;
float amk = A[mk];
float bkn = B[kn];
dmn = __builtin_amdgcn_mfma_f32_16x16x4f32(amk, bkn, dmn, 0, 0, 0);
for (int i = 0; i < 4; ++i) {
const int idx = threadIdx.x + i * N + threadIdx.y * 4 * N;
D[idx] = dmn[i];
}
}
This kernel is launched as follows.
As previously noted, the input C matrix is assumed to contain zeroes.
Example 2 – V_MFMA_F32_16x16x1F32
Consider the case of multiplying matrices of dimensions M=N=16 and K=1 using the compiler intrinsic __builtin_amdgcn_mfma_f32_16x16x1f32
. In this case, the input values could be held just by 16 lanes of the wavefront. In fact, this instruction could simultaneously multiply 4 such matrices thereby having each lane hold values from one of those 4 matrices.
We can reuse the figure from the previous example to illustrate the data layout for this operation too. The input A is not a 16 \times 4 matrix in this case but four 16 \times 1 matrices. But the way they are laid out, and the elements that are owned by each lane in the wavefront is the same. The “columns” of A are distinct 16 \times 1 matrices. The input B is similar.
The output of a given matrix multiplication has exactly the same data layout as in the previous example. The difference is that now there are four separate outputs, one for each multiplication.
The kernel below shows an example of this multiplication for a packed batch of 4 matrices of size M=N=16 and K=1.
#define M 16
#define N 16
#define K 1
using float16 = __attribute__( (__vector_size__(16 * sizeof(float)) )) float;
__global__ void sgemm_16x16x1(const float *A, const float *B, float *D)
{
float16 dmnl = {0};
int mkl = K * threadIdx.x + M * K * threadIdx.y;
int knl = threadIdx.x + N * K * threadIdx.y;
float amkl = A[mkl];
float bknl = B[knl];
dmnl = __builtin_amdgcn_mfma_f32_16x16x1f32(amkl, bknl, dnml, 0, 0, 0);
for (int l = 0; l < 4; ++l) {
for (int i = 0; i < 4; ++i) {
const int idx = threadIdx.x + i * N + threadIdx.y * 4 * N + l * M * N;
D[idx] = dmnl[i];
}
}
}
This kernel is launched using:
Example 3 – V_MFMA_F64_4x4x4F64
Consider the V_MFMA_F64_4x4x4F64
instruction, which computes the MFMA of four independent blocks of matrices of size 4\times4. The operation performed is Z_N = W_N X_N + Y_N where, W_N, X_N, Y_N and Z_N are all matrices of size 4\times4 elements and N = 0,1,2,3.
The two figures below show 1) the size and shapes of the four components of the input arguments A and B and 2) how those components map to lanes in the register owned by the wavefront. The arguments of this instruction include A, B, C and returns D, so we understand that each argument and the output holds 4 matrices.
The layout for output D and input C is the same as the layout for input B.
A Note on rocWMMA
We have presented only three examples of leveraging AMD Matrix cores using compiler intrinsics. More examples can be found here. Note that the builtin functions may change in the future, so it may be better to use AMD’s rocWMMA C++ library instead for accelerating mixed precision MFMA operations. The rocWMMA API facilitates breaking down matrix multiplyaccumulate problems into fragments and using them in blockwise operations that are distributed in parallel across wavefronts. The API is a header library of GPU device code allowing matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly. More details are available in the rocWMMA repo.
References
If you have any questions or comments, you can reach out to us on our mailing list