Finite difference method – Laplacian part 4

Originally posted: July 18, 2023

Last updated: July 1, 2024

Justin Chang

Corresponding Author

Thomas H. Gibson

Author

Sean Miller

Author

In the previous three posts, we presented a HIP implementation of the finite difference method for the Laplace operator and applied four different optimizations. Over the course of these code modifications, we observed incremental performance improvements due to the reduction in total fetches to global memory. We then applied further optimizations to hit the projected performance target on a 512 $\times$ 512 $\times$ 512 grid of points on a single GCD of a MI250X GPU. Shown below is the final HIP kernel which we refer to as Kernel 5:

// Tiling factor
#define m 8
// Launch bounds
#define LB 256
template <typename T>
__launch_bounds__(LB)
__global__ void laplacian_kernel(T * f, const T * u, int nx, int ny, int nz, T invhx2, T invhy2, T invhz2, T invhxyz2) {

    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int j = m*(threadIdx.y + blockIdx.y * blockDim.y);
    int k = threadIdx.z + blockIdx.z * blockDim.z;

    // Exit if this thread is on the xz boundary
    if (i == 0 || i >= nx - 1 ||
        k == 0 || k >= nz - 1)
        return;

    const int slice = nx * ny;
    size_t pos = i + nx * j + slice * k;

    // Each thread accumulates m stencils in the y direction
    T Lu[m] = {0};

    // Scalar for reusable data
    T center;

    // z - 1, loop tiling
    for (int n = 0; n < m; n++)
        Lu[n] += u[pos - slice + n*nx] * invhz2;

    // y - 1
    Lu[0]   += j > 0 ? u[pos - 1*nx] * invhy2 : 0; // bound check

    // x direction, loop tiling
    for (int n = 0; n < m; n++) {
        // x - 1
        Lu[n] += u[pos - 1 + n*nx] * invhx2;

        // x
        center = u[pos + n*nx]; // store for reuse
        Lu[n] += center * invhxyz2;

        // x + 1
        Lu[n] += u[pos + 1 + n*nx] * invhx2;

        // reuse: y + 1 for prev n
        if (n > 0) Lu[n-1] += center * invhy2;

        // reuse: y - 1 for next n
        if (n < m - 1) Lu[n+1] += center * invhy2;
    }

    // y + 1
    Lu[m-1]  += j < ny - m ? u[pos + m*nx] * invhy2 : 0; // bound check

    // z + 1, loop tiling
    for (int n = 0; n < m; n++)
      Lu[n] += u[pos + slice + n*nx] * invhz2;

    // Store only if thread is inside y boundary
    for (int n = 0; n < m; n++)
      if (n + j > 0 && n + j < ny - 1)
        __builtin_nontemporal_store(Lu[n],&f[pos + n*nx]);
}

}

The objective of Part 4 is to explore the achieved performance of this code across various AMD GPUs and problem sizes. As part of this investigation, we will conclude the Laplacian series with lessons learned from our experience and provide of possible code optimizations users can apply to similar problems. The rest of this post is outlined as follows:

Run Kernel 5 across a broad spectrum of problem sizes across several AMD GPUs
Examine L2 cache hit performance and identify the root cause behind its degradation
Present techniques to circumvent L2 cache size limitations
Review all code optimizations undertaken in this Laplacian series.

Performance across different hardware and sizes

In addition to the MI250X GPU, the following AMD GPUs are considered:

We will perform a scaling study for Kernel 5 starting with a problem size nx,ny,nz = 256, 256, 256 across all of the above GPUs. Each dimension is increased in multiples of 256 until we reach a global problem size of nx,ny,nz = 1024, 1024, 1024. Recall that the Figure of Merit (FOM) is defined as the effective memory bandwidth:

theoretical_fetch_size = ((nx * ny * nz - 8 - 4 * (nx - 2) - 4 * (ny - 2) - 4 * (nz - 2) ) * sizeof(double);
theoretical_write_size = ((nx - 2) * (ny - 2) * (nz - 2)) * sizeof(double);
effective_memory_bandwidth = (theoretical_fetch_size + theoretical_write_size) / average_kernel_execution_time;

This is the average rate of data transfer considering the least amount of data that needs to move through the entire memory subsystem. The objective here is to obtain a FOM as close to the achievable device HBM bandwidth as possible. Through prior experimentation of rerunning Kernel 5 across various thread block sizes, launch bounds, and tiling factors, we found the following parameters provided the highest FOM for each individual GPU on a 512 $\times$ 512 $\times$ 512 grid of points:

GPU	Thread block	Launch bounds	Tiling factor
RX 6900 XT	256 x 1 x 1	256	16
RX 7900 XTX	256 x 1 x 1	256	8
MI50	256 x 1 x 1	256	4
MI100	128 x 1 x 1	128	4
MI210	256 x 1 x 1	256	8
MI250	256 x 1 x 1	256	8
MI250X	256 x 1 x 1	256	8

From here on out, experiments on both the MI250 and MI250X GPUs shall be conducted on a single GCD. The figure below depicts the FOM[1] across the various AMD GPUs and problem sizes.

Figure 1: Performance of Kernel 5 across various AMD GPUs and problem sizes

Note that the RX 6900 XT GPU does not have enough memory to execute the largest example nx,ny,nz = 1024, 1924, 1024 so it is not shown in the figure above. Performance across all sizes on the RX 6900 XT and RX 7900 XTX GPUs appear relatively consistent, whereas all Instinct™ GPUs exhibit degraded performance when the problem size crosses a certain threshold. In the next sections we will investigate the source of the performance degredation on Instinct™ GPUs.

Root cause of performance degradation

Let us refocus our attention back on the MI250X GPU. To understand the cause of the performance drop, we need to collect rocprof stats. We will first collect the FETCH_SIZE and compare it against the theoretical_fetch_size for each combination of nx,ny,nz. Ideally, the FETCH_SIZE should be equivalent to the theoretical amount as the elements loaded by each thread could be reused up to six times by neighboring threads. It should be noted that the MI100, MI210, MI250, and MI250X GPUs all have 8 MB of L2 cache shared among the compute units (CUs) per GCD, so the L2CacheHit we collect will be useful in determining how well cached data is reused. The figure below illustrates our findings across the same spectrum of problem sizes:

Figure 2: Fetch efficiency and L2 cache hit rates of Kernel 5 on a single MI250X GCD across various problem sizes

There is a clear correlation between the FOM, fetch efficiency, and L2 cache hit rate. Recall from the Laplacian Part 1 post that the baseline HIP kernel exhibited a 50% fetch efficiency while maintaining a relatively high L2 cache hit rate of 65%. This led us to believe that there was a problem with the kernel design. In this last experiment, both the fetch efficiency and L2 cache hit rate leveled around 33%, suggesting that the limiting factor for performance is the size of the L2 cache.

Another interesting observation from the above figure is that the dips in performance occur only when either nx or ny increases – increases in nz plays virtually no role in the performance. As seen from the thread block configuration and grid indexing in Kernel 5 above, all elements from a single xy-plane will be fetched from global memory before fetching elements from the next xy-plane. Stencil computations per thread have z-direction strides of -nx*ny, 0, and nx*ny so three xy-planes of data are needed. If the three xy-planes are too large to simultaneously fit int the L2 cache, then each thread will cycle the cache to fetch extra xy planes from global memory – for example, when only a single xy-plane fits (or partially fits) the L2 cache, the thread needs to cycle through two extra xy planes from global memory hence fetch efficiency dropping to as low as 33%.

Let us conduct a second experiment to further illustrate this point. We shall examine various values of nx and ny to identify the optimal z-direction stride before performance begins to degrade. From a starting z-stride of ~1 MB (equivalent to nx,ny = 512,256) with nx fixed to three different sizes, we incrementally increase ny until the xy-plane exceeds 8 MB (nxnysizeof(double)). See the figure below:

Figure 3: Performance impact from an increasing nx*ny stride for various fixed values of nx on a single MI250X GCD

Regardless of the size of nx, performance degrades when a single xy-plane exceeds ~2.5 MB in memory. It turns out 2.5 MB is small enough to fit three consecutive xy planes in an 8 MB L2 cache, which explains why the FOMs are higher in value. However, the larger the stride the more elements start dropping out of of cache thereby lowering the L2CacheHit rate and subsequently the FOM. Performance levels out after nx*ny exceeds 8 MB because at this point only one plane fits (or partially fits) the cache – this is where the FETCH_SIZE is three times larger than the theoretical estimate.

Note about Radeon™ GPU performance

One may wonder why neither the RX 6900 XT nor the RX 7900 XTX GPUs appear to be affected by the scaling study shown in Figure 1. Unlike the Instinct™ GPUs where the L2 cache is the last cache shared by all CUs, aka the “Last Level Cache” (LLC), the Radeon™ GPUs have an additional shared cache level: the L3 cache aka the “Infinity Cache”. The L3 cache sizes for the two Radeon™ GPUs are 128 MB and 96 MB, respectively, both of which are large enough to fit multiple xy-planes of size nx,ny = 1024,1024. From here on, we use LLC to refer to the L3 cache for Radeon™ GPUs and L2 for the Instinct™ GPUs.

Let us quickly perform a new scaling study to verify that the LLC of the two Radeon™ GPUs is indeed the limiting factor of performance for larger sizes. To ensure that we do not run out of memory, let us begin with nx,ny,nz = 1024,1024,32 and slowly increment the value of ny:

Figure 4: Performance impact from an increasing nx*ny stride on the RX 6900 XT and RX 7900 XTX GPUs

In the above figure, the effective memory bandwidth still deteriorates when the xy-planes crosses certain thresholds. The RX 6900 XT performance deteriorates when the xy-plane exceeds 32 MB whereas the RX 7900 XTX performance deteriorates when the xy-plane exceeds 25 MB. The LLC of the RX 7900 XTX is slightly smaller than that of the RX 6900 XT hence why performance begins to degrade at a smaller xy-plane size. Regardless of the AMD GPU utilized, performance will degrade if nx * ny is too large.

Addressing the cache size limitation

So how do we address the issue of the cache sizes for these larger problems? This is especially important on hardware like the MI50 where the LLC is only 4 MB of L2 cache. Even if nz is small, large values of nx and ny, the L2 cache could become a memory traffic bottleneck. Let us consider when nx,ny,nz = 1024,1024,1024 for the next several experiments on an MI250X GCD. Below are two possible strategies for addressing L2 cache bottleneck:

Option 1: grid block configurations

As said previously, the current thread block and grid indexing configurations fetch elements in the x-direction followed by the y-direction and z-direction. For a large problem like nx,ny,nz = 1024,1024,1024, this methodology fills the L2 cache entirely before any z-direction stencils can be fetched. Let us first modify the baseline thread block configuration 256 $\times$ 1 $\times$ 1 by changing to 128 $\times$ 1 $\times$ 8. This requires us to reset the launch bounds back to its default value of 1024, but let us see if this simple change has any performance impact:

	FOM (GB/s)	Fetch efficiency (%)	L2 Cache Hit
Kernel 5 – baseline	555.389	33.1	34.4
Kernel 5 – 128x1x8	934.111	79.8	60.6

There is a significant improvement – the FOM improved by 68%, the fetch efficiency increased to nearly 80%, and the L2 cache hit improved to over 60%. Modifying the thread block configuration to include elements in the z-direction enabled the HIP kernel to significant improve its performance on these larger problem sizes. However, the fetch efficiency still has about 20% left on the table.

A different modification to try is to still use the original 256 $\times$ 1 $\times$ 1 but configure both the grid launch of the HIP kernel and the indexing within the kernel itself. This requires code modifications on both the host side and device side. For the host side, we modify the grid launch configuration:

Kernel 5 (Before)	Kernel 6 (After)
`dim3 grid( (nx - 1) / block.x + 1, (ny - 1) / (block.y * m) + 1, (nz - 1) / block.z + 1);`	`dim3 grid( (ny - 1) / (block.y * m) + 1, (nz - 1) / block.z + 1, (nx - 1) / block.x + 1);`

The second change involves configuring the indexing inside the HIP kernel:

Kernel 5 (Before) Kernel 6 (After)

Kernel 5 (Before)	Kernel 6 (After)
`int i = threadIdx.x + blockIdx.x * blockDim.x; int j = m(threadIdx.y + blockIdx.y blockDim.y); int k = threadIdx.z + blockIdx.z * blockDim.z;`	`int i = threadIdx.x + blockIdx.z * blockDim.x; int j = m(threadIdx.y + blockIdx.x blockDim.y); int k = threadIdx.z + blockIdx.y * blockDim.z;`

int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = m*(threadIdx.y + blockIdx.y * blockDim.y);
int k = threadIdx.z + blockIdx.z * blockDim.z;

int i = threadIdx.x + blockIdx.z * blockDim.x;
int j = m*(threadIdx.y + blockIdx.x * blockDim.y);
int k = threadIdx.z + blockIdx.y * blockDim.z;

Previously, the blockIdx.x and blockIdx.y indices stride pos in the x and y directions, respectively, which loads xy-planes of size nx * ny into cache. Now, they stride pos in the y and z directions, respectively, which loads xy-planes of size blockDim.x * ny into cache. The shuffling of the index ordering enables smaller sections of multiple xy-planes to fill the L2 cache thereby increasing the likelihood that some stencil computations can effectively reuse all data already fetched by other thread blocks. Below are the performance numbers:

	FOM (GB/s)	Fetch efficiency (%)	L2 Cache Hit
Kernel 5 – baseline	555.389	33.1	34.4
Kernel 5 – 128x1x8	934.111	79.8	60.6
Kernel 6 – Grid indexing	1070.04	95.4	66.2

This modified grid indexing actually performs better than the previous modification of the thread block sizes. Fetch efficiency is very close to the theoretical limits and the L2 cache hit rate improved slightly. However, this workaround isn’t perfect either – this last example was run with blockDim.x = 256 so each sub xy-planes occupies roughly 2 MB of memory, thereby still allowing three different sub xy-planes to fit into the L2 cache. Larger nx * ny planes will inevitably face the same issues as before.

Option 2: splitting into subdomains

Another way to more robustly circumvent the issue is to split the overall problem size into smaller domains such that three xy-planes can fit into the L2 cache and serialize the kernel launches. The current xy-plane size nx,ny = 1024,1024 is 4 times greater than the previously considered xy-plane size nx,ny = 512,512 known to fit into the L2 cache three times, so let’s consider breaking this problem into 4 sub domains.

Let us start by going back to Kernel 5. First we modify the host side code:

Kernel 5 (Before) Kernel 7 (After)

Kernel 5 (Before)	Kernel 7 (After)
`dim3 grid( (nx - 1) / block.x + 1, (ny - 1) / (block.y * m) + 1 , (nz - 1) / block.z + 1); ... laplacian_kernel<<<grid, block>>>(d_f, d_u, nx, ny, nz, invhx2, invhy2, invhz2, invhxyz2);`	`int bny = (ny - 1) / 4 + 1; dim3 grid( (nx - 1) / block.x + 1, (bny - 1) / (block.y * m) + 1 , (nz - 1) / block.z + 1); ... for (int i = 0; i < 4; i++) laplacian_kernel<<<grid, block>>>(d_f, d_u, nx, ny, nz, invhx2, invhy2, invhz2, invhxyz2, bny*i);`

dim3 grid(
    (nx - 1) / block.x + 1,
    (ny - 1) / (block.y * m) + 1 ,
    (nz - 1) / block.z + 1);

...


laplacian_kernel<<<grid, block>>>(d_f, d_u, nx, ny, nz,
    invhx2, invhy2, invhz2, invhxyz2);

int bny = (ny - 1) / 4 + 1;
dim3 grid(
    (nx - 1) / block.x + 1,
    (bny - 1) / (block.y * m) + 1 ,
    (nz - 1) / block.z + 1);

...

for (int i = 0; i < 4; i++)
  laplacian_kernel<<<grid, block>>>(d_f, d_u, nx, ny, nz,
      invhx2, invhy2, invhz2, invhxyz2, bny*i);

Three changes were made to the code. First we modifyed the grid.y value by using bny = (ny - 1) / 4 + 1 instead of ny to indicate that we’re only iterating through one quarter of the domain in the y direction. Next we add a for loop to launch four HIP kernels to compute the stencils across each of the four sub domains. Finally we modify the kernel arguments by adding a y-direction offset. The device kernel needs to be modified as follows:

Kernel 5 (Before) Kernel 7 (After)

Kernel 5 (Before)	Kernel 7 (After)
`__global__ void laplacian_kernel(T * f, const T * u, int nx, int ny, int nz, T invhx2, T invhy2, T invhz2, T invhxyz2) { ... int j = m(threadIdx.y + blockIdx.y blockDim.y);`	`__global__ void laplacian_kernel(T * f, const T * u, int nx, int ny, int nz, T invhx2, T invhy2, T invhz2, T invhxyz2, int sy) { ... int j = sy + m(threadIdx.y + blockIdx.y blockDim.y);`

__global__ void laplacian_kernel(T * f,
    const T * u, int nx, int ny, int nz,
    T invhx2, T invhy2, T invhz2, T invhxyz2) {

...

int j = m*(threadIdx.y + blockIdx.y * blockDim.y);

__global__ void laplacian_kernel(T * f,
    const T * u, int nx, int ny, int nz,
    T invhx2, T invhy2, T invhz2, T invhxyz2, int sy) {

...

int j = sy + m*(threadIdx.y + blockIdx.y * blockDim.y);

Everything else within the kernel can remain the same. We examine the performance below:

	FOM (GB/s)	Fetch efficiency (%)	L2 Cache Hit
Kernel 5 – baseline	555.389	33.1	34.4
Kernel 5 – 128x1x8	934.111	79.8	60.6
Kernel 6 – Grid indexing	1070.04	95.4	66.2
Kernel 7 – Subdomain split	1175.54	99.6	67.7

The FOM is even higher than before. Fetch efficiency is also nearly 100%, indicating that we have optimized memory movement. Unlike the previous workaround, this methodology is robust to handle xy-planes of any size – the larger the xy-plane the more subdomains the user can split the problem into. This approach is particularly useful if your kernel will run on both Radeon™ and Instinct™ GPUs where the differing LLC sizes can drastically affect the “cutoff” point in performance. Although we have chosen to split the problem size in the y-direction, one could also easily apply this optimization to the x-direction.

Summary and Recap

In this part, we first examined the performance of Kernel 5 across various AMD GPUs and problem sizes and noted that there appeared to be sharp drop off in performance when the grid crosses a certain size threshold. It was determine through experimentation that the size of the LLC was the limiting factor of performance for problems with large xy-planes. Two different workarounds were presented to circumvent the issue regarding the cache size, both of which required modifications to only a few lines of code.

Throughout the course of the finite difference method Laplacian series, we started with a simple implementation of the HIP kernel and incrementally applied several different optimizations to dramatically improve the performance of the kernel. We shall now quickly recap all the different optimizations undertaken in this work:

Loop tiling (Part 2): If a kernel is known to load multiple reusable values, consider adding loop tiling aka loop unrolling. Several steps are required to refactor the code, but this ultimately reduces the number of thread blocks launched and increase the number of stencils computed per thread. The FETCH_SIZE could be greatly reduced with this approach.
Reordered access patterns (Part 2): Frequently accessing elements of a device by going both forward and backward in the memory address may prematurely evict reusable data from cache. The kernel should reorder all load instructions in the kernel to access data in a single direction i.e., accessing memory by ascending address. This optimization is needed in order to properly leverage the loop tiling optimization.
Launch bounds (Part 3): If the register usage of a kernel is extremely high, applying launch bounds may enable the compiler to make more appropriate decision in register allocation, thus improving the register pressure. The default is 1024 threads, but we strongly recommend setting this to the number of threads your kernel will be using.
Nontemporal loads (Part 3): Elements of the output array f are not shared by any other threads or kernels, so the builtin intrinsics allow the finite difference stencil computation to be written back to device memory with an increased priority for eviction from cache, thus freeing up more cache for the input array u.
Grid block configuration (Part 4): A quick way to circumvent problems that spill over the L2 cache can be to configure the thread block or grid indexing configuration. This lets smaller segments of the xy-plane fill the cache first allowing some threads to perform computations without having to fetch from global memory multiple times.
Splitting into subdomains (Part 4): If the entire problem is bottlenecked by the LLC size, consider splitting the problem into several subdomains and launching the HIP kernels to process each subdomain sequentially. This method is robust and can be tailored as needed for any LLC or problem size.

The finite difference stencil for the Laplacian on a structured grid is relatively straightforward to implement in HIP. In a future lab notes post, we will consider other partial differential equations like that of the acoustic wave equation which will require higher order stencils than the ones shown here. This most likely requires optimization strategies not outlined in these last four posts. Nonetheless, we believe all the optimization strategies outlined in this Laplacian series should be leveraged to get the best performance out of stencil kernels.

Accompanying code examples

If you have any questions or comments, please reach out to us on GitHub Discussions

[1]

Testing conducted using ROCm version 5.3.0-63. Benchmark results are not validated performance numbers, and are provided only to demonstrate relative performance improvements of code modifications. Actual performance results depend on multiple factors including system configuration and environment settings, reproducibility of the results is not guaranteed.

Justin Chang

Corresponding Author

Justin Chang is a Senior Member of Technical Staff (SMTS) Software System Design Engineer in the Data Center GPU Software Solutions group and manages the AMD lab notes blog series. He received his PhD degree in Civil Engineering from the University of Houston, where he published several journal papers on structure-preserving high performance computational methods for transport in porous media. As a postdoc, he worked for both Rice University and the National Renewable Energy Laboratory to accelerate finite element simulation time of subsurface flow through dual porosity porous medium and lithium-ion batteries used in electric vehicles. He also worked for the Oil and Gas industry and focused on GPU porting and optimization of key FWI, RTM, and other seismic imaging workloads.

Thomas H. Gibson

Author

Thomas Gibson is a Member of Technical Staff (MTS) Software System Design Engineer in the Data Center GPU Software Solutions group. He obtained his PhD in computational mathematics from Imperial College London, where he specialized in mixed finite element discretizations for numerical weather modeling codes. After completing his PhD in 2020, Thomas continued to work on structure-preserving ("compatible") finite element methods and multigrid preconditioners for weather applications. Additionally, he began shifting his research towards accelerating fluid dynamics codes using GPUs and developed high-fidelity/low-dissipation discontinuous Galerkin methods for turbulence/combustion models on GPUs. His current research interests include optimizing C/C++/Fortran GPU applications, iterative solvers and preconditioning, finite element discretizations, and numerical weather prediction applications.

Sean Miller

Author

Sean Miller is a Senior Member of Technical Staff (SMTS) Software System Design Engineer in AMD's Data Center GPU Software Solutions group. He received his PhD from the University of Washington focusing on computational plasma physics for fusion energy applications. Sean continued his research at Sandia National Laboratories where he developed high-energy density physics modeling tools, before shifting to AMD where he supports the porting and optimization of scientific software for GPU accelerated HPC environments.