Guidelines to help improve CPU and GPU performance by using Smart Access Memory (SAM)
In the past, CPU access to GPU memory has been very limited. Only the driver had direct access; applications had to use specific functions. With explicit APIs, tighter control over memory allocation and use became possible.
With the introduction of Smart Access Memory (SAM), the CPU now has direct access to all video memory. In this article, we will provide guidelines to help improve CPU and GPU performance using this feature.
In a system with a discrete GPU, memory is split between two locations: system memory (RAM) which is close to the CPU as well as video memory (also called GPU memory, VRAM or framebuffer), which is located on the graphics card and optimized for GPU-typical access patterns.
Explicit APIs typically expose this memory in four memory types as shown in the table below:
Video memory that is not visible to the CPU.
System memory, visible to both the CPU and GPU.
This is not cached on the CPU, so reads are slow, but writes can be fast using write-combining. The GPU can read from this memory over PCIe®.
System memory that is cached by the CPU.
This is fast to access on the CPU, but not from the GPU, as it must sniff the CPU cache.
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
Video memory that the CPU can directly access over PCIe®.
From the CPU perspective, this is like UPLOAD memory: it is uncached and write-combined. It is also called the BAR (Base Address Register) after the mechanism used for accessing it. DirectX® 12 currently does not expose this directly.
Video memory that is not visible to the CPU.
System memory, visible to both the CPU and GPU. This is not cached on the CPU, so reads are slow, but writes can be fast using write-combining. The GPU can read from this memory over PCIe®.
System memory that is cached by the CPU. This is fast to access on the CPU, but not from the GPU, as it must sniff the CPU cache.
Video memory that the CPU can directly access over PCIe®. From the CPU perspective, this is like UPLOAD memory: it is uncached and write-combined. It is also called the BAR (Base Address Register) after the mechanism used for accessing it. DirectX® 12 currently does not expose this directly.
The BAR has existed for a long time and has been used by the driver for various purposes. Radeon™ drivers for Vulkan® have exposed it as well for application use. However, until recently, only a 256MiB aperture has been visible to the CPU at any time.
With SAM, recent hardware supports resizing this memory segment to cover the entire framebuffer, based on a standard PCIe® function commonly called Resizable BAR or ReBAR. See amd.com for a list of system requirements and steps to enable this feature on AMD hardware.
When SAM is enabled, Vulkan applications are no longer limited to the small aperture and may use significantly more local visible memory.
For DirectX®12, the Radeon driver can apply an optimization to place some resources allocated in the UPLOAD heap in the BAR instead on allocation.
We have put together a set of guidelines for developers to best make use of Smart Access Memory. Even though this is a new introduction to the PC space, many of these points also apply to more traditional memory access types of the past, and when Smart Access Memory is disabled.
Memory allocation and organization
Processing the same amount of data in more and smaller chunks is usually slower, due to per-operation overhead. Instead of using many small buffers, group data in larger buffers and use an offset to identify parts. Keep at least 64KiB of meaningful data in each buffer. This can also help avoid increased memory use from alignment requirements. Our Vulkan® Memory Allocator (VMA) and Direct3D®12 Memory Allocator (D3D12MA) SDKs can be used for this purpose.
Since our driver optimization places UPLOAD resources in VRAM, do not allocate more memory in the UPLOAD heap than needed. It could lead to memory oversubscription.
To improve the chances of the right resources being allocated in VRAM under this optimization – even with only a 256MiB BAR – allocate small heaps or committed resources and perform allocations that benefit most from VRAM placement first. These are typically ones that are read often by the GPU.
Both UPLOAD and BAR memory segments are uncached and write-combined. That means that CPU reads from this memory will be slow. Due to the increased distance from the CPU, reading from the BAR introduces a large amount of additional latency.
When writing from the CPU, avoid large strides and random access. Write sequentially or at least with a high degree of locality. If possible, align the start of your data to 64 bytes to minimize the number of write combining buffers in use. Again, this becomes more important when writing over PCIe®. You can read more about write-combining in the Software Optimization Guides for AMD processors.
Functions for mapping and unmapping memory (
) have overhead that should be avoided. It is generally safe and recommended to leave buffers persistently mapped. Note that this can interfere with the performance of capture tools (such as PIX or RenderDoc), so it may be beneficial to unmap unneeded memory in debugging builds.
When copying data from UPLOAD memory to DEFAULT memory, apply the following rules:
- If the data is needed immediately, copy using the queue where it is used to avoid synchronization overhead.
- If the graphics or compute queue is otherwise idle, use it to copy.
- If the result is not used soon after copying and both the graphics and compute queues have concurrent work, use the copy/transfer queue.
If the resource is placed in VRAM through our driver optimization, the graphics/compute queues can be 5x as fast as the copy/transfer queue at copying data. If not, the copy/transfer queue is fastest, although only by a small margin.
Shaders in DirectX12 can access UPLOAD memory directly. If data is written and immediately used and is only accessed once on the GPU and with a high degree of locality, then it can be beneficial to skip copying and use a buffer in UPLOAD memory directly. If the resource is placed in BAR memory by the driver, this can be significantly faster than separately copying. If not, the time saved by not copying is often still greater than the extra time spent in the shader reading over PCIe. However, this only applies if the copy latency cannot be hidden on another queue or with concurrent work.
We have updated the RDNA™ 2 Performance Guide with a summary of these recommendations as well.
The D3D12 Memory Allocator (D3D12MA) is a C++ library that provides a simple and easy-to-integrate API to help you allocate memory for DirectX®12 buffers and textures.
VMA is our single-header, MIT-licensed, C++ library for easily and efficiently managing memory allocation for your Vulkan® games and applications.