How to get the most out of Smart Access Memory (SAM)

Oskar Homburg

Originally posted June 15, 2021

Guidelines to help improve CPU and GPU performance by using Smart Access Memory (SAM)

In the past, CPU access to GPU memory has been very limited. Only the driver had direct access; applications had to use specific functions. With explicit APIs, tighter control over memory allocation and use became possible.

With the introduction of Smart Access Memory (SAM), the CPU now has direct access to all video memory. In this article, we will provide guidelines to help improve CPU and GPU performance using this feature.

Background

In a system with a discrete GPU, memory is split between two locations: system memory (RAM) which is close to the CPU as well as video memory (also called GPU memory, VRAM or framebuffer), which is located on the graphics card and optimized for GPU-typical access patterns.

Explicit APIs typically expose this memory in four memory types as shown in the table below:

D3D12_HEAP_TYPE_DEFAULT or VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT

Video memory that is not visible to the CPU.

D3D12_HEAP_TYPE_UPLOAD or VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT

System memory, visible to both the CPU and GPU.

This is not cached on the CPU, so reads are slow, but writes can be fast using write-combining. The GPU can read from this memory over PCIe®.

D3D12_HEAP_TYPE_READBACK or VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT

System memory that is cached by the CPU.

This is fast to access on the CPU, but not from the GPU, as it must sniff the CPU cache.

VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT

Video memory that the CPU can directly access over PCIe®.

From the CPU perspective, this is like UPLOAD memory: it is uncached and write-combined. It is also called the BAR (Base Address Register) after the mechanism used for accessing it. DirectX® 12 currently does not expose this directly.

Identifiers	Memory type
`D3D12_HEAP_TYPE_DEFAULT` or `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT`	Video memory that is not visible to the CPU.
`D3D12_HEAP_TYPE_UPLOAD` or `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT \| VK_MEMORY_PROPERTY_HOST_COHERENT_BIT`	System memory, visible to both the CPU and GPU. This is not cached on the CPU, so reads are slow, but writes can be fast using write-combining. The GPU can read from this memory over PCIe®.
`D3D12_HEAP_TYPE_READBACK` or `VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT \| VK_MEMORY_PROPERTY_HOST_COHERENT_BIT \| VK_MEMORY_PROPERTY_HOST_CACHED_BIT`	System memory that is cached by the CPU. This is fast to access on the CPU, but not from the GPU, as it must sniff the CPU cache.
`VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT \| VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT \| VK_MEMORY_PROPERTY_HOST_COHERENT_BIT`	Video memory that the CPU can directly access over PCIe®. From the CPU perspective, this is like UPLOAD memory: it is uncached and write-combined. It is also called the BAR (Base Address Register) after the mechanism used for accessing it. DirectX® 12 currently does not expose this directly.

The BAR has existed for a long time and has been used by the driver for various purposes. Radeon™ drivers for Vulkan® have exposed it as well for application use. However, until recently, only a 256MiB aperture has been visible to the CPU at any time.

With SAM, recent hardware supports resizing this memory segment to cover the entire framebuffer, based on a standard PCIe® function commonly called Resizable BAR or ReBAR. See amd.com for a list of system requirements and steps to enable this feature on AMD hardware.

When SAM is enabled, Vulkan applications are no longer limited to the small aperture and may use significantly more local visible memory.

For DirectX®12, the Radeon driver can apply an optimization to place some resources allocated in the UPLOAD heap in the BAR instead on allocation.

Guidelines

We have put together a set of guidelines for developers to best make use of Smart Access Memory. Even though this is a new introduction to the PC space, many of these points also apply to more traditional memory access types of the past, and when Smart Access Memory is disabled.

Memory allocation and organization

Processing the same amount of data in more and smaller chunks is usually slower, due to per-operation overhead. Instead of using many small buffers, group data in larger buffers and use an offset to identify parts. Keep at least 64KiB of meaningful data in each buffer. This can also help avoid increased memory use from alignment requirements. Our Vulkan® Memory Allocator (VMA) and Direct3D®12 Memory Allocator (D3D12MA) SDKs can be used for this purpose.

Since our driver optimization places UPLOAD resources in VRAM, do not allocate more memory in the UPLOAD heap than needed. It could lead to memory oversubscription.

To improve the chances of the right resources being allocated in VRAM under this optimization – even with only a 256MiB BAR – allocate small heaps or committed resources and perform allocations that benefit most from VRAM placement first. These are typically ones that are read often by the GPU.

CPU Access

Both UPLOAD and BAR memory segments are uncached and write-combined. That means that CPU reads from this memory will be slow. Due to the increased distance from the CPU, reading from the BAR introduces a large amount of additional latency.

When writing from the CPU, avoid large strides and random access. Write sequentially or at least with a high degree of locality. If possible, align the start of your data to 64 bytes to minimize the number of write combining buffers in use. Again, this becomes more important when writing over PCIe®. You can read more about write-combining in the Software Optimization Guides for AMD processors.

Functions for mapping and unmapping memory ( ID3D12Resource::Map / Unmap or vkMapMemory / vkUnmapMemory ) have overhead that should be avoided. It is generally safe and recommended to leave buffers persistently mapped. Note that this can interfere with the performance of capture tools (such as PIX or RenderDoc), so it may be beneficial to unmap unneeded memory in debugging builds.

Copying

When copying data from UPLOAD memory to DEFAULT memory, apply the following rules:

If the data is needed immediately, copy using the queue where it is used to avoid synchronization overhead.
If the graphics or compute queue is otherwise idle, use it to copy.
If the result is not used soon after copying and both the graphics and compute queues have concurrent work, use the copy/transfer queue.

If the resource is placed in VRAM through our driver optimization, the graphics/compute queues can be 5x as fast as the copy/transfer queue at copying data. If not, the copy/transfer queue is fastest, although only by a small margin.

Shaders in DirectX12 can access UPLOAD memory directly. If data is written and immediately used and is only accessed once on the GPU and with a high degree of locality, then it can be beneficial to skip copying and use a buffer in UPLOAD memory directly. If the resource is placed in BAR memory by the driver, this can be significantly faster than separately copying. If not, the time saved by not copying is often still greater than the extra time spent in the shader reading over PCIe. However, this only applies if the copy latency cannot be hidden on another queue or with concurrent work.

We have updated the RDNA™ 2 Performance Guide with a summary of these recommendations as well.