Performance Guide

AMD Ryzen CPU Performance Guide

Originally posted: April 20, 2021

Last updated: February 9, 2024

TOOLS
COMPILING
TESTING
PROFILING
DEBUGGING
INTEGRATED
HYBRID
MEMORY
SYNCHRONIZATION
PRESENTATIONS
RELATED

Design faster. Render faster. Iterate faster.

Our AMD Ryzen™ Performance Guide will help guide you through the optimization process with a collection of tidbits, tips, and tricks which aim to support you in your performance quest.

Tools

PresentMon

PresentMon is a Command Line Interface (CLI) tool for logging frame times such as MsBetweenPresents .

Example:

PresentMon-1.6.0-x64.exe -process_name "MyGame.exe"
-stop_existing_session
-terminate_on_proc_exit
-terminate_after_timed
-timed 60
-output_file "%CD%\result\presentmon.csv"

See https://github.com/GameTechDev/PresentMon

Open Capture and Analysis Tool (OCAT)

OCAT is a Graphics User Interface (GUI) tool with hot key support for logging frame times based on PresentMon.

See https://github.com/GPUOpen-Tools/OCAT

AMD Open Capture and Analysis Tool (OCAT)

If you want to know how well a game is performing on your machine in real-time with low overhead, AMD OCAT has you covered.

Windows® Performance Toolkit

Windows Performance Analyzer (WPA)

WPA is a highly configurable tool for finding system performance bottlenecks and ideal for filtering and visualizing call stacks.

WPA is included in the Windows SDK Windows Performance Toolkit and also available in the Microsoft Store.
WPA opens logs created by wpr.exe or xperf.exe .
wpr.exe is included in all Windows 10 installations.
xperf.exe is included in the Windows SDK.
See https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer
See https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/

GPUView

GPUView is a tool for analyzing GPU performance with regard to direct memory access (DMA) buffer processing.

GPUView allows you to find times where the GPU Hardware Queue is empty or times where the Process Context CPU Queue is empty.
- Ideally, the GPU Hardware Queue should be near 100% busy.
GPUView is included in the Windows SDK Windows Performance Toolkit.
See https://docs.microsoft.com/en-us/windows-hardware/drivers/display/using-gpuview
See https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/

Visual Studio Concurrency Visualizer

You can use the Concurrency Visualizer for Visual Studio to locate performance bottlenecks, CPU underutilization, thread contention, cross-core thread migration, synchronization delays, DirectX activity, areas of overlapped I/O, and other information.

AMD µProf

Find performance bottlenecks using CPU hardware Performance Monitoring Counters (PMCs)
- Instruction Based Sampling (IBS) has disassembly instruction accurate attribution but with limited counter coverage.
- Event Based Sampling (EBS) has more counters available but less accurate attribution. It is typically accurate within a few instructions. AMD Dev Techs often use EBS counters in the Assess Performance (Extended) profile.
See https://developer.amd.com/amd-uprof/

Radeon GPU Profiler (RGP)

RGP is an offline compiler and performance analysis tool for DirectX, Vulkan®, SPIR-V™, OpenGL® and OpenCL™.

The Overview > Frame summary may quickly assess if the application is CPU bound (GPU idle > 5%) based on the few frames captured.
See https://github.com/GPUOpen-Tools/radeon_gpu_profiler

AMD Radeon™ GPU Profiler

AMD RGP gives you unprecedented, in-depth access to a GPU. Easily analyze graphics, async compute usage, event timing, pipeline stalls, barriers, bottlenecks, and other performance inefficiencies.

Compiling

Use the latest compiler and Windows SDK

Get the latest build and link time improvements.
Ensure you are using the latest C runtime optimizations.
See https://devblogs.microsoft.com/cppblog/the-coalition-sees-27-9x-iteration-build-improvement-with-visual-studio-2019/

Add virus and threat protection exclusions

Add project folders to virus and threat protection settings exclusions for faster build times.
- We have seen some projects compiling 20% faster!

Prefer Shipping configuration builds for CPU profiling

Debug and development configuration builds may greatly reduce performance.
- Stats collection may cause cache pollution.
- Logging may create serialization points.
- Sometimes debug builds may disable multi-threading optimizations.
While investigating open issues, developers may submit change requests which enable debug features on Test and Shipping configurations. Be sure to disable debug features before you ship!
Some Unreal Engine settings to verify include:
- In Build.h , #define FORCE_USE_STATS and #define STATS should never be enabled during Shipping builds.
- It may be convenient to enable ALLOW_CONSOLE_IN_SHIPPING during game development.
- See master/Engine/Source/Runtime/Core/Public/Misc /Build.h

Disable Anti-Tamper for CPU profiling

Build a binary similar to Shipping configuration but without Anti-Tamper or Anti-Cheat tools which may prevent CPU profiling tools from properly loading symbols.

Testing

Audit Content

Run Unreal Engine UE4Editor MapCheck to find errors.
- See https://docs.unrealengine.com/en-US/BuildingWorlds/LevelEditor/MapErrors/index.html
Use Unity® AssetPostprocessor to enforce minimum standards.
- See https://docs.unity3d.com/Manual/BestPracticeUnderstandingPerformanceInUnity4.html

Ask artists and QA for scene recommendations

It is important to profile potential optimizations using representative content. Not all scenes are created equal, and there is not always one best scene.
- Indoor scenes may have heavy occlusion.
- Outdoor forests may have many masked materials.
- Large crowds may represent a good stress test for AI, navmesh, physics, animation, and rendering workloads.
Consistent in game time of day is an important consideration when minimizing run to run variation.
- Time of day may trigger specific world events such as rush hour where there are larger crowds or different lighting composition between day and night.

Use the default Platform Clock setting

WARNING

DAMAGE CAUSED BY USE OF YOUR AMD PROCESSOR OUTSIDE OF SPECIFICATION OR IN EXCESS OF FACTORY SETTINGS ARE NOT COVERED UNDER YOUR AMD PRODUCT WARRANTY AND MAY NOT BE COVERED BY YOUR SYSTEM MANUFACTURER’S WARRANTY. Operating your AMD processor outside of specification or in excess of factory settings, including but not limited to overclocking, may damage or shorten the life of your processor or other system components, create system instabilities (e.g. data loss and corrupted images), and in extreme cases may result in total system failure. AMD does not provide support or service for issues or damages related to use of an AMD processor outside of processor specifications or in excess of factory settings.

Use the default platform clock setting for best performance with high precision and low latency.
Default:
- bcdedit.exe /deletevalue useplatformclock
This option should only be used for debugging. However, some overclocking tools may set it to yes:
- bcdedit.exe /set useplatformclock yes
See https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/bcdedit—set

Test the cold shader cache first time user experience

Be sure to clear the application shader cache if it has one.
- The end user will often not be running the same scene back to back as a developer might.
The example below clears the Microsoft®, AMD, and NVIDIA® shader caches:

rem Run as administrator
rem Disable Steam Shader Pre-Caching before running this script
rem Reboot after running this script to clear any shaders still in system memory

setlocal enableextensions
cd /d "%~dp0"
rmdir /s /q "%LOCALAPPDATA%\D3DSCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\DxCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\GLCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\VkCache"
rmdir /s /q "%ProgramData%\NVIDIA Corporation\NV_Cache"
rmdir /s /q "%ProgramFiles(x86)%\Steam\steamapps\shadercache"

Analyze frame times

When doing performance analysis, prefer averages and percentiles over min and max metrics.
- It only takes one bad frame for min and max to no longer be representative of the average experience.
Be sure to collect sufficient samples when comparing 3 sigma and higher.
- See https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule
Determine the coefficient of variation over many test iterations.
- Under 3% is good in our experience.
High variation is endemic of an inconsistent test scene.
- We recommend setting static seed values for dynamically generated content and fixing variables like time of day.
- If higher variation is unavoidable, the user should increase their number of benchmark runs proportionally.

Profiling

Disable Memory Integrity if needed

Hypervisor-Protected Code Integrity (HVCI) is labelled Memory Integrity in the Windows Security app.

HVCI can be accessed via Settings > Update and Security > Windows Security > Device security > Core isolation details > Memory Integrity.
You may need to disable Memory Integrity for some tools to function such as AMD µProf.
See https://support.microsoft.com/en-us/windows/device-protection-in-windows-security-afa11526-de57-b1c5-599f-3a4c6a61c5e2

Add symbols

The symstore and symbol path can be powerful tools for loading vendor symbols and providing hints to tools which do not check the local directory.

Edit the system environment variables for _NT_SYMBOL_PATH .
- Example:

_NT_SYMBOL_PATH=cache*c:\symbols;srv*https://download.amd.com/dir/bin;srv*https://driver-symbols.nvidia.com/;srv*http://msdl.microsoft.com/download/symbols

Install the Windows 10 SDK Debuggers including symchk.exe and symstore.exe. Adding “C:\Program Files (x86)\Windows Kits\10\Debuggers\x64” to the PATH is recommended.
Store symbols for your project.
- Example:

symstore.exe add /r /f *.pdb /s c:\symbols /t "MyProject"

Determine if CPU-bound

Typically, the application is CPU-bound if GPU Idle > 5%

Look for bubbles of idle work on the GPU in tools such as RGP, GPUView, and the Visual Studio Concurrency Visualizer.
There are multiple tools and methods available for developers to detect boundedness:
Radeon GPU Profiler (RGP)

Radeon GPU Profiler

GPUView
- Warning: Adapter Hardware Queue 3D is a good measure of GPU %Busy but be sure to zoom to a selection which trims out the head and tail of the log which may be missing events.
- Warning: This capture is typically limited to a few seconds which may be too broad to see smaller idle periods. Consider using the zoom function to limit scope to a few frames at a time.
- Example:

rem run as administrator
rem add "C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview" to path
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
call log.cmd light
timeout.exe /t 5
call log.cmd
rem open Merged.etl

Windows Performance Recorder & Window Performance Analyzer
- Warning: The Windows Performance Analyzer’s GPU Utilization (FM) GPU by Process excludes GPU Idle time in Percentage calculation. Fortunately, you can open the etl file in GPUView.
- Note this capture is typically limited to a few seconds.
- Example:

rem run as administrator
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
wpr.exe -start gpu -filemode
timeout.exe /t 5
wpr.exe -stop out.etl
rem open out.etl

Visual Studio Concurrency Visualizer
- The Threads View shows DirectX GPU Engine utilization which may be used to zoom into regions where to GPU is idle for further analysis of blocked threads.

Verify UE4 Parallel Rendering

While investigating open issues, developers may submit change requests which enable debug features on
Test and Shipping configurations. Some debug features may greatly reduce performance due to disabling
parallel rendering.
Check UE4 Parallel Rendering CVARs before shipping.
- See master/Engine/Source/Runtime/RHI/Private/RHICommandList.cpp
- See https://docs.unrealengine.com/en-US/Programming/Rendering/ParallelRendering

Command	Recommended Value
`r.rhicmdbypass`	`0`
`r.rhicmdusedeferredcontexts`	`1`
`r.rhicmduseparallelalgorithms`	`1`
`r.rhithread.enable`	`1`

Verify Parallel DX12 PipelineState Creation

Use a cold shader cache while verifying parallel DX12 pipeline state creation.

Install the Windows SDK Windows Performance Toolkit.
Add the GPUView folder to the PATH .

rem run as administrator
rem clear shader cache
call log.cmd
rem collect samples while game is starting and calling D3D12.dll!CDevice::CreatePipelineState
call log.cmd

Open the merged etl log file with the Windows Performance Analyzer.
Add CPU Usage (Precise) and CPU Usage (Sampled) Flame by Process, Stack graphs.
Find all D3D12.dll!CDevice::CreatePipelineState within the Flame by Process, Stack.

This find command highlights the samples of interest in the CPU Usage (Precise) graph:

Verify Parallel DX12 Command List Generation

Install the Windows SDK Windows Performance Toolkit.
Add the GPUView folder to the PATH .

rem run as administrator
rem add "C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview" to path
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
call log.cmd
rem collect samples while game is playing and rendering frames. 1 seconds should be more than enough data.
timeout.exe /t 1
call log.cmd

Add GPU Utilization, CPU Usage (Precise), and Generic Events graphs.
Zoom into a single frame between two Present markers.
In the Generic Events graph, move the CPU Column next to the Task name then filter and expand Command List.

Debugging

WinDbg

WinDbg may be used for setting breakpoints, logging, skipping functions, editing memory, or editing registers.

For any function, the first four args are in RCX , RDX , R8 , and R9 . Arguments five and higher are passed on the stack.
- See https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention
Note Steam games often require a steam_appid.txt file or SteamAppId system environment variable to launch an executable from WinDbg.
Verify DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE was used:
- DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE (2) is recommended for optimal performance on hybrid graphics systems.
- These WinDbg commands may help:

bp dxgi!CDXGIFactory::EnumAdapterByGpuPreference ".printf \"FOUND DXGIFactory::EnumAdapterByGpuPreference DXGI_GPU_PREFERENCE=%x\\n\",@r8"

Verify GetLogicalProcessorInformation(Ex) calls with non-zero input buffer lengths return success:
- Some applications incorrectly assume the buffer size and may crash, especially on systems with many logical processors.
- Test if the first call has input buffer length 0 to get the buffer length to malloc.
- Test that all calls with non-zero input buffer lengths return success ( return 1 ).
- These WinDbg commands may help:

bp kernelbase!GetLogicalProcessorInformation "bp /1 @$ra \".printf \\\"GetLogicalProcessorInformation returned %i\\\", @rax; .echo; g\"; .printf \"GetLogicalProcessorInformation input buffer length 0x%x\", poi(@rdx); .echo; g"
bp kernelbase!GetLogicalProcessorInformationEx "bp /1 @$ra \".printf \\\"GetLogicalProcessorInformationEx returned %i\\\", @rax; .echo; g\"; .printf \"GetLogicalProcessorInformationEx input buffer length 0x%x\", poi(@r8); .echo; g"

Integrated Graphics

Test for Integrated Graphics

The DirectX APIs refer to Accelerated Processing Units (APUs) or Integrated Graphics parts via the term Unified Memory Architecture (UMA).

DirectX 12

bool isUMA(ID3D12Device* pDevice)
{
    bool result = false;
    D3D12_FEATURE_DATA_ARCHITECTURE data = {};
    if (S_OK == pDevice->CheckFeatureSupport(
        D3D12_FEATURE_ARCHITECTURE,
        &data,
        sizeof(data)))
    {
        result = data.UMA;
    }
    return result;
}

Full sample code here: isUMA_d3d12.cpp

//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

#include <iostream>
#include <dxgi1_4.h>
#include <d3d12.h>

#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d12" )

bool isUMA(ID3D12Device* pDevice)
{
  bool result = false;
  D3D12_FEATURE_DATA_ARCHITECTURE data = {};
  if (S_OK == pDevice->CheckFeatureSupport(
    D3D12_FEATURE_ARCHITECTURE,
    &data,
    sizeof(data)))
  {
    result = data.UMA;
  }
  return result;
}

int main()
{
  ID3D12Device* pDevice = nullptr;
  if (SUCCEEDED(D3D12CreateDevice(
    NULL,
    D3D_FEATURE_LEVEL_11_0,
    _uuidof(ID3D12Device),
    (void**)&pDevice)))
  {
    IDXGIFactory* pFactory;
    IDXGIFactory4* pFactory4;
    if (SUCCEEDED(CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)(&pFactory)))
      && SUCCEEDED(pFactory->QueryInterface(__uuidof(IDXGIFactory4), (void**)&pFactory4)))
    {
      LUID luid = pDevice->GetAdapterLuid();
      IDXGIAdapter* pAdapter;
      DXGI_ADAPTER_DESC desc;
      if (SUCCEEDED(pFactory4->EnumAdapterByLuid(luid, __uuidof(IDXGIAdapter), (void**)&pAdapter))
        && SUCCEEDED(pAdapter->GetDesc(&desc)))
      {
        printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
        printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
        printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
        printf("isUMA %i\n", isUMA(pDevice));
        SIZE_T budget = desc.DedicatedVideoMemory;
        if (isUMA(pDevice))
        {
          budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
        }
        IDXGIAdapter3* pAdapter3 = nullptr;
        DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
        if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
          && SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
        {
          budget = info.Budget;
        }
        printf("budget %I64u\n", budget);
      }
    }
  }
}

See https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_feature_data_architecture

DirectX 11.3

bool isUMA(ID3D11Device* pDevice)
{
    bool result = false;
    ID3D11Device3* pD3D11Device3 = nullptr;
    if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
    {
        D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
        if (S_OK == pD3D11Device3->CheckFeatureSupport(
            D3D11_FEATURE_D3D11_OPTIONS2,
            &data,
            sizeof(data)))
        {
            result = data.UnifiedMemoryArchitecture;
        }
        pD3D11Device3->Release();
    }
    return result;
}

Full sample code here: isUMA_d3d11_3.cpp

//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

#include <iostream>
#include <dxgi1_4.h>
#include <d3d11_3.h>

#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d11" )

bool isUMA(ID3D11Device* pDevice)
{
  bool result = false;
  ID3D11Device3* pD3D11Device3 = nullptr;
  if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
  {
    D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
    if (S_OK == pD3D11Device3->CheckFeatureSupport(
      D3D11_FEATURE_D3D11_OPTIONS2,
      &data,
      sizeof(data)))
    {
      result = data.UnifiedMemoryArchitecture;
    }
    pD3D11Device3->Release();
  }
  return result;
}

int main()
{
  UINT flags = NULL; // D3D11_CREATE_DEVICE_SINGLETHREADED;
  D3D_FEATURE_LEVEL    featureLevels[] = { D3D_FEATURE_LEVEL_11_0 };
  UINT                 numFeatureLevels = ARRAYSIZE(featureLevels);
  D3D_FEATURE_LEVEL    featureLevel;
  ID3D11Device* pDevice = nullptr;
  ID3D11DeviceContext* pImmediateContext = nullptr;
  if SUCCEEDED(D3D11CreateDevice(
    NULL,
    D3D_DRIVER_TYPE_HARDWARE,
    NULL,
    flags,
    featureLevels,
    numFeatureLevels,
    D3D11_SDK_VERSION,
    &pDevice,
    &featureLevel,
    &pImmediateContext))
  {
    IDXGIDevice* pDXGIDevice = nullptr;
    IDXGIAdapter* pAdapter = nullptr;
    DXGI_ADAPTER_DESC desc;
    if (SUCCEEDED(pDevice->QueryInterface(__uuidof(IDXGIDevice), (void**)&pDXGIDevice))
      && SUCCEEDED(pDXGIDevice->GetAdapter(&pAdapter))
      && SUCCEEDED(pAdapter->GetDesc(&desc)))
    {
      printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
      printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
      printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
      printf("isUMA %i\n", isUMA(pDevice));
      SIZE_T budget = desc.DedicatedVideoMemory;
      if (isUMA(pDevice))
      {
        budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
      }
      IDXGIAdapter3* pAdapter3 = nullptr;
      DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
      if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
        && SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
      {
        budget = info.Budget;
      }
      printf("budget %I64u\n", budget);
    }
  }
}

See https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ns-d3d11-d3d11_feature_data_d3d11_options2

Calculate VRAM Budget appropriately for Integrated Graphics

Integrated graphics parts which share their video memory with the CPU require special considerations when detecting VRAM budgets.

DirectX

Preferred method:

        <code readonly="true" class="language-cpp">
          <xmp>IDXGIAdapter3* pAdapter3 = nullptr;
DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
    && SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
{
    budget = info.Budget;
}</xmp>
        </code>

See https://docs.microsoft.com/en-us/windows/win32/api/dxgi1_4/nf-dxgi1_4-idxgiadapter3-queryvideomemoryinfo

Alternative method:

DXGI_ADAPTER_DESC desc;
if (SUCCEEDED(pAdapter->GetDesc(&desc)))
{
    SIZE_T budget = desc.DedicatedVideoMemory;
    if (isUMA(pDevice))
    {
        budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
    }
}

DedicatedVideoMemory : This represents the actual local memory on discrete GPUs and the dedicated carve-out system memory on integrated GPUs.
DedicatedSystemMemory : This value is always zero on AMD GPUs.
SharedSystemMemory : This is determined by the GPU KMD and may return up to half of system memory.
UMA: Unified Memory Architecture used in integrated GPUs.
- DedicatedVideoMemorySize alone may be insufficient to run some gaming applications on systems with integrated graphics (UMA).
- For systems with integrated graphics (UMA), developers should query SharedSystemMemorySize then rely on the GPU KMD and the vidMm to assign system memory optimally.
- Use DX12 (or DX11.3) CheckFeatureSupport to query UMA.
See https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/d3dkmthk/ns-d3dkmthk-_d3dkmt_segmentsizeinfo
See https://docs.microsoft.com/en-us/windows-hardware/drivers/display/calculating-graphics-memory

Full sample code here: isUMA_d3d12.cpp

//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

#include <iostream>
#include <dxgi1_4.h>
#include <d3d12.h>

#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d12" )

bool isUMA(ID3D12Device* pDevice)
{
  bool result = false;
  D3D12_FEATURE_DATA_ARCHITECTURE data = {};
  if (S_OK == pDevice->CheckFeatureSupport(
    D3D12_FEATURE_ARCHITECTURE,
    &data,
    sizeof(data)))
  {
    result = data.UMA;
  }
  return result;
}

int main()
{
  ID3D12Device* pDevice = nullptr;
  if (SUCCEEDED(D3D12CreateDevice(
    NULL,
    D3D_FEATURE_LEVEL_11_0,
    _uuidof(ID3D12Device),
    (void**)&pDevice)))
  {
    IDXGIFactory* pFactory;
    IDXGIFactory4* pFactory4;
    if (SUCCEEDED(CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)(&pFactory)))
      && SUCCEEDED(pFactory->QueryInterface(__uuidof(IDXGIFactory4), (void**)&pFactory4)))
    {
      LUID luid = pDevice->GetAdapterLuid();
      IDXGIAdapter* pAdapter;
      DXGI_ADAPTER_DESC desc;
      if (SUCCEEDED(pFactory4->EnumAdapterByLuid(luid, __uuidof(IDXGIAdapter), (void**)&pAdapter))
        && SUCCEEDED(pAdapter->GetDesc(&desc)))
      {
        printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
        printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
        printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
        printf("isUMA %i\n", isUMA(pDevice));
        SIZE_T budget = desc.DedicatedVideoMemory;
        if (isUMA(pDevice))
        {
          budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
        }
        IDXGIAdapter3* pAdapter3 = nullptr;
        DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
        if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
          && SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
        {
          budget = info.Budget;
        }
        printf("budget %I64u\n", budget);
      }
    }
  }
}

Full sample code here: isUMA_d3d11_3.cpp

//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

#include <iostream>
#include <dxgi1_4.h>
#include <d3d11_3.h>

#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d11" )

bool isUMA(ID3D11Device* pDevice)
{
  bool result = false;
  ID3D11Device3* pD3D11Device3 = nullptr;
  if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
  {
    D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
    if (S_OK == pD3D11Device3->CheckFeatureSupport(
      D3D11_FEATURE_D3D11_OPTIONS2,
      &data,
      sizeof(data)))
    {
      result = data.UnifiedMemoryArchitecture;
    }
    pD3D11Device3->Release();
  }
  return result;
}

int main()
{
  UINT flags = NULL; // D3D11_CREATE_DEVICE_SINGLETHREADED;
  D3D_FEATURE_LEVEL    featureLevels[] = { D3D_FEATURE_LEVEL_11_0 };
  UINT                 numFeatureLevels = ARRAYSIZE(featureLevels);
  D3D_FEATURE_LEVEL    featureLevel;
  ID3D11Device* pDevice = nullptr;
  ID3D11DeviceContext* pImmediateContext = nullptr;
  if SUCCEEDED(D3D11CreateDevice(
    NULL,
    D3D_DRIVER_TYPE_HARDWARE,
    NULL,
    flags,
    featureLevels,
    numFeatureLevels,
    D3D11_SDK_VERSION,
    &pDevice,
    &featureLevel,
    &pImmediateContext))
  {
    IDXGIDevice* pDXGIDevice = nullptr;
    IDXGIAdapter* pAdapter = nullptr;
    DXGI_ADAPTER_DESC desc;
    if (SUCCEEDED(pDevice->QueryInterface(__uuidof(IDXGIDevice), (void**)&pDXGIDevice))
      && SUCCEEDED(pDXGIDevice->GetAdapter(&pAdapter))
      && SUCCEEDED(pAdapter->GetDesc(&desc)))
    {
      printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
      printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
      printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
      printf("isUMA %i\n", isUMA(pDevice));
      SIZE_T budget = desc.DedicatedVideoMemory;
      if (isUMA(pDevice))
      {
        budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
      }
      IDXGIAdapter3* pAdapter3 = nullptr;
      DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
      if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
        && SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
      {
        budget = info.Budget;
      }
      printf("budget %I64u\n", budget);
    }
  }
}

Optimize Scalability for Integrated Graphics

Sometimes feature scaling may required in order to achieve acceptable framerates on thermal limited platforms.

Straightforward changes to try for scaling include:
- Use DXGI_FORMAT_R11G11B10_FLOAT rather than DXGI_FORMAT_R16G16B16A16_FLOAT .
- Reduce shadow map quality.
- Reduce volumetric fog quality.
- Disable Ambient Occlusion.
The following related Unreal Engine CVars may be helpful:
- r.SceneColorFormat
- r.AmbientOcclusionLevels

Hybrid Graphics

Select the optimal GPU for Hybrid Graphics

Additional considerations may be necessary to ensure the expected GPU is utilized in hybrid graphics platforms.

Windows 10 v1803 added IDXGIFactory6::EnumAdapterByGpuPreference .
Use DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE for game applications.
WinDbg may be used to test if DXGI_GPU_PREFERENCE=2 ( DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE ).

bp dxgi!CDXGIFactory::EnumAdapterByGpuPreference ".printf \"FOUND DXGIFactory::EnumAdapterByGpuPreference DXGI_GPU_PREFERENCE=%x\\n\",@r8"

The user may change preferences per application in Graphics settings.
- Example from Dell G5 15 Special Edition (5505)

Windows Set Graphics Preference

Memory

Optimize memcpy/memset

Update the compiler for the latest memcpy , memset , and other c runtime optimizations.
Aligning memcpy source and destination to a 4096 byte page boundary may reduce Zen 2 store to load forwarding events (See STLIOther in AMD µProf).
Aligning data to a 4096 page boundary may benefit probe filtering on AMD Threadripper™ and EPYC™ processors.

Alignas of the native cache line size ( 64 bytes) may reduce false sharing.
Use aligned memory allocators such as _aligned_malloc or C++17 aligned new .
Prefer thread local storage and local variables over process shared data.
- Try using per thread range indices such that thread ranges avoid sharing the same 64 byte cache line or 4096 byte page.
- Try copying data rather than using process shared data.
Padding or reordering a struct may reduce false sharing in some cases where variables which share the same cache line are used by more than one thread.

Prefer data access patterns matching hardware prefetcher behaviors

Streaming
- Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.

Ryzen Streaming

Stride
- Uses memory access history of individual instructions to fetch additional lines when each access is a constant.

Ryzen Stride

Use Software Prefetch instructions for linked data structures experiencing cache misses

Use Software Prefetch instructions on linked data structures, such as std::vector , experiencing cache misses.
- Tune prefetch distance to account for memory latency. In our experience, four iterations into the future is a good place to start tuning.
Use NTA on use once data.
While in dual-thread mode, beware that too many software prefetches from one thread may evict the working set of the other thread from their shared caches.
Remove Ineffective Software Prefetches found by PMCx052.
The AMD µProf Assess Performance (Extended) profile may help find Data Cache refills from DRAM.

Synchronization

Use Modern Sync APIs

Modern sync APIs include std::mutex , std::shared_mutex , SRWLock , and EnterCriticalSection .

These may be faster than and consume less power than WaitForSingleObject or user spin locks.
Some modern sync APIs leverage AMD’s mwaitx instruction efficiently to wait on an address or timeout.
Legacy sync APIs may have unneeded Syscall overhead.
User spins locks may consume OS thread scheduling resources unnecessarily since the OS scheduler may be unable to determine if it should yield to another program thread rather than spin.
- It is generally recommended to issue sleep/wait instructions rather than spin locks.
- Even when waiting on the GPU, calls like SetEventOnCompletion() may be as efficient as the old fence polling model while avoiding starving other threads or unnecessarily consuming power.

Test application scalability from 1 to `%NUMBER_OF_PROCESSORS%`

This advice is specific to AMD processors and is not general guidance for all processor vendors.

Generally, applications show SMT benefits and use of all logical processors is recommended. However, games often suffer from SMT contention on the main or render threads during gameplay.

One strategy to reduce this contention is to create threads based on physical core count rather than logical processor count.
Avoid setting thread pool size as a constant.
Profile your application/game to determine the ideal thread count.
- Game initialization, including decompressing assets and compiling/warming shaders, may benefit from logical processors using SMT dual-thread mode.
- Game play may prefer physical core count using SMT single-thread mode.
We recommend creating developer options to:
- Set Max Thread Pool Size.
- Force Thread Pool Size.
- Force SMT.
- Force Single NUMA Node (implicitly Group).
Profile against multiple CPUs. There is no hard and fast rule here.
- The best thread count heuristic may vary between low and high core count CPUs.
- While a 12 core CPU may benefit from an idle thread in your game to handle interrupts from the Operating System and 3rd party apps, a 6 core may require the availability of every compute resource.
- Developers may tune the low cores threshold for optimal performance on different core count CPUs.
AMD µProf may be used to show the actual thread concurrency histogram for a process.
See: