Performance

Performance

guide

guide

Design faster. Render faster. Iterate faster.

Our AMD Ryzen™  Performance Guide will help guide you through the optimization process with a collection of tidbits, tips, and tricks which aim to support you in your performance quest.

Tools

PresentMon

PresentMon is a Command Line Interface (CLI) tool for logging frame times such as MsBetweenPresents .

Example:

				
					PresentMon-1.6.0-x64.exe -process_name "MyGame.exe"
        -stop_existing_session 
        -terminate_on_proc_exit
        -terminate_after_timed 
        -timed 60 
        -output_file "%CD%\result\presentmon.csv"
				
			

Open Capture and Analysis Tool (OCAT)

OCAT is a Graphics User Interface (GUI) tool with hot key support for logging frame times based on PresentMon.

If you want to know how well a game is performing on your machine in real-time with low overhead, OCAT has you covered.

Windows® Performance Toolkit

Windows® Performance Analyzer (WPA)

WPA is a highly configurable tool for finding system performance bottlenecks and ideal for filtering and visualizing call stacks. 

GPUView

GPUView is a tool for analyzing GPU performance with regard to direct memory access (DMA) buffer processing.

Visual Studio Concurrency Visualizer

You can use the Concurrency Visualizer for Visual Studio to locate performance bottlenecks, CPU underutilization, thread contention, cross-core thread migration, synchronization delays, DirectX® activity, areas of overlapped I/O, and other information.

AMD μProf

  • Find performance bottlenecks using CPU hardware Performance Monitoring Counters (PMCs)
    • Instruction Based Sampling (IBS) has disassembly instruction accurate attribution but with limited counter coverage.
    • Event Based Sampling (EBS) has more counters available but less accurate attribution. It is typically accurate within a few instructions. AMD Dev Techs often use EBS counters in the “Assess Performance (Extended)” profile.
  • See  https://developer.amd.com/amd-uprof/

Radeon™ GPU Profiler (RGP)

RGP is an offline compiler and performance analysis tool for DirectX®, Vulkan®, SPIR-V™, OpenGL® and OpenCL™.

RGP

RGP gives you unprecedented, in-depth access to a GPU. Easily analyze graphics, async compute usage, event timing, pipeline stalls, barriers, bottlenecks, and other performance inefficiencies.

Compiling

Use the latest compiler and Windows® SDK

Add virus and threat protection exclusions

  • Add project folders to virus and threat protection settings exclusions for faster build times.
    • We have seen some projects compiling 20% faster!

Prefer Shipping configuration builds for CPU profiling

  • Debug and development configuration builds may greatly reduce performance.
    • Stats collection may cause cache pollution.
    • Logging may create serialization points.
    • Sometimes debug builds may disable multi-threading optimizations.
  • While investigating open issues, developers may submit change requests which enable debug features on Test and Shipping configurations. Be sure to disable debug features before you ship!
  • Some Unreal Engine settings to verify include:

Disable Anti-Tamper for CPU profiling

  • Build a binary similar to Shipping configuration but without Anti-Tamper or Anti-Cheat tools which may prevent CPU profiling tools from properly loading symbols.

Testing

Audit Content

Ask artists and QA for scene recommendations

  • It’s important to profile potential optimizations using representative content. Not all scenes are created equal, and there isn’t always one best scene.
    • Indoor scenes may have heavy occlusion.
    • Outdoor forests may have many masked materials.
    • Large crowds may represent a good stress test for AI, navmesh, physics, animation, and rendering workloads.
  • Consistent in game time of day is an important consideration when minimizing run to run variation.
    • Time of day may trigger specific world events such as rush hour where there are larger crowds or different lighting composition between day and night.

Use the default Platform Clock setting

Test the cold shader cache first time user experience

  • Be sure to clear the application shader cache if it has one.
    • The end user will often not be running the same scene back to back as a developer might.
  • The example below clears the Microsoft®, AMD, and NVIDIA® shader caches:
				
					rem Run as administrator
rem Disable Steam Shader Pre-Caching before running this script
rem Reboot after running this script to clear any shaders still in system memory

setlocal enableextensions
cd /d "%~dp0"
rmdir /s /q "%LOCALAPPDATA%\D3DSCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\DxCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\GLCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\VkCache"
rmdir /s /q "%ProgramData%\NVIDIA Corporation\NV_Cache"
rmdir /s /q "%ProgramFiles(x86)%\Steam\steamapps\shadercache"
				
			

Analyze frame times

  • When doing performance analysis, prefer averages and percentiles over min and max metrics.
    • It only takes one bad frame for min and max to no longer be representative of the average experience.
  • Be sure to collect sufficient samples when comparing 3 sigma and higher.
  • Determine the coefficient of variation over many test iterations.
    • Under 3% is good in our experience.
  • High variation is endemic of an inconsistent test scene.
    • We recommend setting static seed values for dynamically generated content and fixing variables like time of day.
    • If higher variation is unavoidable, the user should increase their number of benchmark runs proportionally.

Profiling

Disable Memory Integrity if needed

Hypervisor-Protected Code Integrity (HVCI) is labelled “Memory Integrity” in the Windows® Security app.

Add Symbols 

The symstore and symbol path can be powerful tools for loading vendor symbols and providing hints to tools which do not check the local directory.

  • Edit the system environment variables for _NT_SYMBOL_PATH .
    •  Example:
				
					_NT_SYMBOL_PATH=cache*c:\symbols;srv*https://download.amd.com/dir/bin;srv*https://driver-symbols.nvidia.com/;srv*http://msdl.microsoft.com/download/symbols
				
			
  • Install the Windows® 10 SDK Debuggers including symchk.exe and symstore.exe. Adding "C:\Program Files (x86)\Windows Kits\10\Debuggers\x64" to the PATH is recommended.
  • Store symbols for your project.
    •  Example:
				
					symstore.exe add /r /f *.pdb /s c:\symbols /t "MyProject"
				
			

Determine if CPU-bound

Typically, the application is CPU-bound if GPU Idle > 5%

  • Look for bubbles of idle work on the GPU in tools such as RGP, GPUView, and the Visual Studio Concurrency Visualizer.
  • There are multiple tools and methods available for developers to detect boundedness:
  • Radeon™ GPU Profiler (RGP)
  • GPUView
    • Warning: Adapter Hardware Queue 3D is a good measure of GPU %Busy but be sure to zoom to a selection which trims out the head and tail of the log which may be missing events.
    • Warning: This capture is typically limited to a few seconds which may be too broad to see smaller idle periods. Consider using the zoom function to limit scope to a few frames at a time.
    • Example:
				
					rem run as administrator
rem add "C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview" to path
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
call log.cmd light
timeout.exe /t 5
call log.cmd
rem open Merged.etl
				
			
  • Windows® Performance Recorder & Window® Performance Analyzer
    • Warning: The Windows® Performance Analyzer’s GPU Utilization (FM) GPU by Process excludes GPU Idle time in Percentage calculation. Fortunately, you can open the etl file in GPUView.
    • Note this capture is typically limited to a few seconds.
    • Example:
				
					rem run as administrator
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
wpr.exe -start gpu -filemode
timeout.exe /t 5
wpr.exe -stop out.etl
rem open out.etl
				
			
  • Visual Studio Concurrency Visualizer
    • The Threads View shows DirectX® GPU Engine utilization which may be used to zoom into regions where to GPU is idle for further analysis of blocked threads.

Verify UE4 Parallel Rendering

Command Recommended Value
r.rhicmdbypass 0
r.rhicmdusedeferredcontexts 1
r.rhicmduseparallelalgorithms 1
r.rhithread.enable 1

Verify Parallel DX12 PipelineState Creation

Use a cold shader cache while verifying parallel DX12 pipeline state creation.

  1. Install the Windows® SDK Windows® Performance Toolkit.
  2. Add the GPUView folder to the PATH .
				
					rem run as administrator
rem clear shader cache
call log.cmd
rem collect samples while game is starting and calling D3D12.dll!CDevice::CreatePipelineState
call log.cmd
				
			
  • Open the merged etl log file with the Windows® Performance Analyzer.
  • Add “CPU Usage (Precise)” and “CPU Usage (Sampled) Flame by Process, Stack” graphs.
  • Find all “ D3D12.dll!CDevice::CreatePipelineState ” within the “Flame by Process, Stack”.

This find command highlights the samples of interest in the “CPU Usage (Precise)” graph:

Verify Parallel DX12 Command List Generation

  • Install the Windows® SDK Windows® Performance Toolkit.
  • Add the GPUView folder to the PATH .
				
					rem run as administrator
rem add "C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview" to path
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
call log.cmd
rem collect samples while game is playing and rendering frames. 1 seconds should be more than enough data.
timeout.exe /t 1
call log.cmd
				
			
  • Add “GPU Utilization”, “CPU Usage (Precise)”, and “Generic Events” graphs.
  • Zoom into a single frame between two Present markers.
  • In the “Generic Events” graph, move the CPU Column next to the Task name then filter and expand “Command List”.

Debugging

WinDbg

WinDbg may be used for setting breakpoints, logging, skipping functions, editing memory, or editing registers.

  • For any function, the first four args are in RCX , RDX , R8 , and R9 . Arguments five and higher are passed on the stack.
  • Note Steam games often require a steam_appid.txt file or SteamAppId system environment variable to launch an executable from WinDbg.
  • Verify DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE was used:
    • DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE (2) is recommended for optimal performance on hybrid graphics systems.
    • These WinDbg commands may help:
				
					bp dxgi!CDXGIFactory::EnumAdapterByGpuPreference ".printf \"FOUND DXGIFactory::EnumAdapterByGpuPreference DXGI_GPU_PREFERENCE=%x\\n\",@r8"
				
			
  • Verify GetLogicalProcessorInformation(Ex) calls with non-zero input buffer lengths return success:
    • Some applications incorrectly assume the buffer size and may crash – especially on systems with many logical processors.
    • Test if the first call has input buffer length 0 to get the buffer length to malloc.
    • Test that all calls with non-zero input buffer lengths return success ( return 1 ).
    • These WinDbg commands may help:
				
					bp kernelbase!GetLogicalProcessorInformation "bp /1 @$ra \".printf \\\"GetLogicalProcessorInformation returned %i\\\", @rax; .echo; g\"; .printf \"GetLogicalProcessorInformation input buffer length 0x%x\", poi(@rdx); .echo; g"
bp kernelbase!GetLogicalProcessorInformationEx "bp /1 @$ra \".printf \\\"GetLogicalProcessorInformationEx returned %i\\\", @rax; .echo; g\"; .printf \"GetLogicalProcessorInformationEx input buffer length 0x%x\", poi(@r8); .echo; g"
				
			

Integrated Graphics

Test for Integrated Graphics

The DirectX® APIs refer to Accelerated Processing Units (APUs) or Integrated Graphics parts via the term Unified Memory Architecture (UMA).

DirectX® 12

				
					bool isUMA(ID3D12Device* pDevice)
{
    bool result = false;
    D3D12_FEATURE_DATA_ARCHITECTURE data = {};
    if (S_OK == pDevice->CheckFeatureSupport(
        D3D12_FEATURE_ARCHITECTURE,
        &data,
        sizeof(data)))
    {
        result = data.UMA;
    }
    return result;
}
				
			
				
					//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

#include <iostream>
#include <dxgi1_4.h>
#include <d3d12.h>

#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d12" )

bool isUMA(ID3D12Device* pDevice)
{
	bool result = false;
	D3D12_FEATURE_DATA_ARCHITECTURE data = {};
	if (S_OK == pDevice->CheckFeatureSupport(
		D3D12_FEATURE_ARCHITECTURE,
		&data,
		sizeof(data)))
	{
		result = data.UMA;
	}
	return result;
}

int main()
{
	ID3D12Device* pDevice = nullptr;
	if (SUCCEEDED(D3D12CreateDevice(
		NULL,
		D3D_FEATURE_LEVEL_11_0,
		_uuidof(ID3D12Device),
		(void**)&pDevice)))
	{
		IDXGIFactory* pFactory;
		IDXGIFactory4* pFactory4;
		if (SUCCEEDED(CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)(&pFactory)))
			&& SUCCEEDED(pFactory->QueryInterface(__uuidof(IDXGIFactory4), (void**)&pFactory4)))
		{
			LUID luid = pDevice->GetAdapterLuid();
			IDXGIAdapter* pAdapter;
			DXGI_ADAPTER_DESC desc;
			if (SUCCEEDED(pFactory4->EnumAdapterByLuid(luid, __uuidof(IDXGIAdapter), (void**)&pAdapter))
				&& SUCCEEDED(pAdapter->GetDesc(&desc)))
			{
				printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
				printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
				printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
				printf("isUMA %i\n", isUMA(pDevice));
				SIZE_T budget = desc.DedicatedVideoMemory;
				if (isUMA(pDevice))
				{
					budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
				}
				IDXGIAdapter3* pAdapter3 = nullptr;
				DXGI_QUERY_VIDEO_MEMORY_INFO info = {};				
				if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
					&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
				{
					budget = info.Budget;
				}
				printf("budget %I64u\n", budget);
			}
		}
	}
}
				
			

DirectX® 11.3

				
					bool isUMA(ID3D11Device* pDevice)
{
    bool result = false;
    ID3D11Device3* pD3D11Device3 = nullptr;
    if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
    {
        D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
        if (S_OK == pD3D11Device3->CheckFeatureSupport(
            D3D11_FEATURE_D3D11_OPTIONS2,
            &data,
            sizeof(data)))
        {
            result = data.UnifiedMemoryArchitecture;
        }
        pD3D11Device3->Release();
    }
    return result;
}
				
			
				
					//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

#include <iostream>
#include <dxgi1_4.h>
#include <d3d11_3.h>

#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d11" )

bool isUMA(ID3D11Device* pDevice)
{
	bool result = false;
	ID3D11Device3* pD3D11Device3 = nullptr;
	if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
	{
		D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
		if (S_OK == pD3D11Device3->CheckFeatureSupport(
			D3D11_FEATURE_D3D11_OPTIONS2,
			&data,
			sizeof(data)))
		{
			result = data.UnifiedMemoryArchitecture;
		}
		pD3D11Device3->Release();
	}
	return result;
}

int main()
{
	UINT flags = NULL; // D3D11_CREATE_DEVICE_SINGLETHREADED;		
	D3D_FEATURE_LEVEL    featureLevels[] = { D3D_FEATURE_LEVEL_11_0 };
	UINT                 numFeatureLevels = ARRAYSIZE(featureLevels);
	D3D_FEATURE_LEVEL    featureLevel;
	ID3D11Device* pDevice = nullptr;
	ID3D11DeviceContext* pImmediateContext = nullptr;
	if SUCCEEDED(D3D11CreateDevice(
		NULL,
		D3D_DRIVER_TYPE_HARDWARE,
		NULL,
		flags,
		featureLevels,
		numFeatureLevels,
		D3D11_SDK_VERSION,
		&pDevice,
		&featureLevel,
		&pImmediateContext))
	{
		IDXGIDevice* pDXGIDevice = nullptr;
		IDXGIAdapter* pAdapter = nullptr;
		DXGI_ADAPTER_DESC desc;
		if (SUCCEEDED(pDevice->QueryInterface(__uuidof(IDXGIDevice), (void**)&pDXGIDevice))
			&& SUCCEEDED(pDXGIDevice->GetAdapter(&pAdapter))
			&& SUCCEEDED(pAdapter->GetDesc(&desc)))
		{
			printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
			printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
			printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
			printf("isUMA %i\n", isUMA(pDevice));
			SIZE_T budget = desc.DedicatedVideoMemory;
			if (isUMA(pDevice))
			{
				budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
			}
			IDXGIAdapter3* pAdapter3 = nullptr;
			DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
			if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
				&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
			{
				budget = info.Budget;
			}
			printf("budget %I64u\n", budget);
		}
	}
}
				
			

Calculate VRAM Budget appropriately for Integrated Graphics

Integrated graphics parts which share their video memory with the CPU require special considerations when detecting VRAM budgets.

DirectX®

Preferred method:

				
					IDXGIAdapter3* pAdapter3 = nullptr;
DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
    && SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
{
    budget = info.Budget;
}
				
			

Alternative method:

				
					DXGI_ADAPTER_DESC desc;
if (SUCCEEDED(pAdapter->GetDesc(&desc)))
{
    SIZE_T budget = desc.DedicatedVideoMemory;
    if (isUMA(pDevice))
    {
        budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
    }
}
				
			
				
					//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

#include <iostream>
#include <dxgi1_4.h>
#include <d3d12.h>

#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d12" )

bool isUMA(ID3D12Device* pDevice)
{
	bool result = false;
	D3D12_FEATURE_DATA_ARCHITECTURE data = {};
	if (S_OK == pDevice->CheckFeatureSupport(
		D3D12_FEATURE_ARCHITECTURE,
		&data,
		sizeof(data)))
	{
		result = data.UMA;
	}
	return result;
}

int main()
{
	ID3D12Device* pDevice = nullptr;
	if (SUCCEEDED(D3D12CreateDevice(
		NULL,
		D3D_FEATURE_LEVEL_11_0,
		_uuidof(ID3D12Device),
		(void**)&pDevice)))
	{
		IDXGIFactory* pFactory;
		IDXGIFactory4* pFactory4;
		if (SUCCEEDED(CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)(&pFactory)))
			&& SUCCEEDED(pFactory->QueryInterface(__uuidof(IDXGIFactory4), (void**)&pFactory4)))
		{
			LUID luid = pDevice->GetAdapterLuid();
			IDXGIAdapter* pAdapter;
			DXGI_ADAPTER_DESC desc;
			if (SUCCEEDED(pFactory4->EnumAdapterByLuid(luid, __uuidof(IDXGIAdapter), (void**)&pAdapter))
				&& SUCCEEDED(pAdapter->GetDesc(&desc)))
			{
				printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
				printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
				printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
				printf("isUMA %i\n", isUMA(pDevice));
				SIZE_T budget = desc.DedicatedVideoMemory;
				if (isUMA(pDevice))
				{
					budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
				}
				IDXGIAdapter3* pAdapter3 = nullptr;
				DXGI_QUERY_VIDEO_MEMORY_INFO info = {};				
				if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
					&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
				{
					budget = info.Budget;
				}
				printf("budget %I64u\n", budget);
			}
		}
	}
}
				
			
				
					//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

#include <iostream>
#include <dxgi1_4.h>
#include <d3d11_3.h>

#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d11" )

bool isUMA(ID3D11Device* pDevice)
{
	bool result = false;
	ID3D11Device3* pD3D11Device3 = nullptr;
	if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
	{
		D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
		if (S_OK == pD3D11Device3->CheckFeatureSupport(
			D3D11_FEATURE_D3D11_OPTIONS2,
			&data,
			sizeof(data)))
		{
			result = data.UnifiedMemoryArchitecture;
		}
		pD3D11Device3->Release();
	}
	return result;
}

int main()
{
	UINT flags = NULL; // D3D11_CREATE_DEVICE_SINGLETHREADED;		
	D3D_FEATURE_LEVEL    featureLevels[] = { D3D_FEATURE_LEVEL_11_0 };
	UINT                 numFeatureLevels = ARRAYSIZE(featureLevels);
	D3D_FEATURE_LEVEL    featureLevel;
	ID3D11Device* pDevice = nullptr;
	ID3D11DeviceContext* pImmediateContext = nullptr;
	if SUCCEEDED(D3D11CreateDevice(
		NULL,
		D3D_DRIVER_TYPE_HARDWARE,
		NULL,
		flags,
		featureLevels,
		numFeatureLevels,
		D3D11_SDK_VERSION,
		&pDevice,
		&featureLevel,
		&pImmediateContext))
	{
		IDXGIDevice* pDXGIDevice = nullptr;
		IDXGIAdapter* pAdapter = nullptr;
		DXGI_ADAPTER_DESC desc;
		if (SUCCEEDED(pDevice->QueryInterface(__uuidof(IDXGIDevice), (void**)&pDXGIDevice))
			&& SUCCEEDED(pDXGIDevice->GetAdapter(&pAdapter))
			&& SUCCEEDED(pAdapter->GetDesc(&desc)))
		{
			printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
			printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
			printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
			printf("isUMA %i\n", isUMA(pDevice));
			SIZE_T budget = desc.DedicatedVideoMemory;
			if (isUMA(pDevice))
			{
				budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
			}
			IDXGIAdapter3* pAdapter3 = nullptr;
			DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
			if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
				&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
			{
				budget = info.Budget;
			}
			printf("budget %I64u\n", budget);
		}
	}
}
				
			

Optimize Scalability for Integrated Graphics

Sometimes feature scaling may required in order to achieve acceptable framerates on thermal limited platforms.

  • Straightforward changes to try for scaling include:
    • Use DXGI_FORMAT_R11G11B10_FLOAT rather than DXGI_FORMAT_R16G16B16A16_FLOAT .
    • Reduce shadow map quality.
    • Reduce volumetric fog quality.
    • Disable Ambient Occlusion.
  • The following related Unreal Engine CVars may be helpful:
    • r.SceneColorFormat
    • r.AmbientOcclusionLevels

Hybrid Graphics

Select the optimal GPU for Hybrid Graphics

Additional considerations may be necessary to ensure the expected GPU is utilized in hybrid graphics platforms.

  • Windows® 10 v1803 added IDXGIFactory6::EnumAdapterByGpuPreference .
  • Use DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE for game applications.
  • WinDbg may be used to test if DXGI_GPU_PREFERENCE=2 ( DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE ).
				
					bp dxgi!CDXGIFactory::EnumAdapterByGpuPreference ".printf \"FOUND DXGIFactory::EnumAdapterByGpuPreference DXGI_GPU_PREFERENCE=%x\\n\",@r8"
				
			
  • The user may change preferences per application in Graphics settings.
    • Example from Dell G5 15 Special Edition (5505)

Memory

Optimize memcpy/memset

  • Update the compiler for the latest memcpy , memset , and other c runtime optimizations.
  • Aligning memcpy source and destination to a 4096 byte page boundary may reduce “Zen 2” store to load forwarding events (See STLIOther in AMD μProf).
  • Aligning data to a 4096 page boundary may benefit probe filtering on AMD Threadripper™ and EPYC™ processors.

Avoid false sharing

  • Alignas of the native cache line size ( 64 bytes) may reduce false sharing.
  • Use aligned memory allocators such as _aligned_malloc or C++17 aligned new .
  • Prefer thread local storage and local variables over process shared data.
    • Try using per thread range indices such that thread ranges avoid sharing the same 64 byte cache line or 4096 byte page.
    • Try copying data rather than using process shared data.
  • Padding or reordering a struct may reduce false sharing in some cases where variables which share the same cache line are used by more than one thread.

Prefer data access patterns matching hardware prefetcher behaviors

  • Streaming
    • Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.
  • Stride
    • Uses memory access history of individual instructions to fetch additional lines when each access is a constant. 

Use Software Prefetch instructions for linked data structures experiencing cache misses

  • Use Software Prefetch instructions on linked data structures – such as std::vector<T*> – experiencing cache misses.
    • Tune prefetch distance to account for memory latency. In our experience, four iterations into the future is a good place to start tuning.
  • Use NTA on use once data.
  • While in dual-thread mode, beware that too many software prefetches from one thread may evict the working set of the other thread from their shared caches.
  • Remove Ineffective Software Prefetches found by PMCx052.
  • The AMD μProf “Assess Performance (Extended)” profile may help find Data Cache refills from DRAM.

Synchronization

Use Modern Sync APIs

Modern sync APIs include std::mutex , std::shared_mutex , SRWLock , and EnterCriticalSection .

  • These may be faster than and consume less power than WaitForSingleObject or user spin locks.
  • Some modern sync APIs leverage AMD’s mwaitx instruction efficiently to wait on an address or timeout.
  • Legacy sync APIs may have unneeded Syscall overhead.
  • User spins locks may consume OS thread scheduling resources unnecessarily since the OS scheduler may be unable to determine if it should yield to another program thread rather than spin.
    • It is generally recommended to issue sleep/wait instructions rather than spin locks.
    • Even when waiting on the GPU, calls like SetEventOnCompletion() may be as efficient as the old fence polling model while avoiding starving other threads or unnecessarily consuming power.

Test application scalability from 1 to %NUMBER_OF_PROCESSORS%

This advice is specific to AMD processors and is not general guidance for all processor vendors.

Generally, applications show SMT benefits and use of all logical processors is recommended. However, games often suffer from SMT contention on the main or render threads during gameplay.

  • One strategy to reduce this contention is to create threads based on physical core count rather than logical processor count.
  • Avoid setting thread pool size as a constant.
  • Profile your application/game to determine the ideal thread count.
    • Game initialization – including decompressing assets and compiling/warming shaders – may benefit from logical processors using SMT dual-thread mode.
    • Game play may prefer physical core count using SMT single-thread mode.
  • We recommend creating developer options to:
    • Set Max Thread Pool Size.
    • Force Thread Pool Size.
    • Force SMT.
    • Force Single NUMA Node (implicitly Group).
  • Profile against multiple CPUs. There’s no hard and fast rule here.
    • The best thread count heuristic may vary between low and high core count CPUs.
    • While a 12 core CPU may benefit from an idle thread in your game to handle interrupts from the Operating System and 3rd party apps, a 6 core may require the availability of every compute resource.
    • Developers may tune the low cores threshold for optimal performance on different core count CPUs.
  • AMD μProf may be used to show the actual thread concurrency histogram for a process.
  • See:

CPU Core Counts

This sample code correctly detects the physical and logical cores of today’s modern processors, along with the processor vendor and family.

Now watch the presentations!

AMD Ryzen™ Processor Software Optimization – YouTube link

Join AMD Game Engineering team members for an introduction to the AMD Ryzen™ family of processors followed by advanced optimization topics.

Learn about the high-performance AMD “Zen 2” microarchitecture and profiling tools. Gain insight into code optimization opportunities and lessons learned.

Ryzen processor software optimization

Microsoft® Game Stack Live: AMD Ryzen™ Processor Software Optimization

Join AMD on an adventure thru “Zen 2” and “Zen 3” processors which power today’s game consoles and PCs. Dive into instruction sets, cache hierarchies, resource sharing, and simultaneous multi-threading. Journey across the sands of silicon to master microarchitecture and uncover best practices!

Want even more?

Why stop here? Take a look at our other Performance Guides.

There’s even more performance advice to be found in our videos and tutorials.

Videos

Words not enough? How about pictures? How about moving pictures? We have some amazing videos to share with you!

Tutorials Library

Browse all our fantastic tutorials, including programming techniques, performance improvements, guest blogs, and how to use our tools.