Porting your engine to Vulkan or DX12
Adam Sawicki, a member of AMD RTG’s Game Engineering team, has spent the best part of a year assisting one of the world’s biggest game …
Adam Sawicki, a member of AMD RTG’s Game Engineering team, has spent the best part of a year assisting one of the world’s biggest game …
If you’ve ever heard the term “context roll” in the context of AMD GPUs — I’ll do that a lot in this post, sorry in …
Microsoft PIX is the premiere integrated performance tuning and debugging tool for Windows game developers using DirectX 12. PIX enables developers to debug and analyze …
With GDC 2018 done and dusted, we thought it’d be valuable to reemphasise that all of the presented content from the Advanced Graphics Techniques Tutorial …
The AMD GPU Services (AGS) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in …
Radeon GPU Profiler 1.2 At GDC 2018 we talked about a new version of RGP that would interoperate with RenderDoc, allowing the two tools to …
Compressonator is a set of tools that allows artists and developers to work easily with compressed assets and easily visualize the quality impact of various …
We have posted the version 1.2 update to the TrueAudio Next open-source library to Github. It is available here. This update has a number of …
Vulkan™ is designed to have significantly smaller CPU overhead compared to other APIs like OpenGL®. This is achieved by various means – the API is …
Introduction Half-precision (FP16) computation is a performance-enhancing GPU technology long exploited in console and mobile devices not previously used or widely available in mainstream PC …
Real Time Ray Tracing was one of the hottest topics last week at GDC 2018. In this presentation, AMD Software Development Engineer and architect of Radeon …
The level of visual detail required of CAD models for the automotive industry or the most advanced film VFX requires a level of visual accuracy …
If you’re into the state of the art in games, especially real-time gaming graphics, your eyes will undoubtedly be on Moscone Center in San Francisco, …
The long wait is over. The GPU processing power of TrueAudio Next (TAN) has now been integrated into Steam Audio from Valve (Beta 13 release). …
Radeon GPU Profiler 1.1.1 With GDC 2018 getting ever closer, we wanted to get one last minor release of RGP out before things get hectic …
Radeon GPU Profiler 1.1.0 It feels like just last week that we released Radeon GPU Profiler (RGP) 1.0.3 but my calendar says almost 2 months …
Insights from Enscape as to how they designed a renderer that produces path traced real time global illumination and can also converge to offline rendered image quality
We are excited to announce the release of Compressonator V2.7! This version contains several new features and optimizations, including: Cross Platform Support Due to popular demand, …
Radeon GPU Profiler 1.0.3 A couple of months on from the release of 1.0.2, we’ve fully baked and sliced 1.0.3 for your low-level DX12- and …
The AMD GPU Services (AGS) library provides game and application developers with the ability to query information about installed AMD GPUs and their driver, in …
Due to architectural differences between Zen and our previous processor architecture, Bulldozer, developers need to take care when using the Windows® APIs for processor and core enumeration. …
The AMD GCN Vulkan extensions allow developers to get access to some additional functionalities offered by the GCN architecture which are not currently exposed in the Vulkan API. One of these is the ability to access the barycentric coordinates at the fragment-shader level.
Thanks (again!) Before we dive into a run over the release notes for the 1.0.2 release of Radeon GPU Profiler, we’d like to thank everyone …
Understanding the instruction-level capabilities of any processor is a worthwhile endeavour for any developer writing code for it, even if the instructions that get executed …
An important part of learning the Vulkan API – just like any other API – is to understand what types of objects are defined in it, what they represent and how they relate to each other. To help with this, we’ve created a diagram that shows all of the Vulkan objects and some of their relationships, especially the order in which you create one from another.
Summary In this blog post we are announcing the open-source availability of the Radeon™ ProRender renderer, an implementation of the Radeon ProRender API. We will give …
Introduction and thanks Effective GPU performance analysis is a more complex proposition for developers today than it ever has been, especially given developments in how …
TressFX 4 introduces a number of improvements. This blog post focuses on three of these, all of which are tied to simulation: Bone-based skinning Signed distance …
Full application control over GPU memory is one of the major differentiating features of the newer explicit graphics APIs such as Vulkan® and Direct3D® 12. …
We are excited to announce the release of Compressonator V2.6. This version contains several new features and optimizations, including: Adaptive Format Conversion for general transcoding operations …
When getting a new piece of hardware, the first step is to install the driver. You can see how to install them for the Radeon …
In this blog we will go through the installation process of the driver for your new Radeon Vega Frontier card. We will go through the …
When using a compute shader, it is important to consider the impact of thread group size on performance. Limited register space, memory latency and SIMD occupancy each affect shader performance in different ways. This article discusses potential performance issues, and techniques and optimizations that can dramatically increase performance if correctly applied.
The AMD Developer Tools team is thrilled to announce the availability of the AMD plugin for Microsoft’s PIX for Windows tool. PIX is a performance …
A new version of the CodeXL open-source developer tool is out! Here are the major new features in this release: CPU Profiling Support for AMD …
When it comes to multi-GPU (mGPU), most developers immediately think of complicated Crossfire setups with two or more GPUs and how to make their game …
Introduction Shortly after our Capsaicin and Cream event at GDC this year where we unveiled Radeon RX Vega, we hosted a developer-focused event designed to …
BC6 HDR Compression The BC6H codec has been improved and now offers better quality then previous releases, along with support for both 16 bit Half …
This article explains how to use Radeon GPU Analyzer (RGA) to produce a live VGPR analysis report for your shaders and kernels. Basic RGA usage …
I’m Mike Schmit, Director of Software Engineering with the Radeon Technologies Group at AMD. I’m leading the development of a new open-source 360-degree video-stitching framework …
AMD LiquidVR MultiView Rendering in Serious Sam VR with the GPU Services (AGS) Library AMD’s MultiView Rendering feature reduces the number of duplicated object draw …
In 2016, AMD brought TrueAudio Next to GameSoundCon. GameSoundCon was held Sept 27-28 at the Millennium Biltmore Hotel in Los Angeles. GameSoundCon caters to game …
Budgeting, measuring and debugging video memory usage is essential for the successful release of game titles on Windows. As a developer, this can be efficiently achieved with the …
Another year, another Game Developer Conference! GDC is held earlier this year (27 February – 3 March 2017) which is leaving even less time for …
With the launch of AGS 5.0 developers now have access to the shader compiler control API. Here’s a quick summary of the how and why…. Background …
There are many games out there taking place in vast environments. The basic building block of every environment is height-field based terrain – there’s no …
Understanding concurrency (and what breaks it) is extremely important when optimizing for modern GPUs. Modern APIs like DirectX® 12 or Vulkan™ provide the ability to …
Summary Many Gaming and workstation laptops are available with both (1) integrated power saving and (2) discrete high performance graphics devices. Unfortunately, 3D intensive application …
This post is taking a look at some of the interesting bits of helping id Software with their DOOM® Vulkan™ effort, from the perspective of …
This blog is guest authored by Croteam developer Karlo Jez and he will be giving us a detailed look at how Affinity Multi-GPU support was …
When opening a 64-bit crash dump you will find that you will not necessarily get a sensible call stack. This is because 64-bit crash dumps …
Vulkan™’s barrier system is unique as it not only requires you to provide what resources are transitioning, but also specify a source and destination pipeline …
This is the third post in the follow up series to my prior GDC talk on Variable Dynamic Range. Prior posts covered dithering, today’s topic …
Virtual desktop infrastructure systems and cloud gaming are increasingly gaining popularity thanks to an ever more improved internet infrastructure. This gives more flexibility to the …
As noted in my previous blog, new innovations in virtual reality have spearheaded a renewed interest in audio processing, and many new as well as …
This week marks the last in the series of our regular Warhammer Wednesday blog posts. We’d like to extent our thanks to Creative Assembly’s Lead …
Audio Must be Consistent With What You See Virtual reality demands a new way of thinking about audio processing. In the many years of history …
Happy Warhammer Wednesday! This week Creative Assembly’s Lead Graphics Programmer Tamas Rabel talks about how Total War: Warhammer utilized asynchronous compute to extract some extra …
It’s Wednesday, so we’re continuing with our series on Total War: Warhammer. Here’s Tamas Rabel again with some juicy details about how Creative Assembly brought …
A new release of the CodeXL open-source developer tool is out! Here’s the hot new stuff in this release: New platforms support Support Linux systems …
We’re back again on this fine Warhammer Wednesday with more from Tamas Rabel, Lead Graphics Programmer on the Total War series. In last week’s post …
For the next few weeks we’ll be having a regular feature on GPUOpen that we’ve affectionately dubbed “Warhammer Wednesdays”. We’re extremely lucky to have Tamas Rabel, …
Game engines do most of their shading work per-pixel or per-fragment. But there is another alternative that has been popular in film for decades: object …
EDIT: 2016/08/08 – Added section on Targeting Low-Memory GPUs This post serves as a guide on how to best use the various Memory Heaps and …
Before Direct3D® 12 and Vulkan™, resources were bound to shaders through a “slot” system. Some of you might remember when hardware did have only very …
Multi-GPU systems are much more common than you might think. Most of the time, when someone mentions mGPU, you think about high-end gaming machines with …
Compressonator is a set of tools to allow artists and developers to more easily create compressed texture image assets and easily visualize the quality impact …
Prior to explicit graphics APIs a lot of draw-time validation was performed to ensure that resources were synchronized and everything set up correctly. A side-effect of this robustness …
Direct3D® 12 and Vulkan™ significantly reduce CPU overhead and provide new tools to better use the GPU. For instance, one common use case for the …
As promised, we’re back and today I’m going to cover how to get resources to and from the GPU. In the last post, we learned …
A new CodeXL release is out! For the first time the AMD Developer Tools group worked on this release on the CodeXL GitHub public repository, …
Today, we are excited to announce that we are releasing an update for ShadowFX that adds support for DirectX® 12. Features Different shadowing modes Union of …
Achieving high performance from your Graphics or GPU Compute applications can sometimes be a difficult task. There are many things that a shader or kernel …
The GCN architecture contains a lot of functionality in the shader cores which is not currently exposed in current APIs like Vulkan™ or Direct3D® 12. One …
A Complete Tool to Transform Your Desktop Appearance After introducing our Display Output Post Processing (DOPP) technology, we are introducing a new tool to change …
Compaction is a basic building block of many algorithms – for instance, filtering out invisible triangles as seen in Optimizing the Graphics Pipeline with Compute. …
We are releasing TressFX 3.1. Our biggest update in this release is a new order-independent transparency (OIT) option we call “ShortCut”. We’ve also addressed some of …
Today’s update for GeometryFX introduces cluster culling. Previously, GeometryFX worked on a per-triangle level only. With cluster culling, GeometryFX is able to reject large chunks …
Full-speed, out-of-order rasterization If you’re familiar with graphics APIs, you’re certainly aware of the API ordering guarantees. At their core, these guarantees mean that if …
A New Milestone After the success of the first version, FireRays is moving to another major milestone. We are open sourcing the entire library which …
Last week, we organized a two hours-long talk at University of Lodz in Poland where we discussed the most common mistakes we come across in Vulkan applications. Dominik Witczak, …
We are very pleased to be announcing that AMD is open-sourcing one of our most popular tools and SDKs. Compressonator (previously released as AMD Compress …
Gaming at optimal performance and quality at high screen resolutions can sometimes be a demanding task for a single GPU. 4K monitors are becoming mainstream and gamers …
If you have supported Crossfire™ or Eyefinity™ in your previous titles, then you have probably already used our AMD GPU Services (AGS) library. A lot of …
Resource creation and management has changed dramatically in Direct3D® and Vulkan™ compared to previous APIs. In older APIs, memory is managed transparently by the driver. …
CodeXL major release 2.0 is out! It is chock-full of new features and a drastic change in the CodeXL development model: CodeXL is now open …
The prior post in this series established a base technique for adding grain, and now this post is going to look at very subtle changes to …
Welcome back to our performance & optimization series. Today, we’ll be looking more closely at shaders. On the surface, it may look as if they …
This is the first of a series of posts expanding on the ideas presented at GDC in the Advanced Techniques and Optimization of VDR Color …
The Game Developer Conference 2016 was an event of epic proportions. Presentations, tutorials, round-tables, and the show floor are only one part of the story …
This post describes how GCN hardware coalesces memory operations to minimize traffic throughout the memory hierarchy. The post uses the term “invocation” to describe one …
Bandwidth is always a scarce resource on a GPU. On one hand, hardware has made dramatic improvements with the introduction of ever faster memory standards …
Vulkan™ provides unprecedented control to developers over generating graphics and compute workloads for a wide range of hardware, from tiny embedded processors to high-end workstation GPUs with wildly different …
The Game Developer Conference 2016 (GDC16) is held March 14-18 in the Moscone Center in San Francisco. This is the most important event for game developers, …
Welcome back to our DX12 series! Let’s dive into one of the hottest topics right away: synchronization, that is, barriers and fences! Barriers A barrier is …
Vulkan™ is a high performance, low overhead graphics API designed to allow advanced applications to drive modern GPUs to their fullest capacity. Where traditional APIs …
Imagine that you were asked one day to design an API with bleeding-edge graphics hardware in mind. It would need to be as efficient as …
Hello and welcome to our series of blog posts covering performance advice for Direct3D® 12 & Vulkan™. You may have seen the #DX12PerfTweets on Twitter, and …
For GPU-side dynamically generated data structures which need 3D spherical mappings, two of the most useful mappings are cubemaps and octahedral maps. This post explores …
I have met enough game developers in my professional life to know that these guys are among the smartest people on the planet. Those particular individuals will go …
About CodeXL Analyzer CLI CodeXL Analyzer CLI is an offline compiler and performance analysis tool for OpenCL™ kernels, DirectX® shaders and OpenGL® shaders. Using CodeXL …
GPU PerfStudio supports DirectX® 12 on Windows® 10 PCs. The current tool set for DirectX 12 comprises of an API Trace, a new GPU Trace …
Today we’re going to take a look at how asynchronous compute can help you to get the maximum out of a GPU. I’ll be explaining …
What’s New With the recent adoption of new APIs such as DirectX® 12 and Vulkan™, we are seeing renewed interest in an older tool. AMD …
A typical problem with MSAA Resolve mixed with HDR is that a single sample with a large HDR value can over-power all other samples, resulting …
Vulkan™ is designed to have significantly smaller CPU overhead compared to other APIs like OpenGL®. This is achieved by various means – the API is structured to do more work up-front, such as creating the pipeline state once and binding it many times instead of having to continuously set various state bits, and many API calls do more work per call, for example vkCmdBindVertexBuffers
can bind all vertex buffer objects used by the vertex shader stage in one call. However a complex application can still end up calling various Vulkan functions tens or hundreds of thousands of times per frame. This article will look at costs associated with that, and ways to bring them down.
By default, applications on Windows link to vulkan-1.dll
and API calls go through that DLL, which contains the Vulkan loader. While the SDK provides a statically linked loader (VKstatic.1.lib), using it can create a compatibility hazard – if the process of loading Vulkan layers/driver changes, the old loader code might not work in the future. The same, of course, can be said of bundling vulkan-1.dll
with your application; the most future-proof method seems to be to rely on vulkan-1.dll
that the graphics driver installs to the system path.
The loader ( vulkan-1.dll
) exports all Vulkan functions; let’s look at the source code for one of them, vkCmdDraw
(located in trampoline.c
):
static inline VkLayerDispatchTable *loader_get_dispatch(const void *obj) {
return *((VkLayerDispatchTable **)obj);
}
LOADER_EXPORT VKAPI_ATTR void VKAPI_CALL vkCmdDraw(VkCommandBuffer commandBuffer,
uint32_t vertexCount, uint32_t instanceCount, uint32_t firstVertex, uint32_t firstInstance) {
const VkLayerDispatchTable *disp;
disp = loader_get_dispatch(commandBuffer);
disp->CmdDraw(commandBuffer, vertexCount, instanceCount, firstVertex, firstInstance);
}
Whenever you call a Vulkan function, it has to get the dispatch table that contains the pointer to the “real” function – which generally is located inside the graphics driver, or inside a validation layer if one is enabled. The pointer to the table is stored at the beginning of memory pointed to by the dispatchable handle – in this case, VkCommandBuffer
. This allows your code to work even in the presence of multiple drivers/devices loaded into the same process, and looks like a manual implementation of a virtual function call. This probably has a cost, but how bad can this cost be?
Let’s look at what actually happens when you link to vulkan-1.dll
and call VkCmdDraw
! We will examine the instructions executed in Release build of Vulkan cube demo, targeting Windows x86 (the overhead in Windows x64 build is less significant, but it can still reduce performance by a few percent).
It starts with the application calling vkCmdDraw
:
vkCmdDraw(cmd_buf, 12 * 3, 1, 0, 0);
009937B7 6A 00 push 0
009937B9 6A 00 push 0
009937BB 6A 01 push 1
009937BD 6A 24 push 24h
009937BF 57 push edi
009937C0 E8 AB 46 00 00 call _vkCmdDraw@20 (0997E70h)
_vkCmdDraw@20:
00997E70 FF 25 24 92 99 00 jmp dword ptr [__imp__vkCmdDraw@20 (0999224h)]
jmp
instruction, that jumps to an address loaded from the DLL import table…
_vkCmdDraw@20:
50112800 E9 FB C5 03 00 jmp vkCmdDraw (5014EE00h)
vulkan-1.dll
, and seems to point to yet another thunk, which finally jumps to the vkCmdDraw
trampoline that we’ve seen the source code for. The assembly for this function, however, proves to be unexpected.
vkCmdDraw:
5014EE00 55 push ebp
5014EE01 8B EC mov ebp,esp
5014EE03 51 push ecx
5014EE04 A1 34 E0 1C 50 mov eax,dword ptr [__security_cookie (501CE034h)]
5014EE09 33 C5 xor eax,ebp
5014EE0B 89 45 FC mov dword ptr [ebp-4],eax
5014EE0E 8B 45 08 mov eax,dword ptr [commandBuffer]
5014EE11 56 push esi
5014EE12 FF 75 18 push dword ptr [firstInstance]
5014EE15 FF 75 14 push dword ptr [firstVertex]
5014EE18 8B 30 mov esi,dword ptr [eax]
5014EE1A FF 75 10 push dword ptr [instanceCount]
5014EE1D FF 75 0C push dword ptr [vertexCount]
5014EE20 8B B6 68 01 00 00 mov esi,dword ptr [esi+168h]
5014EE26 8B CE mov ecx,esi
5014EE28 50 push eax
5014EE29 FF 15 00 50 1D 50 call dword ptr [__guard_check_icall_fptr (501D5000h)]
5014EE2F FF D6 call esi
5014EE31 8B 4D FC mov ecx,dword ptr [ebp-4]
5014EE34 33 CD xor ecx,ebp
5014EE36 5E pop esi
5014EE37 E8 F4 A6 FC FF call @__security_check_cookie@4 (50119530h)
5014EE3C 8B E5 mov esp,ebp
5014EE3E 5D pop ebp
5014EE3F C2 14 00 ret 14h
__guard_check_icall_fptr
, is emitted by MSVC compiler when Control Flow Guard feature is enabled (via /guard:cf
). This feature instruments indirect function calls and for each call can check that the caller instruction is expected to be able to call the target function, which can prevent exploits that overwrite function pointers with unrelated code addresses.
Fortunately, in our case the executable itself is compiled without CFG, which means that __guard_check_icall_fptr
points to a thunk for _guard_check_icall_nop
:
_guard_check_icall_nop@4:
501198A0 E9 4B 11 04 00 jmp _guard_check_icall_nop (5015A9F0h)
_guard_check_icall_nop:
5015A9F0 C3 ret
call
, jmp
and ret
, but at least we aren’t running the code that actually inspects CFG tables to validate the function call.
The second call
instruction in the original vkCmdDraw
trampoline is the only one we’ve wanted in the first place – it calls into the vkCmdDraw
implementation that the driver provides (since we don’t have any layers active).
Unfortunately, the driver seems to have yet another trampoline that looks like another dispatch layer that translates __stdcall
calling convention to __thiscall
; this is driver specific and can change with driver updates or not be present at all, but at the moment it looks like this happens in Windows drivers for all 3 vendors (NVidia, AMD, Intel).
Finally, the third call to __security_check_cookie
is emitted by MSVC compiler when Buffer Security Check is enabled (via /GS
); this catches some stack buffer overruns before they can do real damage and alter the execution sequence. The function itself is relatively short and simple:
__security_check_cookie@4:
50119530 E9 5F 14 04 00 jmp __security_check_cookie (5015A994h)
__security_check_cookie:
5015A994 3B 0D 34 E0 1C 50 cmp ecx,dword ptr [__security_cookie (501CE034h)]
5015A99A F2 75 02 bnd jne failure (5015A99Fh)
5015A99D F2 C3 bnd ret
As you can see, we wanted to simply call vkCmdDraw
implementation in the driver, and instead had to go through several layers of thunks, trampolines and security infrastructure calls. While the cost of all of these isn’t catastrophic, it can add up to measurable overhead.
Fortunately, the cost of device dispatch was accounted for in the design of Vulkan API; you can get the pointer to the function that does actual work by calling vkGetDeviceProcAddr
:
PFN_vkCmdDraw CmdDraw = (PFN_vkCmdDraw)vkGetDeviceProcAddr(demo->device, "vkCmdDraw");
000637B7 68 C8 9E 06 00 push offset string "vkCmdDraw" (069EC8h)
000637BC FF B6 A4 00 00 00 push dword ptr [esi+0A4h]
000637C2 E8 57 45 00 00 call _vkGetDeviceProcAddr@8 (067D1Eh)
CmdDraw(cmd_buf, 12 * 3, 1, 0, 0);
000637C7 6A 00 push 0
000637C9 6A 00 push 0
000637CB 6A 01 push 1
000637CD 6A 24 push 24h
000637CF 57 push edi
000637D0 FF D0 call eax
Of course, you would want to use vkGetDeviceProcAddr
just once and cache the result; all calls to the resulting function pointer will go to the first enabled layer, if any, and to the driver otherwise, and bypass all overhead associated with DLL thunks etc.
If your application uses just one device or device group, you can simply use global function pointers to store the results of vkGetDeviceProcAddr
; if you need to support multiple instances or devices, you need to store function pointers in a struct and have one instance of that struct per device that you have easy access to in your rendering code.
The performance benefit that you get out of using the device function pointers depends on the platform you’re targeting, the driver/application overhead and the amount of Vulkan calls; it can range between 1-5% for typical Vulkan applications. It may seem minor, but every little bit helps; the trick to getting good performance is to make your code faster one percent at a time.
With Vulkan API containing many functions that can benefit from this optimization, while you could load the ones you need manually, it seems like a good idea to automatically generate them from vk.xml
(which is an XML file that vulkan.h
is generated from).
In addition to generating code to load function pointers for device functions, you might want to load function pointers for other functions as well (using vkGetInstanceProcAddr
). This lets you remove the static dependency on vulkan-1.dll
, which makes it easier to handle the lack of Vulkan loader by switching to a different rendering API or providing a nicer error message to the user.
For both of these, you can use volk, which is an MIT-licensed meta-loader for Vulkan (similar to GLEW for OpenGL). It is designed as a drop-in header/source for projects that are using Vulkan. The library dynamically finds the real Vulkan loader and loads all functions from it; it can also load device functions via vkGetDeviceProcAddr
for faster dispatch.
To use it, add volk.c
to your project, and replace all #include <vulkan/vulkan.h>
lines with #include <volk.h>
(assuming you’ve added volk folder to your header search paths). Then, call the following function to initialize it before calling any Vulkan APIs (including instance creation):
VkResult result = volkInitialize();
If the returned result isn’t VK_SUCCESS
, Vulkan is not available on your system. If the call succeeds, proceed by creating the Vulkan instance as usual, and then loading all remaining functions:
volkLoadInstance(instance);
Finally, after creating the device, you have an option of replacing global function pointers with functions retrieved with vkGetDeviceProcAddr
like this:
volkLoadDevice(demo->device);
Or loading function pointers for direct calls into a function pointer table like this:
VolkDeviceTable table;
volkLoadDeviceTable(&table, device);
And then using the functions from the table instead:
table.vkCmdDraw(cmd_buf, 12 * 3, 1, 0, 0);
The first method allows you to get quick gains without changing your code, but isn’t suitable for applications that want to use explicit multi-GPU by creating multiple VkDevice
objects.
Note that to avoid symbol conflicts, you have to make sure all translation units in your application include volk.h
instead of vulkan.h
, or that you define VK_NO_PROTOTYPES
project-wide to make sure you aren’t accidentally picking up symbols from the real Vulkan loader.