| Reducing Vulkan API call overhead

Vulkan™ is designed to have significantly smaller CPU overhead compared to other APIs like OpenGL®. This is achieved by various means – the API is structured to do more work up-front, such as creating the pipeline state once and binding it many times instead of having to continuously set various state bits, and many API calls do more work per call, for example vkCmdBindVertexBuffers can bind all vertex buffer objects used by the vertex shader stage in one call. However a complex application can still end up calling various Vulkan functions tens or hundreds of thousands of times per frame. This article will look at costs associated with that, and ways to bring them down.

Loader dispatch

By default, applications on Windows link to vulkan-1.dll and API calls go through that DLL, which contains the Vulkan loader. While the SDK provides a statically linked loader (VKstatic.1.lib), using it can create a compatibility hazard – if the process of loading Vulkan layers/driver changes, the old loader code might not work in the future. The same, of course, can be said of bundling vulkan-1.dll with your application; the most future-proof method seems to be to rely on vulkan-1.dll that the graphics driver installs to the system path.
The loader ( vulkan-1.dll ) exports all Vulkan functions; let’s look at the source code for one of them, vkCmdDraw (located in trampoline.c ):

static inline VkLayerDispatchTable *loader_get_dispatch(const void *obj) {
return *((VkLayerDispatchTable **)obj);
} LOADER_EXPORT VKAPI_ATTR void VKAPI_CALL vkCmdDraw(VkCommandBuffer commandBuffer,
uint32_t vertexCount, uint32_t instanceCount, uint32_t firstVertex, uint32_t firstInstance) {
const VkLayerDispatchTable *disp;
disp = loader_get_dispatch(commandBuffer);
disp->CmdDraw(commandBuffer, vertexCount, instanceCount, firstVertex, firstInstance);
}

Whenever you call a Vulkan function, it has to get the dispatch table that contains the pointer to the “real” function – which generally is located inside the graphics driver, or inside a validation layer if one is enabled. The pointer to the table is stored at the beginning of memory pointed to by the dispatchable handle – in this case, VkCommandBuffer . This allows your code to work even in the presence of multiple drivers/devices loaded into the same process, and looks like a manual implementation of a virtual function call. This probably has a cost, but how bad can this cost be?

Tracing the call

Let’s look at what actually happens when you link to vulkan-1.dll and call VkCmdDraw ! We will examine the instructions executed in Release build of Vulkan cube demo, targeting Windows x86 (the overhead in Windows x64 build is less significant, but it can still reduce performance by a few percent).

It starts with the application calling vkCmdDraw :

   vkCmdDraw(cmd_buf, 12 * 3, 1, 0, 0);
009937B7 6A 00 push 0
009937B9 6A 00 push 0
009937BB 6A 01 push 1
009937BD 6A 24 push 24h
009937BF 57 push edi
009937C0 E8 AB 46 00 00 call _vkCmdDraw@20 (0997E70h)

Straightforward – just push all parameters on the stack and call. The function we are calling is inside our executable, and is just a trampoline that exists to implement DLL import:

_vkCmdDraw@20:
00997E70 FF 25 24 92 99 00 jmp dword ptr [__imp__vkCmdDraw@20 (0999224h)]

The function is just one jmp instruction, that jumps to an address loaded from the DLL import table…

_vkCmdDraw@20:
50112800 E9 FB C5 03 00 jmp vkCmdDraw (5014EE00h)

Which is inside vulkan-1.dll , and seems to point to yet another thunk, which finally jumps to the vkCmdDraw trampoline that we’ve seen the source code for. The assembly for this function, however, proves to be unexpected.

vkCmdDraw:
5014EE00 55 push ebp
5014EE01 8B EC mov ebp,esp
5014EE03 51 push ecx
5014EE04 A1 34 E0 1C 50 mov eax,dword ptr [__security_cookie (501CE034h)]
5014EE09 33 C5 xor eax,ebp
5014EE0B 89 45 FC mov dword ptr [ebp-4],eax
5014EE0E 8B 45 08 mov eax,dword ptr [commandBuffer]
5014EE11 56 push esi
5014EE12 FF 75 18 push dword ptr [firstInstance]
5014EE15 FF 75 14 push dword ptr [firstVertex]
5014EE18 8B 30 mov esi,dword ptr [eax]
5014EE1A FF 75 10 push dword ptr [instanceCount]
5014EE1D FF 75 0C push dword ptr [vertexCount]
5014EE20 8B B6 68 01 00 00 mov esi,dword ptr [esi+168h]
5014EE26 8B CE mov ecx,esi
5014EE28 50 push eax
5014EE29 FF 15 00 50 1D 50 call dword ptr [__guard_check_icall_fptr (501D5000h)]
5014EE2F FF D6 call esi
5014EE31 8B 4D FC mov ecx,dword ptr [ebp-4]
5014EE34 33 CD xor ecx,ebp
5014EE36 5E pop esi
5014EE37 E8 F4 A6 FC FF call @__security_check_cookie@4 (50119530h)
5014EE3C 8B E5 mov esp,ebp
5014EE3E 5D pop ebp
5014EE3F C2 14 00 ret 14h

Note that in addition to rearranging the arguments on the stack this assembly sequence contains three function calls. The first one, __guard_check_icall_fptr , is emitted by MSVC compiler when Control Flow Guard feature is enabled (via /guard:cf ). This feature instruments indirect function calls and for each call can check that the caller instruction is expected to be able to call the target function, which can prevent exploits that overwrite function pointers with unrelated code addresses.

Fortunately, in our case the executable itself is compiled without CFG, which means that __guard_check_icall_fptr points to a thunk for _guard_check_icall_nop :

_guard_check_icall_nop@4:
501198A0 E9 4B 11 04 00 jmp _guard_check_icall_nop (5015A9F0h) _guard_check_icall_nop:
5015A9F0 C3 ret

So we pay the cost of indirect call , jmp and ret , but at least we aren’t running the code that actually inspects CFG tables to validate the function call.

The second call instruction in the original vkCmdDraw trampoline is the only one we’ve wanted in the first place – it calls into the vkCmdDraw implementation that the driver provides (since we don’t have any layers active).

Unfortunately, the driver seems to have yet another trampoline that looks like another dispatch layer that translates __stdcall calling convention to __thiscall ; this is driver specific and can change with driver updates or not be present at all, but at the moment it looks like this happens in Windows drivers for all 3 vendors (NVidia, AMD, Intel).

Finally, the third call to __security_check_cookie is emitted by MSVC compiler when Buffer Security Check is enabled (via /GS ); this catches some stack buffer overruns before they can do real damage and alter the execution sequence. The function itself is relatively short and simple:

__security_check_cookie@4:
50119530 E9 5F 14 04 00 jmp __security_check_cookie (5015A994h) __security_check_cookie:
5015A994 3B 0D 34 E0 1C 50 cmp ecx,dword ptr [__security_cookie (501CE034h)]
5015A99A F2 75 02 bnd jne failure (5015A99Fh)
5015A99D F2 C3 bnd ret

As you can see, we wanted to simply call vkCmdDraw implementation in the driver, and instead had to go through several layers of thunks, trampolines and security infrastructure calls. While the cost of all of these isn’t catastrophic, it can add up to measurable overhead.

Getting function pointers for direct calls

Fortunately, the cost of device dispatch was accounted for in the design of Vulkan API; you can get the pointer to the function that does actual work by calling vkGetDeviceProcAddr :

    PFN_vkCmdDraw CmdDraw = (PFN_vkCmdDraw)vkGetDeviceProcAddr(demo->device, "vkCmdDraw");
000637B7 68 C8 9E 06 00 push offset string "vkCmdDraw" (069EC8h)
000637BC FF B6 A4 00 00 00 push dword ptr [esi+0A4h]
000637C2 E8 57 45 00 00 call _vkGetDeviceProcAddr@8 (067D1Eh)
CmdDraw(cmd_buf, 12 * 3, 1, 0, 0);
000637C7 6A 00 push 0
000637C9 6A 00 push 0
000637CB 6A 01 push 1
000637CD 6A 24 push 24h
000637CF 57 push edi
000637D0 FF D0 call eax

Of course, you would want to use vkGetDeviceProcAddr just once and cache the result; all calls to the resulting function pointer will go to the first enabled layer, if any, and to the driver otherwise, and bypass all overhead associated with DLL thunks etc.

If your application uses just one device or device group, you can simply use global function pointers to store the results of vkGetDeviceProcAddr ; if you need to support multiple instances or devices, you need to store function pointers in a struct and have one instance of that struct per device that you have easy access to in your rendering code.

The performance benefit that you get out of using the device function pointers depends on the platform you’re targeting, the driver/application overhead and the amount of Vulkan calls; it can range between 1-5% for typical Vulkan applications. It may seem minor, but every little bit helps; the trick to getting good performance is to make your code faster one percent at a time.

Using volk to get the function pointers

With Vulkan API containing many functions that can benefit from this optimization, while you could load the ones you need manually, it seems like a good idea to automatically generate them from vk.xml (which is an XML file that vulkan.h is generated from).

In addition to generating code to load function pointers for device functions, you might want to load function pointers for other functions as well (using vkGetInstanceProcAddr ). This lets you remove the static dependency on vulkan-1.dll , which makes it easier to handle the lack of Vulkan loader by switching to a different rendering API or providing a nicer error message to the user.

For both of these, you can use volk, which is an MIT-licensed meta-loader for Vulkan (similar to GLEW for OpenGL). It is designed as a drop-in header/source for projects that are using Vulkan. The library dynamically finds the real Vulkan loader and loads all functions from it; it can also load device functions via vkGetDeviceProcAddr for faster dispatch.

To use it, add volk.c to your project, and replace all #include <vulkan/vulkan.h> lines with #include <volk.h> (assuming you’ve added volk folder to your header search paths). Then, call the following function to initialize it before calling any Vulkan APIs (including instance creation):

VkResult result = volkInitialize();

If the returned result isn’t VK_SUCCESS , Vulkan is not available on your system. If the call succeeds, proceed by creating the Vulkan instance as usual, and then loading all remaining functions:

volkLoadInstance(instance);

Finally, after creating the device, you have an option of replacing global function pointers with functions retrieved with vkGetDeviceProcAddr like this:

volkLoadDevice(demo->device);

Or loading function pointers for direct calls into a function pointer table like this:

VolkDeviceTable table;
volkLoadDeviceTable(&table, device);

And then using the functions from the table instead:

table.vkCmdDraw(cmd_buf,  12 * 3, 1, 0, 0);

The first method allows you to get quick gains without changing your code, but isn’t suitable for applications that want to use explicit multi-GPU by creating multiple VkDevice objects.

Note that to avoid symbol conflicts, you have to make sure all translation units in your application include volk.h instead of vulkan.h , or that you define VK_NO_PROTOTYPES project-wide to make sure you aren’t accidentally picking up symbols from the real Vulkan loader. 

Arseny Kapoulkine
Arseny Kapoulkine has worked on game technology for the past decade. Having worked on rendering, physics simulation, language runtimes, multithreading and many other areas, he is still discovering exciting problems in game development that require low-level thinking. After helping ship many titles on PS3 including several FIFA games, he joined Roblox in 2012 and has been working on the in-house engine ever since, helping young game developers achieve their dreams.

| YOU MAY ALSO LIKE...

Tutorials Library

Browse all our fantastic tutorials, including programming techniques, performance improvements, guest blogs, and how to use our tools.

Samples Library

Browse all our useful samples. Perfect for when you’re needing to get started, want to integrate one of our libraries, and much more.