Vulkan™ is a high performance, low overhead graphics API designed to allow advanced applications to drive modern GPUs to their fullest capacity. Where traditional APIs have presented an abstraction that behaves as if commands are executed immediately, Vulkan uses a model that exposes what’s really going on; that, in fact, GPUs execute commands placed in memory, potentially out of order, and where those buffers of commands may be built in parallel across many software threads. Furthermore, large pieces of interrelated state are presented to the graphics driver at the same time through state objects. This provides drivers with an opportunity to fully optimize GPU state long ahead of render time in order to maximize performance without risking stuttering and other issues associated with just-in-time optimization. The end result is lower, more consistent frame times and lower CPU overhead, meaning more CPU cycles for your application.
Vulkan is derived from AMD’s trail-blazing Mantle API. AMD donated the Mantle specification, headers and other technology to Khronos Group to use as the basis of their (at the time unnamed) next-generation API. With the help of other industry players over the course of more than a year, we eventually evolved Mantle into what became Vulkan. It was a long process, and perhaps some of the most significant changes came from our members in the mobile field, who primarily use tiled architectures in their GPUs which are designed to minimize off-chip traffic to memory in an effort to save power. Among the features proposed by our mobile members was the renderpass — an object designed to allow an application to communicate the high-level structure of a frame to the driver. Tiling GPU drivers can use this information to determine when to bring data on and off chip, whether or not to flush data out to memory or discard the content of tile buffers and even to do things like size memory allocations used for binning and other internal operations. This is a feature that Mantle did not have, and is not part of Direct3D® 12 either.
To Tile or Not To Tile
A tiled GPU will batch up geometry, determine which parts of the framebuffer that geometry lands in, and then for each region of the framebuffer, render the parts of geometry that hit that tile. This makes framebuffer access very coherent and in many cases, can allow the GPU to complete rendering of one framebuffer tile entirely on-chip before moving to the next. AMD does not make tiling GPUs. Our GPUs are what is known as forward or immediate renderers. This means that when a command comes in to draw some geometry, the GPU will render it to wherever it lands and complete processing it before moving on to the next command. Things are pipelined, and commands can overlap and even finish out of order, but there is special hardware built into the GPU to get things back into the right order before any data is written to memory. Our drivers don’t generally need to worry about this. So, what do these renderpass objects have to do with us? Why do we care?
In Vulkan, a renderpass object contains the structure of the frame. In its simplest form, a renderpass encapsulates the set of framebuffer attachments, basic information about pipeline state and not much more. However, a renderpass can contain one or more subpasses and information about how those subpasses relate to one another. This is where things get interesting.
Each subpass can reference a subset of the framebuffer attachments for writing and also a subset of the framebuffer attachments for reading. These readable framebuffer attachments are known as input attachments and effectively contain the result of an earlier subpass at the same pixel. Unlike traditional render-to-texture techniques, where each pass may read any pixel produced by a previous pass, input attachments guarantee that each fragment shader only accesses data produced by shader invocations at the same pixel. Further, each subpass contains information about what to do with each of the attachments when it begins (clear it, restore it from memory, or leave it uninitialized) and what to do with the attachments when it ends (store it back to memory or throw it away). The dependencies between the subpasses are explicitly spelled out by the application. This allows a tiled renderer to know, exactly, when it needs to flush its tile buffer, clear it, restore it from memory, and so on.
Go Forward Faster
As it turns out, a forward renderer such as ours can take advantage of this kind of information as well. Here are a few examples of the types of optimizations we can make.
Just as we can tell that one subpass depends on the result of an earlier one, we can tell when a subpass does not depend on an earlier one. Therefore, we can sometimes render those subpasses in parallel or even out of order without synchronization. If one subpass depends on the result of a previous subpass, then with a traditional graphics API, the driver would need to inject a bubble into the GPU pipeline in order to synchronize the render backends’ output caches with the texture units’ input caches. However, by rescheduling work, we can instruct the render backends to flush their caches, process unrelated work and then invalidate the texture caches before initiating the consuming subpass. This eliminates the bubble and saves GPU time.
Because each subpass includes information about what to do with its attachments, we can tell that an application is going to clear an attachment, or that it doesn’t care about the content of that attachment. This allows the driver to schedule clears far ahead of real rendering work, or to intelligently decide what method to use to clear an attachment (such as using a compute shader, fixed function hardware or a DMA engine). If an application says that it doesn’t need an attachment to have defined data, we can bring the attachment into a partially compressed state. This is where the data it contains is undefined but its state, as far as the hardware is concerned is optimal for rendering.
In some cases, the in-memory layout of data is different for optimal rendering and reading via the texture units. By analyzing the data dependencies that an application provides, our drivers can decide when it is best to perform layout changes, decompression, format conversion and so on. It can also split some of these operations into phases, interleaving them with application-supplied rendering work, which again, eliminates pipeline bubbles and improves efficiency.
Finally, Vulkan includes the concept of transient attachments. These are framebuffer attachments that begin in an uninitialized or cleared state at the beginning of a renderpass, are written by one or more subpasses, consumed by one or more subpasses and are ultimately discarded at the end of the renderpass. In this scenario, the data in the attachments only lives within the renderpass and never needs to be written to main memory. Although we’ll still allocate memory for such an attachment, the data may never leave the GPU, instead only ever living in cache. This saves bandwidth, reduces latency and improves power efficiency.
A First-Class Feature
Renderpasses should not be seen as a “mobile-only” feature. This is a first class feature of the Vulkan API and one which presents a lot of opportunities for optimization and efficiency on the GPU, even for forward, immediate renderers such as the GCN architecture. Our initial early-look drivers include a renderpass compiler which already performs some of the optimizations outlined above. We have a laundry list of experiments to perform and we’ll be bringing more and more features online over the coming months. Simply combining a couple of passes together into a single subpass probably won’t yield much improvement. However, getting as much of your frame inside as few renderpass objects as possible is potentially going to be a huge win in both software and hardware performance.