It’s Wednesday, so we’re continuing with our series on Total War: Warhammer. Here’s Tamas Rabel again with some juicy details about how Creative Assembly brought Total War to DirectX® 12.

DirectX® 12

We knew it won’t be a trivial task so we started working on the DirectX 12 port quite early during development. Warhammer is still using the same architecture as Attila or Rome 2. Our device abstraction is just a very thin layer on top of the DirectX 11 interface. This meant that the DirectX 12 port basically sits on top of a DirectX 11 interface. We started out in this very naïve way, which then led to lots of valuable insight.

Barriers

Barriers are synchronization points. To understand them, we have to start from resources. Resources are buffers, textures, vertex and index buffers, constant buffers, basically blobs of memory which we can read and/or write. They can live on the GPU or on CPU, or even in memory shared by both. They also come in read-only and write-only flavours.

So far this is quite intuitive. In DirectX 11 you specify the usage which pretty much determines these factors. Then you forget about all this and assume if you got the usage right, your job is done. In reality the resources have a much more dynamic life in the background. They constantly move through different logical states depending on how they are used by the application. Before DirectX 12 all of this was hidden by the drivers, but with DirectX 12, it’s the application’s responsibility to manage the state of resources. Which means every change in the internal state of a resource is marked by a Barrier. Barriers are the synchronization primitives making sure a resource doesn’t change state before finishing all the work related to its current state.

If you start reading about moving to DirectX 12 one of the first things that everyone mentions are barriers and that’s not a coincidence. While exposing barriers to the application developer does have the potential to make your game run faster compared with DirectX 11 they will make your game run much slower if not done right. The reason for that is during the last years, drivers got pretty good in figuring out different use-cases from the applications’ resource usage patterns.

Let’s take a look at a few examples of the most common causes for state changes and barriers in the Total War engine.

Total War games are well known for their scale of action. In a large battle there can be as many as thousands of units on the screen. This means thousands of unique instances, which means thousands of matrix stacks for skinning to upload. On DirectX 11, the code was simple: create some buffers with D3D11_USAGE_DYNAMIC and for each instance map/write data/unmap. Then the driver takes care of the rest.

Battle Scene Large Number Of Units
Battle Scene Large Number Of Units

This is further complicated by the fact, that we have no upper limit on the number of units in the game. If we run out of instant buffer state, we flush the current batch of primitives, reset the instance buffers and trackers and keep accumulating instances.

In our very first DirectX 12 implementation we tried to mimic this behavior. For each dynamic resource we had we created a second buffer on the upload heap. On map/unmap we used the upload buffer and then we triggered the upload before the draw call. The idea was to keep the data in fast GPU memory and only update it when needed. This approach had two problems:

  • If we start transferring the resources from CPU to GPU just before the draw call, it’s already too late. The GPU has nothing else to hide this latency and we end up with bubbles in the pipeline. Our first attempt to remedy this was to start the transfer using the copy queue as soon as we finished preparing the instances. Unfortunately, as mentioned before we sometimes need to split batches and flush mid-processing, which breaks this method. We also found that it’s not even a good solution when all the data fits into the buffers. This is because although the copy queue does run nicely in the background, it can take a really long to spin up. For intra-frame purposes we always ended up copying on the queue which would use the data. And this leads to the second problem:
  • Too many barriers to transfer data between the CPU and the GPU and then draw, we first transitioned the GPU resource to the D3D12_RESOURCE_STATE_COPY_DEST , called CopyTextureRegion , then transitioned it back to D3D12_RESOURCE_STATE_GENERIC_READ , then called Draw. Apart from the fact that you should never use the GENERIC_READ state, this generated two barriers for every single draw call. That’s way too many. You should aim for roughly double the number of your render targets, plus maybe a handful of extras if you’re doing lots of compute.

The solution to these problems was just to read the upload heap directly from the GPU. Then we treat the whole upload heap just as a NOOVERWRITE resource and create custom views on the upload heap for the draw calls. This way we managed to get rid of the majority of our barriers and we traded the upload latency to slightly slower access speed, which is a much more optimal solution for our use case. It’s worth noting that for our use case we’re only accessing each byte of the resources we place in the upload heap exactly once. However, it’s worth considering a copy if you access the same resource many times.

If you have any questions, please feel free to comment or get in touch with Tamas on Twitter. Next time we’ll take a look at how we brought Asynchronous Compute to the Total War engine. 

More posts in this series

TressFX

The TressFX library is AMD’s hair/fur rendering and simulation technology. TressFX is designed to use the GPU to simulate and render high-quality, realistic hair and fur.

ShadowFX

ShadowFX library provides a scalable GCN-optimized solution for deferred shadow filtering. It supports uniform and contact hardening shadow (CHS) kernels.

GeometryFX

GeometryFX improves the rasterizer efficiency by culling triangles that do not contribute to the output in a pre-pass. This allows the full chip to be used to process geometry, and ensures that the rasterizer only processes triangles that are visible.

LiquidVR™

LiquidVR™ provides a Direct3D 11 based interface for applications to get access to the following GPU features regardless of whether a VR device is installed on a system.