It’s Wednesday, so we’re continuing with our series on Total War: Warhammer. Here’s Tamas Rabel again with some juicy details about how Creative Assembly brought Total War to DirectX® 12.
We knew it won’t be a trivial task so we started working on the DirectX 12 port quite early during development. Warhammer is still using the same architecture as Attila or Rome 2. Our device abstraction is just a very thin layer on top of the DirectX 11 interface. This meant that the DirectX 12 port basically sits on top of a DirectX 11 interface. We started out in this very naïve way, which then led to lots of valuable insight.
Barriers are synchronization points. To understand them, we have to start from resources. Resources are buffers, textures, vertex and index buffers, constant buffers, basically blobs of memory which we can read and/or write. They can live on the GPU or on CPU, or even in memory shared by both. They also come in read-only and write-only flavours.
So far this is quite intuitive. In DirectX 11 you specify the usage which pretty much determines these factors. Then you forget about all this and assume if you got the usage right, your job is done. In reality the resources have a much more dynamic life in the background. They constantly move through different logical states depending on how they are used by the application. Before DirectX 12 all of this was hidden by the drivers, but with DirectX 12, it’s the application’s responsibility to manage the state of resources. Which means every change in the internal state of a resource is marked by a Barrier. Barriers are the synchronization primitives making sure a resource doesn’t change state before finishing all the work related to its current state.
If you start reading about moving to DirectX 12 one of the first things that everyone mentions are barriers and that’s not a coincidence. While exposing barriers to the application developer does have the potential to make your game run faster compared with DirectX 11 they will make your game run much slower if not done right. The reason for that is during the last years, drivers got pretty good in figuring out different use-cases from the applications’ resource usage patterns.
Let’s take a look at a few examples of the most common causes for state changes and barriers in the Total War engine.
Total War games are well known for their scale of action. In a large battle there can be as many as thousands of units on the screen. This means thousands of unique instances, which means thousands of matrix stacks for skinning to upload. On DirectX 11, the code was simple: create some buffers with
and for each instance map/write data/unmap. Then the driver takes care of the rest.
This is further complicated by the fact, that we have no upper limit on the number of units in the game. If we run out of instant buffer state, we flush the current batch of primitives, reset the instance buffers and trackers and keep accumulating instances.
In our very first DirectX 12 implementation we tried to mimic this behavior. For each dynamic resource we had we created a second buffer on the upload heap. On map/unmap we used the upload buffer and then we triggered the upload before the draw call. The idea was to keep the data in fast GPU memory and only update it when needed. This approach had two problems:
- If we start transferring the resources from CPU to GPU just before the draw call, it’s already too late. The GPU has nothing else to hide this latency and we end up with bubbles in the pipeline. Our first attempt to remedy this was to start the transfer using the copy queue as soon as we finished preparing the instances. Unfortunately, as mentioned before we sometimes need to split batches and flush mid-processing, which breaks this method. We also found that it’s not even a good solution when all the data fits into the buffers. This is because although the copy queue does run nicely in the background, it can take a really long to spin up. For intra-frame purposes we always ended up copying on the queue which would use the data. And this leads to the second problem:
- Too many barriers to transfer data between the CPU and the GPU and then draw, we first transitioned the GPU resource to the
CopyTextureRegion, then transitioned it back to
D3D12_RESOURCE_STATE_GENERIC_READ, then called Draw. Apart from the fact that you should never use the
GENERIC_READstate, this generated two barriers for every single draw call. That’s way too many. You should aim for roughly double the number of your render targets, plus maybe a handful of extras if you’re doing lots of compute.
The solution to these problems was just to read the upload heap directly from the GPU. Then we treat the whole upload heap just as a
resource and create custom views on the upload heap for the draw calls. This way we managed to get rid of the majority of our barriers and we traded the upload latency to slightly slower access speed, which is a much more optimal solution for our use case. It’s worth noting that for our use case we’re only accessing each byte of the resources we place in the upload heap exactly once. However, it’s worth considering a copy if you access the same resource many times.
If you have any questions, please feel free to comment or get in touch with Tamas on Twitter. Next time we’ll take a look at how we brought Asynchronous Compute to the Total War engine.
More posts in this series
Tamas Rabel, Lead Graphics Programmer on the Total War series provides a detailed look at the Total War renderer as well as digging deep into some of the optimizations that the team at Creative Assembly did for the brilliant, Total War: Warhammer.
Tamas Rabel from Creative Assembly discusses how performance was measured with the Total War Engine.
Tamas Rabel talks about how Total War: Warhammer utilized asynchronous compute to extract some extra GPU performance in DirectX® 12 and delves into the process of moving some of the passes in the engine to asynchronous compute pipelines.
The final instalment in Tamas Rabel’s insight into developing the Total War engine looks at Multi-GPU.