Anatomy Of The Total War Engine Part IV

Originally posted: August 16, 2016

Tamas Rabel

Once we got our barriers right and we started to hit performance close to DirectX® 11, it was time to switch gears. Since both APIs are driving the same hardware, if you only do rendering that could have been done in DirectX® 11 then there will be little difference in the GPU’s performance. In this scenario, it’s about how you build your commands that counts and the performance win you get from DirectX® 12 comes from the CPU. Under the hood, your command buffers will be very similar in both APIs as you are going to render the same content after all. There is an option unlocked by DirectX® 12 however, and that is the use of multiple queues. Since the DirectX® 11 driver doesn’t know about the structure of the frame it cannot make decisions about using multiple queues automatically, but since we have direct control over the different queues on DirectX® 12, we can.

So we started looking for parts of the pipeline which could be moved to compute shaders and could be run parallel to some geometry processing. The main things we needed to consider are the dependencies between the different parts of the pipeline. For a quick overview of the pipeline, please refer to the diagram in the first part of this blog post. In a nutshell, we identified the following three sites:

Particle sorting and simulation
The only inputs to the particle simulation are the list of new particles to be spawned, which is coming from the CPU side and the previous frame’s particle buffer. Then the new particle buffer is used quite late in the frame, after the rigid pipeline finished rendering and lighting the g-buffer. As we have no intra-frame input dependencies, we can kick the particle simulation off the very first thing in the frame on the compute queue and it will be finished by the time we get to point where we use the results. This gave roughly a 0.6ms boost on the Radeon™ Fury X at 4K resolution.
SSAO
The only input we give to the ambient occlusion pass is the depth buffer of the main viewport with all the opaque objects. The result of the SSAO pass will be then used by the deferred lighting shaders later in the frame. We realized that rendering shadows fits nicely between the two. The shadow pass has no intra-frame input dependencies and the results are also used in the deferred lighting pass. Moreover, the SSAO process is a memory bandwidth heavy compute pass, while the shadow pass is mostly geometry only, which is the perfect combination. As a result we managed to almost completely hide the full cost of the SSAO pass. This was roughly a 1.4ms win on the Radeon Fury X at 4K resolution.
Screen-Space Reflections (SSR)
Screen-space reflections (SSR) was the third pass we identified as a candidate. It comes after the lighting pass and then feeds into the VFX combine pass. Interestingly, moving SSR to asynchronous compute didn’t give any benefit at all. After further analysis we found that SSR is a very demanding shader. It is using a lot of VGPRs and lots of memory bandwidth. You have to remember, that although your compute shader is running in parallel with the geometry pass, they are using the same hardware and the same resources after all. If one queue is saturating them, there is hardly space for the other queue and they will just end up taking the same time as running sequential or even longer in certain cases.

Terror Geist

The main takeaway from these is that asynchronous compute is a really great tool. It is relatively easy to squeeze some additional performance out from DirectX 12, we got an extra 2ms of performance with our particle simulation and SSAO. However, it’s not magic and can lead to some surprising results in certain cases. For the best results you have to try to find the best possible match of blocks in your pipeline and be prepared to experiment a little.

If you have any questions, please feel free to comment or get in touch with Tamas on Twitter. Next time we’ll take a look at explicit Multiple GPU in DirectX 12.

Other posts in this series

TressFX

The TressFX library is AMD’s hair/fur rendering and simulation technology. TressFX is designed to use the GPU to simulate and render high-quality, realistic hair and fur.

ShadowFX

ShadowFX library provides a scalable GCN-optimized solution for deferred shadow filtering. It supports uniform and contact hardening shadow (CHS) kernels.

GeometryFX

GeometryFX improves the rasterizer efficiency by culling triangles that do not contribute to the output in a pre-pass. This allows the full chip to be used to process geometry, and ensures that the rasterizer only processes triangles that are visible.

LiquidVR™

LiquidVR™ provides a Direct3D 11 based interface for applications to get access to the following GPU features regardless of whether a VR device is installed on a system.

Tamas Rabel

Tamas Rabel is lead graphics programmer at Creative Assembly, working on the Total War games. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.