Anatomy Of The Total War Engine: Part IV

Share on facebook
Share on twitter
Share on linkedin
Share on reddit
Share on email

Once we got our barriers right and we started to hit performance close to DirectX® 11, it was time to switch gears. Since both APIs are driving the same hardware, if you only do rendering that could have been done in DirectX® 11 then there will be little difference in the GPU’s performance. In this scenario, it’s about how you build your commands that counts and the performance win you get from DirectX® 12 comes from the CPU. Under the hood, your command buffers will be very similar in both APIs as you are going to render the same content after all. There is an option unlocked by DirectX® 12 however, and that is the use of multiple queues. Since the DirectX® 11 driver doesn’t know about the structure of the frame it cannot make decisions about using multiple queues automatically, but since we have direct control over the different queues on DirectX® 12, we can.

So we started looking for parts of the pipeline which could be moved to compute shaders and could be run parallel to some geometry processing. The main things we needed to consider are the dependencies between the different parts of the pipeline. For a quick overview of the pipeline, please refer to the diagram in the first part of this blog post. In a nutshell, we identified the following three sites:

  • Particle sorting and simulation
    The only inputs to the particle simulation are the list of new particles to be spawned, which is coming from the CPU side and the previous frame’s particle buffer. Then the new particle buffer is used quite late in the frame, after the rigid pipeline finished rendering and lighting the g-buffer. As we have no intra-frame input dependencies, we can kick the particle simulation off the very first thing in the frame on the compute queue and it will be finished by the time we get to point where we use the results. This gave roughly a 0.6ms boost on the Radeon™ Fury X at 4K resolution.
  • SSAO
    The only input we give to the ambient occlusion pass is the depth buffer of the main viewport with all the opaque objects. The result of the SSAO pass will be then used by the deferred lighting shaders later in the frame. We realized that rendering shadows fits nicely between the two. The shadow pass has no intra-frame input dependencies and the results are also used in the deferred lighting pass. Moreover, the SSAO process is a memory bandwidth heavy compute pass, while the shadow pass is mostly geometry only, which is the perfect combination. As a result we managed to almost completely hide the full cost of the SSAO pass. This was roughly a 1.4ms win on the Radeon Fury X at 4K resolution.
  • Screen-Space Reflections (SSR)
    Screen-space reflections (SSR) was the third pass we identified as a candidate. It comes after the lighting pass and then feeds into the VFX combine pass. Interestingly, moving SSR to asynchronous compute didn’t give any benefit at all. After further analysis we found that SSR is a very demanding shader. It is using a lot of VGPRs and lots of memory bandwidth. You have to remember, that although your compute shader is running in parallel with the geometry pass, they are using the same hardware and the same resources after all. If one queue is saturating them, there is hardly space for the other queue and they will just end up taking the same time as running sequential or even longer in certain cases.

Terror Geist

The main takeaway from these is that asynchronous compute is a really great tool. It is relatively easy to squeeze some additional performance out from DirectX 12, we got an extra 2ms of performance with our particle simulation and SSAO. However, it’s not magic and can lead to some surprising results in certain cases. For the best results you have to try to find the best possible match of blocks in your pipeline and be prepared to experiment a little.

If you have any questions, please feel free to comment or get in touch with Tamas on Twitter. Next time we’ll take a look at explicit Multiple GPU in DirectX 12.

Other posts in this series

Anatomy Of The Total War Engine: Part I

Tamas Rabel, Lead Graphics Programmer on the Total War series provides a detailed look at the Total War renderer as well as digging deep into some of the optimizations that the team at Creative Assembly did for the brilliant, Total War: Warhammer.

Tamas Rabel
Tamas Rabel is lead graphics programmer at Creative Assembly, working on the Total War games. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

You may also like...

Some light reading to take away with you. Our ISAs, manuals, whitepapers, and many more.

Explore our huge collection of detailed tutorials, sample code, presentations, and documentation to find your answers to your graphics development questions.

Browse all our useful samples. Perfect for when you’re needing to get started, want to integrate one of our libraries, and much more.

Browse all our fantastic tutorials, including programming techniques, performance improvements, guest blogs, and how to use our tools.