Once we got our barriers right and we started to hit performance close to DirectX® 11, it was time to switch gears. Since both APIs are driving the same hardware, if you only do rendering that could have been done in DirectX® 11 then there will be little difference in the GPU’s performance. In this scenario, it’s about how you build your commands that counts and the performance win you get from DirectX® 12 comes from the CPU. Under the hood, your command buffers will be very similar in both APIs as you are going to render the same content after all. There is an option unlocked by DirectX® 12 however, and that is the use of multiple queues. Since the DirectX® 11 driver doesn’t know about the structure of the frame it cannot make decisions about using multiple queues automatically, but since we have direct control over the different queues on DirectX® 12, we can.
So we started looking for parts of the pipeline which could be moved to compute shaders and could be run parallel to some geometry processing. The main things we needed to consider are the dependencies between the different parts of the pipeline. For a quick overview of the pipeline, please refer to the diagram in the first part of this blog post. In a nutshell, we identified the following three sites:
- Particle sorting and simulation
The only inputs to the particle simulation are the list of new particles to be spawned, which is coming from the CPU side and the previous frame’s particle buffer. Then the new particle buffer is used quite late in the frame, after the rigid pipeline finished rendering and lighting the g-buffer. As we have no intra-frame input dependencies, we can kick the particle simulation off the very first thing in the frame on the compute queue and it will be finished by the time we get to point where we use the results. This gave roughly a 0.6ms boost on the Radeon™ Fury X at 4K resolution.
The only input we give to the ambient occlusion pass is the depth buffer of the main viewport with all the opaque objects. The result of the SSAO pass will be then used by the deferred lighting shaders later in the frame. We realized that rendering shadows fits nicely between the two. The shadow pass has no intra-frame input dependencies and the results are also used in the deferred lighting pass. Moreover, the SSAO process is a memory bandwidth heavy compute pass, while the shadow pass is mostly geometry only, which is the perfect combination. As a result we managed to almost completely hide the full cost of the SSAO pass. This was roughly a 1.4ms win on the Radeon Fury X at 4K resolution.
- Screen-Space Reflections (SSR)
Screen-space reflections (SSR) was the third pass we identified as a candidate. It comes after the lighting pass and then feeds into the VFX combine pass. Interestingly, moving SSR to asynchronous compute didn’t give any benefit at all. After further analysis we found that SSR is a very demanding shader. It is using a lot of VGPRs and lots of memory bandwidth. You have to remember, that although your compute shader is running in parallel with the geometry pass, they are using the same hardware and the same resources after all. If one queue is saturating them, there is hardly space for the other queue and they will just end up taking the same time as running sequential or even longer in certain cases.
The main takeaway from these is that asynchronous compute is a really great tool. It is relatively easy to squeeze some additional performance out from DirectX 12, we got an extra 2ms of performance with our particle simulation and SSAO. However, it’s not magic and can lead to some surprising results in certain cases. For the best results you have to try to find the best possible match of blocks in your pipeline and be prepared to experiment a little.
If you have any questions, please feel free to comment or get in touch with Tamas on Twitter. Next time we’ll take a look at explicit Multiple GPU in DirectX 12.
Other posts in this series
Tamas Rabel, Lead Graphics Programmer on the Total War series provides a detailed look at the Total War renderer as well as digging deep into some of the optimizations that the team at Creative Assembly did for the brilliant, Total War: Warhammer.
Tamas Rabel from Creative Assembly discusses how performance was measured with the Total War Engine.
Here’s Tamas Rabel again with some juicy details about how Creative Assembly brought Total War to DirectX® 12.
The final instalment in Tamas Rabel’s insight into developing the Total War engine looks at Multi-GPU.