Optimized Reversible Tonemapper for Resolve

A typical problem with MSAA Resolve mixed with HDR is that a single sample with a large HDR value can over-power all other samples, resulting in a loss of perceived anti-aliasing on an edge. One way to fix this problem involves using an impractically large filter kernel mixed with an impractically large number of samples per pixel.
A real-time workaround is to instead accept a bias in the resolve and reduce the weighting of samples as a function of how bright they are, causing bright edges to visually erode instead of expand (as would happen if unbiased). This process is equivalent to tonemapping prior to resolve and then reversing that tonemap after the resolve.
This post presents an optimized modified form of the technique Brian Karis [Epic] talks about on Graphics Rants: Tone mapping. The core change is to replace the luma computation with max3(red,green,blue). The luma based tonemapper has variable weighting based on color hue which is not present in the max3 based tonemapper. The max3 based tonemapper removes the hue shift on mixed color edges of similar value.
The max3() operation on all versions of GCN maps to a single instruction, v_max3_f32. The documentation for the GCN instruction set in Fiji, GCN3, is available here. The driver-side AMD DX shader compiler will automatically transform max(x, max(y, z)) into max3(x, y, z). This functionality, as well as min3() and mid3(), is also exposed explicitly in GLSL via the following extension: AMD_shader_trinary_minmax.
Here is an HLSL implementation of the tonemapper and inverse.

float max3(float x, float y, float z) { return max(x, max(y, z)); }

// Apply this to tonemap linear HDR color "c" after a sample is fetched in the resolve.
// Note "c" 1.0 maps to the expected limit of low-dynamic-range monitor output.
float3 Tonemap(float3 c) { return c * rcp(max3(c.r, c.g, c.b) + 1.0); }

// When the filter kernel is a weighted sum of fetched colors,
// it is more optimal to fold the weighting into the tonemap operation.
float3 TonemapWithWeight(float3 c, float w) { return c * (w * rcp(max3(c.r, c.g, c.b) + 1.0)); }

// Apply this to restore the linear HDR color before writing out the result of the resolve.
float3 TonemapInvert(float3 c) { return c * rcp(1.0 - max3(c.r, c.g, c.b)); }

And a GLSL Shadertoy example: https://www.shadertoy.com/view/Xdd3Rr
Here is an example of using the above functions in a low-quality 4xMSAA box filter resolve,

return TonemapInvert(
  TonemapWithWeight(sample0, 0.25) +
  TonemapWithWeight(sample1, 0.25) +
  TonemapWithWeight(sample2, 0.25) +
  TonemapWithWeight(sample3, 0.25));

Here is another example, this time a full HLSL shader without the TonemapInvert(), which is a random 5-tap horizontal filter.

float max3(float x, float y, float z) { return max(x, max(y, z)); }

float3 TonemapWithWeight(float3 c, float w) { return c * (w * rcp(max3(c.r, c.g, c.b) + 1.0)); }

Texture2D tex0;
SamplerState smp0;

float3 main(float2 pos : TEXCOORD) : SV_Target { 
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2(-2,0)), 0.1) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2(-1,0)), 0.2) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 0,0)), 0.4) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 1,0)), 0.2) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 2,0)), 0.1); }

Inspecting the disassembly of this shader in Shader Analyzer (part of GPU Perf Studio) shows the following GCN instructions for each filter tap after the first tap.

v_rcp_f32  <--- rcp takes 4x the runtime as other VALU (vector ALU) operations
v_mul_f32  <--- folds the scalar filter weight to the tonemap weight before multiply by the color
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum

The added cost per filter tap for application of the reversible tonemapper is only the top 4 instructions which take the time of 7 VALU operations. Each tap of the filter takes 10 VALU operations total. To place this in a real context, in terms of just VALU, Fiji Nano for example can approach 400 million of these filter taps in a single millisecond (according to peak specs, measured depends on bandwidth to source the taps, etc).

Other posts by Timothy Lottes

Timothy Lottes

Timothy Lottes

Timothy Lottes is a member of the Graphics Performance R&D team at AMD. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

Enjoy this blog post? If you found it useful, why not share it with other game developers?

You may also like...

Getting started: our software

New or fairly new to AMD’s tools, libraries, and effects? This is the best place to get started on GPUOpen!

Getting started: development and performance

Looking for tips on getting started with developing and/or optimizing your game, whether on AMD hardware or generally? We’ve got you covered!

If slide decks are what you’re after, you’ll find 100+ of our finest presentations here. Plus there’s a handy list of our product manuals!

Developer guides

Browse our developer guides, and find valuable advice on developing with AMD hardware, ray tracing, Vulkan, DirectX, UE4, and lots more.

Words not enough? How about pictures? How about moving pictures? We have some amazing videos to share with you!

The home of great performance and optimization advice for AMD RDNA™ 2 GPUs, AMD Ryzen™ CPUs, and so much more.

Product Blogs

Our handy product blogs will help you make good use of our tools, SDKs, and effects, as well as sharing the latest features with new releases.


Discover our published publications.