| Optimized Reversible Tonemapper for Resolve

A typical problem with MSAA Resolve mixed with HDR is that a single sample with a large HDR value can over-power all other samples, resulting in a loss of perceived anti-aliasing on an edge. One way to fix this problem involves using an impractically large filter kernel mixed with an impractically large number of samples per pixel.
 
A real-time workaround is to instead accept a bias in the resolve and reduce the weighting of samples as a function of how bright they are, causing bright edges to visually erode instead of expand (as would happen if unbiased). This process is equivalent to tonemapping prior to resolve and then reversing that tonemap after the resolve.
 
This post presents an optimized modified form of the technique Brian Karis [Epic] talks about on Graphics Rants: Tone mapping. The core change is to replace the luma computation with max3(red,green,blue). The luma based tonemapper has variable weighting based on color hue which is not present in the max3 based tonemapper. The max3 based tonemapper removes the hue shift on mixed color edges of similar value.
 
The max3() operation on all versions of GCN maps to a single instruction, v_max3_f32. The documentation for the GCN instruction set in Fiji, GCN3, is available here. The driver-side AMD DX shader compiler will automatically transform max(x, max(y, z)) into max3(x, y, z). This functionality, as well as min3() and mid3(), is also exposed explicitly in GLSL via the following extension: AMD_shader_trinary_minmax.
 
Here is an HLSL implementation of the tonemapper and inverse.

float max3(float x, float y, float z) { return max(x, max(y, z)); }

// Apply this to tonemap linear HDR color "c" after a sample is fetched in the resolve.
// Note "c" 1.0 maps to the expected limit of low-dynamic-range monitor output.
float3 Tonemap(float3 c) { return c * rcp(max3(c.r, c.g, c.b) + 1.0); }

// When the filter kernel is a weighted sum of fetched colors,
// it is more optimal to fold the weighting into the tonemap operation.
float3 TonemapWithWeight(float3 c, float w) { return c * (w * rcp(max3(c.r, c.g, c.b) + 1.0)); }

// Apply this to restore the linear HDR color before writing out the result of the resolve.
float3 TonemapInvert(float3 c) { return c * rcp(1.0 - max3(c.r, c.g, c.b)); }

And a GLSL Shadertoy example: https://www.shadertoy.com/view/Xdd3Rr
 
Here is an example of using the above functions in a low-quality 4xMSAA box filter resolve,

return TonemapInvert(
  TonemapWithWeight(sample0, 0.25) +
  TonemapWithWeight(sample1, 0.25) +
  TonemapWithWeight(sample2, 0.25) +
  TonemapWithWeight(sample3, 0.25));

Here is another example, this time a full HLSL shader without the TonemapInvert(), which is a random 5-tap horizontal filter.

float max3(float x, float y, float z) { return max(x, max(y, z)); }

float3 TonemapWithWeight(float3 c, float w) { return c * (w * rcp(max3(c.r, c.g, c.b) + 1.0)); }

Texture2D tex0;
SamplerState smp0;

float3 main(float2 pos : TEXCOORD) : SV_Target { 
  return
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2(-2,0)), 0.1) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2(-1,0)), 0.2) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 0,0)), 0.4) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 1,0)), 0.2) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 2,0)), 0.1); }

Inspecting the disassembly of this shader in Shader Analyzer (part of GPU Perf Studio) shows the following GCN instructions for each filter tap after the first tap.

v_max3_f32
v_add_f32
v_rcp_f32  <--- rcp takes 4x the runtime as other VALU (vector ALU) operations
v_mul_f32  <--- folds the scalar filter weight to the tonemap weight before multiply by the color
---------
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum

The added cost per filter tap for application of the reversible tonemapper is only the top 4 instructions which take the time of 7 VALU operations. Each tap of the filter takes 10 VALU operations total. To place this in a real context, in terms of just VALU, Fiji Nano for example can approach 400 million of these filter taps in a single millisecond (according to peak specs, measured depends on bandwidth to source the taps, etc).


| OTHER POSTS BY TIMOTHY LOTTES

Fetching From Cubes and Octahedrons

For GPU-side dynamically generated data structures which need 3D spherical mappings, two of the most useful mappings are cubemaps and octahedral maps. This post explores the overhead of both mappings.

Using Vulkan Device Memory

This post serves as a guide on how to best use the various Memory Heaps & Memory Types exposed in Vulkan on AMD drivers, starting with some high-level tips.

Vulkan and DOOM

This post takes a look at the interesting bits of helping id Software with their DOOM Vulkan effort, from the perspective of AMD’s Game Engineering Team.

Timothy Lottes
Timothy Lottes is a member of the Graphics Performance R&D team at AMD. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

| YOU MAY ALSO LIKE...

Tutorials Library

Browse all our fantastic tutorials, including programming techniques, performance improvements, guest blogs, and how to use our tools.

Samples Library

Browse all our useful samples. Perfect for when you’re needing to get started, want to integrate one of our libraries, and much more.