A typical problem with MSAA Resolve mixed with HDR is that a single sample with a large HDR value can over-power all other samples, resulting in a loss of perceived anti-aliasing on an edge. One way to fix this problem involves using an impractically large filter kernel mixed with an impractically large number of samples per pixel.
 
A real-time workaround is to instead accept a bias in the resolve and reduce the weighting of samples as a function of how bright they are, causing bright edges to visually erode instead of expand (as would happen if unbiased). This process is equivalent to tonemapping prior to resolve and then reversing that tonemap after the resolve.
 
This post presents an optimized modified form of the technique Brian Karis [Epic] talks about on Graphics Rants: Tone mapping. The core change is to replace the luma computation with max3(red,green,blue). The luma based tonemapper has variable weighting based on color hue which is not present in the max3 based tonemapper. The max3 based tonemapper removes the hue shift on mixed color edges of similar value.
 
The max3() operation on all versions of GCN maps to a single instruction, v_max3_f32. The documentation for the GCN instruction set in Fiji, GCN3, is available here. The driver-side AMD DX shader compiler will automatically transform max(x, max(y, z)) into max3(x, y, z). This functionality, as well as min3() and mid3(), is also exposed explicitly in GLSL via the following extension: AMD_shader_trinary_minmax.
 
Here is an HLSL implementation of the tonemapper and inverse.

float max3(float x, float y, float z) { return max(x, max(y, z)); }

// Apply this to tonemap linear HDR color "c" after a sample is fetched in the resolve.
// Note "c" 1.0 maps to the expected limit of low-dynamic-range monitor output.
float3 Tonemap(float3 c) { return c * rcp(max3(c.r, c.g, c.b) + 1.0); }

// When the filter kernel is a weighted sum of fetched colors,
// it is more optimal to fold the weighting into the tonemap operation.
float3 TonemapWithWeight(float3 c, float w) { return c * (w * rcp(max3(c.r, c.g, c.b) + 1.0)); }

// Apply this to restore the linear HDR color before writing out the result of the resolve.
float3 TonemapInvert(float3 c) { return c * rcp(1.0 - max3(c.r, c.g, c.b)); }

And a GLSL Shadertoy example: https://www.shadertoy.com/view/Xdd3Rr
 
Here is an example of using the above functions in a low-quality 4xMSAA box filter resolve,

return TonemapInvert(
  TonemapWithWeight(sample0, 0.25) +
  TonemapWithWeight(sample1, 0.25) +
  TonemapWithWeight(sample2, 0.25) +
  TonemapWithWeight(sample3, 0.25));

Here is another example, this time a full HLSL shader without the TonemapInvert(), which is a random 5-tap horizontal filter.

float max3(float x, float y, float z) { return max(x, max(y, z)); }

float3 TonemapWithWeight(float3 c, float w) { return c * (w * rcp(max3(c.r, c.g, c.b) + 1.0)); }

Texture2D tex0;
SamplerState smp0;

float3 main(float2 pos : TEXCOORD) : SV_Target { 
  return
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2(-2,0)), 0.1) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2(-1,0)), 0.2) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 0,0)), 0.4) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 1,0)), 0.2) +
    TonemapWithWeight(tex0.SampleLevel(smp0, pos, 0, int2( 2,0)), 0.1); }

Inspecting the disassembly of this shader in Shader Analyzer (part of GPU Perf Studio) shows the following GCN instructions for each filter tap after the first tap.

v_max3_f32
v_add_f32
v_rcp_f32  <--- rcp takes 4x the runtime as other VALU (vector ALU) operations
v_mul_f32  <--- folds the scalar filter weight to the tonemap weight before multiply by the color
---------
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum
v_mac_f32  <--- multiply by weight and accumulate with the weighted sum

The added cost per filter tap for application of the reversible tonemapper is only the top 4 instructions which take the time of 7 VALU operations. Each tap of the filter takes 10 VALU operations total. To place this in a real context, in terms of just VALU, Fiji Nano for example can approach 400 million of these filter taps in a single millisecond (according to peak specs, measured depends on bandwidth to source the taps, etc).


Other posts by Timothy Lottes