Integrating AMD FidelityFX™ into the Ego Engine

Originally posted: December 18, 2019

Last updated: March 11, 2024

Tom Hammersley

AMD FidelityFX Contrast Adaptive Sharpening (CAS) is a new, open source library from AMD that improves both image quality and performance for minimal integration effort.

CAS offers two main features:

Contrast-Adaptive Sharpening intelligently enhances image quality and clarity.
High-quality upscaling improves rendering performance whilst maintaining sharp visuals.

Library Structure

CAS is designed to integrate into your engine at the source code level. Accordingly, the code is available from GitHub under an open source MIT licence. The core CAS functionality is self-contained within a header-only library for C++ and HLSL/GLSL shaders.

The process of integrating CAS is no more complicated than implementing a new full-screen shader pass in your engine. Your engine interfaces with CAS at the function call level. CAS does not directly interface with any shader resource bind points, which can be a common friction point when integrating middleware.

Since there is no external library nor subsystem to initialise, there are no additional GPU resources to create or manage, nor CPU-side memory allocations. Nor does CAS require you to prepare resources on its behalf, such as motion vectors. Any conversion processing occurs within the shader.

Integration Guide

The precise point that CAS is best integrated will vary from engine to engine. I would suggest there is a common point in the rendering pipeline of modern engines: at the end of all rendering and post processing in linear space, prior to applying the SDR/HDR transfer function and UI rendering. This is a natural home for FidelityFX, since it operates in linear space and expects matching inputs.

Shader Structure

At present, CAS requires a compute shader in order to facilitate a few optimisations. We will therefore consume our linear input texture as an SRV and write to a UAV.

This shader will need a few simple integration points:

Include CAS headers
Define interface functions
Define cbuffers
Compute shader body

Headers

The CAS headers are designed to be used in either C++ or shader code. To use the headers in a HLSL shader, we must define A_GPU and A_HLSL . First we #include “ffx_a.h”, which provides a consistent set of types and functions across CPU and GPU. After defining a couple of interface functions, we can include the main header, ffx_cas.h

Interface Functions

Since CAS makes no assumptions nor demands about how our data is accessed or stored, we must provide functions for it to access our data. These functions are CasLoad() and CasInput() .

CasLoad() loads the texel at a given xy location. Since we provide the code, we guarantee it fits with our shader binding scheme. There are no conflicts with naming schemes, register bindpoints, or bindless vs non-bindless.

Texture2D srvInputTexture;
AF3 CasLoad( ASU2 p )
{
    return srvInputTexture.Load( int3( p, 0 ) ).rgb;
}

CasInput() provides an opportunity to perform any colour space conversion to linear space. Since I assume that our input is in the linear space part of the pipeline, CasInput() is simply an empty function:

void CasInput( inout AF1 r, inout AF1 g, inout AF1 b ) {}

Constant Buffers

CAS requires a little processed data stored in a cbuffer. Since CAS accepts this data via function parameters, we can place that data in a separate or combined cbuffer. The data is simply 3 uint4s:

cbuffer cb : register( b0 )
{
    uint4 const0;
    uint4 const1;
    uint4 const2;
};

Example Shader Body

CAS adopts a slightly different compute shader structure to the typical 8×8 or 8×4 thread-per-pixel arrangement. The template shader for CAS swizzles the compute shader thread index to a more efficient pixel-shader addressing pattern. This structure is unrolled 2x in width and height, further enhancing efficiency. Despite this loop unrolling, enough threads are created to keep a modern GPU fully occupied, even at 1080p.

The main work function of CAS is CasFilter() . The last two parameters to this function determine scaling and quality. If we pass true for the first parameter, then the image is only sharpened, not resized. If we pass true for the second parameter we run a slightly cheaper version of the algorithm. For maximum efficiency, I recommend compiling out different variations of the function:

CasFilter( …, true, false ) – sharpen only CasFilter( …, false, false ) – high quality resize CasFilter( …, false, true ) – lower quality resize

RWTexture2D uavOutputTexture;
[ numthreads( 64, 1, 1 ) ]
void cs_cas_sharpen_only( uint3 LocalThreadId : SV_GroupThreadID, uint3 WorkGroupId : SV_GroupID )
{
    AU2 gxy = ARmp8x8( LocalThreadId.x ) + AU2( WorkGroupId.x << 4u, WorkGroupId.y << 4u );

    AF3 c;
    CasFilter( c.r, c.g, c.b, gxy, const0, const1, true, false );
    uavOutputTexture[ ASU2( gxy ) ] = AF4( c, 1 );
    gxy.x += 8u;

    CasFilter( c.r, c.g, c.b, gxy, const0, const1, true, false );
    uavOutputTexture[ ASU2( gxy ) ] = AF4( c, 1 );
    gxy.y += 8u;

    CasFilter( c.r, c.g, c.b, gxy, const0, const1, true, false );
    uavOutputTexture[ ASU2( gxy ) ] = AF4( c, 1 );
    gxy.x -= 8u;

    CasFilter( c.r, c.g, c.b, gxy, const0, const1, true, false );
    uavOutputTexture[ ASU2( gxy ) ] = AF4( c, 1 );
}

Despite the unusual thread group dimensions, the shader is still dispatched with ((width + 15) / 16, (height + 15) / 16, 1).

CPU-Side

Clearly most of the details of implementing a full-screen shader pass will depend on the host engine. The only work that CAS requires CPU-side is the population of the cbuffer. This is as simple as calling the function that CAS provides in ffx_cas.h:

#define A_CPU 1
#include "../../../../shaders_2019/cas/ffx_a.h"
#include "../../../../shaders_2019/cas/ffx_cas.h"

CasSetup( const0,
          const1,
          sharpness,
          AF1( inputWidth ), AF1( inputHeight ),
          AF1(outputWidth ), AF1(outputHeight) );

Upscaling

Little additional work is required to implement upscaling. As noted earlier, the shader simply needs the correct function parameters. The host engine has the responsibility of providing appropriately sized input and output textures. CAS supports scaling of up to 4x area, though clearly limiting scaling results in higher image quality. CasSupportScaling() can be used to check if a configuration is supported.

Performance and Results

CAS is very efficient. It generally costs a few tenths of a millisecond, which is the typical cost of a full-screen operation on a modern GPU. Furthermore, the design of CAS naturally leads itself to integrating inside other passes, minimising cost.

CAS provided a clear boost to image quality for F1 2019 and we consider it a very useful addition to the title, especially considering the limited time required to integrate it. It took two full days of effort from download to having all the features working in a prototype.

We did not observe any notable downsides to image quality. We were unable to detect any of the artefacts commonly associated with sharpening filters.

Here are some examples. The slider can be used to observe the differences. The left side is without CAS applied, and the right side is with CAS on.

Note: You may need to resize your window to see these 100% crops correctly.

PNG (960x640)

Conclusion

AMD FidelityFX CAS offers very useful boost to image quality at minimal performance cost. The package is well designed, easy to integrate and requires very little time and effort to integrate. Finally, there are neither technical nor legal barriers to integration – AMD FidelityFX v1.1.4 runs across all types of modern GPUs and has an MIT open source licence.

AMD FidelityFX is available at https://github.com/GPUOpen-Effects/FidelityFX.