Using AMD Crossfire API

Originally posted: May 5, 2016

Anas Lasram

Gaming at optimal performance and quality at high screen resolutions can sometimes be a demanding task for a single GPU. 4K monitors are becoming mainstream and gamers wishing to benefit from the quality provided by 8 million pixels at comfortable frame rates may need a second GPU in their system to maximize their playing experience. To ensure that performance scales with multiple GPUs some work is usually required at the game implementation level.

Alternate Frame Rendering (AFR) is the method used to take advantage of Multiple GPUs in DirectX® 11 and OpenGL® applications. The Crossfire guide describes AFR and how it is implemented in AMD drivers. The guide also provides recommendations on how to optimize a game engine to fully exploit AFR. As described in the guide the goal is to avoid inter-frame dependencies and resource transfers between GPUs. Transfers are initiated by the driver whenever it determines that a GPU has a stale resource. Transfers go through the PCI Express® bus and this is usually an expensive operation.

The Crossfire guide describes the limitations of the default AFR implementation. In summary:

The driver does not know what resources need tracking. It therefore tracks all of them by default.
The driver does not have knowledge of what region in a resource is updated. If a resource is stale all of it will be transferred.
The driver has to start the transfer at the end of the frame even if the resource is only used at the beginning of the frame.
The driver uses heuristics to detect if a resource is stale. The heuristics can fail resulting in rendering artifacts.

The solution to all above limitations is to give developers control over what gets transferred between GPUs, when to start a transfer and how to wait for the transfer to finish.

Radeon® Software Crimson Edition introduces the Crossfire API as an extension to DirectX 11. The API has:

Functions to enable or disable a transfer for a resource.
Functions to select a transfer mode for a resource.
Functions to select when to start a transfer.
Synchronization functions to avoid data hazard.

Selecting a transfer mode is done at resource creation time. For this purpose, the Crossfire API provides resource creation functions that are very similar to DirectX 11 functions with the addition of a transfer mode flag. For instance, buffers can be created with the function:

AGSReturnCode agsDriverExtensions_CreateBuffer(
    AGSContext* context,
    const D3D11_BUFFER_DESC* desc,
    const D3D11_SUBRESOURCE_DATA* data,
    ID3D11Buffer** buffer,
    AGSAfrTransferType transfer_mode);

Similar functions exist to create textures. These functions are detailed in the Crossfire guide.

As you can see the function is very similar to DirectX 11 function CreateBuffer except that it takes the extra parameter transfer_mode . This parameter is an enumeration that defines how to transfer the resource. The available modes are

AGS_AFR_TRANSFER_DISABLE : turn off driver tracking and transfers.
AGS_AFR_TRANSFER_DEFAULT : use default driver tracking as if the API is not used.
AGS_AFR_TRANSFER_1STEP_P2P : peer to peer application controlled transfer.
AGS_AFR_TRANSFER_2STEP_NO_BROADCAST : application controlled transfer using intermediate system memory.
AGS_AFR_TRANSFER_2STEP_WITH_BROADCAST : application controlled broadcast transfer using intermediate system memory.

For the modes where the application controls the transfer three functions are provided to start a transfer and wait for the transfer to finish. The functions are:

AGSReturnCode agsDriverExtensions_NotifyResourceBeginAllAccess(
    AGSContext* context,
    ID3D11Resource* resource);

AGSReturnCode agsDriverExtensions_NotifyResourceEndWrites(
    AGSContext* context,
    ID3D11Resource* resource,
    const D3D11_RECT* transfer_regions,
    const unsigned int* subresource_array,
    unsigned int num_subresource);

AGSReturnCode agsDriverExtensions_NotifyResourceEndAllAccess(
    AGSContext* context,
    ID3D11Resource* resource);

NotifyResourceBeginAllAccess notifies the driver that the application begins accessing the resource. The driver waits if the resource is being used as the destination of a transfer.

NotifyResourceEndWrites is used to start transferring the resource to another GPU or GPUs. The driver will start the transfer as soon as the destination resource in the other GPU is not being accessed.

NotifyResourceEndAllAccess notifies the driver that the application is done using the resource in the frame. The driver can use the resource as the destination of a transfer.

A simple example

To demonstrate the API and how to use it efficiently let us look at a simple example where a frame depends on results computed in previous frames. Imagine we want to approximate motion blur by averaging the last rendered N frames. The pseudo code of such method is:

render_target_texture_array frame_list{N};
render_target average;
render_target back_buffer;

void render()
{
    static int frame_count = 0;
    int subresource_id = frame_count % N;
    ++frame_count;

    // render frame
    set_render_target(frame_list.sub_resource(subresource_id));
    render_scene();

    // compute average
    set_render_target(average);
    compute_average(frame_list);

    // remaining rendering
    set_render_target(back_buffer);
    finalize_rendering();
}

frame_list is a texture array that holds the N frames that will be averaged. Each frame is stored in one layer of the texture. average is the texture that stores the result of averaging all frame_list layers.

When a layer in frame_list is updated it has to be transferred to all other GPUs so that all of them have an up to date copy of frame_list . Inter-frame dependencies do not exist for average so transfers can be safely disabled.

The default AFR-Compatible driver behavior fails to detect that frame_list is stale. This is because one of the two heuristics that the driver relies on states: “A resource that is updated before it gets used within a frame is not stale”. One layer of frame_list is first rendered then the resource is used in the average computation and that is why the driver considers the resource as up to date. Details on the driver heuristics and failure cases are described in the Crossfire guide.

We will now use the Crossfire API to transfer the layer in frame_list that gets updated. Using the API the pseudo code becomes:

render_target_texture_array frame_list{N}; // created with AFR_TRANSFER_2STEP_WITH_BROADCAST
render_target average; // created with AGS_AFR_TRANSFER_DISABLE
render_target back_buffer;

void render()
{
    static int frame_count = 0;
    int subresource_id = frame_count % N;
    ++frame_count;

    // start of critical section
    NotifyResourceBeginAllAccess(frame_list);

    // render frame
    set_render_target(frame_list.sub_resource(subresource_id));
    render_scene();

    // signal transfer
    NotifyResourceEndWrites(frame_list, subresource_id);

    // compute average
    set_render_target(average);
    compute_average(frame_list);

    // end of critical section
    NotifyResourceEndAllAccess(frame_list);

    // remaining rendering
    set_render_target(back_buffer);
    finalize_rendering();
}

Note that frame_list is created with a broadcast flag because the update needs to be visible to all GPUs. In the case of two GPUs it is faster to replace AFR_TRANSFER_2STEP_WITH_BROADCAST with AFR_TRANSFER_1STEP_P2P . Information about querying the number of GPUs is described in the Crossfire guide.

Reducing the contention of the critical section

Let us now imagine that the function render_scene takes a long time to execute. NotifyResourceBeginAllAccess and NotifyResourceEndAllAccess are used to protect frame_list from updates initiated by another GPU. The long execution time of render_scene causes frame_list to be locked for a long time. In parallel programming it is well known that locks should be avoided as much as possible. In the case where the lock cannot be avoided the resulting critical section has to be reduced to a minimum.

In our example we can reduce the critical section by rendering the scene into a temporary render target and then copy the result into frame_list . This way NotifyResourceBeginAllAccess is called after render_scene . The pseudo code becomes:

render_target_texture_array frame_list{N}; // created with AFR_TRANSFER_2STEP_WITH_BROADCAST
render_target average; // created with AGS_AFR_TRANSFER_DISABLE
render_target scene; // created with AGS_AFR_TRANSFER_DISABLE
render_target back_buffer;

void render()
{
    static int frame_count = 0;
    int subresource_id = frame_count % N;
    ++frame_count;

    // render frame
    set_render_target(scene);
    render_scene();

    // start of critical section
    NotifyResourceBeginAllAccess(frame_list);
    // local transfer
    copy_subresource(frame_list, subresource_id, scene);
    // signal transfer
    NotifyResourceEndWrites(frame_list, subresource_id);

    // compute average
    set_render_target(average);
    compute_average(frame_list);

    // end of critical section
    NotifyResourceEndAllAccess(frame_list);

    // remaining rendering
    set_render_target(back_buffer);
    finalize_rendering();
}

The function copy_subresource is used to copy the render target scene into frame_list at layer subresource_id .

Further optimization

Now let us consider that the function compute_average is as slow as the function render_scene . The goal is to reduce the critical section by calling NotifyResourceEndAllAccess before compute_average . This can be done by releasing frame_list earlier and using a copy of frame_list for the average computation. The pseudo code becomes:

render_target_texture_array frame_list{N}; // created with AFR_TRANSFER_2STEP_WITH_BROADCAST
render_target average; // created with AGS_AFR_TRANSFER_DISABLE
render_target scene; // created with AGS_AFR_TRANSFER_DISABLE
render_target_texture_array frame_list_local_cpy{N}; // created with AGS_AFR_TRANSFER_DISABLE
render_target back_buffer;

void render()
{
    static int frame_count = 0;
    int subresource_id = frame_count % N;
    ++frame_count;

    // render frame
    set_render_target(scene);
    render_scene();

    // start of critical section
    NotifyResourceBeginAllAccess(frame_list);
    // local transfer
    copy_subresource(frame_list, subresource_id, scene);
    // signal transfer
    NotifyResourceEndWrites(frame_list, subresource_id);

    // local transfer
    copy_all_resource(frame_list_local_cpy, frame_list);
    // end of critical section
    NotifyResourceEndAllAccess(frame_list);

    // compute average
    set_render_target(average);
    compute_average(frame_list_local_cpy);

    // remaining rendering
    set_render_target(back_buffer);
    finalize_rendering();
}

frame_list_local_cpy is a local copy of frame_list within the frame. The copy is done with the function copy_all_resource . Note that all the content of frame_list is copied in frame_list_local_cpy . This is because any layer of frame_list could have been updated from a different GPU. After NotifyResourceEndAllAccess another gpu can transfer data to frame_list while frame_list_local_cpy is used to compute the average.

The guide along with a code sample gives detail about the API and demonstrates its use in a practical situation.