Animating geometry with AMD DGF

Originally posted:
Sander Reitter's avatar
Sander Reitter
Quirin Meyer's avatar
Quirin Meyer
Josh Barczak's avatar
Josh Barczak

With the Dense Geometry Format (DGF), AMD introduced a cache-friendly representation for compressed triangles. Since future GPU architectures will support DGF, there is an opportunity to revisit different parts of the graphics programming stack that may utilize and benefit from DGF.

Our AMD DGF SDK was recently updated with bug fixes, improvements, and new features. One of our most exciting new features is the addition of an animation-aware encoding pipeline. You can refer to the release notes for a full list of changes.

In this blog post, we take a look at animations and how they work with DGF. Animating a mesh is achieved by manipulating vertex positions across multiple frames, creating an illusion of movement. Our key idea is to overwrite the positions inside a DGF block with animated positions in each frame.

We explain how to create animations with DGF and how you can use those animations in your ray tracing applications.

But before we dig into the details of animations, let’s recap DGF and linear vertex blend animations.

DGF-Recap

DGF, introduced at High Performance Graphics 2024 [1] by AMD, is a hardware-friendly format for storing compressed triangle meshes. As part of a pre-process, the AMD DGF Baker decomposes a triangle mesh into small meshes, or short meshlets. The DGF Baker then compresses those meshlets individually. A DGF-meshlet consists of up to 64 positions and 64 triangles. The meshlet is stored in a 128-byte DGF-block along with meta information. A set of DGF-blocks then represents a mesh.

DGF Block

A DGF block consists of three parts: a header, a geometry part, and a topology part:

The layout of a DGF block

For this blog post, we need to understand the details of header and the geometry part. You only need a basic understanding of the topology part, so let’s start with that.

Topology Part

The topology part holds triangle connectivity: In a conventional representation, you would use an index buffer, where three indices define a triangle.

In terms of memory, storing the indices per triangle is hefty. Given 64 vertices per DGF block, we would need 6 bits to address a single vertex. A triangle with three vertices would then require 36=183\cdot 6=18 bits. Instead, DGF stores topology as a generalized triangle strip with backtracking. That requires only about 5 bits per triangle stored in a re-use buffer, by the first index-bits, and the control bits, as shown in the figure above.

Initially, we proposed a sequential Θ(N)\Theta(N)-algorithm [1] at High Performance Graphics 2024. But at Eurographics [2], we showed multiple data-parallel Θ(logN)\Theta(\log N)-algorithms. We were even able to deduce a random access Θ(1)\Theta(1) version.

Geometry Part

The figure above shows that the geometry part consists of vertex offsets and an optional GeomID part. For our goal of doing animations, you only need to understand what is meant by vertex offsets and neglect the GeomID part.

Each DGF block stores up to 6464 vertex offsets Oi\vec{O}_i, 0i0\leq i < 6464 from which we compute the vertex positions Pi\vec{P}_i:

  • First, we store vertex offsets with respect to a 3D anchor point A\vec{A}. We use A\vec{A} to offset the vertex offsets (hence the name). Each DGF block stores its own anchor A\vec{A} in its header using three signed 24-bit integers.

  • Second, each vertex offset Oi\vec{O}_i contains three unsigned integer values to represent the three vertex channels xx, yy, and zz of the vertex offsets. For each channel, we use a fixed number of bits per DGF block. For that, we can choose between one and 16 bits per channel. That number of bits per channel is also stored in the header.

  • Finally, we need a scale factor 2E1272^{E-127}. We store its biased exponent EE as an 8-bit unsigned integer in the DGF block header. We use the factor to scale the position to floating-point numbers which are required for ray tracing and rasterization.

To conclude, we compute a 3D vertex floating-point position Pi\vec{P}_i from an unsigned integer offset Oi\vec{O}_i, with

Pi=(A+Oi)2E127.\vec{P}_i = (\vec{A} + \vec{O}_i)\cdot 2^{E - 127}.

The AMD DGF Baker carefully quantizes the original positions to compute Oi\vec{O}_i, A\vec{A}, and P\vec{P} to avoid unwanted cracks. For more details, see Section 4 of our original paper [1].

Animation with Linear Vertex Blend

In this blog, we focus on animation techniques that keep the topology stable across the course of an animation sequence and only vary the vertex positions over time. This holds for popular animation methods found in practice: Examples are all techniques that fall in the category of blend shapes (also known as morph targets, shape keys) or skinning (also known as vertex blending, bone animation, rigging) [3], Chapter 4.

Here, we use a basic form of linear vertex blend for demonstration purposes. But we believe that the basic idea we demonstrate here can easily be extended to more sophisticated animation methods.

DGF Animation Pipeline Overview

Next, we explain the key idea of our animation pipeline. We use boldface to refer to entities shown in the following figure:

The DGF Animation Pipeline

During Startup, we load all animation key frames (Load Key Frames). As said, the animation type we use here transforms vertex data, only, and keeps the topology stable for all key frames. We exploit this fact and Bake DGF for a single key frame, using only Key Frame Topology 0 and Key Frame Positions 0.

During baking (DGF Bake), we reserve space for the vertex positions inside the DGF blocks. This space will be filled with animated positions later. The topology part of a DGF block contains the immutable triangle connectivity information. After baking, we upload the resulting DGF Blocks.

During Startup, we upload the Key Frame Vertex Data (e.g., positions plus normals in our example) of all our key frames to the GPU.

After startup, we perform the steps Animation, Update, and Rendering each frame:

During Animation, we Transform Vertices with a compute shader. The compute shader reads the Key Frame Vertex Data from two neighboring key frames and interpolates between them. The mathematics behind is what you saw in our section on linear vertex blend animations above. We store the output in the Animated Vertex Data.

During Update, we Quantize the Animated Vertex Data and overwrite the positions in the corresponding DGF Blocks. We get to this in the section on Quantization further down.

Note that, after DGF baking, vertex positions can end up in multiple DGF blocks. To map a vertex index inside a DGF block to a vertex in the Animated Vertex Data/Key Frame Vertex Data, the DGF Baker outputs a Vertex Table that we keep on the GPU. We use that table to scatter the quantized and animated positions to its associated DGF Blocks. You find those details in the next section on Adjusting the DGF Baker.

With this, it becomes possible to animate each vertex only once and pull the updated positions into new DGF blocks. This allows free manipulation of vertex data, using different key frames, interpolation of vertices, etc.

Next, we will go into the details of Adjusting the DGF Baker and how our Quantization approach works.

Adjusting the DGF Baker

AMD’s DGF Baker lets you create compressed DGF meshes from arbitrary triangle meshes. The DGF Baker is a C++ library. For more details on how to use the baker in your C++ program, see the DGF SDK Documentation here.

To bake a DGF mesh, you need to call BakeDGF method on a DGFBaker::Baker object. You create your DGFBaker::Baker object by passing a config struct DGFBaker::BakerConfig config = {} to the constructor.

The DGF vertex offsets Oi\vec{O}_i are limited to 16 bits in size. If a vertex drifts too far from its anchor point, the offset can overflow, resulting in severe artifacts. For static geometry, the DGFBaker prevents this overflow by selecting a quantization exponent EE based on the dimensions of each triangle cluster. This process is described in detail in our HPG paper.

For animated geometry, because we do not know the animated vertices at bake time, the value of EE must be chosen more conservatively.

Quantize For Animations

Make sure to set config.quantizeForAnimations to true, when using animations. By this, the quantization factor is chosen based on the diagonal length of the cluster AABB. When set to false the x, y, and z extents are used. Using the diagonal-length prevents offset overflow in the event that a cluster rotates during animation.

Cluster Deformation Padding

Another field for animation is the config.clusterDeformationPadding field. It is a floating-point multiplier that scales the cluster AABBs when selecting a quantization exponent. The purpose of this field is to add some padding to account for local stretching in the animation. Again, this is necessary to prevent offset overflow.

Fixed Offset Width

We use the option config.blockForcedOffsetWidth to assign a fixed, predefined precision for each of the X, Y, and Z coordinates.

Usually, the DGF Baker computes a flexible bit-width for each channel, allowing further compression potential. However, we create our DGF blocks using a single key frame and ignore the other key frames. If we only used the bit-widths from one key-frame, other key-frames positions (and interpolated positions) would not fit in the bit-budget. We need, however, a bit-width that fits all key frames and all interpolated frames in between.

The animation sample currently supports config.blockForcedOffsetWidth for 16 bits for each component, 12 bits for each component, 8 bits per component. Additionally, our sample supports a mixed precision bit width, where the x and y component use 11 bits and the z component uses 10 bits.

Generate Vertex Table

The DGF baker has the config.generateVertexTable option, which outputs a Vertex Table.

This table maps the vertex index inside a DGF block back to its original index in the input vertex array. Note, that the DGF baker duplicates input vertex positions and scatters them across multiple DGF blocks. This mapping allows us to manipulate the vertex data, without a need to extract the vertices inside a DGF block. Instead, we can directly read it from the Key Frame Vertex Data and write it to the Animated Vertex Data. By this, we only need to transform each vertex once during Animation. Eventually, we use the Vertex Table to update the positions inside DGF blocks during Quantization.

Enable User Data

The user-data field is an optional part of the DGF block, located after the DGF header. Its size is 32 bits. We need this field to store the vertex offset of the block. With the offset, we can retrieve the global vertex index for each local vertex in a given DGF block. The option for this is called config.enableUserData and needs to be set to true.

Example Config

To sum up, the minimal configuration to get animation is:

DGFBaker::Config bakerConfig = {
.blockForcedOffsetWidth = {16, 16, 16}, // {12, 12, 12}, {8, 8, 8}, or {11, 11, 10}
.generateVertexTable = true,
.enableUserData = true,
.quantizeForAnimation = true,
.clusterDeformationPadding = 1.03f
};

Set Animation Extents

The DGFBaker selects its quantization exponent EE based on the bounding box of the vertex coordinates. Larger coordinates require larger quantization factors to prevent the 24 bit anchors from overflowing. Animated vertices can easily drift outside of the initial bounding box, so a more conservative bound is required.

For animations, we provide a method BakerMesh::SetAnimationExtents to define a custom 3D bounding box. You can use this to tell the DGFBaker what the bounds of your animation are. By this you ensure that the DGF block anchors do not overflow in case the extents are too large to fit in 24 bits.

The following code shows how to set the animation extents:

DGFBaker::BakerMesh mesh(vertices, indices, numVerts, numTris);
DGFBaker::AnimationExtents extents = {};
float extentMin[3] = ...; // compute animation bounding box
float extentMax[3] = ...; // (NOT SHOWN)
extents.Set(extentMin[0],extentMin[1],extentMin[2],
extentMax[0],extentMax[1],extentMax[2]);
// add the animation extents to the baker mesh
mesh.SetAnimationExtents(extents);

If the animation extents are not specified at baking time, the bounding box of the vertices will be used instead, which may not be sufficient.

In our sample application, we compute the exact bounding box of the vertex animation. If the animation is not known in advance, a simple heuristic, such as padding the rest-pose bounding box by an arbitrary factor, can also work well.

Quantization

After animation, each floating-point position requires re-quantization before we can insert it back into the DGF block. We use a separate quantization compute shader to do this efficiently. In this section, we refer to the source code that comes with this blog post.

We launch the compute shader with one thread group for each DGF block and a wave size of 32. We use 32 instead of 64, because the fixed bit width limits the vertex count to be less than 32. While this might not seem to be the most optimal choice, we will show in the Results section, that the runtime of this compute kernel is negligible. Each thread, whose index is gtid (group thread ID), then manipulates one vertex.

Next, we describe the quantization process in more detail. We make use of the LDS (local data share) memory to allow quick read and write access: First, we load an entire DGF block into LDS. Thereby the group thread ID determines the block index.

We need an array of 32 DWORD’s to fit a DGF block with 128 Byte.

groupshared uint blockData[32]; // sizeof(uint) * 32 = 128 Bytes, i.e., size of a DGF block.
...
{
blockData[gtid] = DgfBlockBuffer.Load(blockInfo.blockStartOffset + 4 * gtid);
...
}

As mentioned, we use a fixed vertex-channel width of different sizes. We tightly pack the vertex data, and thus we pack one vertex into multiple DWORD’s with one DWORD being shared by up to three vertices. We get the two DWORD indices a vertex covers using GetIndices. However, manipulating the DWORD shared by two or more vertices requires additional synchronization. We use atomic bit operations (i.e., InterlockedAnd and InterlockedOr) to safely write the respective section of a DWORD. Depending on their byte address, we generate three bit masks for each thread using GetMasks. We compute the masks such that we zero out the vertices that we want to overwrite:

const uint vertexIndex = gtid; // use the group-thread id as vertex index within the DGF block.
const uint vertexByteAddress = (vertexIndex * blockInfo.bitsPerVertex);
const uint3 indices = GetIndices(vertexBitAddress);
const uint3 masks = GetMasks(vertexBitAddress, blockInfo.header.bitsPerComponent);
// Start location in 4 Byte increment
const uint bitsPerByte = 8;
const uint vertexStartIndex = blockInfo.header.bitSize / (bitsPerByte * 4);
if (gtid < blockInfo.header.numVerts)
{
InterlockedAnd(blockData[vertexStartIndex + indices.x], ~masks.x);
InterlockedAnd(blockData[vertexStartIndex + indices.y], ~masks.y);
InterlockedAnd(blockData[vertexStartIndex + indices.z], ~masks.z);
}

After synchronization, we can start quantizing the vertex positions. We fetch a thread’s designated vertex using the vertex table and vertex offset, which is stored in the user data of the DGF block. After that, we quantize the vertex using the unbiased exponent E127E - 127. The header stores the biased exponent EE in the lower parts of the second DWORD, which we extract and remove the bias. We use wave intrinsic function WaveActiveMin to find the minimal value of each channel. This minimum will be used as the new anchors for the block:

int3 quantizedPositions = int3(0x7fffffff, 0x7fffffff, 0x7fffffff);
if (gtid < blockInfo.header.numVerts)
{
const uint globalVertexIndex = VertexTable.Load(blockInfo.header.userData + gtid);
const float3 vertexPosition = Positions.Load(globalVertexIndex);
const uint headerDWORD1 = blockData[1];
const int unbiasedExponent = -((headerDWORD1 & 0xff) - 127);
quantizedPositions = Quantize(vertexPosition, pow(2, unbiasedExponent));
}
const int3 anchor = WaveActiveMin(quantizedPositions);

We adjust the quantized vertex positions by subtracting the anchor from each channel. Then, we combine the three channels into one 64-bit-wide codeword. Since the codeword needs up to three DWORD’s to write, we need to separate it into multiple values and bit-shifting them accordingly. Again, this is dependent on the byte address. Using the bit masks from earlier, we can write the values in their position in the block using atomic or-operations:

if (gtid < blockInfo.header.numVerts)
{
const int3 offsetVert = quantizedPositions - anchor;
const uint64_t xbits = blockInfo.header.bitsPerComponent.x;
const uint64_t xybits = blockInfo.header.bitsPerComponent.x + blockInfo.header.bitsPerComponent.y;
const uint64_t codedVert = offsetVert.x | (offsetVert.y << xbits) | ((uint64_t) offsetVert.z << xybits);
const uint3 values = GetValues(vertexBitAddress, blockInfo.header.bitsPerComponent, codedVert);
InterlockedOr(blockData[vertexStartIndex + indices.x], values.x & masks.x);
InterlockedOr(blockData[vertexStartIndex + indices.y], values.y & masks.y);
InterlockedOr(blockData[vertexStartIndex + indices.z], values.z & masks.z);
}

Finally, we need to update the block anchor, which we store in the block header. Each anchor channel is stored in the upper three bytes of the second, third and fourth DWORD of the header. We use bit masks and bit operations again, to leave other parts of the DWORD unchanged. With WaveIsFirstLane, we make sure that only one thread writes the anchor.

if (WaveIsFirstLane())
{
blockData[1] = (blockData[1] & 0x000000FF) | ((anchor.x << 8) & 0xFFFFFF00);
blockData[2] = (blockData[2] & 0x000000FF) | ((anchor.y << 8) & 0xFFFFFF00);
blockData[3] = (blockData[3] & 0x000000FF) | ((anchor.z << 8) & 0xFFFFFF00);
}

After synchronization, we write the new block data to memory and are finished with the quantization step.

Results

We created a sample that utilizes this approach. In the following table, we break down the impact of the different steps on the frame time: We show results for various models (left column) from the The Utah 3D Animation Repository with varying triangle (second column) and vertex count (third column). The remaining columns show the time in milliseconds of the different stages of our pipeline (see DGF Animation Pipeline Overview). All timings were measured on an AMD Radeon™ RX 7900 XT graphics card.

Triangle CountVertex CountAnimation (ms)Quantization (ms)BVH Build (ms)Ray Trace (ms)
wooddoll537838990.00180.00160.19530.4787
marbles880046200.00180.00160.20200.5254
toasters1114156280.00180.00170.24400.5675
hand1585586360.00190.00180.21430.5280
ben78029414740.00250.00430.41080.7776
24-cell122880637440.00300.00600.45161.0506
fairyforest174117965660.00360.02400.76461.0422
explodingDragon2525721928590.00630.01350.76761.4074

Quantization took less than 1% of the overall frame time, which means this process will not majorly affect rendering times in an animation pipeline. Likewise, animating the position data (Animation) has an almost insignificant contribution to the frametime. BVH Build and Ray Trace dominate the image computation.

Next, we analyze how the fixed-bit channel width impacts the DGF compression capabilities. We measure the triangle density, i.e., the average number of bytes per triangle. In the column Static DGF, you see the density of a single key-frame data, with a target bit-width of 14. This is for reference, to assess how a single key frame would compress with DGF. The column Uncompressed shows the density of a conventional indexed-face set (floats for positions and integers for indices) would require. The remaining columns are triangle densities that we measured with our sample when performing animation. The columns show the different options we used for config.blockForcedOffsetWidth. Numbers in parentheses are the ratios relative to static DGF.

Static DGFUncompressed16 16 1612 12 1211 11 108 8 8
wooddoll6.920.7 (3.0)13.4 (1.9)9.6 (1.4)8.8 (1.3)6.4 (0.9)
hand6.218.5 (3.0)12.1 (2.0)8.6 (1.4)7.6 (1.2)5.7 (0.9)
toasters6.618.1 (2.7)13.6 (2.1)9.9(1.5)8.8 (1.3)6.4 (1.0)
marbles4.818.3 (3.8)10.5 (2.2)7.4 (1.5)6.9 (1.4)4.8 (1.0)
ben5.218.4 (3.5)14.2 (2.7)9.9 (1.9)8.8 (1.7)6.6 (1.3)
fairyforest4.318.7 (4.3)14.0 (3.3)10.1 (2.3)8.9 (2.1)6.7 (1.6)
24-cell6.418.2 (2.8)14.1 (2.2)9.5 (1.5)8.5 (1.3)6.8 (1.1)
explodingDragon5.621.2 (3.8)15.0 (2.7)10.8 (1.9)9.5 (1.7)7.0 (1.3)

As expected, triangle density goes up for animations when using higher config.blockForcedOffsetWidth. This happens because a larger vertex size limits the number of triangles which will fit in a block, and results in more blocks. However, DGF is still capable of drastically reducing the memory footprint for triangle meshes over uncompressed geometry.

Conclusion

We have demonstrated that DGF is capable of handling animations efficiently. In particular, we are able to animate and update the DGF blocks for a 252K triangle scene in roughly 20 microseconds. This is due to the fact that we can access positions of DGF block in a random access fashion, when using bit operations during position update. This makes animating inexpensive and also negligible in terms of run-time performance.

At the same, we can benefit from the DGF compression ratios since triangles mesh become significantly smaller when encoding them with DGF even with animations.

The animation sample can be found in the updated version of the DGF SDK repository.

View references

[1] J. Barczak, C. Benthin, and D. McAllister, “DGF: A Dense, Hardware-Friendly Geometry Format for Lossily Compressing Meshlets with Arbitrary Topologies”, Proc. ACM Comput. Graph. Interact. Tech., 2024.

[2] Q. Meyer, J. Barczak, S. Reitter, and C. Benthin , “Parallel Dense-Geometry-Format Topology Decompression” in Eurographics 2025 - Short Papers, 2024.

[3] T. Akenine-Möller, E. Haines, N. Hoffmann, A. Pesce, M. Iwanicki, S. Hillaire, “Real-Time Rendering 4th Edition”, A K Peters/CRC Press, 2018.

Sander Reitter's avatar

Sander Reitter

Sander Reitter is an AMD Senior Software Development Engineer and after graduating from Coburg university with a master's in computer science, he started working at AMD with the raytracing team. His interests are raytracing, geometry compression and GPU architecture.
Quirin Meyer's avatar

Quirin Meyer

Before becoming a computer graphics professor at Coburg University, Quirin Meyer obtained a Ph.D. in graphics and worked as a software engineer in the industry. His research focuses on real-time geometry processing primarily on GPUs.
Josh Barczak's avatar

Josh Barczak

Josh Barczak is an AMD Fellow focused on GPU raytracing architecture and 3D API evolution. He has been enjoying GPU hardware and software architecture for roughly 6 years. Prior to this he spent time in the game industry as a rendering lead.