For GPU-side dynamically generated data structures which need 3D spherical mappings, two of the most useful mappings are cubemaps and octahedral maps. This post explores the overhead of both mappings.

Cubemaps

Sampling from a cubemap involves both VALU work and a VMEM instruction. In GL, AMD_gcn_shader provides access to some of these normally hidden VALU instructions in the form of cubeFaceIndexAMD() and cubeFaceCoordAMD(). These can be useful for example when doing image stores to a layered image representing cube faces. Disassembling a simple HLSL shader below provides details on the VALU work.

TextureCube t; SamplerState s;
float4 main(float3 p : TEXCOORD) : SV_Target { return t.Sample(s, p); }

Which disassembles to the following VMEM and VALU instructions.

v_cubetc_f32  v1, v2, v3, v0     // v1 = face s coordinate
v_cubesc_f32  v4, v2, v3, v0     // v4 = face t coordinate
v_cubema_f32  v5, v2, v3, v0     // v5 = 2.0 * major axis
v_cubeid_f32  v6, v2, v3, v0     // v6 = face index (0 to 5)
v_rcp_f32     v2, abs(v5)        // v2 = 1.0 / abs(2.0 * majorAxis) 
s_mov_b32     s0, 0x3fc00000     // s0 = 1.5
v_mad_legacy_f32  v5, v1, v2, s0 // v5 = faceS / abs(2.0 * majorAxis) + 1.5 
v_mad_legacy_f32  v4, v4, v2, s0 // v4 = faceT / abs(2.0 * majorAxis) + 1.5
image_sample  v[0:3], v[4:7], s[4:11], s[12:15] dmask:0xf

The 1.5 constant is designed such that the output face coordinate (v4 and v5 in the above example) range is {1.0 <= x < 2.0} which has an advantage in bit encoding compared to {0.0 <= x < 1.0} in that the upper mantissa bits are constant throughout the entire output range.

Total VALU overhead is 10 ops (v_rcp_f32 counts as 4 ops). When estimating shader cost it is often useful to think in terms of the GPU’s op:byte:tex ratio, where op represents VALU instructions, byte represents bytes of bandwidth, and tex represents simple 2D 32-bit per pixel texture fetch VMEM instructions. Numbers for Fury Nano in giga-units per second are 4096:512:256 (op:byte:tex), which reduces to the following ratio16:2:1. Note flop = op * 2, as one FMA or MAD is 2 flops.
 
It is possible to fetch ratios for other AMD GPUs from Wikipedia. A 10 op VALU overhead for a cubemap fetch could be around 62.5% of the VALU capacity during the VMEM fetch instruction (assuming cache hits, actual results will vary).

Octahedron Maps

Cubemaps are great for filtered lookups, but have a disadvantage when point sampling and doing manual filtering: it is very complex to robustly sample a texel neighborhood via 2D texel offsets.
 
An alternative is to use an octahedral mapping as described by Krzysztof Narkowicz’s Octahedron Normal Vector Encoding blog post and others. The eight-sided octahedron is flattened and unwrapped into a 2D square. The octahedral mapping from un-normalized {x,y,z} coordinates to normalized {x,y} coordinates in the range of {-1 to 1}, can be done as follows.

// 2 temp/return VGPRs
// 2 temp SGPRs (one bool)
// 17 VALU ops
float2 Oct3To2(float3 n) {
  float tx,ty;
  bool neg;
  // project into 2D
  tx = abs(n.x) + abs(n.y);
  tx = tx + abs(n.z);
  tx = rcp(tx); // counts for 4 VALU ops
  n.x = n.x * tx;
  n.y = n.y * tx;
  // unfold if on other half in Z
  // n.xy range from {-1.0 to 1.0} to output range {0.0 to 1.0}
  tx = 1.0 - abs(n.y);
  neg = n.x < 0.0;
  tx = neg ? -tx : tx;
  ty = 1.0 - abs(n.x);
  neg = n.y < 0.0;
  ty = neg ? -ty : ty;
  neg = n.z <= 0.0;
  n.x = neg ? tx : n.x;
  n.y = neg ? ty : n.y; 
  return n.xy; }

The above shader code is written with a 1:1 mapping to the output disassembly. It takes an extra {-1 to 1} to {0 to 1} scale and bias, 2 VALU ops, to fetch from for a grand total of 19 VALU ops. For a single point-sampled texture fetch this makes the octahedral map almost 2x as expensive as the cubemap.
 
Also note assuming cache hits, returning to the 16:1 (op:tex) ratio, in theory it can be more expensive to generate the coordinates for the octahedral map than to fetch from the texture. Barring the case where the offset wraps over the texture’s edge, the above Oct3To2() * 0.5 + 0.5 texture coordinate will just work with 2D texel offsets.
 
Unfortunately GPUs don’t have an octahedral wrapping mode, however the mirrored repeat wrapping mode can be used with some VALU work to emulate.

// Check for offset over texture edge,
//   1 temp VGPR
//   2 temp/return SGPRs (one bool)
//   2 VALU ops 
bool OctFlipped(float2 r) { 
  float t = max(abs(r.x), abs(r.y));
  return t >= 1.0; }

// Example of computing mirrored repeat sampling 
// of an octahedron map with a small texel offset.
// Note this is not designed to solve the double wrap case.
// The "base" is as computed by Oct3To2() above.
float2 coord = base + float2(-2.0, 2.0);    // 2 VALU
coord = OctFlipped(coord) ? -coord : coord; // 4 VALU
coord = coord * 0.5 + 0.5;                  // 2 VALU

Offset texel fetches into an octahedral map are just 8 VALU ops after the cost of the first fetch.


Other posts by Timothy Lottes