Texel Space Shading

Nvidia has been talking about Texel Space Shading support in their Turing GPU. From a distance, this technique somewhat resembles procedural virtual texturing, although both techniques are substantially different. You might not have heard about procedural VT so in this blog we’ll compare the old and the new and discuss how they can be applied to optimize GPU performance and increase framerate. You can skip the section about procedural VT if you’re only interested in Texel Space Shading.

Procedural Virtual Texturing

This technique builds on virtual texturing. You can read more about that here. In short, a virtual texturing system will subdivide the mip maps of textures into small texture tiles (eg, 128x128 pixels) and load only those tiles (=subregions of mipmaps) in video memory when these tiles are actually visible. The tiles are not loaded into conventional textures but into a cache texture that contains a collection of tiles from many different texture assets. The visibility test is done in screen space. VT supports texture resolutions up to 256.000 x 256.000 pixels. It also requires much less video memory and allows for more artistic freedom than a mip map streaming system. Some functionality is provided by the GPU hardware to accelerate a VT sampling under the names Partially Resident Textures (OpenGL) or Tiled Resources (DirectX 11.2).

A Procedural Virtual Texturing (PVT) system uses the same tile texture cache in GPU memory but generates the texture tiles on the fly instead of loading the tiles from disk. It has been used mostly in terrain shaders that blend (splat) many materials using some form of material mask. Conventionally you would execute the terrain shader for every pixel during the forward or deferred rendering pass, every frame. A PVT system will compute the unlit terrain shader (material blending) per texture tile and store that tile in the VT cache. The cache contains all the channels that are needed during the actual render pass (diffuse, normal, roughness, etc.). The tile will be reused many frames and therefor the cost of these terrain calculations are amortized over many frames. During the forward or deferred pass you perform a VT texture sample instead of generating the terrain texels.

You could also use a “classic” VT system by baking the whole terrain into a huge texture but this quickly results in massive files on disk. Procedural VT does have a runtime cost so there is still some constraint needed on the complexity of the terrain shader. Also, to save on video memory the tile is typically encoded to a DXT/BC GPU texture format, which adds to the tile generation time.

Battlefield 3 is probably one of the first titles to use procedural VT. You can find the GDC presentation here. Far Cry 4 used an adaptive virtual texture method that allows for even higher VT resolutions by dynamically lowering the VT space occupancy for distant world areas. A GPU bound application could generate the texture tiles on the CPU, although GPU side generation is the obvious approach. The tile generation budget can easily be configured and is therefore completely predictable. Limiting this budget increases the number of frames that certain screen pixels will display with less detailed (higher mip) texture data (until the tiles are generated).

Texel Space Shading

A forward renderer will rasterize the geometry to screen pixels, execute a pixel shader for each of these pixel and send the shaded pixels to the framebuffer. During the shader execution, typically a few textures are sampled and used as input for the lighting equation. When using Texel Space Shading (TSS), the render will not shade after rasterization but will only record which texels are accessed. A separate pass will compute and store shading values as texels in texture space. To render the final frame, the geometry is rendered using a simple shader that fetches the shaded texel for each screen pixel using a single texture lookup. Using this approach, the sample visibility (rasterization and z-testing) and the sample calculation (shading) are decoupled and can run at completely independent rates. You can find a Eurographics short paper from 2016 here with the overview of the entire technique. Here’s also a blogpost on a DirectX11 implementation.

rasterization in Texture Space Shading
Texture Space Shading (TSS) rasterization and rendering steps, image from devblogs.nvidia.com

Nvidia announced support for TSS for their new Turing GPU. They go into the details in their devblog. You can find their presentation at Siggraph below. TSS has three main stages in the pipeline: first identify texels to shade, then shade these texels and finally apply these texels to the image. The use of texture filtering (trilinear, anisotropic, etc) determines what texels actually need to be shaded and the texture sampling hardware already identifies these as part of the sampling routine. Turing has a new hardware feature that returns a list of texels touched by a texture sampling function. They call this the texture footprint. This removes the shader calculations required to manually compute the footprint and makes implementing this technique less complex. The texture footprint is returned as a bit mask: 64 bits that translate into an 8x8 for 2D textures and 4x4x4 for 3D textures.

The concept of a status bit surface is introduced to keep a list of unique texels that need to be shaded. The texture footprint from each texture sample is used to update that status bit surface. The status bit surface is divided into a grid of 64bit cells. The texture fetch will return 64bit texture footprint that maps to one of those cells. In other words, the footprint is aligned to that grid so it’s very efficient to update the status bit surface. To prevent collisions with atomic operations, Nvidia introduced new warp-wide operations that blend the footprint results so only one thread needs to perform an atomic operation.

The actual shading of the texels is done in a compute shader. Because we cannot use the Texture.Sample function since the derivatives are not known in a compute shader, Nvidia introduced compute_shader_derivatives. This allows us to treat four consecutives threads as a pixel quad and calculate the derivatives on 2x2 pixel block. All this added functionality in the Nvidia Turing cards should allow us to efficiently implement Texture Space Shading in our renderer. Find more information on the new functionality and the extensions for Vulkan and OpenGL here.

Conclusion

Procedural VT and TSS are both performance optimization techniques: by caching results and decoupling work from the main render pass, shader execution is reduced.

Procedural VT is mainly used to cache complex terrain shaders that blend many complex materials. The cache contains texture tiles with a typical size of 256x256 texels and includes all the material channels (diffuse, normal, etc). The tiles are typically kept in the cache for 100s or 1000s of frames and don’t include lighting information.

Texture Space Shading is a newer technique that caches the actual shaded object results (lit color samples) for any object in the scene. The main use case for TSS seems to be performance optimization for stereo rendering because most of the cached texels can be shared between both eyes every frame. TSS also offers more controls to sacrifice quality by lowering the shading rate or reuse shaded textures multiple frames to ensure a stable frame rate. The shaded texels are used for one or a few frames because they also contain lighting information that is quickly outdated. Finally the new Turing instructions also allow optimizing other techniques similar to TSS such as subsurface scattering using texture space diffusion.

This blogpost gives a good overview of the technical pros and cons of TSS. However, the impact on memory usage is not immediately clear and can be quite high. For each object, the result texture (shaded texels), the triangle index texture and the status bit surface need to be stored. A potential solution is to only keep a sparse representation in memory and not allocate the entire result texture. Only the actually shaded texels are needed. If that would be impractical, freeing up memory by applying a virtual texturing system for your material textures could also be a solution.


Leave your information