Voxelish Thoughts

My voxel rendering system

Searching for Voxel Rendering systems online will provide you with several thousand different ways that people have gone about putting cubes on the screen. This is mine.

The plan

I believe that the best way to avoid scope creep when experimenting is to lock yourself into an API early. Laying out your constraints means you have to innovate only within the box you give yourself, holding yourself somewhat to the task you set out to do. Here is my list:

  • We’re only rendering axis aligned voxels. No entities which can have arbitrary meshes, and no blocks with a more complicated mesh like fence posts or doors. And no particles.
  • Voxels can be emissive, transparent, and can refract, reflect and diffuse light, but they only have a singular solid colour – there’s no textures. Every material property is uniform across the voxel. And voxels don’t store normals either.
  • Voxels can be scaled to powers of 2 arbitrarily, but must be aligned to an integer multiple of their size on every axis.
  • The GPU should store and do almost everything. The CPU should send only what changed since last frame.
  • The renderer must work on Web, using WebGPU.

These constraints might seem too restrictive, especially if you’re looking at them from a Minecraft perspective, but I would argue that one person’s constraint is another person’s art style. Hopefully by the end of this blog series you’ll think the same

Once your constraints are set, the next step is to establish what you’d like to have in spite of your constraints. I call these ‘desires’, and here are mine:

  • The core rendering system should run at 60fps on mobile, and without changing the data sent to the the GPU by the CPU (i.e. while maintaining constraints), optional graphical features should be available to enhance visual quality on higher-end devices.
  • The view distance should be so large that any user would not notice that there is a view distance cap at all. Atmospheric effects and the curvature of the earth should limit visibility before the distance cap does.
  • On high-end devices, some light effects should be ray-based.

These are pretty lofty goals, but by having high aims with tight constraints you lock yourself into a design space with no room for tangents.

As an aside, my above constraints and desires outline what is often called a ‘microvoxel’ game. Microvoxel Discord servers and YouTube comments house incredibly opinionated people, myself included. This makes planning a rendering architecture using research difficult, as everyone has an opinion but very few have a demo.

The cover of the video game "Teardown"

Teardown is a microvoxel game which exists, meaning we can take sound inspiration from it.

I’ve tried to keep this in mind when creating my rendering architecture, and have mostly ignored ideas which have not been proven to work. The key exception is my sparse tree, which is one of the more controversial data structures to put on the GPU. If you read this, take into account whether I have a demo or not, as I may be yet another opinionated stranger on the internet.

Asides aside, using my constraints and my desires, we can draw some conclusions.

SVOs

  • If we only send the changed voxels to the GPU each frame, the GPU needs to store all visible voxels in VRAM.
  • If we want a huge view distance, we need to compress voxel data to be able to store the entire world in GPU memory.
  • Much of a 3d world is either air (empty), or not actively visible (indistinguishable from empty).
  • Better compression systems assign fewer bits to states with higher probability of occurring (lower entropy).

Therefore air and hidden regions should be given special representation states.

Even Minecraft (in its un-optimised glory) does this to an extent, assigning 16x16x16 sub-chunks of air to a special singular value.

If we take this to the extreme, we can imagine storing our voxel world as a tree, where the root node encompasses the whole world, and then each child covers some smaller subdivision. To optimise the storage of empty areas, each child can be marked either as an entirely empty leaf, a leaf containing a single voxel, or a parent which is then further subdivided into children. This is a sparse tree.

Even within microvoxel forums, the conclusion that sparse trees best compress sparse volumetric data isn’t controversial. The controversial bit is rendering from them.

Unfortunately GPUs are not great at performing multiple indirect memory accesses in sequence. GPUs aren’t great at performing memory accesses at all, needing to mask their huge memory latency by having many threads in flight at once, interleaving computation with reads and writes to keep themselves busy. This makes recursive data structures like trees, where there is very little computation to be performed between reading each child, very slow.

So we optimise.

To make something faster, first do less of it, and then do it more efficiently. The traditional sparse voxel tree structure is an Octree (oct meaning eight), where each parent holds 8 children, subdividing each axis into 2 children for each step down the tree. This maximises compression, but also maximises the number of reads required to traverse down the tree.

As a first optimisation, my sparse tree subdivides each axis into 4, meaning a parent has 64 children total. This halves the number of reads required to get to a leaf, in exchange for reducing compression opportunities. From my experiments this factor gives a good tradeoff for both CPU and GPU sparse trees.

Note that abstractly this creates a sparsity-indirection spectrum, with one end being octrees with maximum sparsity but maximum indirections, and the other end being dense 3d arrays with zero sparsity and zero indirections. I’ve seen some very performant voxel engines which use a dense voxel grid to minimise lookup time, but in exchange they usually have quite small view distances (comparatively). If you are using this post as inspiration for your engine in the future I’d advise you to consider more points on this spectrum, as 512 or even 4096 children may provide you with a better tradeoff with future hardware.

A second cheap optimisation is outlined in Laine et al. 2010 – rather than having leaves store if they themselves contain any data, we have parents store which of their 64 children store any data. This reduces the number of indirect reads by 1 for every leaf fetch, as we can exit tree traversal at the parent rather than the child.

Storing trees with more children also lets us apply a brickmap traversal optimisation to skip some node traversals: For each cell in the 4x4x4 region, we calculate a mask of every cell accessible across some set of directions (up, down, etc.). We can then use bitmasking to check if all of the remaining cells that we might intersect with are empty in a single operation, and if they are then we can skip to the end of the parent immediately.

Some small tweaks that I made to this method include quantizing directions to some set other than the cardinal directions, quantizing with a higher resolution, and keeping a running mask of all accessible cells during a single parent’s traversal. These modifications work together to calculate a more precise set of accessible cells for a given quantization resolution, further increasing the opportunities that the a ray can do less work.

We can also assume that there will be barely any duplication within the geometry of the top set of nodes – e.g. no two 1024x1024x1024 regions will be absolutely identical. This means that, beyond some size, storing geometry as a DAG, or even as a sparse tree, isn’t really beneficial at all, and so flattening the top n layers of the tree into one dense grid won’t cost us much storage space, and will decrease the number of pointer indirections that we need to traverse from the top to the bottom of the world.

Caching

Now we’re doing less, we focus on doing it faster. For memory accesses this consists pretty much just of cache optimisation, which in turn consists pretty much just of making our world as compressed as possible. Which again consists pretty much just of exploiting regularity.

Within ‘natural’ data (data which represents things we’d find in the real world), some shapes are more common than others. Imagine a 4x4x4 grid of voxels – it should be fairly obvious that a vertical wall where the right half of the voxels are all occupied is going to occur far more frequently than a completely random sponge-like configuration.

Since sparse trees inherently are pointer based, Kämpe et al. 2013 highlighted that we can have some of those pointers point to the same node, forming a Directed Acyclic Graph instead of a tree, de-duplicating common voxel patterns.

Further building on this, Villanueva et al. 2016 explored how we might extend the DAG parents to store transformations on their children, allowing de-duplication of different orientations of the same geometry, and Dolonius et al. 2017 explored how we can separate material data from occupancy to allow common geometry to be de-duplicated even if the materials of the voxels are different.

All of these papers describe implementations of these steps on the CPU, while my constraints require them to run on the GPU – the detailed steps I took to implement the above compression algorithms on the GPU will be explored in a future blog post.

The first cast

As alluded to, simply ray-marching into this tree (while possible) does not come close to the performance available to modern systems.

If memory accesses are GPUs’ Achilles heel, rasterization is their Achilles everything else. If we want good performance, we need to use the hardware accelerators afforded to us, and that means triangles.

‘Traditional’ voxel engines (such as Minecraft) convert the voxels to triangles on the CPU and then stream these triangles to the GPU in chunks. This is horribly inefficient because CPU-GPU memory is far lower bandwidth than either of their own computation or local memory bandwidth, which is why I only want to send the changed voxels each frame and nothing more.

Instead we can move this mesh generation to the GPU, using instanced rendering to dispatch the visible chunks each frame.

There are a few fun optimisations that we can perform in this GPU driven rendering:

  • All of the standard optimisations: occluded faces don’t produce triangles, adjacent voxels on the same plane merge their faces (aka greedy meshing), vertices for the faces have their attributes compressed given the constraints of the underlying geometry (only 6 possible normals etc).
  • We can mesh and store each of the 6 directions that the chunk can be viewed in separately, and then only dispatch draws for the face directions that are visible (implicit back face culling).
  • Each chunk can have a dirty flag, and remeshing on voxel change can be done lazily only when a dirty chunk is queued to be rendered. This avoids re-meshing the back-faces of chunks.
  • Using a flood fill, portal-based Minecraft-inspired occlusion culling can remove chunks hidden behind other chunks.

Additionally, note that by storing each direction independently, the geometry for a chunk can be stored from front to back, making transparency easy – although not as easy as would be nice, as unfortunately we almost always view a chunk from an oblique angle, making more than one face direction visible. Instead, we know that there is a cap on the number of out-of-order transparent surfaces that a pixel may encounter sequentially (namely 3x the width of a chunk), making stack-based order independent transparency accurate on hardware with a sufficient stack capacity.

The nth cast

Rasterization stops being quite as useful once the scene rays de-cohere. The magic sauce of most engines, then, is how they handle the nth cast – lighting and shadows, reflections, refractions and so on.

The gold standard at the time of writing on modern hardware is to now turn to ray-based methods, even if only at half-resolution or similar, to get a convincing lighting model. However not every platform can cope with that yet, so my engine still has a few tricks up its sleeve.

Hard Shadows

The primary (sun/moon) lights, and prominent lights within the scene, can still use rasterization via shadow mapping to achieve the high-frequency edges that we want on high-contrast shadows. We can use the same dirty chunk system as the primary cast to render an occlusion shadow map (with fractional values for transparent materials) and then sample it in the first cast stage.

Ambient Lighting

Modern GI lighting systems all try to amortize the potentially infinite cost of full light simulation across multiple frames, using radiance caching, reprojection and the likes. The idea is to reduce the time taken re-calculating global lighting information each frame, which is typically fairly static, by holding some information across several frames.

Minecraft does these lighting calculations on the CPU (the results of which are also used for mob spawning and so on), by performing a flood fill with a drop-off and storing a light level for every air block. This is historically notoriously laggy, and does not simulate many desirable aspects of lighting such as directionality, specular reflections, coloured lighting, etc.

A more interesting approach is taken by Felix Maier for his VoxelChain engine, seen here. It is deterministic and massively parallelizable, supports coarse specular and diffuse materials, and is effectively infinite bounce (given enough time). It is slow to propagate, but most light sources (such as fires) do not instantly extinguish in real life, so slow propagation isn’t necessarily a downside.

Most games perform lighting calculations client-side, but since it is commonplace for lighting information in voxel games to have an impact on other mechanics, it makes sense to perform this coarse GI server-side (and potentially duplicate it client-side, depending on the server connection and local device performance characteristics), and then polish it up a bit more on the client, adding the higher-frequency information where possible.

Taking inspiration from Minecraft, we can also reduce overhead by only simulating block lighting, and then having a secondary value for overhead lighting, since overhead lighting is expected to vary far more.

Ray batching

For reflections, refraction, GI probes and everything in-between, there is often exploitable coherence between bundles of rays. The initial rasterization pass is an example of one way to exploit the similarity between adjacent pixel rays. Another way to exploit similarity is using ‘Beam Optimisation’ outlined in Laine et al., 2010.

Unfortunately, beam optimisation does not work well for secondary rays since the points of intersection for nearby pixels will be close to geometry, making the cells they occupy smaller than the beam cell cutoff size.

Instead, we optimise clusters of rays using a system of speculative collaborative ray-marching.

Since all threads in a wavefront must exist until every thread in the wavefront terminates, the time taken for a cluster of rays managed by a wavefront to terminate is the time taken for the longest ray to terminate. This means that the shorter rays in the cluster should support the execution of longer rays however possible.

This support can be speculative – it doesn’t matter to the final image if the shorter rays don’t actually help at all, in the end. The shorter rays can simply guess how they might be able to help.

My engine uses a system I call runahead fetching, where idle threads read the cells at positions that a long-running ray will reach if it continues. They don’t do anything useful with this data – simply reading it forces the GPU to bring it in to cache, allowing the long-running ray to perform its remaining reads more quickly.

Citations

Laine, S. and Karras, T., 2010. Efficient sparse voxel octrees–analysis, extensions, and implementation. NVIDIA Corporation, 2(6).

Kämpe, V., Sintorn, E. and Assarsson, U., 2013. High resolution sparse voxel dags. ACM Transactions on Graphics (TOG), 32(4), pp.1-13.

Villanueva, A.J., Marton, F. and Gobbetti, E., 2016, February. SSVDAGs: Symmetry-aware sparse voxel DAGs. In Proceedings of the 20th ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (pp. 7-14).

Dolonius, D., Sintorn, E., Kämpe, V. and Assarsson, U., 2017. Compressing color data for voxelized surface geometry. IEEE transactions on visualization and computer graphics, 25(2), pp.1270-1282.