Casting a Critical Eye on GPU PTex

Storing data on the surfaces of meshes is somewhat of a pain. It involves unwrapping the surface into a 2D UV layout, which is time consuming and can lead to tricky issues such as seams and packing inefficiencies. I’m sure you know all about it.

For this reason, it makes sense to want to switch to some kind of automatic parameterization. Recently PTex has been proposed as a suitable approach for real-time graphics. There’s been a couple of GPU implementations here and here.

Now, I’ve heard PTex mentioned by a few people as being the solution to our parameterization woes, and figured I should write down the main reasons I think PTex is currently not a viable solution.

PTex only really works well with quads. The solution for triangles presumably involves trying to quadrangulate your mesh, or trying to pair up triangles into the same rectangular texture tile (with gutter regions in between). This is annoying, but I’m going to make a couple of generous concessions to PTex in the interest of space, and the first one is: let’s just assume for the purposes of discussion that all our meshes are made up of quads (or that PTex for triangles can be solved elegantly).

Tallying up the overheads

In short, the issue with PTex is the sheer amount of memory overhead you need to introduce for it to work robustly and efficiently.

MIP mapping

The idea of PTex is that each quad gets its own little texture stored in some way that can be dynamically addressed from the pixel shader. In principle you could support MIP mapping by just letting each quad’s texture have a full MIP chain, but there are a few reasons why you don’t do this (increased border regions, and NVIDIA’s implementation relies on storing data in texture arrays which means that MIP 0 of one texture may be stored in the same array as MIP 1 from another texture). Instead each MIP level of each PTex texture is stored separately and you address it by computing the MIP level in the shader. I’ll use the term “tile” for the data associated with a quad at a specific MIP level. So each quad has one tile per MIP level.

In order to take advantage of hardware trilinear filtering, we need to make sure that each PTex tile has a single extra MIP level in addition to its top level. The idea is that the shader picks the MIP level it needs, and truncates down to the nearest integer level, then finds the PTex tile corresponding to this MIP level and samples it with the fractional part of the original MIP level. This means you will only ever touch MIP 0 and 1 (if you went past 1, then it would simply address a different tile altogether, and touch MIP 0 and 1 of that tile).

In other words each tile needs only a single extra MIP level to get hardware trilinear filtering, but this extra MIP level is redundant (it’s the same data as the tile associated with the next MIP level). So, we get a 25% overhead for hardware trilinear filtering.

Border regions

In the most naïve versions of PTex, seamless filtering across neighboring quads is achieved by adding a single texel border region. In reality, your texture data will be block compressed. The colors you get out of the sampler will therefore not exactly match the input colors. In order to get the exact right colors, you need to add a whole 4 texel border so that the BC block in the border region exactly matches the neighboring BC block.

You might just use a single texel border and think it doesn’t matter too much if it doesn’t exactly match the texels of the neighbor quad, and that’s pretty much what people do for current UV parameterization (with some manual tweaks every now and then). However, with PTex you’re adding texture seams on every single edge, rather than just at artist-specified locations, so you really can’t afford for the seams to be anything but completely invisible or you’ll get artifacts all over the place.

The next complication is that we’re storing two MIP levels per tile, but we need the second MIP level to filter correctly too. So really, we need a full 4 texel border in the second MIP level, which means the top level needs 8 texels of border region.

If trilinear filtering was the only reason we needed such big borders you might be tempted to try to do clever tricks in the BC encoder logic and use just one texel border. E.g. maybe BC blocks on the edges of the quad could share anchor points, so that the border texels are exactly identical. This sacrifices compression quality for space, and may not be acceptable given that block compression degrades quality by quite a bit already.

However, I think most agree that the days of straight bi- or trilinear filtering are gone or soon-to-be-gone. We need anisotropic filtering too.  So since we probably need at least 8 texels of border to cover the anisotropic filter kernel anyway, we might as well just copy the BC blocks directly from the neighbors without messing with the encoder. It seems reasonable to proceed assuming that we’re going to use 8 texels of border.

Border region overhead varies with the size of the PTex texture. For a tile of size NxN the total number of texels in the top MIP level is then (N+16)^2. Then, for all but the smallest resolution tile for a quad we add in the overhead of a single MIP level and we end up with the total number of texels per tile: 1.25*(N+16)^2

For next gen games we’re going to have to assume that primitives will be pretty small. Ideally larger than a few 2x2 pixel blocks to avoid pixel shading inefficiencies for forward rendering, but still not massively larger than that because it would make the geometry look too blocky. Let’s be generous here and say that the representative size of a quad is about 16x16 pixels, which means we’re going to need 16x16 texels to texture it at the ideal resolution.

For 16x16 PTex tiles you get 2400 texels for the three MIP levels from 16x16 down to 4x4, using the formula above. The total number of real texels is 16x16+8x8+4x4 = 336 texels. This gives an overhead of roughly 7x.

Size quantization overhead

In order to filter textures between quads of different sizes, both the AMD and NVIDIA implementations require that PTex quads use power-of-two texture sizes. This is because it’s easy to make the border texels for the larger tile equal the upsampled version of the smaller tile whenever the ratio of texture sizes is an integer. The easiest way to ensure this is to pick power-of-two texture sizes for each quad.

But the ideal texture size for a quad is unlikely to be exactly a power-of-two size! This means we have to bump up each quad’s texture size to the next power-of two. You may only need 17x17 texels, but you’ll get 32x32!

If we assume that the ideal size for a quad is roughly a uniform distribution around the two nearest power-of-two sizes, the average overhead is about 1.71x. Add this in to our existing number and we get approximately 12x overhead. I’m assuming square tiles here, btw, to simplify the comparison (NVIDIA’s sample implementation in particular deals poorly with non-squares anyway, since each different aspect ratio adds a draw call).

Isn’t this a bit pessimistic, what about other tile sizes?

Okay, so I tried to be conservative with the 16x16 tile size, but let’s say you have extremely low-resolution meshes for whatever reason (e.g. maybe you use tessellation to dice it up on the GPU).

Running the same numbers again for a few tile sizes, we end up with overheads like so:

#### Tile size #### Border Overhead #### MIP+Size Overhead #### Total Overhead
16x16 5.7x 1.25*1.71x 12.3x
32x32 3.1x 1.25*1.71x 6.7x
64x64 1.9x 1.25*1.71x 4.2x
128x128 1.4x 1.25*1.71x 3.1x
256x256 1.2x 1.25*1.71x 2.6x
512x512 1.1x 1.25*1.71x 2.4x

At larger sizes, most of the overhead comes from the inability to select arbitrary tile sizes, as well as the 25% overhead for the extra MIP level, but I actually expect real-world games would require the smaller tile sizes where the overhead from border regions dominate.

Why does this work for movies then?

Movie renderers have a lot of flexibility that real-time renderers don’t. For example, they store no border regions, they just use more complicated shading logic instead (read the original PTex paper – it’s very clever, but doesn’t map to GPU filtering hardware). For the same reason, they don’t store redundant MIP levels to use hardware filtering, nor do they need to restrict their texture sizes to powers of two so the overhead caused by the size restrictions is gone too. Basically none of the sources of overhead exist for offline renderers.

Could PTex work for games with tessellated meshes?

PTex could potentially work okay for models with very big polygons. One situation where this might seem like it could be the case is for a renderer that uses tessellation and displacement mapping pervasively, so that all the base meshes use very large polygons, with tessellation and displacement adding the fine-scale detail back. However, as I have discussed in previous post a couple of years ago, DX11-style tessellation is by no means a no-brainer for “everything” (even though the solution I suggest in that post – which amounts to using the standard trick of reducing minification aliasing by band-limiting the input signal using MIP-mapping  - seems workable, even if it’s not ideal).

There are plenty of issues with pervasive tessellation, and I won’t rehash all of them here. Suffice to say I’m not aware of anyone really adopting it as a mechanism for drastically reducing the polygon counts for the base meshes in general. It seems to mostly be used on high-polycount meshes for making earlobes slightly smoother and adding some small scale details up close, as well as some “special” surfaces like terrain or water.

In any case, tying the viability of PTex to a workable tessellation strategy seems like a pretty serious caveat. So if that’s the argument for PTex, it should be stated that way by its proponents.

What would we need to change to make PTex work more generally?

It’s tempting to try to avoid many of these issues by just doing manual filtering near edges, but for high quality next-gen games the requirement for anisotropic filtering means you need large filter kernels, which is expensive to fetch and filter, and also means it’s highly likely that at least one thread in a thread group needs to go through the costly path.

Edit:

I originally forgot to mention this old trick (that you may have seen in the context of virtual texturing):

One thing you can do to improve the border overhead is to store only one border “strip” for each pair of neighboring tiles. Basically instead of sampling into your left border region, you’d sample into the right border region of your left neighbor (by shifting/rotating the U-coordinate before looking up the tile, and then shifting it back just before sampling). You’d always keep the border of the higher resolution tile, if they don’t match.

You can skip quite a few border regions this way at the cost of some translation overhead to figure out where to sample from. So in a sense, this is a compromise between full hardware filtering and manual filtering. You do some UV transforms for pixels near edges, but still rely on the hardware to do the actual filtering. It would also be tricky to make this work for e.g. NVIDIA’s implementation, since it stores all the tiles in the same array - a variable number of border “strips” would increase the number of draw calls. For AMDs implementation it’s less problematic, although it would still lead to problems (the more different sized quads you store, the more troublesome the atlas packing becomes).

So you could cut the border region overhead by about a factor of two if you’re lucky (again, it always keeps the “high res” border if there’s a mismatch) - it’s a big win, but there’s still many compounding sources of overhead.

End edit.

We might be able to get rid of at least the cost of the size restriction by letting some texture tiles snap to a lower power-of-two instead of always rounding up. This would reduce the overhead, at the expense of some quality in the form of blurrier textures. For example, clamping tiles to the nearest power-of-two instead of rounding up, would change the overhead factor from 1.71x to 1.3x, but now half your textures will be slightly too blurry. This is an improvement, but isn’t really sufficient on its own to bring down the overheads enough.

Basically, I think the GPU needs to know about mesh primitive adjacency and do filtering for us, even across edges. This includes anisotropic filtering. You’d pass in the primitive ID and a GPU generated intraprimitive coordinate (e.g. barycentric coordinates for triangles). If the GPU is already doing filtering, you might as well relax the power-of-two restriction too. And of course, it needs to support triangles as well as quads. At this point I’m not sure if you can call it PTex or if it’s become some hardware implementation of Mesh Colors, but it’s not really important what you call it.

Conclusion

I generally sympathize with the desire to store data directly on the surfaces of meshes without a complicated and tedious parameterization to a 2D UV-space, but PTex is not quite there yet. Even on a PC with 64-bit address space and gigs and gigs of RAM (and clever management of GPU memory), overheads of up to 12x is too high. Even with various mitigating strategies, the overheads remain substantial.

As much as it annoys me that we still don’t have an efficient way to store and retrieve filtered data on the surfaces of meshes, I think realistically we have to live with the 2D parameterization workaround for now.

comments powered by Disqus