Search this blog

07 August, 2017

Tiled hardware (speculations)

Over at Siggraph I had a discussion with some mobile GPU engineers about the pros and cons of tiled deferred rasterization. I prompted some discussion over Twitter (and privately) as well, and this is how I understand the matter of tiled versus immediate/"forward" hardware rasterization so far...

Thanks for all who participated in said discussions!

Disclaimer! I know (almost) nothing of hardware and electronics, so everything that follows is likely to be wrong. I write this just so that people who really know hardware can laugh at the naivity of mere mortals...



Tiled-Based Deferred Rendering.

My understanding of TBDR is that it works by dividing into tiles all the incoming dispatches aimed at a given rendertarget.

For this to happen, you'll have at the very least to invoke the part of the vertex shader that computes the output screen position of the triangles, for all dispatches, figure out which tile(s) a given triangle belongs to, and memorize in a per-tile storage the vertex positions and indices.
Note: Considering the number of triangles nowadays in games, the per-tile storage has to be in main memory, and not on an on-chip cache. In fact, as you can't predict up-front how much memory you'll need, you will have to allocate generously and then have some way to generate interrupts in case you end up needing even more memory...

Indices would be rather coherent, so I imagine that they are stored with some sort of compression (probably patented) and also I imagine that you would want to try to already start rejecting invisible triangles as they enter the various tiles (e.g. backface culling).

Then visibility of all triangles per tile can be figured out (by sorting and testing against the z-buffer), the remaining portion of the vertex shader can be executed and pixel shaders can be invoked.
Note: while this separation of vertex shading in two passes makes sense, I am not sure that the current architectures do not just emit all vertex outputs in a per-tile buffer in a single pass instead.

From here on you have the same pipeline as a forward renderer, but with perfect culling (no overdraw, other than, possibly, helper threads in a quad - and we really should have quad-merging rasters everywhere, don't care about ddx/ddy rules!).

Vertex and pixel work overlap does not happen on single dispatches, but across different output buffers, so balancing is different than an immediate-mode renderer.
Fabian Giesen also noted that wave sizes and scheduling might differ, because it can be hard to fill large waves with fragments in a tile, you might have only few pixels that touch a given tile with a given shader/state and more partial waves wasting time (not energy).

Pros.

Let's start with the benefits. Clearly the idea behind all this is to have perfect culling in hardware, avoiding to waste (texture and target) bandwidth for invisible samples. As accessing memory takes a lot of power (moving things around is costly), by culling so aggressively you save energy.

The other benefit is that all your rendertargets can be stored in a small per-tile on-chip memory, which can be made to be extremely fast and low-latency.
This is extremely interesting, because you can see this memory as effectively a scratch buffer for multi-pass rendering techniques, allowing for example to implement deferred shading without feeling too guilty about the bandwidth costs.

Also, as the hardware always splits things in tiles, you have strong guarantees of what areas of the screen a pixel-shader wave could access, thus allowing to turn certain vector operations (wave-wide) into scalar ones, if things are constant in a given tile (which would be very useful for example for "forward+" methods).

As the tile memory is quite fast, programmable blending becomes feasible as well.

Lastly, once the tile memory that holds triangle data is primed, in theory one could execute multiple shaders recycling the same vertex data, allowing further ways to split computation between passes.

Cons.

So why do we still have immediate-mode hardware out there? Well, the (probably wrong) way I see this is that TBDR is really "just" a hardware solution to zero overdraw, so it's amenable to the same trade-offs one always have when thinking of what should be done in hardware and what should be programmable.

You have to dedicate a bunch of hardware, and thus area, for this functionality. Area that could be used for something else, more raw computational units.
Note though that even if immediate renderers do not need the sophistication of tiling and sorting, they still need space for rendertarget compression which is less needed on a deferred hardware.

Immediate-mode rasterizers do not have to overdraw necessarily. If we do a full depth-prepass for example then the early-z test should cull away all invisible geometry exactly like TBDR.
We could even predicate the geometry pass after the prepass using the visibility data obtained with it, for example using hardware visibility queries or a compute shader. We could even go down to per-triangle culling granularity!

Also, if one looks at the bandwidth needed for the two solutions, it's not clear where the tipping point is. In both cases one has to go through all the vertex data, but in one case we emit triangle data per tile, in the other we write a compressed z-buffer/early-z-buffer.
Clearly as triangles get denser and denser, there is a point where using the z-buffer will result in less bandwidth use!

Moreover, as this is a software implementation, we could always decide for different trade-offs, and avoid doing a full depth pass but just heuristically selecting a few occluders, or reprojecting previous-frame Z and so on.

Lastly I imagine that there are some trade-offs between area, power and wall-time.
If you care about optimizing for power and are not limited much by the chip area, then building in the chip some smarts to avoid accessing memory looks very interesting.
If you only care about doing things as fast as possible then you might want to dedicate all the area to processing power and even if you waste some bandwidth that might be ok if you are good at latency hiding...
Of course that wasted bandwitdh will cost power (and heat) but you might not see the performance implications if you had other work for your compute units to do while waiting for memory.

Conclusions.

I don't quite know enough about this to say anything too intelligent. I guess that as we're seeing tiled hardware in mobiles but not on the high-end, and vice-versa, tiled might excel at saving power but not at pure wall-clock performance versus simpler architectures that use all the area for computational units and latency hiding.

Round-tripping geometry to main RAM seems to be outrageously wasteful, but if you want perfect culling you have to compare with a full-z prepass which reads geometry data twice, and things start looking a bit more even. 

Moreover, even with immediate rendering, it's not that you can really pump a lot of vertex attributes and not suffer, these want to stay on chip (and are sometimes even redistributed in tile-like patterns) so practically you are quite limited before you start stalling your pixel-shaders because you're running out of parameter space...

Amplification, via tessellation or instancing though can save lots of data for an immediate renderer, and the second pass as noted before can be quite aggressively culled and in an immediate renderer allows to balance in software how much one wants to pay for culling quality, so doing the math is not easy at all.

The truth is that for almost any rendering algorithm and rendering hardware, there are ways to reach great utilization, and I doubt that if one looked at that the two architectures were very far apart when fed appropriate workloads. 
Often it's not a matter of what can be done as there are always ways to make things work, but how easily is to achieve a given result.

And in the end it might even be that things are they way they are because of the expertise and legacy designs of the companies involved, rather than objective data. Or that things are hard to change due to myriads of patents, or likely a bit of all these reasons...

But it's interesting to think of how TBDR could change the software side of the equation. Perfect culling and per-tile fast memory would allow some cute tricks, especially in a console where we could have full exposure of the underlying hardware... Could be fun.

What do you think?

Post Scriptum.

Many are mentioning NVidia's tiled solution, and AMD has something similar as well now. I didn't talk about these because they seem to be in the end "just" another way to save rendertarget bandwidth.
I don't know if they even help with culling (I think not for NVidia, while AMD mentions they can do pixel-shading after a whole tile batch has been processed) but certainly they don't allow to split rendering passes more efficiently via an on-chip scratch, which to me (on the software side of things...) is the most interesting delta of TBDR. 

Of course you could argue that tiles-as-a-cache instead of tiles-as-a-scratch might still save enough BW, and latency-hide the rest, that in practice it allows to do deferred for "free". Hard to say, and in general blending units always had some degree of caching...

Lastly, with these hybrid rasters, if they clip triangles/waves at tile boundaries (if), one could still in theory get some improvements in e.g. F+ methods, but it's questionable because the tile sizes used seem too big to allow for the light/attribute screenspace structures of a F+ renderer to match the hardware tile size.

External Links.

Apple's "GPU family 4" - notice the "imageblocks" section