Sunday, 30 May 2021

Fewer Samples per Pixel per Frame

In my VR roundup, it turned into a bit of an impromptu comparison between various anti-aliasing techniques inside one of the most challenging environments we currently have. VR restricts acceptable (input to photons) latency, so can limit pipeline/work buffer design; uses relatively extreme field of view (close inspection of pixel-scale details) combined with ever-increasing raw pixel counts of screens; and demands more than 60 fps with good frame pacing. Add in lens distortion and a temporal reprojection emergency stage (to avoid dropped frames) and it means even without TAA, you’ve got distortion and potentially an extra reprojection stage exaggerating artefacts in the frames you do render.

I think we’re at another rather interesting point for anti-aliasing techniques, as demands for offline-render quality real-time graphics at high resolutions with fewer compromises (like screen-space effect artefacts) enabled via ray tracing acceleration becomes mainstream. Per pixel shader calculation costs are going to jump just as we saw during the adoption of HDR/physically-based materials and expensive screen-space approximations like real-time SSAO. Samples per pixel per frame may not be forced to drop as quickly as consoles jumping from targeting 1080p to targeting 4K but we are going to need some new magic to ensure a lack of very uncinematic aliasing and luckily it looks like we’re getting there.

Sampling History

It is 1994 and I’m playing Doom on my PC. The CRT is capable of displaying VGA’s 640x480 but due to colour palette limitations most DOS games run 320x200 and Doom’s 3D area is widescreen aspect due to the status bar taking up the bottom area. To make matters worse, those of us without the processor required to software render 35 frames per second (Doom’s cap, half refresh for a VGA CRT’s 70Hz) would often shrink the 3D window to improve framerates. All of this is very common for earlier 3D games (I remember playing Quake 1 two years later similarly), which often had difficulties consistently staying in the “interactive framerate” category. For most it was a dream to output near the maximum displayable image while calculating an individual output value for every pixel of every scan-out and that limitation was not primarily due to early framebuffer limitations.

It is 2004 and I’m playing Half-Life 2. Rapid advancement then convergence under a couple of API families for hardware acceleration has meant most of the last decade provided amazing 3D games that grew with hardware capabilities (even if many earlier examples contain somewhat arbitrary resolution limitations). Even 1998’s Half-Life 1 has quickly jumped past low resolution 3D consoles like the PS2. Super-sampling (SSAA) where every final pixel was internally rendered several times then blended (used extensively for offline rendering) was usually too expensive, especially as screen resolutions continued to increase (initially for 4:3 CRT then LCDs moving to 16:9). But by this point, it was standard to use MSAA to blend samples from different polygons that partially covered a single pixel (the saving being that if multiple coverage points were covered by the same triangle, the shader for the final value was only run once, unlike SSAA). Two years later, nVidia would introduce CSAA to allow more coverage sample points than cached values, making it even cheaper to provide very accurate blending between polygon edges. It was even possible to mix in SSAA for transparent textures, where the edge of the triangle is not where the aliasing happens. Note how those 2006 benchmarks are already showing PC games running at the equivalent of 1080p120 with limited MSAA or 60 fps with many many samples per pixel.

It is 2014 and I’m playing the recent reboot of Tomb Raider. MSAA continued to get faster and better in the intervening decade but unfortunately the move to deferred rendering made it extremely difficult to implement efficiently into newer engines (it is not possible in Tomb Raider, although some deferred renderers did get hacked by nVidia drivers that injected MSAA at an acceptable performance cost). The answer to major aliasing, which had been developed during the xbox 360 generation of consoles, was to run a (MLAA) post-processing pass that looks for high contrast shapes typical of aliased lines and then employ a blur to ease the sudden gradient. This technique requires very clear aliasing telltale line segments so smaller detail like foliage systems become a huge issue, which really stands out in the sequel, Rise of the Tomb Raider. It also completely fails if you apply the pass after doing some other image manipulation that distorts the telltale shapes or edge gradients.

In this 2014 era, the use of HDR intermediate values later tonemapped down to the output range, which was just emerging after HL2, also makes it so that internal calculations can output a much wider range of values and with only one sample per triangle per pixel, a new sort of temporal aliasing become dominant as the sampled locations move enough for slightly different angles to be calculated grazing incredibly bright light sources in sequential frames. Surfaces sparkle and flicker in regular patterns that become at least as distracting in motion as classic polygon edge aliasing, as I mention in my Dragon Age retrospective. A combination of the two aliasing types is easily recognisable where an angle creates a strong lighting highlight along the silhouette of a surface that may be less than a pixel wide, creating light ants crawling along those polygon edges which are too thin for MLAA to catch. A better solution was required. (And you may note the journey isn’t over as I just linked that to a trailer for a 2021 game with an engine that already uses...)

Temporal Accumulation

The problem is clear. By 2014 we are generally using one (complex) sample per pixel per frame and due to fine geometric detail (older games lacked) plus an extreme range of possible lighting values (not to mention potential ordering issues in how various stages of calculating light and darkness components are blended) this is creating pixel-scale aliased elements that are also often not temporally stable. The screenshots look relatively good but in motion anyone with flicker-sensitivity is immediately distracted by aliasing. By this time the shaders have also become complex enough that various motion vectors (showing how far the object under each pixel has moved in the previous frame) are starting to be calculated to enable somewhat accurate motion blur to be added (very important on consoles targeting 30 fps, where this provides extra temporal information missing when not using higher framerate output - it’s also “more cinematic” because most people are used to 24 fps movies with a 180 degree shutter so accumulating all light that hits the lens for 1/48th of a second before closing the shutter for another 1/48th of a second).

Those motion vectors, if they are sufficiently accurate, can point to the pixel location of the object in the previous frame. So expensive effects like real-time ambient occlusion estimation (checking the local depth buffer around a pixel to see how occluded the point is by other geometry that would limit how much bounce lighting it would likely receive) becomes an area of experimentation for temporal accumulation buffers. Sample less in each frame, create a noisy estimation of the ground truth, and filter for stability while reprojecting each frame along the motion vectors. Here’s a good walkthrough blog from this time period and subsequent refinements have worked to deal with edge cases like an incremental buffer not handling geometry arriving from off-screen (causing some early examples to obviously slowly darken geometry as it appeared along the edge of the screen).

As seen shipping in 2011's Crysis 2, temporal accumulation for reducing aliasing not only presents the answer to MLAA’s limitations but also can operate after a cheap MLAA pass to rapidly reduce all aliasing. If you consider a slightly jittered pixel centre location (a common enhancement) then a static scene under TAA effectively generates SSAA-quality images, only spreading the samples per pixel out over time. It was popularised further by nVidia with their branding of the process as TXAA, shipping in games in 2012. Some early implementations had major ghosting issues from motion vector precision and understanding when to reject a previous frame’s data as not contributing to this new location. The actual complexity of this problem becomes apparent when you consider how objects in a scene may have changing visibility (especially during motion and animation) or output values (consider a flickering light and the subsequent illumination between frames). Progress has not always been uniform and a couple of times I've stumbled upon an anti-aliasing fail state that's hard to even explain (Dishonored 2 doesn't have very satisfying TAA due to ghosting thin elements and I don't know what the MLAA is doing here to achieve what's visible in this capture). It is a process under constant refinement but in today’s best temporal accumulation implementations it is often relatively rare to see obvious issues. As mentioned, it also errs on the side of a softer final frame so can be combined with a sharpening filter. Unfortunately this can be handled poorly, effectively paying the computational cost of TAA while then also reintroducing exactly the obvious aliasing that it was meant to remove. It also doesn’t help if your TAA implementation is broken on a platform.

Ray Tracing with DLSS and The Future

In the last couple of years, the new hotness that really explodes the computational costs of working out a stable final value of each pixel in a frame of a modern game is real-time ray tracing. Thanks to nVidia looking to brand the future, they have shipped all RTX GPUs with dedicated silicon to accelerate BVH intersection tests and machine learning tensor operations (big matrix multiplies, often with sparse data) and at least the former part of that is now also available on current AMD GPUs and consoles plus upcoming Intel discrete GPUs. If you thought the aliasing issues from rasterisation going to physically-based materials and HDR were a concern, welcome to a problem so far beyond that that if you look at the underlying data from a single frame using around one sample per pixel, it looks more like white noise than a coherent scene - accumulation with temporally reliable motion vectors is a must and site of ongoing research. The addition of Tensor cores to RTX GPUs was initially proposed as the place to run AI denoising on that ray tracing output, although most games today still denoise in the general purpose shaders. Luckily, another branch of research was to use those Tensor units to AI-accelerate all anti-aliasing and it has been wildly successful with many reviewers now noting that DLSS 2 outperforms native resolution TAA.

DLSS 1 was a bit of a mixed bag as the AI had to be trained on each game and took an aliased lower resolution image from the game then applied the classic AI Super Resolution techniques to “dream” or “hallucinate” the missing details and softened edges. However, DLSS 2 changed the inputs (this presentation originally convinced me AMD would add AI cores to RDNA2) and so required a buffer of previous low resolution input frames (including depth buffers and motion vectors) while removing the previous individual training requirement, effectively giving the AI the power of temporal accumulation information to generate the final output. So each new frame generated by the game can be run at a much lower resolution than the output, reducing the samples per output pixel, and yet will retain the look of a cleanly anti-aliased native resolution render. We are back to 1994 but rather than peering into a small box, the games look almost as good as offline rendering and output fullscreen. Even when not trained to give the exact same result as native processing, the AI seems to be quite stable and creates pleasing results in motion. It’s a game changer when targeting new screens that can accept 4K frames at or above 120Hz.

But nVidia do not have a monopoly on upscaling while anti-aliasing and more significant upscaling without compromises will be the new normal if my reading of the tea leaves (on samples per pixel per frame) is correct. Reusing information from previous frames is clearly a smart efficiency saving as long as we can reliably determine what information is useful and what isn’t (avoiding failures that create significant artefacts which are as distracting as the aliasing we’re trying to move beyond or the framerate drops we’re trying to avoid). The target of 4K on the PS4Pro forced engines to pivot to smart upscaling strategies such as the use of checkerboarding and a rotated tangram resolve in Horizon: Zero Dawn, reducing GPU costs of each new frame by alternating which pixels in a checkerboard were rendered (then blending on the diagonals for that frame while adding in contributions from the previous frame). Recent years have seen an excellent execution of targeting the fixed scan-out time of non-VRR displays by managing the rendering load around modifying the internal render resolution then upscaling for the final presentation (usually with native UI compositing over the top for maximum text clarity). Even when dynamic resolution scaling is not available on PC, it has forced renderers to provide visually pleasing upscaling that gracefully handles even fine texture transparency and pixel-wide polygon details.

The Medium, TAA 50%
The Medium, TAA 75%
The Medium, TAA 100%

The last few years of Unreal Engine 4 have had quite a clean TAA with integrated upscaler (sometimes called TAAU) for dynamic internal resolution (it tracks the sub-pixel jitter so the samples can be correctly distributed even when changing the ratio of internal res to output res; primarily used on consoles, where the APIs for precise frame time calculation and estimation have existed for longer and the fixed platform make it easier to define an ideal internal resolution window for reliable results that still come close to maximising GPU throughput - the skill is not underutilising the GPU by being too conservative and so being ready for scan-out milliseconds before needed). In the best cases, I am completely happy to run UE4 around 80% resolution (just under 1800p) and let the TAA upscaler reconstruct a soft and clean final image on my 4K PC big screen (getting close to home cinema levels of consuming my vision so making aliasing issues more apparent than someone looking at a distant TV or small monitor). It doesn’t compete with DLSS (in Performance mode that is a 50% resolution so 1080p internal renders when the output is 4K) but then head to heads show DLSS 2 reaches close to image quality parity with UE4 TAA running at 100% internal resolution on PC so clearly dropping down to 1800p is under 70% of the actual sample count (previous percentages are edges vs sample count is area) and ensuring a relatively aliasing free result without AI will err on the side of softer than DLSS Perf. The above captures from The Medium show a clear quality loss at 50% while the differences at 75% are more subtle compared to native internal resolution. The captures from Man of Medan below are where I think TAA with some upscaling is showing quality levels that you would not even imagine possible in the MLAA era (expecially noting these captures have significantly fewer samples per pixel per frame than those games from a decade ago).

Man of Medan, TAA 85%
Man of Medan, TAA 85%
Man of Medan, TAA 85%

With the public release of Unreal Engine 5’s beta shipping with default-enabled Temporal Super Resolution, we are looking at the beginning of non-AI (or at least not running on Tensor cores) TAA plus upscaling that aims to hit the same milestones as DLSS when it comes to low internal resolution. The PR for the UE5 release announces 1080p internal render resolution, aiming to hit the quality bar of 4K native. That is an ambitious target and running the editor (which also uses UE5 TSR by default) there is a lot to appreciate about this beta’s visual quality, well beyond the 50% screenshot above from UE4’s technique (and that was already significantly above some previous branded sharpen plus upscale techniques as implemented in shipping games). We are approaching a point where continued refinement of this path of research will be able to pick away at the final issues and retain detail without turning the results into a mess of sharpening halos or lingering aliasing. From there we have a far more interesting future in which some games will be able to explore the artistic choice to reject such smoothing, rather than fall into them via broken PC releases, or even take the performance wins of significant upscaling while tweaking output to retain more of the underlying grainy component of ray tracing or other contributions (while adding noise to areas where it does not naturally occur and so approach something close to movie film grain that actually looks good but reduces render cost rather than increasing it slightly).

Edit (June 2021): This was written on the assumption that the imminent reveal of AMD's FidelityFX Super Resolution would confirm a very similar technique to UE5's Temporal Super Resolution, directly chasing after DLSS's impressive results at similarly low internal rendering resolutions (using fewer samples than checkerboarding and far fewer than where other TAA upscaling, such as in UE4, shines). It has since been announced that AMD are zaging where others have zigged and will not be using a temporal solution. Worryingly this has come with rather weak results on the one pre-release promotional image used to sell the technology. As I mentioned above, DLSS 1 did not come out of the gates a winner so AMD have plenty of time to iterate or to provide an open equivalent that replicates what Epic are doing with UE5.