Monday, 30 August 2021

Intel XeSS: Joining nVidia in Tensor-Accelerated TAAU

Back in May I wrote about the evolution of per-pixel rendering costs, expecting the imminent announcement of AMD's next generation temporal upscaling technique, offering a competitor to DLSS 2.x that would run on hardware from multiple vendors and even provide a fully open source option to inspect or even improve upon (if it wasn't a perfect match to one vendor's underlying hardware) the offered technology. That ended up not happening and FidelityFX Super Resolution, while an interesting alternative to more basic spatial upsampling, didn't quite match my hopes for some real competition to nVidia's RTX-only DLSS.

I had started a draft post on implementing FidelityFX Super Resolution into your own engine but really, I'm not sure how much it adds. If you want to run a more expensive upscale to retain far more sharpness than bilinear (so not something where you're going to be also doing a blur afterwards) or you're already doing an expensive sharpening pass like FidelityFX CAS after an (optional) upscale pass then you absolutely should drop in FidelityFX Super Resolution any place where you'd otherwise be thinking about the value of Lanczos (because that's roughly what it is). As others have noted by now, this is already a choice players make because modern GPUs (when not doing upscale on the output monitor) implement this when setting the internal resolution lower than the output/native resolution of your system - I've often been quite happy running AAA titles at 1800p for a 4K screen (as long as the anti-aliasing was good) and FSR is an enhancement on that path (increasing quality with the option to composite the UI and any pixel-scale noise, like a film grain effect, at native res after the upscale).

Intel XeSS

What has recently re-energised my interest in upscaling techniques is the Intel Architecture Day announcement of XeSS. A next generation temporal upscaling technique, offering a competitor to DLSS 2.x that will run on hardware from multiple vendors and even provide a fully open source option (at some as yet unknown future date). So I had vaguely the right timescale for an announcement but had bet on the wrong non-nVidia GPU company making it.

XeSS outlined
DLSS 2.x outlined

We do not have full access to XeSS so for now we only have a rough roadmap of releases starting with the initial SDK for use with their Arc series of GPUs (hardware that will not become available until early 2022). The design of the new Arc (Xe-HPG) series goes hard on matrix (Tensor) accelerators and so it is a natural fit to offer something broadly comparable to DLSS, which is accelerated by these AI/Tensor cores. Intel is actually investing even more of their GPU into matrix acceleration than nVidia, so expect a major push to ensure software supports XeSS rather than leaving that silicon idle when running the latest AAA releases.

From the outline we have been provided by Intel, it is easy to see that beyond the similar hardware being tapped to run deep learning algorithms, the inputs are also very similar to nVidia's DLSS 2.x. We have a jittered low resolution input frame along with motion vectors noting the velocity of each pixel and a history buffer of previous frames from which to extract information (which, even when showing a totally static scene, provides additional information thanks to the moving jitter pattern). The only additional information nVidia are explicit about collecting with their API is an exposure value (although the current SDK, 2.2.1, has added an auto-exposure function since these nVidia slides were published) and the depth buffer (which Intel may implicitly include as part of the complete input frame).

Intel in comments to the press have discussed the possibility of the industry converging to a common standard for DL upscaling APIs, allowing almost drop-in dll swaps to make it trivial to support various alternatives. The way this is talked about as a future development means it is unlikely that the initial release of XeSS will be a drop-in dll replacement for DLSS 2.x (using identically named functions/entry-points and settings ranges). Although it remains to be seen how difficult it may be for ingenious hackers to work out how to bridge the differences and allow current DLSS titles to run a bootleg XeSS mode under the hood in the future (of course, not condoned by Intel itself).

DLSS time savings
XeSS time savings
DLSS savings scaling

This brings us to a major point of differentiation (vs nVidia) and something very exciting to various users stuck with our current supply-constrained GPU market (which will not improve sufficiently to allow everyone to upgrade to an RTX card even by late next year): XeSS will provide a fallback mode that runs (be it somewhat slower) on GPUs without hardware (XMX) matrix acceleration. Added to nVidia for Pascal (Series 10), AMD for Vega, and Intel for Xe-LP on Tiger/Rocket Lake (11th Gen Core processors) there are some AI acceleration instructions for Int8 operations (DP4a) that can provide quadruple the throughput for dot products on packed Int8 values in comparison to 32-bit operations - this is effectively a mid-ground between trying to run AI workloads as generic shaders and getting the full acceleration of dedicated Tensor units.

With Intel so invested in matrix acceleration, it becomes more evident that AMD are being left behind - even mobile chips ship with limited amounts of this form of hardware acceleration (as I noted in 2019) - so this fallback is providing a vital half-step (which should more than pay for itself with the reduction in rendering cost of a lower resolution input image with no need for antialiasing). This also applies to the current consoles, which notably didn't get left behind on ray tracing acceleration but are starting to look down a long generational window without hardware matrix acceleration. The Xbox Series of consoles offers something equivalent to DP4a via DirectML (and Microsoft have said they are working on their own DL upscaling technique for use on those consoles in the future) but we don't yet know if Sony have an answer for the PS5.

In interviews it sounds like Intel are, at least initially, reserving the XMX path for their own Arc GPUs (despite nVidia RTX cards having equivalent matrix acceleration) so it will be a case of DLSS only on RTX going up against XeSS XMX (fast) only on Arc and XeSS DP4a (slower) everywhere else. But you could read the answers as being open to others coming in and dropping in their own engine (say nVidia Tensor engine rather than being forced down the fallback codepath on DP4a), but maybe not before Intel releases the full source code (for which a timeframe is not provided). In that DF interview there is also the suggestion of potential future developments where laptops do the main rendering on a dGPU then hand it off to the iGPU, where it has Intel matrix accelerators to run the final stages (XeSS upsample, composite UI etc). Given that current laptops with a discrete GPU already pass the completed 3D render to the iGPU to output via direct connection to the screen, this would only be an incremental step forward (rather than completely reinventing the path a frame takes today).

One can even imagine, looking at the announced AVX-VNNI instructions for consumer CPUs and AMX instructions for server CPUs, a future where those people working on interesting software renderers could stay entirely on the CPU while taking advantage of DL upscaling, assuming there was enough throughput that was power efficient enough to provide a worthwhile wow factor. Real-time software renderers are not competitive with modern GPU-accelerated renderers (an embarrassingly parallel problem on hardware designed around accelerating just that) but they are still an interesting hobby niche that may enjoy playing with this new area of technology.

Non-DL-based clamping limitations
DL-based denoising limitations

Going back to a more broad discussion, the reason for this excitement around DL upscaling (as I hopefully outlined in my previous post) is that it avoids the poor TAA performance of rejecting or clamping values from the history buffer, which has evident detail loss or failure states around higher frequency information (as nVidia have made clear in their talks on this topic). When the buffer can be fully utilised, a well managed jittered history can reconstruct a lot of detail for any element that has already been onscreen for a couple of frames (with anything that hasn't been onscreen liable to be masked behind a motion blur) despite using an internal resolution significantly below native output. Direct competition between two different implementations should provide even more impetus for advancement in this area. We are only scratching the surface of what deep learning algorithms can do to enhance our current rendering techniques.

Of course, there are some problems that nVidia have considered potentially intractable, such as the many types of noise that their DLSS 2.x approach cannot deal with (as it cannot provide a generalised solution that accounts for all noise types) and so, if it cannot be avoided, must be denoised before DLSS is applied. This is something that can force a traditional TAA stage (at a non trivial rendering and memory cost) back into engines that would otherwise be able to drop it entirely; the ultimate goal being only relying on the antialiasing of DLSS to provide exceptional final results. Intel offers a second set of engineers looking at such problems who may have fresh insights into what is possible. Microsoft are working on their own Xbox DL upscaling. There are signs Sony are up to something too. While AMD did not announce their plans in this area with the recent announcement of FSR, I am still convinced that the future of AMD GPUs will involve Tensor units and that they will justify that use of transistors with a DLSS-a-like - but we will maybe be waiting for RDNA3 in late 2022 before we get that piece of the puzzle. For now, Intel are in the spotlight and anyone with a vaguely recent GPU (even the most recent iGPUs) is being invited to come along.