Top Ten Obstacles When Using NVIDIA Hardware 14 April 2005 (1) The float4 bandwidth problem. Until the latest drivers, there's a more than 10x falloff (3300 MPix/s to 233 MPix/s) fetching data from a float2, float3, or float4 texture as compared to a float texture. With the latest drivers, there's still more than a 7x falloff (nearly 3500 MPix/s compared to 486 MPix/s). It's just not feasible for us to try to do something like use lots of parallel float textures instead of fewer floatN textures, but the performance is just awful. (2) Stability problems with the DirectX drivers. We basically had to abandon the NV4x's for our raytracing work because shaders that worked perfectly when compiled for GL would either (1) bluescreen the host or (2) silently return incorrect results (ray-triangle calculations would claim a ray missed a triangle it clearly hit). We needed DirectX because getting good performance with GL is impossible (see (3) and (4)). (3) Lack of render to texture in GL. FBOs are coming and should just fix this. Without them, we have to copy from a pbuffer back into a texture after every pass. (4) Lack of read-modify-write textures in GL. A number of our shaders currently want to bind the same buffer as an input texture and output buffer. For each output pixel, they want to fetch the corresponding input texel (and only that texel) and use that to determine the new value. For example, we have a texture that represents the current state of each ray in our raytracer (traversing, intersecting, shading, ...) and the shaders that do the actual traversal may need to modify the state to be intersecting once they're done. This works fine under DirectX with all the ATI and NV hardware we've used, but completely fails under the experimental FBO-capable drivers Nick has made available. (5) Slow texture fetch instructions. A floatN texture fetch imposes an N-cycle latency on the execution time of the shader that is not overlapped with math. Since we fetch lots of float4's (and tend in general to have bandwidth limited kernels) this costs us a lot compared to the single cycle cost on ATI boards. (6) Awkwardness with texture addressing. It's really annoying, a waste of math ops, and problematic because of precision and biasing issues, to have to use floats as texture addresses. I know integer operations are coming, but until they arrive, their absence is actually a top ten problem. *** Things below here are not really showstoppers in the sense that the ATI board doesn't do them significantly better. Rather, they're changes we identify as potential ways of working around more fundamental bandwidth issues we have with all GPUs today. *** (7) Wider paths into the arithmetic units. Our understanding is that it only a single float can be pulled into the arithmetic units per clock per pipe today. Since we do our best to do vectorized math on 4xfp32 vectors, that's a big performance problem. We believe that as graphics apps widen from 8-bit fixed 4-component to fp16 4-component those paths will widen, but a CPU can actually pull an fp32 4-component vector into SIMD registers in a single clock. (8) Efficiency at the low end. What's the smallest number of fragments at which a shader gets a reasonable portion of peak performance? In apps like the raytracer, we try rasterize with a lot of pixels culled by early-z and see really low efficiency / utilization. In general, without high granularity branching, there are a lot of scenarios where efficiency with 'only' a few thousands of fragments is really nice. (9) The ability to refine z without disabling early culling. It would be really nice if we could rasterize with early z culling enabled and have fragments refine z (essentially, a fragment that ran on pass N could set z do that it wouldn't run on pass N+1 or later). Then we could run a lot of passes without having to interleave passes to update z or tolerate doing needless work. A semantic where fragments are only allowed to cull themselves (i.e. any fragment that's behind z on pass N is guaranteed to remain eligible for early-z on all subsequent passes and not come back to life) is fine. (10) Branching. NV40 branching / data dependent looping isn't a remotely viable alternative to early-z based shader-level conditional execution today. Our tests show we need coherence of 64x64 or more and that even then there are still penalties even in the case where all fragments branch the same way. For that matter, the driver seems to silently convert some shaders with branching back to predication because timing a shader that branches between two math intensive choices shows that it takes roughly as long as the sum of the two execution times to run. Branching is especially appealing to our multipass algorithms because it allows them to use their local registers as a scratchpad where today every pass has to fetch e.g. the ray origin and direction (which never change, but have to be fetched from texture memory for every fragment for every pass) and the ray state (which is written in every pass and then refetched from texture memory at the start of the next pass). With looping, so long as the card had enough state to avoid spilling the registers, we could eliminate those repeated fetches. (*) Texture size limits. This isn't a showstopper or even a problem in practice, so much as a conceptual limit that affects credibility. Also, we know there's "something" coming to virtualize texture memory, but we have no idea what it is or how it will appear programatically-- will we just be able to allocate huge textures? Will gathers work? Will there be any application control? Will there be any way to trap / count / notify on out of range gathers or any other sort of memory protection mechanisms? (**) I intentionally omitted bandwidth between the GPU and system memory. The throughputs certainly aren't great (and PCIe is no faster than AGP), but they're much better than the ATI rates and the vast majority of our apps seem to download data to the GPU and then iterate on it a lot, which mitigates the impact. If I were more exposed to gpu-cpu hybrid algorithms I'd probably complain a lot more.