Ray tracing projects Winter 2006 -------------------------------- - Fastest possible GPU raycaster. Settle whether it can be competitive with the CPU and if not, why not. Specific potential tasks: - Incorporate branching and looping to reduce passes - Unroll / packetize kernels to operate on multiple rays or triangles at once. - Use derivative instructions to coordinate 2x2 packets of fragments. - Break the image into tiles and only run tiles that haven't yet completed. - OR, run compaction phases to increase efficiency. - Reflection, refraction, indirect illumination for the GPU raytracer. Specific potential tasks: - Kernels to generate secondary rays based on material properties and/or the scene lighting. - Kernels to cast the rays, associate them with their film position, and build up the resulting images. - Revisit GPU photon mapping, both building (unlikely) and using with today's GPUs and a kd-tree. - Texture support (i.e. support textured primitives). - Explore a hybrid with actual ray casting (primary or secondary) on the CPU, but other ray generation, local shading computations, and other computationally expensive tasks on the GPU. Specific potential tasks: - Similar to the items for reflection, and indirect illumination but the big challenge is determining if there's a structure where enough work gets done on the GPU for it to be worth it. - John Lewis and many others have expressed a belief that film shaders are very computationally intense without casting many (or any) secondary rays. That sort of shader would be ideal for a hybrid tracer. See the LPICS paper. - Focus on shadows. Shadows are touted as a challenge for rasterization and natural for ray tracing, so demonstrate an actual app that has good performance with raytraced shadows. Potential tasks: - Rasterize instead of casting primary rays. - Modify the GPU tracer to terminate on any intersection and run it during fragment shading. - Alternatively, compare with reading back the hits and doing shadow casting with SSE packets. - Broaden from point lights to area lights for soft shadows. - Cell based raytracer. Cell potentially offers some of the computational advantages of a GPU without as big a straitjacket, but requires coping with the very limited local memory and explicit movement of data from system memory. Potential tasks: - Implement an 8 threaded (each thread 4-wide SIMD) ray caster using the naive strategy of fetching kd-tree and triangle data as needed. - Implement some form of tree / triangle cache in each SPU and try and cover as much DMA latency as possible. - Instead, break up the kd-tree into a tree of subtrees and DMA entire subtrees at a time. Upon leaving a subtree, DMA in the new subtree. - Alternatively, run lots of rays through the current subtree and bin them based on their new subtree. Move onto the largest available subtree when all the rays are done. - Shading. We'll also need some form of shading that tackles many of the same issues described above. - How good can a kd-tree get? Gordon lays out some naive and incredibly bad building heuristics and then a cost metric with some approximations in computing actual costs. pbrt has a 'good, but not great' building strategy. It should be possible to actually exhaustively (with aggressive culling of known non-optimal paths) find the 'best' tree meeting some cost criterion. It also might be possible to find the 'best' tree where 'best' is the tree that minimizes raycasting time instead of fitting any greedy local cost metric. - It would be nice to quantify the performance difference (at raycasting time) of the strategies and provide a sense of whether a great tree is 50% better, 2x better, or 10x better than a bad tree. - It would also be nice to quantify how much each increase in build time reduces raycasting time (or which optimizations don't affect build time). The question is what's the best tree that can be built in N seconds total of building plus traversing. If we can build in 5 seconds instead of 10 and raycasting takes 750 milliseconds instead of 300 milliseconds then that's a good trade. - Are there any theoretical results we can produce about kd-trees? Can we show that the greedy-local cost minimization is within some closed form bound of globally optimal? - Just to be different, what about optimal trees for parallel architectures? GPUs value coherence much more highly than CPUs and don't incur nearly as high a relative cost to compute ray-triangle intersections. Does that change the cost-metric? Does it change any of the approximations we use for the different terms when actually implementing a builder? - Pat makes the good point that a good starting point may be to work in 2D. It's much easier to visualize 2D kd-trees and it's also (a bit) easier to brute force generate and test a wider variety of trees in the same time. We can evaluate both the effectiveness of the standard cost metric versus the best possible tree and the effectiveness of the various approximations to the cost of various terms during the build. We can also see if any interesting properties come to light by visually comparing 'very good' semi-randomly generated trees. - How fast can we build a kd-tree? The current raytracing code can trace millions of rays per second, but takes 13 seconds to build a kd-tree out of 70 thousand unsorted triangles (a build rate of 0.005 million triangles per second) but is known to be unoptimized vs. the highly optimized ray caster. Historically, rendering times were so high than a few second or even minutes to build acceleration structures was tolerated. With interactive ray tracing that's clearly not true. Potential specific tasks: - Build by binning the triangles to avoid resorting all of them each time we descend the tree. - Build by sorting the triangles once and then maintaining intervals (one in each of the axes). Each time you descend a layer, refine the intervals by intersection them with the split plane. To enumerate the primitives in a given node, walk all the triangles in the narrowest interval and cull any that violate the other two intervals. Again, you can determine just the triangles in any cell without visiting or sorting all the others. - Build by first putting all the triangles in a grid. This is essentially a sophisticated version of building with intervals. - Build from a grid and maintain pointers from each grid cell to the tree leaves that intersect it. Then handle scenes with moving objects by updating the objects' positions on the grid and relocating triangles as appropriate among affected tree leaves. This seems like it should work well for scenes with lots of static geometry. If too much geometry is moving, at some point you'll need to rebuild the tree to maintain its 'goodness' (possibly, depending on how much impact the goodness of a tree really has on rendering times). - There's a similar argument (and theme) in building a kd-tree from a scene graph (usually equivalent to a BVH) though you lose the easy backmap for updating any tree cells affected by moving objects. - Is there a dynamic programming solution? If you just brute force a fairly naive strategy, how good is it and how memory hungry? Can you simplify (at the potential cost of a degree of optimality). - Build the kd-tree lazily. With interactive tracers, casting the rays takes fractions of a second, but building trees takes multiple seconds. Worse, with moving objects, the trees need to be rebuilt or updated every frame. Additionally, there is no purpose in building a tree for portions of the scene that rays never enter (i.e. aren't visible). So it seems to make sense to refine the tree on demand any time a ray hits a node that's currently a leaf but has an estimated cost of splitting (and continuing to traverse) below the cost of testing against all of its triangles. Potential specific tasks: - Organize the primitives into some form of hierarchy instead of a long list. Without it, refining a node is likely to be extremely expensive. - Determine a threshold number of primitives that comfortably fit in the cache. Refining a cell at or below that size only has computational cost. Refining a cell above that size also has a large memory bandwidth cost. Potentially, a good scene hierarchy mitigates the bandwidth cost at the coarser levels of the tree. - Formulate cost metrics or at least heuristics to determine when to refine vs. intersect and how far to refine (i.e. what metric to use to determine when to stop refining). - Read the literature on dynamic updates to BSP trees (which are just kd-trees with fewer limitations and more legitimacy among the theory folks). Being able to dynamically update kd-trees bears on dynamic scenes, lazy building, and interactivity in general. Ray Tracing Themes Winter 2006 ------------------------------ [ After the brainstorming laid out above and a few rounds of discussions, I've abstracted out the following higher level structure / combination of things we're pursuing and how they fit in a larger framework. ] - Ray tracing efficiently on Cell based machines. The grander question here is: How do the best known CPU ray tracing algorithms adapt to the recent architectures that offer vastly more flops but trade-off limitations in addressable memory, performance cliffs for non-SIMD execution, etc. What aspects of the algorithms are key to their their performance on these newer architectures, can we actually harness those flops to outperform CPUs, where are the most effective changes to enhance ray tracing on these architectures, what benefits can we expect from such changes, and how plausible are they in the constrained context that drives the design of the various architectures. - Describe a vision for ray tracing capable future graphics systems. There are arguments that rasterization has killed ray tracing and its zealots just haven't accepted it and that rasterization is a doomed kludge for 3D because the horsepower for ray tracing (and physical simulation derived methods) hasn't quite been achieved. I [speaking for myself, though I suspect we agree at a high level] believe the optimal future is in between. There's a hefty existing investment in rasterization hardware and amazing techniques for producing visibility from a single (per frame) viewpoint and shading local effects (and precomputed texture based faux global effects). Additionally, rasterization is much more suited to the various desktop and windowing systems, text rendering, and other basic end user tasks. Finally, there's strong commonality between the shading needs of rasterization and ray tracing and again it makes more sense to leverage, reuse, and perhaps extend, the current programmable shading hardware than call for radical changes. Rather, ray tracing post-rasterization can enhance existing GPU capabilities in areas that are awkward today. Specifically, tracing rays simplifies shadows, reflection and refraction, and photon mapping and other techniques for global effects. - Find ways to use kd-trees even better for ray tracing. The three areas we discuss extensively are: to determine the ideal kd-tree (allowing expensive offline building) and thereby set a baseline for judging how good trees are that any given building strategy produces; to devise a builder that runs quickly (relative to time spent intersecting rays per build) and perhaps even trades off tree quality for reduced total build plus intersection time; and, to dynamically update an existing kd-tree based upon the unstructured movement of a subset of the scene geometry. Note that we know of people working actively in the latter two areas, but not enough of the specifics yet to come up with a way to avoid reproducing their effort, learn from them, and produce contributions without trampling on their plans.