Ray tracing projects Winter 2006
--------------------------------

   - Fastest possible GPU raycaster.  Settle whether it can be competitive
     with the CPU and if not, why not.  Specific potential tasks:
        - Incorporate branching and looping to reduce passes
	- Unroll / packetize kernels to operate on multiple rays or
	  triangles at once.
	- Use derivative instructions to coordinate 2x2 packets of fragments.
	- Break the image into tiles and only run tiles that haven't yet
	  completed.
	- OR, run compaction phases to increase efficiency.

   - Reflection, refraction, indirect illumination for the GPU raytracer.
     Specific potential tasks:
        - Kernels to generate secondary rays based on material properties
	  and/or the scene lighting.
	- Kernels to cast the rays, associate them with their film position,
	  and build up the resulting images.
	- Revisit GPU photon mapping, both building (unlikely) and using
	  with today's GPUs and a kd-tree.
	- Texture support (i.e. support textured primitives).

   - Explore a hybrid with actual ray casting (primary or secondary) on the
     CPU, but other ray generation, local shading computations, and other
     computationally expensive tasks on the GPU.  Specific potential tasks:
        - Similar to the items for reflection, and indirect illumination but
	  the big challenge is determining if there's a structure where
	  enough work gets done on the GPU for it to be worth it.
	- John Lewis and many others have expressed a belief that film
	  shaders are very computationally intense without casting many (or
	  any) secondary rays.  That sort of shader would be ideal for a
	  hybrid tracer.  See the LPICS paper.

   - Focus on shadows.  Shadows are touted as a challenge for rasterization
     and natural for ray tracing, so demonstrate an actual app that has good
     performance with raytraced shadows.  Potential tasks:
        - Rasterize instead of casting primary rays.
	- Modify the GPU tracer to terminate on any intersection and run it
	  during fragment shading.
	- Alternatively, compare with reading back the hits and doing shadow
	  casting with SSE packets.
	- Broaden from point lights to area lights for soft shadows.

   - Cell based raytracer.  Cell potentially offers some of the
     computational advantages of a GPU without as big a straitjacket, but
     requires coping with the very limited local memory and explicit
     movement of data from system memory.  Potential tasks:
        - Implement an 8 threaded (each thread 4-wide SIMD) ray caster using
	  the naive strategy of fetching kd-tree and triangle data as
	  needed.
	- Implement some form of tree / triangle cache in each SPU and try
	  and cover as much DMA latency as possible.
	- Instead, break up the kd-tree into a tree of subtrees and DMA
	  entire subtrees at a time.  Upon leaving a subtree, DMA in the new
	  subtree.
	- Alternatively, run lots of rays through the current subtree and
	  bin them based on their new subtree.  Move onto the largest
	  available subtree when all the rays are done.
	- Shading.  We'll also need some form of shading that tackles many
	  of the same issues described above.

   - How good can a kd-tree get?  Gordon lays out some naive and incredibly
     bad building heuristics and then a cost metric with some approximations
     in computing actual costs.  pbrt has a 'good, but not great' building
     strategy.  It should be possible to actually exhaustively (with
     aggressive culling of known non-optimal paths) find the 'best' tree
     meeting some cost criterion.  It also might be possible to find the
     'best' tree where 'best' is the tree that minimizes raycasting time
     instead of fitting any greedy local cost metric.
        - It would be nice to quantify the performance difference (at
	  raycasting time) of the strategies and provide a sense of whether
	  a great tree is 50% better, 2x better, or 10x better than a bad
	  tree.
	- It would also be nice to quantify how much each increase in build
	  time reduces raycasting time (or which optimizations don't affect
	  build time).  The question is what's the best tree that can be
	  built in N seconds total of building plus traversing.  If we can
	  build in 5 seconds instead of 10 and raycasting takes 750
	  milliseconds instead of 300 milliseconds then that's a good trade.
	- Are there any theoretical results we can produce about kd-trees?
	  Can we show that the greedy-local cost minimization is within some
	  closed form bound of globally optimal?
	- Just to be different, what about optimal trees for parallel
	  architectures?  GPUs value coherence much more highly than CPUs
	  and don't incur nearly as high a relative cost to compute
	  ray-triangle intersections.  Does that change the cost-metric?
	  Does it change any of the approximations we use for the different
	  terms when actually implementing a builder?
        - Pat makes the good point that a good starting point may be to work
	  in 2D.  It's much easier to visualize 2D kd-trees and it's also (a
	  bit) easier to brute force generate and test a wider variety of
	  trees in the same time.  We can evaluate both the effectiveness of
	  the standard cost metric versus the best possible tree and the
	  effectiveness of the various approximations to the cost of various
	  terms during the build.  We can also see if any interesting
	  properties come to light by visually comparing 'very good'
	  semi-randomly generated trees.

   - How fast can we build a kd-tree?  The current raytracing code can trace
     millions of rays per second, but takes 13 seconds to build a kd-tree
     out of 70 thousand unsorted triangles (a build rate of 0.005 million
     triangles per second) but is known to be unoptimized vs. the highly
     optimized ray caster.  Historically, rendering times were so high than
     a few second or even minutes to build acceleration structures was
     tolerated.  With interactive ray tracing that's clearly not true.
     Potential specific tasks:
        - Build by binning the triangles to avoid resorting all of them
	  each time we descend the tree.
	- Build by sorting the triangles once and then maintaining
	  intervals (one in each of the axes).  Each time you descend a
	  layer, refine the intervals by intersection them with the split
	  plane.  To enumerate the primitives in a given node, walk all the
	  triangles in the narrowest interval and cull any that violate the
	  other two intervals.  Again, you can determine just the triangles
	  in any cell without visiting or sorting all the others.
	- Build by first putting all the triangles in a grid.   This is
	  essentially a sophisticated version of building with intervals.
	- Build from a grid and maintain pointers from each grid cell to the
	  tree leaves that intersect it.  Then handle scenes with moving
	  objects by updating the objects' positions on the grid and
	  relocating triangles as appropriate among affected tree leaves.
	  This seems like it should work well for scenes with lots of static
	  geometry.  If too much geometry is moving, at some point you'll
	  need to rebuild the tree to maintain its 'goodness' (possibly,
	  depending on how much impact the goodness of a tree really has on
	  rendering times).
	- There's a similar argument (and theme) in building a kd-tree from
	  a scene graph (usually equivalent to a BVH) though you lose the
	  easy backmap for updating any tree cells affected by moving
	  objects.
	- Is there a dynamic programming solution?  If you just brute force
	  a fairly naive strategy, how good is it and how memory hungry?
	  Can you simplify (at the potential cost of a degree of
	  optimality).
	  
   - Build the kd-tree lazily.  With interactive tracers, casting the rays
     takes fractions of a second, but building trees takes multiple seconds.
     Worse, with moving objects, the trees need to be rebuilt or updated
     every frame.  Additionally, there is no purpose in building a tree for
     portions of the scene that rays never enter (i.e. aren't visible).  So
     it seems to make sense to refine the tree on demand any time a ray hits
     a node that's currently a leaf but has an estimated cost of
     splitting (and continuing to traverse) below the cost of testing
     against all of its triangles.  Potential specific tasks:
        - Organize the primitives into some form of hierarchy instead of a
	  long list.  Without it, refining a node is likely to be extremely
	  expensive.
	- Determine a threshold number of primitives that comfortably fit in
	  the cache.  Refining a cell at or below that size only has
	  computational cost.  Refining a cell above that size also has a
	  large memory bandwidth cost.  Potentially, a good scene hierarchy
	  mitigates the bandwidth cost at the coarser levels of the tree.
	- Formulate cost metrics or at least heuristics to determine when to
	  refine vs. intersect and how far to refine (i.e. what metric to
	  use to determine when to stop refining).

   - Read the literature on dynamic updates to BSP trees (which are just
     kd-trees with fewer limitations and more legitimacy among the theory
     folks).  Being able to dynamically update kd-trees bears on
     dynamic scenes, lazy building, and interactivity in general.


Ray Tracing Themes Winter 2006
------------------------------

[ After the brainstorming laid out above and a few rounds of discussions,
  I've abstracted out the following higher level structure / combination of
  things we're pursuing and how they fit in a larger framework. ]

   - Ray tracing efficiently on Cell based machines.
   
     The grander question here is:  How do the best known CPU ray tracing
     algorithms adapt to the recent architectures that offer vastly more
     flops but trade-off limitations in addressable memory, performance
     cliffs for non-SIMD execution, etc.  What aspects of the algorithms are
     key to their their performance on these newer architectures, can we
     actually harness those flops to outperform CPUs, where are the most
     effective changes to enhance ray tracing on these architectures, what
     benefits can we expect from such changes, and how plausible are they in
     the constrained context that drives the design of the various
     architectures.

   - Describe a vision for ray tracing capable future graphics systems.

     There are arguments that rasterization has killed ray tracing and its
     zealots just haven't accepted it and that rasterization is a doomed
     kludge for 3D because the horsepower for ray tracing (and physical
     simulation derived methods) hasn't quite been achieved.

     I [speaking for myself, though I suspect we agree at a high level]
     believe the optimal future is in between.  There's a hefty existing
     investment in rasterization hardware and amazing techniques for
     producing visibility from a single (per frame) viewpoint and shading
     local effects (and precomputed texture based faux global effects).
     Additionally, rasterization is much more suited to the various desktop
     and windowing systems, text rendering, and other basic end user tasks.
     Finally, there's strong commonality between the shading needs of
     rasterization and ray tracing and again it makes more sense to
     leverage, reuse, and perhaps extend, the current programmable shading
     hardware than call for radical changes.

     Rather, ray tracing post-rasterization can enhance existing GPU
     capabilities in areas that are awkward today.  Specifically, tracing
     rays simplifies shadows, reflection and refraction, and photon mapping
     and other techniques for global effects.

   - Find ways to use kd-trees even better for ray tracing.

     The three areas we discuss extensively are: to determine the ideal
     kd-tree (allowing expensive offline building) and thereby set a
     baseline for judging how good trees are that any given building
     strategy produces; to devise a builder that runs quickly (relative to
     time spent intersecting rays per build) and perhaps even trades off
     tree quality for reduced total build plus intersection time; and, to
     dynamically update an existing kd-tree based upon the unstructured
     movement of a subset of the scene geometry.

     Note that we know of people working actively in the latter two areas,
     but not enough of the specifics yet to come up with a way to avoid
     reproducing their effort, learn from them, and produce contributions
     without trampling on their plans.