
David Redkey
Lucas Pereira
cs348c -- Virtual Reality
In the future, it will be interesting to combine light fields with other objects. However, in order to get proper occlusion, it is necessary to determine the depth of every ray in the field. This is easily computed for computer-generated lightfields that have an internal geometric representation of a scene. However, for light fields captured from the real world, we often do not have the depth information.
Our goal was to use multi-baseline computer vision techniques to generate the depth information for a light field from the images. Once we reconstructed the depth information, we could then recombine the light field with other objects or scenes, and get proper occlusion.

One fundamental advantage of stereo reconstruction is known as the "epi- polar line" constraint. If one traces a ray from the center of projection of one camera location through any one pixel in its image plane, the projection of the ray onto the image plane of any other camera will be a line. This is because the viewing transformatin is perspective, and perspective transformations map lines to lines. This means we only have to search for pixel matches along a given line in a stereo image.
Another advantage we can rely on comes from having multiple "baselines" at our disposal. "Baseline" is the computer-vision term for the displacement vector between the projection centers of two cameras. Each baseline provides a different epi-polar line through a different image, allowing for the better rejection of incorrect pixel matches. For example, we may be trying to stereo-match an orange pixel from camera #1 to determine it's depth, and find that three orange pixels lie along the epi-polar line for camera #2. Without additional camera positions, we would be unable to choose from among three equally valid depth values for the corresponding point in space. If we include a third camera, however, its epi-polar line will cross different points in the scene from the epi-polar line in image #2. Even if the new epi-polar line contains, say, four orange pixels, chances are that only one of the four will correspond to the same depth value as any of the three pixels from image #3. This depth value would be the correct depth for the point in space.
In determining the depth of a given pixel, our algorithm selects a regular sampling pattern of inverse-depth values from the near clipping-plane to infinity. By choosing regular spacing of inverse-depth instead of depth, it turns out that each increment of the depth-term corresponds to a constant pixel-increment along the epi-polar line in an image. This is a handy performance benefit. For each depth-sample, we accumulate an error quantity corresponding to how different the color of the projected pixel is in each of the other frames. When we are done, we choose the depth sample with the least error as the actual depth. Here is a simplified description of the algorithm:
For each camera frame
For each pixel
For all candidate depths
For all other camera images
Compute color difference with projection
Choose best distance as minimizer of sum of squared error
Most previous depth reconstruction efforts have tried to match
texture "windows" around pixels, instead of just single pixel values,
to provide more confidence in matching. The problem with this
approach is that widely spaced cameras will greatly distort the
appearance of even small texture windows in other images, making
matches more difficult to determine. Our code supports this mode of
operation, but our hope was that the vast array of cameras available
to us would allow matching to be done by checking only single pixels,
thus avoiding the texture warping problem.
Another difficulty faced by stereo-matching is occlusion. If a point in space is visible to one camera, but another camera's view of it is blocked by an obstruction, stereo matching cannot be done with the blocked camera. We avoid corrupting our error data with obstructed cameras by rejecting all input from a given camera if none of the pixels on its epi-polar line are similar to the target pixel. Because of the 2-dimensional array of camera locations available to us, we are able to determine depth fairly well, even for points that are invisible from large regions of space. The possibility of corruption when false matches are present is still a serious problem.
Extreme texture frequencies are another problem for stereo matching. Low- frequency regions (regions of roughly constant color) are very difficult to reconstruct precisely, because its likely that a range of depth-values will produce color matches. High-frequency textures are subject to aliasing. It's possible that a pixel in one image will have no correspondent in another image. An example of this is our algorithm's difficulty with the wall behind Buddha. This problem is especially apparent in real-world scenes if the cameras are not calibrated perfectly, making the epi-polar lines inexact.
Finally, depth discontinuities are a problem, because the likelyhood exists of pixels that are blends between foreground and background objects. It's likely that one such pixel will not have a match in any other image, or worse, if the background is uniform, false matches will be created as cameras follow the moving silhouette.

Our algorithm works best on scenes like this, where the object
has a texture and few specular highlights.

Our algorithm does fairly well on this object, even though it only
has two colors. However, due to sparse spacing in the u-v plane for
this particular light field (~15 degrees per image), the lower right
face has some aliasing artifacts.

This scene demonstrates some of the things that cause confusion
in our algorithm. The background wall texture has a high
frequency component, so that there is very little coherence between
images. The Buddha and the table have very low frequency
textures, which make it difficult to calculate an exact depth.
Finally, the Buddha has a specular highlight that moves across
his chest, and disrupts our algorithm.

This shows the results of combining a GL-rendered object (the gray cube)
with a lightfield object (the colored cube). We are using the Z-values
computed by our depth-reconstruction algorithm.
For each ray, the lightfield data structure stores the color information. We added the Z data to the same structure. Thus when the lightfield code assembles the rays for a particular view, the Z values get passed in, too. The RGBA colors create a color image, and the Z values create a corresponding Z image. At this point, the Z values are still in world coordinates, perpendicular to the s-t plane.
When we're done rendering the scene, we use lrectread() to copy the Z-buffer data back into memory, so that we can do Z-comparisons in software.
For each ray in the light field image, we compute the vector Vp, which is the vector from the intersection of the ray with the s-t plane to the viewer. We dot this with the normalized viewing direction vector, to get the projection of Vp along the viewing direction. We multiply the ray's Z-value by the ratio: (projection of Vp along viewing direction)/ (distance of viewer from the s-t plane) to get the correct Z-value for the ray along the current viewing direction.
Note that this essentially involves a matrix multiplication for every ray (pixel) in the light field. This can really slow down display performance. It might be more efficient to map geometry-based Z-values to the lightfield's coordinate system, rather than our current approach. The advantage is that the matrix multiplication could be folded into the projection matrix. Thus it wouldn't add any extra work. The major disadvantage is that this requires the geometry renderer to know about the lightfield paramaters.
When we finish generating the alpha values, we copy the image onto the screen. The alpha values control which pixels get modified, and how much. Then we write the new Z-values into the Z-buffer, so that more images can be composited, if desired.

This is another view of the composite image. This view shows an artifact
that appears around the outside edges of lightfields. The multi-colored
cube was rendered anti-aliased, so that it fades to black at the edges.
These "partially black" pixels are matched by our reconstruction algorithm,
and so they have the same depth as the object. Thus they partially
occlude the cube behind the lightfield, and create a dark "fringe" around
the edge of the lightfield object.

Buddha with a cube in his lap, and a cube on his head. Even with Buddha's
specular reflections, we were able to reconstruct the depth information
reasonably well in the areas where there was enough detail, such as his
abdomen and his head.
Levoy, Marc and Pat Hanrahan, Light Field Rendering, Siggraph 1996.
Okutomi, Masatoshi and Takeo Kanade, A Multiple-Baseline Stereo, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 4, April 1993.
The High Performance Computing Graphics Project, Carnegie Mellon.
The Stanford Light Field Project.