A report by Gaurav
Garg
CS448 (Computational
Photography)
Spring 2004
Synthetic Aperture Photography [1] has been used as a technique for generating very shallow depth of field image of a scene at a particular focal plane (or in other words very highly focused image on a single depth) using a bunch of images captured from single/multiple cameras. The technique is so effective that by using a wide enough aperture one can completely blur out things that are not on the focal plane. Not only that, the technique also allows one to change the focal plane itself, thereby making it possible to focus at any depth (or focal plane) in the scene. Figure 1 illustrates one such scenario. The video on the left is a sideways looking video of a cityblock taken from a moving car. The image on the right is a synthetic aperture image made from this video with focal plane set at the storefront. Note that the light pole and the tree are completely blurred out in the image. If you click on the image, you will see a video sequence that shows what happens when we sweep the synthetic focal plane through the scene. Notice how as we sweep the focal plane from store front to the front of the pole, the pole comes in focus and the wall goes out of focus.
The human eye can easily tell what is in focus and what is not at a particular depth by looking at the synthetic aperture video. So, the natural question to ask is, can a computer algorithm do as well? We explore this question in this project. These techniques come under the general framework of shape from lightfields. Specifically in literature, they are also known as shape from stereo and shape from focus techniques. In our context, since video was captured from a moving vehicle, it is a 1-Dimensional lightfield and the synthetic aperture that we get is also 1-Dimensional.
Rest of the report is organized as follows. We present a short background summary of the related literature. The next section explains our pipeline for achieving the stated goal. We then explain the experiments that we performed to chose the right focus operator. We then show the results and also show two applications: synthetic aperture photography and multi-perspective panoramas to illustrate the usefulness of the scheme. We finally conclude with a discussion on why getting precise shape from light field is a hard problem to solve in uncontrolled settings.
Depth from stereo/defocus is a well studied problem in Computer Vision. It has been shown to work well in controlled lab settings. But it is not a solved problem yet. In general scenarios, the assumptions made by these techniques do not hold true and the techniques break down. [4] is one of the earliest work on shape from focus technique and explains the use of laplacian as a focus operator. [5] is the earliest work on voxel based algorithms for shape from many cameras. It falls into the category of variance based approach. [6] is a generalization of this technique for full surround light fields. All these techniques model the scene as consisting of opaque objects. [7] and [8] explore probabilistic variations to these basic approaches. They model the scene as consisting of translucent objects and assign probability value to the occupancy of a voxel. We have not seen these techniques applied to complex outdoor scenes, we investigate that here.
|
| Figure 2: The figure shows the pipeline for foreground object removal from dense video data. The input to the pipeline is assumed to be a calibrated 1-D light field and the output is a light field with specified object removed from it. |
The procedure for removing foreground objects from dense video data is explained in Figure 2. The input to the pipeline is a sideways looking video of a cityblock acquired from a moving vehicle. The video is then calibrated using a commercial software called Boujou which solves for structure from motion from the video data. It returns the camera position and pose for each frame in the video sequence. The rest of the pipeline is as follows:
We performed experiments to find out which is the right operator to use for getting shape.
Figure 3 compares the two operators. Variance operator works well inside the object but it does not work well on the edges of the object. This is because the variance is not low at the boundary because of mixture pixels. As opposed to this the laplacian tends to find object edges, and does not work well inside the object. This is expected because of the nature of the operators. Additionally, laplacian creates a halo around the object because of aliasing. Overall the laplacian is more noisy. The noise of the variance operator seems more predictable as it gives wrong depth in areas of uniform color. Empirically, we found that variance seems to do a better job and the noisy voxels are more easily explainable.
We also tried some combinations of both variance and laplacian. Note that we do not have any scale information between them so they are impossible to combine in a rigorous sense. Taking union of the two operations is one option, this improves the matte for the pole but we also get a big halo around the pole and lot more noise compared to pure variance. Taking the intersection of the two was also not very useful, because of the contradictory nature of the two operators. Though the noise cancelled out, we got a worse matte for the pole. In conclusion, the variance combined with image morphological operation of growing the matte does a good job of removing the pole and hence we use variance as our focus operator.
We present all our results in detail here:
Video with holes
Figure 4 shows the videos with light pole removed using the described techniques. Note that we can almost get the matte for complete pole except its tip.
SAP Video Sequence Comparison
Figure 5 is a comparison of SAP video sequences. Note that the blur due to the pole disappears in (b) and (c) as they were made from the video in which pole was removed as compared to (a) which was made form the original sequence. There are some other regions in (b) which also get removed because of the noisy voxels due to incorrect depth estimation.
Figure 6 is another comparison of SAP video sequences. Note that the blur due to the waste bin completely disappears in (b) as it was made from the video in which waste bin was removed as compared to (a) which was made form the original sequence.
Multi-Perpective Panorama Comparison with Automatic Hole Filling
We also created multi-perspective panorama from the video sequence with the light pole and waste bin removed. If constructed naively, the new panoramas would have holes in place of light pole and the waste bin. To fill in the holes we simply copied the strips with holes from the nearest video frame which did not have hole at that location. Since, we have a very dense video data set and the panorama was created with image plane set at store facade, this assumption is mostly valid. It generates small artifacts but they are unnoticeable unless there are significant depth differences. This technique can be directly extended to remove people from the panoramas. This is because people are just about the size of the waste bin and it is removed completely. Unfortunately, we did not have a good data set with people to try on, but the idea should work. Figure 7 illustrates this appropriately. Note that both the pole and the bin are removed completely. We don't see any artifact on the facade because that is our assumed depth, though we see some artifacts in the alley way because of significant depth difference.
Shape from light fields is an inherently hard problem and it is even harder in uncontrolled settings like ours. Both shape from stereo and shape from focus based techniques make assumptions about the scene structure which does not often hold in outdoor settings. They assume that the objects in the scene are both textured and lambertian. The assumption that the object is textured especially does not hold at lot of places like sky, store facades etc. as shown in our examples. This confuses the shape algorithm and results in wrong depth estimates. Furthermore, the overall depths obtained are also very coarse and noisy as these operators do not put any smoothness constraints on the scene. This makes it impractical to expect a very detailed depth map from these techniques. However, as shown in the results, for certain applications, a rough depth map for removing the foreground objects like light poles, waste bins and people is good enough and can be obtained.
The technique creates videos with holes instead of foreground object in them. One possible direction of future research could be to try video inpainting just like image inpainting as described in [2]. Another practical usage of the technique could be as a plug-in for interactive multi-perspective panorama design tool [3]. As shown in the results just by selecting a rough mask over the object to be removed in one frame, a multi-perspective panorama can be constructed without the object in it.