Removing Foreground Objects from Urban Scenes using SAP for Matting

A report by Gaurav Garg
CS448 (Computational Photography)
Spring 2004

Abstract

This report describes a technique for removing foreground objects like light poles, waste bins and people from videos of urban scenes captured from a video camera. The technique is based on shape from stereo algorithm. Once the object has been removed the new video can be used for applications like creating multi-perspective panoramas without foreground objects and for creating synthetic aperture images. We show our results and also discuss the limitations of the technique.

Introduction

Synthetic Aperture Photography [1] has been used as a technique for generating very shallow depth of field image of a scene at a particular focal plane (or in other words very highly focused image on a single depth) using a bunch of images captured from single/multiple cameras. The technique is so effective that by using a wide enough aperture one can completely blur out things that are not on the focal plane. Not only that, the technique also allows one to change the focal plane itself, thereby making it possible to focus at any depth (or focal plane) in the scene. Figure 1 illustrates one such scenario. The video on the left is a sideways looking video of a cityblock taken from a moving car. The image on the right is a synthetic aperture image made from this video with focal plane set at the storefront. Note that the light pole and the tree are completely blurred out in the image. If you click on the image, you will see a video sequence that shows what happens when we sweep the synthetic focal plane through the scene. Notice how as we sweep the focal plane from store front to the front of the pole, the pole comes in focus and the wall goes out of focus.

Figure 1: The video on the left is a sideways looking video of a cityblock taken from a moving vehicle. The video on the right is synthetic aperture sequence constructed from it. The focal plane sweeps from back to front in the synthetic aperture sequence. Note how first the wall is in focus and the light pole is completely blurred, and then the light pole comes into focus and the wall is blurred. (Click on the images to play the video)

The human eye can easily tell what is in focus and what is not at a particular depth by looking at the synthetic aperture video. So, the natural question to ask is, can a computer algorithm do as well? We explore this question in this project. These techniques come under the general framework of shape from lightfields. Specifically in literature, they are also known as shape from stereo and shape from focus techniques. In our context, since video was captured from a moving vehicle, it is a 1-Dimensional lightfield and the synthetic aperture that we get is also 1-Dimensional.

Rest of the report is organized as follows. We present a short background summary of the related literature. The next section explains our pipeline for achieving the stated goal. We then explain the experiments that we performed to chose the right focus operator. We then show the results and also show two applications: synthetic aperture photography and multi-perspective panoramas to illustrate the usefulness of the scheme. We finally conclude with a discussion on why getting precise shape from light field is a hard problem to solve in uncontrolled settings.


Related Work

Depth from stereo/defocus is a well studied problem in Computer Vision. It has been shown to work well in controlled lab settings. But it is not a solved problem yet. In general scenarios, the assumptions made by these techniques do not hold true and the techniques break down. [4] is one of the earliest work on shape from focus technique and explains the use of laplacian as a focus operator. [5] is the earliest work on voxel based algorithms for shape from many cameras. It falls into the category of variance based approach. [6] is a generalization of this technique for full surround light fields. All these techniques model the scene as consisting of opaque objects. [7] and [8] explore probabilistic variations to these basic approaches. They model the scene as consisting of translucent objects and assign probability value to the occupancy of a voxel. We have not seen these techniques applied to complex outdoor scenes, we investigate that here.


Pipeline

Figure 2: The figure shows the pipeline for foreground object removal from dense video data. The input to the pipeline is assumed to be a calibrated 1-D light field and the output is a light field with specified object removed from it.

The procedure for removing foreground objects from dense video data is explained in Figure 2. The input to the pipeline is a sideways looking video of a cityblock acquired from a moving vehicle. The video is then calibrated using a commercial software called Boujou which solves for structure from motion from the video data. It returns the camera position and pose for each frame in the video sequence. The rest of the pipeline is as follows:

  1. Rectification: One of the camera position and pose is chosen as reference and a focal plane is chosen in the world. After that all the camera images are rectified onto the chosen reference plane in the reference coordinate system. This step is straightforward and reduces to applying a planar homography to each input image as the calibration is computed in advance.
  2. Focus Estimation: After all the images have been rectified onto a common reference plane, we need to find out which pixels on that plane are in focus. There are two operators that can be used for estimating the focus:
    • Variance (Shape from stereo)
    • Laplacian (Shape from focus)
    The variance operator is based on the assumption that if a pixel is in focus, then most of the contributing pixels from different rectified images should have the same color. Or in other words the variance in the contributing pixels should be low if the pixel is in focus. The laplacian on the other hand is a local measure of sharpness. It assumes that if all the contributing pixels are averaged then the pixel which is in focus would be sharper compared to its neighbors. So, laplacian is used as a measure of sharpness. In our case since the light field is 1-D only a 1-D filter is used. A detailed comparison of these operators, where they do well, where they fail and which one to use is later covered in section on experiments. There seem to be no additional benefit from using laplacian, so variance was used for all the results.
    The first two steps are then repeated for a range of depths so as to get a focus estimate for each voxel in the volume.
  3. Depth Map from Focus: It is assumed that the world consists of opaque objects. So, each voxel is either in focus or not. Furthermore, in line of site of the reference camera only one voxel can be in focus at a time. In order to find out which voxels are in focus we sweep the focal plane from back to front and chose the voxel which is in most focus. We set a hard threshold on the variance value at a voxel to qualify as a possible candidate for being in focus. This is based on the assumption that the variance should be very low for foreground objects when they are in focus as they do not get any polluting pixels whereas this does not hold for occluded objects. After that along the line of sight of the reference camera the voxel with the minimum variance is chosen. This could leave a lot of holes in our depth model but they are mostly in background objects and not in foreground which is what we care about.
  4. Create Hole in Video: The next step is to choose voxels from the 3D map that we wish to remove. If we just chose a whole depth range to remove then we would also be removing other objects that are at the same depth as the object we wish to remove. To aid this a little manual intervention is helpful. The user provides a rough mask around the object that is to be removed in one of the frames of the original video sequence. The 3D voxels are then projected onto that frame and tagged on the basis of whether they fall inside the mask or outside it. Only the voxels which fall inside the mask are tagged for removal. The tagged voxels are then projected onto each frame of the sequence. The projected pixel locations give us a matte for foreground object in the video sequence. This rough mask selection has additional benefit of removing noisy voxels which were represented by wrong depth value because of the fragility of the shape algorithm. Still the matte obtained is not perfect, so we expand the matte to enclose neighboring pixels. This gives us a conservative matte around the object to be removed.

Experiments

We performed experiments to find out which is the right operator to use for getting shape.

(a) Shape from Variance (b) Shape from Laplacian
Figure 3: Variance vs. Laplacian. The figure compares one frame of the mask generated using variance operator against laplacian operator. (Click on the images to view full resolution)

Figure 3 compares the two operators. Variance operator works well inside the object but it does not work well on the edges of the object. This is because the variance is not low at the boundary because of mixture pixels. As opposed to this the laplacian tends to find object edges, and does not work well inside the object. This is expected because of the nature of the operators. Additionally, laplacian creates a halo around the object because of aliasing. Overall the laplacian is more noisy. The noise of the variance operator seems more predictable as it gives wrong depth in areas of uniform color. Empirically, we found that variance seems to do a better job and the noisy voxels are more easily explainable.

We also tried some combinations of both variance and laplacian. Note that we do not have any scale information between them so they are impossible to combine in a rigorous sense. Taking union of the two operations is one option, this improves the matte for the pole but we also get a big halo around the pole and lot more noise compared to pure variance. Taking the intersection of the two was also not very useful, because of the contradictory nature of the two operators. Though the noise cancelled out, we got a worse matte for the pole. In conclusion, the variance combined with image morphological operation of growing the matte does a good job of removing the pole and hence we use variance as our focus operator.


Results

We present all our results in detail here:

Video with holes

Figure 4 shows the videos with light pole removed using the described techniques. Note that we can almost get the matte for complete pole except its tip.

(a) (b) (c)
Figure 4: (a) shows the resulting matte for the light pole without manual mask. (b) shows the resulting matte after the manual mask has been applied and (c) is the resulting image after growing the matte thus obtained using image morphological operations. (c) gives the best matte for the light pole. (Click on the images to play the video)

SAP Video Sequence Comparison

Figure 5 is a comparison of SAP video sequences. Note that the blur due to the pole disappears in (b) and (c) as they were made from the video in which pole was removed as compared to (a) which was made form the original sequence. There are some other regions in (b) which also get removed because of the noisy voxels due to incorrect depth estimation.

(a) (b) (c)
Figure 5: The figure compares the SAP images. The focal plane is set at the facade. (a) is the SAP image made from the original video. (b) is after removing the pole but without the manual mask, and (c) is after removing the pole with the manual mask. (Click on the images to play the video)

Figure 6 is another comparison of SAP video sequences. Note that the blur due to the waste bin completely disappears in (b) as it was made from the video in which waste bin was removed as compared to (a) which was made form the original sequence.

(a) (b)
Figure 6: The figure compares the SAP images. The focal plane is set at the facade. (a) is the SAP image made from the original video. (b) is after removing the waste bin with the manual mask. (Click on the images to play the video)

Multi-Perpective Panorama Comparison with Automatic Hole Filling

We also created multi-perspective panorama from the video sequence with the light pole and waste bin removed. If constructed naively, the new panoramas would have holes in place of light pole and the waste bin. To fill in the holes we simply copied the strips with holes from the nearest video frame which did not have hole at that location. Since, we have a very dense video data set and the panorama was created with image plane set at store facade, this assumption is mostly valid. It generates small artifacts but they are unnoticeable unless there are significant depth differences. This technique can be directly extended to remove people from the panoramas. This is because people are just about the size of the waste bin and it is removed completely. Unfortunately, we did not have a good data set with people to try on, but the idea should work. Figure 7 illustrates this appropriately. Note that both the pole and the bin are removed completely. We don't see any artifact on the facade because that is our assumed depth, though we see some artifacts in the alley way because of significant depth difference.

Figure 7: The figure compares the multi-perspective panoramas generated for the vireo sequence. Panorama on the top is the made from the original video while the panorama on the bottom is created after light pole and waste bin were removed from the video. The artifacts are mostly due to the fact that the alley way is much deeper than the store front and the assumption breaks down. (Click on the images to view full resolution)


Discussion

Shape from light fields is an inherently hard problem and it is even harder in uncontrolled settings like ours. Both shape from stereo and shape from focus based techniques make assumptions about the scene structure which does not often hold in outdoor settings. They assume that the objects in the scene are both textured and lambertian. The assumption that the object is textured especially does not hold at lot of places like sky, store facades etc. as shown in our examples. This confuses the shape algorithm and results in wrong depth estimates. Furthermore, the overall depths obtained are also very coarse and noisy as these operators do not put any smoothness constraints on the scene. This makes it impractical to expect a very detailed depth map from these techniques. However, as shown in the results, for certain applications, a rough depth map for removing the foreground objects like light poles, waste bins and people is good enough and can be obtained.

The technique creates videos with holes instead of foreground object in them. One possible direction of future research could be to try video inpainting just like image inpainting as described in [2]. Another practical usage of the technique could be as a plug-in for interactive multi-perspective panorama design tool [3]. As shown in the results just by selecting a rough mask over the object to be removed in one frame, a multi-perspective panorama can be constructed without the object in it.


Bibliography

  1. Vaish V., Wilburn B., Joshi N. and Levoy M., "Using Plane + Parallax for Calibrating Dense Camera Arrays," Proc. CVPR 2004 (to appear).
  2. Bertalmío M., Sapiro G., Caselles V. and Ballester C.,"Image Inpainting," Proc. SIGGRAPH 2000.
  3. Román A., Garg G. and Levoy M., "Interactive Design of Multi-Perspective Images for Visualizing Urban Landscapes," IEEE Visualization 2004 (submitted).
  4. Nayar S. K. and Nakagawa Y.,"Shape from Focus," IEEE PAMI 1994.
  5. Seitz, S. M., Dyer, C. R., "Photorealistic Scene Reconstruction by Voxel Coloring," Proc. CVPR 1997.
  6. Kutulakos K. N., Seitz S. M., " A theory of shape by space carving," Proc. ICCV 1999.
  7. De Bonet J., Viola P., "Poxels: Responsibility Weighted 3D Volume Reconstruction," Proc. ICCV 1999.
  8. Bhotika R., Fleet D. J., Kutulakos K. N., "A Probabilistic Theory of Occupancy and Emptiness," Proc. ECCV 2002.


Last Modified: February 3, 2005 03:30:58 PM