Foveated Video

Approach

One interesting aspect of the project is to capture a representation of a remote location with nonuniform resolution. Due to camera and bandwidth limitations, we cannot create a single high-resolution image representing the environment. However, we can allocate resources on a global and local scale to direct the areas of high-resolution. While one camera with a wide field of view captures the macroview of the environment, one or more cameras with a smaller field of view obtain a high-resolution image of portions of the environment. Because these cameras can pan, tilt, and zoom, their foveated vision moves across the environment. In this way, they can concentrate image resolution on objects of interest.

To construct these nonuniform resolution representations, we must merge these overlapping video streams into one conceptual image. The macroview image is used as the basis of the representation, and the microview images are registered into the macroview. Registration will be aided with intrinsic camera calibration and the pan-tilt-zoom parameters of the microcameras. This information will aid in localizing and warping the microviews for a seamless fit in the macroview.

The views from the cameras are transmitted as separate video streams and are merged together at the receiver (lecture location) side. The lecturer normally sees the mactoview of the audience at the remote location, but has the option to zoom in at particular portions of the image and obtain a clearer view of a person asking question, for example. If the user also has control over the motion of the moving cameras, he can zoom in at any part of the image. Otherwise, an attempt to magnify a region where no hi-resolution data is available, will result in a blury picture.

Challenges

A big challenge is determining accurate registration of the microviews within the macroview. Several phenomena can make this difficult. First, the cameras will have different lenses, each imposing a different warp on the objects. Camera calibration will help us remove those effects. Second, the images may have differing brightness due to distinct camera responses. These brightness discontinuities must be smoothed to create a coherent appearance. Finally, the pan-tilt-zoom parameters of each microcamera must be mapped into a global parameter system. In some sense, this is extrinsic camera calibration because we must determine the relationship between all the microcameras and the macrocamera in order to interpret the pan-tilt-zoom parameters.

Merging the video streams from the different cameras also poses some challenges. The integration of the images should be seamless, or at least tolerable by the user. This is especially true for the parts of the picture where there is a transition from hi-res to lo-res video. On the other hand, it might be convenient to be able to spotlight a speaking person whose image is being captured in a hi-res video stream, thereby taking advantage of the resolution difference with the rest of the image. There are probably other choices regarding the user interface that need to be explored. It is concievable that the hi-res images can be cached, i.e., if a portion of the picture was zoomed in at some point, but no changes occure in that region afterwards, then the last hi-res view of that region is still valid. Probably the greatest challenge, is that the process of merging the video streams must be done in real time.