Foveated Video
Approach
One interesting aspect of the project is to capture a
representation of a remote location with nonuniform resolution.
Due to camera and bandwidth limitations, we cannot create a
single high-resolution image representing the environment.
However, we can allocate resources on a global and local scale to
direct the areas of high-resolution. While one camera with a
wide field of view captures the macroview of the environment, one
or more cameras with a smaller field of view obtain a
high-resolution image of portions of the environment. Because
these cameras can pan, tilt, and zoom, their foveated vision
moves across the environment. In this way, they can concentrate
image resolution on objects of interest.
To construct these nonuniform resolution representations, we must
merge these overlapping video streams into one conceptual image.
The macroview image is used as the basis of the representation,
and the microview images are registered into the macroview.
Registration will be aided with intrinsic camera calibration and
the pan-tilt-zoom parameters of the microcameras. This
information will aid in localizing and warping the microviews for
a seamless fit in the macroview.
The views from the cameras are transmitted as separate video
streams and are merged together at the receiver (lecture location)
side. The lecturer normally sees the mactoview of the audience at
the remote location, but has the option to zoom in at particular
portions of the image and obtain a clearer view of a person asking
question, for example. If the user also has control over the motion
of the moving cameras, he can zoom in at any part of the image.
Otherwise, an attempt to magnify a region where no hi-resolution
data is available, will result in a blury picture.
Challenges
A big challenge is determining accurate registration of the
microviews within the macroview. Several phenomena can make this
difficult. First, the cameras will have different lenses, each
imposing a different warp on the objects. Camera calibration
will help us remove those effects. Second, the images may have
differing brightness due to distinct camera responses. These
brightness discontinuities must be smoothed to create a coherent
appearance. Finally, the pan-tilt-zoom parameters of each
microcamera must be mapped into a global parameter system. In
some sense, this is extrinsic camera calibration because we must
determine the relationship between all the microcameras and the
macrocamera in order to interpret the pan-tilt-zoom parameters.
Merging the video streams from the different cameras also poses some
challenges. The integration of the images should be seamless, or at
least tolerable by the user. This is especially true for the parts of the
picture where there is a transition from hi-res to lo-res video.
On the other hand, it might be convenient to be able to spotlight a
speaking person whose image is being captured in a hi-res video stream,
thereby taking advantage of the resolution difference with the rest of
the image. There are probably other choices regarding the user interface
that need to be explored. It is concievable that the hi-res images can be
cached, i.e., if a portion of the picture was zoomed in at some point,
but no changes occure in that region afterwards, then the last
hi-res view of that region is still valid. Probably the greatest challenge,
is that the process of merging the video streams must be done in
real time.