In our project we addressed these problems from multiple directions applying the enhanced capabilities of DTV and image-based techniques. First of all, we have chosen to separately process and transmit the image of the lecturer, the room, the blackboard, and any additional material. Each of these source materials have very distinct video characteristics that we plan to utilize. In the following we give a short overview of the whole project, but concentrating on the extraction of a high-resolution blackboard image.
On the head-end we need to capture multiple video streams of the room with the lecture and the blackboard. We then need to segment and broadcast the different objects (much in the spirit of MPEG-4's video objects ). On the receiver side we need to recompose the different video streams into a single presentation, but are now able to give the user the ability to customize it according to his preferences. This allows him, for instance, to concentrate longer on a blackboard image or review some earlier slides again.
We start by creating a geometric model of the lecture hall, which is augmented with projective textures extracted occasionally from a video stream. As a result we can save considerable bandwidth by only transmitting this model and the textures infrequently instead of sending the image of the background for each video frame. This approach allows a viewer to freely move within the room and view the classroom from whatever location he prefers, not just the angle chosen by the camera operator.
In order to display the lecturer within this model, we need to segment him from the background. We currently use a simple segmentation algorithm that uses the known colors of the background to distinguish it from the lecturer. Using the known camera position and the geometry of the room, we roughly estimate the position of the lecturer in front of the blackboard and place his video image as a 3D billboard into the scene. Although this is a simple technique it already provides a surprisingly realistic view of the lecture while using only a fraction of the bandwidth a full video transmission of the lecture would require (see Figure 1).
Figure 1: In the background: textured model
of the lecture room, multiple textures are used after the speaker has been
segmented out. The textured model is then used to better segment the speaker,
which is transmitted and displayed as a billboard. In order to enhance
the realism the foreground is augmented with chairs.
In order to display the blackboard with high enough resolution to be readable, it is necessary to use several cameras to form a running image of the blackboard that is updated in real time. Since a single camera cannot capture the entire board with sufficient resolution, we use cameras that pan and zoom to areas of interest and integrate that data into the running high-resolution image of the board. A single fixed camera is used to obtain a low resolution reference image of the entire board to aid in the integration of the image streams from the higher resolution cameras. It also insures that something can be said about all of the board in the case that an area has not yet been scanned by one of the high-resolution cameras.
The lecturer is segmented from the low resolution camera's input to obtain the running reference image of the board, without the lecturer obscuring the view. The segmentation problem here is simpler than the previous one, since we are eliminating the lecturer, rather than extracting him and a conservative algorithm can be used that might also remove a small border around him. The remainder of the frame is then copied over the running low resolution image. To detect the lecturer, we threshold the intensity difference of the running image and the next video frame based on the fact that the blackboard will stay nearly the same. This simple technique can fail and is therefore augmented with a more robust but slower algorithm that analyzes the color distribution for those areas that have not been updated for a while (because we might have wrongly identified a piece of blackboard as a lecturer).
To decide where to point the pan and zoom cameras, we maintain a ``curiosity'' bitmap. It is marked when we see a large enough difference in the corresponding pixel in the low resolution control image (which is updated in real time, regardless of the positions of the mobile cameras). The moving camera will then sweep out that area of the image, take high-resolution images, and clear the appropriate areas in the curiosity bitmap.
When integrating a high-resolution image from a high-resolution camera into the output stream, we compute the mapping between the camera's image space and board space by identifying markers on the board. After reprojection of the high-resolution image into the blackboard image it is masked with the lecturer and copied into the output image. Having a reference stream with the entire board is crucial to integrating the high-resolution images taken from arbitrary positions.
We currently use a modified MPEG encoder to compress the high resolution blackboard image using a significantly lower frame rate. Ideally, we would like to only transmit those areas that have changed, but MPEG already has a fairly small overhead for coding these unchanged regions.
Last Updated: Feb. 25, 1999, slusallek@graphics.stanford.edu