Broadcast Telepresence

Abstract

Both telepresence and digital television have received a lot of attention recently. We foresee the convergence of the two in something we call "broadcast telepresence." In broadcast telepresence a digital television channel is used to transmit a more complete environment of the location being displayed, allowing the viewer limited flexibility in determining how they view the scene. This viewing interactivity allows the viewer to have a better sense of being present at the remote location than is possible in traditional television. This paper describes where broadcast telepresence fits in the greater scope of telepresence and gives a model for how broadcast telepresence content can be presented to the user. It also briefly describes the stages required for broadcasting telepresence, from capture to reconstruction for the user. Finally it discusses the characteristics of broadcast program content which effect the ease with which they can be used for broadcast telepresence.

Brad E. Johanson

Stanford University Computer Systems Laboratory
Palo Alto, CA. 94305
bjohanso@stanford.edu

Introduction
Previous Work
Categorizing Telepresence
The Broadcast Telepresence Content Model
The Broadcast Pipeline
Scenario Considerations
Conclusion
References

Introduction

After years of gestation, digital television broadcasts are finally becoming a reality. From the broadcasters perspective this allows either more channels in the same amount of bandwidth, or a single channel of much higher quality than previously available. Even at higher resolutions, though, digital television as it is now envisioned offers no significant change compared to what has been broadcast for the past seventy years. The recent rise of virtual reality and immersive environments suggests the possibility of broadcasting more than just traditional two dimensional images. Now enough information can be transmitted to allow the viewer to feel present in the location being displayed, rather than just looking into the remote scene through a small glass portal; in other words they can become telepresent to the environment being broadcast to them.

In this paper we attempt to clearly define broadcast telepresence and the associated issues. We show where broadcast telepresence falls in the range of telepresence applications, what content could be broadcast, and the steps required to go from capture of data to final display to the end user. As telepresence is not well defined right now, we begin by providing a method of categorizing the various types of telepresence, and use that scheme to define the key features of broadcast telepresence. We then present a content model for the information which would be broadcast in telepresent shows. After this we discuss the pipeline required for a broadcast telepresence application, from initial capture to final display by the user. Finally we discuss which issues arise in different types of material that might be broadcast. This paper deliberately avoids discussing details of what technologies and techniques would be used for broadcast telepresence, but instead tries to provide a structure within which specific broadcast telepresence scenarios can be more easily addressed.

Previous Work

The future of television in a general sense has been addressed by many people over the years. In particular, the MIT Media Lab has a "Television of Tomorrow" project [6]. It is dedicated to research on technologies that will be important for television in the future. In particular they have looked at transmission and compression of holovideo, and methods for segmenting videos for more efficient transmission.

"Telepresence" is not a term which has as yet been well defined. In its most general sense it means the ability to feel present at some remote location. With this definition, even today's television programs qualify as "telepresence." A slightly stricter definition of the term would include video conferencing and distance learning applications as the simplest telepresence applications. Gordon Bell and Jim Gemmell at Microsoft address these applications in their article " Non-collaborative Telepresentations Come of Age. [1]"

Many uses of "telepresence" consider a much stricter definition-- telepresence means the ability to feel present at a remote location through total immersion in that environment, usually with stereo ear and eye phones, or augmented reality techniques. The latter approach is used by the members of the Tele-Immersion Project [7], which addresses the specific goal of doing collaborative work through the use of telepresence. Their concept is to have "tele-cubicles" which are open to their surrounding area on two sides, and have a virtual window into four similar cubicles through LCD shutter based stereo video display walls on the other cubicle sides. In this way, users of all four cubicles can be present in a shared location which combines their different physical locations.

In some virtual reality kiosks or displays, there is a need to provide the user with some sort of guidance through the system. As will be discussed in this paper, this is also a problem for broadcast telepresence. Tinsley Galyean presented a method of providing guided navigation through an environment in his paper [3]. He gives an analogy of the users path through a virtual environment to traveling down a river. Although the user is allowed some leeway, they are continuously pulled downstream until the tour or presentation is complete. Both the users position and viewing direction are guided in this manner.

Categorizing Telepresence

As noted in the previous section, there is no clear definition of telepresence. To help clarify the definition of telepresence, and to try and better position this paper, we provide here a method for categorizing telepresence in this section. As mentioned earlier, the most general definition of telepresence is any technology or presentation which allows the user to feel as if they were present in a remote location. Beyond that, applications which have been referred to as telepresence range from watching a lecture at a remote location in a distance learning application, to a multi-player game of "QUAKE," to a fully immersive simulation of molecules interacting. As a means of categorizing the various applications which allow telepresence, we propose the following three characteristics of telepresence applications:

Remote Location Type: The type of location into which the user is projecting their presence. The three main types are virtual, augmented, and real. Virtual reality is a completely simulated environment. Augmented reality is a hybrid of some real location and supplemental information or objects which do not exist, but are made to appear in the real scene. Reality is a representation of an actual location in the real world. It may be a polygonal model, but it shows the remote location as it currently appears, or as it appeared at the time of recording.
Level of Interactivity: The amount to which the user is able to interact with the remote environment. The three levels that we have arrived at are full, viewing, and passive interactivity. In passive viewing the user is only able to see and hear the remote location from the point of view which is broadcast to them-- this is like normal television. With viewing interactivity, the user is able to change their point of view in the remote location, but has no impact on that location. In other words, they cannot move remote objects and are not seen-- they are effectively ghosts. With full interactivity the users can be seen by people at the remote location and by other telepresent users. They can also have a physical impact on the remote environment.
Number of Users: The number of users who are projecting their presence into the remote location. This can be divided in an arbitrarily fine manner, but most generally can be looked at as single or multi-user. Note that this does not necessarily have anything to do with whether the users can interact with each other or the remote environment-- that is a function of the level of interactivity.

With the categories just mentioned, the various different telepresence applications can be categorized and are shown in Table 1.

	Full Interactivity		Viewing Interactivity		Passive Viewing
	Single User	Multi-user	Single User	Multi-user	Single User	Multi-user
Virtual Reality	Immersive Molecular Simulation	3-D Network Games, 3D Chat Rooms	Virtual Museum	Virtual Plays	Video tours of virtual locations	Broadcast VR, i.e. "Toy Story"
Augmented Reality	Remote Exploration/Surgery with HUD	Tele-Cubicles	3-D City Model with supplemental HUD Data	*3-D Television with Embedded Objects & Information*	Off-line Distance Learning	Live Distance Learning
Reality	Remote Exploration/Surgery	Robot Sports	3-D City Model	*3-D Television, Panoramic Movies*	Video Tourism	Most Modern Television

Table 1- The Various Types of Telepresence Applications

As the table shows, there are a wide range of applications. For this paper we are concerned with "Broadcast Telepresence." The application types covered by this are highlighted in bold italic in the table. Broadcast telepresence is multi-user since the remote location is being made available to many users simultaneously through broadcasting. The remote locations are either reality, or augmented reality. Finally, and this is the important difference between normal television and broadcast telepresence, the users all have viewing interactivity, so they are able to effect how they see the remote location.

The Broadcast Telepresence Content Model

Since viewing interactivity is the main difference between broadcast telepresence and the standard television broadcasts that are available today, the goal is to define a content model which provides this interactivity without violating several constraints:

The viewing interactivity needs to be provided within the context of a program which evolves in time. In other words, broadcast telepresence should not be the transmission of a large environment that can be toured in an arbitrary fashion. The main reason for this is to utilize the broadcast channel at all times-- downloading a large "set" which can be toured uses a lot of bandwidth at startup, and little or none the rest of the time.
Broadcast telepresence content should offer a super-set of the functionality of current broadcast (primarily television) content. In other words, offering viewing interactivity should not require any activity of the user, but rather allow them additional freedom should they so desire.
The definition of the class of broadcast content needs to be flexible enough to cover all the programming types which may be used. If possible, everything from current television programs to immersive programs with full flexibility should be covered.

Galyean's virtual reality navigation technique based on the "river analogy" [3] (mentioned in the Previous Work Section) provides a good place to begin defining the characteristics of a broadcast telepresence content model. His method of navigation was used specifically for a Virtual Reality presentation at a museum where a user was given a fixed amount of time to travel through a set. To allow the user flexibility, while insuring that they completed their tour in a deterministic amount of time, the user was connected by a spring to an anchor point that moved through the set in a fixed amount of time. The tension and length of the spring controlled the amount of leeway that the user was given during any point in the presentation. With this interaction method, the content of a broadcast telepresence program would be the immediate region to which the user had access at that given time. If the user chose not to interact, they would be dragged along with the anchor point through a fixed tour.

Galyean's navigation technique was designed for tours of a fixed virtual set. The range of broadcast telepresence programming is broader than this, so we propose here a slightly more robust and flexible model for broadcast telepresence content. In addition, we discuss here the implications of the navigation system on the information that needs to be transmitted, since typical broadcast telepresence applications will not have the luxury of having the entire set available at any given time.

The major change in our content model from the system which Galyean discusses is the use of "story-lines." Since broadcast programming takes place across multiple different sets, there is a need to provide for discontinuities in the flow of the program. Instead of having a path through a set, a "story-line" is a series of paths through different sets which combine to convey the story which the program is telling. Further, there may be more than one way of telling a story? the same series of sets could be viewed by traveling along slightly different paths through each set. In other words a given program might have more than one "story-line." Different directors could give the program slightly different flavors by defining separate story-lines.

The use of multiple story-lines requires our next change from Galyean's system, since multiple story-lines require multiple "anchors." Unfortunately, if a user is attached to only one anchor point, there is no smooth way of transferring to a different story-line. Instead of a spring, we propose that each anchor is actually a gravitational source. Once the user moves a certain distance away from one anchor, they will be drawn into the gravitational field of another anchor. By adjusting the strength of the source, the amount of freedom the user has to roam can be controlled. With no gravity complete freedom is possible; with infinite gravity the user is forced to the anchors. In addition, we propose that their be anchors for both the viewers position and the point to which they are looking. In this way important sights are highlighted, as well as the point from which they are viewed. Figure 1 below shows a basic scenario with two story-lines:

Figure 1- A Broadcast Telepresence Program

In the figure there are two story-lines. The main story-line, A, is shown looking at the dark star from the dark circle. The other is looking from the light circle toward the light star. As time progresses, the viewpoints and positions advance along their trajectories.

Figure 1 also shows two regions: a movement box, and a modeled region. As mentioned earlier, the broadcast nature of the medium prevents an entire model or set from being transmitted all at once. Instead, the information transmitted at any given point is the information needed to re-create the parts of the set becoming viewable at any given time. The modeled region is the area of the set which is available at any specific point in time; the viewer is able to look at anything in this region. The movement box (which may actually be some other shape) is the region in which the viewer is allowed to freely change their position. This is needed since the gravitational model, unlike the spring model, does not place a hard limit on the region in which the viewer can move. As time advances both the modeled region and the movement box potentially change position.

The broadcast telepresence content model we have proposed satisfies the three constraints with which we began. By having story-lines and a changing modeled region the model accounts for the time evolving nature of the broadcast medium. By allowing viewers to just sit back and watch as they are drawn along a certain story-line, while still allowing them to change their position and view if desired, the model provides benefits over traditional broadcasting without removing functionality. Finally, the model is flexible enough to account for a wide range of content. For example:

Traditional television is simply a single story-line with gravity so high that the viewer must only view from the perspective broadcast to them.
Shows which broadcast multiple camera views have high gravity with multiple story-lines.
Panoramic shows which allow the user to change their view direction, but not their position are single story-line with a high gravity on position, and low gravity on viewpoint.
Shows which allow full freedom have low gravity for both position and viewpoint.

The model we provide gives a framework for the type of telepresence content which could be broadcast. With this defined it is possible to think about how one would go about broadcasting such content, and what sorts of difficulties might be encountered in so doing.

The Broadcast Pipeline

For a given content type, there is a set of stages that must be followed to capture, transmit and reconstruct the program for the viewer. This section lists these stages and briefly describes what needs to be done for each stage, along with the complications which may arise. The stages are:

Capture: At this stage the real scene is captured and digitized for transmission to the viewer. For broadcast telepresence this involves acquisition of more data than for current television programs. Depth information may be captured using multiple cameras (see Kanade [4]), or active range image devices. Wide angle images might also be captured for image based rendering. Sound also needs to be captured for rendering on the viewer's equipment.
Compression: Before transmission to viewers, the captured information must be compressed. If image maps, panoramas or multiple camera views are all that is being captured, simple video and image compression techniques could be used. If a more complex rendering method is being used, polygonal models may be extracted and transmitted. Scene segmentation methods could also be used to only transmit objects in the scene that are changing dynamically.
Transmission: This stage should use existing infrastructure. The main possibility is to use a MPEG transport stream over a 19.2 Megabit per second HDTV broadcast channel. It may be possible to transmit background or set information rapidly during transmission of commercials or other low bandwidth sections of the programming.
Decompression: The inverse of the compression stage. For content that allows a lot of viewing interactivity, the decompression may actually involve scene compositing and rendering from the position and viewpoint that the viewer has chosen.
Viewing and Interaction: The final stage is the actual interface to the user. Even though complete 3-d environments are being transmitted, the user's display could be as simple as a standard television. In this case, the viewer would only see 2-d projections from their selected viewpoint and position. Stereo displays using LCD shutter glasses, and VR Headsets are other possible display types. The control of the display could be accomplished using a remote control with a space orb or similar device which allows several degrees of freedom. Different story-lines could be selected using a story-line changer on the remote (similar to a channel changer), or by displaying the different story-line views and selecting from them with on screen menus.

Scenario Considerations

This section contains a description of some of the critical characteristics of content for broadcast telepresence programs. It is assumed here that the content for the programs will allow the user flexibility in either viewpoint, view position, or both. For each characteristic, some program types are listed, and the specific problems related to the characteristic are discussed.

Any given program can be thought of as consisting of two elements: the background scenery and the foreground action. The characteristics given here relate to these two elements.

Background Characteristics

The background scenery, or set, of a program is important to characterize since it may be possible to segment out the background to allow compression and/or better quality for the main focus of a scene.

One important characteristic of the set is whether it is a fixed set, or one that changes over time. An example of a fixed set would be the background scenery in a situation comedy. In this case, the set is always the same and covers a fixed volume. On account of this, it might be possible to store a model of the set on the viewers equipment before a program starts, and only transmit the actions of the characters for compositing with the set model. In an educational show which follows the course of divers through an underwater shipwreck, however, the background set is continuously changing as the divers continuously change locations. In this case, the set is probably too large, and probably not well enough known to model and transmit ahead of time to the viewers equipment. This scenario would then require a transmission of the appropriate regions of the set as the show progresses.

A second characteristic of the set is its degree of complexity. A simple set, such as that used in a political talk show can be very easily modeled. As with a fixed set, this makes it easy to transmit a model to the viewers equipment so that the background does not need to be transmitted. A more complex set, such as a forest, would be more difficult to model, and could also cause problems since the principles in the scene may move in and out of occlusion with parts of the set. In the case of a complex set, the user's viewing interactivity may have to be limited, and it may not be possible to transmit a background model to the user's equipment ahead of time.

Foreground Characteristics

The foreground of a program is the most important, since that is where the viewer's focus lies most of the time. The type of action that is occurring can also have a big effect on the computational effort needed to segment and compress a scene.

The first characteristic of the action is whether it is being captured and transmitted live, or if it is being transmitted to users at some later time. Delayed transmission, as is the case for most dramas, action shows, and situational comedies, allows large amounts of computational effort to be applied to compressing and segmenting the action. This would allow broadcasters to give the viewer much more flexibility in navigating around the scenes being transmitted. Live action, as in a news bulletin, or a sports game, requires that the entire capture and compression process take place in real time. Even allowing for the asymmetry of the compression/decompression process (e.g. the broadcaster can afford more expensive equipment than the end user), it will not be possible to perform as complex a capture and compression process as for delayed transmission programs. This means that users viewing live action will have more limited viewing interactivity.

As with the background, an important characteristic of the foreground action is its complexity. Simple foregrounds are much easier to describe. In the case of a political talk show, for example, you may only have three or four people sitting at a desk. All of the principles of the scene are convex objects and none of them interact with or occlude any of the others in the scene. This makes the job of segmenting out each of the foreground characters computationally much easier since many assumptions may be made. A football game, on the other hand, has very complex foreground action. There are many people interacting with each other, and all of them are both occluding and are being occluded by others in the scene. This makes it very difficult to obtain depth information and segment the scene, and hence more difficult to provide a wide range of views to the user.

A Simple Example

Based on the above characteristics, the simplest type of content would be one with a fixed, simple set, and simple foreground action, that is not being displayed live. One example of this might be a situational comedy which may have only one or two rooms that are being used for the set. Since the sets are fixed and relatively simple, they can be transmitted ahead of time as a model. Figure 2 shows the content model of Section 4 applied for this case:

Figure 2- Content Model Applied to a Situational Comedy

As the figure shows, the set can be quite simple. The modeled region for the program therefore remains stationary, so all of the background can be kept and simply composited with the foreground action. The capture of the foreground is also made simpler since the background is already known. The background information can simply be subtracted away to determine what information is in the foreground. Using this technique and range capture from several different perspectives, it would be fairly straightforward to allow the viewer a small range of motion and a fairly wide viewing region. Since the show is not shown live, the computational power needed is not much of an issue.

Conclusion

In this paper we have tried to present a definition of broadcast telepresence and its characteristics. We began by giving a method for categorizing telepresence applications, and showed that the key difference between normal broadcast television and broadcast telepresence is the addition of viewing interactivity. To this end, we then described a content model for broadcast telepresence explaining how viewing interactivity of varying amounts could be provided in programs. After this, the broadcast pipeline required for capture, transmission and display of a broadcast telepresence show was described, and finally various application characteristics and the problems that they create were discussed. With the description of the domain provided here, the key technical problems of broadcast telepresence, namely capture, compression, and reconstruction, become easier to understand for a given class of content.

References

Bell, G., Gemmell, J., "Non-collaborative Telepresentations Come of Age," Communications of the ACM, April 1997, Vol. 40, No. 4, pp. 79-89, http://www.research.microsoft.com/research/barc/Telepresence/telepresentations/telepresentations.html
Fuchs, H., Bishop, G., Arthur, K., McMillan, L., Bajcsy, R., Wook Lee, S., Farid, H., and Kanade, T., "Visual Space Teleconferencing Using a Sea of Cameras," Proceedings of the First International Symposium on Medical Robotics and Computer Assisted Surgery, Vol. 2, Pittsburgh, PA, September 22-24, 1994, http://www-bcs.mit.edu/~farid/mrcas94.ps.gz
Galyean, T.A., "Guided Navigation of Virtual Environments," 1995 Symposium on Interactive 3D Graphics, Monterey, CA. USA
Kanade, T., Yoshida, A., et. al., "A Stereo Machine for Video-rate Dense Depth Mapping and Its New Applications," Proceedings of 15th Computer Vision and Pattern Recognition Conference (CVPR), June 18-20, 1996, San Francisco, http://www.cs.cmu.edu/afs/cs/project/stereo-machine/www/cvpr96.ps
Levoy, M., Hanrahan, P., "Light Field Rendering", Stanford University, http://www-graphics.stanford.edu/papers/light/
MIT Media Lab Television of Tomorrow Project, http://tvot.www.media.mit.edu/projects/tvot/
Tele-Immersion Project Home Page, http://io.advanced.org/tele-immersion/