Broadcast Telepresence
Abstract
Both telepresence and digital television have received a lot of attention
recently. We foresee the convergence of the two in something we call
"broadcast telepresence." In broadcast telepresence a digital television
channel is used to transmit a more complete environment of the location
being displayed, allowing the viewer limited flexibility in determining
how they view the scene. This viewing interactivity allows the viewer
to have a better sense of being present at the remote location than is
possible in traditional television. This paper describes where broadcast
telepresence fits in the greater scope of telepresence and gives a model
for how broadcast telepresence content can be presented to the user.
It also briefly describes the stages required for broadcasting telepresence,
from capture to reconstruction for the user. Finally it discusses
the characteristics of broadcast program content which effect the ease
with which they can be used for broadcast telepresence.
Brad E. Johanson
Stanford University Computer Systems Laboratory
Palo Alto, CA. 94305
bjohanso@stanford.edu
Table of Contents
-
Introduction
-
Previous Work
-
Categorizing Telepresence
-
The Broadcast Telepresence Content Model
-
The Broadcast Pipeline
-
Scenario Considerations
-
Conclusion
-
References
Introduction
After years of gestation, digital television broadcasts are finally becoming
a reality. From the broadcasters perspective this allows either more channels
in the same amount of bandwidth, or a single channel of much higher quality
than previously available. Even at higher resolutions, though, digital
television as it is now envisioned offers no significant change compared
to what has been broadcast for the past seventy years. The recent rise
of virtual reality and immersive environments suggests the possibility
of broadcasting more than just traditional two dimensional images. Now
enough information can be transmitted to allow the viewer to feel present
in the location being displayed, rather than just looking into the remote
scene through a small glass portal; in other words they can become telepresent
to the environment being broadcast to them.
In this paper we attempt to clearly define broadcast telepresence and
the associated issues. We show where broadcast telepresence falls in the
range of telepresence applications, what content could be broadcast, and
the steps required to go from capture of data to final display to the end
user. As telepresence is not well defined right now, we begin by providing
a method of categorizing the various types of telepresence, and use that
scheme to define the key features of broadcast telepresence. We then present
a content model for the information which would be broadcast in telepresent
shows. After this we discuss the pipeline required for a broadcast telepresence
application, from initial capture to final display by the user. Finally
we discuss which issues arise in different types of material that might
be broadcast. This paper deliberately avoids discussing details of what
technologies and techniques would be used for broadcast telepresence, but
instead tries to provide a structure within which specific broadcast telepresence
scenarios can be more easily addressed.
Previous Work
The future of television in a general sense has been addressed by many
people over the years. In particular, the MIT Media Lab has a "Television
of Tomorrow" project [6]. It is dedicated to
research on technologies that will be important for television in the future.
In particular they have looked at transmission and compression of holovideo,
and methods for segmenting videos for more efficient transmission.
"Telepresence" is not a term which has as yet been well defined.
In its most general sense it means the ability to feel present at some
remote location. With this definition, even today's television programs
qualify as "telepresence." A slightly stricter definition of the
term would include video conferencing and distance learning applications
as the simplest telepresence applications. Gordon Bell and Jim Gemmell
at Microsoft address these applications in their article " Non-collaborative
Telepresentations Come of Age. [1]"
Many uses of "telepresence" consider a much stricter definition-- telepresence
means the ability to feel present at a remote location through total immersion
in that environment, usually with stereo ear and eye phones, or augmented
reality techniques. The latter approach is used by the members of
the Tele-Immersion Project [7], which addresses
the specific goal of doing collaborative work through the use of telepresence.
Their concept is to have "tele-cubicles" which are open to their surrounding
area on two sides, and have a virtual window into four similar cubicles
through LCD shutter based stereo video display walls on the other cubicle
sides. In this way, users of all four cubicles can be present in
a shared location which combines their different physical locations.
In some virtual reality kiosks or displays, there is a need to provide
the user with some sort of guidance through the system. As will be
discussed in this paper, this is also a problem for broadcast telepresence.
Tinsley Galyean presented a method of providing guided navigation through
an environment in his paper [3]. He gives
an analogy of the users path through a virtual environment to traveling
down a river. Although the user is allowed some leeway, they are
continuously pulled downstream until the tour or presentation is complete.
Both the users position and viewing direction are guided in this manner.
Categorizing Telepresence
As noted in the previous section, there is no clear definition of telepresence.
To help clarify the definition of telepresence, and to try and better position
this paper, we provide here a method for categorizing telepresence in this
section. As mentioned earlier, the most general definition of telepresence
is any technology or presentation which allows the user to feel as if they
were present in a remote location. Beyond that, applications which
have been referred to as telepresence range from watching a lecture at
a remote location in a distance learning application, to a multi-player
game of "QUAKE," to a fully immersive simulation of molecules interacting.
As a means of categorizing the various applications which allow telepresence,
we propose the following three characteristics of telepresence applications:
-
Remote Location Type: The type of location into which the user is
projecting their presence. The three main types are virtual, augmented,
and real. Virtual reality is a completely simulated environment.
Augmented reality is a hybrid of some real location and supplemental information
or objects which do not exist, but are made to appear in the real scene.
Reality is a representation of an actual location in the real world.
It may be a polygonal model, but it shows the remote location as it currently
appears, or as it appeared at the time of recording.
-
Level of Interactivity: The amount to which the user is able to
interact with the remote environment. The three levels that we have
arrived at are full, viewing, and passive interactivity. In passive
viewing the user is only able to see and hear the remote location from
the point of view which is broadcast to them-- this is like normal television.
With viewing interactivity, the user is able to change their point of view
in the remote location, but has no impact on that location. In other
words, they cannot move remote objects and are not seen-- they are effectively
ghosts. With full interactivity the users can be seen by people at
the remote location and by other telepresent users. They can also
have a physical impact on the remote environment.
-
Number of Users: The number of users who are projecting their presence
into the remote location. This can be divided in an arbitrarily fine
manner, but most generally can be looked at as single or multi-user.
Note that this does not necessarily have anything to do with whether the
users can interact with each other or the remote environment-- that is
a function of the level of interactivity.
With the categories just mentioned, the various different telepresence
applications can be categorized and are shown in Table
1.
|
Full Interactivity
|
Viewing Interactivity
|
Passive Viewing
|
Single User
|
Multi-user
|
Single User
|
Multi-user
|
Single User
|
Multi-user
|
Virtual Reality |
Immersive Molecular Simulation
|
3-D Network Games, 3D Chat Rooms
|
Virtual Museum
|
Virtual Plays
|
Video tours of virtual locations
|
Broadcast VR, i.e. "Toy Story"
|
Augmented Reality |
Remote Exploration/Surgery with HUD
|
Tele-Cubicles
|
3-D City Model with supplemental HUD Data
|
3-D Television with Embedded
Objects & Information
|
Off-line Distance Learning
|
Live Distance Learning
|
Reality |
Remote Exploration/Surgery
|
Robot Sports
|
3-D City Model
|
3-D Television, Panoramic
Movies
|
Video Tourism
|
Most Modern Television
|
Table 1- The Various Types of Telepresence
Applications
As the table shows, there are a wide range of applications. For
this paper we are concerned with "Broadcast Telepresence." The application
types covered by this are highlighted in bold italic in the table.
Broadcast telepresence is multi-user since the remote location is being
made available to many users simultaneously through broadcasting.
The remote locations are either reality, or augmented reality. Finally,
and this is the important difference between normal television and broadcast
telepresence, the users all have viewing interactivity, so they are able
to effect how they see the remote location.
The Broadcast Telepresence Content
Model
Since viewing interactivity is the main difference between broadcast telepresence
and the standard television broadcasts that are available today, the goal
is to define a content model which provides this interactivity without
violating several constraints:
-
The viewing interactivity needs to be provided within the context of a
program which evolves in time. In other words, broadcast telepresence
should not be the transmission of a large environment that can be toured
in an arbitrary fashion. The main reason for this is to utilize the
broadcast channel at all times-- downloading a large "set" which can be
toured uses a lot of bandwidth at startup, and little or none the rest
of the time.
-
Broadcast telepresence content should offer a super-set of the functionality
of current broadcast (primarily television) content. In other words,
offering viewing interactivity should not require any activity of the user,
but rather allow them additional freedom should they so desire.
-
The definition of the class of broadcast content needs to be flexible enough
to cover all the programming types which may be used. If possible,
everything from current television programs to immersive programs with
full flexibility should be covered.
Galyean's virtual reality navigation technique based on the "river analogy"
[3] (mentioned in the Previous Work Section) provides
a good place to begin defining the characteristics of a broadcast telepresence
content model. His method of navigation was used specifically for
a Virtual Reality presentation at a museum where a user was given a fixed
amount of time to travel through a set. To allow the user flexibility,
while insuring that they completed their tour in a deterministic amount
of time, the user was connected by a spring to an anchor point that moved
through the set in a fixed amount of time. The tension and length
of the spring controlled the amount of leeway that the user was given during
any point in the presentation. With this interaction method, the
content of a broadcast telepresence program would be the immediate region
to which the user had access at that given time. If the user chose
not to interact, they would be dragged along with the anchor point through
a fixed tour.
Galyean's navigation technique was designed for tours of a fixed virtual
set. The range of broadcast telepresence programming is broader than
this, so we propose here a slightly more robust and flexible model for
broadcast telepresence content. In addition, we discuss here the
implications of the navigation system on the information that needs to
be transmitted, since typical broadcast telepresence applications will
not have the luxury of having the entire set available at any given time.
The major change in our content model from the system which Galyean
discusses is the use of "story-lines." Since broadcast programming
takes place across multiple different sets, there is a need to provide
for discontinuities in the flow of the program. Instead of having
a path through a set, a "story-line" is a series of paths through different
sets which combine to convey the story which the program is telling.
Further, there may be more than one way of telling a story? the same series
of sets could be viewed by traveling along slightly different paths through
each set. In other words a given program might have more than one
"story-line." Different directors could give the program slightly
different flavors by defining separate story-lines.
The use of multiple story-lines requires our next change from Galyean's
system, since multiple story-lines require multiple "anchors." Unfortunately,
if a user is attached to only one anchor point, there is no smooth way
of transferring to a different story-line. Instead of a spring, we
propose that each anchor is actually a gravitational source. Once the user
moves a certain distance away from one anchor, they will be drawn into
the gravitational field of another anchor. By adjusting the strength
of the source, the amount of freedom the user has to roam can be controlled.
With no gravity complete freedom is possible; with infinite gravity the
user is forced to the anchors. In addition, we propose that their
be anchors for both the viewers position and the point to which they are
looking. In this way important sights are highlighted, as well as
the point from which they are viewed. Figure 1
below shows a basic scenario with two story-lines:
Figure 1- A Broadcast Telepresence Program
In the figure there are two story-lines. The main story-line,
A, is shown looking at the dark star from the dark circle. The other
is looking from the light circle toward the light star. As time progresses,
the viewpoints and positions advance along their trajectories.
Figure 1 also shows two regions: a movement
box, and a modeled region. As mentioned earlier, the broadcast nature
of the medium prevents an entire model or set from being transmitted all
at once. Instead, the information transmitted at any given point
is the information needed to re-create the parts of the set becoming viewable
at any given time. The modeled region is the area of the set which
is available at any specific point in time; the viewer is able to look
at anything in this region. The movement box (which may actually
be some other shape) is the region in which the viewer is allowed to freely
change their position. This is needed since the gravitational model,
unlike the spring model, does not place a hard limit on the region in which
the viewer can move. As time advances both the modeled region and
the movement box potentially change position.
The broadcast telepresence content model we have proposed satisfies
the three constraints with which we began. By having story-lines
and a changing modeled region the model accounts for the time evolving
nature of the broadcast medium. By allowing viewers to just sit back
and watch as they are drawn along a certain story-line, while still allowing
them to change their position and view if desired, the model provides benefits
over traditional broadcasting without removing functionality. Finally,
the model is flexible enough to account for a wide range of content.
For example:
-
Traditional television is simply a single story-line with gravity so high
that the viewer must only view from the perspective broadcast to them.
-
Shows which broadcast multiple camera views have high gravity with multiple
story-lines.
-
Panoramic shows which allow the user to change their view direction, but
not their position are single story-line with a high gravity on position,
and low gravity on viewpoint.
-
Shows which allow full freedom have low gravity for both position and viewpoint.
The model we provide gives a framework for the type of telepresence content
which could be broadcast. With this defined it is possible to think
about how one would go about broadcasting such content, and what sorts
of difficulties might be encountered in so doing.
The Broadcast Pipeline
For a given content type, there is a set of stages that must be followed
to capture, transmit and reconstruct the program for the viewer.
This section lists these stages and briefly describes what needs to be
done for each stage, along with the complications which may arise.
The stages are:
-
Capture: At this stage the real scene is captured and digitized
for transmission to the viewer. For broadcast telepresence this involves
acquisition of more data than for current television programs. Depth
information may be captured using multiple cameras (see Kanade [4]),
or active range image devices. Wide angle images might also be captured
for image based rendering. Sound also needs to be captured for rendering
on the viewer's equipment.
-
Compression: Before transmission to viewers, the captured information
must be compressed. If image maps, panoramas or multiple camera views
are all that is being captured, simple video and image compression techniques
could be used. If a more complex rendering method is being used,
polygonal models may be extracted and transmitted. Scene segmentation
methods could also be used to only transmit objects in the scene that are
changing dynamically.
-
Transmission: This stage should use existing infrastructure.
The main possibility is to use a MPEG transport stream over a 19.2 Megabit
per second HDTV broadcast channel. It may be possible to transmit
background or set information rapidly during transmission of commercials
or other low bandwidth sections of the programming.
-
Decompression: The inverse of the compression stage. For content
that allows a lot of viewing interactivity, the decompression may actually
involve scene compositing and rendering from the position and viewpoint
that the viewer has chosen.
-
Viewing and Interaction: The final stage is the actual interface
to the user. Even though complete 3-d environments are being transmitted,
the user's display could be as simple as a standard television. In
this case, the viewer would only see 2-d projections from their selected
viewpoint and position. Stereo displays using LCD shutter glasses,
and VR Headsets are other possible display types. The control of
the display could be accomplished using a remote control with a space orb
or similar device which allows several degrees of freedom. Different
story-lines could be selected using a story-line changer on the remote
(similar to a channel changer), or by displaying the different story-line
views and selecting from them with on screen menus.
Scenario Considerations
This section contains a description of some of the critical characteristics
of content for broadcast telepresence programs. It is assumed here
that the content for the programs will allow the user flexibility in either
viewpoint, view position, or both. For each characteristic, some
program types are listed, and the specific problems related to the characteristic
are discussed.
Any given program can be thought of as consisting of two elements: the
background scenery and the foreground action. The characteristics
given here relate to these two elements.
Background Characteristics
The background scenery, or set, of a program is important to characterize
since it may be possible to segment out the background to allow compression
and/or better quality for the main focus of a scene.
One important characteristic of the set is whether it is a fixed set,
or one that changes over time. An example of a fixed set would be
the background scenery in a situation comedy. In this case, the set
is always the same and covers a fixed volume. On account of this,
it might be possible to store a model of the set on the viewers equipment
before a program starts, and only transmit the actions of the characters
for compositing with the set model. In an educational show which
follows the course of divers through an underwater shipwreck, however,
the background set is continuously changing as the divers continuously
change locations. In this case, the set is probably too large, and
probably not well enough known to model and transmit ahead of time to the
viewers equipment. This scenario would then require a transmission
of the appropriate regions of the set as the show progresses.
A second characteristic of the set is its degree of complexity.
A simple set, such as that used in a political talk show can be very easily
modeled. As with a fixed set, this makes it easy to transmit a model
to the viewers equipment so that the background does not need to be transmitted.
A more complex set, such as a forest, would be more difficult to model,
and could also cause problems since the principles in the scene may move
in and out of occlusion with parts of the set. In the case of a complex
set, the user's viewing interactivity may have to be limited, and it may
not be possible to transmit a background model to the user's equipment
ahead of time.
Foreground Characteristics
The foreground of a program is the most important, since that is where
the viewer's focus lies most of the time. The type of action that
is occurring can also have a big effect on the computational effort needed
to segment and compress a scene.
The first characteristic of the action is whether it is being captured
and transmitted live, or if it is being transmitted to users at some later
time. Delayed transmission, as is the case for most dramas, action
shows, and situational comedies, allows large amounts of computational
effort to be applied to compressing and segmenting the action. This
would allow broadcasters to give the viewer much more flexibility in navigating
around the scenes being transmitted. Live action, as in a news bulletin,
or a sports game, requires that the entire capture and compression process
take place in real time. Even allowing for the asymmetry of the compression/decompression
process (e.g. the broadcaster can afford more expensive equipment than
the end user), it will not be possible to perform as complex a capture
and compression process as for delayed transmission programs. This
means that users viewing live action will have more limited viewing interactivity.
As with the background, an important characteristic of the foreground
action is its complexity. Simple foregrounds are much easier to describe.
In the case of a political talk show, for example, you may only have three
or four people sitting at a desk. All of the principles of the scene
are convex objects and none of them interact with or occlude any of the
others in the scene. This makes the job of segmenting out each of
the foreground characters computationally much easier since many assumptions
may be made. A football game, on the other hand, has very complex
foreground action. There are many people interacting with each other,
and all of them are both occluding and are being occluded by others in
the scene. This makes it very difficult to obtain depth information
and segment the scene, and hence more difficult to provide a wide range
of views to the user.
A Simple Example
Based on the above characteristics, the simplest type of content would
be one with a fixed, simple set, and simple foreground action, that is
not being displayed live. One example of this might be a situational
comedy which may have only one or two rooms that are being used for the
set. Since the sets are fixed and relatively simple, they can be
transmitted ahead of time as a model. Figure
2 shows the content model of Section 4 applied
for this case:
Figure 2- Content Model Applied to a
Situational Comedy
As the figure shows, the set can be quite simple. The modeled
region for the program therefore remains stationary, so all of the background
can be kept and simply composited with the foreground action. The
capture of the foreground is also made simpler since the background is
already known. The background information can simply be subtracted
away to determine what information is in the foreground. Using this
technique and range capture from several different perspectives, it would
be fairly straightforward to allow the viewer a small range of motion and
a fairly wide viewing region. Since the show is not shown live, the
computational power needed is not much of an issue.
Conclusion
In this paper we have tried to present a definition of broadcast telepresence
and its characteristics. We began by giving a method for categorizing
telepresence applications, and showed that the key difference between normal
broadcast television and broadcast telepresence is the addition of viewing
interactivity. To this end, we then described a content model for
broadcast telepresence explaining how viewing interactivity of varying
amounts could be provided in programs. After this, the broadcast
pipeline required for capture, transmission and display of a broadcast
telepresence show was described, and finally various application characteristics
and the problems that they create were discussed. With the description
of the domain provided here, the key technical problems of broadcast telepresence,
namely capture, compression, and reconstruction, become easier to understand
for a given class of content.
References
-
Bell, G., Gemmell, J., "Non-collaborative Telepresentations
Come of Age," Communications of the ACM, April 1997, Vol. 40, No.
4, pp. 79-89, http://www.research.microsoft.com/research/barc/Telepresence/telepresentations/telepresentations.html
-
Fuchs, H., Bishop, G., Arthur, K., McMillan, L., Bajcsy, R., Wook Lee,
S., Farid, H., and Kanade, T., "Visual Space Teleconferencing Using a Sea
of Cameras," Proceedings of the First International Symposium on Medical
Robotics and Computer Assisted Surgery, Vol. 2, Pittsburgh, PA, September
22-24, 1994, http://www-bcs.mit.edu/~farid/mrcas94.ps.gz
-
Galyean, T.A., "Guided Navigation of Virtual Environments,"
1995 Symposium on Interactive 3D Graphics, Monterey, CA. USA
-
Kanade, T., Yoshida, A., et. al., "A Stereo Machine
for Video-rate Dense Depth Mapping and Its New Applications," Proceedings
of 15th Computer Vision and Pattern Recognition Conference (CVPR),
June 18-20, 1996, San Francisco, http://www.cs.cmu.edu/afs/cs/project/stereo-machine/www/cvpr96.ps
-
Levoy, M., Hanrahan, P., "Light Field Rendering", Stanford University,
http://www-graphics.stanford.edu/papers/light/
-
MIT Media Lab Television of Tomorrow Project, http://tvot.www.media.mit.edu/projects/tvot/
-
Tele-Immersion Project Home Page, http://io.advanced.org/tele-immersion/