With increasing bandwidth, memory, and processor
speeds, it is becoming easier to feel present at remote locations in a
more meaningful sense. This ability to "feel" like you are someplace
other than your physical location is known as "telepresence." There
are a wide range of ways to be telepresent, and different places to which
one would want to be present. For example, two friends could share
a conversation in a 3-d model of a cafe overlooking the rings of Saturn,
or perhaps two executives have a video conference, each being somehow present
in the others location. The goal of this page is to investigate what
research is currently being done on telepresence, and to give some firmer
definition to what research is of interest to us. Specifically, the
main interest is to look into and determine the feasibility of using telepresence
to enhance digital television broadcasts. The emphasis is on the
broadcasting of real and augmented reality scenes, as opposed to completely
virtual ones, is of the most interest. In addition this page also
contains some information on related topics, such as MPEG.
What is Telepresence? - A look at the different
things that telepresence could be, and an attempt to classify different
types of telepresence.
Problem Specification- Telepresence covers a lot
of different things. This section details more specifically what
problem we are investigating.
Panoramic Presentations - An idea for what might
be broadcast over a "3-D television channel." Also, some general
thoughts as to how to go about capturing, transmitting, and receiving such
A Scenario for the Near Future - A scenario for
the capture, transmission and display of some simple panoramic presentations
in the near future.
MPEG Information - Some notes on the different MPEG
specifications, and what they contain.
General Notes - Thoughts and ideas about telepresence
that don't fit in well in other sections.
Telepresence References - References to other
information on Telepresence. A brief summary of what is talked about
is also given for many of the references.
Related Topics References - References to related
material. Again, brief summaries are given for some of the references.
This section contains general observations made about Telepresence and
its related fields during the collection of the material in this note.
What is Telepresence?
Table 1 above gives rise to several observations:
There seem to be several axes which define the space of telepresence.
Any given research project can be (at least partially) classified by specifying
where it stands on each of these axes:
The following table gives a rough outline of the applications which would
fall into the various combinations of the parameters given above.
Those highlighted in red are the ones this project is most interested in
Telepresent Location- the location into which the
users "telepresent" themselves. This varies from a completely virtual
environment to mixed virtual/real to completely real world location.
Standard locations would be:
Virtual Reality - a completely imaginary location
Augmented Reality- a combination of a real location
(possibly recorded) and supplemental objects or information
Reality - a real location
Degree of interactivity- the extent to which the
user, or users, can interact with the remote environment. Three classifications
Full Interactivity - the users can interact with
other objects and users in the environment, as well as move around anc
change their viewpoint.
Viewing Interactivity - the users can move around
and change their viewpoint, but have no effect on their environment.
Effectively, every user is a ghost.
Passive Viewing - the viewer is only able to observe
the remote location from the position and point of view which is transmitted
to them. This is how standard TV works.
Single vs. Multi-User- the number of users who are
simultaneously present to the remote location.
Single User - only a single person is "telepresent"
to the remote location at one time.
Multi User- more than one person can experience the
remote location at one time.
|| 3-D Games
||3-D Network Games, 3-D Chat Rooms
|| Virtual Museum
||Video tours of virtual locations
||Broadcast VR, ie. "Toy Story"
||Remote Exploration/Surgery with
||3-D City Model with supplemental
with Embedded Objects/Information
||Off-line Distance Learning
||Live Distance Learning
||3-D City Model
||Most Modern Television
Table 1: Different Types of Telepresence
In the "Passive Viewing" category, their is little
difference between multi-user and single user, other than the potential
for the multi-user case to be a live broadcast experienced by many at the
With "Viewing Interactivity" only, the multi-user
scenario is different in that it probably restricts all users to a common
area, or sequence of areas which change over time (a story-line).
This allows for a shared experience even though users cannot see or interact
with one another. The "Single User" scenario can be less restrictive,
and need not have such a story-line.
For "Full Interactivity," the key difference between
single and multi-user scenarios is the lack of others tele-users being
present with which to interact at the remote location. Clearly, there
is some spectrum, as remote surgery could potentially be done by multiple
Although Telepresence is the general area being addressed, the actual focus
for us is narrower. Specifically, we are interested in what Table
1 classifies as "multi-user reality or augmented reality telepresence
with viewing interactivity." In simpler terms, the interest is in
broadcast telepresence or real locations, possibly with supplemental information.
Details of this focus, and related ideas are maintained in this section.
The most important question, and the one that drives the answers to many
of the other points, is what benefit this "telepresence" or 3-D TV offer
over traditional programming. Below is a list of types of programming
that would probably benefit, and also programs that probably wouldn't benefit:
The main target is broadcast telepresence-- that is to say some sort of
remote location which is captured and broadcast to "viewers" who then feel
as if they were present at the broadcast location.
By broadcast, we are speaking of a digital broadcast system, probably an
HDTV channel at a rate of 19.2 Mb/s, or perhaps a regular digital channel
at 3-6 Mb/s.
An important aspect of broadcasting is that the information
sent needs to describe a time varying set of information. In other
words, an entire show, or environment is not downloaded for future perusal.
This poses some unique challenges if some sort of 3-d environment is being
The target is the near future, or, in other words, whaterver scheme is
devised must be implementable and deployable within the next 5 years.
Something that could be done on high end equipment today would probably
How the user is "telepresent" can vary from a text transcription of an
occuring event, to a video camera, to a complete 3-d model of the remote
scene being broadcast. Our interest is in broadcasting some sort
of 3-d information of the remote location.
A set of problems need to be addressed:
What is going to be sent to the users?
How do we capture and compress the information for
transmission over a 2 MB/s channel?
How, and to what reconstruction (display) system,
do we reconstruct the transmission?
The previous problems are assymetric-- more time/money/effort can be spent
on capture and compression than on reconstruction.
What is going to be sent to the users should be reasonably
independent of the users display technology.
What sort of display is reasonable for the "viewer" side of things?
Some ideas follow:
Regular Television, with manipulable 3-d view of remote scene.
Standard display with LCD shutter glasses.
VR Goggles. This would allow a user to visually feel as if they were
at a location. The disadvantage is that it requires a more significant
"viewer" side investment. It provides more of an advantage over traditional
Virtual Workbench - good for a top down viewpoint,
limited general use
CAVE. The viewer/user "watches" inside a room
which projects the remote location onto the walls.
Programming that would benefit (and how):
Nature shows-- the ability to look at the environment around you could
add a lot to a show. Imagine being out on the savana in Africa, but
instead of just having to focus on a specific area, the entire environment
could be viewed.
Tourism Shows-- as with Nature shows, the ability to see a broader range
of your environment would be quite beneficial. As the show takes
you on a tour of a town you could concentrate on things that interest you
for longer. You could also choose whether to look at the announcer
or what he is describing.
Sports-- multiple view points might be of interest in sports. Users
could choose which angle they like best to the tennis court, for example.
Having depth information might also make it easier to follow the action,
particularly following the ball.
Programming that probably wouldn't benefit:
Broadcast lectures-- seeing the professor and his slides in 3-d makes little
difference. Broadcasting 3-d "slides" with models that are better
appreciated in stereo (ie. molecules, 3-d flow fields) could be a benefit
(but also would be difficult to show to those in class).
The previous section gave three problems that needed to be solved for broadcast
telepresence: what to broadcast, how to record it, and how to reconstruct
it. This section proposes an idea for what to be broadcast, the problem
that needs to be solved before the other two. Two things need to
be found to answer the question of what to broadcast:
The solution proposed here is that of a "Panoramic Presentation".
The basic idea is that instead of broadcasting just a flat 2-d view of
the scene being portrayed, a complete 3-d local environment is being transmitted.
This gives the viewer the ability to look around themselves and get a better
appreciation of the location, hopefully allowing them to feel more like
they are actually there. In addition, the viewer would be allowed
to change their position within some fixed boundary, or movement box.
As time goes on in the show, the local area will change, either as the
viewers position changes (as in a tour or a nature show), or as actors
enter and depart (as in a sit-com or play).
What information can be broadcast?
Does it have significant value above and beyond regular broadcast television?
(with a sub-point that it should not have any disadvantages over regular
There is a problem with such a concept in that the viewer may get lost
looking in some funny direction and miss something important. To
avoid such a problem, preassigned viewpoints, or items to look at, and
positions from which to look can be defined, including a main viewpoint
and position which are defaults. This way, if the viewer does nothing
at all, they have the same experience as watching a standard show (with
the possible benefit of depth information). This brings up a new
problem, which is how to switch viewpoints and positions. One option,
would be to allow users to toggle through a set of fixed viewpoints, which
translates to just broadcasting several 2-d views of the scene. The
next option, similar to that given by Quicktime VR, is to jump from position
to position, but allow viewers to look around at each location. This
still allows the user to get lost looking in some strange direction, though.
A final option is to have each of the viewpoints, and viewing positions
serve as a sort of gravity well, such that as the viewer looks and moves
around they are naturally attracted to the predefined viewpoints and positions.
Figure 1 below shows how all of this might work.
Figure 1a shows a typical fixed set presentation, while Figure 1b shows
a scene with movement. Here are the key features common to each:
(a) Panoramic Presentation of a Fixed Set Show (ie. Sitcom)
(b) Panoramic Presentation of a Show with Changing Environment
(ie. Tour or nature show)
Figure 1- Examples of Scene Gravity for Different Shows
In both of the types of presentation, viewpoints and positions are defined.
The viewer may change position and view at will as long as they stay within
the movement box. As they move, their point of view will be drawn
towards the predefined viewpoints, accelerating as they come nearer to
them. This mimics the way our eyes are naturally drawn to certain
items. Similarly, certain positions will be attractive. This
allows the viewer to roam around some if they wish, but makes it easy to
be drawn back to the most interesting (or guaranteed interesting) viewpoints.
Gravity of course could be adjusted to be stronger or weaker. As
gravity becomes very high, the scene becomes one of multiple broadcast
views, and as gravity becomes very weak complete freedom of motion is granted.
The dashed region is called the "modeled region." This is the area
that the viewers machine has available to it at any given time.
The solid black enclosed region is the "movement box." The movement
box defines the range of positions over which the viewer can move.
In the case of Figure 1b, the movement box region changes over time.
The stars represent the viewpoints, and the circles represent positions.
The "Temporal Path" in Figure 1b represesents the path over which the viewpoints,
positions, movement box and modeled region travel as time passes.
Of the two types of panoramic presentation shown in Figure 1, the static
scene in Figure 1a is simpler. In that case there is a fixed set
which it would be possible to cache on the viewers machine. The viewpoints,
positions and movement box could change over time, but are constrained
by the fixed nature of the scene. The situation in Figure 1b is more
complicated in that everything changes as time goes on. Imagine a
nature show of swimming through the Great Barrier reef. The movement
region in this case is a volume which moves as time progresses and the
viewer is guided along the reef. The positions all move as time goes
by, and the viewpoints may change to various fish and coral. The
viewer may decide to continue looking at a fish after it is described,
and then allow themselves to be swept forward as they reach the back of
the viewing region, and have their viewpoint drawn to some new object.
Of course shows need not be entirely fixed, or entirely dynamic.
The types can be intermixed or interleaved at will, and the examples are
given as the two extremes.
As shown in Figure 1, both primary and secondary
positions and viewpoints could be sent to the viewer at any given time.
These will of course change over time as a show progresses. The main
viewpoint and position over time define the standard method of watching
the show as it would be seen on conventional television. This idea
could be extended by having several different sets of time varying positions
and viewpoints which extend throughout the show. Instead of constantly
choosing viewpoints and positions as the show progresses, the viewer could
lock in one particular "story-line" at the beginning of the show, and follow
the action from that sequence of positions and viewpoints. This would
allow a viewer who likes one character better than most to choose a "story-line"
that emphasizes that particular character. It could also be set up
that there are several different "directors" for a given show, each of
whom comes up with their own "story-line." The viewer could then
choose which "director's" story-line choice to follow through the show.
Eventually, as homes become better networked, viewers could exchange "story-lines",
or choose another viewer to be the active one who controls their viewing
perspective through the show.
Two questions were posed at the beginning of this section, the first
was what could be broadcast, and that has been discussed throughout the
section. The second was what the benefit was, and was it actually
worse than regular TV. It is fairly easy to see that there is no
loss over regular viewing in this scheme. If you just sit back and
view your viewpoint and position will be changed automatically for you.
It is also difficult to accidentally shift your view, since gravity will
tend to bring you toward a desirable view. It seems pretty clear
that there are also advantages over regular TV since each viewer has the
choice of looking at what is most interesting to them, and can choose to
linger on certain items, or dash ahead to see what is coming next.
Before ending this section, a brief mention of how this might actually
be displayed to the user is appropriate. The panoramic presentation
method presented here is scalable, and could be displayed at least in the
With a decoder, a fixed point of view could be displayed
In addition, the viewer could be given a remote which allowed them to move
Stereo Vision TV- as above, but depth information would be more apparent
with LCD shutter glasses
VR Goggles- the viewer could look around using their head, and move with
a controller of some sort.
Virtual Workbench - User changes view by moving around
a table, position with a joystick or data glove
CAVE- by walking around in the room, and looking at different areas the
projected view of the scene would change.
The previous section outlined the type of scene information that could
be transmitted in a broadcast telepresence situation-- a panoramic presentation.
This section presents some ideas for a prototype of a panoramic presentation
transmission and reception system that could be constructed within the
next year or so. It goes into some specifics on the entire process
including capturing/recording the scene, compression and data segmentation
for transmission, transmission, decompression and viewing and interaction
by the user.
A Scenario for the Near Future
Range Images - RGB-Z images are captured
Range Cameras (not yet real time)
Planar Light Field Distortion (Cyberware)
Laser Time-of-Flight Imaging
Light Field Capturing
Could allow light field compositing if model is a
Model Assisted- Use existing knowledge of scene to
do a better job capturing the dynamic information.
3-d Sound Capture
Each actor and camera has a microphone pick-up
3-d sound field can be reconstructed
Some sounds should be omnipresent (ie. voice-over)
Multiple sound tracks possible (different languages, different announcers?)
Pre-Capture of Set
Background Light Field
Some other image map?
Set Removal from Scene information, (CGI assisted rendering?)
Construction of 3-d information on non-set objects
Transmitted info = dynamic objects + dynamic lighting effects + dynamic
Mapping 3-d extracted information to human models?
Final compression using standard techniques (Any 3-d compression techniques
Transmission of Background Scene During Commercials (until 3-D commercials!)
Possible to cache scenes of popular shows (but why? Maybe better
detail the more you watch?)
Straight forward broadcasting of dynamic objects
Use existing transport protocols (ie. MPEG-2 Transport Stream)
Background scene rendered from desired position/viewpoint with dynamic
Dynamic objects warped or rendered from desired position/viewpoint and
Viewing and Interaction
"Story line" channel can be chosen for those that don't want to interact
Simple button press of the desired "story line" (ie. channel changing within
Miniature views of all "story-lines" are displayed, and the user can select
one, or more than one to appear on their screen.
Twist shuttle or space orb allows manipulation of position/viewpoint in
Gravity force draws users back into a pre-defined "story lines" when they
Gravity can be increased/decreased just as volume, brightness and contrast
can be changed; this allows the user to control the amount of interactivity/hand
The broadcast channels we are looking at are transmitted using the MPEG-2
Transport Protocol, and would normally be a MPEG-2 encoded television program.
On account of this it is helpful to have some idea of how the MPEG standards
MPEG-1-- The original MPEG encoding standard. It was designed to
encode progressive scan video at a size and quality similar to that offered
by VHS. Monophonic audio transmission was also specified. The
standard only specified the format of the bit-stream, and included no provisions
MPEG-2-- MPEG-2 is pretty much a super set of MPEG-1. It adds interlaced
video coding and takes advantage of correlation between fields in the interlaced
video. It also supports stereo sound. In addition, more care
was taken to specify the streams which bundle several audio and video channels
together under a common time sequence. Program streams are specified
for error free environments and transport streams for environments where
noise and interference can cause loss. Transport streams are designed
to operate at the same level as IP/UDP.
MPEG-4-- This standard adds many scene compositing features, and allows
video of a non-rectangular shape to be transmitted. Thus, a static
background, a shaped video of a news anchor, and a small background news
clip could be transmitted and composited together to form a news bulletin.
In addition, a human voice coded stream for the newscaster, a background
MIDI sequence, and the sound for the news clip could be transmitted separately
and mixed at the destination. The transport layer is basically the
same as that specified in MPEG-2.
MPEG-7-- or, "Media Content Description Interface." This standard
is an attempt to define how information about multimedia objects in general,
and those used to composite MPEG-4 scenes specifically, are attached to
the objects themselves. The primary goal of this is to facilitate
searching multimedia information, say for a "boy dressed in blue."
This is stuff that doesn't fit well in other sections.
If real, how is the remote location captured and modeled for the user?
If augmented reality, how are real-local, real-remote (possibly modeled)
and virtual objects made to co-exist?
If interactive, what methods to people use to interact with each other
and environment objects?
If Multi-user, how are user actions synchronized?
Reality check issues: How much network bandwidth can reasonably be
used? How much computational power is available?
- The stated goal of this project is to foster research into Tele-immersion,
which the define as "users in different locations to collaborat[ing] in
a shared, simulated environment as if they were in the same physical room."
There idea is to have users in different geographic locations interacting
in a shared virtual environment. So, telepresence here is projecting
ones presence into a shared virtual environment. *** This page is
only a project overview ***
Tele-Immersion Home Page
- The current thrust of the research is to set up tele-cubicles.
There will be four in the US, at USC, the Northeast, UNC-CH, and UIC.
The cubicles will be linked across two walls and a desk which are all stereo
projection systems. In this way a shared augmented reality system
will be created among the four locations. The main initial application
will be a collaborative CAD system. The description here is at a
very abstract level.
DARPA Video Surveillance
and Monitoring Project - This project is concerned with automatically
identifying salient features in remote video feeds coming from stationary
and autonomous vehicle cameras. The system is supposed to increase
the number of incoming streams that a human operator can monitor by bringing
in a human observer only when an important feature is detected. There
a variety of groups contributing to this, and it is being attacked from
different angles, such as scene reconstruction and image recognition.
Telepresence - Concerned mainly with wide deployment of current technologies
like video-conferencing and shared whiteboards. Two papers on-line,
the first "On-Ramp
Prospects" outlines issues with getting adequate bandwidth to homes.
The second, "Non-Collaborative
Telepresentations Come of Age" deals with digital recording of presentations
for transmission/viewing either simultaneously or at a later time.
It gives bandwidths and storage capacities needed for various encodings.
It also outlines features of current products. It specifically ignores
presentations with audience participation.
Research Group - A project using multiple cameras to extract depth
information and reconstruct a scene, such that a remote user can move through
a reconstruction of the scene in real-time. The main targeted application
is remote medical consultations.
H. Fuchs, Bishop,
G., Arthur, K., McMillan, L., Bajcsy, R., Wook Lee, S., Farid, H., and
Teleconferencing Using a Sea of Cameras, Proceedings of the First International
Medical Robotics and Computer Assisted Surgery, Vol. 2, Pittsburgh, PA,
- This paper presents an idea for creating a room
with a "sea of cameras" which could be used to allow remote users to navigate
the room with a "virtual camera." The authors then describe an improved
algorithm for wide-base line stereo that can acquire a depth map using
multiple cameras along a single-baseline. Some preliminary work is
presented demonstrating the effectiveness of the their algorithm, and they
propose creating a real-time camera to capture RGB-Z images this way (now
complete- see Video-Rate Stereo Machine).
Although the authors propose that using this technique and a "sea of cameras"
it would be possible to allow users complete freedom in a room, they provide
no specifics as to how the multiple depth maps could be combined to create
one 3-d scene, or how to warp/switch between depth map scenes at different
Related Topics Resources
Virtual Worlds Data Server Project
- a project create data servers which allow multiple people to have real-time
interaction with models that are too large to fit into any machines memory.
Virtual LA - A project
to create a model of 10,000 square miles of the Los Angeles Area Basin.
Methods to quickly go from real world data to a 3-d model are used to construct
the Virtual City, and the VWDS system is used to allow users to interact
with the large data set.