Telepresence Information

Overview

With increasing bandwidth, memory, and processor speeds, it is becoming easier to feel present at remote locations in a more meaningful sense. This ability to "feel" like you are someplace other than your physical location is known as "telepresence." There are a wide range of ways to be telepresent, and different places to which one would want to be present. For example, two friends could share a conversation in a 3-d model of a cafe overlooking the rings of Saturn, or perhaps two executives have a video conference, each being somehow present in the others location. The goal of this page is to investigate what research is currently being done on telepresence, and to give some firmer definition to what research is of interest to us. Specifically, the main interest is to look into and determine the feasibility of using telepresence to enhance digital television broadcasts. The emphasis is on the broadcasting of real and augmented reality scenes, as opposed to completely virtual ones, is of the most interest. In addition this page also contains some information on related topics, such as MPEG.

Index

What is Telepresence? - A look at the different things that telepresence could be, and an attempt to classify different types of telepresence.
Problem Specification- Telepresence covers a lot of different things. This section details more specifically what problem we are investigating.
Panoramic Presentations - An idea for what might be broadcast over a "3-D television channel." Also, some general thoughts as to how to go about capturing, transmitting, and receiving such information.
A Scenario for the Near Future - A scenario for the capture, transmission and display of some simple panoramic presentations in the near future.
MPEG Information - Some notes on the different MPEG specifications, and what they contain.
General Notes - Thoughts and ideas about telepresence that don't fit in well in other sections.
Telepresence References - References to other information on Telepresence. A brief summary of what is talked about is also given for many of the references.

Web Resources
Papers

Related Topics References - References to related material. Again, brief summaries are given for some of the references.

Web Resources
Papers

What is Telepresence?

This section contains general observations made about Telepresence and its related fields during the collection of the material in this note.

There seem to be several axes which define the space of telepresence. Any given research project can be (at least partially) classified by specifying where it stands on each of these axes:

Telepresent Location- the location into which the users "telepresent" themselves. This varies from a completely virtual environment to mixed virtual/real to completely real world location. Standard locations would be:

Virtual Reality - a completely imaginary location
Augmented Reality- a combination of a real location (possibly recorded) and supplemental objects or information
Reality - a real location

Degree of interactivity- the extent to which the user, or users, can interact with the remote environment. Three classifications could be:

Full Interactivity - the users can interact with other objects and users in the environment, as well as move around anc change their viewpoint.
Viewing Interactivity - the users can move around and change their viewpoint, but have no effect on their environment. Effectively, every user is a ghost.
Passive Viewing - the viewer is only able to observe the remote location from the position and point of view which is transmitted to them. This is how standard TV works.

Single vs. Multi-User- the number of users who are simultaneously present to the remote location.

Single User - only a single person is "telepresent" to the remote location at one time.
Multi User- more than one person can experience the remote location at one time.

Table 1: Different Types of Telepresence
	Full Interactivity		Viewing Interactivity		Passive Viewing
	Single User	Multi-user	Single User	Multi-user	Single User	Multi-user
Virtual Reality	3-D Games	3-D Network Games, 3-D Chat Rooms	Virtual Museum	Virtual Plays	Video tours of virtual locations (???)	Broadcast VR, ie. "Toy Story" (???)
Augmented Reality	Remote Exploration/Surgery with HUD	Tele-Cubicles	3-D City Model with supplemental HUD Data	3-D Television with Embedded Objects/Information	Off-line Distance Learning	Live Distance Learning
Reality	Remote Exploration/Surgery	Robot Sports	3-D City Model	3-D Television, Panoramic Movies	Video Tourism	Most Modern Television

Table 1 above gives rise to several observations:

In the "Passive Viewing" category, their is little difference between multi-user and single user, other than the potential for the multi-user case to be a live broadcast experienced by many at the same time.
With "Viewing Interactivity" only, the multi-user scenario is different in that it probably restricts all users to a common area, or sequence of areas which change over time (a story-line). This allows for a shared experience even though users cannot see or interact with one another. The "Single User" scenario can be less restrictive, and need not have such a story-line.
For "Full Interactivity," the key difference between single and multi-user scenarios is the lack of others tele-users being present with which to interact at the remote location. Clearly, there is some spectrum, as remote surgery could potentially be done by multiple remote users.

Problem Specification

Although Telepresence is the general area being addressed, the actual focus for us is narrower. Specifically, we are interested in what Table 1 classifies as "multi-user reality or augmented reality telepresence with viewing interactivity." In simpler terms, the interest is in broadcast telepresence or real locations, possibly with supplemental information. Details of this focus, and related ideas are maintained in this section.

The main target is broadcast telepresence-- that is to say some sort of remote location which is captured and broadcast to "viewers" who then feel as if they were present at the broadcast location.
By broadcast, we are speaking of a digital broadcast system, probably an HDTV channel at a rate of 19.2 Mb/s, or perhaps a regular digital channel at 3-6 Mb/s.
An important aspect of broadcasting is that the information sent needs to describe a time varying set of information. In other words, an entire show, or environment is not downloaded for future perusal. This poses some unique challenges if some sort of 3-d environment is being sent.
The target is the near future, or, in other words, whaterver scheme is devised must be implementable and deployable within the next 5 years. Something that could be done on high end equipment today would probably be reasonable.
How the user is "telepresent" can vary from a text transcription of an occuring event, to a video camera, to a complete 3-d model of the remote scene being broadcast. Our interest is in broadcasting some sort of 3-d information of the remote location.
A set of problems need to be addressed:

What is going to be sent to the users?
How do we capture and compress the information for transmission over a 2 MB/s channel?
How, and to what reconstruction (display) system, do we reconstruct the transmission?

The previous problems are assymetric-- more time/money/effort can be spent on capture and compression than on reconstruction.
What is going to be sent to the users should be reasonably independent of the users display technology.
What sort of display is reasonable for the "viewer" side of things? Some ideas follow:

Regular Television, with manipulable 3-d view of remote scene.
Standard display with LCD shutter glasses.
VR Goggles. This would allow a user to visually feel as if they were at a location. The disadvantage is that it requires a more significant "viewer" side investment. It provides more of an advantage over traditional TV, though.
Virtual Workbench - good for a top down viewpoint, limited general use
CAVE. The viewer/user "watches" inside a room which projects the remote location onto the walls.

The most important question, and the one that drives the answers to many of the other points, is what benefit this "telepresence" or 3-D TV offer over traditional programming. Below is a list of types of programming that would probably benefit, and also programs that probably wouldn't benefit:

Programming that would benefit (and how):

Nature shows-- the ability to look at the environment around you could add a lot to a show. Imagine being out on the savana in Africa, but instead of just having to focus on a specific area, the entire environment could be viewed.
Tourism Shows-- as with Nature shows, the ability to see a broader range of your environment would be quite beneficial. As the show takes you on a tour of a town you could concentrate on things that interest you for longer. You could also choose whether to look at the announcer or what he is describing.
Sports-- multiple view points might be of interest in sports. Users could choose which angle they like best to the tennis court, for example. Having depth information might also make it easier to follow the action, particularly following the ball.

Programming that probably wouldn't benefit:

Broadcast lectures-- seeing the professor and his slides in 3-d makes little difference. Broadcasting 3-d "slides" with models that are better appreciated in stereo (ie. molecules, 3-d flow fields) could be a benefit (but also would be difficult to show to those in class).

Panoramic Presentations

The previous section gave three problems that needed to be solved for broadcast telepresence: what to broadcast, how to record it, and how to reconstruct it. This section proposes an idea for what to be broadcast, the problem that needs to be solved before the other two. Two things need to be found to answer the question of what to broadcast:

What information can be broadcast?
Does it have significant value above and beyond regular broadcast television? (with a sub-point that it should not have any disadvantages over regular television)

The solution proposed here is that of a "Panoramic Presentation". The basic idea is that instead of broadcasting just a flat 2-d view of the scene being portrayed, a complete 3-d local environment is being transmitted. This gives the viewer the ability to look around themselves and get a better appreciation of the location, hopefully allowing them to feel more like they are actually there. In addition, the viewer would be allowed to change their position within some fixed boundary, or movement box. As time goes on in the show, the local area will change, either as the viewers position changes (as in a tour or a nature show), or as actors enter and depart (as in a sit-com or play).

There is a problem with such a concept in that the viewer may get lost looking in some funny direction and miss something important. To avoid such a problem, preassigned viewpoints, or items to look at, and positions from which to look can be defined, including a main viewpoint and position which are defaults. This way, if the viewer does nothing at all, they have the same experience as watching a standard show (with the possible benefit of depth information). This brings up a new problem, which is how to switch viewpoints and positions. One option, would be to allow users to toggle through a set of fixed viewpoints, which translates to just broadcasting several 2-d views of the scene. The next option, similar to that given by Quicktime VR, is to jump from position to position, but allow viewers to look around at each location. This still allows the user to get lost looking in some strange direction, though. A final option is to have each of the viewpoints, and viewing positions serve as a sort of gravity well, such that as the viewer looks and moves around they are naturally attracted to the predefined viewpoints and positions. Figure 1 below shows how all of this might work.

Figure 1- Examples of Scene Gravity for Different Shows
(a) Panoramic Presentation of a Fixed Set Show (ie. Sitcom)	(b) Panoramic Presentation of a Show with Changing Environment (ie. Tour or nature show)

Figure 1a shows a typical fixed set presentation, while Figure 1b shows a scene with movement. Here are the key features common to each:

The dashed region is called the "modeled region." This is the area that the viewers machine has available to it at any given time.
The solid black enclosed region is the "movement box." The movement box defines the range of positions over which the viewer can move. In the case of Figure 1b, the movement box region changes over time.
The stars represent the viewpoints, and the circles represent positions.
The "Temporal Path" in Figure 1b represesents the path over which the viewpoints, positions, movement box and modeled region travel as time passes.

In both of the types of presentation, viewpoints and positions are defined. The viewer may change position and view at will as long as they stay within the movement box. As they move, their point of view will be drawn towards the predefined viewpoints, accelerating as they come nearer to them. This mimics the way our eyes are naturally drawn to certain items. Similarly, certain positions will be attractive. This allows the viewer to roam around some if they wish, but makes it easy to be drawn back to the most interesting (or guaranteed interesting) viewpoints. Gravity of course could be adjusted to be stronger or weaker. As gravity becomes very high, the scene becomes one of multiple broadcast views, and as gravity becomes very weak complete freedom of motion is granted.

Of the two types of panoramic presentation shown in Figure 1, the static scene in Figure 1a is simpler. In that case there is a fixed set which it would be possible to cache on the viewers machine. The viewpoints, positions and movement box could change over time, but are constrained by the fixed nature of the scene. The situation in Figure 1b is more complicated in that everything changes as time goes on. Imagine a nature show of swimming through the Great Barrier reef. The movement region in this case is a volume which moves as time progresses and the viewer is guided along the reef. The positions all move as time goes by, and the viewpoints may change to various fish and coral. The viewer may decide to continue looking at a fish after it is described, and then allow themselves to be swept forward as they reach the back of the viewing region, and have their viewpoint drawn to some new object. Of course shows need not be entirely fixed, or entirely dynamic. The types can be intermixed or interleaved at will, and the examples are given as the two extremes.

As shown in Figure 1, both primary and secondary positions and viewpoints could be sent to the viewer at any given time. These will of course change over time as a show progresses. The main viewpoint and position over time define the standard method of watching the show as it would be seen on conventional television. This idea could be extended by having several different sets of time varying positions and viewpoints which extend throughout the show. Instead of constantly choosing viewpoints and positions as the show progresses, the viewer could lock in one particular "story-line" at the beginning of the show, and follow the action from that sequence of positions and viewpoints. This would allow a viewer who likes one character better than most to choose a "story-line" that emphasizes that particular character. It could also be set up that there are several different "directors" for a given show, each of whom comes up with their own "story-line." The viewer could then choose which "director's" story-line choice to follow through the show. Eventually, as homes become better networked, viewers could exchange "story-lines", or choose another viewer to be the active one who controls their viewing perspective through the show.

Two questions were posed at the beginning of this section, the first was what could be broadcast, and that has been discussed throughout the section. The second was what the benefit was, and was it actually worse than regular TV. It is fairly easy to see that there is no loss over regular viewing in this scheme. If you just sit back and view your viewpoint and position will be changed automatically for you. It is also difficult to accidentally shift your view, since gravity will tend to bring you toward a desirable view. It seems pretty clear that there are also advantages over regular TV since each viewer has the choice of looking at what is most interesting to them, and can choose to linger on certain items, or dash ahead to see what is coming next.

Before ending this section, a brief mention of how this might actually be displayed to the user is appropriate. The panoramic presentation method presented here is scalable, and could be displayed at least in the following ways:

Normal Television

With a decoder, a fixed point of view could be displayed
In addition, the viewer could be given a remote which allowed them to move around.

Stereo Vision TV- as above, but depth information would be more apparent with LCD shutter glasses
VR Goggles- the viewer could look around using their head, and move with a controller of some sort.
Virtual Workbench - User changes view by moving around a table, position with a joystick or data glove
CAVE- by walking around in the room, and looking at different areas the projected view of the scene would change.

A Scenario for the Near Future

The previous section outlined the type of scene information that could be transmitted in a broadcast telepresence situation-- a panoramic presentation. This section presents some ideas for a prototype of a panoramic presentation transmission and reception system that could be constructed within the next year or so. It goes into some specifics on the entire process including capturing/recording the scene, compression and data segmentation for transmission, transmission, decompression and viewing and interaction by the user.

Capture

Range Images - RGB-Z images are captured

Multiple Cameras
Range Cameras (not yet real time)

Planar Light Field Distortion (Cyberware)
Laser Time-of-Flight Imaging

Light Field Capturing

Could allow light field compositing if model is a light field?

Hybrid Scheme
Model Assisted- Use existing knowledge of scene to do a better job capturing the dynamic information.

3-d Sound Capture

Each actor and camera has a microphone pick-up
3-d sound field can be reconstructed
Some sounds should be omnipresent (ie. voice-over)
Multiple sound tracks possible (different languages, different announcers?)

Compression

Pre-Capture of Set

Polygon Model
Background Light Field
Some other image map?

Set Removal from Scene information, (CGI assisted rendering?)
Construction of 3-d information on non-set objects
Transmitted info = dynamic objects + dynamic lighting effects + dynamic textures
Mapping 3-d extracted information to human models?
Final compression using standard techniques (Any 3-d compression techniques available?)

Transmission

Transmission of Background Scene During Commercials (until 3-D commercials!)

Possible to cache scenes of popular shows (but why? Maybe better detail the more you watch?)

Straight forward broadcasting of dynamic objects
Use existing transport protocols (ie. MPEG-2 Transport Stream)

Decompression

Background scene rendered from desired position/viewpoint with dynamic texture/lighting
Dynamic objects warped or rendered from desired position/viewpoint and composited

Viewing and Interaction

"Story line" channel can be chosen for those that don't want to interact very much

Simple button press of the desired "story line" (ie. channel changing within a channel)
Miniature views of all "story-lines" are displayed, and the user can select one, or more than one to appear on their screen.

Twist shuttle or space orb allows manipulation of position/viewpoint in active window
Gravity force draws users back into a pre-defined "story lines" when they stop interacting
Gravity can be increased/decreased just as volume, brightness and contrast can be changed; this allows the user to control the amount of interactivity/hand holding.

MPEG Information

The broadcast channels we are looking at are transmitted using the MPEG-2 Transport Protocol, and would normally be a MPEG-2 encoded television program. On account of this it is helpful to have some idea of how the MPEG standards fit together:

MPEG-1-- The original MPEG encoding standard. It was designed to encode progressive scan video at a size and quality similar to that offered by VHS. Monophonic audio transmission was also specified. The standard only specified the format of the bit-stream, and included no provisions for transmission.
MPEG-2-- MPEG-2 is pretty much a super set of MPEG-1. It adds interlaced video coding and takes advantage of correlation between fields in the interlaced video. It also supports stereo sound. In addition, more care was taken to specify the streams which bundle several audio and video channels together under a common time sequence. Program streams are specified for error free environments and transport streams for environments where noise and interference can cause loss. Transport streams are designed to operate at the same level as IP/UDP.
MPEG-4-- This standard adds many scene compositing features, and allows video of a non-rectangular shape to be transmitted. Thus, a static background, a shaped video of a news anchor, and a small background news clip could be transmitted and composited together to form a news bulletin. In addition, a human voice coded stream for the newscaster, a background MIDI sequence, and the sound for the news clip could be transmitted separately and mixed at the destination. The transport layer is basically the same as that specified in MPEG-2.
MPEG-7-- or, "Media Content Description Interface." This standard is an attempt to define how information about multimedia objects in general, and those used to composite MPEG-4 scenes specifically, are attached to the objects themselves. The primary goal of this is to facilitate searching multimedia information, say for a "boy dressed in blue."

General Notes

This is stuff that doesn't fit well in other sections.

Key issues:

If real, how is the remote location captured and modeled for the user?
If augmented reality, how are real-local, real-remote (possibly modeled) and virtual objects made to co-exist?
If interactive, what methods to people use to interact with each other and environment objects?
If Multi-user, how are user actions synchronized?
Reality check issues: How much network bandwidth can reasonably be used? How much computational power is available?

Telepresence Resources

Web Resources

General Resources

Tele-Immersion Project - The stated goal of this project is to foster research into Tele-immersion, which the define as "users in different locations to collaborat[ing] in a shared, simulated environment as if they were in the same physical room." There idea is to have users in different geographic locations interacting in a shared virtual environment. So, telepresence here is projecting ones presence into a shared virtual environment. *** This page is only a project overview ***
Tele-Immersion Home Page - The current thrust of the research is to set up tele-cubicles. There will be four in the US, at USC, the Northeast, UNC-CH, and UIC. The cubicles will be linked across two walls and a desk which are all stereo projection systems. In this way a shared augmented reality system will be created among the four locations. The main initial application will be a collaborative CAD system. The description here is at a very abstract level.
DARPA Video Surveillance and Monitoring Project - This project is concerned with automatically identifying salient features in remote video feeds coming from stationary and autonomous vehicle cameras. The system is supposed to increase the number of incoming streams that a human operator can monitor by bringing in a human observer only when an important feature is detected. There a variety of groups contributing to this, and it is being attacked from different angles, such as scene reconstruction and image recognition.
Microsoft Telepresence - Concerned mainly with wide deployment of current technologies like video-conferencing and shared whiteboards. Two papers on-line, the first "On-Ramp Prospects" outlines issues with getting adequate bandwidth to homes. The second, "Non-Collaborative Telepresentations Come of Age" deals with digital recording of presentations for transmission/viewing either simultaneously or at a later time. It gives bandwidths and storage capacities needed for various encodings. It also outlines features of current products. It specifically ignores presentations with audience participation.

CMU Resources

CMU VR Projects - A variety of VR related projects. Ones more pertinent to telepresence are located below.

Modeling by Video Taping - 3d scene reconstruction using data from a vidoe camera panned around a room
Z-Key Project - Using real world depth information to interleave real and artificial objects for augmented reality

Video-Rate Stereo Machine
The Video Surveillance and Monitoring HomePage
Virtualized Reality

UCSD Resources

U-Penn Resources

3-D Reconstruction Paper (U-Penn)
Telepresence (U-Penn) - Same project as UNC-CH Telepresence Research Group (see below)

UNC-CH Resources

UNC-CH Telepresence Research Group - A project using multiple cameras to extract depth information and reconstruct a scene, such that a remote user can move through a reconstruction of the scene in real-time. The main targeted application is remote medical consultations.

UIC Resources

Papers

H. Fuchs, Bishop, G., Arthur, K., McMillan, L., Bajcsy, R., Wook Lee, S., Farid, H., and Kanade, T.,

Visual Space Teleconferencing Using a Sea of Cameras, Proceedings of the First International

Symposium on Medical Robotics and Computer Assisted Surgery, Vol. 2, Pittsburgh, PA, September

22-24, 1994

Video-Rate Stereo Machine)

Telepresence Information

Overview

Index

What is Telepresence?

Table 1: Different Types of Telepresence

Problem Specification

Panoramic Presentations

Figure 1- Examples of Scene Gravity for Different Shows

A Scenario for the Near Future

Capture

Compression

Transmission

Decompression

Viewing and Interaction

MPEG Information

General Notes

Telepresence Resources

Web Resources

Papers

Related Topics Resources

Web Resources

Papers