Research Projects

Face Driven Animation

Click Here for the Village People

COS 598d Modeling
Ian Buck
Abstract: As CPU get faster, memory busses get wider, and multimedia hardware becomes more common, the potential for the computer expanding it perception its surroundings is becoming more probable. In this work, we explore the possibility of using computer vision to detect the expressions of the human face. Using SGI media hardware, this project shows that it is possible for a modern inexpensive computer to perform real-time facial feature detection without an visual aids for tracking. Furthermore, we show that this feature data can then be used for a variety of different applications in parallel with the feature detection which benefit from the computer being able to process the human face.

Goal: The goal of this work is to implement a real-time gesture recognition program that could perturb a model of a human face to match the gesture expressed by the user seen by a camera attached the computer. The basis idea is that computer will grab frames off of the camera and perform an efficient pass of the image data to determine where features like the outline of the mouth, positions of the eyes, or eyebrow tilt. This resulting data is passed a model manipulation engine, which will take that data and apply it to a model that match the features detected. This project could greatly benefit the motion capture industry, which currently uses position sensors that are taped to the subject’s face to get all the necessary data.

The primary difficulty in doing the project is achieving a suitable image recognition algorithm that can be applied to a model. The algorithm must be fast and also pick out regions of space that can be quite small. However, people have implement algorithms that work as an off-line process, and I don’t see why it can be converted to an online one.

The image processing will use some pre-existing work that I’ve done in color filtering to determine skin tones. Which pixels that are skin color and which are not is a problem that is solved by a simple one-pass filter. From the detection of skin tones, the outlines of the mouth, the eyes, nostrils, and eyebrows can be figured. This can be done by processing regions of the image, which are "holes" in the skin as detected by the skin processor. These can be processed with fast flood fill to determine the area and position. Also since the area which is being revisited is much smaller than the image data, it should not add any delay to the capture rate of the camera. This results into a feature "vector" which contains the positioning of all of the facial parts. This vector can then be passed to the model manipulation engine, through a direct call, or through a network connection. Since work on skin detection has already been done, this part of the project will not be that time consuming.

The second part of the project is to take the feature vector from the image filtering and recognition, and apply that data to a model in order to have the model show the same features that the vector describes. There are a variety of different model representations that could be used in manipulating the model. The basic guidelines are that the vector data could be applied fast to the model so that it can be done 5, 6, or 7 times a second. Also the different fittings to the gesture data should match roughly the properties of the human face, unless other models are desired (like cartoon figures). These questions of data representation are going to be the main large of the work. A simple solution may be to have the face modeled as a mesh with tension between vertices to model the bending of the human skin. This would be quite simple to implement but may not provide realistic looking results around the mouth. Another possibility may be to model the face as a nurb surface. However now the image-processing vector must produce control points instead of actual sample points. I plan to implement a mesh algorithm first then if that produces unrealistic results, explore other representations.

Success in the project will mean that any person could come and sit down in front of an O2 and be recognized by my application be animated on the screen. This project when it is completed has the potential to produce some very interesting results and also will teach me about image recognition and what you can and cannot do in real time when it comes to model manipulation.

Progress Report

(April 23, 1998)

One of the main questions surrounding this project was how well I will be able to track features on the face. In particular, I was concerned with how difficult it would be to pick out of the video image the more complicated facial features, especially the mouth with its irregular shape and sizes. So I decided to first focus in on trying to accurately detect the contours of the mouth. Once the mouth can be detected and tracked in real-time, the remaining features (eyes, nostrils, eyebrows, &c.) should follow naturally.

In trying to detect the position of the mouth, I considered a few different types of image filtering to help my search. First I was tried to detect the "redness" of the lips. This was done by computing the normal of the color vector defined at each pixel, then tagging pixels which had larger red components and small blue and green. This does work fairly well for detecting skin tones within the image however the lip can often have very subtle shade differences with the skin. Also the gums of the mouth appeared to have similar shades of red as the lip.

As it turns out, if a small lighting constraint is placed on the subject, a simple threshold filter works quite well for deducting pixels inside of the mouth. If the lighting is casted so that there is not light shining directly into the mouth but rather down from above, the interior of the mouth will never be lit and remain dark when the user opens their mouth. Therefore the mouths contours can be detected by marking all of the pixels which are below a certain area.

Mouth Algorithm

In order to simplify the algorithm, we restrict the position of the mouth to be within a fixed viewing area. For each video frame, perform a threshold test across the image and tag each pixel which is below a certain intensity. Next, beginning from the lowest pixel in the middle of the box, scan upward until a tagged pixel is hit. Next, continue upward until the pixels are no longer tagged. This detects a cut of the mouth running vertically. Now from the half-point of the cut scan out left and right till the edges of the mouth are detected. From the edge, we then walk further outward traveling up or down along the edge of the mouth to find the corner. This establishes the two corners of the mouth. Finally, we recompute the middle of the mouth by taking the midpoint of these two lines and performing the vertical search again.

This algorithm makes quite a few assumptions about the mouths shape and position, however yields fast and often quite correct results. Below is a screen shot of the algorithm in action. The green pixels are the tagged intensities while the blue boxes are the corners of the mouth. To show the effectiveness of the algorithm, the corners of the mouth were used to warp a texture map of a picture to control the images mouth.

The current problems with the implementation are mainly that the mouth is constrained to be inside of the box and passing through the horizontal middle of the camera image. Also, exaggerate expression with the mouth are made, ie. sticking out the tongue, the algorithm will get confused.

Of course, more points along the mouth or a smooth interpolation between the points would provide for better results, however the purpose of this intermediate goal of mouth detection was achieved. We are able to detect the location and general shape of the mouth in real time without using much of the processor. Although the final implementation will most likely lax most of the constraints the current algorithm requires, the basic intensity search yield satisfactory real-time results. This work can definitely be applied to the other features of the face.

Final Update

May 18, 1998

Performing the simple mouth detection showed two interesting effects of having the computer detect facial features. First, while imposing only slight restrictions on the user to limit position of the mouth with in the image, a simple intensity filtering and walking search algorithm can pick up facial feature quite well and efficiently. Also, and more importantly, the way in which the facial data is used for is just as important as the detection. The warping of a texture map, although quite simple, provided an realistic appearing effect for the facial data. Expanding the implementation took into consideration of these observation for the next steps.

Related Work: Eric Petajan, researcher at Lucent, is currently working on the a face animation specification for the upcoming release of MPEG-4. His work is defining what parts of the human face must be manipulated to capture expression. The proposed specification for MPEG-4 defines 68 unique necessary motions of the face ranging to from the extension of the jaw to the pitch of the eyeball. Although it is clearly not possible to detect all of these different positions from a single camera frame, it can provide a reference guide of what are good canidates for detecting. He has also done plenty of work in applying facial animation data to 3D face models for rendering and was kind enough to provide me with models and software for driving a 3D model of a face to test the output of my application.

All of the 68 different control points for a 3D input model as defined by the MPEG-4 specification.

Detecting Eyes, Nose, and more Mouth

Expanded Mouth Algorithm: The purpose of tracking facial features is to hopefully capture human expression. With only four points surrounding the mouth, only basic emotions can be represented. Adding more points greatly improved the expression and contours of the mouth . This was done by first detecting the original four points and then using the halfway horizontal positions and performing a vertical linear search. This results in 8 points around the mouth rather than the original four. These 8 also match the MPEG specification provided by Petajan. Small other changes were also improved with the mouth detection regarding the interference likely with the tongue obscuring the search for the left and right corners.

Detecting the Nose: The nose is a critical part of the face since it provides a center for everything else. The simplest way to look for the nose was to detect it boldest features, the nostrils. We apply the same box style restriction on the user, asserting that the nose must reside inside of the nose box. Using this assumption, the nostrils should be the darkest object within the box.

To find the nostrils, we apply the same intensity filter in the nose box which was applied for the mouth. Next, we using the average x and y of the dark pixels, as the center of the nose. We can also figure out where the nostrils are by linear searching left and right from the average point. This only requires one pass over the nose box to perform the filter.

The nostrils provide an accurate center-point for the head since rotations are usually centered at the top of the spine which in line with the nose. Since there are no rotation effects, we can use the position of the nose to indicate the head tilt and yaw. Vertical movement corresponds head nodding (tilt) and horizontal is head shake (yaw).

Pupil Detection:Despite their small size, the pupils are prime targets for computer vision simply because they are one of the darkest objects in the image. Since the eyes can be considered to be always perpendicular to the vertical angle of the head, they can yield the roll information that nose was unable to pick up.

Like the other detections, we define a box within which the eyes must reside. Next, since the eyes are far apart, we can also assume they reside on opposite sides of the horizontal midpoint. This greatly helps our searching since the we can being from the centers and travel outward efficiently. Also once the pupils are found, a simple vertical search can find the dark eyebrows.

Here are all the different points detected by the algorithm labeled in blue. The blue rectangles defined the areas which the eyes, nose, and mouth.

Applying the Features

With the face data detected, the next problem is applying that data. How to we use that data to drive an animation in real time. Their has been work done in this area previous for the MPEG-4 research. The focus of that work is to apply face data to a 3D model for teleconferencing. The basis for their standard deals with deviations from the "relaxed" face. Each of the 68 different control points is fed displacements from their original position. Using intensity detected face data, as in this project,can also be used to drive an MPEG style 3D model. The only difficulty in the mapping is that the input values are a function of distance from the relaxed face, something which has no equivalent in camera view, but certain values are relative to facial not world distances. For example, the stretch of the mouth is a scalar factor of the mouth width when relaxed, while the mouth height is a function of the relaxed distance of between the nose and the closed mouth. This coordinate system makes it difficult get correctly scaled input values, however you can use the boxes as frames of reference can compute approximate values. Furthermore, 2D textures can provide interesting data to apply the feature data to, simply because the constrained environment allows a 2D approximation of the model. It is also much simpler to map 2D data straight to other 2D data rather than having to perform complex transformations.

Results:

The camera detection is able to detect pupils, nostrils, eyebrows, and 8 points round the mouth. This data is then in a form which can processed in a variety of different ways. Here is a list of all the different uses for the camera data:

Wireframe 3D model. Using the rendering code provided by Eric Petajan, the camera data is transformed to relative displacements and applied to an MPEG model.
Texture 2D warp. Apply the feature data, to warp holes a texture to make it appear to behave as a face.
Save the feature data for future off line rendering. Since the feature data is so small, its ideal for storage or sending across the web. This can be done to get the fasted frame rate possible out of the camera since it doesn't have to render.

This demo shows the different warpings between the camera points and the texture.

Wireframe input model. This 3D model is controlled through the camera data translating into displacement values.

FrameRates. The camera can only output frames a just over thirty frames per second. Most of this time the cpu is idle since it's waiting for a vertical refresh. Furthermore, the total processor used by the texture warping is only around 30%. This means that there is plenty of other cpu time available for other tasks that might want to use the data.

Conclusion

Overall, I think this project shows quite a bit of potential for computer vision in the area a face driven animation. The imaging filters and searching algorithm used were quite simple yet still were able to capture a lot of personal expression in the face. Expansion of these methods, using color information for filtering, removing the fixed box restraint could only improve the detection model. One issue that was never resolved was determining the pupil position within the eye. Since the area of the eye is so small on the screen, it is difficult to establish a stable position that remains consistant per frame. This work definitely established that camera recognition can detect facial features and expressions, however the limit of what it can and cannot do has yet to be established.