A Hand Pose and Position Tracker for the Interactive Table

CS223B Final Project


Brad Johanson, Rachel Kolodny, Daniel Russel

 

Introduction:

We implemented a prototype system for controlling the cursor of the interactive table, one of the many computer controlled displays in the interactive room. The system uses a video camera mounted over the table to track the motion of a hand over the table (other hands or moving objects are ignored). Various techniques are used to establish the position of the tip of the index finger, which is used to set the cursor location. Clicking will be accomplished by sensor fusion with a capacitance-based touch pad, which can record finger contact to the surface, but does not record the location. The current system consists of a number of .mex files which process frames loaded in to Matlab. Currently no attempt is made to recognize gestures, although ultimately that might be used to create left and right mouse button events.

We decided not to try to recognize clicks directly from the video since we could not find any convenient gestures that could be reliably tracked. The natural thing to do is to touch the table, but an overhead camera clearly can't pick that up. Moving fingers sidewise is sufficiently awkward to make any such motions unsatisfactory.

The camera used recorded 720x480 frames at 30 frames a second.

Initial Hand Location:

This phase occurs when the system is not tracking a hand in the video stream, for example before any one has stuck their hand over the screen, or when the video camera is first turned on. We locate areas of high motion by taking the center of mass of the difference between the the current frame and the last frame and then finding the nearest pixel with significant difference. This last step is in order to avoid getting stuck between two areas of high motion. We then take all the moving pixels in a square surrounding the chosen pixel and use that as our initial guess for the hand. This process initially produces quite bad estimates but tends to converge after a few frames.

Tracking:

Now that we have some idea where the hand is, we want to track it from frame to frame. First, we update the position of our guess by xoring the guess from the previous frame with the difference between the two frames. The rational behind this move is that if we consider the hand as a uniformly colored, translating object then the new position will be the old position xored with the difference. In order to cut out extraneous motions, we cut the area of consideration to some neighborhood of the old guess. The resulting image looks like:

This gives us a fairly noisy blob around the hand in which a human can generally figure out what is going on, but which would be very difficult to process on a computer. To clean this up, we first perform 3 center EM on the colors of the region trying to segment it into hand and background. The 3 centers were chosen to capture the hand, a dark background and bright windows and icons and may have to be increased to make the tracking robust over more general computer images. We then take the largest connected component of the resulting pixels, which helps eliminate noise and make the next processing stages simpler. The result of this processing is generally a rather good outline of the hand:


 
 

Finally, we have to determine the location of the tip of the index finger. To do this we first calculate the discrete medial axis using a very simple algorithm: the edge of the hand stencil is repeatedly moved in by one pixel until there is a gap of one or two pixels at a given point, this point is then marked as the medial axis. The resulting axis is very noisy, which tends to be the case with the medial axis, and so a number of cleanup passes must be used. The first is to simplify the medial axis down to one path of pixels (except at branch points). Then, all branches which are closer than a certain amount (about 2 pixels) to the edge of the hand are removed. Finally, gaps are filled in along the medial axis. The medial axis is draw in red with its intensity being the distance to the edge.

The index finger is located by evaluating an error function at the tip of each arm of the medial axis. The error function has terms for how long the arm is, how far the tip is the from the last location of the tip, the average width of the hand surrounding the arm of the axis, and the standard deviation of that width. These terms are summed for each tip and the minimum sum is taken. Currently the weights are rather arbitrary, but they could easily be trained given sufficient time. The chosen point in the last image in the center of the (clipped) green square.

The final result looks something like:

The hand is clipped at the edge of the table since there was something wrong with the color of the frames we grabbed which made skin the same color as the edge of the table. Since the edge did not change we could just use a mask that we set up at the beginning. The located hand is in pink and the green square is around the tip of the finger.

Making the system operational:

At the moment, the system consists of a bunch of mex files controlled from Matlab. As a result it is no where near real time (we really can't evaluate the speed since we were running it over X and most of the time was taken drawing pictures). The total process make approximately 7 passes over the entire image and about 10 over the local area containing the hand being tracked. These number could both be reduced somewhat without too much difficulty. Tying it in to the interactive room framework can be done without too much difficulty, but we have not tried anything in that direction, nor have we done anything with the touchpad.

The final cursor location tends to jump around a bit due to our method for choosing the tip of the index finger. In practice introducing an extra frame or two of latency to smooth the cursor position would be advantageous.
 

Results:

Here are a few videos we constructed from our tracking.
 

Implementation details:

The bulk of our work in done in C files called from Matlab, with Matlab handing the loading and displaying. The things that are still in .m files can easily be reimplemented in C if necessary. Feel free to use the code, but we have moved on to other things and will not support anything.

Matlab files:
myzoom.m Computes the bounding box coordinates of the non-black regions in the picture to allow zooming in on these regions. This is purely for display purposes.
diff2pic.m Computes the significantly different pixels between two pictures. This can easily be done in C.
mydisplay.m Allows nice displaying of pictures.
parse.m This is the main loop that performs our algorithm on every image in our stream. This will eventually be replaced by the Active Streams architecture.
C files:
medialAxis.c Computes the medial axis of a given object, simplifies it.
findPoint.c Finds the tip of the index finger position.
concomp.c Finding the largest contiguous component.
ColorSmooth.c EM cleaning of the image with 3 color centers (light, dark, skin color).
InitialR.c Finding the center of the motion and growing a rectangle around it to initialize the system.
bounds.hconnect.h drawline.hmatlab_utils.h smooth.huint8.h All the header files which contain many utility functions.

References:

[1] Martin, J.; Devin, V.; Crowley, J.L., "Active hand tracking", Third IEEE International Conference on Automatic Face and Gesture Recognition, 14-16 April 1998, Nara, Japan.

[2] R. Bowden, T. Heap, and D.Hogg. Real Time Hand Tracking and Gesture Recognition as a 3D Input Device for Graphical Applications. Progress in Gesture Interaction, 1997

[3] James Rehg and Takeo Kanade. Visual Tracking of High DOF Articulated Structures: An Application to Human Hand Tracking., Third European Conference on Computer Vision, May 1994

[4] Y. Weiss and E. H. Adelson. A unified mixture framework for motion segmentation: Incorporating spatial coherence and estimating the number of models. CVPR, pp. 321--326, 1996.