We decided not to try to recognize clicks directly from the video since we could not find any convenient gestures that could be reliably tracked. The natural thing to do is to touch the table, but an overhead camera clearly can't pick that up. Moving fingers sidewise is sufficiently awkward to make any such motions unsatisfactory.
The camera used recorded 720x480 frames at 30 frames a second.
This gives us a fairly noisy blob around the hand in which a human can generally figure out what is going on, but which would be very difficult to process on a computer. To clean this up, we first perform 3 center EM on the colors of the region trying to segment it into hand and background. The 3 centers were chosen to capture the hand, a dark background and bright windows and icons and may have to be increased to make the tracking robust over more general computer images. We then take the largest connected component of the resulting pixels, which helps eliminate noise and make the next processing stages simpler. The result of this processing is generally a rather good outline of the hand:
Finally, we have to determine the location of the tip of the index finger. To do this we first calculate the discrete medial axis using a very simple algorithm: the edge of the hand stencil is repeatedly moved in by one pixel until there is a gap of one or two pixels at a given point, this point is then marked as the medial axis. The resulting axis is very noisy, which tends to be the case with the medial axis, and so a number of cleanup passes must be used. The first is to simplify the medial axis down to one path of pixels (except at branch points). Then, all branches which are closer than a certain amount (about 2 pixels) to the edge of the hand are removed. Finally, gaps are filled in along the medial axis. The medial axis is draw in red with its intensity being the distance to the edge.
The index finger is located by evaluating an error function at the tip of each arm of the medial axis. The error function has terms for how long the arm is, how far the tip is the from the last location of the tip, the average width of the hand surrounding the arm of the axis, and the standard deviation of that width. These terms are summed for each tip and the minimum sum is taken. Currently the weights are rather arbitrary, but they could easily be trained given sufficient time. The chosen point in the last image in the center of the (clipped) green square.
The final result looks something like:
The hand is clipped at the edge of the table since there was something wrong with the color of the frames we grabbed which made skin the same color as the edge of the table. Since the edge did not change we could just use a mask that we set up at the beginning. The located hand is in pink and the green square is around the tip of the finger.
The final cursor location tends to jump around a bit due to our method
for choosing the tip of the index finger. In practice introducing an extra
frame or two of latency to smooth the cursor position would be advantageous.
Matlab files:
myzoom.m | Computes the bounding box coordinates of the non-black regions in the picture to allow zooming in on these regions. This is purely for display purposes. |
diff2pic.m | Computes the significantly different pixels between two pictures. This can easily be done in C. |
mydisplay.m | Allows nice displaying of pictures. |
parse.m | This is the main loop that performs our algorithm on every image in our stream. This will eventually be replaced by the Active Streams architecture. |
medialAxis.c | Computes the medial axis of a given object, simplifies it. |
findPoint.c | Finds the tip of the index finger position. |
concomp.c | Finding the largest contiguous component. |
ColorSmooth.c | EM cleaning of the image with 3 color centers (light, dark, skin color). |
InitialR.c | Finding the center of the motion and growing a rectangle around it to initialize the system. |
bounds.hconnect.h drawline.hmatlab_utils.h smooth.huint8.h | All the header files which contain many utility functions. |
[2] R. Bowden, T. Heap, and D.Hogg. Real Time Hand Tracking and Gesture
Recognition as a 3D Input Device for Graphical Applications. Progress
in Ges
[3] James Rehg and Takeo Kanade. Visual Tracking of High DOF Articulated Structures: An Application to Human Hand Tracking., Third European Conference on Computer Vision, May 1994
[4]
Y. Weiss and E. H. Adelson. A unified mixture framework for motion segmentation:
Incorporating spatial coherence and estimating the number of models. CVPR,
pp. 321--326, 1996.