IMPLEMENTED ALGORITHM
Our segmentation and tracking algorithm can be separated according to this
flowchart: The segmentation is the top box, and the tracking is the bottom box.
Image Segmentation
During skin detection, an initial silhouette is manually traced over the diver's initial pose. Those pixel values inside the silhouette are used to compute the mean HSV value of the skin.
Using this mean value for skin pixels, we iterate over all pixels in the image and remove any pixels whose values are greater than a given deviation:
However, this method proved insufficient since most of the background remained. Next we used a simple background subtraction algorithm which takes the last n frames of the video and computes an average per pixel. Let S(x,y) be the sum of pixels (over n frames), and Sq(x,y) be their squares. Then at each x,y location, m(x,y) = S(x,y)/n and the standard deviation is sigma(x,y) = sqrt(Sq(x,y)/n - (S(x,y)/n)^2). If a pixel at x,y satisfies the condition: abs(m(x,y) - p(x,y) > 3sigma(x,y) then it's considered part of the foreground. [OpenCV pg32]These averages are compared to the frame in question and if this difference is greater than some deviation then that pixel is considered a foreground pixel. Note: We decided not to use the bayesian skin classifier because we wanted to see if a simpler method existed and we did not want to force a restriction of having a training set.
Since much of the image has noise (ie: the fence in the figure above), we can apply certain morphology filters (erode) to help remove such noise. A morphology filter uses an image A and and a structuring element B (aka kernel). The filter was originally used for set operations. If we let B_t be the translation of B around the image, then the dialate of A by B is: ![]() Similarly, the erosion of an image A is given by: ![]() Basically, the dilate filter "grows" pixels around existing pixels and the erode filter trims the edges. ![]() Illustration of Dilating and Eroding Diagrams and equations taken from OpenCV Manual pg75. After applying a 5x5 Gaussian and the erode morphology filter. However, there still is a lot of noise after the erosion, so we tried applying a locality filter where hypothesized silhouette values are "local" to the previous silhouette, but this turned out disastrous since the locality constraint couldn't keep up with the moving diver. See Bloopers.
Tracking
So instead we used a ConDensation tracker that simply tracked the centroid
of the skin-colored pixels. The center of the window corresponds to the
centroid. In most of our footage, our tracker seemed to follow the diver
pretty well.
We used the window as a mask to zero out all the pixels outside the
window.
Disparity Calculation We then convolve the MHI image with two separable PreWitt kernels to extract the X and Y magnitudes of the gradient vector. Then the orientation is simply: Note that since intensity values on the silhouette image are discrete, all non-zero gradients are clustered around the "steps" and edges against the background. However, the gradient at a particular point is meaningful only when it lies on the step, and therefore, we must zero out all the vectors adjacent to the background pixels (i.e: black) before computing the global vector. A global vector is simply an aggregate of all the gradients, normalized by the total number of valid points. It signifies the general direction of the blob. By successively adding the global vectors over the time starting to absolute coordinate (0,0), we can derive an approxipate trajectory in 2D space. Another way of determining the general direction of the blob is by computing the centroids of the blob in question. Once again, motion history images are used because the successive layers of the past tend to stabilize the movement of the centroids over time. More weight is given to the more recent layer. Both methods are highly sensitive to garbage left over by the segmentation process. Fortunately, the condensation tracking filter, which defines a rectangular region of interest and zeros out pixel values outside of the region, improves the accuracy of the centroid computation significantly. Each of the above two methods returns a set of (x_i,y_i) coordinates, which may be used to fit a polynomial function. Since the camera is stationary and that the diver must obey the law of gravity, the trajectory should essentially resemble the shape of a paranola. However, we discovered that a simple quadratic function did not fit well due to photages containing scenes beyond the window of the actual dive, such as divers submerged under water, or ducking on the platform prior to jumping into the air. Instead of a second-degree polynomal function, we decided to fit a third-degree polynomal function using the least square solution to allow a certain degree of flexibility. One advantage of using trajectory curves as the basis of comparison is that their characterizations are invariant to the rate of video sampling. We don't have to worry about comparing a sequence played back at normal speed with another shown in slow motion. Another advantage is that it models a physical constraint, the law of gravity, conveniently. This constaint restricts the form of the curve function so it's not completely arbitrary. Finally, a prior knowledge like newtonian mechanics could be used to predict the centroid location in the next frame. I imagine this could be useful to further improve the performance of the tracker. After fitting polynomial curves onto the trajectories, we simply computed the affine mapping from one curve to another. We ignored translation and rotation, so all we were left with was scaling in the x- and y-directions. If the two divers were perfectly synchronized, this matrix would be the identity. So, we compared the affine mapping to the identity to arrive at a disparity measure. We penalized equally for differences of scale in x and y, but perhaps if we knew more about diving then we'd penalize more for one than the other.
Planar Tracking (cardboard people) Constraints:
Simplifications:
Therefore, our internal model is in a 7-dimensional space: A projection of the internal model. The tracking begins with a manual initialization of the model matching the first frame of the photage. Automating this would probably be difficult, so we did not attempt. At every frame, we have to somehow adjust the parameters of the model that should best fit the diver on the next frame. The search of the configuration space follows this form:
Ideally, we want to be able to traverse the entire space of configuration to find the globally optimal point, but that's obviously too expensive. An alternative is to search around the immediate neighbors, but that also is very expensive because in a 7D-space, there are 14 possible individual adjustments, and testing all immediate neighbors require 2^14 combinations (i.e: power set). Instead, we will adjust one parameter at a time, and allow N iterations in the hope that the search algorithm will locate the local minimum (error). In a way, this models the physical constraint where body parts may be not move across the 2D space too much in a certain amount of time, such as time elasped between each consecutive frames. Note that not testing all immediate neighbors might not lead to a locally minimum because of possible "ridges" in the search space. This is the price we pay for the simplification. We found that (N=10, translational_delta=4pixels, and angular_delta=4degrees) was effective while keeping the computation cost acceptable. The difference function is defined simply as: The projection of our internal model (red) superimposed onto an intensity map. Using an correspondence algorithm such as this frequently leads to an irrecoverable divergence. It's particularly distastrous if there are a lot of garbage left over from the segmentation stage. Fortunately, the condensation tracker at least discards the junk outside of the window, so the risk of the model leaving the intended region is minimized. Furthermore, the successive layers of the motion history image help pin down the model to the right place. The output of the cardboard tracker is a sequence of model configurations. Much more information about the diving performance is extracted this way compared to the simple center of mass trajectory. Complex characterizations of a dive, such as somersaults and torso orientation, can be easily derived from this model. However, this tracker is much more susceptible to mistracking due to divergence, and it's highly computationally expensive. |