
One consistent trend in electronics has been the “minification” of devices. Digital cameras have been subject to this movement from the start. On the one hand these smaller devices sell extremely well, on the other the small frame stands in the way of decent looking flash photography. Placing a flash close to the optical axis creates unpleasant “frontal” (“Deer in headlight”) exposure, generating a booming flash-dome, increasing red-eye effects, and generating unflattering wrinkles in portraits. Furthermore a small flash can not fully illuminate a room in natural looking light and often resulting in aggressive cold white light. (numerous top selling cameras such as the Canon Powershot or Nikon’s CoolPix Series produce such artifacts).
An alternative would be non-flash photography with a more sensitive imaging sensor. Unfortunately this comes at the trade-off of greater noise or if the exposure time is increased, a blurred image. On the positive side a flash image has outstanding sharpness, while the “non-flash” image produces more “pleasant” colors.
With the emergence of CMOS image sensor a digital camera will be able to take two images in rapid succession with varying exposure, sensitivity and lighting settings. We propose to investigate techniques for combining the strengths of the flash and non-flash images to produce a single sharper, less noisy and more uniformly lit image.
Shadows and noise not only affect the appearance of an image but may also cause an image segmentation algorithm to detect object boundaries in incorrect locations. In acquiring our two images, we gain knowledge about the location of shadows and highlights and can therefore attempt to improve accuracy in segmentation.
Final Power Point Presentation
Initial Project Scope and Value PropositionGiven two images, one taken with a flash, the other without, we hope to combine the information from both to (minimum):
After determining the shadow, specular and background locations we hope to achieve the following goal as well (best case):
Ultimately we hope this will yield:
We ended up exploring the following scenarios in our project:
Ideally we would like to have used a camera with the following exposure program:
While such cameras will be available in the foreseeable future, currently they do not exist requiring us to manually adjust the exposure setting between shots. This required us to tightly secure the camera on a static surface.
Furthermore the intrinsic parameters in both images should be the same, which is easily accomplished. In order to avoid registration though, the extrinsic parameter should remain constant as well. This holds true if both images can be taken within 1/30 of second. If this is not possible a tripod must be used, or the images have to be downsampled in resolution to correct for the disparity.
We assumed the flash image would always yield sharp results. However the non-flash image would vary in blurriness. If an image is too blurry or misaligned you have to down sample in resolution. The following chart illustrates in which regions the algorithms hold.

Figure - Algorithm validity based on input image quality
Object CutoutIt is often useful to segment an object from its background for the purpose of creating images used in E-Bay auctions, or “transparent” gifs with non-rectangular shape:



In the general case where nonuniformities exist in object and background, this is a difficult problem. Techniques based on finding closed contours seem appealing, but objects are often not completely separated from their background by sufficiently strong contours. Even if they were, finding this contour would be difficult because sharp gradients may also exist within the background or object.
The simple “Magic Wand” in Photoshop has the desirable property of finding patches of relatively uniform color, but often fails because of luminance variances in the background. These variances may result from rough texture, varying reflectance, shadows, or lighting nonuniformity (as in the flash halo).


The Magic Wand allows for background variance in a crude way, by letting the user define a tolerance threshold, thereby carving a “box” in color space centered on the selected point. However, if we examine the actual distribution of background pixels, we will most likely find this “box” to be a poor description of our background. The Magic Wand tool assumes that the box width is equal in all dimensions, and that the selected pixel lies at the centroid of our background data. Use of the Magic Wand therefore results in either over- or under- detection. To obtain good results, the user must tweak the tolerance, and carefully select the initial background pixel.
Our ApproachWe therefore propose a generalized “Magic Wand” that allows the user to mark several background pixels. We use this training data to create a hypercube in YCrCb space. We choose this color space because it tends to minimize correlation between bands. This minimum covariance is necessary if we are drawing cubes whose boundaries lie parallel to the component axes.
The cube dimensions are padded with a small tolerance, if desired (if a good sampling of background pixels are chosen, this is not absolutely necessary). Each pixel in the entire image is classified as lying within this cube or not. This will give us a binary detection mask. We initially define the background as the location of all 1’s connected to our first training pixel. We then define the “object” as the largest cluster of 0’s. Our final background mask includes everything but this object.
Once we have our background mask, we can use it to create images of the segmented object. It is usually a good idea to erode the object by a few pixels, and Gaussian - blur the mask to obtain an alpha (transparency) mask. If this is done, the edges of the segmented object will blend more naturally with the new background.
Use of Flash and Non-Flash Images for Improved SegmentationAs we see below, shadows may exist in the flash and non-flash images. To fully segment an object based on one of these images, someone using our tool would need to explicitly select training pixels in the shadowed region.


However, we can exploit the fact that our shadows look different in the flash and non-flash images. The user can look at either image and select a polygon defining some background pixel locations. These locations correspond to training data in both the flash and non-flash image. The flash image training data is used to classify pixels in the flash image, and the same is done for the non-flash image. The result will be two binary masks. Our hope is the following:
As we see below, this can yield improved segmentation results:


In the above image, the light gray area corresponds to where a background was detected in only the non-flash image. The dark gray area corresponds to where a background was detected in only the flash image. By simply combining the two masks, we get good segmentation and shadow removal.
Another result




As we see below, our algorithm can succeed even when the background is considerably textured:




However, if the background contains several colors, which surround the object data in color space, a single box will perform poorly. Thus we wish to generalize our training algorithm to allow the creation of multiple boxes, or “nodes”. We present the training data pixel by pixel to our network. If the pixel does not fall into a previously created box, we measure its L1 distance to the nearest box. If the distance is within a tolerance, that box is expanded to contain the training pixel. Otherwise, a new node is created. The lower we set our tolerance, the more boxes we will create. This general idea is a simplified version of Fuzzy ARTMAP, a learning algorithm developed by Grossberg and Carpenter at Boston University.
Clearly, the resulting hypercubes will depend on the order in which we present our training data to the algorithm. For this reason, it is useful to create multiple (~5) classifiers, each created by a randomly ordered presentation of training pixels. The “committee” of classifiers then “votes” on each data pixel.
Main Matlab Functions[obj_mask] = fuzzy_segment(‘imageName.jpg’, tol, mult)
Segments ‘imageName.jpg’. The image is displayed and the user is prompted to define a polygon of background pixels for training. One or more hypercubes are created, padded in each dimension by tol (0.1 works fine) and used for classification. If mult equals zero, we create only one hypercube. If mult is set to one, we allow the creation of multiple nodes, and this process will be governed by our tolerance parameter as described earlier. The final result is obj_mask, a binary mask marking the object with 1’s.
[obj_mask] = fuzzy_segment2(‘flash.jpg’, ‘noflash.jpg’,tol, mult)
Similar to fuzzy_segment, but now both flash and nonflash images are inputs. Once the polygon of background pixel locations is defined, training and classification are run on both images. The results are combined to produce obj_mask.
mask_image(mask, ‘image.jpg’,resultname)
Once we have our binary detection mask, we use it to segment ‘image.jpg’. Erosion and blurring is performed to produce an alpha mask, which is layered with the original image to yield our final product.
FlashSharpenFlashFlash tries to enhance the image quality of the two images by combining the strength of flash photography with the strengths of natural light imaging.
This could improve photography in such low light conditions as a "candle light scene", indoor events or parties.
The idea behind FlashSharpen is to combine the natural light image’s low frequency components with flash image’s high frequency components.

Figure - which image contributes to what part of the frequency band

Figure - Sharpening a blurred image with a flash image (more results)
The interesting intellectual challenge now becomes how to set the cut-off frequency for the merge, that is when to use the flash image and when not in the. From a signal processing point of view we could filter the whole image with a know cut-off filter (we used a Gaussian convolution kernel), and keep the low-pass from the natural image and the high-pass for the flash image.
In practice though some of the flash images high-frequency information may be perceived as unpleasant by the viewer. Examples for this are shadows. As such we implemented a shadow estimation algorithm that rejected high frequency information from areas with shadows.
FlashSharpen++FlashSharpen++ extends FlashSharpen by computing a shadow map to help the merge. However this algorithm requires a moderately sharp natural light image.
FlashSharpen++ using shadow detection (more results)
As you can see from the figure above, the flash image contains an unwanted shadow around the glue stick. The shadow map indicates in which region the FlashSharpen algorithm may hold. Finally we fuse the information in the last image when acceptable. If we had not accounted for the shadow a "ghost" image would appear in the final result (example).
For shadow detection we implemented three approaches and compared their effectiveness:
First we used a color/probabilistic based approach to detect shadows. The assumption was that a shadow is a rarely occurring dark region in the image. A specular conversely is a rarely occurring bright spot in the image. The final segmentation can be cross verified amongst the two input images. However this approach failed in cases where the image had not shadows. Furthermore in some images a shadow could be commonplace violating or assumption of being at the extreme of the color histogram. For example thin objects cast much smaller shadows than thick ones. If the camera's flash is close to a plane a large shadow may occur as well as the glues stick example above illustrates.
Then we tried to detect shadows based on luminosity differences only. To make the comparison more meaningful we first histogram matched the flash image to the non flash image. While this approach worked for relatively close looking images, it would fail miserably on images that diverged greatly in their luminosity distribution. In those cases the histogram match operation would introduce unwanted contours in the image, often not resembling like the shadow at all.
The final attempt was a combined color and frequency approach. Here a significant difference in high frequency components would be taken as an indicator for a flash/specular edge. Obviously this technique requires the natural light image to be reasonably sharp as well. Then the region around the shadow edge was verified to be an abnormality using a color based comparison (is the surrounding area darker than its counterpart area in the other image?). Ultimately this yielded the best result. Still on close inspection the images do not fuse perfectly. You can still see a ghost version of the shadow in the fused image (example). One explanation for this is that we did adjust our cut-off frequencies based on the computed shadow mask. Since we performed the merge in the image domain only we may have introduced discontinuities in our frequency band.
Overall the result are very promising and it would be interesting to combine our technique with a CMOS camera such as the Canon EOS-D60.
Red Eye LocationAnother Type of "Abnormality" introduced by flash photography is the dreaded "red-eye". Luckily though the location of red-eye is rather easily recovered given our two images.
Locating Red-eye using a flash and non flash image (more results)
The idea here is to compute the "red" difference in both images and keep the extreme points. Here the challenge becomes which color space to use. We investigated the following color spaces.
Since we are comparing a flash and non-flash image we need a luminosity independent red measure. This is where CIE L*a*b stood out and yielded great results.
First we segment the "a" difference image using a low threshold of the "greater than maximum - 30". This gives a binary image of a few probable blobs that could be considered as red-eye. The next step is to only select the blobs which pass the high threshold test of "greater than maximum - 5". At this point we obtain a rather robust classification. Unless there is another surface that reflects red light in the image no other difference can be close to the one of the maximum. As a corrective step one could fit a model of an eye back into the image and seep any retained non-red color information into the red-eye area. A desaturate and darken operation works as well.
ConclusionDigital cameras are revolutionizing photography and will surpass film in its
capabilities. Whereas previously better film chemicals and optics where the main
distinguishing factors for a high quality camera system, it is clear that chemistry
will be replaced by computer vision, signal processing and computer graphics. The
algorithms presented on this page are a humble attempt to show what the future may
hold.
Copyright © 2002 Georg Petschnigg and Mike Braun