Interactive Modeling with an Indycam and Complex Domain-Specific Constraints

by Lucas Pereira

Perhaps the most striking aspect of Paul Debevec's work (on interactive architectural modeling from photographs) is his ability to use "common sense" constraints to develop geometric models from only a single photograph. By specifying such things as symmetry, basic geometric shapes, and axis-alignment, he enables the computer to "make sense" of a photograph in much the same way that a person can deduce the 3D shape from a 2D photo.

In a sense, the constraints specify "common sense" about how we expect architecture to be constructed. (Debevec's algorithm would have as much trouble as a human in trying to make sense of an M.C. Escher drawing, which defies "common sense.") By applying the constraints of a particular domain (in Debevec's case, his domain is architecture), the search space of possible objects is reduced to a tractable number.

The goal of my project would be to develop a representation in which "common sense" constraints can be written about arbitrary domains, not just architecture. The representation (or "language") would describe the total search space of possible models in the given domain. Each possible variation would be described by a number with some default value and some measure of confidence of that default value, in the absence of further evidence. The algorithm would explore the different possible variations of each parameter, and determine which variations best match the given photo(s). As it determined correspondences between the model and the photo, it would update the parameter values and the confidence values.

I hope to demonstrate that such a language can be built up with appropriate C++ objects. Each object would list its variable parameters (and confidences), possible children objects, and functions for determining how the "appearance" of the object in a photograph should map into the parameters.

For example:

The Domain of Cups/Mugs:

Cups and mugs generally consist of a central section, which 95% of the time (roughly) is an object of revolution, and may have a handle (30% yes, 70% no). The handle (if it exists) generally has the topology of a half-torus, and generally exhibits centerline symmetry. The outline of the handle and the mug can be parameterized with an array of numbers (as a parametric curve, or a spline, or a displacement map, etc.)

Functions that derive silhouettes against a colored background can be used to determine the shape of the cup, and a function that detects a torus topology can be used to positively identify a handle.

Let's assume that the user has written these constraints somehow in our language. Then the user shows the program a photo of a specific cup, and asks the program to try to reconcile the photo with the constraints. First, it would make a guess at the size and shape of an "average" cup (assuming it is an object of revolution, since this is the most likely case). Then it will try to match the model with the photo. It will try perturbing different parameters, and see if the perturbation creates a better match or a worse match.

For example, one parameter would be whether it has a handle. It would try matching a model with and without the handle, and pick the one that more closely matched the photo. If both models worked out equally well (for example, if the handle was hidden from that view), then it would pick the more probable model, but assign a low confidence value to that decision. In this case, it would decide, with low confidence, that the cup does not have a handle. ("No handle" has higher probability, because it is easier disproved.) If another photo shows the handle, then it would change the parameter to "cup has a handle" with higher probability.

The user's role in this system would be to write the initial C++ description of the "cup domain", and then give the algorithm pictures of the object. Some of the more common functions that map appearance to parameters (such as the "silhouette of an object of revolution" function) would be inherited, so that the new object would hopefully require minimal programming.

One example of a special function that would be helpful in the cup domain would be a function that maps the inner wall of the cup to "1/4 inch inside the outer wall". With this function, the modeler could guess at the inside shape of the cup, without ever actually seeing a silhouette of the inside.

Two steps that I have not mentioned, camera calibration and registration, will need to be implemented. However, I am assuming that I will use known techniques (such as Tsai and Debevec), and will introduce nothing new here.

Due to the limited time available for this project, I will probably only tackle one or two domains. Another possible "easy" domain would be a limited subset of Legos. Given several pictures of a simple lego model, and knowledge about all the possible lego bricks used, the program could try to identify the actual structure of the lego model.

Extensions, probably beyond the scope of this quarter, would be:

Allow the user to manually adjust the parameters and their confidence values. For example, if the user sees that the program put a lego in the right place, the user could click on that lego and say "yes, this lego is 100% correct". Likewise, a user could click on a section of the cup silhouette and say "This part of the silhouette looks bad. Give it 0% confidence."
In an ideal system, the computer would display the model *as* it is being refined. The user could watch as the program places the lego bricks in the model one by one, and provide feedback about which legos are correct, and which are questionable. The user could click on parts of the cup to indicate "This part of the cup looks fine" and "that part of the cup needs some more work." This would modify the confidence values, and tell the program which parameters of the model to work on first.
Write the rules for a complicated domain, such as the domain of oak trees. This might include rules such as "trees start with a trunk", "branches split into 2, 3, or 4 sub-branches", "leaves usually attach to branches less than 1/4 inch in diameter", "leaves usually appear in groups of 3", etc.
Other interesting domains might include: faces, cars, books, flowers, forests, hands, or virtually anything that can be described procedurally.
Update the default values, using the data from specific instances. For example, suppose that I wrote a description of oak trees, and said that branches fork into 2 (50%), 3 (20%), or 4 (30%) sub-branches. Then, when I applied this description to a set of photos of an actual oak tree, the resulting model had 99% of the branches fork into 2 sub-branches. We could use this information as a "training set" to update our statistics about the branching characteristics (namely, that 2-branchings occur 99% of the time). This could speed up subsequent modeling (since the default will start out closer to the actual model), and also make it possible to generate completely synthetic oak trees, using the oak tree description, the updated parameter probabilities, and a random seed.
Introduce probes (as described in my paper) and other fancy techniques of mapping pictures to model parameters.
Acquiring texture maps or brdf's of the modelled objects.
Include functions that shape the model parameters from a variety of different scanning devices, such as range scanners. For example, this domain-specific approach might make it easy to generate models of plants or other highly self-occluding objects.