CS468: Topics in Geometric Computing

Foundation Models for 3D/4D Scene Understanding and Content Creation

Leonidas Guibas

Fall 2024-25

Breaking News:

In the last few years, large pre-trained models in the language and vision-language areas have shown impressive capabilities and emergent behaviors even for tasks they were not specifically trained on. These so-called foundation models (FMs) are re-shaping how we approach learning problems as we aim for the grand goal of artificial general intelligence (AGI).

When it comes to 3D or 4D tasks, however -- tasks that involve spatial reasoning in 3D about geometry and motion, the state of FM development is less clear. This is because current FMs are trained with vast web data that includes text, images, and videos -- but little 3D. It is important to assess the 3D / 4D awareness and capabilities of FMs and study how to improve them, as our world is 3D and perceiving, reasoning an acting on the real world requires 3D understanding. The obvious challenge is that the real 3D data we have is orders of magnitude less that what is available in the language and vision domains. Furthermore, 3D annotations are cumbersome.

This course will survey the state of the art of 3D (space) / 4D (space+time) understanding of FMs, explore a variety of approaches towards enhancing that understanding, and study how FMs can be used in a variety of 3D / 4D tasks. Specific topics to be covered include:

Geometry Representations: Implicit and Explicit, Structured and Unstructured
Survey of Large Language and Language-Vision Models
3D Awareness Assessment of Current Foundation Models
In Context Learning for 3D / 4D
Fine Turning Foundation Models for 3D / 4D
Parametric 3D Geometries, Human Models
2Dfor3D: Distillation, Neural Rendering, 3D Features
Neural Approaches for 3D Point Clouds and Meshes
Programmatic Representations of Geometry; Synthetic 3D / 4D Data
Foundation-Assisted Agents for 3D Content Creation
Shaping Latent Spaces for Geometry, Topology, and Physics
Token-based and Diffusion Architectures
3D from Language, Image(s), and Video
Motion Models

he course will require presentations of papers from the current literature in class, active participation in the class discussions, and a collaborative project.

These pages are maintained by Leonidas Guibas guibas@cs.stanford.edu.
Last update September 18, 2024.