CS468: Topics in Geometric Computing

Foundation Models for 3D/4D Scene Understanding and Content Creation

Leonidas Guibas

Fall 2024-25

Breaking News:


In the last few years, large pre-trained models in the language and vision-language areas have shown impressive capabilities and emergent behaviors even for tasks they were not specifically trained on. These so-called foundation models (FMs) are re-shaping how we approach learning problems as we aim for the grand goal of artificial general intelligence (AGI).

When it comes to 3D or 4D tasks, however -- tasks that involve spatial reasoning in 3D about geometry and motion, the state of FM development is less clear. This is because current FMs are trained with vast web data that includes text, images, and videos -- but little 3D. It is important to assess the 3D / 4D awareness and capabilities of FMs and study how to improve them, as our world is 3D and perceiving, reasoning an acting on the real world requires 3D understanding. The obvious challenge is that the real 3D data we have is orders of magnitude less that what is available in the language and vision domains. Furthermore, 3D annotations are cumbersome.

This course will survey the state of the art of 3D (space) / 4D (space+time) understanding of FMs, explore a variety of approaches towards enhancing that understanding, and study how FMs can be used in a variety of 3D / 4D tasks. Specific topics to be covered include:

  • Geometry Representations: Implicit and Explicit, Structured and Unstructured
  • Survey of Large Language and Language-Vision Models
  • 3D Awareness Assessment of Current Foundation Models
  • In Context Learning for 3D / 4D
  • Fine Turning Foundation Models for 3D / 4D
  • Parametric 3D Geometries, Human Models
  • 2Dfor3D: Distillation, Neural Rendering, 3D Features
  • Neural Approaches for 3D Point Clouds and Meshes
  • Programmatic Representations of Geometry; Synthetic 3D / 4D Data
  • Foundation-Assisted Agents for 3D Content Creation
  • Shaping Latent Spaces for Geometry, Topology, and Physics
  • Token-based and Diffusion Architectures
  • 3D from Language, Image(s), and Video
  • Motion Models

he course will require presentations of papers from the current literature in class, active participation in the class discussions, and a collaborative project.


These pages are maintained by Leonidas Guibas guibas@cs.stanford.edu.
Last update September 18, 2024.