Computational Video Editing for Dialogue-Driven Scenes
Mackenzie Leake✧
Abe Davis✧
Anh Truong✻
Maneesh Agrawala✧
✧ Stanford University, ✻ Adobe Research
Abstract: We present a system for efficiently editing video of dialogue-driven scenes. The input to our system is a standard film script and multiple video takes, each capturing a different camera framing or performance of the complete scene. Our system then automatically selects the most appropriate clip from one of the input takes, for each line of dialogue, based on a user-specified set of film-editing idioms. Our system starts by segmenting the input script into lines of dialogue and then splitting each input take into a sequence of clips time-aligned with each line. Next it labels the script and the clips with high-level structural information (e.g., emotional sentiment of dialogue, camera framing of clip, etc.). After this pre-process, our interface offers a set of basic idioms that users can combine in a variety of ways to build custom editing styles. Our system encodes each basic idiom as a Hidden Markov Model that relates editing decisions to the labels extracted in the pre-process. For short scenes (< 2 minutes, 8-16 takes, 6-27 lines of dialogue) applying the user-specified combination of idioms to the pre-processed inputs generates an edited sequence in 2-3 seconds. We show that this is significantly faster than the hours of user time skilled editors typically require to produce such edits and that the quick feedback lets users iteratively explore the space of edit designs.
Example:
Fig. 1: Given a script and multiple video recordings, or takes, of a
dialogue-driven scene as input (left), our computational video
editing system automatically selects the most appropriate clip
from one of the takes for each line of dialogue in the script
based on a set of user-specified film-editing idioms (right). For
this scene titled Fluffles, editing style A (top row) combines
two such idioms; start wide ensures that the first clip
is a wide, establishing shot of all the characters in the scene,
and speaker visible ensures that the speaker of each
line of dialogue is visible. Editing style B (middle) adds in the
intensify emotion idiom, which reserves close ups for
strongly emotional lines of dialogue, as in lines 4 and 5 where
the emotional sentiment strength (shown in blue) is greater than
0.65. Editing style C (bottom) replaces the intensify emotion
idiom with emphasize character that focuses on the
Stacy character whenever Ryan has a particularly short line of
dialogue, as in lines 1 and 3.