Computational Video Editing for Dialogue-Driven Scenes

✧ Stanford University, ✻ Adobe Research

Abstract: We present a system for efficiently editing video of dialogue-driven scenes. The input to our system is a standard film script and multiple video takes, each capturing a different camera framing or performance of the complete scene. Our system then automatically selects the most appropriate clip from one of the input takes, for each line of dialogue, based on a user-specified set of film-editing idioms. Our system starts by segmenting the input script into lines of dialogue and then splitting each input take into a sequence of clips time-aligned with each line. Next it labels the script and the clips with high-level structural information (e.g., emotional sentiment of dialogue, camera framing of clip, etc.). After this pre-process, our interface offers a set of basic idioms that users can combine in a variety of ways to build custom editing styles. Our system encodes each basic idiom as a Hidden Markov Model that relates editing decisions to the labels extracted in the pre-process. For short scenes (< 2 minutes, 8-16 takes, 6-27 lines of dialogue) applying the user-specified combination of idioms to the pre-processed inputs generates an edited sequence in 2-3 seconds. We show that this is significantly faster than the hours of user time skilled editors typically require to produce such edits and that the quick feedback lets users iteratively explore the space of edit designs.

Example:
Fig. 1: Given a script and multiple video recordings, or takes, of a dialogue-driven scene as input (left), our computational video editing system automatically selects the most appropriate clip from one of the takes for each line of dialogue in the script based on a set of user-specified film-editing idioms (right). For this scene titled Fluffles, editing style A (top row) combines two such idioms; start wide ensures that the first clip is a wide, establishing shot of all the characters in the scene, and speaker visible ensures that the speaker of each line of dialogue is visible. Editing style B (middle) adds in the intensify emotion idiom, which reserves close ups for strongly emotional lines of dialogue, as in lines 4 and 5 where the emotional sentiment strength (shown in blue) is greater than 0.65. Editing style C (bottom) replaces the intensify emotion idiom with emphasize character that focuses on the Stacy character whenever Ryan has a particularly short line of dialogue, as in lines 1 and 3.