Video editing today is more art rather than a science, which makes it exceedingly hard to automate. In fact, you might even think it would be impossible. But it's not—it's just really frustrating.
"These extremely creative people, when they have a problem, they solve it ad hoc," says Oren Boiman. "They have no idea how they did it." Boiman is the CEO and cofounder of Magisto, a mobile app that automatically edits videos for its 20 million users. For most app developers, automating something is mostly a matter of mimicking a clearly laid out human workflow. For Boiman, it's as if the people he's automating can't even explain how they do what they do. And that matters big time, he says, because the value of a piece of video is entirely in the way it's cut.
"The emotion is not in the footage itself. The emotion is in the editing," says Boiman. "In order to get a very strong emotional response you need to sync the audio, the soundtrack, the pace, the cuts—everything together."
To figure out how software could imitate this process, Boiman's engineers observed professional video editors in their natural element, codifying their behavior into editing rules that could be built into its "emotional" algorithms, which the company calls styles or themes. A style is a way of telling a story which creates a particular mood, so for example, the Party Beat style is quick and energetic while the Love style is dreamy and romantic. You can take exactly the same footage, apply a different style, and get a very different emotional response.
"The mood, which we set here with the editing style, sets expectations about how you want to re-create a life experience," says Boiman. "If I capture something with my kids I'm going to make it cute and when I see that I'm going to think 'yeah that's how I felt.' With Magisto we say we help you show how it feels."
Injecting emotion is the actually the final stage in Magisto’s elaborate analysis process. First, the app needs to extract the basic elements of the story being told in the raw video footage: who the main characters are, what they did, and where they did it.
Magisto uses a plethora of vision analysis and machine learning techniques to go from a mass of pixels to a semantic understanding of the story the video captures. Object analysis identifies the objects in the footage and how often they appear. Action and topic analysis determine what is happening while scene analysis figures out where the action occurs.
In Magisto’s world a character is an object which gets a significant amount of screen time, displays some behavior and interacts with the camera or other characters. However, screen time isn’t enough to make an object into a character. Another person may be standing behind a character when you shoot a video but he will not be part of the story if the character doesn’t interact with him.
Characters are not always human. Magisto videos often feature animals, gadgets, or even computer games as characters. If a person grabs a ball and throws it, the person and the object interact, making the object more likely to become a character. When it pinpoints a human character, Magisto’s technology can recognize the character’s face when it occurs again, analyze the facial expression, select the right shots, frame them, and polish them up as a professional photographer would. It will also detect and analyze speech.
Once the software has recognized the story’s characters, it uses action analysis to study their movements and behavior and recognize common actions like walking or running as well as interactions—actions which involve more than one character. If a clearly important character A interacts with another character B, that implies that B is relevant to the story. Some interactions, for example when two characters talk, bind them together more strongly than merely standing together.
Topic analysis is a machine learning task where Magisto uses all the information from the other analysis modules to infer the topic of the story based on the topics of millions of previously analyzed videos. Is this video of a landscape, a wedding, or a sports event? The topic is derived from various story elements and is used, in addition to the selected style, to direct the editing and production processes. If Magisto detects a sports topic, for example, it will focus more on activities than on dialog when selecting cuts.
The initial analysis makes no judgment about the most important or interesting parts of the footage, but given the length chosen by the user for the final edited video, Magisto must now start to make those choices. Boiman tells me that anything that repeats itself is boring. "Boring is very easy to detect but the fact that it's not boring doesn't make it interesting," he says.
Figuring out what’s interesting is part of the job of the style algorithm and that’s not straightforward. "If I choose something like Sentimental," says Boiman, "it will be slow-paced editing that gives more space to, for instance, dialog of the main characters. If you take something like Party Beat, it will be an MTV-style video. It's going to be very high-paced and it will probably choose different parts of the video." For a travel video, Magisto will select more footage of landscapes, while same scenes will be considered less relevant, or more boring, by the Party style.
Fast cuts can generate an energetic feeling in the right setting, but discomfort in the wrong one. If the footage shows a couple dancing to slow music, fast erratic cuts will not match that mood. "So it's not what are the most important parts, but what are the most important parts to tell a story in the way you want to tell it," Boiman explains. The selection of shots which make up the final video must not only be suitable for the style but feel like a coherent whole. "You need to make sure that as much as possible you don't feel the cut," says Boiman.
Then it’s time to finally add some emotion. "In the editing we're controlling the mood, and the emotion is the resulting perception of that mood," says Boiman. "For instance, in a Love theme we try to generate a relaxed yet dramatic mood by keeping a low to medium pace, using close-ups and slow-mo for a bit of dramatic effect. A single ingredient doesn't generate a mood. This is art, not science, but only a coherent set of editing and production rules, in the right sequencing and timing, creates the right mood."
The Sentimental style is slow-paced so Magisto will choose longer cuts and use clean editing with few transition effects. The Sentimental theme will also override the soundtrack music with dialog more frequently than other styles. The Party Beat theme does almost the opposite, putting the soundtrack in the foreground. Thus each theme has its own recipe for which elements to emphasize and how to use them in conjunction.
The final ingredient in Magisto’s mood mixture is music. Magisto recommends music to match each style but users can also upload their own. "When people pick their own music," says Boiman, "it often feels better even if according to the analysis it doesn't fit as well." The software will analyze uploaded music in a similar way to the video by identifying low building blocks like the different instruments being played and then higher-level musical concepts such as a song’s chorus and emotional peak. The music elements will then be synced with the story elements by, for example, cutting on the beat.
Ultimately, though, Boiman’s objective is bigger than helping people to make better videos. "We are trying to shift the paradigm," says Boiman. "The paradigm right now says you capture and it goes to your archive. We say if it was important enough for you to take out the camera, we are going to make it unforgettable for you. We are talking about billions of life experiences per day that go to the archive. If they go to the Internet there is going to be a different Internet."