When Google bought the world’s biggest user-generated video site in 2006, the company knew it could also be be acquiring the world’s biggest legal headache. So a year after YouTube became a Google subsidiary, it launched a copyright management tool that only a company with Google’s scale and knack for innovative problem solving could muster: Content ID.
Content ID gave rights holders an automated way of finding unauthorized copies of videos and songs uploaded to the site. They could then decide whether to block the content or run ads against it and start to rake in money. Predictably, the system was far from perfect: Leaving machines to make decisions about something as nuanced as copyright can easily lead to misfires and frustration for video uploaders and rights holders alike. Meanwhile, the most dedicated infringers can always find creative ways to sneak past the robotic copyright cops on the beat, primarily by distorting audio and video to avoid detection.
Nearly a decade in, Content ID has come a long way. The system—which the company says has generated $2 billion in revenue for rights holders—has grown smarter at matching video and audio and more nuanced in its approach to dealing with copyright disputes. While there’s always plenty of room for improvement—and the system’s successes don’t address the bigger, looming questions about YouTube’s paltry payouts to labels and artists or its broader role in the music industry—Content ID is a rare example of how a technology giant can take a proactive stab at a complex problem and learn from its mistakes along the way.
One of the most notable things about Content ID, flaws and criticisms aside, is that YouTube had no legal obligation to build it. Since the mid-1990s, websites that allow users to upload content have been shielded from legal liability by the so-called “safe harbor” provision of the Digital Millennium Copyright Act (DMCA), as long as these sites provide a way for alleged infringement to be reported and infringing content to be taken down. While endlessly controversial (and the subject of debate as Congress reconsiders how to approach copyright in the 21st century), the DMCA is what we’ve got, and it’s the reason that sites like YouTube and SoundCloud have been able to exist at all.
For YouTube, which was massively popular in 2006 and has only exploded since, leaving everything up to the DMCA and the “notice and take down” system it prescribes would not only be a logistical, unscalable nightmare, but also one that’s bound to open up the company to legal troubles like Viacom’s failed $1 billion lawsuit against the site in 2007.
Enter Content ID. The system, which now handles 98% of copyright management on YouTube, scans new video uploads and compares them against a huge library of 50 million reference files provided by copyright holders like film studios and record labels. That library, which now contains the viewing-time equivalent of 600 years’ worth of material provided by rights holders, contains things like Michael Jackson’s discography and blockbuster films in their entirety. This reference content comes mostly from thousands of hand-selected partners, an arrangement that naturally leaves some blind spots when it comes to independent musicians and other rights holders.
Content ID uses audio and video fingerprinting technology to detect matches between videos people upload and the reference files waiting in the system’s copyright library. In the case of video, for instance, the clip is sliced into thousands of frames, and each one is automatically checked against the fingerprinted files in the reference library. Using similar, less sophisticated fingerprinting technology, the system can also recognize raw audio, much like Shazam is able to identify hot new jams in a crowded bar. More recently, YouTube updated Content ID with the ability to detect actual melodies, which helps songwriters track down unauthorized covers (to the inevitable chagrin of those who take the time to learn Prince classics on the ukulele).
If it finds a match (or thinks it found a match), the system will flag the video and let the purported rights holder decide what to do. They can block the video, ask to have the offending audio or video footage removed (in cases where only a portion of the video contains copyrighted material), or let the video stay up and decide to run ads against it, effectively monetizing what otherwise would have been considered pure piracy by funneling ad revenue to the rights holder (rather than the uploader, who would normally have the option to grab that ad revenue).
As you might imagine, this part is where much of the controversy around Content ID stems from. “There are going to be mistakes,” says Harris Cohen, senior product manager of Content ID at YouTube. “There’s inaccurate data sometimes. The first line of defense against that type of stuff is always the dispute process.”
Some cases are more clear cut than others. If I upload Jurassic Park to my personal YouTube account, perhaps distorting the video somewhat to evade Content ID, that’s an obvious violation. But in other cases—where a video clip is used in a context that legally counts as “fair use” under copyright law, or if I write a parody version of a popular song and record my own version—an algorithm is bound to screw up from time to time, incorrectly flagging a video as a copyright violation. If that happens, the process that Cohen describes would let one file a dispute and suspend the copyright claim for up to 30 days, during which the person claiming ownership can respond. (If they don’t, the claim expires and my video stays put.) If they do respond, they need to clarify their case for upholding the video claim (or taking down the video), which I can then appeal. Fun times.
It used to be that a Content ID claim would automatically divert ad revenue away from the uploader, which invited accusations of YouTube giving preferential treatment to copyright claimants. This apparent bias triggered controversy among the gaming community in particular, whose video game walkthroughs and related material were frequently affected by YouTube’s automated copyright crackdown.
Earlier this year, YouTube updated the dispute process to be more forgiving to uploaders. Now, when a copyright claim is filed, YouTube holds the video’s ad revenue in limbo until the dispute is resolved, at which point that money goes to whoever prevails.
To minimize frustrations for both uploaders and copyright holders, YouTube has also improved the way its content-matching technology works. This includes a general refinement of how precise the machines’ artificial “vision” is. Two different soccer games, for example, can contain many scenes and images that are nearly identical, easily tripping up an algorithm. On the backend, YouTube has developed a system of abstract-looking “heat map” graphics to help visualize how similar two video frames are by analyzing things like the distance between objects in the frame. Whenever a reference file closely resembles an uploaded clip, this visualization draws out a line of bright colors in a pattern that signifies a match—that is, a possible infringement.
As the breadth of both YouTube’s uploaded videos and its under-the-hood copyright reference library continue to grow, this process naturally rubs up against the constraints of computing power. To help keep Content ID both discerning and speedy, YouTube’s engineers have plugged it into Google Brain’s deep learning system. By using the neural network developed by the Brain team, Content ID can more quickly ingest new imagery into its fingerprinting system and find matches.
Google Brain also helps with one of Content ID’s biggest weaknesses: Figuring out when people have modified videos and audio in order to upload infringing content. To evade Content ID, many uploaders have taken to visual tricks like horizontally flipping videos, changing their aspect ratio, warping the audio or adding a bright “halo” effect to the center of the clip. As pirates get more clever, Content ID’s engineers must continually sharpen their claws for the inevitable cat-and-mouse chase that ensues. That means training the fingerprinting system to spot these types of distortions, and leaning on Google Brain’s machine learning mechanisms to perpetuate and speed up that knowledge.
“In the past, we may have seen some patterns of abuse out there, and it would take quite a long time to revamp and relaunch the fingerprint technology in order to address that,” says Harris. “We can be much faster now in terms of teaching the system to recognize new patterns and ignore extraneous images and that kind of thing.”
Of course, these improvements don’t mean that YouTube has succeeded in steamrolling distorted videos and eliminating this sneaky means of piracy. This episode of Vice’s Party Legends—its frame shrunken down and embedded within a blurred, full-sized version of itself—has been live for one month. This slowed-down, visually altered version of Martin Scorsese’s 2011 documentary about George Harrison, Living in the Material World, has been up for six months. (Update: Both videos suddenly disappeared after this article was published) It’s possible that Content ID flagged these videos and that the copyright owners chose to keep them up, although that seems unlikely, given their unflatteringly lower quality.
Harris says YouTube remains focused on refining its matching technology, as well as the dispute and appeals processes. That’s a good thing because YouTube’s role in the music industry is a critical, often controversial one. As streaming music explodes toward mainstream dominance, the company finds itself in the unique position of streaming more music than services like Spotify and Apple Music, yet contributing a comparatively small slice of the growing revenue pie that labels see from streaming overall. Frustration over this so-called “value grab”—in addition to longstanding concerns about piracy on YouTube—comes at a pivotal moment: YouTube’s major label licensing deals are up for renegotiation, pressure is growing for Congress to reexamine the DMCA, and competition in the music streaming space is heating up.
As the pieces of the new music economy puzzle continue to shift around, YouTube is eager to portray its own role as a positive one. And while an inherently controversial, copyright-focused tool like Content ID isn’t likely to please everyone—especially the music industry—don’t expect YouTube’s engineers to stop trying anytime soon.