Last February, April the Giraffe gave birth at Animal Adventure Park in Harpursville, New York. Eager viewers were able to watch the event, in real-time, on Facebook Live. And they did, to the tune of more than 14 million views and 250,000 comments. Across the country, in Seattle, a group of Facebook engineers made sure that everyone watching the feed, and commenting, was able to see the latest updates as they happened.
Facebook has more than 2 billion users. Messenger has more than a billion, as does WhatsApp, while Instagram has more than 700 million. All told, then, ensuring that each of those users is able to watch real-time interactions with the people most important to them is a crucial task for Mark Zuckerberg’s company. If people don’t see those little ellipses that indicate someone they care about is typing something right now, Facebook worries they might lose interest.
The key, said Shie Erlich, Facebook’s manager for real-time infrastructure, is to make sure that when any user does something on live features on the social network–or any of the services it owns–the people that matter to them see it immediately. “Let’s say that your friend Joe wants to go live and show the new tricks his dog learned,” Erlich says. “He opens up the app, [clicks] go live, and at that point, he starts streaming video out. We’re trying to create an audience for this person. Otherwise he’s lonely.”
While most of what people broadcast with Facebook Live is entertaining or at least appropriate, the company has had to grapple with controversy over the tool. People have used it to stream shocking events, such as a number of suicides and killings, and in some cases, Facebook has been accused of moving slowly to remove the videos.
Still, most live videos are harmless and Facebook wants to ensure that users can see them in real time, as well as any relevant engagement, as they’re happening.
In order to do that, Facebook needs to automatically find all the people that might be interested in Joe’s video–friends, family, and followers–and send them notifications that alert them that something is happening. If they care, they’ll click on the notification and start watching Joe’s dog show off. And soon, in all likelihood, some of those people will begin commenting or reacting–sending likes, hearts, and so forth–which everyone also wants to see.
The engagement doesn’t end at seeing how people are reacting. Facebook’s system is also designed to let you know while you’re watching Joe’s video that another friend of yours is watching too. That’s useful, Erlich says, because now you can communicate directly with that friend. “Our systems enable this notion of online presence,” he says, “and let you know that someone is sharing the same experience” as you.
When Facebook talks about real-time, they mean it. The goal, Erlich says, is “sub-second latency,” despite the difficult challenges of making sure that hundreds of thousands or even millions of people engaging with a live video are seeing things as they happen. “Real-time is a feature,” Erlich explains. “Today, it’s more like a utility….A big part of why it’s hard is how to do it for so many people at once.”
What makes it hard, he says, is that even while thousands or millions of people are watching the same video, Facebook still wants to give each person an individual experience. “If you and me are watching [a live video] on election night, I want to see your comment, because you’re my friend, more than” random people, he says.
At the same time, Facebook and its other services are global. And while the system is usually handling countless events across the world, sometimes, everyone’s attention is drawn to the same event.
An example, Erlich explained, was last month’s solar eclipse. “Suddenly, the world’s attention converges,” he says. “Think of it as the Eye of Sauron, a beam of intense interest pointing at one specific place. Our system needs to handle everyone wanting to get the same information or experience.”
During NASA’s live video of the eclipse–which created a computer-science event Erlich calls a “hot spot,” there were more than 31 million views and nearly 300,000 comments. That meant traffic to Facebook’s systems grew more than 20 times within a couple of minutes.
That was an event Erlich’s team could predict. But what about the unpredictable. He says Facebook’s infrastructure must withstand anything. For example, Beyonce announcing her pregnancy on Instagram–which generated more than 11 million likes.
According to Erlich, the key to successfully handling the expected and the unexpected alike is due to three elements.
First, he says, Facebook has “badass engineers” who are given the freedom to explore, to get creative, and to arrive at “crazy solutions to hard problems.” Second, Facebook and its sibling services are meant to work on common platforms that are built and reused by multiple product teams. One-offs “would fall over under their own weight.”
Finally, Facebook is constantly upgrading its systems, ensuring that its current technological architecture is able to scale to today’s needs. Tomorrow’s will be handled by even more advanced infrastructure.
“Our system is built to take a beating,” Erlich says. “We know this. We know it’s going to happen.”
One method is to make sure that the most important elements are available. Sometimes, as in the case of an extremely compute-intensive hot spot like the eclipse or a giraffe being born, that may mean that personal interactions take an algorithmic back seat to real-time content. When enough bandwidth is available, the system will once again return to showing everything. Or, perhaps comments would be shown in the order they’re sent rather than showing comments only from friends.
It’s called grateful degradation. “That means if the shit hits the fan, and you cannot provide the service you’re used to, how do you do it without causing a shit show for users,” he says. “These are things we are continuously obsessing about.”
The goal, Erlich added, is to help internal engineering teams make choices about how their systems will handle heavy loads in advance.
These problems are likely to become more frequent over time, as there’s more and more real-time experiences broadcast on Facebook’s systems. So the company must do a better job as the stakes get higher and the cost of failure grows, he says.
“We spend a lot of time trying to understand what’s coming in the next six months or a year,” he says. “We’re trying to make sure we don’t get caught with our pants down.”