When the Library of Congress began archiving new tweets in February 2011, it would transfer about 140 million new tweets each day from temporary servers onto magnetic film. By the next October, that number had soared to about 400 million tweets per day. There are now about 170 billion tweets in the archive, which also includes a collection of all tweets going back to 2006 that the library acquired from Twitter in 2010. The library’s two compressed copies of the data total 133.2 terabytes.
“It’s a few cabinets of tape,” explains Robert Dizard, the deputy librarian at the Library of Congress. “It’s not a roomful or roomfuls. And that’s just a testament of the storage capacity of tape.”
Since the library signed an agreement with Twitter in 2010 that gave it access to historical tweets, a small team of its staff have been working to establish a sustainable process for acquiring and storing tweets. Now they’re transitioning their efforts to the significantly more difficult challenge of providing access to the archive they’ve built. Running a search term through a body of data as big as the tweet archive, Dizard says, could take more than 24 hours. The archived tweets are already indexed by time, but an hour of tweets could contain millions of 140-character snippets–not so helpful for someone doing research about a specific topic.
The library has experience with large digital collections. It regularly archives, for instance, websites, government databases, and policy events. But Twitter is new territory. “It’s not only very large,” Dizard says. “It’s expanding daily and at an increasing velocity. The variety of tweets is high.”
Not even Twitter, which employs some of the best engineers in Silicon Valley, has attempted to create a searchable archive of tweets. That’s partly because the commercial demand for historical access pales in comparison to that for real-time advertising. But the massive server space and resources such a project would consume are certainly another factor. Jamie de Guerre, VP Product at Topsy, a private company that provides some access to the Twitter archive, compares the task of indexing Twitter to indexing the entire Internet.
“Google’s index of the entire Internet ranges from about, in some estimates, 45 billion web pages to 125 billion web pages,” he told Fast Company in a recent interview. “So the size of Twitter is on the order of the size of the Internet, just in tweets instead of web pages. Having all of that data available, being able to query across and return a large data file to a user is definitely quite a challenge.”
In its first step to addressing the challenge, the library is talking with third-party companies that could potentially manage access to the archive. Given current resources, a solution for access is not likely to be a search engine that can locate a specific tweet. Dizard says he has no idea how a viable solution might look, but that abandoning the project isn’t an option.
“Our mission is to collect, preserve, and provide access to creative and historical record of America,” he says. “We’re looking at Twitter from a research and scholarship perspective as providing a reflection of everyday life as well as showing the development and impact of significant events. You also have the record and recordings of individuals. Which are also valuable.”
“I look at Twitter as a start of what the Library will be doing in the medium and long-term,” Dizard says. “Not a test of whether we’ll collect social media at all.”
[Image: Flickr user The Library of Congress]