advertisement
advertisement

The frantic, unprecedented race to save 700,000 NSFW Tumblrs for posterity

Volunteers are scrambling to download up to 800 terabytes of content from Tumblr’s adult-themed community before it disappears from view on December 17.

The frantic, unprecedented race to save 700,000 NSFW Tumblrs for posterity

GeoCities, Vine, Friendster–communities live, thrive, and often die on the net. But the two-week timeframe in which content will disappear from Tumblr is unprecedented, says Jason Scott. He cofounded Archive Team, a volunteer project running software that scarfs copies of endangered websites for posterity.

advertisement
advertisement

They are now scrambling to preserve an estimated 700,000 Tumblr blogs that are expected to partly or entirely disappear due to a new, broadly defined ban on “adult content” announced on December 3. That makes Monday the 17th D-Day, when images, GIFs, and videos flagged as verboten by Tumblr’s AI will disappear from public view–and probably from the reach of archivists.

“Usually we’re given 30 or 60 days or 90 days warning. Fourteen days is insane,” says Scott. “So we’re going to probably get just a percentage. Frankly, I don’t know what that percentage is.” For comparison, the team has mostly finished archiving GeoCities Japan, which won’t go offline until March 2019.

Other people are also offering tools to preserve these blogs, but not at the industrial scale of the Archive Team effort–which may still not be enough.

Scott sprang into action as soon as the Tumblr content ban was announced, getting the mechanism in place to facilitate a mass-download of material, which began on December 7. So far, volunteers have copied over 40,000 blogs—a fraction of the platform’s total of 12 million blogs—amounting to about 10 terabytes of data. Scott estimates the total amount of content that may be banned is between 400 and 800 terabytes.

To take part, volunteers install a program for Windows, Mac, or Linux called ArchiveTeam Warrior, which makes their computer part of a distributed network. Individual systems scrape web sites and forward the content on to Archive Team’s servers. (The top volunteer had processed about a terabyte of data between Saturday and Tuesday afternoon, according to the group’s leader board.)

advertisement

Much of the material Archive Team has scraped over the years winds up reproduced on the Wayback Machine, run by the nonprofit Internet Archive in San Francisco. There’s no formal relationship between the two groups, but a strong informal one: Scott holds the staff title of “freerange archivist” at the Internet Archive–facilitating connections with individuals or groups (like Archive Team) that have collected digital content for preservation.

Jason Scott [Photo: Dennis van Zuijlekom/Flickr]
The Internet Archive has agreed to take the rescued Tumblr content.”[Internet Archive is] the institution that is most open to receiving archived web content,” says Scott. “Sometimes Archive.org has said, ‘We can’t take this. This is too much.’ But it’s very rare.”

Scott’s effort is one of several to rescue Tumblr content ahead of the ban. The poorly trained machine learning software of Tumblr parent company Verizon has flagged a baffling amount of images, GIFs, and videos as “adult”–seemingly anything that is beige or contains round shapes. On the 17th, this content will be hidden from public view, though not deleted from the servers, says Tumblr. Users will also have an opportunity to appeal decisions, which the company admits to us have been error prone. (It has not commented on preservation projects like Archive Team’s.)

But a wide range of bloggers–from sex educators to porn aficionados to artists who make racy images–feel that they are no longer welcome on Tumblr, and it’s time to move on.

Some are building alternative sites, like one called Timbr that can scarf up and reproduce an entire Tumblr blog. (It worked quickly and almost perfectly with an old, safe-for-work Tumblr blog I had run years ago.) People need only to post the name of any Tumblr blog into a field on the site. The goal is to make it what Tumblr had been–not a pure-porn site, but a broad-based online community that doesn’t shun NSFW content. But adult-focused sites are also getting in the act. One called Dark Cloud, for instance, also has a Tumblr-scarfing tool.


Related: Tumblr’s NSFW castaways are flocking to these lifeboats as a ban looms

advertisement

Some Tumblr users are heading to Twitter or other sites with liberal content policies, such as Dreamwidth and Pillowfort.

But there are many drawbacks to these efforts. New sites take time to build, and models of funding them-donations, memberships, ads, etc.–at scale are unclear. Twitter isn’t really a community site. Dreamwidth accounts are limited to 500MB of storage, and Pillowfort is in closed beta.

And the 17th is looming–after which Archive Team, Timbr, and other sites will no longer be able to access the hidden “adult” content. (Timbr’s creator is working on a fix that might allow owners to transfer content after the 17th.) Owners of blogs can still download all their content after the 17th, however, using a built-in Tumblr feature that creates a zip file of the site. But how useful is that to people not versed in web technologies?

“The people who are the most savvy will do things like port their work to WordPress or build something at a [web] host,” says Scott. But he fears a lot of people won’t be prepared to handle the shutdown. “They don’t really fully integrate what it all means for them…They don’t know their next moves.” Beyond making copies to put on the Wayback Machine, he says Archive Team might also help people restore their blogs on other platforms, too.

Scott’s suspicious of Tumblr’s statement that no content will be permanently deleted (just hidden) and that only graphic visual material, not entire blogs, will be inaccessible. The sudden decision to change content guidelines, the limited time to adapt to the new policies, and the glitches in Verizon’s image-tagging software do not inspire confidence in Tumblr’s procedures among archivists and bloggers.

advertisement

That’s another reason why Scott has pushed on so aggressively, despite criticism from some that Archive Team is taking people’s content without their permission. He is setting up a process for people to request that their blogs be removed from the sweep. Not everyone impacted by the ban is thrilled by that approach. “They really should have an opt-in, not a ‘whoops, maybe we’ll get around to taking it off if you DM us according to this post buried in a thread that people probably won’t see,” one user wrote on Twitter.

For now, though, Scott says, Archive Team is most focused on saving as much as it can, while it still can.

advertisement
advertisement

About the author

Sean Captain is a Bay Area technology, science, and policy journalist. Follow him on Twitter @seancaptain.

More