Pete Warden's been called the "Facebook Whisperer," because of an unusual hobby he took up more than a year ago. Using simple Web crawling software, he assembled a database of 210 million Facebook public profiles: Names, "fan" pages, friend listings, and locations. He then used that data to find, for example, where "God" was the top fan site, the geographic reach of virtual social networks, and more.
Warden realized that full-time social science researchers might take what he'd gathered in important new directions and was on the cusp of distributing that data when Facebook shut the whole thing down. FastCompany.com talked with Warden about his Facebook project--how it began, what it promised, and what pissed off Facebook so much--as well as his plans for the future.
How did you come at this idea of using Facebook data to do these studies?
Right now, I’m creating an email service called Mail Ana, short for mail analysis. It tries to add intelligence to your inbox. One thing I realized was that when people meet someone new, the first thing they do is Google or Facebook them. I wanted to integrate that function, which is how I stumbled into this data mine. I’ve always been an engineer, but it was a bit like finding a pristine archaeological site. That’s why I wanted to share it. I wanted to give it to people who’ve spent their entire careers analyzing data that was much less rich. And they were desperate for that data.
How many requests for the data did you get, and were there obvious scam artists?
I had over a 100 requests for the data set and I narrowed that to 54 researchers who were affiliated with academic institutions. But the interesting thing is that I seemed to have scared off the scam artists. They didn’t seem to want to talk to someone who pretty much blogs about everything they do.
So what do you think is the ultimate promise of this sort of data mining work? What were the most interesting proposals you got?
Actually, someone sent me an email today asking for that data, and they said it could feed “lifetimes of curiosity,” which is pretty accurate. There’s a massive blind spot about the social component in our lives, and that’s where this data helps. On Facebook, you’re essentially voting all the time on your profile and nobody yet has the ability to make those votes count.
One request I got was someone hoping to study how social connectedness and social networking relates to finding jobs. If you can correlate high social connection and employment rates, you can make a conclusion about what actually works in job searches. You can then design computer programs that help people do that. I could see that making a difference in day-to-day life.
Another request was art historians trying to figure out how artist popularity changes, spreads, and grows over time. That’s not going to put food on the table, but that’s data you can’t get otherwise, because there’s no one collecting data on artists that no one has ever heard of.
There were also lots of health-related work. One group wanted to see how socially connected different regions were and how that relates to disease transmission. Because places like New York and L.A. might be more closely connected than L.A. and a city somewhere else in California. They wanted to see how that affected the spread of disease.
Wow. It seems like Facebook set a lot of important work back. And hurt themselves, because why wouldn’t they want to have you in the fold?
This project became so big so quickly that I think it became an embarrassment. But I was hoping to break it to them gently. I don’t think they realized that you could take so much fragmented data and answer such big questions. The public profiles aren’t that interesting, but when you have a couple hundred million of these, it starts becoming so.
I don’t think Facebook has a master plan for this stuff. They’re wrestling with these issues as they come up, because we’re all treading new territory and figuring out the rules.
What happened, anyway? It seemed like Facebook was helping you before they decided to sue you.
I’ve helped report bugs and security holes so I had a good relationship with the technical team. But once the legal team got involved they wanted it all to stop, and they had a fairly draconian set of demands. They threatened to sue me, but because of my relationship with the security team they held off after I agreed to destroy the data.
But this stuff is all public. Names, fan pages, a few connections. What were they so afraid about?
I don’t really have a firm grasp on that. The data’s available in Google index to anyone with a Web crawler. But Facebook did talk about the data being a snapshot in time of preferences, so 10 years down the road you might not want to have people know what you were a fan of at 18.
But is it really that scary that someone will know you liked Nickelback as a teenager?
Facebook has set things up so that they control user data and give users ways to go back and remove that data. Once that’s public, they’ve lost control, so that’s scary.
The nightmare scenario is spam, but is that really a threat?
I don’t see how this could be used to get anybody’s email.
Now that the Facebook experiment is over, what’s next?
I’m a [British] immigrant myself and I’m fascinated by how people move around the country. Google profiles has history of where you are and where you’ve lived, and there could be fascinating work to be done around where people are leaving, and where they’re going to. The profiles also have things like college and job profiles, so there’s a way to slice that data to how the country is changing, what’s happening to people’s jobs, and where unemployment is driving people.
We also have a chance to quite easily draw interactive maps. If you can see this data county by county, patterns will jump out at you. I’m also hopefully about finding other data sets. Nothing has the kind of ubiquity of Facebook, but there are Twitter and Google Buzz profiles. I think there’s less controversy over those. Hopefully! I’m trying very hard to keep Google apprised of what I’m doing.
Do you think Facebook’s privacy concerns were warranted?
This is an area I’m torn about, because, personally, I’m pretty concerned about privacy. The fundamental problem here is that the data I was planning to release is still crawlable by anyone else and there’s a lot of commercial companies that have grabbed the same dataset. I think that the privacy concerns I have are that this is being put on public profiles in the first place.
But the bigger analogy I come back to is the phone book. We grew up with the idea that our names and address and contact details were available, and we’ve developed social conventions to prevent abuse, such as not having a full first name. On the Web, no one understands those conventions yet. So we need to let people know what data is available on them and have a debate about acceptable uses and control. Right now it’s a bunch of geeks sprinting ahead with the technology, and the cultural debate hasn’t caught up.