Scroll down to the next section learn why I'm spying on myself, or read on to see how readers helped me improved my model.
Update: Following up on some useful feedback from Hacker News users, I tweaked the spying experiment today to be a bit more effective. The first criticism was that stating the model's %-accuracy is useless because I reported no baseline. In other words, if the sample data I gave the model was 67% biased in favor of a particular gender, then a 67% accuracy would mean that the model was actually just turning out results based on the distribution of records in the sample set and not actually finding any statistical pattern. I went back and checked my training data and found that it was roughly even, but did favor males by about 4%, meaning my real accuracy is closer to about 63%, which is only 13% better than flipping a coin.
Similarly, the second criticism focused on the fact that I didn't do enough analysis on how good my model actually is given the limited criteria I gave it to train on. User quchen suggested verifying the model by increasing and decreasing the size of the training dataset and feeding it deliberately false data. When I did this, I noticed one thing in particular: the timestamp of the call did very little to effect the accuracy of the model. Looking at the output from the Prediction API's "analyze" method, I found out why: Google was treating each timestamp as a single token, and because it's pretty unlikely to receive two calls or text messages at the exact same time, it found almost no connections between the items in this category.
In retrospect, this was an obvious mistake. Even without understanding the nuances of the algorithms behind these models (and honestly, I really don't), it makes little intuitive sense to look for a pattern in the exact timestamp of a call. Where they may be a pattern, however, is around what times of the day or month or even hour each gender prefers to call. To rectify this problem, I split up the timestamp into four different tokens: the day of the week, day of the month, hour, and minute of the call.
This time when the model came back, it found lots of connections between gender and the various times they prefer to call or text, and the accuracy improved, too: Now I can say that my model can predict the gender of a caller with 80% accuracy. That's still not anything you'd want to deploy in a production system, but it is enough to suggest that there's a pattern, at least among the people unfortunate enough to interact with me via voice and SMS.
Last week, we learned that the NSA has been secretly collecting billions of phone records from major U.S. providers and mining the data, ostensibly to look for terrorists and other threats to national security. To justify these programs, the government is pointing to the fact that they don't collect the contents of these calls and text messages, just "metadata," and that to associate this data with real people, they need a warrant.
Here's the catch: there appears to be nothing that says the government can't use full, non-anonymous datasets to mine this metadata for pure gold. We've been covering data science in business at Co.Labs, but if you need a refresher, here's how basic data-mining typically works: you take a set of data that contains examples of the types of patterns you're looking for, and use it to train a computer to look for similar patterns in another set of data.
These techniques are now so widespread that performing simple data-mining on an individual level is becoming much easier, thanks to numerous prediction libraries available in just about any programming language and powerful cloud-based tools like Google's Prediction API. To understand exactly what the government can do with this metadata, I decided to beat the NSA at its game by spying on my own data.
Unfortunately, getting access to it proved to be difficult. Ironically, although they will willingly hand all of it over to the government, according to this support thread my cellphone provider, Verizon, does not allow you to export call data and only provides customers with 30 days of logs online. Luckily for me, I've been using Google Voice for most of my calls and text messages for the last three years, and they provide a data export service called Google Takeout that includes everything the government has except device serial numbers and location.
The Takeout data wasn't perfectly formatted for data-mining, so I wrote a quick and delightfully inefficient Ruby script that creates a CSV with the data that I'll be sharing on GitHub shortly. Once I had the data in a workable format, the question became what to look for. My life probably isn't all that interesting to the NSA (I hope, anyway), so trying to discover signs of terrorism in my data seems unlikely. Plus, I'm new to data-mining, and my friends who aren't convinced me to start simply. I decided to ask a basic question: can a computer tell, based only on the time of day and duration of a call, whether a given caller is male or female?
I randomly chose 20 phone numbers from my metadata, looked up the gender of the owners of those numbers, and marked all of their records as male or female. Then, I fed that set of 861 test examples to the Google Prediction API, and waited. As I'm sure any true data scientists reading this article are screaming right now, there are numerous caveats to be made here. First, 861 test examples is a very small sample, and isn't likely to produce a good result. Moreover, I'm only looking at a couple of variables, meaning any patterns it finds won't be very strong. Second, because I only have my data to play with and my call patterns are unique to me, any results I get from this experiment will probably only apply to me. Finally, randomly picking 20 numbers is a bad way to choose a sample population.
Nonetheless, when Google's API was done training my model, it reported that it could predict the gender of a caller with 67% confidence in the result. That's a bad confidence for any production model (only 17% better than guessing), but testing it on other calls in my history and even friends', I found it surprisingly good at determining a caller's gender. Moreover, we don't know what the NSA considers good enough to seek out a warrant, but the evidence suggests the threshold is fairly low: According to leaker Edward Snowden, an analyst at the NSA only needs to be 51% confident that their target isn't a U.S. citizen.
Most importantly, if that's what I can do with a limited set of my own data, imagine what the NSA can do with the datasets it has access to. If you don't think determining an anonymous caller's gender is particularly useful, think about the other things you might find out from a better set of data and more precise algorithms, like which callers are likely to be related to one another (I'm going to try that one on myself next), or with location data, where they're likely to be at any given time. Once you start combining these questions and running these algorithms on multiple people's sets of data, you start to see how you can build up a fairly complete picture of just about anyone's life without truly knowing anything about them at all.
What's next for this experiment? I'd love to hear your suggestions on what to look for in my own data. I'll also be cleaning up my scripts and posting them to GitHub so that others can mine their Google Voice data, and I'm working on exposing my models to the public online so that anyone can plug in their data. In the mean time, let me know if you have any suggestions on Twitter.
[Image: Flickr user Ralphbijker]