The New York City Taxi and Limousine Commission does not ordinarily release its data. But thanks to a Freedom of Information request, self-described data junkie Chris Whong was able to obtain 20 GB of data concerning 173 million taxi trips, including drop-off times, pickup locations, fare, tip amounts, and more, which he used for all sorts of cool data visualization projects. Private companies like Uber can even use this kind of information to accurately predict where a customer might want to be dropped off.
But it looks like the Taxi Commission may have made a mistake, as pointed out by Vijay Pandurangan. Although it appears that they attempted to anonymize taxi medallion and license numbers, Pandurangan was able to de-anonymize the entire dataset thanks to an overlooked vulnerability in about an hour. (He goes over how he was able to do this in this post.)
In a separate experiment, Neustar privacy researcher Anthony Tockar used the data to show what sort of damage could be done in a spooky privacy experiment-slash-cautionary tale. Tockar was able to identify which celebrities, for example, took a specific cab at a specific time. All you have to do is Google publicly available images of the star getting out of a taxi. By cross-referencing the Taxi Commission’s data with photos published to celebrity gossip blogs, he was able to zero in on rides taken by stars like Bradley Cooper and Jessica Alba, and glean information like how much they paid. Tockar writes:
In Brad Cooper’s case, we now know that his cab took him to Greenwich Village, possibly to have dinner at Melibea, and that he paid $10.50, with no recorded tip. Ironically, he got in the cab to escape the photographers! We also know that Jessica Alba got into her taxi outside her hotel, the Trump SoHo, and somewhat surprisingly also did not add a tip to her $9 fare. Now while this information is relatively benign, particularly a year down the line, I have revealed information that was not previously in the public domain. Considering the speculative drivel that usually accompanies these photos (trust me, I know!), a celebrity journalist would be thrilled to learn this additional information.
Now, whether a celebrity tips or not might be harmless blog fodder. Maybe they left a cash tip, which doesn’t recorded in the system. But Tockar took this opportunity illustrate how this data could expose ordinary folks like you and me.
By using GPS coordinates, Tocker was able to track cab traffic to and from strip clubs located in Hell’s Kitchen between the hours of midnight and 6 a.m. By pinpointing the pickup and drop-off zones, Tockar could tell, with frightening precision, where a loyal customer of Larry Flynt’s Hustler club might reside. “The potential consequences of this analysis cannot be overstated,” writes Tockar. “Using this freely-obtainable, easily-created map, one can find out where many of Hustler’s customers live, as there are only a handful of locations possible for each point.”
Update: This article has been updated to clarify that Pandurangan had no hand in identifying the taxi trips taken by celebrities. His experiment merely showed how the Taxi Commission’s data could be de-anonymized.