When you tweet–even if you tweet under a pseudonym–how much do you reveal about yourself? More than you realize, argues a new paper from researchers at the Mitre Corporation. The paper, “Discriminating Gender on Twitter,” which is being presented this week at the Conference on Empirical Methods in Natural Language Processing in Scotland, demonstrates that machines can often figure out a person’s gender on Twitter just by reading their tweets. And such knowledge is power: the findings could be useful to advertisers and others.
To conduct their research, the Mitre folks–John Burger, John Henderson, George Kim, and Guido Zarrella–first had to assemble a corpus of Twitter users whose gender they were confident of. Since Twitter doesn’t demand that users specify gender, they narrowed their focus to Twitter users who had linked to major blog sites in which they had filled out that information. In addition to collecting the tweets of these folks–many users had only tweeted once, while one of them had tweeted 4,000 times–Burger et al. collected the minimal profile data that Twitter users sometimes do include: screen name, full name, location, URL, and description.
The dataset was about 55% female, 45% male (which squares roughly with estimates of Twitter’s overall gender breakdown). Thus, by guessing “female” for every user, a computer would be right 55% of the time. Simply by examining the full name of the user, a computer was accurate about 89% of the time–a remarkable improvement, if not an especially interesting one, since first names are highly predictive of gender. The Mitre findings become intriguing, though, when the team limited its analysis to tweets alone. By scanning for patterns in all the tweets of a given user, Mitre’s program was able to guess the correct gender 75.8% of the time–a 20% improvement over the baseline. And even just by analyzing a single tweet of a user, it was right 65.9% of the time–an over 10% improvement over the baseline.
Here, from the paper, is a table breaking down how various combinations of data helped boost the gender prediction success rate:
How is this possible? How can we give away so much with 140 characters or less? There is a whole branch of study called “sociolinguistics” that observes that different people speak differently. In the real world, for example, sociolinguists have found that women tend to laugh more than men. In the last few years, computational linguists like those at Mitre have sought to determine to what extent that research translates into cyberspace. They use a technique called n-gramming. Given a certain sequence of characters, or words, what is the likelihood that the speaker is, for instance, male or female?
Mitre found that given certain characters or combinations of characters, the computer could wisely bet on the gender of the tweeter. The mere fact of a tweet containing an exclamation mark or a smiley face meant that odds were a woman was tweeting, for instance. Of the most gender-skewed words, the majority were in the female category, while only a few were male, leading to this unintentionally hilarious figure from the Mitre paper: