When you tweet–even if you tweet under a pseudonym–how much do you reveal about yourself? More than you realize, argues a new paper from researchers at the Mitre Corporation. The paper, “Discriminating Gender on Twitter,” which is being presented this week at the Conference on Empirical Methods in Natural Language Processing in Scotland, demonstrates that machines can often figure out a person’s gender on Twitter just by reading their tweets. And such knowledge is power: the findings could be useful to advertisers and others.
To conduct their research, the Mitre folks–John Burger, John Henderson, George Kim, and Guido Zarrella–first had to assemble a corpus of Twitter users whose gender they were confident of. Since Twitter doesn’t demand that users specify gender, they narrowed their focus to Twitter users who had linked to major blog sites in which they had filled out that information. In addition to collecting the tweets of these folks–many users had only tweeted once, while one of them had tweeted 4,000 times–Burger et al. collected the minimal profile data that Twitter users sometimes do include: screen name, full name, location, URL, and description.
The dataset was about 55% female, 45% male (which squares roughly with estimates of Twitter’s overall gender breakdown). Thus, by guessing “female” for every user, a computer would be right 55% of the time. Simply by examining the full name of the user, a computer was accurate about 89% of the time–a remarkable improvement, if not an especially interesting one, since first names are highly predictive of gender. The Mitre findings become intriguing, though, when the team limited its analysis to tweets alone. By scanning for patterns in all the tweets of a given user, Mitre’s program was able to guess the correct gender 75.8% of the time–a 20% improvement over the baseline. And even just by analyzing a single tweet of a user, it was right 65.9% of the time–an over 10% improvement over the baseline.
Here, from the paper, is a table breaking down how various combinations of data helped boost the gender prediction success rate:
How is this possible? How can we give away so much with 140 characters or less? There is a whole branch of study called “sociolinguistics” that observes that different people speak differently. In the real world, for example, sociolinguists have found that women tend to laugh more than men. In the last few years, computational linguists like those at Mitre have sought to determine to what extent that research translates into cyberspace. They use a technique called n-gramming. Given a certain sequence of characters, or words, what is the likelihood that the speaker is, for instance, male or female?
Mitre found that given certain characters or combinations of characters, the computer could wisely bet on the gender of the tweeter. The mere fact of a tweet containing an exclamation mark or a smiley face meant that odds were a woman was tweeting, for instance. Of the most gender-skewed words, the majority were in the female category, while only a few were male, leading to this unintentionally hilarious figure from the Mitre paper:
It’s fun to imagine these two characters on a date.
Of course, one shouldn’t run away with such a chart: this doesn’t mean that all women talk about on Twitter is chocolate and Etsy, of course. It simply means that if those words do appear in a tweet, it’s a winning bet to guess a woman is its author.
The Mitre folks aren’t the first to have brought a sociolinguistic approach to Twitter. In 2010, Delip Rao and others at Johns Hopkins performed a similar study, this time looking not only at gender but also at age, geography, and politics. They identified, both manually and through computer programs, categories of linguistic features that were demographically predictive. Emoticons, abbreviations like OMG, repeated letters (“niceeee” or “noooo waaay”), expressions of affection (“xoxo”), and a category called “honorifics” (“dude,” “man,” “bro,” “sir”) were all predictive, one way or another.
The most intriguing category that Rao et al. focused on, though, were what they called “possessive bigrams”–word pairings beginning with the word “my” or “our.” If someone wrote “my zipper” or “my wife” or “my nigga,” odds are that person was a man. But if someone wrote “my yogurt” or “my husband” or “my yoga,” odds are you were dealing with a woman. Here’s a fuller chart:
They also did a breakdown of “my” phrases as they correlated with political identification. Looks like Democrats on Twitter are more likely to talk about their tofurkey and sushi than Republicans, while tweeting Republicans are more likely to talk about their weapons or their local Walmart. (In other news, canine Twitter users are more likely to bite men than the other way around.)
Rao, reached for comment, said he was impressed with the Mitre paper, but that he had “a feeling they must have missed a certain section of the population.” He was surprised that “http” and “google” were so highly predictive of a male speaker. “In our case, we saw references to ‘PS3,’ things like ‘bro,’ ‘man,’ and ‘dude,'” Rao tells Fast Company. (The Mitre authors, for their part–who were unavailable to speak to me due to their preparation for this week’s conference–write in their paper that “[o]ur Twitter-blog dataset may not be entirely representative of the Twitter population at general,” since Twitter users who also blog are a particular kind of Twitter user.)
“All data sets that we gather from social media are skewed,” at least when it comes to making points about the population at large, adds Jacob Eisenstein of Carnegie Mellon University, who has done similar work on Twitter. (He found, intriguingly, that a distinctive geographical region spanning from Philadelphia through Cleveland and into Detroit was most likely to use the abbreviation “CTFU,” for “cracking the fuck up,” on Twitter–even though that region does not tend to have be considered a distinctive linguistic region outside of Twitter.)
Why, in the end, should anyone go to so much trouble to infer omitted details from a person’s Twitter profile? I asked Eisenstein if it all came down to marketing dollars. “I would assume that that’s the case,” he said, adding, “I’m an academic, so if I knew something about making money, I wouldn’t be here.”
Rao agreed that marketing is one of the major motivators here, adding that he had heard talk that Twitter was internally working on similar demographically identifying algorithms internally. (Twitter, through a spokesperson, declined to comment on that.) If knowing users’ gender is so valuable, why doesn’t Twitter just have people check a box as they sign up? “A lot of people will put any kind of garbage in those fields,” points out Rao. “People have things like ‘jellyfish’ or something, and you don’t know what ‘jellyfish’ means, or what gender is ‘jellyfish.'”