Should We Teach Literature Students How To Analyze Texts Algorithmically?

Textual analysis like the type that revealed J.K. Rowling’s nom de plume could change the way we understand the very concept of writing style. Is this the answer to the staleness and despair that has crept into the study of literature?

Should We Teach Literature Students How To Analyze Texts Algorithmically?

When the U.K. newspaper the Sunday Times outed J.K. Rowling as the author of detective novel The Cuckoo’s Calling earlier this year, computer scientists were among the first people called in. Although the novel was published under the pen name Robert Galbraith, two computational scholars–including Duquesne University’s Patrick Juola–were tasked with confirming or denying whether the novel belonged to the Harry Potter author, or one of three other possible writers.


That Juola succeeded (his conclusions were later confirmed by Rowling herself) speaks volume about the the potential that algorithms and computer science can have, even with application to a field as notoriously subjective as literature. Which raises an interesting question: Can we use software to help us think about literature?

Reverse Engineering J.K. Rowling

To explore that question, we should first look at how the Juola cracked Rowling’s writing style. To begin the process, Juola loaded 1,000-word samples of The Cuckoo’s Calling in to his self-designed Java Graphical Authorship Attribution Program (JGAAP), along with several other texts, including The Casual Vacancy, Rowling’s first post-Harry Potter novel. A freely available Java-based, modular program for textual analysis, categorization, and authorship attribution, JGAAP analyzed the texts on four different variables: word-length distribution, the use of common words like “the” and “of,” recurring-word pairings, and the distribution of “character 4-grams,” or groups of four adjacent characters, words, or parts of words. The computer analysis took around 30 minutes in total.

“Nothing that we’re doing is magic,” Juola said recently about the process. “What we are doing is the same type of judgment that experts have always done about reading documents and figuring out something about the author–just a lot faster, and more accurate than most.”

Juola’s work is hardly the first piece of evidence that algorithms may have a useful role to play in the field of textual analysis. Whether it is using algorithms to determine the semantic difference between male and female tweets, or companies like Narrative Science, which utilize machine learning tools to generate entire new works in a number of styles, the possibility that computer science can change the way books are read is apparent to everyone.

Of all the people to celebrate literary analyses’ “algorithmic turn,” perhaps none have been more outspoken about the subject than Jonathan Gottschall, a literary scholar at Washington and Jefferson College in Pennsylvania, who specializes in the field of literature and evolution.

Not only does Gottschall believe that algorithmic analysis can be used to change the way we read literature, he also believes it should profoundly alter the way that we should study it. Most recently the author of The Storytelling Animal: How Stories Make us Human, Gottschall has spent much of the past five years arguing that literary studies needs to adopt a more scientific approach–including up-to-date scientific theories, research methods, use of statistical tools, and insistence on hypothesis and proof.


Curing The Despair Over Humanities

Speaking to FastCo.Labs, Gottschall says that he was prompted to embrace the digital humanities by “frustration and near despair with the way academic literary studies within the humanities were conducting themselves.” In particular, he points to literary studies’ inability (or refusal) to ever get closer to an objective truth–with their being, he claims, little accumulation of knowledge from generation to generation which meaningfully builds on the work of predecessors, without trying to shoot them down first.

“The sciences are full of vigor,” Gottschall says. “They’re full of energy–there’s a real sense that, boy, we’re doing important stuff, and that we’re really getting somewhere. Compare that to my field of English, within the humanities, where there’s the feeling that our best days are behind us, that this is a dying field with no cultural prestige, that jobs aren’t safe–and all of those things are really quite true.”

Gottschall believes that these problems (the humanities’ increasing irrelevance, decreasing popularity, and the subjects’ inability to ever answer questions conclusively) are not only connected, but can be solved by implementing the kind of data mining tools and computational textual analysis carried out by the likes of Patrick Juola.

“For me, it all comes down to the question that we are asking,” Gottschall says, “and often that is a question that cannot be solved without the empirical method. If you look back over the history of literary studies what you will see if an endless argument that never gets anywhere. If you want to get a conclusion, you can turn to the sciences.”

Anatomy Of The “Reading Machines” Of Tomorrow

Taking something of a less positivistic approach is Stephen Ramsay, associate professor of English and a fellow at the Center for Digital Research in the Humanities at the University of Nebraska-Lincoln, as well as author of last year’s Reading Machines: Toward an Algorithmic Criticism. In his book, Ramsay attacks the idea that the digital humanities should take over entirely from traditional literary studies, while additionally detailing the ways in which algorithms can be usefully integrated into the field without attempting to turn the humanities into a branch of statistical science in the process.

Algorithms can, Ramsay points out, be used to determine “vocabulary richness” by measuring the number of different words that appear in a 50,000-word block of text. This, in turn, can reveal useful insights like the fact that a “popular” author like Sinclair Lewis (sometimes derided for his supposed lack of style) regularly demonstrates twice the vocabulary of Nobel Prize laureate William Faulkner, whose work is considered notoriously difficult.


Tools like Google Books, meanwhile, offer the possibility of completely transforming the way in which literary scholars approach questions and comparisons, by allowing the simultaneous searching of up to 35,000 novels, and perhaps opening up the cultural space for works such as Pierre Bayard’s How To Talk About Books You Haven’t Read.

As Ramsay observes, “The rigid calculus of computation, which knows nothing about the nature what it’s examining, can shock us out of our preconceived notions on a particular subject. When we read, we do so with all kinds of biases. Algorithms have none of those. Because of that they can take us off our rails and make us say, ‘A-ha! I’d never noticed that before.’”

The idea that the literary studies of the (near) future may involve more data visualization and machine learning than speculation on the “death of the author” or studying the sexism of the Western canon could be enough to unsettle some of those working within the field. But it might also be closer to becoming a reality than many of us realize.

[Image: Flickr user Mikael Altemark]