For some, quantity trumps quality in scientific research. Countries that submitted the least papers to the online repository arXiv since 2011 tended to plagiarize the most. That’s what Science’s news and policy tracker, ScienceInsider, found when it asked arXiv to share data about the papers researchers submitted to it.
Anyone can submit a manuscript to arXiv–pronounced “archive”–as long as it documents a study in the math or physics domains. And the documents don’t have to go through the orthodox peer-review process, which makes it relatively easy to get accepted.
When it comes down to it, a bot vets each new study for quality, especially regarding reusing text from older studies. The automated program compares a new article’s text to the text of every other document in arXiv’s database. After ruling out exceptions, like when an author cites her own work or uses quotes, the bot flags the ones that heavily lift text word for word from older studies.
Copying, the bot finds, is quite common: among the 767,000 papers submitted from arXiv’s inception in 1991 until 2012, one in 16 authors were found to have copied long phrases and sentences from their own previously published work, and about one out of every 1,000 authors copied about a paragraph’s worth of text from other people’s papers without citing them.
So what happens to these copycat studies? ArXiv’s founder, Paul Ginsberg, and Cornell PhD student Daniel Citron have conducted what they say is the first comprehensive study of patterns of text reuse within the full texts of an important large scientific corpus. Eventually, they found the ones with the most reused text tended not to get cited so much by later researchers.
“One motivation for undertaking this analysis of arXiv data was the known incidence of text copying and plagiarism, usually noticed by readers, and sometimes reported in the news media,” the researchers write of their study, which attempts to focus on “textual overlap” within ArXiv, not “plagiarism” per se. There are no universal guidelines for what constitutes plagiarism in science anyway, they note, but rather “a standard somewhat more lenient than currently applied to journalists, popular authors, and public figures.”
Even if the algorithm can’t detect clear-cut plagiarism, it can help. An author’s tendency to reuse text in an article is a good indicator of her likelihood to plagiarize. Citron and Ginsberg shared their results in PNAS earlier this month, and posited that the plagiarism was influenced by cultural differences “in academic infrastructure and mentoring, or incentives that emphasize quantity of publication over quality.”
Those not highly proficient in English might also be likely to lift text from English sources. The paper observes this at the student level, “where in order to explain concepts, students less confident in their English proficiency tended to employ longer phrases from other sources, rather than just words.” But even at a later career stage, there may be a continued impetus for plagiarism. “A researcher concerned that his or her articles are rejected due to the quality of writing may feel compelled to imitate sentence structures from other articles.”
But ScienceInsider wanted to get to the bottom of these cultural differences. Knowing that authors had to report their home countries with each submission, it asked Ginsberg to release this data. ScienceInsider then mapped out all the countries from which authors submitted at least 100 papers since August 2011 and found a small number of countries, like the U.S., Canada, and Japan had the least number of flagged authors.
Incidentally, authors from these industrialized countries turned out to submit the most papers to arXiv. Authors from less industrialized countries, ScienceInsider noted, had the most flagged studies but also tended to submit fewer papers:
For example, of all the authors from Bulgaria who submitted papers since August 2011, 20% submitted flagged articles. But arXiv’s bot flagged only 6% of the authors from Japan. In the same time frame, around 4,700 papers came out of Japan, but Bulgaria only submitted around 200.
In the U.S., 1,236 out of 26,052 authors were flagged, while in Germany, 297 out of 9,201 authors were flagged. In Iran, 164 out of 1054 authors were cited for “text overlap,” while in China, 688 out of 6,372 authors were flagged.
“While conceivably exacerbated by the ease of cutting and pasting text in electronic format,” the researchers note, “the problem does predate both the new technology and the use of preprints. Ironically the combination of those make make that reuse that much easier to detect.”
Leave it to a bot, a bit of data mining, and a map to keep researchers in check–or at least raise more questions about how the same science is conducted differently across the globe.