Fifty-three linear miles of shelves packed with manuscripts and books amassed over the course of 12 centuries divided in 600 archival collections–an unfathomable labyrinth that hides more mysteries than we can possibly imagine. It’s the Vatican Secret Archives, and as the Atlantic recently reported, a new artificial intelligence project called In Codice Ratio is on track to expose its enigmas.
Not even the highest Roman Catholic church archivists know what’s hiding in the archives’ endless volumes, which are carefully stored near the Sistine Chapel. Only a tiny amount of these archives have been digitized–the rest is an endless ocean of inaccessible papers and parchments. Going through its tomes in search of something would be a task that not even the goddess Minerva herself would be able to accomplish.
Some libraries have used technology to digitize their collections, like Optical Character Recognition software that’s trained to recognize fixed, separated individual letter shapes. However, OCR is useless when it comes to the endless variety of free-flowing cursive styles featured in many of the Vatican’s tomes, which go all the way back to the eighth century. The handwriting of letters and words running into each other is undecipherable to non-human eyes–until now, that is.
In a paper published in ArXiv in March, researchers from the University of Rome, La Sapienza University of Rome, and the Vatican Secret Archives describe how they’ve developed a neural network capable of identifying entire handwritten Latin words rather than individual characters. Their method is based on TensorFlow–the world’s most popular deep learning framework developed by Google–running on NVIDIA GeForce GTX graphic cards. According to the researchers, it “requires minimal training efforts, making the transcription process more scalable as the production of training sets requires a few pages and can be easily crowdsourced.”
This is very important, as the handwriting obviously evolves as you move through the library’s 12 centuries of books. For each of the handwriting styles, the AI will need to be retrained. According to the Atlantic, the AI trainers for the project’s test phase–centered around the Vatican Registers of 13th century pope Honorius III–were students recruited from 24 Italian high schools. The researchers built a platform that let these students judge the system’s analysis for accuracy, and teach it using their own eyes. “Image by image, click by click, the students taught the software what each of the 22 characters in the medieval Latin alphabet (a–i, l–u, plus some alternative forms of s and d) looks like,” Sam Kean writes.
The researchers are excited with their preliminary results, with an accuracy rate of 65%, writing, “our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speed up the transcription process at a large scale.”
Ultimately, the In Codice Ratio team is developing a general framework to digitize the entire Vatican Secret Archives but also any other large collection of ancient documents. “The goal,” the researchers say, “is to provide humanities scholars with novel tools to conduct data-driven studies over large historical sources,” which is why anyone in the world can access their AI and training platform code here.