The Long Now Foundation--currently breaking ground in Texas at the future site of its first monument-sized 10,000-year clock--is pursuing several programs in addition to the clock. One of these, the Rosetta Project, takes as its daunting mission the documentation of every human language currently in use; some 7,000 in total, the majority of which are in danger of disappearing without a trace. Directing this ambitious venture is Laura Welcher, a linguist who has specialized in building archival resources for indigenous North American languages. They're documenting the world's languages and storing them on one small disc that currently contains 13,000 microetched pages of word lists from 1,500 languages.
On July 30, she and her colleagues, in conjunction with online translation service Mightyverse.com and the Internet Archive, will throw a “Record-a-thon” in San Francisco. They’ll be capturing video of speakers of the Bay area’s more than 100 languages, as they tell stories and converse. We reached Welcher by email this week to ask her about the Record-a-thon and the Rosetta Project’s broader purpose, scope, and ambitions.
You're trying to make a record of every language in the world. How do you go about that?
There are about 7,000 languages spoken in the world today, and it is likely that we will lose at least half of them--and some say up to 90%--in the next 100 years. With all our resources combined (money, experts in the field, community initiatives) we have a hope of maybe documenting 500 languages in the foreseeable future, but we need to scale this to about 5,000. The only way I can see to do this is by engaging speakers of languages themselves to produce their own language documentation. So, the question then becomes: What is the minimal amount of useful language documentation the average person might produce? I would argue it would be a verbal text--ideally a short video--and then I'd need to know what language the user thinks the recording is in (detailed identification can be done later).
The realization I've come to in the past year or so is most of us are carrying around language documentation devices in our own bag or back pocket--video enabled cell phones, cameras, laptops. If you project out 10 years, these devices become globally ubiquitous, and then anyone can create and contribute language documentation to a central repository. Then, as we assemble a collection of videos for any given language, we can start enriching them with transcriptions, translations, annotations--that is, building a corpus.
How will future researchers use that data, and what insights will they be able to glean?
A corpus can be used in many different ways--a small corpus can provide language learning and teaching materials, as well as materials for the building of linguistic resources such as grammars and dictionaries (this is the kind of language documentation linguists are producing today). Then, with a larger corpus--say tens of hours of transcribed speech, we can start building acoustic models for speech recognition. With a few million words we can start to do machine translation. And these are the tools that enable a language to be used online--which I would argue is a crucial new domain for language use in the modern world.
How will the corpus collected by the Rosetta Project differ from other archives of natural language?
Most language archives focus on languages of a particular region, or data collected under the umbrella of a particular project. The Rosetta Project is quite different in that we aim to assemble information on and in all human languages--all 7,000 of them. Not only is this a big effort, it is also a big challenge for how you organize all that information and make it usable to many different groups of people, from language specialists, to endangered language speech communities, to the interested general public, to an elementary school teacher or student.
So how are you going to reach the communities that speak endangered languages?
The project has to be visible and discoverable. What we've found over the past decade of building this collection is that speakers from small language groups find us--say if a speaker moves to a city, and has Internet access, and is doing a search to see what the Internet has to say about where he or she comes from. We may have some of the only documentation of their language available online. In the future, say within the next 10 years, I'm counting on ubiquitous Internet access through mobile devices, and those are the same ones that can be used to create language documentation. My want-to-have killer app would be a cross-platform "push this button and archive your language video in the Rosetta Project Collection."
What form will these recordings take, and how will their longevity be ensured?
Ideally video. Audio is fine too, but video is better because it is a richer source of language information--it also documents context (to help with what the speaker is talking about, or pointing to), speech participants (like a conversation, or public speaking event), as well as the speaker's body and facial gestures.
We'll be adding them to a collection in the Internet Archive which (along with the Rosetta Project) has a commitment to their long term preservation, migration, and dissemination. To contribute recordings during the Record-a-thon event, participants will need to assign recordings a CC license, so we will be building an open collection. Openness is one of the keys to longevity, since, in the long run, unused or inaccessible resources are more likely to be lost. And unlike a lot of smaller language archives, the Internet Archive is very "discoverable" so people will come across the collection more easily, which promotes access, use and LOCKSS ("lots of copies keeps stuff safe").
Given that our knowledge of languages from 10,000 years in the past is shadowy and fragmented at best, the idea of offering a snapshot of living language today for our descendants ten millennia from now is awe-inspiring, to be sure. But what's will be the value of collecting all of today's endangered languages?
Human language has taken thousands of years to reach the amount of differentiation we see today, with more than 100 different language families that aren't demonstrably related. Many of these languages have only a few thousand speakers and represent humanity's store of knowledge of how to survive--and indeed live--in all of the myriad environments of the planet. If languages are our how-to guides for living on planet earth, and we stand to lose up to 90% of them, then that seems like we are looking at handing our descendants an encyclopedia of human life on Earth with all of the pages ripped out, except sections X, Y, and Z.
But how does the project solve the problem of intelligibility going forward? Will future visitors to the Long Now clocks have some means of recording their own languages through the next 10,000 years' many generations?
One effort of the Rosetta Project is producing a physical backup in the form of the Rosetta Disk. This disk (microscopic pages of information formed in solid nickel, readable with 500 times magnification) has parallel information like word lists, texts, grammatical information for as many human languages as we've been able to find this documentation--so far about 2,500. This parallel collection effort has its inspiration in the original Rosetta Stone artifact, whose parallel texts enabled the decipherment of ancient Egyptian hieroglyphs, thereby unlocking an entire ancient civilization. And even if it is not used for such a future purpose, it is at least a pretty good snapshot of human cultural diversity on 21st century planet earth--a diversity we might not have in the near future, nor build up again for thousands of years.
[Images: Top, Wikimedia Commons; Bottom: Rosetta Project]