The Long Now Foundation–currently breaking ground in Texas at the future site of its first monument-sized 10,000-year clock–is pursuing several programs in addition to the clock. One of these, the Rosetta Project, takes as its daunting mission the documentation of every human language currently in use; some 7,000 in total, the majority of which are in danger of disappearing without a trace. Directing this ambitious venture is Laura Welcher, a linguist who has specialized in building archival resources for indigenous North American languages. They’re documenting the world’s languages and storing them on one small disc that currently contains 13,000 microetched pages of word lists from 1,500 languages.
On July 30, she and her colleagues, in conjunction with online translation service Mightyverse.com and the Internet Archive, will throw a “Record-a-thon” in San Francisco. They’ll be capturing video of speakers of the Bay area’s more than 100 languages, as they tell stories and converse. We reached Welcher by email this week to ask her about the Record-a-thon and the Rosetta Project’s broader purpose, scope, and ambitions.
You’re trying to make a record of every language in the world. How do you go about that?
are about 7,000 languages spoken in the world today, and it is likely
that we will lose at least half of them–and some say up to 90%–in the
next 100 years. With all our resources combined (money, experts in the
field, community initiatives) we have a hope of maybe documenting 500
languages in the foreseeable future, but we need to scale this to about
5,000. The only way I can see to do this is by engaging speakers of
languages themselves to produce their own language documentation. So,
the question then becomes: What is the minimal amount of useful language
documentation the average person might produce? I would argue it would
be a verbal text–ideally a short video–and then I’d need to know what
language the user thinks the recording is in (detailed identification
can be done later).
The realization I’ve come to in the past year or so is most of us are
carrying around language documentation devices in our own bag or back
pocket–video enabled cell phones, cameras, laptops. If you project out
10 years, these devices become globally ubiquitous, and then anyone can
create and contribute language documentation to a central repository.
Then, as we assemble a collection of videos for any given language, we
can start enriching them with transcriptions, translations,
annotations–that is, building a corpus.
How will future researchers use that data, and what insights
will they be able to glean?
corpus can be used in many different ways–a small corpus can provide
language learning and teaching materials, as well as materials for the
building of linguistic resources such as grammars and dictionaries (this
is the kind of language documentation linguists are producing today).
Then, with a larger corpus–say tens of hours of transcribed speech, we
can start building acoustic models for speech recognition. With a few
million words we can start to do machine translation. And these are the
tools that enable a language to be used online–which I would argue is a
crucial new domain for language use in the modern world.
How will the corpus collected by the Rosetta Project differ from other archives of natural language?
language archives focus on languages of a particular region, or data
collected under the umbrella of a particular project. The Rosetta
Project is quite different in that we aim to assemble information on and
in all human languages–all 7,000 of them. Not only is this a big
effort, it is also a big challenge for how you organize all that
information and make it usable to many different groups of people, from
language specialists, to endangered language speech communities, to the
interested general public, to an elementary school teacher or student.
So how are you going to reach the communities that speak endangered languages?
project has to be visible and discoverable. What we’ve found over the
past decade of building this collection is that speakers from small
language groups find us–say if a speaker moves to a city, and has
Internet access, and is doing a search to see what the Internet has to
say about where he or she comes from. We may have some of the only
documentation of their language available online. In the future, say
within the next 10 years, I’m counting on ubiquitous Internet access
through mobile devices, and those are the same ones that can be used to
create language documentation. My want-to-have killer app would be a
cross-platform “push this button and archive your language video in the
Rosetta Project Collection.”
What form will these recordings take, and how will their longevity be ensured?
video. Audio is fine too, but video is better
because it is a richer source of language information–it also documents
context (to help with what the speaker is talking about, or pointing
to), speech participants (like a conversation, or public speaking
event), as well as the speaker’s body and facial gestures.
We’ll be adding them to a collection in the Internet
Archive which (along with the Rosetta Project) has a commitment to their
long term preservation, migration, and dissemination. To contribute
recordings during the Record-a-thon event, participants will need to
assign recordings a CC license, so we will be building an open
collection. Openness is one of the keys to longevity, since, in the long
run, unused or inaccessible resources are more likely to be lost. And
unlike a lot of smaller language archives, the Internet Archive is very
“discoverable” so people will come across the collection more easily,
which promotes access, use and LOCKSS (“lots of copies keeps stuff safe”).
that our knowledge of languages from 10,000 years in the past is
shadowy and fragmented at best, the idea of offering a snapshot of
living language today for our descendants ten millennia from now is
awe-inspiring, to be sure. But what’s
will be the value of collecting all of today’s endangered
language has taken thousands of years to reach the amount of
differentiation we see today, with more than 100 different language families that
aren’t demonstrably related. Many of these languages have only a few
thousand speakers and represent humanity’s store of knowledge of how to
survive–and indeed live–in all of the myriad environments of the
planet. If languages are our how-to guides for living on planet earth,
and we stand to lose up to 90% of them, then that seems like we are
looking at handing our descendants an encyclopedia of human life on Earth with all of the pages ripped out, except sections X, Y, and Z.
But how does the project solve the problem of
intelligibility going forward? Will future visitors to the Long Now
clocks have some means of recording their own languages through the next
10,000 years’ many generations?
One effort of the Rosetta Project is producing a
physical backup in the form of the Rosetta Disk. This disk (microscopic
pages of information formed in solid nickel, readable with 500 times
magnification) has parallel information like word lists, texts,
grammatical information for as many human languages as we’ve been able
to find this documentation–so far about 2,500. This parallel collection
effort has its inspiration in the original Rosetta Stone artifact, whose
parallel texts enabled the decipherment of ancient Egyptian hieroglyphs, thereby unlocking an entire ancient civilization. And even
if it is not used for such a future purpose, it is at least a pretty
good snapshot of human cultural diversity on 21st century planet
earth–a diversity we might not have in the near future, nor build up
again for thousands of years.