Twitter‘s robust community of non-English speakers just got another boost with the launch of a new site called Indigenous Tweets. The site, created by St. Louis-based computational linguistics professor Kevin Patrick Scannell, collects tweets from more than 70 languages. These range from better-known tongues such as Haitian Creole and Basque to the downright esoteric Gamilaraay, an Australian indigenous language with approximately three living speakers.
Indigenous Tweets launched in March 2011, following a linguistic survey of Twitter conducted by Scannell that found almost 500 languages being used on Twitter. Collection of the tweets is based on a customized database of words and phrases across languages that attempts to create lists of which users tweet in what languages. Twitter’s API is then used for a data-scraping project that attempts to collect user information and post frequency information across languages. Following that, results are largely automated:
The site is generated by using a program that “crawls” Twitter users, grabbing the tweets on their timeline and performing statistical language recognition on those tweets […] Then, if a given user has more than a certain fraction of their tweets in the target language, that user’s followers are added to a queue to be checked in the same way. In the last couple of days, the initial crawls for Basque and Welsh were completed.
According to the BBC World Service, Twitter has become a home for minority languages. More than 3000 Twitter accounts primarily post in Basque, 1000 users regularly post in Irish Gaelic, and Haitian Creole is used by more than 6500 Twitter users.
While Twitter itself categorizes tweets by language, their results are sometimes imperfect due to the fact that minority languages often do not have a corpus of written literature for their algorithms to run riot on. A major part of Indigenous Tweets’ work has consisted of categorizing the use of these languages on Twitter by both native speakers and linguists.<
Scannell hopes that Indigenous Tweets will “help build online language communities through Twitter.” Helping the world’s Igbo- and Yiddish-speakers get together on Twitter is just one part of that.
Fast Company has previously reported on Twitter’s corporate forays into the Arabic and Korean, along with the State Department’s decision to embrace foreign-language tweets for public diplomacy purposes.
Read More: Most Innovative Companies: Twitter
[Image via Flickr user Leonard John Matthews]