Jean Véronis
Aix-en-Provence
(France)


Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

jeudi, décembre 16, 2010

Google: The largest linguistic corpus of all time

When I was a student at the end of the 1970's, I never dared imagine, even in my wildest dreams, that the scientific community would one day have the means of analyzing computerized corpuses of texts of several hundreds of billions of words. At the time, I marvelled at the Brown Corpus, which included an extraordinary quantity of one million words of American English, and that after serving to compile the American Heritage Dictionary, was made widely available to scientists. This corpus, despite its size, which now seems derisory, enabled an impressive quantity of studies and largely contributed to the development of language technologies... The study to be published tomorrow in Science by a team comprising scientists from Google, Harvard, MIT, the Encyclopaedia Britannica and Houghton Mifflin Harcourt (publisher of the American Heritage Dictionary) deals with the largest linguistic corpus of all time: 500 billion words. This is the data collected by Google in its (sometimes controversial) programme to digitise books, used, for the first time to my knowledge, for an extensive linguistic study.

I was lucky to have had access to the study before publication, and I felt rather light-headed on reading it... My fingers were itching to talk about it on this blog, but I was forced to respect the embargo (I think the team have organised a bit of a buzz, you'll hear about it in the press as far as I can tell by all the journalists calling me). This corpus contains 4% of all the books ever published on Earth. As the authors say, to read only the texts published in only 2000 (i.e. a tiny fraction of the whole), without pausing to eat or sleep, you would need 80 years, a whole life time for us humans. The sequence of letters in the whole corpus is 1000 times longer than our genome, and if it was all written on one line, it would reach to the moon and back 10 times!

Let's not get carried away though, the corpus will not be accessible to common mortals, who will have to make do with pre-calculated results, the list of words and "n-grams" (i.e. sequences of n consecutive words) extracted from the corpus (limited to 5 words), for English and six other languages, including French. It's already a lot, let's not be churlish, all the more so as the data are organised consolidated by year, allowing for some very interesting studies, and can already be tasted from the on-line interface.

The authors provide a few examples, illustrated with curves that are rather like those from the Chronologue – some readers may remember this tool I made in 2005 for French (and which unfortunately died with the decline of the Dir.com search engine by Free, where I was working at the time). Except of course I had neither the resources nor the material collected by Google, that can trace lexical curves over more than two centuries! The fields covered are as varied as the evolution of grammar (compared usage of regular and irregular forms of English verbs such as burnt/burned), or the effect of censorship (the disappearance of names such as Marc Chagall during the Nazi period)...

The correlation between the use of names of diseases and peaks in epidemics especially hit me, as it reminded me exactly of the curves I obtained on bird 'flu [fr] – except these new data go all the way back to the 19th century! I won't take an image from Science, I'll let you read the article, but here's another image, from an internal team report, that illustrates peaks in the use of the word cholera since 1800. The bluish zones correspond to the terrible epidemics that hit the United States and Europe (in particular the south of France, there area where i live, with thousands of deaths in Marseille, Toulon, etc.).


For the occasion, the team came up with a new word, culturomics, to qualify this new activity, a portmanteau word that starts with culture and ends like genomics, and it is interesting to note that except for computer scientists (Dan Clancy and Peter Norvig at Google, for example) and lexicographers (including Joe Pickett, the current director of the American Heritage Dictionary), the team includes cognitive scientists and biologists, such as the well-known Steven Pinker and Martin Nowak, and two mathematician-biologists, main authors of the study, Jean-Baptiste Michel (a Frenchman, from the Ecole Polytechnique and doing a post-doc at Harvard) and Erez Liberman Aiden. This is no coincidence: biology and language processing share many things alongside algorithms and mathematics (I gave one example myself with phylogenetic trees – for example here, here or here).

And for French? Well, it all remains to be done. My sleeves are rolled up! Here's the very first curve, obtained exclusively thanks to the complicity of the team, who in passing, I would like to thank warmly. It's for the word blog in French, the adoption of which from English we can see as it happened [see update below]...



Today, I am feeling the fascination that astronomers must have felt when they turned Hubble for the first time on an unexplored corner of the universe. Something has happened, a giant step has been taken in the tools available to the linguist.

Will linguists (French ones anyway) be aware of it? That's a whole other story. There is often a huge gap between numbers and letters...




PS
Update : superimposed curves for blog in French (light blue) and in American English (dark blue). The shift between the two languages is clearly visible (NB: vertical scales do not match).




More

4 Commentaires:

Anonymous Olivier Aubert a écrit...

Funny. Following the "blog" example, I tried with "internet", and guess what, it looks like some visionary used the word between 1900 and 1905 (see http://ngrams.googlelabs.com/graph?content=internet&year_start=1800&year_end=2008&corpus=0&smoothing=3 ).

For instance, the 1888 "Memoirs and proceedings of the Manchester Literary & Philosophical Society" (
http://books.google.com/books?id=y6vaAAAAMAAJ&q=%22internet%22&dq=%22internet%22&hl=fr&ei=DYMKTaCOOczysgapuICrCg&sa=X&oi=book_result&ct=result&resnum=15&ved=0CGYQ6AEwDg ) mentions that "The estimated user-base of the Internet is in excess of 20 million world-wide".


Alright, usual OCR+classification errors, but funny anyway. More seriously, has there been any study of the error rate that could give some idea of the precision of said data?

16 décembre, 2010 22:27  
Blogger Jean Véronis a écrit...

Right probably OCR errors. Yes, although there was probably not enough space to detail this int he Science papier, the authors have been very careful about this, and have done precise evaluation of OCR error rate per language/period -- it's part of the Google Book process, actually. Books with low OCR quality have been eliminated, although the team admits that English has been better checked that other languages for which the corpora "may not be as reliable". The team estimates that estimates that over 98% of words are correctly digitized for modern English books, which is not bad !

I assume that with such sizes, we have to accept the fact (as in all other sciences) that there is some noise in the data. It's the same for telescopes. It's upon us to develop filtering methods and so on -- although the area in linguistics is still in infancy !

16 décembre, 2010 22:41  
Anonymous Jice a écrit...

Jean, do you have a twitter account? I always come back to your blog after weeks of oblivion, but I am always interested by your posts (and I found wikio a great tool).

I must confess that I don't have the force to read through all your blog to find this out.... sorry, I have been raised in Corsica ;-)

17 décembre, 2010 00:31  
Blogger Jean Véronis a écrit...

Jice> aixtal

17 décembre, 2010 07:20  

Enregistrer un commentaire