Monthly Archives: March 2014

Google Ngrams Viewer: How good is it really?

Whether you are technologically minded or not Google Books Ngram Viewer is a valuable digital tool. It is simple to use and easy to understand. The Ngram viewer uses Big Data which has been collected from Google Books and puts it into simple graphs as seen below. This information enables historians and other academics to find patterns or long-term trends through data mining.

ImageSource: Google Books Ngram Viewer

What does it do? And how accessible is it?

Basically, the Ngram viewer uses graphs to visualise the use of phrases and language found in the Google books collection over a particular period of time. (For a definition of an Ngram see here). Search dates range from 1500 to 2008 and it allows you to search across twenty-two different languages including British/American English, Russian, Chinese and Hebrew. Once the user has searched for their key words, a graph shows the resulting use of them over the set period of time. The dated hyperlinks below the graph take the user to Google books and the selected books related to their search as seen here:

ImageSource: Google Books Ngram Viewer

If the user clicks on the link at the top right had side of the graph they are able to embed that data onto their own website. That is pretty nifty! To top that, Google allow their Ngrams users to download the Big Data from their site and use it in personal research. All that Google ask is that the use of their work is attributed to them.

So, for someone who is usually more at home in a library than on a digital search engine, I am impressed…so far. In terms of visual design for accessibility Google Ngrams is appealing because of its simple and yet colourful graphs and easy to navigate layout. It is possible to manipulate the data, to some extent, through the search terms and you can take the Big Data away and use it for your own purposes. It’s a take away without the calories and no washing up! Basically, they do all of the work for you. Now, that can’t be bad, can it?

How do they do it?

Optical Character Recognition or OCR is used by Google Books to digitize books and make the data available on Ngrams. A basic definition is below:

ImageSource: TechTerms.com 

There are, however, some  flaws associated with the use of OCR. These include:

  • Accuracy rates are not 100% – a combination of manual and OCR transcription may produce better results. Danny Sullivan, an expert on search engines talks about this problem here.
  • Images are also affected – Image colour and detail may not be as precise when using OCR. Manual scanning may take longer but is possibly more cost-effective. Examples of mistakes can be seen below:

ImageSource: The Art of Google Books

ImageSource: The Art of Google Books

  • The case of the f-word – What is known as the Medial S has proven to be something of a problem for OCR when it is used to scan older source material. Often the words and letters are partly or completely misread. Danny Sullivan highlights these problems here.

A further aspect of Google Ngrams is its Extensible Markup Language or XML. This shows the bare basics of what the document is made of. When creating an XML document the person building the resource controls the information available in the search engine. In comparison to this, a HTML document is fixed and unchangeable. For further comparison see here. In Google Ngrams the XML schema has less to offer than, for example, that of the Old Bailey Online. This may be a consequence of the types of data used in each digital resource. The data in Google Ngrams Viewer is formed of groups of letters as seen here in a sample:

ImageSource: Google Ngrams Viewer

Whereas the XML schema in the Old Bailey Online is much easier to identify with. The search terms, for example, are easier to understand:

ImageSource: The Old Bailey Online

Copyright has been something of an issue for Google Books. Although it may impact the Ngram Viewer and its content in one way, this resource has made it possible to download the Big data and use its XML schema for quantitative research. This  removes messy legal problems and is discussed in the following article by John Bohannon.

Overall Google Ngram Viewer has a lot to offer historians. It allows them to see patterns or trends in data over a longer period than would be possible if they were researching through traditional methods. It stores a vast amount of data in a small space which can be accessed immediately. Finally, it offers historians a simple and manageable tool in the emerging and sometimes complicated discipline of Digital History, as Dan Cohen discusses here.

Bibliography

 

‘Big Data definition’, PCMag Encyclopedia, http://www.pcmag.com/encyclopedia/term/62849/big-data; consulted 1st March 2014.

Bohannon, John, ‘Google Opens Books to New Cultural Studies’, Sciencemag.org, http://dericbownds.net/uploaded_images/Science-2010-Bohannon.pdf; consulted 1st march 2014.

Cohen, Dan, ‘Initial Thoughts on the Google Books Ngram Viewer and Datasets’, http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/; consulted 1st March 2014.

‘Data Mining definition’, Dictionary.com, http://dictionary.reference.com/browse/data+mining; consulted 1st March 2014.

Google Books Ngram Viewer, https://books.google.com/ngrams; consulted 27th February 2014.

‘Google Ngrams Dataset’, Google Books Ngram Viewer, http://storage.googleapis.com/books/ngrams/books/datasetsv2.html; consulted 28th March 2014.

‘Google “Sherlock Holmes” search’, Google Books Ngram Viewer, https://www.google.com/search?q=%22sherlock%20holmes%22&tbs=bks:1,cdr:1,cd_min:1910,cd_max:1977&lr=lang_en; consulted 27th February 2014.

‘Introduction to XML’, w3schools.com, http://www.w3schools.com/xml/xml_whatis.asp; consulted 1st March 2014.

‘Ngram definition’, Dictionary.com, http://dictionary.reference.com/browse/n-gram; consulted 1st March 2014.

‘OCR definition’, Techterms.com, http://www.techterms.com/definition/ocr; consulted 27th February 2014.

Sullivan, Danny, ‘When OCR Goes Bad: Google’s Ngram Viewer & the F-Word’, Search Engine Land, http://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181; consulted 28th February 2014.

The Art of Google Books, http://theartofgooglebooks.tumblr.com/; consulted 28th February 2014.

The Old Bailey Online,‘Violent Theft, highway robbery: reference no. t16740909-6’, version 7.1, http://www.oldbaileyonline.org/browse.jsp?id=t16740909-6&div=t16740909-6&terms=horse#highlight; consulted 28th March 2014.

 

Advertisements