When Google’s Ngram Viewer was the topic of a post on Science-Based Medice, I knew it was becoming mainstream. No longer happy to only be toyed with by linguists killing time, the Ngram Viewer had entranced people from other walks of life. And I can understand why. Google’s Ngram Viewer is an impressive service that allows you to quickly and easily search for the frequency of words and phrases in millions of books. But I want to warn you about Google’s Ngram Viewer. As a corpus linguist, I think it’s important to explain just what Ngram Viewer is, what it can be used to do, how I feel about it, and the praise it has been receiving since its inception. I’ll start out simple: despite all its power and what it seems to be capable of, looks can be deceiving.
Have we learned nothing?
Jann Bellamy wrote a post at Science-Based Medicine about using Google’s Ngram Viewer (GNV) to research some terms used to describe the very unscientific practice of Complementary and Alternative Medicine (CAM). Although an article of this type is unusual for the SBM site, it does show how intriguing GNV can be. And Ms. Bellamy does a good job by explaining a few of the caveats of GVN:
The database only goes through 2008, so searches have to end there. Also, the searches have to assume that the word or phrase has only one definition, or perhaps one definition that dominates all others. We also have to remember that only books were scanned, not, for example, academic journals or popular magazines. Or blog posts, for that matter.
Ms. Bellamy then goes on the search for some CAM terms. After noting the which terms are more common and when they started to rise in usage, she does a very good job at explaining the reasons that certain terms have a higher frequency than others. At the end, however, she is left with more questions than answers. Although she discovered that alternative medicine appears more frequently than complementary medicine in the Google Books database, and although she did further research (outside of Google Books) to explain why, she is still left right where she started. Just looking at the numbers from GNV, she can’t say what kind of impact CAM has had on our (English-speaking) world or culture. So what was the point of looking as GNV at all (besides the pretty colors)?
In her post, Ms. Bellamy links to an article in the New York Times by Natasha Singer. In what is essentially a exposition of GNV, with quotes from two of its founders, Ms. Singer places a lot more stock in the value and capability of the program. But from a corpus linguist’s perspective, she leaps a bit too far to her conclusions.
Ms. Singer’s article begins with the phrase “Data is the new oil” and then goes on to explain the comparison between these two words offered by GNV. She writes:
I started my data-versus-oil quest with casual one-gram queries about the two words. The tool produced a chart showing that the word “data” appeared more often than “oil” in English-language texts as far back as 1953, and that its frequency followed a steep upward trajectory into the late 1980s. Of course, in the world of actual commerce, oil may have greater value than raw data. But in terms of book mentions, at least, the word-use graph suggests that data isn’t simply the new oil. It’s more like a decades-old front-runner.
But with the Google Books corpus (the set of texts that GNV analyzes), we need to remember what the corpus contains, i.e. what “book mentions” means. This lets us know how representative both the corpus and our analysis is. The Google Books corpus does not contain speech, newspapers, tweets, magazine articles, business letters, or financial reports. Sure, oil is important to our culture, and certainly to global and political history, but do people write books about it? We can not directly extrapolate the findings from Google Books to Culture any more than we can tell people about the world of 16th Century England by studying the plays of Shakespeare. With GNV we can merely study the culture of books (or the culture of publishing). And there are many ways that GNV can mislead you. For example, are the hits in Ms. Singer’s search talking about crude oil, olive oil, or oil paintings? Google Ngrams will not tell you. Just for fun, here’s Ms. Singer’s search redone with some other terms. Feel free to draw your own conclusions.
The second article I want to talk about comes from Ben Zimmer. While I don’t think Mr. Zimmer needs to be told anything that’s in this post, his article in The Atlantic gets to the heart of my frustration with GNV. It features a more complex search on GNV to find out which nouns modify the word mogul and how they have changed over the last 100 years. In the following passage, he alludes to the reality of GNV without coming right out and saying it.
It’s possible to answer these questions using the publicly available corpora compiled by Mark Davies at Brigham Young University, but the peculiar interface can be off-putting to casual users. With the Ngram Viewer, you just need to enter a search like “*_NOUN mogul” or “ragtag *_NOUN” and select a year range. It turns out that in 20th-century sources, media moguls are joined by movie moguls, real estate moguls, and Hollywood moguls, while the most likely things to be ragtag are armies, groups, and bands.
There are a few points to make about this. First, the interface of the publicly available corpora compiled by Mark Davies could be described as “peculiar”, but that’s only because it’s not the lowest common denominator. And there’s the rub because researchers are capable of so much more using Mark Davies’ corpora. While the interface isn’t immediately intuitive, it certainly isn’t hard to learn. As a bad comparison, think about the differences between Windows, OSX, and a Linux OS. Windows is the lowest common denominator – easiest to use and most intuitive. OSX and Linux, on the other hand, take a bit of getting used to. But how many of us have learned OSX or Linux and willingly gone back to Windows?
The second point is not so much about casual users as it is about casual searches. I think Mr. Zimmer is right to talk about casual users since it’s probable that most of the people who use GNV will be looking for a quick and easy stroll down the cultural garden path. But more to the point, I think he’s right to offer different types of moguls as a search example because that’s about as far as GNV will take you. Can you see which types of moguls people are talking about? No. How about which types of moguls are being used in magazines? Nope. Newspapers? Nuh-uh. You have to turn to one of Mark Davies’ corpora for that. In fact, less casual users are even able to access Google Books (and other corpora) via Mark Davies’ site, and this allows them to conduct more complex searches (For a much more detailed comparison of GNV and some of the corpora offered on Mark Davies’ site, see here). So again the question is what’s the point of looking at GNV at all?
Final thoughts – Almost right but not quite
All this picking on GNV is not without reason. Even though what the people at Google have done is truly impressive, we have seen that the practical use GNV is limited. As the saying in corpus linguistics goes “Compiling is only half the battle”. GNV does not offer users a way to really measure what they are (usually) looking for. As an example, a quote from Ms. Singer’s article will suffice:
The system can also conduct quantitative checks on popular perceptions. Consider our current notion that we live in a time when technology is evolving faster than ever. Mr. Aiden and Mr. Michel [two of GNV’s creators] tested this belief by comparing the dates of invention of 147 technologies with the rates at which those innovations spread through English texts. They found that early 19th-century inventions, for instance, took 65 years to begin making a cultural impact, while turn-of-the-20th-century innovations took only 26 years. Their conclusion: the time it takes for society to learn about an invention has been shrinking by about 2.5 years every decade.
While this may be true, it’s not proven by looking at Google Books. For example, ask yourself these questions: what was the rate of literacy in the early 19th-century? How many books did people read (or have read to them) in the early 19th-century compared to the turn of the 20th century? What was the difference between the rate of dissemination of information in the two time periods? How about the rate of publishing? And what exactly qualifies as technology – farm equipment or fMRI machines? Or does it have to be more closely related to culture and Culturomics – like Facebook?
And most importantly, are books the best way to measure the cultural impact of an idea or technology? The fact is that the system can not really conduct quantitative checks on popular perceptions. But it can make you think it can.
So GNV has a long way to go. I hesitate to say that they will get there because Google does not really have an interest in offering this kind of service to the public (I didn’t see any ads on the GNV page, did you?). While it may be fun to play around with GNV, I would advise against drawing any (serious) conclusions from what it spits out. Below are some other searches I ran. Again, feel free to draw your own conclusions about how these terms and the things they describe relate to human culture.
Notice how the middle initial of some presidents complicates things in the above search. It would be nice to be able to combine the frequencies for “John Fitzgerald Kennedy”, “John F Kennedy”, “John Kennedy”, and “JFK” into one line, and exclude hits like “John S Kennedy” from the results completely, but that’s not possible. You could, however, search GNV for the different ways to refer to President Kennedy and see the differences, for whatever that will tell you.
Noam Chomsky has had a bigger effect on our culture than Audrey Hepburn, Bob Marley, and Brad Pitt? You be the judge!
3 thoughts on “Don’t Go Down the Google Books Garden Path”
Reblogged this on TESOL_Peter and commented:
Interesting post here about the Google Ngram viewer and its limitations. One possible limitation about this viewer is if the amount of literature for each time period included are normalized, in other words are all the texts in Google Books represented in equal amount, or would it be possible that there are more texts from the 20th century and after compared to before? Since the data for all this comes from Google books itself, is it just a raw reading of the data, or are the years normalized. I have only casually looked into this tool, so I don’t know if this is true or not. If anybody knows the Google NGram viewer well or uses it on a regular basis, feel free to comment.
Thanks, Parise. The figures are normalized. Actually that is one of the questions answered on their About page (https://books.google.com/ngrams/info). I think many other problems with NGram Viewer can be solved by accessing the corpus via Mark Davies’ interface (or by just simply using COHA). Here is his explanation of the limitations of using the NGram Viewer compared to COHA and his advanced interface to Google Books: http://googlebooks.byu.edu/compare-googleBooks.asp.
Great! Thank you Joe! Much appreciated!