Feeds:
Posts
Comments

In two recent papers, one by Kloumann et al. (2012) and the other by Dodds et al. (2015), a group of researchers created a corpus to study the positivity of the English language. I looked at some of the problems with those papers here and here. For this post, however, I want to focus on one of the registers in the authors’ corpus – song lyrics. There is a problem with taking language such as lyrics out of context and then judging them based on the positivity of the words in the songs. But first I need to briefly explain what the authors did.

In the two papers, the authors created a corpus based on books, New York Times articles, tweets and song lyrics. They then created a list of the 10,000 most common word types in their corpus and had voluntary respondents rate how positive or negative they felt the words were. They used this information to claim that human language overall (and English) is emotionally positive.

That’s the idea anyway, but song lyrics exist as part of a multimodal genre. There are lyrics and there is music. These two modalities operate simultaneously to convey a message or feeling. This is important for a couple of reasons. First, the other registers in the corpus do not work like song lyrics. Books and news articles are black text on a white background with few or no pictures. And tweets are not always multimodal – it’s possible to include a short video or picture in a tweet, but it’s not necessary (Side note: I would like to know how many tweets in the corpus included pictures and/or videos, but the authors do not report that information).

So if we were to do a linguistic analysis of an artist or a genre of music, we would create a corpus of the lyrics of that artist or genre. We could then study the topics that are brought up in the lyrics, or even common words and expressions (lexical bundles or n-grams) that are used by the artist(s). We could perhaps even look at how the writing style of the artist(s) changed over time.

But if we wanted to perform an analysis of the positivity of the songs in our corpus, we would need to incorporate the music. The lyrics and music go hand in hand – without the music, you only have poetry. To see what I mean, take a look at the following word list. Do the words in this list look particularly positive or negative to you?

a

ain’t

all

and

as

away

back

bitch

body

breast

but

butterfly

can

can’t

caught

chasing

comin’

days

did

didn’t

do

dog

down

everytime

fairy

fantasy

for

ghost

guess

had

hand

harm

her

his

i

i’m

if

in

it

looked

lovely

jar

makes

mason

life

live

maybe

me

mean

momma’s

more

my

need

nest

never

no

of

on

outside

pet

pin

real

return

robin

scent

she

sighing

slips

smell sorry

that

the

then

think

to

today

told

up

want

wash

went

what

when

with

withered

woke

would

yesterday

you

you’re

your

If we combine these words as Rivers Cuomo did in his song “Butterfly”, they average out to a positive score of 5.23. Here are the lyrics to that song.

Yesterday I went outside
With my momma’s mason jar
Caught a lovely Butterfly
When I woke up today
And looked in on my fairy pet
She had withered all away
No more sighing in her breast

I’m sorry for what I did
I did what my body told me to
I didn’t mean to do you harm
But everytime I pin down what I think I want
it slips away – the ghost slips away

I smell you on my hand for days
I can’t wash away your scent
If I’m a dog then you’re a bitch
I guess you’re as real as me
Maybe I can live with that
Maybe I need fantasy
A life of chasing Butterfly

I’m sorry for what I did
I did what my body told me to
I didn’t mean to do you harm
But everytime I pin down what I think I want
it slips away – the ghost slips away

I told you I would return
When the robin makes his nest
But I ain’t never comin’ back
I’m sorry, I’m sorry, I’m sorry

Does this look like a positive text to you? Does it look moderate, neither positive nor negative? I would say not. It seems negative to me, a sad song based on the opera Madame Butterfly, in which a man leaves his wife because he never really cared for her. When we include the music into our consideration, the non-positivity of this song is clear.


Let’s take a look at another list. How does this one look?

above

absence

alive

an

animal

apart

are

away

become

brings

broke

can

closer

complicate

desecrate

down

drink

else

every

everything

existence

faith

feel

flawed

for

forest

from

fuck

get

god

got

hate

have

help

hive

honey

i

i’ve

inside

insides

is

isolation

it

it’s

knees

let

like

make

me

my

myself

no

of

off

only

penetrate

perfect

reason

scraped

sell

sex

smell

somebody

soul

stay

stomach

tear

that

the

thing

through

to

trees

violate

want

whole

within

works

you

your

Based on the ratings in the two papers, this list is slightly more positive, with an average happiness rating of 5.46. When the words were used by Trent Reznor, however, they expressed “a deeply personal meditation on self-hatred” (Huxley 1997: 179). Here are the lyrics for “Closer” by Nine Inch Nails:

You let me violate you
You let me desecrate you
You let me penetrate you
You let me complicate you

Help me
I broke apart my insides
Help me
I’ve got no soul to sell
Help me
The only thing that works for me
Help me get away from myself

I want to fuck you like an animal
I want to feel you from the inside
I want to fuck you like an animal
My whole existence is flawed
You get me closer to god

You can have my isolation
You can have the hate that it brings
You can have my absence of faith
You can have my everything

Help me
Tear down my reason
Help me
It’s your sex I can smell
Help me
You make me perfect
Help me become somebody else

I want to fuck you like an animal
I want to feel you from the inside
I want to fuck you like an animal
My whole existence is flawed
You get me closer to god

Through every forest above the trees
Within my stomach scraped off my knees
I drink the honey inside your hive
You are the reason I stay alive

As Reznor (the songwriter and lyricist) sees it, “Closer” is “supernegative and superhateful” and that the song’s message is “I am a piece of shit and I am declaring that” (Huxley 1997: 179). You can see what he means when you listen to the song (minor NSF warning for the imagery in the video). [1]

Nine Inch Nails: Closer (Uncensored) (1994) from Nine Inch Nails on Vimeo.

Then again, meaning is relative. Tommy Lee has said that “Closer” is “the all-time fuck song. Those are pure fuck beats – Trent Reznor knew what he was doing. You can fuck to it, you can dance to it and you can break shit to it.” And Tommy Lee should know. He played in the studio for NIИ and he is arguably more famous for fucking than he is for playing drums.

Nevertheless, the problem with the positivity rating of songs keeps popping up. The song “Mad World” was a pop hit for Tears for Fears, then reinterpreted in a more somber tone by Gary Jules and Michael Andrews. But it is rated a positive 5.39. Gotye’s global hit about failed relationships, “Somebody That I Used To Know”, is rated a positive 5.33. The anti-war and protest ballad “Eve of Destruction”, made famous by Barry McGuire, rates just barely on the negative side at 4.93. I guess there should have been more depressing references besides bodies floating, funeral processions, and race riots if the song writer really wanted to drive home the point.

For the song “Milkshake”, Kelis has said that it “means whatever people want it to” and that the milkshake referred to in the song is “the thing that makes women special […] what gives us our confidence and what makes us exciting”. It is rated less positive than “Mad World” at 5.24. That makes me want to doubt the authors’ commitment to Sparkle Motion.

Another upbeat jam that the kids listen to is the Ramones’ “Blitzkrieg Bop”. This is the energetic and exciting anthem of punk rock. It’s rated a negative 4.82. I wonder if we should even look at “Pinhead”.

Then there’s the old American folk classic “Where did you sleep last night”, which Nirvana performed a haunting version of on their album MTV Unplugged in New York. The song (also known as “In the Pines” and “Black Girl”) was first made famous by Lead Belly and it includes such catchy lines as

My girl, my girl, don’t lie to me
Tell me where did you sleep last night
In the pines, in the pines
Where the sun don’t ever shine
I would shiver the whole night through

And

Her husband was a hard working man
Just about a mile from here
His head was found in a driving wheel
But his body never was found

This song is rated a positive 5.24. I don’t know about you but neither the Lead Belly version, nor the Nirvana cover would give me that impression.

Even Pharrell Williams’ hit song “Happy” rates only 5.70. That’s a song so goddamn positive that it’s called “Happy”. But it’s only 0.03 points more positive than Eric Clapton’s “Tears in Heaven”, which is a song about the death of Clapton’s four-year-old son. Harry Chapin’s “Cat’s in the Cradle” was voted the fourth saddest song of all time by readers of Rolling Stone but it’s rated 5.55, while Willie Nelson’s “Always on My Mind” rates 5.63. So they are both sadder than “Happy”, but not by much. How many lyrics must a man research, before his corpus is questioned?

Corpus linguistics is not just gathering a bunch of words and calling it a day. The fact that the same “word” can have several meanings (known as polysemy), is a major feature of language. So before you ask people to rate a word’s positivity, you will want to make sure they at least know which meaning is being referred to. On top of that, words do not work in isolation. Spacing is an arbitrary construct in written language (remember that song lyrics are mostly heard not read). The back used in the Ramones’ lines “Piling in the back seat” and “Pulsating to the back beat” are not about a body part. The Weezer song “Butterfly” uses the word mason, but it’s part of the compound noun mason jar, not a reference to a brick layer. Words are also conditioned by the words around them. A word like eve may normally be considered positive as it brings to mind Christmas Eve and New Year’s Eve, but when used in a phrase like “the eve of destruction” our judgment of it is likely to change. In the corpus under discussion here, eat is rated 7.04, but that doesn’t consider what’s being eaten and so can not account for lines like “Eat your next door neighbor” (from “Eve of Destruction”).

We could go on and on like this. The point is that the authors of both of the papers didn’t do enough work with their data before drawing conclusions. And they didn’t consider that some of the language in their corpus is part of a multimodal genre where there are other things affecting the meaning of the language used (though technically no language use is devoid of context). Whether or not the lyrics of a song are “positive” or “negative”, the style of singing and the music that they are sung to will highly effect a person’s interpretation of the lyrics’ meaning and emotion. That’s just the way that music works.

This doesn’t mean that any of these songs are positive or negative based on their rating, it means that the system used by the authors of the two papers to rate the positivity or negativity of language seems to be flawed. I would have guessed that a rating system which took words out of context would be fundamentally flawed, but viewing the ratings of the songs in this post is a good way to visualize that. The fact that the two papers were published in reputable journals and picked up by reputable publications, such as the Atlantic and the New York Times, only adds insult to injury for the field of linguistics.

You can see a table of the songs I looked at for this post below and an spreadsheet with the ratings of the lyrics is here. I calculated the positivity ratings by averaging the scores for the word tokens in each song, rather than the types.

(By the way, Tupac is rated 4.76. It’s a good thing his attitude was fuck it ‘cause motherfuckers love it.)

Song Positivity score (1–9)
“Happy” by Pharrell Williams 5.70
“Tears in Heaven” by Eric Clapton 5.67
“You Were Always on My Mind” by Willie Nelson 5.63
“Cat’s in the Cradle” by Harry Chapin 5.55
“Closer” by NIN 5.46
“Mad World” by Gary Jules and Michael Andrews 5.39
“Somebody that I Used to Know” by Gotye feat. Kimbra 5.33
“Waitin’ for a Superman” by The Flaming Lips 5.28
“Milkshake” by Kelis 5.24
“Where Did You Sleep Last Night” by Nirvana 5.24
“Butterfly” by Weezer 5.23
“Eve of Destruction” by Barry McGuire 4.93
“Blitzkrieg Bop” by The Ramones 4.82

 

Footnotes

[1] Also, be aware that listening to these songs while watching their music videos has an effect on the way you interpret them. (Click here to go back up.)

References

Isabel M. Kloumann, Christopher M. Danforth, Kameron Decker Harris, Catherine A. Bliss, Peter Sheridan Dodds. 2012. “Positivity of the English Language”. PLoS ONE. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0029484

Dodds, Peter Sheridan, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow, Karine Megerdoomian, Matthew T. McMahon, Brian F. Tivnan, and Christopher M. Danforth. 2015. “Human language reveals a universal positivity bias”. PNAS 112:8. http://www.pnas.org/content/112/8/2389

Huxley, Martin. 1997. Nine Inch Nails. New York: St. Martin’s Griffin.

Last week I wrote a post called “If you’re not a linguist, don’t do linguistics”. This got shared around Twitter quite a bit and made it to the front page of r/linguistics, so a lot of people saw it. Pretty much everyone had good insight on the topic and it generated some great discussion. I thought it would be good to write a follow-up to flesh out my main concerns in a more serious manner (this time sans emoticons!) and to address the concerns some people had with my reasoning.

The paper in question is by Dodds et al. (2015) and it is called “Human language reveals a universal positivity bias”. The certainty of that title is important since I’m going to try to show in this post that the authors make too many assumptions to reliably make any claims about all human language. I’m going to focus on the English data because that is what I am familiar with. But if anyone who is familiar with the data in other languages would like to weigh in, please do so in the comments.

The first assumption made by the authors is that it is possible to make universal claims about language using only written data. This is not a minor issue. The differences between spoken and written language are many and major (Linell 2005). But dealing with spoken data is difficult – it takes much more time and effort to collect and analyze than written data. We can argue, however, that even in highly literate societies, the majority of language use is spoken – and spoken language does not work like written language. This is an assumption that no scholar should ever make. So any research which makes claims about all human language will therefore have to include some form of spoken data. But the data set that the authors draw from (called their corpus) is made from tweets, song lyrics, New York Times articles and the Google Books project. Tweets and song lyrics, let alone news articles or books, do not mimic spoken language in an accurate way. For example, these registers may include the same words as human speech, but certainly not in the same proportion. Written language does not include false starts, nor does it include repetition or elusion in near the same way that spoken language does. Anyone who has done any transcription work will tell you this.

The next assumption made by the authors is that their data is representative of all human language. Representativeness is a major issue in corpus linguistics. When linguists want to investigate a register or variety of language, they build a corpus which is representative of that register or variety by taking a large enough and balanced sample of texts from that register. What is important here, however, is that most linguists do not have a problem with a set of data representing a larger register – so long as that larger register isn’t all human language. For example, if we wanted to research modern English journalism (quite a large register), we would build a corpus of journalism texts from English-speaking countries and we would be careful to include various kinds of journalism – op-eds, sports reporting, financial news, etc. We would not build a corpus of articles from the Podunk Free Press and make claims about all English journalism. But representativeness is a tricky issue. The larger the language variety you are trying to investigate, the more data from that variety you will need in your corpus. Baker (2010: 7) notes that a corpus analysis of one novel is “unlikely to be representative of all language use, or all novels, or even the general writing style of that author”. The English sub-corpora in Dodds et al. exists somewhere in between a fully non-representative corpus of English (one novel) and a fully representative corpus of English (all human speech and writing in English). In fact, in another paper (Dodds et al. 2011), the representativeness of the Twitter corpus is explained as “First, in terms of basic sampling, tweets allocated to data feeds by Twitter were effectively chosen at random from all tweets. Our observation of this apparent absence of bias in no way dismisses the far stronger issue that the full collection of tweets is a non-uniform subsampling of all utterances made by a non-representative subpopulation of all people. While the demographic profile of individual Twitter users does not match that of, say, the United States, where the majority of users currently reside, our interest is in finding suggestions of universal patterns.”. What I think that doozy of a sentence in the middle is saying is that the tweets come from an unrepresentative sample of the population but that the language in them may be suggestive of universal English usage. Does that mean can we assume that the English sub-corpora (specifically the Twitter data) in Dodds et al. is representative of all human communication in English?

Another assumption the authors make is that they have sampled their data correctly. The decisions on what texts will be sampled, as Tognini-Bonelli (2001: 59) points out, “will have a direct effect on the insights yielded by the corpus”. Following Biber (see Tognini-Bonelli 2001: 59), linguists can classify texts into various channels in order to assure that their sample texts will be representative of a certain population of people and/or variety of language. They can start with general “channels” of the language (written texts, spoken data, scripted data, electronic communication) and move on to whether the language is private or published. Linguists can then sample language based on what type of person created it (their age, sex, gender, social-economic situation, etc.). For example, if we made a corpus of the English articles on Wikipedia, we would have a massive amount of linguistic data. Literally billions of words. But 87% of it will have been written by men and 59% of it will have been written by people under the age of 40. Would you feel comfortable making claims about all human language based on that data? How about just all English language encyclopedias?

The next assumption made by the authors is that the relative positive or negative nature of the words in a text are indicative of how positive that text is. But words can have various and sometimes even opposing meanings. Texts are also likely to contain words that are written the same but have different meanings. For example, the word fine in the Dodds et al. corpus, like the rest of the words in the corpus, is just a four letter word – free of context and naked as a jaybird. Is it an adjective that means “good, acceptable, or satisfactory”, which Merriam-Webster says is sometimes “used in an ironic way to refer to things that are not good or acceptable”? Or does it refer to that little piece of paper that the Philadelphia Parking Authority is so (in)famous for? We don’t know. All we know is that it has been rated 6.74 on the positivity scale by the respondents in Dodds et al. Can we assume that all the uses of fine in the New York Times are that positive? Can we assume that the use of fine on Twitter is always or even mostly non-ironic? On top of that, some of the most common words in English also tend to have the most meanings. There are 15 entries for get in the Macmillan Dictionary, including “kill/attack/punish” and “annoy”. Get in Dodds et al. is ranked on the positive side of things at 5.92. Can we assume that this rating carries across all the uses of get in the corpus? The authors found approximately 230 million unique “words” in their Twitter corpus (they counted all forms of a word separately, so banana, bananas, b-a-n-a-n-a-s! would be separate “words”; and they counted URLs as words). So they used the 50,000 most frequent ones to estimate the information content of texts. Can we assume that it is possible to make an accurate claim about how positive or negative a text is based on nothing but the words taken out of context?

Another assumption that the authors make is that the respondents in their survey can speak for the entire population. The authors used Amazon’s Mechanical Turk to crowdsource evaluations for the words in their sub-corpus. 60% of the American people on Mechanical Turk are women and 83.5% of them are white. The authors used respondents located in the United States and India. Can we assume that these respondents have opinions about the words in the corpus that are representative of the entire population of English speakers? Here are the ratings for the various ways of writing laughter in the authors’ corpus:

Laughter tokens Rating
ha 6
hah 5.92
haha 7.64
hahah 7.3
hahaha 7.94
hahahah 7.24
hahahaha 7.86
hahahahaha 7.7
ha 6
hee 5.4
heh 5.98
hehe 6.48
hehehe 7.06

And here is a picture of a character expressing laughter:

Pictured: Good times. Credit: Batman #36, DC Comics, Scott Snyder (wr), Greg Capullo (p), Danny Miki (i), Fco Plascenia (c), Steve Wands (l).

Pictured: Good times. Credit: Batman #36, DC Comics, Scott Snyder (wr), Greg Capullo (p), Danny Miki (i), Fco Plascenia (c), Steve Wands (l).

Can we assume that the textual representation of laughter is always as positive as the respondents rated it? Can we assume that everyone or most people on Twitter use the various textual representations of laughter in a positive way – that they are laughing with someone and not at someone?
Finally, let’s compare some data. The good people at the Corpus of Contemporary American English (COCA) have created a word list based on their 450 million word corpus. The COCA corpus is specifically designed to be large and balanced (although the problem of dealing with spoken language might still remain). In addition, each word in their corpus is annotated for its part of speech, so they can recognize when a word like state is either a verb or a noun. This last point is something that Dodds et al. did not do – all forms of words that are spelled the same are collapsed into being one word. The compilers of the COCA list note that “there are more than 140 words that occur both as a noun and as a verb at least 10,000 times in COCA”. This is the type/token issue that came up in my previous post. A corpus that tags each word for its part of speech can tell the difference between different types of the “same” word (state as a verb vs. state as a noun), while an untagged corpus treats all occurrences of state as the same token. If we compare the 10,000 most common words in Dodds et al. to a sample of the 10,000 most common words in COCA, we see that there are 121 words on the COCA list but not the Dodds et al. list (Here is the spreadsheet from the Dodds et al. paper with the COCA data – pnas.1411678112.sd01 – Dodds et al corpus with COCA). And that’s just a sample of the COCA list. How many more differences would there be if we compared the Dodds et al. list to the whole COCA list?

To sum up, the authors use their corpus of tweets, New York Times articles, song lyrics and books and ask us to assume (1) that they can make universal claims about language despite using only written data; (2) that their data is representative of all human language despite including only four registers; (3) that they have sampled their data correctly despite not knowing what types of people created the linguistic data and only including certain channels of published language; (4) that the relative positive or negative nature of the words in a text are indicative of how positive that text is despite the obvious fact that words can be spelled the same and still have wildly different meanings; (5) that the respondents in their survey can speak for the entire population despite the English-speaking respondents being from only two subsets of two English-speaking populations (USA and India); and (6) that their list of the 10,000 most common words in their corpus (which they used to rate all human language) is representative despite being uncomfortably dissimilar to a well-balanced list that can differentiate between different types of words.

I don’t mean to sound like a Negative Nancy and I don’t want to trivialize the work of the authors in this paper. The corpus that they have built is nothing short of amazing. The amount of feedback they got from human respondents on language is also impressive (to say the least). I am merely trying to point out what we can and can not say based on the data. It would be nice to make universal claims about all human language, but the fact is that even with millions and billions of data points, we still are not able to do so unless the data is representative and sampled correctly. That means it has to include spoken data (preferably a lot of it) and it has to be sampled from all socio-economic human backgrounds.

Hat tip to the commenters on the last post and the redditors over at r/linguistics.

References

Dodds, Peter Sheridan, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow, Karine Megerdoomian, Matthew T. McMahon, Brian F. Tivnan, and Christopher M. Danforth. 2015. “Human language reveals a universal positivity bias”. PNAS 112:8. http://www.pnas.org/content/112/8/2389

Dodds, Peter Sheridan, Kameron Decker Harris, Isabel M. Koumann, Catherine A. Bliss, Christopher M. Danforth. 2011. “Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter”. PLOS One. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752#abstract0

Baker, Paul. 2010. Sociolinguistics and Corpus Linguistics. Edinburgh: Edinburgh University Press. http://www.ling.lancs.ac.uk/staff/paulb/socioling.htm

Linell, Per. 2005. The Written Language Bias in Linguistics. Oxon: Routledge.

Mair, Christian. 2015. “Responses to Davies and Fuchs”. English World-Wide 36:1, 29–33. doi: 10.1075/eww.36.1.02mai

Tognini-Bonelli, Elena. 2001. Studies in Corpus Linguistics, Volume 6: Corpus Linguistics as Work. John Benjamins. https://benjamins.com/#catalog/books/scl.6/main

A paper recently published in PNAS claims that human language tends to be positive. This was news enough to make the New York Times. But there are a few fundamental problems with the paper.

Linguistics – Now with less linguists!

The first thing you might notice about the paper is that it was written by mathematicians and computer scientists. I can understand the temptation to research and report on language. We all use it and we feel like masters of it. But that’s what makes language a tricky thing. You never hear people complain about math when they only have a high-school-level education in the subject. The “authorities” on language, however, are legion. My body has, like, a bunch of cells in it, but you don’t see me writing papers on biology. So it’s not surprising that the authors of this paper make some pretty basic errors in doing linguistic research. They should have been caught by the reviewers, but they weren’t. And the editor is a professor of demography and statistics, so that doesn’t help.

Too many claims and not enough data

The article is titled “Human language reveals a universal positivity bias” but what the authors really mean is “10 varieties of languages might reveal something about the human condition if we had more data”. That’s because the authors studied data in 10 different languages and they are making claims about ALL human languages. You can’t do that. There are some 6,000 languages in the world. If you’re going to make a claim about how every language works, you’re going to have to do a lot more than look at only 10 of them. Linguists know this, mathematicians apparently do not.

On top of that, the authors don’t even look at that much linguistic data. They extracted 5,000–10,000 of the most common words from larger corpora. Their combined corpora contain the 100,000 most common words in each of their sub-corpora. That is woefully inadequate. The Brown corpus contains 1 million words and it was made in the 1960s. In this paper, the authors claim that 20,000 words are representative of English. That is, not 20,000 different words, but the 5,000 most common words in each of their English sub-corpora. So 5,000 words each from Twitter, the New York Times, music lyrics, and the Google Books Project are supposed to represent the entire English language. This is shocking… to a linguist. Not so much to mathematicians, who don’t do linguistic research. It’s pretty frustrating, but this paper is a whole lotta ¯\_(ツ)_/¯.

To complete the trifecta of missing linguistic data, take a look at the sources for the English corpora:

Corpus Word count
English: Twitter 5,000
English: Google Books Project 5,000
English: The New York Times 5,000
English: Music lyrics 5,000

If you want to make a general claim about a language, you need to have data that is representative of that language. 5,000 words from Twitter, the New York Times, some books and music lyrics does not cut it. There are hundreds of other ways that language is used, such as recipes, academic writing, blogging, magazines, advertising, student essays, and stereo instructions. Linguists use the terms register and genre to refer to these and they know that you need more than four if you want your data to be representative of the language as a whole. I’m not even going to ask why the authors didn’t make use of publicly available corpora (such as COCA for English). Maybe they didn’t know about them. ¯\_(ツ)_/¯

Say what?

Speaking of registers, the overwhelmingly most common way that language is used is speech. Humans talking to other humans. No matter how many written texts you have, your analysis of ALL HUMAN LANGUAGE is not going to be complete until you address spoken language. But studying speech is difficult, especially if you’re not a linguist, so… ¯\_(ツ)_/¯

The fact of the matter is that you simply cannot make a sweeping claim about human language without studying human speech. It’s like doing math without the numeral 0. It doesn’t work. There are various ways to go about analyzing human speech, and there are ways of including spoken data into your materials in order to make claims about a language. But to not perform any kind of analysis of spoken data in an article about Language is incredibly disingenuous.

Same same but different

The authors claim their data set includes “global coverage of linguistically and culturally diverse languages” but that isn’t really true. Of the 10 languages that they analyze, 6 are Indo-European (English, Portuguese, Russian, German, Spanish, and French). Besides, what does “diverse” mean? We’re not told. And how are the cultures diverse? Because they speak different languages and/or because they live in different parts of the world? ¯\_(ツ)_/¯

The authors also had native speakers judge how positive, negative or neutral each word in their data set was. A word like “happy” would presumably be given the most positive rating, while a word like “frown” would be on the negative end of the scale, and a word like “the” would be rated neutral (neither positive nor negative). The people ranking the words, however, were “restricted to certain regions or countries”. So, not only are 14,000 words supposed to represent the entire Portuguese language, but residents of Brazil are rating them and therefore supposed to be representative of all Portuguese speakers. Or, perhaps that should be residents of Brazil with internet access.

[Update 2, March 2: In the following paragraph, I made some mistakes. I should not have said that ALL linguists believe that rating language is an notoriously poor way of doing an analysis. Obviously I can’t speak for all the linguists everywhere. That would be overgeneralizing, which is kind of what I’m criticizing the original paper for. Oops! :O I also shouldn’t have tied the rating used in the paper and tied it to grammaticality judgments. Grammaticality judgments have been shown to be very, very consistent for English sentences. I am not aware of whether people tend to be as consistent when rating words for how positive, negative, or neutral they are (but if you are, feel free to post in the comments). So I think the criticism still stands. Some say that the 384 English-speaking participants is more than enough to rate a word’s positivity. If people rate words as consistently as they do sentences, then this is true. I’m not as convinced that people do that (until I see some research on it), but I’ll revoke my claim anyway. Either way, the point still stands – the positivity of language does not lie in the relative positive or negative nature of the words in a text (the next point I make below). Thanks to u/rusoved, u/EvM and u/noahpoah on reddit for pointing this out to me.] There are a couple of problems with this, but the main one is that having people rate language is a notoriously poor way of analyzing language (notorious to linguists, that is). If you ask ten people to rate the grammaticality of a sentence on a scale from 1 to 10, you will get ten different answers. I understand that the authors are taking averages of the answers their participants gave, but they only had 384 participants rating the English words. I wouldn’t call that representative of the language. The number of participants for the other languages goes down from there.

A loss for words

A further complication with this article is in how it rates the relative positive nature of words rather than sentences. Obviously words have meaning, but they are not really how humans communicate. Consider the sentence Happiness is a warm gun. Two of the words in that sentence are positive (happiness and warm), while only one is negative (gun). This does not mean it’s a positive sentence. That depends on your view of guns (and possibly Beatles songs). So it is potentially problematic to look at how positive or negative the words in a text are and then say that the text as a whole (or the corpus) presents a positive view of things.

Lost in Google’s Translation

The last problem I’ll mention concerns the authors’ use of Google Translate. They write

We now examine how individual words themselves vary in their average happiness score between languages. Owing to the scale of out corpora, we were compelled to use an online service, choosing Google Translate. For each of the 45 language pairs, we translated isolated words from one language to the other and then back. We then found all word pairs that (i) were translationally stable, meaning the forward and back translation returns the original word, and (ii) appeared in our corpora in each language.

This is ridiculous. As good as Google Translate may be in helping you understand a menu in another country, it is not a good translator. Asya Pereltsvaig writes that “Google Translate/Conversation do not translate. They match. More specifically, they match (bits of) the original text with best translations, where ‘best’ means most frequently found in a large corpus such as the World Wide Web.” And she has caught Google Translate using English as an intermediate language when translating from one language to another. That means that when going between two languages that are not English (say French and Russian), Google Translate will first translate the word into English and then into target language. This represents a methodological problem for the article in that using the online Google Translate actually makes their analysis untrustworthy.

 

It’s unfortunate that this paper made it through to publication and it’s a shame that it was (positively) reported on by the New York Times. The paper should either be heavily edited or withdrawn. I’m doubtful that will happen.

 

Update: In the fourth paragraph of this post (the one which starts “On top of that…”), there was some type/token confusion concerning the corpora analyzed. I’ve made some minor edits to it to clear things up. Hat tip to Ben Zimmer on Twitter for pointing this out to me.

Update (March 17, 2015): I wrote a more detailed post (more references, less emoticons) on my problems with the article in question. You can find that here.

Peter Friederici, in a recent article in the Bulletin of the Atomic Scientists, reminds us that “the language used to characterize the climate problem is far more important than is generally recognized”. Mr. Friederici’s article links to a CBS piece which states things more bluntly:

If you’re trying to get someone to care about the way the environment is changing, you might want to refer to it as “global warming,” rather than “climate change,” according to a new study

The idea is that global warming sounds more dire than climate change. Global warming is more likely to inspire people to do something drastic or force their government to take major steps, but climate change requires only minor steps to solve. So tree-hugging liberals will want to use global warming to fire up their base, while the term climate change is more amenable to the conservative approach of letting the free market sort things out. This idea has been floating around for just over ten years. It was inspired by the American political pollster Frank Luntz. While consulting the Republican Party in 2002, Luntz wrote a memo to President George W. Bush’s staff which read in part:

It’s time for us to start talking about “climate change” instead of global warming […] “Climate change” is less frightening than “global warming.” […] While global warming has catastrophic connotations attached to it, climate change suggests a more controllable and less emotional challenge.

Similar ideas about the differences between these seemingly synonymous terms have been raised in other news outlets. The two articles above also report the results of the Yale Project on Climate Change Communication, which found that:

the term “global warming” is associated with greater public understanding, emotional engagement, and support for personal and national action than the term “climate change.” […] Our findings strongly suggest that the terms global warming and climate change are used differently and mean different things in the minds of many Americans.

The report also says that:

Americans are four times more likely to say they hear the term global warming in public discourse than climate change.

The crucial element missing from all of these news articles and reports is any actual data about how often these terms are used. So let’s see if we can find that out.

Easier said than done

There are a few things to think about before we get started with the data. First, although Luntz’s recommendations were informed by his discussions with voters, we don’t know if President Bush or the Republican party actually listened to him. Reporting that Republicans were advised to use climate change instead of global warming doesn’t mean that they actually did so. Perhaps the reason for this is that it seems Bush didn’t use either term. He didn’t use them in his debates with Democratic presidential candidate John Kerry and he only used the term global climate change once in both his 2007 and 2008 State of the Union addresses:

And these technologies will help us be better stewards of the environment, and they will help us to confront the serious challenge of global climate change. – George W. Bush, State of the Union 2007

The United States is committed to strengthening our energy security and confronting global climate change. – George W. Bush, State of the Union 2008

So it’s hard to report on something happening when it didn’t happen. Ironically, Kerry used global warming once in his debate in St. Louis and twice in Coral Gables, so maybe he also got Luntz’s memo?

The second thing to think about is that reporting that Americans claim they hear global warming more often that climate change doesn’t mean that they actually do. People are really bad at accurately reporting things like this. For example, before I present the data to you, I want you to ask yourself which term you think is more common on various American news outlets. Based on the information above, do you think Fox News uses global warming more often or climate change? How about NPR and MSNBC? We’ll see whether the numbers back you up in a bit.

Finally, I’m going to take my data from the Corpus of Contemporary American English (COCA), which is a 450 million word database of speech and writing that is “suitable for looking at current, ongoing changes in the language”. I wrote about why it is better to use corpora like COCA instead of the Google N-gram viewer here.

Crunching the numbers

Let’s first see how common each of these terms are. COCA allows us to split up our data into different genres depending on where the texts come from – Spoken, Fiction, Magazine, Academic, and Newspaper – so we can look at only the genres we are interested in. For the purposes of this blog post, I’m going to look at news texts, magazine texts and spoken language data. We could also look at academic genres, but that might be problematic since according to the CBS article “Scientists have largely started using the term climate change because it more accurately describes the myriad changes to the climate […] while global warming refers to a single phenomenon.” So academics are very particular in the terms they use (seriously, we write whole sections of our theses just to define our terms and we love doing it).

Climate change
SECTION ALL SPOKEN MAGAZINE NEWSPAPER
FREQ 3136 806 1510 820
PER MIL 6.77 8.43 15.8 8.94

 

Climate change
SECTION 1990-1994 1995-1999 2000-2004 2005-2009 2010-2012
FREQ 156 174 390 1541 883
PER MIL 1.5 1.68 3.79 15.1 17.01

Here we can see the raw count (FREQ) for climate change in the Spoken, Magazine, and Newspaper sections of COCA, as well as for the term in different time periods. This is basically the number of times that the term appears in each section. We also have the frequency per million words (PER MIL), which is a way of normalizing the various sections because they each have a different amount of total words. Looking at this more accurate stat, we can see that climate change is most common in the Magazine genre and that its usage (in all genres taken together) increases over time.

Global warming
SECTION ALL SPOKEN MAGAZINE NEWSPAPER
FREQ 4031 1063 1801 1147
PER MIL 8.68 11.12 18.85 12.51

 

Global Warming
SECTION 1990-1994 1995-1999 2000-2004 2005-2009 2010-2012
FREQ 519 375 763 1854 520
PER MIL 4.99 3.63 7.41 18.17 10.02

Here we have the same stats for global warming. They show that the term is more common in all of the genres and time periods, except for 2010–2012, when the normalized frequency drops down to 10.02. In the same time period, the frequency for climate change is 17.02. Conservatives are winning!

Not so fast, tiger. We still don’t know who is using these words. Remember that global warming only refers to one of the many changes happening to our planet. Maybe those in the media picked up on this and started using climate change where it was more appropriate. So let’s cut up the genres.

Didn’t you get the memo?

So President Bush didn’t use climate change or global warming. But perhaps this idea that the opposing sides of the debate should use different terms has filtered down to the talking heads on TV. If we remember the idea that people believe they hear global warming more often than climate change in public discourse, we can look at the Spoken section of the corpus to check this claim. Here is where you can check your guesses about which term is more common on various news outlets. Below are the frequencies for climate change in the different sections of the Spoken corpus.

Climate change
Spoken # PER MILLION # TOKENS # WORDS
FOX 19.51 123 6,302,918
NPR 18.45 321 17,399,724
PBS 12.1 80 6,612,202
CNN 5.37 111 20,656,861
NBC 4.41 28 6,348,632
MSNBC 3.68 3 814,156
CBS 3.41 44 12,887,290
ABC 3.29 51 15,514,463
Indep 0.23 1 4,343,343

So climate change occurs about 19 times per million words on Fox News and about 3 times per million words on MSNBC. #TOKENS refers to the actual number of times the term appears in each subsection, while # WORDS refers to how many words make up each subsection.

Here are the same stats for global warming:

Global warming
Spoken # PER MILLION # TOKENS # WORDS
FOX 36.33 229 6,302,918
MSNBC 31.93 26 814,156
NPR 17.82 310 17,399,724
PBS 13.16 87 6,612,202
CNN 8.37 173 20,656,861
ABC 6.96 108 15,514,463
Indep 6.22 27 4,343,343
CBS 4.03 52 12,887,290
NBC 3.15 20 6,348,632

Interestingly enough, Fox news tops both lists. What’s strange, though, is that we should have expected a conservative/Republican news site like Fox to use the climate change much more than global warming, but that is not the case (they really are fair and balanced!). NPR and PBS use the terms with almost equal frequency, while the commie pinkos over at MSNBC use global warming at a much higher rate than climate change (they’re coming for your guns too!).

Everybody chill

But hold on a second. What do these numbers really tell us? First, in terms of the spoken data in COCA, global warming really is more frequent. That doesn’t account for all of the language people hear every day, but it is representative of the public discourse they are likely to hear. Only NBC used climate change more often, and even then only barely.

While we can say that the issue of climate change or global warming seems to feature more prominently on Fox News compared to CBS or ABC, we don’t really have a way of saying how these terms are used on any channel.

For that we have to look at the concordances (the passages from the texts where our search terms appear). There we can see things like Fox News’s Sean Hannity saying:

Al Gore has a financial stake in spreading global warming hysteria…
 
Al Gore’s friends in the liberal media jumped on the global warming bandwagon…
 
And finally tonight, Al Gore’ s global warming manipulation isn’t just affecting food prices…

Could it be possible that Fox News uses global warming in its scare tactics and/or liberal bashing?

We can compare this with Hannity’s use of climate change:

the University of Alaska at Fairbanks used 50,000 stimulus dollars to send 11 students to Copenhagen for the failed climate change conference…
 
Jones findings have been used for years to bolster the U.N.’s findings on climate change….

But this is probably nitpicking and it misses the larger point. The words around global warming and climate change say more about their meaning than anything else. We know how Sean Hannity feels about climate change. He says so right here:

HANNITY: Carol, I love you. You’re a great liberal. You defend your side well. If it is hot, it is global warming. If it is cold, it is global warming. If it rains, it’s global warming. If it hails, it is global warming.
 
CAROLINE HELDMAN: Gingrich and Romney are both saying that climate change is happening, are you behind them on this one?
 
HANNITY: I disagree. I don’t think the science is conclusive. Now, I do believe man has an impact on the environment. I want clean air. I want clean water. I want to leave a good planet for our kids and grandkids. But I’m not going to buy lies that are perpetrated by people […] with a political agenda.

I can’t tell if that last line was tongue in cheek, but Hannity seems to opt for another message that was in Luntz’s memo and stress that the scientific jury is still out on global warming. This has also become a conservative talking point. Obviously, the science is firmly in favor of man-made climate change, but even if we replace climate change with global warming in any of the quotes from Sean Hannity, the meaning will not change. The same goes for any of the news outlets above because the difference between these two terms is not that vast. We can all think of two terms which roughly mean the same thing, but are not interchangable in the same way that climate change and global warming are also not. (To his credit, Frank Luntz realizes the complex nature of language and his advice to President Bush on how to talk about environmental issues was nuanced and erudite.)

The idea here is to make sure not to put the cart in front of the horse. Frank Luntz advised President Bush to start using climate change instead of global warming as one way to swing the environmental issue into the Republicans’ favor. This idea would presumably trickle down to other Republicans in the government and to members of the media sympathetic to Republican views. So the first step would be to look at whether the frequency of global warming rose above that of climate change or not. Judging from the data in COCA, I would say this is not what happened. Global warming was already more common than climate change before Luntz issued his memo to President Bush, and both terms were on the rise. Luntz’s advice could certainly have been a contributing factor to climate change’s gain in usage, but it is certainly not the only one. And global warming is still more common on major American news outlets.

I don’t doubt that the terms have a difference in meaning for many people. No matter how small, there is always some semantic difference between even the closest of synonyms. These differences in meanings are based on many different factors, such as the hearer’s education, social background, nationality, familiarity with the speaker, and the context of the situation. What this boils down to is that it doesn’t matter what we call global warming. Focusing on who uses what term misses the point, even if people have more emotional reactions to one term or the other. Climate change is happening and all that matters is that we do something about it.

In the next post, I’ll do a more in depth quantitative analysis of President Bush’s use of these terms. I’ll also look at the problems with reporting Google Search statistics in research on language, which was a method employed by the Yale Project on Climate Change Communication (the same project that studied people’s feelings about the terms).

Dan Zarrella, the “social media scientist” at HubSpot, has an infographic on his website called “How to: Get More Clicks on Twitter”. In it he analyzes 200,000 link-containing tweets to find out which ones had the highest clickthrough rates (CTRs), which is another way of saying which tweets got the most people to click on the link in the tweet. Now, you probably already know that infographics are not the best form of advice, but Mr. Zarrella did a bit of linguistic analysis and I want to point out where he went wrong so that you won’t be misled. It may sound like I’m picking on Mr. Zarrella, but I’m really not. He’s not a linguist, so any mistakes he made are simply due to the fact that he doesn’t know how to analyze language. And nor should he be expected to – he’s not a linguist.

But there’s the rub. Since analyzing the language of your tweets, your marketing, your copy, and your emails, is extremely important to know what language works better for you, it is extremely important that you do the analysis right. To use a bad analogy, I could tell you that teams wearing the color red have won six out of the last ten World Series, but that’s probably not information you want if you’re placing your bets in Vegas. You’d probably rather know who the players are, wouldn’t you?

Here’s a section of Mr. Zarrella’s infographic called “Use action words: more verbs, fewer nouns”:

Copyright Dan Zarrella

Copyright Dan Zarrella

That’s it? Just adverbs, verbs, nouns, and adjectives? That’s only four parts of speech. Your average linguistic analysis is going to be able to differentiate between at least 60 parts of speech. But there’s another reason why this analysis really tells us nothing. The word less is an adjective, adverb, noun, and preposition; run is a verb, noun, and adjective; and check, a word which Mr. Zarrella found to be correlated with higher CTRs, is a verb and a noun.

I don’t really know what to draw from his oversimplified picture. He says, “I found that tweets that contained more adverbs and verbs had higher CTRs than noun and adjective heavy tweets”. The image seems to show that tweets that “contained more adverbs” had 4% higher CTRs than noun heavy tweets and 5-6% higher CTRs than adjective heavy tweets. Tweets that “contained more verbs” seem to have slightly lower CTRs in comparison. But what does this mean? How did the tweets contain more adverbs? More adverbs than what? More than tweets which contained no adverbs? This doesn’t make any sense.

The thing is that it’s impossible to write a tweet that has more adverbs and verbs than adjectives and nouns. I mean that. Go ahead and try to write a complete sentence that has more verbs in it than nouns. You can’t do it because that’s not how language works. You just can’t have more verbs than nouns in a sentence (with the exception of some one- and two-word-phrases). In any type of writing – academic articles, fiction novels, whatever – about 37% of the words are going to be nouns (Hudson 1994). Some percentage (about 5-10%) of the words you say and write are going to be adjectives and adverbs. Think about it. If you try to remove adjectives from your language, you will sound like a Martian. You will also not be able to tell people how many more clickthroughs you’re getting from Twitter or the color of all the money you’re making.

I know it’s easy to think of Twitter as one entity, but we all know it’s not. Twitter is made up of all kinds of people, who tweet about all kinds of things. While anyone is able to follow anyone else, people of similar backgrounds and/or professions tend to group together. Take a look at the people you follow and the people who follow you. How many of them do you know on personally and how many are in a similar business as you? These people probably make up the majority of your Twitter world. So what we need to know from Mr. Zarrella is which Twitter accounts he analyzed. Who are these people? Are they on Twitter for professional or personal reasons? What were they tweeting about and where did the links in their tweets go – to news stories or to dancing cat videos? And who are their followers (the people who clicked on the links)? This is essential information to put the analysis of language in context.

Finally, What Mr. Zarrella’s analysis should be telling us is which kinds of verbs and adverbs equal higher CTRs. As I mentioned in a previous post, marketers would presumably favor some verbs over others. They want to say that their product “produces results” and not that it “produced results”. What we need is a type of analysis can tell shit (noun and verb) from Shinola (just a noun). And this is what I can do – it’s what I invented Econolinguistics for. Marketers need to be able to empirically study the language that they are using, whether it be in their blog posts, their tweets, or their copy. That’s what Econolinguistics can do. With my analysis, you can forget about meaningless phrases like “use action words”. Econolinguistics will allow you to rely on a comprehensive linguistic analysis of your copy to know what works with your audience. If this sounds interesting, get in touch and let’s do some real language analysis (joseph.mcveigh (at) gmail.com).

 

Other posts on marketing and linguistics

How Linguistics can Improve your Marketing by Joe McVeigh

Adjectives just can’t get a break by Joe McVeigh

Everyone loves verbs, or so you would be led to believe by writing guides. Zack Rutherford, a professional freelance copywriter, posted an article on .eduGuru about how to write better marketing copy. In it he says:

Verbs work better than adjectives. A product can be quick, easy, and powerful. But it’s a bit more impressive if the product speeds through tasks, relieves stress, and produces results. Adjectives describe, while verbs do. People want a product or service that does. So make sure you provide them with one. [Emphasis his – JM]

If you’re a copy writer or marketer, chances are that you’ve heard this piece of advice. It sort of makes sense, right? Well as a linguist who studies marketing (and a former copy writer who was given this advice), I want to explain to you why it is misleading at best and flat out wrong at worst. These days it is very easy to check whether verbs actually work better than adjectives in copy. You simply take many pieces of copy (texts) and use computer programs to tag each word for the part of speech it is. Then you can see whether the better, i.e. more successful, pieces of copy use more verbs than adjectives. This type of analysis is what I’m writing my PhD on (marketers and copy writers, you should get in touch).

Don’t heed your own advice

So being the corpus linguist that I am, I decided to check whether Mr. Rutherford follows his own advice. His article has the following frequencies of usage for nouns, verbs, adjectives, and adverbs:

Nouns Verbs Adjectives Adverbs Word count
Total 275 208 135 90 1195
% of all words 23.01% 17.41% 11.30% 7.53%

Hooray! He uses more verbs than adjectives. The only thing is that those frequencies don’t tell the whole story. They would if all verbs are equal, but those of us who study language know that some verbs are more equal than others. Look at Mr. Rutherford’s advice again. He singles out the verbs speeds through, relieves, and produces as being better than the adjectives quick, easy, and powerful. Disregarding the fact that the first verb in there is a phrasal verb, what his examples have in common is that the verbs are all -s forms of lexical verbs (gives, takes, etc.) and the adjectives are all general adjectives (according to CLAWS, the part-of-speech tagger I used). This is important because a good copy writer would obviously want to say that their product produces results and not that it produced results. Or as Mr. Rutherford says “People want a product or service that does” and not presumably one that did. So what do the numbers look like if we compare his use of -s form lexical verbs to general adjectives?

-s form of lexical verbs General adjectives
Total 24 135
% of all words 2.01% 11.30%

Uh oh. Things aren’t looking so good. Those frequencies exclude all forms of the verbs BE, HAVE, and DO, as well as modals and past tense verbs. So maybe this is being a bit unfair. What would happen if we included the base forms of lexical verbs (relieve, produce), the -ing participles (relieving, producing) and verbs in the infinitive (to relieve, it will produce)? The idea is that there would be positive ways for marketers to write their copy using these forms of the verbs. Here are the frequencies:

Verbs (base, -ing part.,
Infin., and -s forms)
General adjectives
Total 127 135
% of all words 10.63% 11.30%

Again, things don’t look so good. The verbs are still less frequent than the general adjectives. So is there something to writing good copy other than just “use verbs instead of adjectives”? I thought you’d never ask.

Some good advice on copy writing

I wrote this post because the empirical research of marketing copy is exactly what I study. I call it Econolinguistics. Using this type of analysis, I have found that using more verbs or more adjectives does not relate to selling more products. Take a look at these numbers.

Copy text Performance Verbs – Adjectives
1 42.04 3.94%
2 11.82 0.63%
3 11.81 6.22%
4 10.75 -0.40%
5 2.39 3.21%
6 2.23 -0.78%
7 2.23 4.01%
8 1.88 1.14%
9 5.46%

These are the frequencies of verbs and adjectives in marketing texts ordered by how well they performed. The ninth text is the worst and the rest are ranked based on how much better they performed than this ninth text. The third column shows the difference between the verb frequency and adjective frequency for each text (verb % minus adjective %). If it looks like a mess, that’s because it is. There is not much to say about using more verbs than adjectives in your copy. You shouldn’t worry about it.

There is, however, something to say about the combination of nouns, verbs, adjectives, adverbs, prepositions, pronouns, etc., etc. in your copy. The ways that these kinds of words come together (and the frequencies at which they are used) will spell success or failure for your copy. Trust me. It’s what Econolinguistics was invented for. If you want to know more, I suggest you get in touch with me, especially if you’d like to check your copy before you send it out (email: joseph.mcveigh(at)gmail.com).

In order to really drive the point home, think about this: if you couldn’t use adjectives to describe your product, how would you tell people what color it is? Or how big it is? Or how long it lasts? You need adjectives. Don’t give up on them. They really do matter. And so do all the other words.

 

Other posts on marketing and linguistics

How Linguistics can Improve your Marketing by Joe McVeigh

When Google’s Ngram Viewer was the topic of a post on Science-Based Medice, I knew it was becoming mainstream. No longer happy to only be toyed with by linguists killing time, the Ngram Viewer had entranced people from other walks of life. And I can understand why. Google’s Ngram Viewer is an impressive service that allows you to quickly and easily search for the frequency of words and phrases in millions of books. But I want to warn you about Google’s Ngram Viewer. As a corpus linguist, I think it’s important to explain just what Ngram Viewer is, what it can be used to do, how I feel about it, and the praise it has been receiving since its inception. I’ll start out simple: despite all its power and what it seems to be capable of, looks can be deceiving.

Have we learned nothing?

Jann Bellamy wrote a post at Science-Based Medicine about using Google’s Ngram Viewer (GNV) to research some terms used to describe the very unscientific practice of Complementary and Alternative Medicine (CAM). Although an article of this type is unusual for the SBM site, it does show how intriguing GNV can be. And Ms. Bellamy does a good job by explaining a few of the caveats of GVN:

The database only goes through 2008, so searches have to end there. Also, the searches have to assume that the word or phrase has only one definition, or perhaps one definition that dominates all others. We also have to remember that only books were scanned, not, for example, academic journals or popular magazines. Or blog posts, for that matter.

Ms. Bellamy then goes on the search for some CAM terms. After noting the which terms are more common and when they started to rise in usage, she does a very good job at explaining the reasons that certain terms have a higher frequency than others. At the end, however, she is left with more questions than answers. Although she discovered that alternative medicine appears more frequently than complementary medicine in the Google Books database, and although she did further research (outside of Google Books) to explain why, she is still left right where she started. Just looking at the numbers from GNV, she can’t say what kind of impact CAM has had on our (English-speaking) world or culture. So what was the point of looking as GNV at all (besides the pretty colors)?

In her post, Ms. Bellamy links to an article in the New York Times by Natasha Singer. In what is essentially a exposition of GNV, with quotes from two of its founders, Ms. Singer places a lot more stock in the value and capability of the program. But from a corpus linguist’s perspective, she leaps a bit too far to her conclusions.

Ms. Singer’s article begins with the phrase “Data is the new oil” and then goes on to explain the comparison between these two words offered by GNV. She writes:

I started my data-versus-oil quest with casual one-gram queries about the two words. The tool produced a chart showing that the word “data” appeared more often than “oil” in English-language texts as far back as 1953, and that its frequency followed a steep upward trajectory into the late 1980s. Of course, in the world of actual commerce, oil may have greater value than raw data. But in terms of book mentions, at least, the word-use graph suggests that data isn’t simply the new oil. It’s more like a decades-old front-runner.

But with the Google Books corpus (the set of texts that GNV analyzes), we need to remember what the corpus contains, i.e. what “book mentions” means. This lets us know how representative both the corpus and our analysis is. The Google Books corpus does not contain speech, newspapers, tweets, magazine articles, business letters, or financial reports. Sure, oil is important to our culture, and certainly to global and political history, but do people write books about it? We can not directly extrapolate the findings from Google Books to Culture any more than we can tell people about the world of 16th Century England by studying the plays of Shakespeare. With GNV we can merely study the culture of books (or the culture of publishing). And there are many ways that GNV can mislead you. For example, are the hits in Ms. Singer’s search talking about crude oil, olive oil, or oil paintings? Google Ngrams will not tell you. Just for fun, here’s Ms. Singer’s search redone with some other terms. Feel free to draw your own conclusions.

Click to embiggen

Search for “data, oil, chocolate, love” on GNV. (Just to be clear, searching for oil_NOUN doesn’t change things much; oil as a verb is almost non-existent in the corpus. Take that as you will)

Research casual

The second article I want to talk about comes from Ben Zimmer. While I don’t think Mr. Zimmer needs to be told anything that’s in this post, his article in The Atlantic gets to the heart of my frustration with GNV. It features a more complex search on GNV to find out which nouns modify the word mogul and how they have changed over the last 100 years. In the following passage, he alludes to the reality of GNV without coming right out and saying it.

It’s possible to answer these questions using the publicly available corpora compiled by Mark Davies at Brigham Young University, but the peculiar interface can be off-putting to casual users. With the Ngram Viewer, you just need to enter a search like “*_NOUN mogul” or “ragtag *_NOUN” and select a year range. It turns out that in 20th-century sources, media moguls are joined by movie moguls, real estate moguls, and Hollywood moguls, while the most likely things to be ragtag are armies, groups, and bands.

There are a few points to make about this. First, the interface of the publicly available corpora compiled by Mark Davies could be described as “peculiar”, but that’s only because it’s not the lowest common denominator. And there’s the rub because researchers are capable of so much more using Mark Davies’ corpora. While the interface isn’t immediately intuitive, it certainly isn’t hard to learn. As a bad comparison, think about the differences between Windows, OSX, and a Linux OS. Windows is the lowest common denominator – easiest to use and most intuitive. OSX and Linux, on the other hand, take a bit of getting used to. But how many of us have learned OSX or Linux and willingly gone back to Windows?

The second point is not so much about casual users as it is about casual searches. I think Mr. Zimmer is right to talk about casual users since it’s probable that most of the people who use GNV will be looking for a quick and easy stroll down the cultural garden path. But more to the point, I think he’s right to offer different types of moguls as a search example because that’s about as far as GNV will take you. Can you see which types of moguls people are talking about? No. How about which types of moguls are being used in magazines? Nope. Newspapers? Nuh-uh. You have to turn to one of Mark Davies’ corpora for that. In fact, less casual users are even able to access Google Books (and other corpora) via Mark Davies’ site, and this allows them to conduct more complex searches (For a much more detailed comparison of GNV and some of the corpora offered on Mark Davies’ site, see here). So again the question is what’s the point of looking at GNV at all?

Final thoughts – Almost right but not quite

All this picking on GNV is not without reason. Even though what the people at Google have done is truly impressive, we have seen that the practical use GNV is limited. As the saying in corpus linguistics goes “Compiling is only half the battle”. GNV does not offer users a way to really measure what they are (usually) looking for. As an example, a quote from Ms. Singer’s article will suffice:

The system can also conduct quantitative checks on popular perceptions. Consider our current notion that we live in a time when technology is evolving faster than ever. Mr. Aiden and Mr. Michel [two of GNV’s creators] tested this belief by comparing the dates of invention of 147 technologies with the rates at which those innovations spread through English texts. They found that early 19th-century inventions, for instance, took 65 years to begin making a cultural impact, while turn-of-the-20th-century innovations took only 26 years. Their conclusion: the time it takes for society to learn about an invention has been shrinking by about 2.5 years every decade.

While this may be true, it’s not proven by looking at Google Books. For example, ask yourself these questions: what was the rate of literacy in the early 19th-century? How many books did people read (or have read to them) in the early 19th-century compared to the turn of the 20th century? What was the difference between the rate of dissemination of information in the two time periods? How about the rate of publishing? And what exactly qualifies as technology – farm equipment or fMRI machines? Or does it have to be more closely related to culture and Culturomics – like Facebook?

And most importantly, are books the best way to measure the cultural impact of an idea or technology? The fact is that the system can not really conduct quantitative checks on popular perceptions. But it can make you think it can.

So GNV has a long way to go. I hesitate to say that they will get there because Google does not really have an interest in offering this kind of service to the public (I didn’t see any ads on the GNV page, did you?). While it may be fun to play around with GNV, I would advise against drawing any (serious) conclusions from what it spits out. Below are some other searches I ran. Again, feel free to draw your own conclusions about how these terms and the things they describe relate to human culture.

Click to embiggen

Search for “blood, sugar, sex, magik”. Click here to see the results on GNV.

Click to embiggen

Search for “Booby Fischer, Jay Z”. Click here to see the results on GNV.

Click to embiggen

Search for “Bobby Fischer, Jay Z, Eminem, Dr Dre, Run DMC, Noam Chomsky”. Click here to see the results on GNV.

Click to embiggen

Search for “Johnny Carson, Conan O’Brien, Jay Leno, David Letterman, Jimmy Kimmel, Jimmy Fallon, Big Bird, Saturday Night Live” (from the year 1950). Click here and then “Search lots of books” to see the results on GNV.

Click to embiggen

Search for “Superman, Batman, Wonder Woman, Buffy, King Arthur, Robin Hood, Hercules, Sherlock Holmes, Pele”. Click here to see the results on GNV.

Click to embiggen

Search for “Barack Obama, George Bush, Bill Clinton, Ronald Reagan, Richard Nixon, John * Kennedy, Dwight * Eisenhower, Harry * Truman, Franklin * Roosevelt, Abraham Lincoln, Beatles”. Click here to see the results on GNV.


Notice how the middle initial of some presidents complicates things in the above search. It would be nice to be able to combine the frequencies for “John Fitzgerald Kennedy”, “John F Kennedy”, “John Kennedy”, and “JFK” into one line, and exclude hits like “John S Kennedy” from the results completely, but that’s not possible. You could, however, search GNV for the different ways to refer to President Kennedy and see the differences, for whatever that will tell you.
 
 
Click to embiggen

Search for “Brad Pitt, Audrey Hepburn, Noam Chomsky, Bob Marley”. Click here to see the result on GNV.


Noam Chomsky has had a bigger effect on our culture than Audrey Hepburn, Bob Marley, and Brad Pitt? You be the judge!

Follow

Get every new post delivered to your Inbox.

Join 66 other followers

%d bloggers like this: