NPR’s Code Switch did an interview about language a few months ago and it stayed on my mind because of how bad it was. I gave it a re-listen and I’d like to point out just why it’s so bad. You can listen to the episode below. It’s episode 42 and it’s called “Not-So-Simple Questions From Code Switch Listeners”. The interview in question starts at the 14:47 mark. The hosts, Gene Demby and Shereen Marisol Meraji, talk to Brent Blair about what it sounds like to be American. I couldn’t find a transcript of the interview, so I made my own, which you can find here. I’ll summarize Blair’s points below and briefly point out why they are wrong. The linguistics behind each of the topics that I discuss below is complex, but I will try to keep things simple in order to keep things short.

1. We understand this quote unquote “American dialect” or “Received American Pronunciation” based on culture and media: what sells.

No, we don’t. We (I mean linguists, people who study dialects) understand American dialects (plural) based on how the dialects sound. Non-linguists (and linguists when they’re not studying dialects) understand dialects through an array of socio-economic and linguistic factors.

“Received American Pronunciation” is not a thing. Blair is mixing up General American and Received Pronunciation, the accents with the highest prestige in the US and the UK, respectively. Many national newscasters in the US use General American on air (for example, Brian Williams). In the UK, Received Pronunciation is used by the Royal Family and members of parliament (with exceptions, of course). Mixing up the names of these two dialects is so incredibly basic that it’s hard to believe someone would make it. It’s like someone talking about the Boston Yankees baseball team. Or the band Led Sabbath. Or President Abraham E. Lee. The term General American is not without its problems.

2. What we understand as the American dialect comes from the West Coast, specifically Hollywood, and what Hollywood has considered the standard American dialect. This dialect is “vanilla” – its features do not include “twisty or harsh R sounds or twangy stuff or dropped AH” (quotes from Blair).

It’s probably not surprising that a theater professor would think that Hollywood is responsible for our thoughts on American dialects. Blair is almost correct on this – the dialect used in many popular movies is indeed General American. It doesn’t come from Hollywood, though. The dialect known as General American comes from the eastern part of the US, and it is often considered the dialect of the Midwestern region of the United States, not California. General American is believed to not have any regional or ethnic features, but obviously this is nonsense. It is a mish-mash of various dialects. It’s also (as far as I can tell) not really used in dialect studies anymore.

Map of the dialects of North America. From The Atlas of North American English by Labov, Ash and Boberg (2006; Map 11.15).

Map of the dialects of North America. From The Atlas of North American English by Labov, Ash and Boberg (2006; Map 11.15).

The terms “vanilla”, “twisty”, “harsh R”, “twangy”, and “dropped AH” are not used in dialect studies. These terms are problematic. For example, the dialect that Blair is calling standard, the one from Hollywood, uses an R sound. This is one of the ways that linguists describe dialects: whether they include a post-vocalic R or not. Linguists use the terms rhotic to describe dialects which pronounce the R when it comes after a vowel, and non-rhotic to describe dialects which do not pronounce post-vocalic Rs. The Boston dialect is classically non-rhotic, with Hahvahd Yahd (Harvard Yard) being a common term used by people imitating the dialect (Notice that the Boston dialect doesn’t drop all of its Rs, just the ones which come after a vowel and before a consonant. No one in Boston goes to watch the Pat_iots or B_uins play). So, do rhotic dialects have “harsh R sounds”? I don’t know because I don’t know what the hell that means. What does “twangy” mean? What dialect sounds “twangy”? Does Nelly sound “Twangy” (he’s from St. Louis)? Does Taylor Swift (she’s from eastern Pennsylvania)? Can I say that this whole interview sounds “twangy” or should I use the more technical term: shitty?

3. Regionalisms in dialects are disappearing rapidly. Today a person from Atlanta, Georgia, sounds like a person from California. You can’t tell the difference between people from Houston, Chicago and New York. On the contrary, dialects in rural areas are still diverse.

Blair couldn’t be more wrong about this. Literally the first page of William Labov’s Dialect Diversity in America says “People tend to believe that dialect differences in American English are disappearing, especially given our exposure to a fairly uniform broadcast standard in the mass media. One can find this point of view in almost any discussion of American dialects […] This overwhelming common opinion is simply and jarringly wrong.” THE FIRST GODDAMN PAGE. Of a book that is sure to turn up in any Amazon or Google search on dialects in America. There is no way that Blair’s name showed up in a Google search of dialects in America.

Even though the Code Switch hosts didn’t need to read past the second page of Labov’s book to get better info than Blair gave them, if they had made it to page 35, they would have read “The dialects of Chicago, Philadelphia, Pittsburgh, and Los Angeles are now more different from each other than they were 50 or 100 years ago […] On the other hand, dialects of many smaller cities have receded in favor of the new regional patterns.” Again, exactly the opposite of what Blair told them. Labov also does something which Blair does not: he backs up his claims with (decades of) research. I guess they do linguistics differently in the field of theater studies.

As if that wasn’t enough, here’s a story from NPR about dialects NOT disappearing!

4. Globalization, commercialism, and our careers have made us say “We all want to sound the same”.


5. This “vanilla” Californian dialect, or this blending of dialects, and/or the disappearance of regionalisms is not due to class or race, but access and power. (It’s hard to tell what they are talking about here. They use the term “placeless”.)

Things kind of break down around point 5. Blair has dug himself into a hole and he can’t get out. He talks about how people of color are only allowed to use the Vanilla-fornian dialect based on the culture that is employing them and their relationship to systems of power, but it is unclear what he means and he is unable to explain. He only offers an immediate anecdote – the interviewer Meraji is able to say “Latino” with a Puerto Rican accent on NPR, so maybe she would allow herself to use more Spanish on air in the future. But Spanish isn’t a dialect. Meraji would allow herself to speak Spanish on NPR if she knew her audience would understand her. Blair wraps it all up with something truly bizarre when he says, “So for me, when we’re accent stereotyping, it just means we haven’t fallen in love enough with that community to understand its diversity and its complexity”. I don’t know what the hell this guy is talking about.

Pointing fingers

So who’s at fault here? I think partial blame falls on both sides.

First, Blair should be blamed for not saying no to the interview. If NPR called me up and asked me to talk about theater studies, I would say no. Because I’m not a theater scholar or professional. If someone called you up and said “Hey, we want to talk about theoretical mathematics on the radio,” would you say “Sure! I took math in high school. Let’s do this.”? No, of course you wouldn’t. But they called Blair up and he said, “Ummmm, I speak a language. Get me on the phone!” And then he proved that he knows about as much about language and dialects as I do about theater studies. It’s not that Blair can’t know anything about dialects in America, it’s that he showed he doesn’t know anything about dialects in America. If he had gotten everything right, I wouldn’t be writing this blog post.

Some of the blame also goes to the people at Code Switch though. If they wanted to talk about language and dialects, why didn’t they call a linguist? Why did they think calling a theater professor, who as far as I can tell has not written anything on language, would be ok? In an earlier part of this episode, the hosts have a discussion about the magical negro and they talk to Ebony Elizabeth Thomas, a professor and researcher who has published on representations of people of color in various media. Thomas is at the University of Pennsylvania, the same university as Labov, who I quoted above. She literally could have transferred them over to his office. Or they could have talked to Walt Wolfram or Natalie Schilling or John Baugh. Any of these people would have been far better than Blair.

Ok, I’ve been pretty hard on everyone in this interview. You may be thinking, jeez, this guy just doesn’t like it when people talk about language. That’s not the case. I don’t like it when prominent news organizations talk about language and get it so wrong (I see you, The New Yorker). If you want to hear a really great interview on language and linguistics, go listen to this Top of Mind interview (download it here). The host, Julie Rose, and the guests talk about filler words (um, uh, you know, etc.), which is – like dialects – a linguistic topic with a divide between what the public thinks and what linguists have discovered. To discuss this topic, the host invited two linguists who have researched filler words, Alexandra D’Arcy and Jena Barchas-Lichtenstein. I hope other interviewers listen to this and learn how to discuss language on air.

If you are interested in learning more about dialects in America and/or dialect discrimination, follow the links behind the researchers’ names in the previous two paragraphs. Most of them have written books and articles aimed at the general public. Walt Wolfram even has a movie about African American speech coming out and it sounds amazing. I’m not saying that all of the things you will read are going to be positive – discrimination based on language happens and it is terrible. But the research put out by these and other linguists is fascinating and it can actually do what the NPR Code Switch interview attempted to do: make you more informed about language.

Hat tip to Nicole Holliday on Twitter for pointing me to this Code Switch episode. Holliday would also have been good for this interview.

Update 14 June 2017:

Almost immediately after posting this article and sharing it on Twitter, Gene Demby reached out. Gene is one of the hosts of NPR’s Code Switch. According to him, this episode “was the source of much consternation”. Gene wanted to talk to a linguist but was overruled by an editor. He has also said the Code Switch will do better in the future and that they have an episode about African American Vernacular English (AAVE) coming up. I’d like to thank Gene for clearing things up and I look forward to that episode.

Also related to this post, Kevin Calcamp reached out to say that Blair’s views are not representative of the study of linguistics in theater and performance studies. Kevin says that theater/performance scholars have a good understanding of linguistics. I believe him. He also pointed out the complicated nature and the various ways of incorporating dialects into theater/performance studies (follow the tweet below to see more). Thanks, Kevin, for explaining things.

Mary Norris’s book Between You & Me: Confessions of a Comma Queen (2015, Norton) is part autobiography, part style guide. Norris has been an editor at The New Yorker magazine for many years and her voice can be heard through the text, which makes parts of this book an enjoyment to read, especially when she tells stories about her life. She says in the intro that her book is “for all of you who want to feel better about your grammar” (p. 14), which is an unfortunate dedication since the book goes off the rails when Norris discusses grammar and linguistics. In these sections, Norris doesn’t just make herself look bad, but she also ropes in the rest of the editorial staff at The New Yorker.

Between You & Me: Confessions of a Comma Queen by Mary Norris (2015, Norton)

Early on, Norris discusses the importance of dictionaries to editing. She also, however, walks right into a mine field when she discusses her and The New Yorker’s preference for a dictionary published in the 1930s over nearly all others:

If we cannot find something in the Little Red Web [Merriam-Webster’s Collegiate Dictionary 2003], our next resort is Webster’s New International Dictionary (Unabridged), Second Edition, which we call Web II. First published in 1934, it was the Great American Dictionary and is still an object of desire: 3,194 pages long, with leisurely definitions and detailed illustrations. It was supplanted in 1961 by Webster’s Third, whose editors, led by Philip Gove, caused a huge ruckus in the dictionary world by including commonly used words without warning people about which ones would betray their vulgar origins. (p. 18)

Norris is selling Gove and the other editors of Merriam-Webster’s short here. Gove actually wrote that “We must see to it that a mid-twentieth-century dictionary gives evidence of having been written by editors who lived in the twentieth century” (quote from The Story of Ain’t by David Skinner, p. 205) and what Gove did (besides dropping sick burns) was help systematize the way that dictionaries qualified words for their “vulgar” natures. Gove also saw to it that the quotes used to illustrate the meanings of words were neither archaic nor unnatural, i.e. contemporary quotes rather than contrived sentences written by the dictionary makers. But Gove’s actions caused a lot of uptight social commentators to get their knickers in a bunch, as Norris briefly explains:

On the publication of this dictionary, which we call Web 3, a seismic shift occurred between prescriptivists (who tell you what to do) and descriptivists (who describe what people say, without judging it). In March of 1962, The New Yorker, a bastion of prescriptivism, published an essay by Dwight MacDonald [who was not a linguist, nor a language scholar – JM] that attacked the dictionary and its linguistic principles: ‘The objection is not to recording the facts of actual usage. It is to failing to give the information that would enable the reader to decide which usage he wants to adopt.’ (p. 18)

It is no more surprising that Norris sticks by MacDonald’s essay than it is that MacDonald went to The New Yorker to voice his complaint. But romanticizing the fact that Norris and her fellow editors use a dictionary from the 1930s (Webster’s Second) over more modern ones doesn’t look prescriptivist, it looks downright foolish. Norris drives the point home:

Since the great dictionary war of the early sixties, there has been an institutional distrust of Web 3. It’s good for some scientific terms, we say, patronizingly. Its look is a lot cleaner than that of Web II. Lexicology aside, it is just not as beautiful. I would not haul a Web 3 home. You can even tell by the way it is abbreviated in our offices that it is less distinguished: Webster’s Second gets the Roman numeral, as if it were royalty, but Webster’s Third must make do with a plain old Arabic numeral. (p. 19)

This is nonsense. The editors at The New Yorker are prioritizing a dictionary from 1934 because it “enables the reader to decide which usage he wants to adopt”. Think about that for a second. Who in their right mind wants their writing to sound like it was published in 1934? The New Yorker is not a “bastion of prescriptivism”, it is an ancient ruin of unfounded notions about language.

MacDonald can maybe be excused for the incorrect ideas in his article. They were, after all, popular at the time. But Norris doesn’t get off so easy. She wrote her book in the 2010s, well after the ideas in MacDonald and W2 were shown to be incorrect. Think about what she is doing here. She using a 50-year-old article with incorrect ideas about language to defend her use of an 80-year-old dictionary. If your doctor recommended that you start smoking Camels because a commercial in the 1950s said they activate your T-zone, you would find another doctor.

Later in the book, Norris visits the offices of Merriam-Webster and says “These people are having far too much fun to be lexicographers” (p. 29). This is perhaps true, and she might even believe it, but I doubt she likes any of the advice that the MW editors give online or in their videos.

Bad Grammar

Every chapter in Norris’s book starts with a personal story and moves into a topic of English grammar or style. In Chapter 2, titled “That witch!”, Norris discusses relative clauses. She gives some OK advice about how to distinguish whether the clause is restrictive or non-restrictive, but then makes some major mistakes on what to do after that:

If the phrase or clause introduced by a relative pronoun – “that” or “which” – is essential to the meaning of the sentence, “that” is preferred, and it is not separated from its antecedent by a comma. (p. 40)

I suppose Norris means that that is preferred in The New Yorker, but it sounds like she means that is preferred across the English language, which simply isn’t true. Anyone who has spent any time hanging out with the English language would know this. Perhaps she means that that is preferred by people (such as editors at The New Yorker?) who wish they could dictate which relative pronoun should be used in all cases across the English language. Norris then gives us a half-baked explanation of what’s going with that and which in relative clauses:

If people are nervous, they sometimes use “which” when “that” would do. Politicians often say “which” instead of “that”, to sound important. A writer may say “which” instead of “that” – it’s no big deal. It would be much worse to say “that” instead of “which.” Apparently the British use “which” more and do not see anything wrong with it. Americans have agreed to use “that” when the clause is restrictive and to use “which,” set off with commas, when the clause is nonrestrictive. It works pretty well. (p. 41)

What? No. There is so much wrong with this paragraph. First, what the hell does Norris mean by the first two sentences? Is she a professional on spoken English now? The third sentence gives it away – writers don’t “say” things, they write things. But Norris doesn’t realize that she has blurred the line between spoken and written language so much that she’s erased it. This paragraph means that an admittedly prescriptivist editor of written language – who prefers a dictionary from 1934 – can’t tell the obvious difference between spoken and written English and that we are supposed to take for granted her claims about ALL spoken English, based on… something. Another thing that is wrong with this paragraph is that it is demonstrably wrong that Americans have “agreed to use ‘that’” with restrictive relative clauses. This was dictated by copy editors in the beginning of the 20th century! This hope/wish/desire to separate which and that comes from Fowler (1926), who wrote “The two kinds of relative clauses, to one of which that and the other to which which is appropriate, are the defining [restrictive] and the non-defining [non-restrictive]; and if writers would agree to regard that as the defining relative pronoun, and which as the non-defining, there would be much gain both in lucidity and in ease. Some there are who follow this principle now; but it would be idle to pretend that it is practice either of most or of the best writers.” (Fowler’s Dictionary of Modern English Usage, 4th ed., 2015, edited by J. Butterfield, p. 809) Even Fowler gave up on this that/which nonsense. You would think Norris would recognize this because of her preference for early 20th century English reference works. No one cares about this that/which distinction anymore, if they ever did. It wasn’t just the British who saw nothing wrong with using which in nonrestrictive relative clauses. Americans have also never cared about this when they were speaking naturally*.

Norris also has a chapter on pronouns, in which she wastes four pages (pp. 60-63) blabbering about pronouns before we get to the point of the chapter, i.e. the (supposed) problem of English’s (supposed) lack of a gender-neutral third person singular pronoun. The chapter ends with a heartfelt and well written personal story about Norris having to switch the pronouns she used for a family member who transitioned. Norris quite deftly shows how personal our pronouns can be and this part of the chapter is definitely worth reading. What comes before it, however, are a bunch of pronoun howlers.

One of the stranger ones is when Norris claims that “There is only one documented instance of a gender-neutral pronoun springing from actual speech, and that is “yo,” which ‘spontaneously appeared in Baltimore city schools in the early-to-mid 2000s.’ (p. 66) What? Does Norris actually believe this? The research cited on yo is from Stotko and Troyer, but they do not claim that yo is the only documented instance of a gender-neutral pronoun springing from actual speech (Stotko, Elaine M. and Margaret Troyer. 2007. “A New gender-neutral pronoun in Baltimore, Maryland: A preliminary study”. American Speech 82(3): 262–279. https://dx.doi.org/10.1215/00031283-2007-012).

Then Norris drops the bomb:

I hate to say it, but the colloquial use of “their” when you mean “his or her” is just wrong. (p. 69)

Ugh, where to start? Literally right before this sentence, Norris said that having singular you and plural you is fine. But then she says that singular they is not because… reasons? Norris actually tries to claim that the epicene he would be invisible if we didn’t “make such a fuss” about it. Guess what? It isn’t and we do. Does Norris really think that the epicene he is only visible because people complain about it? She has it backwards. The epicene he is complained about because it is so damn visible. And are we really to believe that he would be invisible to Norris? She devoted an entire chapter in her book to pronouns. Also, singular they isn’t colloquial (although I’m willing to bet that the editors of The New Yorker have a different definition of the term “colloquial” – one from the 1930s perhaps). It has been used across all types of texts and registers and first appeared 800 years ago. (Wait, is it possible that singular they SPINGS FROM ACTUAL SPEECH?! Omg you guys!!1!) Basically, if you have a problem with singular they, maybe it’s time to get over it. Or, if you’re going to complain about singular they, maybe you shouldn’t use it in your writing. That’s right, Norris uses singular they in this book:

A notice from the editor, William Shawn, went up on the bulletin board, saying that anyone whose work was not “essential” could go home. Nobody wanted to think they were not essential. (p. 11)


The discussion of pronoun usage gets more convoluted after this. On the very next page (p. 70), after telling us that a writer was wrong for not using the epicene he, Norris says that a The New Yorker staff writer was correct in using singular they. So what the hell is going on here? I don’t know and I’m starting to not care.

Chapter 4 – “Between you and me”

This might be the most confusing chapter in terms of grammar. Norris writes:

The most important verb is the verb “to be” in all its glory: am, are, is, were, will be, has been. (p. 84)

So will be and has been are part of the verb BE? Uhh… how? And why isn’t being in that list, or (by Norris’ logic) have been? No one knows.

The rest of this chapter goes from bad to worse. Immediately after this quote, Norris discusses nouns, rather than nouns phrases, even though she uses noun phrases rather than single-word nouns (such as copy editor and my plumber). In a later admission that there are several copulative verbs, Norris says that “It is because these verbs are copulative and not merely transitive that we say something ‘tastes good’ (an adjective), not ‘well’ (an adverb): the verb is throwing the meaning back onto the noun”. What does this mean? Norris is also incorrect when she says that “nouns are modified by adjectives, not adverbs”. Noun phrases are modified by other noun phrases (a no-frills airline, sign language) as well as adverb phrases (the then President, a through road). Those examples from Downing & Locke (2006: 436), but from The New Yorker we have “Danny Hartzell backed a Budget rental truck up to a no-frills apartment building…” from a piece called “Empty Wallets” by George Packer in the July 25, 2011 issue, perhaps edited by Norris. But this isn’t even a matter of modification. In Norris’s example (“Something tastes good”), the adjective phrase good does not modify the noun phrase something, but rather functions as a complement in the sentence. Essentially, the subject (which may be a noun phrase or may be something else) requires a complement when a copulative verb is used. And there is no reason that adverb phrases cannot act as complements after copulative verbs (They’re off!, I am through with you, That is quite all right).

In the following paragraph, Norris writes “One might reasonably ask, if we can use the objective for the subjective, as in ‘It’s me again,’ why can’t we use the subjective for the objective?” But again this is confusing and it’s hard to tell whether Norris believes that me is the subject in her example sentence (hint: it’s not, it’s what some grammars call an extraposed subject, but I can see how Norris would be confused – The New Yorker has proven its ineptitude when it comes to describing sentences of this type. See Downing & Locke 2006: 47–48, 261).

In discussing grammar, Norris also tells stories about working at The New Yorker. It’s hard to describe how shocking some of these are, so I’ll let Norris tell it:

Lu Burke once ridiculed a new copy editor who had come from another publication for taking the hyphen out of “pan-fry.” “But it’s in [Webster’s dictionary],” the novice chirped. “What are you even looking in the dictionary for?” Lu said, and I wish there were a way of styling that sentence so that you could see it getting louder and more incredulous toward the end. She spoke it in a crescendo, like Ralph Kramden, on The Honeymooners, saying, “Because I’ve got a BIG MOUTH!” Without the hyphen, “panfry” looks like “pantry.” “Panfree!” Lu guffawed, and said it again. “Panfree!” The copy editor was just following the rules, but Lu said she had no “word sense.” Lu was especially scornful of unnecessary hyphens in adverbs like “feet first” and “head on.” Of course, “head on” is hyphenated as an adjective in front of a noun – “The editors met in a head-on collision” – but in context there is no way of misreading “The editors clashed head on in the hall.” The novice argued that “head on” was ambiguous without the hyphen. Lu was incredulous. “Head on what?” she howled, over and over, as if it were an uproarious punch line. Eventually, that copy editor went back to where she had come from. “It’s as if I tried to become a nun and failed,” she confided. It did sometimes feel as if we belonged to some strange cloistered order, the Sisters of the Holy Humility of Hyphens. (p. 116)

Some strange cloistered order? Jesus Christ, working at The New Yorker sounds fucking miserable. “Pan-fry” needs a hyphen because, what, the readers of The New Yorker are so fucking dumb that they would think it means “panfree”? Probably not, but what a great excuse for one of the editors to be a total dick to an employee, huh? Hahaha, good times!

Here is the sentence in question, from a 1977 issue of The New Yorker:

“It’s heartening to see that a restaurant in a national park is going to take the time to pan-fry some chicken,” I told Tom.

Whoa! Good thing that hyphen was there or I would’ve thought this guy was taking time to panfree some chicken and WHAT THE FUCK WHY WOULD I THINK THAT.

Incredibly, the hits keep on coming in the next paragraph:

The writer-editor Veronica Geng once physically restrained me from looking in the dictionary for the word “hairpiece,” because she was afraid that the dictionary would make it two words and that I would follow it blindly. As soon as she left the office, I did look it up, and it was two words, but I respected her word sense and left it alone. (p. 117)

Ok, now respect the word sense of writers who use(d) singular they.

And if you’re wondering why The New Yorker still writes “teen-ager”:

Not everyone at The New Yorker is devoted to the diaeresis [the two little dots that The New Yorker – and only The New Yorker – places over the word cooperate]. Some have wondered why it’s still hanging around. Style does change sometimes. […]

Lu Burke used to pester the style editor Hobie Weekes, who had been at the magazine since 1928, to get rid of the diaeresis. Like Mr. Hyphen, Lu was a modern independent-minded reader, and she didn’t need to have her vowels micromanaged. Once, in the elevator, Weekes seemed to be weakening. He told her he was on the verge of changing that style and would be sending out a memo soon. And then he died.

This was in 1978. No one has had the nerve to raise the subject since. (pp. 123–124)

Kee-rist, I’m surprised they don’t write “base-ball” and “to-morrow” and “bull-shit”.

A chapter about pencil sharpening. Seriously.

Chapter 10 (“Ballad of a Pencil Junkie”) is some sort of dime store pencil porn as Norris describes pencils in such detail that only an actual pencil would find it interesting. I kept thinking that I would rather have pencils in my eyes, but then I came across the best line in the entire book:

David Rees specializes in the artisanal sharpening of No. 2 pencils: for a fee (at first, it was fifteen dollars, but like everything else, the price of sharpening pencils has gone up), he will hand-sharpen your pencil and return it to you (along with the shavings), its point sheathed in vinyl tubing. (p. 182)



The New Yorker hardly needs help in showing people that it has a very tenuous grasp of English grammar [links to LangLog and Arnold Zwicky]. They demonstrate that in their pages whenever the topic of grammar comes up). Apparently, decades of publishing some of the greatest writers has not helped anyone at the magazine to learn how English grammar works. Unfortunately, Norris’s book does nothing to help The New Yorker’s reputation when it comes to grammar. On top of that, some of the stories she tells about working at The New Yorker are pretty horrifying. If you are able to separate or skip over the discussions of grammar, this book may be enjoyable for you. It’s an easy read, but I couldn’t force myself to like it.



* Not to mention Norris doesn’t even follow her own advice –

p. 15: “It is one of those words which defy the old “i before e except after c” rule”

p. 54: “The piece also had numbers in it – that is, numerals – which I instinctively didn’t touch”

And she quotes A. A. Milne doing it: “If the English language had been properly organized … there would be a word which meant both ‘he’ and ‘she’” (p. 64)

And Henry James: “Poor Catherine was conscious of her freshness; it gave her a feeling about the future which rather added to the weight upon her mind.” (p. 143)

And Mark Twain: “It was what I thought when I stood before ‘The Last Supper’ and heard men apostrophizing wonders and beauties and perfections which had faded out of the picture and gone a hundred years before they were born.” (pp. 147-48)

You could argue that these are all old/dead writers and that no one should write like that anymore, but again, The New Yorker magazine, as well as the author of Between You & Me, prefers to use a dictionary from 193fucking4.

The following is a sentence on an exam I gave my student this semester. It’s a lyric from the totally awesome band The Go-Go’s (who are too punk rock to care about using your lame apostrophes correctly). Read it and decide which part of speech you think sealed is: verb or adjective?

In the jealous games people play, our lips are sealed.

I first thought that sealed is clearly an adjective and that it functions as the subject complement of the sentence (a subject complement is an element required by copular verbs, such as be and seem, which does not encode a different kind of participant to the subject in the phrase in the way that an object does). But many of my students analyzed it as a verb. This calls for some weekend grammar research (while listening to the Go-Go’s of course)!

On the exam, students had to mark the function (subject, predicate, object, etc.) of each clause in the sentence. In the grammar that we’re using (English Grammar: A University Course, 2nd ed., 2006, by Downing and Locke), only verb phrases can be included in the predicate. This means that if sealed is a verb, the phrase consists of only a subject (Our lips) and a predicate (are sealed).

Two dictionaries list sealed as an adjective: the OED and Macmillan Dictionary. The OED’s citation which mirrors this construction is a bit out of date though. It comes from the 1611 printing of the King James Bible: And the vision of all is become vnto you, as the wordes of a booke that is sealed. Macmillan Dictionary only offers “a sealed box/bag/envelope” as an example. Four other dictionaries (Merriam-Webster’s, Dictionary.com, Oxford Learner’s Dictionary, and Oxford Dictionaries) do not list sealed as an adjective, only as a transitive verb (i.e. it needs an object). Strangely, Oxford Learner’s Dictionary has this example sentence under the second entry for seal as a verb:

The organs are kept in sealed plastic bags.

In this case, sealed is definitely an adjective modifying a noun (plastic bags). This must be an oversight by the editors. More importantly, though, is the fact that sealed in Our lips are sealed does not have an object. What gives?

Well, sealed is more of a participial adjective than anything else (some grammars use the terms verbal adjective or attributive verb). It’s an adjective that has been derived from a verb. Participial adjectives look like verbs but they function grammatically like adjectives. I know. Welcome to the Twilight Zone. These are the cases which really show that there are not sharp limits between the parts of speech, but rather very hazy boundaries. Sometimes it is easy to tell whether the word in question is a verb or an adjective. For example:

This is the sealed envelope that you mailed. = adjective

I sealed the envelope with a kiss. = verb

Other times – such as the one under discussion here – things are not so clear cut. Downing & Locke (p. 479) say that “past participles may often have either an adjectival or a verbal interpretation. In The flat was furnished, the participle [furnished] may be understood either as part of a passive verb form or as the adjectival subject complement of the copula was.” This means that sealed could be a passive verb that is simply missing its object. The object is presumably missing because we know that the person who owns the lips is the one who seals them, so it would sound ridiculous to say Our lips are sealed by us (although maybe not as ridiculous as the similar phrase My lips are sealed by me).

I want to argue that sealed is definitely an adjective, but like so much else in linguistics, it is hard to be definite about this. The verb analysis works just as well and sealed might be semantically closer to a verb in that we can think about the sealing of lips as resulting from an action taken. If we compare it to Our lips are chapped there isn’t as clear of an action present, except maybe the action of the weather. But I don’t like talking about verbs as action words.

For what’s it worth, 19 out of 25 people in my Twitter poll said that sealed is an adjective.

On the exam, I accepted both adjective/subject complement and verb/predicator. This made my students happy. Talking about sealed for 20 minutes in class did not make them so happy.

As a dictionary of English vocabulary and phrases, the American English Compendium by Marv Rubinstein is satisfactory. It is 500 pages long so it covers a lot of ground. As a book of American English or Americanisms, this book is not what it seems. A brief glance at any of the pages will make you question if the entries really are words or phrases that are exclusive to American English. And a comparison to another source will most likely show that they are not. As a commentary on language, however, this book is terrible.

American English Compendium

Cover of American English Compendium by Marv Rubinstein. Published by Rowman & Littlefield. Cover design by Neil Cotterill.

The problems start on the first page of Chapter 1. The author defends the use of the term American English by proclaiming it is better than British English:

Dynamic. Versatile. Imaginative. Capable of capturing fine nuances. All these terms can truthfully be used to describe the American language. “Don’t you mean the ‘English language’?” some readers may ask. No, I mean the American language. Over many years, American English has vastly expanded and changed, a transmutation that has left it only loosely connected to its mother tongue, British English. (p. 3)

Although no one would (or should) argue that American English is a term that needs to be defended, the imaginary readers in this passage come off as more knowledgeable about language than the author. Are we really to believe American English is the only variation of English that is “dynamic” or “imaginative” or “capable of capturing fine nuances”? The problem gets compounded when the author recognizes the influence of American English in England, but seems to suggest that the reverse is not happening:

[W]hile there are numerous localisms [in countries where English is the primary language], more and more the terminology, idioms, slang, and colloquialisms smack of American English. Even in England this is slowly but surely happening. (p. 3)

And it only get stranger from there. On the next page we are told:

Things have changed so much, and the use of American English in international communications has grown so much, one can now safely say that most English speakers use (to a greater or lesser degree) Americanized English – that is, the American language. And rightly so. The American language is so much richer and more adventurous. British English neve stood a chance. (p. 4, emphasis mine)

Excuse me, Mr. Rubinstein, but H. G. Wells, J. K. Rowling, Grant Morrison, Agatha Christie and a thousand other British writers would like a word.

After this “proof” that ‘Murican English is better than British English, readers are given a “microcosm of what is happening” (p. 4) in the world. Rubinstein relates a story from an article by New York Times columnist and economist Thomas Friedman about how a senior Moroccan official is sending his kids to an American school even though he was educated in a French school. Rubinstein uses this story to claim that

There are now several American schools in Casablanca, each with a long waiting list. In addition, English (primarily American English) courses are springing up all over that country. If this is happening in Morocco, a country with long-lasting French connections and traditions, it is undoubtedly happening everywhere. The American language is becoming ubiquitous. (p. 5)

But it needs to be noted that Friedman does not claim that these English-language schools which are supposedly popping up all over Casablanca are teaching American English. Nor are readers given any proof that Casablanca is an example of what is happening around the world. I am very hesitant to believe it is. While it’s a cute story, this kind of claim needs to be backed up with evidence. How do we know that the English being taught in these schools is strictly British or American or some variation of English as an international language? We have to take the Rubinstein’s word for it, but as we have seen with his dismissal of British English, he is not to be trusted when it comes to linguistics commentary.

Further down the page, in a section titled The Richness of the American Language, Rubinstein claims that “much of the richness of the American language lies in the fact that it has absorbed words and expressions from at least fifty other languages.” (p. 5) He lists some examples, but completely fails to acknowledge the fact that many of them, such as brogue and orangutan and typhoon, were originally borrowed into British English and then used by Americans.

Rubinstein then presumes readers will ask how the American language differs from other languages, which obviously also use foreign words and phrases. But the answer given is just as confused as the question. The author states that “there is no question that American English has been like a sponge absorbing and modifying words from many other languages” (p. 7) without realizing (or reporting) that this is true of English in general, not American English in particular. This is actually true of languages in general, although English does appear to be particularly greedy when it comes to borrowing words from other languages.

Later, there is a fairly reasonable, but short and undefinitive, discussion of “Black English” (African American Vernacular English). The section unfortunately ends with this quote: “Educated African Americans, of course, use standard American English” (pp. 11–12). Well, good for them.
Things get really bonkers in the section on compounding, which includes this howler:

Compound words exist in almost all languages, but never anywhere near the extent that they do in American English. […] during the last few decades, compounding has reached epidemic proportions. The vast majority of compound words are of relatively recent origin languagewise (p. 15)

This is nonsense. Does the author know how any other languages work? Finnish compounds words much more than English does. In fact, the syntax of Finnish demands it, unlike in English where compounding is very often a matter of style. And how do we know that the “vast majority” of compound words are not old? Let’s say “the last few decades” goes back to 1960. Do you really think words such as outcast, outdoors, outlook, output, overcome, overdoes, overdue, oversee, oddball, goofball, downfall, and downhill (all words supplied by the author) were made compound words after 1960?

Here are some other WTFs in this book along with the thoughts I had after reading them:

In general [the English speakers of Australia, Canada, Guyana, India, Ireland, New Zealand, and South Africa] all understand each other, but, as you have seen in the previous chapter on American and British English, there are substantial differences. The same can be said of the English used in the other countries listed above. With a few exceptions, Canadian English consists of a blending of American and British English, but the other English-speaking countries have all developed their own unique and distinctive expressions (including slang and colloquialisms). (p. 267)

Hahahahaha! Fuck you, Canada! Get your own expressions, eh!


English is an Anglo-Saxon language with roots in Latin, the Romance Languages, and German. [No.] This means that most, if not all, English words are variations of foreign words, and such words have legitimately entered the language. (p. 281)



The Oxford English Dictionary prides itself on keeping up to date, and it does pretty well (but not perfect) with including new words in its latest editions. Unfortunately, libraries with limited budgets these days do not always have the most recent revisions. Your best bet for researching neologisms is probably the Internet – for example, Google. (p. 403)

Because the OED is the only dictionary in the world. I’ve said it before and I’ll say it again: In linguistics research there is only the OED and Google. It’s a wonder we get anything done.


Chairman has become chairperson and has been further reduced to chair. But many gender-based terms remain unresolved. While, for example, policeman easily becomes police officer, other words and phrases resist change. One almost invariably hears expressions such as “Everyone to their own taste. [What? Who invariably hears this?] Grammatically incorrect [Nope!] but why risk offending potential female customers of advertised products? [Bitches be trippin’, amiright?] However, when a woman mans the controls of an aircraft, should the term be changed even though it denotes action, not identity? What should we now call a “manhole cover”? [Serious questions, you guys.] Note that we no longer have actresses; they all insist on being called actors. [How dare they?!] (p. 13)

Based on the claims about language alone, I would not recommend this book. I don’t know how someone writes a book about language and gets so much wrong. The word and phrase entries may be useful, but any online dictionary will have most if not all of them. Go there instead or get a proper reference book from a respected dictionary.

As the authors state in their foreword (pp. xii-xiii):

This book represents an attempt to defang the slang and crack the code. In writing this, we tried to think back to when we were new to Washington and wishing, like wandering tourists lost in a foreign city, that we had a handy all-in-one-place phrasebook.

I would say they have largely accomplished this. Dog Whistles, Walk-backs & Washington Handshakes is an up-to-date glossary of American political terms. I think that people interested in language and politics would find this book enjoying for a few reasons. First, the book is well referenced (always a plus). The authors are not trying to discover the first known use of some political code word, but rather to show that politicians from all sides use this type of language and that you are likely to come across it in tomorrow’s newspaper or news broadcast. So their references mostly come from very recent sources, which is refreshing. The foreword and introduction make nuanced points about language and slang, and the authors back up these points with references to reputable sources.

Dog Whistles has appeal for people who follow American politics, since although they are likely to already know some of the terms in here, they will probably find some they don’t know or haven’t thought about. That’s because the book isn’t just made up of eye-catching terms such as Overton window and San Fransisco values. Readers will appreciate the care that the authors have taken to explain each term. For example, here is the entry for the seemingly innocent term bold (p. 40):

Bold: A politician’s most common description of their own or their party’s proposals. It manages to be a punchy, optimistic-sounding break with conventional thinking and deliberately vague all at once.

Image copyright ForeEdge and University Press of New England

Image copyright ForeEdge and University Press of New England

But the book is not just for language and politics heads. In the introduction (p. ix), the authors recognize the problem that people who do not closely follow politics might have when reading about or listening to their representatives:

For most of the population – let’s call them “regular, normal people” – time spent listening to legislation, operatives, and journalists thrash over public policy on cable or a website can often result in something close to a fugue state, induced by the repeated use of words and phrases that have little if any connection to life as it is lived on planet Earth.

Later (p. 129), the authors explain the importance of their glossary by saying that:

Knowing the meanings of such specialized political terms can help cut through spin meant to obscure what’s really going on in a campaign. When politicians use the cliché, “The only poll that counts is the one on Election Day,” they really mean, “I wouldn’t win if the election were held today.”

I am all for educating people about the intricacies of language, especially when that means explaining the ways that politicians use words and phrases to trick people.
I am, however, not sure that all of the terms deserve being placed in this book. I feel like a glossary should include words that are at least nominally used by a group of people. But in their attempt to be current, the authors have included phrases such as hardship porn. This is a phrase coined by Frank Bruni of the New York Times and it only returns two hits on Google News – the July 2015 article in which Bruni coined it and an October 2015 book review in the Missoula Independent. However influential Frank Bruni is, this term has not caught on yet.

This is really nitpicking though (something us academics excel at, thankyouverymuch). I really found this book enjoyable. If you like politics, language, or both, you will probably enjoy it too. You can check out the interactive website here: http://dogwhistlebook.com/ and even suggest you own term.




McCutcheon, Chuck and David Mark. 2014. Dog Whistles, Walk-backs & Washington Handshakes: Decoding the Jargon, Slang, and Bluster of American Political Speech. ForeEdge: New Hampshire.

In two recent papers, one by Kloumann et al. (2012) and the other by Dodds et al. (2015), a group of researchers created a corpus to study the positivity of the English language. I looked at some of the problems with those papers here and here. For this post, however, I want to focus on one of the registers in the authors’ corpus – song lyrics. There is a problem with taking language such as lyrics out of context and then judging them based on the positivity of the words in the songs. But first I need to briefly explain what the authors did.

In the two papers, the authors created a corpus based on books, New York Times articles, tweets and song lyrics. They then created a list of the 10,000 most common word types in their corpus and had voluntary respondents rate how positive or negative they felt the words were. They used this information to claim that human language overall (and English) is emotionally positive.

That’s the idea anyway, but song lyrics exist as part of a multimodal genre. There are lyrics and there is music. These two modalities operate simultaneously to convey a message or feeling. This is important for a couple of reasons. First, the other registers in the corpus do not work like song lyrics. Books and news articles are black text on a white background with few or no pictures. And tweets are not always multimodal – it’s possible to include a short video or picture in a tweet, but it’s not necessary (Side note: I would like to know how many tweets in the corpus included pictures and/or videos, but the authors do not report that information).

So if we were to do a linguistic analysis of an artist or a genre of music, we would create a corpus of the lyrics of that artist or genre. We could then study the topics that are brought up in the lyrics, or even common words and expressions (lexical bundles or n-grams) that are used by the artist(s). We could perhaps even look at how the writing style of the artist(s) changed over time.

But if we wanted to perform an analysis of the positivity of the songs in our corpus, we would need to incorporate the music. The lyrics and music go hand in hand – without the music, you only have poetry. To see what I mean, take a look at the following word list. Do the words in this list look particularly positive or negative to you?





































































smell sorry






















If we combine these words as Rivers Cuomo did in his song “Butterfly”, they average out to a positive score of 5.23. Here are the lyrics to that song.

Yesterday I went outside
With my momma’s mason jar
Caught a lovely Butterfly
When I woke up today
And looked in on my fairy pet
She had withered all away
No more sighing in her breast

I’m sorry for what I did
I did what my body told me to
I didn’t mean to do you harm
But everytime I pin down what I think I want
it slips away – the ghost slips away

I smell you on my hand for days
I can’t wash away your scent
If I’m a dog then you’re a bitch
I guess you’re as real as me
Maybe I can live with that
Maybe I need fantasy
A life of chasing Butterfly

I’m sorry for what I did
I did what my body told me to
I didn’t mean to do you harm
But everytime I pin down what I think I want
it slips away – the ghost slips away

I told you I would return
When the robin makes his nest
But I ain’t never comin’ back
I’m sorry, I’m sorry, I’m sorry

Does this look like a positive text to you? Does it look moderate, neither positive nor negative? I would say not. It seems negative to me, a sad song based on the opera Madame Butterfly, in which a man leaves his wife because he never really cared for her. When we include the music into our consideration, the non-positivity of this song is clear.

Let’s take a look at another list. How does this one look?

















































































Based on the ratings in the two papers, this list is slightly more positive, with an average happiness rating of 5.46. When the words were used by Trent Reznor, however, they expressed “a deeply personal meditation on self-hatred” (Huxley 1997: 179). Here are the lyrics for “Closer” by Nine Inch Nails:

You let me violate you
You let me desecrate you
You let me penetrate you
You let me complicate you

Help me
I broke apart my insides
Help me
I’ve got no soul to sell
Help me
The only thing that works for me
Help me get away from myself

I want to fuck you like an animal
I want to feel you from the inside
I want to fuck you like an animal
My whole existence is flawed
You get me closer to god

You can have my isolation
You can have the hate that it brings
You can have my absence of faith
You can have my everything

Help me
Tear down my reason
Help me
It’s your sex I can smell
Help me
You make me perfect
Help me become somebody else

I want to fuck you like an animal
I want to feel you from the inside
I want to fuck you like an animal
My whole existence is flawed
You get me closer to god

Through every forest above the trees
Within my stomach scraped off my knees
I drink the honey inside your hive
You are the reason I stay alive

As Reznor (the songwriter and lyricist) sees it, “Closer” is “supernegative and superhateful” and that the song’s message is “I am a piece of shit and I am declaring that” (Huxley 1997: 179). You can see what he means when you listen to the song (minor NSF warning for the imagery in the video). [1]

Nine Inch Nails: Closer (Uncensored) (1994) from Nine Inch Nails on Vimeo.

Then again, meaning is relative. Tommy Lee has said that “Closer” is “the all-time fuck song. Those are pure fuck beats – Trent Reznor knew what he was doing. You can fuck to it, you can dance to it and you can break shit to it.” And Tommy Lee should know. He played in the studio for NIИ and he is arguably more famous for fucking than he is for playing drums.

Nevertheless, the problem with the positivity rating of songs keeps popping up. The song “Mad World” was a pop hit for Tears for Fears, then reinterpreted in a more somber tone by Gary Jules and Michael Andrews. But it is rated a positive 5.39. Gotye’s global hit about failed relationships, “Somebody That I Used To Know”, is rated a positive 5.33. The anti-war and protest ballad “Eve of Destruction”, made famous by Barry McGuire, rates just barely on the negative side at 4.93. I guess there should have been more depressing references besides bodies floating, funeral processions, and race riots if the song writer really wanted to drive home the point.

For the song “Milkshake”, Kelis has said that it “means whatever people want it to” and that the milkshake referred to in the song is “the thing that makes women special […] what gives us our confidence and what makes us exciting”. It is rated less positive than “Mad World” at 5.24. That makes me want to doubt the authors’ commitment to Sparkle Motion.

Another upbeat jam that the kids listen to is the Ramones’ “Blitzkrieg Bop”. This is the energetic and exciting anthem of punk rock. It’s rated a negative 4.82. I wonder if we should even look at “Pinhead”.

Then there’s the old American folk classic “Where did you sleep last night”, which Nirvana performed a haunting version of on their album MTV Unplugged in New York. The song (also known as “In the Pines” and “Black Girl”) was first made famous by Lead Belly and it includes such catchy lines as

My girl, my girl, don’t lie to me
Tell me where did you sleep last night
In the pines, in the pines
Where the sun don’t ever shine
I would shiver the whole night through


Her husband was a hard working man
Just about a mile from here
His head was found in a driving wheel
But his body never was found

This song is rated a positive 5.24. I don’t know about you but neither the Lead Belly version, nor the Nirvana cover would give me that impression.

Even Pharrell Williams’ hit song “Happy” rates only 5.70. That’s a song so goddamn positive that it’s called “Happy”. But it’s only 0.03 points more positive than Eric Clapton’s “Tears in Heaven”, which is a song about the death of Clapton’s four-year-old son. Harry Chapin’s “Cat’s in the Cradle” was voted the fourth saddest song of all time by readers of Rolling Stone but it’s rated 5.55, while Willie Nelson’s “Always on My Mind” rates 5.63. So they are both sadder than “Happy”, but not by much. How many lyrics must a man research, before his corpus is questioned?

Corpus linguistics is not just gathering a bunch of words and calling it a day. The fact that the same “word” can have several meanings (known as polysemy), is a major feature of language. So before you ask people to rate a word’s positivity, you will want to make sure they at least know which meaning is being referred to. On top of that, words do not work in isolation. Spacing is an arbitrary construct in written language (remember that song lyrics are mostly heard not read). The back used in the Ramones’ lines “Piling in the back seat” and “Pulsating to the back beat” are not about a body part. The Weezer song “Butterfly” uses the word mason, but it’s part of the compound noun mason jar, not a reference to a brick layer. Words are also conditioned by the words around them. A word like eve may normally be considered positive as it brings to mind Christmas Eve and New Year’s Eve, but when used in a phrase like “the eve of destruction” our judgment of it is likely to change. In the corpus under discussion here, eat is rated 7.04, but that doesn’t consider what’s being eaten and so can not account for lines like “Eat your next door neighbor” (from “Eve of Destruction”).

We could go on and on like this. The point is that the authors of both of the papers didn’t do enough work with their data before drawing conclusions. And they didn’t consider that some of the language in their corpus is part of a multimodal genre where there are other things affecting the meaning of the language used (though technically no language use is devoid of context). Whether or not the lyrics of a song are “positive” or “negative”, the style of singing and the music that they are sung to will highly effect a person’s interpretation of the lyrics’ meaning and emotion. That’s just the way that music works.

This doesn’t mean that any of these songs are positive or negative based on their rating, it means that the system used by the authors of the two papers to rate the positivity or negativity of language seems to be flawed. I would have guessed that a rating system which took words out of context would be fundamentally flawed, but viewing the ratings of the songs in this post is a good way to visualize that. The fact that the two papers were published in reputable journals and picked up by reputable publications, such as the Atlantic and the New York Times, only adds insult to injury for the field of linguistics.

You can see a table of the songs I looked at for this post below and an spreadsheet with the ratings of the lyrics is here. I calculated the positivity ratings by averaging the scores for the word tokens in each song, rather than the types.

(By the way, Tupac is rated 4.76. It’s a good thing his attitude was fuck it ‘cause motherfuckers love it.)

Song Positivity score (1–9)
“Happy” by Pharrell Williams 5.70
“Tears in Heaven” by Eric Clapton 5.67
“You Were Always on My Mind” by Willie Nelson 5.63
“Cat’s in the Cradle” by Harry Chapin 5.55
“Closer” by NIN 5.46
“Mad World” by Gary Jules and Michael Andrews 5.39
“Somebody that I Used to Know” by Gotye feat. Kimbra 5.33
“Waitin’ for a Superman” by The Flaming Lips 5.28
“Milkshake” by Kelis 5.24
“Where Did You Sleep Last Night” by Nirvana 5.24
“Butterfly” by Weezer 5.23
“Eve of Destruction” by Barry McGuire 4.93
“Blitzkrieg Bop” by The Ramones 4.82



[1] Also, be aware that listening to these songs while watching their music videos has an effect on the way you interpret them. (Click here to go back up.)


Isabel M. Kloumann, Christopher M. Danforth, Kameron Decker Harris, Catherine A. Bliss, Peter Sheridan Dodds. 2012. “Positivity of the English Language”. PLoS ONE. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0029484

Dodds, Peter Sheridan, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow, Karine Megerdoomian, Matthew T. McMahon, Brian F. Tivnan, and Christopher M. Danforth. 2015. “Human language reveals a universal positivity bias”. PNAS 112:8. http://www.pnas.org/content/112/8/2389

Huxley, Martin. 1997. Nine Inch Nails. New York: St. Martin’s Griffin.

Last week I wrote a post called “If you’re not a linguist, don’t do linguistics”. This got shared around Twitter quite a bit and made it to the front page of r/linguistics, so a lot of people saw it. Pretty much everyone had good insight on the topic and it generated some great discussion. I thought it would be good to write a follow-up to flesh out my main concerns in a more serious manner (this time sans emoticons!) and to address the concerns some people had with my reasoning.

The paper in question is by Dodds et al. (2015) and it is called “Human language reveals a universal positivity bias”. The certainty of that title is important since I’m going to try to show in this post that the authors make too many assumptions to reliably make any claims about all human language. I’m going to focus on the English data because that is what I am familiar with. But if anyone who is familiar with the data in other languages would like to weigh in, please do so in the comments.

The first assumption made by the authors is that it is possible to make universal claims about language using only written data. This is not a minor issue. The differences between spoken and written language are many and major (Linell 2005). But dealing with spoken data is difficult – it takes much more time and effort to collect and analyze than written data. We can argue, however, that even in highly literate societies, the majority of language use is spoken – and spoken language does not work like written language. This is an assumption that no scholar should ever make. So any research which makes claims about all human language will therefore have to include some form of spoken data. But the data set that the authors draw from (called their corpus) is made from tweets, song lyrics, New York Times articles and the Google Books project. Tweets and song lyrics, let alone news articles or books, do not mimic spoken language in an accurate way. For example, these registers may include the same words as human speech, but certainly not in the same proportion. Written language does not include false starts, nor does it include repetition or elusion in near the same way that spoken language does. Anyone who has done any transcription work will tell you this.

The next assumption made by the authors is that their data is representative of all human language. Representativeness is a major issue in corpus linguistics. When linguists want to investigate a register or variety of language, they build a corpus which is representative of that register or variety by taking a large enough and balanced sample of texts from that register. What is important here, however, is that most linguists do not have a problem with a set of data representing a larger register – so long as that larger register isn’t all human language. For example, if we wanted to research modern English journalism (quite a large register), we would build a corpus of journalism texts from English-speaking countries and we would be careful to include various kinds of journalism – op-eds, sports reporting, financial news, etc. We would not build a corpus of articles from the Podunk Free Press and make claims about all English journalism. But representativeness is a tricky issue. The larger the language variety you are trying to investigate, the more data from that variety you will need in your corpus. Baker (2010: 7) notes that a corpus analysis of one novel is “unlikely to be representative of all language use, or all novels, or even the general writing style of that author”. The English sub-corpora in Dodds et al. exists somewhere in between a fully non-representative corpus of English (one novel) and a fully representative corpus of English (all human speech and writing in English). In fact, in another paper (Dodds et al. 2011), the representativeness of the Twitter corpus is explained as “First, in terms of basic sampling, tweets allocated to data feeds by Twitter were effectively chosen at random from all tweets. Our observation of this apparent absence of bias in no way dismisses the far stronger issue that the full collection of tweets is a non-uniform subsampling of all utterances made by a non-representative subpopulation of all people. While the demographic profile of individual Twitter users does not match that of, say, the United States, where the majority of users currently reside, our interest is in finding suggestions of universal patterns.”. What I think that doozy of a sentence in the middle is saying is that the tweets come from an unrepresentative sample of the population but that the language in them may be suggestive of universal English usage. Does that mean can we assume that the English sub-corpora (specifically the Twitter data) in Dodds et al. is representative of all human communication in English?

Another assumption the authors make is that they have sampled their data correctly. The decisions on what texts will be sampled, as Tognini-Bonelli (2001: 59) points out, “will have a direct effect on the insights yielded by the corpus”. Following Biber (see Tognini-Bonelli 2001: 59), linguists can classify texts into various channels in order to assure that their sample texts will be representative of a certain population of people and/or variety of language. They can start with general “channels” of the language (written texts, spoken data, scripted data, electronic communication) and move on to whether the language is private or published. Linguists can then sample language based on what type of person created it (their age, sex, gender, social-economic situation, etc.). For example, if we made a corpus of the English articles on Wikipedia, we would have a massive amount of linguistic data. Literally billions of words. But 87% of it will have been written by men and 59% of it will have been written by people under the age of 40. Would you feel comfortable making claims about all human language based on that data? How about just all English language encyclopedias?

The next assumption made by the authors is that the relative positive or negative nature of the words in a text are indicative of how positive that text is. But words can have various and sometimes even opposing meanings. Texts are also likely to contain words that are written the same but have different meanings. For example, the word fine in the Dodds et al. corpus, like the rest of the words in the corpus, is just a four letter word – free of context and naked as a jaybird. Is it an adjective that means “good, acceptable, or satisfactory”, which Merriam-Webster says is sometimes “used in an ironic way to refer to things that are not good or acceptable”? Or does it refer to that little piece of paper that the Philadelphia Parking Authority is so (in)famous for? We don’t know. All we know is that it has been rated 6.74 on the positivity scale by the respondents in Dodds et al. Can we assume that all the uses of fine in the New York Times are that positive? Can we assume that the use of fine on Twitter is always or even mostly non-ironic? On top of that, some of the most common words in English also tend to have the most meanings. There are 15 entries for get in the Macmillan Dictionary, including “kill/attack/punish” and “annoy”. Get in Dodds et al. is ranked on the positive side of things at 5.92. Can we assume that this rating carries across all the uses of get in the corpus? The authors found approximately 230 million unique “words” in their Twitter corpus (they counted all forms of a word separately, so banana, bananas, b-a-n-a-n-a-s! would be separate “words”; and they counted URLs as words). So they used the 50,000 most frequent ones to estimate the information content of texts. Can we assume that it is possible to make an accurate claim about how positive or negative a text is based on nothing but the words taken out of context?

Another assumption that the authors make is that the respondents in their survey can speak for the entire population. The authors used Amazon’s Mechanical Turk to crowdsource evaluations for the words in their sub-corpus. 60% of the American people on Mechanical Turk are women and 83.5% of them are white. The authors used respondents located in the United States and India. Can we assume that these respondents have opinions about the words in the corpus that are representative of the entire population of English speakers? Here are the ratings for the various ways of writing laughter in the authors’ corpus:

Laughter tokens Rating
ha 6
hah 5.92
haha 7.64
hahah 7.3
hahaha 7.94
hahahah 7.24
hahahaha 7.86
hahahahaha 7.7
ha 6
hee 5.4
heh 5.98
hehe 6.48
hehehe 7.06

And here is a picture of a character expressing laughter:

Pictured: Good times. Credit: Batman #36, DC Comics, Scott Snyder (wr), Greg Capullo (p), Danny Miki (i), Fco Plascenia (c), Steve Wands (l).

Pictured: Good times. Credit: Batman #36, DC Comics, Scott Snyder (wr), Greg Capullo (p), Danny Miki (i), Fco Plascenia (c), Steve Wands (l).

Can we assume that the textual representation of laughter is always as positive as the respondents rated it? Can we assume that everyone or most people on Twitter use the various textual representations of laughter in a positive way – that they are laughing with someone and not at someone?
Finally, let’s compare some data. The good people at the Corpus of Contemporary American English (COCA) have created a word list based on their 450 million word corpus. The COCA corpus is specifically designed to be large and balanced (although the problem of dealing with spoken language might still remain). In addition, each word in their corpus is annotated for its part of speech, so they can recognize when a word like state is either a verb or a noun. This last point is something that Dodds et al. did not do – all forms of words that are spelled the same are collapsed into being one word. The compilers of the COCA list note that “there are more than 140 words that occur both as a noun and as a verb at least 10,000 times in COCA”. This is the type/token issue that came up in my previous post. A corpus that tags each word for its part of speech can tell the difference between different types of the “same” word (state as a verb vs. state as a noun), while an untagged corpus treats all occurrences of state as the same token. If we compare the 10,000 most common words in Dodds et al. to a sample of the 10,000 most common words in COCA, we see that there are 121 words on the COCA list but not the Dodds et al. list (Here is the spreadsheet from the Dodds et al. paper with the COCA data – pnas.1411678112.sd01 – Dodds et al corpus with COCA). And that’s just a sample of the COCA list. How many more differences would there be if we compared the Dodds et al. list to the whole COCA list?

To sum up, the authors use their corpus of tweets, New York Times articles, song lyrics and books and ask us to assume (1) that they can make universal claims about language despite using only written data; (2) that their data is representative of all human language despite including only four registers; (3) that they have sampled their data correctly despite not knowing what types of people created the linguistic data and only including certain channels of published language; (4) that the relative positive or negative nature of the words in a text are indicative of how positive that text is despite the obvious fact that words can be spelled the same and still have wildly different meanings; (5) that the respondents in their survey can speak for the entire population despite the English-speaking respondents being from only two subsets of two English-speaking populations (USA and India); and (6) that their list of the 10,000 most common words in their corpus (which they used to rate all human language) is representative despite being uncomfortably dissimilar to a well-balanced list that can differentiate between different types of words.

I don’t mean to sound like a Negative Nancy and I don’t want to trivialize the work of the authors in this paper. The corpus that they have built is nothing short of amazing. The amount of feedback they got from human respondents on language is also impressive (to say the least). I am merely trying to point out what we can and can not say based on the data. It would be nice to make universal claims about all human language, but the fact is that even with millions and billions of data points, we still are not able to do so unless the data is representative and sampled correctly. That means it has to include spoken data (preferably a lot of it) and it has to be sampled from all socio-economic human backgrounds.

Hat tip to the commenters on the last post and the redditors over at r/linguistics.


Dodds, Peter Sheridan, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow, Karine Megerdoomian, Matthew T. McMahon, Brian F. Tivnan, and Christopher M. Danforth. 2015. “Human language reveals a universal positivity bias”. PNAS 112:8. http://www.pnas.org/content/112/8/2389

Dodds, Peter Sheridan, Kameron Decker Harris, Isabel M. Koumann, Catherine A. Bliss, Christopher M. Danforth. 2011. “Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter”. PLOS One. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752#abstract0

Baker, Paul. 2010. Sociolinguistics and Corpus Linguistics. Edinburgh: Edinburgh University Press. http://www.ling.lancs.ac.uk/staff/paulb/socioling.htm

Linell, Per. 2005. The Written Language Bias in Linguistics. Oxon: Routledge.

Mair, Christian. 2015. “Responses to Davies and Fuchs”. English World-Wide 36:1, 29–33. doi: 10.1075/eww.36.1.02mai

Tognini-Bonelli, Elena. 2001. Studies in Corpus Linguistics, Volume 6: Corpus Linguistics as Work. John Benjamins. https://benjamins.com/#catalog/books/scl.6/main

