Academic Criticism – …And Read All Over

Dr. Andrew Thomas tries to mansplain mansplaining

July 28, 2023August 14, 2023 Joe McVeighLeave a comment

Is this dude about to mansplain mansplaining? Hoo boy. Here we go.

This is going to be a long post. I’ll go through each part of the article with my usual irreverence, but don’t be fooled. Thomas’s ideas about language are a real danger to women. So I’ll comment seriously on that as well. Let’s get to it.

tl;dr – Andrew Thomas is incredibly wrong about mansplaining. He cites no sources to back up his claim that men and women have different communication styles, except for one limited study from 40 years ago. Modern linguistic research disproves Thomas’s ideas, and in fact his ideas are about 50 years out of date. Mansplaining is one part of systematic discrimination that women face. Thomas tries to water down the meaning of mansplaining. Thomas’s ideas are dangerous because they will be used to silence and exclude women in society.

Continue reading →

Strange etymologies are afoot at Psychology Today

June 12, 2019 Joe McVeighLeave a comment

Last week I was on the twitters talking about “untranslatable” words. The idea was about Dr. Tim Lomas’ work on “untranslatable words,” or his term for how some languages have words that don’t have exact equivalents in other languages (but usually English). Right around the same time I posted my blog post, Lomas wrote an article in Psychology Today. Let’s have a look at it. If you want to see my thoughts on “untranslatable” words, go see my post on it and then come back.

Lomas claims that many concepts are non-English in origin. What this means is that the words used to describe these concepts are from other languages. I think this is opening a whole can of worms, but I’m willing to go with the idea that concepts can be “from another language”. For a bit. Let’s move on.

To prove his point, Lomas analyzes an article on positive psychology by Seligman and Csikszentmihalyi (2000). He looks for the etymology of every word in the text.

According to Lomas, there are:

1333 distinct lexemes

‘Native’ English words – belonging either to the Germanic language from which English emerged, or originating as neologisms in English itself – comprise only 39.4% of the sample (and 38% of the psychological words). Thus, over 60% of the general words (and 62% of psychological words) are loanwords, borrowed from other languages at some point in the development of English.

First, Lomas has a strange definition for “‘native’ English words”. Which “Germanic language” does he mean? Proto-Germanic? One of the other West Germanic languages? Old English? It’s also strange because Lomas’ definition means that these words are not native English words: they, table, blue, and orange. [Britney Spears gif says “huh?!” Oprah gif says “hrmmm?!”]

Lomas also doesn’t say exactly how he counted the words in the C&S article. He says that there are 1,333 “distinct lexemes”. The term lexeme is used in linguistics to talk about all the inflected forms of a word: singular and plural forms for nouns, present and past tense forms for verbs, etc. So runner and runners would be a part of the same lexeme RUNNER, and run, runs, ran, running are a part of RUN. Lexemes are also sometimes called “lemmas” in linguistics.

If Lomas really went through every single word in the article, then he spent a whole lotta time on this. The C&S article is 8,124 words long (not including the References section). He doesn’t say how he did the work, but I used some corpus linguistics methods and got different results. I checked the C&S article against the Someya lemma list in AntConc and found 1,750 lemmas, or 417 more lexemes than Lomas found. This is a large difference and I’m not sure how to explain it. Maybe Lomas didn’t divide his words based on parts of speech? So he counted ran and runner as part of the same lexeme? I don’t know.

Second, let’s look at counting the words in language. Lomas seems to do a straight count. That means one instance of one form of a lexeme is equal to all the other instances. For Lomas, it doesn’t matter how many times a word occurs. In corpus linguistics, however, frequency is a big deal. I’m not going to go through the theoretical points here, but basically if a word is more frequent then it is more important or worthy of being looked at (hehe, fight me, corpus linguists).

So, Lomas claims that only 39% of the lexemes in the article are “native English words”. I took the lexemes in the article and ranked them based on frequency (using AntConc). Then I went through the 100 most frequent lexemes on the list and looked at their etymology. My numbers look much different than Lomas’. I found that 85% of the 100 most frequent lexemes are English in origin. That is, the 100 most frequent lexemes occur a total of 4,440 times in the article (so the lexeme the occurs 442 times, the lexeme of occurs 308 times, the lexeme BE occurs 300 times, and so on) and of these occurrences, 3,767 are English words. This isn’t particularly intriguing – you’ll probably find a similar percentage with any text in English. [See the bottom of this post for my data.]

Looking at this from another angle, we could treat each of the 100 most frequent lexemes as equal – forgetting about how often they occur. Then we find that 70 of them are English, while 30 of them come from another language. This is closer to Lomas’ numbers, but still pretty far off: 70 of the 100 most common lexemes in the article are still English words.

Of course, words in language do not really occur in the way that we’re looking at them. The most common word is the with 442 instances, but the first 442 words of the article are not all the. The word the is sprinkled around the article (you know, where the grammar of English calls for it). I’m not sure how to get to Lomas’ numbers. We could assume that every lexeme outside the 100 most frequent were non-English, but that only gets us down to 46% of the words in the article as being English lexemes. Lomas’ ratio was 40% English to 60% non-English.

Later in the article, Lomas says that 234 words were treated as English in origin in his analysis. But this means that only 17% of the words in his counting are English in origin (234/1,333=0.17). What’s going on here? If 39.4% of the lexemes in the article are English in origin, and there are 1,333 total lexemes in the article (according to Lomas), then there should be 525 English words. Where he gets 234, I don’t know. Let’s move on.

Lomas’ includes two graphs to visualize his findings but they’re pretty weird. The graph below “shows the influx of words according to the language of origin (with the century in which they entered English as stacks within them)”. Look at the third column.

Lomas_PT_graph_1

English words entered English? I don’t get it. Or Germanic words from before the 12^th century are not English words? What’s going on here? I guess in Lomas’ counting, Germanic and English lexemes are English lexemes, but then he splits them up in the graph? Are the words me, myself and I not English words? It seems very strange to me to cut things up like this and I would like to see his list of etymologies, or his rationale for doing so.

Agree to disagree?

But there are places that I can agree with Lomas. At the end of the article, he writes:

In these ways does our understanding of life become complexified and enriched. In that respect, one can make the case that English-speaking psychology would do well to more consciously and actively engage with other languages and cultures. Its understanding of the mind has benefited greatly from English incorporating loanwords over the centuries. If one accepts that premise, it follows that psychology would continue to develop from this kind of cross-cultural engagement and borrowing – including, of course, through collaboration with scholars from non-English speaking cultures themselves. One such way in which the field might develop is through inquiring into untranslatable words, since these constitute clear candidates for borrowing (given that they lack an exact equivalent in English). I myself have sought to promote this kind of endeavor, with my ongoing creation of a cross-cultural lexicography of untranslatable words relating to well-being.

I definitely agree with the first part of this. We should engage with speakers of other languages and people from other cultures (although Lomas’ wording seems to present all English speakers as a monolithic culture). I find it hard for anyone to not accept the premise that English (not just “English-speaking psychology”) has benefited greatly from incorporating loanwords. That’s kind of just a fact of language – borrowing words is one of the things that living languages do and so English is still a living language partly for this reason. But I totally agree that people should collaborate with people from different cultures (although again, Lomas’ wording blurs the distinction between language and culture too much for me and again presents English speakers as one culture).

When Lomas goes into the sales pitch in the second to last sentence, I can’t sign on, particularly based on what I’ve seen of his research into “untranslatable” words (in my last post and in this one and in a later one to come).

Lomas’ claims are true – we should reach out to people who speak other languages. But he should perhaps recognize that the reason that English has so many words from Latin and Ancient Greek is because these were once prestigious languages (and to a large extent still are in academia). It wasn’t because the Latin-speaking or Greek-speaking cultures had anything more special than other cultures, but it was believed that by using these languages people would be more civilized. Of course, we know what happened to the Latin-speaking and (Ancient) Greek-speaking cultures. They dead.

But we in English-speaking cultures could just as easily have adapted Finnish words to use in the fields of psychology and linguistics, but Finnish was never considered a prestigious language. Or consider German: once German raised its standing, we got words from German to describe abstract concepts because the texts describing them were written in German and people were supposed to know German to engage in the debate.

There’s more to say about all this and I’ll be back at cha with a later post. I’ll link to it when I write it.

Data

Spreadsheet with my analysis. The first sheet is the Someya lemma list analysis. I counted words from Anglo-Norman as not being English. I’m including the 3^rd person plural pronouns (they, them, their, themselves) as being English. Illness counts as English. The second sheet uses AntConc’s Word List tool, so it’s not a lexeme/lemma analysis, it treats every “word” as separate (that is, was, am, and is are separate words, not part of the lexeme BE).

Link to download the C&S article as a plain text file (.txt) which was used with AntConc in the analysis. The References section is excluded. And here’s a link to download a POS-tagged version of the article (using CLAWS7).

Steven Pinker’s Dog Whistles

May 27, 2019 Joe McVeigh3 Comments

So. Steven Pinker.

Yeah I wrote about him way back in the day when I reviewed his book The Language Instinct and how it was garbage. But if linguistic nativism is your thing, then fine. You do you. Geoffrey Sampson presents a valid argument against Pinker’s claims and Pinker responded… never. Because The Language Instinct is still making that money yo.

But Steven Pinker has branched out now. And things have not gone so well. Scholars from other fields are learning that he’s kinda bad at scholarship.

So here’s a rundown of why you should not follow what Steven Pinker says or writes.

First up we got Pinker’s garbage tweet about words not having power. He links to an article in Quillette (which we’ll get to later). I’m not going to embed the tweet here, but I’ll quote it. Pinker says “The first insight of linguistics, going back to Plato, is that words are conventions, without magical powers. That’s being nullified by PC/SJW attacks on mentioning taboo words, even ironically or in works of art.” Many people pointed out how stupid this is and, indeed, it is very stupid. The first insight of linguistics? Even historians know more about linguistics than this. But what it’s really about is how Steven Pinker really wants to be able to say the n-word. Like really bad. And preferably with impunity, if that’s not too much to ask. Why does everyone have to be so uptight about Steven Pinker saying the n-word? iT’s jUsT a wOrD

Let’s not dwell on it because things get worse (somehow).

Pinker has published a book called Enlightenment Now. In the book, Pinker argues that the world is actually a better place than you think it is because of the Enlightenment. It’s too bad Pinker totally fucks up the scholarship in his book. As this article by Aaron R. Hanlon shows, Pinker doesn’t even know what the Enlightenment was all about.

History scholars staring to feel like linguists.

Do you think we’re done? We’re not done. (I wish we were done. Those three paragraphs alone were draining. On we go! Into the shit!)

Pinker’s Enlightenment Now is bad for other reasons. Here’s Samuel Moyn pointing out one of the problems:

Or take inequality. Sure, some perceive a rampant crisis in most nations, but it is all sort of boring and overblown, by Pinker’s lights. “I need a chapter on the topic,” he writes, apparently willing himself to push through his fatigue with the subject, “because so many people have been swept up in the dystopian rhetoric and see inequality as a sign that modernity has failed to improve the human condition.” In his cursory treatment, Pinker tries to downplay currently exploding levels of national inequality, by pointing out that global inequality is declining: Even if the gap between the richest and the rest in individual countries is widening, on a world scale inequality is falling slightly. Never mind that it is within their individual countries that most people are experiencing and responding to inequality, and wreaking havoc because of it. In any case, Pinker argues, it does not matter morally if some people get extremely wealthy, so long as poverty decreases.

Just as in his somewhat literal understanding of violence, Pinker simply cannot see something so straightforward as class rule, which has been massively reestablished in our time of inequality, with all the baleful effects it has had on politics. In a world in which the outsized gains of the rich allow them to live a separate existence from the rest—stooping only to buy elections with dark money and even induce populists to act in their interest—rage is not only an expected but also an understandable result. The fact that these forms of domination and hierarchy are features of the very modernity he wants to lionize is not a possibility Pinker pauses to contemplate. Each of his arguments on the subject is a way of saying he doesn’t think inequality is that important—even as populists across the world are reaping gains from the obvious conclusion that it is.

“But, Joe,” I hear you saying, “those are just scholars who know more about the Enlightenment than Steven Pinker. So what if he got some stuff wrong? It’s not like he’s a leading thinker in society!” He is a leading thinker in society. He learned how to get things wrong and not care about it in linguistics. Now he’s moved on to other fields and he also sucks at them. And he’s also an asshole about it. Here’s Jennifer Szalai:

Steven Pinker doesn’t just want you to be happy; he wants you to be grateful too. His new book, “Enlightenment Now,” is a spirited and exasperated rebuke to anyone who refuses to concede that the world is becoming a better place. “None of us are as happy as we ought to be, given how amazing our world has become,” he writes. “People seem to bitch, moan, whine, carp and kvetch as much as ever.”

The world has become amazing for Steven Pinker, so why don’t you all just shut your pie holes, huh? You want another article showing that Pinker fucks up his argument? You got it! In fact, here’s two! Go nuts! Because this nonsense of Steven Pinker writing things and people paying for his hot garbage is getting tiring. Linguists knew it first. Sorry, historians. He’s yours now. (Please take him) [Update July 30: Here’s a third article pointing out how wrong and misleading Pinker is in Enlightenment Now. It’s by Phil Torres in Salon.]

And here’s Jason Hickel asking Pinker to debate him. Hahahaha, good luck, bro. Call some linguists if you get Pinker to respond. Because he ain’t ever done that. But he still gets puff pieces in the Chronicle. I see you, Chronicle. Do better. Be more like Mehdi Hasan and don’t fall for Pinker’s bullshit.

[Update June 5] Don in the comments pointed me to a piece in Current Affairs by Nathan J. Robinson which is a thorough take down of Pinker, his writings and his ideology. Well, almost every way – there’s not much in there about how Pinker also sucks at linguistics. If you want something less acerbic than what I’ve written here, then check that article out. But if you want the really despicable stuff Pinker has written, read on and check that piece out later.

But friends, things get much worse. Steven Pinker promotes the website Quillette, which is website all about “free speech”. I’m putting that in scare quotes because it’s 2019 and you know what that means. Quillette likes to publish racists and sexists. They’ll even let these people publish anonymously because why should they have to own up to their bigotry? Steven Pinker has aligned himself with them. Even more so, Pinker said that campus rape is a “moral panic,” an “extraordinary popular delusion” and something akin to a witch hunt. And all because the rate of rapes on college campuses is not as high as it is in “the world’s most savage war zones”. Fuck you, Steven Pinker. Maybe you’ll listen to me because I’m also a straight white man. Steven Pinker has never had to cross the street on campus because he was walking alone and there were men walking toward him and he was worried about being attacked. Steven Pinker has never had to worry about what he’ll do while he’s out for a jog on campus and there’s a man running behind him – is he fast enough to outrun that man? Is he strong enough to overpower him? Are there enough other people around to hear him scream? Steven Pinker has never had to worry about having something slipped into his drink at a campus party. The only reason I know that women have to worry about these things is because they have told me. There are other things they have to worry about – things that neither me nor Steven Pinker have ever been forced to think about. And there are women who have been raped on campus. But Steven Pinker doesn’t care because there aren’t enough rape victims on college campuses as there are in some hypothetical war zone. Ugh. Get fucked, Pinker.

I can’t go on. I don’t want to. Go read this thread. And when someone cites Steven Pinker, tell them to get a real source for their claims. If he would act right, academia would take him seriously. If he would do actual scholarship, he wouldn’t be a problem. But every field he goes into rejects his claims. Why? Because he’s shit.

Update on that F-K paper

May 20, 2019 Joe McVeigh1 Comment

Three months ago I posted about a paper in PLoS ONE called “Liberals lecture, conservatives communicate: Analyzing complexity and ideology in 381,609 political speeches”. I noted that there are serious problems with that study. For the tl;dr:

The hot takes:
1. The article confuses written language w/ spoken language
2. Uses a test for written language on spoken language
3. Punctuation. How does it work?
4. Almost no linguistics sources. Hmm.
5. Uses a test for English on other languageshttps://t.co/QlizHYeUf7

— Joe McVeigh (@EvilJoeMcVeigh) February 26, 2019

After I posted on here, I also commented on the article with my concerns. The PLoS ONE journal allows commenting on their articles, but I’ll admit that my first comment was neither appropriate nor helpful. It was more of a troll than anything. The editors removed my comment, and to their credit, they emailed me with an explanation why. They also told me what a comment should look like. So I posted a grown-up comment on the article. This started an exchange between me and the authors of the article. Here’s the skinny:

1. The authors confuse written language with spoken language
2. The study uses an ineffectual test for written language on spoken language
3. The paper does not take into account how transcriptions and punctuation affect the data
4. The authors cite almost no linguistic sources in a study about language
5. They use a test developed for English on other languages

The authors tried to respond to my points about why their methodology is wrong, but there are some things that they just couldn’t argue their way out of (such as points 1, 2, 3 and 5 above).

Behind the scenes, I was talking with the editors of the journal. They told me that they were taking my criticisms seriously and looking into the issue themselves. In my comments on the paper, I provided multiple sources to back up my claims. The authors did not do in their replies to me, but that’s because they can’t – there aren’t studies to back up their claims. However, my last email with the editors of the journal was over a month ago. I understand that these things can take time (and the editors told me this much) but a few of the criticisms that I raised are pretty cut and dry. The authors also stopped replying to my comments, the last one of which was posted on April 9, 2019 (can’t say I blame them though).

So I’m not very positive that anything is going to change. But I’ll let you know if it does.

Stop using the Flesch-Kincaid test

February 26, 2019 Joe McVeigh5 Comments

Before Language Log beats me to it, I want to hip you to another Bad Linguistics study out there. This one is called “Liberals lecture, conservatives communicate: Analyzing complexity and ideology in 381,609 political speeches” and it’s written by Martijn Schoonvelde, Anna Brosius, Gils Schumacher and Bert Bakker. It was published in PLoS One (doi:10.1371/journal.pone.0208450).

The study analyzes almost 400,000 political speeches from different countries using a method called the Flesch-Kincaid Grade Score. The authors want to find out how complex the language in the speeches is and whether conservative or liberal politicians use more complex language. But hold up: what’s the Flesch-Kincaid score, you ask. Well, it’s a measure of how many syllables and words are in each sentence. The test gives a number that in theory can be correlated to how many years of education someone would need in order to understand the text. This is called the “readability” of the text.

So what’s the problem? Well, rather than spend too much time on it, I’ll listicle-ize the problems with this paper.

Continue reading “Stop using the Flesch-Kincaid test” →

When the econs do some lingua, drop it like it’s hot

December 17, 2018 Joe McVeighLeave a comment

Last week I did a twitter and it got a big response (for me, that is). It was about a recent paper on language that appeared in an economics journal and it lit a fire under other people as well. The paper is called “Do Linguistic Structures Affect Human Capital? The Case of Pronoun Drop” and it’s by Horst Feldmann. I thought that in addition to dunking on that paper on Twitter, I’d spell out some of the fundamental problems with it. Here goes.

https://twitter.com/EvilJoeMcVeigh/status/1070073422863130624

Continue reading “When the econs do some lingua, drop it like it’s hot” →

Fluency and linguistics in the news

May 17, 2018 Joe McVeighLeave a comment

There was some press recently about a new study which seems to claim that you can’t become fluent in a second language if you start learning it after age 10. In fact, the study* did not talk about fluency at all. As this article in the Conversation UK by Prof. Monika Schmid points out, the media misinterpreted what the study showed. I’m glad Schmid wrote this piece, which not only clears up the media’s confusion with the study, but also explains some other things about fluency in linguistics. I read the study in question and it seemed pretty legit. I have some misgivings about the idea of nativeness in language learning and about how the questionnaire says that India isn’t a “traditional English speaking country”. And also how the quiz said that “Canadians, Irish, and Scottish accept I’m finished my homework instead of with my homework,” when this is also very common in and around Philadelphia**.

games_with_words_done_my_homework

But all in all, it seems to be an interesting linguistics study that got blown out of proportion by the media. File it with the rest.

* The title of the study is “A critical period for second language acquisition: Evidence from 2/3 million English speakers”. Does anyone else find “2/3 million English speakers” ungrammatical?

**It might just be me, but the phrase “Canadians, Irish, and Scottish accept X” also seems ungrammatical. “Canadians accept X” is ok, but “Irish accept X” and “Scottish accept X” are not, at least not in my variety of English. The latter two need articles before them or the word people after them: “The Irish accept X”, “Scottish people accept X”. I don’t know of any variety where “Canadians, Irish, and Scottish accept X” is correct. This is just a bit of irony in a quiz about the grammaticality of different clauses.

Sam Smith’s conservative linguistics

April 27, 2018 Joe McVeigh3 Comments

In researching a book on English usage (called Junk English; review coming soon), I came across an article from 2007 by Sam Smith, the journalist, essayist and co-founder of Green Party. Smith’s article is a lesson in how to NOT write about language, as he gets a number of things wrong. One day I’ll write a general post about these kinds of articles, but for now, let’s go through Smith’s post and see where the train goes off the tracks.

The article starts with this:

Sitting in Manhattan across from an editor at one of best regarded publishing houses, I asked, “Does good writing still matter?”

Ugh. Like, gag me with a spoon. This kind of comment is a red flag inside a bell inside a whistle telling me that what is about to come is going to be a bunch of pretentious crap about the good ol’ days when people knew how to use The Language (a time which was probably also when Sam Smith was in his thirties; when he was looking forward to his life, not back on it) Continue reading “Sam Smith’s conservative linguistics” →

Speaking as David Brooks is hard

March 27, 2018 Joe McVeigh1 Comment

David Brooks has an opinion piece in the New York Times called “Speaking as a White Male…”. It’s about identity politics and it features the usual headscratcher ideas that we have come to expect from the Times‘ opinion page, including this nonsense right here:

Brooks Opinion Speaking as a White Male - nonsense paragraph

Wat.

Brooks’ column may be full of the stuff that only white dudes could ever think of, but I want to look at one particular thing that he says:

Now we are at a place where it is commonly assumed that your perceptions are something that come to you through your group, through your demographic identity. How many times have we all heard somebody rise up in conversation and say, “Speaking as a Latina. …” or “Speaking as a queer person. …” or “Speaking as a Jew. …”?

Brooks’ choice of words is telling (“rise up”, “Latina”, “queer”, and “Jew”), but I’m not going to get into that here (or just yet). I can’t remember hearing anybody rise up in conversation and say “Speaking as a(n) X”. So I thought I’d have a look at the SPOKEN section of the Corpus of Contemporary American English (COCA) to see if we can find this phrase. The data in the SPOKEN section of COCA comes from transcripts of news shows. Not all of it is entirely spontaneous, but it’s good enough for what we’re looking for. Here’s a search for “SPEAK as _at*” (that is: all forms of the lemma speak, followed by the word as, followed by an article):

COCA Spoken - SPEAK as _article_ — Search in COCA for “SPEAK as _a*”

The first thing we can see is that speaking as a is the most common form of this construction, but that it seems to be trailing off in usage from the 1990s to 2017. On top of that, 63 hits are not that much (48 hits for speaking as a + 15 hits for speaking as an = 63). Let’s take a look at the two most common constructions.

Of those 63 hits, 35 fit the form that Brooks used. I’m excluding examples like “I’m not speaking as X” or “I was speaking as X” or when the speaking as X was put into the person’s mouth, such as in this example from NPR’s Wait Wait… Don’t Tell Me! Show:

Peter Sagal: And speaking as an esteemed historian, a Pulitzer Prize-winning historian, do in fact, in your scholarly opinion, the Yankees suck?

Doris Kearns Goodwin: Without a question.

Peter Sagal: Thank you.

Of the 35 hits that are similar to the form Brooks made up, most of the words that appear after speaking as a(n) are unique, so most of them appear only once. Some favorites of mine are

Speaking as a(n)

old-time criminal defense lawyer

guy

reporter who missed that story

There are 2 hits each for Speaking as a followed by woman, individual, and mom. The words Latina, queer or Jew do not appear in the search results. There is, however, one example of speaking as a homosexual and one example of speaking as an Israeli dove. So two out of three ain’t bad?

What’s going on here, then? My guess is that Brooks inflated the number of times he has heard Speaking as a(n) X to suit his argument. It’s probable that he has heard it a couple of times, but he makes it seem like it’s an everyday thing. Of course, maybe just hearing speaking as a queer person once is one too many times for some people.

The scant results from the search could also indicate that the data in the corpus doesn’t include speech from enough people who identify as a Latina, queer or Jew. That is probably true (these people need to be represented more on our news shows), but I also think that people do not need to say speaking as a(n) X because often in conversation a person’s identity is known by the participants. Think about it: would any of your friends or family members unironically say speaking as a(n) X? When people are being interviewed, their identity is often spelled out before the interview starts. And of course there are many other ways to indicate aspects of your identity in conversation without saying speaking as a(n) X.

I don’t recommend you read Brooks’ column (read this instead). It’s bad. It’s by a white guy who claims he doesn’t understand identity politics. Come on, David. You have a column in the New York Times. You get it. You just don’t want to. If you really need some help, give up your column and hand it to a Latina person, or a queer person. Then sit back and read what they have to say. I guarantee it won’t be “Speaking as a(n) X…”

My Corpus Brings All the Boys to the Yard

April 12, 2015 Joe McVeighLeave a comment

In two recent papers, one by Kloumann et al. (2012) and the other by Dodds et al. (2015), a group of researchers created a corpus to study the positivity of the English language. I looked at some of the problems with those papers here and here. For this post, however, I want to focus on one of the registers in the authors’ corpus – song lyrics. There is a problem with taking language such as lyrics out of context and then judging them based on the positivity of the words in the songs. But first I need to briefly explain what the authors did.

In the two papers, the authors created a corpus based on books, New York Times articles, tweets and song lyrics. They then created a list of the 10,000 most common word types in their corpus and had voluntary respondents rate how positive or negative they felt the words were. They used this information to claim that human language overall (and English) is emotionally positive.

That’s the idea anyway, but song lyrics exist as part of a multimodal genre. There are lyrics and there is music. These two modalities operate simultaneously to convey a message or feeling. This is important for a couple of reasons. First, the other registers in the corpus do not work like song lyrics. Books and news articles are black text on a white background with few or no pictures. And tweets are not always multimodal – it’s possible to include a short video or picture in a tweet, but it’s not necessary (Side note: I would like to know how many tweets in the corpus included pictures and/or videos, but the authors do not report that information).

So if we were to do a linguistic analysis of an artist or a genre of music, we would create a corpus of the lyrics of that artist or genre. We could then study the topics that are brought up in the lyrics, or even common words and expressions (lexical bundles or n-grams) that are used by the artist(s). We could perhaps even look at how the writing style of the artist(s) changed over time.

But if we wanted to perform an analysis of the positivity of the songs in our corpus, we would need to incorporate the music. The lyrics and music go hand in hand – without the music, you only have poetry. To see what I mean, take a look at the following word list. Do the words in this list look particularly positive or negative to you?

ain’t

all

and

away

back

bitch

body

breast

but

butterfly

can

can’t

caught

chasing

comin’

days

did

didn’t

dog

down

everytime

fairy

fantasy

for

ghost

guess

had

hand

harm

her

his

i’m

looked

lovely

jar

makes

mason

life

live

maybe

mean

momma’s

need

nest

never

outside

pet

real

return

robin

scent

she

sighing

slips

smell sorry

that

the

then

think

today

told

want

wash

went

what

when

with

withered

woke

would

yesterday

you

you’re

your

If we combine these words as Rivers Cuomo did in his song “Butterfly”, they average out to a positive score of 5.23. Here are the lyrics to that song.

Yesterday I went outside
With my momma’s mason jar
Caught a lovely Butterfly
When I woke up today
And looked in on my fairy pet
She had withered all away
No more sighing in her breast

I’m sorry for what I did
I did what my body told me to
I didn’t mean to do you harm
But everytime I pin down what I think I want
it slips away – the ghost slips away

I smell you on my hand for days
I can’t wash away your scent
If I’m a dog then you’re a bitch
I guess you’re as real as me
Maybe I can live with that
Maybe I need fantasy
A life of chasing Butterfly

I’m sorry for what I did
I did what my body told me to
I didn’t mean to do you harm
But everytime I pin down what I think I want
it slips away – the ghost slips away

I told you I would return
When the robin makes his nest
But I ain’t never comin’ back
I’m sorry, I’m sorry, I’m sorry

Does this look like a positive text to you? Does it look moderate, neither positive nor negative? I would say not. It seems negative to me, a sad song based on the opera Madame Butterfly, in which a man leaves his wife because he never really cared for her. When we include the music into our consideration, the non-positivity of this song is clear.

[youtube https://www.youtube.com/watch?v=rCoGkMlfz9I]
Let’s take a look at another list. How does this one look?

above

absence

alive

animal

apart

are

away

become

brings

broke

can

closer

complicate

desecrate

down

drink

else

every

everything

existence

faith

feel

flawed

for

forest

from

fuck

get

god

got

hate

have

help

hive

honey

i’ve

inside

insides

isolation

it’s

knees

let

make

myself

off

only

penetrate

perfect

reason

scraped

sell

sex

smell

somebody

soul

stay

stomach

tear

that

the

thing

through

trees

violate

want

whole

within

works

you

your

Based on the ratings in the two papers, this list is slightly more positive, with an average happiness rating of 5.46. When the words were used by Trent Reznor, however, they expressed “a deeply personal meditation on self-hatred” (Huxley 1997: 179). Here are the lyrics for “Closer” by Nine Inch Nails:

You let me violate you
You let me desecrate you
You let me penetrate you
You let me complicate you

Help me
I broke apart my insides
Help me
I’ve got no soul to sell
Help me
The only thing that works for me
Help me get away from myself

I want to fuck you like an animal
I want to feel you from the inside
I want to fuck you like an animal
My whole existence is flawed
You get me closer to god

You can have my isolation
You can have the hate that it brings
You can have my absence of faith
You can have my everything

Help me
Tear down my reason
Help me
It’s your sex I can smell
Help me
You make me perfect
Help me become somebody else

I want to fuck you like an animal
I want to feel you from the inside
I want to fuck you like an animal
My whole existence is flawed
You get me closer to god

Through every forest above the trees
Within my stomach scraped off my knees
I drink the honey inside your hive
You are the reason I stay alive

As Reznor (the songwriter and lyricist) sees it, “Closer” is “supernegative and superhateful” and that the song’s message is “I am a piece of shit and I am declaring that” (Huxley 1997: 179). You can see what he means when you listen to the song (minor NSF warning for the imagery in the video). [1]

[vimeo 3554226 w=500 h=377]

Nine Inch Nails: Closer (Uncensored) (1994) from Nine Inch Nails on Vimeo.

Then again, meaning is relative. Tommy Lee has said that “Closer” is “the all-time fuck song. Those are pure fuck beats – Trent Reznor knew what he was doing. You can fuck to it, you can dance to it and you can break shit to it.” And Tommy Lee should know. He played in the studio for NIИ and he is arguably more famous for fucking than he is for playing drums.

Nevertheless, the problem with the positivity rating of songs keeps popping up. The song “Mad World” was a pop hit for Tears for Fears, then reinterpreted in a more somber tone by Gary Jules and Michael Andrews. But it is rated a positive 5.39. Gotye’s global hit about failed relationships, “Somebody That I Used To Know”, is rated a positive 5.33. The anti-war and protest ballad “Eve of Destruction”, made famous by Barry McGuire, rates just barely on the negative side at 4.93. I guess there should have been more depressing references besides bodies floating, funeral processions, and race riots if the song writer really wanted to drive home the point.

For the song “Milkshake”, Kelis has said that it “means whatever people want it to” and that the milkshake referred to in the song is “the thing that makes women special […] what gives us our confidence and what makes us exciting”. It is rated less positive than “Mad World” at 5.24. That makes me want to doubt the authors’ commitment to Sparkle Motion.

Another upbeat jam that the kids listen to is the Ramones’ “Blitzkrieg Bop”. This is the energetic and exciting anthem of punk rock. It’s rated a negative 4.82. I wonder if we should even look at “Pinhead”.

Then there’s the old American folk classic “Where did you sleep last night”, which Nirvana performed a haunting version of on their album MTV Unplugged in New York. The song (also known as “In the Pines” and “Black Girl”) was first made famous by Lead Belly and it includes such catchy lines as

My girl, my girl, don’t lie to me
Tell me where did you sleep last night
In the pines, in the pines
Where the sun don’t ever shine
I would shiver the whole night through

And

Her husband was a hard working man
Just about a mile from here
His head was found in a driving wheel
But his body never was found

This song is rated a positive 5.24. I don’t know about you but neither the Lead Belly version, nor the Nirvana cover would give me that impression.

Even Pharrell Williams’ hit song “Happy” rates only 5.70. That’s a song so goddamn positive that it’s called “Happy”. But it’s only 0.03 points more positive than Eric Clapton’s “Tears in Heaven”, which is a song about the death of Clapton’s four-year-old son. Harry Chapin’s “Cat’s in the Cradle” was voted the fourth saddest song of all time by readers of Rolling Stone but it’s rated 5.55, while Willie Nelson’s “Always on My Mind” rates 5.63. So they are both sadder than “Happy”, but not by much. How many lyrics must a man research, before his corpus is questioned?

Corpus linguistics is not just gathering a bunch of words and calling it a day. The fact that the same “word” can have several meanings (known as polysemy), is a major feature of language. So before you ask people to rate a word’s positivity, you will want to make sure they at least know which meaning is being referred to. On top of that, words do not work in isolation. Spacing is an arbitrary construct in written language (remember that song lyrics are mostly heard not read). The back used in the Ramones’ lines “Piling in the back seat” and “Pulsating to the back beat” are not about a body part. The Weezer song “Butterfly” uses the word mason, but it’s part of the compound noun mason jar, not a reference to a brick layer. Words are also conditioned by the words around them. A word like eve may normally be considered positive as it brings to mind Christmas Eve and New Year’s Eve, but when used in a phrase like “the eve of destruction” our judgment of it is likely to change. In the corpus under discussion here, eat is rated 7.04, but that doesn’t consider what’s being eaten and so can not account for lines like “Eat your next door neighbor” (from “Eve of Destruction”).

We could go on and on like this. The point is that the authors of both of the papers didn’t do enough work with their data before drawing conclusions. And they didn’t consider that some of the language in their corpus is part of a multimodal genre where there are other things affecting the meaning of the language used (though technically no language use is devoid of context). Whether or not the lyrics of a song are “positive” or “negative”, the style of singing and the music that they are sung to will highly effect a person’s interpretation of the lyrics’ meaning and emotion. That’s just the way that music works.

This doesn’t mean that any of these songs are positive or negative based on their rating, it means that the system used by the authors of the two papers to rate the positivity or negativity of language seems to be flawed. I would have guessed that a rating system which took words out of context would be fundamentally flawed, but viewing the ratings of the songs in this post is a good way to visualize that. The fact that the two papers were published in reputable journals and picked up by reputable publications, such as the Atlantic and the New York Times, only adds insult to injury for the field of linguistics.

You can see a table of the songs I looked at for this post below and an spreadsheet with the ratings of the lyrics is here. I calculated the positivity ratings by averaging the scores for the word tokens in each song, rather than the types.

(By the way, Tupac is rated 4.76. It’s a good thing his attitude was fuck it ‘cause motherfuckers love it.)

Song	Positivity score (1–9)
“Happy” by Pharrell Williams	5.70
“Tears in Heaven” by Eric Clapton	5.67
“You Were Always on My Mind” by Willie Nelson	5.63
“Cat’s in the Cradle” by Harry Chapin	5.55
“Closer” by NIN	5.46
“Mad World” by Gary Jules and Michael Andrews	5.39
“Somebody that I Used to Know” by Gotye feat. Kimbra	5.33
“Waitin’ for a Superman” by The Flaming Lips	5.28
“Milkshake” by Kelis	5.24
“Where Did You Sleep Last Night” by Nirvana	5.24
“Butterfly” by Weezer	5.23
“Eve of Destruction” by Barry McGuire	4.93
“Blitzkrieg Bop” by The Ramones	4.82

Footnotes

[1] Also, be aware that listening to these songs while watching their music videos has an effect on the way you interpret them. (Click here to go back up.)

References

Isabel M. Kloumann, Christopher M. Danforth, Kameron Decker Harris, Catherine A. Bliss, Peter Sheridan Dodds. 2012. “Positivity of the English Language”. PLoS ONE. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0029484

Dodds, Peter Sheridan, Eric M. Clark, Suma Desu, Morgan R. Frank, Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann, James P. Bagrow, Karine Megerdoomian, Matthew T. McMahon, Brian F. Tivnan, and Christopher M. Danforth. 2015. “Human language reveals a universal positivity bias”. PNAS 112:8. http://www.pnas.org/content/112/8/2389

Huxley, Martin. 1997. Nine Inch Nails. New York: St. Martin’s Griffin.

Category: Academic Criticism

Dr. Andrew Thomas tries to mansplain mansplaining

Like this:

Strange etymologies are afoot at Psychology Today

Agree to disagree?

Data

Like this:

Steven Pinker’s Dog Whistles

Like this:

Update on that F-K paper

Like this:

Stop using the Flesch-Kincaid test

Like this:

When the econs do some lingua, drop it like it’s hot

Like this:

Fluency and linguistics in the news

Like this:

Sam Smith’s conservative linguistics

Like this:

Speaking as David Brooks is hard

Like this:

My Corpus Brings All the Boys to the Yard

Footnotes

References

Like this:

Share this

Like this:

Agree to disagree?

Data

Share this

Like this:

Share this

Like this:

Share this

Like this:

Share this

Like this:

Share this

Like this:

Share this

Like this:

Share this

Like this:

Share this

Like this:

Footnotes

References

Share this

Like this: