Stop using the Flesch-Kincaid test

Before Language Log beats me to it, I want to hip you to another Bad Linguistics study out there. This one is called “Liberals lecture, conservatives communicate: Analyzing complexity and ideology in 381,609 political speeches” and it’s written by Martijn Schoonvelde, Anna Brosius, Gils Schumacher and Bert Bakker. It was published in PLoS One (doi:10.1371/journal.pone.0208450).

The study analyzes almost 400,000 political speeches from different countries using a method called the Flesch-Kincaid Grade Score. The authors want to find out how complex the language in the speeches is and whether conservative or liberal politicians use more complex language. But hold up: what’s the Flesch-Kincaid score, you ask. Well, it’s a measure of how many syllables and words are in each sentence. The test gives a number that in theory can be correlated to how many years of education someone would need in order to understand the text. This is called the “readability” of the text.

So what’s the problem? Well, rather than spend too much time on it, I’ll listicle-ize the problems with this paper.

1. Writing is not speech

I can’t believe I’m saying this again, but here we are. The Flesch-Kincaid Grade Score is a test that was developed to measure written language. It specifically gives a “readability” score. Even the Wikipedia page for Flesch-Kincaid says it’s for written English. On the other hand, political speeches are spoken. It’s right there in what they’re called – speeches. But Schoonvelde et al. zoom right past this and apply a written language measurement (Flesch-Kincaid test) to spoken data (political speeches). Do they address the potential problems of doing this? No. No, they do not.

Why isn’t the language in Jurassic Park the same as the language in the book? Don’t the actors know they just have to read straight from the book?

2. Worst. Analysis. Ever.

Schoonvelde et al. claim that “All speeches were […] transcribed verbatim.” (p. 4) Nope. They most certainly were not. Do the speeches include false starts, ums and ers? Any mispronunciations? Laughing? Grunting? More importantly for the study at hand, where did the transcribers put the punctuation symbols? Schoonvelde et al. correctly claim that the Flesch-Kincaid scores increase with “an increasing number of clauses in a compound sentence” (p. 5), which would seem to indicate more complex language being used, but they don’t even mention that punctuation in speeches is essentially arbitrary. How did the transcribers decide where to put periods instead of commas or semi-colons and were their decisions uniform across all of the speeches? I don’t know and I bet the authors of this bad linguistics study don’t know either. I wonder if they even thought about this problem. They don’t seem to have when they say that Gordon Brown’s speech is more complex than David Cameron’s because “the Brown text consists of just one long sentence whereas the Cameron text contains multiple short sentences.” (p. 6) There’s a way to change that and it’s called “punctuation”. As Mark Liberman points out here, the F-K test “relies only on word length in syllables and sentence length in words, so that the resulting number depends crucially on punctuation choices.” Liberman adds that “any journalist who takes the Flesch-Kincaid test seriously is in dire need of remediation.” I’d say the same goes for political scientists.

3. Written speeches vs. spoken speeches

In a related problem, the authors don’t report how many of the speeches were prepared in advance and how many were given spontaneously? Probably because they don’t know, but also because they don’t seem to realize that there might be a difference. The difference (and hence the problem) is that spontaneous spoken language will be more informal and less complex that written and edited language (at least by the faulty F-K test; fewer syllables, fewer words per sentence, if we can talk about spoken language having sentences).

4. Y U No Cite Linguistic Sources?

The authors spend a lot of time citing other political scientists who have used the Flesch-Kincaid test, but they spend no time discussing the problems with it. They devote one measly paragraph to referencing a single other source that tried to develop a better model than the Flesch-Kincaid test, but then throw up their arms and say “while we think this measure is very promising, it is not feasible for our project”. Aka we can’t do better, so we won’t. We’ll stick with the garbage linguistic analysis we have, thanks.

The authors also suspiciously don’t have many sources that appeared in linguistic or language journals and used the Flesch-Kincaid test. I wonder why that is. They reference two studies which were published in linguistic/language journals (Wang & Liu 2018 and Dalvean 2016), but neither of these uses only the F-K tests in their analyses. It’s almost like the field of linguistics and language studies has a higher standard for measuring a text’s complexity than the Flesch-Kincaid test can produce. Could that be the case? Nah. I bet linguists just haven’t thought of analyzing politicians speeches yet. Good thing we got political scientists!

5. No problemo?

The article uses the Flesch-Kincaid test on languages other than English. Can the Flesch-Kincaid test be used on languages other than English? No. Do the authors address this problem? Also no.

Again, Mark Liberman notes “The Flesch-Kincaid metric is so insensitive to actual reading difficulty that it doesn’t even matter whether the tested material is in the English language.”

6. Says who?

The authors make some general, unverified claims which should not have been accepted without evidence. For example, they write “this finding [that Trump uses less complex language than other politicians] speaks to the more general claim that conservative politicians use simpler, less complex language than liberals. […] Their divergence in linguistic complexity is argued to be rooted in personality differences.” (p. 1)

Who is making this claim? Who is arguing this? No one knows.

Thank u, next

The authors round up their analysis by saying that it “provides consistent evidence that the link between ideology and language complexity exists across countries; differences in linguistic complexity between liberals and conservatives transcend beyond the Anglo-Saxon world, despite language differences.” That’s a bold claim about language! Too bad the linguistic part of their analysis is garbage. And you know what they say: garbage in, garbage out.

I could go around looking for more hot takes on the Flesch-Kincaid test, but let’s stick with Mark Liberman, who calls it “an outdated and simple-minded metric that pretends to predict reading level based only on average word and sentence length.”

For what it’s worth, the article scores a 10.3 on the Flesch-Kincaid grade level, which means that you need to be at least 10 years and 3 months old to understand that this article should not have been published.

Update March 4

On Twitter, Alan Lischinsky and I talked about how someone should write a paper about how harmful readability tests are (the F-K test is a readability test). And wouldn’t you know it, Caroline Jarrett swooped in to show us that someone did! That someone is Ginny Redish and the article is called “Readability formulas have even more limitations than Klare discusses” (doi:10.1145/344599.344637). It’s behind a paywall, so I’ll give you the highlights. But first let me point out that it was published almost 20 years ago and people are still using the F-K test like it offers some revelations.

Redish points out that “a readability score is seldom useful and can, indeed, be misleading” (p. 132). It is misleading because we don’t know what an “eighth-grade reading level” means when we’re talking about adults. In a prescient quote for the article above, Redish says “An adult who reads at an “eighth-grade level” may be a poor reader but may have a large spoken vocabulary from life experiences far beyond any eighth grader.” So the complexity of political speeches doesn’t boil down to just how many words and syllables are in a sentence. And (again pointed out by Redish) audience members can be very different, but readability scores can’t distinguish between audience members.

Highlights (my additions in italics):

“How valid are readability formulas for technical material for adult readers? No one knows. For technical materials for adults, using any readability formula means generalizing from situations that are decades old and not directly relevant to the audience or type of material.” (p. 133)

Same goes for political speeches. This should be obvious.

“[Flesch] created the formula by correlations with older comprehension tests and other formulas, not by redoing the research with adult readers. Before you use a readability formula, think about what, if anything, you are really learning from it.” (p. 134)

That last sentence scored very high on the Burn-o-meter.

Redish shows (p. 136) that the following two sentences would get the same score on the F-K test:

I wave my hand.

I waive my rights.

One of these sentences is complex. The other is not.

That’s it. As Redish notes in a much better way than I could have above, “Long sentences are not a problem just because they are long. Length is only a corollary of several linguistic aspects that make sentences difficult.” (p. 136)”

The citation for the article is below. If you have access you should go read it. It’s five pages long, it was written 19 years ago, and if Schoonvelde et al. had read it, we all would’ve saved ourselves a lot of time.

Redish, Janice. 2000. “Readability formulas have even more limitations than Klare discusses”. ACM Journal of Computer Documentation 24(3): 132-137. doi:10.1145/344599.344637

Update October 8

This academic article is still up on the journal’s website. I posted my comments on it back in March and April, and the authors responded to some of my points, but of course they couldn’t get out of the significant problems with their methodology. Go here if you want to read the exchange of comments.

Also, hi everyone from Hacker News!

5 thoughts on “Stop using the Flesch-Kincaid test”

Leave a Reply

Your email address will not be published. Required fields are marked *