Analyzing language – You’re doing it wrong

Dan Zarrella, the “social media scientist” at HubSpot, has an infographic on his website called “How to: Get More Clicks on Twitter”. In it he analyzes 200,000 link-containing tweets to find out which ones had the highest clickthrough rates (CTRs), which is another way of saying which tweets got the most people to click on the link in the tweet. Now, you probably already know that infographics are not the best form of advice, but Mr. Zarrella did a bit of linguistic analysis and I want to point out where he went wrong so that you won’t be misled. It may sound like I’m picking on Mr. Zarrella, but I’m really not. He’s not a linguist, so any mistakes he made are simply due to the fact that he doesn’t know how to analyze language. And nor should he be expected to – he’s not a linguist.

But there’s the rub. Since analyzing the language of your tweets, your marketing, your copy, and your emails, is extremely important to know what language works better for you, it is extremely important that you do the analysis right. To use a bad analogy, I could tell you that teams wearing the color red have won six out of the last ten World Series, but that’s probably not information you want if you’re placing your bets in Vegas. You’d probably rather know who the players are, wouldn’t you?

Here’s a section of Mr. Zarrella’s infographic called “Use action words: more verbs, fewer nouns”:

Copyright Dan Zarrella
Copyright Dan Zarrella

That’s it? Just adverbs, verbs, nouns, and adjectives? That’s only four parts of speech. Your average linguistic analysis is going to be able to differentiate between at least 60 parts of speech. But there’s another reason why this analysis really tells us nothing. The word less is an adjective, adverb, noun, and preposition; run is a verb, noun, and adjective; and check, a word which Mr. Zarrella found to be correlated with higher CTRs, is a verb and a noun.

I don’t really know what to draw from his oversimplified picture. He says, “I found that tweets that contained more adverbs and verbs had higher CTRs than noun and adjective heavy tweets”. The image seems to show that tweets that “contained more adverbs” had 4% higher CTRs than noun heavy tweets and 5-6% higher CTRs than adjective heavy tweets. Tweets that “contained more verbs” seem to have slightly lower CTRs in comparison. But what does this mean? How did the tweets contain more adverbs? More adverbs than what? More than tweets which contained no adverbs? This doesn’t make any sense.

The thing is that it’s impossible to write a tweet that has more adverbs and verbs than adjectives and nouns. I mean that. Go ahead and try to write a complete sentence that has more verbs in it than nouns. You can’t do it because that’s not how language works. You just can’t have more verbs than nouns in a sentence (with the exception of some one- and two-word-phrases). In any type of writing – academic articles, fiction novels, whatever – about 37% of the words are going to be nouns (Hudson 1994). Some percentage (about 5-10%) of the words you say and write are going to be adjectives and adverbs. Think about it. If you try to remove adjectives from your language, you will sound like a Martian. You will also not be able to tell people how many more clickthroughs you’re getting from Twitter or the color of all the money you’re making.

I know it’s easy to think of Twitter as one entity, but we all know it’s not. Twitter is made up of all kinds of people, who tweet about all kinds of things. While anyone is able to follow anyone else, people of similar backgrounds and/or professions tend to group together. Take a look at the people you follow and the people who follow you. How many of them do you know on personally and how many are in a similar business as you? These people probably make up the majority of your Twitter world. So what we need to know from Mr. Zarrella is which Twitter accounts he analyzed. Who are these people? Are they on Twitter for professional or personal reasons? What were they tweeting about and where did the links in their tweets go – to news stories or to dancing cat videos? And who are their followers (the people who clicked on the links)? This is essential information to put the analysis of language in context.

Finally, What Mr. Zarrella’s analysis should be telling us is which kinds of verbs and adverbs equal higher CTRs. As I mentioned in a previous post, marketers would presumably favor some verbs over others. They want to say that their product “produces results” and not that it “produced results”. What we need is a type of analysis can tell shit (noun and verb) from Shinola (just a noun). And this is what I can do – it’s what I invented Econolinguistics for. Marketers need to be able to empirically study the language that they are using, whether it be in their blog posts, their tweets, or their copy. That’s what Econolinguistics can do. With my analysis, you can forget about meaningless phrases like “use action words”. Econolinguistics will allow you to rely on a comprehensive linguistic analysis of your copy to know what works with your audience. If this sounds interesting, get in touch and let’s do some real language analysis (joseph.mcveigh (at) gmail.com).

 

Other posts on marketing and linguistics

How Linguistics can Improve your Marketing by Joe McVeigh

Adjectives just can’t get a break by Joe McVeigh

Adjectives just can’t get a break

Everyone loves verbs, or so you would be led to believe by writing guides. Zack Rutherford, a professional freelance copywriter, posted an article on .eduGuru about how to write better marketing copy. In it he says:

Verbs work better than adjectives. A product can be quick, easy, and powerful. But it’s a bit more impressive if the product speeds through tasks, relieves stress, and produces results. Adjectives describe, while verbs do. People want a product or service that does. So make sure you provide them with one. [Emphasis his – JM]

If you’re a copy writer or marketer, chances are that you’ve heard this piece of advice. It sort of makes sense, right? Well as a linguist who studies marketing (and a former copy writer who was given this advice), I want to explain to you why it is misleading at best and flat out wrong at worst. These days it is very easy to check whether verbs actually work better than adjectives in copy. You simply take many pieces of copy (texts) and use computer programs to tag each word for the part of speech it is. Then you can see whether the better, i.e. more successful, pieces of copy use more verbs than adjectives. This type of analysis is what I’m writing my PhD on (marketers and copy writers, you should get in touch).

Don’t heed your own advice

So being the corpus linguist that I am, I decided to check whether Mr. Rutherford follows his own advice. His article has the following frequencies of usage for nouns, verbs, adjectives, and adverbs:

Nouns Verbs Adjectives Adverbs Word count
Total 275 208 135 90 1195
% of all words 23.01% 17.41% 11.30% 7.53%

Hooray! He uses more verbs than adjectives. The only thing is that those frequencies don’t tell the whole story. They would if all verbs are equal, but those of us who study language know that some verbs are more equal than others. Look at Mr. Rutherford’s advice again. He singles out the verbs speeds through, relieves, and produces as being better than the adjectives quick, easy, and powerful. Disregarding the fact that the first verb in there is a phrasal verb, what his examples have in common is that the verbs are all -s forms of lexical verbs (gives, takes, etc.) and the adjectives are all general adjectives (according to CLAWS, the part-of-speech tagger I used). This is important because a good copy writer would obviously want to say that their product produces results and not that it produced results. Or as Mr. Rutherford says “People want a product or service that does” and not presumably one that did. So what do the numbers look like if we compare his use of -s form lexical verbs to general adjectives?

-s form of lexical verbs General adjectives
Total 24 135
% of all words 2.01% 11.30%

Uh oh. Things aren’t looking so good. Those frequencies exclude all forms of the verbs BE, HAVE, and DO, as well as modals and past tense verbs. So maybe this is being a bit unfair. What would happen if we included the base forms of lexical verbs (relieve, produce), the -ing participles (relieving, producing) and verbs in the infinitive (to relieve, it will produce)? The idea is that there would be positive ways for marketers to write their copy using these forms of the verbs. Here are the frequencies:

Verbs (base, -ing part.,
Infin., and -s forms)
General adjectives
Total 127 135
% of all words 10.63% 11.30%

Again, things don’t look so good. The verbs are still less frequent than the general adjectives. So is there something to writing good copy other than just “use verbs instead of adjectives”? I thought you’d never ask.

Some good advice on copy writing

I wrote this post because the empirical research of marketing copy is exactly what I study. I call it Econolinguistics. Using this type of analysis, I have found that using more verbs or more adjectives does not relate to selling more products. Take a look at these numbers.

Copy text Performance Verbs – Adjectives
1 42.04 3.94%
2 11.82 0.63%
3 11.81 6.22%
4 10.75 -0.40%
5 2.39 3.21%
6 2.23 -0.78%
7 2.23 4.01%
8 1.88 1.14%
9 5.46%

These are the frequencies of verbs and adjectives in marketing texts ordered by how well they performed. The ninth text is the worst and the rest are ranked based on how much better they performed than this ninth text. The third column shows the difference between the verb frequency and adjective frequency for each text (verb % minus adjective %). If it looks like a mess, that’s because it is. There is not much to say about using more verbs than adjectives in your copy. You shouldn’t worry about it.

There is, however, something to say about the combination of nouns, verbs, adjectives, adverbs, prepositions, pronouns, etc., etc. in your copy. The ways that these kinds of words come together (and the frequencies at which they are used) will spell success or failure for your copy. Trust me. It’s what Econolinguistics was invented for. If you want to know more, I suggest you get in touch with me, especially if you’d like to check your copy before you send it out (email: joseph.mcveigh(at)gmail.com).

In order to really drive the point home, think about this: if you couldn’t use adjectives to describe your product, how would you tell people what color it is? Or how big it is? Or how long it lasts? You need adjectives. Don’t give up on them. They really do matter. And so do all the other words.

 

Other posts on marketing and linguistics

How Linguistics can Improve your Marketing by Joe McVeigh

Don’t Go Down the Google Books Garden Path

When Google’s Ngram Viewer was the topic of a post on Science-Based Medice, I knew it was becoming mainstream. No longer happy to only be toyed with by linguists killing time, the Ngram Viewer had entranced people from other walks of life. And I can understand why. Google’s Ngram Viewer is an impressive service that allows you to quickly and easily search for the frequency of words and phrases in millions of books. But I want to warn you about Google’s Ngram Viewer. As a corpus linguist, I think it’s important to explain just what Ngram Viewer is, what it can be used to do, how I feel about it, and the praise it has been receiving since its inception. I’ll start out simple: despite all its power and what it seems to be capable of, looks can be deceiving.

Have we learned nothing?

Jann Bellamy wrote a post at Science-Based Medicine about using Google’s Ngram Viewer (GNV) to research some terms used to describe the very unscientific practice of Complementary and Alternative Medicine (CAM). Although an article of this type is unusual for the SBM site, it does show how intriguing GNV can be. And Ms. Bellamy does a good job by explaining a few of the caveats of GVN:

The database only goes through 2008, so searches have to end there. Also, the searches have to assume that the word or phrase has only one definition, or perhaps one definition that dominates all others. We also have to remember that only books were scanned, not, for example, academic journals or popular magazines. Or blog posts, for that matter.

Ms. Bellamy then goes on the search for some CAM terms. After noting the which terms are more common and when they started to rise in usage, she does a very good job at explaining the reasons that certain terms have a higher frequency than others. At the end, however, she is left with more questions than answers. Although she discovered that alternative medicine appears more frequently than complementary medicine in the Google Books database, and although she did further research (outside of Google Books) to explain why, she is still left right where she started. Just looking at the numbers from GNV, she can’t say what kind of impact CAM has had on our (English-speaking) world or culture. So what was the point of looking as GNV at all (besides the pretty colors)?

In her post, Ms. Bellamy links to an article in the New York Times by Natasha Singer. In what is essentially a exposition of GNV, with quotes from two of its founders, Ms. Singer places a lot more stock in the value and capability of the program. But from a corpus linguist’s perspective, she leaps a bit too far to her conclusions.

Ms. Singer’s article begins with the phrase “Data is the new oil” and then goes on to explain the comparison between these two words offered by GNV. She writes:

I started my data-versus-oil quest with casual one-gram queries about the two words. The tool produced a chart showing that the word “data” appeared more often than “oil” in English-language texts as far back as 1953, and that its frequency followed a steep upward trajectory into the late 1980s. Of course, in the world of actual commerce, oil may have greater value than raw data. But in terms of book mentions, at least, the word-use graph suggests that data isn’t simply the new oil. It’s more like a decades-old front-runner.

But with the Google Books corpus (the set of texts that GNV analyzes), we need to remember what the corpus contains, i.e. what “book mentions” means. This lets us know how representative both the corpus and our analysis is. The Google Books corpus does not contain speech, newspapers, tweets, magazine articles, business letters, or financial reports. Sure, oil is important to our culture, and certainly to global and political history, but do people write books about it? We can not directly extrapolate the findings from Google Books to Culture any more than we can tell people about the world of 16th Century England by studying the plays of Shakespeare. With GNV we can merely study the culture of books (or the culture of publishing). And there are many ways that GNV can mislead you. For example, are the hits in Ms. Singer’s search talking about crude oil, olive oil, or oil paintings? Google Ngrams will not tell you. Just for fun, here’s Ms. Singer’s search redone with some other terms. Feel free to draw your own conclusions.

Click to embiggen
Search for “data, oil, chocolate, love” on GNV. (Just to be clear, searching for oil_NOUN doesn’t change things much; oil as a verb is almost non-existent in the corpus. Take that as you will)

Research casual

The second article I want to talk about comes from Ben Zimmer. While I don’t think Mr. Zimmer needs to be told anything that’s in this post, his article in The Atlantic gets to the heart of my frustration with GNV. It features a more complex search on GNV to find out which nouns modify the word mogul and how they have changed over the last 100 years. In the following passage, he alludes to the reality of GNV without coming right out and saying it.

It’s possible to answer these questions using the publicly available corpora compiled by Mark Davies at Brigham Young University, but the peculiar interface can be off-putting to casual users. With the Ngram Viewer, you just need to enter a search like “*_NOUN mogul” or “ragtag *_NOUN” and select a year range. It turns out that in 20th-century sources, media moguls are joined by movie moguls, real estate moguls, and Hollywood moguls, while the most likely things to be ragtag are armies, groups, and bands.

There are a few points to make about this. First, the interface of the publicly available corpora compiled by Mark Davies could be described as “peculiar”, but that’s only because it’s not the lowest common denominator. And there’s the rub because researchers are capable of so much more using Mark Davies’ corpora. While the interface isn’t immediately intuitive, it certainly isn’t hard to learn. As a bad comparison, think about the differences between Windows, OSX, and a Linux OS. Windows is the lowest common denominator – easiest to use and most intuitive. OSX and Linux, on the other hand, take a bit of getting used to. But how many of us have learned OSX or Linux and willingly gone back to Windows?

The second point is not so much about casual users as it is about casual searches. I think Mr. Zimmer is right to talk about casual users since it’s probable that most of the people who use GNV will be looking for a quick and easy stroll down the cultural garden path. But more to the point, I think he’s right to offer different types of moguls as a search example because that’s about as far as GNV will take you. Can you see which types of moguls people are talking about? No. How about which types of moguls are being used in magazines? Nope. Newspapers? Nuh-uh. You have to turn to one of Mark Davies’ corpora for that. In fact, less casual users are even able to access Google Books (and other corpora) via Mark Davies’ site, and this allows them to conduct more complex searches (For a much more detailed comparison of GNV and some of the corpora offered on Mark Davies’ site, see here). So again the question is what’s the point of looking at GNV at all?

Final thoughts – Almost right but not quite

All this picking on GNV is not without reason. Even though what the people at Google have done is truly impressive, we have seen that the practical use GNV is limited. As the saying in corpus linguistics goes “Compiling is only half the battle”. GNV does not offer users a way to really measure what they are (usually) looking for. As an example, a quote from Ms. Singer’s article will suffice:

The system can also conduct quantitative checks on popular perceptions. Consider our current notion that we live in a time when technology is evolving faster than ever. Mr. Aiden and Mr. Michel [two of GNV’s creators] tested this belief by comparing the dates of invention of 147 technologies with the rates at which those innovations spread through English texts. They found that early 19th-century inventions, for instance, took 65 years to begin making a cultural impact, while turn-of-the-20th-century innovations took only 26 years. Their conclusion: the time it takes for society to learn about an invention has been shrinking by about 2.5 years every decade.

While this may be true, it’s not proven by looking at Google Books. For example, ask yourself these questions: what was the rate of literacy in the early 19th-century? How many books did people read (or have read to them) in the early 19th-century compared to the turn of the 20th century? What was the difference between the rate of dissemination of information in the two time periods? How about the rate of publishing? And what exactly qualifies as technology – farm equipment or fMRI machines? Or does it have to be more closely related to culture and Culturomics – like Facebook?

And most importantly, are books the best way to measure the cultural impact of an idea or technology? The fact is that the system can not really conduct quantitative checks on popular perceptions. But it can make you think it can.

So GNV has a long way to go. I hesitate to say that they will get there because Google does not really have an interest in offering this kind of service to the public (I didn’t see any ads on the GNV page, did you?). While it may be fun to play around with GNV, I would advise against drawing any (serious) conclusions from what it spits out. Below are some other searches I ran. Again, feel free to draw your own conclusions about how these terms and the things they describe relate to human culture.

Click to embiggen
Search for “blood, sugar, sex, magik”. Click here to see the results on GNV.
Click to embiggen
Search for “Booby Fischer, Jay Z”. Click here to see the results on GNV.
Click to embiggen
Search for “Bobby Fischer, Jay Z, Eminem, Dr Dre, Run DMC, Noam Chomsky”. Click here to see the results on GNV.
Click to embiggen
Search for “Johnny Carson, Conan O’Brien, Jay Leno, David Letterman, Jimmy Kimmel, Jimmy Fallon, Big Bird, Saturday Night Live” (from the year 1950). Click here and then “Search lots of books” to see the results on GNV.
Click to embiggen
Search for “Superman, Batman, Wonder Woman, Buffy, King Arthur, Robin Hood, Hercules, Sherlock Holmes, Pele”. Click here to see the results on GNV.

Click to embiggen
Search for “Barack Obama, George Bush, Bill Clinton, Ronald Reagan, Richard Nixon, John * Kennedy, Dwight * Eisenhower, Harry * Truman, Franklin * Roosevelt, Abraham Lincoln, Beatles”. Click here to see the results on GNV.

Notice how the middle initial of some presidents complicates things in the above search. It would be nice to be able to combine the frequencies for “John Fitzgerald Kennedy”, “John F Kennedy”, “John Kennedy”, and “JFK” into one line, and exclude hits like “John S Kennedy” from the results completely, but that’s not possible. You could, however, search GNV for the different ways to refer to President Kennedy and see the differences, for whatever that will tell you.
 
 
Click to embiggen
Search for “Brad Pitt, Audrey Hepburn, Noam Chomsky, Bob Marley”. Click here to see the result on GNV.

Noam Chomsky has had a bigger effect on our culture than Audrey Hepburn, Bob Marley, and Brad Pitt? You be the judge!

Unsurprisingly, corpus linguists have already answered your question

This post is a response to a corpus search done on another blog. Over on What You’re Doing Is Rather Desperate, Neil Saunders wanted to research how adverbs are used in academic articles, specifically the sentence adverb, or as he says, adverbs which are used “with a comma to make a point at the start of a sentence”. I’m not trying to pick on Mr. Saunders (because what he did was pretty great for a non-linguist), but I think his post, and the media reports on it, makes a great excuse to write about the really, really awesome corpus linguistics resources available to the public. I’ll go through what Mr. Saunders did, and list what he could have done had he known about corpus linguistics.

Mr. Saunders wanted to know about sentence adverbs in academic texts so he wrote a script to download abstracts from PubMed Central. Right off the bat, he could have gone looking for either (1) articles on sentence adverbs or (2) already available corpora. As I pointed out in a comment on his post (which has mysteriously disappeared, probably due to the URLs I in it), there are corpora with science texts from as far back as the 1375 AD. There are also modern alternatives, such as the Corpus of Contemporary American English (COCA) and the British National Corpus (BNC), both of which (and much, much more) are available through Mark Davies’ awesome site.

I bring this up because there are several benefits of using these corpora instead of compiling your own, especially if you’re not a linguist. The first is time and space. Saunders says that his uncompressed corpus of abstracts is 47 GB (!) and that it took “overnight” (double !) for his script to comb through the abstracts. Using an online corpus drops the space required on your home machine down to 0 GB. And running searches on COCA, which contains 450 million words, takes a matter of seconds.

The second benefit is a pretty major one for linguists. After noting that his search only looks for words ending in -ly, Saunders says:

There will of course be false positives – words ending with “ly,” that are not adverbs. Some of these include: the month of July, the country of Italy, surnames such as Whitely, medical conditions such as renomegaly and typographical errors such as “Findingsinitially“. These examples are uncommon and I just ignore them where they occur.

This is a big deal. First of all, the idea of using “ly” as a way to search for adverbs is profoundly misguided. Saunders seems to realize this, since he notes that not all words that end in -ly are adverbs. But where he really goes wrong, as we’ll soon see, is in disregarding all of the adverbs that do not end in -ly. If Saunders had used a corpus that already had each word tagged for its part of speech (POS), or if he had ran a POS-tagger on his own corpus, he could have had an accurate measurement of the use of adverbs in academic articles. This is because POS-tagging allows researchers to find adverbs, adjectives, nouns, etc., as well as searching for words that end in -ly – or even just adverbs that end in -ly. And remember, it can all be done in a matter of moments (even the POS tagging). You won’t even have time to make a cup of coffee, although consumption of caffeinated beverages is highly recommended when doing linguistics (unless you’re at a conference, in which case you should substitute alcohol for caffeine).

Here is where I break from following Saunders’ method. I want like to show you what’s possible with some of the publicly available corpora online, or how a linguist would conduct an inquiry into the use of adverbs in academia.

Looking for sentence-initial adverbs in academic texts, I went to COCA. I know the COCA interface can seem a bit daunting to the uninitiated, but there are very clear instructions (with examples) of how to do everything. Just remember: if confusion persists for more than four hours, consult your local linguist.

On the COCA page, I searched for adverbs coming after a period, or sentence initial adverbs, in the Medical and Science/Technology texts in the Academic section (Click here to rerun my exact search on COCA. Just hit “Search” on the left when you get there). Here’s what I came up with:

Click to embiggen
Top ten sentence initial adverbs in medical and science academic texts in COCA.

You’ll notice that only one of the adverbs on this list (“finally”) ends in “ly”. That word is also coincidentally the top word on Saunders’ list. Notice also that the list above includes the kind of sentence adverbs that Saunders’ search deliberately does not, or those not ending in -ly, such as “for” and “in”, despite the examples of such given on the Wikipedia page that Saunders linked to in his post. (For those wondering, the POS-tagger treated these as parts of adverbial phrases, hence the “REX21” and “RR21” tags)

Searching for only those sentence initial adverbs that end in -ly, we find a list similar to Saunders’, but with only five of the same words on it. (Saunders’ top ten are: finally, additionally, interestingly, recently, importantly, similarly, surprisingly, specifically, conversely, consequentially)

Click to embiggen
Top ten sentence initial adverbs ending in -ly in medical and science academic texts in COCA.

So what does this tell us? Well, for starters, my shooting-from-the-hip research is insufficient to draw any great conclusions from, even if it is more systematic than Saunders’. Seeing what adverbs are used to start sentences doesn’t really tell us much about, for example, what the journals, authors, or results of the papers are like. This is the mistake that Mr. Saunders makes in his conclusions. After ranking the usage frequencies of surprising by journal, he writes:

The message seems clear: go with a Nature or specialist PLoS journal if your results are surprising.

Unfortunately for Mr. Saunders, a linguist would find the message anything but clear. For starters, the realtive use of surprising in a journal does not tell us that the results in the articles are actually surprising, but rather that the authors wish to present their results as surprising. That is, if the word surprising in the articles is not preceded by Our results are not. This is another problem with Mr. Saunders’ conclusions – not placing his results in context – and it is something that linguists would research, perhaps by scrolling through the concordances using corpus linguistics software, or software designed exactly for the type of research that Mr. Saunders wished to do.

The second thing to notice about my results is that they probably look a whole lot more boring than Saunders’. Such is the nature of researching things that people think matter (like those nasty little adverbs), but professionals know really don’t. So it goes.

Finally, what we really should be looking at is how scientists use adverbs in comparison to other writers. I chose to contrast the frequencies of sentence-initial adverbs in the medical and science/technology articles with the frequencies found in academic articles from the (oft-disparaged) humanities. (Here is the link to that search.)

Click to embiggen
Top ten sentence initial adverbs in humanities academic texts in COCA.

Six of the top ten sentence initial adverbs in the humanities texts are also on the list for the (hard) science texts. What does this tell us? Again, not much. But we can get an idea that either the styles in the two subjects are not that different, or that sentence initial adverbs might be similar across other genres as well (since the words on these lists look rather pedestrian). We won’t know, of course, until we do more research. And if you really want to know, I suggest you do some corpus searches of your own because the end of this blog post is long overdue.

I also think I’ve picked on Mr. Saunders enough. After all, it’s not really his fault if he didn’t do as I have suggested. How was he supposed to know all these corpora are available? He’s a bioinformatician, not a corpus linguist. And yet, sadly, he’s the one who gets written up in the Smithsonian’s blog, even though linguists have been publishing about these matters since at least the late 1980s.

Before I end, though, I want to offer a word of warning. Although I said that anyone who knows where to look can and should do their own corpus linguistic research, and although I tried to keep my searches as simple as possible, I couldn’t have done them without my background in linguistics. Doing linguistic research on Big Data is tempting. But doing linguistic research on a corpora, especially one that you compiled, can be misleading at best and flat out wrong at worst if you don’t know what you’re doing. The problem is that Mr. Saunders isn’t alone. I’ve seen other non-linguists try this type of research. My message here is similar to the one in my previous post, which was directed to marketers: linguistic research is interesting and it can tell you a lot about the subject of your interest, but only if you do it right. So get a linguist to do it or see if a linguist has already done it. If either of these is not possible, then feel free to do your own research, but tread lightly, young padawans.

If you’re wondering whether academia overuses adverbs (hint: it doesn’t) or just how much adverbs get tossed into academic articles, I recommend reading papers written by Douglas Biber and/or Susan Conrad. They have published extensively on the linguistic nature of many different writing genres. Here’s a link to a Google Scholar search to get you started. You can also have a look at the Longman Grammar, which is probably available at your library.

Book review: Punctuation..? by User Design

This is by far the hippest book on punctuation I’ve ever read. That may sound strange, but I study linguistics, so I’ve read a few good books on punctuation.

Front and back covers of Punctuation..?
Front and back covers of Punctuation..?

Punctuation..? intends to explain the “functions and correct uses of 21 of the most used punctuation marks.” I say “intends” because it’s always a toss up with grammar books. Some people get very picky about what is verboten in written and spoken English. The problem is that when these people get bent out of shape one too many times, they start convincing publishers to bound their rantings and ravings.

But Punctuation..? takes a different approach. The slick, minimalist artwork matches the concise and reasonable explanations of punctuation marks. This book will not tell you that you’re going to die poor and lonely if you don’t use an Oxford comma. Instead it very succinctly explains what a comma is and how it is used.

According to the book’s website, Punctuation..? is for “a wide age range (young to ageing) and intelligence (emerging to expert).” As someone who probably resides on the more expert end of punctuation intelligence, or who at least doesn’t need to be told what an ellipsis is, I still found this book enjoyable for two reasons.

First, the explanations are not only easy to understand, they’re also correct. This is kind of important for educational books. While it was nice that the interpunct (·) and pilcrow (¶) were included, it was even better that the semicolon got some (well deserved) respect and that the exclamation point came with a word of caution.

Pages 34 and 35, which feature some semicolon love.
Pages 34 and 35, which feature some semicolon love.

Second, although Punctuation..? is of more practical benefit to learners of English, it’s probably more of a joy to language enthusiasts because the book is actually funny. If a punctuation book has you laughing, I think that’s a good sign.

I guess the only problem I had with this book was its definition of a noun, which was a little too traditional for my tastes (you know the one). But I think that’s neither here nor there, since if you have another definition for a noun, you’re probably a linguist. And in that case you’ll just be glad to see such a cool book about punctuation aimed at wide audience.

Check out the User design website for more info and links to where you can buy it.

 

 

Up next: A twenty-years-too-late look at a seminal work in pragmatics, Cross-cultural pragmatics: the semantics of human interaction by Anna Wierzbicka.

Autocorrected

James Gleick has a recent article in the New York Times about Autocorrect (“Auto Crrect Ths!” – Aug. 4, 2012), that bane of impatient texters and Tweeters everywhere. Besides recounting some of the more hilarious and embarrassing autocorrections made, he very poignantly tells how Autocorrect works and how it is advancing as computers get better at making predictions.

But in the second to last paragraph, he missteps. He writes:

One more thing to worry about: the better Autocorrect gets, the more we will come to rely on it. It’s happening already. People who yesterday unlearned arithmetic will soon forget how to spell. One by one we are outsourcing our mental functions to the global prosthetic brain.

I don’t know whether Mr. Gleick’s writing was the victim of an editor trying to save space, but that seems unlikely since there’s room on the internet for a bit of qualification, which is what could save these statements from being common cases of declinism. Let me explain.

“People who yesterday unlearned arithmetic” probably refers to the use of calculators. But I would hesitate to say that the power and ubiquity of modern calculators has caused people to unlearn arithmetic. Let’s take a simple equation such as 4 x 4. Anyone conducting such an equation on a calculator knows the arithmetic behind it. If they put it in and the answer comes back as 0 or 8 or 1 or even 20, they are more than likely to realize something went wrong, namely they pressed the minus or plus button instead of the multiplication button. Likewise they know the arithmetic behind 231 x 47.06.

Mr. Gleick writes implies that the efficiency of calculators has caused people to rely too much on them. But this is backwards. The more difficult that calculations get, the more arithmetical knowledge a user is likely to have. Relying on a machine to tell me the square root of 144 doesn’t necessarily mean I “unlearned” arithmetic. It only means that I trust the calculator to give me the correct answer to the equation I gave it. If I trust that I pressed the buttons in the right order, the answer I am given will be sufficient for me, even if I do not know how to work out the equation on pen and paper. I doubt any mathematicians out there are worried about “unlearning” arithmetic because of the power of their calculators. Rather, they’re probably more worried about how to enter the equations correctly. And just like I know 8 is not the answer to 4 x 4, they probably know x = 45 is not the answer to x2 + 2x – 4 = 0.

Taking the analogy to language, we see the same thing. Not being able to spell quixotic, but knowing that chaotic is not the word I’m looking for, does not mean that I have lost the ability to spell. It merely means that I have enough trust in my Autocorrect to suggest the correct word I’m looking for. If it throws something else at me, I’ll consult a dictionary.

If the Autocorrect cannot give me the correct word I’m looking for because it is a recent coinage, there may not be a standard spelling yet, in which case I am able to disregard any suggestions. I’ll spell the word as I want and trust the reader to understand it. Ya dig?

None of the infamous stories of Autocorrect turning normal language into gibberish involve someone who didn’t know how to spell. None of them end with someone pleading for the correct spelling of whatever word Autocorrect mangled. As Autocorrect gets better, people will just learn to trust its suggestions more with words that are difficult to spell. This doesn’t mean we have lost the ability to spell. Spelling in English is a tour de force in memorization because the spelling of English words is a notorious mess. If all I can remember is that the word I’m looking for has a q and an x in it, does it really mean I have unlearned how to spell or that I have just forgotten the exact spelling of quixotic and am willing to trust Autocorrect’s suggestion?

Learning arithmetic is learning a system. Once you know how 2 x 2 works, you can multiply any numbers. The English spelling system is nowhere near a system like arithmetic, so the analogy Mr. Gleick used doesn’t really work for this reason either. But there is one thing that spelling and arithmetic have in common when it comes to computers. Calculators and Autocorrect are only beneficial to those who already have at least a basic understanding of arithmetic and spelling. The advance of Autocorrect will have the same effect on people’s ability to spell as the advance of calculators did on people’s ability to do arithmetic, which was not really any at all.

By the way, I once looked up took (meaning the past tense of take) in a dictionary because after writing it I was sure that wasn’t the way to spell it. And that’s my memory getting worse, not my Autocorrect unlearning me.

[Update – Aug. 6, 2012] If our spelling really does go down the drain, it should at least make this kind of spelling bee more interesting (if only it were true).

Poetry and Prose, Computers and Code

Back in February, I analyzed WordPress’s automated grammar checker, After the Deadline, by running some famous and well-regarded pieces of prose through it. I found the program lacking. What I wrote was:

If you have understood this article so far, you already know more about writing than After the Deadline. It will not improve your writing. It will most likely make it worse. Contrary to what is claimed on its homepage, you will not write better and you will spend more time editing.

I think my test of After the Deadline proved its inefficiency, especially since I noticed that the program finds errors in its own suggestions. Talk about needing to heed your own advice…

A comment by one of the program’s developers, Raphael Mudge, however, got me thinking about what benefit (if any) automatic grammar checkers can offer. Mr. Mudge noted that the program was written for bloggers so running famous prose through it was not fair. He is right about that, but as I replied, the problem with automated grammar checkers really lies with the confidence and capability of writers who use them:

[The effect that computer grammar checkers could have on uncertain writers] is even more important when we think of running After the Deadline against a random sample of blog posts, as you suggest. While that would be more fair than what I did, it wouldn’t necessarily tell us anything. What’s needed is a second step of deciding which editing suggestions will be accepted. If we accept only the correct suggestions, we assume an extremely capable author who is therefore not in need of the program. As the threshold for our accepted suggestions lowers, however, we will begin to see a muddying of the waters – the more poorly written posts will be made better, but the more well written posts will be made worse. The question then becomes where do we draw the line on acceptions to ensure that the program is not doing more harm than good? That will decide the program’s worth, in my opinion.

As it turns out, after that review of After the Deadline, I was contacted by someone from Grammarly, another automated grammar checker. For some reason they wanted me to review their program. I said sure, I’d love to, and then I promptly did nothing. In truth, I was sidetracked by other things – kids, work, beer, school, the NHL playoffs, more beer, and recycling. So much for that.

Now R.L.G. over at the Economist’s Johnson blog has a post about these programs and a short discussion of Ben Yagoda’s review of Grammarly at Lingua Franca, a Chronicle of Higher Education blog. I want to quickly review these posts and add to my thoughts about these programs.

First, R.L.G. rightly points out that “computers can be very good at parsing natural language, finding determiners and noun phrases and verb phrases and organising them into trees.” I’m happy to agree with that. Part-of-speech taggers alone are amazing and they open up new ways of researching language. But, as he again rightly points out, “Online grammar coaches and style checkers will be snake oil for some time, precisely due to some of the things that separate formal and natural languages.”

Second, Mr. Yagoda’s review of Grammarly is spot on. (I’m impressed by how much he was able to do with only a five day trial. They gave me three months, Ben. Have your people call mine.) Not to take anything away from Mr. Yagoda, but reviewing these checkers is like shooting fish in a barrel because they’re pretty awful. A rudimentary understanding of writing is enough to protect you from their “corrections”. But it’s the lofty claims of these programs that makes testing them irresistible to people like Mr. Yagoda and myself.

So who uses automated grammar checkers and who could possibly benefit from them? The answer takes us back to the confidence of writers. Obviously, writers like RLG and Ben Yagoda are out of the question. As I noted in my comment to Mr. Mudge, the developer of After the Deadline, “a confident writer doesn’t need computer grammar checkers for a variety of reasons, so it’s the uncertain writers that matter. They may have perfect grammar, but be lead astray by a computer grammar checker.” It’s even worse if we take into account Mr. Yagoda’s point that “when it comes to computer programs evaluating prose, the cards never tell the truth.”

We do not have computers that can edit prose, not even close. What we have right now are inadequate grammar checkers that may be doing more harm than good since the suggestions they make are either useless or flat out wrong. They are also being peddled to writers who may not be able to recognize how bad they are. So there’s a danger that competent but insecure writers will follow the program’s misguided attempts to improve their prose.

It’s strange that Grammarly would ask Mr. Yagoda or myself to review their program since Mr. Yagoda is clearly immune to the program’s snake oil charm and I wasn’t exactly kind to After the Deadline. But such bad business decisions might prove helpful for everyone. Respected writers will point out the inadequacy of these automatic grammar checkers, which will hopefully influence people to not use them. At the same time, until these programs can really prove their worth – or at least not make their inadequacy so glaringly obvious – they will not receive any good press from those who know how to write (nor will they get any from lowly bloggers like myself). In this case, any press is not good press since anyone reading R.L.G. or Ben Yagoda’s discussion of automated grammar checkers is unlikely to use one, especially if they have to pay for it.

[Update – Aug. 9, 2012] R.L.G. at Johnson, the Economist’s language blog that I linked to above, heard from Grammarly’s chief executive about what the program was meant for (“to proofread mainstream text like student papers, cover letters and proposals”). So he decided to put Grammarly through some more tests. Want to guess how it did? Check it.

The Problem with Computer Grammar Checkers [Updated]

When I moved this blog over to WordPress, I noticed that under the Users > Personal Settings page there is an option to turn on a computer proofreader. The program is from Automattic (the same people that make WordPress) and it’s called After the Deadline. While an automatic proofreader isn’t anything spectacular in itself, the grammar and style mistakes that this proofreader can supposedly prevent you from making are eye-popping:

bias language, cliches, complex phrases, diacritical marks, double negatives, hidden verbs, jargon, passive voice, phrases to avoid, and redundant phrases.

It’s an impressive looking list, but anyone with even mediocre writing skills and experience with computer proofreaders is likely to be wary. How often has Microsoft Word mistakenly underlined some of your text? How many times has your smartphone autocorrected you into incomprehension?

The thing is, when presented with such a list, even a confident writer couldn’t be blamed for being curious. Are you unwittingly making grammar mistakes in your carefully crafted prose? Have you been straying outside the accepted limits of complex and redundant phrases? Are there verbs hiding in your text? And holy shit, what the hell are diacritical marks?

Let’s put those ridiculous questions aside for a moment. Many people have pointed out what’s wrong with automatic spelling and grammar checkers. What I want to do here is show you why there are problems with these programs by using some highly regarded prose.

Let’s fire up the incinerator.

"To the Lighthouse" by Virginia Woolf*

At the first green line, After the Deadline suggests, “Did you mean… ‘its fine tomorrow?’” Things are not off to a good start. The three other green lines warn me (or Ms. Woolf) about the Dreaded Passive Voice™. The blue line suggests that “Complex Expression” be changed to “plans.”But perhaps the worst suggestion is given by clicking on the red line – “Did you mean… ‘sense,’ ‘cents,’ ‘scents?’” Moving on…

"Sense and Sensibility" by Jane Austen

The blue line is another “Complex Expression,” which After the Deadline suggests be changed to “way.” That’s not so bad. The green line, however, is (according to the proofreader) an example of a “Hidden Verb.” What’s a hidden verb, you ask? As the After the Deadline explains, “A hidden verb (aka nominalization) is a verb made into a noun. They often need extra words to make sense. Strong verbs are easier to read and use less words.” But this doesn’t make any sense. Constant had not been nominalized, while had is one of the most common (and easiest to read) verbs in English. I’m told to “revise ‘had a constant’ to bring out the verb,” but I don’t know what that means. Alert readers will begin to see the problem here. So will everyone else.

"Great Expectations" by Charles Dickens

Here’s the Dreaded Passive Voice™ again. Geoffrey Pullum would have a fit with this program (comments are open, Geoff! Let us know how you really feel!). I guess the proofreader wants me to change the sentence to something like, “So I called myself Pip, and people called me Pip?” It was the best of times, it was the blurst of times

"The Jungle" by Upton Sinclair

All I really need to say here is that the second green line says “Hyphen Required” and suggests I change the phrase to “out-of-the-way.” Really? Yes, really.

To be sure, I ran some other styles of writing through After the Deadline, such as Pulitzer Prize winners, and got the same results. You’re welcome to run anything you want through there, but I got $20 bucks saying you’re going to get the same nonsense I did.

Getting back to those ridiculous questions, the answers are all irrelevant. If you have understood this article so far, you already know more about writing than After the Deadline. It will not improve your writing. It will most likely make it worse. Contrary to what is claimed on its homepage, you will not write better and you will spend more time editing.

I can’t believe anyone except the most inexperienced writers would be fooled by After the Deadline’s “corrections.” This isn’t exactly surprising when it comes to grammar checkers because they are at best useless and at worst harmful. But the way in which we rely on technology threatens to undermine our own writing. Insecure writers might be tricked into believing that After the Deadline’s suggestions are legit. And that is the real problem with these programs. Their potential to do more harm than good is a ratio approaching one since it’s almost impossible for them to do good.

Finally, I’d just like to add that when I used After the Deadline on this post, two terms were underlined in the explanation of hidden verbs:

“A hidden verb (aka nominalization) is a verb made into a noun. They often need extra words to make sense. Strong verbs are easier to read and use less words.”

The program says that nominalization isn’t a word and that I should write “fewer words” instead of “less words.” But that is a quote from the program itself! If even the makers of After the Deadline can’t (or won’t) follow their own guidelines, why should you?

And so I have decided to destroy the machine. Feeding this next piece of prose into your grammar checker is equivalent to setting its controls for the heart of the sun.

riverrun, past Eve and Adam’s, from swerve of shore to bend of bay, brings us by a commodius vicus of recirculation back to Howth Castle and Environs.
Finnegan’s Wake by James Joyce

 

[Update – Feb. 28, 2011] It’s always nice when someone with first-hand knowledge weighs in on the discussion. In this case, former After the Deadline developer Raphael Mudge was kind enough to stop by and leave his thoughts, to which I responded below.

[Update – Mar. 16, 2012] I heard from the WordPress staff about why they chose to incorporate After the Deadline into their software. Actually, I was directed to the post on the WordPress.com blog about the incorporation. I’m a bit disappointed in this, however. First, although the WordPress staff tells me that “There are many reasons to explain why we chose this service to help WordPress.com users with their writing, but you can read our announcement post for the full details,” their post is not full of “details.” Second, neither the email I got nor the WordPress blog post addresses any of the problems with automatic grammar or spell checkers. Oh well.

But most importantly, I don’t think the author of the post is serious when he says he “was blown away” by After the Deadline. Did he run his own post through there? What the hell did it look like before he did? And why didn’t he accept all of the suggestions? And judging by the comments on the post, when will a psychologist do a study with an automatic grammar checker with incorrect suggestions just to see how blindly people will obey their master?

By the way, running this update through AtD underlines “incorporate,” “was directed,” and “all of the.” Feel free to guess why if you really have nothing better to do.

 

 

 

*So much for only using said to carry dialogue, amIright, Elmore? Way to go, Virginia, you dope.

Write as Elmore Leonard Says, Not as Elmore Leonard Does

Speaking of how to write well, Dangerous Minds contributor Paul Gallagher has posted Elmore Leonard’s “10 Rules for Writing Fiction.” The list is from a 2001 New York Times article and it goes like this:

1. Never open a book with the weather.
2. Avoid prologues.
3. Never use a verb other than “said” to carry dialogue.
4. Never use an adverb to modify the verb “said”…
5. Keep your exclamation points under control.
6. Never use the words “suddenly” or “all hell broke loose.”
7. Use regional dialect, patois, sparingly.
8. Avoid detailed descriptions of characters.
9. Don’t go into great detail describing places and things.
10. Try to leave out the part that readers tend to skip.

I commented that Leonard breaks at least two of these rules in his books. That knowledge came from five minutes worth of what us in the business call “using the internet.” I’m not going to spend more time on this, but I wanted to relay another comment by witzed that had me rolling:

A lot of people don’t know it, but Elmore Leonard is also an architect, and he has some really good rules for that, too.
1) Do not start with the roof.
2) Make sure there is another room on the other side of the door.
3) Carpeting must always face left.
4) No boilers in the elevator!
5) All arbitrary lists must have at least ten items.
6) A bedroom is not a bowling alley.
7) Make up your mind: shoes or no shoes.
8) Think first: Is this supposed to be a bathroom?
9) Pay attention to the stuff everyone can see.

Remember, folks, the best players often make the worst coaches. For more hilarity, check out the contest held by the National Post and CBC Radio. The contest is closed, but you can check the comments.

The Real Reason Short Words Are Best

In the opening of a recent Macmillan Dictionary Blog post, Robert Lane Greene quotes the editor of the Economist’s style guide, who in turn quotes Winston Churchill as saying “Short words are best, and old words, when short, are best of all.” Greene then goes on to discuss how difficult it is to write clearly. If you think you’ve heard this one before, don’t. Greene’s post is brief, practical, and a touch insightful. He believes that journalists often get a “bad rap” as writers of plain English because of the schedules they are under. I can go along with that.

Greene also says that metaphors are one way writers can improve. He says there are “three ways to use a metaphor to get ideas across, and two of them are bad.” The two bad kinds are tired metaphors and strained metaphors. Greene suggests using the best kind of metaphors, those that are “simple, clear, memorable and quite often short.”

Greene uses the conventional meaning of “metaphor,” of course, since that’s how most people still understand the term. But the updated meaning shows us that phrases like on Wednesday and the sun came out are also metaphors (for those unfamiliar, think about actually putting something on a day in the way we put something on a table). This realization of metaphors lurking all around our language is important because it adds what I think is the most important element of Greene’s (and Churchill’s and the unnamed Economist editor’s) belief that short words are best. (Don’t worry, I’m not going to get into Conceptual Metaphor Theory or Blending. I’m trying to keep your attention, believe it or not.)

Consider the opening to Greene’s post, which is really the opening to the Economist editorial:

“Short words are best, and old words, when short, are best of all.” Thus, quoting Winston Churchill, began an editorial in The Economist that consisted entirely of one-syllable words. It went on:
“AND, not for the first time, he was right: short words are best. Plain they may be, but that is their strength. They are clear, sharp and to the point. You can get your tongue round them. You can spell them. Eye, brain and mouth work as one to greet them as friends, not foes. For that is what they are.”

Churchill, Greene, and our anonymous editor aren’t the only ones that love short words. You’ll hear language gurus promoting them all over the place. It’s a common idea, but a good one. It goes: Keep it simple, stupid.

And yet, I can’t help feeling that short words are anything but “plain.” The more I think about them, the more I realize that short words are downright complex, especially ones like prepositions. For example, you know what on, of, at, in, etc. mean, but could you define them? It’s pretty tough when you think about it. Fortunately, every language has a way of expressing the notions that prepositions in English express, such as spatial relations. So when you encounter a new language, no matter if it has prepositions or suffixes doing the job of English prepositions, you will be able to understand them. That’s not plain, in my mind. Prepositions do some complicated things.

I don’t think Greene, Churchill or Mr. Editor were talking about prepositions, though. So let’s think about some other short and “plain” words. The English word set, according to Macmillan, has fifteen definitions. Stand has seventeen definitions. Run has nineteen. And that’s not counting the entries for phrases that include these words.

These are not plain words. Short words are not great because they are “to the point,” but because they are to so many points. The fact is, I can do a lot more with set, stand, and run than I can with Australopithecus, midi-chlorians, and Tyrannosaurus rex. That’s because English packs a lot of information into little tiny words.

Or, then again, maybe it doesn’t. Sometimes we’re forced to say yesterday or tomorrow, pretentious or university (two unrelated words), Superman or Professor Xavier. That’s just the way things are.

In his editorial, the Masked Editor uses literature as an example of what can be done with short words – “to be or not to be,” “The year’s at the spring/And day’s at the morn…/The lark’s on the wing;/The snail’s on the thorn.” But he’s using a double-edged sword and he’s not using it well. Sometimes people write thinigs like this:

Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore–
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door–
“‘Tis some visitor,” I muttered, “tapping at my chamber door–
Only this and nothing more.”

Or this:

In my younger and more vulnerable years my father gave me some advice that I’ve been turning over in my mind ever since.

Or this:

All this happened, more or less. The war parts, anyway, are pretty much true. One guy I knew really was shot in Dresden for taking a teapot that wasn’t his. Another guy I knew really did threaten to have his personal enemies killed by hired gunmen after the war. And so on. I’ve changed all the names.

So it goes. I guess some folks know what they’re doing. Neither Churchill, nor Greene, nor the artist formerly known as an editor tell us what a short word is. One syllable? Two? Three is stretching it, I guess.

The point is, Greene, Churchill, and the editor who wasn’t there are correct. Everyone should keep it simple (stupid). They should do that all the time. It’s a good rule to follow. But we should realize that in English our “short and simple” words are often only the former, not the latter. I’m not picking on Greene, who I think is a great journalist (seriously, DuckDuckGo his name, read his articles, watch his TED Talks). It’s just that his article made me think of this idea, which has probably been brewing for a while.

By the way, here are the first three sentences of The Gathering Storm, the first book in a series which won Churchill the Nobel Prize in Literature:

After the end of the World War of 1914 there was a deep conviction and almost universal hope that peace would reign in the world. This heart’s desire of all the peoples could easily have been gained by steadfastness in righteous convictions, and by reasonable common sense and prudence. The phrase “the war to end war” was on every lip, and measures had been taken to turn it into a reality.

And then there’s the rest of Hamlet’s speech that our friendly neighborhood editor used as an example. Guess what, there’s disagreement over its meaning. So much for short words. Shakespeare does away with them after the first line. To wit:

To be, or not to be, that is the question:
Whether ’tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing end them: to die, to sleep
No more; and by a sleep, to say we end
The heart-ache, and the thousand Natural shocks
That Flesh is heir to? ‘Tis a consummation
Devoutly to be wished. To die to sleep,
To sleep, perchance to Dream; Ay, there’s the rub,
For in that sleep of death, what dreams may come…

[Update: Mr. Greene was kind enough to drop by and leave a link to his reply, which you can find here.]