This post is a response to a corpus search done on another blog. Over on What You’re Doing Is Rather Desperate, Neil Saunders wanted to research how adverbs are used in academic articles, specifically the sentence adverb, or as he says, adverbs which are used “with a comma to make a point at the start of a sentence”. I’m not trying to pick on Mr. Saunders (because what he did was pretty great for a non-linguist), but I think his post, and the media reports on it, makes a great excuse to write about the really, really awesome corpus linguistics resources available to the public. I’ll go through what Mr. Saunders did, and list what he could have done had he known about corpus linguistics.
Mr. Saunders wanted to know about sentence adverbs in academic texts so he wrote a script to download abstracts from PubMed Central. Right off the bat, he could have gone looking for either (1) articles on sentence adverbs or (2) already available corpora. As I pointed out in a comment on his post (which has mysteriously disappeared, probably due to the URLs I in it), there are corpora with science texts from as far back as the 1375 AD. There are also modern alternatives, such as the Corpus of Contemporary American English (COCA) and the British National Corpus (BNC), both of which (and much, much more) are available through Mark Davies’ awesome site.
I bring this up because there are several benefits of using these corpora instead of compiling your own, especially if you’re not a linguist. The first is time and space. Saunders says that his uncompressed corpus of abstracts is 47 GB (!) and that it took “overnight” (double !) for his script to comb through the abstracts. Using an online corpus drops the space required on your home machine down to 0 GB. And running searches on COCA, which contains 450 million words, takes a matter of seconds.
The second benefit is a pretty major one for linguists. After noting that his search only looks for words ending in -ly, Saunders says:
There will of course be false positives – words ending with “ly,” that are not adverbs. Some of these include: the month of July, the country of Italy, surnames such as Whitely, medical conditions such as renomegaly and typographical errors such as “Findingsinitially“. These examples are uncommon and I just ignore them where they occur.
This is a big deal. First of all, the idea of using “ly” as a way to search for adverbs is profoundly misguided. Saunders seems to realize this, since he notes that not all words that end in -ly are adverbs. But where he really goes wrong, as we’ll soon see, is in disregarding all of the adverbs that do not end in -ly. If Saunders had used a corpus that already had each word tagged for its part of speech (POS), or if he had ran a POS-tagger on his own corpus, he could have had an accurate measurement of the use of adverbs in academic articles. This is because POS-tagging allows researchers to find adverbs, adjectives, nouns, etc., as well as searching for words that end in -ly – or even just adverbs that end in -ly. And remember, it can all be done in a matter of moments (even the POS tagging). You won’t even have time to make a cup of coffee, although consumption of caffeinated beverages is highly recommended when doing linguistics (unless you’re at a conference, in which case you should substitute alcohol for caffeine).
Here is where I break from following Saunders’ method. I want like to show you what’s possible with some of the publicly available corpora online, or how a linguist would conduct an inquiry into the use of adverbs in academia.
Looking for sentence-initial adverbs in academic texts, I went to COCA. I know the COCA interface can seem a bit daunting to the uninitiated, but there are very clear instructions (with examples) of how to do everything. Just remember: if confusion persists for more than four hours, consult your local linguist.
On the COCA page, I searched for adverbs coming after a period, or sentence initial adverbs, in the Medical and Science/Technology texts in the Academic section (Click here to rerun my exact search on COCA. Just hit “Search” on the left when you get there). Here’s what I came up with:
You’ll notice that only one of the adverbs on this list (“finally”) ends in “ly”. That word is also coincidentally the top word on Saunders’ list. Notice also that the list above includes the kind of sentence adverbs that Saunders’ search deliberately does not, or those not ending in -ly, such as “for” and “in”, despite the examples of such given on the Wikipedia page that Saunders linked to in his post. (For those wondering, the POS-tagger treated these as parts of adverbial phrases, hence the “REX21” and “RR21” tags)
Searching for only those sentence initial adverbs that end in -ly, we find a list similar to Saunders’, but with only five of the same words on it. (Saunders’ top ten are: finally, additionally, interestingly, recently, importantly, similarly, surprisingly, specifically, conversely, consequentially)
So what does this tell us? Well, for starters, my shooting-from-the-hip research is insufficient to draw any great conclusions from, even if it is more systematic than Saunders’. Seeing what adverbs are used to start sentences doesn’t really tell us much about, for example, what the journals, authors, or results of the papers are like. This is the mistake that Mr. Saunders makes in his conclusions. After ranking the usage frequencies of surprising by journal, he writes:
The message seems clear: go with a Nature or specialist PLoS journal if your results are surprising.
Unfortunately for Mr. Saunders, a linguist would find the message anything but clear. For starters, the realtive use of surprising in a journal does not tell us that the results in the articles are actually surprising, but rather that the authors wish to present their results as surprising. That is, if the word surprising in the articles is not preceded by Our results are not. This is another problem with Mr. Saunders’ conclusions – not placing his results in context – and it is something that linguists would research, perhaps by scrolling through the concordances using corpus linguistics software, or software designed exactly for the type of research that Mr. Saunders wished to do.
The second thing to notice about my results is that they probably look a whole lot more boring than Saunders’. Such is the nature of researching things that people think matter (like those nasty little adverbs), but professionals know really don’t. So it goes.
Finally, what we really should be looking at is how scientists use adverbs in comparison to other writers. I chose to contrast the frequencies of sentence-initial adverbs in the medical and science/technology articles with the frequencies found in academic articles from the (oft-disparaged) humanities. (Here is the link to that search.)
Six of the top ten sentence initial adverbs in the humanities texts are also on the list for the (hard) science texts. What does this tell us? Again, not much. But we can get an idea that either the styles in the two subjects are not that different, or that sentence initial adverbs might be similar across other genres as well (since the words on these lists look rather pedestrian). We won’t know, of course, until we do more research. And if you really want to know, I suggest you do some corpus searches of your own because the end of this blog post is long overdue.
I also think I’ve picked on Mr. Saunders enough. After all, it’s not really his fault if he didn’t do as I have suggested. How was he supposed to know all these corpora are available? He’s a bioinformatician, not a corpus linguist. And yet, sadly, he’s the one who gets written up in the Smithsonian’s blog, even though linguists have been publishing about these matters since at least the late 1980s.
Before I end, though, I want to offer a word of warning. Although I said that anyone who knows where to look can and should do their own corpus linguistic research, and although I tried to keep my searches as simple as possible, I couldn’t have done them without my background in linguistics. Doing linguistic research on Big Data is tempting. But doing linguistic research on a corpora, especially one that you compiled, can be misleading at best and flat out wrong at worst if you don’t know what you’re doing. The problem is that Mr. Saunders isn’t alone. I’ve seen other non-linguists try this type of research. My message here is similar to the one in my previous post, which was directed to marketers: linguistic research is interesting and it can tell you a lot about the subject of your interest, but only if you do it right. So get a linguist to do it or see if a linguist has already done it. If either of these is not possible, then feel free to do your own research, but tread lightly, young padawans.
If you’re wondering whether academia overuses adverbs (hint: it doesn’t) or just how much adverbs get tossed into academic articles, I recommend reading papers written by Douglas Biber and/or Susan Conrad. They have published extensively on the linguistic nature of many different writing genres. Here’s a link to a Google Scholar search to get you started. You can also have a look at the Longman Grammar, which is probably available at your library.