Unsurprisingly, corpus linguists have already answered your question

This post is a response to a corpus search done on another blog. Over on What You’re Doing Is Rather Desperate, Neil Saunders wanted to research how adverbs are used in academic articles, specifically the sentence adverb, or as he says, adverbs which are used “with a comma to make a point at the start of a sentence”. I’m not trying to pick on Mr. Saunders (because what he did was pretty great for a non-linguist), but I think his post, and the media reports on it, makes a great excuse to write about the really, really awesome corpus linguistics resources available to the public. I’ll go through what Mr. Saunders did, and list what he could have done had he known about corpus linguistics.

Mr. Saunders wanted to know about sentence adverbs in academic texts so he wrote a script to download abstracts from PubMed Central. Right off the bat, he could have gone looking for either (1) articles on sentence adverbs or (2) already available corpora. As I pointed out in a comment on his post (which has mysteriously disappeared, probably due to the URLs I in it), there are corpora with science texts from as far back as the 1375 AD. There are also modern alternatives, such as the Corpus of Contemporary American English (COCA) and the British National Corpus (BNC), both of which (and much, much more) are available through Mark Davies’ awesome site.

I bring this up because there are several benefits of using these corpora instead of compiling your own, especially if you’re not a linguist. The first is time and space. Saunders says that his uncompressed corpus of abstracts is 47 GB (!) and that it took “overnight” (double !) for his script to comb through the abstracts. Using an online corpus drops the space required on your home machine down to 0 GB. And running searches on COCA, which contains 450 million words, takes a matter of seconds.

The second benefit is a pretty major one for linguists. After noting that his search only looks for words ending in -ly, Saunders says:

There will of course be false positives – words ending with “ly,” that are not adverbs. Some of these include: the month of July, the country of Italy, surnames such as Whitely, medical conditions such as renomegaly and typographical errors such as “Findingsinitially“. These examples are uncommon and I just ignore them where they occur.

This is a big deal. First of all, the idea of using “ly” as a way to search for adverbs is profoundly misguided. Saunders seems to realize this, since he notes that not all words that end in -ly are adverbs. But where he really goes wrong, as we’ll soon see, is in disregarding all of the adverbs that do not end in -ly. If Saunders had used a corpus that already had each word tagged for its part of speech (POS), or if he had ran a POS-tagger on his own corpus, he could have had an accurate measurement of the use of adverbs in academic articles. This is because POS-tagging allows researchers to find adverbs, adjectives, nouns, etc., as well as searching for words that end in -ly – or even just adverbs that end in -ly. And remember, it can all be done in a matter of moments (even the POS tagging). You won’t even have time to make a cup of coffee, although consumption of caffeinated beverages is highly recommended when doing linguistics (unless you’re at a conference, in which case you should substitute alcohol for caffeine).

Here is where I break from following Saunders’ method. I want like to show you what’s possible with some of the publicly available corpora online, or how a linguist would conduct an inquiry into the use of adverbs in academia.

Looking for sentence-initial adverbs in academic texts, I went to COCA. I know the COCA interface can seem a bit daunting to the uninitiated, but there are very clear instructions (with examples) of how to do everything. Just remember: if confusion persists for more than four hours, consult your local linguist.

On the COCA page, I searched for adverbs coming after a period, or sentence initial adverbs, in the Medical and Science/Technology texts in the Academic section (Click here to rerun my exact search on COCA. Just hit “Search” on the left when you get there). Here’s what I came up with:

Click to embiggen — Top ten sentence initial adverbs in medical and science academic texts in COCA.

You’ll notice that only one of the adverbs on this list (“finally”) ends in “ly”. That word is also coincidentally the top word on Saunders’ list. Notice also that the list above includes the kind of sentence adverbs that Saunders’ search deliberately does not, or those not ending in -ly, such as “for” and “in”, despite the examples of such given on the Wikipedia page that Saunders linked to in his post. (For those wondering, the POS-tagger treated these as parts of adverbial phrases, hence the “REX21” and “RR21” tags)

Searching for only those sentence initial adverbs that end in -ly, we find a list similar to Saunders’, but with only five of the same words on it. (Saunders’ top ten are: finally, additionally, interestingly, recently, importantly, similarly, surprisingly, specifically, conversely, consequentially)

So what does this tell us? Well, for starters, my shooting-from-the-hip research is insufficient to draw any great conclusions from, even if it is more systematic than Saunders’. Seeing what adverbs are used to start sentences doesn’t really tell us much about, for example, what the journals, authors, or results of the papers are like. This is the mistake that Mr. Saunders makes in his conclusions. After ranking the usage frequencies of surprising by journal, he writes:

The message seems clear: go with a Nature or specialist PLoS journal if your results are surprising.

Unfortunately for Mr. Saunders, a linguist would find the message anything but clear. For starters, the realtive use of surprising in a journal does not tell us that the results in the articles are actually surprising, but rather that the authors wish to present their results as surprising. That is, if the word surprising in the articles is not preceded by Our results are not. This is another problem with Mr. Saunders’ conclusions – not placing his results in context – and it is something that linguists would research, perhaps by scrolling through the concordances using corpus linguistics software, or software designed exactly for the type of research that Mr. Saunders wished to do.

The second thing to notice about my results is that they probably look a whole lot more boring than Saunders’. Such is the nature of researching things that people think matter (like those nasty little adverbs), but professionals know really don’t. So it goes.

Finally, what we really should be looking at is how scientists use adverbs in comparison to other writers. I chose to contrast the frequencies of sentence-initial adverbs in the medical and science/technology articles with the frequencies found in academic articles from the (oft-disparaged) humanities. (Here is the link to that search.)

Six of the top ten sentence initial adverbs in the humanities texts are also on the list for the (hard) science texts. What does this tell us? Again, not much. But we can get an idea that either the styles in the two subjects are not that different, or that sentence initial adverbs might be similar across other genres as well (since the words on these lists look rather pedestrian). We won’t know, of course, until we do more research. And if you really want to know, I suggest you do some corpus searches of your own because the end of this blog post is long overdue.

I also think I’ve picked on Mr. Saunders enough. After all, it’s not really his fault if he didn’t do as I have suggested. How was he supposed to know all these corpora are available? He’s a bioinformatician, not a corpus linguist. And yet, sadly, he’s the one who gets written up in the Smithsonian’s blog, even though linguists have been publishing about these matters since at least the late 1980s.

Before I end, though, I want to offer a word of warning. Although I said that anyone who knows where to look can and should do their own corpus linguistic research, and although I tried to keep my searches as simple as possible, I couldn’t have done them without my background in linguistics. Doing linguistic research on Big Data is tempting. But doing linguistic research on a corpora, especially one that you compiled, can be misleading at best and flat out wrong at worst if you don’t know what you’re doing. The problem is that Mr. Saunders isn’t alone. I’ve seen other non-linguists try this type of research. My message here is similar to the one in my previous post, which was directed to marketers: linguistic research is interesting and it can tell you a lot about the subject of your interest, but only if you do it right. So get a linguist to do it or see if a linguist has already done it. If either of these is not possible, then feel free to do your own research, but tread lightly, young padawans.

If you’re wondering whether academia overuses adverbs (hint: it doesn’t) or just how much adverbs get tossed into academic articles, I recommend reading papers written by Douglas Biber and/or Susan Conrad. They have published extensively on the linguistic nature of many different writing genres. Here’s a link to a Google Scholar search to get you started. You can also have a look at the Longman Grammar, which is probably available at your library.

Warren: I can certainly admire how willing Mr. Saunders was to answer his question. But I think the problems with his method are indicative of a larger problem in academia, which is how people often don’t know what’s going on outside their own field. It’s not unheard of for a linguist to be asked, “So what do you do?” People often either don’t know or just assume that linguists are those people who know, like, lots of languages. But the answer to Mr. Saunders’ query is exactly what (some) linguists do and have been doing for decades. I’m not saying anyone is at fault for this. It’s just that the actions of some fields are more well known than others. For a really bad example, let’s say I wanted to know what the inside of a frog looks like. Rather than opening a frog, I would open a biology book. Or at least let a biology book tell me how to open a frog. That’s because I know that scientists have already done the research. But this raises another question. If I had a question about biology, I would consult research in that field because I know I’m not an expert (nor even an amateur; something below that). If someone has a question about language, they should consult linguistic research. The problem is that everyone has a very strong and intuitive command of language, which can make them feel capable enough to do their own linguistic research. And that’s when problems arise. They make assumptions that they really shouldn’t make, like Mr. Saunders did. So I’m all for getting your hands dirty (with either a language corpus or a frog corpse), but you should know what you’re doing beforehand.

John: I agree that not all work done in corpus linguistics is the most fascinating thing I’ve ever seen. And I think that there are some very good reasons to go over or even redo previous research. That’s part of what scientific inquiry is all about. If you’re looking into speech corpora, I think you’ll soon be seeing some really interesting things. Technology has now advanced enough that linguists can present their annotated speech corpus along with audio samples. Some are already out, but of course there will be more. It’s interesting if nothing else because it lets people study speech corpora in the same way as written corpora. The transcribing can be a helluva lot of work, but once it’s done it’s off to the races.

7 thoughts on “Unsurprisingly, corpus linguists have already answered your question”

alf August 26, 201312:14 am Reply

Is there a way to find out which journals are indexed in the Academic > Medical and Science/Technology section of COCA? The corpus documentation mentions “nearly 100 different peer-reviewed journals […] selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.”. [http://corpus.byu.edu/coca/help/texts_e.asp]

Even though “Medicine” and “Science” are major classes of the LOCC, that could still end up being less than 10 journals included in the corpus – much less than the coverage of even the Open Access subset of PubMed Central.
1. alf August 26, 201312:42 am Reply
  
  From the Excel file listing all the texts in the COCA corpus (http://corpus.byu.edu/coca/help/coca_2012_06_22.zip), there are 2201 texts categorised as “Medicine”, from 51 journals, and 4176 texts categorised as “Sci/Text”, from 23 journals.
  1. Joe McVeigh August 26, 20133:51 pm Reply
    
    Thanks, alf, for finding that info. I know that COCA tries to be representative of the fields that they include. Representativeness is a big issue when compiling general (as opposed to specialist) corpora. Of course, nothing is stopping anyone from making a corpora of just medical journals and researching that to get a better idea of how the authors use language. But for general purposes, using a corpus like COCA is sufficient.
Unsurprisingly, corpus linguists have already a... August 26, 20132:59 am Reply

[…] This post is a response to a corpus search done on another blog. Over on What You’re Doing Is Rather Desperate, Neil Saunders wanted to research how adverbs are used in academic articles, spe… […]
Warren M Tang September 5, 20134:11 pm Reply

I have to admit 47GB going a bit overboard. Surely a smaller corpus would have sufficed. One has to admire someone willing to get their hands dirty in this manner for an answer … even if he didn’t check the literature to his question.
Maybe Corpus Linguists Have, Maybe They Haven’t | LogBook September 10, 201312:07 am Reply

[…] I do feel like corpus linguists have already answered my question, but sometimes I feel like they didn’t ask very interesting questions in the first place, and […]
Joe McVeigh September 11, 201312:36 pm Reply

Warren: I can certainly admire how willing Mr. Saunders was to answer his question. But I think the problems with his method are indicative of a larger problem in academia, which is how people often don’t know what’s going on outside their own field. It’s not unheard of for a linguist to be asked, “So what do you do?” People often either don’t know or just assume that linguists are those people who know, like, lots of languages. But the answer to Mr. Saunders’ query is exactly what (some) linguists do and have been doing for decades. I’m not saying anyone is at fault for this. It’s just that the actions of some fields are more well known than others. For a really bad example, let’s say I wanted to know what the inside of a frog looks like. Rather than opening a frog, I would open a biology book. Or at least let a biology book tell me how to open a frog. That’s because I know that scientists have already done the research. But this raises another question. If I had a question about biology, I would consult research in that field because I know I’m not an expert (nor even an amateur; something below that). If someone has a question about language, they should consult linguistic research. The problem is that everyone has a very strong and intuitive command of language, which can make them feel capable enough to do their own linguistic research. And that’s when problems arise. They make assumptions that they really shouldn’t make, like Mr. Saunders did. So I’m all for getting your hands dirty (with either a language corpus or a frog corpse), but you should know what you’re doing beforehand.

John: I agree that not all work done in corpus linguistics is the most fascinating thing I’ve ever seen. And I think that there are some very good reasons to go over or even redo previous research. That’s part of what scientific inquiry is all about. If you’re looking into speech corpora, I think you’ll soon be seeing some really interesting things. Technology has now advanced enough that linguists can present their annotated speech corpus along with audio samples. Some are already out, but of course there will be more. It’s interesting if nothing else because it lets people study speech corpora in the same way as written corpora. The transcribing can be a helluva lot of work, but once it’s done it’s off to the races.

Unsurprisingly, corpus linguists have already answered your question

Like this:

Related

7 thoughts on “Unsurprisingly, corpus linguists have already answered your question”

Leave a Reply Cancel reply

Share this

Like this:

Related

7 thoughts on “Unsurprisingly, corpus linguists have already answered your question”

Leave a Reply Cancel reply