n-grams: for fun (and profit?)

In the last post I delved into LSSU’s list of “banned words” for 2016 using the Google Ngram viewer. If you’re interested in modern language use (primarily English, but it dips into a few others), that’s a tool you should know about and learn to use. But what exactly are “n-grams“?

N-grams (properly n-grams) weren’t created by Google. The idea goes back several decades. I can’t pin down a good origination point, but Wikipedia suggests the idea comes from Claude Shannon’s work on information theory (critically important to the later development of many things we rely on today, including much of modern computing). But as with many things wiki, a source isn’t cited. A different source credits the idea to another researcher in 1949, building on Shannon’s work, and gives yet a third researcher credit for first use of the actual term “n-gram” in 1957.

Regardless, n-grams are an interesting thing for language people to geek out on, even though they’re not exclusively restricted to applications in language. N-grams, to simplify the definition, are strings of consecutive units from a longer text. Note that “text” here has a broader meaning — I’ll get to that. N-grams can be words, but they can also be phonemes or single characters; in principle, there’s no reason why they can’t be entire sentences or paragraphs (but in practice that undermines their application).

Let’s use the first sentence of Lincoln’s Gettysburg Address for an example:

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Using words as our unit of measure, four, score, and, seven, and years are all n-grams of length 1 (they are 1-grams or unigrams). Each additional word is also a 1-gram.

Four score, score and, and seven, and seven years are also n-grams, but of length 2 (2-grams or bigrams). You can see how this works: four score and and score and seven are 3-grams; four score and seven is a 4-gram, and so on.

In computational linguistics, where this idea is used heavily, n-grams can help predict things, such as which word comes next. In our example sentence, for instance, there are two 2-grams beginning with “and” — and seven, and dedicated. Looking at the entire short address, more appear: and so, and proper, and dead, and that. As more text is analyzed, the list will grow and patterns will emerge.

A better example with this text would be the word “we.” When taken as the start of a 2-gram we get: we are, we are, we have, we should, we can, we can, we can, we take, we here. You can see something happening already: we is always attached to a verb (except in one case), and some n-grams have already repeated. Probabilities have taken the field.

While an amateur hack linguist like myself might use a tool like Google’s Ngrams viewer for occasional fun, one-off comparisons of the relative frequency of words and short phrases, a serious researcher (or applications programmer) can do a lot more with n-grams. They can begin to detect larger patterns — of two and three and more words — and use them to speed processing, whether it be for data transmission, speech recognition, or automated translation. Without knowing it, you most likely use more than one application every day that makes use of n-grams in some way. Google Translate? Siri? Your smartphone’s auto-correct? Viewing a video online? N-grams are probably involved.

Where else have n-grams been used? In areas conceptually similar to linguistics, such as communications theory and natural language processing. Their use is also cited in probability and in data compression, as well as a place you might not guess: computational biology. Think about it: if a ‘text’ is interpreted loosely to include any sequence built from a set of discrete, repeatable units, then the genetic sequence of any organism — a ‘text’ composed of long sequences of the units A, C, G, and T — fits the definition very well. It’s easy to see how larger n-grams — TTT, AGC, CTG — can be used in DNA analysis. Applying 10-grams or 20-grams to millions of base pairs can speed computations greatly.

Google’s Ngrams data is a good place to start when looking at n-grams. They’ve not only scanned and analyzed over 5 million books, but the database is publicly available with an-easy-to-use search feature. Of course, as wonderful as this is, it’s got limitations. The two that you’re most likely to run into are that it doesn’t include everything (only books and some periodicals) and it was last updated in 2012, with material that stops abruptly at 2008. Very low frequency n-grams (found in fewer than 40 sources) were excluded, to make the size of the database manageable.

Another set of publicly usable n-grams looks at a completely different corpus (n-grams lingo for database).  The statistical analysis site FiveThirtyEight generated a corpus using about 8 years of user comments from Reddit — more than 1.7 billion text fragments. Although they’ve been described as many different things, Reddit refers to itself as the “online community where users vote on content.”

Reddit can be a dimly lit place, and comment threads on any site are the Internet’s landfill. But there’s potentially a lot to mine in there about contemporary language use. Interested in the relative frequency of stfu vs gtfo? Wondering how netflix and chill track together? Curious about which game has been discussed the most, Call of Duty or World of Warcraft? (Minecraft beat them both, actually.) Use their simple interface and bang away. The data was truncated in August of 2015, though, so up-to-the-minute usage isn’t available. Wouldn’t you love to know, for example, how tiny hands and orange hair correlate? Or how socialist and fascist compare over the past six months?

Another relevant n-grams site is the Corpus of Contemporary American English, which contains only 440 million words from 190,000 texts, but is better structured for linguistic research than Google’s corpus (it doesn’t exclude low-frequency n-grams). That same site hosts a list of several other n-grams corpora (their plural for corpus) which are searchable online.

Very specialized sites exist, too. This one, for example, looks only at roughly 40,000 works printed in English before the year 1700.

Not all are free; some can only be accessed for a fee, or downloaded after payment. That makes sense, though. A huge effort goes into creating these corpora, and they are potentially important factors in making a great deal of money online. They can be fun to poke through for those who are curious about language, but they might also help some company generate revenue with a better search engine, or produce a lucrative text-to-speech application. The use of n-grams involves potentially a whole lot more than a few good one-liners about the inanity of Internet comments.


About thebettereditor

Chris holds a BA degree in history from the University of Virginia and a Master of Fine Arts (MFA) Degree in writing from the University of Southern Maine (Stonecoast). He has worked extensively with professional and semi-professional writers and enthusiastic amateurs for about 20 years. He has several years experience in scientific publishing, but has also worked in information technology, insurance, health care, and education (he taught writing at the university level for a number of years). Since 2011, he's also specialized in helping small business meet their writing and editing needs on a budget.
This entry was posted in Culture, Language, Things you should know and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s