What?

N-Grams are a simple class of Language Models (lmS). LMs are models that assign probabilities to sequences of words. With this, we can predict the chance of emerging a word in a given position, for instance the last word of an n-gram given the previous words.

An n-gram then is a sequence of N words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.

How?

To measure the probability of word w, given some history, h; we should calculate the P(w|h). But we should learn how to calculate the probability of an entire sequence like w1,….,wn, first. To do this, we should use the chain rule of probability:

Calculating all above probabilities are hard and time-consuming. Instead, N-grams are based on Markov assumption. This assumption states that the probability of a word depends only on the previous word, not the entire history. Thus, the intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words. The bigram model, for example, approximates the probability of a word given all the previous words by using only the conditional probability of the preceding word. Therefore, we will have this formula:

Finally, the maximum likelihood estimation helps us to simplify the above formula to this:

The above equation says that to compute a particular bigram probability of a word y given a previous word x, we’ll compute the count of the bigram C(xy) and normalize by the sum of all the bigrams that share the same first word x.

Evaluating?

To evaluate the power of a N-gram, we use a metric called Perplexity. The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words.

Generalization

In LMs, we should consider several criteria in choosing a convenient training set. Otherwise, our modes is subject to bias. The conditions are:

In fact, the training and test sets should be of same genre, dialect and sparsity.

There are other problems pertinent to N-grams like zeros and unkown words. Refer to Jurafsky for reading more!