N Grams model for word prediction

Exploratory Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Tasks to accomplish

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Data cleaning and filtering

At this stage we try to have a initial look at the data, filter out special characters and organised the data into a clean format for further analysis.

I start with the English twitter data set (en_US.twitter.txt). The data set is split into training(80%), testing (10%), and validation (10%) sets randomly. Line by line, I seperate the lines into sentences, then clean the sentences by removing all spaces and special charaters and tokenized the sentences into words, and subsequently into 2-grams and 3-grams.

## scanned  1000  lines

Compuate Frequencies

Questions to consider

Then an exploratory analysis is preformed and I try to answer the following questions. 1. Some words are more frequent than others - what are the distributions of word frequencies? 2. What are the frequencies of 2-grams and 3-grams in the dataset? 3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

The top 20 most frequent words are:

##  <s> </s>    i  the    a    t   to  you  and   it   in   of  for   on    e 
## 1998  999  552  415  313  307  305  229  185  184  179  175  152  143  140 
##   my that    o    u   be 
##  126  119  110   99   95

Now plot the frequncies in decreasing order:

The number of unique words needed in a frequency sorted dictionary to cover 50% of all word instances is:

## ed 
## 92

In other words, we only need

##         ed 
## 0.03028308

of all words to cover 50% of all word instances with the twitter data set.

The number of unique words needed in a frequency sorted dictionary to cover 90% of all word instances is:

## thin 
## 1661

In other words, we only need the following percentage of all words to cover 90% of all word instances with the twitter data set.

##      thin 
## 0.5467413

What are the frequencies of 2-grams and 3-grams in the dataset?

Here “s” and “/s” are used to mark the start and end of a sentence. In order to deal with 3-gram and 4-gram, two “s” are prefixed into each sentence.

Here’s 2-grams

The most frequent bigrams are

##      ju.t       i.m     don.t    of.the    in.the   for.the     to.be 
##        57        50        41        35        34        29        26 
##   <s>.the <s>.thank    on.the  day.</s>      la.t   it.</s>     can.t 
##        25        25        22        22        22        18        18 
##       n.t    <s>.it    to.the    at.the    <s>.rt 
##        18        17        17        17        16

Here’s 3-grams

The most frequent 3-grams are

## <s>_<s>.thank       <s>_i.m    <s>_<s>.it    <s>_<s>.rt   <s>_<s>.you 
##            25            18            17            16            16 
##   <s>_<s>.hey     <s>_<s>.o    <s>_i.love    la_t.night       i_can.t 
##            14            11            11            11            11 
##    <s>_<s>.no    <s>_<s>.if     <s>_<s>.a  <s>_<s>.what       i_don.t 
##            10            10            10            10            10 
##       doe_n.t    <s>_<s>.ju  <s>_<s>.that thank_for.the 
##            10             9             9             9

How do you evaluate how many of the words come from foreign languages? This can be worked out by comparing against an English dictionary.
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

If synonymous words, verb tenses, singleton and plura of nones etc are mapped into a single representation in text pre-processing, the size of the dictionary can be substentially reduced.

How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Modelling

The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.

I have built unigram, 2-gram, 3-gram and 4-gram dictionaries for model prediction.

Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Here we use Backoff Interpolation to handel unseen phrases that can appear in the validation or testing datasets.

Questions to consider

How can you efficiently store an n-gram model (think Markov Chains)? Here we store n-grams as nested hash lists. This can be further improved by storing them into trees.
How can you use the knowledge about word frequencies to make your model smaller and more efficient? It will substantially improve the effency of the model by matching genre of between the training set and the testing set. We can use human knowledge to find suitable training texts for certain prediction tasks.
How many parameters do you need (i.e. how big is n in your n-gram model)? In the n-gram model presented here, I used Backoff Interpolation to model unseen phrases, for which there are four lambdas, with 3 free parameters to estimate. It is straightforward to search these variable by standard optimization libraries, such as optim in R.
Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ? The simplest way is to add a pseudo count of 1 into all frequencies to avoid 0 counts (Laplace Smoothing). A better way of doing it is to use Backoff Interpolation, that is using values from N-1 gram if N gram is not available, untill unigram. Each probability can be assigned a weight, which can be optimised using a validation set.
How do you evaluate whether your model is any good? By testing against a test set.
How can you use backoff models to estimate the probability of unobserved n-grams? We can build a composite probability by adding N-gram, N-1 gram, …, till unigram, with a set of weights. The benefit of doing so is to alleviate the problems that any of these N-grams can give a zero probability. This provides a fall-back mechanism. The weights can be optimised if we create a validation data set.