Exploratory Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Tasks to accomplish

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Data cleaning and filtering

At this stage we try to have a initial look at the data, filter out special characters and organised the data into a clean format for further analysis.

The data set is split into training(80%), testing (10%), and validation (10%) sets randomly. Line by line, I seperate the lines into sentences, then clean the sentences by removing all spaces and special charaters and tokenized the sentences into words, and subsequently into 2-grams, 3-grams and 4-grams. Each sentence is prefixed with two “s” to mark the start and one “s” to mark the end.

Number of lines in each data file

##    nLines   nWords           dataset
## 1  899288 37334690   en_US.blogs.txt
## 2 1010242 34372720    en_US.news.txt
## 3 2360148 30374206 en_US.twitter.txt

Below I focus on data from “en_US.twitter.txt” for more detailed analysis. Due to limited computing time available to me, I demonstrate the analysis using the first 5000 sentences.

Compuate Frequencies

Questions to consider

Then an exploratory analysis is preformed and I try to answer the following questions. 1. Some words are more frequent than others - what are the distributions of word frequencies? 2. What are the frequencies of 2-grams and 3-grams in the dataset? 3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

The top 20 most frequent words are:

##  <s> </s>  the    i   to    a  you  and   it   of  for   is   in   on that 
## 1998  999  414  399  311  248  242  173  170  169  158  158  147  131  121 
##    s   my    t   so your 
##  121  108  103   92   82

Now plot the frequncies in decreasing order:

The number of unique words needed in a frequency sorted dictionary to cover 50% of all word instances is:

## any 
## 110

In other words, we only need

##        any 
## 0.03212617

of all words to cover 50% of all word instances with the twitter data set.

The number of unique words needed in a frequency sorted dictionary to cover 90% of all word instances is:

## mariano 
##    2118

In other words, we only need the following percentage of all words to cover 90% of all word instances with the twitter data set.

##   mariano 
## 0.6185748
  1. What are the frequencies of 2-grams and 3-grams in the dataset?

Here “s” and “/s” are used to mark the start and end of a sentence. In order to deal with 3-gram and 4-gram, two “s” are prefixed into each sentence.

Here’s 2-grams

The most frequent bigrams are

##        i.m      don.t     of.the       it.s    for.the     in.the 
##         60         38         35         34         29         27 
##    <s>.the      to.be <s>.thanks    it.</s>     on.the     to.the 
##         25         24         22         21         21         20 
##      can.t     i.know      i.can    <s>.you     at.the     i.love 
##         20         19         18         17         17         16 
##     it.was 
##         16

Here’s 3-grams

The most frequent 3-grams are

## <s>_<s>.thanks        <s>_i.m    <s>_<s>.you    <s>_<s>.hey     <s>_<s>.it 
##             22             20             17             15             14 
##     <s>_<s>.rt     <s>_<s>.so   <s>_<s>.just     <s>_<s>.oh        i_can.t 
##             12             11             11             11             11 
##     <s>_<s>.my thanks_for.the     <s>_i.love <s>_thanks.for     <s>_<s>.no 
##             10             10             10             10              9 
##   <s>_<s>.that   <s>_<s>.what        i_don.t     <s>_<s>.we 
##              9              9              8              7
  1. How do you evaluate how many of the words come from foreign languages? This can be worked out by comparing against an English dictionary.

  2. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

If synonymous words, verb tenses, singleton and plura of nones etc are mapped into a single representation in text pre-processing, the size of the dictionary can be substentially reduced.

  1. How do you evaluate how many of the words come from foreign languages?
  2. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Modelling

The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

  1. Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.

I have build dictionaries for unigram, 2-gram, 3-gram and 4-gram. This is to enable predicting a word by just itself, as how likely it would appear at all, the word before it, two words and three words before it. The context normally helps prediction quite substantially. However, they may not be always available in the training data, for example, we may find “How”, “How are”, “How are you”, in the training data, but not “How are you, Tom”. Using a series of N grams can help to build a composite probability, using as much information as is available.

  1. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

The simplest way is to add a pseudo count to avoid 0 observations. For example, adding 1 to the frequencies of all N-grams. A better way of doing it is to use Backoff Interpolation. The basic idea is to use the probability of a shorter phrase if the one for the longer one is not available. For example, If we were trying to predict the word “so” in sentence “It cannot be but so”, we could searh for the probability of “so” following “It cannot be but”. If this phrase is not available in N-gram, we could reduce the search phrase to “cannot be but”, and then to “be but” and then “but”. We could also consider the probability of just the word “so” itself. We could combine these probabilities so that we use as much information as possible.

Questions to consider

  1. How can you efficiently store an n-gram model (think Markov Chains)? Here we store n-grams as nested hash lists. This can be further improved by storing them into trees.

  2. How can you use the knowledge about word frequencies to make your model smaller and more efficient? It will substantially improve the effency of the model by matching genre of between the training set and the testing set. We can use human knowledge to find suitable training texts for certain prediction tasks.

  3. How many parameters do you need (i.e. how big is n in your n-gram model)? In the n-gram model presented here, I used Backoff Interpolation to model unseen phrases, for which there are four lambdas, with 3 free parameters to estimate. It is straightforward to search these variable by standard optimization libraries, such as optim in R.

  4. Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ? The simplest way is to add a pseudo count of 1 into all frequencies to avoid 0 counts (Laplace Smoothing). A better way of doing it is to use Backoff Interpolation, that is using values from N-1 gram if N gram is not available, untill unigram. Each probability can be assigned a weight, which can be optimised using a validation set.

  5. How do you evaluate whether your model is any good? By testing against a test set.

  6. How can you use backoff models to estimate the probability of unobserved n-grams? We can build a composite probability by adding N-gram, N-1 gram, …, till unigram, with a set of weights. The benefit of doing so is to alleviate the problems that any of these N-grams can give a zero probability. This provides a fall-back mechanism. The weights can be optimised if we create a validation data set.