The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
At this stage we try to have a initial look at the data, filter out special characters and organised the data into a clean format for further analysis.
I start with the English twitter data set (en_US.twitter.txt). The data set is split into training(80%), testing (10%), and validation (10%) sets randomly. Line by line, I seperate the lines into sentences, then clean the sentences by removing all spaces and special charaters and tokenized the sentences into words, and subsequently into 2-grams and 3-grams.
## scanned 1000 lines
Compuate Frequencies
Then an exploratory analysis is preformed and I try to answer the following questions. 1. Some words are more frequent than others - what are the distributions of word frequencies? 2. What are the frequencies of 2-grams and 3-grams in the dataset? 3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
The top 20 most frequent words are:
## <s> </s> i the a t to you and it in of for on e
## 1998 999 552 415 313 307 305 229 185 184 179 175 152 143 140
## my that o u be
## 126 119 110 99 95
Now plot the frequncies in decreasing order:
The number of unique words needed in a frequency sorted dictionary to cover 50% of all word instances is:
## ed
## 92
In other words, we only need
## ed
## 0.03028308
of all words to cover 50% of all word instances with the twitter data set.
The number of unique words needed in a frequency sorted dictionary to cover 90% of all word instances is:
## thin
## 1661
In other words, we only need the following percentage of all words to cover 90% of all word instances with the twitter data set.
## thin
## 0.5467413
Here “s” and “/s” are used to mark the start and end of a sentence. In order to deal with 3-gram and 4-gram, two “s” are prefixed into each sentence.
Here’s 2-grams
The most frequent bigrams are
## ju.t i.m don.t of.the in.the for.the to.be
## 57 50 41 35 34 29 26
## <s>.the <s>.thank on.the day.</s> la.t it.</s> can.t
## 25 25 22 22 22 18 18
## n.t <s>.it to.the at.the <s>.rt
## 18 17 17 17 16
Here’s 3-grams
The most frequent 3-grams are
## <s>_<s>.thank <s>_i.m <s>_<s>.it <s>_<s>.rt <s>_<s>.you
## 25 18 17 16 16
## <s>_<s>.hey <s>_<s>.o <s>_i.love la_t.night i_can.t
## 14 11 11 11 11
## <s>_<s>.no <s>_<s>.if <s>_<s>.a <s>_<s>.what i_don.t
## 10 10 10 10 10
## doe_n.t <s>_<s>.ju <s>_<s>.that thank_for.the
## 10 9 9 9
How do you evaluate how many of the words come from foreign languages? This can be worked out by comparing against an English dictionary.
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
If synonymous words, verb tenses, singleton and plura of nones etc are mapped into a single representation in text pre-processing, the size of the dictionary can be substentially reduced.
The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.
I have built unigram, 2-gram, 3-gram and 4-gram dictionaries for model prediction.
Here we use Backoff Interpolation to handel unseen phrases that can appear in the validation or testing datasets.
How can you efficiently store an n-gram model (think Markov Chains)? Here we store n-grams as nested hash lists. This can be further improved by storing them into trees.
How can you use the knowledge about word frequencies to make your model smaller and more efficient? It will substantially improve the effency of the model by matching genre of between the training set and the testing set. We can use human knowledge to find suitable training texts for certain prediction tasks.
How many parameters do you need (i.e. how big is n in your n-gram model)? In the n-gram model presented here, I used Backoff Interpolation to model unseen phrases, for which there are four lambdas, with 3 free parameters to estimate. It is straightforward to search these variable by standard optimization libraries, such as optim in R.
Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ? The simplest way is to add a pseudo count of 1 into all frequencies to avoid 0 counts (Laplace Smoothing). A better way of doing it is to use Backoff Interpolation, that is using values from N-1 gram if N gram is not available, untill unigram. Each probability can be assigned a weight, which can be optimised using a validation set.
How do you evaluate whether your model is any good? By testing against a test set.
How can you use backoff models to estimate the probability of unobserved n-grams? We can build a composite probability by adding N-gram, N-1 gram, …, till unigram, with a set of weights. The benefit of doing so is to alleviate the problems that any of these N-grams can give a zero probability. This provides a fall-back mechanism. The weights can be optimised if we create a validation data set.