The aim of this report is to explore the data in the set of text files in the en_US folder of the Coursera-Swiftkey dataset and to define a language model that could help predict text for a text autocomplete program.
There are a total of 4269678 lines and 102059872 words in all the files under en_US folder.
If you are a non-data scientist please skip to the Conclusion section.
Language models are evaluated by computing a metric called preplexity. Perplexity (PP) of a test set is its inverse probability normalized by the number of words.
For a test set \(\displaystyle (w_1, w_2,\ldots, w_m)\)
the preplexity is represented by:
\(\displaystyle PP = P(w_1, w_2, \ldots, w_m)^{-\frac{1}{m}}\)
Higher the probability of a word sequence, the lower the perplexity. Emperical tests done on language have shown that perplexity of trigrams is much lesser than unigrams. That is the longer the context on which a model is trained, the more coherent a generated sentence will be. For this reason, in this project I have chosen to use a trigram model.
Below is the frequency distribution of trigrams in the corpora in en_US for the most common 50 trigrams.
Another observation to be made is trigrams approximates to Zipf’s law. Zipf’s law states that for many frequency distributions, the n-th largest frequency is proportaional to a negative power of the rank order n.
The dark line here represents zipf’s law.
While language modeling we ofen find new phrases that were not present in the test set. So the probablity of these new phrases are going to be zero. One way to increase the probability from zero is to distribute the probablity mass. Due to the massive number of zero probabilites that may occur, a simple distribution of mass will drastically reduce the known conditional probabilities.
To improve the model without a large change in probability we can use less context. This method is known as smoothing. For a relatively rare trigram use a brigram instead and for a rare bigram use a unigram.Another improvement would be use weighted mix of a unigram, bigram and trigram.This improved method is called interpolation.
One of the smoothing models which is widely considered to be effective is known as Kneser-Ney smoothing.
\(p_{AbsoluteDiscounting}(w_i|w_{i-1}) = \frac{\max(c(w_{i-1},w_{i}) - \delta,0)}{\sum_{w'} c(w_{i-1},w')} + \lambda_{w_{i-1}} p(w_i)\)
Here \(\delta\) is the absolute discount (for example 0.75), \(\lambda_{w_{i-1}}\) is the interpolation weight and \(p(w_i)\) is the unigram probability.
The drawback of unigram probability is that we do not have a context. Instead it is desirable to use a continuation probability of that unigram as given below.
\(p_{continutation}(w_i) = \frac { |\{ w_{i- 1} : c(w_{i- 1},w) > 0 \} | } { |\{ (w_{j- 1},w_j) : c(w_{j- 1},w_j) > 0\} | }\)
Where the numerator is the number of preceeding words of the given word. The denominator is the total number of word bigram types.For instance a frequent word like Fransisco occuring in only a few contexts like San will have a low continuation probability.
Build a trigram model using Kneser-Ney smoothing. Given a bigram “expecting a” the model would know the probability of a word already seen in the training set. For instance “expecting a baby” or “expecting a package”.The UI would then display a list of suggestions (of the third word in the trigram) sorted by the probability of that trigram and the user can choose one if that is what they are expecting.
If you are a reviewer any feedback on this approach is welcome.
In the Shiny app there will be a text box where users can input text. When at least two words are typed, the app will suggest a list of words from the language model. For example, if you type “Expecting a” the app will suggest “baby, package” as a drop down list. The goal of this app will be to make smart a autocomplete system to speed up typing by predicting words.