Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.
The objective of this project is to put in practice all knowledge acquired during the previous 9 courses from the Data Science Specialization.
The project will consist in showcasing in a shinyApp a predictive model of the next word someone is typing on a keyboard.
A secondary objective will be do it in a reproductible way.
The data is from a corpus called HC Corpora (http://www.corpora.heliohost.org).
The following directories are available once the archive has been unzipped:
Each language specific sub-folder contains three files:
In this project, we will focus on the en_US locale.
Some interesting facts on these files:| file_name | file_size | nb_of_lines |
|---|---|---|
| en_US.twitter.txt | 159.4 Mb | 2,360,148 |
| en_US.blogs.txt | 200.4 Mb | 899,288 |
| en_US.news.txt | 196.3 Mb | 1,010,242 |
The dataset is partitionned into a training (99%) and testing (1%) set using a random binomial sampling via the splitCorpus function.
The training set will be used to created the Language Model: n-gram tables.
The testing set will be used to:
Some refinement is done on the data via the cleaningCorpus function:
Other cleaning rules have been reviewed but not used:
textcat package showed very poor results. Therefore, I decided not to use it.A future refinement would consists in splitting lines by sentences so that one line only consists of one sentence.
The implementation of such refinement will be challenging: There is no special character to end a sentence!
Furthermore, many exceptions apply:
Another problem that would need to be tackled is the multiple purposes of the “.” character:
Implementing such refinement should increase significantly the accuracy of the model.
Our predictive model will rely on a Language Model which
is a probability distribution over sequences of words. Given such a sequence, say of length \(m\), it assigns a probability \(P\left(w_{1},\ldots ,w_{m}\right)\) to the whole sequence.
[…]
In speech recognition, the computer tries to match sounds with word sequences. The language model provides context to distinguish between words and phrases that sound similar. For example, in American English, the phrases “recognize speech” and “wreck a nice beach” are pronounced almost the same but mean very different things. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model." [1].
In our context, it will be used to assess the probability of the next possible words and therefore identifying the most probable one.
Data sparsity is a major problem in building language models. Most possible word sequences will not be observed in training. One solution is to make the assumption that the probability of a word only depends on the previous n words. [1].
This assumption is also known as the nth order Markov property[2].
It is built on the concept of Markov Process which consider a sequence of random variables \(X_{1},\,X_{2},\,...,\,X_{m}\).
In our context, each word in the sentence is a random variable.
Each random variable can take any value in a finite set \(V\).
We could rewrite our probability function as: \[P\left(X_{1} = x_{1},\,X_{2} = x_{2},\,...,\,X_{m} = x_{m}\right)\]
Thanks to the chain rule property of probabilities, we can rewrite our Language Model probability function as: \[P\left(X_{1} = x_{1},\,X_{2} = x_{2},\,...,\,X_{m} = x_{m}\right) \\ = P\left(X_{1} = x_{1}\right) \cdot P\left(X_{2} = x_{2}\, | \,X_{1} = x_{1}\right) \\ \prod\limits_{i=3}^{m} P\left(X_{i} = x_{i}\, | \,X_{1} = x_{1},\,...,\,X_{i-1} = x_{i-1}\right) \]
Making the 2nd order Markov Assumption, we obtain:
\[= P\left(X_{1} = x_{1}\right) \cdot P\left(X_{2} = x_{2}\, | \,X_{1} = x_{1}\right) \\ \prod\limits_{i=3}^{m} P\left(X_{i} = x_{i}\, | \,X_{i-2} = x_{i-2},\,X_{i-1} = x_{i-1}\right) \]
Each word probability now only depends on the two previous words. Our Language Model is therefore a tri-gram model.
Several techniques allows to account for unobserved n-grams including Interpolation and Discount methods.
I have decided to implement two methods:
The Katz Back-off Model is a discounting method which consists in lowering probability of observed n-grams so that the left over probability mass can be assigned to non-observed ones.
In practice, we will substract a constant amount \(d = 0.5\) from n-gram counts: \[Count^{*}\left(w_{i-2},\,w_{i-1},\,w_{i}\right) = Count\left(w_{i-2},\,w_{i-1},\,w_{i}\right) - d\] \[Count^{*}\left(w_{i-1},\,w_{i}\right) = Count\left(w_{i-1},\,w_{i}\right) - d\]
This model also uses back-off to n-grams with lower history ((n-1)-grams, (n-2)-grams, … , 1-grams) in order to generate a more accurate probability for non-observed n-grams.
For a trigram model, let’s first define two sets: \[A\left(w_{i-2},\,w_{i-1}\right) = \left\{w: Count\left(w_{i-2},\,w_{i-1},\,w\right) > 0\right\}\] \[B\left(w_{i-2},\,w_{i-1}\right) = \left\{w: Count\left(w_{i-2},\,w_{i-1},\,w\right) = 0\right\}\] The Katz Backed-Off (KBO) estimate is then defined as: \[q_{KBO}\left(w_{i}\,|\,w_{i-2},\,w_{i-1}\right) = \left\{ \begin{array}{ll} \frac{Count^{*}\left(w_{i-2},\,w_{i-1},\,w_{i}\right)}{Count\left(w_{i-2},\,w_{i-1}\right)}\,\,\,if\,\,w_{i} \in A\left(w_{i-2},\,w_{i-1}\right)\\ \frac{\alpha\left(w_{i-2},\,w_{i-1}\right)\,q_{KBO}\left(w_{i}\,|\,w_{i-1}\right)}{\sum_{w \in B\left(w_{i-2},\,w_{i-1}\right)} q_{KBO}\left(w\,|\,w_{i-1}\right)}\,\,\,if\,\,w_{i} \in B\left(w_{i-2},\,w_{i-1}\right) \end{array} \right.\] where \[\alpha\left(w_{i-2},\,w_{i-1}\right) = 1 - \sum_{w \in A\left(w_{i-2},\,w_{i-1}\right)} \frac{Count^{*}\left(w_{i-2},\,w_{i-1},\,w\right)}{Count\left(w_{i-2},\,w_{i-1}\right)}\]
And now, for a bigram model, let’s define two sets: \[A\left(w_{i-1}\right) = \left\{w: Count\left(w_{i-1},\,w\right) > 0\right\}\] \[B\left(w_{i-1}\right) = \left\{w: Count\left(w_{i-1},\,w\right) = 0\right\}\] The Katz Backed-Off (KBO) estimate is then defined as: \[q_{KBO}\left(w_{i}\,|\,w_{i-1}\right) = \left\{ \begin{array}{ll} \frac{Count^{*}\left(w_{i-1},\,w_{i}\right)}{Count\left(w_{i-1}\right)}\,\,\,if\,\,w_{i} \in A\left(w_{i-1}\right)\\ \frac{\alpha\left(w_{i-1}\right)\,q_{ML}\left(w_{i}\right)}{\sum_{w \in B\left(w_{i-1}\right)} q_{ML}\left(w\right)}\,\,\,if\,\,w_{i} \in B\left(w_{i-1}\right) \end{array} \right.\] where \[\alpha\left(w_{i-1}\right) = 1 - \sum_{w \in A\left(w_{i-1}\right)} \frac{Count^{*}\left(w_{i-1},\,w\right)}{Count\left(w_{i-1}\right)}\]
The Maximum Likelihood (ML) of unigrams is simply defined as : \[q_{ML}\left(w_{i}\right) = \frac{Count\left(w_{i}\right)}{\sum\limits_{j=1}^{m}Count\left(w_{j}\right)}\] where \(m\) is the number of words [5].
The Stupid Back-off Model is a simpler model that does not generate normalized probabilities but what we could call a score.
The main difference is that it doesn’t apply any discounting and instead directly use the relative frequencies.
For a trigram model, the Stupid Backed Off (SBO) estimate is defined as: \[s_{SBO}\left(w_{i}\,|\,w_{i-2},\,w_{i-1}\right) = \left\{ \begin{array}{ll} \frac{Count\left(w_{i-2},\,w_{i-1},\,w_{i}\right)}{Count\left(w_{i-2},\,w_{i-1}\right)}\,\,\,if\,\,Count\left(w_{i-2},\,w_{i-1},\,w_{i}\right) > 0\\ \alpha \cdot s_{SBO}\left(w_{i}\,|\,w_{i-1}\right)\,\,\,otherwise \end{array} \right.\]
For a bigram model, the Stupid Backed Off (SBO) estimate is defined as: \[s_{SBO}\left(w_{i}\,|\,w_{i-1}\right) = \left\{ \begin{array}{ll} \frac{Count\left(w_{i-1},\,w_{i}\right)}{Count\left(w_{i-1}\right)}\,\,\,if\,\,Count\left(w_{i-1},\,w_{i}\right) > 0\\ \alpha \cdot q_{ML}\left(w_{i}\right)\,\,\,otherwise \end{array} \right.\] In general, the back-off factor \(\alpha\) may be made to depend on \(n\). Here, a single value is used and heuristically set to \(\alpha\) = 0.4.
The two Back-off Models use the number of Counts of trigrams, bigrams and unigrams.
They are calculated using the calculatingNgrams function which takes 3 parameters (ratio, cutoff, filter_one_freq) later used during the Model tweaking phase.
The predictNextWord function implements the Katz Back-off Model. It takes 4 parameters:
The stupidPredictNextWord function implements the Stupid Back-off Model. It takes 3 parameters:
The context of use of our feature is the writting of a text message.
User would only use this functionnality if it was increasing their productivity. Therefore, we should aim at getting a result in less than 2 seconds. Unfortunately, I was not able to reach this speed without drastically lower the accuracy of the model.
I chose instead a limit of 15 seconds.
During my test, I measured the execution time of both the Stupid and Katz back-off algorithms.
I first intended to evaluate the accuracy of my model by calculating its perplexity. Indeed, perplexity is a
measurement of how well a […] probability model predicts a sample. […] A low perplexity indicates the probability model is good at predicting the sample [6].
Perplexity can be calculated as follow:
\[Perplexity = 2^{-l}\] where \[l = \frac{1}{M}\sum\limits_{i=1}^{m} \log p\left(s_{i}\right)\] and \(M\) is the total number of words in the test dataset.
It is therefore necessary to calculate the probability of each sentence. As seen above: \[P\left(X_{1} = x_{1},\,X_{2} = x_{2},\,...,\,X_{n} = x_{n}\right) \\ = P\left(X_{1} = x_{1}\right) \cdot P\left(X_{2} = x_{2}\, | \,X_{1} = x_{1}\right) \\ \prod\limits_{i=3}^{n} P\left(X_{i} = x_{i}\, | \,X_{i-2} = x_{i-2},\,X_{i-1} = x_{i-1}\right) \]
Due to a technical constraint (the trigram tokenizer doesn’t return trigrams in the order of appearance in the sentence), I had to simplify the equation: \[\approx \prod\limits_{i=3}^{n} P\left(X_{i} = x_{i}\, | \,X_{i-2} = x_{i-2},\,X_{i-1} = x_{i-1}\right) \] The calculation was done on the test set first on a few lines (10, 50, …) and then on a larger set (500). Results were as follow:| nb_lines | perplexity |
|---|---|
| 10 | 150 |
| 50 | 1,435 |
| 500 | 1,585,188 |
We can observe that Perplexity is not stable when increasing the number of line taken into account…
Looking at the formula, it seems that the \(\frac{1}{M}\) part should in theory bring this stability.
Unfortunately, I could not find a way to fix the issue.
The perplexity function was kept for the record.
A more natural way to estimate the accuracy of the model is to count the number of hits (among top 3 predicted words) and divide by the total number of tests.
The accuracyNChoices function implements this method. It gives much better results that are actionable.
Parameters of the model used to optimize the Performance vs Accuracy tradeoff were:
As seen before, the Corpus is very large. One could ask if using all data brings a significant benefit or if we could take a subset and not loose too much variance.
I decided to test different ratios: 20%, 25%, 30% and 100%.
Looking at the distribution of words frequency, we can see that a lot of these words only appear once in the Corpus.
Their probability to appear is very low (\(\frac{1}{M}\), where \(M\) is the total number of words in the training dataset).
Therefore, their predictive power is poor. As a consequence, bigrams and trigrams including these words also have a poor prediction power.
Removing them from the dictionary should therefore have a very low consequence on the prediction power of our model.
Besides, it will drastically reduce the size of our n-grams tables.
The same principle will be applied to bigrams and trigrams with a frequency of 1.
One could now ask the question if we could remove more than just words with a frequency of 1.
The cutoff parameter represent the percentile of the distribution we want to keep (keeping in priority words with high frequencies).
I decided to test different values: 90%, 85% and 80%.
These 3 Model Parameters gives us 24 combinaisons.
We are looking at optimizing the Performance vs Accuracy tradeoff.
We need to calculate for each parameters combinaisons:
In addition, we can store the size of n-gram tables.
Results should be stored in a file for future analysis.
The optimizeModel function is doing just that.
Tests have been performed on a Linux server with 16 cores and 60 Go of memory. Below are some detailed information on the CPU and memory:
lscpu
| Property | Value |
|---|---|
| Architecture | x86_64 |
| CPU op-mode(s) | 32-bit, 64-bit |
| Byte Order | Little Endian |
| CPU(s) | 16 |
| Vendor ID | GenuineIntel |
| CPU family | 6 |
| Model | 42 |
| Model name | Intel Xeon E312xx (Sandy Bridge) |
| … |
cat /proc/meminfo
| Property | Value |
|---|---|
| MemTotal | 64428136 kB |
| … |
In the Optimization problem statement, we made the assumption that removing n-grams with a frequency of 1 would have a very small impact on the Accuracy.
Let’s see if this is true:
We can observe that the impact on Accuracy is small.
Therefore any future result will be presented with n-grams with a frequency of 1 filtered out.
Looking at the “Accuracy by Cutoff and Ratio values” chart below, we can observe that when increasing the ratio and cutoff values, the accuracy also increases:
Looking at the “Performance by Ratio and Cutoff values” chart below, we can observe that when increasing the ratio and cutoff values, the timing also increases:
As stated in the Optimization problem statement, a timing > 15s is not acceptable. Therefore, we can elimitate several combinaisons:
We can highlight the corresponding Accuracies:
As we are interested in maximazing the Accuracy, the winning combinaison is:
We can also note that by reducing the timing by 25% we could use a Ratio of 30% which would lead to an increase of Accuracy of 9%.
This project was the opportunity to work on a very large dataset (4,269,678 rows).
During the n-gram tables creation phase, it was not possible to process 100% of the Corpus at once by lack of memory.
In order to work around it, it was necessary to split the Corpus in chunks, process each of them and merge results back together.
Several tasks were heavy in computation:
By default, R runs only on a single thread on the CPU.
In order to speed up the above tasks, it is possible to use multithreading in order to run them in parallel.
foreach packageI used the doParallel package as a “parallel backend” for the foreach package. It provides a mechanism needed to execute foreach loops in parallel.
The doParallel package acts as an interface between foreach and the parallel package of R.
I implemented foreach loops to perform the following tasks:
multidplyr packageI used the multidplyr package as a “backend” for the dplyr package that partitions a data frame across multiple cores.
I used it essentially to reprocess n-gram tables:
The calculation of bigrams and trigrams probability is done in parallel (15 cores).
It is important to control the loading of data in memory since it will be duplicated in the 15 instances running in parallel.
In order to reduce the memory footprint, I used the bigmemory package so that bigrams and trigrams tables are shared across all instances.