Coursera Data Science - Capstone Project

Background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

Objective

The objective of this project is to put in practice all knowledge acquired during the previous 9 courses from the Data Science Specialization.
The project will consist in showcasing in a shinyApp a predictive model of the next word someone is typing on a keyboard.

A secondary objective will be do it in a reproductible way.

Data retrieval and cleaning

Data retrieval

The data is from a corpus called HC Corpora (http://www.corpora.heliohost.org).

The following directories are available once the archive has been unzipped:

final:
- de_DE
- en_US
- fi_FI
- ru_RU

Each language specific sub-folder contains three files:

[locale].blogs.txt
[locale].news.txt
[locale].twitter.txt

In this project, we will focus on the en_US locale.

Some interesting facts on these files:

file_name	file_size	nb_of_lines
en_US.twitter.txt	159.4 Mb	2,360,148
en_US.blogs.txt	200.4 Mb	899,288
en_US.news.txt	196.3 Mb	1,010,242

Splitting Corpus

The dataset is partitionned into a training (99%) and testing (1%) set using a random binomial sampling via the splitCorpus function.

The training set will be used to created the Language Model: n-gram tables.
The testing set will be used to:

Optimize the Accuracy vs Performance tradeoff when tweaking the model
Calculate the Accuracy of the model

Data cleaning

Some refinement is done on the data via the cleaningCorpus function:

URL, hashtag and mention filtering: These elements are too specific to the context they appear in.
Lowercase transforming: In order to simplify the problem, text is transformed in lowercase.
Removing non-ASCII characters
Removing profanities
Removing ponctuation but preserving intra words one
Removing extra whitespace

Other cleaning rules have been reviewed but not used:

Stemming words: as our application comprise a generation phase (prediction of the next word), it would have been necessary to add back what was removed from the stemming process which complexity was not worth the gain.
Foreign words: test of the textcat package showed very poor results. Therefore, I decided not to use it.

A future refinement would consists in splitting lines by sentences so that one line only consists of one sentence.
The implementation of such refinement will be challenging: There is no special character to end a sentence!
Furthermore, many exceptions apply:

It can be any of the following characters: “.”, “!”, “?”.
The character can be used multiple times to accentuate its meaning: “!!!”, “??”, “…”.
Multiple characters can be used in conjunction: “!!?”, “??!”.
It can be missing at the end of the sentence.

Another problem that would need to be tackled is the multiple purposes of the “.” character:

Abbreviate a word (e.g. St.)
After an initial letter used to stand for a name (e.g. J. K. Rowling)
Placed after each individual letter in an initialism (e.g. U.S.A.)
It has multiple contexts in mathematics and computing

Implementing such refinement should increase significantly the accuracy of the model.

Language Model and Algorithms

Language Model

Our predictive model will rely on a Language Model which

is a probability distribution over sequences of words. Given such a sequence, say of length \(m\), it assigns a probability \(P\left(w_{1},\ldots ,w_{m}\right)\) to the whole sequence.
[…]
In speech recognition, the computer tries to match sounds with word sequences. The language model provides context to distinguish between words and phrases that sound similar. For example, in American English, the phrases “recognize speech” and “wreck a nice beach” are pronounced almost the same but mean very different things. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model." [1].

In our context, it will be used to assess the probability of the next possible words and therefore identifying the most probable one.

Data sparsity is a major problem in building language models. Most possible word sequences will not be observed in training. One solution is to make the assumption that the probability of a word only depends on the previous n words. [1].

This assumption is also known as the n^th order Markov property[2].

It is built on the concept of Markov Process which consider a sequence of random variables \(X_{1},\,X_{2},\,...,\,X_{m}\).
In our context, each word in the sentence is a random variable.
Each random variable can take any value in a finite set \(V\).
We could rewrite our probability function as: \[P\left(X_{1} = x_{1},\,X_{2} = x_{2},\,...,\,X_{m} = x_{m}\right)\]

Thanks to the chain rule property of probabilities, we can rewrite our Language Model probability function as: \[P\left(X_{1} = x_{1},\,X_{2} = x_{2},\,...,\,X_{m} = x_{m}\right) \\ = P\left(X_{1} = x_{1}\right) \cdot P\left(X_{2} = x_{2}\, | \,X_{1} = x_{1}\right) \\ \prod\limits_{i=3}^{m} P\left(X_{i} = x_{i}\, | \,X_{1} = x_{1},\,...,\,X_{i-1} = x_{i-1}\right) \]

Making the 2^nd order Markov Assumption, we obtain:

\[= P\left(X_{1} = x_{1}\right) \cdot P\left(X_{2} = x_{2}\, | \,X_{1} = x_{1}\right) \\ \prod\limits_{i=3}^{m} P\left(X_{i} = x_{i}\, | \,X_{i-2} = x_{i-2},\,X_{i-1} = x_{i-1}\right) \]

Each word probability now only depends on the two previous words. Our Language Model is therefore a tri-gram model.

Accounting for unobserved n-grams

Several techniques allows to account for unobserved n-grams including Interpolation and Discount methods.
I have decided to implement two methods:

Katz Back-off Model [3]
Stupid Back-off Model [4]

Katz Back-off Model

The Katz Back-off Model is a discounting method which consists in lowering probability of observed n-grams so that the left over probability mass can be assigned to non-observed ones.

In practice, we will substract a constant amount \(d = 0.5\) from n-gram counts: \[Count^{*}\left(w_{i-2},\,w_{i-1},\,w_{i}\right) = Count\left(w_{i-2},\,w_{i-1},\,w_{i}\right) - d\] \[Count^{*}\left(w_{i-1},\,w_{i}\right) = Count\left(w_{i-1},\,w_{i}\right) - d\]

This model also uses back-off to n-grams with lower history ((n-1)-grams, (n-2)-grams, … , 1-grams) in order to generate a more accurate probability for non-observed n-grams.

For a trigram model, let’s first define two sets: \[A\left(w_{i-2},\,w_{i-1}\right) = \left\{w: Count\left(w_{i-2},\,w_{i-1},\,w\right) > 0\right\}\] \[B\left(w_{i-2},\,w_{i-1}\right) = \left\{w: Count\left(w_{i-2},\,w_{i-1},\,w\right) = 0\right\}\] The Katz Backed-Off (KBO) estimate is then defined as: \[q_{KBO}\left(w_{i}\,|\,w_{i-2},\,w_{i-1}\right) = \left\{ \begin{array}{ll} \frac{Count^{*}\left(w_{i-2},\,w_{i-1},\,w_{i}\right)}{Count\left(w_{i-2},\,w_{i-1}\right)}\,\,\,if\,\,w_{i} \in A\left(w_{i-2},\,w_{i-1}\right)\\ \frac{\alpha\left(w_{i-2},\,w_{i-1}\right)\,q_{KBO}\left(w_{i}\,|\,w_{i-1}\right)}{\sum_{w \in B\left(w_{i-2},\,w_{i-1}\right)} q_{KBO}\left(w\,|\,w_{i-1}\right)}\,\,\,if\,\,w_{i} \in B\left(w_{i-2},\,w_{i-1}\right) \end{array} \right.\] where \[\alpha\left(w_{i-2},\,w_{i-1}\right) = 1 - \sum_{w \in A\left(w_{i-2},\,w_{i-1}\right)} \frac{Count^{*}\left(w_{i-2},\,w_{i-1},\,w\right)}{Count\left(w_{i-2},\,w_{i-1}\right)}\]

And now, for a bigram model, let’s define two sets: \[A\left(w_{i-1}\right) = \left\{w: Count\left(w_{i-1},\,w\right) > 0\right\}\] \[B\left(w_{i-1}\right) = \left\{w: Count\left(w_{i-1},\,w\right) = 0\right\}\] The Katz Backed-Off (KBO) estimate is then defined as: \[q_{KBO}\left(w_{i}\,|\,w_{i-1}\right) = \left\{ \begin{array}{ll} \frac{Count^{*}\left(w_{i-1},\,w_{i}\right)}{Count\left(w_{i-1}\right)}\,\,\,if\,\,w_{i} \in A\left(w_{i-1}\right)\\ \frac{\alpha\left(w_{i-1}\right)\,q_{ML}\left(w_{i}\right)}{\sum_{w \in B\left(w_{i-1}\right)} q_{ML}\left(w\right)}\,\,\,if\,\,w_{i} \in B\left(w_{i-1}\right) \end{array} \right.\] where \[\alpha\left(w_{i-1}\right) = 1 - \sum_{w \in A\left(w_{i-1}\right)} \frac{Count^{*}\left(w_{i-1},\,w\right)}{Count\left(w_{i-1}\right)}\]

The Maximum Likelihood (ML) of unigrams is simply defined as : \[q_{ML}\left(w_{i}\right) = \frac{Count\left(w_{i}\right)}{\sum\limits_{j=1}^{m}Count\left(w_{j}\right)}\] where \(m\) is the number of words [5].

Stupid Back-off Model

The Stupid Back-off Model is a simpler model that does not generate normalized probabilities but what we could call a score.
The main difference is that it doesn’t apply any discounting and instead directly use the relative frequencies.

For a trigram model, the Stupid Backed Off (SBO) estimate is defined as: \[s_{SBO}\left(w_{i}\,|\,w_{i-2},\,w_{i-1}\right) = \left\{ \begin{array}{ll} \frac{Count\left(w_{i-2},\,w_{i-1},\,w_{i}\right)}{Count\left(w_{i-2},\,w_{i-1}\right)}\,\,\,if\,\,Count\left(w_{i-2},\,w_{i-1},\,w_{i}\right) > 0\\ \alpha \cdot s_{SBO}\left(w_{i}\,|\,w_{i-1}\right)\,\,\,otherwise \end{array} \right.\]

For a bigram model, the Stupid Backed Off (SBO) estimate is defined as: \[s_{SBO}\left(w_{i}\,|\,w_{i-1}\right) = \left\{ \begin{array}{ll} \frac{Count\left(w_{i-1},\,w_{i}\right)}{Count\left(w_{i-1}\right)}\,\,\,if\,\,Count\left(w_{i-1},\,w_{i}\right) > 0\\ \alpha \cdot q_{ML}\left(w_{i}\right)\,\,\,otherwise \end{array} \right.\] In general, the back-off factor \(\alpha\) may be made to depend on \(n\). Here, a single value is used and heuristically set to \(\alpha\) = 0.4.

Creating N-gram tables

The two Back-off Models use the number of Counts of trigrams, bigrams and unigrams.

They are calculated using the calculatingNgrams function which takes 3 parameters (ratio, cutoff, filter_one_freq) later used during the Model tweaking phase.

Predicting next word

Katz Back-off Model

The predictNextWord function implements the Katz Back-off Model. It takes 4 parameters:

words: a string of characters that must be 2 words long.
resultNb (default = 10): the number of returned results.
incStopWords (default = TRUE) [experimental]: set to FALSE, it allows to filter english stop words from results.
d (default = .5): the discounting value.

Stupid Back-off Model

The stupidPredictNextWord function implements the Stupid Back-off Model. It takes 3 parameters:

words: a string of characters that must be 2 words long.
resultNb (default = 10): the number of returned results.
incStopWords (default = TRUE) [experimental]: set to FALSE, it allows to filter english stop words from results.

Model Performance vs Accuracy Tradeoff Optimization

Performance

The context of use of our feature is the writting of a text message.
User would only use this functionnality if it was increasing their productivity. Therefore, we should aim at getting a result in less than 2 seconds. Unfortunately, I was not able to reach this speed without drastically lower the accuracy of the model.
I chose instead a limit of 15 seconds.

During my test, I measured the execution time of both the Stupid and Katz back-off algorithms.

Accuracy

Perplexity

I first intended to evaluate the accuracy of my model by calculating its perplexity. Indeed, perplexity is a

measurement of how well a […] probability model predicts a sample. […] A low perplexity indicates the probability model is good at predicting the sample [6].

Perplexity can be calculated as follow:
\[Perplexity = 2^{-l}\] where \[l = \frac{1}{M}\sum\limits_{i=1}^{m} \log p\left(s_{i}\right)\] and \(M\) is the total number of words in the test dataset.

It is therefore necessary to calculate the probability of each sentence. As seen above: \[P\left(X_{1} = x_{1},\,X_{2} = x_{2},\,...,\,X_{n} = x_{n}\right) \\ = P\left(X_{1} = x_{1}\right) \cdot P\left(X_{2} = x_{2}\, | \,X_{1} = x_{1}\right) \\ \prod\limits_{i=3}^{n} P\left(X_{i} = x_{i}\, | \,X_{i-2} = x_{i-2},\,X_{i-1} = x_{i-1}\right) \]

Due to a technical constraint (the trigram tokenizer doesn’t return trigrams in the order of appearance in the sentence), I had to simplify the equation: \[\approx \prod\limits_{i=3}^{n} P\left(X_{i} = x_{i}\, | \,X_{i-2} = x_{i-2},\,X_{i-1} = x_{i-1}\right) \] The calculation was done on the test set first on a few lines (10, 50, …) and then on a larger set (500). Results were as follow:

nb_lines	perplexity
10	150
50	1,435
500	1,585,188

We can observe that Perplexity is not stable when increasing the number of line taken into account…
Looking at the formula, it seems that the \(\frac{1}{M}\) part should in theory bring this stability.
Unfortunately, I could not find a way to fix the issue.

The perplexity function was kept for the record.

Hits versus Total

A more natural way to estimate the accuracy of the model is to count the number of hits (among top 3 predicted words) and divide by the total number of tests.

The accuracyNChoices function implements this method. It gives much better results that are actionable.

Model Parameters

Parameters of the model used to optimize the Performance vs Accuracy tradeoff were:

Sampling of the training set
Filtering or not bigrams and trigrams with only 1 occurence
Filtering words from dictonnary based on the distribution of unigrams frequency sorted descending

Ratio

As seen before, the Corpus is very large. One could ask if using all data brings a significant benefit or if we could take a subset and not loose too much variance.

I decided to test different ratios: 20%, 25%, 30% and 100%.

Frequency of 1

Looking at the distribution of words frequency, we can see that a lot of these words only appear once in the Corpus.
Their probability to appear is very low (\(\frac{1}{M}\), where \(M\) is the total number of words in the training dataset).
Therefore, their predictive power is poor. As a consequence, bigrams and trigrams including these words also have a poor prediction power.

Removing them from the dictionary should therefore have a very low consequence on the prediction power of our model.
Besides, it will drastically reduce the size of our n-grams tables.

The same principle will be applied to bigrams and trigrams with a frequency of 1.

Cutoff

One could now ask the question if we could remove more than just words with a frequency of 1.

The cutoff parameter represent the percentile of the distribution we want to keep (keeping in priority words with high frequencies).

I decided to test different values: 90%, 85% and 80%.

Method

These 3 Model Parameters gives us 24 combinaisons.

We are looking at optimizing the Performance vs Accuracy tradeoff.
We need to calculate for each parameters combinaisons:

The Performance of the Stupid and Katz back-off algorithms
The Accuracy of the model

In addition, we can store the size of n-gram tables.
Results should be stored in a file for future analysis.

The optimizeModel function is doing just that.

Hardware

Tests have been performed on a Linux server with 16 cores and 60 Go of memory. Below are some detailed information on the CPU and memory:
lscpu

Property	Value
Architecture	x86_64
CPU op-mode(s)	32-bit, 64-bit
Byte Order	Little Endian
CPU(s)	16
Vendor ID	GenuineIntel
CPU family	6
Model	42
Model name	Intel Xeon E312xx (Sandy Bridge)
…

cat /proc/meminfo

Property	Value
MemTotal	64428136 kB
…

Results

In the Optimization problem statement, we made the assumption that removing n-grams with a frequency of 1 would have a very small impact on the Accuracy.
Let’s see if this is true:

We can observe that the impact on Accuracy is small.
Therefore any future result will be presented with n-grams with a frequency of 1 filtered out.

Looking at the “Accuracy by Cutoff and Ratio values” chart below, we can observe that when increasing the ratio and cutoff values, the accuracy also increases:

Looking at the “Performance by Ratio and Cutoff values” chart below, we can observe that when increasing the ratio and cutoff values, the timing also increases:

As stated in the Optimization problem statement, a timing > 15s is not acceptable. Therefore, we can elimitate several combinaisons:

We can highlight the corresponding Accuracies:

As we are interested in maximazing the Accuracy, the winning combinaison is:

Ratio of 20%
Cutoff of 90%

We can also note that by reducing the timing by 25% we could use a Ratio of 30% which would lead to an increase of Accuracy of 9%.

Technical Optimization

Challenges

The Corpus size

This project was the opportunity to work on a very large dataset (4,269,678 rows).

During the n-gram tables creation phase, it was not possible to process 100% of the Corpus at once by lack of memory.

In order to work around it, it was necessary to split the Corpus in chunks, process each of them and merge results back together.

Heavy computation

Several tasks were heavy in computation:

Creating n-gram tables: cleaning the Corpus, creating bigram and trigram tables, converting words to the corresponding index in the unigram table
Predicting the next word: calculating the probability of N trigrams (where N equals to the total number of unigrams) using either the Stupid or Katz Back-off algorithm

By default, R runs only on a single thread on the CPU.
In order to speed up the above tasks, it is possible to use multithreading in order to run them in parallel.

The `foreach` package

I used the doParallel package as a “parallel backend” for the foreach package. It provides a mechanism needed to execute foreach loops in parallel.
The doParallel package acts as an interface between foreach and the parallel package of R.

I implemented foreach loops to perform the following tasks:

Cleaning the Corpus and creating bigram and trigram tables
Calculating the probability of bigrams and trigrams

The `multidplyr` package

I used the multidplyr package as a “backend” for the dplyr package that partitions a data frame across multiple cores.

I used it essentially to reprocess n-gram tables:

Filtering
converting words to the corresponding index in the unigram table

N-grams tables size

The calculation of bigrams and trigrams probability is done in parallel (15 cores).
It is important to control the loading of data in memory since it will be duplicated in the 15 instances running in parallel.

In order to reduce the memory footprint, I used the bigmemory package so that bigrams and trigrams tables are shared across all instances.

References

Coursera Data Science - Capstone Project

Sébastien Lievain

23/02/2017

Background

Objective

Data retrieval and cleaning

Data retrieval

Splitting Corpus

Data cleaning

Language Model and Algorithms

Language Model

Accounting for unobserved n-grams

Katz Back-off Model

Stupid Back-off Model

Creating N-gram tables

Predicting next word

Katz Back-off Model

Stupid Back-off Model

Model Performance vs Accuracy Tradeoff Optimization

Performance

Accuracy

Perplexity

Hits versus Total

Model Parameters

Ratio

Frequency of 1

Cutoff

Method

Hardware

Results

Technical Optimization

Challenges

The Corpus size

Heavy computation

The foreach package

The multidplyr package

N-grams tables size

References

[1] Language Model

[2] Markov property

[3] Katz Back-off Model

[4] Stupid Back-off Model

[5] Columbia University - Course: Natural Language Processing

[6] Perplexity

The `foreach` package

The `multidplyr` package