1. Overview

Whenever we type texts on search engines, word documents, mobile phone apps etc, applications often show us possible next words we are going to type in. In this project, we try to build the predictive model of English texts that works in the same way using the large text dataset sourced from twitter posts, news and blogs. The goal is to build the model with accuracy as well as reasonable speed. Below is a breakdown of the tasks. I will try to get insights to achieve the goal by building the first simple model.

Goal :

To build the model that predict the next word according to the previous inputs with accuracy and reasonable speed.

Steps :

  1. Obtain the dataset and sample data for modeling.
  2. Clean and analyze the data.
  3. Find the optimal model to predict a next word.
  4. Evaluate the model.
  5. Repeat the process 2-4.
  6. Deploy it in Shiny server for users to try.

Questions to consider :

  1. How can we efficiently store a N-gram model?
  2. Do we need all the words that appear in the corpus or if it appears less frequent than others, can it be removed from the corpus (What is the size of the word dictionary)?
  3. How much do we need to go backward in the previous inputs to predict the next word?
  4. How can we evaluate the model accuracy?
  5. Can the model we are going to build be deployed in Shiny server with limited resources?

2. Data

2.1 Load the data

First, we download the data from ‘https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip’ then read text data only in English.

2.2 Details of the data

Below outputs are details of downloaded data. N.documents is the number of documents which consist of one or more sentences, N.words is the count of all words and N.unique.words is the count of unique words in each file.

##                   Object.size N.documents   N.words N.unique.words
## en_US.twitter.txt   334.48 MB     2360148  30093413         370390
## en_US.news.txt      269.84 MB     1010242  34762395         284533
## en_US.blogs.txt     267.76 MB      899288  37546250         320008
## total               872.08 MB     4269678 102402058         974931
2.3 Subset the data

For the purpose of exploratory analysis we subset 50% of the data and split them into training set(70%), test set(20%) and held-out set(10%). First combine all twitter, blogs and news data and randomly select 50% of data and split it accordingly.

After subset the data we have 1494387 documents for training, 422698 documents for testing, 217754 documents for tuning the model.

2.4 Clean the data

In order to get useful features of input data, we need to clean the data before we conduct exploratory analysis. Our training dataset is a list of documents. First, we split them into a single sentence > split them into words par each sentence > clean sentences as below

  1. Standardize the apostrophe (’ and ’).
  2. Replace words from Non-English languages / words with numbers / words with profane or offensive meanings with “UNK” (unknown, treat as regular words).
  3. Replace the words if they appears less frequent (words with low probability) with “UNK.

After the cleaning steps 2, we need to decide how many times the words have to appear for being used as parameters of the model. Below outputs are the rates of the coverage when we use words that appear at least 2/3/4/5 times. We can see from below outputs, even though we use 1/4 of all the unique words, we can cover 98.9% of all the words that appears in all the sentences. Thus we replace the words with “UNK” if it appears less than 5 times at the step 3.

##                         N.words.total N.words.subset N.unique.words Coverage
## Use all the words            35084938       35084938         333058    100 %
## Use the words freq >= 2      35084938       34909879         157999   99.5 %
## Use the words freq >= 3      35084938       34826893         116506  99.26 %
## Use the words freq >= 4      35084938       34766563          96396  99.09 %
## Use the words freq >= 5      35084938       34716867          83972  98.95 %

3. Frequencies of words/N-grams

Now we have a list of cleaned sentences to analyze. To build the predictive model, we need to know frequencies of each word/combination of words.

3.1 Word Distributions

Below Fig.1 shows distribution of top 50 words of 83972 unique words. “The” appears the most, followed by “to”, “and”, “a” and so on. “The” has the highest probability if we do not consider the previous history.

In the data cleaning process, low probability words(appears less than 5 times), profane words, numbers, non-English words were replaced by “UNK”(unknown). The count of unknown words is 521934, which is the 8th place from the top and around 1.5 % of the total word counts in the training data. We treat UNK as regular words in order to see the probability.

3.2 Frequencies of Sequence of Words (N-gram)

N-gram is a sequence of N words. For example, 2-gram(bigram) of the sentence “The sun goes down.” are “The sun”, “sun goes”, “goes down”. In the same way we can generate longer N-grams. Using N-grams is a good way to know how the words appear in different contexts. Table.1 is 10 most frequent sequences in different N-grams.

Table.1: The most frequent N-grams
Rank Unigram counts 2-gram counts 3-gram counts 4-gram counts
1 the 1669791 of the 151370 one of the 12013 the end of the 2789
2 to 967805 in the 143844 a lot of 10378 UNK UNK UNK UNK 2385
3 and 847103 to the 74583 thanks for the 8468 the rest of the 2372
4 a 835364 for the 70346 to be a 6453 at the end of 2328
5 of 703150 on the 68990 going to be 6014 thanks for the follow 2235
6 i 580551 to be 57141 the end of 5341 for the first time 2175
7 in 579858 at the 49771 i want to 5272 at the same time 1810
8 UNK 521934 and the 44162 out of the 5252 is going to be 1540
9 for 386144 the UNK 42330 it was a 5059 one of the most 1505
10 is 376252 in a 41889 as well as 4838 is one of the 1448

As N gets larger, the frequencies of each N-gram is getting less because they are more restricted. What about the total number of unique N-grams? Below outputs shows the total count of each N-grams(Total.Count), the number of unique N-grams(Unique.Ngrams.Count), the number and percentage of N-grams which appeared only once(Singleton.Count/Percentage.Singleton).

##         Total.Count Unique.Ngrams.Count Singleton.Count Percentage.Singleton
## unigram    35084938               83972               0                  0 %
## 2-gram     32376453             5487582         3746127              11.57 %
## 3-gram     29667968            16194884        13613296              45.89 %
## 4-gram     27107569            22569873        20989072              77.43 %

As N gets larger, the number of unique N-grams dramatically gets larger. It is natural because N-grams are possible combinations of 83972 words. But is it necessary to keep all the combinations? We can say 4-grams that appeared once is likely to appear once in 27107569 times.

3.3 Summary of Findings

4. Modeling

Predicting next words is hard because the number of options are almost infinity in many cases. The important things in this task is to show users the options that are most likely and are grammatically correct.

4.1 Predicting next word using N-gram

We use N-gram model to predict the next word according to the previous set of the words in the sentence. Look at the example.

Example sentence: “The sun goes ( )”
We want to predict the last word of this sentence.

How do we compute? It’s simply by the counts of N-grams. Please see below outputs.

Example of 4-gram probabilities

##         history    word  N  n       prob                ngram
## 1: the sun goes    down 28 25 0.89285714    the sun goes down
## 2: the sun goes through 28  1 0.03571429 the sun goes through
## 3: the sun goes  behind 28  1 0.03571429  the sun goes behind
## 4: the sun goes      on 28  1 0.03571429      the sun goes on

There are 28(N) 4-grams that start with “the sun goes”(history), and 25(n) of them end with “down”(word). Therefore, “down” has the highest probability(25/28 \(\approx\) 0.89) in this case.

Formula to calculate the probability of 4-gram model for the example

\[P(down\,|\,the\,sun\,goes) = \dfrac{Counts\,of\,<the\,sun\,goes\,+\,down>}{Counts\,of\,<the\,sun\,goes\,+\,???(any\, words)>}\]

This can be translated that the probability of next word being “down”, given that the previous set of words is “the sun goes”, is the counts of 4-gram <the sun goes + down> divided by the total counts of 4-gram starting with “the sun goes”, thus the probabilities of the 4-grams with the same history sums up to 1. In same way we construct the formula for other N-grams.

4.2 Backoff Model and Interpolation

In the previous section we tried to predict the word using 4-gram model, and we found the word quite reasonable. But what if we don’t have the exact combination of the words in our training set? For example, if the sentence is “The moon goes ()”, then we don’t have 4-gram and 3-gram that matches, but we do have 2-grams that has history of “goes”. Therefore, we can use 2-grams to predict the word. It is called Back-off model because it back-offs to lower-order N-grams till find the matches. The back-off model requires modification of probability mass to keep the sum of probabilities of the all possible options Not greater than one.

Interpolation is using all available N-gram probabilities weighted by the values between 0 and 1. This makes computation simpler than backoff model. In this project, we use Interpolation model. Below outputs are probabilities of word “down” using all the N-grams(4 to 1).

##              history word        N     n         prob             ngram
## 4-gram  the sun goes down       28    25 0.8928571429 the sun goes down
## 3_gram      sun goes down       30    25 0.8333333333     sun goes down
## 2_gram          goes down     7037   178 0.0252948700         goes down
## unigram         <NA> down 37793423 31809 0.0008416544              down

The probabilities of all options of Each N-grams have to sum up to 1. In order to make a probability distribution between 0 and 1, we need values to multiply each probability. For example, if we multiply all N-gram option by 0.25, those will sum up to 1, and probability of next word being “down” will be between 0 and 1. But is it good idea to multiply by the same values? We need to put more weight on reliable N-grams.

4.3 Find the best parameters using held-out dataset

Formula for a simple interpolation

\[\hat{P}(down\,|\,the\,sun\,goes) \approx \lambda_{1} P(down) + \lambda_{2}P(down\,|\,goes) + \lambda_{3}P(down\,|\,sun\,goes) + \lambda_{4}P(down\,|\, the\,sun\,goes) \\ where\,\, \lambda_{1} + \lambda_{2} + \lambda_{3} + \lambda_{4}\,= 1\]

If we set \(\lambda_{1}\) to \(\lambda_{4}\) values in above formula c(0.25, 0.25, 0.25, 0.25) respectively, the probability of the next word after “the sun goes” being “down” will be modified as below.

##                    history word       prob lambda prob_times_lambda
## 4-gram        the sun goes down 0.89285714   0.25        0.22321429
## 3_gram            sun goes down 0.83333333   0.25        0.20833333
## 2_gram                goes down 0.02529487   0.25        0.00632372
## unigram               <NA> down 0.00084165   0.25        0.00021041
## modified.prob                                            0.43808175

Then how can interpolation model help us to predict the next word according to the previous input? Here is the top 5 words ordered by the modified probability using above formula. Simply subset all the N-gram tables by its history and combine probabilities of the same word which were multiplied by respective values of lambda. We can show users options with highest probabilities. Below output shows the top 5 words with highest probabilities for the previous example.

##       word modified.prob
## 1:    down    0.43808175
## 2:      on    0.04535482
## 3:      to    0.04104028
## 4:     EOS    0.03621253
## 5: through    0.02287673

In this example, equal values of lambda worked well enough, but we do not know which N-gram needs more weight. To find optimal values of lambda, we use our held-out dataset(the data we did not use for training) to calculate probability of the sentence using different sets of lambda values and see which set gives the highest probability to the each sentence on average. The probability of the sentence is calculated as below. Since it is a product of the probability of each word, the value becomes very small, thus we use sum of log probabilities as metrics rather than raw probabilities.

Probability of the sentence (length N) \[ P(sentence) = P(word_{1}) \times P(word_{2}) \times ... \times P(word_{N}) = \exp(logP(word_{1}) + logP(word_{2}) + ... + logP(word_{N})) \]

Lambda values were generated all possible way in 0.1 interval excluding \(\lambda_{1}\)(unigram weight) = 0 (we always need lowest order N-gram), which sums up 220 patterns in total. we calculate log probabilities of the sentences using each set of lambdas and take the average. Fig.2 shows the average log probability of 10000 sentences grouped by how we distributed the weights.

Below outputs shows lambda values that have highest probabilities in each group.

##               lambdas                    weight sentence.log.prob
## 1: 0.2, 0.5, 0.2, 0.1 unigram_2gram_3gram_4gram         -79.87554
## 2:   0.2, 0.5, 0.3, 0       unigram_2gram_3gram         -80.18512
## 3:   0.2, 0.7, 0, 0.1       unigram_2gram_4gram         -81.38315
## 4:     0.2, 0.8, 0, 0             unigram_2gram         -83.15348
## 5:   0.5, 0, 0.4, 0.1       unigram_3gram_4gram         -85.57074
## 6:     0.5, 0, 0.5, 0             unigram_3gram         -85.89380
## 7:     0.7, 0, 0, 0.3             unigram_4gram         -92.19254
## 8:         1, 0, 0, 0              unigram_only         -99.17568

Results show that,

5. Test the First Model

We have cleaned, analyzed data, and built the first model. Now we need to test

5.1 Data size

For the platform we are going to deploy our application(Shiny server free plan), there is a data size limitation, which is a maximum size of 1 GB. but our data tables are closer to 3GB.

In the section 3.3, we found that there are many N-grams that only appear once(singletons). Since those N-grams are not likely to appear often, the performance of the model may not be affected too much without them and the number of singletons are many, thus we can reduce data size significantly.

To see the difference of two models, we can calculate Perplexity of the sentences in the test set using the probability tables that we trained, and the new tables of which probabilities are re-calculated without singletons. (The optimal values of lambdas also have to be searched using held-out set for the latter.)

The formula for the sentence(length N)

\(Perplexity(sentence) = P(sentence)^{-1/N} = \sqrt[N]{\frac{1}{P(sentence)}}\)

(Perplexity is a measure of how well the model gives the probabilities on unseen sentences. In above formula, the higher the probability of the sentence, gives the lower perplexity, thus the better model gives lower perplexity.)

The summary of perplexities of the two different model on 10000 sentences of the test set

##                        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
## with singletons    5.547354 108.2008 244.2925 799.8691 553.5817 254647.2
## without singletons 5.188467 111.1114 267.7871 945.7274 633.0662 254647.2

Comparison of data size, unique N-gram counts, weights with(left) and without(right) singletons

##               With singletons  #Unique Weight Without singletons #Unique Weight
## Unigram.table         6.69 MB    83973    0.2            6.69 MB   83973    0.2
## 2-gram.table        144.42 MB  5600541    0.5            52.4 MB 1823229    0.5
## 3-gram.table        806.97 MB 17857256    0.2          130.34 MB 3064093    0.2
## 4-gram.table       1973.35 MB 27471362    0.1          157.66 MB 2499625    0.1

Results show that,

5.2 Accuracy and Runtime of the first model

Now We actually predict every word in 100 sentences in the test set according to its previous words and evaluate if the correct word is in the set of predicted words. The model we use is the one that we excluded singletons.

Below table is the predictions of few sentences as examples. The “word” column is the words of the sentences (EOS means end of the sentence “.”), V1 - V2 columns are top 5 words the model predicted in ascending order of the probability, “top_3” and “top_5” columns show if the correct word is in the predicted top3/5 words respectively, “run_time” column is a execution time to get predictions in second.

##         word    V1    V2     V3   V4    V5 top_3 top_5 runtime
##  1:       oe     i   the    UNK  and    it    NO    NO    1.51
##  2:       on   EOS    in    the   to   and    NO    NO    1.50
##  3:        a   the     a    EOS   my   UNK   YES   YES    1.53
##  4:     full   EOS   UNK    new  few   lot    NO    NO    1.50
##  5:  stomach    of  time  count  EOS   day    NO    NO    1.50
##  6:      EOS    or   EOS     is  and   the   YES   YES    1.42
##  7:   always     i   the    UNK  and    it    NO    NO    1.54
##  8:     such     a    be    the been   EOS    NO    NO    1.46
##  9:        a     a    as     an  EOS   the   YES   YES    1.54
## 10:     good great   big  treat good   UNK    NO   YES    1.59
## 11:     idea   EOS   day   idea time thing   YES   YES    1.61
## 12:      EOS   EOS    of     to that   for   YES   YES    1.51
## 13:    green     i   the    UNK  and    it    NO    NO    1.58
## 14: chillies   bay   EOS    and  tea beans    NO    NO    1.50
## 15:  chopped   EOS   and medium slit    in    NO    NO    1.47
## 16:      EOS   EOS fresh  onion  and    up   YES   YES    1.39
## 17:     work     i   the    UNK  and    it    NO    NO    1.66
## 18:     flow   EOS    is     on with    in    NO    NO    1.58
## 19:      EOS   EOS    of     in    i   and   YES   YES    1.54
## 20:      the     i   the    UNK  and    it   YES   YES    1.66

Results

##   n_words top_3_accuracy top_5_accuracy avg_runtime
## 1    1233         33.09%          38.2%    1.58 sec
5.3 Findings

6. Conclusion

The results show that the model still needs to be improved. We experimented with a half of the data, but it is not generalized very well for unseen data. We can add more data but remove low probability N-grams, repeat the process we did in this analysis to get better accuracy as well as finding the more efficient way to store the model for speed.

References

Speech and Language Processing : N-gram Language Model / Daniel Jurafsky & James H. Martin