1. Overview
Whenever we type texts on search engines, word documents, mobile phone apps etc, applications often show us possible next words we are going to type in. In this project, we try to build the predictive model of English texts that works in the same way using the large text dataset sourced from twitter posts, news and blogs. The goal is to build the model with accuracy as well as reasonable speed. Below is a breakdown of the tasks. I will try to get insights to achieve the goal by building the first simple model.
Goal :
To build the model that predict the next word according to the previous inputs with accuracy and reasonable speed.
Steps :
Questions to consider :
2. Data
First, we download the data from ‘https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip’ then read text data only in English.
Below outputs are details of downloaded data. N.documents is the number of documents which consist of one or more sentences, N.words is the count of all words and N.unique.words is the count of unique words in each file.
## Object.size N.documents N.words N.unique.words
## en_US.twitter.txt 334.48 MB 2360148 30093413 370390
## en_US.news.txt 269.84 MB 1010242 34762395 284533
## en_US.blogs.txt 267.76 MB 899288 37546250 320008
## total 872.08 MB 4269678 102402058 974931
For the purpose of exploratory analysis we subset 50% of the data and split them into training set(70%), test set(20%) and held-out set(10%). First combine all twitter, blogs and news data and randomly select 50% of data and split it accordingly.
After subset the data we have 1494387 documents for training, 422698 documents for testing, 217754 documents for tuning the model.
In order to get useful features of input data, we need to clean the data before we conduct exploratory analysis. Our training dataset is a list of documents. First, we split them into a single sentence > split them into words par each sentence > clean sentences as below
After the cleaning steps 2, we need to decide how many times the words have to appear for being used as parameters of the model. Below outputs are the rates of the coverage when we use words that appear at least 2/3/4/5 times. We can see from below outputs, even though we use 1/4 of all the unique words, we can cover 98.9% of all the words that appears in all the sentences. Thus we replace the words with “UNK” if it appears less than 5 times at the step 3.
## N.words.total N.words.subset N.unique.words Coverage
## Use all the words 35084938 35084938 333058 100 %
## Use the words freq >= 2 35084938 34909879 157999 99.5 %
## Use the words freq >= 3 35084938 34826893 116506 99.26 %
## Use the words freq >= 4 35084938 34766563 96396 99.09 %
## Use the words freq >= 5 35084938 34716867 83972 98.95 %
3. Frequencies of words/N-grams
Now we have a list of cleaned sentences to analyze. To build the predictive model, we need to know frequencies of each word/combination of words.
Below Fig.1 shows distribution of top 50 words of 83972 unique words. “The” appears the most, followed by “to”, “and”, “a” and so on. “The” has the highest probability if we do not consider the previous history.
In the data cleaning process, low probability words(appears less than 5 times), profane words, numbers, non-English words were replaced by “UNK”(unknown). The count of unknown words is 521934, which is the 8th place from the top and around 1.5 % of the total word counts in the training data. We treat UNK as regular words in order to see the probability.
N-gram is a sequence of N words. For example, 2-gram(bigram) of the sentence “The sun goes down.” are “The sun”, “sun goes”, “goes down”. In the same way we can generate longer N-grams. Using N-grams is a good way to know how the words appear in different contexts. Table.1 is 10 most frequent sequences in different N-grams.
| Rank | Unigram | counts | 2-gram | counts | 3-gram | counts | 4-gram | counts |
|---|---|---|---|---|---|---|---|---|
| 1 | the | 1669791 | of the | 151370 | one of the | 12013 | the end of the | 2789 |
| 2 | to | 967805 | in the | 143844 | a lot of | 10378 | UNK UNK UNK UNK | 2385 |
| 3 | and | 847103 | to the | 74583 | thanks for the | 8468 | the rest of the | 2372 |
| 4 | a | 835364 | for the | 70346 | to be a | 6453 | at the end of | 2328 |
| 5 | of | 703150 | on the | 68990 | going to be | 6014 | thanks for the follow | 2235 |
| 6 | i | 580551 | to be | 57141 | the end of | 5341 | for the first time | 2175 |
| 7 | in | 579858 | at the | 49771 | i want to | 5272 | at the same time | 1810 |
| 8 | UNK | 521934 | and the | 44162 | out of the | 5252 | is going to be | 1540 |
| 9 | for | 386144 | the UNK | 42330 | it was a | 5059 | one of the most | 1505 |
| 10 | is | 376252 | in a | 41889 | as well as | 4838 | is one of the | 1448 |
As N gets larger, the frequencies of each N-gram is getting less because they are more restricted. What about the total number of unique N-grams? Below outputs shows the total count of each N-grams(Total.Count), the number of unique N-grams(Unique.Ngrams.Count), the number and percentage of N-grams which appeared only once(Singleton.Count/Percentage.Singleton).
## Total.Count Unique.Ngrams.Count Singleton.Count Percentage.Singleton
## unigram 35084938 83972 0 0 %
## 2-gram 32376453 5487582 3746127 11.57 %
## 3-gram 29667968 16194884 13613296 45.89 %
## 4-gram 27107569 22569873 20989072 77.43 %
As N gets larger, the number of unique N-grams dramatically gets larger. It is natural because N-grams are possible combinations of 83972 words. But is it necessary to keep all the combinations? We can say 4-grams that appeared once is likely to appear once in 27107569 times.
4. Modeling
Predicting next words is hard because the number of options are almost infinity in many cases. The important things in this task is to show users the options that are most likely and are grammatically correct.
We use N-gram model to predict the next word according to the previous set of the words in the sentence. Look at the example.
Example sentence: “The sun goes ( )”
We want to predict the last word of this sentence.
How do we compute? It’s simply by the counts of N-grams. Please see below outputs.
Example of 4-gram probabilities
## history word N n prob ngram
## 1: the sun goes down 28 25 0.89285714 the sun goes down
## 2: the sun goes through 28 1 0.03571429 the sun goes through
## 3: the sun goes behind 28 1 0.03571429 the sun goes behind
## 4: the sun goes on 28 1 0.03571429 the sun goes on
There are 28(N) 4-grams that start with “the sun goes”(history), and 25(n) of them end with “down”(word). Therefore, “down” has the highest probability(25/28 \(\approx\) 0.89) in this case.
Formula to calculate the probability of 4-gram model for the example
\[P(down\,|\,the\,sun\,goes) = \dfrac{Counts\,of\,<the\,sun\,goes\,+\,down>}{Counts\,of\,<the\,sun\,goes\,+\,???(any\, words)>}\]
This can be translated that the probability of next word being “down”, given that the previous set of words is “the sun goes”, is the counts of 4-gram <the sun goes + down> divided by the total counts of 4-gram starting with “the sun goes”, thus the probabilities of the 4-grams with the same history sums up to 1. In same way we construct the formula for other N-grams.
In the previous section we tried to predict the word using 4-gram model, and we found the word quite reasonable. But what if we don’t have the exact combination of the words in our training set? For example, if the sentence is “The moon goes ()”, then we don’t have 4-gram and 3-gram that matches, but we do have 2-grams that has history of “goes”. Therefore, we can use 2-grams to predict the word. It is called Back-off model because it back-offs to lower-order N-grams till find the matches. The back-off model requires modification of probability mass to keep the sum of probabilities of the all possible options Not greater than one.
Interpolation is using all available N-gram probabilities weighted by the values between 0 and 1. This makes computation simpler than backoff model. In this project, we use Interpolation model. Below outputs are probabilities of word “down” using all the N-grams(4 to 1).
## history word N n prob ngram
## 4-gram the sun goes down 28 25 0.8928571429 the sun goes down
## 3_gram sun goes down 30 25 0.8333333333 sun goes down
## 2_gram goes down 7037 178 0.0252948700 goes down
## unigram <NA> down 37793423 31809 0.0008416544 down
The probabilities of all options of Each N-grams have to sum up to 1. In order to make a probability distribution between 0 and 1, we need values to multiply each probability. For example, if we multiply all N-gram option by 0.25, those will sum up to 1, and probability of next word being “down” will be between 0 and 1. But is it good idea to multiply by the same values? We need to put more weight on reliable N-grams.
Formula for a simple interpolation
\[\hat{P}(down\,|\,the\,sun\,goes) \approx \lambda_{1} P(down) + \lambda_{2}P(down\,|\,goes) + \lambda_{3}P(down\,|\,sun\,goes) + \lambda_{4}P(down\,|\, the\,sun\,goes) \\ where\,\, \lambda_{1} + \lambda_{2} + \lambda_{3} + \lambda_{4}\,= 1\]
If we set \(\lambda_{1}\) to \(\lambda_{4}\) values in above formula c(0.25, 0.25, 0.25, 0.25) respectively, the probability of the next word after “the sun goes” being “down” will be modified as below.
## history word prob lambda prob_times_lambda
## 4-gram the sun goes down 0.89285714 0.25 0.22321429
## 3_gram sun goes down 0.83333333 0.25 0.20833333
## 2_gram goes down 0.02529487 0.25 0.00632372
## unigram <NA> down 0.00084165 0.25 0.00021041
## modified.prob 0.43808175
Then how can interpolation model help us to predict the next word according to the previous input? Here is the top 5 words ordered by the modified probability using above formula. Simply subset all the N-gram tables by its history and combine probabilities of the same word which were multiplied by respective values of lambda. We can show users options with highest probabilities. Below output shows the top 5 words with highest probabilities for the previous example.
## word modified.prob
## 1: down 0.43808175
## 2: on 0.04535482
## 3: to 0.04104028
## 4: EOS 0.03621253
## 5: through 0.02287673
In this example, equal values of lambda worked well enough, but we do not know which N-gram needs more weight. To find optimal values of lambda, we use our held-out dataset(the data we did not use for training) to calculate probability of the sentence using different sets of lambda values and see which set gives the highest probability to the each sentence on average. The probability of the sentence is calculated as below. Since it is a product of the probability of each word, the value becomes very small, thus we use sum of log probabilities as metrics rather than raw probabilities.
Probability of the sentence (length N) \[ P(sentence) = P(word_{1}) \times P(word_{2}) \times ... \times P(word_{N}) = \exp(logP(word_{1}) + logP(word_{2}) + ... + logP(word_{N})) \]
Lambda values were generated all possible way in 0.1 interval excluding \(\lambda_{1}\)(unigram weight) = 0 (we always need lowest order N-gram), which sums up 220 patterns in total. we calculate log probabilities of the sentences using each set of lambdas and take the average. Fig.2 shows the average log probability of 10000 sentences grouped by how we distributed the weights.
Below outputs shows lambda values that have highest probabilities in each group.
## lambdas weight sentence.log.prob
## 1: 0.2, 0.5, 0.2, 0.1 unigram_2gram_3gram_4gram -79.87554
## 2: 0.2, 0.5, 0.3, 0 unigram_2gram_3gram -80.18512
## 3: 0.2, 0.7, 0, 0.1 unigram_2gram_4gram -81.38315
## 4: 0.2, 0.8, 0, 0 unigram_2gram -83.15348
## 5: 0.5, 0, 0.4, 0.1 unigram_3gram_4gram -85.57074
## 6: 0.5, 0, 0.5, 0 unigram_3gram -85.89380
## 7: 0.7, 0, 0, 0.3 unigram_4gram -92.19254
## 8: 1, 0, 0, 0 unigram_only -99.17568
Results show that,
5. Test the First Model
We have cleaned, analyzed data, and built the first model. Now we need to test
For the platform we are going to deploy our application(Shiny server free plan), there is a data size limitation, which is a maximum size of 1 GB. but our data tables are closer to 3GB.
In the section 3.3, we found that there are many N-grams that only appear once(singletons). Since those N-grams are not likely to appear often, the performance of the model may not be affected too much without them and the number of singletons are many, thus we can reduce data size significantly.
To see the difference of two models, we can calculate Perplexity of the sentences in the test set using the probability tables that we trained, and the new tables of which probabilities are re-calculated without singletons. (The optimal values of lambdas also have to be searched using held-out set for the latter.)
The formula for the sentence(length N)
\(Perplexity(sentence) = P(sentence)^{-1/N} = \sqrt[N]{\frac{1}{P(sentence)}}\)
(Perplexity is a measure of how well the model gives the probabilities on unseen sentences. In above formula, the higher the probability of the sentence, gives the lower perplexity, thus the better model gives lower perplexity.)
The summary of perplexities of the two different model on 10000 sentences of the test set
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## with singletons 5.547354 108.2008 244.2925 799.8691 553.5817 254647.2
## without singletons 5.188467 111.1114 267.7871 945.7274 633.0662 254647.2
Comparison of data size, unique N-gram counts, weights with(left) and without(right) singletons
## With singletons #Unique Weight Without singletons #Unique Weight
## Unigram.table 6.69 MB 83973 0.2 6.69 MB 83973 0.2
## 2-gram.table 144.42 MB 5600541 0.5 52.4 MB 1823229 0.5
## 3-gram.table 806.97 MB 17857256 0.2 130.34 MB 3064093 0.2
## 4-gram.table 1973.35 MB 27471362 0.1 157.66 MB 2499625 0.1
Results show that,
Now We actually predict every word in 100 sentences in the test set according to its previous words and evaluate if the correct word is in the set of predicted words. The model we use is the one that we excluded singletons.
Below table is the predictions of few sentences as examples. The “word” column is the words of the sentences (EOS means end of the sentence “.”), V1 - V2 columns are top 5 words the model predicted in ascending order of the probability, “top_3” and “top_5” columns show if the correct word is in the predicted top3/5 words respectively, “run_time” column is a execution time to get predictions in second.
## word V1 V2 V3 V4 V5 top_3 top_5 runtime
## 1: oe i the UNK and it NO NO 1.51
## 2: on EOS in the to and NO NO 1.50
## 3: a the a EOS my UNK YES YES 1.53
## 4: full EOS UNK new few lot NO NO 1.50
## 5: stomach of time count EOS day NO NO 1.50
## 6: EOS or EOS is and the YES YES 1.42
## 7: always i the UNK and it NO NO 1.54
## 8: such a be the been EOS NO NO 1.46
## 9: a a as an EOS the YES YES 1.54
## 10: good great big treat good UNK NO YES 1.59
## 11: idea EOS day idea time thing YES YES 1.61
## 12: EOS EOS of to that for YES YES 1.51
## 13: green i the UNK and it NO NO 1.58
## 14: chillies bay EOS and tea beans NO NO 1.50
## 15: chopped EOS and medium slit in NO NO 1.47
## 16: EOS EOS fresh onion and up YES YES 1.39
## 17: work i the UNK and it NO NO 1.66
## 18: flow EOS is on with in NO NO 1.58
## 19: EOS EOS of in i and YES YES 1.54
## 20: the i the UNK and it YES YES 1.66
Results
## n_words top_3_accuracy top_5_accuracy avg_runtime
## 1 1233 33.09% 38.2% 1.58 sec
6. Conclusion
The results show that the model still needs to be improved. We experimented with a half of the data, but it is not generalized very well for unseen data. We can add more data but remove low probability N-grams, repeat the process we did in this analysis to get better accuracy as well as finding the more efficient way to store the model for speed.
References
Speech and Language Processing : N-gram Language Model / Daniel Jurafsky & James H. Martin