This is a milestone report for Coursera’s Data Science Specialization capstone project. The ultimate objective of this capstone project is to build an Shiny app that, similarly to what SwiftKey does, it predicts the next word when a user is writing using a keyboard.
In order to do that we need a language model that tells us which are the most probable next words. To train the model we will use a data set (corpus) of several files that contain text samples. Here we explore the data included in the corpus.
The motivation for this report is to:
The corpus provided includes samples from twitter, blogs, and news in 4 languages: English (en), German (de), Finnish (fi), and Russian (ru). Table 1 shows a brief summary of the files statistics.
To keep this report brief, I will focus on the english language files.
In some of the data files, specially in the news and blog files, we can find several sentences on each line. Our aim here is to predict words within a sentence, not accross sentences. To avoid mixing words from different sentences, I split each line in sentences. Then I create new files with one sentence per line. Now I show an example, but this procedure was applied to every line in each english file.
The line 3 of the file en_US.blogs.txt is:
## [1] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
When I split this line into sentences, I get the following:
# Use the openNLP package to split the line into sentences.
sent_token_annotator <- Maxent_Sent_Token_Annotator(language = 'en')
s <- as.String(line)
s[annotate(s, list(sent_token_annotator))]
## [1] "Chad has been awesome with the kids and holding down the fort while I work later than usual!"
## [2] "The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank."
## [3] "He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad)."
## [4] "We made him count all of his money to make sure that he had enough!"
## [5] "It was very cute to watch his reaction when he realized he did!"
## [6] "He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters!"
## [7] "She loves it almost as much as him."
It’s important to notice that the tool used to annotate sentences is only available for English and German. I won’t be able to do something similar for Russian and Finnish.
Table 2 shows the statistics of train and test files. As expected, the average number of words per sentence in the twitter file is lower than in the other two files. Also the average number of words per sentence is consistent between train and test files.
As we are training a model, it is important to hold part of our dataset to evaluate the performance of our final model. Here we will divide each data file into a training set and a test set. The training set includes 85% of the sentences in each file and the test set the remaining 15%.
Each sentence in the training and test set was preprocessed in the same way:
In order to explore the statistical properties of the english dataset I took a random sample from each of the three files. The sample accounted for a 10% of the sentences in the file. Then I split each sentence into single words (terms). Table 3 shows the number of unique terms in each english file. We can see that the number of different terms is similar in the three files.
I computed the frequency of each term by counting the number of times that they appear in one file. Then I analyzed the distribution of such frequencies in each file.
As we can see in Figure 1, the terms frequency distribution in the three files follows approximately a power law distribution. This means that there are a many number of words have a low frequency, but that a few words have a very high frequency.
As a result, a relatively small set of words in our vocabulary represent a high percentage of the words in the corpus. Table 4 shows how many terms account for 50% and 90% of the words in each file (50% and 90% coverage). We can see that the 50% and 90% coverage only a fraction of the total number of terms in each file (Table 3).
If we consider only the terms in the 90% coverage we can reduce considerably the size of our model and still be able to predict most of the words that a user would write, except for the rarest ones. A smaller model is easier to fit into a platorm with limited computing power like a smartphone.
It’s interesting to notice that coverage in the news dataset is higher than the coverage in blogs and twitter. This means that news writers use a richer vocabulary than the blogs and twitter authors, which makes perfect sense.
Figure 2 shows a word cloud for each sample – here I have used percentages instead of absolute frequencies to account for differences in sample size. The words pictured are those included in the 50% coverage for each file. The bigger the word, the more frequent is in the sample.
The twitter and blogs wordclouds have less words and a high number of very frequent words than the news wordcloud, which is a consequence of the coverage differences. But the wordclouds also tell us about which are the most frequent words.
The most striking fact is that the personal pronoums are among the most frequent words in the twitter and blogs samples, while they are secondary in the news sample. We can also notice that they are even more frequent in the twitter sample then in the blogs sample. This differences makes perfect sense because news is far more impersonal communication channel than twitter, which sometimes may be closer to instant messagery. Blogs are somewhere in between because they often transmit personal experiences and they are often less objectives than the news channel.
This means that we may want to include the user writing context in the model: Is the user writing in something like a chat or is composing an article?
Figure 2: wordclouds of the terms included in the 50% coverage
To end this section I show in Table 5 the words that are included in the 90% coverage of each sample – blank cells mean that the word is not in the 90% coverage for that sample. I display percentages instead of absolute frequencies to account for sample size differences.
Using just single word frequencies and probabilities in our model is equivalent to ignoring the context of the prediction: whatever the user has written up to that point. So, in addition to consider the distribution of single words we need to incorporate in our model the distribution of sequences of words.
Usually the context used are the two previuos words written before we make our prediction. So, we need to explore the distribution of sequences of two words (bigrams) and three words (trigrams). As happens with the single words distributions, the bigram and trigram distributions – not shown for brevity – also follow a power law. Therefore, similarly to what we did with single words we can use just a fraction of the total number of bigrams and trigrams found in the sample and still cover most of the samples.
My next step will be to select some models, train them for each language and compare their performance. At this moment my best candidates are trigram models – for their simplicity and small size – and log-linear models –for their ability to incorporate extra information beyond the words, bigrams, and trigrams distributions. I will use perplexity as a measure of performance.
At the very least I will need one model per language, but given the results above is very likely that I could need three models per language if the models trained are not capable to difference between twitter, blogs and news like writing contexts.
The model must be small enough to work in realtime in a Shiny app. This will be a determining factor in the final model selection.