The objective of Coursera Data Science Capstone Project is to develop an app that predicts, based in a sequence of 2 words, a third one. The prediction model should be based in some documents provided by Swiftkey.
This report is the assignment of week 2 of the . Basically, the aim of this paper is to get the data required by the course, preprocess it and make some exploratory analysis. Then some ideas about how I would get a predictive model.
I use the following packages: dplyr, ggplot2, tidyr, and tm.
First, according to the task specification, data is downloaded from Coursera-SwiftKey.zip if it is not locally available.
Then, the file is unzip. It has 4 directories (de_DE, en_US, fi_FI, ru_RU). Files from ‘en_US’ directory (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) are used to create a corpus. These are large text files (200.4, 196.3, 159.4 MB and 899288, 77259, 2360148 lines respectively).
Summary of the corpus dimensions.
| Document | Size (MB) | Lines (n) | Characters (n) |
|---|---|---|---|
| blogs | 200.4 | 899288 | 208361438 |
| news | 196.3 | 77259 | 15683765 |
| 159.4 | 2360148 | 162384825 |
In order to handle the data with my PC, I decided to sample 1/100 lines from the content of the corpus. Then, I filtered the content to include characters, remove extra spaces, and convert to lower case.
I decided to avoid removing stop words, because I believe that I will need them in my predictive model (see below).
Using NLP::ngrams function, term document matrix of 1, 2, or 3 words n-grams were built.
These 1, 2, or 3 n-grams term document matrices have 44221, 334334, and 592876 terms respectively. Among them, there are many with a really low frequency. For example, there are 24289, 269253 and 549203 unique terms per matrix with a frequency of 1. Thus, one may consider that lot of these terms as noise that should be removed before continuing. tm::removeSparseTerms function allows filtering out these infrequent terms. Figure 1 shows the number of terms according to the selection of the ‘sparse’ parameter of the function for the 3 term document matrices.
Number of terms of term document matrices according to the sparse parameter in tm::removeSparseTerms function.
A good approach to reduce the number of words (unigrams), as suggested above, would be to use tm::removeSparseTerms function with a sparse parameter between 1/3 and 2/3. Thus I use 0.5. Figure 2 shows the 20 more frequent words in the corpus after tidying it.
Words distribution in the corpus. Notice that many of them are ‘stop words’.
Figures 3 and 4 show the frequency of the 20 more frequent bi and trigrams of the corpus.
Distribution of 2 words n-grams according to the documents in the corpus.
Distribution of 3 words n-grams according to the documents in the corpus.
In order to built the predictive model, I will study about n-gram models and how to deal with out of vocabulary terms. I will continue processing the data and I will explore if word stemming is useful to strength the model.
Finally I will build a nice shiny web app and a presentation.