Milestone Report - Exploring Swiftkey Corpus

Introduction

The objective of Coursera Data Science Capstone Project is to develop an app that predicts, based in a sequence of 2 words, a third one. The prediction model should be based in some documents provided by Swiftkey.
This report is the assignment of week 2 of the . Basically, the aim of this paper is to get the data required by the course, preprocess it and make some exploratory analysis. Then some ideas about how I would get a predictive model.
I use the following packages: dplyr, ggplot2, tidyr, and tm.

Getting and processing the data

First, according to the task specification, data is downloaded from Coursera-SwiftKey.zip if it is not locally available.

Then, the file is unzip. It has 4 directories (de_DE, en_US, fi_FI, ru_RU). Files from ‘en_US’ directory (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) are used to create a corpus. These are large text files (200.4, 196.3, 159.4 MB and 899288, 77259, 2360148 lines respectively).

Table 1

Summary of the corpus dimensions.

Document	Size (MB)	Lines (n)	Characters (n)
blogs	200.4	899288	208361438
news	196.3	77259	15683765
twitter	159.4	2360148	162384825

Cleaning the data

In order to handle the data with my PC, I decided to sample 1/100 lines from the content of the corpus. Then, I filtered the content to include characters, remove extra spaces, and convert to lower case.
I decided to avoid removing stop words, because I believe that I will need them in my predictive model (see below).

Term Document matrix for 1, 2 and 3 words n-gram

Using NLP::ngrams function, term document matrix of 1, 2, or 3 words n-grams were built.

These 1, 2, or 3 n-grams term document matrices have 44221, 334334, and 592876 terms respectively. Among them, there are many with a really low frequency. For example, there are 24289, 269253 and 549203 unique terms per matrix with a frequency of 1. Thus, one may consider that lot of these terms as noise that should be removed before continuing. tm::removeSparseTerms function allows filtering out these infrequent terms. Figure 1 shows the number of terms according to the selection of the ‘sparse’ parameter of the function for the 3 term document matrices.

Figure 1

Number of terms of term document matrices according to the sparse parameter in tm::removeSparseTerms function.

Exploring the data

Distribution of terms according to the documents in the corpus

A good approach to reduce the number of words (unigrams), as suggested above, would be to use tm::removeSparseTerms function with a sparse parameter between 1/3 and 2/3. Thus I use 0.5. Figure 2 shows the 20 more frequent words in the corpus after tidying it.

Figure 2

Words distribution in the corpus. Notice that many of them are ‘stop words’.

Figures 3 and 4 show the frequency of the 20 more frequent bi and trigrams of the corpus.

Figure 3

Distribution of 2 words n-grams according to the documents in the corpus.

Figure 4

Distribution of 3 words n-grams according to the documents in the corpus.

What’s next…

In order to built the predictive model, I will study about n-gram models and how to deal with out of vocabulary terms. I will continue processing the data and I will explore if word stemming is useful to strength the model.
Finally I will build a nice shiny web app and a presentation.