Objective

The objective of this capstone project is to put in practice all knowledge acquired during the previous 9 courses.
The project will consist in showcasing in a shinyApp a predictive model of the next word someone is typing on a keyboard.

Data retrieval and loading

Data must be downloaded from the following link:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

if(!file.exists("../Coursera-SwiftKey.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                  "../Coursera-SwiftKey.zip", method = "curl")
    unzip("../Coursera-SwiftKey.zip")
}

The following files are available once the archive has been unzipped:

Each language specific sub-folder contains three files:

In this project, we will focus on the en_US locale. Data can be loaded as follow:

US_twitter <- readLines("../final/en_US/en_US.twitter.txt", skipNul = TRUE)
US_blogs <- readLines("../final/en_US/en_US.blogs.txt", skipNul = TRUE)
US_news <- readLines("../final/en_US/en_US.news.txt", skipNul = TRUE)

The skipNul argument of the readLines function allows to skip NULL lines.

Some interesting facts on these files:

file_name file_size nb_of_lines
en_US.twitter.txt 159.4 Mb 2,360,148
en_US.blogs.txt 200.4 Mb 899,288
en_US.news.txt 196.3 Mb 1,010,242

Exploratory analysis

Subsetting strategy

With a total of 4,269,678 lines, the Corpus is way to big to be used entirely.

My strategy is to use 1% of each files:

sub_US_twitter <- US_twitter[as.logical(rbinom(length(US_twitter), 1, .01))]
sub_US_blogs <- US_blogs[as.logical(rbinom(length(US_blogs), 1, .01))]
sub_US_news <- US_news[as.logical(rbinom(length(US_news), 1, .01))]

sub_US <- c(sub_US_twitter, sub_US_blogs, sub_US_news)

Using the tm package, I can create a Corpus that I will later use to apply data cleaning transformations and to create n-grams:

library(tm)

corpus <- VCorpus(VectorSource(sub_US))

Data cleaning

I will apply several cleaning transformations on the data via the td package:

library(dplyr)

swearWords <- read.csv("../swearWords.csv", header = FALSE, stringsAsFactors = FALSE)

cleanCorpus <- tm_map(corpus, removeNumbers) %>%
    tm_map(content_transformer(function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+[^\\s.]+", " ", x, perl = TRUE))) %>%
    tm_map(content_transformer(function(x) gsub("[@#][^\\s]+", " ", x, perl = TRUE))) %>%
    tm_map(content_transformer(tolower)) %>%
    tm_map(content_transformer(function(x) iconv(x, "latin1", "ASCII", sub = ""))) %>%
    tm_map(removeWords, swearWords$V1) %>%
    tm_map(removePunctuation, preserve_intra_word_dashes = TRUE) %>%
    tm_map(stripWhitespace)

Other cleaning rules have been reviewed but not used:

  • stemming words: as our application comprise a generation phase (prediction of the next word), it would have been necessary to add back what was removed from the stemming process which complexity was not worth the gain.
  • foreign words: test of the textcat package showed very poor results. Therefore, I decided not to use it.

Words distribution

We are now going to look at the distribution of words: uni-grams, bi-grams and tri-grams.
Using the tm package, we can create a DocumentTermMatrix object which is a sparse matrix of Terms appearing in each Document.
Using the tidytext package, we can manipulate the sparse matrix as a data.frame.

We are first interested in the frequency of uni-grams (each Term in the Corpus):

library(tidytext)

dtm_one <-DocumentTermMatrix(cleanCorpus)
    
dtm_one_td <- tidy(dtm_one) %>%
    group_by(term) %>%
    summarise(count = sum(count)) %>%
    arrange(desc(count))

Using the wordcloud package, we can represent uni-grams graphically:

Then, we can calculate bi-grams using a special tokenizer during the creation of the sparse matrix:

options(java.parameters = "-Xmx14g")
options(mc.cores = 1)
library(RWeka)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

dtm_two <- DocumentTermMatrix(cleanCorpus, control = list(tokenize = BigramTokenizer))
    
dtm_two_td <- tidy(dtm_two) %>% 
    group_by(term) %>%
    summarise(count = sum(count))

terms <- matrix(unlist(strsplit(dtm_two_td$term, " ")), ncol = 2, byrow = TRUE)

dtm_two_td <- dtm_two_td %>%
    mutate(
        first_word = terms[, 1],
        second_word = terms[, 2]) %>% 
    arrange(desc(count))

Finally, we can calculate tri-grams:

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

dtm_three <- DocumentTermMatrix(cleanCorpus, control = list(tokenize = TrigramTokenizer))
    
dtm_three_td <- tidy(dtm_three) %>% 
    group_by(term) %>%
    summarise(count = sum(count)) 

terms <- matrix(unlist(strsplit(dtm_three_td$term, " ")), ncol = 3, byrow = TRUE)

dtm_three_td <- dtm_three_td %>%
    mutate(
        first_two_words = paste(terms[, 1], terms[, 2]),
        third_word = terms[, 3]) %>% 
    arrange(desc(count))

We can look at the frequency of our n-grams using bar charts:

The distribution of our n-grams is similar but the highest count decrease when N increases:

Here is a summary of the number of observed unique n-gram compared to the vocabulary size:

  unigram bigram trigram
# of observed unique n-gram 55,420 437,973 764,925
(V, V^2, V^3) 55,420 3,071,376,400 1.702157e+14
pourcentage 100% 0.01425983% 4.493857e-07%

V obviously corresponds to the Vocabulary size: The unique number of uni-grams taken into account.

Coverage

One of the future objective will be to find the correct balance between accuracy and performance.
It will therfore be necessary to truncate our dictionnary in order to increase performance and reduce memory footprint.

Let’s calculate the number words we need, in a frequency sorted dictionary, to cover 50% of all word instances in the language? 90%?

cumul_nb <- 0
i <- 1

while(cumul_nb < (.5 * sum(dtm_one_td$count))) {
    cumul_nb <- cumul_nb + dtm_one_td[i, ]$count
    i <- i + 1
}

The number of words required to cover 50% of all word instances is 312.
The number of instances of the word at the 50% threshold is 300.

while(cumul_nb < (.9 * sum(dtm_one_td$count))) {
    cumul_nb <- cumul_nb + dtm_one_td[i, ]$count
    i <- i + 1
}

The number of words required to cover 90% of all word instances is 9689.
The number of instances of the word at the 90% threshold is 6.

In the future, we will use this method to choose the minimum number of instances for a word to be kept in the dictionnary.

Thoughts for a prediction model

I am going to create a predictive model based on a tri-gram language model which is based on the theory of second-order Markow Processes.

As seen above the percentage of observed tri-grams in our Corpus compared to the total number is very small: 4.493857e-07%. Meaning that any non-observed tri-gram in the training Corpus will never be proposed during the testing phase.
Therefore, I am going to apply a discounting method: The Katz Back-Off Model.

In order to evaluate the accuracy of the produced model, I think about different strategies:

The corpus will require to be split into a training and testing sets.