SwiftKey Capstone - Milestone Report

Required Libraries

library(tm)
library(RWeka)

Loading the Data

The source data used for this project exists in the following URL:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Downloaded file has been unzipped in the working directory, and the following function used to read the corpora from the given folder/file structure into vector of strings. It written to be general in terms of:

Ability to load other databases (i.e. folders of German, Russian and Finnish).
Load more than one dataset in the same call (e.g. “news”, “blogs”, and “twitter”).
Can restrict reading file to a pre-defined maximum number of lines (default should read up to the end of file).

loading.data <- function(localization, charset="UTF-8", src=c("news", "blogs", "twitter"), samples=-1){
  corpora <- c()

  # loop for each dataset/source
  for(source in src){
    
    # build up the path of the file to be loaded (e.g. "./final/en_US/en_US.blogs.txt")
    source.file <- paste0("./final/", localization, "/", localization, ".", source, ".txt")
    
    # read the text file lines into vector of strings
    source.lines <- readLines(source.file, skipNul=TRUE, encoding=charset, warn=FALSE, n=samples)
    
    # combined all requested datasets/sources in one corpora
    corpora <- c(corpora, source.lines)
  }
  
  return(corpora)
}

Each dataset has been loaded separately to get its own basic summaries in the next step:

news.data <- loading.data("en_US", src="news")
blogs.data <- loading.data("en_US", src="blogs")
twitter.data <- loading.data("en_US", src="twitter")

Wikipedia may be a good suggestion for external data sets to augment the model

Cleanup the Data

Tokenization

We don’t want our model to predict any numbers or special characters, and we will focus only on words to serve the auto-complete functionality which we are going to build. To do this, we performed a radical and faster way by simply exclude all other characters but letters as well as apostrophe, dot, and dash characters only when mentioned within words (e.g. don’t, e-mail, ph.d.)

tokenization <- function(x){
  x <- iconv(x, from="UTF-8", to="latin1", sub=" ")
  
  # convert the whole string to lower-case
  x <- tolower(x)
  
  # remove all digits and special characters but letters, space, apostrophe, dot, and dash characters
  # to keep counting words like: don't, u.s.a, and e-mail
  x <- gsub("[^a-z'. -]", " ", x)
  
  # remove apostrophe, dot, and dash characters if they are at the beginning or end of the sentence
  x <- gsub("^['.-]", "", x)
  x <- gsub("['.-]$", "", x)

  # remove apostrophe, dot, and dash characters if they are at the beginning or end of the word
  x <- gsub(" ['.-]", " ", x)
  x <- gsub("['.-] ", " ", x)
  
  # strip extra spaces
  x <- gsub(" {2,}", " ", x)
  
  return(x)
}

Profanity Filtering

It is not proper to let our model predict any bad words, so we will exclude those bad words from the corpora before build the model. There are several resources on the web provides a list of bad words, we used the simple list which published at http://www.bannedwordlist.com/.

To perform this task we utilise the R Text Mining package to create a corpus object first starting from our corpora data which loaded as a vector of strings:

x <- VCorpus(VectorSource(x))

Then we pass that corpus object to the following filtering function:

profanity.filtering <- function(x){
  # assuming you download the bad words list and save it in the project working directory
  bad.words <- readLines("./bad_words.txt")

  x <- tm_map(x, removeWords, bad.words)

  # because when we removed any bad word we may get two spaces in sequence (i.e. before and after it)
  x <- tm_map(x, stripWhitespace)

  return(x)
}

Please note that bad words list should be excluded from the corpora NOT the stop words list which includes the most common words in the language (in other words, they may appears at the top of our prediction list).

Basic Summary

# to get number of lines in loaded corpora x
length(x)

# to get longest line in a given corpora x
max(nchar(x))

# to get total number of words in a given corpora x
sum(sapply(gregexpr("\\s", x), length) + 1)

Language	Dataset	# Lines	# Words	Longest Line
en_US	News	77259	2609413	5760
en_US	Blogs	899288	37246919	40833
en_US	Twitter	2360148	30649855	140

Please note that presented total number of words for each dataset has been calculated after cleanup the corpora, while the longest line calculated form the raw data as written in corpora file

Sampling the Data

To build models we don’t need to load in and use all of the data. On the other hand, so we create the following function to do that job with a flexibility to formulate a several sub sets via define a vector of splitting ratio and associate name for each sub set. This function will also save each sub set in a separate file as a check point to avoid re-run all the previous steps each time.

sampling.data <- function(x, pr=c(0.7, 0.2, 0.1), txt=c("training", "testing", "validation")){
  index <- sample(1:length(pr), size=length(x), prob=pr, replace=TRUE)
  
  for(i in 1:length(pr)){
    sub.sample <- x[index==i]
    writeLines(sub.sample, paste0("./", txt[i], ".txt"))
  }
  
  return(x[index==1])
}

Using this function we managed to split our corpus database into the following sub sets:

10% for sample dataset, this will be used below to illustrate features of the data.
60% for training dataset, this will be used to learning our prediction model.
20% for testing dataset, this will be used to estimate how well our model has been trained and assess the performance (generalization).
10% for validation dataset, this will be used only for testing the final solution in order to confirm the actual predictive power in production phase.

Build N-gram Model

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus source.

UniGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))

uni.gram <- TermDocumentMatrix(x, control=list(tokenize=UniGramTokenizer))
uni.freq <- findFreqTerms(uni.gram)
uni.freq <- sort(rowSums(as.matrix(uni.gram[uni.freq,])), decreasing=TRUE)
uni.prob <- uni.freq/sum(uni.freq)

barplot(uni.prob[1:20], las=3, main="Top Unigrams by Probability", ylab="Probability")

BiGramTokenizer  <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))

bi.gram <- TermDocumentMatrix(x, control=list(tokenize=BiGramTokenizer))
bi.freq <- findFreqTerms(bi.gram)
bi.freq <- sort(rowSums(as.matrix(bi.gram[bi.freq,])), decreasing=TRUE)
bi.prob <- bi.freq/sum(bi.freq)

barplot(bi.prob[1:20], las=3, main="Top Bigrams by Probability", ylab="Probability")

Prediction Algorithm for Shiny app.

We will use created N-gram data-frames to calculate the probabilities of the next word occurring with respect to previous words. We can use a dictionary to reduce the size required to save the model via referring to each word by a number, and we may also exclude the rare cases which has no chance to view in the suggestions.

This prediction algorithm will be implemented in a simple Shiny app which has an input field where user can insert a text and the application will list down interactively top 5 suggested words to auto-complete current word, this prediction will be refined by filter suggestions according to the inserted letters so far of the ongoing written word.

To avoid limitation in required resources to calculate N-gram frequencies, you may get benefit from this available resource: Google Research Blog, All Our N-gram are Belong to You