Summary

This is a milestone report regarding the Data Science Capstone NLP (Natural Language Processing) project in cooperation with the company SwiftKey, founded in 2008 by Jon Reynolds and Dr Ben Medlock. The goal of the final project is to implement an online, predictive text model powered, typing assistant capable of suggesting the next word for the user inputted phrase. This report explains the fundamental steps to achieve this goal, such as data cleaning, analyzing and restructuring of the raw source data.

Download data and load it into R

The raw data consisting of newspapers, magazines, (personal and professional) blogs and twitter updates, is provided by HC corpora. It can be downloaded from the Coursera webpage: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. In the scope of the this paper is to analyze the english corpora data only.

corpus <- VCorpus(DirSource(directory = "~/source/r/capstone/data/sample/", 
    encoding = "UTF-8"), readerControl = list(language = "en"))
summary(corpus)
##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list

Basic analysis of the input files (such as number of lines, words and size on the disk) reveals that using the full source data is impractical, considering the limited hardware memory and processing capacity.

## 
## 
## |        | line counts | word counts | size on the disk |
## |:-------|:-----------:|:-----------:|:----------------:|
## |blogs   |   899288    |  37334690   |       200M       |
## |news    |   1010242   |  34372720   |       196M       |
## |twitter |   2360148   |  30374206   |       159M       |

Sampling

As indicated the earlier file analysis, the raw data files are extremely large, too large to process in its entirety. The exploratory analysis will be based on a randomized (the binomial distribution) sample of the dataset (lines of source files).

createSample <- function(pInFile, pOutFile, pRatio) {
  con <- file(pInFile,"r")
  fullFile <- readLines(con)
  close(con)
  
  to_select <- rbinom(n = length(fullFile), size = 1, prob = pRatio)
  to_select <- to_select > 0
  file_subset <- fullFile[to_select]
  
  outCon <- file(pOutFile, "w")
  writeLines(file_subset, con = outCon)
  close(outCon)
}

plot of chunk unnamed-chunk-10

The sample size is set to 1% of the total file lines, with correspond approximately to
1.88%, 1.69% and 1.9% of the number of words contained in original text sources: blogs, news and twitter.

+The sampled blogs text source has: 701928 number of words.
+The sampled news text source has: 581459 number of words.
+The sampled twitter text source has: 578294 number of words.

The number of words in the English language is estimated to 1,025,109 by the Global Language Monitor on January 1, 2014. This means that a rough estimates of the word coverage in english language is 30%. I have determined that the 1% sample size is accurate approximation of the results that would be obtained using all the data, considering the nature of the problem.

Preprocessing

Profanity filtering

Removing profanities from the source data is requirement of the final application. For this purpose I am using a list with profanity words downloaded from the gihub.com. It is possible that in the final product the scope of words, we do not want to predict, will be further expanded.

Data cleaning and filtering

Frivolous investigation of the source data indicates that apostrophes in the text are following two standards, which need to be sorted out. Addition to this non-printable character and english stopwords are removed from the data. Finally the words are reduced to their root form also known as word stemming.

corpus_cleaned <- tm_map(corpus, removeWords, profanity)
cleanApostrophes <- function(x) gsub("’", "'", x)
corpus_cleaned <- tm_map(corpus_cleaned, cleanApostrophes)
corpus_cleaned <- tm_map(corpus_cleaned, removeNumbers)
corpus_cleaned <- tm_map(corpus_cleaned, removeWords, stopWords_en)
corpus_cleaned <- tm_map(corpus_cleaned, removePunctuation)
corpus_cleaned <- tm_map(corpus_cleaned, stripWhitespace)
corpus_cleaned <- tm_map(corpus_cleaned, PlainTextDocument)
corpus_cleaned <- tm_map(corpus_cleaned, content_transformer(tolower))
corpus_cleaned <- tm_map(corpus_cleaned, content_transformer(function(x) gsub("[^'[:alnum:] ]", 
    "", x, perl = TRUE)))
corpus_cleaned <- tm_map(corpus_cleaned, content_transformer(function(x) gsub("[[:punct:]]", 
    "", x, perl = TRUE)))
corpus_cleaned <- tm_map(corpus_cleaned, stemDocument)

Exploratory Analysis

I am plotting the 50 words with highest occurrences considering all the text sources simultaneously. This is a quick way to find out relation of the common vs. rare words, which will profoundly affect to the development of our word predictor model.

corpus.tdm <- TermDocumentMatrix(corpus_cleaned, control = list(wordLengths=c(1,Inf)))
corpus.matrix <- rowSums(as.matrix(corpus.tdm))
corpus.matrix <- apply(t(corpus.matrix), 2, sort)
corpus.df <- as.data.frame(corpus.matrix)

plot of chunk unnamed-chunk-17

The word distribution has an exponential shape, meaning that most of the words have a very low occurrence. In fact, if we plotted the top 5000 terms with descendant occurrences, we can see that around the 4000th word, for any remaining words in the set, the occurrence is sparse.

plot of chunk unnamed-chunk-18

Next I will be construct the uni-, bi, tri-grams, which shall form the base of our predictive text model.

An n-gram is simply a distribution of a n word combination. Focus will be on single word (n=1), word duplet (n=2), word triplet (n=3). The visualize the n-grams, I am displaying a worldcloud; To quickly indicate the highest occuring terms in the data structure.

unigrams <- removeSparseTerms(TermDocumentMatrix(corpus_cleaned), 0.3)
freq <- sort(rowSums(as.matrix(unigrams)))
par(mar = c(0, 6, 8, 6) + 0.5)
wordcloud(names(freq), freq, max.words = 200, random.order = FALSE, colors = brewer.pal(8, 
    "Dark2"))
title("Word cloud of the unigrams", line = 7.7)

plot of chunk unnamed-chunk-20

plot of chunk unnamed-chunk-22

plot of chunk unnamed-chunk-23

plot of chunk unnamed-chunk-25

plot of chunk unnamed-chunk-26

Plans for implementing the predictive text algorithm.

Future

The final web application will be implemented using R-Shiny. The predictive text model will continue to rely to the RWeka n-gram tokenizer plus other relevant R libraries, with an high focus to the smoothing of the n-gram model.

Questions to consider

+How to create a web-application (Shiny App) to house a large n-gram model, while keeping the application responsive? +How to handle combination of words that is unknown to the model? +How to estimate Accuracy and overall performance?