Milestone Report for Capstone Project

Abstract

This milestone report is the first important task in the Coursera Capstone Project of the Data Science Specialization. One of the final deliverables of the Capstone Project is a shiny app where the user can enter a word sequence and the app predicts with fairly good accuracy the next word the user might enter within a reasonable elapse time. The objective of this milestone report is therefore to demonstrate that we are on track to create the prediction algorithm for the app by presenting findings of the exploratory data analysis performed thus far, as well as to outline the future work required to develop the text prediction system.

Background

Swiftkey is a predictive text app, which employs smart prediction technology to make it easier to input text into a mobile device. It accomplishes this by predicting the next word the user intends to type and makes suggestions. Through text analysis and cleaning, sampling, prediction and evaluation, the final goal is to build a shiny app where the user can enter a word sequence and the app predicts with fairly good accuracy the next word the user might enter within a reasonable elapse time.

Data Loading

While external datasets may be used to augment the model later in the project, the data to be used at this point is to be downloaded at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Supplied for our initial use is a collection of text sources in Dutch, Finnish, English and Russian. For this task, I will only be using the three English files in the “en_US” folder. These three files represents data from blogs, news and Twitter. To get the ball rolling, we start by loading the files.

set.seed(2015)

## Given that the files are extremely large, check that the datasets have not already been loaded before reading them in.
if (!exists("textBlogs"))
    textBlogs <- readLines(fnameBlogs, encoding="UTF-8", skipNul=TRUE)
    
if (!exists("textNews")) {
    textNews <- readLines(fnameNews, encoding="UTF-8", skipNul=TRUE)
    sampleTextNews <- textNews[sample(1:length(textNews), 15000)]
}

if (!exists("textTwitter")) 
    textTwitter <- readLines(fnameTwitter, encoding="UTF-8", skipNul=TRUE)

Given the size of the data and given my assumption that the news dataset will be cleanest of the three in terms of - fewer spelling or grammatical errors - less problematic language elements, eg. contractions, emoticons, hashtagging as in tweets, etc - absence of profanity (hopefully), I chose the news dataset as a representative for a closer look.

However, even a visual inspection of the News text at this point reveals that the data is still riddled with unicode, odd non-English characters, extraneous backslashes for escaping, etc.

## Get the file sizes
fsizeBlogs <- utils:::format.object_size(file.info(fnameBlogs)$size, "auto")
fsizeNews <- utils:::format.object_size(file.info(fnameNews)$size, "auto")
fsizeTwitter <- utils:::format.object_size(file.info(fnameTwitter)$size, "auto")

## Number of unique words in each file.
blobBlogs <- unlist(stri_extract_all_words(textBlogs))
blobNews <- unlist(stri_extract_all_words(textNews))
blobTwitter <- unlist(stri_extract_all_words(textTwitter))

data.frame(source = c("Blogs", "News", "Twitter"),
           File_size = c(fsizeBlogs, 
                         fsizeNews, 
                         fsizeTwitter),
           Lines = c(length(textBlogs), 
                     length(textNews), 
                     length(textTwitter)),
           Words = c(length(blobBlogs), 
                     length(blobNews), 
                     length(blobTwitter)),
           Words_Unique = c(length(unique(blobBlogs)), 
                           length(unique(blobNews)), 
                           length(unique(blobTwitter))))

##    source File_size   Lines    Words Words_Unique
## 1   Blogs  200.4 Mb  899288 37546249       396194
## 2    News  196.3 Mb   77259  2674536       100973
## 3 Twitter  159.4 Mb 2360148 30093410       486660

To simplify the data exploration, I opted to use only the News dataset as my source for the text corpus.

sourceEN <- c(sampleTextNews)
## sourceEN <- c(textBlogs, textNews, textTwitter)
corpus <- Corpus(VectorSource(sourceEN))

Now that the corpus is ready, before it can be used for text mining, a couple of transformations is required - changing letters to lower case, removing punctuation marks, numbers and removing common English stopwords.

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

Next I perform stemming, which essentially removes affixes from words. For example, “run”, “runs” and “running” all converges to “run”.

corpus <- tm_map(corpus, stemDocument, language=("english"))

A Term Document Matrix (TDM) is then created to reflects the number of times each word in the corpus is found in each of the sources.
It is of interest to note that the Document Term Matrix is a transposition of the Term Document Matrix.

tdm <- TermDocumentMatrix(corpus)
##findFreqTerms(tdm, 5) ## Display words which occur more that 5 times.

dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)

## Now we can sort the maxtrix to see the most frequently used words.
freqUnigrams <- sort(colSums(dtm2), decreasing=TRUE)
##head(freqUnigrams)

dfUnigrams <- data.frame(word=names(freqUnigrams), freq=freqUnigrams)

Data Exploration

Here I create a wordcloud for a more intuitive visual of the words that make up the corpus. This is followed by various histograms detailing the most commonly recurring words, 2-word and 3-word phases.

# Create a WordCloud to Visualize the Text Data
wordcloud(names(freqUnigrams), freqUnigrams, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

## Plot the unigram frequency histogram to show the top 20 most frequently recurring words. 
ggplot(dfUnigrams[1:20, ], aes(x=reorder(word, freq), y=freq, fill=freq)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="20 most common unigrams in News")

# NGramTokenizer splits strings into n-grams with given minimal and maximal numbers of grams.
tokenizerBigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
tokenizerTrigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
tokenizerFourgram <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

## Plot the bigram frequency histogram to show the top 20 most frequently recurring 2-word phrases.
bigrams <- TermDocumentMatrix(corpus, control=list(tokenize=tokenizerBigram))
## From our first look at the TDM we know that there are many terms which do not occur very often. It might make sense to simply remove these sparse terms from the analysis.
bigrams2 <- removeSparseTerms(bigrams, 0.999)
freqBigrams2 <- sort(rowSums(as.matrix(bigrams2)), decreasing=TRUE)
dfBigrams2 <- data.frame(word=names(freqBigrams2), freq=freqBigrams2)

ggplot(dfBigrams2[1:20, ], aes(x=reorder(word, freq), y=freq, fill=freq)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="20 most common bigrams in News")

## Plot the trigram frequency histogram to show the top 20 most frequently recurring 3-word phrases.
trigrams <- TermDocumentMatrix(corpus, control=list(tokenize=tokenizerTrigram))
trigrams2 <- removeSparseTerms(trigrams, 0.999)
freqTrigrams2 <- sort(rowSums(as.matrix(trigrams2)), decreasing=TRUE)
dfTrigrams2 <- data.frame(word=names(freqTrigrams2), freq=freqTrigrams2)

ggplot(dfTrigrams2[1:20, ], aes(x=reorder(word, freq), y=freq, fill=freq)) +
    geom_bar(stat="identity") +
    theme_bw() +
    coord_flip() +
    theme(axis.title.y = element_blank()) +
    labs(y="Frequency", title="20 most common Trigrams in News")

## Warning: Removed 17 rows containing missing values (position_stack).

Future Work

Due to unfamiliarity in text mining and Natural Language Processing as a whole, I had simplified the data exploration task by using relatively clean data from the News data file. Thus, the next step will involve

more thorough data cleaning (removing emoticons, profanities); and
possibly drawing random samples from the three data sources, instead of just one.

Overcoming or sidestepping the memory issue so that I can increase the size of the corpus to be used and consequently improve n-gram table generation and prediction. Currently, the 4-gram table is empty, probably a result of the small corpus size. Intuitively, a 4-gram or even a 5-gram table will be useful to support prediction of proverbs, idioms and sayings. Thus, ideally I would want to look into this area.
Develop the algorithm to do the prediction - which n-gram table to go through first, smoothing methods, etc.
Explore and consider the use of external source(s) to augment Swiftkey’s text corpus.

Glossary

corpus: a collection of text documents used in the text mining
n-gram: a contiguous sequence of n items from a given sequence of text