Data Science Capstone Project Milestone Report

Executive Summary:

This report provides the summary of the data preparation process and exploratory analysis done on the Swiftkey corpora consisting of blogs, news and twitter feeds provided by Coursera, with the final objective of building a predictive text model that predcits the next word in a sequence. R packages written for Natural Language Processing, namely tm and rweka are used to create the corpus and then to build Ngrams (Uni, Bi, Tri and Quadragrams). The results of these Ngrams are summarized into dataframes that are used to create summary tables, barplots and word clouds. The moldeling steps envisaged are also stated in the end.

Reading in the data:

The files are downloaded and unzipped. For this project, I am limiting the analyses to the English language files in the en_US directory, namely, the 3 files en_US.blogs.txt, en_US.blogs.txt and en_US.twitter.txt. The files are read in using the appropriate Unicode encoding and conversions. A 5% sample of each of these files is written into /cleanedSamples directory. Since the memory needs for processing Ngram tokenizers are very high, I had to settle for a small sample size.

#Read in US.blogs and write out sample
USblogsRaw <- file("./en_US/en_US.blogs.txt", open="rb")
USblogs <- readLines(USblogsRaw, skipNul=TRUE,  encoding="UTF-8")
close(USblogsRaw)
USblogs <- iconv(USblogs, "latin1", "ASCII", sub="")
rm(USblogsRaw)
set.seed(123)
sampBlogs <- sample(USblogs, 0.05*length(USblogs))
write(sampBlogs, file="./cleanedSamples/en_US.blogs.txt")

The same kind of processing is done on the News and Tweets files and samples written out (code chunk hidden using echo=FALSE option)

Summary table

The word count, the mean word count and the number of lines for each of the 3 files are output below:

library(stringi)
#Count words and lines
WordCountBlogs <- stri_count_words(USblogs)
WordCountNews <- stri_count_words(USnews)
WordCountTweets <- stri_count_words(UStweets)
#output a summary table 
data.frame(filename = c("USblogs","USnews","UStweets"),
                            LineCount = c(length(USblogs),length(USnews),length(UStweets)),
                            WordCount = c(sum(WordCountBlogs), sum(WordCountNews),                                                                                       sum(WordCountTweets)),
                            MeanWordCount =c(round(mean(WordCountBlogs),2), round(mean(WordCountNews),                             2) ,round(mean(WordCountTweets),2)))

##   filename LineCount WordCount MeanWordCount
## 1  USblogs    899288  37510168         41.71
## 2   USnews   1010242  34749301         34.40
## 3 UStweets   2360148  30088605         12.75

All the intermediate data sets are removed in order to clear memory(code hidden).

Building the corpus

The VCorpus function in the tm package is used for tokenization into one single corpus. Subsequently, the tm_map function is used for the necessary conversions as in the code below.

library(tm)
UScorpus <- VCorpus(DirSource("./cleanedSamples", encoding = "UTF-8"), readerControl = list(language= "en"))
UScorpus <- tm_map(UScorpus, tolower)
UScorpus <- tm_map(UScorpus, stripWhitespace)
UScorpus <- tm_map(UScorpus, removePunctuation)
UScorpus <- tm_map(UScorpus, removeNumbers)
UScorpus <- tm_map(UScorpus, PlainTextDocument)

Stop words and Stemming

Since the objective is to predict the next words, removing stop words is not appropriate here, as the next word could be different based on the stopwords. Similarly stemming is also not done here, because that would strip the contextual information from the word, important for predicting the next word.

Tokenization

The NGramTokenizer function in the RWeka package is used to create the Ngrams. Separate Ngrams are built for the bolgs, news and twitter files and then then rolled up into data frames, which are then merged to create the unigram, bigram, trigram and quadragram data frames.

library(RWeka)
#Caret functions for Ngram tokenization
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min =3, max =3))
QuadragramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min =4, max =4))

Working around memory constraints

To work around the massive memory resources needed for Ngram tokenization, each of the corpora (blogs, news and tweets) were processed separately, converted to data frames and then merged to get the Ngram dataframe. After each dataframe is created, all the intermediate datasets were removed to clear space (code hidden)

Unigram Tokenization

The code for Unigram tokenization is shown below:

#Create Unigrams for blogs, news and twitter separately and merge the dataframes to UnigramData
blogsUnigram<- DocumentTermMatrix(UScorpus[1], control = list(tokenize = UnigramTokenizer)) 
newsUnigram<- DocumentTermMatrix(UScorpus[2], control = list(tokenize = UnigramTokenizer)) 
tweetsUnigram<- DocumentTermMatrix(UScorpus[3], control = list(tokenize = UnigramTokenizer)) 
blogsUnimat <- sort(colSums(as.matrix(blogsUnigram)), decreasing=TRUE)
blogsUniData <- data.frame(word = names(blogsUnimat), freq = blogsUnimat)
newsUnimat <- sort(colSums(as.matrix(newsUnigram)), decreasing=TRUE)
newsUniData <- data.frame(word = names(newsUnimat), freq = newsUnimat)
tweetsUnimat <- sort(colSums(as.matrix(tweetsUnigram)), decreasing=TRUE)
tweetsUniData <- data.frame(word = names(tweetsUnimat), freq = tweetsUnimat)
UnigramData <- merge(merge(blogsUniData, newsUniData, by = "word", x.all = TRUE, y.all =TRUE),                                 tweetsUniData, by = "word", x.all=TRUE, y.all=TRUE)
UnigramData$frequency = UnigramData$freq.x + UnigramData$freq.y + UnigramData$freq 
UnigramData <- subset(UnigramData, select = -c(freq.x, freq.y, freq))
UnigramData <- UnigramData[order(-UnigramData$frequency),]

The code chunks for the Bigram, Trigram and Quadragram tokenization are not being displayed, in order to keep the report concise.

Top Ngrams

#Unigram
head(UnigramData, n = 10)

##       word frequency
## 21222  the    238146
## 749    and    119907
## 8278   for     54865
## 21214 that     52209
## 23591  you     47015
## 23339 with     35723
## 22936  was     31236
## 21298 this     27150
## 9637  have     26576
## 1015   are     24535

#Bigram
head(BigramData, n = 10)

##           word frequency
## 46733   of the     21463
## 33491   in the     20712
## 71719   to the     10727
## 24809  for the     10035
## 47688   on the      9683
## 70277    to be      8073
## 9503    at the      7196
## 7251   and the      6286
## 32610     in a      5892
## 78509 with the      5259

#Trigram
head(TrigramData, n = 10)

##                 word frequency
## 27597     one of the      1729
## 767         a lot of      1506
## 32250 thanks for the      1188
## 38037        to be a       873
## 12773    going to be       858
## 33772     the end of       747
## 16607      i want to       731
## 20514       it was a       724
## 28147     out of the       716
## 5006      as well as       707

#Quadragram
head(QuadragramData, n = 10)

##                    word frequency
## 7338     the end of the       373
## 7690    the rest of the       368
## 1193      at the end of       315
## 2282 for the first time       312
## 1214   at the same time       253
## 6264    one of the most       231
## 4506     is going to be       221
## 4298   in the middle of       207
## 9811   when it comes to       205
## 4589      is one of the       204

Barplots of top Ngrams

library(ggplot2)
#Barplot for Unigram
ggplot(UnigramData[1:10,], aes(x=reorder(word, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill = "brown") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    labs(title = "Top Unigrams by Frequency")

#Barplot for bigram
ggplot(BigramData[1:10,], aes(x=reorder(word, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill = "brown") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    labs(title = "Top Bigrams by Frequency")

#Barplot for Trigram
ggplot(TrigramData[1:10,], aes(x=reorder(word, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill = "brown") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    labs(title = "Top Trigrams by Frequency")

#Barplot for Quadragram
ggplot(QuadragramData[1:10,], aes(x=reorder(word, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill = "brown") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Quadragram") + ylab("Frequency") +
    labs(title = "Top Quadragrams by Frequency")

Word Clouds

Word Clouds with the top 50 words for each of the Ngrams are shown below

library(wordcloud)
#Unigram Word Cloud
wordcloud(UnigramData[1:50,]$word, UnigramData[1:50,]$frequency, max.words=50, colors=brewer.pal(6, "Dark2"))

#Bigram Word Cloud
wordcloud(BigramData[1:50,]$word, BigramData[1:50,]$frequency, max.words=50, colors=brewer.pal(6, "Dark2"))

#Trigram Word Cloud
wordcloud(TrigramData[1:50,]$word, TrigramData[1:50,]$frequency, max.words=50, colors=brewer.pal(6, "Dark2"))

#Quadragram Word Cloud
wordcloud(QuadragramData[1:50,]$word, QuadragramData[1:50,]$frequency, max.words=50, colors=brewer.pal(6, "Dark2"))

Next Steps:

The next step is to build a prediction model using a combination of bigrams, trigrams and quadragrams. An appropriate smoothing technique may be chosen between Stupid-Backoff, Good-Turing and Kneser-Ney, based on further study and analyses.