Introduction

This report is the first milestone report for DSS Capstone project using the SwiftKey data. SwiftKey has created a keyboard app for mobile devices. This app uses NLP (Natural Language Processing) to implement predictive text models. The app predicts what the user is typing and makes suggestions based on the NLP model.

Objective

The final objective of the DSS Capstone Project is to create a similar predictive model in a Shiny app and present with accompanying slides and a report. This model will be created for English locale only.

This is a milestone report towards that final objective and has the goal to identify the following:

  1. Download and load the data.
  2. Create summary statistics.
  3. Highlight preliminary findings.
  4. Identify the plan to create a prediction algorithm.

The above objectives are discussed below in greater detail.

Data Exploration & Data Cleansing

The corpus used for this data is HC Corpora.

The SwiftKey data consists of three sets of files:

Each set is available in four locales:

The file info for English files is as below:

file.info("data/final/en_US/en_US.blogs.txt")$size   
## [1] 210160014
file.info("data/final/en_US/en_US.news.txt")$size    
## [1] 205811889
file.info("data/final/en_US/en_US.twitter.txt")$size 
## [1] 167105338

The three English files are first loaded and as an initial cleansing step, certain special characters are removed. Examples are left and right single quotes, left and right double quotes, etc.

The initial statistics for the complete data set are as below:

stri_stats_general(blogs)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223032   169860744
stri_stats_general(twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162384825   134370070

The sample size for the milestone report is 100,000 rows for each of the three datasets - blogs, news and twitter.

Each of the sets is transformed by removing numbers and punctuations as well as removing stopwords - common words such as, for, and, of, etc.

Below is the analysis of blogs data.

sampleBlogsTransformed<-removeWords(sampleBlogs,stopwords())
sampleBlogsTransformed<-removeNumbers(sampleBlogsTransformed)
sampleBlogsTransformed<-removePunctuation(sampleBlogsTransformed)
sampleBlogsTransformed<-stripWhitespace(sampleBlogsTransformed)

sampleBlogsTransformed2<-paste(sampleBlogsTransformed,collapse = " ")
sampleBlogsTransformed2<- VectorSource(sampleBlogsTransformed2)
sampleBlogsCorpus <- Corpus(sampleBlogsTransformed2)

dtmBlogs <- DocumentTermMatrix(sampleBlogsCorpus)
dtmBlogsM <- as.matrix(dtmBlogs)
freqBlogs <- colSums(dtmBlogsM)
freqBlogs <- sort(freqBlogs, decreasing=TRUE)

The ten most frequently used words in blogs data set are shown below. Also, shown below is a chart of one hundred most frequent words in blogs.

head(freqBlogs, 10)
##    the    one   will   like   just    can   time    get   know people 
##  20683  13658  12259  11128  11014  10646   9828   7919   6670   6607
words <- names(freqBlogs)
wordcloud(words[1:100], freqBlogs[1:100])

Following is a similar analysis for news data set.

sampleNewsTransformed<-removeWords(sampleNews,stopwords())
sampleNewsTransformed<-removeNumbers(sampleNewsTransformed)
sampleNewsTransformed<-removePunctuation(sampleNewsTransformed)
sampleNewsTransformed<-stripWhitespace(sampleNewsTransformed)

sampleNewsTransformed2<-paste(sampleNewsTransformed,collapse = " ")
sampleNewsTransformed2<- VectorSource(sampleNewsTransformed2)
sampleNewsCorpus <- Corpus(sampleNewsTransformed2)

dtmNews <- DocumentTermMatrix(sampleNewsCorpus)
dtmNewsM <- as.matrix(dtmNews)
freqNews <- colSums(dtmNewsM)
freqNews <- sort(freqNews, decreasing=TRUE)

Most frequently used words in news data set.

head(freqNews, 10)
##  said   the  will   one   new   can  also   but   two  year 
## 24821 24801 10563  8128  6970  5801  5786  5606  5605  5602
words <- names(freqNews)
wordcloud(words[1:100], freqNews[1:100])

Below is the analysis of the twitter data set.

sampleTwitterTransformed<-removeWords(sampleTwitter,stopwords())
sampleTwitterTransformed<-removeNumbers(sampleTwitterTransformed)
sampleTwitterTransformed<-removePunctuation(sampleTwitterTransformed)
sampleTwitterTransformed<-stripWhitespace(sampleTwitterTransformed)

sampleTwitterTransformed2<-paste(sampleTwitterTransformed,collapse = " ")
sampleTwitterTransformed2<- VectorSource(sampleTwitterTransformed2)
sampleTwitterCorpus <- Corpus(sampleTwitterTransformed2)

dtmTwitter <- DocumentTermMatrix(sampleTwitterCorpus)
dtmTwitterM <- as.matrix(dtmTwitter)
freqTwitter <- colSums(dtmTwitterM)
freqTwitter <- sort(freqTwitter, decreasing=TRUE)

Most Frequently used words in twitter data set.

head(freqTwitter, 10)
##   just   like    get   love   good   will    the thanks    can    day 
##   6262   5238   4750   4485   4204   4087   3974   3857   3824   3810
words <- names(freqTwitter)
wordcloud(words[1:100], freqTwitter[1:100])
## Warning in wordcloud(words[1:100], freqTwitter[1:100]): just could not be
## fit on page. It will not be plotted.

Future Plans

This milestone report is an initial step with further analysis to follow. To that end, the final report will include the following:

  1. A bi-gram, tri-gram and quad-gram analysis of all three data sets to better able to predict the next word.
  2. A Shiny app that implements the predictive model.
  3. Better memory and resource usage given the target devices are mobile devices as well as the constraints of Shiny server. Work with a goal of smaller footprint.
  4. A slide deck containing the user guide describing the usage of the Shiny app.