This report is the first milestone report for DSS Capstone project using the SwiftKey data. SwiftKey has created a keyboard app for mobile devices. This app uses NLP (Natural Language Processing) to implement predictive text models. The app predicts what the user is typing and makes suggestions based on the NLP model.
The final objective of the DSS Capstone Project is to create a similar predictive model in a Shiny app and present with accompanying slides and a report. This model will be created for English locale only.
This is a milestone report towards that final objective and has the goal to identify the following:
The above objectives are discussed below in greater detail.
The corpus used for this data is HC Corpora.
The SwiftKey data consists of three sets of files:
Each set is available in four locales:
The file info for English files is as below:
file.info("data/final/en_US/en_US.blogs.txt")$size
## [1] 210160014
file.info("data/final/en_US/en_US.news.txt")$size
## [1] 205811889
file.info("data/final/en_US/en_US.twitter.txt")$size
## [1] 167105338
The three English files are first loaded and as an initial cleansing step, certain special characters are removed. Examples are left and right single quotes, left and right double quotes, etc.
The initial statistics for the complete data set are as below:
stri_stats_general(blogs)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223032 169860744
stri_stats_general(twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162384825 134370070
The sample size for the milestone report is 100,000 rows for each of the three datasets - blogs, news and twitter.
Each of the sets is transformed by removing numbers and punctuations as well as removing stopwords - common words such as, for, and, of, etc.
Below is the analysis of blogs data.
sampleBlogsTransformed<-removeWords(sampleBlogs,stopwords())
sampleBlogsTransformed<-removeNumbers(sampleBlogsTransformed)
sampleBlogsTransformed<-removePunctuation(sampleBlogsTransformed)
sampleBlogsTransformed<-stripWhitespace(sampleBlogsTransformed)
sampleBlogsTransformed2<-paste(sampleBlogsTransformed,collapse = " ")
sampleBlogsTransformed2<- VectorSource(sampleBlogsTransformed2)
sampleBlogsCorpus <- Corpus(sampleBlogsTransformed2)
dtmBlogs <- DocumentTermMatrix(sampleBlogsCorpus)
dtmBlogsM <- as.matrix(dtmBlogs)
freqBlogs <- colSums(dtmBlogsM)
freqBlogs <- sort(freqBlogs, decreasing=TRUE)
The ten most frequently used words in blogs data set are shown below. Also, shown below is a chart of one hundred most frequent words in blogs.
head(freqBlogs, 10)
## the one will like just can time get know people
## 20683 13658 12259 11128 11014 10646 9828 7919 6670 6607
words <- names(freqBlogs)
wordcloud(words[1:100], freqBlogs[1:100])
Following is a similar analysis for news data set.
sampleNewsTransformed<-removeWords(sampleNews,stopwords())
sampleNewsTransformed<-removeNumbers(sampleNewsTransformed)
sampleNewsTransformed<-removePunctuation(sampleNewsTransformed)
sampleNewsTransformed<-stripWhitespace(sampleNewsTransformed)
sampleNewsTransformed2<-paste(sampleNewsTransformed,collapse = " ")
sampleNewsTransformed2<- VectorSource(sampleNewsTransformed2)
sampleNewsCorpus <- Corpus(sampleNewsTransformed2)
dtmNews <- DocumentTermMatrix(sampleNewsCorpus)
dtmNewsM <- as.matrix(dtmNews)
freqNews <- colSums(dtmNewsM)
freqNews <- sort(freqNews, decreasing=TRUE)
Most frequently used words in news data set.
head(freqNews, 10)
## said the will one new can also but two year
## 24821 24801 10563 8128 6970 5801 5786 5606 5605 5602
words <- names(freqNews)
wordcloud(words[1:100], freqNews[1:100])
Below is the analysis of the twitter data set.
sampleTwitterTransformed<-removeWords(sampleTwitter,stopwords())
sampleTwitterTransformed<-removeNumbers(sampleTwitterTransformed)
sampleTwitterTransformed<-removePunctuation(sampleTwitterTransformed)
sampleTwitterTransformed<-stripWhitespace(sampleTwitterTransformed)
sampleTwitterTransformed2<-paste(sampleTwitterTransformed,collapse = " ")
sampleTwitterTransformed2<- VectorSource(sampleTwitterTransformed2)
sampleTwitterCorpus <- Corpus(sampleTwitterTransformed2)
dtmTwitter <- DocumentTermMatrix(sampleTwitterCorpus)
dtmTwitterM <- as.matrix(dtmTwitter)
freqTwitter <- colSums(dtmTwitterM)
freqTwitter <- sort(freqTwitter, decreasing=TRUE)
Most Frequently used words in twitter data set.
head(freqTwitter, 10)
## just like get love good will the thanks can day
## 6262 5238 4750 4485 4204 4087 3974 3857 3824 3810
words <- names(freqTwitter)
wordcloud(words[1:100], freqTwitter[1:100])
## Warning in wordcloud(words[1:100], freqTwitter[1:100]): just could not be
## fit on page. It will not be plotted.
This milestone report is an initial step with further analysis to follow. To that end, the final report will include the following: