The goal of this milestone report is to demonstrate that we have basic knowledge of the data and that we are on track to create our own prediction algorithm. This report explains the exploratory analysis that was accomplished and the goal for the eventual application and algorithm. This report explains only the major features of the data that have been identified and briefly summarizes the plan for creating the prediction algorithm and Shiny app.
The motivation for this report is to: -Demonstrate that we have downloaded the data and have successfully loaded it in. -Create a basic report of summary statistics about the data sets. -Report any interesting findings. -Get feedback on the plans for creating a prediction algorithm and Shiny app.
Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.
The training data was downloaded from coursera website https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The dataset has texts in different languages: German, English, Finnish and Russian originating from different sources: twitter, news and blogs. In this report we analyzed only English language.
library(R.utils)
library(tm)
library(SnowballC)
library(RWeka)
library(wordcloud)
library(ggplot2)
library(dplyr)
myblog <- file.info("en_US.blogs.txt")$size / (1024*1024)
mytwitter <- file.info("en_US.twitter.txt")$size / (1024*1024)
mynews <- file.info("en_US.news.txt")$size / (1024*1024)
myblog <-countLines("en_US.twitter.txt")
mytwitter <-countLines("en_US.news.txt")
mynews <- countLines("en_US.blogs.txt")
myblogs <- readLines('en_US.blogs.txt', encoding = 'UTF-8')
mytwitter <- readLines('en_US.twitter.txt', encoding = 'UTF-8')
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
mynews <- readLines('en_US.news.txt', encoding = 'UTF-8')
mywordblog <-sum (sapply(gregexpr("\\W+", myblogs), length) + 1)
mywordtwitter <-sum (sapply(gregexpr("\\W+", mytwitter), length) + 1)
mywordnews <- sum (sapply(gregexpr("\\W+", mynews), length) + 1)
Give the size of each file we will sample 5% and then perform some additional cleaning and analysis.
set.seed(21)
samplingmyblogs <- myblogs[sample(1:length(myblogs),100000)]
samplingmywitter <- mytwitter[sample(1:length(mytwitter),50000)]
samplingmynews <- mynews[sample(1:length(mynews),50000)]
sampleddataset <- c(samplingmyblogs,samplingmywitter,samplingmynews)
writeLines(sampleddataset, "./folder/sampleddataset.txt")
folderlocation <- file.path(".", "folder")
tmdoc <- Corpus(DirSource(folderlocation))
tmdoc <- tm_map(tmdoc, content_transformer(tolower))
tmdoc <- tm_map(tmdoc, removeNumbers)
tmdoc <- tm_map(tmdoc, removePunctuation)
tmdoc <- tm_map(tmdoc, stripWhitespace)
tmdoc <- tm_map(tmdoc, removeWords, stopwords("english"))
spaces <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
tmdoc <- tm_map(tmdoc, spaces, "/|@|\\|")
#Using Snowballc that implements Porter's word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary.
tmdoc <- tm_map(tmdoc, stemDocument)
tmdoctoken <- tmdoc
options(mc.cores=1)
tritoken <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
tridtm <- DocumentTermMatrix(tmdoc,control = list(tokenize = tritoken))
bitoken <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
bidtm <- DocumentTermMatrix(tmdoc,control = list(tokenize = bitoken))
onetoken <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
unidtm <- DocumentTermMatrix(tmdoc,control = list(tokenize = onetoken))
unifrequency <- sort(colSums(as.matrix(unidtm)), decreasing=TRUE)
uniwordfrequency <- data.frame(word=names(unifrequency), freq=unifrequency)
paste("Top 10 highest frequency Unigrams")
## [1] "Top 10 highest frequency Unigrams"
head(uniwordfrequency,10)
## word freq
## one one 20983
## will will 20577
## like like 17813
## said said 16985
## just just 16757
## time time 16745
## get get 16458
## can can 15839
## year year 13653
## make make 13019
bifrequency <- sort(colSums(as.matrix(bidtm)), decreasing=TRUE)
biwordfrequency <- data.frame(word=names(bifrequency), freq=bifrequency)
paste("Top 10 highest frequency Bigrams")
## [1] "Top 10 highest frequency Bigrams"
head(biwordfrequency,10)
## word freq
## last year last year 1362
## new york new york 1252
## look like look like 1193
## dont know dont know 1135
## year ago year ago 1118
## right now right now 1058
## feel like feel like 942
## last week last week 935
## first time first time 809
## high school high school 807
trifrequency <- sort(colSums(as.matrix(tridtm)), decreasing=TRUE)
triwordfrequency <- data.frame(word=names(trifrequency), freq=trifrequency)
paste("Top 10 highest frequency Trigrams")
## [1] "Top 10 highest frequency Trigrams"
head(triwordfrequency,10)
## word freq
## new york citi new york citi 170
## new york time new york time 115
## presid barack obama presid barack obama 111
## cant wait see cant wait see 101
## happi mother day happi mother day 96
## im pretti sure im pretti sure 89
## let us know let us know 86
## two year ago two year ago 85
## feel like im feel like im 67
## gov chris christi gov chris christi 66
wordcloud(names(unifrequency), unifrequency, max.words=25, scale=c(5, .1), colors=brewer.pal(8, "Paired"))
wordcloud(names(bifrequency), bifrequency, max.words=25, scale=c(3, .1), colors=brewer.pal(8, "Paired"))
wordcloud(names(trifrequency), trifrequency, max.words=25, scale=c(3, .1), colors=brewer.pal(8, "Paired"))
It was interesting to see the Wordcloud on each dataset to get a basic idea of the structure of the vocabulary as well as the various N-gram frequencies. This will help in guiding the the next phase of the project.
We sampled about 5% of each dataset and the overall performance was not great. This is whitout any advanced analytical models applied against the data. It will be interesting to see the overall performance once the whole application is completed.
Before we can create and run any prediction algorithm there is still quite a bit of data cleaning and preprocessing that needs to occur. Here is a sample of what needs to be accomplished: -Because of computing resource limitations we will need to adjust the size of the vocabulary. -We will need to correct words that have been spelled incorrectly. -Remove duplicate words by first lowercasing sentences. -Use a hashing strategy to improve overall efficiency. -Build n-grams frequency models. Not sure at this time what will be the maximun N-grams that will need to be built. We may be limited due to computing resource limitations.
Once this is done we will need to try a variety of prediction algorithms and make an assessment of the training performance vs validation performance vs testing performance as well as take into account the computing resource limitations. The outcome will be a working Shiny App.