Capstone Data Science Milestone Report

Summary

The goal of this milestone report is to demonstrate that we have basic knowledge of the data and that we are on track to create our own prediction algorithm. This report explains the exploratory analysis that was accomplished and the goal for the eventual application and algorithm. This report explains only the major features of the data that have been identified and briefly summarizes the plan for creating the prediction algorithm and Shiny app.

The motivation for this report is to: -Demonstrate that we have downloaded the data and have successfully loaded it in. -Create a basic report of summary statistics about the data sets. -Report any interesting findings. -Get feedback on the plans for creating a prediction algorithm and Shiny app.

Understanding the problem

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models.

Exploratory data analysis

The training data was downloaded from coursera website https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The dataset has texts in different languages: German, English, Finnish and Russian originating from different sources: twitter, news and blogs. In this report we analyzed only English language.

library(R.utils)
library(tm)
library(SnowballC)
library(RWeka)
library(wordcloud)
library(ggplot2)
library(dplyr)

Data Set Size

myblog <- file.info("en_US.blogs.txt")$size / (1024*1024)
mytwitter <- file.info("en_US.twitter.txt")$size / (1024*1024)
mynews <- file.info("en_US.news.txt")$size / (1024*1024)

The size of the Blog file is 200.4242 Meg.

The size of the Twitter file is 159.3641 Meg.

The size of the News file is 196.2775 Meg.

myblog <-countLines("en_US.twitter.txt")
mytwitter <-countLines("en_US.news.txt")
mynews <- countLines("en_US.blogs.txt")

The number of lines in the Blog file is 2360148.

The number of lines in the Twitter file is 1010242.

The number of lines in the News file is 899288.

myblogs <- readLines('en_US.blogs.txt', encoding = 'UTF-8')
mytwitter <- readLines('en_US.twitter.txt', encoding = 'UTF-8')

## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul

mynews <- readLines('en_US.news.txt', encoding = 'UTF-8')

mywordblog <-sum (sapply(gregexpr("\\W+", myblogs), length) + 1)
mywordtwitter <-sum (sapply(gregexpr("\\W+", mytwitter), length) + 1)
mywordnews <- sum (sapply(gregexpr("\\W+", mynews), length) + 1)

The number of words in the Blog file is 3.9121 × 10⁷.

The number of words in the Twitter file is 3.2793 × 10⁷.

The number of words in the News file is 3.6721 × 10⁷.

Random Sampling

Give the size of each file we will sample 5% and then perform some additional cleaning and analysis.

set.seed(21)

samplingmyblogs <- myblogs[sample(1:length(myblogs),100000)]
samplingmywitter <- mytwitter[sample(1:length(mytwitter),50000)]
samplingmynews <- mynews[sample(1:length(mynews),50000)]

sampleddataset <- c(samplingmyblogs,samplingmywitter,samplingmynews)

writeLines(sampleddataset, "./folder/sampleddataset.txt")

Cleaning and Pre-Processing the sampled dataset using the TM package

folderlocation <- file.path(".", "folder")
tmdoc <- Corpus(DirSource(folderlocation))

tmdoc <- tm_map(tmdoc, content_transformer(tolower))
tmdoc <- tm_map(tmdoc, removeNumbers)
tmdoc <- tm_map(tmdoc, removePunctuation)
tmdoc <- tm_map(tmdoc, stripWhitespace)
tmdoc <- tm_map(tmdoc, removeWords, stopwords("english"))
spaces <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
tmdoc <- tm_map(tmdoc, spaces, "/|@|\\|")

#Using Snowballc that implements Porter's word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary.
tmdoc <- tm_map(tmdoc, stemDocument)
tmdoctoken <- tmdoc

Initial Try at Tokenization

options(mc.cores=1)
tritoken <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
tridtm <- DocumentTermMatrix(tmdoc,control = list(tokenize = tritoken))

bitoken <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
bidtm <- DocumentTermMatrix(tmdoc,control = list(tokenize = bitoken))

onetoken <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
unidtm <- DocumentTermMatrix(tmdoc,control = list(tokenize = onetoken))

unifrequency <- sort(colSums(as.matrix(unidtm)), decreasing=TRUE)
uniwordfrequency <- data.frame(word=names(unifrequency), freq=unifrequency)
paste("Top 10 highest frequency Unigrams")

## [1] "Top 10 highest frequency Unigrams"

head(uniwordfrequency,10)

##      word  freq
## one   one 20983
## will will 20577
## like like 17813
## said said 16985
## just just 16757
## time time 16745
## get   get 16458
## can   can 15839
## year year 13653
## make make 13019

bifrequency <- sort(colSums(as.matrix(bidtm)), decreasing=TRUE)
biwordfrequency <- data.frame(word=names(bifrequency), freq=bifrequency)
paste("Top 10 highest frequency Bigrams")

## [1] "Top 10 highest frequency Bigrams"

head(biwordfrequency,10)

##                    word freq
## last year     last year 1362
## new york       new york 1252
## look like     look like 1193
## dont know     dont know 1135
## year ago       year ago 1118
## right now     right now 1058
## feel like     feel like  942
## last week     last week  935
## first time   first time  809
## high school high school  807

trifrequency <- sort(colSums(as.matrix(tridtm)), decreasing=TRUE)
triwordfrequency <- data.frame(word=names(trifrequency), freq=trifrequency)
paste("Top 10 highest frequency Trigrams")

## [1] "Top 10 highest frequency Trigrams"

head(triwordfrequency,10)

##                                    word freq
## new york citi             new york citi  170
## new york time             new york time  115
## presid barack obama presid barack obama  111
## cant wait see             cant wait see  101
## happi mother day       happi mother day   96
## im pretti sure           im pretti sure   89
## let us know                 let us know   86
## two year ago               two year ago   85
## feel like im               feel like im   67
## gov chris christi     gov chris christi   66

Wordcloud Top 25 Unigrams

wordcloud(names(unifrequency), unifrequency, max.words=25, scale=c(5, .1), colors=brewer.pal(8, "Paired"))

plot of chunk unnamed-chunk-8

Wordcloud Top 25 Bigrams

wordcloud(names(bifrequency), bifrequency, max.words=25, scale=c(3, .1), colors=brewer.pal(8, "Paired"))

plot of chunk unnamed-chunk-9

Wordcloud Top 25 Trigrams

wordcloud(names(trifrequency), trifrequency, max.words=25, scale=c(3, .1), colors=brewer.pal(8, "Paired"))

plot of chunk unnamed-chunk-10

Frequency Analysis

plot of chunk unnamed-chunk-11

Conclusion

It was interesting to see the Wordcloud on each dataset to get a basic idea of the structure of the vocabulary as well as the various N-gram frequencies. This will help in guiding the the next phase of the project.

We sampled about 5% of each dataset and the overall performance was not great. This is whitout any advanced analytical models applied against the data. It will be interesting to see the overall performance once the whole application is completed.

Future work

Before we can create and run any prediction algorithm there is still quite a bit of data cleaning and preprocessing that needs to occur. Here is a sample of what needs to be accomplished: -Because of computing resource limitations we will need to adjust the size of the vocabulary. -We will need to correct words that have been spelled incorrectly. -Remove duplicate words by first lowercasing sentences. -Use a hashing strategy to improve overall efficiency. -Build n-grams frequency models. Not sure at this time what will be the maximun N-grams that will need to be built. We may be limited due to computing resource limitations.

Once this is done we will need to try a variety of prediction algorithms and make an assessment of the training performance vs validation performance vs testing performance as well as take into account the computing resource limitations. The outcome will be a working Shiny App.