I began with files downloaded directly from the URL provided (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and extracted in working directory.

Reading the Data

#read files
blogs<-readLines("final/en_US/en_US.blogs.txt")
news<-readLines("final/en_US/en_US.news.txt")
twitter<-readLines("final/en_US/en_US.twitter.txt")

Data Exploration - Initial Analysis

The three files have very different line counts and word counts. This should be considered in any evaluation of the text.

#get basic data about each file and put into table
lines<-c(length(blogs),length(news),length(twitter))
size<-c(file.size("final/en_US/en_US.blogs.txt"),file.size("final/en_US/en_US.news.txt"),file.size("final/en_US/en_US.twitter.txt"))
summary<-rbind(size,lines)
rownames(summary)<-c("Size","Lines")
colnames(summary)<-c("blogs","news","twitter")

Here’s the summary and plot for the line counts to compare the files.

summary
##           blogs      news   twitter
## Size  210160014 205811889 167105338
## Lines    899288     77259   2360148
barplot(summary[2,],main="Line Count", col="darkred")

Here’s the summary and plot for the word counts to compare the files.

wordcount
##         Word Count
## blogs     38154238
## news      30218125
## twitter    2693898
#plot max words for each type
wordcount<-t(wordcount)
barplot(wordcount,main="Word Count by Text Type", col="darkblue")

Data Exploration - Indepth Analysis

The files are extremely large. For the purposes of this exploration, I’m taking a sample training set of just 5000 from each file. I also combine the 3 samples into one large file for further analysis.

#create subsets of data by sampling 5k lines due to the times and space limitations
set.seed(9009)
bdata<-sample(blogs,replace=FALSE,size=5000)
ndata<-sample(news,replace=FALSE,size=5000)
tdata<-sample(twitter,replace=FALSE,size=5000)

#combine datasets into one for later use
data<- paste(c(bdata,ndata,tdata))

I retrieved a list of profanity to remove from the text and created functions to remove all numbers, punctuation, weird characters and stopwords using the tm package.

#create profanity list 
profanity<- readLines("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
#create functions to convert to regular text and lower case and remove all punctuation, numbers, whitespace and profanity
(rmJunk<- content_transformer(function(x, pattern) gsub(pattern, " ", x)))
FALSE function (x, ...) 
FALSE {
FALSE     content(x) <- FUN(content(x), ...)
FALSE     x
FALSE }
FALSE <environment: 0x00000000c4b82a68>
FALSE attr(,"class")
FALSE [1] "content_transformer" "function"
cleandata<-function(x) {
x<- iconv(x, 'utf-8', 'ascii', sub='')
x<- VCorpus(VectorSource(x))
x<- tm_map(x, rmJunk, "[[:digit:]]+")
x<- tm_map(x, rmJunk, "=")
x<- tm_map(x, rmJunk, "\\(")
x<- tm_map(x, rmJunk, "\\)")
x<- tm_map(x, removePunctuation)
x<- tm_map(x, content_transformer(tolower))
x<- tm_map(x, removeNumbers)
x<- tm_map(x, stripWhitespace)
x<- tm_map(x, removeWords, stopwords('english'))
x<- tm_map(x, removeWords, profanity)
}
#run the cleaning function on each dataset
blog_clean<-cleandata(bdata)
twt_clean<-cleandata(tdata)
news_clean<-cleandata(ndata)
data_clean<-cleandata(data)

Plot the word frequencies for each dataset.

Working with n-grams

In order to work with n-grams, I used the RWeka, tidytext, and tm packages to tokenize the text and structure each dataset for analysis.

Final Notes

  1. This analysis was done only on the english files. Non-english words in the corpus were removed by converting the text to ASCII.
  2. Assumptions: Numbers, punctuation, profanity and stopwords will not help determine next words. Therefore, numbers and punctuation were removed from the text. Profanity and stopwords were also removed from the text.
  3. A prediction algorithm based on the 3-gram analyses completed will be created.