I began with files downloaded directly from the URL provided (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and extracted in working directory.
#read files
blogs<-readLines("final/en_US/en_US.blogs.txt")
news<-readLines("final/en_US/en_US.news.txt")
twitter<-readLines("final/en_US/en_US.twitter.txt")
The three files have very different line counts and word counts. This should be considered in any evaluation of the text.
#get basic data about each file and put into table
lines<-c(length(blogs),length(news),length(twitter))
size<-c(file.size("final/en_US/en_US.blogs.txt"),file.size("final/en_US/en_US.news.txt"),file.size("final/en_US/en_US.twitter.txt"))
summary<-rbind(size,lines)
rownames(summary)<-c("Size","Lines")
colnames(summary)<-c("blogs","news","twitter")
Here’s the summary and plot for the line counts to compare the files.
summary
## blogs news twitter
## Size 210160014 205811889 167105338
## Lines 899288 77259 2360148
barplot(summary[2,],main="Line Count", col="darkred")
Here’s the summary and plot for the word counts to compare the files.
wordcount
## Word Count
## blogs 38154238
## news 30218125
## twitter 2693898
#plot max words for each type
wordcount<-t(wordcount)
barplot(wordcount,main="Word Count by Text Type", col="darkblue")
The files are extremely large. For the purposes of this exploration, I’m taking a sample training set of just 5000 from each file. I also combine the 3 samples into one large file for further analysis.
#create subsets of data by sampling 5k lines due to the times and space limitations
set.seed(9009)
bdata<-sample(blogs,replace=FALSE,size=5000)
ndata<-sample(news,replace=FALSE,size=5000)
tdata<-sample(twitter,replace=FALSE,size=5000)
#combine datasets into one for later use
data<- paste(c(bdata,ndata,tdata))
I retrieved a list of profanity to remove from the text and created functions to remove all numbers, punctuation, weird characters and stopwords using the tm package.
#create profanity list
profanity<- readLines("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
#create functions to convert to regular text and lower case and remove all punctuation, numbers, whitespace and profanity
(rmJunk<- content_transformer(function(x, pattern) gsub(pattern, " ", x)))
FALSE function (x, ...)
FALSE {
FALSE content(x) <- FUN(content(x), ...)
FALSE x
FALSE }
FALSE <environment: 0x00000000c4b82a68>
FALSE attr(,"class")
FALSE [1] "content_transformer" "function"
cleandata<-function(x) {
x<- iconv(x, 'utf-8', 'ascii', sub='')
x<- VCorpus(VectorSource(x))
x<- tm_map(x, rmJunk, "[[:digit:]]+")
x<- tm_map(x, rmJunk, "=")
x<- tm_map(x, rmJunk, "\\(")
x<- tm_map(x, rmJunk, "\\)")
x<- tm_map(x, removePunctuation)
x<- tm_map(x, content_transformer(tolower))
x<- tm_map(x, removeNumbers)
x<- tm_map(x, stripWhitespace)
x<- tm_map(x, removeWords, stopwords('english'))
x<- tm_map(x, removeWords, profanity)
}
#run the cleaning function on each dataset
blog_clean<-cleandata(bdata)
twt_clean<-cleandata(tdata)
news_clean<-cleandata(ndata)
data_clean<-cleandata(data)
Plot the word frequencies for each dataset.
In order to work with n-grams, I used the RWeka, tidytext, and tm packages to tokenize the text and structure each dataset for analysis.