The capstone project of Johns Hopkins Data Science specialization is to build a model to predict up to 3 words of user’s next input on a tablet device. The milestone reprot is to conclude the first phasse of the project and layout the plans for the second phase before the final submission.
The following code shows that I have downloaded the data from Coursera site and unzipped it to the data folder.
# download the data in binary format
binData <- getBinaryURL("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", ssl.verifypeer=FALSE)
# write the files to hard disk
conObj <- file("data/dataset_capstone.zip", open = "wb")
writeBin(binData, conObj)
close(conObj)
# unzip the files
unzip("data/dataset_capstone.zip", exdir = "./data/")
A quick review of the file name and size. I also calculated how many lines in each file, the longest and shortest line, and minimum and maximum words in each line. The summary is shown as follows:
print(files.summary)
## Name Size Lines Longest Ln. No. of Wrds. Min. Wrds.
## 1 en_US.blogs.txt 200 899288 40833 37345419 2
## 2 en_US.news.txt 196 1010242 11384 34376579 2
## 3 en_US.twitter.txt 159 2360148 140 30374815 2
## Max. Wrds.
## 1 6630
## 2 1792
## 3 47
From the summary, we could see that the blog file is the biggest file in size, 200M, while contains least number of lines, but longest lines. Twitter file, in contrast, has highest number of lines, 2.36 million, but smallest files size, and shortest lines. In general, all three files contains 30 - 37 million words. The big size of these language corpus files post a challenge to data cleaning and exploration.
I used random sampling method for the purpose of generating this paper in reasonable processing time. Using tm_map function, a priliminary data clean up was performed by removing numbers, punctuations, white spaces, and converting the lines to plain text and lower case.
clean.corpus <- function(x){
x <- tm_map(x, removeNumbers)
x <- tm_map(x, removePunctuation)
x <- tm_map(x, tolower)
x <- tm_map(x, PlainTextDocument)
x <- tm_map(x, stripWhitespace)
return(x)
}
sample <- 0.01
set.seed(999)
# taking samples from blog corpus
blog <- VCorpus(VectorSource(en_US.blogs.txt[as.logical(rbinom(length(en_US.blogs.txt),1,sample))]))
# clean up the blog corpus
blog.corpus <- clean.corpus(blog)
# taking samples from twitter corpus
tweet <- VCorpus(VectorSource(en_US.twitter.txt[as.logical(rbinom(length(en_US.twitter.txt),1,sample))]))
# clean up the twitter corpus
twitter.corpus <- clean.corpus(tweet)
# taking samples from news corpus
news <- VCorpus(VectorSource(en_US.news.txt[as.logical(rbinom(length(en_US.news.txt),1,sample))]))
# Transform corpus
news.corpus <- clean.corpus(news)
Three document term matrix are created for each sampled corpus as in follows.
blog.dtm <- DocumentTermMatrix(blog.corpus,control=list(minWordLength=2, minDocFreq=1))
blog.dtm <- removeSparseTerms(blog.dtm, 0.9999)
blog.1gram.plot.data <- head(convertDTM(blog.dtm),100)
blog.top.counts <- data.frame(head(convertDTM(blog.dtm), 10), type = "blog")
twitter.dtm <- DocumentTermMatrix(twitter.corpus,control=list(minWordLength=2, minDocFreq=1))
twitter.dtm <- removeSparseTerms(dtmTweet, 0.9999)
twitter.1gram.plot.data <- head(convertDTM(twitter.dtm),100)
twitter.top.counts <- data.frame(head(convertDTM(twitter.dtm), 10), type = "tweet")
news.dtm <- DocumentTermMatrix(newsCorpus,control=list(minWordLength=2, minDocFreq=1))
news.dtm <- removeSparseTerms(news.dtm, 0.9999)
news.1gram.plot.data <- head(convertDTM(news.dtm),100)
news.top.counts <- data.frame(head(convertDTM(news.dtm), 10), type = "news")
The most frequently used words in each corpus are ploted as in follows.
freq.words <- rbind(twitter.top.counts,rbind(blog.top.counts,news.top.counts))
ggplot(data <- freq.words, aes(x = reorder(term,-freq), freq))+
geom_bar(stat="identity", fill="steelblue")+ scale_y_continuous("Frequency")+
scale_x_discrete("")+facet_grid(type~.)
We have learned from the above plot that stop words like “the”, “of” “that” are the most frequent used words in each corpus. This is because we haven’t removed the stop words from the corpus at this stage. It was intentional because further tokenizing and cleaning up stop words is closely related the final algorithm, hence I have left the task to second stage of the project.
A similar analysis can be performed on bi- and trigrams. Again, stopwords are over-represented.
bigrammization = function(x){NGramTokenizer(x, Weka_control(min=2,max=2))}
trigrammization = function(x){NGramTokenizer(x, Weka_control(min=3,max=3))}
blog.2gram <- DocumentTermMatrix(blog.corpus, control=list(tokenize = bigrammization))
blog.2gram <- removeSparseTerms(blog.2gram, 0.999)
blog.2gram <- convertDTM(blog2gram)
blog.2gram.plot.data <- head(blog.2gram, 100)
head(blog.2gram.plot.data)
## term freq
## 1848 of the 1852
## 1269 in the 1556
## 2916 to the 851
## 1097 i am 820
## 1894 on the 753
## 2791 to be 734
blog.3gram <- DocumentTermMatrix(blog.corpus, control=list(tokenize = trigrammization))
blog.3gram = removeSparseTerms(blog.3gram, 0.999)
blog.3gram = convertDTM(blog.3gram)
blog.3gram.plot.data <- head(blog.3gram, 100)
head(blog.3gram.plot.data)
## term freq
## 394 one of the 153
## 200 i do not 133
## 15 a lot of 129
## 187 i am not 90
## 212 i have been 87
## 464 the fact that 77
Using word cloud, we first look at the most frequently used words in 1gram data of each corpus.
wordcloud.colors <- brewer.pal(8, "Dark2")
plotFreqTerms <- function(plot.data, scale){
# scale the terms frequency preportion to make it readable in the plot
for (i in 99:1) {
if (plot.data$freq[i] > plot.data$freq[i+1] * 1.1) {
plot.data$freq[i] <- plot.data$freq[i+1] * 1.1
}
}
wordcloud(plot.data$term, plot.data$freq, colors = wordcloud.colors, scale=scale)
}
par(mfrow = c(1,3), mgp=c(0,0,0))
plotFreqTerms(blog.1gram.plot.data, c(3.2,0.3))
title("Blog 1gram")
plotFreqTerms(news.1gram.plot.data, c(3.3,0.3))
title("News 1gram")
plotFreqTerms(twitter.1gram.plot.data, c(3,0.3))
title("Twitter 1gram")
From the above visualization, we could see that the most frequenly used words in each corpus are mostly stop words with several exceptions, like “you”, “your” in twitter corpus, “said”, “have”, “his” in news corpus, and “you”, “have” in blog corpus.
We are also interested to compare the most frequent words between different grams of data, here is the visualization on blog 2gram and 3gram data.
par(mfrow = c(1,2))
plotFreqTerms(blog.2gram.plot.data, c(1.5,0.3))
title("blog 2gram")
plotFreqTerms(blog.3gram.plot.data, c(1.4,0.3))
title("blog 3gram")
Again, stop words or combination of stop words are the most frequent terms in 2gram and 3 gram data.
The goal of the stage 2 of the project is to use n-grams to build a predictive model. This include the following task: