Data Science Capstone Milestone Report

The capstone project of Johns Hopkins Data Science specialization is to build a model to predict up to 3 words of user’s next input on a tablet device. The milestone reprot is to conclude the first phasse of the project and layout the plans for the second phase before the final submission.

1. Downloading the data

The following code shows that I have downloaded the data from Coursera site and unzipped it to the data folder.

# download the data in binary format
binData <- getBinaryURL("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", ssl.verifypeer=FALSE)

# write the files to hard disk
conObj <- file("data/dataset_capstone.zip", open = "wb")
writeBin(binData, conObj)
close(conObj)

# unzip the files
unzip("data/dataset_capstone.zip", exdir = "./data/")

2. Statistics of the Files

A quick review of the file name and size. I also calculated how many lines in each file, the longest and shortest line, and minimum and maximum words in each line. The summary is shown as follows:

print(files.summary)

##                Name Size   Lines Longest Ln. No. of Wrds. Min. Wrds.
## 1   en_US.blogs.txt  200  899288       40833     37345419          2
## 2    en_US.news.txt  196 1010242       11384     34376579          2
## 3 en_US.twitter.txt  159 2360148         140     30374815          2
##   Max. Wrds.
## 1       6630
## 2       1792
## 3         47

From the summary, we could see that the blog file is the biggest file in size, 200M, while contains least number of lines, but longest lines. Twitter file, in contrast, has highest number of lines, 2.36 million, but smallest files size, and shortest lines. In general, all three files contains 30 - 37 million words. The big size of these language corpus files post a challenge to data cleaning and exploration.

3. Cleaning Data

I used random sampling method for the purpose of generating this paper in reasonable processing time. Using tm_map function, a priliminary data clean up was performed by removing numbers, punctuations, white spaces, and converting the lines to plain text and lower case.

clean.corpus  <-  function(x){
  x <- tm_map(x, removeNumbers)
  x <- tm_map(x, removePunctuation)
  x <- tm_map(x, tolower)
  x <- tm_map(x, PlainTextDocument)
  x <- tm_map(x, stripWhitespace)
  return(x)
}

sample <- 0.01
set.seed(999)

# taking samples from blog corpus
blog <- VCorpus(VectorSource(en_US.blogs.txt[as.logical(rbinom(length(en_US.blogs.txt),1,sample))]))

# clean up the blog corpus
blog.corpus <- clean.corpus(blog)

# taking samples from twitter corpus
tweet <- VCorpus(VectorSource(en_US.twitter.txt[as.logical(rbinom(length(en_US.twitter.txt),1,sample))]))

# clean up the twitter corpus
twitter.corpus <- clean.corpus(tweet)

# taking samples from news corpus
news <- VCorpus(VectorSource(en_US.news.txt[as.logical(rbinom(length(en_US.news.txt),1,sample))]))
# Transform corpus
news.corpus <- clean.corpus(news)

Three document term matrix are created for each sampled corpus as in follows.

blog.dtm <- DocumentTermMatrix(blog.corpus,control=list(minWordLength=2, minDocFreq=1))
blog.dtm <- removeSparseTerms(blog.dtm, 0.9999)
blog.1gram.plot.data <- head(convertDTM(blog.dtm),100)
blog.top.counts <- data.frame(head(convertDTM(blog.dtm), 10), type = "blog")

twitter.dtm <- DocumentTermMatrix(twitter.corpus,control=list(minWordLength=2, minDocFreq=1))
twitter.dtm <- removeSparseTerms(dtmTweet, 0.9999)
twitter.1gram.plot.data <- head(convertDTM(twitter.dtm),100)
twitter.top.counts  <- data.frame(head(convertDTM(twitter.dtm), 10), type = "tweet")

news.dtm <- DocumentTermMatrix(newsCorpus,control=list(minWordLength=2, minDocFreq=1))
news.dtm <- removeSparseTerms(news.dtm, 0.9999)
news.1gram.plot.data <- head(convertDTM(news.dtm),100)
news.top.counts  <- data.frame(head(convertDTM(news.dtm), 10), type = "news")

The most frequently used words in each corpus are ploted as in follows.

freq.words <- rbind(twitter.top.counts,rbind(blog.top.counts,news.top.counts))
ggplot(data <- freq.words, aes(x = reorder(term,-freq), freq))+
  geom_bar(stat="identity", fill="steelblue")+ scale_y_continuous("Frequency")+
  scale_x_discrete("")+facet_grid(type~.)

We have learned from the above plot that stop words like “the”, “of” “that” are the most frequent used words in each corpus. This is because we haven’t removed the stop words from the corpus at this stage. It was intentional because further tokenizing and cleaning up stop words is closely related the final algorithm, hence I have left the task to second stage of the project.

4. Generating bigrams and trigrams

A similar analysis can be performed on bi- and trigrams. Again, stopwords are over-represented.

bigrammization = function(x){NGramTokenizer(x, Weka_control(min=2,max=2))}
trigrammization = function(x){NGramTokenizer(x, Weka_control(min=3,max=3))}

blog.2gram <- DocumentTermMatrix(blog.corpus, control=list(tokenize = bigrammization))
blog.2gram <- removeSparseTerms(blog.2gram, 0.999)
blog.2gram  <- convertDTM(blog2gram)

blog.2gram.plot.data <- head(blog.2gram, 100)

head(blog.2gram.plot.data)

##        term freq
## 1848 of the 1852
## 1269 in the 1556
## 2916 to the  851
## 1097   i am  820
## 1894 on the  753
## 2791  to be  734

blog.3gram <- DocumentTermMatrix(blog.corpus, control=list(tokenize = trigrammization))
blog.3gram = removeSparseTerms(blog.3gram, 0.999)
blog.3gram = convertDTM(blog.3gram)

blog.3gram.plot.data <- head(blog.3gram, 100)

head(blog.3gram.plot.data)

##              term freq
## 394    one of the  153
## 200      i do not  133
## 15       a lot of  129
## 187      i am not   90
## 212   i have been   87
## 464 the fact that   77

5. Data Visualization

Using word cloud, we first look at the most frequently used words in 1gram data of each corpus.

wordcloud.colors <- brewer.pal(8, "Dark2")

plotFreqTerms <- function(plot.data, scale){
    # scale the terms frequency preportion to make it readable in the plot
    for (i in 99:1) {
        if (plot.data$freq[i] > plot.data$freq[i+1] * 1.1) {
            plot.data$freq[i] <- plot.data$freq[i+1] * 1.1
        }
    }
    
    wordcloud(plot.data$term, plot.data$freq, colors = wordcloud.colors, scale=scale) 
}

par(mfrow = c(1,3), mgp=c(0,0,0))
    
plotFreqTerms(blog.1gram.plot.data, c(3.2,0.3))
title("Blog 1gram")

plotFreqTerms(news.1gram.plot.data, c(3.3,0.3))
title("News 1gram")

plotFreqTerms(twitter.1gram.plot.data, c(3,0.3))
title("Twitter 1gram")

From the above visualization, we could see that the most frequenly used words in each corpus are mostly stop words with several exceptions, like “you”, “your” in twitter corpus, “said”, “have”, “his” in news corpus, and “you”, “have” in blog corpus.

We are also interested to compare the most frequent words between different grams of data, here is the visualization on blog 2gram and 3gram data.

par(mfrow = c(1,2))
plotFreqTerms(blog.2gram.plot.data, c(1.5,0.3))
title("blog 2gram")

plotFreqTerms(blog.3gram.plot.data, c(1.4,0.3))
title("blog 3gram")

Again, stop words or combination of stop words are the most frequent terms in 2gram and 3 gram data.

6. Plan of project stage two

The goal of the stage 2 of the project is to use n-grams to build a predictive model. This include the following task:

Use the unigrams to predict a word while it’s being typed.
Use the unigrams to auto-correct a mistyped word.
Use bigrams to predict the next word when user pressed space after typed last word.
Use trigrams as a priority search for predicting the next word.
Develop a strategy to handle possible n-grams that do not appear in the training set (smoothing).
Develop a strategy to feed the user-specific data (such as previous entries, contact list, etc…) back into the predictive model.