This is milestone report for word prediction capstone project in Johns Hopkins Data Science. The objectives of this report is load the 3 given data sets, give a summary of the data, and explore the data to understand the frequency distribution of unigram, bigram and trigram words.

1. Data Load

First, set the working directory to the local folder, where the downloaded data resided. Then, created 3 file pointers to blog, news and twitter respectively. Read the data into memory.

setwd("C:/Users/twming/OneDrive/Documents/DataScience/datascience-swiftkeycapstone")
file_blog<-"Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
file_news<-"Coursera-SwiftKey/final/en_US/en_US.news.txt"
file_twitter<-"Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
myblog<-readLines(file_blog,skipNul = TRUE,warn=FALSE)
mynews<-readLines(file_news,skipNul = TRUE,warn=FALSE)
mytwitter<-readLines(file_twitter,skipNul = TRUE,warn=FALSE)

Check the file size of each data: blog - 200MB, news - 196MB and twitter - 159MB.

paste("blog size:",round(file.size(file_blog)/(1024*1024)),"MB")
## [1] "blog size: 200 MB"
paste("news size:",round(file.size(file_news)/(1024*1024)),"MB")
## [1] "news size: 196 MB"
paste("twitter size:",round(file.size(file_twitter)/(1024*1024)),"MB")
## [1] "twitter size: 159 MB"

Check the number of lines in each data: blog - 899288, news - 77259 and twitter - 2360148.

paste("blog line:",length(myblog))
## [1] "blog line: 899288"
paste("news line:",length(mynews))
## [1] "news line: 77259"
paste("twitter line:",length(mytwitter))
## [1] "twitter line: 2360148"

2. Data Sampling

Since the data size is huge, to cut down the exploration time, we do sampling of 500 lines from each data. This is to ensure the word from all blog, news and twitter are selected. To conserve the memory, after the sampling, we delete the original data from memory.

set.seed(120987)

samplesize_blog=500
samplesize_news=500
samplesize_twitter=500

sampleblog <- myblog[sample(1:length(myblog),samplesize_blog)]
samplenews <- mynews[sample(1:length(mynews),samplesize_news)]
sampletwitter <- mytwitter[sample(1:length(mytwitter),samplesize_twitter)]

sampledata<-rbind(sampleblog,samplenews,sampletwitter)
rm(myblog,mynews,mytwitter)

3. Data Clean

In the data, there are a few cleaning done, which are known:

  1. Convert all the words to lower case, this is to ensure no double word counting due to case sensitive.
  2. Remove all the punctuation, so that the word count exclude punctuation.
  3. Remove all the numbers, which is not part of our study interest.
  4. Remove multiple whitespace.

To perform that, we use the tm package in R. First, we use the VectorSource to construct the corpus, then tm_map to clean up the data.

library(tm)
## Loading required package: NLP
mycorpus<-VCorpus(VectorSource(sampledata))
mycorpus <- tm_map(mycorpus, content_transformer(tolower)) # convert to lowercase
mycorpus <- tm_map(mycorpus, removePunctuation) # remove punctuation
mycorpus <- tm_map(mycorpus, removeNumbers) # remove numbers
mycorpus <- tm_map(mycorpus, stripWhitespace) # remove multiple whitespace
changetospace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
mycorpus <- tm_map(mycorpus, changetospace, "/|@|\\|")

4. NGram Tokenize

With the help of RWeka package, first we define the functions for unigram, bigram and trigram tokenization using NGramTokenizer in the package.

In this section, we will tokenize the data into unigram, bigram and trigram, and hold them by variables called one_matrix, two_matrix and three_matrix respectively.

library(RWeka)
OneGramToken <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
TwoGramToken <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
ThreeGramToken <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

one_matrix<-DocumentTermMatrix(mycorpus,control = list(tokenize = OneGramToken))
two_matrix<-DocumentTermMatrix(mycorpus,control = list(tokenize = TwoGramToken))
three_matrix<-DocumentTermMatrix(mycorpus,control = list(tokenize = ThreeGramToken))

After the tokenization, we sum up the word count and sort them in descending frequency order.

onefreq <- sort(colSums(as.matrix(one_matrix)), decreasing=TRUE)
onefreq_df <- data.frame(word=names(onefreq), freq=onefreq)

twofreq <- sort(colSums(as.matrix(two_matrix)), decreasing=TRUE)
twofreq_df <- data.frame(word=names(twofreq), freq=twofreq)

threefreq <- sort(colSums(as.matrix(three_matrix)), decreasing=TRUE)
threefreq_df <- data.frame(word=names(threefreq), freq=threefreq)

The unique unigram, bigram, trigram word/phrase count as below:

paste("unique unigram:",nrow(onefreq_df))
## [1] "unique unigram: 9256"
paste("unique bigram:",nrow(twofreq_df))
## [1] "unique bigram: 32281"
paste("unique trigram:",nrow(threefreq_df))
## [1] "unique trigram: 40057"

The total number of word/phrase count as below:

paste("unigram word count:",sum(onefreq_df[,2]))
## [1] "unigram word count: 35519"
paste("bigram word count:",sum(twofreq_df[,2]))
## [1] "bigram word count: 42954"
paste("trigram word count:",sum(threefreq_df[,2]))
## [1] "trigram word count: 41473"

5. Plot top 20 histogram

We use ggplot2 package to plot the histogram of each words. Below is the top 20 word/phrase histogram.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
top20_one <-head(onefreq_df,n=20)
top20_one <- transform(top20_one,word = reorder(word, -freq))
top20_two <-head(twofreq_df,n=20)
top20_two <- transform(top20_two,word = reorder(word, -freq))
top20_three <-head(threefreq_df,n=20)
top20_three <- transform(top20_three,word = reorder(word, -freq))

ggplot(top20_one,aes(x=word,y=freq))+geom_bar(stat="identity")+theme(axis.text.x=element_text(angle=90, hjust=1))

ggplot(top20_two,aes(x=word,y=freq))+geom_bar(stat="identity")+theme(axis.text.x=element_text(angle=90, hjust=1))

ggplot(top20_three,aes(x=word,y=freq))+geom_bar(stat="identity")+theme(axis.text.x=element_text(angle=90, hjust=1))

6. Summary

In this exploration, we find that there are many words repeated in unigram, bigram and trigram. In this case, we might consider do a further study on bigram and trigram based on the words appear frequenctly in unigram. For example, “the” is a frequently used word, and we could study the bigram and trigram to predict the phrases that use “the”, like “one of the”, “part of the”, “in the”, etc.

DataScience - Milestone Report http://rpubs.com/twming/milestonereportwordprediction