Data Science Capstone - Milestone Report

Overview

This is for the Coursera Data Science Captstone Project, week 2 Milestone Report. The gol of this project is to display that I’ve gotten used to working with the data and I am ready to create my own prediction algorithm. Training data sets https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip are three sets of data, twitter, news, and blogs in multiple languages, I will use the english language data only.

Motivation

  1. Demonstrate and load the downloaded data.
  2. Create a basic report of summary statistics about the data sets.
  3. Report some interesting findings.
  4. Feedback on plans for creating a prediction algorithm and Shiny app.

Loading Data

Load the libraries first.

library(tm)
library(wordcloud2)
library(stringi)
library(RWeka)
library(ggplot2)
library(DT)
library(plotly)

Load data from three English’s files - news, twitter, and blogs, to load only the english corpora, using UTF-8 encoding.

blogs = readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = 'UTF-8', warn = FALSE)
twitter = readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = 'UTF-8', warn = FALSE)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = 'UTF-8', warn = FALSE)

The basic statistic analysis, the table below summarizes the sie characteristics of the three traing data sets of enelgish.

DataStats <- rbind(stri_stats_general(news), stri_stats_general(blogs), stri_stats_general(twitter))
DataStats <- as.data.frame(DataStats)
row.names(DataStats) <- c("news", "blogs", "twitter")
datatable(DataStats)

As the table shows, the twillter file has the most lines, but the blogs file has the most non-white characters.

Due to the large size of the train data sets, in this report, we only take 50000 lines from each files.

blogs = readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = 'UTF-8', warn = FALSE, n = 50000)
twitter = readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = 'UTF-8', warn = FALSE, n = 50000)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = 'UTF-8', warn = FALSE, n = 50000)

Creating the Corpus and cleaning the data

Remove all other characters, convert to lower cases

news = gsub("[^a-zA-Z ']", "", news, perl = TRUE)
news = tolower(news)

blogs = gsub("[^a-zA-Z ']", "", blogs, perl = TRUE)
blogs = tolower(blogs)

twitter = gsub("[^a-zA-Z ']", "", twitter, perl = TRUE)
twitter = tolower(twitter)

Convert data to corpus

news_corpus <- Corpus(VectorSource(news))
blogs_corpus <- Corpus(VectorSource(blogs))
twitter_corpus <- Corpus(VectorSource(twitter))
rm(news,blogs, twitter)

Clean the data in the corpus again, such as strip the white spaces, remove the stop words, etc.

news_corpus <- tm_map(news_corpus, removeWords, stopwords("english"))
news_corpus <- tm_map(news_corpus, stripWhitespace)
news_corpus <- tm_map(news_corpus, stemDocument)

blogs_corpus <- tm_map(blogs_corpus, removeWords, stopwords("english"))
blogs_corpus <- tm_map(blogs_corpus, stripWhitespace)
blogs_corpus <- tm_map(blogs_corpus, stemDocument)

twitter_corpus <- tm_map(twitter_corpus, removeWords, stopwords("english"))
twitter_corpus <- tm_map(twitter_corpus, stripWhitespace)
twitter_corpus <- tm_map(twitter_corpus, stemDocument)

Based on the cleaned corpus, we can create document-term matrix.

news.dtm <- TermDocumentMatrix(news_corpus)
blogs.dtm <- TermDocumentMatrix(blogs_corpus)
twitter.dtm <- TermDocumentMatrix(twitter_corpus)

After create a document-term matrix for each file, I removed the terms with high sparsity (>99.5%). By doing this, I can reduce the number of the vocabulary number in each data file.

news.dtms = removeSparseTerms(news.dtm, 0.995)
blogs.dtms = removeSparseTerms(blogs.dtm, 0.995)
twitter.dtms = removeSparseTerms(twitter.dtm, 0.995)

vocab.stat = data.frame(c("news", "blogs", "twitter"), c(news.dtm$nrow, blogs.dtm$nrow, twitter.dtm$nrow), c(news.dtms$nrow, blogs.dtms$nrow, twitter.dtms$nrow))
names(vocab.stat) = c("Data Sets", "Terms", "Non-Sparse Terms")
datatable(vocab.stat)

As the table above shows, removing the sparse terms, we can get much meaning and smaller data to build model. Then , we can get the frequency of terms in each data sets.

news.m = as.matrix(news.dtms)
blogs.m = as.matrix(blogs.dtms)
twitter.m = as.matrix(twitter.dtms)

news.v = sort(rowSums(news.m), decreasing = TRUE)
blogs.v = sort(rowSums(blogs.m), decreasing = TRUE)
twitter.v = sort(rowSums(twitter.m), decreasing = TRUE)

news.freq = data.frame(word = names(news.v), freq=news.v)
blogs.freq = data.frame(word = names(blogs.v), freq=blogs.v)
twitter.freq = data.frame(word = names(twitter.v), freq=twitter.v)

Plot the wordcloud for most frequent word (top 300) in each data set.

wordcloud2(news.freq[c(1:200),],size=0.5, shape="circle")
wordcloud2(blogs.freq[c(1:200),],size=0.5, shape="circle")
wordcloud2(twitter.freq[c(1:200),],size=0.5, shape="circle")
news.top = news.freq[1:10,]
blogs.top = blogs.freq[1:10,]
twitter.top = twitter.freq[1:10,]

news.top$word = as.factor(news.top$word)
blogs.top$word = as.factor(blogs.top$word)
twitter.top$word = as.factor(twitter.top$word)
p.news = ggplot(news.top, aes(word, freq, fill=word)) + geom_bar(stat="identity")
ggplotly(p.news)
p.blogs = ggplot(blogs.top, aes(word, freq, fill=word)) + geom_bar(stat="identity")
ggplotly(p.blogs)
p.twitter = ggplot(twitter.top, aes(word, freq, fill=word)) + geom_bar(stat="identity")
ggplotly(p.twitter)

From the plots above, we can tell, the highest term in news, blogs, and twitters are “said”, “one”, and “just”.

We also can find the associations between different terms based on the document term matrix.

findAssocs(news.dtms, "year", 0.05) 
## $year
##    last     ago million    next   three    past     two    five    four 
##    0.25    0.22    0.11    0.10    0.10    0.09    0.09    0.08    0.08 
## percent  averag billion increas  budget earlier   first  compar   later 
##    0.08    0.07    0.07    0.07    0.06    0.06    0.06    0.05    0.05 
##     old  school   state     tax 
##    0.05    0.05    0.05    0.05
findAssocs(blogs.dtms, "know", 0.05) 
## $know
##    dont    just    want     get     let    like   peopl  realli   think 
##    0.18    0.15    0.14    0.13    0.12    0.12    0.12    0.12    0.12 
##    even    time    feel     one     say  someth   thing    will     can 
##    0.11    0.11    0.10    0.10    0.10    0.10    0.10    0.10    0.09 
##   didnt    love     now  someon    tell  happen    life    make    much 
##    0.09    0.09    0.09    0.09    0.09    0.08    0.08    0.08    0.08 
##    need   never   right    sure    take     tri     way   alway   anyon 
##    0.08    0.08    0.08    0.08    0.08    0.08    0.08    0.07    0.07 
##     ask  believ  better    come     els    ever everyon everyth  friend 
##    0.07    0.07    0.07    0.07    0.07    0.07    0.07    0.07    0.07 
##    good    look    mayb    mind   still    well   anyth    back    cant 
##    0.07    0.07    0.07    0.07    0.07    0.07    0.06    0.06    0.06 
##    care     day  doesnt  enough    find    give     god     guy    help 
##    0.06    0.06    0.06    0.06    0.06    0.06    0.06    0.06    0.06 
##     ive    mani  matter    mean    part  person     put     see   start 
##    0.06    0.06    0.06    0.06    0.06    0.06    0.06    0.06    0.06 
##    talk thought    told    work   world     yet    your    also    best 
##    0.06    0.06    0.06    0.06    0.06    0.06    0.06    0.05    0.05 
##     end    fact    girl     goe     got    hard  honest    keep    knew 
##    0.05    0.05    0.05    0.05    0.05    0.05    0.05    0.05    0.05 
##   littl   often    read  realiz  whatev    word   wrong    year 
##    0.05    0.05    0.05    0.05    0.05    0.05    0.05    0.05
findAssocs(twitter.dtms, "love", 0.05) 
## $love
## much 
## 0.07

Next Steps

Based on the clean corpus of each input train data set. We plan to implement the N-gram model, which means we will use the frequency table we got here, combining with the n-gram information, using the previous 1, 2, 3, or more words to predict the next word. The simplest such prediction model is a back-off model, such as Katz back-off. We will pick the model with the best performance. The final prediction model will list the next several words based on the highest probabilities. Then using this model to build an online Shiny app.