This is for the Coursera Data Science Captstone Project, week 2 Milestone Report. The gol of this project is to display that I’ve gotten used to working with the data and I am ready to create my own prediction algorithm. Training data sets https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip are three sets of data, twitter, news, and blogs in multiple languages, I will use the english language data only.
Load the libraries first.
library(tm)
library(wordcloud2)
library(stringi)
library(RWeka)
library(ggplot2)
library(DT)
library(plotly)
Load data from three English’s files - news, twitter, and blogs, to load only the english corpora, using UTF-8 encoding.
blogs = readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = 'UTF-8', warn = FALSE)
twitter = readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = 'UTF-8', warn = FALSE)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = 'UTF-8', warn = FALSE)
The basic statistic analysis, the table below summarizes the sie characteristics of the three traing data sets of enelgish.
DataStats <- rbind(stri_stats_general(news), stri_stats_general(blogs), stri_stats_general(twitter))
DataStats <- as.data.frame(DataStats)
row.names(DataStats) <- c("news", "blogs", "twitter")
datatable(DataStats)
As the table shows, the twillter file has the most lines, but the blogs file has the most non-white characters.
Due to the large size of the train data sets, in this report, we only take 50000 lines from each files.
blogs = readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = 'UTF-8', warn = FALSE, n = 50000)
twitter = readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = 'UTF-8', warn = FALSE, n = 50000)
news <- readLines("Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = 'UTF-8', warn = FALSE, n = 50000)
Remove all other characters, convert to lower cases
news = gsub("[^a-zA-Z ']", "", news, perl = TRUE)
news = tolower(news)
blogs = gsub("[^a-zA-Z ']", "", blogs, perl = TRUE)
blogs = tolower(blogs)
twitter = gsub("[^a-zA-Z ']", "", twitter, perl = TRUE)
twitter = tolower(twitter)
Convert data to corpus
news_corpus <- Corpus(VectorSource(news))
blogs_corpus <- Corpus(VectorSource(blogs))
twitter_corpus <- Corpus(VectorSource(twitter))
rm(news,blogs, twitter)
Clean the data in the corpus again, such as strip the white spaces, remove the stop words, etc.
news_corpus <- tm_map(news_corpus, removeWords, stopwords("english"))
news_corpus <- tm_map(news_corpus, stripWhitespace)
news_corpus <- tm_map(news_corpus, stemDocument)
blogs_corpus <- tm_map(blogs_corpus, removeWords, stopwords("english"))
blogs_corpus <- tm_map(blogs_corpus, stripWhitespace)
blogs_corpus <- tm_map(blogs_corpus, stemDocument)
twitter_corpus <- tm_map(twitter_corpus, removeWords, stopwords("english"))
twitter_corpus <- tm_map(twitter_corpus, stripWhitespace)
twitter_corpus <- tm_map(twitter_corpus, stemDocument)
Based on the cleaned corpus, we can create document-term matrix.
news.dtm <- TermDocumentMatrix(news_corpus)
blogs.dtm <- TermDocumentMatrix(blogs_corpus)
twitter.dtm <- TermDocumentMatrix(twitter_corpus)
After create a document-term matrix for each file, I removed the terms with high sparsity (>99.5%). By doing this, I can reduce the number of the vocabulary number in each data file.
news.dtms = removeSparseTerms(news.dtm, 0.995)
blogs.dtms = removeSparseTerms(blogs.dtm, 0.995)
twitter.dtms = removeSparseTerms(twitter.dtm, 0.995)
vocab.stat = data.frame(c("news", "blogs", "twitter"), c(news.dtm$nrow, blogs.dtm$nrow, twitter.dtm$nrow), c(news.dtms$nrow, blogs.dtms$nrow, twitter.dtms$nrow))
names(vocab.stat) = c("Data Sets", "Terms", "Non-Sparse Terms")
datatable(vocab.stat)
As the table above shows, removing the sparse terms, we can get much meaning and smaller data to build model. Then , we can get the frequency of terms in each data sets.
news.m = as.matrix(news.dtms)
blogs.m = as.matrix(blogs.dtms)
twitter.m = as.matrix(twitter.dtms)
news.v = sort(rowSums(news.m), decreasing = TRUE)
blogs.v = sort(rowSums(blogs.m), decreasing = TRUE)
twitter.v = sort(rowSums(twitter.m), decreasing = TRUE)
news.freq = data.frame(word = names(news.v), freq=news.v)
blogs.freq = data.frame(word = names(blogs.v), freq=blogs.v)
twitter.freq = data.frame(word = names(twitter.v), freq=twitter.v)
Plot the wordcloud for most frequent word (top 300) in each data set.
wordcloud2(news.freq[c(1:200),],size=0.5, shape="circle")
wordcloud2(blogs.freq[c(1:200),],size=0.5, shape="circle")
wordcloud2(twitter.freq[c(1:200),],size=0.5, shape="circle")
news.top = news.freq[1:10,]
blogs.top = blogs.freq[1:10,]
twitter.top = twitter.freq[1:10,]
news.top$word = as.factor(news.top$word)
blogs.top$word = as.factor(blogs.top$word)
twitter.top$word = as.factor(twitter.top$word)
p.news = ggplot(news.top, aes(word, freq, fill=word)) + geom_bar(stat="identity")
ggplotly(p.news)
p.blogs = ggplot(blogs.top, aes(word, freq, fill=word)) + geom_bar(stat="identity")
ggplotly(p.blogs)
p.twitter = ggplot(twitter.top, aes(word, freq, fill=word)) + geom_bar(stat="identity")
ggplotly(p.twitter)
From the plots above, we can tell, the highest term in news, blogs, and twitters are “said”, “one”, and “just”.
We also can find the associations between different terms based on the document term matrix.
findAssocs(news.dtms, "year", 0.05)
## $year
## last ago million next three past two five four
## 0.25 0.22 0.11 0.10 0.10 0.09 0.09 0.08 0.08
## percent averag billion increas budget earlier first compar later
## 0.08 0.07 0.07 0.07 0.06 0.06 0.06 0.05 0.05
## old school state tax
## 0.05 0.05 0.05 0.05
findAssocs(blogs.dtms, "know", 0.05)
## $know
## dont just want get let like peopl realli think
## 0.18 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.12
## even time feel one say someth thing will can
## 0.11 0.11 0.10 0.10 0.10 0.10 0.10 0.10 0.09
## didnt love now someon tell happen life make much
## 0.09 0.09 0.09 0.09 0.09 0.08 0.08 0.08 0.08
## need never right sure take tri way alway anyon
## 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.07
## ask believ better come els ever everyon everyth friend
## 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07
## good look mayb mind still well anyth back cant
## 0.07 0.07 0.07 0.07 0.07 0.07 0.06 0.06 0.06
## care day doesnt enough find give god guy help
## 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
## ive mani matter mean part person put see start
## 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
## talk thought told work world yet your also best
## 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.05 0.05
## end fact girl goe got hard honest keep knew
## 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
## littl often read realiz whatev word wrong year
## 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
findAssocs(twitter.dtms, "love", 0.05)
## $love
## much
## 0.07
Based on the clean corpus of each input train data set. We plan to implement the N-gram model, which means we will use the frequency table we got here, combining with the n-gram information, using the previous 1, 2, 3, or more words to predict the next word. The simplest such prediction model is a back-off model, such as Katz back-off. We will pick the model with the best performance. The final prediction model will list the next several words based on the highest probabilities. Then using this model to build an online Shiny app.