This milestone report performs an exploratory data analysis as well as modeling and app concept on the SwiftKey Capstone Project as a part of the Coursera’s Data Science Specialization. Using the Coursera SwiftKey data, we analyze uni-, bi- and tri-gram counts and line counts then plot histograms and wordclouds based on the data. It then gives a prelimanry concept on how these data could be used to build a recommender system.
We download and read the data into RDS format for better performance. We clean the data– such as removing stop words and whitespaces–then tokenize the words into uni-, bi- and tri-grams for twitter, blog and news dataset. We perform a summary for each of them.
library(tm) #text mining
library(SnowballC) #stemming
library(wordcloud) #word cloud
library(ggplot2) #plotting
library(cluster) #clustering
library(RWeka) #ngrams
library(qdap) #mgsub
if (!file.exists('Coursera-SwiftKey.zip')){
download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip',
destfile = 'Coursera-SwiftKey.zip')
unzip('Coursera-SwiftKey.zip')
}
RDS has better retrieval performance than a typical text file. We loaded the three files into twitter, blog and news data frames. We sample 10,000 documents each from the data frames to perform the analysis.
if (!file.exists('blogs.RDS')){
con_twitter <- file("final/en_US/en_US.twitter.txt", "r")
con_blog <- file("final/en_US/en_US.blogs.txt", "r")
con_news <- file("final/en_US/en_US.news.txt", "r")
twitter <-readLines(con_twitter,skipNul = TRUE)
blog <-readLines(con_blog,skipNul = TRUE)
news <-readLines(con_news,skipNul = TRUE)
saveRDS(twitter,'twitter.RDS')
saveRDS(blog,'blogs.RDS')
saveRDS(news,'news.RDS')}
#Read from RDS for better performance
twitter_ori<-readRDS('twitter.RDS')
blog_ori<-readRDS('blogs.RDS')
news_ori<-readRDS('news.RDS')
#Create sample of 100,000 from each
set.seed(1412)
twitter <- sample(twitter_ori,10000)
blog <- sample(blog_ori,10000)
news <- sample(news_ori,10000)
Using each data frame, we create a corpus of words in each document (line). We then remove profanity, punctuations, white spaces, numbers and stop words in the English language from them, then perform stemming.
#Transform to lowercase; remove punctuations, white spaces, numbers, stop words; stem them
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
#Download profanity
if(!file.exists('bad-words.txt'))
download.file(url='http://www.cs.cmu.edu/~biglou/resources/bad-words.txt',destfile='bad-words.txt')
profane <- readLines('bad-words.txt')
profane <- profane[-1]
#Corpus creating function
createCorp <- function (df,ngram) {
df <- sapply(df,FUN=function(x) mgsub(pattern=profane,replacement='',x))
corp <- Corpus(DataframeSource(df))
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removeWords, stopwords('english'))
corp <-tm_map(corp, removeNumbers)
corp <- tm_map(corp, stemDocument)
return(corp)
}
#Create corpora from data frames
df_twitter<-data.frame(text=twitter,stringsAsFactors = FALSE)
corp_twitter <- createCorp(df_twitter)
df_blog<-data.frame(text=blog,stringsAsFactors = FALSE)
corp_blog <- createCorp(df_blog)
df_news<-data.frame(text=news,stringsAsFactors = FALSE)
corp_news <- createCorp(df_news)
We create uni-, bi- and tri-grams for each corpus. The resulting document term matrices are limited to only words that appear in 10 or more documents and words that are less than 20 characters long. This is to exclude words that are rare or from non-English language.
#Function to make document term matrix for uni-, bi- and tri-grams
#Limit to 1-20 character-long words and words that appear in more than 10 documents only
ngrammer <- function(corp){
dtm_uni <-DocumentTermMatrix(corp, control=list(wordLengths=c(1, 20),
bounds = list(global = c(10,Inf))))
#Set the default number of threads to use; prevents compatibility problem on MacOS
options(mc.cores=1)
#Bi- and Tri-grams
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm_bi <-DocumentTermMatrix(corp, control = list(tokenize = BigramTokenizer))
dtm_tri <-DocumentTermMatrix(corp, control = list(tokenize = TrigramTokenizer))
#inspect(dtm)
return(list(dtm_uni,dtm_bi,dtm_tri))
}
twitter_ngram <- ngrammer(corp_twitter)
blog_ngram <- ngrammer(corp_blog)
news_ngram <- ngrammer(corp_news)
First, we identify the number of documents (lines) in each corpus. Then, we use the following plotter function to plot the frequency and wordcloud of top 100 most frequent words in each corpus. The results show that the n-grams (words) are distributed similar to a power law distribution.
plotter <- function(dtm){
#Get frequency in descending order
freq <- colSums(as.matrix(dtm))
freq <- sort(freq, decreasing=TRUE)
#Plot top 100 most frequent words
df100 <- data.frame(term=names(freq),occurrences=freq)
df100 <- df100[order(-df100$occurrences),][1:100,]
p <- ggplot(df100, aes(x=reorder(term,-occurrences), y=occurrences))
p <- p + geom_bar(stat='identity')
p <- p + labs(x='Terms',y='Occurrences',title='Top 10 Most Frequent Words')
p <- p + theme(axis.text.x=element_text(angle=90, size=5,hjust=1))
p
#for word cloud
words <- names(freq)
return(list(words,freq,p))
}
Aka the number of lines in the original file.
length(twitter_ori)
## [1] 2360148
u <- plotter(twitter_ngram[[1]])
#Plot
u[[3]]
#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))
u <- plotter(twitter_ngram[[2]])
#Plot
u[[3]]
#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))
u <- plotter(twitter_ngram[[3]])
#Plot
u[[3]]
#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))
Aka the number of lines in the original file.
length(blog_ori)
## [1] 899288
u <- plotter(blog_ngram[[1]])
#Plot
u[[3]]
#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))
u <- plotter(blog_ngram[[2]])
#Plot
u[[3]]
#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))
#### Tri-gram
u <- plotter(blog_ngram[[3]])
#Plot
u[[3]]
Aka the number of lines in the original file.
length(news_ori)
## [1] 1010242
u <- plotter(news_ngram[[1]])
#Plot
u[[3]]
#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))
u <- plotter(news_ngram[[2]])
#Plot
u[[3]]
#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))
u <- plotter(news_ngram[[3]])
#Plot
u[[3]]
We plan to adapt a memoryless Markov Chain for the recommender system.
For a given uni-gram (for instance, ‘Hi’), we assume it as the first term of the bi-gram and find the most likely bi-gram (for instance, if we type ‘Hi’, ‘Hi I’ is suggested).
Repeat the process for the next term using the probability of a tri-gram (for instance, if we type ‘Hi I’, ‘Hi I am’ is suggested).
In case that there is no tri-gram available for the chain, we instead use a bi-gram (for instance, if there is no tri-gram starting with ‘Hi I’, we find a bi-gram with the highest probability starting with ‘I’ instead). And in case that there is no bi-gram available we use the uni-gram with the highest association (for instance, if there is no bi-gram starting with ‘I’, we use the word with highest association with ‘I’ instead).
#Find word association with a given word
findAssocs(twitter_ngram[[1]],'just',0)
## $just
## get wish take anoth best busi wonder caus d
## 0.10 0.10 0.09 0.08 0.08 0.08 0.08 0.07 0.07
## doesnt dont well know week friend lol one thing
## 0.07 0.07 0.07 0.06 0.06 0.05 0.05 0.05 0.05
## want cant end give great happen long pick right
## 0.05 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04
## someth way come didnt email help hope im read
## 0.04 0.04 0.03 0.03 0.03 0.03 0.03 0.03 0.03
## readi start that time what can check make need
## 0.03 0.03 0.03 0.03 0.03 0.02 0.02 0.02 0.02
## realli tri twitter ask everi follow hous littl night
## 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01
## real think work call gonna hate home job let
## 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00
## next still sure use watch
## 0.00 0.00 0.00 0.00 0.00