Executive Summary

This milestone report performs an exploratory data analysis as well as modeling and app concept on the SwiftKey Capstone Project as a part of the Coursera’s Data Science Specialization. Using the Coursera SwiftKey data, we analyze uni-, bi- and tri-gram counts and line counts then plot histograms and wordclouds based on the data. It then gives a prelimanry concept on how these data could be used to build a recommender system.

Data Processing

We download and read the data into RDS format for better performance. We clean the data– such as removing stop words and whitespaces–then tokenize the words into uni-, bi- and tri-grams for twitter, blog and news dataset. We perform a summary for each of them.

Import necessary libraries

library(tm) #text mining
library(SnowballC) #stemming
library(wordcloud) #word cloud
library(ggplot2) #plotting
library(cluster) #clustering
library(RWeka) #ngrams
library(qdap) #mgsub

Download and unzip Coursera-SwiftKey.zip

if (!file.exists('Coursera-SwiftKey.zip')){
    download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip',
                  destfile = 'Coursera-SwiftKey.zip')
    unzip('Coursera-SwiftKey.zip')
}

Read and save to RDS

RDS has better retrieval performance than a typical text file. We loaded the three files into twitter, blog and news data frames. We sample 10,000 documents each from the data frames to perform the analysis.

if (!file.exists('blogs.RDS')){
    con_twitter <- file("final/en_US/en_US.twitter.txt", "r")
    con_blog <- file("final/en_US/en_US.blogs.txt", "r")
    con_news <- file("final/en_US/en_US.news.txt", "r")

    twitter <-readLines(con_twitter,skipNul = TRUE)
    blog <-readLines(con_blog,skipNul = TRUE)
    news <-readLines(con_news,skipNul = TRUE)

    saveRDS(twitter,'twitter.RDS')
    saveRDS(blog,'blogs.RDS')
    saveRDS(news,'news.RDS')}

#Read from RDS for better performance
twitter_ori<-readRDS('twitter.RDS')
blog_ori<-readRDS('blogs.RDS')
news_ori<-readRDS('news.RDS')

#Create sample of 100,000 from each
set.seed(1412)
twitter <- sample(twitter_ori,10000)
blog <- sample(blog_ori,10000)
news <- sample(news_ori,10000)

The Corpora

Using each data frame, we create a corpus of words in each document (line). We then remove profanity, punctuations, white spaces, numbers and stop words in the English language from them, then perform stemming.

#Transform to lowercase; remove punctuations, white spaces, numbers, stop words; stem them
getTransformations()
## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"
#Download profanity
if(!file.exists('bad-words.txt'))
download.file(url='http://www.cs.cmu.edu/~biglou/resources/bad-words.txt',destfile='bad-words.txt')

profane <- readLines('bad-words.txt')
profane <- profane[-1]


#Corpus creating function
createCorp <- function (df,ngram) {
    df <- sapply(df,FUN=function(x) mgsub(pattern=profane,replacement='',x))
    corp <- Corpus(DataframeSource(df))
    corp <- tm_map(corp, content_transformer(tolower))
    corp <- tm_map(corp, removePunctuation)
    corp <- tm_map(corp, stripWhitespace)
    corp <- tm_map(corp, removeWords, stopwords('english'))
    corp <-tm_map(corp, removeNumbers)
    corp <- tm_map(corp, stemDocument)
    return(corp)
}

#Create corpora from data frames

df_twitter<-data.frame(text=twitter,stringsAsFactors = FALSE)
corp_twitter <- createCorp(df_twitter)

df_blog<-data.frame(text=blog,stringsAsFactors = FALSE)
corp_blog <- createCorp(df_blog)

df_news<-data.frame(text=news,stringsAsFactors = FALSE)
corp_news <- createCorp(df_news)

The N-grams

We create uni-, bi- and tri-grams for each corpus. The resulting document term matrices are limited to only words that appear in 10 or more documents and words that are less than 20 characters long. This is to exclude words that are rare or from non-English language.

#Function to make document term matrix for uni-, bi- and tri-grams
#Limit to 1-20 character-long words and words that appear in more than 10 documents only

ngrammer <- function(corp){
    dtm_uni <-DocumentTermMatrix(corp, control=list(wordLengths=c(1, 20),
                            bounds = list(global = c(10,Inf))))
    
    #Set the default number of threads to use; prevents compatibility problem on MacOS
    options(mc.cores=1)

    #Bi- and Tri-grams
    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
    TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

    dtm_bi <-DocumentTermMatrix(corp, control = list(tokenize = BigramTokenizer))
    dtm_tri <-DocumentTermMatrix(corp, control = list(tokenize = TrigramTokenizer))
    #inspect(dtm)
    
    return(list(dtm_uni,dtm_bi,dtm_tri))
}

twitter_ngram <- ngrammer(corp_twitter)
blog_ngram <- ngrammer(corp_blog)
news_ngram <- ngrammer(corp_news)

Exploratory data analysis

First, we identify the number of documents (lines) in each corpus. Then, we use the following plotter function to plot the frequency and wordcloud of top 100 most frequent words in each corpus. The results show that the n-grams (words) are distributed similar to a power law distribution.

plotter <- function(dtm){
    #Get frequency in descending order
    freq <-  colSums(as.matrix(dtm))
    freq <- sort(freq, decreasing=TRUE)
    
    #Plot top 100 most frequent words
    df100 <- data.frame(term=names(freq),occurrences=freq)
    df100 <- df100[order(-df100$occurrences),][1:100,]
    p <- ggplot(df100, aes(x=reorder(term,-occurrences), y=occurrences))
    p <- p + geom_bar(stat='identity')
    p <- p + labs(x='Terms',y='Occurrences',title='Top 10 Most Frequent Words')
    p <- p + theme(axis.text.x=element_text(angle=90, size=5,hjust=1))
    p

    #for word cloud
    words <- names(freq)
    
    return(list(words,freq,p))
}

Twitter

Number of documents

Aka the number of lines in the original file.

length(twitter_ori)
## [1] 2360148

Word Frequency

Uni-gram

u <- plotter(twitter_ngram[[1]])
#Plot
u[[3]]

#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))

Bi-gram

u <- plotter(twitter_ngram[[2]])
#Plot
u[[3]]

#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))

Tri-gram

u <- plotter(twitter_ngram[[3]])
#Plot
u[[3]]

#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))

Blogs

Number of documents

Aka the number of lines in the original file.

length(blog_ori)
## [1] 899288

Word Frequency

Uni-gram

u <- plotter(blog_ngram[[1]])
#Plot
u[[3]]

#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))

Bi-gram

u <- plotter(blog_ngram[[2]])
#Plot
u[[3]]

#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))

#### Tri-gram

u <- plotter(blog_ngram[[3]])
#Plot
u[[3]]

News

Number of documents

Aka the number of lines in the original file.

length(news_ori)
## [1] 1010242

Word Frequency

Uni-gram

u <- plotter(news_ngram[[1]])
#Plot
u[[3]]

#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))

Bi-gram

u <- plotter(news_ngram[[2]])
#Plot
u[[3]]

#Word cloud
wordcloud(u[[1]][1:100], u[[2]][1:100],colors=brewer.pal(6,'Dark2'))

Tri-gram

u <- plotter(news_ngram[[3]])
#Plot
u[[3]]

Modeling Concept: Markov Chain, Frequency and Word Association

We plan to adapt a memoryless Markov Chain for the recommender system.

  1. For a given uni-gram (for instance, ‘Hi’), we assume it as the first term of the bi-gram and find the most likely bi-gram (for instance, if we type ‘Hi’, ‘Hi I’ is suggested).

  2. Repeat the process for the next term using the probability of a tri-gram (for instance, if we type ‘Hi I’, ‘Hi I am’ is suggested).

  3. In case that there is no tri-gram available for the chain, we instead use a bi-gram (for instance, if there is no tri-gram starting with ‘Hi I’, we find a bi-gram with the highest probability starting with ‘I’ instead). And in case that there is no bi-gram available we use the uni-gram with the highest association (for instance, if there is no bi-gram starting with ‘I’, we use the word with highest association with ‘I’ instead).

#Find word association with a given word
findAssocs(twitter_ngram[[1]],'just',0)
## $just
##     get    wish    take   anoth    best    busi  wonder    caus       d 
##    0.10    0.10    0.09    0.08    0.08    0.08    0.08    0.07    0.07 
##  doesnt    dont    well    know    week  friend     lol     one   thing 
##    0.07    0.07    0.07    0.06    0.06    0.05    0.05    0.05    0.05 
##    want    cant     end    give   great  happen    long    pick   right 
##    0.05    0.04    0.04    0.04    0.04    0.04    0.04    0.04    0.04 
##  someth     way    come   didnt   email    help    hope      im    read 
##    0.04    0.04    0.03    0.03    0.03    0.03    0.03    0.03    0.03 
##   readi   start    that    time    what     can   check    make    need 
##    0.03    0.03    0.03    0.03    0.03    0.02    0.02    0.02    0.02 
##  realli     tri twitter     ask   everi  follow    hous   littl   night 
##    0.02    0.02    0.02    0.01    0.01    0.01    0.01    0.01    0.01 
##    real   think    work    call   gonna    hate    home     job     let 
##    0.01    0.01    0.01    0.00    0.00    0.00    0.00    0.00    0.00 
##    next   still    sure     use   watch 
##    0.00    0.00    0.00    0.00    0.00