Executive Summary

This milestone report for the Data Science Capstone project provides a summary of data preprocessing and exploratory data analysis of the data sets provided. Plans for creating the prediction algorithm and the Shiny app will also be discussed.

Summary Statistics

We load the data that provided at the website for this project as below: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

We also have a summary table to investigate the three datasets.

library(stringi)

blogp <- file("./final/en_US/en_US.blogs.txt", "rb")
blogs <- readLines(blogp, encoding="UTF-8", skipNul = TRUE)
close(blogp)

newsp <- file("./final/en_US/en_US.news.txt", "rb")
news <- readLines(newsp, encoding="UTF-8", skipNul = TRUE)
close(newsp)


twitterp <- file("./final/en_US/en_US.twitter.txt", "rb")
twitter <- readLines(twitterp, encoding="UTF-8", skipNul = TRUE)
close(twitterp)

rm(blogp, newsp, twitterp)

words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)

summary_table <- data.frame(filename = c("blogs","news","twitter"),
        num_lines = c(length(blogs),length(news),length(twitter)),
        num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
        mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))

summary_table
##   filename num_lines num_words mean_num_words
## 1    blogs    899288  37546246       41.75108
## 2     news   1010242  34762395       34.40997
## 3  twitter   2360148  30093410       12.75065

Data Preprocessing

We will randomly choose 1% of each data set to demonstrate data preprocessing and exploratory data analysis. The full dataset will be used later in creating the prediction algorithm.

set.seed(12345)
blogsSample <- sample(blogs, length(blogs)*0.01)
newsSample <- sample(news, length(news)*0.01)
twitterSample <- sample(twitter, length(twitter)*0.01)
twitterSample <- sapply(twitterSample, 
                        function(row) iconv(row, "latin1", "ASCII", sub=""))
text_sample  <- c(blogsSample,newsSample,twitterSample)
length(text_sample)
## [1] 42695
sum(stri_count_words(text_sample))
## [1] 1023080

Data Cleaning

The first task we are going to do is profanity filtering, that is to remove profanity and other words we do not wish to predict. To do so, we will make use of a list of profanity vocabulary list provided at the website. The list is to be saved as “profanity.csv” and used to clean the data.

http://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/
profanity_list <- read.csv("./final/Terms-to-Block.csv", header = FALSE,  stringsAsFactors=FALSE, skip = 4)
profanity <- profanity_list$V2

Tokenization

We will now identify appropriate tokens such as words, punctuation, and numbers. During such process, we will also perform profanity filtering with the list of key words from the previous section.

library(tm)
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})

# Create a corpus with the data.
sample.corpus<- VCorpus(VectorSource(text_sample))
sample.corpus<- tm_map(sample.corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
sample.corpus<- tm_map(sample.corpus, toSpace, "@[^\\s]+")
sample.corpus<- tm_map(sample.corpus, removeNumbers)
sample.corpus<- tm_map(sample.corpus, removePunctuation)
sample.corpus<- tm_map(sample.corpus, tolower)
sample.corpus<- tm_map(sample.corpus, removeWords, stopwords("english"))
sample.corpus<- tm_map(sample.corpus, removeWords, profanity)
sample.corpus<- tm_map(sample.corpus, stripWhitespace)
sample.corpus2<- tm_map(sample.corpus, PlainTextDocument)

Exploratory Analysis

We will perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

# This custom function will return the word count of the input corpus.
getFreq <- function(input_data) {
  freq <- sort(rowSums(as.matrix(input_data)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}

# This is the tokenizers for 2-grams and 3-grams.

BigramTokenizer  <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

TrigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

# It's time to generate the count each of the category.
unigram<- getFreq(removeSparseTerms(TermDocumentMatrix(sample.corpus2),0.9999))

bigram <- getFreq(removeSparseTerms(TermDocumentMatrix(sample.corpus2, control = list(tokenize = BigramTokenizer)), 0.9999))

trigram <- getFreq(removeSparseTerms(TermDocumentMatrix(sample.corpus2, control = list(tokenize = TrigramTokenizer)), 0.9999))

We take a glance at the word-cloud first, then take a look at the top 20 most frequent unigrams of the dataset.

library(wordcloud2)
uni_cloud<- wordcloud2(unigram, size = 0.5, color = "random-light")

uni_cloud
library(ggplot2)
uni_plot <-  ggplot(unigram[1:20,], aes(x=reorder(word, freq), y=freq, fill = freq)) +
             labs(x = "Word", y = "Count") +
             theme(axis.text.x = element_text(angle = 90), plot.title =     element_text(hjust = 0.5)) +
             coord_flip() +
             geom_bar(stat = "identity") +
             ggtitle("Top 20 Most Frequent Unigrams")

uni_plot

Similarly, we take a look at the top 20 most frequent 2-grams of the dataset.

bi_plot <-  ggplot(bigram[1:20,], aes(x=reorder(word, freq), y=freq, fill=freq)) +
            labs(x = "Word", y = "Count") +
            theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5)) +
            coord_flip() +
            geom_bar(stat = "identity") +
            ggtitle("Top 20 Most Frequent 2-grams")

bi_plot

Finally, this is the result for 3-grams.

tri_plot <-  ggplot(trigram[1:20,], aes(x= reorder(word, freq), y=freq, fill=freq)) +
             labs(x = "Word", y = "Count") +
             theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5)) +
             coord_flip() +
             geom_bar(stat = "identity") +
             ggtitle("Top 20 Most Frequent 3-grams")

tri_plot

Interesting Findings

As we can see from the result, there will be a need to retain the punctuation such as apostrophes and hyphens in expression such as “I’ve” and “I’m”.

Next steps

Before building the first predictive text mining application, we need to refine the Tokenization process better. We may exclude all n-grams with low frequency so that it won’t be too slow. The basic mechanism of the algorithm would be to provide a match of the highest n-gram and work its way down to lowest n-gram. Once the application is in service, it can be further enhanced by collecting unseen n-grams entered.