Executive Summary
This milestone report for the Data Science Capstone project provides a summary of data preprocessing and exploratory data analysis of the data sets provided. Plans for creating the prediction algorithm and the Shiny app will also be discussed.
Summary Statistics
We also have a summary table to investigate the three datasets.
library(stringi)
blogp <- file("./final/en_US/en_US.blogs.txt", "rb")
blogs <- readLines(blogp, encoding="UTF-8", skipNul = TRUE)
close(blogp)
newsp <- file("./final/en_US/en_US.news.txt", "rb")
news <- readLines(newsp, encoding="UTF-8", skipNul = TRUE)
close(newsp)
twitterp <- file("./final/en_US/en_US.twitter.txt", "rb")
twitter <- readLines(twitterp, encoding="UTF-8", skipNul = TRUE)
close(twitterp)
rm(blogp, newsp, twitterp)
words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)
summary_table <- data.frame(filename = c("blogs","news","twitter"),
num_lines = c(length(blogs),length(news),length(twitter)),
num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))
summary_table
## filename num_lines num_words mean_num_words
## 1 blogs 899288 37546246 41.75108
## 2 news 1010242 34762395 34.40997
## 3 twitter 2360148 30093410 12.75065
Data Preprocessing
We will randomly choose 1% of each data set to demonstrate data preprocessing and exploratory data analysis. The full dataset will be used later in creating the prediction algorithm.
set.seed(12345)
blogsSample <- sample(blogs, length(blogs)*0.01)
newsSample <- sample(news, length(news)*0.01)
twitterSample <- sample(twitter, length(twitter)*0.01)
twitterSample <- sapply(twitterSample,
function(row) iconv(row, "latin1", "ASCII", sub=""))
text_sample <- c(blogsSample,newsSample,twitterSample)
length(text_sample)
## [1] 42695
sum(stri_count_words(text_sample))
## [1] 1023080
Data Cleaning
The first task we are going to do is profanity filtering, that is to remove profanity and other words we do not wish to predict. To do so, we will make use of a list of profanity vocabulary list provided at the website. The list is to be saved as “profanity.csv” and used to clean the data.
Exploratory Analysis
We take a glance at the word-cloud first, then take a look at the top 20 most frequent unigrams of the dataset.
library(wordcloud2)
uni_cloud<- wordcloud2(unigram, size = 0.5, color = "random-light")
uni_cloud
library(ggplot2)
uni_plot <- ggplot(unigram[1:20,], aes(x=reorder(word, freq), y=freq, fill = freq)) +
labs(x = "Word", y = "Count") +
theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5)) +
coord_flip() +
geom_bar(stat = "identity") +
ggtitle("Top 20 Most Frequent Unigrams")
uni_plot

Similarly, we take a look at the top 20 most frequent 2-grams of the dataset.
bi_plot <- ggplot(bigram[1:20,], aes(x=reorder(word, freq), y=freq, fill=freq)) +
labs(x = "Word", y = "Count") +
theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5)) +
coord_flip() +
geom_bar(stat = "identity") +
ggtitle("Top 20 Most Frequent 2-grams")
bi_plot

Finally, this is the result for 3-grams.
tri_plot <- ggplot(trigram[1:20,], aes(x= reorder(word, freq), y=freq, fill=freq)) +
labs(x = "Word", y = "Count") +
theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5)) +
coord_flip() +
geom_bar(stat = "identity") +
ggtitle("Top 20 Most Frequent 3-grams")
tri_plot

Interesting Findings
As we can see from the result, there will be a need to retain the punctuation such as apostrophes and hyphens in expression such as “I’ve” and “I’m”.
Next steps
Before building the first predictive text mining application, we need to refine the Tokenization process better. We may exclude all n-grams with low frequency so that it won’t be too slow. The basic mechanism of the algorithm would be to provide a match of the highest n-gram and work its way down to lowest n-gram. Once the application is in service, it can be further enhanced by collecting unseen n-grams entered.