This report has been created to fulfill the requirements of the exploratory analysis milestone for the Data Science Capstone offered by the Johns Hopkins University on Coursera. Here we will be analyzing the English dataset from the data provided by SwiftKey belonging to a corpus called HC Corpora.
The main objectives of this report are as follows
This section consists of the steps followed to setup our R environment, get the required data, load it and perform basic exploratory analysis.
In the following code segment, we set the required global options and load the required packages in R.
library(knitr)
opts_chunk$set(cache=TRUE,echo=TRUE)
options(width=120)
library(pander)
library(stringr)
library(stringi)
library(tm)
library(ggplot2)
library(gridExtra)
Here we set the working directory as required, it can be changed according to your preference in your personal computer.
setwd("E:/MOOCs/Coursera/Data Science - Specialization/10. Data Science Capstone/CapstoneProject")
The dataset can be downloaded using the following code segment.
download_url = 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
data_file = 'Coursera-SwiftKey.zip'
if (!file.exists(data_file)) {
cat('Downloading Dataset...\n')
download.file(download_url, destfile=data_file, method="curl")
cat('Unzipping Dataset...\n')
unzip(data_file)
}else{
cat('Dataset is already downloaded!\n')
}
## Dataset is already downloaded!
Next, we load in the english datasets using the following commands so that we can start analyzing them.
blog_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")
con <- file("Coursera-SwiftKey/final/en_US/en_US.news.txt", open="rb")
news_data <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
In this section, we will be looking at various summary statistics for the overall english datasets provided to get a feel of the data including basic counts and frequencies.
The following code segment computes basic summary statistics for the three datasets. We using the wc
system command here which is used for word, line and character counts using the system
function call. Do note that I have tested this on Windows and it works fine, hopefully should work on other Operating Systems too.
# Dataset names
dataset_names <- c("Blogs", "News", "Twitter")
# Summary Stats function
get_summary_stats <- function(file_path){
dataset_size <- round(file.info(file_path)$size/1024/1024, digits = 1) # file size in MB
lwc <- unlist(strsplit(system(paste("wc ",file_path), intern=TRUE), " "))
lwc <- lwc[lwc != ""]
lines <- as.numeric(lwc[1])
words <- as.numeric(lwc[2])
chars <- as.numeric(lwc[3])
avg_words_per_line <- round(words / lines, 1)
avg_chars_per_line <- round(chars / lines, 1)
summary_stats <- list(size=dataset_size, lines=lines, words=words,
chars=chars, avg_wpl=avg_words_per_line,
avg_cpl=avg_chars_per_line)
return (summary_stats)
}
# Get Summary Statistics
blog_data_path <- "./Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
blog_stats <- get_summary_stats(blog_data_path)
news_data_path <- "./Coursera-SwiftKey/final/en_US/en_US.news.txt"
news_stats <- get_summary_stats(news_data_path)
twitter_data_path <- "./Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
twitter_stats <- get_summary_stats(twitter_data_path)
# Summary Stats Vectors
sizes <- c(blog_stats$size, news_stats$size, twitter_stats$size)
line_counts <- c(blog_stats$lines, news_stats$lines, twitter_stats$lines)
word_counts <- c(blog_stats$words, news_stats$words, twitter_stats$words)
char_counts <- c(blog_stats$chars, news_stats$chars, twitter_stats$chars)
avg_wpl_counts <- c(blog_stats$avg_wpl, news_stats$avg_wpl, twitter_stats$avg_wpl)
avg_cpl_counts <- c(blog_stats$avg_cpl, news_stats$avg_cpl, twitter_stats$avg_cpl)
# Combining stats vectors into a Summary Stats Table
summary_stats_table <- data.frame(
cbind(dataset_names, sizes, line_counts, word_counts,
char_counts, avg_wpl_counts, avg_cpl_counts)
)
colnames(summary_stats_table) <- c("File Name", "Size (in MB)", "Lines", "Words",
"Characters", "Avg. Words/Line", "Avg. Chars/Line")
pandoc.table(summary_stats_table, style = "simple", justify = 'left',
caption = "Datasets Summary Statistics", split.table = Inf)
File Name Size (in MB) Lines Words Characters Avg. Words/Line Avg. Chars/Line
----------- -------------- ------- -------- ------------ ----------------- -----------------
Blogs 200.4 899288 37334114 210160014 41.5 233.7
News 196.3 1010242 34365936 205811889 34 203.7
Twitter 159.4 2360148 30359852 167105338 12.9 70.8
Table: Datasets Summary Statistics
From the above table, it is quite expected that the Blogs
dataset has the highest count in all the fields and Twitter
has the lowest count because each tweet as you already know is always limited to 140
characters. We will now visualize the same in the following section.
The following code segment plots histograms of word count distributions across different articles or tweets in the different datasets.
blog_wordcounts <- nchar(blog_data)
news_wordcounts <- nchar(news_data)
twitter_wordcounts <- nchar(twitter_data)
par(mfrow = c(1, 3))
hist(blog_wordcounts[blog_wordcounts<=1000], xlab = "Word Count", ylab = "Frequency",
main = "Blog Word Counts", col="lightblue", breaks=15)
hist(news_wordcounts[news_wordcounts<=1000], xlab = "Word Count", ylab = "Frequency",
main = "News Word Counts", col="lightblue", breaks=15)
hist(twitter_wordcounts, xlab = "Word Count", ylab = "Frequency", main = "Twitter Word Counts", col="lightblue", breaks=15)
The histograms above depict the distribution of wordcounts for each dataset. The blog and news datasets are having a high right skew hence to compare across datasets we take the major chunk of the dataset distribution showing articles falling within 1000 words. From the plots above, it is quite clear that twitter data will always be within
140
characters and is bimodal compared to the unimodal distributions of the other two datasets.
Now, as we know for text prediction it is always useful to know specific words or rather sequence of words which occur a lot in common. n-grams
is specifically a technique to achieve that and in the following section, we will be generating top unigrams
, bigrams
and trigrams
from a sample subset of our datasets since in the above section you may have noticed the datasets were quite big and for quick predictions, we need to sample the dataset.
I created the following utility functions to process the text data in the datasets to clean text documents and also remove profanity using a pre-defined dictionary which I obtained online.
badwords = readLines('badwords.txt')
badwords = badwords[order(nchar(badwords),decreasing = TRUE)]
clean_text <- function(text)
{
# drop all non unicode characters
text <- iconv(text, from = "latin1", to = "UTF-8", sub="")
text <- stri_replace_all_regex(text, "\u2019|`","'")
text <- stri_replace_all_regex(text, "\u201c|\u201d|u201f|``",'"')
text <- iconv(text, from = "latin1", to = "ASCII", sub="")
# tolower
text <- tolower(text)
# tabs and whitespace removal
# remove tabs
text <- gsub("[ |\t]{2,}", "", text)
# remove blank spaces at the beginning
text <- gsub("^ ", "", text)
# remove blank spaces at the end
text <- gsub(" $", "", text)
# remove punctuation
text <- gsub("(?!')[[:punct:]]", "", text, perl=TRUE)
# remove links
text <- gsub('http\\S+\\s*|http\\S+$', '', text)
# remove profanity
for(badword in badwords){
text = gsub(badword, "", text)
}
# remove rt
text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", text)
# remove at
text <- gsub("@\\w+", "", text)
# remove numbers
text <- gsub("[[:digit:]]", "", text)
# remove extra spaces and tabs
text = unlist(strsplit(text, " "))
text = text[nchar(text) > 0]
text = paste(text, sep = ' ', collapse = ' ')
return(text)
}
clean_dataset <- function(text_docs){
return (simplify2array(lapply(text_docs, clean_text)))
}
Now, we will use the above functions to actually clean, sample the datasets and save them.
# taking sample data from datasets
blog_data_subset <- sample(blog_data, 10000)
twitter_data_subset <- sample(twitter_data, 10000)
news_data_subset <- sample(news_data, 10000)
# cleaning datasets from extra spaces, tabs, punctuations, symbols, profanity
blog_data_subset = clean_dataset(blog_data_subset)
twitter_data_subset = clean_dataset(twitter_data_subset)
news_data_subset = clean_dataset(news_data_subset)
# saving datasets to disk
save(blog_data_subset, file="blog_data_subset.RData")
save(twitter_data_subset, file="twitter_data_subset.RData")
save(news_data_subset, file="news_data_subset.RData")
Here, we will be loading the saved clean datasets, create a corpus from them with additional text cleaning and normalization methods and then generate n-grams
like we discussed earlier. We will be using a lot of functions from the tm
package here in some of our functions below.
# loading and combining the datasets
load("blog_data_subset.RData")
load("news_data_subset.RData")
load("twitter_data_subset.RData")
sample_dataset <- c(paste(blog_data_subset[1:length(blog_data_subset)], collapse = " "),
paste(news_data_subset[1:length(news_data_subset)], collapse = " "),
paste(twitter_data_subset[1:length(twitter_data_subset)], collapse = " ")
)
sample_dataset <- paste(sample_dataset, collapse = " ")
# clean and create a corpus of documents with and without stopwords as needed
create_corpus <- function(dataset, remove_stopwords=FALSE){
corpus <- Corpus(VectorSource(sample_dataset))
to_space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, to_space, "/|@|//|$|:|:)|*|&|!|?|_|-|#|")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
if (remove_stopwords){
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
}
return (corpus)
}
# get the top N frequent occuring n-grams from the corpus
get_top_ngrams <- function(corpus, tokenize_function=NULL, top_ngram_limit=20){
if (is.null(tokenize_function)){
corpus_tdm <- TermDocumentMatrix(corpus)
} else{
corpus_tdm <- TermDocumentMatrix(corpus, control=list(tokenize=tokenize_function))
}
corpus_tdm <- removeSparseTerms(corpus_tdm, 0.99)
corpus_tdm <- as.matrix(corpus_tdm)
ngram_freqs <- sort(rowSums(corpus_tdm), decreasing=TRUE)
ngrams <- data.frame(ngram=names(ngram_freqs),freq=ngram_freqs)
top_ngrams <- ngrams[1:top_ngram_limit,]
return (top_ngrams)
}
# plot top ngrams
plot_ngrams <- function(top_ngrams_data, xlabel, ylabel, title, cg_low, cg_high){
ggplot(data = top_ngrams_data, aes(x = reorder(ngram, -freq), y = freq, fill=freq)) +
geom_bar(stat = "identity") +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size=15, color="black"),
axis.text.y = element_text(size=12, color="black"),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
title = element_text(size=20),
axis.line = element_line(size = 1, colour = "black", linetype = "dashed"),
panel.background = element_rect(fill = "white"),
panel.grid.major = element_line(colour = "black", linetype = "dotted"),
panel.grid.minor = element_line(colour = "black", linetype = "dotted"),
panel.background = element_rect(fill = "white")
) +
xlab(xlabel) + ylab(ylabel) + ggtitle(title) +
scale_fill_gradient(low=cg_low,high=cg_high, guide = FALSE)
}
# create the corpuses now, one with stopwords and one without them
corpus <- create_corpus(sample_dataset)
corpus_without_stopwords <- create_corpus(sample_dataset, remove_stopwords = TRUE)
Now we use the above functions to generate n-grams below.
The following code segment plots the top 15 unigrams from two corpuses, the original one and one after removing stopwords. The idea is to compare both of them and see which words occur a lot and if we should really be removing stopwords for our final model.
top_unigrams <- get_top_ngrams(corpus = corpus, top_ngram_limit = 15)
top_unigrams_without_stopwords <- get_top_ngrams(corpus = corpus_without_stopwords,
top_ngram_limit = 15)
g1 <- plot_ngrams(top_ngrams_data = top_unigrams, xlabel = "Unigram", ylabel = "Frequency",
title = "Most Frequent Unigrams", cg_low = "#FF8888", cg_high = "#FF0000")
g2 <- plot_ngrams(top_ngrams_data = top_unigrams_without_stopwords, xlabel = "Unigram",
ylabel = "Frequency", title = "Most Frequent Unigrams without stopwords",
cg_low = "#BADDFA", cg_high = "#2095F2")
grid.arrange(arrangeGrob(g1, g2, ncol = 1, nrow = 2))
From the above plots, it is clear that there is a marked difference between the frequency counts of the two plots and the ones involving stop words have higher frequency compared to the others.
The following code segment plots the top 15 bigrams from our original corpus as well as the corpus without stop words.
tokenizer_bigrams <- function(x) unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names = FALSE)
top_bigrams <- get_top_ngrams(corpus = corpus,tokenize_function = tokenizer_bigrams, top_ngram_limit = 15)
top_bigrams_without_stopwords <- get_top_ngrams(corpus = corpus_without_stopwords,
tokenize_function = tokenizer_bigrams,
top_ngram_limit = 15)
g3 <- plot_ngrams(top_ngrams_data = top_bigrams, xlabel = "Bigram", ylabel = "Frequency",
title = "Most Frequent Bigrams", cg_low = "#FF8888", cg_high = "#FF0000")
g4 <- plot_ngrams(top_ngrams_data = top_bigrams_without_stopwords, xlabel = "Bigram",
ylabel = "Frequency", title = "Most Frequent Bigrams without stopwords",
cg_low = "#BADDFA", cg_high = "#2095F2")
grid.arrange(arrangeGrob(g3, g4, ncol = 1, nrow = 2))
As expected, words like
the
and a
occur frequently within the bigrams because these words are used a lot in combination with other words when communicating anything in the form of sentences in the first plot. Their corresponding frequencies are also higher compared to the bigrams in the second plot which does not involve stopwords.
The following code segment plots the top 15 trigrams from our original corpus and the corpus without stopwords.
tokenizer_trigrams <- function(x) unlist(lapply(ngrams(words(x),3),paste,collapse=" "),use.names = FALSE)
top_trigrams <- get_top_ngrams(corpus = corpus,
tokenize_function = tokenizer_trigrams,
top_ngram_limit = 15)
top_trigrams_without_stopwords <- get_top_ngrams(corpus = corpus_without_stopwords,
tokenize_function = tokenizer_trigrams,
top_ngram_limit = 15)
g5 <- plot_ngrams(top_ngrams_data = top_trigrams, xlabel = "Trigram", ylabel = "Frequency",
title = "Most Frequent Trigrams", cg_low = "#FF8888", cg_high = "#FF0000")
g6 <- plot_ngrams(top_ngrams_data = top_trigrams_without_stopwords, xlabel = "Trigram",
ylabel = "Frequency", title = "Most Frequent Trigrams without stopwords",
cg_low = "#BADDFA", cg_high = "#2095F2")
grid.arrange(arrangeGrob(g5, g6, ncol = 1, nrow = 2))
Here, as before, frequencies of the trigrams in the first plot are higher corresponding to the ones in the second plot. Also we notice, trigrams in the first plot are more in the lines of phrases people use frequently when beginning sentences or communicating anything like,
it was a
, i want to
, i have to
and so on. More interestingly, we notice entities which occur a lot in the second plot which include, new york city
, president barack obama
, happy mothers day
and so on.
From the above plots, it is clear that there is a marked difference between the frequency counts of each of the pair-plots. Definitely stop words occur a lot more in every sentence compared to normal words because these are use a lot in joining words and sentences together i.e., subjects and predicates to communicate information. Hence ideally in this use-case for word predictions in sentences, we should not be removing stop words, but we can definitely augment our model with ngrams obtained from removing stop words.
Now that I have got a working mechanism to generate n-grams, I will be using this to generate n-grams upto a certain level ( maybe 3 or 4 ) and then store this data and use this in developing a predictive model by testing various methods like weighted approach and priority to frequently occuring ngrams based on input text using backoff models and Bayesian approach also if possible. Once this is done, I will be looking at ways to improve efficiency of the model to improve response time since this will be finally plugged into a Shiny Web Application.