Executive Summary

This report has been created to fulfill the requirements of the exploratory analysis milestone for the Data Science Capstone offered by the Johns Hopkins University on Coursera. Here we will be analyzing the English dataset from the data provided by SwiftKey belonging to a corpus called HC Corpora.

The main objectives of this report are as follows

Data retrieval and Loading

This section consists of the steps followed to setup our R environment, get the required data, load it and perform basic exploratory analysis.


Setting up required environment in R

In the following code segment, we set the required global options and load the required packages in R.

library(knitr)
opts_chunk$set(cache=TRUE,echo=TRUE)
options(width=120)
library(pander)
library(stringr)
library(stringi)
library(tm)
library(ggplot2)
library(gridExtra)


Setting the working directory

Here we set the working directory as required, it can be changed according to your preference in your personal computer.

setwd("E:/MOOCs/Coursera/Data Science - Specialization/10. Data Science Capstone/CapstoneProject")


Getting the required data

The dataset can be downloaded using the following code segment.

download_url = 'https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
data_file = 'Coursera-SwiftKey.zip'

if (!file.exists(data_file)) {
  cat('Downloading Dataset...\n')
  download.file(download_url, destfile=data_file, method="curl")
  cat('Unzipping Dataset...\n')
  unzip(data_file)
}else{
  cat('Dataset is already downloaded!\n')
}
## Dataset is already downloaded!


Loading the data

Next, we load in the english datasets using the following commands so that we can start analyzing them.

blog_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter_data <- readLines("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding="UTF-8")
con <- file("Coursera-SwiftKey/final/en_US/en_US.news.txt", open="rb")
news_data <- readLines(con, encoding="UTF-8")
close(con)
rm(con)


Datasets Summary Statistics

In this section, we will be looking at various summary statistics for the overall english datasets provided to get a feel of the data including basic counts and frequencies.


Basic Summary statistics

The following code segment computes basic summary statistics for the three datasets. We using the wc system command here which is used for word, line and character counts using the system function call. Do note that I have tested this on Windows and it works fine, hopefully should work on other Operating Systems too.

# Dataset names
dataset_names <- c("Blogs", "News", "Twitter")

# Summary Stats function
get_summary_stats <- function(file_path){
  dataset_size <- round(file.info(file_path)$size/1024/1024, digits = 1) # file size in MB
  lwc <- unlist(strsplit(system(paste("wc ",file_path), intern=TRUE), " "))
  lwc <- lwc[lwc != ""]
  lines <- as.numeric(lwc[1])
  words <- as.numeric(lwc[2])
  chars <- as.numeric(lwc[3])
  avg_words_per_line <- round(words / lines, 1)
  avg_chars_per_line <- round(chars / lines, 1)
  
  summary_stats <- list(size=dataset_size, lines=lines, words=words, 
                        chars=chars, avg_wpl=avg_words_per_line, 
                        avg_cpl=avg_chars_per_line)
  return (summary_stats)
}

# Get Summary Statistics
blog_data_path <- "./Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
blog_stats <- get_summary_stats(blog_data_path)

news_data_path <- "./Coursera-SwiftKey/final/en_US/en_US.news.txt"
news_stats <- get_summary_stats(news_data_path)

twitter_data_path <- "./Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
twitter_stats <- get_summary_stats(twitter_data_path)

# Summary Stats Vectors
sizes <- c(blog_stats$size, news_stats$size, twitter_stats$size)
line_counts <- c(blog_stats$lines, news_stats$lines, twitter_stats$lines)
word_counts <- c(blog_stats$words, news_stats$words, twitter_stats$words)
char_counts <- c(blog_stats$chars, news_stats$chars, twitter_stats$chars)
avg_wpl_counts <- c(blog_stats$avg_wpl, news_stats$avg_wpl, twitter_stats$avg_wpl)    
avg_cpl_counts <- c(blog_stats$avg_cpl, news_stats$avg_cpl, twitter_stats$avg_cpl) 

# Combining stats vectors into a Summary Stats Table
summary_stats_table <- data.frame(
                        cbind(dataset_names, sizes, line_counts, word_counts, 
                              char_counts, avg_wpl_counts, avg_cpl_counts)
                       ) 
colnames(summary_stats_table) <- c("File Name", "Size (in MB)", "Lines", "Words", 
                                   "Characters", "Avg. Words/Line", "Avg. Chars/Line")
pandoc.table(summary_stats_table, style = "simple", justify = 'left', 
             caption = "Datasets Summary Statistics", split.table = Inf)


File Name   Size (in MB)   Lines   Words    Characters   Avg. Words/Line   Avg. Chars/Line  
----------- -------------- ------- -------- ------------ ----------------- -----------------
Blogs       200.4          899288  37334114 210160014    41.5              233.7            
News        196.3          1010242 34365936 205811889    34                203.7            
Twitter     159.4          2360148 30359852 167105338    12.9              70.8             

Table: Datasets Summary Statistics

From the above table, it is quite expected that the Blogs dataset has the highest count in all the fields and Twitter has the lowest count because each tweet as you already know is always limited to 140 characters. We will now visualize the same in the following section.


Basic Dataset Visualization

The following code segment plots histograms of word count distributions across different articles or tweets in the different datasets.

blog_wordcounts <- nchar(blog_data)
news_wordcounts <- nchar(news_data)
twitter_wordcounts <- nchar(twitter_data)

par(mfrow = c(1, 3))
hist(blog_wordcounts[blog_wordcounts<=1000], xlab = "Word Count", ylab = "Frequency",
     main = "Blog Word Counts", col="lightblue", breaks=15)
hist(news_wordcounts[news_wordcounts<=1000], xlab = "Word Count", ylab = "Frequency", 
     main = "News Word Counts", col="lightblue", breaks=15)
hist(twitter_wordcounts, xlab = "Word Count", ylab = "Frequency", main = "Twitter Word Counts", col="lightblue", breaks=15)

The histograms above depict the distribution of wordcounts for each dataset. The blog and news datasets are having a high right skew hence to compare across datasets we take the major chunk of the dataset distribution showing articles falling within 1000 words. From the plots above, it is quite clear that twitter data will always be within 140 characters and is bimodal compared to the unimodal distributions of the other two datasets.


Datasets N-Grams Analysis

Now, as we know for text prediction it is always useful to know specific words or rather sequence of words which occur a lot in common. n-grams is specifically a technique to achieve that and in the following section, we will be generating top unigrams, bigrams and trigrams from a sample subset of our datasets since in the above section you may have noticed the datasets were quite big and for quick predictions, we need to sample the dataset.


Datasets Text Processing and Sampling

I created the following utility functions to process the text data in the datasets to clean text documents and also remove profanity using a pre-defined dictionary which I obtained online.

badwords = readLines('badwords.txt')
badwords = badwords[order(nchar(badwords),decreasing = TRUE)]

clean_text <- function(text)
{
  
  # drop all non unicode characters
  text <- iconv(text, from = "latin1", to = "UTF-8", sub="")
  text <- stri_replace_all_regex(text, "\u2019|`","'")
  text <- stri_replace_all_regex(text, "\u201c|\u201d|u201f|``",'"')
  text <- iconv(text, from = "latin1", to = "ASCII", sub="")
  
  # tolower
  text <- tolower(text)
  
  # tabs and whitespace removal
  # remove tabs
  text <- gsub("[ |\t]{2,}", "", text)
  # remove blank spaces at the beginning
  text <- gsub("^ ", "", text)
  # remove blank spaces at the end
  text <- gsub(" $", "", text)
  
  # remove punctuation
  text <- gsub("(?!')[[:punct:]]", "", text, perl=TRUE)
  
  # remove links
  text <- gsub('http\\S+\\s*|http\\S+$', '', text)
  
  # remove profanity
  for(badword in badwords){
    text = gsub(badword, "", text)
  }

  # remove rt
  text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", text)
  # remove at
  text <- gsub("@\\w+", "", text)
  
  # remove numbers
  text <- gsub("[[:digit:]]", "", text)

  # remove extra spaces and tabs
  text = unlist(strsplit(text, " "))
  text = text[nchar(text) > 0]
  text = paste(text, sep = ' ', collapse = ' ')

  return(text)
}


clean_dataset <- function(text_docs){
  return (simplify2array(lapply(text_docs, clean_text)))
}


Now, we will use the above functions to actually clean, sample the datasets and save them.

# taking sample data from datasets
blog_data_subset <- sample(blog_data, 10000)
twitter_data_subset <- sample(twitter_data, 10000)
news_data_subset <- sample(news_data, 10000)

# cleaning datasets from extra spaces, tabs, punctuations, symbols, profanity
blog_data_subset = clean_dataset(blog_data_subset)
twitter_data_subset = clean_dataset(twitter_data_subset)
news_data_subset = clean_dataset(news_data_subset)

# saving datasets to disk
save(blog_data_subset, file="blog_data_subset.RData")
save(twitter_data_subset, file="twitter_data_subset.RData")
save(news_data_subset, file="news_data_subset.RData")


Corpus creation, n-grams generation and visualization

Here, we will be loading the saved clean datasets, create a corpus from them with additional text cleaning and normalization methods and then generate n-grams like we discussed earlier. We will be using a lot of functions from the tm package here in some of our functions below.

# loading and combining the datasets
load("blog_data_subset.RData")
load("news_data_subset.RData")
load("twitter_data_subset.RData")

sample_dataset <- c(paste(blog_data_subset[1:length(blog_data_subset)], collapse = " "),
                    paste(news_data_subset[1:length(news_data_subset)], collapse = " "),
                    paste(twitter_data_subset[1:length(twitter_data_subset)], collapse = " ")
                    )

sample_dataset <- paste(sample_dataset, collapse = " ")

# clean and create a corpus of documents with and without stopwords as needed
create_corpus <- function(dataset, remove_stopwords=FALSE){
  corpus <- Corpus(VectorSource(sample_dataset))
  
  to_space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
  corpus <- tm_map(corpus, to_space, "/|@|//|$|:|:)|*|&|!|?|_|-|#|")  
  corpus <- tm_map(corpus, content_transformer(tolower)) 
  corpus <- tm_map(corpus, removePunctuation)  
  corpus <- tm_map(corpus, removeNumbers) 
  corpus <- tm_map(corpus, stripWhitespace)
  
  if (remove_stopwords){
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    corpus <- tm_map(corpus, stripWhitespace)
    
  }
  
  return (corpus)
}

# get the top N frequent occuring n-grams from the corpus
get_top_ngrams <- function(corpus, tokenize_function=NULL, top_ngram_limit=20){
  
  if (is.null(tokenize_function)){
    corpus_tdm <- TermDocumentMatrix(corpus)
  } else{
    corpus_tdm <- TermDocumentMatrix(corpus, control=list(tokenize=tokenize_function))
  }
  
  corpus_tdm <- removeSparseTerms(corpus_tdm, 0.99)
  corpus_tdm <- as.matrix(corpus_tdm)
  ngram_freqs <- sort(rowSums(corpus_tdm), decreasing=TRUE)
  ngrams <- data.frame(ngram=names(ngram_freqs),freq=ngram_freqs) 
  top_ngrams <- ngrams[1:top_ngram_limit,]
  return (top_ngrams)
}

# plot top ngrams
plot_ngrams <- function(top_ngrams_data, xlabel, ylabel, title, cg_low, cg_high){
  ggplot(data = top_ngrams_data, aes(x = reorder(ngram, -freq), y = freq, fill=freq)) + 
      geom_bar(stat = "identity") +
      theme(
          axis.text.x = element_text(angle = 45, hjust = 1, size=15, color="black"),
          axis.text.y = element_text(size=12, color="black"),
          axis.title.x = element_text(size=15),
          axis.title.y = element_text(size=15),
          title = element_text(size=20),
          axis.line = element_line(size = 1, colour = "black", linetype = "dashed"), 
          panel.background = element_rect(fill = "white"), 
          panel.grid.major = element_line(colour = "black", linetype = "dotted"), 
          panel.grid.minor = element_line(colour = "black", linetype = "dotted"), 
          panel.background = element_rect(fill = "white")
        ) +
      xlab(xlabel) + ylab(ylabel) + ggtitle(title) + 
      scale_fill_gradient(low=cg_low,high=cg_high, guide = FALSE)
}

# create the corpuses now, one with stopwords and one without them
corpus <- create_corpus(sample_dataset)
corpus_without_stopwords <- create_corpus(sample_dataset, remove_stopwords = TRUE)


Now we use the above functions to generate n-grams below.


Unigrams

The following code segment plots the top 15 unigrams from two corpuses, the original one and one after removing stopwords. The idea is to compare both of them and see which words occur a lot and if we should really be removing stopwords for our final model.

top_unigrams <- get_top_ngrams(corpus = corpus, top_ngram_limit = 15)
top_unigrams_without_stopwords <- get_top_ngrams(corpus = corpus_without_stopwords, 
                                                 top_ngram_limit = 15)

g1 <- plot_ngrams(top_ngrams_data = top_unigrams, xlabel = "Unigram", ylabel = "Frequency",
                  title = "Most Frequent Unigrams", cg_low = "#FF8888", cg_high = "#FF0000")
g2 <- plot_ngrams(top_ngrams_data = top_unigrams_without_stopwords, xlabel = "Unigram", 
                  ylabel = "Frequency", title = "Most Frequent Unigrams without stopwords", 
                  cg_low = "#BADDFA", cg_high = "#2095F2")

grid.arrange(arrangeGrob(g1, g2, ncol = 1, nrow = 2))

From the above plots, it is clear that there is a marked difference between the frequency counts of the two plots and the ones involving stop words have higher frequency compared to the others.


Bigrams

The following code segment plots the top 15 bigrams from our original corpus as well as the corpus without stop words.

tokenizer_bigrams <- function(x) unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names = FALSE)
top_bigrams <- get_top_ngrams(corpus = corpus,tokenize_function = tokenizer_bigrams, top_ngram_limit = 15)
top_bigrams_without_stopwords <- get_top_ngrams(corpus = corpus_without_stopwords, 
                                                tokenize_function = tokenizer_bigrams,
                                                top_ngram_limit = 15)

g3 <- plot_ngrams(top_ngrams_data = top_bigrams, xlabel = "Bigram", ylabel = "Frequency",
                  title = "Most Frequent Bigrams", cg_low = "#FF8888", cg_high = "#FF0000")
g4 <- plot_ngrams(top_ngrams_data = top_bigrams_without_stopwords, xlabel = "Bigram", 
                  ylabel = "Frequency", title = "Most Frequent Bigrams without stopwords", 
                  cg_low = "#BADDFA", cg_high = "#2095F2")

grid.arrange(arrangeGrob(g3, g4, ncol = 1, nrow = 2))

As expected, words like the and a occur frequently within the bigrams because these words are used a lot in combination with other words when communicating anything in the form of sentences in the first plot. Their corresponding frequencies are also higher compared to the bigrams in the second plot which does not involve stopwords.


Trigrams

The following code segment plots the top 15 trigrams from our original corpus and the corpus without stopwords.

tokenizer_trigrams <- function(x) unlist(lapply(ngrams(words(x),3),paste,collapse=" "),use.names = FALSE)
top_trigrams <- get_top_ngrams(corpus = corpus,
                               tokenize_function = tokenizer_trigrams, 
                               top_ngram_limit = 15)
top_trigrams_without_stopwords <- get_top_ngrams(corpus = corpus_without_stopwords, 
                                                tokenize_function = tokenizer_trigrams,
                                                top_ngram_limit = 15)

g5 <- plot_ngrams(top_ngrams_data = top_trigrams, xlabel = "Trigram", ylabel = "Frequency",
                  title = "Most Frequent Trigrams", cg_low = "#FF8888", cg_high = "#FF0000")
g6 <- plot_ngrams(top_ngrams_data = top_trigrams_without_stopwords, xlabel = "Trigram", 
                  ylabel = "Frequency", title = "Most Frequent Trigrams without stopwords", 
                  cg_low = "#BADDFA", cg_high = "#2095F2")

grid.arrange(arrangeGrob(g5, g6, ncol = 1, nrow = 2))

Here, as before, frequencies of the trigrams in the first plot are higher corresponding to the ones in the second plot. Also we notice, trigrams in the first plot are more in the lines of phrases people use frequently when beginning sentences or communicating anything like, it was a, i want to, i have to and so on. More interestingly, we notice entities which occur a lot in the second plot which include, new york city, president barack obama, happy mothers day and so on.


Conclusion

From the above plots, it is clear that there is a marked difference between the frequency counts of each of the pair-plots. Definitely stop words occur a lot more in every sentence compared to normal words because these are use a lot in joining words and sentences together i.e., subjects and predicates to communicate information. Hence ideally in this use-case for word predictions in sentences, we should not be removing stop words, but we can definitely augment our model with ngrams obtained from removing stop words.


Next Steps

Now that I have got a working mechanism to generate n-grams, I will be using this to generate n-grams upto a certain level ( maybe 3 or 4 ) and then store this data and use this in developing a predictive model by testing various methods like weighted approach and priority to frequently occuring ngrams based on input text using backoff models and Bayesian approach also if possible. Once this is done, I will be looking at ways to improve efficiency of the model to improve response time since this will be finally plugged into a Shiny Web Application.