SUMMARY

The ultimate goal of the capstone project is to build a predictive text model. The data used for for this project include blogs, news and twitters. The final product will be a Shiny application which can be used to predict the next word(s) when the user inputs a phrase. This milestone report for the capstone project is intended to present the data preprocessing and exploratory data analysis and sketch out the plans for the Data Science Capstone project.

The reference text corpora is available at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Content archived from heliohost.org on September 30, 2016 and retrieved via Wayback Machine on April 24, 207. https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html. The corpora are collected from publicly available sources by a web crawler.

Text Corpora Description

The given text corpora contains 3 files across four languages (Russian, Finnish, German and English). This project will focus on the English language datasets. The names of the data files are as follows:

  1. en_US.blogs.txt
  2. en_US.twitter.txt
  3. en_US.news.txt

Load Packages

library(plyr)
library(stringi)
library(NLP)
library(tm)
library(wordcloud)
library(RWeka)
library(stringr)
library(ggplot2)
library(grid)
library(gridExtra)

Download and unzip the reference Text Corpora

if(!file.exists("Coursera-SwiftKey.zip")){
    #Download the specified training data
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "Coursera-SwiftKey.zip")
}else{
    print("Reference training data is already downloaded!")
}
[1] "Reference training data is already downloaded!"
## check if the file is unzipped. Unpack only if it doesn't exist 
if(!file.exists("./final/en_US")){
  print("Unzipping Coursera-SwiftKey.zip!")
  unzip("./Coursera-SwiftKey.zip")
}

Exploratory Analysis

  1. Summary statistics of original datasets
  2. Since the data size is fairly big, we will conduct analysis with small sample (say 1%).
  3. In the english language, some of the most common word in sentences are the likes of articles and prepositions. However, if we want to extract important messages from the texts, we should try to eliminate these less meaningful words.
  4. Profanity can be in our data and we would like to get rid of it. However, I believe sometimes people express their emotions through their use of language so it can also be useful to analyze it seperately.
  5. Create TermDocumentMatrix and subsequently perform 1-gram, 2-gram, and 3-gram analysis

Summary Statistics of the complete dataset

Here we are collecting suammry statistics on all English files without loading them into memory.

dataFiles <- dir("./final/en_US", full.names=TRUE)
datasets.stats <- ldply(dataFiles, function(ds) {
    files.name <- basename(ds)
    files.size <- as.integer(format(file.info(ds)$size/1024^2))
  files.linesCount <- as.integer(sub(" .*$", "", system(paste("wc -l", ds), intern=TRUE)))
  files.wordsCount <- as.integer(sub(" .*$", "", system(paste("wc -w", ds), intern=TRUE)))
  files.maxLineLen <- as.integer(sub(" .*$", "", system(paste("wc -L", ds), intern=TRUE)))
    
    data.frame(files.name, files.size, files.linesCount, files.wordsCount, files.maxLineLen)
})
colnames(datasets.stats) <- c("File", "File Size (MB)", "Lines", "Words", "Max length")
datasets.stats
               File File Size (MB)   Lines    Words Max length
1   en_US.blogs.txt            200  899288 37272578      40832
2    en_US.news.txt            196 1010242 34309642      11384
3 en_US.twitter.txt            159 2360148 30341028        140

Sampling small portion of training data for Exploratory Data Analysis

As all the text files are of relatively large size (~200 MB), hence we are trying to perform EDA on small portion (1%) of the training dataset

seed <- 1234
blogsSampleId <- as.logical(rbinom(datasets.stats$Lines[1], 1, prob=0.01))
newsSampleId <- as.logical(rbinom(datasets.stats$Lines[2], 1, prob=0.01))
twittersSampleId <- as.logical(rbinom(datasets.stats$Lines[3], 1, prob=0.01))
blogsSample <- readLines(dataFiles[1], encoding="UTF-8", skipNul=TRUE)[blogsSampleId]
newsSample <- readLines(dataFiles[2], encoding="UTF-8", skipNul=TRUE)[newsSampleId]
incomplete final line found on './final/en_US/en_US.news.txt'
twittersSample <- readLines(dataFiles[3], encoding="UTF-8", skipNul=TRUE)[twittersSampleId]

Summary statistics of samples data before data clearning

Note that lines count and words count are almost 1/10th of original dataset except for news words count which is somehow disproprotional which needs to be investigated later (mostly data cleaning would have removed huge presence of redundant terms)

samples.list <- list(blog = blogsSample, news = newsSample, twitter = twittersSample)
samples.df <- data.frame(text.source = c("blog", "news", "twitter"), lines.count = NA, words.count = NA)
samples.df$lines.count <- sapply(samples.list, length)
samples.df$words.count <- stri_count_words(samples.list)
print("Summary statistics of samples data before data clearning")
[1] "Summary statistics of samples data before data clearning"
samples.df
  text.source lines.count words.count
1        blog        8877      366961
2        news       10140       35974
3     twitter       23612      301322

Data cleaning to prepare text corpus

NOTE that even 1% sample has 10k documents and 30k terms on average across blogs/news/twitters datasets (including some large no of irrelevant terms). It is costly and uneccessary to identify all unique terms in raw sample, rather sample data must be cleaned before constructing BOW (bag of words) representation such as TermDocumentMatrix and subsequent processing.

Rationale behind different kind of data cleaning: -
- Punctuations: Punctuation characters such as commas, parentheses and the like have little or no impact on word order and n-gram composition.
- Stop Words: Stopwords are words so common that their information value is very low.
- ToLower: Words themselves - not their case - are important for prediction. Note that the resulting predictive model will predict lowercase words only. For English text this is less of an issue than for a language like German that uses capitalized nouns.
- http/ftp/mail uris & Twitter hashtags and handles: These social media specific terms are not helpful for word prediction.
- Profanity words: Words dictionary has been downloaded from https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/ and it has been used to remove their occurances.

cleanUri <- function(text){
        ## remove html/ftp url
        text <- gsub("?(f|ht)tp(s?)://(.*)[.][a-z]+", "", text) 
        ## remove e-mail adresses
        text <- gsub("[[:alnum:].-]+@[[:alnum:].-]+", "", text) 
        return(text)
}
cleanTwitterTagsHandles <- function(text){
        ## remove twitter handles
        text <- gsub("@\\S+", "", text)
        ## remove twitter hashtags
        text <- gsub("#\\S+", "", text)
        return(text)
}
samplesList <- list(blogs = blogsSample, news = newsSample, twitters = twittersSample)
samplesCorpus <- VCorpus(VectorSource(samplesList))
rm(samplesList)
# Remove URIs & Twitter hashtag and handles from all types of documents/messages
samplesCorpus <- tm_map(samplesCorpus, content_transformer(cleanUri))
samplesCorpus <- tm_map(samplesCorpus, content_transformer(cleanTwitterTagsHandles))
samplesCorpus <- tm_map(samplesCorpus, removeWords, stopwords("english"))
samplesCorpus <- tm_map(samplesCorpus, content_transformer(function(x) iconv(x, from="UTF-8", to="ASCII", sub="")))
samplesCorpus <- tm_map(samplesCorpus, content_transformer(tolower))
samplesCorpus <- tm_map(samplesCorpus, removePunctuation)
samplesCorpus <- tm_map(samplesCorpus, removeNumbers)
samplesCorpus <- tm_map(samplesCorpus, stripWhitespace)
# Remove profanity words as per https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/ 
samplesCorpus <- tm_map(samplesCorpus, removeWords, readLines("./badWords_2018_03_26.txt"))
incomplete final line found on './badWords_2018_03_26.txt'
# Apply stemming to the resulting corpus
samplesCorpus <- tm_map(samplesCorpus, stemDocument)

Text analysis for deriving n-gram

## Tokenizers
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
## TermDocumentMatrix for 1/2/3-grams
tdmBlogs_1gram = TermDocumentMatrix(samplesCorpus["blogs"], control = list(tokenize = UnigramTokenizer, wordLengths = c(3,Inf)))
tdmNews_1gram = TermDocumentMatrix(samplesCorpus["news"], control = list(tokenize = UnigramTokenizer, wordLengths = c(3,Inf)))
tdmTwitters_1gram = TermDocumentMatrix(samplesCorpus["twitters"], control = list(tokenize = UnigramTokenizer, wordLengths = c(3,Inf)))
tdmBlogs_2gram = TermDocumentMatrix(samplesCorpus["blogs"], control = list(tokenize = BigramTokenizer, wordLengths = c(3,Inf)))
tdmNews_2gram = TermDocumentMatrix(samplesCorpus["news"], control = list(tokenize = BigramTokenizer, wordLengths = c(3,Inf)))
tdmTwitters_2gram = TermDocumentMatrix(samplesCorpus["twitters"], control = list(tokenize = BigramTokenizer, wordLengths = c(3,Inf)))
tdmBlogs_3gram = TermDocumentMatrix(samplesCorpus["blogs"], control = list(tokenize = TrigramTokenizer, wordLengths = c(3,Inf)))
tdmNews_3gram = TermDocumentMatrix(samplesCorpus["news"], control = list(tokenize = TrigramTokenizer, wordLengths = c(3,Inf)))
tdmTwitters_3gram = TermDocumentMatrix(samplesCorpus["twitters"], control = list(tokenize = TrigramTokenizer, wordLengths = c(3,Inf)))

Extract most occuring 1/2/3-grams along with their frequency

# put word count from term-document matrices into data frames
wfmBlogs_1gram <- data.frame(word = tdmBlogs_1gram$dimnames$Terms, frequency = tdmBlogs_1gram$v)
wfmNews_1gram <- data.frame(word = tdmNews_1gram$dimnames$Terms, frequency = tdmNews_1gram$v)
wfmTwitters_1gram <- data.frame(word = tdmTwitters_1gram$dimnames$Terms, frequency = tdmTwitters_1gram$v)
wfmBlogs_2gram <- data.frame(word = tdmBlogs_2gram$dimnames$Terms, frequency = tdmBlogs_2gram$v)
wfmNews_2gram <- data.frame(word = tdmNews_2gram$dimnames$Terms, frequency = tdmNews_2gram$v)
wfmTwitters_2gram <- data.frame(word = tdmTwitters_2gram$dimnames$Terms, frequency = tdmTwitters_2gram$v)
wfmBlogs_3gram <- data.frame(word = tdmBlogs_3gram$dimnames$Terms, frequency = tdmBlogs_3gram$v)
wfmNews_3gram <- data.frame(word = tdmNews_3gram$dimnames$Terms, frequency = tdmNews_3gram$v)
wfmTwitters_3gram <- data.frame(word = tdmTwitters_3gram$dimnames$Terms, frequency = tdmTwitters_3gram$v)
# reorder by descreasing frequency
wfmBlogs_1gram <- plyr::arrange(wfmBlogs_1gram, -frequency)
wfmNews_1gram <- plyr::arrange(wfmNews_1gram, -frequency)
wfmTwitters_1gram <- plyr::arrange(wfmTwitters_1gram, -frequency)
wfmBlogs_2gram <- plyr::arrange(wfmBlogs_2gram, -frequency)
wfmNews_2gram <- plyr::arrange(wfmNews_2gram, -frequency)
wfmTwitters_2gram <- plyr::arrange(wfmTwitters_2gram, -frequency)
wfmBlogs_3gram <- plyr::arrange(wfmBlogs_3gram, -frequency)
wfmNews_3gram <- plyr::arrange(wfmNews_3gram, -frequency)
wfmTwitters_3gram <- plyr::arrange(wfmTwitters_3gram, -frequency)

Plot most frequent Unigrams

n <- 20L # show 20 most frequently occuring terms/words
# isolate top n words by decreasing frequency
blogs.top <- wfmBlogs_1gram[1:n, ]
news.top <- wfmNews_1gram[1:n, ]
twitters.top <- wfmTwitters_1gram[1:n, ]
blogs.top$word <- reorder(blogs.top$word, blogs.top$frequency)
news.top$word <- reorder(news.top$word, news.top$frequency)
twitters.top$word <- reorder(twitters.top$word, twitters.top$frequency)
# plots
g.blogs.top <- ggplot(blogs.top, aes(x = word, y = frequency))
g.blogs.top <- g.blogs.top + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Uni-grams in Blogs")
g.news.top <- ggplot(news.top, aes(x = word, y = frequency))
g.news.top <- g.news.top + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Uni-grams in News")
g.twitters.top <- ggplot(twitters.top, aes(x = word, y = frequency))
g.twitters.top <- g.twitters.top + geom_bar(stat = "identity") + coord_flip() + 
  labs(title = "Uni-grams in Twitters")
grid.arrange(g.blogs.top, g.news.top, g.twitters.top, ncol = 3)

Plot most frequent Bigrams

# isolate top n words by decreasing frequency
blogs.top <- wfmBlogs_2gram[1:n, ]
news.top <- wfmNews_2gram[1:n, ]
twitters.top <- wfmTwitters_2gram[1:n, ]
blogs.top$word <- reorder(blogs.top$word, blogs.top$frequency)
news.top$word <- reorder(news.top$word, news.top$frequency)
twitters.top$word <- reorder(twitters.top$word, twitters.top$frequency)
# plots
g.blogs.top <- ggplot(blogs.top, aes(x = word, y = frequency))
g.blogs.top <- g.blogs.top + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Bigrams in Blogs")
g.news.top <- ggplot(news.top, aes(x = word, y = frequency))
g.news.top <- g.news.top + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Bigrams in News")
g.twitters.top <- ggplot(twitters.top, aes(x = word, y = frequency))
g.twitters.top <- g.twitters.top + geom_bar(stat = "identity") + coord_flip() + 
  labs(title = "Bigrams in Twitters")
grid.arrange(g.blogs.top, g.news.top, g.twitters.top, ncol = 3)

Plot most frequent Trigrams

# isolate top n words by decreasing frequency
blogs.top <- wfmBlogs_3gram[1:n, ]
news.top <- wfmNews_3gram[1:n, ]
twitters.top <- wfmTwitters_3gram[1:n, ]
blogs.top$word <- reorder(blogs.top$word, blogs.top$frequency)
news.top$word <- reorder(news.top$word, news.top$frequency)
twitters.top$word <- reorder(twitters.top$word, twitters.top$frequency)
# plots
g.blogs.top <- ggplot(blogs.top, aes(x = word, y = frequency))
g.blogs.top <- g.blogs.top + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Trigrams in Blogs")
g.news.top <- ggplot(news.top, aes(x = word, y = frequency))
g.news.top <- g.news.top + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Trigrams in News")
g.twitters.top <- ggplot(twitters.top, aes(x = word, y = frequency))
g.twitters.top <- g.twitters.top + geom_bar(stat = "identity") + coord_flip() + 
  labs(title = "Trigrams in Twitters")
grid.arrange(g.blogs.top, g.news.top, g.twitters.top, ncol = 3)

Next Steps…

The plan for completing the Capstone Project would mostly involve the following tasks: -
- Create Shiny app
- Plan for the approach to archive learning/training of either currently known basic n-gram or its better alternative; This will mostly involve ensuring to quantify probability and confidence interval of proposed word predictions
- Profile and optimize the solution to achieve good performance and accuracy
- Fix stopwords not removing common most frequent words like ‘I’, ‘the’, etc
- Find the impact/usefulness of existing data cleaning steps and revise to achieve better performance and accuracy in modeling and prediction
- Investigate if any dimentionality / sparsity reduction would be feasible and helpful
- Identify appropriate training size to balance bais and variance
- Save sample training dataset to disk to avoid loading full training dataset
determine the best data structure in R to store NGram language models - Make the code terse

