Capstone Project Milestone Report

Objectives of the milestone report

to demonstrate basic competency in dowloading and loading the data
to create a basic report of summary statistics about the data sets
to report interesting findings
to briefly summarize plans for creating a prediction algorithm and Shiny app in a way understandable to a non-data science manager

Preload necessary libraries

library(tm)
library(SnowballC)
library(RWeka)
library(slam)
library(wordcloud)
library(stringi)
library(RColorBrewer)
library(ggplot2)

Download data files

For the purpose of this project we are using the Swiftkey English database. Therefore we check if the project’s data Coursera-Swiftkey.zip exists and download and unzip if necessary

#Check for zip file and download if necessary
if(!file.exists("Coursera-Swiftkey.zip")){
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-Swiftkey.zip")
}

#Check for data file and unzip if necesssary

if (!file.exists("final/en_US/en_US.blogs.txt")) {
    unzip("Coursera-SwiftKey.zip", exdir = "data/final/en_US", list = TRUE)
}

Load Raw Data

#Import blogs and twitter datasets in text mode
blogs <- readLines("final/en_US/en_US.blogs.txt", skipNul = T)
twitter <- readLines("final/en_US/en_US.twitter.txt", skipNul = T)

# import the news dataset in binary mode
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, skipNul = T)
close(con)
rm(con)

Basic summary of files imported

##      File   Lines     Chars TotalWords
## 1   blogs  899288 208361438   37865888
## 2    news 1010242 203791405   34678691
## 3 twitter 2360148 162385035   30578933

As we can see from the above the imported files each have a really large number of lines, words and characters. Hence for the purpose of subsequent preprocessinng and other operations we sample about 1% of the data from each file

Sample the data

# Select a random 1% of lines
set.seed(123)

blogs_sample <- blogs[rbinom(length(blogs)*.01, length(blogs), .5)]

twitter_sample <- twitter[rbinom(length(twitter)*.01, length(twitter), .5)]

news_sample <- news[rbinom(length(news)*.01, length(news), .5)]


#Clean up the global environment
rm(blogs, news, twitter)

Create corpora

blogs_source <- VectorSource(blogs_sample)
blogs_corpus <- VCorpus(blogs_source)

news_source <- VectorSource(news_sample)
news_corpus <- VCorpus(news_source)

twitter_source <- VectorSource(twitter_sample)
twitter_corpus <- VCorpus(twitter_source)

Preprocess corpus

Since raw text formats can cause significant issues when text mining, it’s necessary to pre-process text data by using common transformation and filtering functions. In the following we use a function ‘clean_corpus’ that takes a corpus and applies the following transformations one by one:

removes special characters
removes numbers
removes punctuations
changes words to lower case
removes common English stopwords
removes profane words as per full list of offensive words banned by Google downloaded from here
converts words to stems
removes whitespaces
converts corpus to plain text document

blogs_clean <- clean_corpus(blogs_corpus)
twitter_clean <- clean_corpus(twitter_corpus)
news_clean <- clean_corpus(news_corpus)
full_clean <- c(blogs_clean, twitter_clean, news_clean, recursive = F)

Create NGram Tokenizers for one, two and three words

unitokenizer <- function(x) 
  NGramTokenizer(x, Weka_control(min = 1, max = 1))

bitokenizer <- function(x) 
  NGramTokenizer(x, Weka_control(min = 2, max = 2))


tritokenizer <- function(x) 
  NGramTokenizer(x, Weka_control(min = 3, max = 3))

Convert individual and full corpora into Term Document Matrices

blogs_tdm <- TermDocumentMatrix(blogs_clean, control = list(tokenize = unitokenizer))

news_tdm <- TermDocumentMatrix(news_clean, control = list(tokenize = unitokenizer))

twitter_tdm <- TermDocumentMatrix(twitter_clean, control = list(tokenize = unitokenizer))

uni_tdm <- TermDocumentMatrix(full_clean, control = list(tokenize = unitokenizer))

bi_tdm <- TermDocumentMatrix(full_clean, control = list(tokenize = bitokenizer))

tri_tdm <- TermDocumentMatrix(full_clean, control = list(tokenize = tritokenizer))

Top most common terms for blogs, news, twitter

Top 10 frequent unigrams and unigram wordcloud

Top 10 frequent bigrams and bigram wordcloud

Top 10 frequent trigrams and trigram wordcloud

Observations

An interesting observation is that the most frequent words differ for each source - blog, news and twitter. This may or may not be a factor to consider when creating the n-gram model later.
A second interesting observation is that the relative frequencies of top unigrams are quite low (less than 1% for the top unigram ‘will’). The relative frequencies fall drastically as we move to bigrams and trigrams. This could have implications for future modeling with unigrams, bigrams and trigrams.

Plan For Creating Prediction Algorithm and Shiny App

Final prediction algorithm will created as a n-gram model that predicts the next item in a sequence in the form of a (n-1) order Markov model. Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability – with larger n, a model can store more context with a well-understood space– time tradeoff, enabling small experiments to scale up efficiently.

The 2-gram, 3-gram frequency tables calculated above (and higher-order n-grams) will be used to train the model, with independence assumption. so that each word depends only on the last n − 1 words. This Markov model is used as an approximation of the true underlying language.

In a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution (often imprecisely called a “multinomial distribution”).

In practice, the probability distributions will be smoothed by assigning non-zero probabilities to unseen words or infrequent n-grams

Finally, a simple Shiny app will be created that

examines a text string entered by a user
compares that string to the smoothed n-gram frequency matrix
tries to match the string to the existing n-grams.

If there is a match, then the app will look for the maximum probability word that follows the n-gram. If there is no match, then it will check the (n-1)-grams, and then the (n-2)-grams and so on. At each step, the app will look for a match, and if there is a match, the app will identify the word with the highest probability of occurring next, using the smoothed frequency matrix. This is a backoff model approach.