Data Science Capstone Milestone Report

The goal of this report is to explain the basic exploratory analysis performed on the datasets and plans for the production app and algorithm. This document explains only the major features of the data that have been identified and briefly summarizes plans for creating the prediction algorithm and Shiny app.

As a first step, load the depencies required to perform the basic exploratory analysis.

require("quanteda")
require("tokenizers") 
require("data.table") 
require("ggplot2")

Next, we load the datasets into the R environment.

blog <- readLines("data/en_US.blogs.txt", skipNul=T); 
news <- readLines("data/en_US.news.txt", skipNul=T);
tweets <- readLines("data/en_US.twitter.txt", skipNul=T);

Let’s look at some basic summaries of the datasets to see what we’re working with here:

summary(blog); summary(news); summary(tweets)

##    Length     Class      Mode 
##    899288 character character

##    Length     Class      Mode 
##   1010242 character character

##    Length     Class      Mode 
##   2360148 character character

In the three files combined, we have more than 4 million lines of text! To save time and resources, we’ll take a small 1% sample of the three files and bundle them into a corpus to perform our exploratory analysis. In the process of creating a corpus, we’ll remove non-latin characters and convert everything to lowercase text.

set.seed(1234)
blogSample <- sample(blog, 0.01*length(blog))
newsSample <- sample(news, 0.01*length(news))
tweetsSample <- sample(tweets, 0.01*length(tweets))
data <- c(blogSample, newsSample, tweetsSample); rm(blogSample, newsSample, tweetsSample)
data <- corpus(data); data <- sapply(data$documents,FUN=function(x) iconv(x, "latin1", "ASCII", sub=""))
data <- corpus(data); data <- tolower(data)

We now have more than 40,000 lines of text in a corpus to work with and perform a cursory exploratory analysis. We’ll tokenize the data into unigrams, bigrams, trigrams and quadgrams then store them in document feature matrices. In the process, we’ll clean the data by removing numbers, punctuation, symbols, hyphens, urls and twitter-specific characters:

ngram1 <- tokens(data, what=("word"), skip=0, ngrams=1, remove_numbers=T, remove_punct=T, remove_twitter=T, remove_symbols=T, remove_hyphens=T, remove_url=T, concatenator=' ') 
dfm1 <- dfm(ngram1, stem=F)

ngram2 <- tokens(data, what=("word"), skip=0, ngrams=2, remove_numbers=T, remove_punct=T, remove_twitter=T, remove_symbols=T, remove_hyphens=T, remove_url=T, concatenator=' ')
dfm2 <- dfm(ngram2, stem=F) 

ngram3 <- tokens(data, what=("word"), skip=0, ngrams=3, remove_numbers=T, remove_punct=T, remove_twitter=T, remove_symbols=T, remove_hyphens=T, remove_url=T, concatenator=' ') 
dfm3 <- dfm(ngram3, stem=F) 

ngram4 <- tokens(data, what=("word"), skip=0, ngrams=4, remove_numbers=T, remove_punct=T, remove_twitter=T, remove_symbols=T, remove_hyphens=T, remove_url=T, concatenator=' ');
dfm4 <- dfm(ngram4, stem=F)

We’ll get the number of times that specific unigrams, bigrams, trigrams and quadgrams appear in each of the document feature matrix. Then we’ll sort them from highest to lowest number of appearances.

Finally we’ll rename the column values as “term” and “freq” to denote the respective ngrams and the frequency with which they appear. We’ll arrange the tables in descending order to enable us to visualize how often the most frequently appearing n-grams appear.

table1 <- colSums(dfm1)
table1<- sort(table1,decreasing=TRUE) %>% as.data.table(table1)
setnames(table1, c('term','freq')); setkey(table1,term) 
table1 <- table1[order(-freq),]

table2 <- colSums(dfm2)
table2 <- sort(table2,decreasing=TRUE) %>% as.data.table(table2)
setnames(table2,c('term','freq')); setkey(table2,term); 
table2 <- table2[order(-freq),]

table3 <- colSums(dfm3)
table3 <- sort(table3,decreasing=TRUE) %>% as.data.table(table3)
setnames(table3,c('term','freq')); setkey(table3,term); 
table3 <- table3[order(-freq),]

table4 <- colSums(dfm4)
table4 <- sort(table4,decreasing=TRUE) %>% as.data.table(table4)
setnames(table4,c('term','freq')); setkey(table4,term); 
table4 <- table4[order(-freq),]

Now we’ll create some basic histograms to view the various frequencies of different ngrams.

The top 20 unigrams are the following:

The top 20 bigrams are the following:

The top 20 trigrams are the following:

And the top 20 quadgrams are the following:

As the number of ngrams increases, the frequency of given terms decreases. However many of the words appear to be the same - many of which are stopwords. Stopwords are of limited value in and of themselves so we will likely remove them when it is time for app production.

As we move towards production, we will use a maximum likelihood estimate to calculate how often a given term appears in the corpus. We will also use a smoothing method to assign some probability to unknown n-grams, specifically a modified Kneser-Ney smoothing approach that uses:

Discounting: removing a fixed probability mass dervied from the maximum likelihood estimate and reassigning it to out-of-corpus n-grams
Interpolation-backoff: recursively combining probabilities for lower-order models to assign probabilities to higher-order n-grams
Contextualization: predicting the likelihood of a word based on the different contexts in which it appears

The interpolation-backoff model will first use higher-order n-grams to predict the next word and then lower-order n-grams to assign probabilities to higher-order n-grams.

We’ll remove stopwords from the n-gram tables, then we’ll use the tables to predict the next word from a string of words. The default will be to select single words with the highest probability in the corpus given the context.

Data Science Capstone Milestone Report

Michael Baldassaro

7/23/2018