Exploratory Analysis on Corpus of Data

In this assignment, I explore a provided Corpus of English text data: in the form of blogs, news articles, and tweets. The goal is to get a sense of the shape of the data, the cleaning requirements, and preliminary considerations for building a predictive text algorithm.

The Corpora can be found at: www.corpora.heliohost.org. You will notice in my code that I have downloaded it locally to save time. I also used this list of profane words from GitHub to clean the Corpora: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en. I have also saved this locally in a .txt file named ‘badwords.txt’.

setwd("~/Desktop/final/en_US")
getwd()

library(slam)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## R.utils v2.2.0 (2015-12-09) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(tm)
## Loading required package: NLP
library(stringr)
library(stringi)
library(RWeka)
library(knitr)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
profanities <- scan("badwords.txt", what="", sep="\n")
profanities <- as.character(profanities)
tfile <- "en_US.twitter.txt"
twitter_con <- file(tfile, open="r")
twitter_text <- readLines(twitter_con, encoding = 'UTF-8')
## Warning in readLines(twitter_con, encoding = "UTF-8"): line 167155 appears
## to contain an embedded nul
## Warning in readLines(twitter_con, encoding = "UTF-8"): line 268547 appears
## to contain an embedded nul
## Warning in readLines(twitter_con, encoding = "UTF-8"): line 1274086 appears
## to contain an embedded nul
## Warning in readLines(twitter_con, encoding = "UTF-8"): line 1759032 appears
## to contain an embedded nul
close(twitter_con)

nfile <- "en_US.news.txt"
news_con <- file(nfile, open="r")
news_text <- readLines(news_con, encoding = 'UTF-8')
close(news_con)

bfile <- "en_US.blogs.txt"
blogs_con <- file(bfile, open="r")
blogs_text <- readLines(blogs_con, encoding = 'UTF-8')
close(blogs_con)

I created a summary matrix to understand the size of the different English language corpora. In total, it is ~556MB, ~4.2M lines, and over 100M words!

summary_matrix<-matrix()

summary_matrix$size<-c(file.info(bfile)$size / (1024*1024), file.info(nfile)$size / (1024*1024), file.info(tfile)$size / (1024*1024))
## Warning in summary_matrix$size <- c(file.info(bfile)$size/(1024 * 1024), :
## Coercing LHS to a list
summary_matrix$lines<-c(countLines(bfile), countLines(nfile), countLines(tfile))

summary_matrix$words<-c(sum(stri_count_words(blogs_text)), sum(stri_count_words(news_text)), sum(stri_count_words(twitter_text)))

summary_matrix<-as.data.frame(summary_matrix)

rownames(summary_matrix)<-c("Blog", "News", "Twitter")
colnames(summary_matrix)<-c("NA", "Size (in MB)", "Line Count", "Word Count")
summary_matrix <- rbind(summary_matrix,colSums(summary_matrix))
rownames(summary_matrix)[4] <- "TOTAL"

summary_matrix
##         NA Size (in MB) Line Count Word Count
## Blog    NA     200.4242     899288   37546246
## News    NA     196.2775    1010242   34762395
## Twitter NA     159.3641    2360148   30093369
## TOTAL   NA     556.0658    4269678  102402010

One method for developing predictive text is to use an ‘n-gram model.’ That is, a representative corpora is used to determine the frequency of n-word subsets of text. Then, based on text that a user provides - the model can try to guess what the next word should be. For example, if it turns out that a common four-gram phrase in a corpus is “I want my MTV”, it would be reasonable for the model to predict “MTV” as the next word if I type “I want my”.

I wanted to get a sense of what common n-grams would look like without cleaning. I also wanted to get a sense of what the n-grams would look like if I took out ‘stop words’. Stop words are common fillers like “of”, “a”, or “the”. I saved both of these experiments as .csv files to demonstrate why a cleaning of the data was necessary, and why I left stop words in. Removing stop words resulted in a frequency that likely would not be helpful for users. For example, the 2nd most common 3-gram without stopwords was “happy mothers day”.

messy <- read.csv('messytable.csv')
cleaned <- read.csv('cleanedtable.csv')

mplot <- ggplot(messy[1:10, 2:3], aes(x=reorder(word,freq), y=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Top Ten 3-grams with unclean data")

cplot <- ggplot(cleaned[1:10, 2:3], aes(x=reorder(word,freq), y=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Top Ten 3-grams with no stopwords")

mplot

cplot

Using the tm package’s common cleaning functions, I created a master function for cleaning the corpora. Note that the order of the functions is important. For example, the list of profanities are all in lowercase. Therefore, if ‘tolower’ did not come earlier - I would have left in some bad words.

Per the recommendation of the RWeka documentation, I also created functions for the different multi-word tokens that I will be creating.

clean_text <- function(text){
  text <- tm_map(text, content_transformer(tolower))
  text <- tm_map(text, removeNumbers)
  text <- tm_map(text, removePunctuation)
  text <- tm_map(text, stripWhitespace)
  text <- tm_map(text, removeWords, profanities)
  return(text)
}

BiTokens <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
TriTokens <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
QuadTokens <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 4, max = 4))}

The full corpora is not necessary for this exercise and would take quite a bit of time - I am using a random sample of 2% of each document and joining them together. I am then applying the cleaning function.

set.seed(309)
blogsSample <- sample(blogs_text, length(blogs_text)*0.02)
newsSample <- sample(news_text, length(news_text)*0.02)
twitterSample <- sample(twitter_text, length(twitter_text)*0.02)

joined_text <- c(blogsSample, newsSample, twitterSample)
joined_text <- VCorpus(VectorSource(joined_text))
clean_join <- clean_text(joined_text)

For some reason, RWeka has trouble parallelizing the different cores on my OSx. I could only get the functions to work if I specifically stated that I will only use one core. Here, I am creating a 2-, 3-, and 4-gram model. The TDM function creates a very sparse matrix, which I collapse, and then turn into sorted dataframes by frequency.

options(mc.cores = 1)

two_gram_tdm <- TermDocumentMatrix(clean_join, control=list(tokenize=BiTokens))
two_gram <- removeSparseTerms(two_gram_tdm, 0.9999)
two_freq <- sort(row_sums(two_gram, na.rm = T), decreasing=TRUE)
two_freqdf <- data.frame(word=names(two_freq), freq=two_freq)

three_gram_tdm <- TermDocumentMatrix(clean_join, control=list(tokenize=TriTokens))
three_gram <- removeSparseTerms(three_gram_tdm, 0.9999)
three_freq <- sort(row_sums(three_gram, na.rm = T), decreasing=TRUE)
three_freqdf <- data.frame(word=names(three_freq), freq=three_freq)

four_gram_tdm <- TermDocumentMatrix(clean_join, control=list(tokenize=QuadTokens))
four_gram <- removeSparseTerms(four_gram_tdm, 0.9999)
four_freq <- sort(row_sums(four_gram, na.rm = T), decreasing=TRUE)
four_freqdf <- data.frame(word=names(four_freq), freq=four_freq)

Here are charts of the Top 10 most common n-grams for the different sizes.

twoplot <- ggplot(two_freqdf[1:10, ], aes(x=reorder(word,freq), y=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Top Ten 2-grams")

threeplot <- ggplot(three_freqdf[1:10, ], aes(x=reorder(word,freq), y=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Top Ten 3-grams")

fourplot <- ggplot(four_freqdf[1:10, ], aes(x=reorder(word,freq), y=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Top 10 4-grams")

twoplot

threeplot

fourplot

In trying to figure out what model to use (eg, Katz’s backoff model), I reviewed some different smoothing methods from Stanford: http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

Some examples they give are Jelinek-Mercer, Witten-Bell, Good Turing, and Kneser-Ney. I am not confident that the n-gram models from the analysis have highly predictive capabilities, as they lack the broader context of what the user is trying to communicate.

In going through this exercise, here are a few things I’ve learned: