In this assignment, I explore a provided Corpus of English text data: in the form of blogs, news articles, and tweets. The goal is to get a sense of the shape of the data, the cleaning requirements, and preliminary considerations for building a predictive text algorithm.
The Corpora can be found at: www.corpora.heliohost.org. You will notice in my code that I have downloaded it locally to save time. I also used this list of profane words from GitHub to clean the Corpora: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en. I have also saved this locally in a .txt file named ‘badwords.txt’.
setwd("~/Desktop/final/en_US")
getwd()
library(slam)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
## R.utils v2.2.0 (2015-12-09) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(tm)
## Loading required package: NLP
library(stringr)
library(stringi)
library(RWeka)
library(knitr)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
profanities <- scan("badwords.txt", what="", sep="\n")
profanities <- as.character(profanities)
tfile <- "en_US.twitter.txt"
twitter_con <- file(tfile, open="r")
twitter_text <- readLines(twitter_con, encoding = 'UTF-8')
## Warning in readLines(twitter_con, encoding = "UTF-8"): line 167155 appears
## to contain an embedded nul
## Warning in readLines(twitter_con, encoding = "UTF-8"): line 268547 appears
## to contain an embedded nul
## Warning in readLines(twitter_con, encoding = "UTF-8"): line 1274086 appears
## to contain an embedded nul
## Warning in readLines(twitter_con, encoding = "UTF-8"): line 1759032 appears
## to contain an embedded nul
close(twitter_con)
nfile <- "en_US.news.txt"
news_con <- file(nfile, open="r")
news_text <- readLines(news_con, encoding = 'UTF-8')
close(news_con)
bfile <- "en_US.blogs.txt"
blogs_con <- file(bfile, open="r")
blogs_text <- readLines(blogs_con, encoding = 'UTF-8')
close(blogs_con)
I created a summary matrix to understand the size of the different English language corpora. In total, it is ~556MB, ~4.2M lines, and over 100M words!
summary_matrix<-matrix()
summary_matrix$size<-c(file.info(bfile)$size / (1024*1024), file.info(nfile)$size / (1024*1024), file.info(tfile)$size / (1024*1024))
## Warning in summary_matrix$size <- c(file.info(bfile)$size/(1024 * 1024), :
## Coercing LHS to a list
summary_matrix$lines<-c(countLines(bfile), countLines(nfile), countLines(tfile))
summary_matrix$words<-c(sum(stri_count_words(blogs_text)), sum(stri_count_words(news_text)), sum(stri_count_words(twitter_text)))
summary_matrix<-as.data.frame(summary_matrix)
rownames(summary_matrix)<-c("Blog", "News", "Twitter")
colnames(summary_matrix)<-c("NA", "Size (in MB)", "Line Count", "Word Count")
summary_matrix <- rbind(summary_matrix,colSums(summary_matrix))
rownames(summary_matrix)[4] <- "TOTAL"
summary_matrix
## NA Size (in MB) Line Count Word Count
## Blog NA 200.4242 899288 37546246
## News NA 196.2775 1010242 34762395
## Twitter NA 159.3641 2360148 30093369
## TOTAL NA 556.0658 4269678 102402010
One method for developing predictive text is to use an ‘n-gram model.’ That is, a representative corpora is used to determine the frequency of n-word subsets of text. Then, based on text that a user provides - the model can try to guess what the next word should be. For example, if it turns out that a common four-gram phrase in a corpus is “I want my MTV”, it would be reasonable for the model to predict “MTV” as the next word if I type “I want my”.
I wanted to get a sense of what common n-grams would look like without cleaning. I also wanted to get a sense of what the n-grams would look like if I took out ‘stop words’. Stop words are common fillers like “of”, “a”, or “the”. I saved both of these experiments as .csv files to demonstrate why a cleaning of the data was necessary, and why I left stop words in. Removing stop words resulted in a frequency that likely would not be helpful for users. For example, the 2nd most common 3-gram without stopwords was “happy mothers day”.
messy <- read.csv('messytable.csv')
cleaned <- read.csv('cleanedtable.csv')
mplot <- ggplot(messy[1:10, 2:3], aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top Ten 3-grams with unclean data")
cplot <- ggplot(cleaned[1:10, 2:3], aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top Ten 3-grams with no stopwords")
mplot
cplot
Using the tm package’s common cleaning functions, I created a master function for cleaning the corpora. Note that the order of the functions is important. For example, the list of profanities are all in lowercase. Therefore, if ‘tolower’ did not come earlier - I would have left in some bad words.
Per the recommendation of the RWeka documentation, I also created functions for the different multi-word tokens that I will be creating.
clean_text <- function(text){
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeNumbers)
text <- tm_map(text, removePunctuation)
text <- tm_map(text, stripWhitespace)
text <- tm_map(text, removeWords, profanities)
return(text)
}
BiTokens <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
TriTokens <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
QuadTokens <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 4, max = 4))}
The full corpora is not necessary for this exercise and would take quite a bit of time - I am using a random sample of 2% of each document and joining them together. I am then applying the cleaning function.
set.seed(309)
blogsSample <- sample(blogs_text, length(blogs_text)*0.02)
newsSample <- sample(news_text, length(news_text)*0.02)
twitterSample <- sample(twitter_text, length(twitter_text)*0.02)
joined_text <- c(blogsSample, newsSample, twitterSample)
joined_text <- VCorpus(VectorSource(joined_text))
clean_join <- clean_text(joined_text)
For some reason, RWeka has trouble parallelizing the different cores on my OSx. I could only get the functions to work if I specifically stated that I will only use one core. Here, I am creating a 2-, 3-, and 4-gram model. The TDM function creates a very sparse matrix, which I collapse, and then turn into sorted dataframes by frequency.
options(mc.cores = 1)
two_gram_tdm <- TermDocumentMatrix(clean_join, control=list(tokenize=BiTokens))
two_gram <- removeSparseTerms(two_gram_tdm, 0.9999)
two_freq <- sort(row_sums(two_gram, na.rm = T), decreasing=TRUE)
two_freqdf <- data.frame(word=names(two_freq), freq=two_freq)
three_gram_tdm <- TermDocumentMatrix(clean_join, control=list(tokenize=TriTokens))
three_gram <- removeSparseTerms(three_gram_tdm, 0.9999)
three_freq <- sort(row_sums(three_gram, na.rm = T), decreasing=TRUE)
three_freqdf <- data.frame(word=names(three_freq), freq=three_freq)
four_gram_tdm <- TermDocumentMatrix(clean_join, control=list(tokenize=QuadTokens))
four_gram <- removeSparseTerms(four_gram_tdm, 0.9999)
four_freq <- sort(row_sums(four_gram, na.rm = T), decreasing=TRUE)
four_freqdf <- data.frame(word=names(four_freq), freq=four_freq)
Here are charts of the Top 10 most common n-grams for the different sizes.
twoplot <- ggplot(two_freqdf[1:10, ], aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top Ten 2-grams")
threeplot <- ggplot(three_freqdf[1:10, ], aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top Ten 3-grams")
fourplot <- ggplot(four_freqdf[1:10, ], aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top 10 4-grams")
twoplot
threeplot
fourplot
In trying to figure out what model to use (eg, Katz’s backoff model), I reviewed some different smoothing methods from Stanford: http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf
Some examples they give are Jelinek-Mercer, Witten-Bell, Good Turing, and Kneser-Ney. I am not confident that the n-gram models from the analysis have highly predictive capabilities, as they lack the broader context of what the user is trying to communicate.
In going through this exercise, here are a few things I’ve learned: