Introduction:
It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
In this project I used emails classified as ‘spam’ or ‘ham’ to build a email classifier. There were several technical challenges to this assignment that I will adress as we move through the methods and results
Methods
- Load the libraries that we will initially need to parse the data
library(rio)
library(dplyr)
library(stringi)
library(stringr)
library(prettydoc)
library(readr)
#usethis::edit_r_environ()
- We are pulling the data directly from the online archive and unzipping the data within R. To do otherwise would be sub-optimal considering the number of files we need to unzip and load.
I used some of the functionality of the Rio package to create temporary directories to store my archive data
I used the download.file function to put my tarball in a temporary directory
I save out a list of file names that we will need to iterate over
I then untar my temporary file ‘tf’ but am storing it in my working directory which isn’t ideal but I had trouble storning it back in my temporary directory
# create a temporary directory
<- tempdir()
td
# create a temporary file
<- tempfile(tmpdir=td)
tf
# download file from internet into temporary location
download.file("https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2", tf)
# list zip archive
<- untar(tf, list=TRUE)
file_names <- file_names[2:length(file_names)]
file_names
untar(tf)
- We use a for loop to load and attemp to parse each file. Surprisingly with all the new packages out there read.csv does a pretty good job extracting our files
If a email has been misparsed by read.csv I catch it when I end up with named row names and don’t parse them further
I then attempt to remove the email header through subsetting with a grep of common terms delimiting the transition from header to email body. This list isn’t exhaustive but removes alot of unwanted header text
I use str_replace to remove all letter characters, I know that the TM package has this functionality but their were some special characters that were throwing errors if I used TM functionality
<- data.frame()
all_spam
# use when zip file has only one file
for(j in 1:length(file_names)) {
try(data <- read.csv(file.path(getwd(), file_names[j]), encoding = "UTF-8"))
#print(file_names[i])
#don't try and parse emails where some of the text has been converted to row names
try(if (!is.na(sum(as.numeric(rownames(data))))) {
#Remove the header
<-
data as.data.frame(data[grep('X-Spam-Level:|Precedence: bulk|> >|Message-Id:|Message-ID:',
1]):nrow(data), 1])
data[, colnames(data) <- c('data')
<- rbind(all_spam, data)
all_spam
}) }
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in nchar(sm[1L], type = "w"): invalid multibyte string, element 1
# #Remove all numbers, punctuation, special characters
$data <- str_replace_all(all_spam$data, "[^a-zA-Z ]", " ")
all_spam
# delete the files and directories
unlink(td)
unlink(tf)
- Creating my corpus
I use the tm package to create my corpus and clean up the text
I strip the white space
Cast the text to lower case
Removing English stopwords
Stem my words and cast it to plain text
I create a meta tag for ham_spam
library(tm)
<- VCorpus(VectorSource(all_spam$data)) %>%
spam_corpus tm_map(stripWhitespace) %>%
tm_map(tolower) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stemDocument) %>%
tm_map(PlainTextDocument)
meta(spam_corpus, tag = "ham_spam") <- "spam"
- Now for the ham
- I do the same thing that I did for the spam
# create a temporary directory
<- tempdir()
td
# create a temporary file
<- tempfile(tmpdir=td)
tf
# download file from internet into temporary location
download.file("https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2", tf)
# list zip archive
<- untar(tf, list=TRUE)
file_names <- file_names[2:length(file_names)]
file_names
untar(tf)
<- data.frame()
all_ham
# use when zip file has only one file
for(i in 1:length(file_names)) {
try(data <- read.csv(file.path(getwd(), file_names[i]), encoding = "UTF-8"))
#print(file_names[i])
#don't try and parse emails where some of the text has been converted to row names
if(!is.na(sum(as.numeric(rownames(data))))){
#Remove the header
<- as.data.frame(data[grep('X-Spam-Level:|Precedence: bulk|> >|Message-Id:|Message-ID:', data[,1]):nrow(data),1])
data colnames(data) <- c('data')
<- rbind(all_ham, data)
all_ham
} }
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
## Error in read.table(file = file, header = header, sep = sep, quote = quote, :
## duplicate 'row.names' are not allowed
# #Remove all numbers, punctuation, special characters
$data <- str_replace_all(all_ham$data, "[^a-zA-Z ]", " ")
all_ham
# delete the files and directories
unlink(td)
unlink(tf)
<- VCorpus(VectorSource(all_ham$data)) %>%
ham_corpus tm_map(stripWhitespace) %>%
tm_map(tolower) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stemDocument) %>%
tm_map(PlainTextDocument)
meta(ham_corpus, tag = "ham_spam") <- "ham"
- Joining my corpuses and viewing word frequencies
- I use the wordcloud package to view frequencies for the spam, ham and combined corpus
<- c(ham_corpus, spam_corpus)
joint_corpus # Scramble the order
set.seed(1234)
<- joint_corpus[sample(c(1:length(joint_corpus)))]
joint_corpus
library(wordcloud)
#Look at word clouds of frequent terms
<- wordcloud(joint_corpus, max.words = 100, random.order = FALSE, min.freq=500) temp
wordcloud(spam_corpus, max.words = 100, random.order = FALSE, min.freq=500)
wordcloud(ham_corpus, max.words = 100, random.order = FALSE, min.freq=500)
<- unlist(meta(joint_corpus, "ham_spam")) dtm_ham_spam
- Creating Document Term Matrix
I create DTMs for all three of my corpus
I use the findFreqTerms to find the most common words for each matrix
It’s worth noting that findFreqTerms() does something that I couldn’t do by attempting to convert the DTMs to a matrix and doing a count on the columns. I ran out fo memory
I also am not able to do randomForest on the full matrix do to memory issues
I find the most common words in the Ham corpus and subtract the common terms from the Spam corpus to give me a workable list of columns
<- spam_corpus %>%
s_dtm DocumentTermMatrix()
<- ham_corpus %>%
h_dtm DocumentTermMatrix()
<- joint_corpus %>%
dtm DocumentTermMatrix()
<- findFreqTerms(dtm, lowfreq=500)
d_temp <- findFreqTerms(s_dtm, lowfreq=500)
s_temp <- findFreqTerms(h_dtm, lowfreq=500)
h_temp <- setdiff(h_temp, s_temp)
f_temp
<- dtm[,f_temp] dtm_short
#7 Random Forest
Random Forest is a great non-parametric method for doing classification that give pretty good performance with minimal tuning
I have to attach my Spam/Ham column to my matrix as factors for RF to work
I do 70/30 split on my dataset
I pass my outcome and predictors to RF as a formula
I run a minimal number of trees on my model as a demonstration
library(randomForest)
set.seed(999)
<- sample(2, nrow(dtm_short), replace = TRUE, prob = c(0.7, 0.3))
ind <- dtm_short[ind==1,]
train <- as.matrix(train)
train <- sample(nrow(dtm), replace = FALSE)
ind2 <- as.matrix(dtm_short[ind==2,])
test
<- factor(dtm_ham_spam[ind==1])
is_spam <- factor(dtm_ham_spam[ind==2])
is_spam_test
colnames(train) <- make.names(colnames(train))
colnames(test) <- make.names(colnames(test))
<- data.frame(Is_Spam = is_spam, train)
df <- data.frame(Is_Spam = is_spam_test, test)
df_test
<- formula(paste0("Is_Spam ~ ", paste0(colnames(train), collapse = "+")))
f <- randomForest(f, df, ntree = 10) # Run the random forest model rf
- Prediction with my test set
- I now have a model that I can run my test set through
<- predict(rf, newdata=df_test, type="response") RF_pred
Results
I use the confusionMatrix function from the caret package
With the low number of trees that I ran Random Forest performs at the level of the prevalence with is a floor for performance
library(caret)
confusionMatrix(RF_pred, factor(df_test$Is_Spam),dnn=c("Prediction", "Reference"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 45342 12323
## spam 44 204
##
## Accuracy : 0.7865
## 95% CI : (0.7831, 0.7898)
## No Information Rate : 0.7837
## P-Value [Acc > NIR] : 0.05356
##
## Kappa : 0.0237
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.99903
## Specificity : 0.01628
## Pos Pred Value : 0.78630
## Neg Pred Value : 0.82258
## Prevalence : 0.78369
## Detection Rate : 0.78293
## Detection Prevalence : 0.99572
## Balanced Accuracy : 0.50766
##
## 'Positive' Class : ham
##
Conclusion
Most of the effort in any data science task in the data gathering and processing. Getting the data to parse directly from the web turned out to be surprisingly tricky.
There is additional cleaning and parsing of the data that could have improved my model building. I could have extracted more information from the header that I just discarded.
Originally I had saved the header and flagged it in the meta data. I trimmed that part of the analysis when I realized just how long it would take to run models on the full data set
I was also surprised when my laptop was unable to opperate on the full matrix. This is something that I could have done with AWS if this was a genuine project.
This was a lackluster performance by Random Forest but I ran an almost criminal number of trees and it still gave me performance on par with the prevalence which I consider a win.