DATA 607: Project 4
library(tidyverse)
library(R.utils)
library(tm)
library(caret)
library(magrittr)
library(e1071)
Project Task Overview
It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/
Steps taken to Classify Emails as Ham or Spam
- Obtain Data
- Download, unzip, and untar files from the provided resource
- Read and Store text in the Files
- Created function to read text from each file in both the spam and ham data folders
- Reduced the size of the datasets to reduce the run time of functions and to save memory.
- Preprocess the emails to more easily classify emails
- Created function to create a corpus and perform necessary text mining cleanup (like remove punctuation, remove stop words, transform to lower case, etc.) and transform corpus to a document term matrix to allow for classification
- Creating Training and Testing Data
- Divide data into a 70% training and a 30% testing dataframe
- Classify Emails
- Use Naive Bayes classifier to create a model to classify whether an email is spam or ham.
- Generate confusion matrix to display results
Obtain Data
The data was collected via the online email corpus resource provided by Professor Catlin: https://spamassassin.apache.org/old/publiccorpus/. I specifically used the links https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2 (spam) and https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2 (ham).
# Obtain the expected files after downloading the data
<- "https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2"
url.spam <- "20030228_spam.tar.bz2"
spam1 <-"20030228_spam.tar"
spam2
<- "https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2"
url.ham <- "20030228_easy_ham.tar.bz2"
ham1 <- "20030228_easy_ham.tar" ham2
Removing the files from the directory before trying to download them again allows for the code to be rerun automatically without having to manually delete the directories.
# Check if files exist
if (file.exists(spam1) | file.exists(ham1) | file.exists(spam2) | file.exists(ham2)) {
# Delete file if it exists
file.remove(spam1)
file.remove(spam2)
file.remove(ham1)
file.remove(ham2)
}
## [1] TRUE
Here I downloaded the file and specified a folder name in the directory. Then I unzipped the files with bunzip2, since the files were a bz2 extension. To get the desired files it is necessary to untar the files.
# download file
download.file(url.spam, destfile = spam1)
download.file(url.ham, destfile = ham1)
# unzip the bz2 file
bunzip2(spam1)
bunzip2(ham1)
# unzip the tar file
untar(spam2, exdir = "Project4_Data")
untar(ham2, exdir = "Project4_Data")
Here I store the path to the newly created folders and output the number of files in each folder.
<- 'C:\\Users\\ericl\\OneDrive\\Documents\\CUNY MS in Data Science\\DATA-607\\R\\DATA607\\Project4_Data\\spam_2\\'
spam_folder <- 'C:\\Users\\ericl\\OneDrive\\Documents\\CUNY MS in Data Science\\DATA-607\\R\\DATA607\\Project4_Data\\easy_ham\\'
ham_folder
length(list.files(path = spam_folder))
## [1] 1397
length(list.files(path = ham_folder))
## [1] 2501
Read and Store text in the Files
Created function to read and store the text present in the email files. The function takes a folder full of files and the type of email it is. I decided to use 1 for spam emails and 0 for ham emails. I use the function read_lines to read in the text on each file.
<- function(folder_of_files, classification){
read.files <- list.files(folder_of_files, full.names = TRUE)
files <- list.files(folder_of_files) %>%
text.data as.data.frame() %>%
set_colnames("file") %>%
mutate(text = lapply(files, read_lines)) %>%
unnest(c(text)) %>%
mutate(class = classification) %>%
group_by(file) %>%
mutate(text = paste(text, collapse = " ")) %>%
ungroup() %>%
distinct()
return(text.data)
}
<- read.files(spam_folder, 1)
spam.data <- read.files(ham_folder, 0) ham.data
I decided to reduce the size of both dataset to 500 to decrease the runtime of future functions. I also combined the spam and ham data to a single dataframe.
set.seed(1999)
<- sample(seq(nrow(spam.data)), size = 500)
spam_index <- na.omit(ham.data)
ham.data <- ham.data[1:500,]
ham.data $class <- 0 ham.data
<- spam.data[spam_index,]
spam.data
<- rbind(spam.data, ham.data) email.data
Preprocess the emails to more easily classify
I removed the graphic characters that were proving to be problematic to my analysis. I also created a function to carry out the necessary corpus preprocessing steps. The function removes numbers, punctuation, stop words, and white space. The function also transforms the text to lowercase and makes the output a document term matrix.
$text <- email.data$text %>%
email.datastr_replace_all("[^[:graph:]]", " ")
<- function(dataframe){
clean.data <- VectorSource(dataframe$text)
emails <- VCorpus(emails)
emails <- emails %>%
corpus tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(tolower) %>%
tm_map(PlainTextDocument) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)%>%
DocumentTermMatrix()
return(corpus)
}
<- clean.data(email.data) email.data.dtm
Creating Training and Testing Data
To reduce the size of the document term matrix I excluded terms that were more than 95% sparse. I had to add the classification column from the email.data to the document term matrix of the email data.
<- email.data.dtm %>%
email.data.classify removeSparseTerms(0.95) %>%
as.matrix() %>%
as.data.frame() %>%
mutate(class = email.data$class)
$class <- factor(email.data.classify$class) email.data.classify
I assigned 70% of the data to train the classification model (naiveBayes) and the remaining 30% to test the results of the model.
set.seed(1999)
<- sample(seq(nrow(email.data.classify)), size = nrow(email.data.classify)*0.7)
train_index
<- email.data.classify[train_index,]
train.data <- email.data.classify[-train_index,]
test.data
1:583] <- ifelse(train.data[ , 1:583] == 0, "No", "Yes")
train.data[ , 1:583] <- ifelse(test.data[ , 1:583] == 0, "No", "Yes")
test.data[ ,
584] <- ifelse(train.data[ , 584] == 0, "Ham", "Spam")
train.data[ , 584] <- ifelse(test.data[ , 584] == 0, "Ham", "Spam") test.data[ ,
Classify Emails
The naiveBayes() function will classify the emails.
<- naiveBayes(train.data, train.data$class) nb.model
To predict the results of the model, it is important to use the testing data.
<- predict(nb.model, newdata = test.data) model.results
Results of the model in confusion matrix.
<- test.data$class
testing.results
<- table(model.results, testing.results)
model.vs.testing
colnames(model.vs.testing) <- c("Ham", "Spam")
rownames(model.vs.testing) <- c("Ham", "Spam")
confusionMatrix(model.vs.testing)
## Confusion Matrix and Statistics
##
## testing.results
## model.results Ham Spam
## Ham 147 1
## Spam 7 145
##
## Accuracy : 0.9733
## 95% CI : (0.9481, 0.9884)
## No Information Rate : 0.5133
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9467
##
## Mcnemar's Test P-Value : 0.0771
##
## Sensitivity : 0.9545
## Specificity : 0.9932
## Pos Pred Value : 0.9932
## Neg Pred Value : 0.9539
## Prevalence : 0.5133
## Detection Rate : 0.4900
## Detection Prevalence : 0.4933
## Balanced Accuracy : 0.9738
##
## 'Positive' Class : Ham
##
Conclusion
The Naive Bayes model was able to correctly classify email type 97.33% of the time. The model only incorrectly classified 7 ham emails as spam and 1 spam email as ham out of 300 emails.