library(tm)
library(caret)
library(e1071)
set.seed(123)

Overview

This is project four of the Fall 2024 edition of DATA 607 at the CUNY School of Professional Studies. The assignment states:

“It can be useful to be able to classify new”test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

I have taken one spam and one ham dataset from the provided link.

Data Organization

This first chunk defines the relative location of the spam and ham emails and converts them into usable datasets. 'text_sweep removes all non ASCII characters to prevent encoding issues. That function is then passed into emailer, a function that reads the emails, turns them into concatenated strings, uses text_sweep to removed non ASCII characters, and then returns vectors of the processed emails.

spam_folder <- "/Users/uwsthoughts/Desktop/github_sync/data_science_masters_work/2024_Fall/data_607_data_management/project_four/spam"
ham_folder <- "/Users/uwsthoughts/Desktop/github_sync/data_science_masters_work/2024_Fall/data_607_data_management/project_four/ham"

spam <- list.files(spam_folder, full.names = TRUE)
ham <- list.files(ham_folder, full.names = TRUE)

text_sweep <- function(text) {
  iconv(text, from = "UTF-8", to = "ASCII", sub = "")
}

emailer <- function(files) sapply(files, function(f) {
  tryCatch({
    text <- paste(readLines(f, warn = FALSE, encoding = "UTF-8"), collapse = " ")
    text_sweep(text)
  }, error = function(e) {
    ""
  })
})

Model Running

The chunk starts by using the functions above to read and preprocess the emails. It then turns the dataset into a structured DocumentTermMatrix by tokenizing the text and filtering out uncommon terms that have a 1% or less preveleance rate. The processed matrix is then converted into a dataframe with a label column that declares spam or ham for it. The data is then split into training and testing sets, after which a Naive Bayes classifier is run on the training data and then used to predict labels for the test set. The performance is evaluated using a confusion matrix.

corpus_spamus <- emailer(spam)
corpus_hamus <- emailer(ham)

corpus_d <- c(corpus_spamus, corpus_hamus)
labels <- c(rep("spam", length(corpus_spamus)), rep("ham", length(corpus_hamus)))

valid_indices <- corpus_d != ""
corpus_d <- corpus_d[valid_indices]
labels <- labels[valid_indices]

corpus_w <- Corpus(VectorSource(corpus_d))
corpus_w <- tm_map(corpus_w, content_transformer(tolower))
corpus_w <- tm_map(corpus_w, removePunctuation)
corpus_w <- tm_map(corpus_w, removeNumbers)
corpus_w <- tm_map(corpus_w, removeWords, stopwords("en"))
corpus_w <- tm_map(corpus_w, stripWhitespace)

terminator <- DocumentTermMatrix(corpus_w)
terminator <- removeSparseTerms(terminator, 0.99)

terminator_df <- as.data.frame(as.matrix(terminator))
terminator_df$label <- factor(labels)

trainer <- createDataPartition(terminator_df$label, p = 0.8, list = FALSE)
train_x <- terminator_df[trainer, ]
test_y <- terminator_df[-trainer, ]

very_naive <- naiveBayes(label ~ ., data = train_x)
spam_ham <- predict(very_naive, test_y)
confusionMatrix(spam_ham, test_y$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   24    0
##       spam 486  100
##                                          
##                Accuracy : 0.2033         
##                  95% CI : (0.172, 0.2374)
##     No Information Rate : 0.8361         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0159         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.04706        
##             Specificity : 1.00000        
##          Pos Pred Value : 1.00000        
##          Neg Pred Value : 0.17065        
##              Prevalence : 0.83607        
##          Detection Rate : 0.03934        
##    Detection Prevalence : 0.03934        
##       Balanced Accuracy : 0.52353        
##                                          
##        'Positive' Class : ham            
## 

Analysis and Conclusion

My model showed a low overall accurate rate of ~20%, which is unfortunate but also a good learning lesson. The model did correctly predict all the spam emails but it could not effectively separate out the ham emails. The low sensitivity of 0.047% reflects the general inability to classify ham emails, suggesting heavy bias towards predicting spam.