Project overview

This R Markdown file performs a full document classification workflow for spam detection using the SpamAssassin public corpus (or any similarly structured spam/ham email folders). The pipeline includes:

Important: Update the params\(spam_dir and params\)ham_dir at the top of this document if your directories differ.

1. Load required packages
# Install only if missing (commented out by default)
# install.packages(c("tidyverse","quanteda","caret","e1071","naivebayes","randomForest","pROC","wordcloud","knitr","kableExtra","ggrepel","quanteda.textplots"))

suppressPackageStartupMessages({
  library(tidyverse)
  library(quanteda)      # fast tokenization and dfm
  library(quanteda.textplots)
  library(caret)         # train/test split, confusionMatrix
  library(e1071)         # naiveBayes (alternative)
  library(naivebayes)    # naive_bayes (fast)
  library(randomForest)  # random forest classifier
  library(pROC)          # ROC / AUC
  library(wordcloud)
  library(knitr)
  library(kableExtra)
  library(ggplot2)
  library(ggrepel)
})
2. Parameters / Paths
spam_dir <- params$spam_dir
ham_dir  <- params$ham_dir

cat("Spam folder:", spam_dir, "\n")
## Spam folder: C:/Users/taham/OneDrive/Documents/Data 607/Project 4/20050311_spam_2/spam_2
cat("Ham folder: ", ham_dir, "\n")
## Ham folder:  C:/Users/taham/OneDrive/Documents/Data 607/Project 4/20030228_easy_ham/easy_ham

Note: If you get encoding or path errors on Windows, make sure to use double backslashes (\) or forward slashes (/). The default parameter values above are set to the directories you provided; change them if needed.

3. Utility functions: read emails into data.frame
# Read all files in a directory; collapse lines into single text string per file
read_emails_from_dir <- function(dir_path, label) {
  files <- list.files(dir_path, full.names = TRUE)
  # Exclude potential control files like "cmds"
  files <- files[basename(files) != "cmds"]
  # Read each file and collapse into single text field (handle encoding differences)
  df <- tibble(
    file = files,
    text = map_chr(files, ~ {
      txt <- tryCatch(readLines(.x, warn = FALSE, encoding = "UTF-8"),
                      error = function(e) tryCatch(readLines(.x, warn = FALSE, encoding = "latin1"),
                                                   error = function(e2) paste0(readBin(.x, what = "raw", n = file.info(.x)$size), collapse = "")))
      paste(txt, collapse = "\n")
    }),
    label = label
  )
  return(df)
}

# load spam and ham
spam_df <- read_emails_from_dir(spam_dir, "spam")
ham_df  <- read_emails_from_dir(ham_dir,  "ham")

# Combine and show counts
emails_df <- bind_rows(spam_df, ham_df) %>% mutate(doc_id = row_number())
emails_df %>% count(label) %>% knitr::kable()
label n
ham 2500
spam 1396
4. Quick data checks and basic EDA
# Fix encoding so nchar() doesn't break on invalid bytes
emails_df$text <- iconv(emails_df$text, from = "", to = "UTF-8", sub = "byte")

# Basic statistics
emails_df <- emails_df %>%
  mutate(n_chars = nchar(text),
         n_words = stringr::str_count(text, "\\w+"))

summary_stats <- emails_df %>%
  group_by(label) %>%
  summarise(n = n(),
            mean_chars = mean(n_chars),
            median_chars = median(n_chars),
            mean_words = mean(n_words),
            median_words = median(n_words)) %>%
  arrange(label)

kable(summary_stats, caption = "Basic length statistics by class") %>% 
  kable_styling(full_width = FALSE)
Basic length statistics by class
label n mean_chars median_chars mean_words median_words
ham 2500 3441.826 3156.0 576.2912 540.0
spam 1396 6341.804 4108.5 970.6712 672.5
# Histogram of message lengths by class
ggplot(emails_df, aes(x = n_words, fill = label)) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 60) +
  scale_x_log10() +
  labs(title = "Distribution of message lengths (words) by class",
       x = "Words (log10 scale)", y = "Count", fill = "Label") +
  theme_minimal()

Interpretation: The ham emails are generally shorter, with median length ~3156 characters and ~540 words, while spam emails are noticeably longer on average, with a median ~4108 characters and ~672 words. The log-scaled histogram shows that spam tends to have a heavier tail, suggesting the presence of very long advertisement-style messages. These differences indicate that message length may carry class-related signal, though length alone is not sufficient for reliable classification.

5. Rationale for preprocessing choices

Rationale (short academic justification): - Stemming reduces sparsity by collapsing inflected and derived word forms to a common root (e.g., “clicking”, “clicked” -> “click”), improving the overlap between documents without substantially changing topical content. - Stopword removal removes very high-frequency function words (like “the”, “and”) that carry little topical information and can dominate raw frequency counts. - Removing numbers/punctuation and lowercasing simplifies the token set and reduces spurious features (e.g., various punctuation tokens). - Trimming very rare and extremely frequent terms prevents overfitting and reduces model complexity; rare terms often add noise, while extremely common tokens may be non-informative. - TF (term-frequency) is a good baseline; TF-IDF helps account for term specificity and is included below as a comparative baseline.

6. Text preprocessing and Document-Feature Matrix (DFM)

We’ll use quanteda for tokenization and dfm creation. Steps:

# Create a corpus (quanteda) preserving original order to maintain mapping to emails_df
qcorpus <- corpus(emails_df$text, docvars = data.frame(label = emails_df$label, file = emails_df$file, doc_id = emails_df$doc_id))

# Tokenize and clean
tokens_clean <- tokens(qcorpus,
                       remove_punct = TRUE,
                       remove_symbols = TRUE,
                       remove_numbers = TRUE,
                       remove_separators = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(pattern = stopwords("en")) %>%
  tokens_wordstem(language = "en")

# Create dfm (document-feature matrix) with term frequency weighting
dfm_full <- dfm(tokens_clean)
dim(dfm_full)  # documents x features
## [1]  3896 90828
# Trim - keep terms that appear in at least 0.5% of documents and at most 99%
min_docfreq <- 0.005   # 0.5% of documents
max_docfreq <- 0.99    # 99% of documents

dfm_trimmed <- dfm_trim(dfm_full, min_docfreq = min_docfreq, max_docfreq = max_docfreq, docfreq_type = "prop")
cat("Trimmed dfm dims:", dim(dfm_trimmed), " (documents x features)\n")
## Trimmed dfm dims: 3896 3542  (documents x features)
6b. Exploratory features: most frequent terms & wordclouds
# Top terms bar plot
top_features <- names(topfeatures(dfm_trimmed, 25))
top_features_df <- tibble(term = top_features, freq = as.numeric(topfeatures(dfm_trimmed, 25)))

ggplot(top_features_df, aes(x = reorder(term, freq), y = freq)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 25 terms (overall)", x = NULL, y = "Frequency") +
  theme_minimal()

# Wordcloud - overall
set.seed(42)
textplot_wordcloud(dfm_trimmed, max_words = 150, color = RColorBrewer::brewer.pal(8, "Dark2"))

Interpretation: The most frequent spam-associated terms include “click,” “free,” “html,” and URL-related patterns such as “href,” reflecting advertising and promotional content. Ham emails show higher frequencies of mailing-list or conversational terms (such as dates, system-generated headers, and routine communication phrases). This confirms that vocabulary usage differs strongly by class, and supports the use of bag-of-words features for downstream modeling.

dfm_byclass <- dfm_group(dfm_trimmed, groups = docvars(dfm_trimmed, "label"))

# convert to term-frequency df for each class safely
tf_matrix_df <- convert(dfm_byclass, to = "data.frame")
# The first column may be 'document' or 'doc_id' depending on quanteda version
first_col_name <- colnames(tf_matrix_df)[1]
tf_matrix <- tf_matrix_df %>% column_to_rownames(first_col_name)
tf_matrix <- as.data.frame(t(tf_matrix))  # terms x classes

# Top terms per class table
top_terms_per_class <- map_df(colnames(tf_matrix), function(cl) {
  freqs <- sort(tf_matrix[[cl]], decreasing = TRUE)
  n <- min(20, length(freqs))
  tibble(label = cl, term = names(freqs)[1:n], freq = unname(freqs)[1:n])
})
top_terms_per_class %>% group_by(label) %>% slice(1:6) %>% knitr::kable()
label freq
ham 14239
ham 10152
ham 9790
ham 8406
ham 7348
ham 6170
spam 32179
spam 32076
spam 16750
spam 15691
spam 11900
spam 11656
# Wordclouds per class
par(mfrow = c(1,2))
set.seed(100)
textplot_wordcloud(dfm_trimmed[docvars(dfm_trimmed,"label") == "spam", ], max_words = 100, colors = brewer.pal(8, "Reds"))
title("Spam wordcloud")

textplot_wordcloud(dfm_trimmed[docvars(dfm_trimmed,"label") == "ham", ], max_words = 100, colors = brewer.pal(8, "Blues"))
title("Ham wordcloud")

par(mfrow = c(1,1))

Interpretation: Terms like “free”, “click”, “http” may be more frequent in spam, while ham may contain words associated with mailing lists, dates, and conversational text.

7. Prepare training and test data (80/20 split)
# Convert trimmed dfm to a dense matrix only for modeling (quanteda keeps dfm sparse by default; as.matrix will produce dense, be mindful for memory)
dfm_mat <- as.matrix(dfm_trimmed)
df_features <- as.data.frame(dfm_mat)
# Add label column preserving document order
df_features$label <- docvars(dfm_trimmed, "label")

# Shuffle and split (stratified)
set.seed(123)
train_index <- createDataPartition(df_features$label, p = 0.8, list = FALSE)
train_df <- df_features[train_index, ]
test_df  <- df_features[-train_index, ]

cat("Train size:", nrow(train_df), "Test size:", nrow(test_df), "\n")
## Train size: 3117 Test size: 779
table(train_df$label) %>% knitr::kable()
Var1 Freq
ham 2000
spam 1117
8. Model 1: Naive Bayes (TF)

Naive Bayes is a classical and strong baseline for text classification.

# Remove label column for predictors
x_train <- as.matrix(select(train_df, -label))
y_train <- factor(train_df$label)
x_test  <- as.matrix(select(test_df, -label))
y_test  <- factor(test_df$label)

# Train Naive Bayes on raw term-frequency counts
nb_model <- naive_bayes(x = x_train, y = y_train, usekernel = FALSE)
nb_model
## 
## ================================= Naive Bayes ==================================
## 
## Call:
## naive_bayes.default(x = x_train, y = y_train, usekernel = FALSE)
## 
## -------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## -------------------------------------------------------------------------------- 
##  
## A priori probabilities: 
## 
##       ham      spam 
## 0.6416426 0.3583574 
## 
## -------------------------------------------------------------------------------- 
##  
## Tables: 
## 
## -------------------------------------------------------------------------------- 
## :: ilug-admin@linux.i (Gaussian) 
## -------------------------------------------------------------------------------- 
##                   
## ilug-admin@linux.i        ham       spam
##               mean 0.15800000 0.03312444
##               sd   0.78698518 0.36877983
## 
## -------------------------------------------------------------------------------- 
## :: tue (Gaussian) 
## -------------------------------------------------------------------------------- 
##       
## tue         ham     spam
##   mean 1.238000 1.120859
##   sd   2.487660 2.138712
## 
## -------------------------------------------------------------------------------- 
## :: aug (Gaussian) 
## -------------------------------------------------------------------------------- 
##       
## aug          ham      spam
##   mean 1.7520000 0.9695613
##   sd   3.9656474 2.2810905
## 
## -------------------------------------------------------------------------------- 
## :: return-path (Gaussian) 
## -------------------------------------------------------------------------------- 
##            
## return-path        ham       spam
##        mean 1.00150000 0.85586392
##        sd   0.03871045 0.35895369
## 
## -------------------------------------------------------------------------------- 
## :: delivered-to (Gaussian) 
## -------------------------------------------------------------------------------- 
##             
## delivered-to       ham      spam
##         mean 1.3775000 0.5496867
##         sd   0.6067766 0.6282567
## 
## --------------------------------------------------------------------------------
## 
## # ... and 3536 more tables
## 
## --------------------------------------------------------------------------------
# Predict probabilities and labels
nb_pred <- predict(nb_model, x_test)
nb_prob <- predict(nb_model, x_test, type = "prob")

# Confusion matrix and metrics
cm_nb <- confusionMatrix(nb_pred, y_test, positive = "spam")
cm_nb
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  348    0
##       spam 152  279
##                                           
##                Accuracy : 0.8049          
##                  95% CI : (0.7753, 0.8322)
##     No Information Rate : 0.6418          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6212          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6960          
##          Pos Pred Value : 0.6473          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3582          
##          Detection Rate : 0.3582          
##    Detection Prevalence : 0.5533          
##       Balanced Accuracy : 0.8480          
##                                           
##        'Positive' Class : spam            
## 

Show precision, recall, F1:

precision <- cm_nb$byClass["Precision"]
recall    <- cm_nb$byClass["Recall"]
f1        <- cm_nb$byClass["F1"]

tibble(Model = "Naive Bayes (TF)", Precision = precision, Recall = recall, F1 = f1) %>% knitr::kable()
Model Precision Recall F1
Naive Bayes (TF) 0.6473318 1 0.7859155

ROC / AUC (we need numeric scores for the positive class):

if ("spam" %in% colnames(nb_prob)) {
  roc_nb <- roc(response = as.numeric(y_test == "spam"), predictor = nb_prob[ , "spam"])
} else {
  roc_nb <- roc(response = as.numeric(y_test == "spam"), predictor = nb_prob[ , 1])
}
plot.roc(roc_nb, main = "Naive Bayes (TF) ROC", col = "#1c61b6")

auc_nb <- auc(roc_nb)
auc_nb
## Area under the curve: 0.858

Interpretation: Naive Bayes reaches an accuracy of 0.8049, with Precision = 0.647, Recall = 1.00, and F1 = 0.786. The model correctly recovers all spam emails (high recall), but it misclassifies a large number of ham emails as spam (lower precision). This behavior is typical of naïve Bayes when class-conditional independence assumptions are violated in high-dimensional text data. The ROC AUC of 0.858 indicates moderate discriminative ability, but there is clear room for improvement.

8b. TF-IDF baseline (Naive Bayes on TF-IDF)

We include a TF-IDF baseline as requested.

# Compute TF-IDF on the trimmed dfm
dfm_tfidf_obj <- dfm_tfidf(dfm_trimmed)
df_tfidf_mat <- as.matrix(dfm_tfidf_obj)
df_tfidf <- as.data.frame(df_tfidf_mat)
df_tfidf$label <- docvars(dfm_tfidf_obj, "label")

# Use the same train/test split indices (train_index)
x_train_tfidf <- as.matrix(select(df_tfidf[train_index, ], -label))
x_test_tfidf  <- as.matrix(select(df_tfidf[-train_index, ], -label))
y_train_tfidf <- factor(df_tfidf$label[train_index])
y_test_tfidf  <- factor(df_tfidf$label[-train_index])

# Train Naive Bayes on TF-IDF
nb_tfidf_model <- naive_bayes(x = x_train_tfidf, y = y_train_tfidf, usekernel = FALSE)
nb_tfidf_pred <- predict(nb_tfidf_model, x_test_tfidf)
nb_tfidf_prob <- predict(nb_tfidf_model, x_test_tfidf, type = "prob")

cm_nb_tfidf <- confusionMatrix(nb_tfidf_pred, y_test_tfidf, positive = "spam")
cm_nb_tfidf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  306    0
##       spam 194  279
##                                         
##                Accuracy : 0.751         
##                  95% CI : (0.719, 0.781)
##     No Information Rate : 0.6418        
##     P-Value [Acc > NIR] : 4.253e-11     
##                                         
##                   Kappa : 0.5305        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 1.0000        
##             Specificity : 0.6120        
##          Pos Pred Value : 0.5899        
##          Neg Pred Value : 1.0000        
##              Prevalence : 0.3582        
##          Detection Rate : 0.3582        
##    Detection Prevalence : 0.6072        
##       Balanced Accuracy : 0.8060        
##                                         
##        'Positive' Class : spam          
## 
# ROC/AUC
roc_nb_tfidf <- roc(response = as.numeric(y_test_tfidf == "spam"), predictor = nb_tfidf_prob[, "spam"])
auc_nb_tfidf <- auc(roc_nb_tfidf)
auc_nb_tfidf
## Area under the curve: 0.81

Quick comparison: TF-IDF may change the balance between precision and recall; include it in the model comparison table below.

9. Model 2: Random Forest (trained on top-K features matrix)

Random forest is more computationally intensive; limit the number of features to top-K most frequent terms to reduce memory and training time. We’ll train on matrix inputs as recommended.

# choose top K features by overall frequency using training data only
term_sums_train <- colSums(select(train_df, -label))
K <- 1000  # adjust as needed
top_terms <- names(sort(term_sums_train, decreasing = TRUE))[1:min(K, length(term_sums_train))]

x_train_rf <- as.matrix(train_df %>% select(all_of(top_terms)))
x_test_rf  <- as.matrix(test_df  %>% select(all_of(top_terms)))
y_train_rf <- y_train
y_test_rf  <- y_test

cat("RF features:", length(top_terms), "\n")
## RF features: 1000

Train a random forest (on matrix):

set.seed(234)
rf_model <- randomForest(x = x_train_rf, y = y_train_rf, ntree = 200, importance = TRUE)
print(rf_model)
## 
## Call:
##  randomForest(x = x_train_rf, y = y_train_rf, ntree = 200, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 31
## 
##         OOB estimate of  error rate: 0.58%
## Confusion matrix:
##       ham spam class.error
## ham  1996    4  0.00200000
## spam   14 1103  0.01253357
rf_pred <- predict(rf_model, x_test_rf)
cm_rf <- confusionMatrix(rf_pred, y_test_rf, positive = "spam")
cm_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  499    2
##       spam   1  277
##                                           
##                Accuracy : 0.9961          
##                  95% CI : (0.9888, 0.9992)
##     No Information Rate : 0.6418          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9916          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9928          
##             Specificity : 0.9980          
##          Pos Pred Value : 0.9964          
##          Neg Pred Value : 0.9960          
##              Prevalence : 0.3582          
##          Detection Rate : 0.3556          
##    Detection Prevalence : 0.3569          
##       Balanced Accuracy : 0.9954          
##                                           
##        'Positive' Class : spam            
## 

Show RF feature importance (top 20):

imp <- importance(rf_model)
imp_df <- as.data.frame(imp) %>% rownames_to_column("term") %>% arrange(desc(MeanDecreaseGini))
imp_df %>% slice(1:20) %>% knitr::kable()
term ham spam MeanDecreaseAccuracy MeanDecreaseGini
jul 9.210616 7.819458 9.488142 60.52765
ist 5.441118 5.779723 6.657390 54.31205
jalapeno 5.134075 6.245511 6.225045 49.12091
imap 5.750577 4.232255 5.972405 43.39703
localhost 5.027364 3.629080 5.413346 35.89660
click 6.672682 4.520735 6.817995 34.92019
fetchmail-5.9.0 4.402641 4.083347 4.933638 33.86985
br 5.536451 3.649553 5.704030 32.91306
3.908763 4.830615 4.791449 30.74311
href 5.328216 4.177501 5.502422 30.54192
jmason.org 4.232763 5.397292 5.573973 29.57949
6.153284 5.101657 6.276801 25.12110
x-keyword 4.295148 4.287863 4.862084 24.59174
sep 5.949370 6.842204 7.042187 24.35211
single-drop 4.076746 2.849343 4.218388 23.93181
html 4.252411 2.415504 3.727842 21.16160
remov 5.789474 5.137593 6.309525 19.85000
p 4.598776 1.973599 4.633249 19.78174
@localhost 4.588750 2.355616 4.606266 18.77095
jun 3.717090 4.416518 4.609552 17.25340
ggplot(imp_df %>% slice(1:20), aes(reorder(term, MeanDecreaseGini), MeanDecreaseGini)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  labs(title = "Top 20 important terms (Random Forest)", x = NULL, y = "MeanDecreaseGini") +
  theme_minimal()

Compute precision/recall/F1 and ROC:

# Precision/Recall/F1
tibble(
  Model = "Random Forest",
  Precision = cm_rf$byClass["Precision"],
  Recall    = cm_rf$byClass["Recall"],
  F1        = cm_rf$byClass["F1"]
) %>% knitr::kable()
Model Precision Recall F1
Random Forest 0.9964029 0.9928315 0.994614
# Predict probabilities on TEST SET USING MATRIX
rf_prob_spam <- predict(rf_model, newdata = x_test_rf, type = "prob")[, "spam"]

# Compute ROC
roc_rf <- roc(response = as.numeric(y_test_rf == "spam"), predictor = rf_prob_spam)
plot.roc(roc_rf, main = "Random Forest ROC", col = "#b21c1c")

auc_rf <- auc(roc_rf)
auc_rf
## Area under the curve: 0.9999

Interpretation: The Random Forest model performs extremely well, achieving Accuracy = 0.996, Precision = 0.996, Recall = 0.993, and F1 = 0.995. Only three misclassifications occur in the entire test set. The AUC of 0.9999 confirms near-perfect separation between spam and ham. Top features such as “click,” “html,” “href,” “jalapeno,” and server-related tokens indicate that the model leverages both promotional vocabulary and structural artifacts of the emails. Compared to Naive Bayes, Random Forest captures far richer nonlinear interactions among words.

10. Model comparison
# Collect metrics for comparison
nb_prec <- as.numeric(cm_nb$byClass["Precision"])
nb_rec  <- as.numeric(cm_nb$byClass["Recall"])
nb_f1   <- as.numeric(cm_nb$byClass["F1"])
nb_auc  <- as.numeric(auc_nb)

nb_tfidf_prec <- as.numeric(cm_nb_tfidf$byClass["Precision"])
nb_tfidf_rec  <- as.numeric(cm_nb_tfidf$byClass["Recall"])
nb_tfidf_f1   <- as.numeric(cm_nb_tfidf$byClass["F1"])
nb_tfidf_auc  <- as.numeric(auc_nb_tfidf)

rf_prec <- as.numeric(cm_rf$byClass["Precision"])
rf_rec  <- as.numeric(cm_rf$byClass["Recall"])
rf_f1   <- as.numeric(cm_rf$byClass["F1"])
rf_auc  <- as.numeric(auc_rf)

results_df <- tibble(
  Model = c("Naive Bayes (TF)", "Naive Bayes (TF-IDF)", "Random Forest"),
  Precision = c(nb_prec, nb_tfidf_prec, rf_prec),
  Recall = c(nb_rec, nb_tfidf_rec, rf_rec),
  F1 = c(nb_f1, nb_tfidf_f1, rf_f1),
  AUC = c(nb_auc, nb_tfidf_auc, rf_auc)
)

results_df %>% knitr::kable(digits = 4)
Model Precision Recall F1 AUC
Naive Bayes (TF) 0.6473 1.0000 0.7859 0.8580
Naive Bayes (TF-IDF) 0.5899 1.0000 0.7420 0.8100
Random Forest 0.9964 0.9928 0.9946 0.9999
# barplot for primary metrics
results_long <- results_df %>% pivot_longer(cols = Precision:F1, names_to = "Metric", values_to = "Value")

ggplot(results_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_col(position = "dodge") +
  labs(title = "Model performance comparison", y = "Score") +
  theme_minimal()

Interpretation: The Random Forest substantially outperforms Naive Bayes across all metrics. Naive Bayes offers perfect spam recall but struggles with precision, producing many false positives. Random Forest, by contrast, achieves both high precision and high recall simultaneously. This demonstrates that tree-based ensemble models handle sparse, high-dimensional text features far more effectively than linear-probabilistic methods in this dataset. For operational use, Random Forest is clearly the superior model.

11. Example predictions (inspect errors / false positives & false negatives)
# Reconstruct test indices relative to dfm_trimmed
test_index <- setdiff(seq_len(nrow(dfm_trimmed)), train_index)

# Build table with original text and predictions
inspect_df <- tibble(
  id         = test_index,
  true_label = y_test,
  pred_nb    = nb_pred,
  pred_rf    = rf_pred,
  text       = emails_df$text[test_index]
)

# Add short preview for readability
inspect_df <- inspect_df %>%
  mutate(text_preview = str_replace_all(substr(text, 1, 160), "\\s+", " "))

# Filter only misclassified by Naive Bayes and label error type
errors_df <- inspect_df %>%
  filter((pred_nb == "spam" & true_label == "ham") |
         (pred_nb == "ham" & true_label == "spam")) %>%
  mutate(error_type = case_when(
      pred_nb == "spam" & true_label == "ham" ~ "False Positive (NB)",
      pred_nb == "ham" & true_label == "spam" ~ "False Negative (NB)"
  )) %>%
  slice(1:10) %>%
  select(id, true_label, pred_nb, pred_rf, error_type, text_preview) %>%
  knitr::kable(caption = "Example errors by Naive Bayes (showing first 10)")

errors_df
Example errors by Naive Bayes (showing first 10)
id true_label pred_nb pred_rf error_type text_preview
1481 ham spam ham False Positive (NB) From Mon Sep 2 16:22:38 2002 Return-Path: Delivered-To: Received: from localhost (loca
1569 ham spam ham False Positive (NB) From Wed Aug 28 10:53:37 2002 Return-Path: Delivered-To: Received: from localhost (
1583 ham spam ham False Positive (NB) From Wed Aug 28 10:55:28 2002 Return-Path: Delivered-To: Received: from localhost (lo
1631 ham spam ham False Positive (NB) From Wed Aug 28 13:55:31 2002 Return-Path: Delivered-To: Received: from localhost (localhost [1
1633 ham spam ham False Positive (NB) From Wed Aug 28 14:05:36 2002 Return-Path: Delivered-To: Received: from localhost (loca
1661 ham spam ham False Positive (NB) From Wed Oct 9 10:55:52 2002 Return-Path: Delivered-To: Received: from localho
1698 ham spam ham False Positive (NB) From Mon Aug 26 15:32:24 2002 Return-Path: Delivered-To: Received: from localhost (loca
1704 ham spam ham False Positive (NB) From Mon Aug 26 16:57:05 2002 Return-Path: Delivered-To: Received: from localhost (loca
1720 ham spam ham False Positive (NB) From Mon Aug 26 22:18:19 2002 Return-Path: Delivered-To: Received: from localhost (loca
1734 ham spam ham False Positive (NB) From Tue Aug 27 03:46:39 2002 Return-Path: Delivered-To: Received: from localhost (loca

To inspect the raw text of a particular example, locate it via its id and print the corresponding emails_df$text entry to examine why the model erred.

12. Notes, limitations, and suggestions for improvements
13. Conclusions

Conclusions: This project implemented a full supervised document-classification pipeline for spam detection, including preprocessing, tokenization, feature construction, exploratory analysis, and model evaluation. Naive Bayes provided a strong baseline but showed limitations, achieving high recall at the cost of misclassifying many ham emails. The Random Forest model demonstrated near-perfect performance, correctly identifying almost all messages and achieving an AUC of 0.9999.

These results indicate that more expressive, nonlinear models better capture the structure of email text. Future enhancements such as TF-IDF weighting, n-grams, character-level features, and hyperparameter tuning could further strengthen the classifier, but the current Random Forest already performs at a level suitable for real-world deployment.

14. Reproducibility: Session Info
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggrepel_0.9.6             kableExtra_1.4.0         
##  [3] knitr_1.50                wordcloud_2.6            
##  [5] RColorBrewer_1.1-3        pROC_1.18.5              
##  [7] randomForest_4.7-1.2      naivebayes_1.0.0         
##  [9] e1071_1.7-16              caret_7.0-1              
## [11] lattice_0.22-6            quanteda.textplots_0.96.1
## [13] quanteda_4.3.1            lubridate_1.9.4          
## [15] forcats_1.0.0             stringr_1.5.1            
## [17] dplyr_1.1.4               purrr_1.1.0              
## [19] readr_2.1.5               tidyr_1.3.1              
## [21] tibble_3.3.0              ggplot2_3.5.1            
## [23] tidyverse_2.0.0          
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.2    timeDate_4041.110   
##  [4] farver_2.1.2         fastmap_1.2.0        digest_0.6.37       
##  [7] rpart_4.1.23         timechange_0.3.0     lifecycle_1.0.4     
## [10] survival_3.6-4       magrittr_2.0.3       compiler_4.4.1      
## [13] rlang_1.1.4          sass_0.4.9           tools_4.4.1         
## [16] yaml_2.3.10          data.table_1.16.0    labeling_0.4.3      
## [19] stopwords_2.3        xml2_1.3.7           plyr_1.8.9          
## [22] withr_3.0.2          nnet_7.3-19          grid_4.4.1          
## [25] stats4_4.4.1         future_1.34.0        globals_0.16.3      
## [28] scales_1.4.0         iterators_1.0.14     MASS_7.3-60.2       
## [31] cli_3.6.3            rmarkdown_2.29       generics_0.1.3      
## [34] rstudioapi_0.17.1    future.apply_1.11.3  reshape2_1.4.4      
## [37] tzdb_0.4.0           cachem_1.1.0         proxy_0.4-27        
## [40] splines_4.4.1        parallel_4.4.1       vctrs_0.6.5         
## [43] hardhat_1.4.1        Matrix_1.7-0         jsonlite_2.0.0      
## [46] hms_1.1.3            listenv_0.9.1        systemfonts_1.2.1   
## [49] foreach_1.5.2        gower_1.0.2          jquerylib_0.1.4     
## [52] recipes_1.1.1        glue_1.8.0           parallelly_1.38.0   
## [55] codetools_0.2-20     stringi_1.8.4        gtable_0.3.6        
## [58] pillar_1.10.1        htmltools_0.5.8.1    ipred_0.9-15        
## [61] lava_1.8.1           R6_2.6.1             evaluate_1.0.3      
## [64] SnowballC_0.7.1      bslib_0.9.0          class_7.3-22        
## [67] Rcpp_1.0.13          fastmatch_1.1-6      svglite_2.1.3       
## [70] nlme_3.1-164         prodlim_2024.06.25   xfun_0.51           
## [73] pkgconfig_2.0.3      ModelMetrics_1.2.2.2