This R Markdown file performs a full document classification workflow
for spam detection using the SpamAssassin public corpus (or any
similarly structured spam/ham email folders). The pipeline includes:
- Loading raw email files from two directories (spam and ham)
- Corpus creation and cleaning (tokenization, stopword removal,
stemming / optionally lemmatization)
- Exploratory data analysis: message lengths, most frequent words,
word clouds, and per-class comparisons
- Feature engineering using document-feature matrices (DFM / term
frequency and a TF-IDF baseline)
- Training and evaluating multiple classifiers (Naive Bayes and Random
Forest)
- Evaluation metrics: confusion matrix, precision/recall/F1,
ROC/AUC
- Conclusions, interpretation, and reproducible session info
Important: Update the params\(spam_dir and
params\)ham_dir at the top of this document if your directories
differ.
| 1. Load required packages |
# Install only if missing (commented out by default)
# install.packages(c("tidyverse","quanteda","caret","e1071","naivebayes","randomForest","pROC","wordcloud","knitr","kableExtra","ggrepel","quanteda.textplots"))
suppressPackageStartupMessages({
library(tidyverse)
library(quanteda) # fast tokenization and dfm
library(quanteda.textplots)
library(caret) # train/test split, confusionMatrix
library(e1071) # naiveBayes (alternative)
library(naivebayes) # naive_bayes (fast)
library(randomForest) # random forest classifier
library(pROC) # ROC / AUC
library(wordcloud)
library(knitr)
library(kableExtra)
library(ggplot2)
library(ggrepel)
})
spam_dir <- params$spam_dir
ham_dir <- params$ham_dir
cat("Spam folder:", spam_dir, "\n")
## Spam folder: C:/Users/taham/OneDrive/Documents/Data 607/Project 4/20050311_spam_2/spam_2
cat("Ham folder: ", ham_dir, "\n")
## Ham folder: C:/Users/taham/OneDrive/Documents/Data 607/Project 4/20030228_easy_ham/easy_ham
Note: If you get encoding or path errors on Windows, make sure to use
double backslashes (\) or forward slashes (/). The default parameter
values above are set to the directories you provided; change them if
needed.
| 3. Utility functions: read emails into data.frame |
# Read all files in a directory; collapse lines into single text string per file
read_emails_from_dir <- function(dir_path, label) {
files <- list.files(dir_path, full.names = TRUE)
# Exclude potential control files like "cmds"
files <- files[basename(files) != "cmds"]
# Read each file and collapse into single text field (handle encoding differences)
df <- tibble(
file = files,
text = map_chr(files, ~ {
txt <- tryCatch(readLines(.x, warn = FALSE, encoding = "UTF-8"),
error = function(e) tryCatch(readLines(.x, warn = FALSE, encoding = "latin1"),
error = function(e2) paste0(readBin(.x, what = "raw", n = file.info(.x)$size), collapse = "")))
paste(txt, collapse = "\n")
}),
label = label
)
return(df)
}
# load spam and ham
spam_df <- read_emails_from_dir(spam_dir, "spam")
ham_df <- read_emails_from_dir(ham_dir, "ham")
# Combine and show counts
emails_df <- bind_rows(spam_df, ham_df) %>% mutate(doc_id = row_number())
emails_df %>% count(label) %>% knitr::kable()
| 4. Quick data checks and basic EDA |
# Fix encoding so nchar() doesn't break on invalid bytes
emails_df$text <- iconv(emails_df$text, from = "", to = "UTF-8", sub = "byte")
# Basic statistics
emails_df <- emails_df %>%
mutate(n_chars = nchar(text),
n_words = stringr::str_count(text, "\\w+"))
summary_stats <- emails_df %>%
group_by(label) %>%
summarise(n = n(),
mean_chars = mean(n_chars),
median_chars = median(n_chars),
mean_words = mean(n_words),
median_words = median(n_words)) %>%
arrange(label)
kable(summary_stats, caption = "Basic length statistics by class") %>%
kable_styling(full_width = FALSE)
Basic length statistics by class
|
label
|
n
|
mean_chars
|
median_chars
|
mean_words
|
median_words
|
|
ham
|
2500
|
3441.826
|
3156.0
|
576.2912
|
540.0
|
|
spam
|
1396
|
6341.804
|
4108.5
|
970.6712
|
672.5
|
# Histogram of message lengths by class
ggplot(emails_df, aes(x = n_words, fill = label)) +
geom_histogram(position = "identity", alpha = 0.6, bins = 60) +
scale_x_log10() +
labs(title = "Distribution of message lengths (words) by class",
x = "Words (log10 scale)", y = "Count", fill = "Label") +
theme_minimal()

Interpretation: The ham emails are generally shorter, with median
length ~3156 characters and ~540 words, while spam emails are noticeably
longer on average, with a median ~4108 characters and ~672 words. The
log-scaled histogram shows that spam tends to have a heavier tail,
suggesting the presence of very long advertisement-style messages. These
differences indicate that message length may carry class-related signal,
though length alone is not sufficient for reliable classification.
| 5. Rationale for preprocessing choices |
Rationale (short academic justification): - Stemming reduces sparsity
by collapsing inflected and derived word forms to a common root (e.g.,
“clicking”, “clicked” -> “click”), improving the overlap between
documents without substantially changing topical content. - Stopword
removal removes very high-frequency function words (like “the”, “and”)
that carry little topical information and can dominate raw frequency
counts. - Removing numbers/punctuation and lowercasing simplifies the
token set and reduces spurious features (e.g., various punctuation
tokens). - Trimming very rare and extremely frequent terms prevents
overfitting and reduces model complexity; rare terms often add noise,
while extremely common tokens may be non-informative. - TF
(term-frequency) is a good baseline; TF-IDF helps account for term
specificity and is included below as a comparative baseline.
| 6. Text preprocessing and Document-Feature Matrix
(DFM) |
We’ll use quanteda for tokenization and dfm creation. Steps:
- Lowercase
- Remove numbers, punctuation
- Remove stopwords
- Stem words
- Trim features by document frequency
# Create a corpus (quanteda) preserving original order to maintain mapping to emails_df
qcorpus <- corpus(emails_df$text, docvars = data.frame(label = emails_df$label, file = emails_df$file, doc_id = emails_df$doc_id))
# Tokenize and clean
tokens_clean <- tokens(qcorpus,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem(language = "en")
# Create dfm (document-feature matrix) with term frequency weighting
dfm_full <- dfm(tokens_clean)
dim(dfm_full) # documents x features
## [1] 3896 90828
# Trim - keep terms that appear in at least 0.5% of documents and at most 99%
min_docfreq <- 0.005 # 0.5% of documents
max_docfreq <- 0.99 # 99% of documents
dfm_trimmed <- dfm_trim(dfm_full, min_docfreq = min_docfreq, max_docfreq = max_docfreq, docfreq_type = "prop")
cat("Trimmed dfm dims:", dim(dfm_trimmed), " (documents x features)\n")
## Trimmed dfm dims: 3896 3542 (documents x features)
| 6b. Exploratory features: most frequent terms &
wordclouds |
# Top terms bar plot
top_features <- names(topfeatures(dfm_trimmed, 25))
top_features_df <- tibble(term = top_features, freq = as.numeric(topfeatures(dfm_trimmed, 25)))
ggplot(top_features_df, aes(x = reorder(term, freq), y = freq)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 25 terms (overall)", x = NULL, y = "Frequency") +
theme_minimal()

# Wordcloud - overall
set.seed(42)
textplot_wordcloud(dfm_trimmed, max_words = 150, color = RColorBrewer::brewer.pal(8, "Dark2"))

Interpretation: The most frequent spam-associated terms include
“click,” “free,” “html,” and URL-related patterns such as “href,”
reflecting advertising and promotional content. Ham emails show higher
frequencies of mailing-list or conversational terms (such as dates,
system-generated headers, and routine communication phrases). This
confirms that vocabulary usage differs strongly by class, and supports
the use of bag-of-words features for downstream modeling.
dfm_byclass <- dfm_group(dfm_trimmed, groups = docvars(dfm_trimmed, "label"))
# convert to term-frequency df for each class safely
tf_matrix_df <- convert(dfm_byclass, to = "data.frame")
# The first column may be 'document' or 'doc_id' depending on quanteda version
first_col_name <- colnames(tf_matrix_df)[1]
tf_matrix <- tf_matrix_df %>% column_to_rownames(first_col_name)
tf_matrix <- as.data.frame(t(tf_matrix)) # terms x classes
# Top terms per class table
top_terms_per_class <- map_df(colnames(tf_matrix), function(cl) {
freqs <- sort(tf_matrix[[cl]], decreasing = TRUE)
n <- min(20, length(freqs))
tibble(label = cl, term = names(freqs)[1:n], freq = unname(freqs)[1:n])
})
top_terms_per_class %>% group_by(label) %>% slice(1:6) %>% knitr::kable()
| ham |
14239 |
| ham |
10152 |
| ham |
9790 |
| ham |
8406 |
| ham |
7348 |
| ham |
6170 |
| spam |
32179 |
| spam |
32076 |
| spam |
16750 |
| spam |
15691 |
| spam |
11900 |
| spam |
11656 |
# Wordclouds per class
par(mfrow = c(1,2))
set.seed(100)
textplot_wordcloud(dfm_trimmed[docvars(dfm_trimmed,"label") == "spam", ], max_words = 100, colors = brewer.pal(8, "Reds"))
title("Spam wordcloud")

textplot_wordcloud(dfm_trimmed[docvars(dfm_trimmed,"label") == "ham", ], max_words = 100, colors = brewer.pal(8, "Blues"))
title("Ham wordcloud")

par(mfrow = c(1,1))
Interpretation: Terms like “free”, “click”, “http” may be more
frequent in spam, while ham may contain words associated with mailing
lists, dates, and conversational text.
| 7. Prepare training and test data (80/20 split) |
# Convert trimmed dfm to a dense matrix only for modeling (quanteda keeps dfm sparse by default; as.matrix will produce dense, be mindful for memory)
dfm_mat <- as.matrix(dfm_trimmed)
df_features <- as.data.frame(dfm_mat)
# Add label column preserving document order
df_features$label <- docvars(dfm_trimmed, "label")
# Shuffle and split (stratified)
set.seed(123)
train_index <- createDataPartition(df_features$label, p = 0.8, list = FALSE)
train_df <- df_features[train_index, ]
test_df <- df_features[-train_index, ]
cat("Train size:", nrow(train_df), "Test size:", nrow(test_df), "\n")
## Train size: 3117 Test size: 779
table(train_df$label) %>% knitr::kable()
| 8. Model 1: Naive Bayes (TF) |
Naive Bayes is a classical and strong baseline for text
classification.
# Remove label column for predictors
x_train <- as.matrix(select(train_df, -label))
y_train <- factor(train_df$label)
x_test <- as.matrix(select(test_df, -label))
y_test <- factor(test_df$label)
# Train Naive Bayes on raw term-frequency counts
nb_model <- naive_bayes(x = x_train, y = y_train, usekernel = FALSE)
nb_model
##
## ================================= Naive Bayes ==================================
##
## Call:
## naive_bayes.default(x = x_train, y = y_train, usekernel = FALSE)
##
## --------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## --------------------------------------------------------------------------------
##
## A priori probabilities:
##
## ham spam
## 0.6416426 0.3583574
##
## --------------------------------------------------------------------------------
##
## Tables:
##
## --------------------------------------------------------------------------------
## :: ilug-admin@linux.i (Gaussian)
## --------------------------------------------------------------------------------
##
## ilug-admin@linux.i ham spam
## mean 0.15800000 0.03312444
## sd 0.78698518 0.36877983
##
## --------------------------------------------------------------------------------
## :: tue (Gaussian)
## --------------------------------------------------------------------------------
##
## tue ham spam
## mean 1.238000 1.120859
## sd 2.487660 2.138712
##
## --------------------------------------------------------------------------------
## :: aug (Gaussian)
## --------------------------------------------------------------------------------
##
## aug ham spam
## mean 1.7520000 0.9695613
## sd 3.9656474 2.2810905
##
## --------------------------------------------------------------------------------
## :: return-path (Gaussian)
## --------------------------------------------------------------------------------
##
## return-path ham spam
## mean 1.00150000 0.85586392
## sd 0.03871045 0.35895369
##
## --------------------------------------------------------------------------------
## :: delivered-to (Gaussian)
## --------------------------------------------------------------------------------
##
## delivered-to ham spam
## mean 1.3775000 0.5496867
## sd 0.6067766 0.6282567
##
## --------------------------------------------------------------------------------
##
## # ... and 3536 more tables
##
## --------------------------------------------------------------------------------
# Predict probabilities and labels
nb_pred <- predict(nb_model, x_test)
nb_prob <- predict(nb_model, x_test, type = "prob")
# Confusion matrix and metrics
cm_nb <- confusionMatrix(nb_pred, y_test, positive = "spam")
cm_nb
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 348 0
## spam 152 279
##
## Accuracy : 0.8049
## 95% CI : (0.7753, 0.8322)
## No Information Rate : 0.6418
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6212
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.6960
## Pos Pred Value : 0.6473
## Neg Pred Value : 1.0000
## Prevalence : 0.3582
## Detection Rate : 0.3582
## Detection Prevalence : 0.5533
## Balanced Accuracy : 0.8480
##
## 'Positive' Class : spam
##
Show precision, recall, F1:
precision <- cm_nb$byClass["Precision"]
recall <- cm_nb$byClass["Recall"]
f1 <- cm_nb$byClass["F1"]
tibble(Model = "Naive Bayes (TF)", Precision = precision, Recall = recall, F1 = f1) %>% knitr::kable()
| Naive Bayes (TF) |
0.6473318 |
1 |
0.7859155 |
ROC / AUC (we need numeric scores for the positive class):
if ("spam" %in% colnames(nb_prob)) {
roc_nb <- roc(response = as.numeric(y_test == "spam"), predictor = nb_prob[ , "spam"])
} else {
roc_nb <- roc(response = as.numeric(y_test == "spam"), predictor = nb_prob[ , 1])
}
plot.roc(roc_nb, main = "Naive Bayes (TF) ROC", col = "#1c61b6")

auc_nb <- auc(roc_nb)
auc_nb
## Area under the curve: 0.858
Interpretation: Naive Bayes reaches an accuracy of 0.8049, with
Precision = 0.647, Recall = 1.00, and F1 = 0.786. The model correctly
recovers all spam emails (high recall), but it misclassifies a large
number of ham emails as spam (lower precision). This behavior is typical
of naïve Bayes when class-conditional independence assumptions are
violated in high-dimensional text data. The ROC AUC of 0.858 indicates
moderate discriminative ability, but there is clear room for
improvement.
| 8b. TF-IDF baseline (Naive Bayes on TF-IDF) |
We include a TF-IDF baseline as requested.
# Compute TF-IDF on the trimmed dfm
dfm_tfidf_obj <- dfm_tfidf(dfm_trimmed)
df_tfidf_mat <- as.matrix(dfm_tfidf_obj)
df_tfidf <- as.data.frame(df_tfidf_mat)
df_tfidf$label <- docvars(dfm_tfidf_obj, "label")
# Use the same train/test split indices (train_index)
x_train_tfidf <- as.matrix(select(df_tfidf[train_index, ], -label))
x_test_tfidf <- as.matrix(select(df_tfidf[-train_index, ], -label))
y_train_tfidf <- factor(df_tfidf$label[train_index])
y_test_tfidf <- factor(df_tfidf$label[-train_index])
# Train Naive Bayes on TF-IDF
nb_tfidf_model <- naive_bayes(x = x_train_tfidf, y = y_train_tfidf, usekernel = FALSE)
nb_tfidf_pred <- predict(nb_tfidf_model, x_test_tfidf)
nb_tfidf_prob <- predict(nb_tfidf_model, x_test_tfidf, type = "prob")
cm_nb_tfidf <- confusionMatrix(nb_tfidf_pred, y_test_tfidf, positive = "spam")
cm_nb_tfidf
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 306 0
## spam 194 279
##
## Accuracy : 0.751
## 95% CI : (0.719, 0.781)
## No Information Rate : 0.6418
## P-Value [Acc > NIR] : 4.253e-11
##
## Kappa : 0.5305
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.6120
## Pos Pred Value : 0.5899
## Neg Pred Value : 1.0000
## Prevalence : 0.3582
## Detection Rate : 0.3582
## Detection Prevalence : 0.6072
## Balanced Accuracy : 0.8060
##
## 'Positive' Class : spam
##
# ROC/AUC
roc_nb_tfidf <- roc(response = as.numeric(y_test_tfidf == "spam"), predictor = nb_tfidf_prob[, "spam"])
auc_nb_tfidf <- auc(roc_nb_tfidf)
auc_nb_tfidf
## Area under the curve: 0.81
Quick comparison: TF-IDF may change the balance between precision and
recall; include it in the model comparison table below.
| 9. Model 2: Random Forest (trained on top-K features
matrix) |
Random forest is more computationally intensive; limit the number of
features to top-K most frequent terms to reduce memory and training
time. We’ll train on matrix inputs as recommended.
# choose top K features by overall frequency using training data only
term_sums_train <- colSums(select(train_df, -label))
K <- 1000 # adjust as needed
top_terms <- names(sort(term_sums_train, decreasing = TRUE))[1:min(K, length(term_sums_train))]
x_train_rf <- as.matrix(train_df %>% select(all_of(top_terms)))
x_test_rf <- as.matrix(test_df %>% select(all_of(top_terms)))
y_train_rf <- y_train
y_test_rf <- y_test
cat("RF features:", length(top_terms), "\n")
## RF features: 1000
Train a random forest (on matrix):
set.seed(234)
rf_model <- randomForest(x = x_train_rf, y = y_train_rf, ntree = 200, importance = TRUE)
print(rf_model)
##
## Call:
## randomForest(x = x_train_rf, y = y_train_rf, ntree = 200, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 31
##
## OOB estimate of error rate: 0.58%
## Confusion matrix:
## ham spam class.error
## ham 1996 4 0.00200000
## spam 14 1103 0.01253357
rf_pred <- predict(rf_model, x_test_rf)
cm_rf <- confusionMatrix(rf_pred, y_test_rf, positive = "spam")
cm_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 499 2
## spam 1 277
##
## Accuracy : 0.9961
## 95% CI : (0.9888, 0.9992)
## No Information Rate : 0.6418
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9916
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9928
## Specificity : 0.9980
## Pos Pred Value : 0.9964
## Neg Pred Value : 0.9960
## Prevalence : 0.3582
## Detection Rate : 0.3556
## Detection Prevalence : 0.3569
## Balanced Accuracy : 0.9954
##
## 'Positive' Class : spam
##
Show RF feature importance (top 20):
imp <- importance(rf_model)
imp_df <- as.data.frame(imp) %>% rownames_to_column("term") %>% arrange(desc(MeanDecreaseGini))
imp_df %>% slice(1:20) %>% knitr::kable()
| jul |
9.210616 |
7.819458 |
9.488142 |
60.52765 |
| ist |
5.441118 |
5.779723 |
6.657390 |
54.31205 |
| jalapeno |
5.134075 |
6.245511 |
6.225045 |
49.12091 |
| imap |
5.750577 |
4.232255 |
5.972405 |
43.39703 |
| localhost |
5.027364 |
3.629080 |
5.413346 |
35.89660 |
| click |
6.672682 |
4.520735 |
6.817995 |
34.92019 |
| fetchmail-5.9.0 |
4.402641 |
4.083347 |
4.933638 |
33.86985 |
| br |
5.536451 |
3.649553 |
5.704030 |
32.91306 |
| yyyy@localhost.spamassassin.taint.org |
3.908763 |
4.830615 |
4.791449 |
30.74311 |
| href |
5.328216 |
4.177501 |
5.502422 |
30.54192 |
| jmason.org |
4.232763 |
5.397292 |
5.573973 |
29.57949 |
| jm@netnoteinc.com |
6.153284 |
5.101657 |
6.276801 |
25.12110 |
| x-keyword |
4.295148 |
4.287863 |
4.862084 |
24.59174 |
| sep |
5.949370 |
6.842204 |
7.042187 |
24.35211 |
| single-drop |
4.076746 |
2.849343 |
4.218388 |
23.93181 |
| html |
4.252411 |
2.415504 |
3.727842 |
21.16160 |
| remov |
5.789474 |
5.137593 |
6.309525 |
19.85000 |
| p |
4.598776 |
1.973599 |
4.633249 |
19.78174 |
| @localhost |
4.588750 |
2.355616 |
4.606266 |
18.77095 |
| jun |
3.717090 |
4.416518 |
4.609552 |
17.25340 |
ggplot(imp_df %>% slice(1:20), aes(reorder(term, MeanDecreaseGini), MeanDecreaseGini)) +
geom_col(fill = "darkorange") +
coord_flip() +
labs(title = "Top 20 important terms (Random Forest)", x = NULL, y = "MeanDecreaseGini") +
theme_minimal()

Compute precision/recall/F1 and ROC:
# Precision/Recall/F1
tibble(
Model = "Random Forest",
Precision = cm_rf$byClass["Precision"],
Recall = cm_rf$byClass["Recall"],
F1 = cm_rf$byClass["F1"]
) %>% knitr::kable()
| Random Forest |
0.9964029 |
0.9928315 |
0.994614 |
# Predict probabilities on TEST SET USING MATRIX
rf_prob_spam <- predict(rf_model, newdata = x_test_rf, type = "prob")[, "spam"]
# Compute ROC
roc_rf <- roc(response = as.numeric(y_test_rf == "spam"), predictor = rf_prob_spam)
plot.roc(roc_rf, main = "Random Forest ROC", col = "#b21c1c")

auc_rf <- auc(roc_rf)
auc_rf
## Area under the curve: 0.9999
Interpretation: The Random Forest model performs extremely well,
achieving Accuracy = 0.996, Precision = 0.996, Recall = 0.993, and F1 =
0.995. Only three misclassifications occur in the entire test set. The
AUC of 0.9999 confirms near-perfect separation between spam and ham. Top
features such as “click,” “html,” “href,” “jalapeno,” and server-related
tokens indicate that the model leverages both promotional vocabulary and
structural artifacts of the emails. Compared to Naive Bayes, Random
Forest captures far richer nonlinear interactions among words.
# Collect metrics for comparison
nb_prec <- as.numeric(cm_nb$byClass["Precision"])
nb_rec <- as.numeric(cm_nb$byClass["Recall"])
nb_f1 <- as.numeric(cm_nb$byClass["F1"])
nb_auc <- as.numeric(auc_nb)
nb_tfidf_prec <- as.numeric(cm_nb_tfidf$byClass["Precision"])
nb_tfidf_rec <- as.numeric(cm_nb_tfidf$byClass["Recall"])
nb_tfidf_f1 <- as.numeric(cm_nb_tfidf$byClass["F1"])
nb_tfidf_auc <- as.numeric(auc_nb_tfidf)
rf_prec <- as.numeric(cm_rf$byClass["Precision"])
rf_rec <- as.numeric(cm_rf$byClass["Recall"])
rf_f1 <- as.numeric(cm_rf$byClass["F1"])
rf_auc <- as.numeric(auc_rf)
results_df <- tibble(
Model = c("Naive Bayes (TF)", "Naive Bayes (TF-IDF)", "Random Forest"),
Precision = c(nb_prec, nb_tfidf_prec, rf_prec),
Recall = c(nb_rec, nb_tfidf_rec, rf_rec),
F1 = c(nb_f1, nb_tfidf_f1, rf_f1),
AUC = c(nb_auc, nb_tfidf_auc, rf_auc)
)
results_df %>% knitr::kable(digits = 4)
| Naive Bayes (TF) |
0.6473 |
1.0000 |
0.7859 |
0.8580 |
| Naive Bayes (TF-IDF) |
0.5899 |
1.0000 |
0.7420 |
0.8100 |
| Random Forest |
0.9964 |
0.9928 |
0.9946 |
0.9999 |
# barplot for primary metrics
results_long <- results_df %>% pivot_longer(cols = Precision:F1, names_to = "Metric", values_to = "Value")
ggplot(results_long, aes(x = Metric, y = Value, fill = Model)) +
geom_col(position = "dodge") +
labs(title = "Model performance comparison", y = "Score") +
theme_minimal()

Interpretation: The Random Forest substantially outperforms Naive
Bayes across all metrics. Naive Bayes offers perfect spam recall but
struggles with precision, producing many false positives. Random Forest,
by contrast, achieves both high precision and high recall
simultaneously. This demonstrates that tree-based ensemble models handle
sparse, high-dimensional text features far more effectively than
linear-probabilistic methods in this dataset. For operational use,
Random Forest is clearly the superior model.
| 11. Example predictions (inspect errors / false positives &
false negatives) |
# Reconstruct test indices relative to dfm_trimmed
test_index <- setdiff(seq_len(nrow(dfm_trimmed)), train_index)
# Build table with original text and predictions
inspect_df <- tibble(
id = test_index,
true_label = y_test,
pred_nb = nb_pred,
pred_rf = rf_pred,
text = emails_df$text[test_index]
)
# Add short preview for readability
inspect_df <- inspect_df %>%
mutate(text_preview = str_replace_all(substr(text, 1, 160), "\\s+", " "))
# Filter only misclassified by Naive Bayes and label error type
errors_df <- inspect_df %>%
filter((pred_nb == "spam" & true_label == "ham") |
(pred_nb == "ham" & true_label == "spam")) %>%
mutate(error_type = case_when(
pred_nb == "spam" & true_label == "ham" ~ "False Positive (NB)",
pred_nb == "ham" & true_label == "spam" ~ "False Negative (NB)"
)) %>%
slice(1:10) %>%
select(id, true_label, pred_nb, pred_rf, error_type, text_preview) %>%
knitr::kable(caption = "Example errors by Naive Bayes (showing first 10)")
errors_df
To inspect the raw text of a particular example, locate it via its id
and print the corresponding emails_df$text entry to examine why the
model erred.
| 12. Notes, limitations, and suggestions for
improvements |
- Preprocessing: We used stemming and stopword removal. Consider
leaving words unstemmed or using lemmatization (via spaCy / UDPipe) for
potentially better interpretability.
- Feature engineering: Consider TF-IDF weighting (we included a
baseline), n-grams (bigrams/trigrams), character-level n-grams, and
phrase detection.
- Class imbalance: If your dataset has class imbalance, try stratified
sampling, class weights, or resampling (SMOTE).
- Model tuning: Use caret or tidymodels to perform cross-validation
and hyperparameter tuning for Random Forest, SVM, or gradient boosting
(xgboost).
- Computational limits: For very large corpora, consider using a more
memory-efficient pipeline (sparse matrices on disk) or feature selection
before model training.
Conclusions: This project implemented a full supervised
document-classification pipeline for spam detection, including
preprocessing, tokenization, feature construction, exploratory analysis,
and model evaluation. Naive Bayes provided a strong baseline but showed
limitations, achieving high recall at the cost of misclassifying many
ham emails. The Random Forest model demonstrated near-perfect
performance, correctly identifying almost all messages and achieving an
AUC of 0.9999.
These results indicate that more expressive, nonlinear models better
capture the structure of email text. Future enhancements such as TF-IDF
weighting, n-grams, character-level features, and hyperparameter tuning
could further strengthen the classifier, but the current Random Forest
already performs at a level suitable for real-world deployment.
| 14. Reproducibility: Session Info |
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggrepel_0.9.6 kableExtra_1.4.0
## [3] knitr_1.50 wordcloud_2.6
## [5] RColorBrewer_1.1-3 pROC_1.18.5
## [7] randomForest_4.7-1.2 naivebayes_1.0.0
## [9] e1071_1.7-16 caret_7.0-1
## [11] lattice_0.22-6 quanteda.textplots_0.96.1
## [13] quanteda_4.3.1 lubridate_1.9.4
## [15] forcats_1.0.0 stringr_1.5.1
## [17] dplyr_1.1.4 purrr_1.1.0
## [19] readr_2.1.5 tidyr_1.3.1
## [21] tibble_3.3.0 ggplot2_3.5.1
## [23] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 timeDate_4041.110
## [4] farver_2.1.2 fastmap_1.2.0 digest_0.6.37
## [7] rpart_4.1.23 timechange_0.3.0 lifecycle_1.0.4
## [10] survival_3.6-4 magrittr_2.0.3 compiler_4.4.1
## [13] rlang_1.1.4 sass_0.4.9 tools_4.4.1
## [16] yaml_2.3.10 data.table_1.16.0 labeling_0.4.3
## [19] stopwords_2.3 xml2_1.3.7 plyr_1.8.9
## [22] withr_3.0.2 nnet_7.3-19 grid_4.4.1
## [25] stats4_4.4.1 future_1.34.0 globals_0.16.3
## [28] scales_1.4.0 iterators_1.0.14 MASS_7.3-60.2
## [31] cli_3.6.3 rmarkdown_2.29 generics_0.1.3
## [34] rstudioapi_0.17.1 future.apply_1.11.3 reshape2_1.4.4
## [37] tzdb_0.4.0 cachem_1.1.0 proxy_0.4-27
## [40] splines_4.4.1 parallel_4.4.1 vctrs_0.6.5
## [43] hardhat_1.4.1 Matrix_1.7-0 jsonlite_2.0.0
## [46] hms_1.1.3 listenv_0.9.1 systemfonts_1.2.1
## [49] foreach_1.5.2 gower_1.0.2 jquerylib_0.1.4
## [52] recipes_1.1.1 glue_1.8.0 parallelly_1.38.0
## [55] codetools_0.2-20 stringi_1.8.4 gtable_0.3.6
## [58] pillar_1.10.1 htmltools_0.5.8.1 ipred_0.9-15
## [61] lava_1.8.1 R6_2.6.1 evaluate_1.0.3
## [64] SnowballC_0.7.1 bslib_0.9.0 class_7.3-22
## [67] Rcpp_1.0.13 fastmatch_1.1-6 svglite_2.1.3
## [70] nlme_3.1-164 prodlim_2024.06.25 xfun_0.51
## [73] pkgconfig_2.0.3 ModelMetrics_1.2.2.2