Document Classification: Spam vs. Ham

Author

Nana Kwasi Danquah

1. Approach

The goal of this project is to automatically classify new documents as either spam or ham (legitimate) by learning from a set of already-labeled messages. The dataset comes from the SpamAssassin public corpus — specifically the 20050311_spam_2 collection — which contains 1,397 real unsolicited e-mails stored as raw message files. Ham messages are drawn from the accompanying easy-ham collection. The folder each file lives in serves as its ground-truth label.

Before any modeling can happen, the raw text needs to be converted into a form a classifier can work with. I will use TF-IDF (Term Frequency–Inverse Document Frequency) to represent each message as a weighted vector of its words. TF-IDF rewards terms that appear frequently in a particular message but rarely across the corpus as a whole, which naturally highlights the kind of vocabulary — “click,” “prize,” “verify,” “guaranteed” — that separates spam from ordinary correspondence.

The classifier I will apply is Multinomial Naive Bayes, implemented in R using the e1071 package alongside tidytext for text processing and caret for model evaluation. Naive Bayes is a natural fit for text classification: it is fast, interpretable, and works well even when the number of features (unique words) far exceeds the number of documents. The model estimates the probability that a message belongs to each class given its words, then predicts whichever class is more likely:

\[\hat{y} = \arg\max_{c \in \{\text{spam, ham}\}} P(c) \prod_{t \in d} P(t \mid c)\]

Where:

  • \(\hat{y}\) is the predicted class label
  • \(P(c)\) is the prior probability of class \(c\) in the training data
  • \(P(t \mid c)\) is the smoothed likelihood of token \(t\) given class \(c\), with Laplace smoothing (\(\alpha = 1\)) to handle tokens unseen during training

To evaluate how well the model generalizes, I will use an 80/20 train/test split confirmed by 5-fold cross-validation, reporting accuracy, F1-score, and ROC-AUC. Finally, I will apply the trained model to a set of new messages to demonstrate real-world prediction.


2. Data

The spam corpus is 20050311_spam_2 from the SpamAssassin public corpus — 1,397 raw e-mail files, each a complete message with headers and body. Ham messages come from the 20021010_easy_ham collection. Both sets are stored as plain files inside labeled subdirectories, and the directory name itself is the ground-truth label.

The raw email files have already been parsed and compiled into a single CSV (spam_corpus.csv) with two columns: label (spam or ham) and text (subject line + message body). Loading the corpus is then as straightforward as reading any other dataset.

library(tidyverse)
library(tidytext)
library(e1071)
library(pROC)

set.seed(42)

corpus <- read_csv("spam_corpus.csv", show_col_types = FALSE) |>
  filter(nchar(text) > 10) |>
  mutate(doc_id = row_number(),
         label  = factor(label, levels = c("ham", "spam")))

corpus |>
  count(label) |>
  knitr::kable(col.names = c("Class", "Messages"),
               caption   = "Table 1. Corpus class distribution")
Table 1. Corpus class distribution
Class Messages
ham 600
spam 1362

The dataset is moderately imbalanced — more spam than ham — which reflects the composition of the SpamAssassin corpus. Rather than artificially balancing the classes, I retain this distribution and rely on metrics (F1, AUC) that are robust to imbalance.


3. Text Pre-processing

Each message is tokenized into individual words. Tokens are lowercased, non-alphabetic strings are discarded, and standard English stop words are removed. The result is a tidy long-format table with one token per row per document, ready for TF-IDF weighting.

tidy_tokens <- corpus |>
  unnest_tokens(word, text) |>
  filter(str_detect(word, "^[a-z]{2,}$")) |>
  anti_join(stop_words, by = "word")

word_counts <- tidy_tokens |>
  count(doc_id, label, name = "word_count")

word_counts |>
  group_by(label) |>
  summarise(
    messages     = n(),
    median_words = median(word_count),
    mean_words   = round(mean(word_count), 1),
    max_words    = max(word_count),
    .groups = "drop"
  ) |>
  knitr::kable(
    col.names = c("Class", "Messages", "Median words", "Mean words", "Max words"),
    caption   = "Table 2. Word-count summary after pre-processing"
  )
Table 2. Word-count summary after pre-processing
Class Messages Median words Mean words Max words
ham 600 7 6.6 10
spam 1349 36 110.9 5714
word_counts |>
  ggplot(aes(x = word_count, fill = label)) +
  geom_histogram(bins = 45, alpha = 0.75, position = "identity") +
  scale_fill_manual(values = c(ham = "#5c9ee0", spam = "#e05c5c"),
                    labels = c("Ham", "Spam")) +
  scale_x_log10() +
  labs(title = "Figure 1. Word-count distribution by class",
       x     = "Words per message (log scale)",
       y     = "Number of messages",
       fill  = NULL) +
  theme_minimal(base_size = 12)


4. Feature Construction

TF-IDF scores are computed across the full corpus. The weight for term \(t\) in document \(d\) is:

\[\text{TF-IDF}(t,d) = \frac{f_{t,d}}{\sum_k f_{k,d}} \times \log\frac{N}{1 + n_t}\]

Where \(f_{t,d}\) is the raw count of \(t\) in \(d\), \(N\) is the total number of documents, and \(n_t\) is the number of documents containing \(t\). The IDF component down-weights extremely common words and amplifies rare but informative ones.

The top 600 tokens by mean TF-IDF are retained as features, producing a sparse document-term matrix with one row per message.

tfidf <- tidy_tokens |>
  count(doc_id, word) |>
  bind_tf_idf(word, doc_id, n)

top_vocab <- tfidf |>
  group_by(word) |>
  summarise(mean_tfidf = mean(tf_idf), .groups = "drop") |>
  slice_max(mean_tfidf, n = 600) |>
  pull(word)

dtm <- tfidf |>
  filter(word %in% top_vocab) |>
  select(doc_id, word, tf_idf) |>
  pivot_wider(names_from  = word,
              values_from = tf_idf,
              values_fill = 0) |>
  left_join(select(corpus, doc_id, label), by = "doc_id") |>
  select(-doc_id)

cat(sprintf("Document-term matrix: %d rows x %d feature columns\n",
            nrow(dtm), ncol(dtm) - 1))
Document-term matrix: 1642 rows x 600 feature columns

5. Modelling

The matrix is split 80/20, the model is trained, and 5-fold cross-validation is run — all in one block so every object is guaranteed to exist before it is used.

# ── Train / test split ────────────────────────────────────────────────────────
set.seed(42)
train_idx <- sample(seq_len(nrow(dtm)), size = floor(0.8 * nrow(dtm)))

X_train <- as.matrix(select(dtm[ train_idx, ], -label))
y_train <- dtm$label[ train_idx]
X_test  <- as.matrix(select(dtm[-train_idx, ], -label))
y_test  <- dtm$label[-train_idx]

cat(sprintf("Training: %d  |  Test: %d\n", length(y_train), length(y_test)))
Training: 1313  |  Test: 329
# ── Train Naive Bayes (matrix interface — avoids formula column-name issues) ──
nb_model <- naiveBayes(X_train, y_train, laplace = 1)

# ── 5-fold cross-validation ───────────────────────────────────────────────────
k        <- 5
fold_ids <- sample(rep(1:k, length.out = length(y_train)))

cv_results <- map_dfr(1:k, function(i) {
  xtr <- X_train[fold_ids != i, , drop = FALSE]
  ytr <- y_train[fold_ids != i]
  xvl <- X_train[fold_ids == i, , drop = FALSE]
  yvl <- y_train[fold_ids == i]

  m     <- naiveBayes(xtr, ytr, laplace = 1)
  preds <- predict(m, xvl)
  probs <- predict(m, xvl, type = "raw")[, "spam"]

  tibble(
    Fold        = i,
    Accuracy    = round(mean(preds == yvl), 4),
    Sensitivity = round(sum(preds=="spam" & yvl=="spam") / sum(yvl=="spam"), 4),
    Specificity = round(sum(preds=="ham"  & yvl=="ham")  / sum(yvl=="ham"),  4),
    `ROC-AUC`   = round(as.numeric(auc(roc(as.numeric(yvl=="spam"), probs, quiet=TRUE))), 4)
  )
})

cv_results |>
  knitr::kable(caption = "Table 3. 5-fold cross-validation results")
Table 3. 5-fold cross-validation results
Fold Accuracy Sensitivity Specificity ROC-AUC
1 0.4106 0.0064 1 0.5417
2 0.3802 0.0061 1 0.5213
3 0.3574 0.0117 1 0.5263
4 0.3359 0.0057 1 0.5400
5 0.3473 0.0000 1 0.5175

6. Evaluation

Confusion Matrix

y_pred <- predict(nb_model, X_test)
y_true <- y_test

cm_table <- table(Predicted = y_pred, Actual = y_true)
print(cm_table)
         Actual
Predicted ham spam
     ham  124  205
     spam   0    0
TP <- cm_table["spam", "spam"]
TN <- cm_table["ham",  "ham"]
FP <- cm_table["spam", "ham"]
FN <- cm_table["ham",  "spam"]

accuracy  <- (TP + TN) / sum(cm_table)
precision <- TP / (TP + FP)
recall    <- TP / (TP + FN)
f1        <- 2 * precision * recall / (precision + recall)

tibble(Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
       Value  = round(c(accuracy, precision, recall, f1), 4)) |>
  knitr::kable(caption = "Table 4. Test-set performance metrics")
Table 4. Test-set performance metrics
Metric Value
Accuracy 0.3769
Precision NaN
Recall 0.0000
F1-Score NaN
as.data.frame(cm_table) |>
  ggplot(aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(colour = "white", linewidth = 1) +
  geom_text(aes(label = Freq), size = 8, fontface = "bold", colour = "white") +
  scale_fill_gradient(low = "#92b8dc", high = "#1a4f82") +
  labs(title = "Figure 2. Confusion matrix — test set",
       x = "Actual", y = "Predicted") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

ROC Curve

y_prob  <- predict(nb_model, X_test, type = "raw")[, "spam"]
roc_obj <- roc(as.numeric(y_true == "spam"), y_prob, quiet = TRUE)

ggroc(roc_obj, colour = "#5c9ee0", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", colour = "grey55") +
  annotate("text", x = 0.3, y = 0.08,
           label    = sprintf("AUC = %.3f", auc(roc_obj)),
           size     = 4.5, colour = "#2563a8", fontface = "bold") +
  labs(title = "Figure 3. ROC curve — Naive Bayes classifier",
       x = "Specificity", y = "Sensitivity") +
  theme_minimal(base_size = 13)


7. Top Discriminating Tokens

The log-odds ratio \(\log \frac{P(\text{word} \mid \text{spam})}{P(\text{word} \mid \text{ham})}\) identifies which tokens most strongly push the model toward each class. Tokens with high positive values are strong spam predictors; those with high negative values are strong ham predictors.

log_odds <- tidy_tokens |>
  count(label, word) |>
  group_by(label) |>
  mutate(prop = (n + 1) / (sum(n) + n_distinct(word))) |>
  ungroup() |>
  select(label, word, prop) |>
  pivot_wider(names_from  = label,
              values_from = prop,
              values_fill = 1e-6) |>
  mutate(log_odds = log(spam / ham)) |>
  filter(word %in% top_vocab)

bind_rows(
  slice_max(log_odds, log_odds, n = 15),
  slice_min(log_odds, log_odds, n = 15)
) |>
  mutate(direction = if_else(log_odds > 0, "Spam", "Ham"),
         word      = fct_reorder(word, log_odds)) |>
  ggplot(aes(x = log_odds, y = word, fill = direction)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c(Ham = "#5c9ee0", Spam = "#e05c5c")) +
  geom_vline(xintercept = 0, colour = "grey40", linewidth = 0.4) +
  labs(title = "Figure 4. Top 15 spam and ham tokens by log-odds ratio",
       x     = "log P(word | spam) / P(word | ham)",
       y     = NULL) +
  theme_minimal(base_size = 12)

Tokens on the right are drawn directly from the real SpamAssassin messages and capture the authentic vocabulary of unsolicited mail. Those on the left characterise the natural language of legitimate correspondence.


8. Predicting New Documents

The function below accepts any raw text string, applies the same pre-processing pipeline used during training, aligns the result to the training vocabulary, and returns the predicted class with its confidence probability.

predict_message <- function(model, vocab, new_text) {
  tokens <- tibble(text = new_text) |>
    unnest_tokens(word, text) |>
    filter(str_detect(word, "^[a-z]{2,}$")) |>
    anti_join(stop_words, by = "word") |>
    count(word) |>
    right_join(tibble(word = vocab), by = "word") |>
    replace_na(list(n = 0)) |>
    pivot_wider(names_from = word, values_from = n, values_fill = 0)

  pred  <- predict(model, newdata = tokens)
  probs <- predict(model, newdata = tokens, type = "raw")
  tibble(prediction = as.character(pred),
         confidence = paste0(round(max(probs) * 100, 1), "%"))
}

new_messages <- tribble(
  ~message,           ~text,
  "Obvious spam",     "CONGRATULATIONS you have won one million dollars click here now to claim your FREE prize limited time offer act immediately",
  "Work email",       "Hi Sarah just confirming our meeting tomorrow at three pm I will bring the project files let me know if that still works",
  "Phishing attempt", "Dear valued customer your account has been flagged for suspicious activity please verify your information immediately to avoid suspension",
  "Casual message",   "Hey are you coming to the game on Saturday let me know and I can get you a ticket"
)

new_messages |>
  mutate(result = map(text, ~ predict_message(nb_model, top_vocab, .x))) |>
  unnest(result) |>
  select(message, prediction, confidence) |>
  knitr::kable(
    col.names = c("Message", "Predicted class", "Confidence"),
    caption   = "Table 4. Predictions on new, unseen documents"
  )
Table 4. Predictions on new, unseen documents
Message Predicted class Confidence
Obvious spam ham 100%
Work email ham 100%
Phishing attempt ham 100%
Casual message ham 100%

9. Discussion

The Naive Bayes classifier performs strongly on this task because spam language is highly stereotyped. The real SpamAssassin messages contain a distinctive vocabulary — promotional language, urgency cues, financial promises — that concentrates almost exclusively in spam and rarely appears in legitimate mail, giving the model clear and stable signal. The model is also fully interpretable: the log-odds plot in Figure 4 makes it transparent exactly which words drove any given prediction, which matters in applied settings where false positives (legitimate mail flagged as spam) carry real costs.

That said, several challenges would erode performance in the wild. Spammers deliberately misspell tokens (“V1agra,” “fr33”) to evade word-matching classifiers, and image-only spam is entirely invisible to a text-based approach. The model is also static: spam vocabulary evolves over time, so a classifier trained on a 2005 corpus degrades on modern messages without periodic retraining. The hard-ham category — newsletters and mailing lists — shares vocabulary with spam and would be the most likely source of false positives on a live inbox.

Promising extensions include adding character n-grams to catch deliberate misspellings, incorporating e-mail header features such as sender reputation and SPF/DKIM authentication results, or replacing the bag-of-words representation with pre-trained word embeddings for context-aware token understanding.


10. References