Project 4 – Document Classification of Spam and Ham Emails: Codebase

Author

Muhammad Suffyan Khan

Published

May 2, 2026

Introduction

This codebase implements a supervised document classification workflow using the SpamAssassin Public Mail Corpus. The goal is to classify email documents as either spam or ham using labeled training documents and then evaluate the model on withheld test documents.

The workflow includes reading raw email files, creating a structured document-level dataset, cleaning and tokenizing text, building text features, training a classifier, evaluating predictions, and visualizing important patterns in the data.

Load Required Packages

library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(purrr)
library(fs)
library(e1071)
library(caret)

Define Project Paths

The raw email files were downloaded from the SpamAssassin Public Mail Corpus and extracted locally into the data/raw/ folder.

ham_dir  <- "data/raw/20030228_easy_ham/easy_ham"
spam_dir <- "data/raw/20030228_spam/spam"

processed_dir <- "data/processed"

Validate Folder Structure

Before reading the email files, I first check that the expected raw data folders exist. This helps confirm that the project is being run from the correct root directory and that the extracted corpus files are stored in the correct location.

dir.exists(ham_dir)

[1] TRUE

dir.exists(spam_dir)

[1] TRUE

if (!dir.exists(ham_dir)) {
  stop("The ham folder was not found. Please check the path: ", ham_dir)
}

if (!dir.exists(spam_dir)) {
  stop("The spam folder was not found. Please check the path: ", spam_dir)
}

if (!dir.exists(processed_dir)) {
  dir.create(processed_dir, recursive = TRUE)
}

Read Raw Email File Paths

Each individual file represents one email document. The class label is inferred from the folder where the file is stored.

ham_files <- dir_ls(ham_dir, type = "file")
spam_files <- dir_ls(spam_dir, type = "file")

length(ham_files)

[1] 2501

length(spam_files)

[1] 501

Create a Safe Email Reading Function

Some raw emails may contain unusual characters or encoding issues. This helper function reads each file safely and returns the email text as one character string.

read_email_safe <- function(path) {
  tryCatch(
    {
      read_lines(path, locale = locale(encoding = "Latin1"), progress = FALSE) |>
        paste(collapse = " ")
    },
    error = function(e) {
      NA_character_
    }
  )
}

Build Document-Level Dataset

The ham and spam files are converted into two labeled datasets and then combined into one document-level table. Each row represents one email document.

ham_emails <- tibble(
  doc_id = paste0("ham_", seq_along(ham_files)),
  label = "ham",
  file_path = as.character(ham_files),
  text = map_chr(ham_files, read_email_safe)
)

spam_emails <- tibble(
  doc_id = paste0("spam_", seq_along(spam_files)),
  label = "spam",
  file_path = as.character(spam_files),
  text = map_chr(spam_files, read_email_safe)
)

emails <- bind_rows(ham_emails, spam_emails) |>
  mutate(
    label = factor(label, levels = c("ham", "spam")),
    text = str_squish(text)
  ) |>
  filter(
    !is.na(text),
    text != "",
    !str_detect(file_path, "cmds$")
  )

glimpse(emails)

Rows: 3,000
Columns: 4
$ doc_id    <chr> "ham_1", "ham_2", "ham_3", "ham_4", "ham_5", "ham_6", "ham_7…
$ label     <fct> ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, …
$ file_path <chr> "data/raw/20030228_easy_ham/easy_ham/00001.7c53336b37003a928…
$ text      <chr> "From exmh-workers-admin@redhat.com Thu Aug 22 12:36:23 2002…

Validate Document-Level Dataset

These checks confirm that the dataset contains both classes, that each document has a label, and that there are no missing or empty email texts.

emails |>
  count(label)

# A tibble: 2 × 2
  label     n
  <fct> <int>
1 ham    2500
2 spam    500

validation_summary <- emails |>
  summarize(
    total_documents = n(),
    missing_labels = sum(is.na(label)),
    missing_text = sum(is.na(text)),
    empty_text = sum(str_trim(text) == "", na.rm = TRUE)
  )

validation_summary

# A tibble: 1 × 4
  total_documents missing_labels missing_text empty_text
            <int>          <int>        <int>      <int>
1            3000              0            0          0

if (validation_summary$missing_text > 0) {
  warning("Some email files could not be read and produced missing text.")
}

if (validation_summary$empty_text > 0) {
  warning("Some email files appear to have empty text.")
}

Save Processed Document-Level Dataset

The raw email files are converted into a structured CSV file for transparency and reproducibility. This processed file contains one row per email document.

emails_clean_export <- emails |>
  mutate(text = str_squish(text))

write_csv(emails_clean_export, file.path(processed_dir, "spam_ham_emails.csv"))

Initial Class Distribution

This plot shows the number of ham and spam documents available in the dataset. It also helps identify whether the dataset is imbalanced before modeling.

emails |>
  count(label) |>
  ggplot(aes(x = label, y = n)) +
  geom_col() +
  labs(
    title = "Class Distribution of Spam and Ham Emails",
    x = "Email Class",
    y = "Number of Documents"
  )

Text Cleaning and Tokenization

The next step is to convert the document-level dataset into a word-level dataset. Each email is tokenized into individual words, common stop words are removed, and very short or uninformative terms are filtered out.

data("stop_words")

email_tokens <- emails |>
  select(doc_id, label, text) |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by = "word") |>
  filter(
    str_detect(word, "^[a-z]+$"),
    str_length(word) >= 3
  )

glimpse(email_tokens)

Rows: 705,559
Columns: 3
$ doc_id <chr> "ham_1", "ham_1", "ham_1", "ham_1", "ham_1", "ham_1", "ham_1", …
$ label  <fct> ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham…
$ word   <chr> "exmh", "workers", "admin", "thu", "aug", "return", "path", "ex…

Validate Tokenized Dataset

These checks confirm that tokenization produced a usable word-level dataset and that both classes are still represented.

email_tokens |>
  count(label)

# A tibble: 2 × 2
  label      n
  <fct>  <int>
1 ham   528901
2 spam  176658

email_tokens |>
  summarize(
    total_tokens = n(),
    unique_words = n_distinct(word),
    documents_with_tokens = n_distinct(doc_id)
  )

# A tibble: 1 × 3
  total_tokens unique_words documents_with_tokens
         <int>        <int>                 <int>
1       705559        31893                  3000

Most Common Words by Class

This table shows the most common remaining words in ham and spam emails after stop word removal and basic filtering.

top_words <- email_tokens |>
  count(label, word, sort = TRUE) |>
  group_by(label) |>
  slice_max(n, n = 15) |>
  ungroup()

top_words

# A tibble: 30 × 3
   label word          n
   <fct> <chr>     <int>
 1 ham   received  14086
 2 ham   list      13388
 3 ham   localhost 12624
 4 ham   fork      10877
 5 ham   sep        9790
 6 ham   esmtp      8407
 7 ham   http       7518
 8 ham   subject    7151
 9 ham   mailto     6170
10 ham   admin      6058
# ℹ 20 more rows

Top Word Frequency Visualization

This plot compares the most frequent words in spam and ham emails. It provides an initial look at vocabulary differences between the two classes.

top_words |>
  mutate(word = reorder_within(word, n, label)) |>
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ label, scales = "free_y") +
  scale_x_reordered() +
  labs(
    title = "Most Common Words in Spam and Ham Emails",
    x = "Word",
    y = "Frequency"
  )

TF-IDF Feature Exploration

TF-IDF gives more weight to words that are important within a class but not equally common across all classes. This helps identify terms that are more distinctive for spam or ham messages.

tfidf_by_class <- email_tokens |>
  count(label, word, sort = TRUE) |>
  bind_tf_idf(word, label, n) |>
  arrange(desc(tf_idf))

tfidf_by_class |>
  group_by(label) |>
  slice_max(tf_idf, n = 15) |>
  ungroup()

# A tibble: 30 × 6
   label word          n      tf   idf   tf_idf
   <fct> <chr>     <int>   <dbl> <dbl>    <dbl>
 1 ham   fork      10877 0.0206  0.693 0.0143  
 2 ham   rpm        5869 0.0111  0.693 0.00769 
 3 ham   exmh       4972 0.00940 0.693 0.00652 
 4 ham   zzzlist    2726 0.00515 0.693 0.00357 
 5 ham   razor      2076 0.00393 0.693 0.00272 
 6 ham   rssfeeds   1869 0.00353 0.693 0.00245 
 7 ham   zzzzteana   926 0.00175 0.693 0.00121 
 8 ham   wrote       925 0.00175 0.693 0.00121 
 9 ham   devel       907 0.00171 0.693 0.00119 
10 ham   khare       740 0.00140 0.693 0.000970
# ℹ 20 more rows

TF-IDF Visualization

This plot shows the most distinctive words for each email class based on TF-IDF.

tfidf_by_class |>
  group_by(label) |>
  slice_max(tf_idf, n = 15) |>
  ungroup() |>
  mutate(word = reorder_within(word, tf_idf, label)) |>
  ggplot(aes(x = word, y = tf_idf)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ label, scales = "free_y") +
  scale_x_reordered() +
  labs(
    title = "Distinctive Terms by Email Class Using TF-IDF",
    x = "Word",
    y = "TF-IDF"
  )

Because the corpus contains raw email files, some frequent and distinctive terms come from email headers, mailing-list metadata, and HTML formatting. These terms are still useful for document classification because they are part of the raw documents the classifier sees, but they should be interpreted as document-level signals rather than purely message-body language.

Save Token-Level Feature Output

The token-level feature table is saved as a processed output. This file supports transparency because it shows the cleaned words and their class-level TF-IDF values.

write_csv(tfidf_by_class, file.path(processed_dir, "spam_ham_features.csv"))

Train/Test Split

The dataset is split into training and testing sets using a stratified split. This keeps the class proportions similar in both sets while making sure the testing data remains unseen during model training.

set.seed(607)

train_index <- createDataPartition(emails$label, p = 0.80, list = FALSE)

email_train <- emails[train_index, ]
email_test  <- emails[-train_index, ]

email_train |> count(label)

# A tibble: 2 × 2
  label     n
  <fct> <int>
1 ham    2000
2 spam    400

email_test |> count(label)

# A tibble: 2 × 2
  label     n
  <fct> <int>
1 ham     500
2 spam    100

Create Document-Term Matrix Features

For modeling, I create document-level word-count features. Only the most frequent terms are kept so that the model remains manageable and avoids an extremely sparse feature table.

top_model_terms <- email_tokens |>
  count(label, word, sort = TRUE) |>
  group_by(label) |>
  slice_max(n, n = 750) |>
  ungroup() |>
  distinct(word) |>
  pull(word)

model_tokens <- email_tokens |>
  filter(word %in% top_model_terms) |>
  count(doc_id, word, name = "count")

dtm <- model_tokens |>
  pivot_wider(
    names_from = word,
    values_from = count,
    values_fill = 0
  )

dtm <- emails |>
  select(doc_id, label) |>
  left_join(dtm, by = "doc_id") |>
  mutate(across(where(is.numeric), ~replace_na(.x, 0)))

dtm_summary <- tibble(
  documents = nrow(dtm),
  total_columns = ncol(dtm),
  feature_columns = ncol(dtm) - 2
)

dtm_summary

# A tibble: 1 × 3
  documents total_columns feature_columns
      <int>         <int>           <dbl>
1      3000          1165            1163

Create Training and Testing Matrices

For the final model, I use binary word-presence features instead of raw word counts. Each selected word is represented as either present or absent in a document. This approach works well for Naive Bayes because it focuses on whether important terms appear in an email rather than how many times they appear.

train_dtm <- dtm |>
  filter(doc_id %in% email_train$doc_id) |>
  select(-doc_id)

test_dtm <- dtm |>
  filter(doc_id %in% email_test$doc_id) |>
  select(-doc_id)

train_labels <- train_dtm$label
test_labels  <- test_dtm$label

train_x <- train_dtm |>
  select(-label) |>
  mutate(across(everything(), ~factor(if_else(.x > 0, "yes", "no"),
                                      levels = c("no", "yes"))))

test_x <- test_dtm |>
  select(-label) |>
  mutate(across(everything(), ~factor(if_else(.x > 0, "yes", "no"),
                                      levels = c("no", "yes"))))

Train Naive Bayes Classifier

A Naive Bayes classifier is trained using binary word-presence features from the training set. Laplace smoothing is included to help handle words that may appear in one class but not the other.

nb_model <- naiveBayes(
  x = train_x,
  y = train_labels,
  laplace = 1
)

model_summary <- tibble(
  model_type = "Naive Bayes",
  feature_type = "Binary word-presence",
  training_documents = nrow(train_x),
  testing_documents = nrow(test_x),
  laplace_smoothing = 1
)

model_summary

# A tibble: 1 × 5
  model_type feature_type training_documents testing_documents laplace_smoothing
  <chr>      <chr>                     <int>             <int>             <dbl>
1 Naive Bay… Binary word…               2400               600                 1

Generate Predictions

The trained model is used to predict labels for the withheld testing documents.

nb_pred <- predict(nb_model, newdata = test_x)

prediction_results <- tibble(
  doc_id = email_test$doc_id,
  label = test_labels,
  predicted_label = nb_pred
)

prediction_results |> count(label, predicted_label)

# A tibble: 4 × 3
  label predicted_label     n
  <fct> <fct>           <int>
1 ham   ham               496
2 ham   spam                4
3 spam  ham                 7
4 spam  spam               93

Evaluate Model Performance

The confusion matrix compares the predicted labels with the true labels.

confusion_output <- confusionMatrix(
  data = prediction_results$predicted_label,
  reference = prediction_results$label,
  positive = "spam"
)

confusion_output

Confusion Matrix and Statistics

          Reference
Prediction ham spam
      ham  496    7
      spam   4   93
                                          
               Accuracy : 0.9817          
                 95% CI : (0.9674, 0.9908)
    No Information Rate : 0.8333          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9332          
                                          
 Mcnemar's Test P-Value : 0.5465          
                                          
            Sensitivity : 0.9300          
            Specificity : 0.9920          
         Pos Pred Value : 0.9588          
         Neg Pred Value : 0.9861          
             Prevalence : 0.1667          
         Detection Rate : 0.1550          
   Detection Prevalence : 0.1617          
      Balanced Accuracy : 0.9610          
                                          
       'Positive' Class : spam

Confusion Matrix Table

confusion_table <- as.data.frame(confusion_output$table)

confusion_table

  Prediction Reference Freq
1        ham       ham  496
2       spam       ham    4
3        ham      spam    7
4       spam      spam   93

Confusion Matrix Visualization

confusion_table |>
  ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), size = 5) +
  labs(
    title = "Confusion Matrix for Spam and Ham Classification",
    x = "True Class",
    y = "Predicted Class"
  )

Metric Summary

metric_summary <- tibble(
  metric = c("Accuracy", "Sensitivity / Spam Recall", "Specificity", "Precision"),
  estimate = c(
    confusion_output$overall["Accuracy"],
    confusion_output$byClass["Sensitivity"],
    confusion_output$byClass["Specificity"],
    confusion_output$byClass["Precision"]
  )
)

metric_summary

# A tibble: 4 × 2
  metric                    estimate
  <chr>                        <dbl>
1 Accuracy                     0.982
2 Sensitivity / Spam Recall    0.93 
3 Specificity                  0.992
4 Precision                    0.959

Metric Summary Visualization

metric_summary |>
  ggplot(aes(x = metric, y = estimate)) +
  geom_col() +
  ylim(0, 1) +
  labs(
    title = "Naive Bayes Classification Metric Summary",
    x = "Metric",
    y = "Estimate"
  )

Save Model Predictions

The final prediction results are saved as a processed output for transparency.

write_csv(prediction_results, file.path(processed_dir, "spam_ham_predictions.csv"))

Interpretation of Model Results

The final Naive Bayes model performed strongly on the withheld testing data. Out of 600 test emails, the model correctly classified 496 ham emails and 93 spam emails. Only 4 ham emails were incorrectly predicted as spam, and only 7 spam emails were incorrectly predicted as ham.

The overall accuracy was 98.17%, but accuracy alone is not enough for this project because the dataset contains more ham emails than spam emails. The spam recall was 93.00%, meaning the model successfully identified most of the true spam messages. The precision was 95.88%, meaning that most emails predicted as spam were actually spam. The specificity was 99.20%, showing that the model was also very strong at correctly identifying legitimate ham emails.

This result is much better than a simple majority-class classifier because it performs well on both classes. The small number of false positives and false negatives shows that the word-presence Naive Bayes approach is effective for this spam/ham classification task.