Project 4 – Document Classification of Spam and Ham Emails: Approach

Author

Muhammad Suffyan Khan

Published

April 30, 2026

Introduction

The objective of this project is to build a supervised document classification workflow using labeled email documents. A common real-world example of document classification is spam detection, where previously classified spam and non-spam emails are used to train a model that can classify new or unseen emails.

For this project, I will use the SpamAssassin Public Mail Corpus, which contains email messages that are already separated into spam and ham categories. The project will focus on converting raw email files into a structured dataset, preparing text features, training a predictive classifier, and evaluating how accurately the model classifies withheld test documents.

This project follows the same general idea described in the assignment prompt: already classified training documents can be used to predict the class of new test documents. In this case, the test documents will be withheld from the original labeled dataset so the model can be evaluated on emails it has not seen during training.


Objective

The primary objective is to classify email documents as either:

  • ham: legitimate non-spam email
  • spam: unwanted or spam email

The final workflow will demonstrate how raw text documents can be transformed into machine-readable features and used for predictive classification.

The main goals are to:

  • Load labeled spam and ham email documents from the extracted corpus folders.
  • Convert raw email files into a structured document-level dataset.
  • Clean and tokenize email text for text mining.
  • Create useful document features using word counts or TF-IDF values.
  • Train a supervised classification model.
  • Predict labels for withheld test documents.
  • Evaluate model performance using a confusion matrix and classification metrics.
  • Visualize important patterns in the corpus and model results.

Dataset Selection

The dataset selected for this project is the SpamAssassin Public Mail Corpus.

Dataset source:

https://spamassassin.apache.org/old/publiccorpus/

The two corpus files selected for this project are:

  • 20030228_easy_ham.tar.bz2
  • 20030228_spam.tar.bz2

After downloading and extracting the files, the raw email documents are stored locally in the following project folders:

data/raw/20030228_easy_ham/easy_ham/
data/raw/20030228_spam/spam/

The easy_ham folder contains legitimate non-spam emails, while the spam folder contains spam emails. These folder names provide the labels needed for supervised learning.


Reason for Dataset Choice

This dataset is appropriate for Project 4 because it provides a clear binary document classification problem. The emails are already classified, which allows the project to focus on the full text mining and modeling workflow rather than manual labeling.

The SpamAssassin corpus is also directly suggested in the project instructions, making it a strong and assignment-aligned dataset choice. It is realistic enough to demonstrate text classification, but manageable enough for a reproducible DATA 607 project.

I will use easy_ham and spam as the main datasets. I am not starting with hard_ham because those documents are intentionally more difficult and could introduce additional complexity before the main classification workflow is validated. A harder ham dataset can be discussed as a possible future extension.


Data Structure Before Processing

The raw data begins as many individual email files stored inside two labeled folders.

The structure is:

Project4_DATA_607/
├── data/
│   └── raw/
│       ├── 20030228_easy_ham/
│       │   └── easy_ham/
│       │       ├── 00001...
│       │       ├── 00002...
│       │       └── ...
│       └── 20030228_spam/
│           └── spam/
│               ├── 00001...
│               ├── 00002...
│               └── ...

Each file represents one email document. The class label is not stored as a column inside the file. Instead, the label is inferred from the folder where the file is located.

This means the first major data preparation step is to convert the folder-based document collection into a structured table with one row per email.

The planned document-level structure is:

doc_id label file_path text
ham_001 ham data/raw/… email text
spam_001 spam data/raw/… email text

Overall Strategy

1. Data Ingestion

The raw email files will be read from the local extracted folders.

The planned folder paths are:

ham_dir  <- "data/raw/20030228_easy_ham/easy_ham"
spam_dir <- "data/raw/20030228_spam/spam"

All files from the ham folder will be labeled as ham, and all files from the spam folder will be labeled as spam.

The files will be read into R as plain text. If some files contain unusual characters or encoding issues, the reading function will handle them safely so that the project can continue without failing on one problematic email.

2. Document-Level Dataset Creation

After reading the files, I will create one combined dataset with one row per email.

The planned columns are:

  • doc_id
  • label
  • file_path
  • text

This dataset will act as the foundation for the rest of the project.

3. Text Cleaning

Raw emails often contain headers, punctuation, numbers, HTML fragments, URLs, and other formatting. I will clean the text enough to support classification while avoiding excessive manual changes.

Planned cleaning steps include:

  • Convert text to lowercase.
  • Remove or reduce email header noise where appropriate.
  • Remove punctuation and unnecessary symbols during tokenization.
  • Remove common stop words.
  • Remove very short or uninformative terms.
  • Preserve the original label for each document.

4. Tokenization and Feature Engineering

The cleaned text will be tokenized into words using text mining tools in R.

The project will create features such as:

  • word counts by document
  • term frequency
  • TF-IDF values

TF-IDF is useful because it gives more weight to terms that are important within a document but not overly common across all documents.

The final modeling table will convert text into a format that a classification model can use.

5. Train/Test Split

The labeled document dataset will be divided into training and testing sets.

The planned split is:

  • 80% training data
  • 20% testing data

The split will be stratified by class if possible, so both spam and ham are represented in the training and testing sets.

A random seed will be set to make the results reproducible.

6. Classification Model

The main model planned for this project is a Naive Bayes classifier.

Naive Bayes is appropriate for text classification because it works well with word-based features and is commonly used for spam detection problems. It estimates the probability that an email belongs to each class based on the words that appear in the document.

The model will be trained only on the training data and then evaluated on the withheld testing data.


Planned Visualizations

Although this project is mainly a classification project, visualizations will be included to make the analysis clearer and more professional.

The planned visualizations are:

1. Class Distribution Plot

A bar chart will show the number of ham and spam documents in the dataset.

Purpose:

  • Confirm how many documents are available in each class.
  • Show whether the dataset is imbalanced.
  • Provide context before modeling.

2. Top Word Frequency Plot

A bar chart will compare the most frequent words in spam and ham emails.

Purpose:

  • Explore vocabulary differences between classes.
  • Identify words that may help the classifier separate spam from ham.
  • Support the text mining interpretation.

3. TF-IDF Term Importance Plot

A TF-IDF visualization will show high-value words for each class.

Purpose:

  • Highlight terms that are especially distinctive in spam or ham documents.
  • Provide evidence that the feature engineering step is meaningful.
  • Make the text classification process easier to explain.

4. Confusion Matrix Visualization

The final confusion matrix will be shown as either a formatted table or heatmap-style plot.

Purpose:

  • Show correct and incorrect predictions.
  • Identify whether the model makes more false spam or false ham predictions.
  • Connect model performance directly to the project objective.

5. Model Metric Summary Plot

If appropriate, a small bar chart will summarize evaluation metrics such as accuracy, precision, recall, and F1 score.

Purpose:

  • Provide a quick visual summary of model performance.
  • Make the final results easier to compare and interpret.

Evaluation Plan

The model will be evaluated on the withheld testing data.

The main evaluation outputs will include:

  • Confusion matrix
  • Accuracy
  • Precision
  • Recall / sensitivity
  • Specificity
  • F1 score, if appropriate

The confusion matrix is important because spam classification errors have different practical meanings. For example, incorrectly classifying a legitimate ham email as spam can be more harmful than allowing a spam message into the inbox. The analysis will discuss both types of errors.


Validation Plan

To make the workflow reliable, I will include several validation checks.

Data Validation Checks

Before modeling, I will check:

  • whether both raw data folders exist
  • how many files are read from each folder
  • whether every document has a label
  • whether any documents have missing or empty text
  • whether the final dataset contains both spam and ham classes

Feature Validation Checks

After tokenization and feature creation, I will check:

  • number of tokens produced
  • most common words after cleaning
  • whether stop words were removed
  • whether rare terms were filtered properly
  • whether each document is represented in the feature table

Model Validation Checks

After training and prediction, I will check:

  • whether predictions were generated for all test documents
  • whether predicted labels use only valid classes
  • whether the confusion matrix dimensions are correct
  • whether evaluation metrics are calculated from the withheld test set only

These checks are included to avoid silent errors and to make the project more reproducible and trustworthy.


Planned Processed Outputs

The raw emails may be converted into processed outputs during the analysis.

The planned processed files are:

data/processed/spam_ham_emails.csv
data/processed/spam_ham_features.csv

The first file will contain the structured document-level version of the raw corpus. The second file will contain the feature-ready representation used for modeling, such as token counts or TF-IDF values.

If a processed output contains full email text, I will be careful about whether it should be uploaded publicly. The GitHub repository will include the code and folder instructions, while the raw email folders will remain local.


Reproducibility Plan

The raw extracted email folders will not be uploaded directly to GitHub because they contain many individual files. Instead, the GitHub repository will include the project code, folder structure, and instructions for reproducing the analysis.

To reproduce the project, a user should:

  1. Download 20030228_easy_ham.tar.bz2 and 20030228_spam.tar.bz2 from the SpamAssassin Public Mail Corpus.
  2. Extract both files.
  3. Place the extracted folders inside data/raw/.
  4. Confirm the final paths are:
data/raw/20030228_easy_ham/easy_ham/
data/raw/20030228_spam/spam/
  1. Run the Quarto code base file from the project root directory.

The repository will contain data/raw/.gitkeep to show the expected folder structure without uploading the raw corpus files.


Anticipated Challenges

  1. Reading many individual email files into one structured dataset.
  2. Handling unusual characters, HTML fragments, or email header content.
  3. Avoiding data leakage between training and testing sets.
  4. Managing class imbalance between ham and spam documents.
  5. Choosing enough text features for useful prediction without making the feature table too large.
  6. Interpreting model errors in a practical way.
  7. Keeping the workflow reproducible without uploading thousands of raw email files to GitHub.

Expected Outcome

The expected result is that the Naive Bayes classifier will be able to distinguish between spam and ham emails with reasonable accuracy. Spam emails often contain different vocabulary, promotional terms, links, and formatting compared with legitimate emails, so the text features should provide useful classification signals.

The final analysis will report the model’s performance and discuss which kinds of errors occurred. It will also use visualizations to show class balance, important terms, and final classification results.


Conclusion

This project is a complete document classification workflow using labeled spam and ham emails. By using the SpamAssassin corpus, the project aligns directly with the assignment instructions while still allowing a full supervised learning pipeline.

The focus of the project will be correctness, reproducibility, clear text preparation, meaningful visualizations, and honest model evaluation. The final result will demonstrate how raw email documents can be transformed into structured text features and used to classify unseen documents as spam or ham.