The objective of this project is to build a predictive text classification model that can distinguish between spam and ham (non-spam) emails. The analysis uses a labeled corpus of email messages as training data and applies text preprocessing techniques to prepare both training and test datasets for classification.
The final goal is to use already labeled emails to predict the class (spam or ham) of new, unseen email documents.
Approach
For this project, I used a combination of labeled email datasets from the SpamAssassin public corpus, including spam and easy ham folders, along with an additional MBOX file containing my personal spam emails used as test data.
Can a machine learning model trained on labeled email data accurately classify new, unseen emails as spam or ham based on text content?
The dataset consists of raw email messages that include headers, system metadata, HTML content, encoding artifacts, and unstructured text. Therefore, significant preprocessing is required to convert the raw emails into a consistent and analyzable format.
The analysis follows a supervised learning approach, where:
Training data consists of labeled spam and ham emails
Test data consists of unseen spam emails (MBOX file)
Features are derived from cleaned email text
Data Analysis Steps
The analysis is structured as follows:
Data Collection (Training Set): - Spam and ham emails are loaded from local directories (spam_2 and easy_ham) - Each file is read and combined into a single raw text string per email
Label Assignment: - Emails from spam_2 are labeled as spam - Emails from easy_ham are labeled as ham
Email Parsing: - Each email is split into headers and body - Only the email body is retained for text analysis
Text Cleaning (Training Data):
Convert text to lowercase
Remove HTML tags, URLs, and email addresses
Remove punctuation, digits, and non-ASCII characters
Normalize whitespace for consistency
Test Data Preparation (MBOX File):
The MBOX file is read as a single raw text file
Emails are separated using "From " delimiters
Email body is extracted from each message
Quoted-printable encoding artifacts are decoded
System metadata and email headers are removed
Same cleaning pipeline is applied as in training data
Feature Standardization:
Both training and test datasets are transformed into a consistent format containing only cleaned text
Filtering:
Very short or incomplete emails are removed to improve data quality
Classification Preparation:
The cleaned training dataset is used to build a model
The test dataset is prepared for prediction using the same feature structure
Dataset Structure
Training Data:
Spam emails (spam_2) → labeled as spam
Ham emails (easy_ham) → labeled as ham
Test Data:
Personal spam emails stored in MBOX format (unlabeled)
Modeling Strategy Overview
The overall workflow follows a standard supervised text classification pipeline:
Train a model using labeled spam/ham emails
Extract text-based features from cleaned email content
Apply the trained model to predict labels for new unseen emails
Evaluate how well the model generalizes to real-world spam messages
Anticipated Challenges
A key challenge in this project is handling highly unstructured email data. Emails contain mixed formats such as plain text, HTML, encoded characters, and system-generated headers, all of which must be carefully removed or standardized.
Another challenge is ensuring consistency between training and test datasets. Since they originate from different sources (folder-based corpus vs. MBOX file), they require separate preprocessing pipelines that still produce comparable cleaned outputs.
Additionally, decoding quoted-printable content and removing encoding artifacts without corrupting the actual message text requires careful implementation.
Finally, balancing text cleaning is critical: excessive cleaning may remove meaningful words important for classification, while insufficient cleaning may retain noise that reduces model performance.
Implementation of Data Import and Cleaning
The following code demonstrates how the training and test email datasets are loaded and structured for analysis in R.
Training Data Import and Cleaning
This section loads and processes 3,879 labeled emails (1,391 spam and 2,488 ham) from the SpamAssassin corpus. After cleaning HTML, URLs, punctuation, and noise, each email is reduced to a standardized text format. The final output is a structured dataset (train_emails) with clean text and balanced class labels for model training.
Rows: 3,879
Columns: 2
$ text <chr> "greetings you are receiving this letter because you have expres…
$ label <fct> spam, spam, spam, spam, spam, spam, spam, spam, spam, spam, spam…
table(train_emails$label)
spam ham
1391 2488
print(train_emails, n =10)
# A tibble: 3,879 × 2
text label
<chr> <fct>
1 greetings you are receiving this letter because you have expressed an … spam
2 the need for safety is real in you might only get one chance be ready … spam
3 bonus fat absorbers as seen on tv included free with purchase of or mo… spam
4 bonus fat absorbers as seen on tv included free with purchase of or mo… spam
5 government grants e book edition just $ summer sale good until august … spam
6 cpurf = = = = = = = =b z=c =d =a = b=a =ce = =aa=ba=abh=a =ce=a d=b =d… spam
7 new product announcement from outsource eng mfg inc sir madam this not… spam
8 thank you for your interest judgment courses offers an extensive audio… spam
9 === secatt fnapngoonxcrm content type text plain charset= us ascii con… spam
10 internet service providers we apologize if this is an unwanted email w… spam
# ℹ 3,869 more rows
Test Data Import and Cleaning (MBOX Dataset)
This section processes 50 real-world Gmail MBOX emails by splitting raw message logs and extracting email bodies. The cleaning pipeline removes encoded text artifacts, system headers, and HTML content, producing highly compressed email text. The final output (test_df) contains 50 cleaned but unlabeled real spam emails used for external validation.
Rows: 50
Columns: 2
$ raw_text <chr> "From 1861021134463221248@xxx Sun Mar 29 18:14:18 +0000 2026\…
$ text <chr> "de e1 26169 7fb69c96 content transfer encoding 7bit content …
print(test_df, n =1, width =Inf)
# A tibble: 50 × 2
raw_text
<chr>
1 "From 1861021134463221248@xxx Sun Mar 29 18:14:18 +0000 2026\nX-GM-THRID: 186…
text
<chr>
1 de e1 26169 7fb69c96 content transfer encoding 7bit content type text plain c…
# ℹ 49 more rows
cat(test_df$text[1])
de e1 26169 7fb69c96 content transfer encoding 7bit content type text plain charset utf 8 rescue food and save money this weekend spring break for your wallet looking for ways to save on food start a springtime saving streak find a store close to you and pick up a bag weekly stock your fridge or pantry and watch the savings add up find my bag facebook instagram please do not reply to this email you are receiving this email because you created a too good to go account from our mobile app if you no longer wish to receive emails please update your email settings or unsubscribe for more information visit toogoodtogo com tgtg claims too good to go aps landskronagade 66 2100 copenhagen denmark de e1 26169 7fb69c96 content transfer encoding quoted printable content type text html charset utf 8 96 box sizing border box body margin 0 padding 0 a x apple data detectors color inherit important text decoration inherit important a color inherit text decoration none p line height inherit desktop hide desktop hide table mso hide all display none max height 0 overflow hidden image block img div display none sub sup font size 75 line height 0 media max width 620px mobile hide display none row content width 100 important stack column width 100 display block mobile hide min height 0 max height 0 max width 0 overflow hidden font size 0 desktop hide desktop hide table display table important max height none important sup sub font size 100 important sup mso text raise 10 sub mso text raise 10 rescue food and save money this weekend spring break for your wallet looking for ways to save on food start a springtime saving streak find a store close to you and pick up a bag weekly stock your fridge or pantry and watch the savings add up find my bag please do not reply to this email you are receiving this email because you created a too good to go account from our mobile app if you no longer wish to receive emails please update your email settings or unsubscribe for more information visit toogoodtogo com tgtg claims too good to go aps landskronagade 66 2100 copenhagen denmark de e1 26169 7fb69c96
Feature Engineering and Cross-Validation Setup
This section converts email text into TF-IDF features with a maximum of 2000 tokens after tokenization and stopword removal. A 5-fold cross-validation framework is created to ensure robust model evaluation. The output is a reproducible modeling pipeline that prepares consistent input for both SVM and Logistic Regression.
Linear SVM Model Training: A Linear Support Vector Machine is tuned using cross-validation over cost values and evaluated using accuracy, precision, recall, and F1-score. The model achieves extremely strong performance during cross-validation with accuracy around 0.992 and recall up to 0.996. The final fitted SVM model is selected based on highest accuracy and used for prediction.
Logistic Regression Model Training: A regularized logistic regression model is tuned using a grid of penalty values and evaluated using ROC-AUC and classification metrics. Cross-validation results show strong performance, though slightly lower than SVM, with final accuracy around 0.992 in CV tuning. The best penalty model is selected and refit on the full training dataset.
This section compares both models using aggregated cross-validation results. Linear SVM slightly outperforms Logistic Regression across most metrics, especially recall and F1-score, while both maintain very high accuracy. The results indicate strong and stable performance for both models with a small advantage for SVM.
Holdout Test Set Evaluation (Labeled Split Data): Both models are evaluated on a holdout test split using confusion matrices and classification metrics. Linear SVM achieves 0.990 accuracy, 0.996 recall, and 0.992 F1-score, while Logistic Regression achieves 0.973 accuracy and 0.979 F1-score. The results confirm strong generalization, with SVM consistently outperforming Logistic Regression. Logistic Regression performs well but shows slightly lower classification strength across all metrics.
# ============================================================# MODEL EVALUATION AND COMPARISON# Compare:# 1. Cross Validation Results# 2. Holdout Test Set Performance# 3. Final Best Model# ============================================================library(tidyverse)library(tidymodels)#--------------------------------------------------# 1. CROSS VALIDATION RESULTS#--------------------------------------------------# SVM CV Resultssvm_metrics <-collect_metrics(svm_cv_results) %>%select(.metric, mean, std_err) %>%mutate(model ="Linear SVM")# Logistic Regression CV Resultslog_metrics <-collect_metrics(log_cv_results) %>%filter(.config == best_lambda$.config) %>%select(.metric, mean, std_err) %>%mutate(model ="Logistic Regression")# Combine Resultscv_results <-bind_rows(svm_metrics, log_metrics) %>%select(model, .metric, mean, std_err)print(cv_results)
# A tibble: 37 × 4
model .metric mean std_err
<chr> <chr> <dbl> <dbl>
1 Linear SVM accuracy 0.992 0.00114
2 Linear SVM f_meas 0.994 0.000887
3 Linear SVM precision 0.991 0.000998
4 Linear SVM recall 0.996 0.00101
5 Linear SVM accuracy 0.992 0.00114
6 Linear SVM f_meas 0.994 0.000887
7 Linear SVM precision 0.991 0.000998
8 Linear SVM recall 0.996 0.00101
9 Linear SVM accuracy 0.992 0.00114
10 Linear SVM f_meas 0.994 0.000887
# ℹ 27 more rows
#--------------------------------------------------# 3. FINAL TEST SET COMPARISON#--------------------------------------------------final_results <-bind_rows( svm_test_metrics, log_test_metrics) %>%select(model, .metric, .estimate)print(final_results)
# A tibble: 9 × 3
model .metric .estimate
<chr> <chr> <dbl>
1 Linear SVM accuracy 0.990
2 Linear SVM precision 0.988
3 Linear SVM recall 0.996
4 Linear SVM f_meas 0.992
5 Logistic Regression accuracy 0.973
6 Logistic Regression precision 0.980
7 Logistic Regression recall 0.978
8 Logistic Regression f_meas 0.979
9 Logistic Regression roc_auc 0.00240
#--------------------------------------------------# 4. BEST MODEL BASED ON ACCURACY#--------------------------------------------------final_results %>%filter(.metric =="accuracy") %>%arrange(desc(.estimate))
# A tibble: 2 × 3
model .metric .estimate
<chr> <chr> <dbl>
1 Linear SVM accuracy 0.990
2 Logistic Regression accuracy 0.973
Real-World Gmail Spam Evaluation
Both models are applied to 50 real Gmail emails assumed to be spam based on Gmail labeling. Linear SVM achieves moderate performance (~0.92 accuracy), while Logistic Regression drops significantly (~0.58 accuracy), with unstable precision and recall due to single-class evaluation. This indicates stronger robustness of SVM under real-world noisy conditions.
# A tibble: 2 × 2
prediction count
<fct> <int>
1 ham 21
2 spam 29
Conclusion
This project built and evaluated spam classification models using TF-IDF features and supervised learning techniques. Both Linear SVM and Logistic Regression achieved strong performance on the structured test dataset, with accuracy above 97%. However, Linear SVM consistently outperformed Logistic Regression, reaching about 99% accuracy and higher recall. When applied to real-world Gmail spam emails, SVM also showed greater stability, while Logistic Regression performance dropped significantly. Overall, Linear SVM proved to be the most reliable and effective model for spam email classification in both controlled and real-world settings.
Reference
OpenAI. (2026). ChatGPT conversation: Email spam classification approach and R preprocessing pipeline [Large language model]. https://chat.openai.com/
OpenAI. (2026). ChatGPT conversation: Spam email classification project using SVM and logistic regression in R [Large language model]. https://chat.openai.com/