Project 4

Dataset Setup Instructions

To replicate this project:

  1. Clone GitHub repo: https://github.com/tcgraham-data/data-607-project-4
  2. Unzip the compressed datasets in the data/ folder
  3. Rename the extracted folders to spam/ and ham/ for consistency

This project applies a document classification pipeline using a labeled dataset of spam and ham emails from the SpamAssassin corpus. The objective is to use this corpus to build a predictive model that can classify unseen documents (emails) as spam or not spam.

Manual Extraction (one-time setup)

Content setup after data extraction:

Required directory layout:

data/

├── spam/ ← extracted from spam.tar.bz2

└── ham/ ← extracted from easy_ham.tar.bz2

├── tg-project4.Rmd ← Markdown lives here

Setup & Scaffold

We load all required libraries for text preprocessing (tidytext, tm), model training (e1071, caret), and file handling (fs, readr). This ensures we have the tools to transform raw text into numeric features, train a classifier, and evaluate results.

Installs are commented out as they are one-time use. Libraries needed for the remainder of the project have been loaded in.

We read in raw emails from two folders — one labeled spam and the other ham — using a custom function. Each email is read as plain text and assigned a label.

Initially, we had ~500 spam and ~2500 ham emails, which created a severe class imbalance. To address this and create a more neutral modeling setup, we randomly sampled 500 ham emails to match the spam count.

# Define paths
spam_path <- "data/spam"
ham_path <- "data/ham"

# Load email files into a dataframe
read_emails <- function(path, label) {
  files <- dir_ls(path, type = "file")  # safeguard against reading directories
  tibble(
    text = map_chr(files, ~ read_file(.x)),
    label = label
  )
}

spam_df <- read_emails(spam_path, "spam")
ham_df  <- read_emails(ham_path, "ham")

# Balance the dataset (undersample ham)
set.seed(123)
ham_bal <- ham_df %>% slice_sample(n = nrow(spam_df))
email_df <- bind_rows(spam_df, ham_bal) %>% 
  mutate(id = row_number())

Preprocessing

Next, we tokenize each email into individual words (unigrams), filter out common stop words, and remove non-alphabetic content. The cleaned words are then converted into a Document-Term Matrix (DTM), which records how often each word appears in each email.

This numeric matrix serves as input for our machine learning model.

# Tokenize and clean text
clean_tokens <- email_df %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

# Create a Document-Term Matrix (DTM)
dtm <- clean_tokens %>%
  count(id, word) %>%
  cast_dtm(document = id, term = word, value = n)

# Get labels and factor them consistently
labels <- factor(email_df$label, levels = c("ham", "spam"))

Train/Test Split

We split the DTM into training and test sets (80/20 split) while preserving the balance between spam and ham labels. Factor levels are explicitly set to avoid label mismatches during prediction.

# Train/test split using rows of the DTM
set.seed(42)
train_indices <- createDataPartition(labels, p = 0.8, list = FALSE)
dtm_train <- dtm[train_indices, ]
dtm_test  <- dtm[-train_indices, ]
labels_train <- labels[train_indices]
labels_test  <- labels[-train_indices]

Train Naive Bayes Classifier

We train a Naive Bayes classifier on the training portion of the data. Naive Bayes is a simple and efficient probabilistic model often used in text classification, particularly spam detection. It works well when features (words) are assumed to be conditionally independent. Thank you internet for helping me find this and utilize it.

model <- naiveBayes(as.matrix(dtm_train), as.factor(labels_train))

Predictions and Evaluate

We use the trained model to predict the class of the test emails and generate a confusion matrix to evaluate how well the predictions matched the true labels.

The distribution of predictions is also printed to check for model bias — in earlier attempts, the model predicted overwhelmingly one class, which signaled imbalance or a feature mismatch.

predictions <- predict(model, as.matrix(dtm_test))

# Show how many of each label were predicted
table(predictions)
## predictions
##  ham spam 
##  198    2
# Confusion Matrix
confusion <- confusionMatrix(predictions, as.factor(labels_test))
confusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   99   99
##       spam   1    1
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4287, 0.5713)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.5282          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.990           
##             Specificity : 0.010           
##          Pos Pred Value : 0.500           
##          Neg Pred Value : 0.500           
##              Prevalence : 0.500           
##          Detection Rate : 0.495           
##    Detection Prevalence : 0.990           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : ham             
## 

Results and Reflections

The confusion matrix and class-based metrics (precision, recall, F1) show that the model achieved approximately 50% accuracy on the test set. While this isn’t high, it is a substantial improvement over our first attempt.

In our initial model (not shown here), we used the full unbalanced dataset and found the classifier predicted almost every email as spam — resulting in only 16% accuracy. That failure highlighted the importance of balancing class distributions and ensuring proper feature alignment.

By undersampling ham to match spam and reprocessing the data, we corrected those issues and produced a functioning pipeline. The results show the model can now identify both classes, even if it’s far from perfect.

# Display Accuracy, Precision, Recall
confusion$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9900000            0.0100000            0.5000000 
##       Neg Pred Value            Precision               Recall 
##            0.5000000            0.5000000            0.9900000 
##                   F1           Prevalence       Detection Rate 
##            0.6644295            0.5000000            0.4950000 
## Detection Prevalence    Balanced Accuracy 
##            0.9900000            0.5000000

Conclusion

This project was not about creating a world-class spam filter. It was about implementing a working classification pipeline: loading real-world data, preparing it for machine learning, training a model, and evaluating results. We also learned how model performance can be shaped by data imbalance and preprocessing decisions — an important takeaway for future text analysis projects.