Project Four - Document Classification

Author

Brandon Chanderban

Published

April 29, 2026

Introduction/Approach

The objective of this assignment is to build a document classification model that can predict whether an email should be classified as spam or ham. For this project, the SpamAssassin Public Corpus (Apache SpamAssassin Project, n.d.) will likely be used, since it already contains labeled spam and non-spam email messages.

Data Preparation

The spam and ham files will first be downloaded and extracted. Each email will then be imported into R and assigned a label based on its folder classification.

  • spam = 1

  • ham = 0

The text will then be cleaned by removing unnecessary punctuation, numbers, stopwords, and extra whitespace. After preprocessing, the emails will be converted into a Document-Term Matrix, where each row represents an email and each column represents a term.

Model Construction

A predictive classifier will then be trained on the labeled email data. One likely method is Naive Bayes, since it is commonly used for text classification and spam filtering.

Evaluation Plan

The data will be split into training and testing sets. The model will be trained on the training set and evaluated on the withheld test set using measures such as accuracy, precision, recall, and F1-score.

Particular attention will be paid to false positives and false negatives, since legitimate emails being classified as spam, or spam emails being missed, would both affect the usefulness of the classifier.

Potential Challenges

One possible challenge is that spam messages may use varied or misleading language, making classification more difficult. Another challenge is balancing the tradeoff between catching spam and avoiding the incorrect classification of legitimate emails.

Code Base/Body

Due to file size constraints and the desire to foster reproducibility within the analysis, the separate spam and ham data folders were first combined into a single CSV file. This consolidated dataset was then uploaded to my personal GitHub repository so that it could be accessed directly through a raw GitHub URL, rather than relying on local file paths.

Loading the Required Libraries

As with most analyses conducted in RStudio, the first step involves loading the required libraries. In this project, the tidyverse and tidytext packages will be used for data preparation and text processing, while e1071 and caret will assist with the Naive Bayes classifier and model evaluation.

Code
library(tidyverse)
library(tidytext)
library(e1071)
library(caret)
library(rsample)

Importing the Spam/Ham Dataset

The dataset will now be imported from the raw GitHub URL. This dataset contains the original email file name, the email text, and a label indicating whether the message is spam or ham.

Code
url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Project%20Four%20Assignment/spam_ham_email_dataset.csv"

email_data <- read_csv(url)

glimpse(email_data)
Rows: 3,896
Columns: 3
$ file_name <chr> "00001.7c53336b37003a9286aba55d2945844c", "00002.9c4069e25e1…
$ text      <chr> "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 200…
$ label     <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham…
Code
table(email_data$label)

 ham spam 
2500 1396 

At this stage, the dataset has been successfully imported. The label variable identifies the already-classified email type, which will be used to train the classifier.

Preparing the Data

Before the text can be modeled, a document identifier will be created and the label variable will be converted into a factor. This will make it easier to track each email during tokenization and modeling.

Code
email_data <- email_data %>%
  mutate(
    doc_id = row_number(),
    label = factor(label, levels = c("ham", "spam"))
  )

glimpse(email_data)
Rows: 3,896
Columns: 4
$ file_name <chr> "00001.7c53336b37003a9286aba55d2945844c", "00002.9c4069e25e1…
$ text      <chr> "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 200…
$ label     <fct> ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, …
$ doc_id    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…

Splitting the Data into Training and Testing Sets

The data will now be split into training and testing sets. The training set will be used to build the model, while the testing set will be withheld and used to evaluate how well the classifier performs on unseen emails.

Code
set.seed(6767)

email_split <- initial_split(email_data, prop = 0.80, strata = label)

train_data <- training(email_split)
test_data <- testing(email_split)

table(train_data$label)

 ham spam 
2000 1116 
Code
table(test_data$label)

 ham spam 
 500  280 

This split allows the classifier to learn from one portion of the data and then be evaluated on a separate portion.

Tokenizing and Cleaning the Data

The next step involves converting the email text into individual words. Common stopwords, numbers, and non-alphabetic tokens will be removed so that the model focuses more heavily on meaningful terms.

Code
train_tokens <- train_data %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "^[a-z]+$")) %>%
  count(doc_id, label, word, sort = TRUE)

head(train_tokens)
# A tibble: 6 × 4
  doc_id label word      n
   <int> <fct> <chr> <int>
1   2528 spam  font   1627
2   3590 spam  font   1102
3   3591 spam  font   1102
4   2551 spam  br      812
5   2551 spam  nbsp    567
6   3481 spam  font    542

At this stage, the training emails have been converted into a tidy word-level format.

Selecting the Most Common Terms

To keep the model manageable, the most frequent terms from the training data will be selected as the model vocabulary.

Code
top_terms <- train_tokens %>%
  group_by(word) %>%
  summarise(total_count = sum(n), .groups = "drop") %>%
  slice_max(total_count, n = 1000)

vocabulary <- top_terms$word

head(top_terms)
# A tibble: 6 × 2
  word      total_count
  <chr>           <int>
1 font            27905
2 id              16840
3 received        16351
4 br              14160
5 http            13496
6 localhost       12998

Limiting the vocabulary helps reduce noise and prevents the model from becoming unnecessarily large.

Creating Document-Term Matrices

The cleaned tokens will now be converted into a document-term structure. In this format, each row represents an email, each column represents a word, and the cell values represent how many times that word appears in the email.

Code
create_feature_matrix <- function(data, vocabulary) {
  features <- data %>%
    select(doc_id, label, text) %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words, by = "word") %>%
    filter(str_detect(word, "^[a-z]+$")) %>%
    filter(word %in% vocabulary) %>%
    count(doc_id, word) %>%
    pivot_wider(
      names_from = word,
      values_from = n,
      values_fill = 0
    )
  
  features <- data %>%
    select(doc_id, label) %>%
    left_join(features, by = "doc_id") %>%
    mutate(across(-c(doc_id, label), ~replace_na(.x, 0)))
  
  missing_terms <- setdiff(vocabulary, names(features))
  
  for(term in missing_terms) {
    features[[term]] <- 0
  }
  
  features %>%
    select(doc_id, label, all_of(vocabulary))
}
Code
train_features <- create_feature_matrix(train_data, vocabulary)
test_features <- create_feature_matrix(test_data, vocabulary)

head(train_features)
# A tibble: 6 × 1,003
  doc_id label  font    id received    br  http localhost    td  list  size
   <int> <fct> <int> <int>    <int> <int> <int>     <int> <int> <int> <int>
1      1 ham       0    12       10     0     0         7     0    12     0
2      2 ham       0     5       10     0     2         5     0     4     0
3      3 ham       0     4        9     0     2         5     0     4     0
4      4 ham       0     8        7     0     5         5     0     8     0
5      5 ham       0     4        9     0     3         6     0     4     0
6      6 ham       0     4       11     0     2         6     0     4     0
# ℹ 992 more variables: fork <int>, esmtp <int>, nbsp <int>, jm <int>,
#   color <int>, sep <int>, subject <int>, tr <int>, width <int>,
#   content <int>, mailto <int>, align <int>, admin <int>, arial <int>,
#   date <int>, aug <int>, mon <int>, message <int>, postfix <int>, rpm <int>,
#   type <int>, version <int>, text <int>, thu <int>, oct <int>, wed <int>,
#   mailman <int>, exmh <int>, ist <int>, center <int>, request <int>,
#   spamassassin <int>, href <int>, jul <int>, tue <int>, users <int>, …
Code
head(test_features)
# A tibble: 6 × 1,003
  doc_id label  font    id received    br  http localhost    td  list  size
   <int> <fct> <int> <int>    <int> <int> <int>     <int> <int> <int> <int>
1     18 ham       0     8        6     0     2         6     0     3     0
2     34 ham       0     6        7     0     1         8     0     2     0
3     47 ham       0     6        7     0     2         7     0     3     0
4     49 ham       0     6        7     0     4         6     0     6     0
5     63 ham       0     3        4     0     2         5     0     5     0
6     67 ham       0     3        4     0     1         5     0     1     0
# ℹ 992 more variables: fork <int>, esmtp <int>, nbsp <int>, jm <int>,
#   color <int>, sep <int>, subject <int>, tr <int>, width <int>,
#   content <int>, mailto <int>, align <int>, admin <int>, arial <int>,
#   date <int>, aug <int>, mon <int>, message <int>, postfix <int>, rpm <int>,
#   type <int>, version <int>, text <int>, thu <int>, oct <int>, wed <int>,
#   mailman <int>, exmh <int>, ist <int>, center <int>, request <int>,
#   spamassassin <int>, href <int>, jul <int>, tue <int>, users <int>, …

The training and testing datasets are now represented in a structure suitable for classification.

Building the Naive Bayes Classifier

A Naive Bayes classifier will now be trained using the document-term features. This method is commonly used for spam filtering because it estimates the probability that an email belongs to a particular class based on the words it contains.

Code
train_model_data <- train_features %>%
  select(-doc_id)

test_model_data <- test_features %>%
  select(-doc_id)

nb_model <- naiveBayes(label ~ ., data = train_model_data)

The model has now learned patterns in the training data that are associated with spam and ham emails.

Generating Predictions

The trained model will now be used to predict whether the emails in the testing set should be classified as spam or ham.

Code
nb_predictions <- predict(
  nb_model,
  newdata = test_model_data
)

head(nb_predictions)
[1] ham  ham  ham  ham  ham  spam
Levels: ham spam

The predictions represent the model’s classification of the withheld test emails.

Evaluating the Classifier

The model’s predictions will now be compared against the actual labels from the test set. This will allow for evaluation using accuracy, sensitivity, specificity, and other classification metrics.

Code
confusion_results <- confusionMatrix(
  nb_predictions,
  test_model_data$label,
  positive = "spam"
)

confusion_results
Confusion Matrix and Statistics

          Reference
Prediction ham spam
      ham  486   45
      spam  14  235
                                          
               Accuracy : 0.9244          
                 95% CI : (0.9035, 0.9419)
    No Information Rate : 0.641           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8315          
                                          
 Mcnemar's Test P-Value : 9.397e-05       
                                          
            Sensitivity : 0.8393          
            Specificity : 0.9720          
         Pos Pred Value : 0.9438          
         Neg Pred Value : 0.9153          
             Prevalence : 0.3590          
         Detection Rate : 0.3013          
   Detection Prevalence : 0.3192          
      Balanced Accuracy : 0.9056          
                                          
       'Positive' Class : spam            
                                          

The confusion matrix provides insight into how well the model classified spam and ham messages, including how often it correctly identified spam and how often it misclassified legitimate messages.

Conclusion/Interpretation

The Naive Bayes classifier performed well on the whole, achieving an accuracy of approximately 92.44%, indicating that the majority of emails were correctly classified.

More specifically, the model had a sensitivity of 83.93% and a specificity of 97.20%, meaning that it was stronger at correctly identifying ham emails than spam emails. This is reflected in the confusion matrix, where 45 spam emails were incorrectly classified as ham (false negatives), compared to 14 ham emails incorrectly classified as spam (false positives).

As such, this suggests that the model is somewhat conservative, prioritizing the avoidance of misclassifying legitimate emails as spam, although this comes at the cost of allowing some spam messages to pass through.

References:

LLM Used