Project 4 Document Classification

Author

Ciara Bonnett

Intro

The goal of this project is to develop a supervised machine learning model capable of classifying documents as either Spam or Ham. I will be using the SpamAssassin Public Corpus.

Approach

I plan on using the tm and tidytext packages in R to handle the raw email files. I will download and decompress the SpamAssassin tarballs, then read the files into a volatile corpus using VCorpus.

To reduce noise, I will apply transformations to lowercase the text, remove punctuation, strip numbers, and eliminate stop words. I will convert the cleaned corpus into a Document Term Matrix.

I want to remove infrequent terms to prevent model from overfitting and to keep the matrix computationally manageable. I will also use a binary indicator to help with simplification.

Code

library(tm)
library(tidyverse)
library(e1071)

# Only unzip if the folders don't exist
if (!dir.exists("data/easy_ham")) {
  untar("data/easy_ham.tar.bz2", exdir = "data")
}

if (!dir.exists("data/spam")) {
  untar("data/spam.tar.bz2", exdir = "data")
}

# Now load your corpora
ham_corpus <- VCorpus(DirSource("data/easy_ham/", encoding = "UTF-8"))
spam_corpus <- VCorpus(DirSource("data/spam/", encoding = "UTF-8"))

# Check the counts
print(ham_corpus)

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2501

print(spam_corpus)

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 501

# 1. Load the individual folders
ham_corpus <- VCorpus(DirSource("data/easy_ham/", encoding = "UTF-8"))
spam_corpus <- VCorpus(DirSource("data/spam/", encoding = "UTF-8"))

# 2. Tag them so the model knows which is which later
meta(ham_corpus, tag = "type") <- "ham"
meta(spam_corpus, tag = "type") <- "spam"

# 3. CREATE 'all_corpus' (This is the missing link!)
all_corpus <- c(ham_corpus, spam_corpus)

# 4. Check to make sure it exists now
print(all_corpus)

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 1
Content:  documents: 3002

# 1. Start with your combined corpus
clean_corp <- all_corpus

# 2. Fix the encoding FIRST (This is the magic fix for your error)
clean_corp <- tm_map(clean_corp, content_transformer(function(x) {
  iconv(x, from = "UTF-8", to = "ASCII", sub = "")
}))

# 3. Now run the rest of the cleaning
clean_corp <- tm_map(clean_corp, content_transformer(tolower))
clean_corp <- tm_map(clean_corp, removeNumbers)
clean_corp <- tm_map(clean_corp, removePunctuation)
clean_corp <- tm_map(clean_corp, removeWords, stopwords("en"))
clean_corp <- tm_map(clean_corp, stripWhitespace)

# 4. Try the DTM again
all_dtm <- DocumentTermMatrix(clean_corp)

all_dtm <- DocumentTermMatrix(clean_corp)

all_dtm <- removeSparseTerms(all_dtm, 0.99)

convert_counts <- function(x){
  x <- ifelse(x >0, "Yes", "No")
}

all_dtm_binary <- apply(all_dtm, MARGIN = 2, convert_counts)

set.seed(123) # For reproducibility
sample_size <- floor(0.75 * nrow(all_dtm_binary))
train_ind <- sample(seq_len(nrow(all_dtm_binary)), size = sample_size)

train_data <- all_dtm_binary[train_ind, ]
test_data <- all_dtm_binary[-train_ind, ]

# 1. Pull all labels into a simple vector
all_labels <- unlist(meta(clean_corp, "type"))

# 2. Now subset that vector using your training indices
train_labels <- all_labels[train_ind]
test_labels  <- all_labels[-train_ind]

spam_classifier <- naiveBayes(train_data, train_labels)

# Predict on test data
test_pred <- predict(spam_classifier, test_data)

# Evaluate with a Confusion Matrix
table(Predicted = test_pred, Actual = test_labels)

         Actual
Predicted ham spam
     ham  616    2
     spam   3  130

accuracy <- sum(test_pred == test_labels) /length(test_labels)
print(paste0("Accuracy:", round(accuracy * 100, 2), "%"))

[1] "Accuracy:99.33%"

Analysis

The Naive Bayes classifier was trained on a Document Term Matrix (DTM) consisting of approximately 3,000 emails. By applying a Sparsity threshold of 0.99, the feature set was narrowed down to only the most impactful words, which reduced computational load while maintaining predictive power.

Model Performance: The confusion matrix reveals how the model distinguishes between class features. Naive Bayes performs exceptionally well on this dataset because spam often contains specific “trigger” tokens (e.g., “money,” “free,” “click”) that have significantly higher conditional probabilities in the Spam class than the Ham class.

The Cost of Errors: In this specific context, a False Positive (classifying Ham as Spam) is more “expensive” than a False Negative. Missing a legitimate email is generally worse for a user than seeing a piece of junk in their inbox.

Conclusion

This project successfully demonstrated the effectiveness of the Naive Bayes algorithm for binary text classification. Despite its “naive” assumption—that all words are independent of one another—the model achieved high accuracy in distinguishing Spam from Ham.

Key takeaways include:

Data Pre-processing is Critical: Handling character encoding (UTF-8 to ASCII) and removing “stop words” was essential to prevent the model from crashing and to ensure it focused on meaningful language patterns.

Binary vs. Frequency: Using a binary indicator (Yes/No) for word presence proved sufficient for this classification task, confirming that the occurrence of certain words is often more telling than their frequency.

Future Work: Further improvements could involve using “Laplace Smoothing” to handle words that appear in the test set but were never seen during training.

AI Transcript

User: “Error: attempt to use zero-length variable name.” AI: Explained that this usually stems from a stray pipe (%>%) or empty quotes. Advised on checking the tm_map syntax for trailing operators.

User: “Error: object ‘all_corpus’ not found.” AI: Identified a break in the variable chain. Provided the code to combine ham_corpus and spam_corpus using the c() function to create the required all_corpus object.

User: “Error: object ‘clean_corp’ not found.” AI: Clarified that the cleaning steps must be assigned to a new object name. Provided a consolidated block of tm_map functions (tolower, removeNumbers, etc.) to define clean_corp.

User: “Error in FUN(content(x), …) : invalid input… in ‘utf8towcs’.” AI: Diagnosed a character encoding conflict common in the SpamAssassin dataset. Provided a “Blunt Force” fix using iconv() to convert text to ASCII and strip non-translatable characters.

User: “Error in [.data.frame(meta(clean_corp,”type”), train_ind) : undefined columns selected.” AI: Explained that meta() returns a data frame that can confuse standard indexing. Recommended using unlist() to turn metadata into a simple vector for stable training/testing splits.