The goal of this project is to develop a supervised machine learning model capable of classifying documents as either Spam or Ham. I will be using the SpamAssassin Public Corpus.
Approach
I plan on using the tm and tidytext packages in R to handle the raw email files. I will download and decompress the SpamAssassin tarballs, then read the files into a volatile corpus using VCorpus.
To reduce noise, I will apply transformations to lowercase the text, remove punctuation, strip numbers, and eliminate stop words. I will convert the cleaned corpus into a Document Term Matrix.
I want to remove infrequent terms to prevent model from overfitting and to keep the matrix computationally manageable. I will also use a binary indicator to help with simplification.
Code
library(tm)library(tidyverse)library(e1071)# Only unzip if the folders don't existif (!dir.exists("data/easy_ham")) {untar("data/easy_ham.tar.bz2", exdir ="data")}if (!dir.exists("data/spam")) {untar("data/spam.tar.bz2", exdir ="data")}# Now load your corporaham_corpus <-VCorpus(DirSource("data/easy_ham/", encoding ="UTF-8"))spam_corpus <-VCorpus(DirSource("data/spam/", encoding ="UTF-8"))# Check the countsprint(ham_corpus)
# 1. Load the individual foldersham_corpus <-VCorpus(DirSource("data/easy_ham/", encoding ="UTF-8"))spam_corpus <-VCorpus(DirSource("data/spam/", encoding ="UTF-8"))# 2. Tag them so the model knows which is which latermeta(ham_corpus, tag ="type") <-"ham"meta(spam_corpus, tag ="type") <-"spam"# 3. CREATE 'all_corpus' (This is the missing link!)all_corpus <-c(ham_corpus, spam_corpus)# 4. Check to make sure it exists nowprint(all_corpus)
# 1. Start with your combined corpusclean_corp <- all_corpus# 2. Fix the encoding FIRST (This is the magic fix for your error)clean_corp <-tm_map(clean_corp, content_transformer(function(x) {iconv(x, from ="UTF-8", to ="ASCII", sub ="")}))# 3. Now run the rest of the cleaningclean_corp <-tm_map(clean_corp, content_transformer(tolower))clean_corp <-tm_map(clean_corp, removeNumbers)clean_corp <-tm_map(clean_corp, removePunctuation)clean_corp <-tm_map(clean_corp, removeWords, stopwords("en"))clean_corp <-tm_map(clean_corp, stripWhitespace)# 4. Try the DTM againall_dtm <-DocumentTermMatrix(clean_corp)
set.seed(123) # For reproducibilitysample_size <-floor(0.75*nrow(all_dtm_binary))train_ind <-sample(seq_len(nrow(all_dtm_binary)), size = sample_size)train_data <- all_dtm_binary[train_ind, ]test_data <- all_dtm_binary[-train_ind, ]# 1. Pull all labels into a simple vectorall_labels <-unlist(meta(clean_corp, "type"))# 2. Now subset that vector using your training indicestrain_labels <- all_labels[train_ind]test_labels <- all_labels[-train_ind]
spam_classifier <-naiveBayes(train_data, train_labels)# Predict on test datatest_pred <-predict(spam_classifier, test_data)# Evaluate with a Confusion Matrixtable(Predicted = test_pred, Actual = test_labels)
The Naive Bayes classifier was trained on a Document Term Matrix (DTM) consisting of approximately 3,000 emails. By applying a Sparsity threshold of 0.99, the feature set was narrowed down to only the most impactful words, which reduced computational load while maintaining predictive power.
Model Performance: The confusion matrix reveals how the model distinguishes between class features. Naive Bayes performs exceptionally well on this dataset because spam often contains specific “trigger” tokens (e.g., “money,” “free,” “click”) that have significantly higher conditional probabilities in the Spam class than the Ham class.
The Cost of Errors: In this specific context, a False Positive (classifying Ham as Spam) is more “expensive” than a False Negative. Missing a legitimate email is generally worse for a user than seeing a piece of junk in their inbox.
Conclusion
This project successfully demonstrated the effectiveness of the Naive Bayes algorithm for binary text classification. Despite its “naive” assumption—that all words are independent of one another—the model achieved high accuracy in distinguishing Spam from Ham.
Key takeaways include:
Data Pre-processing is Critical: Handling character encoding (UTF-8 to ASCII) and removing “stop words” was essential to prevent the model from crashing and to ensure it focused on meaningful language patterns.
Binary vs. Frequency: Using a binary indicator (Yes/No) for word presence proved sufficient for this classification task, confirming that the occurrence of certain words is often more telling than their frequency.
Future Work: Further improvements could involve using “Laplace Smoothing” to handle words that appear in the test set but were never seen during training.
AI Transcript
User: “Error: attempt to use zero-length variable name.” AI: Explained that this usually stems from a stray pipe (%>%) or empty quotes. Advised on checking the tm_map syntax for trailing operators.
User: “Error: object ‘all_corpus’ not found.” AI: Identified a break in the variable chain. Provided the code to combine ham_corpus and spam_corpus using the c() function to create the required all_corpus object.
User: “Error: object ‘clean_corp’ not found.” AI: Clarified that the cleaning steps must be assigned to a new object name. Provided a consolidated block of tm_map functions (tolower, removeNumbers, etc.) to define clean_corp.
User: “Error in FUN(content(x), …) : invalid input… in ‘utf8towcs’.” AI: Diagnosed a character encoding conflict common in the SpamAssassin dataset. Provided a “Blunt Force” fix using iconv() to convert text to ASCII and strip non-translatable characters.
User: “Error in [.data.frame(meta(clean_corp,”type”), train_ind) : undefined columns selected.” AI: Explained that meta() returns a data frame that can confuse standard indexing. Recommended using unlist() to turn metadata into a simple vector for stable training/testing splits.