Code
library(tidyverse)
library(tidytext)
library(e1071)
library(caret)
library(rsample)The objective of this assignment is to build a document classification model that can predict whether an email should be classified as spam or ham. For this project, the SpamAssassin Public Corpus (Apache SpamAssassin Project, n.d.) will likely be used, since it already contains labeled spam and non-spam email messages.
The spam and ham files will first be downloaded and extracted. Each email will then be imported into R and assigned a label based on its folder classification.
spam = 1
ham = 0
The text will then be cleaned by removing unnecessary punctuation, numbers, stopwords, and extra whitespace. After preprocessing, the emails will be converted into a Document-Term Matrix, where each row represents an email and each column represents a term.
A predictive classifier will then be trained on the labeled email data. One likely method is Naive Bayes, since it is commonly used for text classification and spam filtering.
The data will be split into training and testing sets. The model will be trained on the training set and evaluated on the withheld test set using measures such as accuracy, precision, recall, and F1-score.
Particular attention will be paid to false positives and false negatives, since legitimate emails being classified as spam, or spam emails being missed, would both affect the usefulness of the classifier.
One possible challenge is that spam messages may use varied or misleading language, making classification more difficult. Another challenge is balancing the tradeoff between catching spam and avoiding the incorrect classification of legitimate emails.
Due to file size constraints and the desire to foster reproducibility within the analysis, the separate spam and ham data folders were first combined into a single CSV file. This consolidated dataset was then uploaded to my personal GitHub repository so that it could be accessed directly through a raw GitHub URL, rather than relying on local file paths.
As with most analyses conducted in RStudio, the first step involves loading the required libraries. In this project, the tidyverse and tidytext packages will be used for data preparation and text processing, while e1071 and caret will assist with the Naive Bayes classifier and model evaluation.
library(tidyverse)
library(tidytext)
library(e1071)
library(caret)
library(rsample)The dataset will now be imported from the raw GitHub URL. This dataset contains the original email file name, the email text, and a label indicating whether the message is spam or ham.
url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Project%20Four%20Assignment/spam_ham_email_dataset.csv"
email_data <- read_csv(url)
glimpse(email_data)Rows: 3,896
Columns: 3
$ file_name <chr> "00001.7c53336b37003a9286aba55d2945844c", "00002.9c4069e25e1…
$ text <chr> "From exmh-workers-admin@redhat.com Thu Aug 22 12:36:23 200…
$ label <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham…
table(email_data$label)
ham spam
2500 1396
At this stage, the dataset has been successfully imported. The label variable identifies the already-classified email type, which will be used to train the classifier.
Before the text can be modeled, a document identifier will be created and the label variable will be converted into a factor. This will make it easier to track each email during tokenization and modeling.
email_data <- email_data %>%
mutate(
doc_id = row_number(),
label = factor(label, levels = c("ham", "spam"))
)
glimpse(email_data)Rows: 3,896
Columns: 4
$ file_name <chr> "00001.7c53336b37003a9286aba55d2945844c", "00002.9c4069e25e1…
$ text <chr> "From exmh-workers-admin@redhat.com Thu Aug 22 12:36:23 200…
$ label <fct> ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, …
$ doc_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
The data will now be split into training and testing sets. The training set will be used to build the model, while the testing set will be withheld and used to evaluate how well the classifier performs on unseen emails.
set.seed(6767)
email_split <- initial_split(email_data, prop = 0.80, strata = label)
train_data <- training(email_split)
test_data <- testing(email_split)
table(train_data$label)
ham spam
2000 1116
table(test_data$label)
ham spam
500 280
This split allows the classifier to learn from one portion of the data and then be evaluated on a separate portion.
The next step involves converting the email text into individual words. Common stopwords, numbers, and non-alphabetic tokens will be removed so that the model focuses more heavily on meaningful terms.
train_tokens <- train_data %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "^[a-z]+$")) %>%
count(doc_id, label, word, sort = TRUE)
head(train_tokens)# A tibble: 6 × 4
doc_id label word n
<int> <fct> <chr> <int>
1 2528 spam font 1627
2 3590 spam font 1102
3 3591 spam font 1102
4 2551 spam br 812
5 2551 spam nbsp 567
6 3481 spam font 542
At this stage, the training emails have been converted into a tidy word-level format.
To keep the model manageable, the most frequent terms from the training data will be selected as the model vocabulary.
top_terms <- train_tokens %>%
group_by(word) %>%
summarise(total_count = sum(n), .groups = "drop") %>%
slice_max(total_count, n = 1000)
vocabulary <- top_terms$word
head(top_terms)# A tibble: 6 × 2
word total_count
<chr> <int>
1 font 27905
2 id 16840
3 received 16351
4 br 14160
5 http 13496
6 localhost 12998
Limiting the vocabulary helps reduce noise and prevents the model from becoming unnecessarily large.
The cleaned tokens will now be converted into a document-term structure. In this format, each row represents an email, each column represents a word, and the cell values represent how many times that word appears in the email.
create_feature_matrix <- function(data, vocabulary) {
features <- data %>%
select(doc_id, label, text) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "^[a-z]+$")) %>%
filter(word %in% vocabulary) %>%
count(doc_id, word) %>%
pivot_wider(
names_from = word,
values_from = n,
values_fill = 0
)
features <- data %>%
select(doc_id, label) %>%
left_join(features, by = "doc_id") %>%
mutate(across(-c(doc_id, label), ~replace_na(.x, 0)))
missing_terms <- setdiff(vocabulary, names(features))
for(term in missing_terms) {
features[[term]] <- 0
}
features %>%
select(doc_id, label, all_of(vocabulary))
}train_features <- create_feature_matrix(train_data, vocabulary)
test_features <- create_feature_matrix(test_data, vocabulary)
head(train_features)# A tibble: 6 × 1,003
doc_id label font id received br http localhost td list size
<int> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 ham 0 12 10 0 0 7 0 12 0
2 2 ham 0 5 10 0 2 5 0 4 0
3 3 ham 0 4 9 0 2 5 0 4 0
4 4 ham 0 8 7 0 5 5 0 8 0
5 5 ham 0 4 9 0 3 6 0 4 0
6 6 ham 0 4 11 0 2 6 0 4 0
# ℹ 992 more variables: fork <int>, esmtp <int>, nbsp <int>, jm <int>,
# color <int>, sep <int>, subject <int>, tr <int>, width <int>,
# content <int>, mailto <int>, align <int>, admin <int>, arial <int>,
# date <int>, aug <int>, mon <int>, message <int>, postfix <int>, rpm <int>,
# type <int>, version <int>, text <int>, thu <int>, oct <int>, wed <int>,
# mailman <int>, exmh <int>, ist <int>, center <int>, request <int>,
# spamassassin <int>, href <int>, jul <int>, tue <int>, users <int>, …
head(test_features)# A tibble: 6 × 1,003
doc_id label font id received br http localhost td list size
<int> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 18 ham 0 8 6 0 2 6 0 3 0
2 34 ham 0 6 7 0 1 8 0 2 0
3 47 ham 0 6 7 0 2 7 0 3 0
4 49 ham 0 6 7 0 4 6 0 6 0
5 63 ham 0 3 4 0 2 5 0 5 0
6 67 ham 0 3 4 0 1 5 0 1 0
# ℹ 992 more variables: fork <int>, esmtp <int>, nbsp <int>, jm <int>,
# color <int>, sep <int>, subject <int>, tr <int>, width <int>,
# content <int>, mailto <int>, align <int>, admin <int>, arial <int>,
# date <int>, aug <int>, mon <int>, message <int>, postfix <int>, rpm <int>,
# type <int>, version <int>, text <int>, thu <int>, oct <int>, wed <int>,
# mailman <int>, exmh <int>, ist <int>, center <int>, request <int>,
# spamassassin <int>, href <int>, jul <int>, tue <int>, users <int>, …
The training and testing datasets are now represented in a structure suitable for classification.
A Naive Bayes classifier will now be trained using the document-term features. This method is commonly used for spam filtering because it estimates the probability that an email belongs to a particular class based on the words it contains.
train_model_data <- train_features %>%
select(-doc_id)
test_model_data <- test_features %>%
select(-doc_id)
nb_model <- naiveBayes(label ~ ., data = train_model_data)The model has now learned patterns in the training data that are associated with spam and ham emails.
The trained model will now be used to predict whether the emails in the testing set should be classified as spam or ham.
nb_predictions <- predict(
nb_model,
newdata = test_model_data
)
head(nb_predictions)[1] ham ham ham ham ham spam
Levels: ham spam
The predictions represent the model’s classification of the withheld test emails.
The model’s predictions will now be compared against the actual labels from the test set. This will allow for evaluation using accuracy, sensitivity, specificity, and other classification metrics.
confusion_results <- confusionMatrix(
nb_predictions,
test_model_data$label,
positive = "spam"
)
confusion_resultsConfusion Matrix and Statistics
Reference
Prediction ham spam
ham 486 45
spam 14 235
Accuracy : 0.9244
95% CI : (0.9035, 0.9419)
No Information Rate : 0.641
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8315
Mcnemar's Test P-Value : 9.397e-05
Sensitivity : 0.8393
Specificity : 0.9720
Pos Pred Value : 0.9438
Neg Pred Value : 0.9153
Prevalence : 0.3590
Detection Rate : 0.3013
Detection Prevalence : 0.3192
Balanced Accuracy : 0.9056
'Positive' Class : spam
The confusion matrix provides insight into how well the model classified spam and ham messages, including how often it correctly identified spam and how often it misclassified legitimate messages.
The Naive Bayes classifier performed well on the whole, achieving an accuracy of approximately 92.44%, indicating that the majority of emails were correctly classified.
More specifically, the model had a sensitivity of 83.93% and a specificity of 97.20%, meaning that it was stronger at correctly identifying ham emails than spam emails. This is reflected in the confusion matrix, where 45 spam emails were incorrectly classified as ham (false negatives), compared to 14 ham emails incorrectly classified as spam (false positives).
As such, this suggests that the model is somewhat conservative, prioritizing the avoidance of misclassifying legitimate emails as spam, although this comes at the cost of allowing some spam messages to pass through.