For this assignment, I downloaded the SpamAssassin public corpus from Apache and unzipped the files locally. The dataset contains two groups of emails: easy_ham, which represents non-spam emails, and spam_2, which represents spam emails.
The goal of this project is to use already labeled emails to build a spam/ham classifier. I first loaded the email files from each folder, assigned each email a label, and combined them into one data frame. Then I will clean and tokenize the text using tidy text methods, remove stop words, and transform the email text into features that can be used for classification.
Due to the large number of individual files, the dataset is stored locally in the project directory and is not included in the RPubs publication.
##Reproducibility Note
To reproduce this analysis:
1. Download the SpamAssassin corpus from the link above.
2. Extract the folders `easy_ham` and `spam_2`.
3. Place them in the project directory.
4. Run the code as provided.
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# A tibble: 6 × 3
label id word
<chr> <int> <chr>
1 ham 1 from
2 ham 1 exmh
3 ham 1 workers
4 ham 1 admin
5 ham 1 redhat.com
6 ham 1 thu
data(stop_words)tidy_emails <- tidy_emails |>anti_join(stop_words, by ="word")head(tidy_emails)
# A tibble: 6 × 3
label id word
<chr> <int> <chr>
1 ham 1 exmh
2 ham 1 workers
3 ham 1 admin
4 ham 1 redhat.com
5 ham 1 thu
6 ham 1 aug
tidy_emails <- tidy_emails |>filter(str_detect(word, "[a-z]")) |># keep real wordsfilter(!str_detect(word, "^\\d+$")) # remove pure numberstidy_emails
# A tibble: 1,472,313 × 3
label id word
<chr> <int> <chr>
1 ham 1 exmh
2 ham 1 workers
3 ham 1 admin
4 ham 1 redhat.com
5 ham 1 thu
6 ham 1 aug
7 ham 1 return
8 ham 1 path
9 ham 1 exmh
10 ham 1 workers
# ℹ 1,472,303 more rows
word_counts <- tidy_emails |>count(label, word, sort =TRUE)head(word_counts)
# A tibble: 6 × 3
label word n
<chr> <chr> <int>
1 spam font 33903
2 spam 3d 32154
3 spam br 16751
4 spam td 15691
5 ham id 14726
6 ham received 14318
word_counts <- word_counts |>filter(n >5)
##Visualization
top_words <- word_counts |>group_by(label) |>slice_max(n, n =10) |>ungroup()ggplot(top_words, aes(x =reorder(word, n), y = n, fill = label)) +geom_col(show.legend =FALSE) +facet_wrap(~label, scales ="free") +coord_flip() +labs(title ="Top Words in Spam vs Ham")
Warning: package 'e1071' was built under R version 4.5.2
Attaching package: 'e1071'
The following object is masked from 'package:ggplot2':
element
model <-naiveBayes(label ~ . - id, data = train)
predictions <-predict(model, test)
table(Predicted = predictions, Actual = test$label)
Actual
Predicted ham spam
ham 498 10
spam 2 280
mean(predictions == test$label)
[1] 0.9848101
cm <-as.data.frame(table(Predicted = predictions, Actual = test$label))ggplot(cm, aes(x = Actual, y = Predicted, fill = Freq)) +geom_tile(color ="white") +geom_text(aes(label = Freq), size =5) +scale_fill_gradient(low ="lightblue", high ="darkblue") +labs(title ="Confusion Matrix Heatmap") +theme_minimal()
##Interpretation
The Naive Bayes model performed very well, achieving an accuracy of approximately 98.5%.
The model correctly classified 498 ham emails and 280 spam emails. Only a small number of errors were made, with 10 spam emails incorrectly classified as ham (false negatives) and 2 ham emails incorrectly classified as spam (false positives).
The model demonstrates high precision, meaning that when it predicts an email as spam, it is almost always correct. The recall is also high, indicating that the model successfully identifies most spam emails, although a small number are missed.
In practical terms, false negatives are more concerning because spam emails that are not detected may reach the user. However, the model’s overall performance suggests it is highly effective for spam detection.
Conclusion
This assignment was challenging because the dataset contained many individual email files stored in separate spam and ham folders. After loading and cleaning the emails, I used tidy text methods to tokenize the messages, remove stop words, and create TF-IDF features.
The Naive Bayes model performed very well, with an accuracy of approximately 98.5%. The model correctly classified most spam and ham emails, although a small number of spam emails were incorrectly classified as ham. This shows that text classification can be an effective method for detecting spam, but the model is not perfect.
Overall, this project demonstrates how labeled email data can be transformed into structured features and used to build a machine learning model for spam detection.