Ham and Spam Detection

Author

Ramde Guibril

Approach

For this assignment, I downloaded the SpamAssassin public corpus from Apache and unzipped the files locally. The dataset contains two groups of emails: easy_ham, which represents non-spam emails, and spam_2, which represents spam emails.

The goal of this project is to use already labeled emails to build a spam/ham classifier. I first loaded the email files from each folder, assigned each email a label, and combined them into one data frame. Then I will clean and tokenize the text using tidy text methods, remove stop words, and transform the email text into features that can be used for classification.

Data Source

The dataset used in this project comes from the Apache SpamAssassin public corpus: https://spamassassin.apache.org/old/publiccorpus/

Due to the large number of individual files, the dataset is stored locally in the project directory and is not included in the RPubs publication.

##Reproducibility Note

To reproduce this analysis:

1. Download the SpamAssassin corpus from the link above.
2. Extract the folders `easy_ham` and `spam_2`.
3. Place them in the project directory.
4. Run the code as provided.

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ham_files <- list.files("easy_ham", full.names = TRUE, all.files = TRUE)
spam_files <- list.files("spam_2", full.names = TRUE, all.files = TRUE)

ham_files <- ham_files[!basename(ham_files) %in% c(".", "..")]
spam_files <- spam_files[!basename(spam_files) %in% c(".", "..")]

length(ham_files)
[1] 2551
length(spam_files)
[1] 1397
read_email <- function(file) {
  paste(readLines(file, warn = FALSE), collapse = " ")
}

ham_texts <- sapply(ham_files, read_email)
spam_texts <- sapply(spam_files, read_email)

##Convert emails to tidy text

emails_df <- tibble(
  text = c(ham_texts, spam_texts),
  label = c(
    rep("ham", length(ham_texts)),
    rep("spam", length(spam_texts))
  )
)
glimpse(emails_df)
Rows: 3,948
Columns: 2
$ text  <chr> "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002 Re…
$ label <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "…
table(emails_df$label)

 ham spam 
2551 1397 
library(tidytext)

tidy_emails <- emails_df |>
  mutate(id = row_number()) |>
  unnest_tokens(word, text)

head(tidy_emails)
# A tibble: 6 × 3
  label    id word      
  <chr> <int> <chr>     
1 ham       1 from      
2 ham       1 exmh      
3 ham       1 workers   
4 ham       1 admin     
5 ham       1 redhat.com
6 ham       1 thu       
data(stop_words)

tidy_emails <- tidy_emails |>
  anti_join(stop_words, by = "word")

head(tidy_emails)
# A tibble: 6 × 3
  label    id word      
  <chr> <int> <chr>     
1 ham       1 exmh      
2 ham       1 workers   
3 ham       1 admin     
4 ham       1 redhat.com
5 ham       1 thu       
6 ham       1 aug       
tidy_emails <- tidy_emails |>
  filter(str_detect(word, "[a-z]")) |>   # keep real words
  filter(!str_detect(word, "^\\d+$"))   # remove pure numbers
tidy_emails
# A tibble: 1,472,313 × 3
   label    id word      
   <chr> <int> <chr>     
 1 ham       1 exmh      
 2 ham       1 workers   
 3 ham       1 admin     
 4 ham       1 redhat.com
 5 ham       1 thu       
 6 ham       1 aug       
 7 ham       1 return    
 8 ham       1 path      
 9 ham       1 exmh      
10 ham       1 workers   
# ℹ 1,472,303 more rows
top_words <- tidy_emails |>
  count(word, sort = TRUE) |>
  slice_head(n = 1000) |>
  pull(word)

tidy_emails <- tidy_emails |>
  filter(word %in% top_words)
library(tidytext)

tf_idf <- tidy_emails |>
  count(id, word) |>
  bind_tf_idf(word, id, n)

tf_idf
# A tibble: 356,276 × 6
      id word          n      tf   idf  tf_idf
   <int> <chr>     <int>   <dbl> <dbl>   <dbl>
 1     1 admin         4 0.0131  0.792 0.0104 
 2     1 ago           1 0.00327 3.18  0.0104 
 3     1 archive       1 0.00327 0.900 0.00294
 4     1 ascii         1 0.00327 1.07  0.00351
 5     1 aug          13 0.0425  1.62  0.0689 
 6     1 beenthere     1 0.00327 0.849 0.00277
 7     1 bulk          1 0.00327 0.702 0.00229
 8     1 cc            1 0.00327 1.45  0.00475
 9     1 charset       1 0.00327 0.430 0.00141
10     1 chris         2 0.00654 3.56  0.0233 
# ℹ 356,266 more rows
tf_idf |> 
  arrange(desc(tf_idf)) |> 
  head(10)
# A tibble: 10 × 6
      id word                 n    tf   idf tf_idf
   <int> <chr>            <int> <dbl> <dbl>  <dbl>
 1  3948 mv                1400 1      6.89   6.89
 2  2777 webmaster           69 0.489  3.41   1.67
 3   507 alb                119 0.263  5.88   1.55
 4   509 alb                119 0.263  5.88   1.55
 5  2855 3d                 157 0.599  2.27   1.36
 6  3206 netnovations.com    50 0.246  5.51   1.36
 7  3210 netnoir.net         50 0.233  5.45   1.27
 8  3901 3d                 143 0.534  2.27   1.21
 9  3329 3e                 376 0.222  5.19   1.15
10  3362 3e                 376 0.205  5.19   1.07
library(tidyr)
tf_idf <- tf_idf |>
  filter(!word %in% c("id"))

dtm <- tf_idf |>
  select(id, word, tf_idf) |>
  pivot_wider(names_from = word, values_from = tf_idf, values_fill = 0)

dtm
# A tibble: 3,948 × 1,000
      id  admin    ago archive   ascii    aug beenthere    bulk      cc charset
   <int>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
 1     1 0.0104 0.0104 0.00294 0.00351 0.0689   0.00277 0.00229 0.00475 0.00141
 2     2 0      0      0       0.00684 0.124    0       0.00447 0       0.00274
 3     3 0      0      0       0.00596 0.0992   0       0.00390 0       0.00239
 4     4 0.0223 0      0.00634 0.00756 0.103    0.00598 0.00494 0       0.00303
 5     5 0.0113 0      0.00321 0       0.0869   0.00303 0.00251 0       0      
 6     6 0      0      0       0.00701 0.117    0       0.00459 0       0.00281
 7     7 0      0      0       0.00692 0.136    0       0.00453 0       0.00278
 8     8 0      0      0       0       0.121    0       0.00401 0       0.00246
 9     9 0      0      0       0.00684 0.114    0       0.00447 0       0.00274
10    10 0      0      0       0       0.100    0       0.00309 0       0.00190
# ℹ 3,938 more rows
# ℹ 990 more variables: chris <dbl>, code <dbl>, content <dbl>,
#   corp.example.com <dbl>, corp.redhat.com <dbl>, cvs <dbl>, cwg <dbl>,
#   date <dbl>, dated <dbl>, day <dbl>, deepeddy.com <dbl>,
#   deepeddy.vircio.com <dbl>, delivered <dbl>, developers <dbl>,
#   discussion <dbl>, dogma.slashnull.org <dbl>, drop <dbl>, edt <dbl>,
#   error <dbl>, errors <dbl>, esmtp <dbl>, example.com <dbl>, exmh <dbl>, …
word_counts <- tidy_emails |>
  count(label, word, sort = TRUE)

head(word_counts)
# A tibble: 6 × 3
  label word         n
  <chr> <chr>    <int>
1 spam  font     33903
2 spam  3d       32154
3 spam  br       16751
4 spam  td       15691
5 ham   id       14726
6 ham   received 14318
word_counts <- word_counts |>
  filter(n > 5)

##Visualization

top_words <- word_counts |>
  group_by(label) |>
  slice_max(n, n = 10) |>
  ungroup()

ggplot(top_words, aes(x = reorder(word, n), y = n, fill = label)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~label, scales = "free") +
  coord_flip() +
  labs(title = "Top Words in Spam vs Ham")

dtm <- dtm |>
  left_join(
    emails_df |> 
      mutate(id = row_number()) |> 
      select(id, label),
    by = "id"
  )
set.seed(123)

train_index <- sample(nrow(dtm), 0.8 * nrow(dtm))

train <- dtm[train_index, ]
test  <- dtm[-train_index, ]

train$label <- as.factor(train$label)
test$label  <- as.factor(test$label)

##Naive Bayes Model

library(e1071)
Warning: package 'e1071' was built under R version 4.5.2

Attaching package: 'e1071'
The following object is masked from 'package:ggplot2':

    element
model <- naiveBayes(label ~ . - id, data = train)
predictions <- predict(model, test)
table(Predicted = predictions, Actual = test$label)
         Actual
Predicted ham spam
     ham  498   10
     spam   2  280
mean(predictions == test$label)
[1] 0.9848101
cm <- as.data.frame(table(Predicted = predictions, Actual = test$label))

ggplot(cm, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Confusion Matrix Heatmap") +
  theme_minimal()

##Interpretation

The Naive Bayes model performed very well, achieving an accuracy of approximately 98.5%.

The model correctly classified 498 ham emails and 280 spam emails. Only a small number of errors were made, with 10 spam emails incorrectly classified as ham (false negatives) and 2 ham emails incorrectly classified as spam (false positives).

The model demonstrates high precision, meaning that when it predicts an email as spam, it is almost always correct. The recall is also high, indicating that the model successfully identifies most spam emails, although a small number are missed.

In practical terms, false negatives are more concerning because spam emails that are not detected may reach the user. However, the model’s overall performance suggests it is highly effective for spam detection.

Conclusion

This assignment was challenging because the dataset contained many individual email files stored in separate spam and ham folders. After loading and cleaning the emails, I used tidy text methods to tokenize the messages, remove stop words, and create TF-IDF features.

The Naive Bayes model performed very well, with an accuracy of approximately 98.5%. The model correctly classified most spam and ham emails, although a small number of spam emails were incorrectly classified as ham. This shows that text classification can be an effective method for detecting spam, but the model is not perfect.

Overall, this project demonstrates how labeled email data can be transformed into structured features and used to build a machine learning model for spam detection.