project_04

Approach

I set up the files from https://spamassassin.apache.org/old/publiccorpus/?C=M;O=A into my project folder under ./data/ham and ./data/spam by downloading and extracting the most recently modified tar files. So, I’ll use these for my spam and not-spam (bool) training data.

I’ll create a list for each folder, list.files("/data/spam") as spam_files and `list.files("/data/ham") as ham_files. I’ll use a standard index for an ID (each row is an email), then I’ll collapse each file into a single row of data in an str field called text. The tibble will also have a bool column for spam/ham where 0 will be ham and 1 will be spam.

I’ll then create a new df from this tibble where the words are unnested, so ID 1 will be n rows dependent on the word count, filter out common words using anti_join(stop_words), which is a tidytext function to remove common words. Then we will count for the next step of converting it into a data matrix for the prediction model.

The data matrix will be a wide table where each word is given a column and each row is an email_id along with the associated counts per word per column. To do this I need to use library tm which is called the text mining package, which allows us to create a matrix from these words with relative ease.

I asked AI what type of model would be best suited and it recommend the naiveBayes model, which requires the library e1071. So we will use that model with the dtm input and the emails df.

Afterwards we can test the model by taking some of the older modified files of https://spamassassin.apache.org/old/publiccorpus/?C=M;O=A like the ones with hard_ham or maybe generate a csv of my own with AI generating 20-50 examples of spam/ham emails and see if the model gets it right.

Codebase

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)

Warning: package 'tidytext' was built under R version 4.5.3

library(tm)

Warning: package 'tm' was built under R version 4.5.3

Loading required package: NLP

Attaching package: 'NLP'

The following object is masked from 'package:ggplot2':

    annotate

library(e1071)

Warning: package 'e1071' was built under R version 4.5.3


Attaching package: 'e1071'

The following object is masked from 'package:ggplot2':

    element

spam_files <- list.files("./data/spam", full.names = TRUE) |>
  discard(~ basename(.x) == "0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1")

ham_files <- list.files("./data/ham", full.names = TRUE)

set.seed(123)

n_files <- min(length(spam_files), length(ham_files))

spam_files <- sample(spam_files, n_files)
ham_files  <- sample(ham_files, n_files)

length(spam_files)

[1] 500

length(ham_files)

[1] 500

read_email <- function(path) {
  raw <- readLines(path, warn = FALSE, encoding = "latin1")
  
  subject <- raw[str_detect(raw, "^Subject:")] |>
    str_replace("^Subject:\\s*", "") |>
    paste(collapse = " ")
  
  blank_line <- which(raw == "")[1]
  
  body <- if (is.na(blank_line)) {
    raw
  } else {
    raw[(blank_line + 1):length(raw)]
  }
  
  paste(subject, paste(body, collapse = " "))
}

# build email tibble
spam_df <- tibble(
  file = spam_files,
  label = 1,
  text = map_chr(file, read_email)
)

ham_df <- tibble(
  file = ham_files,
  label = 0,
  text = map_chr(file, read_email)
)

emails <- bind_rows(spam_df, ham_df) |>
  mutate(email_id = row_number()) |>
  select(email_id, file, label, text)

emails <- emails[-1, ]

head(emails)

# A tibble: 6 × 4
  email_id file                                              label text         
     <int> <chr>                                             <dbl> <chr>        
1        2 ./data/spam/0463.47a4c19eea5230ff19a42e62a5f59484     1 "[ILUG-Socia…
2        3 ./data/spam/0179.3a4c735c7c1e494f4e7a7b9465043280     1 "ADV: Lowest…
3        4 ./data/spam/0014.ed99ffe0f452b91be11684cbfe8d349c     1 "FREE Cell P…
4        5 ./data/spam/0195.8b276e08dd05b0131faa8fb24764f205     1 "The TBA Doc…
5        6 ./data/spam/0426.2002be3b0195b54596a5e7fd7d7561d5     1 "Work at Hom…
6        7 ./data/spam/0306.521d917ac6509c499c406647fd0d336b     1 "Best produc…

email_words <- emails |>
  select(email_id, label, text) |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by = "word") |>
  filter(str_detect(word, "^[a-z]+$")) |>
  count(email_id, label, word, name = "count")

email_words

# A tibble: 92,038 × 4
   email_id label word          count
      <int> <dbl> <chr>         <int>
 1        2     1 access            1
 2        2     1 address           1
 3        2     1 administrator     1
 4        2     1 approved          1
 5        2     1 attractive        1
 6        2     1 benefit           1
 7        2     1 biggest           1
 8        2     1 bin               1
 9        2     1 biz               2
10        2     1 brand             1
# ℹ 92,028 more rows

email_words |>
  distinct(email_id, label) |>
  mutate(label_name = if_else(label == 1, "spam", "ham")) |>
  count(label_name)

# A tibble: 2 × 2
  label_name     n
  <chr>      <int>
1 ham          500
2 spam         499

top_words <- email_words |>
  group_by(word) |>
  summarize(
    total = sum(count),
    docs = n_distinct(email_id),
    .groups = "drop"
  ) |>
  filter(docs >= 5) |>
  slice_max(total, n = 3000)

email_words_small <- email_words |>
  semi_join(top_words, by = "word")

head(email_words_small)

# A tibble: 6 × 4
  email_id label word       count
     <int> <dbl> <chr>      <int>
1        2     1 access         1
2        2     1 address        1
3        2     1 approved       1
4        2     1 attractive     1
5        2     1 benefit        1
6        2     1 biggest        1

email_words_small |>
  distinct(email_id, label) |>
  mutate(label_name = if_else(label == 1, "spam", "ham")) |>
  count(label_name)

# A tibble: 2 × 2
  label_name     n
  <chr>      <int>
1 ham          500
2 spam         499

email_words2 <- email_words_small |>
  rename(.label = label)

email_dtm <- email_words2 |>
  pivot_wider(
    id_cols = c(email_id, .label),
    names_from = word,
    values_from = count,
    values_fill = 0
  )

x <- email_dtm |>
  select(-email_id, -.label)

y <- as.factor(email_dtm$.label)

table(y)

y
  0   1 
500 499

train_ids <- sample(seq_len(nrow(email_dtm)), 
                    size = 0.8 * nrow(email_dtm))

x_train <- x[train_ids, ]
x_test  <- x[-train_ids, ]

y_train <- y[train_ids]
y_test  <- y[-train_ids]

model <- naiveBayes(x_train, y_train, laplace = 1)

pred <- predict(model, x_test)

table(predicted = pred, actual = y_test)

         actual
predicted   0   1
        0 103  49
        1   1  47

mean(pred == y_test)

[1] 0.75

0.725 accuracy is not that bad! I didn’t use it against new data, but I did withhold 20% of the data within the corpus to be used for predictions. I think I can play around with the probabilities of the prediction, but to be honest, it took me a long time to just complete the project.