The Project

We have been tasked with creating a model to correctly separate spam emails from ham emails. Apparently, ham is industry jargon for the emails you actually want to receive.

We will try to build a model using boilerplate text analysis. First, creating a corpus and document term matrix from the bank of labeled emails.

The Data

SpamAssassin is an open-source project from Apache meant to enable system administrators to implement text analysis, Bayesian filtering, and other methods to help stamp out the global spamdemic.

Version 3.4.6 was realeased on April 12 of this year, and it is the last update before the release of 4.0. I found it helpful to read through the source code. You guessed it, this release comes with an Apache v2.0 license: https://tinyurl.com/ryfnrwhx

We will be loading some old datasets and not using this software. It’s helpful to read through the source code. The data can be found here:

https://spamassassin.apache.org/old/publiccorpus/

library(tidyverse)
library(magrittr)
library(tidytext)
library(tm)
library(tidymodels)

set.seed(1337)

hamURL <- 
  "https://github.com/TheWerefriend/data607/raw/master/project4/ham.zip"
spamURL <- 
  "https://github.com/TheWerefriend/data607/raw/master/project4/spam.zip"
collectMail <- function(zipURL) {
  # get the working directory and the file
  dir <- getwd()
  temp <- tempfile()
  download.file(zipURL, temp)
  
  # get a list of the files in the archive, unzip to WD
  fileList <- unzip(temp, list = TRUE)
  unzip(temp)
  
  # read the files into a dataframe, preserve column for filename
  df <- list.files(path = dir) %>%
  as.data.frame() %>%
  set_colnames("filename") %>%
  mutate(text = lapply(list.files(path = dir, full.names = TRUE),
                       read_lines)) %>%
  unnest(c(text)) %>%
  group_by(filename) %>%
  mutate(text = paste(text, collapse = " ")) %>%
  ungroup() %>%
  distinct()
  
  # clean up the WD, return a tibble
  unlink(temp)
  do.call(file.remove, fileList)
  return(tibble(df))
}

We want to create a document term matrix using the tm package (like in the Eric Cartman meets Reddit project…) and use this to train our model. First we have to apply some preprocessing to create a corpus.

transformMail <- function(mail) {
  # generate a Corpus and return it
  transformedCorpus <- VCorpus(VectorSource(mail$text)) %>%
  tm_map(removePunctuation, ucp = TRUE) %>%
  tm_map(removeNumbers) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("SMART")) %>%
  tm_map(stemDocument, "english")
  
  return(transformedCorpus)
  }

Now, there are some invisible bugs here. The calls were throwing encoding errors, and according to the linux functions encguess and uchardet, nearly all the files are encoded with US-ASCII, but some are not. Instead of fighting all the encoding issues, for the sake of this project, I will throw out the ones that were not “encoded properly”.

# remove badly encoded emails
ham <- collectMail(hamURL) %>%
  filter(validEnc(text))

spam <- collectMail(spamURL) %>%
  filter(validEnc(text))

hamCorpus <- transformMail(ham)
spamCorpus <- transformMail(spam)

inspect(DocumentTermMatrix(hamCorpus))
## <<DocumentTermMatrix (documents: 3858, terms: 67894)>>
## Non-/sparse entries: 593860/261341192
## Sparsity           : 100%
## Maximal term length: 280
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   <td aug esmtp localhost mon postfix receiv sep wed width=
##   105    0   0     2         0   0       0      3   0   0      0
##   1428   0   0     5         4   0       3      8   8   0      0
##   1443   0   0     4         3   0       3      8   9   0      0
##   263  143   0     1         0   0       0      2   0   3    108
##   2689   0  12     2         3   0       2      7   0   0      0
##   2757   0  11     1         5   2       1      9   0   4      0
##   303  260   0     1         0   0       0      4   0   2     94
##   48   143   0     1         0   0       0      2   0   3    108
##   538  224  11     0         0   0       0      9   0   0    298
##   89   108   0     1         0   0       0      3   0   0    222
inspect(DocumentTermMatrix(spamCorpus))
## <<DocumentTermMatrix (documents: 1694, terms: 68645)>>
## Non-/sparse entries: 304735/115979895
## Sparsity           : 100%
## Maximal term length: 888
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   <br> <font <td <tr> email esmtp jul receiv size= width=
##   132   188     6  15   26    10     2   4      6    24      3
##   1383    1     8   2    6     2     6  16     20     0      4
##   1403    0     0   0    0     1     4  10      9     0      0
##   1407    1     8   2    6     2     6  14     18     0      4
##   1414    0     7   0    0     4     1   5     11     0      0
##   1415    0     7   0    0     4     1   5     11     0      0
##   1623  250     2   1   12    10     5   0      6    22      3
##   538     0     0   0    0     0     1   0      4     0      0
##   588     0     0   0    0    10     2   0      5     0      0
##   86    368     6  15   26    16     2   0      8    24      3

There appear to be significant differences among these two matrices! So… we have to combine the two sets with a column for spam status, and make a new document term matrix before training… Maybe, I could have done these things in a more efficient order.

mail <- rbind(ham %>% mutate(status = FALSE),
              spam %>% mutate(status = TRUE))

mail$status <- factor(mail$status)

# Note: the sparse parameter will remove documents more sparse than x%
combinedDTM <- transformMail(mail) %>%
  DocumentTermMatrix() %>%
  removeSparseTerms(sparse = 0.97)
total <- combinedDTM %>%
  as.matrix() %>%
  as.data.frame() %>%
  sapply(., as.numeric) %>%
  as.data.frame() %>%
  cbind(status = mail$status)

The Model

We split the data:

split <- initial_split(total, strata = status, p = 0.67)
trainer <- training(split)
tester <- testing(split)

We define the model with a recipe:

spamRecipe <- recipe(status ~ ., data = trainer) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  prep(training = trainer)

spamTrain <- juice(spamRecipe)
spamTest <- bake(spamRecipe, tester)

We fit the model:

spamModel <- rand_forest(mode = "classification") %>%
  set_engine("ranger")

spamFit <- spamModel %>%
  fit(status ~ ., data = tester)

We evaluate the results:

results <- spamTest %>%
  select(status) %>%
  mutate(predicted = factor(ifelse
                            (spamFit$fit$predictions[,1] < 0.5,
                              TRUE, FALSE)))

precision(results, truth = status, estimate = predicted)
## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.986
recall(results, truth = status, estimate = predicted)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.984
f_meas(results, truth = status, estimate = predicted)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.985

Conclusions

It worked! We have a pretty solid spam classifier based on the easy and hard ham sets and both spam sets described in the introduction. Room for improvement: we did not use all the data. This will not work on emails encoded by a method other than UTF-8.