We have been tasked with creating a model to correctly separate spam emails from ham emails. Apparently, ham is industry jargon for the emails you actually want to receive.
We will try to build a model using boilerplate text analysis. First, creating a corpus and document term matrix from the bank of labeled emails.
SpamAssassin is an open-source project from Apache meant to enable system administrators to implement text analysis, Bayesian filtering, and other methods to help stamp out the global spamdemic.
Version 3.4.6 was realeased on April 12 of this year, and it is the last update before the release of 4.0. I found it helpful to read through the source code. You guessed it, this release comes with an Apache v2.0 license: https://tinyurl.com/ryfnrwhx
We will be loading some old datasets and not using this software. It’s helpful to read through the source code. The data can be found here:
https://spamassassin.apache.org/old/publiccorpus/
library(tidyverse)
library(magrittr)
library(tidytext)
library(tm)
library(tidymodels)
set.seed(1337)
hamURL <-
"https://github.com/TheWerefriend/data607/raw/master/project4/ham.zip"
spamURL <-
"https://github.com/TheWerefriend/data607/raw/master/project4/spam.zip"
collectMail <- function(zipURL) {
# get the working directory and the file
dir <- getwd()
temp <- tempfile()
download.file(zipURL, temp)
# get a list of the files in the archive, unzip to WD
fileList <- unzip(temp, list = TRUE)
unzip(temp)
# read the files into a dataframe, preserve column for filename
df <- list.files(path = dir) %>%
as.data.frame() %>%
set_colnames("filename") %>%
mutate(text = lapply(list.files(path = dir, full.names = TRUE),
read_lines)) %>%
unnest(c(text)) %>%
group_by(filename) %>%
mutate(text = paste(text, collapse = " ")) %>%
ungroup() %>%
distinct()
# clean up the WD, return a tibble
unlink(temp)
do.call(file.remove, fileList)
return(tibble(df))
}
We want to create a document term matrix using the tm package (like in the Eric Cartman meets Reddit project…) and use this to train our model. First we have to apply some preprocessing to create a corpus.
transformMail <- function(mail) {
# generate a Corpus and return it
transformedCorpus <- VCorpus(VectorSource(mail$text)) %>%
tm_map(removePunctuation, ucp = TRUE) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("SMART")) %>%
tm_map(stemDocument, "english")
return(transformedCorpus)
}
Now, there are some invisible bugs here. The calls were throwing encoding errors, and according to the linux functions encguess and uchardet, nearly all the files are encoded with US-ASCII, but some are not. Instead of fighting all the encoding issues, for the sake of this project, I will throw out the ones that were not “encoded properly”.
# remove badly encoded emails
ham <- collectMail(hamURL) %>%
filter(validEnc(text))
spam <- collectMail(spamURL) %>%
filter(validEnc(text))
hamCorpus <- transformMail(ham)
spamCorpus <- transformMail(spam)
inspect(DocumentTermMatrix(hamCorpus))
## <<DocumentTermMatrix (documents: 3858, terms: 67894)>>
## Non-/sparse entries: 593860/261341192
## Sparsity : 100%
## Maximal term length: 280
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs <td aug esmtp localhost mon postfix receiv sep wed width=
## 105 0 0 2 0 0 0 3 0 0 0
## 1428 0 0 5 4 0 3 8 8 0 0
## 1443 0 0 4 3 0 3 8 9 0 0
## 263 143 0 1 0 0 0 2 0 3 108
## 2689 0 12 2 3 0 2 7 0 0 0
## 2757 0 11 1 5 2 1 9 0 4 0
## 303 260 0 1 0 0 0 4 0 2 94
## 48 143 0 1 0 0 0 2 0 3 108
## 538 224 11 0 0 0 0 9 0 0 298
## 89 108 0 1 0 0 0 3 0 0 222
inspect(DocumentTermMatrix(spamCorpus))
## <<DocumentTermMatrix (documents: 1694, terms: 68645)>>
## Non-/sparse entries: 304735/115979895
## Sparsity : 100%
## Maximal term length: 888
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs <br> <font <td <tr> email esmtp jul receiv size= width=
## 132 188 6 15 26 10 2 4 6 24 3
## 1383 1 8 2 6 2 6 16 20 0 4
## 1403 0 0 0 0 1 4 10 9 0 0
## 1407 1 8 2 6 2 6 14 18 0 4
## 1414 0 7 0 0 4 1 5 11 0 0
## 1415 0 7 0 0 4 1 5 11 0 0
## 1623 250 2 1 12 10 5 0 6 22 3
## 538 0 0 0 0 0 1 0 4 0 0
## 588 0 0 0 0 10 2 0 5 0 0
## 86 368 6 15 26 16 2 0 8 24 3
There appear to be significant differences among these two matrices! So… we have to combine the two sets with a column for spam status, and make a new document term matrix before training… Maybe, I could have done these things in a more efficient order.
mail <- rbind(ham %>% mutate(status = FALSE),
spam %>% mutate(status = TRUE))
mail$status <- factor(mail$status)
# Note: the sparse parameter will remove documents more sparse than x%
combinedDTM <- transformMail(mail) %>%
DocumentTermMatrix() %>%
removeSparseTerms(sparse = 0.97)
total <- combinedDTM %>%
as.matrix() %>%
as.data.frame() %>%
sapply(., as.numeric) %>%
as.data.frame() %>%
cbind(status = mail$status)
We split the data:
split <- initial_split(total, strata = status, p = 0.67)
trainer <- training(split)
tester <- testing(split)
We define the model with a recipe:
spamRecipe <- recipe(status ~ ., data = trainer) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
prep(training = trainer)
spamTrain <- juice(spamRecipe)
spamTest <- bake(spamRecipe, tester)
We fit the model:
spamModel <- rand_forest(mode = "classification") %>%
set_engine("ranger")
spamFit <- spamModel %>%
fit(status ~ ., data = tester)
We evaluate the results:
results <- spamTest %>%
select(status) %>%
mutate(predicted = factor(ifelse
(spamFit$fit$predictions[,1] < 0.5,
TRUE, FALSE)))
precision(results, truth = status, estimate = predicted)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.986
recall(results, truth = status, estimate = predicted)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.984
f_meas(results, truth = status, estimate = predicted)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.985
It worked! We have a pretty solid spam classifier based on the easy and hard ham sets and both spam sets described in the introduction. Room for improvement: we did not use all the data. This will not work on emails encoded by a method other than UTF-8.