Assignment

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham data-set, then predict the class of new documents (either withheld from the training data-set or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

Solution

Overview

Executive Summary

The tm package will be used to create a corpus of data which will serve as the source of features and observations for the analysis. This will then be converted into a document-term matrix. Finally, The caret package will be used for the model fitting, validation, and testing.

The process of building a ham/spam filter is an oft-used pedagogical tool when teaching predictive modeling. Therefore, there is a multitude of information available on-line and in texts, of which we availed ourselves.

It should be noted that one of the more common packages in recent use for text mining, the RTextTools package was recently removed from CRAN, and personal communication by one of us with the author (who is now building the news feed at LinkedIn) confirmed that the package is abandonware.

Lastly, we understand that the object of this exercise is not to build an excellent predictor but to demonstrate the necessary knowledge required to build classification algorithms.

Document-Term Matrix

A document-term matrix (DTM) is the model matrix used in natural language processing (NLP). Its rows represent the documents in the corpus and its columns represent the selected terms or tokens which are treated as features. The values in each cell depends on the weighting schema selected. The simplest is term-frequency (tf). This is just the number of times the word is found in that document. A more sophisticated weighting scheme is term frequency–inverse document frequency (tf-idf). This measure increases with the frequency of the term, but offsets it by the number of documents in which it appears. This will lower the predictive power of words that naturally appear very often in all kinds of documents, and so do not shed much light on the type of document. This problem is also addressed by removing words so common as to have no predictive power at all like “and” or “the”. These are often called stop words.

Code and Process

Style

In the following document, all user-created variables will be in snake_case and all user-created functions will be in CamelCase. Unfortunately, the tm packages uses camelCase for its functions. wE aPoLoGIze fOr anY IncoNVenIence.

Load Libraries and Set Seed

set.seed(12)
library(doParallel)
cl <- makePSOCKcluster(6L)
registerDoParallel(cl)
library(tm)
library(SnowballC)
library(caret)
library(wordcloud)

List files

The files were downloaded from the link above, and the spam_2 and easy_ham sets were selected for analysis. These were unzipped so that each email is its own file in the directory.

s_files <- list.files("./Data/spam_2", full.names = TRUE)
h_files <- list.files("./Data/easy_ham", full.names = TRUE)
h_len <- length(h_files)
s_len <- length(s_files)

Building the Corpus

Email Headers

We will be focusing on email content, and not the meta information or doing reverse DNS lookups. Therefore, it makes sense to remove the email headers. According to the most recent RFC about email, RFC 5322, Section 2.2, the header should not contain any purely blank lines. Therefore, it is a very reasonable approach to look for the first blank line and only start ingesting the email from the next line. That is what is searched for by the regex pattern "^$" in the function below.

Raw Corpus

The readLines function reads each line as a separate vector. To turn this into a single character vector, the paste function is used with the appropriate sep and collapse values. The class of the document is passed as a parameter to the `Build

BuildCorpus <- function(files, class){
  for (i in seq_along(files)) {
    raw_text <- readLines(files[i])
    em_length <- length(raw_text)
    body_start <- min(grep("^$", raw_text, fixed = FALSE)) + 1L
    em_body <- paste(raw_text[body_start:em_length],
                     sep = "", collapse = " ")
    if (i == 1L) {
      ret_Corpus <- VCorpus(VectorSource(em_body))
    } else {
      tmp_Corpus <- VCorpus(VectorSource(em_body))
      ret_Corpus <- c(ret_Corpus, tmp_Corpus)
    }
  }
  meta(ret_Corpus, tag = "class", type = "indexed") <- class
  return(ret_Corpus)
}

h_corp_raw <- BuildCorpus(h_files, "ham")
s_corp_raw <- BuildCorpus(s_files, "spam")

Cleaning the Corpus

We used many of the default cleaning tools in the tm package to perform standard adjustments like lower-casing, removing numbers, etc. We made two non-native adjustments. First we stripped out anything that looked like a URL. This needed to be done prior to removing punctuation, of course. We also added a few words to the removal list which we think have little predictive power due to their overuse. We considered removing all punctuation, but decided to leave both intra-word contractions and internal punctuation.

Lastly, we used the SnowballC package to stem the document. This process tries to identify common roots shared by similar words and then treat them as one. For example:

wordStem(c('run', 'running', 'ran', 'runt'), language = 'porter')
## [1] "run"  "run"  "ran"  "runt"

The complete cleaning rules are in the CleanCorpus function.

# https://stackoverflow.com/questions/47410866/r-inspect-document-term-matrix-results-in-error-repeated-indices-currently-not
CleanCorpus <- function(corpus){
  overused_words <- c("ok", 'okay', 'day', "might", "bye", "hello", "hi",
                      "dear", "thank", "you", "please", "sorry")
  StripURL <- function(x) {gsub("(http[^ ]*)|(www\\.[^ ]*)", "", x)}
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, content_transformer(StripURL))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removePunctuation,
                   preserve_intra_word_contractions = TRUE,
                   preserve_intra_word_dashes = TRUE)
  corpus <- tm_map(corpus, removeWords, c(stopwords("english"), overused_words))
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}

Removing Very Sparse Terms

Even with a cleaned corpus, the overwhelming majority of the terms are rare. There are two ways to address sparsity of terms in the tm package. The first is to generate a list of words that appear at least \(k\) times in the corpus. This is done using the findFreqTerms command. Then the document-term matrix (DTM) can be built using only those words.

The second way is to build the DTM with all words, and then remove the words that don’t appear in at least \(p\%\) of documents. This is done using the removeSparseTerms function in tm. Both methods make manual inspection of more than one line of the matrix impossible. The matrix is stored sparsely as a triplet, and once terms are removed, it becomes impossible for R to print properly.

The removeSparseTerms is intuitively more appealing as it measures frequency by document, and not across documents. However, applying that to three separate corpuses would result in the validation and testing sets not having the same words as the training set. Therefore, the build-up method will be used, but used by finding the remaining terms after calling remove.

However, before we do that, we need to discuss…

Training, Validation, and Testing

Hastie & Tibshirani, in their seminal work ESL, suggest breaking ones data into three parts: 50% training, 25% validation, and 25% testing. Confusingly, some literature uses “test” for the validation set and “holdout” for the test set. Regardless, the idea is that you train your model on 50% of the data, and use 25% of the data (the validation set) to refine any hyper-parameters of the model. You do this for each model, and then once all the models are tuned as best possible, they are compared with each other by their performance on the heretofore unused testing/holdout set. The SplitSample function was used to split the data at the start.

SplitSample <- function(n) {
  if (n %% 4 == 0) {
    n_split <- sample(c(rep("train", n / 2),
                      rep("validate", n / 4),
                      rep("test", n / 4)))
  } else {
    n_split <- sample(x = c("train", "validate", "test"), size = n,
                    replace = TRUE, prob = c(0.5, 0.25, 0.25))
  }
}
h_split <- SplitSample(h_len)
s_split <- SplitSample(s_len)

Building the Term List

As both training and validation are part of the model construction, we feel that the term list can be built from the combination of the two. The terms in the testing/holdout set will not be seen prior to testing. We will restrict the word list to words that appear in at least 100 of the combined 2922 documents.

raw_train <- c(h_corp_raw[h_split == "train"],
               s_corp_raw[s_split == "train"])
raw_val <- c(h_corp_raw[h_split == "validate"],
             s_corp_raw[s_split == "validate"])
raw_test <- c(h_corp_raw[h_split == "test"],
              s_corp_raw[s_split == "train"])
raw_term_corp <- c(raw_train, raw_val)
clean_term_corp <- CleanCorpus(raw_term_corp)
dtm_terms <- DocumentTermMatrix(clean_term_corp, control = list(
  bounds = list(global = c(100L, Inf))))
freq_terms <- Terms(dtm_terms)

Here are the top 20 stemmed terms out of the 542 terms we will use in the dictionary:

ft <- colSums(as.matrix(dtm_terms))
ft_df <- data.frame(term = names(ft), count = as.integer(ft))
knitr::kable(head(ft_df[order(ft, decreasing = TRUE), ], n = 20L),
             row.names = FALSE)
term count
size 3578
font 2902
will 2872
widthd 2474
use 2402
can 2393
tabl 2359
width 2275
email 2262
get 2101
list 2070
one 1932
helvetica 1853
mail 1826
time 1640
just 1597
free 1538
href 1529
new 1480
div 1407

Here is a histogram of word frequency using the Freedman-Diaconis rule for binwidth.

bw_fd <- 2 * IQR(ft_df$count) / (dim(ft_df)[[1]]) ^ (1/3)
ggplot(ft_df, aes(x = count)) + geom_histogram(binwidth = bw_fd) + xlab("Term")

Finally, a wordcloud of the stemmed terms appearing at least 250 times:

wordcloud(ft_df$term,ft_df$count, scale = c(3, 0.6), min.freq = 250L,
          colors = brewer.pal(5, "Dark2"), random.color = TRUE,
          random.order = TRUE, rot.per = 0, fixed.asp = FALSE)

Building the Training Set

# sample is to randomize the observations
clean_train <- sample(CleanCorpus(raw_train))
clean_train_type <- unlist(meta(clean_train, tag = "class"))
attributes(clean_train_type) <- NULL
dtm_train <- DocumentTermMatrix(clean_train,
                                control = list(dictionary = freq_terms))
dtm_train
## <<DocumentTermMatrix (documents: 1948, terms: 542)>>
## Non-/sparse entries: 76647/979169
## Sparsity           : 93%
## Maximal term length: 41
## Weighting          : term frequency (tf)

Compare the above with the sparsity of the cleaned training corpus without the limiting dictionary:

dtm_train_S <- DocumentTermMatrix(clean_train)
dtm_train_S
## <<DocumentTermMatrix (documents: 1948, terms: 38356)>>
## Non-/sparse entries: 187674/74529814
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)

Building the Validation Set

clean_val <- sample(CleanCorpus(raw_val))
clean_val_type <- unlist(meta(clean_val, tag = "class"))
attributes(clean_val_type) <- NULL
dtm_val <- DocumentTermMatrix(clean_val,
                              control = list(dictionary = freq_terms))

Building the Testing Set

clean_test <- sample(CleanCorpus(raw_test))
clean_test_type <- unlist(meta(clean_test, tag = "class"))
attributes(clean_test_type) <- NULL
dtm_test <- DocumentTermMatrix(clean_test,
                              control = list(dictionary = freq_terms))

Last step

The caret package requires its input to be a numeric matrix. As the DTM is a special form of sparse matrix, we need to convert it to something caret understands. The response vector must be a factor for classification, which is why all three clean_x_type vectors were created as factors.

train_m <- as.matrix(dtm_train)
clean_train_type <- factor(clean_train_type, levels = c("spam", "ham"))
val_m <- as.matrix(dtm_val)
clean_val_type <- factor(clean_val_type, levels = c("spam", "ham"))
test_m <- as.matrix(dtm_test)
clean_test_type <- factor(clean_test_type, levels = c("spam", "ham"))

Train Models

Overview

Now we can train the models. The process will generally follow the following path:

  1. Select a model family (logistic regression, random forest, etc.)
  2. Use the caret package on the training set to pick “best” model given the supplied control, pre-processing, or other [hyper-]parameters. This may include some level of validation
  3. Switch the hyper-parameters, train again, and compare using validation set
  4. Select “best” model from family
  5. Repeat with other families
  6. Compare performance of final selections using testing/holdout set
  7. Take a well-deserved vacation

As the caret package serves as an umbrella for over 230 model types living in different packages, we may select a less-sophisticated version of a family if it reduces code complexity and migraine propensity. Forgive us as well if we don’t explain every family and every selection. Below we create the model matrices which will be passed to caret.

Experimentation was done with many of the tuning parameters. However, most increases in accuracy came at an inordinate expense of time. Therefore, for the purposes of this exercise, many of the more advantageous options will be limited. For example, cross-validation will be limited to single-pass ten-fold. In production, one should be more vigorous, of course.

Optimization Metric

Usually, AUC, a function of ROC, is used for classification problems. However, for imbalanced data sets it is suggested to use one of precision, recall, or F1 instead. See here, here, or here for examples.

In our case, the data set is imbalanced, and the cost of a false positive (classifying ham as spam) is greater than a false negative. Originally, we selected precision as the metric, as hitting the “junk” button for something in your inbox is less annoying than having your boss’s email sit in your junk folder.

However, as we trained models, we found some fascinating results. In one of the random forest models, the algorithm found a better model with one less false positive, at the expense of 61 more false negatives. Therefore, we decided to redo the tests using the balanced F1 as the optimization metric.

Logistic Regression

This is the classic good-old logistic regression in R. There are no hyper/tuning parameters, so the only comparison can be between the method of cross-validation.

# 10-fold CV
tr_ctrl <- trainControl(method = "cv", number = 10L, classProbs = TRUE,
                        summaryFunction = prSummary)
LogR1 <- train(x = train_m, y = clean_train_type, method = "glm",
              family = "binomial", trControl = tr_ctrl, metric = "F")
LogR1
## Generalized Linear Model 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1753, 1753, 1753, 1753, 1753, 1753, ... 
## Resampling results:
## 
##   AUC        Precision  Recall     F       
##   0.3178984  0.8351154  0.8652588  0.849491
LogR1v <- predict(LogR1, val_m)
confusionMatrix(LogR1v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  312  50
##       ham    37 575
##                                          
##                Accuracy : 0.9107         
##                  95% CI : (0.891, 0.9278)
##     No Information Rate : 0.6417         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8073         
##                                          
##  Mcnemar's Test P-Value : 0.1983         
##                                          
##               Precision : 0.8619         
##                  Recall : 0.8940         
##                      F1 : 0.8776         
##              Prevalence : 0.3583         
##          Detection Rate : 0.3203         
##    Detection Prevalence : 0.3717         
##       Balanced Accuracy : 0.9070         
##                                          
##        'Positive' Class : spam           
## 
# Monte-Carlo Cross validation using 75/25 and 5 iterations
tr_ctrl <- trainControl(method = "LGOCV", number = 10L, p = 0.75,
                        classProbs = TRUE, summaryFunction = prSummary)
LogR2 <- train(x = train_m, y = clean_train_type, method = "glm",
              family = "binomial", trControl = tr_ctrl, metric = "F")
LogR2
## Generalized Linear Model 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (10 reps, 75%) 
## Summary of sample sizes: 1462, 1462, 1462, 1462, 1462, 1462, ... 
## Resampling results:
## 
##   AUC        Precision  Recall     F        
##   0.3573015  0.817922   0.8465517  0.8313286
LogR2v <- predict(LogR2, val_m)
confusionMatrix(LogR2v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  312  50
##       ham    37 575
##                                          
##                Accuracy : 0.9107         
##                  95% CI : (0.891, 0.9278)
##     No Information Rate : 0.6417         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8073         
##                                          
##  Mcnemar's Test P-Value : 0.1983         
##                                          
##               Precision : 0.8619         
##                  Recall : 0.8940         
##                      F1 : 0.8776         
##              Prevalence : 0.3583         
##          Detection Rate : 0.3203         
##    Detection Prevalence : 0.3717         
##       Balanced Accuracy : 0.9070         
##                                          
##        'Positive' Class : spam           
## 

Both versions performed the same on the validation set. As the first has a slightly better F-score, we will select that one.

Random Forest

The ranger package is used as the random forest engine due to its being optimized for higher dimensions.

tr_ctrl <- trainControl(method = "cv", number = 10L, classProbs = TRUE,
                        summaryFunction = prSummary)
RF1 <- train(x = train_m, y = clean_train_type, method = 'ranger',
             trControl = tr_ctrl, metric = "F", tuneLength = 5L)
RF1
## Random Forest 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1753, 1753, 1753, 1753, 1753, 1754, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   AUC        Precision  Recall     F        
##     2   gini        0.9795280  0.9889771  0.8897516  0.9364257
##     2   extratrees  0.9780115  0.9934074  0.8595859  0.9213395
##     8   gini        0.9801904  0.9797804  0.9570393  0.9680785
##     8   extratrees  0.9800284  0.9825914  0.9484472  0.9649337
##    32   gini        0.9555496  0.9780804  0.9569565  0.9672975
##    32   extratrees  0.9662968  0.9727490  0.9598551  0.9660542
##   133   gini        0.8940063  0.9479018  0.9569565  0.9521676
##   133   extratrees  0.8983431  0.9617815  0.9627329  0.9620140
##   541   gini        0.7892907  0.9192789  0.9513043  0.9345954
##   541   extratrees  0.8915232  0.9398438  0.9555901  0.9473738
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## F was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 8, splitrule = gini
##  and min.node.size = 1.
RF1v <- predict(RF1, val_m)
confusionMatrix(RF1v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  338   4
##       ham    11 621
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9747, 0.9914)
##     No Information Rate : 0.6417          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9664          
##                                           
##  Mcnemar's Test P-Value : 0.1213          
##                                           
##               Precision : 0.9883          
##                  Recall : 0.9685          
##                      F1 : 0.9783          
##              Prevalence : 0.3583          
##          Detection Rate : 0.3470          
##    Detection Prevalence : 0.3511          
##       Balanced Accuracy : 0.9810          
##                                           
##        'Positive' Class : spam            
## 

Let’s do a bit wider search among tuning parameters.

rf_grid <- expand.grid(mtry = seq(8, 48, 4),
                       splitrule = c('gini', 'extratrees'),
                       min.node.size = c(1L, 10L))
RF2 <- train(x = train_m, y = clean_train_type, method = 'ranger',
             trControl = tr_ctrl, metric = "F", tuneGrid = rf_grid)
RF2
## Random Forest 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1753, 1753, 1753, 1754, 1753, 1753, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   min.node.size  AUC        Precision  Recall   
##    8    gini         1             0.9798741  0.9840928  0.9526294
##    8    gini        10             0.9795270  0.9824006  0.9512215
##    8    extratrees   1             0.9795727  0.9837457  0.9440580
##    8    extratrees  10             0.9787052  0.9794574  0.9454865
##   12    gini         1             0.9785521  0.9824863  0.9526708
##   12    gini        10             0.9786194  0.9795419  0.9555280
##   12    extratrees   1             0.9797120  0.9794532  0.9497723
##   12    extratrees  10             0.9788159  0.9809273  0.9469358
##   16    gini         1             0.9765000  0.9795473  0.9540994
##   16    gini        10             0.9786215  0.9825510  0.9569565
##   16    extratrees   1             0.9794289  0.9783860  0.9555280
##   16    extratrees  10             0.9785257  0.9796119  0.9541201
##   20    gini         1             0.9732387  0.9795071  0.9540994
##   20    gini        10             0.9784366  0.9795920  0.9569772
##   20    extratrees   1             0.9761725  0.9739802  0.9526915
##   20    extratrees  10             0.9782321  0.9738838  0.9555280
##   24    gini         1             0.9617282  0.9810152  0.9540994
##   24    gini        10             0.9765661  0.9809719  0.9540994
##   24    extratrees   1             0.9747931  0.9769215  0.9555487
##   24    extratrees  10             0.9782076  0.9753190  0.9526708
##   28    gini         1             0.9670348  0.9795454  0.9540994
##   28    gini        10             0.9762022  0.9766118  0.9526708
##   28    extratrees   1             0.9749260  0.9740871  0.9584058
##   28    extratrees  10             0.9767339  0.9754316  0.9541201
##   32    gini         1             0.9555198  0.9752862  0.9569565
##   32    gini        10             0.9758785  0.9766344  0.9526501
##   32    extratrees   1             0.9690636  0.9740702  0.9612629
##   32    extratrees  10             0.9778254  0.9754535  0.9584058
##   36    gini         1             0.9581448  0.9781013  0.9555280
##   36    gini        10             0.9755307  0.9751570  0.9526501
##   36    extratrees   1             0.9644583  0.9742062  0.9612629
##   36    extratrees  10             0.9747970  0.9739576  0.9541201
##   40    gini         1             0.9578235  0.9767483  0.9555280
##   40    gini        10             0.9772231  0.9767869  0.9555280
##   40    extratrees   1             0.9647315  0.9741283  0.9612629
##   40    extratrees  10             0.9775023  0.9711491  0.9555487
##   44    gini         1             0.9592389  0.9766508  0.9555280
##   44    gini        10             0.9768456  0.9752405  0.9540994
##   44    extratrees   1             0.9585495  0.9697864  0.9598344
##   44    extratrees  10             0.9761985  0.9710527  0.9555280
##   48    gini         1             0.9559455  0.9710025  0.9512215
##   48    gini        10             0.9750970  0.9722507  0.9483644
##   48    extratrees   1             0.9642692  0.9698828  0.9598344
##   48    extratrees  10             0.9777043  0.9741031  0.9584058
##   F        
##   0.9677003
##   0.9662812
##   0.9631549
##   0.9618236
##   0.9670886
##   0.9671638
##   0.9641250
##   0.9633318
##   0.9664470
##   0.9693207
##   0.9664933
##   0.9664684
##   0.9664580
##   0.9679398
##   0.9629547
##   0.9643596
##   0.9671329
##   0.9671223
##   0.9658744
##   0.9635991
##   0.9664146
##   0.9643225
##   0.9659367
##   0.9644253
##   0.9658230
##   0.9642316
##   0.9674091
##   0.9666249
##   0.9664908
##   0.9635597
##   0.9674162
##   0.9636843
##   0.9657892
##   0.9657556
##   0.9674187
##   0.9630607
##   0.9657711
##   0.9643209
##   0.9645734
##   0.9630053
##   0.9607170
##   0.9599066
##   0.9645819
##   0.9659152
## 
## F was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 16, splitrule = gini
##  and min.node.size = 10.
RF2v <- predict(RF2, val_m)
confusionMatrix(RF2v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  337   5
##       ham    12 620
##                                           
##                Accuracy : 0.9825          
##                  95% CI : (0.9722, 0.9898)
##     No Information Rate : 0.6417          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9619          
##                                           
##  Mcnemar's Test P-Value : 0.1456          
##                                           
##               Precision : 0.9854          
##                  Recall : 0.9656          
##                      F1 : 0.9754          
##              Prevalence : 0.3583          
##          Detection Rate : 0.3460          
##    Detection Prevalence : 0.3511          
##       Balanced Accuracy : 0.9788          
##                                           
##        'Positive' Class : spam            
## 

Interestingly, the first model performed better on the validation set despite performing more poorly on the training set. Possibly an example of overfitting.

Naive Bayes

tr_ctrl <- trainControl(method = "cv", number = 10L, classProbs = TRUE,
                        summaryFunction = prSummary)
NB1 <- train(x = train_m, y = clean_train_type, method = "nb",
             trControl = tr_ctrl, metric = "F")
NB1
## Naive Bayes 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1753, 1753, 1753, 1754, 1753, 1753, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  AUC        Precision  Recall      F         
##   FALSE            NaN  NaN               NaN         NaN
##    TRUE      0.7850906    1        0.01142857  0.03729786
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## F was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
##  and adjust = 1.
NB1v <- predict(NB1, val_m)
confusionMatrix(NB1v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam   11   1
##       ham   338 624
##                                           
##                Accuracy : 0.652           
##                  95% CI : (0.6211, 0.6819)
##     No Information Rate : 0.6417          
##     P-Value [Acc > NIR] : 0.2634          
##                                           
##                   Kappa : 0.038           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##               Precision : 0.91667         
##                  Recall : 0.03152         
##                      F1 : 0.06094         
##              Prevalence : 0.35832         
##          Detection Rate : 0.01129         
##    Detection Prevalence : 0.01232         
##       Balanced Accuracy : 0.51496         
##                                           
##        'Positive' Class : spam            
## 

This is an awfully performing model. Naive Bayes is known to be very sensitive to class imbalances. Let’s implement up-sampling and a wider search.

tr_ctrl <- trainControl(method = "cv", number = 10L, classProbs = TRUE,
                        summaryFunction = prSummary, sampling = 'up')
nb_grid <- expand.grid(usekernel = TRUE,
                       fL = seq(0.25, 0.75, 0.05),
                       adjust = 1)
NB2 <- train(x = train_m, y = clean_train_type, method = "nb",
             trControl = tr_ctrl, metric = "F", tuneGrid = nb_grid)
NB2
## Naive Bayes 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1754, 1753, 1753, 1753, 1753, 1753, ... 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   fL    AUC        Precision  Recall      F         
##   0.25  0.7681896  1.0000000  0.04873706  0.11355084
##   0.30  0.7693804  0.8750000  0.04585921  0.12139227
##   0.35  0.7755739  0.8750000  0.04298137  0.11448366
##   0.40  0.7738335  0.8750000  0.04300207  0.11374212
##   0.45  0.7743898  1.0000000  0.04726708  0.12550687
##   0.50  0.7729407  0.8888889  0.04443064  0.10360982
##   0.55  0.7736464  1.0000000  0.04871636  0.11373861
##   0.60  0.7794932  0.8750000  0.03298137  0.08972103
##   0.65  0.7712956  0.8888889  0.04300207  0.10050668
##   0.70  0.7752014  1.0000000  0.04443064  0.10391559
##   0.75  0.7761318  0.8888889  0.03728778  0.08846916
## 
## Tuning parameter 'usekernel' was held constant at a value of TRUE
## 
## Tuning parameter 'adjust' was held constant at a value of 1
## F was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0.45, usekernel = TRUE
##  and adjust = 1.
NB2v <- predict(NB2, val_m)
confusionMatrix(NB2v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam   22   1
##       ham   327 624
##                                           
##                Accuracy : 0.6632          
##                  95% CI : (0.6326, 0.6929)
##     No Information Rate : 0.6417          
##     P-Value [Acc > NIR] : 0.08491         
##                                           
##                   Kappa : 0.0774          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##               Precision : 0.95652         
##                  Recall : 0.06304         
##                      F1 : 0.11828         
##              Prevalence : 0.35832         
##          Detection Rate : 0.02259         
##    Detection Prevalence : 0.02361         
##       Balanced Accuracy : 0.53072         
##                                           
##        'Positive' Class : spam            
## 

Results are still miserable. Maybe this wasn’t used properly.

Neural Network

tr_ctrl <- trainControl(method = "cv", number = 10L, classProbs = TRUE,
                        summaryFunction = prSummary)
NN1 <- train(x = train_m, y = clean_train_type, method = "nnet", trace = FALSE,
             trControl = tr_ctrl, metric = "F", tuneLength = 5L, maxit = 250L)
NN1
## Neural Network 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1753, 1753, 1753, 1754, 1753, 1753, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  AUC        Precision  Recall     F        
##   1     0e+00  0.1404975  0.9637886  0.9699379  0.9665824
##   1     1e-04  0.2989530  0.9485876  0.9642443  0.9555692
##   1     1e-03  0.5008390  0.9570043  0.9742443  0.9654024
##   1     1e-02  0.6164534  0.9729661  0.9684679  0.9705386
##   1     1e-01  0.6630303  0.9731230  0.9699586  0.9713496
##   3     0e+00        NaN        NaN        NaN        NaN
##   3     1e-04        NaN        NaN        NaN        NaN
##   3     1e-03        NaN        NaN        NaN        NaN
##   3     1e-02        NaN        NaN        NaN        NaN
##   3     1e-01        NaN        NaN        NaN        NaN
##   5     0e+00        NaN        NaN        NaN        NaN
##   5     1e-04        NaN        NaN        NaN        NaN
##   5     1e-03        NaN        NaN        NaN        NaN
##   5     1e-02        NaN        NaN        NaN        NaN
##   5     1e-01        NaN        NaN        NaN        NaN
##   7     0e+00        NaN        NaN        NaN        NaN
##   7     1e-04        NaN        NaN        NaN        NaN
##   7     1e-03        NaN        NaN        NaN        NaN
##   7     1e-02        NaN        NaN        NaN        NaN
##   7     1e-01        NaN        NaN        NaN        NaN
##   9     0e+00        NaN        NaN        NaN        NaN
##   9     1e-04        NaN        NaN        NaN        NaN
##   9     1e-03        NaN        NaN        NaN        NaN
##   9     1e-02        NaN        NaN        NaN        NaN
##   9     1e-01        NaN        NaN        NaN        NaN
## 
## F was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.
NN1v <- predict(NN1, val_m)
confusionMatrix(NN1v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  338   7
##       ham    11 618
##                                          
##                Accuracy : 0.9815         
##                  95% CI : (0.9709, 0.989)
##     No Information Rate : 0.6417         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9597         
##                                          
##  Mcnemar's Test P-Value : 0.4795         
##                                          
##               Precision : 0.9797         
##                  Recall : 0.9685         
##                      F1 : 0.9741         
##              Prevalence : 0.3583         
##          Detection Rate : 0.3470         
##    Detection Prevalence : 0.3542         
##       Balanced Accuracy : 0.9786         
##                                          
##        'Positive' Class : spam           
## 

Some light tuning:

nn_grid <- expand.grid(size = 1L, decay = c(0.99, seq(0.95, 0.05, -0.05), 0.01))
NN2 <- train(x = train_m, y = clean_train_type, method = "nnet", trace = FALSE,
             trControl = tr_ctrl, metric = "F", tuneGrid = nn_grid,
             maxit = 250L)
NN2
## Neural Network 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1753, 1754, 1753, 1754, 1753, 1753, ... 
## Resampling results across tuning parameters:
## 
##   decay  AUC        Precision  Recall     F        
##   0.01   0.6037821  0.9771174  0.9628157  0.9695605
##   0.05   0.7361770  0.9744610  0.9771222  0.9756923
##   0.10   0.6950868  0.9642958  0.9756936  0.9696577
##   0.15   0.7396476  0.9745013  0.9771222  0.9757125
##   0.20   0.7058696  0.9722723  0.9799793  0.9759194
##   0.25   0.7255797  0.9771951  0.9756936  0.9763400
##   0.30   0.7564494  0.9772549  0.9756936  0.9763496
##   0.35   0.7549194  0.9772963  0.9771222  0.9770896
##   0.40   0.7606259  0.9786645  0.9771222  0.9777785
##   0.45   0.7591476  0.9786828  0.9756936  0.9770481
##   0.50   0.7679155  0.9772743  0.9756936  0.9763385
##   0.55   0.7648583  0.9772743  0.9756936  0.9763385
##   0.60   0.7650416  0.9772116  0.9742650  0.9755881
##   0.65   0.7706187  0.9799911  0.9742650  0.9769860
##   0.70   0.7720380  0.9785826  0.9728157  0.9755574
##   0.75   0.7739467  0.9798496  0.9728157  0.9762260
##   0.80   0.7750922  0.9812770  0.9713872  0.9761839
##   0.85   0.7738389  0.9799088  0.9713872  0.9754950
##   0.90   0.7841840  0.9826641  0.9713872  0.9768825
##   0.95   0.7807338  0.9798466  0.9713872  0.9754845
##   0.99   0.7809652  0.9812149  0.9713872  0.9761735
## 
## Tuning parameter 'size' was held constant at a value of 1
## F was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.4.
NN2v <- predict(NN2, val_m)
confusionMatrix(NN2v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  338   7
##       ham    11 618
##                                          
##                Accuracy : 0.9815         
##                  95% CI : (0.9709, 0.989)
##     No Information Rate : 0.6417         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9597         
##                                          
##  Mcnemar's Test P-Value : 0.4795         
##                                          
##               Precision : 0.9797         
##                  Recall : 0.9685         
##                      F1 : 0.9741         
##              Prevalence : 0.3583         
##          Detection Rate : 0.3470         
##    Detection Prevalence : 0.3542         
##       Balanced Accuracy : 0.9786         
##                                          
##        'Positive' Class : spam           
## 

Both models performed the same on the validation set. As the second performed better on the training set too, we will use it.

Gradient Boosted Machines

tr_ctrl <- trainControl(method = "cv", number = 10L, classProbs = TRUE,
                        summaryFunction = prSummary)
GBM1 <- train(x = train_m, y = clean_train_type, method = "gbm", verbose = FALSE,
              trControl = tr_ctrl, tuneLength = 5L, metric = "F")
GBM1v <- predict(GBM1, val_m)
confusionMatrix(GBM1v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  339   8
##       ham    10 617
##                                          
##                Accuracy : 0.9815         
##                  95% CI : (0.9709, 0.989)
##     No Information Rate : 0.6417         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9598         
##                                          
##  Mcnemar's Test P-Value : 0.8137         
##                                          
##               Precision : 0.9769         
##                  Recall : 0.9713         
##                      F1 : 0.9741         
##              Prevalence : 0.3583         
##          Detection Rate : 0.3480         
##    Detection Prevalence : 0.3563         
##       Balanced Accuracy : 0.9793         
##                                          
##        'Positive' Class : spam           
## 

This model looks really good. Let’s throw a little extra fine-tuning in. After running a wide-scale grid, the best option is selected below, so that the entire grid doesn’t have to rerun every time.

gbm_grid <- expand.grid(n.trees = 400L,
                     interaction.depth = 7L,
                     shrinkage = 0.1,
                     n.minobsinnode = 10L)
GBM2 <- train(x = train_m, y = clean_train_type, method = "gbm", verbose = FALSE,
              trControl = tr_ctrl, tuneGrid = gbm_grid, metric = "F")
GBM2
## Stochastic Gradient Boosting 
## 
## 1948 samples
##  542 predictor
##    2 classes: 'spam', 'ham' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1753, 1753, 1754, 1754, 1753, 1753, ... 
## Resampling results:
## 
##   AUC        Precision  Recall     F        
##   0.9824148  0.9700152  0.9655072  0.9675497
## 
## Tuning parameter 'n.trees' was held constant at a value of 400
##  7
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
GBM2v <- predict(GBM2, val_m)
confusionMatrix(GBM2v, clean_val_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  341   7
##       ham     8 618
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9747, 0.9914)
##     No Information Rate : 0.6417          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9665          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##               Precision : 0.9799          
##                  Recall : 0.9771          
##                      F1 : 0.9785          
##              Prevalence : 0.3583          
##          Detection Rate : 0.3501          
##    Detection Prevalence : 0.3573          
##       Balanced Accuracy : 0.9829          
##                                           
##        'Positive' Class : spam            
## 

The second model performed better.

Other models

With over 230 possible models, there are many more options to train, like XGBoost, Neural Networks, Bayesian Regression, Support Vector Machines, etc. We don’t need to exhaust the possibilities here.

Test Models

The best models in the above categories will now be compared against the testing/holdout set:

LogRt <- predict(LogR1, test_m)
RFt <- predict(RF1, test_m)
NNt <- predict(NN2, test_m)
NBt <- predict(NB2, test_m) # For laughs
GBMt <- predict(GBM2, test_m)
confusionMatrix(LogRt, clean_test_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  698  57
##       ham     0 568
##                                           
##                Accuracy : 0.9569          
##                  95% CI : (0.9445, 0.9672)
##     No Information Rate : 0.5276          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9132          
##                                           
##  Mcnemar's Test P-Value : 1.195e-13       
##                                           
##               Precision : 0.9245          
##                  Recall : 1.0000          
##                      F1 : 0.9608          
##              Prevalence : 0.5276          
##          Detection Rate : 0.5276          
##    Detection Prevalence : 0.5707          
##       Balanced Accuracy : 0.9544          
##                                           
##        'Positive' Class : spam            
## 
confusionMatrix(RFt, clean_test_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  694   2
##       ham     4 623
##                                           
##                Accuracy : 0.9955          
##                  95% CI : (0.9902, 0.9983)
##     No Information Rate : 0.5276          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9909          
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##               Precision : 0.9971          
##                  Recall : 0.9943          
##                      F1 : 0.9957          
##              Prevalence : 0.5276          
##          Detection Rate : 0.5246          
##    Detection Prevalence : 0.5261          
##       Balanced Accuracy : 0.9955          
##                                           
##        'Positive' Class : spam            
## 
confusionMatrix(NNt, clean_test_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  696   8
##       ham     2 617
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9861, 0.9964)
##     No Information Rate : 0.5276          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9848          
##                                           
##  Mcnemar's Test P-Value : 0.1138          
##                                           
##               Precision : 0.9886          
##                  Recall : 0.9971          
##                      F1 : 0.9929          
##              Prevalence : 0.5276          
##          Detection Rate : 0.5261          
##    Detection Prevalence : 0.5321          
##       Balanced Accuracy : 0.9922          
##                                           
##        'Positive' Class : spam            
## 
confusionMatrix(NBt, clean_test_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam   46   0
##       ham   652 625
##                                           
##                Accuracy : 0.5072          
##                  95% CI : (0.4799, 0.5345)
##     No Information Rate : 0.5276          
##     P-Value [Acc > NIR] : 0.935           
##                                           
##                   Kappa : 0.0625          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##               Precision : 1.00000         
##                  Recall : 0.06590         
##                      F1 : 0.12366         
##              Prevalence : 0.52759         
##          Detection Rate : 0.03477         
##    Detection Prevalence : 0.03477         
##       Balanced Accuracy : 0.53295         
##                                           
##        'Positive' Class : spam            
## 
confusionMatrix(GBMt, clean_test_type, mode = "prec_recall", positive = "spam")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction spam ham
##       spam  696   7
##       ham     2 618
##                                           
##                Accuracy : 0.9932          
##                  95% CI : (0.9871, 0.9969)
##     No Information Rate : 0.5276          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9863          
##                                           
##  Mcnemar's Test P-Value : 0.1824          
##                                           
##               Precision : 0.9900          
##                  Recall : 0.9971          
##                      F1 : 0.9936          
##              Prevalence : 0.5276          
##          Detection Rate : 0.5261          
##    Detection Prevalence : 0.5314          
##       Balanced Accuracy : 0.9930          
##                                           
##        'Positive' Class : spam            
## 

From these models, while the logistic had no false negatives—a recall of 1—it did so by coding 57 good emails as spam. The remaining models all did quite well, but the winner is the random forest model, with the highest F-score and fewest miscategorized emails of any type.

Epilogue

sessionInfo()
## R version 3.6.1 Patched (2019-10-25 r77334)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] wordcloud_2.6      RColorBrewer_1.1-2 caret_6.0-84      
##  [4] ggplot2_3.2.1      lattice_0.20-38    SnowballC_0.6.0   
##  [7] tm_0.7-6           NLP_0.2-0          doParallel_1.0.15 
## [10] iterators_1.0.12   foreach_1.4.7     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.10          slam_0.1-46       
##  [4] reshape2_1.4.3     purrr_0.3.3        splines_3.6.1     
##  [7] colorspace_1.4-1   generics_0.0.2     stats4_3.6.1      
## [10] htmltools_0.4.0    yaml_2.2.0         survival_2.44-1.1 
## [13] prodlim_2018.04.18 rlang_0.4.1        ModelMetrics_1.2.2
## [16] pillar_1.4.2       glue_1.3.1         withr_2.1.2       
## [19] plyr_1.8.4         lava_1.6.6         stringr_1.4.0     
## [22] timeDate_3043.102  munsell_0.5.0      gtable_0.3.0      
## [25] recipes_0.1.7      codetools_0.2-16   evaluate_0.14     
## [28] labeling_0.3       knitr_1.25         class_7.3-15      
## [31] highr_0.8          Rcpp_1.0.2         scales_1.0.0      
## [34] ipred_0.9-9        digest_0.6.22      stringi_1.4.3     
## [37] dplyr_0.8.3        grid_3.6.1         tools_3.6.1       
## [40] magrittr_1.5       lazyeval_0.2.2     tibble_2.1.3      
## [43] crayon_1.3.4       pkgconfig_2.0.3    MASS_7.3-51.4     
## [46] Matrix_1.2-17      data.table_1.12.6  xml2_1.2.2        
## [49] lubridate_1.7.4    gower_0.2.1        assertthat_0.2.1  
## [52] rmarkdown_1.16     R6_2.4.0           rpart_4.1-15      
## [55] nnet_7.3-12        nlme_3.1-141       compiler_3.6.1
stopCluster(cl)