Document Classification

Introduction

The purpose of this project is to build a classification model that can accurately classify spam email messages from ham email messages. We will do this by using pre-classified email messages to build a training set and then build a predictive model to forecast unseen email messages as either spam or ham. In order to build this predictive model, we’ll also need to rely heavily on several text mining techniques which will be demonstrated below. We’ll begin this project by loading the necessary libraries.

Loading Libraries

library(readr)
library(tidyverse)
library(tidytext)
library(ggplot2)
library(tm)
library(caret)
library(tidymodels)

Reading in the data and combining

We will take our data set from several locations, the first of which is a repository of 6,046 emails that I have downloaded from here as individual files. We will need to read in each file from it’s location on my computer and then create a dataframe with one email per row. We’ll start by getting the file path to each file and storing it as a list.

ham_file_path <- "C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Spring 2020/DATA607/Week 13-Classification/SPAMHAM/ham"
spam_file_path <- "C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Spring 2020/DATA607/Week 13-Classification/SPAMHAM/spam"

ham <- list.files(ham_file_path, full.names = TRUE)
spam <- list.files(spam_file_path, full.names = TRUE)

Next, we will build a function that takes in a file path, reads the file, and removes the new line characters. We will use purrr::map() to apply this function to our list of file paths. Using map allows us to take advantage of vecotorization as opposed to using an iterrating “if” statement. Once we have our data frame, we will rename the “value” column to “text” as well as add an “indicator” variable labeling the data frame as either containing SPAM or HAM (we will run this function twice, once over a set of Ham files, and again over a set of Spam files). Lastly, we’ll combine both of the data frames together.

convert_line <- function(path) {
  file <- read_file(path) %>% 
  str_replace_all("\\\n+|\\n|_+", "")
    }
 
  spam_list <- purrr::map(spam, convert_line)
  spam_df  <- as.tibble(unlist(spam_list)) %>%
    rename(text = value) %>%
    mutate(indicator = 1)
  
  ham_list <- purrr::map(ham, convert_line)
  ham_df  <- as.tibble(unlist(ham_list)) %>%
    rename(text = value) %>%
    mutate(indicator = 0) 
  
  spam_ham <- rbind(spam_df, ham_df)

#readr::read_csv("C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Spring 2020/DATA607/Week 13-Classification/SPAMHAM/easy_ham/00001.7c53336b37003a9286aba55d2945844c")

Let’s take a look at the first few rows of our data:

head(spam_ham)

## # A tibble: 6 x 2
##   text                                                                 indicator
##   <chr>                                                                    <dbl>
## 1 "From ilug-admin@linux.ie  Tue Aug  6 11:51:02 2002Return-Path: <il~         1
## 2 "From lmrn@mailexcite.com  Mon Jun 24 17:03:24 2002Return-Path: mer~         1
## 3 "From amknight@mailexcite.com  Mon Jun 24 17:03:49 2002Return-Path:~         1
## 4 "From jordan23@mailexcite.com  Mon Jun 24 17:04:20 2002Return-Path:~         1
## 5 "From merchantsworld2001@juno.com  Tue Aug  6 11:01:33 2002Return-P~         1
## 6 "Received: from hq.pro-ns.net (localhost [127.0.0.1])\tby hq.pro-ns~         1

Looking at the rows above, we can see there is going to be a lot of garbage in each of these emails (web addresses, dates, digits). While attempting some data extraction, there are quite a few instances where the output of the emails vary significantly from email to email, as such, an attempt to extract the “Content” or some other attribute is returning many null values. As such, we’ll work with the email string in its entirety as to not lose possibly informative data from our data set.

Let’s see how many SPAM emails we have an how many HAM emails:

spam_ham %>% count(indicator)

## # A tibble: 2 x 2
##   indicator     n
##       <dbl> <int>
## 1         0  4150
## 2         1  1896

It looks like our data set is ~30% spam.

Total dimensions for our data frame:

dim(spam_ham)

## [1] 6046    2

In an effort to make our model more robust and so our entire training data set isn’t from one source, let’s bring in one more data set that we can use in our training and prediction. This data set is from Kaggle, and can be downloaded here. Below, I have downloaded the file and put it in my GitHub.

test_data <- readr::read_csv("https://raw.githubusercontent.com/christianthieme/MSDS-DATA607/master/SpamHamTestData.csv")

Again, let’s take look at the first few rows of data:

head(test_data)

## # A tibble: 6 x 5
##      v1 v2                                                     X3    X4    X5   
##   <dbl> <chr>                                                  <chr> <chr> <chr>
## 1     0 "Subject: for vince j kaminski ' s approval  below yo~ <NA>  <NA>  <NA> 
## 2     0 "Subject: continental phone #  1 - 800 - 621 - 7467 o~ <NA>  <NA>  <NA> 
## 3     0 "Subject: exmar purchase decision  fyi  - - - - - - -~ <NA>  <NA>  <NA> 
## 4     0 "Subject: re : next step  al ,  i have spoken with ma~ <NA>  <NA>  <NA> 
## 5     0 "Subject: summer opportunity  vince :  i did get your~ <NA>  <NA>  <NA> 
## 6     0 "Subject: interviewing for the associate and analyst ~ <NA>  <NA>  <NA>

Looking at the first few rows of this data, we can see that we’ll need to rename and reorder some columns as well as get rid of a few blank columns that were read in.

test_data <- test_data %>% 
  rename(indicator = v1, text = v2) %>%
  select(text, indicator)
head(test_data)

## # A tibble: 6 x 2
##   text                                                                 indicator
##   <chr>                                                                    <dbl>
## 1 "Subject: for vince j kaminski ' s approval  below you will find a ~         0
## 2 "Subject: continental phone #  1 - 800 - 621 - 7467 or ? cust . ser~         0
## 3 "Subject: exmar purchase decision  fyi  - - - - - - - - - - - - - -~         0
## 4 "Subject: re : next step  al ,  i have spoken with mark lay and he ~         0
## 5 "Subject: summer opportunity  vince :  i did get your phone message~         0
## 6 "Subject: interviewing for the associate and analyst programs  the ~         0

After our transformations, the dimensions of the data set are:

dim(test_data)

## [1] 5726    2

Our new data set will add 5,726 emails to our 6,046, bringing our total to 11,772.

In this additional data set, let’s see how many spam and ham emails we have:

test_data %>% 
  count(indicator)

## # A tibble: 2 x 2
##   indicator     n
##       <dbl> <int>
## 1         0  4358
## 2         1  1368

Now that we’ve cleaned both data sets let’s combine them.

spam_ham_final <- rbind(spam_ham, test_data)

Now, because of the way the original data was stored (folder of ham and a seperate folder of spam) and read-in, all of the spam emails are at the first 30% of our data set and the ham emails are the last 70%. Let’s randomize our data set so we aren’t introducing bias to our model. Here’s well use sample() to shuffle our dat set and use a seed number so it is reproducible.

set.seed(42)

rows <- sample(nrow(spam_ham_final))
spam_ham_final <- spam_ham_final[rows,]

Before building a model, let’s do some analysis on this data to see if there is a certain method that makes more sense when implementing our model (tf vs tf-idf).

Analysis

Now that we’ve combined the data set, let’s do some analysis. We’ll start by creating some new feature columns. Let’s create a few new columns:

text_count: how many words are in the email
char_count: how many alphabet characters are in the email
digit_count: how many digits/numeric values are in the email
non_count: how many non-alphanumeric characters are in the email
total_char_count: counting the total number of chars (sum of 2 through 4 above)
char_perc: percentage of alpha characters out of total characters
digit_perc: percentage of numeric characters out of total characters
non_per: percentage of non-alphanumeric characters out of total

analysis <- spam_ham_final %>% 
  mutate(text_count = unlist(map(str_split(spam_ham_final$text, " "), length))) %>%
  mutate(char_count = str_count(spam_ham_final$text, "[A-Za-z]")) %>%
  mutate(digit_count = str_count(spam_ham_final$text, "[0-9]")) %>%
  mutate(non_count = str_count(spam_ham_final$text, "[^[:alnum:]]")) %>%
  mutate(total_char_count = char_count + digit_count + non_count) %>%
  mutate(char_perc = char_count / total_char_count,
          digit_perc = digit_count / total_char_count, 
          non_per = non_count / total_char_count)
head(analysis)

## # A tibble: 6 x 10
##   text  indicator text_count char_count digit_count non_count total_char_count
##   <chr>     <dbl>      <int>      <int>       <int>     <int>            <int>
## 1 "Sub~         0        301        800          53       362             1215
## 2 "Ret~         0       4942      30077        1474     11539            43090
## 3 "Ret~         0        399        942         214       532             1688
## 4 "Sub~         0        366       1031          98       437             1566
## 5 "Fro~         1        578       4936         638      1843             7417
## 6 "Sub~         1       1029       3697          48      1174             4919
## # ... with 3 more variables: char_perc <dbl>, digit_perc <dbl>, non_per <dbl>

Now that, we’ve created these columns, let’s see if we can identify any differences between spam and ham.

summary(analysis %>% 
  filter(indicator == 0))

##      text             indicator   text_count      char_count    
##  Length:8508        Min.   :0   Min.   :    2   Min.   :    10  
##  Class :character   1st Qu.:0   1st Qu.:  178   1st Qu.:   661  
##  Mode  :character   Median :0   Median :  323   Median :  1464  
##                     Mean   :0   Mean   :  456   Mean   :  2078  
##                     3rd Qu.:0   3rd Qu.:  490   3rd Qu.:  2353  
##                     Max.   :0   Max.   :36450   Max.   :125449  
##   digit_count        non_count        total_char_count   char_perc     
##  Min.   :    0.0   Min.   :     2.0   Min.   :    12   Min.   :0.1140  
##  1st Qu.:   26.0   1st Qu.:   259.0   1st Qu.:  1031   1st Qu.:0.6348  
##  Median :  130.0   Median :   593.0   Median :  2244   Median :0.6667  
##  Mean   :  207.8   Mean   :   843.5   Mean   :  3129   Mean   :0.6713  
##  3rd Qu.:  324.0   3rd Qu.:   893.2   3rd Qu.:  3554   3rd Qu.:0.7130  
##  Max.   :11078.0   Max.   :169467.0   Max.   :296769   Max.   :0.8378  
##    digit_perc         non_per       
##  Min.   :0.00000   Min.   :0.04376  
##  1st Qu.:0.02185   1st Qu.:0.23439  
##  Median :0.05435   Median :0.24800  
##  Mean   :0.06352   Mean   :0.26518  
##  3rd Qu.:0.10292   3rd Qu.:0.28332  
##  Max.   :0.32863   Max.   :0.84205

summary(analysis %>% 
  filter(indicator == 1))

##      text             indicator   text_count        char_count    
##  Length:3264        Min.   :1   Min.   :    5.0   Min.   :    10  
##  Class :character   1st Qu.:1   1st Qu.:  156.0   1st Qu.:   560  
##  Mode  :character   Median :1   Median :  272.5   Median :  1496  
##                     Mean   :1   Mean   :  591.5   Mean   :  2772  
##                     3rd Qu.:1   3rd Qu.:  579.2   3rd Qu.:  3146  
##                     Max.   :1   Max.   :13991.0   Max.   :199264  
##   digit_count        non_count     total_char_count     char_perc     
##  Min.   :    0.0   Min.   :    7   Min.   :    17.0   Min.   :0.2180  
##  1st Qu.:   10.0   1st Qu.:  242   1st Qu.:   872.5   1st Qu.:0.6073  
##  Median :  219.5   Median :  550   Median :  2307.5   Median :0.6659  
##  Mean   :  335.4   Mean   : 1135   Mean   :  4242.0   Mean   :0.6626  
##  3rd Qu.:  396.0   3rd Qu.: 1223   3rd Qu.:  4844.0   3rd Qu.:0.7307  
##  Max.   :28465.0   Max.   :18112   Max.   :229237.0   Max.   :0.8808  
##    digit_perc         non_per      
##  Min.   :0.00000   Min.   :0.0113  
##  1st Qu.:0.01267   1st Qu.:0.2323  
##  Median :0.06702   Median :0.2594  
##  Mean   :0.06557   Mean   :0.2718  
##  3rd Qu.:0.10331   3rd Qu.:0.2936  
##  Max.   :0.50404   Max.   :0.7377

Looking at the two above summaries (indicator = 0 is ham and indicator = 1 is spam), it appears that there is a noticable difference in word count, digit count, and digit percentage between spam and ham, with spam having more words and digits on average.

Looking at the summary is all well and good, but let’s see if we can visualize some of these differences with density plots.

  ggplot(analysis) + 
  aes(x = text_count, fill = as.factor(indicator)) + 
  geom_density(alpha = 0.4) + 
  labs(title = "Spam vs Ham - Text Count") + 
  xlim(0,3000)

## Warning: Removed 172 rows containing non-finite values (stat_density).

Looking at the above density plot, you can see clearly that there is a slight difference in the distribution of text count between spam (1) and ham (0). It appears that spam, on average, has higher text count than ham.

Now let’s turn our attention to character count:

  ggplot(analysis) + 
  aes(x = char_count, fill = as.factor(indicator)) + 
  geom_density(alpha = 0.4) + 
  labs(title = "Spam vs Ham - Character Count") + 
  xlim(0,10000)

## Warning: Removed 320 rows containing non-finite values (stat_density).

This is a tricky chart. From our summaries above, we know that spam does have a higher character count than ham, but not by much. However, we are seeing something interesting on this chart. It appears that at the lower end of character count, ham tends to have higher number of characters than spam, however as we move to higher counts on the x-axis, the trend switches and spam has higher counts. This is why it is important to create a chart. Just from the summaries above, we wouldn’t have been able to tell that while the numbers above are accurate, they don’t tell the whole story.

I’ll show the remainder of the charts for the variables we created below.

  ggplot(analysis) + 
  aes(x = char_perc, fill = as.factor(indicator)) + 
  labs(title = "Spam vs Ham - Character Percentage") + 
  geom_density(alpha = 0.4)

  ggplot(analysis) + 
  aes(x = digit_count, fill = as.factor(indicator)) + 
  geom_density(alpha = 0.4) + 
  labs(title = "Spam vs Ham - Digit Count") + 
  xlim(0,2500)

## Warning: Removed 48 rows containing non-finite values (stat_density).

  ggplot(analysis) + 
  aes(x = digit_perc, fill = as.factor(indicator)) + 
  labs(title = "Spam vs Ham - Digit Percentage") + 
  geom_density(alpha = 0.4)

  ggplot(analysis) + 
  aes(x = non_count, fill = as.factor(indicator)) + 
  geom_density(alpha = 0.4) + 
  labs(title = "Spam vs Ham - Non-AlphaNumeric Count") + 
  xlim(0,7500)

## Warning: Removed 148 rows containing non-finite values (stat_density).

  ggplot(analysis) + 
  aes(x = non_per, fill = as.factor(indicator)) + 
  labs(title = "Spam vs Ham - Non-AlphaNumeric Percentage") + 
  geom_density(alpha = 0.4)

In looking at the above charts, we can clearly see there are some differences between spam and ham when it word count, character count (alpha & numeric), as well as non-character counts. It may make sense to add these variables to our data set to use as predictors.

Text Analysis

Having looked the some quantitative data, let’s now turn our attention to the text in this data set. First let’s create a row number in our data set that will be used frequently for reference going forward.

spam_ham_final <- spam_ham_final %>% 
  mutate(row_num = row_number())
head(spam_ham_final)

## # A tibble: 6 x 3
##   text                                                         indicator row_num
##   <chr>                                                            <dbl>   <int>
## 1 "Subject: re : video conference for interview : stig faltin~         0       1
## 2 "Return-Path: <fool@motleyfool.com>Received: (qmail 18981 i~         0       2
## 3 "Return-Path: nas@python.caDelivery-Date: Mon Sep  9 02:20:~         0       3
## 4 "Subject: re : message from ken rice  vince :  thanks for r~         0       4
## 5 "From jm@netnoteinc.com  Mon Jul 29 11:22:06 2002Return-Pat~         1       5
## 6 "Subject: strictly private .  gooday ,  with warm heart my ~         1       6

Now that our data set is combined, let’s take a look at the final breakout of ham vs spam:

spam_ham_final %>% 
  count(indicator)

## # A tibble: 2 x 2
##   indicator     n
##       <dbl> <int>
## 1         0  8508
## 2         1  3264

It looks like our percentage has dropped a little with the addition of the new data set above and our data set now is about 27% spam.

Many critical components of text analysis rely on our ability to look at individual words. To do that we’ll use unnest_tokens from TidyText break each word in to its own row. We’ll also change the case of every word to lowercase, remove all stop words (it, as, the, etc.), and extract the word stem (thankful –> thank).

tidy_spam_ham <- spam_ham_final %>% 
  unnest_tokens(output = word, input = text) %>% 
  mutate(word = tolower(word)) %>%
  anti_join(stop_words) %>%
  mutate(word = SnowballC::wordStem(word))

## Joining, by = "word"

tidy_spam_ham

## # A tibble: 3,851,053 x 3
##    indicator row_num word     
##        <dbl>   <int> <chr>    
##  1         0       1 subject  
##  2         0       1 video    
##  3         0       1 confer   
##  4         0       1 interview
##  5         0       1 stig     
##  6         0       1 faltinsen
##  7         0       1 shirlei  
##  8         0       1 hope     
##  9         0       1 pleasant 
## 10         0       1 easter   
## # ... with 3,851,043 more rows

Above, we can see this has really lengthened our data set. We are now working with almost 4M rows of data.

Let’s see if we can get a feel for the difference between spam and ham and which words are used most frequently in each.

#looking at top words in ham
tidy_spam_ham %>% 
  filter(indicator == 0) %>%
  count(indicator, word, sort = TRUE) %>%
  top_n(15, n)

## # A tibble: 15 x 3
##    indicator word          n
##        <dbl> <chr>     <int>
##  1         0 http      36996
##  2         0 2002      34168
##  3         0 id        26148
##  4         0 td        23445
##  5         0 localhost 21865
##  6         0 1         21421
##  7         0 list      20915
##  8         0 width     18206
##  9         0 subject   17943
## 10         0 0         15414
## 11         0 3d        15108
## 12         0 esmtp     14982
## 13         0 xent.com  13788
## 14         0 font      13484
## 15         0 enron     13252

#looking at top words in spam
tidy_spam_ham %>% 
  filter(indicator == 1) %>%
  count(indicator, word, sort = TRUE) %>%
  top_n(15, n)

## # A tibble: 15 x 3
##    indicator word      n
##        <dbl> <chr> <int>
##  1         1 3d    43261
##  2         1 font  42963
##  3         1 td    21149
##  4         1 br    20167
##  5         1 size  16147
##  6         1 tr    12640
##  7         1 nbsp  11869
##  8         1 http  11621
##  9         1 color 11387
## 10         1 width 11158
## 11         1 2002  10359
## 12         1 id     9212
## 13         1 1      8787
## 14         1 align  8188
## 15         1 2      7761

Looking at the above, it looks like there isn’t anything tremendously meainingful that we can glean from this word count exercise. It appears that many of our highest word count items are actually the header or html sections of the email. While this particular exercise wasn’t terribly fruitful, what we do learn is that perhaps word count or term-frequency, will not be the best approach for this analysis as we will have many similar words between both sets because of the headers used in the emails.

Let’s now change directions and look at term-frequency inverse document frequency, which helps us measure how “important” a term is in a corpus. We’ll start by looking first at spam:

spam_tf_idf <- tidy_spam_ham %>% 
  filter(indicator == 1) %>%
  count(row_num, word, sort = TRUE) %>%
   bind_tf_idf(term = word, document = row_num, n = n) %>%
   arrange(desc(tf_idf))
spam_tf_idf

## # A tibble: 553,950 x 6
##    row_num word                 n    tf   idf tf_idf
##      <int> <chr>            <int> <dbl> <dbl>  <dbl>
##  1   10625 jif                  1 0.5    8.09   4.05
##  2     615 4623                 2 0.333  7.40   2.47
##  3   11486 ya                   2 0.4    6.14   2.46
##  4    2342 oreo                 4 0.222  8.09   1.80
##  5    7384 126432211            2 0.222  7.40   1.64
##  6    5202 graand               2 0.25   6.30   1.57
##  7    5093 requisit             1 0.2    6.70   1.34
##  8    6195 www.laxpress.com    63 0.163  8.09   1.32
##  9     646 lambino              5 0.161  8.09   1.30
## 10    6195 50site              61 0.158  8.09   1.28
## # ... with 553,940 more rows

  spam_tf_idf %>% 
  top_n(15, wt = tf_idf) %>%
  ggplot() + 
  aes(x= reorder(word, tf_idf), y = tf_idf) + 
  geom_col() + 
  coord_flip()

In looking at the above output, while there is still some garble (i.e. 126432211), there are some subtleties here. Words like Jif and Oreo may indicate some type of advertising, while words like striptease look like typical spam garbage.

Now let’s look at ham:

ham_tf_idf <- tidy_spam_ham %>% 
  filter(indicator == 0) %>%
  count(row_num, word, sort = TRUE) %>%
   bind_tf_idf(term = word, document = row_num, n = n) %>%
   arrange(desc(tf_idf))
ham_tf_idf

## # A tibble: 1,211,558 x 6
##    row_num word             n    tf   idf tf_idf
##      <int> <chr>        <int> <dbl> <dbl>  <dbl>
##  1   10778 elana            1 0.333  9.05   3.02
##  2    8814 rank             1 0.5    4.74   2.37
##  3    7729 congrat          2 0.333  6.75   2.25
##  4    1830 gillian          1 0.333  6.65   2.22
##  5    1592 congratul        2 0.5    4.35   2.17
##  6    5108 exl             21 0.231  9.05   2.09
##  7   10778 hurrican         1 0.333  6.05   2.02
##  8   11623 statistician     2 0.25   7.95   1.99
##  9    4083 sddp             2 0.25   7.44   1.86
## 10    1956 fyi              1 0.5    3.57   1.78
## # ... with 1,211,548 more rows

  ham_tf_idf %>% 
  top_n(15, wt = tf_idf) %>%
  ggplot() + 
  aes(x= reorder(word, tf_idf), y = tf_idf) + 
  geom_col() + 
  coord_flip()

This output is completely different than what we saw above for spam. for the most part, these look like normal words and names. In looking at the difference between term frequency (term count) and term inverse document frequency, we can make the educated assumption that tf-idf is going to be more helpful to us in a predictive model.

The model

In order to build a predictive model, we’ll need to get our data in a format that is digestible. To do this, we’ll create a document term matrix, which will take our current data set which is one row per word and turn it into one row per email, with every column being a word in that email. We’ll weight those values with tf-idf.

dtm_spam_ham <- tidy_spam_ham %>%
  count(row_num, word) %>%
  cast_dtm(document = row_num, term = word, value = n, weighting = tm::weightTfIdf)
dtm_spam_ham

## <<DocumentTermMatrix (documents: 11772, terms: 221798)>>
## Non-/sparse entries: 1765508/2609240548
## Sparsity           : 100%
## Maximal term length: 19844
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

In looking at the above output, our data set has 11,772 ros and 221,798 columns. This is pretty crazy, and too many columns to work with so we’ll want to remove some of the sparse columns (words), meaning remove those words that don’t occur accross many of the documents. Too many columns in our data set will add to the complexity of our model as well as substantially add to the time it takes to run.

We will set our sparcity to .98 meaning we will remove words that are missing from 98% of the documents.

dtm_s_h <- tm::removeSparseTerms(dtm_spam_ham, sparse = .98)
dtm_s_h

## <<DocumentTermMatrix (documents: 11772, terms: 1123)>>
## Non-/sparse entries: 992711/12227245
## Sparsity           : 92%
## Maximal term length: 38
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

In looking at the above, output we’ve trimmed our data set down to 1,123 columns. This will be a bit easier for our model to chew.

matrix <- as.matrix(dtm_s_h)
full_df <- as.data.frame(matrix)
dim(full_df)

## [1] 11772  1123

Now that our data set is cleaned and ready to go, we’re ready to split our data set into a training and testing data set. We can do this with tidymodel’s initial_split function. We’ll break the data set into two chunks: 75% train and 25% test.

full_df_split <- initial_split(full_df, prop = 3/4)

train_df <- training(full_df_split)
test_df <- testing(full_df_split)

Now that we’ve split the data, let’s take a look at the dimension of each data set:

dim(train_df)

## [1] 8829 1123

dim(test_df)

## [1] 2943 1123

Now, let’s build our model. Let’s use a random forest model (ranger is a faster application of random forest) and use cross validation on our training set with 3-folds. We’ll also add the importance argument so we can see the importance of each variable when the model is finished.

model <- train(
  x = train_df,
  y = as.factor(spam_ham_final[rownames(train_df),]$indicator), 
  method = "ranger", 
   num.trees = 200,
  importance = "impurity",
  trControl = trainControl(method = "cv", 
  number = 3,
  verboseIter = TRUE
  )
)

## + Fold1: mtry=   2, min.node.size=1, splitrule=gini 
## - Fold1: mtry=   2, min.node.size=1, splitrule=gini 
## + Fold1: mtry=  47, min.node.size=1, splitrule=gini 
## - Fold1: mtry=  47, min.node.size=1, splitrule=gini 
## + Fold1: mtry=1123, min.node.size=1, splitrule=gini 
## Growing trees.. Progress: 69%. Estimated remaining time: 14 seconds.
## - Fold1: mtry=1123, min.node.size=1, splitrule=gini 
## + Fold1: mtry=   2, min.node.size=1, splitrule=extratrees 
## - Fold1: mtry=   2, min.node.size=1, splitrule=extratrees 
## + Fold1: mtry=  47, min.node.size=1, splitrule=extratrees 
## - Fold1: mtry=  47, min.node.size=1, splitrule=extratrees 
## + Fold1: mtry=1123, min.node.size=1, splitrule=extratrees 
## Growing trees.. Progress: 64%. Estimated remaining time: 17 seconds.
## - Fold1: mtry=1123, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry=   2, min.node.size=1, splitrule=gini 
## - Fold2: mtry=   2, min.node.size=1, splitrule=gini 
## + Fold2: mtry=  47, min.node.size=1, splitrule=gini 
## - Fold2: mtry=  47, min.node.size=1, splitrule=gini 
## + Fold2: mtry=1123, min.node.size=1, splitrule=gini 
## Growing trees.. Progress: 70%. Estimated remaining time: 13 seconds.
## - Fold2: mtry=1123, min.node.size=1, splitrule=gini 
## + Fold2: mtry=   2, min.node.size=1, splitrule=extratrees 
## - Fold2: mtry=   2, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry=  47, min.node.size=1, splitrule=extratrees 
## - Fold2: mtry=  47, min.node.size=1, splitrule=extratrees 
## + Fold2: mtry=1123, min.node.size=1, splitrule=extratrees 
## Growing trees.. Progress: 61%. Estimated remaining time: 20 seconds.
## - Fold2: mtry=1123, min.node.size=1, splitrule=extratrees 
## + Fold3: mtry=   2, min.node.size=1, splitrule=gini 
## - Fold3: mtry=   2, min.node.size=1, splitrule=gini 
## + Fold3: mtry=  47, min.node.size=1, splitrule=gini 
## - Fold3: mtry=  47, min.node.size=1, splitrule=gini 
## + Fold3: mtry=1123, min.node.size=1, splitrule=gini 
## Growing trees.. Progress: 73%. Estimated remaining time: 11 seconds.
## - Fold3: mtry=1123, min.node.size=1, splitrule=gini 
## + Fold3: mtry=   2, min.node.size=1, splitrule=extratrees 
## - Fold3: mtry=   2, min.node.size=1, splitrule=extratrees 
## + Fold3: mtry=  47, min.node.size=1, splitrule=extratrees 
## - Fold3: mtry=  47, min.node.size=1, splitrule=extratrees 
## + Fold3: mtry=1123, min.node.size=1, splitrule=extratrees 
## Growing trees.. Progress: 61%. Estimated remaining time: 20 seconds.
## - Fold3: mtry=1123, min.node.size=1, splitrule=extratrees 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 47, splitrule = gini, min.node.size = 1 on full training set

model

## Random Forest 
## 
## 8829 samples
## 1123 predictors
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 5886, 5886, 5886 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##      2  gini        0.9193567  0.7823802
##      2  extratrees  0.9083701  0.7488211
##     47  gini        0.9800657  0.9502959
##     47  extratrees  0.9768943  0.9421166
##   1123  gini        0.9619436  0.9047048
##   1123  extratrees  0.9790463  0.9479912
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 47, splitrule = gini
##  and min.node.size = 1.

Above, we can see the output of the model. It looks like on our training data, our accuracy for the best model was ~97%.

Let’s visualize the different tunning parameters the model went through and it’s accuracy using ggplot:

ggplot(model)

Because we used importance = “impurity” we can also look at the individual predictor’s strengths. This will show us the most predictive attributes:

varImp(model, scale = TRUE)$importance %>% 
  rownames_to_column() %>% 
  arrange(-Overall) %>% 
  top_n(25) %>%
  ggplot() + 
  aes(x = reorder(rowname, Overall), y = Overall) +
  geom_col() + 
  labs(title = "Most Predictive Words") +
  coord_flip()

## Selecting by Overall

Looking at teh above, it looks like some obvious words like “click” and “offer” probably indicate spam pretty easily. Another thing to note is that there appear to be quite a few HTML terms here. Perhaps HTML terms in an email often indicate spam and people sending email back and forth to each other often don’t include HTML tags.

Finally, let’s use our model to make prediction on our test data set.

predictions <- predict(model, test_df)
confusionMatrix(predictions, as.factor(spam_ham_final[rownames(test_df),]$indicator))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2108   26
##          1   35  774
##                                           
##                Accuracy : 0.9793          
##                  95% CI : (0.9735, 0.9841)
##     No Information Rate : 0.7282          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9478          
##                                           
##  Mcnemar's Test P-Value : 0.3057          
##                                           
##             Sensitivity : 0.9837          
##             Specificity : 0.9675          
##          Pos Pred Value : 0.9878          
##          Neg Pred Value : 0.9567          
##              Prevalence : 0.7282          
##          Detection Rate : 0.7163          
##    Detection Prevalence : 0.7251          
##       Balanced Accuracy : 0.9756          
##                                           
##        'Positive' Class : 0               
##

It looks like our model was 97.93% accurate in correctly classifying ham from spam. If we look at the confusion matrix in the model output, it looks like the model incorrectly labeled 35 ham invoices as spam and 26 spam invoices as ham. However, it correctly categorized 2,882 emails.

Conclusion

We were able to build an accurate model to classify spam from ham using the random forest model. As the model was quite accurate, exploration of other models was not done, however, to extend this, additional models should be tested, especially models with a particular strength in classification such as support vector machine (SVM). Additionally, there are some additional pre-processing text tidying we could do to potentially make our data more accurate as well such as using a combination of text frequency and tf-idf or using a different weighting method when creating our document term matrix.