The purpose of this project is to build a classification model that can accurately classify spam email messages from ham email messages. We will do this by using pre-classified email messages to build a training set and then build a predictive model to forecast unseen email messages as either spam or ham. In order to build this predictive model, we’ll also need to rely heavily on several text mining techniques which will be demonstrated below. We’ll begin this project by loading the necessary libraries.
library(readr)
library(tidyverse)
library(tidytext)
library(ggplot2)
library(tm)
library(caret)
library(tidymodels)
We will take our data set from several locations, the first of which is a repository of 6,046 emails that I have downloaded from here as individual files. We will need to read in each file from it’s location on my computer and then create a dataframe with one email per row. We’ll start by getting the file path to each file and storing it as a list.
ham_file_path <- "C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Spring 2020/DATA607/Week 13-Classification/SPAMHAM/ham"
spam_file_path <- "C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Spring 2020/DATA607/Week 13-Classification/SPAMHAM/spam"
ham <- list.files(ham_file_path, full.names = TRUE)
spam <- list.files(spam_file_path, full.names = TRUE)
Next, we will build a function that takes in a file path, reads the file, and removes the new line characters. We will use purrr::map() to apply this function to our list of file paths. Using map allows us to take advantage of vecotorization as opposed to using an iterrating “if” statement. Once we have our data frame, we will rename the “value” column to “text” as well as add an “indicator” variable labeling the data frame as either containing SPAM or HAM (we will run this function twice, once over a set of Ham files, and again over a set of Spam files). Lastly, we’ll combine both of the data frames together.
convert_line <- function(path) {
file <- read_file(path) %>%
str_replace_all("\\\n+|\\n|_+", "")
}
spam_list <- purrr::map(spam, convert_line)
spam_df <- as.tibble(unlist(spam_list)) %>%
rename(text = value) %>%
mutate(indicator = 1)
ham_list <- purrr::map(ham, convert_line)
ham_df <- as.tibble(unlist(ham_list)) %>%
rename(text = value) %>%
mutate(indicator = 0)
spam_ham <- rbind(spam_df, ham_df)
#readr::read_csv("C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Spring 2020/DATA607/Week 13-Classification/SPAMHAM/easy_ham/00001.7c53336b37003a9286aba55d2945844c")
Let’s take a look at the first few rows of our data:
head(spam_ham)
## # A tibble: 6 x 2
## text indicator
## <chr> <dbl>
## 1 "From ilug-admin@linux.ie Tue Aug 6 11:51:02 2002Return-Path: <il~ 1
## 2 "From lmrn@mailexcite.com Mon Jun 24 17:03:24 2002Return-Path: mer~ 1
## 3 "From amknight@mailexcite.com Mon Jun 24 17:03:49 2002Return-Path:~ 1
## 4 "From jordan23@mailexcite.com Mon Jun 24 17:04:20 2002Return-Path:~ 1
## 5 "From merchantsworld2001@juno.com Tue Aug 6 11:01:33 2002Return-P~ 1
## 6 "Received: from hq.pro-ns.net (localhost [127.0.0.1])\tby hq.pro-ns~ 1
Looking at the rows above, we can see there is going to be a lot of garbage in each of these emails (web addresses, dates, digits). While attempting some data extraction, there are quite a few instances where the output of the emails vary significantly from email to email, as such, an attempt to extract the “Content” or some other attribute is returning many null values. As such, we’ll work with the email string in its entirety as to not lose possibly informative data from our data set.
Let’s see how many SPAM emails we have an how many HAM emails:
spam_ham %>% count(indicator)
## # A tibble: 2 x 2
## indicator n
## <dbl> <int>
## 1 0 4150
## 2 1 1896
It looks like our data set is ~30% spam.
Total dimensions for our data frame:
dim(spam_ham)
## [1] 6046 2
In an effort to make our model more robust and so our entire training data set isn’t from one source, let’s bring in one more data set that we can use in our training and prediction. This data set is from Kaggle, and can be downloaded here. Below, I have downloaded the file and put it in my GitHub.
test_data <- readr::read_csv("https://raw.githubusercontent.com/christianthieme/MSDS-DATA607/master/SpamHamTestData.csv")
Again, let’s take look at the first few rows of data:
head(test_data)
## # A tibble: 6 x 5
## v1 v2 X3 X4 X5
## <dbl> <chr> <chr> <chr> <chr>
## 1 0 "Subject: for vince j kaminski ' s approval below yo~ <NA> <NA> <NA>
## 2 0 "Subject: continental phone # 1 - 800 - 621 - 7467 o~ <NA> <NA> <NA>
## 3 0 "Subject: exmar purchase decision fyi - - - - - - -~ <NA> <NA> <NA>
## 4 0 "Subject: re : next step al , i have spoken with ma~ <NA> <NA> <NA>
## 5 0 "Subject: summer opportunity vince : i did get your~ <NA> <NA> <NA>
## 6 0 "Subject: interviewing for the associate and analyst ~ <NA> <NA> <NA>
Looking at the first few rows of this data, we can see that we’ll need to rename and reorder some columns as well as get rid of a few blank columns that were read in.
test_data <- test_data %>%
rename(indicator = v1, text = v2) %>%
select(text, indicator)
head(test_data)
## # A tibble: 6 x 2
## text indicator
## <chr> <dbl>
## 1 "Subject: for vince j kaminski ' s approval below you will find a ~ 0
## 2 "Subject: continental phone # 1 - 800 - 621 - 7467 or ? cust . ser~ 0
## 3 "Subject: exmar purchase decision fyi - - - - - - - - - - - - - -~ 0
## 4 "Subject: re : next step al , i have spoken with mark lay and he ~ 0
## 5 "Subject: summer opportunity vince : i did get your phone message~ 0
## 6 "Subject: interviewing for the associate and analyst programs the ~ 0
After our transformations, the dimensions of the data set are:
dim(test_data)
## [1] 5726 2
Our new data set will add 5,726 emails to our 6,046, bringing our total to 11,772.
In this additional data set, let’s see how many spam and ham emails we have:
test_data %>%
count(indicator)
## # A tibble: 2 x 2
## indicator n
## <dbl> <int>
## 1 0 4358
## 2 1 1368
Now that we’ve cleaned both data sets let’s combine them.
spam_ham_final <- rbind(spam_ham, test_data)
Now, because of the way the original data was stored (folder of ham and a seperate folder of spam) and read-in, all of the spam emails are at the first 30% of our data set and the ham emails are the last 70%. Let’s randomize our data set so we aren’t introducing bias to our model. Here’s well use sample() to shuffle our dat set and use a seed number so it is reproducible.
set.seed(42)
rows <- sample(nrow(spam_ham_final))
spam_ham_final <- spam_ham_final[rows,]
Before building a model, let’s do some analysis on this data to see if there is a certain method that makes more sense when implementing our model (tf vs tf-idf).
Now that we’ve combined the data set, let’s do some analysis. We’ll start by creating some new feature columns. Let’s create a few new columns:
analysis <- spam_ham_final %>%
mutate(text_count = unlist(map(str_split(spam_ham_final$text, " "), length))) %>%
mutate(char_count = str_count(spam_ham_final$text, "[A-Za-z]")) %>%
mutate(digit_count = str_count(spam_ham_final$text, "[0-9]")) %>%
mutate(non_count = str_count(spam_ham_final$text, "[^[:alnum:]]")) %>%
mutate(total_char_count = char_count + digit_count + non_count) %>%
mutate(char_perc = char_count / total_char_count,
digit_perc = digit_count / total_char_count,
non_per = non_count / total_char_count)
head(analysis)
## # A tibble: 6 x 10
## text indicator text_count char_count digit_count non_count total_char_count
## <chr> <dbl> <int> <int> <int> <int> <int>
## 1 "Sub~ 0 301 800 53 362 1215
## 2 "Ret~ 0 4942 30077 1474 11539 43090
## 3 "Ret~ 0 399 942 214 532 1688
## 4 "Sub~ 0 366 1031 98 437 1566
## 5 "Fro~ 1 578 4936 638 1843 7417
## 6 "Sub~ 1 1029 3697 48 1174 4919
## # ... with 3 more variables: char_perc <dbl>, digit_perc <dbl>, non_per <dbl>
Now that, we’ve created these columns, let’s see if we can identify any differences between spam and ham.
summary(analysis %>%
filter(indicator == 0))
## text indicator text_count char_count
## Length:8508 Min. :0 Min. : 2 Min. : 10
## Class :character 1st Qu.:0 1st Qu.: 178 1st Qu.: 661
## Mode :character Median :0 Median : 323 Median : 1464
## Mean :0 Mean : 456 Mean : 2078
## 3rd Qu.:0 3rd Qu.: 490 3rd Qu.: 2353
## Max. :0 Max. :36450 Max. :125449
## digit_count non_count total_char_count char_perc
## Min. : 0.0 Min. : 2.0 Min. : 12 Min. :0.1140
## 1st Qu.: 26.0 1st Qu.: 259.0 1st Qu.: 1031 1st Qu.:0.6348
## Median : 130.0 Median : 593.0 Median : 2244 Median :0.6667
## Mean : 207.8 Mean : 843.5 Mean : 3129 Mean :0.6713
## 3rd Qu.: 324.0 3rd Qu.: 893.2 3rd Qu.: 3554 3rd Qu.:0.7130
## Max. :11078.0 Max. :169467.0 Max. :296769 Max. :0.8378
## digit_perc non_per
## Min. :0.00000 Min. :0.04376
## 1st Qu.:0.02185 1st Qu.:0.23439
## Median :0.05435 Median :0.24800
## Mean :0.06352 Mean :0.26518
## 3rd Qu.:0.10292 3rd Qu.:0.28332
## Max. :0.32863 Max. :0.84205
summary(analysis %>%
filter(indicator == 1))
## text indicator text_count char_count
## Length:3264 Min. :1 Min. : 5.0 Min. : 10
## Class :character 1st Qu.:1 1st Qu.: 156.0 1st Qu.: 560
## Mode :character Median :1 Median : 272.5 Median : 1496
## Mean :1 Mean : 591.5 Mean : 2772
## 3rd Qu.:1 3rd Qu.: 579.2 3rd Qu.: 3146
## Max. :1 Max. :13991.0 Max. :199264
## digit_count non_count total_char_count char_perc
## Min. : 0.0 Min. : 7 Min. : 17.0 Min. :0.2180
## 1st Qu.: 10.0 1st Qu.: 242 1st Qu.: 872.5 1st Qu.:0.6073
## Median : 219.5 Median : 550 Median : 2307.5 Median :0.6659
## Mean : 335.4 Mean : 1135 Mean : 4242.0 Mean :0.6626
## 3rd Qu.: 396.0 3rd Qu.: 1223 3rd Qu.: 4844.0 3rd Qu.:0.7307
## Max. :28465.0 Max. :18112 Max. :229237.0 Max. :0.8808
## digit_perc non_per
## Min. :0.00000 Min. :0.0113
## 1st Qu.:0.01267 1st Qu.:0.2323
## Median :0.06702 Median :0.2594
## Mean :0.06557 Mean :0.2718
## 3rd Qu.:0.10331 3rd Qu.:0.2936
## Max. :0.50404 Max. :0.7377
Looking at the two above summaries (indicator = 0 is ham and indicator = 1 is spam), it appears that there is a noticable difference in word count, digit count, and digit percentage between spam and ham, with spam having more words and digits on average.
Looking at the summary is all well and good, but let’s see if we can visualize some of these differences with density plots.
ggplot(analysis) +
aes(x = text_count, fill = as.factor(indicator)) +
geom_density(alpha = 0.4) +
labs(title = "Spam vs Ham - Text Count") +
xlim(0,3000)
## Warning: Removed 172 rows containing non-finite values (stat_density).
Looking at the above density plot, you can see clearly that there is a slight difference in the distribution of text count between spam (1) and ham (0). It appears that spam, on average, has higher text count than ham.
Now let’s turn our attention to character count:
ggplot(analysis) +
aes(x = char_count, fill = as.factor(indicator)) +
geom_density(alpha = 0.4) +
labs(title = "Spam vs Ham - Character Count") +
xlim(0,10000)
## Warning: Removed 320 rows containing non-finite values (stat_density).
This is a tricky chart. From our summaries above, we know that spam does have a higher character count than ham, but not by much. However, we are seeing something interesting on this chart. It appears that at the lower end of character count, ham tends to have higher number of characters than spam, however as we move to higher counts on the x-axis, the trend switches and spam has higher counts. This is why it is important to create a chart. Just from the summaries above, we wouldn’t have been able to tell that while the numbers above are accurate, they don’t tell the whole story.
I’ll show the remainder of the charts for the variables we created below.
ggplot(analysis) +
aes(x = char_perc, fill = as.factor(indicator)) +
labs(title = "Spam vs Ham - Character Percentage") +
geom_density(alpha = 0.4)
ggplot(analysis) +
aes(x = digit_count, fill = as.factor(indicator)) +
geom_density(alpha = 0.4) +
labs(title = "Spam vs Ham - Digit Count") +
xlim(0,2500)
## Warning: Removed 48 rows containing non-finite values (stat_density).
ggplot(analysis) +
aes(x = digit_perc, fill = as.factor(indicator)) +
labs(title = "Spam vs Ham - Digit Percentage") +
geom_density(alpha = 0.4)
ggplot(analysis) +
aes(x = non_count, fill = as.factor(indicator)) +
geom_density(alpha = 0.4) +
labs(title = "Spam vs Ham - Non-AlphaNumeric Count") +
xlim(0,7500)
## Warning: Removed 148 rows containing non-finite values (stat_density).
ggplot(analysis) +
aes(x = non_per, fill = as.factor(indicator)) +
labs(title = "Spam vs Ham - Non-AlphaNumeric Percentage") +
geom_density(alpha = 0.4)
In looking at the above charts, we can clearly see there are some differences between spam and ham when it word count, character count (alpha & numeric), as well as non-character counts. It may make sense to add these variables to our data set to use as predictors.
Having looked the some quantitative data, let’s now turn our attention to the text in this data set. First let’s create a row number in our data set that will be used frequently for reference going forward.
spam_ham_final <- spam_ham_final %>%
mutate(row_num = row_number())
head(spam_ham_final)
## # A tibble: 6 x 3
## text indicator row_num
## <chr> <dbl> <int>
## 1 "Subject: re : video conference for interview : stig faltin~ 0 1
## 2 "Return-Path: <fool@motleyfool.com>Received: (qmail 18981 i~ 0 2
## 3 "Return-Path: nas@python.caDelivery-Date: Mon Sep 9 02:20:~ 0 3
## 4 "Subject: re : message from ken rice vince : thanks for r~ 0 4
## 5 "From jm@netnoteinc.com Mon Jul 29 11:22:06 2002Return-Pat~ 1 5
## 6 "Subject: strictly private . gooday , with warm heart my ~ 1 6
Now that our data set is combined, let’s take a look at the final breakout of ham vs spam:
spam_ham_final %>%
count(indicator)
## # A tibble: 2 x 2
## indicator n
## <dbl> <int>
## 1 0 8508
## 2 1 3264
It looks like our percentage has dropped a little with the addition of the new data set above and our data set now is about 27% spam.
Many critical components of text analysis rely on our ability to look at individual words. To do that we’ll use unnest_tokens from TidyText break each word in to its own row. We’ll also change the case of every word to lowercase, remove all stop words (it, as, the, etc.), and extract the word stem (thankful –> thank).
tidy_spam_ham <- spam_ham_final %>%
unnest_tokens(output = word, input = text) %>%
mutate(word = tolower(word)) %>%
anti_join(stop_words) %>%
mutate(word = SnowballC::wordStem(word))
## Joining, by = "word"
tidy_spam_ham
## # A tibble: 3,851,053 x 3
## indicator row_num word
## <dbl> <int> <chr>
## 1 0 1 subject
## 2 0 1 video
## 3 0 1 confer
## 4 0 1 interview
## 5 0 1 stig
## 6 0 1 faltinsen
## 7 0 1 shirlei
## 8 0 1 hope
## 9 0 1 pleasant
## 10 0 1 easter
## # ... with 3,851,043 more rows
Above, we can see this has really lengthened our data set. We are now working with almost 4M rows of data.
Let’s see if we can get a feel for the difference between spam and ham and which words are used most frequently in each.
#looking at top words in ham
tidy_spam_ham %>%
filter(indicator == 0) %>%
count(indicator, word, sort = TRUE) %>%
top_n(15, n)
## # A tibble: 15 x 3
## indicator word n
## <dbl> <chr> <int>
## 1 0 http 36996
## 2 0 2002 34168
## 3 0 id 26148
## 4 0 td 23445
## 5 0 localhost 21865
## 6 0 1 21421
## 7 0 list 20915
## 8 0 width 18206
## 9 0 subject 17943
## 10 0 0 15414
## 11 0 3d 15108
## 12 0 esmtp 14982
## 13 0 xent.com 13788
## 14 0 font 13484
## 15 0 enron 13252
#looking at top words in spam
tidy_spam_ham %>%
filter(indicator == 1) %>%
count(indicator, word, sort = TRUE) %>%
top_n(15, n)
## # A tibble: 15 x 3
## indicator word n
## <dbl> <chr> <int>
## 1 1 3d 43261
## 2 1 font 42963
## 3 1 td 21149
## 4 1 br 20167
## 5 1 size 16147
## 6 1 tr 12640
## 7 1 nbsp 11869
## 8 1 http 11621
## 9 1 color 11387
## 10 1 width 11158
## 11 1 2002 10359
## 12 1 id 9212
## 13 1 1 8787
## 14 1 align 8188
## 15 1 2 7761
Looking at the above, it looks like there isn’t anything tremendously meainingful that we can glean from this word count exercise. It appears that many of our highest word count items are actually the header or html sections of the email. While this particular exercise wasn’t terribly fruitful, what we do learn is that perhaps word count or term-frequency, will not be the best approach for this analysis as we will have many similar words between both sets because of the headers used in the emails.
Let’s now change directions and look at term-frequency inverse document frequency, which helps us measure how “important” a term is in a corpus. We’ll start by looking first at spam:
spam_tf_idf <- tidy_spam_ham %>%
filter(indicator == 1) %>%
count(row_num, word, sort = TRUE) %>%
bind_tf_idf(term = word, document = row_num, n = n) %>%
arrange(desc(tf_idf))
spam_tf_idf
## # A tibble: 553,950 x 6
## row_num word n tf idf tf_idf
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 10625 jif 1 0.5 8.09 4.05
## 2 615 4623 2 0.333 7.40 2.47
## 3 11486 ya 2 0.4 6.14 2.46
## 4 2342 oreo 4 0.222 8.09 1.80
## 5 7384 126432211 2 0.222 7.40 1.64
## 6 5202 graand 2 0.25 6.30 1.57
## 7 5093 requisit 1 0.2 6.70 1.34
## 8 6195 www.laxpress.com 63 0.163 8.09 1.32
## 9 646 lambino 5 0.161 8.09 1.30
## 10 6195 50site 61 0.158 8.09 1.28
## # ... with 553,940 more rows
spam_tf_idf %>%
top_n(15, wt = tf_idf) %>%
ggplot() +
aes(x= reorder(word, tf_idf), y = tf_idf) +
geom_col() +
coord_flip()
In looking at the above output, while there is still some garble (i.e. 126432211), there are some subtleties here. Words like Jif and Oreo may indicate some type of advertising, while words like striptease look like typical spam garbage.
Now let’s look at ham:
ham_tf_idf <- tidy_spam_ham %>%
filter(indicator == 0) %>%
count(row_num, word, sort = TRUE) %>%
bind_tf_idf(term = word, document = row_num, n = n) %>%
arrange(desc(tf_idf))
ham_tf_idf
## # A tibble: 1,211,558 x 6
## row_num word n tf idf tf_idf
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 10778 elana 1 0.333 9.05 3.02
## 2 8814 rank 1 0.5 4.74 2.37
## 3 7729 congrat 2 0.333 6.75 2.25
## 4 1830 gillian 1 0.333 6.65 2.22
## 5 1592 congratul 2 0.5 4.35 2.17
## 6 5108 exl 21 0.231 9.05 2.09
## 7 10778 hurrican 1 0.333 6.05 2.02
## 8 11623 statistician 2 0.25 7.95 1.99
## 9 4083 sddp 2 0.25 7.44 1.86
## 10 1956 fyi 1 0.5 3.57 1.78
## # ... with 1,211,548 more rows
ham_tf_idf %>%
top_n(15, wt = tf_idf) %>%
ggplot() +
aes(x= reorder(word, tf_idf), y = tf_idf) +
geom_col() +
coord_flip()
This output is completely different than what we saw above for spam. for the most part, these look like normal words and names. In looking at the difference between term frequency (term count) and term inverse document frequency, we can make the educated assumption that tf-idf is going to be more helpful to us in a predictive model.
In order to build a predictive model, we’ll need to get our data in a format that is digestible. To do this, we’ll create a document term matrix, which will take our current data set which is one row per word and turn it into one row per email, with every column being a word in that email. We’ll weight those values with tf-idf.
dtm_spam_ham <- tidy_spam_ham %>%
count(row_num, word) %>%
cast_dtm(document = row_num, term = word, value = n, weighting = tm::weightTfIdf)
dtm_spam_ham
## <<DocumentTermMatrix (documents: 11772, terms: 221798)>>
## Non-/sparse entries: 1765508/2609240548
## Sparsity : 100%
## Maximal term length: 19844
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
In looking at the above output, our data set has 11,772 ros and 221,798 columns. This is pretty crazy, and too many columns to work with so we’ll want to remove some of the sparse columns (words), meaning remove those words that don’t occur accross many of the documents. Too many columns in our data set will add to the complexity of our model as well as substantially add to the time it takes to run.
We will set our sparcity to .98 meaning we will remove words that are missing from 98% of the documents.
dtm_s_h <- tm::removeSparseTerms(dtm_spam_ham, sparse = .98)
dtm_s_h
## <<DocumentTermMatrix (documents: 11772, terms: 1123)>>
## Non-/sparse entries: 992711/12227245
## Sparsity : 92%
## Maximal term length: 38
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
In looking at the above, output we’ve trimmed our data set down to 1,123 columns. This will be a bit easier for our model to chew.
matrix <- as.matrix(dtm_s_h)
full_df <- as.data.frame(matrix)
dim(full_df)
## [1] 11772 1123
Now that our data set is cleaned and ready to go, we’re ready to split our data set into a training and testing data set. We can do this with tidymodel’s initial_split function. We’ll break the data set into two chunks: 75% train and 25% test.
full_df_split <- initial_split(full_df, prop = 3/4)
train_df <- training(full_df_split)
test_df <- testing(full_df_split)
Now that we’ve split the data, let’s take a look at the dimension of each data set:
dim(train_df)
## [1] 8829 1123
dim(test_df)
## [1] 2943 1123
Now, let’s build our model. Let’s use a random forest model (ranger is a faster application of random forest) and use cross validation on our training set with 3-folds. We’ll also add the importance argument so we can see the importance of each variable when the model is finished.
model <- train(
x = train_df,
y = as.factor(spam_ham_final[rownames(train_df),]$indicator),
method = "ranger",
num.trees = 200,
importance = "impurity",
trControl = trainControl(method = "cv",
number = 3,
verboseIter = TRUE
)
)
## + Fold1: mtry= 2, min.node.size=1, splitrule=gini
## - Fold1: mtry= 2, min.node.size=1, splitrule=gini
## + Fold1: mtry= 47, min.node.size=1, splitrule=gini
## - Fold1: mtry= 47, min.node.size=1, splitrule=gini
## + Fold1: mtry=1123, min.node.size=1, splitrule=gini
## Growing trees.. Progress: 69%. Estimated remaining time: 14 seconds.
## - Fold1: mtry=1123, min.node.size=1, splitrule=gini
## + Fold1: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold1: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold1: mtry= 47, min.node.size=1, splitrule=extratrees
## - Fold1: mtry= 47, min.node.size=1, splitrule=extratrees
## + Fold1: mtry=1123, min.node.size=1, splitrule=extratrees
## Growing trees.. Progress: 64%. Estimated remaining time: 17 seconds.
## - Fold1: mtry=1123, min.node.size=1, splitrule=extratrees
## + Fold2: mtry= 2, min.node.size=1, splitrule=gini
## - Fold2: mtry= 2, min.node.size=1, splitrule=gini
## + Fold2: mtry= 47, min.node.size=1, splitrule=gini
## - Fold2: mtry= 47, min.node.size=1, splitrule=gini
## + Fold2: mtry=1123, min.node.size=1, splitrule=gini
## Growing trees.. Progress: 70%. Estimated remaining time: 13 seconds.
## - Fold2: mtry=1123, min.node.size=1, splitrule=gini
## + Fold2: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold2: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold2: mtry= 47, min.node.size=1, splitrule=extratrees
## - Fold2: mtry= 47, min.node.size=1, splitrule=extratrees
## + Fold2: mtry=1123, min.node.size=1, splitrule=extratrees
## Growing trees.. Progress: 61%. Estimated remaining time: 20 seconds.
## - Fold2: mtry=1123, min.node.size=1, splitrule=extratrees
## + Fold3: mtry= 2, min.node.size=1, splitrule=gini
## - Fold3: mtry= 2, min.node.size=1, splitrule=gini
## + Fold3: mtry= 47, min.node.size=1, splitrule=gini
## - Fold3: mtry= 47, min.node.size=1, splitrule=gini
## + Fold3: mtry=1123, min.node.size=1, splitrule=gini
## Growing trees.. Progress: 73%. Estimated remaining time: 11 seconds.
## - Fold3: mtry=1123, min.node.size=1, splitrule=gini
## + Fold3: mtry= 2, min.node.size=1, splitrule=extratrees
## - Fold3: mtry= 2, min.node.size=1, splitrule=extratrees
## + Fold3: mtry= 47, min.node.size=1, splitrule=extratrees
## - Fold3: mtry= 47, min.node.size=1, splitrule=extratrees
## + Fold3: mtry=1123, min.node.size=1, splitrule=extratrees
## Growing trees.. Progress: 61%. Estimated remaining time: 20 seconds.
## - Fold3: mtry=1123, min.node.size=1, splitrule=extratrees
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 47, splitrule = gini, min.node.size = 1 on full training set
model
## Random Forest
##
## 8829 samples
## 1123 predictors
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 5886, 5886, 5886
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.9193567 0.7823802
## 2 extratrees 0.9083701 0.7488211
## 47 gini 0.9800657 0.9502959
## 47 extratrees 0.9768943 0.9421166
## 1123 gini 0.9619436 0.9047048
## 1123 extratrees 0.9790463 0.9479912
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 47, splitrule = gini
## and min.node.size = 1.
Above, we can see the output of the model. It looks like on our training data, our accuracy for the best model was ~97%.
Let’s visualize the different tunning parameters the model went through and it’s accuracy using ggplot:
ggplot(model)
Because we used importance = “impurity” we can also look at the individual predictor’s strengths. This will show us the most predictive attributes:
varImp(model, scale = TRUE)$importance %>%
rownames_to_column() %>%
arrange(-Overall) %>%
top_n(25) %>%
ggplot() +
aes(x = reorder(rowname, Overall), y = Overall) +
geom_col() +
labs(title = "Most Predictive Words") +
coord_flip()
## Selecting by Overall
Looking at teh above, it looks like some obvious words like “click” and “offer” probably indicate spam pretty easily. Another thing to note is that there appear to be quite a few HTML terms here. Perhaps HTML terms in an email often indicate spam and people sending email back and forth to each other often don’t include HTML tags.
Finally, let’s use our model to make prediction on our test data set.
predictions <- predict(model, test_df)
confusionMatrix(predictions, as.factor(spam_ham_final[rownames(test_df),]$indicator))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2108 26
## 1 35 774
##
## Accuracy : 0.9793
## 95% CI : (0.9735, 0.9841)
## No Information Rate : 0.7282
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9478
##
## Mcnemar's Test P-Value : 0.3057
##
## Sensitivity : 0.9837
## Specificity : 0.9675
## Pos Pred Value : 0.9878
## Neg Pred Value : 0.9567
## Prevalence : 0.7282
## Detection Rate : 0.7163
## Detection Prevalence : 0.7251
## Balanced Accuracy : 0.9756
##
## 'Positive' Class : 0
##
It looks like our model was 97.93% accurate in correctly classifying ham from spam. If we look at the confusion matrix in the model output, it looks like the model incorrectly labeled 35 ham invoices as spam and 26 spam invoices as ham. However, it correctly categorized 2,882 emails.
We were able to build an accurate model to classify spam from ham using the random forest model. As the model was quite accurate, exploration of other models was not done, however, to extend this, additional models should be tested, especially models with a particular strength in classification such as support vector machine (SVM). Additionally, there are some additional pre-processing text tidying we could do to potentially make our data more accurate as well such as using a combination of text frequency and tf-idf or using a different weighting method when creating our document term matrix.