Project 4: Spam Filter

Data

The two files selected are located here:

https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2

To begin, the following packages are loaded: tidyverse, tidytext, stringr, caret, tm.

rm(list=ls())

library(tidyverse)
library(tidytext)
library(stringr)
library(caret)
library(tm)

Spam and ham emails are loaded separately as simple text files, then combined for cleaning.

ham<-list.files('C:/MSDS/spamham/easy_ham')
ham2<-paste('C:/MSDS/spamham/easy_ham/',ham,sep='')
spam<-list.files('C:/MSDS/spamham/spam')
spam2<-paste('C:/MSDS/spamham/spam/',spam,sep='')

mlist<-append(ham2,spam2)

emails <- data.frame(files= sapply(mlist, FUN = function(x)readChar(x, file.info(x)$size)),
                  stringsAsFactors=FALSE)

The row names are captured with the package data.table; slow and methodical cleaning follows. While this can be accomplished with fewer lines of code, the result was checked after each step. Ultimately a list of single-word tokens is developed for each document.

A document-term matrix is created. Sparse terms are eliminated with removeSparseTerms function. The weighting option is set to term frequency-inverse document frequency, which increases with frequency per document, but is adjusted for prevalence in all documents in the analysis.

email_tokens %>%
  #get count
  count(ID, word) %>%
  #document term matrix created with tf-idf
  cast_dtm(document = ID, term = word, value = n,
           weighting = tm::weightTfIdf)
## <<DocumentTermMatrix (documents: 3000, terms: 24737)>>
## Non-/sparse entries: 138283/74072717
## Sparsity           : 100%
## Maximal term length: 10
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
emails_dtm <- email_tokens %>%
   count(ID,word) %>%
   cast_dtm(document=ID,term = word, value = n)


#omit sparse words
emailsNoSparse_dtm <- removeSparseTerms(emails_dtm, sparse = .99)

Word frequencies are plotted for both the spam and ham (group by spam = 0,1).

emails_tfidf <- email_tokens %>%
   count(spam, word) %>%
   bind_tf_idf(term = word, document = spam, n = n)


#sort, convert to factor
plot_emails <- emails_tfidf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))

#plot 10 most frequent spam and ham tokens
plot_emails %>%
  filter(spam %in% c(0, 1)) %>%
  mutate(spam = factor(spam, levels = c(0, 1),
                        labels = c("Ham", "Spam"))) %>%
  group_by(spam) %>%
  top_n(10) %>%
  ungroup() %>%
  ggplot(aes(word, tf_idf)) +
  geom_col() +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~spam, scales = "free") +
  coord_flip()

The matrix is converted to a dataframe; a seed is set and a random selection of cases form the training dataset and the testing dataset. A random sample of 70% of the documents forms the training set with the remaining rows designated for testing the model. Spam classification variables are separated from the training and testing datasets.

#form data frame with cleaned tokens and indicator variable
CleanEmail<-data.frame(as.matrix(emailsNoSparse_dtm),emails$spam)

#set seed to maintain random sample consistency; make 70-30 split
set.seed(1973)
rownums<-sample(nrow(CleanEmail),nrow(CleanEmail)*.7)

#form training set and test set
trainSet<-CleanEmail[rownums,]
trainSet_Pre<-trainSet[,1:ncol(trainSet)-1]
testSet<-CleanEmail[-rownums,]
testSet_Pre<-testSet[,1:ncol(testSet)-1]

Model 1 - Random Forest

The first model was created using the Random Forest algorithm, with ntrees set to 50. Additionally, out-of-bag estimate (oob) is set as the method of trainControl with other options left to their default settings.

#train by comparing dataframe to spam indicator
model1 <- train(x = trainSet_Pre,
                     y = factor(trainSet$emails.spam),
                     method = "rf",
                     ntree = 50,
                     trControl = trainControl(method = "oob"))


#view result
model1$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 50, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 43
## 
##         OOB estimate of  error rate: 4.62%
## Confusion matrix:
##      0   1 class.error
## 0 1719  27  0.01546392
## 1   70 284  0.19774011

The created model is tested against the testing dataset. Accuracy is calculated.

#test model
predictions<-predict(model1,newdata = testSet_Pre)

#calculate accuracy
compare<-data.frame(testSet$emails.spam,predictions)
compare$correct<-ifelse(compare$testSet.emails.spam == compare$predictions,1,0)
accuracy<-round(sum(compare$correct)*100/nrow(compare),1)

cat("Accuracy:",accuracy,"%")
## Accuracy: 97 %

Importance of each term is plotted; further data cleaning is performed if necessary and the model trained again.

#grab importance via varImp
imp<-varImp(model1,scale=FALSE)

imp2<-data.frame(imp["importance"])
setDT(imp2, keep.rownames = TRUE)[]
##           rn      Overall
##   1:    chri 0.0385162907
##   2: command 0.1026969697
##   3:  compil 0.1826702603
##   4:   creat 0.7032410226
##   5: develop 0.8482434523
##  ---                     
## 924:   spamd 0.1412943316
## 925:  jabber 0.0695238095
## 926:  jeremi 0.0007573919
## 927: spambay 0.4758320720
## 928:   guido 0.0345516437
imp2<-imp2[which(imp2$Overall>9),]
colnames(imp2)[1]<-"word"
colnames(imp2)[2]<-"importance"

ggplot(imp2, aes(x=reorder(word, importance), weight=importance, fill=as.factor(importance)))+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Model 2 - Stochastic Gradient Boosting

In order to train a model with the method gbm I loaded the package of the same name. The trainControl method is set as cv, number of folds is set to 3, classProbs is set to TRUE and summaryFunction set as twoClassSummary. Since the classification of spam vs ham is a two-class problem, I use metric="ROC" in the train function; according to documentation, caret will calculate the area under the ROC metric only for 2-class models.

library('gbm')

#recode spam indicator variable
trainSet$emails.spam<-ifelse(trainSet$emails.spam==1,"spam","ham")
testSet$emails.spam<-ifelse(testSet$emails.spam==1,"spam","ham")

ctrl <- trainControl(method='cv',
                     number=3,
                     returnResamp='none',
                     summaryFunction = twoClassSummary, 
                     classProbs = TRUE)

model2 <- train(x = trainSet_Pre,
                y = factor(trainSet$emails.spam),
                method='gbm',
                trControl=ctrl,
                metric = "ROC",
                preProc = c("center", "scale"))
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8826             nan     0.1000    0.0113
##      2        0.8659             nan     0.1000    0.0072
##      3        0.8518             nan     0.1000    0.0033
##      4        0.8367             nan     0.1000    0.0069
##      5        0.8254             nan     0.1000    0.0037
##      6        0.8085             nan     0.1000    0.0058
##      7        0.7962             nan     0.1000    0.0050
##      8        0.7825             nan     0.1000    0.0060
##      9        0.7749             nan     0.1000    0.0030
##     10        0.7646             nan     0.1000    0.0034
##     20        0.6968             nan     0.1000    0.0003
##     40        0.6188             nan     0.1000    0.0013
##     60        0.5673             nan     0.1000   -0.0001
##     80        0.5306             nan     0.1000   -0.0005
##    100        0.4969             nan     0.1000    0.0001
##    120        0.4709             nan     0.1000    0.0000
##    140        0.4466             nan     0.1000    0.0003
##    150        0.4326             nan     0.1000   -0.0000
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8697             nan     0.1000    0.0155
##      2        0.8394             nan     0.1000    0.0149
##      3        0.8145             nan     0.1000    0.0094
##      4        0.7920             nan     0.1000    0.0089
##      5        0.7752             nan     0.1000    0.0066
##      6        0.7589             nan     0.1000    0.0073
##      7        0.7415             nan     0.1000    0.0081
##      8        0.7258             nan     0.1000    0.0048
##      9        0.7143             nan     0.1000    0.0053
##     10        0.7004             nan     0.1000    0.0059
##     20        0.6182             nan     0.1000    0.0020
##     40        0.5278             nan     0.1000    0.0022
##     60        0.4609             nan     0.1000    0.0003
##     80        0.4170             nan     0.1000   -0.0005
##    100        0.3821             nan     0.1000    0.0002
##    120        0.3544             nan     0.1000   -0.0005
##    140        0.3320             nan     0.1000   -0.0003
##    150        0.3230             nan     0.1000   -0.0006
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8668             nan     0.1000    0.0173
##      2        0.8308             nan     0.1000    0.0161
##      3        0.7963             nan     0.1000    0.0162
##      4        0.7735             nan     0.1000    0.0096
##      5        0.7515             nan     0.1000    0.0084
##      6        0.7312             nan     0.1000    0.0080
##      7        0.7116             nan     0.1000    0.0081
##      8        0.6949             nan     0.1000    0.0059
##      9        0.6812             nan     0.1000    0.0045
##     10        0.6674             nan     0.1000    0.0055
##     20        0.5558             nan     0.1000    0.0021
##     40        0.4589             nan     0.1000   -0.0004
##     60        0.4013             nan     0.1000   -0.0001
##     80        0.3569             nan     0.1000   -0.0000
##    100        0.3174             nan     0.1000    0.0008
##    120        0.2927             nan     0.1000   -0.0006
##    140        0.2682             nan     0.1000    0.0003
##    150        0.2577             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8781             nan     0.1000    0.0109
##      2        0.8565             nan     0.1000    0.0086
##      3        0.8419             nan     0.1000    0.0047
##      4        0.8215             nan     0.1000    0.0102
##      5        0.8070             nan     0.1000    0.0062
##      6        0.7969             nan     0.1000    0.0018
##      7        0.7856             nan     0.1000    0.0033
##      8        0.7713             nan     0.1000    0.0039
##      9        0.7632             nan     0.1000    0.0025
##     10        0.7536             nan     0.1000    0.0037
##     20        0.6699             nan     0.1000    0.0020
##     40        0.5834             nan     0.1000    0.0003
##     60        0.5312             nan     0.1000   -0.0000
##     80        0.4898             nan     0.1000    0.0004
##    100        0.4547             nan     0.1000    0.0007
##    120        0.4243             nan     0.1000    0.0001
##    140        0.3975             nan     0.1000   -0.0001
##    150        0.3858             nan     0.1000   -0.0000
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8665             nan     0.1000    0.0162
##      2        0.8370             nan     0.1000    0.0121
##      3        0.8073             nan     0.1000    0.0131
##      4        0.7804             nan     0.1000    0.0116
##      5        0.7561             nan     0.1000    0.0094
##      6        0.7342             nan     0.1000    0.0090
##      7        0.7190             nan     0.1000    0.0063
##      8        0.7028             nan     0.1000    0.0061
##      9        0.6883             nan     0.1000    0.0054
##     10        0.6766             nan     0.1000    0.0031
##     20        0.5845             nan     0.1000    0.0021
##     40        0.4884             nan     0.1000    0.0013
##     60        0.4267             nan     0.1000    0.0001
##     80        0.3820             nan     0.1000    0.0004
##    100        0.3422             nan     0.1000    0.0004
##    120        0.3127             nan     0.1000    0.0002
##    140        0.2906             nan     0.1000   -0.0006
##    150        0.2791             nan     0.1000   -0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8561             nan     0.1000    0.0201
##      2        0.8146             nan     0.1000    0.0189
##      3        0.7742             nan     0.1000    0.0164
##      4        0.7412             nan     0.1000    0.0145
##      5        0.7175             nan     0.1000    0.0094
##      6        0.6979             nan     0.1000    0.0092
##      7        0.6768             nan     0.1000    0.0083
##      8        0.6580             nan     0.1000    0.0072
##      9        0.6417             nan     0.1000    0.0051
##     10        0.6285             nan     0.1000    0.0052
##     20        0.5291             nan     0.1000    0.0017
##     40        0.4203             nan     0.1000    0.0003
##     60        0.3523             nan     0.1000    0.0005
##     80        0.3096             nan     0.1000    0.0000
##    100        0.2758             nan     0.1000   -0.0006
##    120        0.2502             nan     0.1000   -0.0003
##    140        0.2259             nan     0.1000    0.0000
##    150        0.2171             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8816             nan     0.1000    0.0124
##      2        0.8610             nan     0.1000    0.0087
##      3        0.8430             nan     0.1000    0.0079
##      4        0.8238             nan     0.1000    0.0104
##      5        0.8122             nan     0.1000    0.0044
##      6        0.7966             nan     0.1000    0.0062
##      7        0.7876             nan     0.1000    0.0040
##      8        0.7774             nan     0.1000    0.0048
##      9        0.7650             nan     0.1000    0.0065
##     10        0.7570             nan     0.1000    0.0024
##     20        0.6861             nan     0.1000    0.0043
##     40        0.6082             nan     0.1000    0.0008
##     60        0.5518             nan     0.1000    0.0003
##     80        0.5082             nan     0.1000    0.0001
##    100        0.4727             nan     0.1000    0.0010
##    120        0.4431             nan     0.1000    0.0009
##    140        0.4209             nan     0.1000    0.0000
##    150        0.4095             nan     0.1000   -0.0006
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8677             nan     0.1000    0.0187
##      2        0.8351             nan     0.1000    0.0150
##      3        0.8072             nan     0.1000    0.0140
##      4        0.7818             nan     0.1000    0.0111
##      5        0.7604             nan     0.1000    0.0067
##      6        0.7431             nan     0.1000    0.0072
##      7        0.7257             nan     0.1000    0.0069
##      8        0.7083             nan     0.1000    0.0084
##      9        0.6948             nan     0.1000    0.0051
##     10        0.6855             nan     0.1000    0.0032
##     20        0.6014             nan     0.1000    0.0015
##     40        0.5022             nan     0.1000    0.0016
##     60        0.4337             nan     0.1000   -0.0003
##     80        0.3885             nan     0.1000    0.0001
##    100        0.3560             nan     0.1000    0.0004
##    120        0.3313             nan     0.1000    0.0006
##    140        0.3105             nan     0.1000   -0.0005
##    150        0.2987             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8490             nan     0.1000    0.0254
##      2        0.8107             nan     0.1000    0.0150
##      3        0.7749             nan     0.1000    0.0166
##      4        0.7479             nan     0.1000    0.0095
##      5        0.7257             nan     0.1000    0.0085
##      6        0.7037             nan     0.1000    0.0092
##      7        0.6857             nan     0.1000    0.0076
##      8        0.6695             nan     0.1000    0.0077
##      9        0.6561             nan     0.1000    0.0060
##     10        0.6434             nan     0.1000    0.0044
##     20        0.5516             nan     0.1000    0.0027
##     40        0.4364             nan     0.1000    0.0016
##     60        0.3726             nan     0.1000    0.0008
##     80        0.3239             nan     0.1000    0.0003
##    100        0.2908             nan     0.1000    0.0001
##    120        0.2629             nan     0.1000   -0.0002
##    140        0.2423             nan     0.1000   -0.0000
##    150        0.2318             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.8578             nan     0.1000    0.0247
##      2        0.8231             nan     0.1000    0.0162
##      3        0.7945             nan     0.1000    0.0121
##      4        0.7685             nan     0.1000    0.0135
##      5        0.7433             nan     0.1000    0.0103
##      6        0.7257             nan     0.1000    0.0062
##      7        0.7074             nan     0.1000    0.0078
##      8        0.6925             nan     0.1000    0.0066
##      9        0.6741             nan     0.1000    0.0077
##     10        0.6610             nan     0.1000    0.0049
##     20        0.5665             nan     0.1000    0.0019
##     40        0.4539             nan     0.1000    0.0030
##     60        0.3930             nan     0.1000    0.0004
##     80        0.3470             nan     0.1000    0.0003
##    100        0.3114             nan     0.1000    0.0001
##    120        0.2858             nan     0.1000    0.0004
##    140        0.2634             nan     0.1000    0.0001
##    150        0.2526             nan     0.1000    0.0002
summary(model2)

##                   var    rel.inf
## offer           offer 7.02960982
## price           price 6.79891606
## receiv         receiv 6.78576067
## repli           repli 6.09359345
## friend         friend 4.92132807
## guarante     guarante 4.68478927
## express       express 3.86752829
## visit           visit 3.85957286
## market         market 2.87438036
## invest         invest 2.76901818
## servic         servic 2.39294092
## compani       compani 2.04942107
## insur           insur 1.89899953
## sponsor       sponsor 1.79690258
## assist         assist 1.71460335
## discuss       discuss 1.62791137
## onlin           onlin 1.61915292
## request       request 1.57159721
## lowest         lowest 1.55299280
## payment       payment 1.48622362
## copyright   copyright 1.45478962
## address       address 1.19678453
## mortgag       mortgag 1.19312593
## affili         affili 1.00099669
## career         career 0.97652546
## sincer         sincer 0.96975360
## busi             busi 0.92842568
## freebsd       freebsd 0.90538667
## world           world 0.84906260
## purchas       purchas 0.79653181
## right           right 0.74622896
## websit         websit 0.71689583
## futur           futur 0.71673288
## fortun         fortun 0.64641897
## internet     internet 0.63006192
## doesn           doesn 0.62026735
## movi             movi 0.61012127
## invok           invok 0.60868245
## social         social 0.59321035
## access         access 0.55777236
## newslett     newslett 0.51246726
## review         review 0.47494571
## deliv           deliv 0.46925789
## credit         credit 0.44781685
## enter           enter 0.43975721
## minut           minut 0.39813084
## time             time 0.38584862
## found           found 0.37050368
## contract     contract 0.36453958
## develop       develop 0.35995112
## deposit       deposit 0.32504765
## brand           brand 0.31874301
## subject       subject 0.31527663
## site             site 0.30167574
## technologi technologi 0.30138848
## system         system 0.29501632
## press           press 0.29010999
## dollar         dollar 0.28037739
## motiv           motiv 0.27831719
## email           email 0.24932220
## sens             sens 0.24240280
## retail         retail 0.24170587
## pai               pai 0.24125785
## qualifi       qualifi 0.23028890
## peopl           peopl 0.22539038
## effici         effici 0.21595124
## central       central 0.21443726
## couldn         couldn 0.20423612
## unknown       unknown 0.18119613
## linux           linux 0.18087056
## boundari     boundari 0.17790217
## spammer       spammer 0.17074729
## quick           quick 0.16856366
## simpli         simpli 0.16822611
## suppli         suppli 0.15991908
## spambay       spambay 0.15541360
## video           video 0.14691636
## server         server 0.14565499
## attack         attack 0.14335135
## header         header 0.14328251
## design         design 0.14027181
## director     director 0.13737613
## help             help 0.13637051
## check           check 0.13597645
## obtain         obtain 0.13231715
## septemb       septemb 0.12571596
## post             post 0.12477388
## class           class 0.12435633
## user             user 0.12293578
## dogma           dogma 0.11867748
## condit         condit 0.11658123
## wait             wait 0.11566286
## remain         remain 0.11457226
## rate             rate 0.11019752
## regist         regist 0.10851258
## david           david 0.10762326
## irish           irish 0.10614492
## ship             ship 0.10609431
## hous             hous 0.10340517
## caus             caus 0.10310448
## messag         messag 0.09997960
## bring           bring 0.09373585
## score           score 0.09296631
## includ         includ 0.09031352
## advic           advic 0.08787705
## propos         propos 0.08563564
## hardwar       hardwar 0.08226531
## stuff           stuff 0.08094199
## approv         approv 0.08073328
## tell             tell 0.07780140
## content       content 0.07539032
## titl             titl 0.07145358
## straight     straight 0.07137487
## argument     argument 0.06878779
## subscrib     subscrib 0.06837872
## countri       countri 0.06471925
## word             word 0.06136472
## watch           watch 0.05936653
## guess           guess 0.05856722
## oblig           oblig 0.05769310
## privat         privat 0.05763819
## train           train 0.05685886
## sender         sender 0.05637963
## run               run 0.05486857
## produc         produc 0.05393823
## spend           spend 0.05057506
## chang           chang 0.04594050
## special       special 0.04569715
## origin         origin 0.04280646
## account       account 0.04265751
## opinion       opinion 0.04239576
## softwar       softwar 0.04217730
## execut         execut 0.04031843
## file             file 0.03952879
## black           black 0.03946849
## yahoo           yahoo 0.03896471
## chanc           chanc 0.03735082
## mobil           mobil 0.03673048
## languag       languag 0.03626788
## sale             sale 0.03532746
## note             note 0.03505997
## result         result 0.03394751
## appear         appear 0.03350941
## come             come 0.03235181
## chri             chri 0.00000000
## command       command 0.00000000
## compil         compil 0.00000000
## creat           creat 0.00000000
## error           error 0.00000000
## garrigu       garrigu 0.00000000
## happen         happen 0.00000000
## haven           haven 0.00000000
## issu             issu 0.00000000
## local           local 0.00000000
## mail             mail 0.00000000
## mark             mark 0.00000000
## reach           reach 0.00000000
## relev           relev 0.00000000
## repositori repositori 0.00000000
## search         search 0.00000000
## sequenc       sequenc 0.00000000
## version       version 0.00000000
## window         window 0.00000000
## worker         worker 0.00000000
## commun         commun 0.00000000
## featur         featur 0.00000000
## plan             plan 0.00000000
## pretti         pretti 0.00000000
## agenc           agenc 0.00000000
## attempt       attempt 0.00000000
## august         august 0.00000000
## block           block 0.00000000
## build           build 0.00000000
## carri           carri 0.00000000
## claim           claim 0.00000000
## continu       continu 0.00000000
## demand         demand 0.00000000
## detail         detail 0.00000000
## feder           feder 0.00000000
## forc             forc 0.00000000
## front           front 0.00000000
## govern         govern 0.00000000
## hour             hour 0.00000000
## locat           locat 0.00000000
## offic           offic 0.00000000
## offici         offici 0.00000000
## polic           polic 0.00000000
## presid         presid 0.00000000
## radio           radio 0.00000000
## report         report 0.00000000
## secur           secur 0.00000000
## street         street 0.00000000
## talk             talk 0.00000000
## thursdai     thursdai 0.00000000
## approach     approach 0.00000000
## british       british 0.00000000
## daili           daili 0.00000000
## estim           estim 0.00000000
## expert         expert 0.00000000
## hold             hold 0.00000000
## magazin       magazin 0.00000000
## month           month 0.00000000
## summer         summer 0.00000000
## viru             viru 0.00000000
## virus           virus 0.00000000
## ad                 ad 0.00000000
## effect         effect 0.00000000
## ident           ident 0.00000000
## person         person 0.00000000
## suppos         suppos 0.00000000
## univers       univers 0.00000000
## basic           basic 0.00000000
## attract       attract 0.00000000
## beauti         beauti 0.00000000
## final           final 0.00000000
## heart           heart 0.00000000
## imagin         imagin 0.00000000
## moment         moment 0.00000000
## owner           owner 0.00000000
## partner       partner 0.00000000
## prefer         prefer 0.00000000
## promis         promis 0.00000000
## quickli       quickli 0.00000000
## sell             sell 0.00000000
## slashnul     slashnul 0.00000000
## woman           woman 0.00000000
## women           women 0.00000000
## apologi       apologi 0.00000000
## possibli     possibli 0.00000000
## resid           resid 0.00000000
## worst           worst 0.00000000
## believ         believ 0.00000000
## book             book 0.00000000
## call             call 0.00000000
## choic           choic 0.00000000
## creativ       creativ 0.00000000
## critic         critic 0.00000000
## cultur         cultur 0.00000000
## current       current 0.00000000
## danger         danger 0.00000000
## death           death 0.00000000
## discov         discov 0.00000000
## doubt           doubt 0.00000000
## dream           dream 0.00000000
## earn             earn 0.00000000
## educ             educ 0.00000000
## expens         expens 0.00000000
## extra           extra 0.00000000
## feel             feel 0.00000000
## figur           figur 0.00000000
## focu             focu 0.00000000
## freedom       freedom 0.00000000
## gordon         gordon 0.00000000
## guid             guid 0.00000000
## hand             hand 0.00000000
## health         health 0.00000000
## hear             hear 0.00000000
## histori       histori 0.00000000
## industri     industri 0.00000000
## insid           insid 0.00000000
## laugh           laugh 0.00000000
## learn           learn 0.00000000
## light           light 0.00000000
## live             live 0.00000000
## machin         machin 0.00000000
## michael       michael 0.00000000
## million       million 0.00000000
## move             move 0.00000000
## octob           octob 0.00000000
## past             past 0.00000000
## power           power 0.00000000
## publish       publish 0.00000000
## read             read 0.00000000
## reason         reason 0.00000000
## recent         recent 0.00000000
## releas         releas 0.00000000
## rememb         rememb 0.00000000
## respect       respect 0.00000000
## secret         secret 0.00000000
## seri             seri 0.00000000
## share           share 0.00000000
## sign             sign 0.00000000
## societi       societi 0.00000000
## space           space 0.00000000
## speak           speak 0.00000000
## spent           spent 0.00000000
## stand           stand 0.00000000
## start           start 0.00000000
## steve           steve 0.00000000
## style           style 0.00000000
## tabl             tabl 0.00000000
## teach           teach 0.00000000
## troubl         troubl 0.00000000
## virtual       virtual 0.00000000
## william       william 0.00000000
## worth           worth 0.00000000
## write           write 0.00000000
## written       written 0.00000000
## atalk           atalk 0.00000000
## base             base 0.00000000
## client         client 0.00000000
## engin           engin 0.00000000
## exist           exist 0.00000000
## filter         filter 0.00000000
## give             give 0.00000000
## interfac     interfac 0.00000000
## option         option 0.00000000
## phone           phone 0.00000000
## procmail     procmail 0.00000000
## provid         provid 0.00000000
## research     research 0.00000000
## script         script 0.00000000
## tire             tire 0.00000000
## articl         articl 0.00000000
## devel           devel 0.00000000
## handl           handl 0.00000000
## rule             rule 0.00000000
## default       default 0.00000000
## faster         faster 0.00000000
## flow             flow 0.00000000
## format         format 0.00000000
## requir         requir 0.00000000
## respond       respond 0.00000000
## suggest       suggest 0.00000000
## trick           trick 0.00000000
## updat           updat 0.00000000
## fail             fail 0.00000000
## instanc       instanc 0.00000000
## link             link 0.00000000
## notic           notic 0.00000000
## pass             pass 0.00000000
## popul           popul 0.00000000
## solv             solv 0.00000000
## austin         austin 0.00000000
## begin           begin 0.00000000
## bottom         bottom 0.00000000
## congress     congress 0.00000000
## displai       displai 0.00000000
## doer             doer 0.00000000
## fix               fix 0.00000000
## folder         folder 0.00000000
## funni           funni 0.00000000
## gnupg           gnupg 0.00000000
## leav             leav 0.00000000
## micalg         micalg 0.00000000
## robert         robert 0.00000000
## screen         screen 0.00000000
## signatur     signatur 0.00000000
## suit             suit 0.00000000
## take             take 0.00000000
## wrong           wrong 0.00000000
## acquir         acquir 0.00000000
## american     american 0.00000000
## avoid           avoid 0.00000000
## california california 0.00000000
## campaign     campaign 0.00000000
## chief           chief 0.00000000
## choos           choos 0.00000000
## commerci     commerci 0.00000000
## decid           decid 0.00000000
## defin           defin 0.00000000
## delet           delet 0.00000000
## direct         direct 0.00000000
## directli     directli 0.00000000
## effort         effort 0.00000000
## ensur           ensur 0.00000000
## equal           equal 0.00000000
## experi         experi 0.00000000
## field           field 0.00000000
## fund             fund 0.00000000
## level           level 0.00000000
## list             list 0.00000000
## make             make 0.00000000
## manag           manag 0.00000000
## media           media 0.00000000
## method         method 0.00000000
## nation         nation 0.00000000
## neg               neg 0.00000000
## opt               opt 0.00000000
## perfect       perfect 0.00000000
## perform       perform 0.00000000
## plai             plai 0.00000000
## polit           polit 0.00000000
## potenti       potenti 0.00000000
## practic       practic 0.00000000
## prepar         prepar 0.00000000
## primari       primari 0.00000000
## public         public 0.00000000
## qualiti       qualiti 0.00000000
## recipi         recipi 0.00000000
## respons       respons 0.00000000
## seek             seek 0.00000000
## send             send 0.00000000
## simpl           simpl 0.00000000
## specif         specif 0.00000000
## standard     standard 0.00000000
## support       support 0.00000000
## target         target 0.00000000
## typic           typic 0.00000000
## cheer           cheer 0.00000000
## connect       connect 0.00000000
## network       network 0.00000000
## question     question 0.00000000
## appar           appar 0.00000000
## children     children 0.00000000
## comput         comput 0.00000000
## evid             evid 0.00000000
## googl           googl 0.00000000
## hope             hope 0.00000000
## kiddi           kiddi 0.00000000
## name             name 0.00000000
## pictur         pictur 0.00000000
## similar       similar 0.00000000
## step             step 0.00000000
## trust           trust 0.00000000
## complet       complet 0.00000000
## oper             oper 0.00000000
## anymor         anymor 0.00000000
## bui               bui 0.00000000
## famili         famili 0.00000000
## inlin           inlin 0.00000000
## maintain     maintain 0.00000000
## redhat         redhat 0.00000000
## tool             tool 0.00000000
## vendor         vendor 0.00000000
## term             term 0.00000000
## product       product 0.00000000
## relat           relat 0.00000000
## statement   statement 0.00000000
## truth           truth 0.00000000
## agent           agent 0.00000000
## america       america 0.00000000
## assum           assum 0.00000000
## decad           decad 0.00000000
## enjoi           enjoi 0.00000000
## human           human 0.00000000
## keep             keep 0.00000000
## legal           legal 0.00000000
## meant           meant 0.00000000
## middl           middl 0.00000000
## obviou         obviou 0.00000000
## packag         packag 0.00000000
## parti           parti 0.00000000
## profit         profit 0.00000000
## properti     properti 0.00000000
## put               put 0.00000000
## record         record 0.00000000
## refer           refer 0.00000000
## roger           roger 0.00000000
## section       section 0.00000000
## solut           solut 0.00000000
## sound           sound 0.00000000
## download     download 0.00000000
## driver         driver 0.00000000
## folk             folk 0.00000000
## idea             idea 0.00000000
## instal         instal 0.00000000
## panel           panel 0.00000000
## peter           peter 0.00000000
## memori         memori 0.00000000
## uniqu           uniqu 0.00000000
## wonder         wonder 0.00000000
## activ           activ 0.00000000
## answer         answer 0.00000000
## arriv           arriv 0.00000000
## close           close 0.00000000
## copi             copi 0.00000000
## drive           drive 0.00000000
## game             game 0.00000000
## initi           initi 0.00000000
## physic         physic 0.00000000
## situat         situat 0.00000000
## understand understand 0.00000000
## wednesdai   wednesdai 0.00000000
## wors             wors 0.00000000
## wrote           wrote 0.00000000
## heard           heard 0.00000000
## random         random 0.00000000
## author         author 0.00000000
## confid         confid 0.00000000
## declin         declin 0.00000000
## econom         econom 0.00000000
## economi       economi 0.00000000
## french         french 0.00000000
## georg           georg 0.00000000
## harlei         harlei 0.00000000
## leader         leader 0.00000000
## recal           recal 0.00000000
## stori           stori 0.00000000
## mondai         mondai 0.00000000
## return         return 0.00000000
## white           white 0.00000000
## modul           modul 0.00000000
## separ           separ 0.00000000
## appli           appli 0.00000000
## catch           catch 0.00000000
## expect         expect 0.00000000
## mimeol         mimeol 0.00000000
## music           music 0.00000000
## abil             abil 0.00000000
## agre             agre 0.00000000
## britain       britain 0.00000000
## center         center 0.00000000
## confirm       confirm 0.00000000
## elect           elect 0.00000000
## emerg           emerg 0.00000000
## franc           franc 0.00000000
## line             line 0.00000000
## natur           natur 0.00000000
## sourc           sourc 0.00000000
## speech         speech 0.00000000
## spring         spring 0.00000000
## strategi     strategi 0.00000000
## studi           studi 0.00000000
## washington washington 0.00000000
## cover           cover 0.00000000
## improv         improv 0.00000000
## knowledg     knowledg 0.00000000
## piec             piec 0.00000000
## fals             fals 0.00000000
## blame           blame 0.00000000
## collect       collect 0.00000000
## depend         depend 0.00000000
## mean             mean 0.00000000
## accur           accur 0.00000000
## care             care 0.00000000
## correct       correct 0.00000000
## desir           desir 0.00000000
## devic           devic 0.00000000
## identifi     identifi 0.00000000
## load             load 0.00000000
## model           model 0.00000000
## surpris       surpris 0.00000000
## appl             appl 0.00000000
## framework   framework 0.00000000
## fight           fight 0.00000000
## daniel         daniel 0.00000000
## night           night 0.00000000
## sundai         sundai 0.00000000
## wouldn         wouldn 0.00000000
## absolut       absolut 0.00000000
## configur     configur 0.00000000
## forget         forget 0.00000000
## function.   function. 0.00000000
## manual         manual 0.00000000
## port             port 0.00000000
## advantag     advantag 0.00000000
## hundr           hundr 0.00000000
## innov           innov 0.00000000
## lawrenc       lawrenc 0.00000000
## lead             lead 0.00000000
## measur         measur 0.00000000
## murphi         murphi 0.00000000
## picasso       picasso 0.00000000
## telephon     telephon 0.00000000
## thousand     thousand 0.00000000
## useless       useless 0.00000000
## bunch           bunch 0.00000000
## basi             basi 0.00000000
## benefit       benefit 0.00000000
## charg           charg 0.00000000
## coupl           coupl 0.00000000
## habea           habea 0.00000000
## heaven         heaven 0.00000000
## incom           incom 0.00000000
## justin         justin 0.00000000
## letter         letter 0.00000000
## licens         licens 0.00000000
## mason           mason 0.00000000
## posit           posit 0.00000000
## purpos         purpos 0.00000000
## reject         reject 0.00000000
## warrant       warrant 0.00000000
## directori   directori 0.00000000
## mozilla       mozilla 0.00000000
## speed           speed 0.00000000
## week             week 0.00000000
## short           short 0.00000000
## amount         amount 0.00000000
## annoi           annoi 0.00000000
## easili         easili 0.00000000
## increas       increas 0.00000000
## instant       instant 0.00000000
## document     document 0.00000000
## util             util 0.00000000
## imag             imag 0.00000000
## headlin       headlin 0.00000000
## mailer         mailer 0.00000000
## pudg             pudg 0.00000000
## reserv         reserv 0.00000000
## burn             burn 0.00000000
## comment       comment 0.00000000
## evolut         evolut 0.00000000
## import         import 0.00000000
## permiss       permiss 0.00000000
## weight         weight 0.00000000
## ximian         ximian 0.00000000
## backup         backup 0.00000000
## databas       databas 0.00000000
## process       process 0.00000000
## traffic       traffic 0.00000000
## transfer     transfer 0.00000000
## winter         winter 0.00000000
## archiv         archiv 0.00000000
## combin         combin 0.00000000
## confus         confus 0.00000000
## convinc       convinc 0.00000000
## earth           earth 0.00000000
## event           event 0.00000000
## form             form 0.00000000
## larger         larger 0.00000000
## launch         launch 0.00000000
## liber           liber 0.00000000
## look             look 0.00000000
## modern         modern 0.00000000
## photo           photo 0.00000000
## print           print 0.00000000
## track           track 0.00000000
## trade           trade 0.00000000
## equip           equip 0.00000000
## fridai         fridai 0.00000000
## limit           limit 0.00000000
## action         action 0.00000000
## attach         attach 0.00000000
## custom         custom 0.00000000
## easier         easier 0.00000000
## individu     individu 0.00000000
## intend         intend 0.00000000
## promot         promot 0.00000000
## total           total 0.00000000
## worri           worri 0.00000000
## amend           amend 0.00000000
## browser       browser 0.00000000
## edit             edit 0.00000000
## follow         follow 0.00000000
## permit         permit 0.00000000
## string         string 0.00000000
## built           built 0.00000000
## clean           clean 0.00000000
## freshrpm     freshrpm 0.00000000
## rebuild       rebuild 0.00000000
## addit           addit 0.00000000
## button         button 0.00000000
## charact       charact 0.00000000
## commiss       commiss 0.00000000
## decis           decis 0.00000000
## financi       financi 0.00000000
## matter         matter 0.00000000
## polici         polici 0.00000000
## treat           treat 0.00000000
## wireless     wireless 0.00000000
## entri           entri 0.00000000
## protect       protect 0.00000000
## technic       technic 0.00000000
## shouldn       shouldn 0.00000000
## announc       announc 0.00000000
## honor           honor 0.00000000
## intern         intern 0.00000000
## modifi         modifi 0.00000000
## program       program 0.00000000
## project       project 0.00000000
## replac         replac 0.00000000
## resolv         resolv 0.00000000
## confer         confer 0.00000000
## consum         consum 0.00000000
## depart         depart 0.00000000
## determin     determin 0.00000000
## difficult   difficult 0.00000000
## electron     electron 0.00000000
## extens         extens 0.00000000
## gener           gener 0.00000000
## involv         involv 0.00000000
## materi         materi 0.00000000
## period         period 0.00000000
## readi           readi 0.00000000
## school         school 0.00000000
## warn             warn 0.00000000
## beberg         beberg 0.00000000
## domain         domain 0.00000000
## duncan         duncan 0.00000000
## find             find 0.00000000
## major           major 0.00000000
## scientist   scientist 0.00000000
## angl             angl 0.00000000
## popular       popular 0.00000000
## cheap           cheap 0.00000000
## complex       complex 0.00000000
## cost             cost 0.00000000
## platform     platform 0.00000000
## reduc           reduc 0.00000000
## revers         revers 0.00000000
## test             test 0.00000000
## differ         differ 0.00000000
## object         object 0.00000000
## rang             rang 0.00000000
## techniqu     techniqu 0.00000000
## kevin           kevin 0.00000000
## count           count 0.00000000
## admit           admit 0.00000000
## amaz             amaz 0.00000000
## code             code 0.00000000
## librari       librari 0.00000000
## proper         proper 0.00000000
## think           think 0.00000000
## wast             wast 0.00000000
## yesterdai   yesterdai 0.00000000
## deal             deal 0.00000000
## remot           remot 0.00000000
## setup           setup 0.00000000
## accept         accept 0.00000000
## deliveri     deliveri 0.00000000
## set               set 0.00000000
## card             card 0.00000000
## earlier       earlier 0.00000000
## consid         consid 0.00000000
## stick           stick 0.00000000
## actual         actual 0.00000000
## enabl           enabl 0.00000000
## global         global 0.00000000
## tuesdai       tuesdai 0.00000000
## colleg         colleg 0.00000000
## concept       concept 0.00000000
## europ           europ 0.00000000
## extend         extend 0.00000000
## rais             rais 0.00000000
## scienc         scienc 0.00000000
## structur     structur 0.00000000
## happi           happi 0.00000000
## laptop         laptop 0.00000000
## emac             emac 0.00000000
## exchang       exchang 0.00000000
## mention       mention 0.00000000
## procedur     procedur 0.00000000
## singl           singl 0.00000000
## store           store 0.00000000
## eventu         eventu 0.00000000
## failur         failur 0.00000000
## morn             morn 0.00000000
## perfectli   perfectli 0.00000000
## prevent       prevent 0.00000000
## upgrad         upgrad 0.00000000
## allow           allow 0.00000000
## valid           valid 0.00000000
## forward       forward 0.00000000
## court           court 0.00000000
## advanc         advanc 0.00000000
## stream         stream 0.00000000
## largest       largest 0.00000000
## page             page 0.00000000
## shape           shape 0.00000000
## south           south 0.00000000
## theori         theori 0.00000000
## writer         writer 0.00000000
## arrest         arrest 0.00000000
## defend         defend 0.00000000
## north           north 0.00000000
## regular       regular 0.00000000
## anim             anim 0.00000000
## lower           lower 0.00000000
## player         player 0.00000000
## razor           razor 0.00000000
## excit           excit 0.00000000
## militari     militari 0.00000000
## bother         bother 0.00000000
## miss             miss 0.00000000
## python         python 0.00000000
## femal           femal 0.00000000
## control       control 0.00000000
## recommend   recommend 0.00000000
## favor           favor 0.00000000
## ground         ground 0.00000000
## break.         break. 0.00000000
## kill             kill 0.00000000
## damag           damag 0.00000000
## reveal         reveal 0.00000000
## statu           statu 0.00000000
## mistak         mistak 0.00000000
## parent         parent 0.00000000
## verifi         verifi 0.00000000
## abus             abus 0.00000000
## paper           paper 0.00000000
## deserv         deserv 0.00000000
## doubl           doubl 0.00000000
## success       success 0.00000000
## voic             voic 0.00000000
## assur           assur 0.00000000
## behaviour   behaviour 0.00000000
## common         common 0.00000000
## england       england 0.00000000
## join             join 0.00000000
## journal       journal 0.00000000
## meet             meet 0.00000000
## opposit       opposit 0.00000000
## detect         detect 0.00000000
## jame             jame 0.00000000
## previous     previous 0.00000000
## billion       billion 0.00000000
## organ           organ 0.00000000
## previou       previou 0.00000000
## select         select 0.00000000
## mother         mother 0.00000000
## capit           capit 0.00000000
## reader         reader 0.00000000
## entir           entir 0.00000000
## concern       concern 0.00000000
## ignor           ignor 0.00000000
## scale           scale 0.00000000
## biggest       biggest 0.00000000
## compet         compet 0.00000000
## conclud       conclud 0.00000000
## corpor         corpor 0.00000000
## editor         editor 0.00000000
## fastest       fastest 0.00000000
## realiz         realiz 0.00000000
## stock           stock 0.00000000
## submit         submit 0.00000000
## sort             sort 0.00000000
## suspect       suspect 0.00000000
## emploi         emploi 0.00000000
## extrem         extrem 0.00000000
## match           match 0.00000000
## occur           occur 0.00000000
## ultim           ultim 0.00000000
## averag         averag 0.00000000
## compar         compar 0.00000000
## pull             pull 0.00000000
## english       english 0.00000000
## green           green 0.00000000
## realiti       realiti 0.00000000
## region         region 0.00000000
## republ         republ 0.00000000
## travel         travel 0.00000000
## centuri       centuri 0.00000000
## disabl         disabl 0.00000000
## foreign       foreign 0.00000000
## patch           patch 0.00000000
## energi         energi 0.00000000
## tomorrow     tomorrow 0.00000000
## sampl           sampl 0.00000000
## strang         strang 0.00000000
## awar             awar 0.00000000
## date             date 0.00000000
## touch           touch 0.00000000
## brian           brian 0.00000000
## grant           grant 0.00000000
## label           label 0.00000000
## listen         listen 0.00000000
## lose             lose 0.00000000
## stupid         stupid 0.00000000
## unit             unit 0.00000000
## indic           indic 0.00000000
## financ         financ 0.00000000
## insert         insert 0.00000000
## station       station 0.00000000
## behavior     behavior 0.00000000
## china           china 0.00000000
## binari         binari 0.00000000
## forev           forev 0.00000000
## integr         integr 0.00000000
## matthia       matthia 0.00000000
## output         output 0.00000000
## resourc       resourc 0.00000000
## rpmforg       rpmforg 0.00000000
## split           split 0.00000000
## sylphe         sylphe 0.00000000
## valhalla     valhalla 0.00000000
## attent         attent 0.00000000
## child           child 0.00000000
## father         father 0.00000000
## rel               rel 0.00000000
## pick             pick 0.00000000
## proven         proven 0.00000000
## strong         strong 0.00000000
## digit           digit 0.00000000
## storag         storag 0.00000000
## valuabl       valuabl 0.00000000
## pack             pack 0.00000000
## spread         spread 0.00000000
## type             type 0.00000000
## bearer         bearer 0.00000000
## hettinga     hettinga 0.00000000
## agreeabl     agreeabl 0.00000000
## antiqu         antiqu 0.00000000
## boston         boston 0.00000000
## edward         edward 0.00000000
## empir           empir 0.00000000
## farquhar     farquhar 0.00000000
## gibbon         gibbon 0.00000000
## predict       predict 0.00000000
## roman           roman 0.00000000
## us                 us 0.00000000
## commit         commit 0.00000000
## balanc         balanc 0.00000000
## brought       brought 0.00000000
## progress     progress 0.00000000
## correctli   correctli 0.00000000
## cach             cach 0.00000000
## save             save 0.00000000
## brain           brain 0.00000000
## excel           excel 0.00000000
## capabl         capabl 0.00000000
## geeg             geeg 0.00000000
## brightli     brightli 0.00000000
## compliant   compliant 0.00000000
## canada         canada 0.00000000
## token           token 0.00000000
## rock             rock 0.00000000
## minim           minim 0.00000000
## broken         broken 0.00000000
## unseen         unseen 0.00000000
## barcelona   barcelona 0.00000000
## edificio     edificio 0.00000000
## nort             nort 0.00000000
## planta         planta 0.00000000
## spain           spain 0.00000000
## bliss           bliss 0.00000000
## wed               wed 0.00000000
## eugen           eugen 0.00000000
## strongli     strongli 0.00000000
## prompt         prompt 0.00000000
## classifi     classifi 0.00000000
## deploy         deploy 0.00000000
## corpu           corpu 0.00000000
## brent           brent 0.00000000
## spamd           spamd 0.00000000
## jabber         jabber 0.00000000
## jeremi         jeremi 0.00000000
## guido           guido 0.00000000
print(model2)
## Stochastic Gradient Boosting 
## 
## 2100 samples
##  928 predictor
##    2 classes: 'ham', 'spam' 
## 
## Pre-processing: centered (928), scaled (928) 
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 1400, 1400, 1400 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens       Spec     
##   1                   50      0.8831081  0.9873998  0.3276836
##   1                  100      0.9121430  0.9873998  0.4491525
##   1                  150      0.9290582  0.9839633  0.5677966
##   2                   50      0.9133467  0.9845361  0.4830508
##   2                  100      0.9342419  0.9845361  0.6186441
##   2                  150      0.9443134  0.9839633  0.6638418
##   3                   50      0.9245394  0.9822451  0.5480226
##   3                  100      0.9483371  0.9833906  0.6666667
##   3                  150      0.9583430  0.9816724  0.7090395
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

Get “raw” probability and calculate accuracy.

#use another method to get accuracy, confirm with process before.
predictions2 <- predict(object=model2, testSet_Pre, type='raw')

print(postResample(pred=predictions2, obs=as.factor(testSet$emails.spam)))
##  Accuracy     Kappa 
## 0.9466667 0.7860624
#calculate accuracy using a second method to confirm
compare2<-data.frame(testSet$emails.spam,predictions2)
compare2$correct<-ifelse(compare2$testSet.emails.spam == compare2$predictions,1,0)
accuracy2<-round(sum(compare2$correct)*100/nrow(compare2),1)

cat("Accuracy confirmed:",accuracy2,"%")
## Accuracy confirmed: 94.7 %

Use package pROC to calculate AUC score. According to https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc:
“AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.”

In simplest terms, it is a measure of model performance. Using the package pROC I calculated AUC for the second model below using class probabilities from training.

#obtain probabilities
predictions2b <- predict(object=model2, testSet_Pre, type='prob')

library(pROC)

#get AUC score "AUC ranges between 0.5 and 1, where 0.5 is random and 1 is perfect" from https://amunategui.github.io/binary-outcome-modeling/
auc <- roc(ifelse(testSet$emails.spam=="spam",1,0), predictions2b[[2]])
print(auc$auc)
## Area under the curve: 0.9624

Plotting the importance of each term allows for refinement in cleaning; the most important terms are scrutinized for authenicity.

#grab importance via varImp
imp<-varImp(model2,scale=FALSE)

imp2<-data.frame(imp["importance"])
setDT(imp2, keep.rownames = TRUE)[]
##           rn  Overall
##   1:    chri 0.000000
##   2: command 0.000000
##   3:  compil 0.000000
##   4:   creat 0.000000
##   5: develop 2.504518
##  ---                 
## 924:   spamd 0.000000
## 925:  jabber 0.000000
## 926:  jeremi 0.000000
## 927: spambay 1.081359
## 928:   guido 0.000000
imp2<-imp2[which(imp2$Overall>15),]
colnames(imp2)[1]<-"word"
colnames(imp2)[2]<-"importance"

ggplot(imp2, aes(x=reorder(word, importance), weight=importance, fill=as.factor(importance)))+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions

A more rigorous approach would yield helpful insights into the options associated with the models above; adjusting the model parameters in the train function will alter the result. My analysis proved the Random Forest model made more accurate predictions. The model should be tested against other sets of spam and ham to assess capability.

Stephen Jones

April 7, 2019