Project 4: Spam Filter
Data
The two files selected are located here:
https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2
To begin, the following packages are loaded: tidyverse, tidytext, stringr, caret, tm.
rm(list=ls())
library(tidyverse)
library(tidytext)
library(stringr)
library(caret)
library(tm)Spam and ham emails are loaded separately as simple text files, then combined for cleaning.
ham<-list.files('C:/MSDS/spamham/easy_ham')
ham2<-paste('C:/MSDS/spamham/easy_ham/',ham,sep='')
spam<-list.files('C:/MSDS/spamham/spam')
spam2<-paste('C:/MSDS/spamham/spam/',spam,sep='')
mlist<-append(ham2,spam2)
emails <- data.frame(files= sapply(mlist, FUN = function(x)readChar(x, file.info(x)$size)),
stringsAsFactors=FALSE)The row names are captured with the package data.table; slow and methodical cleaning follows. While this can be accomplished with fewer lines of code, the result was checked after each step. Ultimately a list of single-word tokens is developed for each document.
A document-term matrix is created. Sparse terms are eliminated with removeSparseTerms function. The weighting option is set to term frequency-inverse document frequency, which increases with frequency per document, but is adjusted for prevalence in all documents in the analysis.
email_tokens %>%
#get count
count(ID, word) %>%
#document term matrix created with tf-idf
cast_dtm(document = ID, term = word, value = n,
weighting = tm::weightTfIdf)## <<DocumentTermMatrix (documents: 3000, terms: 24737)>>
## Non-/sparse entries: 138283/74072717
## Sparsity : 100%
## Maximal term length: 10
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
emails_dtm <- email_tokens %>%
count(ID,word) %>%
cast_dtm(document=ID,term = word, value = n)
#omit sparse words
emailsNoSparse_dtm <- removeSparseTerms(emails_dtm, sparse = .99)Word frequencies are plotted for both the spam and ham (group by spam = 0,1).
emails_tfidf <- email_tokens %>%
count(spam, word) %>%
bind_tf_idf(term = word, document = spam, n = n)
#sort, convert to factor
plot_emails <- emails_tfidf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
#plot 10 most frequent spam and ham tokens
plot_emails %>%
filter(spam %in% c(0, 1)) %>%
mutate(spam = factor(spam, levels = c(0, 1),
labels = c("Ham", "Spam"))) %>%
group_by(spam) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(word, tf_idf)) +
geom_col() +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~spam, scales = "free") +
coord_flip()The matrix is converted to a dataframe; a seed is set and a random selection of cases form the training dataset and the testing dataset. A random sample of 70% of the documents forms the training set with the remaining rows designated for testing the model. Spam classification variables are separated from the training and testing datasets.
#form data frame with cleaned tokens and indicator variable
CleanEmail<-data.frame(as.matrix(emailsNoSparse_dtm),emails$spam)
#set seed to maintain random sample consistency; make 70-30 split
set.seed(1973)
rownums<-sample(nrow(CleanEmail),nrow(CleanEmail)*.7)
#form training set and test set
trainSet<-CleanEmail[rownums,]
trainSet_Pre<-trainSet[,1:ncol(trainSet)-1]
testSet<-CleanEmail[-rownums,]
testSet_Pre<-testSet[,1:ncol(testSet)-1]Model 1 - Random Forest
The first model was created using the Random Forest algorithm, with ntrees set to 50. Additionally, out-of-bag estimate (oob) is set as the method of trainControl with other options left to their default settings.
#train by comparing dataframe to spam indicator
model1 <- train(x = trainSet_Pre,
y = factor(trainSet$emails.spam),
method = "rf",
ntree = 50,
trControl = trainControl(method = "oob"))
#view result
model1$finalModel##
## Call:
## randomForest(x = x, y = y, ntree = 50, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 43
##
## OOB estimate of error rate: 4.62%
## Confusion matrix:
## 0 1 class.error
## 0 1719 27 0.01546392
## 1 70 284 0.19774011
The created model is tested against the testing dataset. Accuracy is calculated.
#test model
predictions<-predict(model1,newdata = testSet_Pre)
#calculate accuracy
compare<-data.frame(testSet$emails.spam,predictions)
compare$correct<-ifelse(compare$testSet.emails.spam == compare$predictions,1,0)
accuracy<-round(sum(compare$correct)*100/nrow(compare),1)
cat("Accuracy:",accuracy,"%")## Accuracy: 97 %
Importance of each term is plotted; further data cleaning is performed if necessary and the model trained again.
#grab importance via varImp
imp<-varImp(model1,scale=FALSE)
imp2<-data.frame(imp["importance"])
setDT(imp2, keep.rownames = TRUE)[]## rn Overall
## 1: chri 0.0385162907
## 2: command 0.1026969697
## 3: compil 0.1826702603
## 4: creat 0.7032410226
## 5: develop 0.8482434523
## ---
## 924: spamd 0.1412943316
## 925: jabber 0.0695238095
## 926: jeremi 0.0007573919
## 927: spambay 0.4758320720
## 928: guido 0.0345516437
imp2<-imp2[which(imp2$Overall>9),]
colnames(imp2)[1]<-"word"
colnames(imp2)[2]<-"importance"
ggplot(imp2, aes(x=reorder(word, importance), weight=importance, fill=as.factor(importance)))+
geom_bar()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))Model 2 - Stochastic Gradient Boosting
In order to train a model with the method gbm I loaded the package of the same name. The trainControl method is set as cv, number of folds is set to 3, classProbs is set to TRUE and summaryFunction set as twoClassSummary. Since the classification of spam vs ham is a two-class problem, I use metric="ROC" in the train function; according to documentation, caret will calculate the area under the ROC metric only for 2-class models.
library('gbm')
#recode spam indicator variable
trainSet$emails.spam<-ifelse(trainSet$emails.spam==1,"spam","ham")
testSet$emails.spam<-ifelse(testSet$emails.spam==1,"spam","ham")
ctrl <- trainControl(method='cv',
number=3,
returnResamp='none',
summaryFunction = twoClassSummary,
classProbs = TRUE)
model2 <- train(x = trainSet_Pre,
y = factor(trainSet$emails.spam),
method='gbm',
trControl=ctrl,
metric = "ROC",
preProc = c("center", "scale"))## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8826 nan 0.1000 0.0113
## 2 0.8659 nan 0.1000 0.0072
## 3 0.8518 nan 0.1000 0.0033
## 4 0.8367 nan 0.1000 0.0069
## 5 0.8254 nan 0.1000 0.0037
## 6 0.8085 nan 0.1000 0.0058
## 7 0.7962 nan 0.1000 0.0050
## 8 0.7825 nan 0.1000 0.0060
## 9 0.7749 nan 0.1000 0.0030
## 10 0.7646 nan 0.1000 0.0034
## 20 0.6968 nan 0.1000 0.0003
## 40 0.6188 nan 0.1000 0.0013
## 60 0.5673 nan 0.1000 -0.0001
## 80 0.5306 nan 0.1000 -0.0005
## 100 0.4969 nan 0.1000 0.0001
## 120 0.4709 nan 0.1000 0.0000
## 140 0.4466 nan 0.1000 0.0003
## 150 0.4326 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8697 nan 0.1000 0.0155
## 2 0.8394 nan 0.1000 0.0149
## 3 0.8145 nan 0.1000 0.0094
## 4 0.7920 nan 0.1000 0.0089
## 5 0.7752 nan 0.1000 0.0066
## 6 0.7589 nan 0.1000 0.0073
## 7 0.7415 nan 0.1000 0.0081
## 8 0.7258 nan 0.1000 0.0048
## 9 0.7143 nan 0.1000 0.0053
## 10 0.7004 nan 0.1000 0.0059
## 20 0.6182 nan 0.1000 0.0020
## 40 0.5278 nan 0.1000 0.0022
## 60 0.4609 nan 0.1000 0.0003
## 80 0.4170 nan 0.1000 -0.0005
## 100 0.3821 nan 0.1000 0.0002
## 120 0.3544 nan 0.1000 -0.0005
## 140 0.3320 nan 0.1000 -0.0003
## 150 0.3230 nan 0.1000 -0.0006
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8668 nan 0.1000 0.0173
## 2 0.8308 nan 0.1000 0.0161
## 3 0.7963 nan 0.1000 0.0162
## 4 0.7735 nan 0.1000 0.0096
## 5 0.7515 nan 0.1000 0.0084
## 6 0.7312 nan 0.1000 0.0080
## 7 0.7116 nan 0.1000 0.0081
## 8 0.6949 nan 0.1000 0.0059
## 9 0.6812 nan 0.1000 0.0045
## 10 0.6674 nan 0.1000 0.0055
## 20 0.5558 nan 0.1000 0.0021
## 40 0.4589 nan 0.1000 -0.0004
## 60 0.4013 nan 0.1000 -0.0001
## 80 0.3569 nan 0.1000 -0.0000
## 100 0.3174 nan 0.1000 0.0008
## 120 0.2927 nan 0.1000 -0.0006
## 140 0.2682 nan 0.1000 0.0003
## 150 0.2577 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8781 nan 0.1000 0.0109
## 2 0.8565 nan 0.1000 0.0086
## 3 0.8419 nan 0.1000 0.0047
## 4 0.8215 nan 0.1000 0.0102
## 5 0.8070 nan 0.1000 0.0062
## 6 0.7969 nan 0.1000 0.0018
## 7 0.7856 nan 0.1000 0.0033
## 8 0.7713 nan 0.1000 0.0039
## 9 0.7632 nan 0.1000 0.0025
## 10 0.7536 nan 0.1000 0.0037
## 20 0.6699 nan 0.1000 0.0020
## 40 0.5834 nan 0.1000 0.0003
## 60 0.5312 nan 0.1000 -0.0000
## 80 0.4898 nan 0.1000 0.0004
## 100 0.4547 nan 0.1000 0.0007
## 120 0.4243 nan 0.1000 0.0001
## 140 0.3975 nan 0.1000 -0.0001
## 150 0.3858 nan 0.1000 -0.0000
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8665 nan 0.1000 0.0162
## 2 0.8370 nan 0.1000 0.0121
## 3 0.8073 nan 0.1000 0.0131
## 4 0.7804 nan 0.1000 0.0116
## 5 0.7561 nan 0.1000 0.0094
## 6 0.7342 nan 0.1000 0.0090
## 7 0.7190 nan 0.1000 0.0063
## 8 0.7028 nan 0.1000 0.0061
## 9 0.6883 nan 0.1000 0.0054
## 10 0.6766 nan 0.1000 0.0031
## 20 0.5845 nan 0.1000 0.0021
## 40 0.4884 nan 0.1000 0.0013
## 60 0.4267 nan 0.1000 0.0001
## 80 0.3820 nan 0.1000 0.0004
## 100 0.3422 nan 0.1000 0.0004
## 120 0.3127 nan 0.1000 0.0002
## 140 0.2906 nan 0.1000 -0.0006
## 150 0.2791 nan 0.1000 -0.0004
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8561 nan 0.1000 0.0201
## 2 0.8146 nan 0.1000 0.0189
## 3 0.7742 nan 0.1000 0.0164
## 4 0.7412 nan 0.1000 0.0145
## 5 0.7175 nan 0.1000 0.0094
## 6 0.6979 nan 0.1000 0.0092
## 7 0.6768 nan 0.1000 0.0083
## 8 0.6580 nan 0.1000 0.0072
## 9 0.6417 nan 0.1000 0.0051
## 10 0.6285 nan 0.1000 0.0052
## 20 0.5291 nan 0.1000 0.0017
## 40 0.4203 nan 0.1000 0.0003
## 60 0.3523 nan 0.1000 0.0005
## 80 0.3096 nan 0.1000 0.0000
## 100 0.2758 nan 0.1000 -0.0006
## 120 0.2502 nan 0.1000 -0.0003
## 140 0.2259 nan 0.1000 0.0000
## 150 0.2171 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8816 nan 0.1000 0.0124
## 2 0.8610 nan 0.1000 0.0087
## 3 0.8430 nan 0.1000 0.0079
## 4 0.8238 nan 0.1000 0.0104
## 5 0.8122 nan 0.1000 0.0044
## 6 0.7966 nan 0.1000 0.0062
## 7 0.7876 nan 0.1000 0.0040
## 8 0.7774 nan 0.1000 0.0048
## 9 0.7650 nan 0.1000 0.0065
## 10 0.7570 nan 0.1000 0.0024
## 20 0.6861 nan 0.1000 0.0043
## 40 0.6082 nan 0.1000 0.0008
## 60 0.5518 nan 0.1000 0.0003
## 80 0.5082 nan 0.1000 0.0001
## 100 0.4727 nan 0.1000 0.0010
## 120 0.4431 nan 0.1000 0.0009
## 140 0.4209 nan 0.1000 0.0000
## 150 0.4095 nan 0.1000 -0.0006
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8677 nan 0.1000 0.0187
## 2 0.8351 nan 0.1000 0.0150
## 3 0.8072 nan 0.1000 0.0140
## 4 0.7818 nan 0.1000 0.0111
## 5 0.7604 nan 0.1000 0.0067
## 6 0.7431 nan 0.1000 0.0072
## 7 0.7257 nan 0.1000 0.0069
## 8 0.7083 nan 0.1000 0.0084
## 9 0.6948 nan 0.1000 0.0051
## 10 0.6855 nan 0.1000 0.0032
## 20 0.6014 nan 0.1000 0.0015
## 40 0.5022 nan 0.1000 0.0016
## 60 0.4337 nan 0.1000 -0.0003
## 80 0.3885 nan 0.1000 0.0001
## 100 0.3560 nan 0.1000 0.0004
## 120 0.3313 nan 0.1000 0.0006
## 140 0.3105 nan 0.1000 -0.0005
## 150 0.2987 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8490 nan 0.1000 0.0254
## 2 0.8107 nan 0.1000 0.0150
## 3 0.7749 nan 0.1000 0.0166
## 4 0.7479 nan 0.1000 0.0095
## 5 0.7257 nan 0.1000 0.0085
## 6 0.7037 nan 0.1000 0.0092
## 7 0.6857 nan 0.1000 0.0076
## 8 0.6695 nan 0.1000 0.0077
## 9 0.6561 nan 0.1000 0.0060
## 10 0.6434 nan 0.1000 0.0044
## 20 0.5516 nan 0.1000 0.0027
## 40 0.4364 nan 0.1000 0.0016
## 60 0.3726 nan 0.1000 0.0008
## 80 0.3239 nan 0.1000 0.0003
## 100 0.2908 nan 0.1000 0.0001
## 120 0.2629 nan 0.1000 -0.0002
## 140 0.2423 nan 0.1000 -0.0000
## 150 0.2318 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.8578 nan 0.1000 0.0247
## 2 0.8231 nan 0.1000 0.0162
## 3 0.7945 nan 0.1000 0.0121
## 4 0.7685 nan 0.1000 0.0135
## 5 0.7433 nan 0.1000 0.0103
## 6 0.7257 nan 0.1000 0.0062
## 7 0.7074 nan 0.1000 0.0078
## 8 0.6925 nan 0.1000 0.0066
## 9 0.6741 nan 0.1000 0.0077
## 10 0.6610 nan 0.1000 0.0049
## 20 0.5665 nan 0.1000 0.0019
## 40 0.4539 nan 0.1000 0.0030
## 60 0.3930 nan 0.1000 0.0004
## 80 0.3470 nan 0.1000 0.0003
## 100 0.3114 nan 0.1000 0.0001
## 120 0.2858 nan 0.1000 0.0004
## 140 0.2634 nan 0.1000 0.0001
## 150 0.2526 nan 0.1000 0.0002
summary(model2)## var rel.inf
## offer offer 7.02960982
## price price 6.79891606
## receiv receiv 6.78576067
## repli repli 6.09359345
## friend friend 4.92132807
## guarante guarante 4.68478927
## express express 3.86752829
## visit visit 3.85957286
## market market 2.87438036
## invest invest 2.76901818
## servic servic 2.39294092
## compani compani 2.04942107
## insur insur 1.89899953
## sponsor sponsor 1.79690258
## assist assist 1.71460335
## discuss discuss 1.62791137
## onlin onlin 1.61915292
## request request 1.57159721
## lowest lowest 1.55299280
## payment payment 1.48622362
## copyright copyright 1.45478962
## address address 1.19678453
## mortgag mortgag 1.19312593
## affili affili 1.00099669
## career career 0.97652546
## sincer sincer 0.96975360
## busi busi 0.92842568
## freebsd freebsd 0.90538667
## world world 0.84906260
## purchas purchas 0.79653181
## right right 0.74622896
## websit websit 0.71689583
## futur futur 0.71673288
## fortun fortun 0.64641897
## internet internet 0.63006192
## doesn doesn 0.62026735
## movi movi 0.61012127
## invok invok 0.60868245
## social social 0.59321035
## access access 0.55777236
## newslett newslett 0.51246726
## review review 0.47494571
## deliv deliv 0.46925789
## credit credit 0.44781685
## enter enter 0.43975721
## minut minut 0.39813084
## time time 0.38584862
## found found 0.37050368
## contract contract 0.36453958
## develop develop 0.35995112
## deposit deposit 0.32504765
## brand brand 0.31874301
## subject subject 0.31527663
## site site 0.30167574
## technologi technologi 0.30138848
## system system 0.29501632
## press press 0.29010999
## dollar dollar 0.28037739
## motiv motiv 0.27831719
## email email 0.24932220
## sens sens 0.24240280
## retail retail 0.24170587
## pai pai 0.24125785
## qualifi qualifi 0.23028890
## peopl peopl 0.22539038
## effici effici 0.21595124
## central central 0.21443726
## couldn couldn 0.20423612
## unknown unknown 0.18119613
## linux linux 0.18087056
## boundari boundari 0.17790217
## spammer spammer 0.17074729
## quick quick 0.16856366
## simpli simpli 0.16822611
## suppli suppli 0.15991908
## spambay spambay 0.15541360
## video video 0.14691636
## server server 0.14565499
## attack attack 0.14335135
## header header 0.14328251
## design design 0.14027181
## director director 0.13737613
## help help 0.13637051
## check check 0.13597645
## obtain obtain 0.13231715
## septemb septemb 0.12571596
## post post 0.12477388
## class class 0.12435633
## user user 0.12293578
## dogma dogma 0.11867748
## condit condit 0.11658123
## wait wait 0.11566286
## remain remain 0.11457226
## rate rate 0.11019752
## regist regist 0.10851258
## david david 0.10762326
## irish irish 0.10614492
## ship ship 0.10609431
## hous hous 0.10340517
## caus caus 0.10310448
## messag messag 0.09997960
## bring bring 0.09373585
## score score 0.09296631
## includ includ 0.09031352
## advic advic 0.08787705
## propos propos 0.08563564
## hardwar hardwar 0.08226531
## stuff stuff 0.08094199
## approv approv 0.08073328
## tell tell 0.07780140
## content content 0.07539032
## titl titl 0.07145358
## straight straight 0.07137487
## argument argument 0.06878779
## subscrib subscrib 0.06837872
## countri countri 0.06471925
## word word 0.06136472
## watch watch 0.05936653
## guess guess 0.05856722
## oblig oblig 0.05769310
## privat privat 0.05763819
## train train 0.05685886
## sender sender 0.05637963
## run run 0.05486857
## produc produc 0.05393823
## spend spend 0.05057506
## chang chang 0.04594050
## special special 0.04569715
## origin origin 0.04280646
## account account 0.04265751
## opinion opinion 0.04239576
## softwar softwar 0.04217730
## execut execut 0.04031843
## file file 0.03952879
## black black 0.03946849
## yahoo yahoo 0.03896471
## chanc chanc 0.03735082
## mobil mobil 0.03673048
## languag languag 0.03626788
## sale sale 0.03532746
## note note 0.03505997
## result result 0.03394751
## appear appear 0.03350941
## come come 0.03235181
## chri chri 0.00000000
## command command 0.00000000
## compil compil 0.00000000
## creat creat 0.00000000
## error error 0.00000000
## garrigu garrigu 0.00000000
## happen happen 0.00000000
## haven haven 0.00000000
## issu issu 0.00000000
## local local 0.00000000
## mail mail 0.00000000
## mark mark 0.00000000
## reach reach 0.00000000
## relev relev 0.00000000
## repositori repositori 0.00000000
## search search 0.00000000
## sequenc sequenc 0.00000000
## version version 0.00000000
## window window 0.00000000
## worker worker 0.00000000
## commun commun 0.00000000
## featur featur 0.00000000
## plan plan 0.00000000
## pretti pretti 0.00000000
## agenc agenc 0.00000000
## attempt attempt 0.00000000
## august august 0.00000000
## block block 0.00000000
## build build 0.00000000
## carri carri 0.00000000
## claim claim 0.00000000
## continu continu 0.00000000
## demand demand 0.00000000
## detail detail 0.00000000
## feder feder 0.00000000
## forc forc 0.00000000
## front front 0.00000000
## govern govern 0.00000000
## hour hour 0.00000000
## locat locat 0.00000000
## offic offic 0.00000000
## offici offici 0.00000000
## polic polic 0.00000000
## presid presid 0.00000000
## radio radio 0.00000000
## report report 0.00000000
## secur secur 0.00000000
## street street 0.00000000
## talk talk 0.00000000
## thursdai thursdai 0.00000000
## approach approach 0.00000000
## british british 0.00000000
## daili daili 0.00000000
## estim estim 0.00000000
## expert expert 0.00000000
## hold hold 0.00000000
## magazin magazin 0.00000000
## month month 0.00000000
## summer summer 0.00000000
## viru viru 0.00000000
## virus virus 0.00000000
## ad ad 0.00000000
## effect effect 0.00000000
## ident ident 0.00000000
## person person 0.00000000
## suppos suppos 0.00000000
## univers univers 0.00000000
## basic basic 0.00000000
## attract attract 0.00000000
## beauti beauti 0.00000000
## final final 0.00000000
## heart heart 0.00000000
## imagin imagin 0.00000000
## moment moment 0.00000000
## owner owner 0.00000000
## partner partner 0.00000000
## prefer prefer 0.00000000
## promis promis 0.00000000
## quickli quickli 0.00000000
## sell sell 0.00000000
## slashnul slashnul 0.00000000
## woman woman 0.00000000
## women women 0.00000000
## apologi apologi 0.00000000
## possibli possibli 0.00000000
## resid resid 0.00000000
## worst worst 0.00000000
## believ believ 0.00000000
## book book 0.00000000
## call call 0.00000000
## choic choic 0.00000000
## creativ creativ 0.00000000
## critic critic 0.00000000
## cultur cultur 0.00000000
## current current 0.00000000
## danger danger 0.00000000
## death death 0.00000000
## discov discov 0.00000000
## doubt doubt 0.00000000
## dream dream 0.00000000
## earn earn 0.00000000
## educ educ 0.00000000
## expens expens 0.00000000
## extra extra 0.00000000
## feel feel 0.00000000
## figur figur 0.00000000
## focu focu 0.00000000
## freedom freedom 0.00000000
## gordon gordon 0.00000000
## guid guid 0.00000000
## hand hand 0.00000000
## health health 0.00000000
## hear hear 0.00000000
## histori histori 0.00000000
## industri industri 0.00000000
## insid insid 0.00000000
## laugh laugh 0.00000000
## learn learn 0.00000000
## light light 0.00000000
## live live 0.00000000
## machin machin 0.00000000
## michael michael 0.00000000
## million million 0.00000000
## move move 0.00000000
## octob octob 0.00000000
## past past 0.00000000
## power power 0.00000000
## publish publish 0.00000000
## read read 0.00000000
## reason reason 0.00000000
## recent recent 0.00000000
## releas releas 0.00000000
## rememb rememb 0.00000000
## respect respect 0.00000000
## secret secret 0.00000000
## seri seri 0.00000000
## share share 0.00000000
## sign sign 0.00000000
## societi societi 0.00000000
## space space 0.00000000
## speak speak 0.00000000
## spent spent 0.00000000
## stand stand 0.00000000
## start start 0.00000000
## steve steve 0.00000000
## style style 0.00000000
## tabl tabl 0.00000000
## teach teach 0.00000000
## troubl troubl 0.00000000
## virtual virtual 0.00000000
## william william 0.00000000
## worth worth 0.00000000
## write write 0.00000000
## written written 0.00000000
## atalk atalk 0.00000000
## base base 0.00000000
## client client 0.00000000
## engin engin 0.00000000
## exist exist 0.00000000
## filter filter 0.00000000
## give give 0.00000000
## interfac interfac 0.00000000
## option option 0.00000000
## phone phone 0.00000000
## procmail procmail 0.00000000
## provid provid 0.00000000
## research research 0.00000000
## script script 0.00000000
## tire tire 0.00000000
## articl articl 0.00000000
## devel devel 0.00000000
## handl handl 0.00000000
## rule rule 0.00000000
## default default 0.00000000
## faster faster 0.00000000
## flow flow 0.00000000
## format format 0.00000000
## requir requir 0.00000000
## respond respond 0.00000000
## suggest suggest 0.00000000
## trick trick 0.00000000
## updat updat 0.00000000
## fail fail 0.00000000
## instanc instanc 0.00000000
## link link 0.00000000
## notic notic 0.00000000
## pass pass 0.00000000
## popul popul 0.00000000
## solv solv 0.00000000
## austin austin 0.00000000
## begin begin 0.00000000
## bottom bottom 0.00000000
## congress congress 0.00000000
## displai displai 0.00000000
## doer doer 0.00000000
## fix fix 0.00000000
## folder folder 0.00000000
## funni funni 0.00000000
## gnupg gnupg 0.00000000
## leav leav 0.00000000
## micalg micalg 0.00000000
## robert robert 0.00000000
## screen screen 0.00000000
## signatur signatur 0.00000000
## suit suit 0.00000000
## take take 0.00000000
## wrong wrong 0.00000000
## acquir acquir 0.00000000
## american american 0.00000000
## avoid avoid 0.00000000
## california california 0.00000000
## campaign campaign 0.00000000
## chief chief 0.00000000
## choos choos 0.00000000
## commerci commerci 0.00000000
## decid decid 0.00000000
## defin defin 0.00000000
## delet delet 0.00000000
## direct direct 0.00000000
## directli directli 0.00000000
## effort effort 0.00000000
## ensur ensur 0.00000000
## equal equal 0.00000000
## experi experi 0.00000000
## field field 0.00000000
## fund fund 0.00000000
## level level 0.00000000
## list list 0.00000000
## make make 0.00000000
## manag manag 0.00000000
## media media 0.00000000
## method method 0.00000000
## nation nation 0.00000000
## neg neg 0.00000000
## opt opt 0.00000000
## perfect perfect 0.00000000
## perform perform 0.00000000
## plai plai 0.00000000
## polit polit 0.00000000
## potenti potenti 0.00000000
## practic practic 0.00000000
## prepar prepar 0.00000000
## primari primari 0.00000000
## public public 0.00000000
## qualiti qualiti 0.00000000
## recipi recipi 0.00000000
## respons respons 0.00000000
## seek seek 0.00000000
## send send 0.00000000
## simpl simpl 0.00000000
## specif specif 0.00000000
## standard standard 0.00000000
## support support 0.00000000
## target target 0.00000000
## typic typic 0.00000000
## cheer cheer 0.00000000
## connect connect 0.00000000
## network network 0.00000000
## question question 0.00000000
## appar appar 0.00000000
## children children 0.00000000
## comput comput 0.00000000
## evid evid 0.00000000
## googl googl 0.00000000
## hope hope 0.00000000
## kiddi kiddi 0.00000000
## name name 0.00000000
## pictur pictur 0.00000000
## similar similar 0.00000000
## step step 0.00000000
## trust trust 0.00000000
## complet complet 0.00000000
## oper oper 0.00000000
## anymor anymor 0.00000000
## bui bui 0.00000000
## famili famili 0.00000000
## inlin inlin 0.00000000
## maintain maintain 0.00000000
## redhat redhat 0.00000000
## tool tool 0.00000000
## vendor vendor 0.00000000
## term term 0.00000000
## product product 0.00000000
## relat relat 0.00000000
## statement statement 0.00000000
## truth truth 0.00000000
## agent agent 0.00000000
## america america 0.00000000
## assum assum 0.00000000
## decad decad 0.00000000
## enjoi enjoi 0.00000000
## human human 0.00000000
## keep keep 0.00000000
## legal legal 0.00000000
## meant meant 0.00000000
## middl middl 0.00000000
## obviou obviou 0.00000000
## packag packag 0.00000000
## parti parti 0.00000000
## profit profit 0.00000000
## properti properti 0.00000000
## put put 0.00000000
## record record 0.00000000
## refer refer 0.00000000
## roger roger 0.00000000
## section section 0.00000000
## solut solut 0.00000000
## sound sound 0.00000000
## download download 0.00000000
## driver driver 0.00000000
## folk folk 0.00000000
## idea idea 0.00000000
## instal instal 0.00000000
## panel panel 0.00000000
## peter peter 0.00000000
## memori memori 0.00000000
## uniqu uniqu 0.00000000
## wonder wonder 0.00000000
## activ activ 0.00000000
## answer answer 0.00000000
## arriv arriv 0.00000000
## close close 0.00000000
## copi copi 0.00000000
## drive drive 0.00000000
## game game 0.00000000
## initi initi 0.00000000
## physic physic 0.00000000
## situat situat 0.00000000
## understand understand 0.00000000
## wednesdai wednesdai 0.00000000
## wors wors 0.00000000
## wrote wrote 0.00000000
## heard heard 0.00000000
## random random 0.00000000
## author author 0.00000000
## confid confid 0.00000000
## declin declin 0.00000000
## econom econom 0.00000000
## economi economi 0.00000000
## french french 0.00000000
## georg georg 0.00000000
## harlei harlei 0.00000000
## leader leader 0.00000000
## recal recal 0.00000000
## stori stori 0.00000000
## mondai mondai 0.00000000
## return return 0.00000000
## white white 0.00000000
## modul modul 0.00000000
## separ separ 0.00000000
## appli appli 0.00000000
## catch catch 0.00000000
## expect expect 0.00000000
## mimeol mimeol 0.00000000
## music music 0.00000000
## abil abil 0.00000000
## agre agre 0.00000000
## britain britain 0.00000000
## center center 0.00000000
## confirm confirm 0.00000000
## elect elect 0.00000000
## emerg emerg 0.00000000
## franc franc 0.00000000
## line line 0.00000000
## natur natur 0.00000000
## sourc sourc 0.00000000
## speech speech 0.00000000
## spring spring 0.00000000
## strategi strategi 0.00000000
## studi studi 0.00000000
## washington washington 0.00000000
## cover cover 0.00000000
## improv improv 0.00000000
## knowledg knowledg 0.00000000
## piec piec 0.00000000
## fals fals 0.00000000
## blame blame 0.00000000
## collect collect 0.00000000
## depend depend 0.00000000
## mean mean 0.00000000
## accur accur 0.00000000
## care care 0.00000000
## correct correct 0.00000000
## desir desir 0.00000000
## devic devic 0.00000000
## identifi identifi 0.00000000
## load load 0.00000000
## model model 0.00000000
## surpris surpris 0.00000000
## appl appl 0.00000000
## framework framework 0.00000000
## fight fight 0.00000000
## daniel daniel 0.00000000
## night night 0.00000000
## sundai sundai 0.00000000
## wouldn wouldn 0.00000000
## absolut absolut 0.00000000
## configur configur 0.00000000
## forget forget 0.00000000
## function. function. 0.00000000
## manual manual 0.00000000
## port port 0.00000000
## advantag advantag 0.00000000
## hundr hundr 0.00000000
## innov innov 0.00000000
## lawrenc lawrenc 0.00000000
## lead lead 0.00000000
## measur measur 0.00000000
## murphi murphi 0.00000000
## picasso picasso 0.00000000
## telephon telephon 0.00000000
## thousand thousand 0.00000000
## useless useless 0.00000000
## bunch bunch 0.00000000
## basi basi 0.00000000
## benefit benefit 0.00000000
## charg charg 0.00000000
## coupl coupl 0.00000000
## habea habea 0.00000000
## heaven heaven 0.00000000
## incom incom 0.00000000
## justin justin 0.00000000
## letter letter 0.00000000
## licens licens 0.00000000
## mason mason 0.00000000
## posit posit 0.00000000
## purpos purpos 0.00000000
## reject reject 0.00000000
## warrant warrant 0.00000000
## directori directori 0.00000000
## mozilla mozilla 0.00000000
## speed speed 0.00000000
## week week 0.00000000
## short short 0.00000000
## amount amount 0.00000000
## annoi annoi 0.00000000
## easili easili 0.00000000
## increas increas 0.00000000
## instant instant 0.00000000
## document document 0.00000000
## util util 0.00000000
## imag imag 0.00000000
## headlin headlin 0.00000000
## mailer mailer 0.00000000
## pudg pudg 0.00000000
## reserv reserv 0.00000000
## burn burn 0.00000000
## comment comment 0.00000000
## evolut evolut 0.00000000
## import import 0.00000000
## permiss permiss 0.00000000
## weight weight 0.00000000
## ximian ximian 0.00000000
## backup backup 0.00000000
## databas databas 0.00000000
## process process 0.00000000
## traffic traffic 0.00000000
## transfer transfer 0.00000000
## winter winter 0.00000000
## archiv archiv 0.00000000
## combin combin 0.00000000
## confus confus 0.00000000
## convinc convinc 0.00000000
## earth earth 0.00000000
## event event 0.00000000
## form form 0.00000000
## larger larger 0.00000000
## launch launch 0.00000000
## liber liber 0.00000000
## look look 0.00000000
## modern modern 0.00000000
## photo photo 0.00000000
## print print 0.00000000
## track track 0.00000000
## trade trade 0.00000000
## equip equip 0.00000000
## fridai fridai 0.00000000
## limit limit 0.00000000
## action action 0.00000000
## attach attach 0.00000000
## custom custom 0.00000000
## easier easier 0.00000000
## individu individu 0.00000000
## intend intend 0.00000000
## promot promot 0.00000000
## total total 0.00000000
## worri worri 0.00000000
## amend amend 0.00000000
## browser browser 0.00000000
## edit edit 0.00000000
## follow follow 0.00000000
## permit permit 0.00000000
## string string 0.00000000
## built built 0.00000000
## clean clean 0.00000000
## freshrpm freshrpm 0.00000000
## rebuild rebuild 0.00000000
## addit addit 0.00000000
## button button 0.00000000
## charact charact 0.00000000
## commiss commiss 0.00000000
## decis decis 0.00000000
## financi financi 0.00000000
## matter matter 0.00000000
## polici polici 0.00000000
## treat treat 0.00000000
## wireless wireless 0.00000000
## entri entri 0.00000000
## protect protect 0.00000000
## technic technic 0.00000000
## shouldn shouldn 0.00000000
## announc announc 0.00000000
## honor honor 0.00000000
## intern intern 0.00000000
## modifi modifi 0.00000000
## program program 0.00000000
## project project 0.00000000
## replac replac 0.00000000
## resolv resolv 0.00000000
## confer confer 0.00000000
## consum consum 0.00000000
## depart depart 0.00000000
## determin determin 0.00000000
## difficult difficult 0.00000000
## electron electron 0.00000000
## extens extens 0.00000000
## gener gener 0.00000000
## involv involv 0.00000000
## materi materi 0.00000000
## period period 0.00000000
## readi readi 0.00000000
## school school 0.00000000
## warn warn 0.00000000
## beberg beberg 0.00000000
## domain domain 0.00000000
## duncan duncan 0.00000000
## find find 0.00000000
## major major 0.00000000
## scientist scientist 0.00000000
## angl angl 0.00000000
## popular popular 0.00000000
## cheap cheap 0.00000000
## complex complex 0.00000000
## cost cost 0.00000000
## platform platform 0.00000000
## reduc reduc 0.00000000
## revers revers 0.00000000
## test test 0.00000000
## differ differ 0.00000000
## object object 0.00000000
## rang rang 0.00000000
## techniqu techniqu 0.00000000
## kevin kevin 0.00000000
## count count 0.00000000
## admit admit 0.00000000
## amaz amaz 0.00000000
## code code 0.00000000
## librari librari 0.00000000
## proper proper 0.00000000
## think think 0.00000000
## wast wast 0.00000000
## yesterdai yesterdai 0.00000000
## deal deal 0.00000000
## remot remot 0.00000000
## setup setup 0.00000000
## accept accept 0.00000000
## deliveri deliveri 0.00000000
## set set 0.00000000
## card card 0.00000000
## earlier earlier 0.00000000
## consid consid 0.00000000
## stick stick 0.00000000
## actual actual 0.00000000
## enabl enabl 0.00000000
## global global 0.00000000
## tuesdai tuesdai 0.00000000
## colleg colleg 0.00000000
## concept concept 0.00000000
## europ europ 0.00000000
## extend extend 0.00000000
## rais rais 0.00000000
## scienc scienc 0.00000000
## structur structur 0.00000000
## happi happi 0.00000000
## laptop laptop 0.00000000
## emac emac 0.00000000
## exchang exchang 0.00000000
## mention mention 0.00000000
## procedur procedur 0.00000000
## singl singl 0.00000000
## store store 0.00000000
## eventu eventu 0.00000000
## failur failur 0.00000000
## morn morn 0.00000000
## perfectli perfectli 0.00000000
## prevent prevent 0.00000000
## upgrad upgrad 0.00000000
## allow allow 0.00000000
## valid valid 0.00000000
## forward forward 0.00000000
## court court 0.00000000
## advanc advanc 0.00000000
## stream stream 0.00000000
## largest largest 0.00000000
## page page 0.00000000
## shape shape 0.00000000
## south south 0.00000000
## theori theori 0.00000000
## writer writer 0.00000000
## arrest arrest 0.00000000
## defend defend 0.00000000
## north north 0.00000000
## regular regular 0.00000000
## anim anim 0.00000000
## lower lower 0.00000000
## player player 0.00000000
## razor razor 0.00000000
## excit excit 0.00000000
## militari militari 0.00000000
## bother bother 0.00000000
## miss miss 0.00000000
## python python 0.00000000
## femal femal 0.00000000
## control control 0.00000000
## recommend recommend 0.00000000
## favor favor 0.00000000
## ground ground 0.00000000
## break. break. 0.00000000
## kill kill 0.00000000
## damag damag 0.00000000
## reveal reveal 0.00000000
## statu statu 0.00000000
## mistak mistak 0.00000000
## parent parent 0.00000000
## verifi verifi 0.00000000
## abus abus 0.00000000
## paper paper 0.00000000
## deserv deserv 0.00000000
## doubl doubl 0.00000000
## success success 0.00000000
## voic voic 0.00000000
## assur assur 0.00000000
## behaviour behaviour 0.00000000
## common common 0.00000000
## england england 0.00000000
## join join 0.00000000
## journal journal 0.00000000
## meet meet 0.00000000
## opposit opposit 0.00000000
## detect detect 0.00000000
## jame jame 0.00000000
## previous previous 0.00000000
## billion billion 0.00000000
## organ organ 0.00000000
## previou previou 0.00000000
## select select 0.00000000
## mother mother 0.00000000
## capit capit 0.00000000
## reader reader 0.00000000
## entir entir 0.00000000
## concern concern 0.00000000
## ignor ignor 0.00000000
## scale scale 0.00000000
## biggest biggest 0.00000000
## compet compet 0.00000000
## conclud conclud 0.00000000
## corpor corpor 0.00000000
## editor editor 0.00000000
## fastest fastest 0.00000000
## realiz realiz 0.00000000
## stock stock 0.00000000
## submit submit 0.00000000
## sort sort 0.00000000
## suspect suspect 0.00000000
## emploi emploi 0.00000000
## extrem extrem 0.00000000
## match match 0.00000000
## occur occur 0.00000000
## ultim ultim 0.00000000
## averag averag 0.00000000
## compar compar 0.00000000
## pull pull 0.00000000
## english english 0.00000000
## green green 0.00000000
## realiti realiti 0.00000000
## region region 0.00000000
## republ republ 0.00000000
## travel travel 0.00000000
## centuri centuri 0.00000000
## disabl disabl 0.00000000
## foreign foreign 0.00000000
## patch patch 0.00000000
## energi energi 0.00000000
## tomorrow tomorrow 0.00000000
## sampl sampl 0.00000000
## strang strang 0.00000000
## awar awar 0.00000000
## date date 0.00000000
## touch touch 0.00000000
## brian brian 0.00000000
## grant grant 0.00000000
## label label 0.00000000
## listen listen 0.00000000
## lose lose 0.00000000
## stupid stupid 0.00000000
## unit unit 0.00000000
## indic indic 0.00000000
## financ financ 0.00000000
## insert insert 0.00000000
## station station 0.00000000
## behavior behavior 0.00000000
## china china 0.00000000
## binari binari 0.00000000
## forev forev 0.00000000
## integr integr 0.00000000
## matthia matthia 0.00000000
## output output 0.00000000
## resourc resourc 0.00000000
## rpmforg rpmforg 0.00000000
## split split 0.00000000
## sylphe sylphe 0.00000000
## valhalla valhalla 0.00000000
## attent attent 0.00000000
## child child 0.00000000
## father father 0.00000000
## rel rel 0.00000000
## pick pick 0.00000000
## proven proven 0.00000000
## strong strong 0.00000000
## digit digit 0.00000000
## storag storag 0.00000000
## valuabl valuabl 0.00000000
## pack pack 0.00000000
## spread spread 0.00000000
## type type 0.00000000
## bearer bearer 0.00000000
## hettinga hettinga 0.00000000
## agreeabl agreeabl 0.00000000
## antiqu antiqu 0.00000000
## boston boston 0.00000000
## edward edward 0.00000000
## empir empir 0.00000000
## farquhar farquhar 0.00000000
## gibbon gibbon 0.00000000
## predict predict 0.00000000
## roman roman 0.00000000
## us us 0.00000000
## commit commit 0.00000000
## balanc balanc 0.00000000
## brought brought 0.00000000
## progress progress 0.00000000
## correctli correctli 0.00000000
## cach cach 0.00000000
## save save 0.00000000
## brain brain 0.00000000
## excel excel 0.00000000
## capabl capabl 0.00000000
## geeg geeg 0.00000000
## brightli brightli 0.00000000
## compliant compliant 0.00000000
## canada canada 0.00000000
## token token 0.00000000
## rock rock 0.00000000
## minim minim 0.00000000
## broken broken 0.00000000
## unseen unseen 0.00000000
## barcelona barcelona 0.00000000
## edificio edificio 0.00000000
## nort nort 0.00000000
## planta planta 0.00000000
## spain spain 0.00000000
## bliss bliss 0.00000000
## wed wed 0.00000000
## eugen eugen 0.00000000
## strongli strongli 0.00000000
## prompt prompt 0.00000000
## classifi classifi 0.00000000
## deploy deploy 0.00000000
## corpu corpu 0.00000000
## brent brent 0.00000000
## spamd spamd 0.00000000
## jabber jabber 0.00000000
## jeremi jeremi 0.00000000
## guido guido 0.00000000
print(model2)## Stochastic Gradient Boosting
##
## 2100 samples
## 928 predictor
## 2 classes: 'ham', 'spam'
##
## Pre-processing: centered (928), scaled (928)
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 1400, 1400, 1400
## Resampling results across tuning parameters:
##
## interaction.depth n.trees ROC Sens Spec
## 1 50 0.8831081 0.9873998 0.3276836
## 1 100 0.9121430 0.9873998 0.4491525
## 1 150 0.9290582 0.9839633 0.5677966
## 2 50 0.9133467 0.9845361 0.4830508
## 2 100 0.9342419 0.9845361 0.6186441
## 2 150 0.9443134 0.9839633 0.6638418
## 3 50 0.9245394 0.9822451 0.5480226
## 3 100 0.9483371 0.9833906 0.6666667
## 3 150 0.9583430 0.9816724 0.7090395
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
Get “raw” probability and calculate accuracy.
#use another method to get accuracy, confirm with process before.
predictions2 <- predict(object=model2, testSet_Pre, type='raw')
print(postResample(pred=predictions2, obs=as.factor(testSet$emails.spam)))## Accuracy Kappa
## 0.9466667 0.7860624
#calculate accuracy using a second method to confirm
compare2<-data.frame(testSet$emails.spam,predictions2)
compare2$correct<-ifelse(compare2$testSet.emails.spam == compare2$predictions,1,0)
accuracy2<-round(sum(compare2$correct)*100/nrow(compare2),1)
cat("Accuracy confirmed:",accuracy2,"%")## Accuracy confirmed: 94.7 %
Use package pROC to calculate AUC score. According to https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc:
“AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.”
In simplest terms, it is a measure of model performance. Using the package pROC I calculated AUC for the second model below using class probabilities from training.
#obtain probabilities
predictions2b <- predict(object=model2, testSet_Pre, type='prob')
library(pROC)
#get AUC score "AUC ranges between 0.5 and 1, where 0.5 is random and 1 is perfect" from https://amunategui.github.io/binary-outcome-modeling/
auc <- roc(ifelse(testSet$emails.spam=="spam",1,0), predictions2b[[2]])
print(auc$auc)## Area under the curve: 0.9624
Plotting the importance of each term allows for refinement in cleaning; the most important terms are scrutinized for authenicity.
#grab importance via varImp
imp<-varImp(model2,scale=FALSE)
imp2<-data.frame(imp["importance"])
setDT(imp2, keep.rownames = TRUE)[]## rn Overall
## 1: chri 0.000000
## 2: command 0.000000
## 3: compil 0.000000
## 4: creat 0.000000
## 5: develop 2.504518
## ---
## 924: spamd 0.000000
## 925: jabber 0.000000
## 926: jeremi 0.000000
## 927: spambay 1.081359
## 928: guido 0.000000
imp2<-imp2[which(imp2$Overall>15),]
colnames(imp2)[1]<-"word"
colnames(imp2)[2]<-"importance"
ggplot(imp2, aes(x=reorder(word, importance), weight=importance, fill=as.factor(importance)))+
geom_bar()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))Conclusions
A more rigorous approach would yield helpful insights into the options associated with the models above; adjusting the model parameters in the train function will alter the result. My analysis proved the Random Forest model made more accurate predictions. The model should be tested against other sets of spam and ham to assess capability.
Sources
The following sites were useful resources in developing this analysis:
use of alternative models and methods with caret https://amunategui.github.io/binary-outcome-modeling/
general use of caret https://cfss.uchicago.edu/notes/supervised-text-classification/
general knowledge and methods https://topepo.github.io/caret/variable-importance.html https://topepo.github.io/caret/model-training-and-tuning.html
https://www.tidytextmining.com/nasa.html
https://www.rdocumentation.org/packages/caret/versions/4.47/topics/train https://www.rdocumentation.org/packages/caret/versions/5.05.004/topics/predict.train
http://www.rebeccabarter.com/blog/2017-11-17-caret_tutorial/
https://github.com/topepo/caret/issues/141
https://www.hvitfeldt.me/blog/binary-text-classification-with-tidytext-and-caret/