Background Information on the Dataset

Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.

As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.

In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes – Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.

The dataset contains just two fields:

R Exercises

Loading the Dataset

Begin by loading the dataset emails.csv into a data frame called emails. Remember to pass the stringsAsFactors=FALSE option when loading the data.

emails = read.csv("emails.csv", stringsAsFactors=FALSE)

How many emails are in the dataset?

# Examine the string emails
str(emails)
## 'data.frame':    5728 obs. of  2 variables:
##  $ text: chr  "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market"| __truncated__ "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__ "Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved"| __truncated__ "Subject: 4 color printing special  request additional information now ! click here  click here for a printable "| __truncated__ ...
##  $ spam: int  1 1 1 1 1 1 1 1 1 1 ...

5728 emails in the dataset.

How many of the emails are spam?

# Tabulates the amount of emails are spam
table(emails$spam)
## 
##    0    1 
## 4360 1368

1368 emails are spam.

Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.

# Examine the string emails that are text
str(emails$text[2])
##  chr "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__

Subject appears at the beginning of every email in the dataset.

Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?

We know that each email has the word “subject” appear at least once, but the frequency with which it appears might help us differentiate spam from ham. For instance, a long email chain would have the word “subject” appear a number of times, and this higher frequency might be indicative of a ham message.

How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?

max(nchar(emails$text))
## [1] 43952

43952 characters are in the longest email.

Which row contains the shortest email in the dataset? (Just like in the previous problem, shortest is measured in terms of the fewest number of characters.)

# Finds the row with the shortest email
which.min(nchar(emails$text))
## [1] 1992

The 1992 row contains the shortest email in the dataset.

Preparing the Corpus:

  1. Build a new corpus variable called corpus.

  2. Using tm_map, convert the text to lowercase.

  3. Using tm_map, remove all punctuation from the corpus.

  4. Using tm_map, remove all English stopwords from the corpus.

  5. Using tm_map, stem the words in the corpus.

  6. Build a document term matrix from the corpus, called dtm.

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords(“english”)).

How many terms are in dtm?

# Preparing the Corpus
library(tm)
# Build a new corpus variable called corpus
corpus = VCorpus(VectorSource(emails$text))
# Convert the text to lowercase.
corpus = tm_map(corpus, content_transformer(tolower))
# Remove all punctuation from the corpus
corpus = tm_map(corpus, removePunctuation)
# Remove all English stopwords from the corpus
corpus = tm_map(corpus, removeWords, stopwords("english"))
# Stem the words in the corpus
corpus = tm_map(corpus, stemDocument)
# Build a document term matrix from the corpus, called dtm
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

28687 terms are in dtm.

To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don’t overwrite dtm, because we will use it in a later step of this homework). How many terms are in spdtm?

# Remove the sparse terms
spdtm = removeSparseTerms(dtm, 0.95)
spdtm
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity           : 89%
## Maximal term length: 10
## Weighting          : term frequency (tf)

330 terms are in spdtm.

Build a data frame called emailsSparse from spdtm, and use the make.names function to make the variable names of emailsSparse valid. What is the word stem that shows up most frequently across all the emails in the dataset?

# Build data frame called emailsSparse from spdtm
emailsSparse = as.data.frame(as.matrix(spdtm))
colnames(emailsSparse) = make.names(colnames(emailsSparse))
# Word stem that is frequent
frequency <- colSums(emailsSparse)
which.max(frequency)
## enron 
##    92

Enron is the most frequent word.

How many word stems appear at least 5000 times in the ham emails in the dataset? Hint: in this and the next question, remember not to count the dependent variable we just added.

# Add a variable called spam
emailsSparse$spam = emails$spam
# Sort the ham emails in dataset
a = sort((colSums(subset(emailsSparse, spam == 0))))
kable(a)
x
spam 0
life 80
remov 103
money 114
onlin 173
without 191
websit 194
click 217
special 226
wish 229
repli 239
buy 243
net 243
link 247
immedi 249
done 254
mean 259
design 261
lot 268
effect 270
info 273
either 279
read 279
write 286
line 289
begin 291
sorri 293
success 293
involv 294
creat 299
softwar 299
better 301
vkamin 301
say 305
keep 306
bring 311
believ 313
full 317
increas 320
realli 324
mention 325
thought 325
idea 327
invest 327
secur 337
specif 338
sever 340
experi 346
thing 347
allow 348
check 351
due 351
type 352
happi 354
return 355
expect 356
short 357
effort 358
open 360
internet 361
sincer 361
public 364
recent 368
anoth 369
alreadi 372
home 375
made 380
respond 382
given 383
etc 385
put 385
within 386
place 388
right 390
version 390
hello 395
sure 396
area 397
run 398
arrang 399
account 401
join 403
hour 404
locat 406
togeth 406
engin 411
import 411
per 412
corpor 414
high 416
result 418
hear 420
final 422
deal 423
applic 428
even 429
web 430
custom 433
soon 435
long 436
sinc 439
futur 440
member 446
X000 447
event 447
don 450
part 450
feel 453
tuesday 454
wednesday 456
still 457
unit 457
site 458
X853 461
continu 464
understand 464
resourc 466
robert 466
analysi 468
form 468
point 474
assist 475
confirm 485
differ 489
intern 489
might 490
real 490
case 492
howev 496
comment 505
abl 515
complet 515
rate 516
appreci 518
tri 521
move 526
updat 527
approv 533
suggest 533
free 535
contract 544
detail 546
morn 546
end 550
mani 550
attend 558
thursday 558
direct 561
requir 562
cours 567
person 569
relat 573
depart 575
today 577
start 580
way 586
mark 588
valu 590
problem 593
peopl 599
note 600
school 607
invit 614
access 617
term 625
juli 630
monday 630
gibner 633
base 635
director 640
offer 643
cost 646
addit 648
kevin 654
great 655
set 658
file 659
find 665
much 669
oper 669
order 669
deriv 673
doc 673
april 677
book 680
address 693
copi 700
financi 702
month 709
student 710
respons 711
possibl 712
associ 715
particip 717
now 725
first 726
industri 731
dear 734
support 734
plan 738
back 739
name 745
come 748
opportun 760
report 772
product 776
two 787
origin 796
ask 797
credit 798
state 806
system 816
process 826
hope 828
london 828
just 830
receiv 830
chang 831
review 834
current 841
shall 844
friday 847
team 850
phone 858
issu 865
data 868
avail 872
last 874
good 876
give 883
www 897
gas 905
list 907
posit 917
visit 920
includ 924
resum 928
best 933
offic 935
servic 942
talk 943
number 951
well 961
fax 963
provid 970
sent 971
next. 975
send 986
http 1009
john 1022
univers 1025
financ 1038
stinson 1051
schedul 1054
take 1057
date 1060
want 1068
question 1069
program 1080
think 1084
X713 1097
crenshaw 1115
attach 1155
trade 1167
help 1168
email 1201
compani 1225
request 1227
see 1238
communic 1251
confer 1264
discuss 1270
make 1281
contact 1301
follow 1308
interview 1320
project 1328
mail 1352
present 1397
busi 1416
interest 1429
option 1432
day 1440
call 1497
one 1516
year 1523
week 1527
messag 1538
houston 1577
also 1604
look 1607
edu 1620
corp 1643
shirley 1687
develop 1691
get 1768
new 1777
use 1784
let 1856
regard 1859
inform 1883
need 1890
power 1972
may 1976
like 1980
risk 2097
energi 2124
market 2150
model 2170
price 2191
work 2293
manag 2334
know 2345
group 2474
meet 2544
time 2552
research 2752
forward 2952
X2001 3060
can 3426
thank 3558
com 4444
pleas 4494
kaminski 4801
X2000 4935
hou 5569
will 6802
vinc 8531
subject 8625
ect 11417
enron 13388

6 words stems appear at least 5000 times in the ham emails in the dataset.

How many word stems appear at least 1000 times in the spam emails in the dataset?

# Sort the spam emails in the dataset
a = sort((colSums(subset(emailsSparse, spam == 1))))
kable(a)
x
X713 0
crenshaw 0
enron 0
gibner 0
kaminski 0
stinson 0
vkamin 0
X853 1
vinc 1
doc 2
kevin 2
shirley 2
deriv 3
april 5
houston 5
resum 5
edu 7
friday 7
hou 8
wednesday 8
ect 10
arrang 11
interview 13
attend 15
london 15
robert 16
student 16
schedul 17
thursday 17
monday 19
john 20
tuesday 20
attach 21
suggest 21
appreci 23
mark 25
begin 26
comment 26
analysi 27
X2001 29
model 29
hope 30
mention 30
X2000 32
togeth 32
confer 33
invit 33
univers 34
financ 35
talk 38
either 39
run 39
morn 40
shall 40
happi 42
thought 42
depart 46
confirm 47
respond 48
school 48
corp 49
etc 49
hear 49
howev 49
sorri 50
idea 51
energi 55
discuss 56
open 56
option 56
soon 57
understand 57
cours 59
experi 59
associ 62
point 62
bring 63
director 65
particip 65
anoth 66
join 66
still 66
final 68
research 68
case 69
set 69
specif 69
given 70
juli 71
problem 73
put 73
alreadi 74
ask 74
abl 75
deal 75
fax 75
book 76
team 76
issu 79
locat 79
meet 79
updat 79
lot 80
sincer 80
better 82
short 82
sinc 82
done 83
question 83
recent 83
possibl 84
contract 85
end 85
move 86
data 87
might 87
continu 88
note 88
feel 90
resourc 90
sever 90
area 92
communic 92
realli 93
due 94
direct 96
origin 96
copi 97
unit 97
long 98
member 99
sure 99
allow 102
dear 104
public 104
write 104
event 105
let 107
differ 109
file 111
involv 111
respons 113
creat 114
type 114
approv 115
detail 115
effort 115
intern 117
request 117
say 118
import 119
support 120
part 121
relat 121
assist 123
last 124
two 124
back 125
keep 125
addit 126
date 127
place 128
group 130
mean 131
valu 131
think 132
offic 133
read 134
immedi 136
check 137
applic 139
hello 139
tri 140
review 142
believ 143
phone 143
hour 144
power 145
present 146
process 149
corpor 151
oper 151
full 152
return 154
come 155
sent 155
opportun 158
real 158
repli 158
line 159
engin 160
term 161
credit 162
well 164
gas 165
info 165
plan 166
next. 170
risk 170
increas 171
access 172
give 172
thank 172
link 174
requir 174
version 174
cost 175
great 182
wish 185
regard 186
posit 187
thing 188
call 190
develop 191
complet 192
much 192
even 193
project 194
design 196
form 196
expect 198
person 198
without 198
buy 199
trade 199
effect 201
rate 201
base 202
find 202
current 203
first 203
chang 204
visit 206
financi 207
high 208
mani 208
forward 209
good 221
special 225
don 226
success 226
per 230
number 231
week 231
result 237
web 238
industri 239
contact 242
made 242
follow 244
month 249
right 249
today 251
also 260
help 262
internet 262
manag 266
know 269
way 278
avail 280
state 280
futur 282
home 285
start 300
system 302
take 304
net 305
includ 314
life 320
see 329
name 344
onlin 345
within 346
remov 357
best 358
program 358
peopl 359
custom 363
year 367
like 372
interest 385
send 393
servic 395
look 396
work 415
day 420
want 420
product 421
www 426
account 428
provid 435
need 438
softwar 440
messag 445
site 455
address 461
may 489
list 503
price 503
new 504
websit 506
report 507
secur 520
just 524
offer 528
invest 540
order 541
use 546
click 552
X000 560
now 575
one 592
time 593
http 600
market 600
make 603
free 606
pleas 619
money 662
get 694
receiv 727
inform 818
can 831
email 865
busi 897
mail 917
com 999
compani 1065
spam 1368
will 1450
subject 1577

3 word stems appear at least 1000 times.

Buuilding machine learning models (Training Set)

First, convert the dependent variable to a factor with “emailsSparse$spam = as.factor(emailsSparse$spam)”.

Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called “train” and a testing set called “test”. Make sure to perform this step on emailsSparse instead of emails.

Using the training set, train the following three machine learning models. The models should predict the dependent variable “spam”, using all other available variables as independent variables. Please be patient, as these models may take a few minutes to train.

  1. A logistic regression model called spamLog. You may see a warning message here - we’ll discuss this more later.

  2. A CART model called spamCART, using the default parameters to train the model (don’t worry about adding minbucket or cp). Remember to add the argument method=“class” since this is a binary classification problem.

  3. A random forest model called spamRF, using the default parameters to train the model (don’t worry about specifying ntree or nodesize). Directly before training the random forest model, set the random seed to 123 (even though we’ve already done this earlier in the problem, it’s important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).

For each model, obtain the predicted spam probabilities for the training set. Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values. Recall that you can obtain probabilities for CART models by not passing any type parameter to the predict() function, and you can obtain probabilities from a random forest by adding the argument type=“prob”. For CART and random forest, you need to select the second column of the output of the predict() function, corresponding to the probability of a message being spam.

You may have noticed that training the logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model. Let’s investigate the predicted probabilities from the logistic regression model.

# Convert the dependent variable
emailsSparse$spam = as.factor(emailsSparse$spam)
# Split the dataset into training and testing sets
set.seed(123)
library(caTools)
spl = sample.split(emailsSparse$spam, 0.7)
train = subset(emailsSparse, spl == TRUE)
test = subset(emailsSparse, spl == FALSE)

Logistic Regression

# Create the logistic regression model
spamLog = glm(spam~., data=train, family="binomial")
# Create the predictions
predTrainLog = predict(spamLog, type="response")
How many of the training set predicted probabilities from spamLog are less than 0.00001?
# Tabulate the predictions
table(predTrainLog < 0.00001)
## 
## FALSE  TRUE 
##   964  3046

3046 training set predicted probabilities from spamLog are less than 0.00001.

How many of the training set predicted probabilities from spamLog are more than 0.99999?
# Tabulate the predictions
table(predTrainLog > 0.99999)
## 
## FALSE  TRUE 
##  3056   954

865 training set predicted probabilities from spamLog are more than 0.99999.

How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?
# Tabulate the predictions
table(predTrainLog= 0.00001 & predTrainLog <= 0.99999)
## predTrainLog
## FALSE  TRUE 
##   954  3056

0 training set predicted probabilities from spamLog are between 0.00001 and 0.99999.

How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?
# Output the summary
summary(spamLog)
## 
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.011   0.000   0.000   0.000   1.354  
## 
## Coefficients:
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)   -30.81671 10548.74309  -0.003    0.998
## X000           14.73835 10583.79587   0.001    0.999
## X2000         -36.30653 15559.78162  -0.002    0.998
## X2001         -32.14770 13177.68792  -0.002    0.998
## X713          -24.27301 29138.37993  -0.001    0.999
## X853           -1.21231 59416.82728   0.000    1.000
## abl            -2.04851 20883.26713   0.000    1.000
## access        -14.79724 13353.48989  -0.001    0.999
## account        24.88120  8164.78793   0.003    0.998
## addit           1.46349 27027.11666   0.000    1.000
## address        -4.61291 11134.38676   0.000    1.000
## allow          18.99178  6436.37103   0.003    0.998
## alreadi       -24.07476 33188.28852  -0.001    0.999
## also           29.89671 13781.79015   0.002    0.998
## analysi       -24.05003 38603.03061  -0.001    1.000
## anoth          -8.74404 20316.93640   0.000    1.000
## applic         -2.64873 16735.57715   0.000    1.000
## appreci       -21.44644 27616.28096  -0.001    0.999
## approv         -1.30155 15894.77923   0.000    1.000
## april         -26.20274 22080.53147  -0.001    0.999
## area           20.40642 22657.77444   0.001    0.999
## arrang         10.69469 21352.21386   0.001    1.000
## ask            -7.74592 19763.16826   0.000    1.000
## assist        -11.28267 24895.25793   0.000    1.000
## associ          9.04942 19093.54135   0.000    1.000
## attach        -10.36592 15343.29985  -0.001    0.999
## attend        -34.50552 32573.32268  -0.001    0.999
## avail           8.65114 17094.57157   0.001    1.000
## back          -13.23471 22723.02376  -0.001    1.000
## base          -13.54255 21218.17140  -0.001    0.999
## begin          22.28011 29731.41257   0.001    0.999
## believ         32.32591 21360.08248   0.002    0.999
## best           -8.20054  1333.38661  -0.006    0.995
## better         42.63151 23599.88794   0.002    0.999
## book            4.30072 20235.79190   0.000    1.000
## bring          16.06635 67670.96796   0.000    1.000
## busi           -4.80293 10002.28921   0.000    1.000
## buy            41.70188 38923.92521   0.001    0.999
## call           -1.14501 11111.06778   0.000    1.000
## can             3.76174  7673.89305   0.000    1.000
## case          -33.72402 28804.23279  -0.001    0.999
## chang         -27.16799 22152.89068  -0.001    0.999
## check           1.42516 19631.44189   0.000    1.000
## click          13.76120  7076.98961   0.002    0.998
## com             1.93633  4039.20494   0.000    1.000
## come           -1.16616 15107.73858   0.000    1.000
## comment        -3.25141 33870.01419   0.000    1.000
## communic       15.79546  8958.08782   0.002    0.999
## compani         4.78131  9186.32551   0.001    1.000
## complet       -13.62879 20237.90361  -0.001    0.999
## confer         -0.75029  8557.36337   0.000    1.000
## confirm       -12.99690 15139.72575  -0.001    0.999
## contact         1.53001 12616.52550   0.000    1.000
## continu        14.86611 15351.23871   0.001    0.999
## contract      -12.95405 14984.74369  -0.001    0.999
## copi          -42.73831 30699.56822  -0.001    0.999
## corp           16.05505 27083.03847   0.001    1.000
## corpor         -0.82863 28181.34783   0.000    1.000
## cost           -1.93757 18329.88729   0.000    1.000
## cours          16.65262 18338.38154   0.001    0.999
## creat          13.37623 39460.05157   0.000    1.000
## credit         26.17376 13138.00273   0.002    0.998
## crenshaw       99.94406 67692.02756   0.001    0.999
## current         3.62913 17066.24264   0.000    1.000
## custom         18.28821 10079.08744   0.002    0.999
## data          -26.09087 22714.27741  -0.001    0.999
## date           -2.78615 16985.30607   0.000    1.000
## day            -6.09984  5866.28762  -0.001    0.999
## deal          -11.29372 14476.48731  -0.001    0.999
## dear           -2.31316 23063.89229   0.000    1.000
## depart        -40.68465 25092.95410  -0.002    0.999
## deriv         -49.71057 35873.67244  -0.001    0.999
## design         -7.92306 29388.93892   0.000    1.000
## detail         11.96923 23008.84872   0.001    1.000
## develop         5.97638  9454.56063   0.001    0.999
## differ         -2.29290 10749.59972   0.000    1.000
## direct        -20.50611 31942.88823  -0.001    0.999
## director      -17.69812 17932.01295  -0.001    0.999
## discuss       -10.51005 19154.35311  -0.001    1.000
## doc           -25.97116 26031.83704  -0.001    0.999
## don            21.28659 14561.06709   0.001    0.999
## done            6.82837 18822.05005   0.000    1.000
## due            -4.16267 35316.37257   0.000    1.000
## ect             0.86849  5341.51294   0.000    1.000
## edu            -0.21215   691.74099   0.000    1.000
## effect         19.48236 21002.41283   0.001    0.999
## effort         16.05818 56700.57914   0.000    1.000
## either        -27.44247 39997.01701  -0.001    0.999
## email           3.83283 11856.54591   0.000    1.000
## end           -13.10536 29380.68822   0.000    1.000
## energi        -16.19710 16457.87662  -0.001    0.999
## engin          26.64290 23936.07677   0.001    0.999
## enron          -8.78876  5718.87819  -0.002    0.999
## etc             0.94697 15694.76515   0.000    1.000
## even          -16.53893 22886.63796  -0.001    0.999
## event          16.94185 18505.84730   0.001    0.999
## expect        -11.78693 19139.41707  -0.001    1.000
## experi          2.45969 22404.65521   0.000    1.000
## fax             3.53700 33855.88989   0.000    1.000
## feel            2.59590 23476.27698   0.000    1.000
## file          -29.43243 21649.57371  -0.001    0.999
## final           8.07492 50075.45250   0.000    1.000
## financ         -9.12241  7523.95040  -0.001    0.999
## financi        -9.74670 17271.83784  -0.001    1.000
## find           -2.62282  9727.09459   0.000    1.000
## first          -0.46663 20429.80447   0.000    1.000
## follow         17.65781  3079.68087   0.006    0.995
## form            8.48346 16736.54461   0.001    1.000
## forward        -3.48404 18642.93644   0.000    1.000
## free            6.11316  8121.04177   0.001    0.999
## friday        -11.46161 19964.73259  -0.001    1.000
## full           21.25102 21904.34008   0.001    0.999
## futur          41.45948 14387.24195   0.003    0.998
## gas            -3.90086  4160.29256  -0.001    0.999
## get             5.15375  9737.07069   0.001    1.000
## gibner         29.01185 24595.48183   0.001    0.999
## give          -25.18310 21296.83494  -0.001    0.999
## given         -21.86413 54264.02633   0.000    1.000
## good            5.39940 16193.42812   0.000    1.000
## great          12.21940 10901.07901   0.001    0.999
## group           0.52639 10371.47801   0.000    1.000
## happi           0.01939 12018.68812   0.000    1.000
## hear           28.86533 22809.11427   0.001    0.999
## hello          21.65549 13606.73123   0.002    0.999
## help           17.30963  2790.89981   0.006    0.995
## high           -1.98198 25536.23275   0.000    1.000
## home            5.97294  8964.82707   0.001    0.999
## hope          -14.35451 21794.88576  -0.001    0.999
## hou             6.85153  6436.89472   0.001    0.999
## hour            2.47799 13334.90035   0.000    1.000
## houston       -18.54502  7305.03681  -0.003    0.998
## howev         -34.49274 35618.85713  -0.001    0.999
## http           25.27938 21071.12399   0.001    0.999
## idea          -18.44864 38918.50700   0.000    1.000
## immedi         62.85329 33464.69294   0.002    0.999
## import         -1.85930 22364.33823   0.000    1.000
## includ         -3.45439 17988.89125   0.000    1.000
## increas         6.47593 23286.64042   0.000    1.000
## industri      -31.60069 23734.81080  -0.001    0.999
## info           -1.25474  4857.12017   0.000    1.000
## inform         20.78075  8549.02454   0.002    0.998
## interest       26.98037 11587.59215   0.002    0.998
## intern         -7.99071 33512.78147   0.000    1.000
## internet        8.74897 10999.92712   0.001    0.999
## interview     -16.40484 18733.97043  -0.001    0.999
## invest         32.01252 23934.41479   0.001    0.999
## invit           4.30368 22150.24289   0.000    1.000
## involv         38.14864 33152.60845   0.001    0.999
## issu          -37.08367 33960.70787  -0.001    0.999
## john           -0.53256 28562.06741   0.000    1.000
## join          -38.24082 23338.62282  -0.002    0.999
## juli          -13.57779 30093.27084   0.000    1.000
## just          -10.21157 11140.82560  -0.001    0.999
## kaminski      -18.11964  6029.07127  -0.003    0.998
## keep           18.66596 27816.06998   0.001    0.999
## kevin         -37.79040 47379.74713  -0.001    0.999
## know           12.77077 15263.56770   0.001    0.999
## last            1.04644 13724.44714   0.000    1.000
## let           -27.63338 14620.67500  -0.002    0.998
## life           58.12464 38643.08273   0.002    0.999
## like            5.64936  7659.87875   0.001    0.999
## line            8.74324 12361.53963   0.001    0.999
## link           -6.92851 13446.94610  -0.001    1.000
## list           -8.69209  2148.97953  -0.004    0.997
## locat          20.72567 15965.71676   0.001    0.999
## london          6.74530 16419.73479   0.000    1.000
## long          -14.89135 19336.44934  -0.001    0.999
## look           -7.03074 15631.44591   0.000    1.000
## lot           -19.63678 13211.37522  -0.001    0.999
## made            2.82049 27432.26185   0.000    1.000
## mail            7.58373 10210.95687   0.001    0.999
## make           29.00542 15276.35270   0.002    0.998
## manag           6.01449 14452.54495   0.000    1.000
## mani           18.85052 14418.02739   0.001    0.999
## mark          -33.50071 32080.87051  -0.001    0.999
## market          7.89523  8012.29528   0.001    0.999
## may            -9.43386 13969.56515  -0.001    0.999
## mean            0.60776 29518.71186   0.000    1.000
## meet           -1.06259 12633.55749   0.000    1.000
## member         13.81301 23429.90857   0.001    1.000
## mention       -22.78594 27136.91573  -0.001    0.999
## messag         17.15699  2561.57560   0.007    0.995
## might          12.44156 17533.00513   0.001    0.999
## model         -22.92334 10487.34692  -0.002    0.998
## monday         -1.03402 32330.80963   0.000    1.000
## money          32.63552 13212.06828   0.002    0.998
## month          -3.72670 11123.66899   0.000    1.000
## morn          -26.44760 34027.89144  -0.001    0.999
## move          -38.33622 30112.46626  -0.001    0.999
## much            0.37747 13921.57766   0.000    1.000
## name           16.72141 13218.44812   0.001    0.999
## need            0.84367 12207.61715   0.000    1.000
## net            12.56157 21972.81289   0.001    1.000
## new             1.00331 10091.52526   0.000    1.000
## next.          14.92299 17244.68652   0.001    0.999
## note           14.46034 22937.89167   0.001    0.999
## now            37.89680 12190.24904   0.003    0.998
## number         -9.62184 15914.59792  -0.001    1.000
## offer          11.73834 10837.22113   0.001    0.999
## offic         -13.44163 23114.72339  -0.001    1.000
## one            12.41238  6652.03196   0.002    0.999
## onlin          35.88623 16649.73495   0.002    0.998
## open           21.14171 29613.79926   0.001    0.999
## oper          -16.95704 27565.60102  -0.001    1.000
## opportun       -4.13117 19183.21135   0.000    1.000
## option         -1.08516  9325.32428   0.000    1.000
## order           6.53265 12424.08477   0.001    1.000
## origin         32.26280 38175.24720   0.001    0.999
## part            4.59427 34830.42984   0.000    1.000
## particip      -11.54271 17383.30582  -0.001    0.999
## peopl         -18.63789 14389.74787  -0.001    0.999
## per            13.67495 12732.83389   0.001    0.999
## person         18.69761  9575.47655   0.002    0.998
## phone          -6.95663 11717.69170  -0.001    1.000
## place           9.00530 36608.96507   0.000    1.000
## plan          -18.30364  6320.49885  -0.003    0.998
## pleas          -7.96138  9484.46386  -0.001    0.999
## point           5.49836 34025.65614   0.000    1.000
## posit         -15.43111 23155.99226  -0.001    0.999
## possibl       -13.65960 24918.15730  -0.001    1.000
## power          -5.64308 11727.15930   0.000    1.000
## present        -6.16295 12775.05633   0.000    1.000
## price           3.42759  7849.85957   0.000    1.000
## problem        12.62018  9763.03191   0.001    0.999
## process        -0.29572 11905.84841   0.000    1.000
## product        10.15835 13447.64033   0.001    0.999
## program         1.44411 11831.16188   0.000    1.000
## project         2.17330 14973.05155   0.000    1.000
## provid          0.24225 18589.08726   0.000    1.000
## public        -52.49850 23410.58227  -0.002    0.998
## put           -10.51886 26812.43218   0.000    1.000
## question      -34.67470 18588.44086  -0.002    0.999
## rate           -3.11213 13189.56979   0.000    1.000
## read          -15.27446 21446.74926  -0.001    0.999
## real           20.45912 23580.85242   0.001    0.999
## realli        -26.66848 46403.45625  -0.001    1.000
## receiv          0.57652 15848.49610   0.000    1.000
## recent         -2.06668 17795.16989   0.000    1.000
## regard         -3.66813 15110.01493   0.000    1.000
## relat         -51.13833 17926.46118  -0.003    0.998
## remov          23.25452 24837.86579   0.001    0.999
## repli          15.37977 29155.61883   0.001    1.000
## report        -14.82125 14769.91974  -0.001    0.999
## request       -12.31889 11669.66111  -0.001    0.999
## requir          0.50042 29365.45474   0.000    1.000
## research      -28.25897 15526.46633  -0.002    0.999
## resourc       -27.34889 35221.06048  -0.001    0.999
## respond        29.74186 38879.30348   0.001    0.999
## respons       -19.59598 36666.00577  -0.001    1.000
## result         -0.50024 31401.05156   0.000    1.000
## resum          -9.21906 20996.14073   0.000    1.000
## return         17.45096 18435.18761   0.001    0.999
## review         -4.82452 10132.79683   0.000    1.000
## right          23.11851 15904.45788   0.001    0.999
## risk           -4.00079 17177.99841   0.000    1.000
## robert        -20.95504 29071.43181  -0.001    0.999
## run           -51.62204 44337.51560  -0.001    0.999
## say             7.36621 22174.24418   0.000    1.000
## schedul         1.91913 35796.84272   0.000    1.000
## school         -3.87014 28823.46891   0.000    1.000
## secur         -16.03677  2200.71431  -0.007    0.994
## see           -11.19904 12932.46795  -0.001    0.999
## send          -24.26771 12224.21338  -0.002    0.998
## sent          -14.88198 21953.79637  -0.001    0.999
## servic         -7.16432 12351.22106  -0.001    1.000
## set            -9.35324 26268.89516   0.000    1.000
## sever          20.41198 30927.28109   0.001    0.999
## shall          19.29869 30748.77616   0.001    0.999
## shirley       -71.32873 63289.37737  -0.001    0.999
## short          -8.97353 17207.51481  -0.001    1.000
## sinc           -3.43847 35455.98205   0.000    1.000
## sincer        -20.73171 35145.26470  -0.001    1.000
## site            8.68864 14955.35264   0.001    1.000
## softwar        25.74855 10593.09469   0.002    0.998
## soon           23.49750 37313.28390   0.001    0.999
## sorri           6.03563 22992.82314   0.000    1.000
## special        17.77075 27552.36443   0.001    0.999
## specif        -23.36688 30834.20294  -0.001    0.999
## start          14.37480 18972.26951   0.001    0.999
## state          12.20754 16772.13151   0.001    0.999
## still           3.87790 26222.21248   0.000    1.000
## stinson       -43.45351 26967.01750  -0.002    0.999
## student       -18.14731 21856.41556  -0.001    0.999
## subject        30.41125 10548.74309   0.003    0.998
## success         4.34358 27830.47372   0.000    1.000
## suggest       -38.42169 44745.18597  -0.001    0.999
## support       -15.39269 19761.55243  -0.001    0.999
## sure           -5.50273 20777.09818   0.000    1.000
## system          3.77801  9148.65860   0.000    1.000
## take            5.73138 17156.12167   0.000    1.000
## talk          -10.10574 20206.41806  -0.001    1.000
## team            7.94049 25703.84987   0.000    1.000
## term           20.13285 23031.54376   0.001    0.999
## thank         -38.90473 10586.96129  -0.004    0.997
## thing          25.78599 13405.17195   0.002    0.998
## think         -12.18122 20772.99986  -0.001    1.000
## thought        12.43295 30228.11251   0.000    1.000
## thursday      -14.91355 32617.92027   0.000    1.000
## time           -5.92102  8334.70945  -0.001    0.999
## today         -17.61557 19649.57463  -0.001    0.999
## togeth        -23.54813 18689.97232  -0.001    0.999
## trade         -17.55016 14825.10071  -0.001    0.999
## tri             0.92783 12819.63747   0.000    1.000
## tuesday       -28.08297 39588.86870  -0.001    0.999
## two           -25.72666 18439.43987  -0.001    0.999
## type          -14.47371 27548.25790  -0.001    1.000
## understand      9.30723 23416.65694   0.000    1.000
## unit           -4.02049 30080.64655   0.000    1.000
## univers        12.27580 21969.41146   0.001    1.000
## updat         -15.09781 14480.71856  -0.001    0.999
## use           -13.85349  9381.76315  -0.001    0.999
## valu            0.90239 13599.59155   0.000    1.000
## version       -36.06359 29386.80360  -0.001    0.999
## vinc          -37.34756  8647.15534  -0.004    0.997
## visit          25.84604 11697.85338   0.002    0.998
## vkamin        -66.48981 57028.76975  -0.001    0.999
## want           -2.55510 11057.56463   0.000    1.000
## way            13.38972 11375.38536   0.001    0.999
## web             2.79074 16859.81655   0.000    1.000
## websit        -25.62659 18475.02800  -0.001    0.999
## wednesday     -15.26360 26422.76449  -0.001    1.000
## week           -6.79505 10458.98638  -0.001    0.999
## well          -22.21928  9713.40116  -0.002    0.998
## will          -11.19383  5980.47999  -0.002    0.999
## wish           11.73089 31747.37935   0.000    1.000
## within         29.00289 21632.49653   0.001    0.999
## without        19.41978 17628.74297   0.001    0.999
## work          -10.98745 11596.31706  -0.001    0.999
## write          44.06181 28249.11863   0.002    0.999
## www            -7.86715 22237.59888   0.000    1.000
## year          -10.10293 10394.69041  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4409.49  on 4009  degrees of freedom
## Residual deviance:   13.46  on 3679  degrees of freedom
## AIC: 675.46
## 
## Number of Fisher Scoring iterations: 25

0 variables are labeled as significant.

What is the training set accuracy of spamLog, using a threshold of 0.5 for predictions?
# Tabulate the spam in the training set and predictTrainLog
a = table(train$spam, predTrainLog > 0.5)
kable(a)
FALSE TRUE
0 3052 0
1 4 954
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9990025

Training Set Accuracy = 0.9990025

What is the training set AUC of spamLog?
# Calculate the training set AUC
library(ROCR)
ROCRpred = prediction(predTrainLog, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9999959

Training Set AUC of spamLog = 0.9999959

CART Model


library(rpart)
library(rpart.plot)

spamCART = rpart(spam  ~ ., data=train, method="class")
How many of the word stems “enron”, “hou”, “vinc”, and “kaminski” appear in the CART tree? Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.
# Plot of CART Model
prp(spamCART)

“vinc” and “enron” appear in the CART tree.

What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions?
# Make predictions on the training set accuracy
predTrainCART = predict(spamCART)[,2]
# Tabulate the spam in the training set and predictions
a = table(train$spam, predTrainCART > 0.5)
kable(a)
FALSE TRUE
0 2885 167
1 64 894
# Calculate the accuracy
sum(diag(a))/(sum(a))
## [1] 0.942394

Training Set Accuracy of spamCART = 0.942394

What is the training set AUC of spamCART?
# Calculate the ROCR
library(ROCR)
ROCRpred = prediction(predTrainCART, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9696044

Training Set AUC of spamCART = 0.9696044

Random Forest (RF) Model

# Implement the Random Forest (RF) algorithm
library(randomForest)
set.seed(123)
spamRF = randomForest(spam~., data=train)
What is the training set accuracy of spamRF, using a threshold of 0.5 for predictions?
# Make predictions using the RF model
predTrainRF = predict(spamRF, type="prob")[,2]
# Tabulate the spam in the training and predictions
a = table(train$spam, predTrainRF > 0.5)
kable(a)
FALSE TRUE
0 3015 37
1 42 916
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9802993

Training Set Accuracy = 0.9802993

What is the training set AUC of spamRF?
# Calculate the training set AUC
library(ROCR)
ROCRpred = prediction(predTrainRF, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9978155

Training Set AUC of spamRF = 0.9978155

Buuilding machine learning models (Testing Set)

Logistic Regression

# Make predictions using Logistic Regression
predTestLog = predict(spamLog, newdata = test, type="response")
What is the testing set accuracy of spamLog, using a threshold of 0.5 for predictions?
# Tabulate the spam in the testing set and predictions
a = table(test$spam, predTestLog > 0.5)
kable(a)
FALSE TRUE
0 1257 51
1 34 376
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9505239

Testing Set Accuracy = 0.9505239

What is the testing set AUC of spamLog?
# Calculate the testing set AUC 
library(ROCR)
ROCRpred = prediction(predTestLog, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9627517

Testing Set AUC of spamLog = 0.9627517

CART Model

# Making predictions using the CART model
predTestCART = predict(spamCART, newdata = test)[,2]
What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?
# Tabulate the spam in the testing set and predictTestCART
a = table(test$spam, predTestCART > 0.5)
kable(a)
FALSE TRUE
0 1228 80
1 24 386
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9394645

Testing Set Accuracy = 0.9394645

What is the testing set AUC of spamCART?
# Calculating the testing set AUC of spamCART
library(ROCR)
ROCRpred = prediction(predTestCART, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.963176

Testing Set Accuracy AUC of spamCART = 0.963176

Random Forest (RF) Model

# Make predictions using random forest
predTestRF = predict(spamRF, newdata = test, type="prob")[,2]
What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?
# Tabulate the spam in the testing set and predictTestRF
a = table(test$spam, predTestRF > 0.5)
kable(a)
FALSE TRUE
0 1291 17
1 23 387
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9767171

Testing Set Accuracy = 0.9767171

What is the testing set AUC of spamRF?
# Calculate the testing set AUC of spamRF
library(ROCR)
ROCRpred = prediction(predTestRF, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9975899

Testing Set AUC of spamRF = 0.9975899