Analytics Edge: Unit 5 - Separating Spam from Ham (Part 1)

R Exercises

Loading the Dataset

Begin by loading the dataset emails.csv into a data frame called emails. Remember to pass the stringsAsFactors=FALSE option when loading the data.

emails = read.csv("emails.csv", stringsAsFactors=FALSE)

How many emails are in the dataset?

# Examine the string emails
str(emails)
## 'data.frame':    5728 obs. of  2 variables:
##  $ text: chr  "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market"| __truncated__ "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__ "Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved"| __truncated__ "Subject: 4 color printing special  request additional information now ! click here  click here for a printable "| __truncated__ ...
##  $ spam: int  1 1 1 1 1 1 1 1 1 1 ...

5728 emails in the dataset.

How many of the emails are spam?

# Tabulates the amount of emails are spam
table(emails$spam)
## 
##    0    1 
## 4360 1368

1368 emails are spam.

Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.

# Examine the string emails that are text
str(emails$text[2])
##  chr "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__

Subject appears at the beginning of every email in the dataset.

Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?

We know that each email has the word “subject” appear at least once, but the frequency with which it appears might help us differentiate spam from ham. For instance, a long email chain would have the word “subject” appear a number of times, and this higher frequency might be indicative of a ham message.

How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?

max(nchar(emails$text))
## [1] 43952

43952 characters are in the longest email.

Which row contains the shortest email in the dataset? (Just like in the previous problem, shortest is measured in terms of the fewest number of characters.)

# Finds the row with the shortest email
which.min(nchar(emails$text))
## [1] 1992

The 1992 row contains the shortest email in the dataset.

Preparing the Corpus:

Build a new corpus variable called corpus.
Using tm_map, convert the text to lowercase.
Using tm_map, remove all punctuation from the corpus.
Using tm_map, remove all English stopwords from the corpus.
Using tm_map, stem the words in the corpus.
Build a document term matrix from the corpus, called dtm.

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords(“english”)).

How many terms are in dtm?

# Preparing the Corpus
library(tm)
# Build a new corpus variable called corpus
corpus = VCorpus(VectorSource(emails$text))
# Convert the text to lowercase.
corpus = tm_map(corpus, content_transformer(tolower))
# Remove all punctuation from the corpus
corpus = tm_map(corpus, removePunctuation)
# Remove all English stopwords from the corpus
corpus = tm_map(corpus, removeWords, stopwords("english"))
# Stem the words in the corpus
corpus = tm_map(corpus, stemDocument)
# Build a document term matrix from the corpus, called dtm
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

28687 terms are in dtm.

To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don’t overwrite dtm, because we will use it in a later step of this homework). How many terms are in spdtm?

# Remove the sparse terms
spdtm = removeSparseTerms(dtm, 0.95)
spdtm
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity           : 89%
## Maximal term length: 10
## Weighting          : term frequency (tf)

330 terms are in spdtm.

Build a data frame called emailsSparse from spdtm, and use the make.names function to make the variable names of emailsSparse valid. What is the word stem that shows up most frequently across all the emails in the dataset?

# Build data frame called emailsSparse from spdtm
emailsSparse = as.data.frame(as.matrix(spdtm))
colnames(emailsSparse) = make.names(colnames(emailsSparse))
# Word stem that is frequent
frequency <- colSums(emailsSparse)
which.max(frequency)
## enron 
##    92

Enron is the most frequent word.

How many word stems appear at least 5000 times in the ham emails in the dataset? Hint: in this and the next question, remember not to count the dependent variable we just added.

# Add a variable called spam
emailsSparse$spam = emails$spam
# Sort the ham emails in dataset
a = sort((colSums(subset(emailsSparse, spam == 0))))
kable(a)

	x
spam	0
life	80
remov	103
money	114
onlin	173
without	191
websit	194
click	217
special	226
wish	229
repli	239
buy	243
net	243
link	247
immedi	249
done	254
mean	259
design	261
lot	268
effect	270
info	273
either	279
read	279
write	286
line	289
begin	291
sorri	293
success	293
involv	294
creat	299
softwar	299
better	301
vkamin	301
say	305
keep	306
bring	311
believ	313
full	317
increas	320
realli	324
mention	325
thought	325
idea	327
invest	327
secur	337
specif	338
sever	340
experi	346
thing	347
allow	348
check	351
due	351
type	352
happi	354
return	355
expect	356
short	357
effort	358
open	360
internet	361
sincer	361
public	364
recent	368
anoth	369
alreadi	372
home	375
made	380
respond	382
given	383
etc	385
put	385
within	386
place	388
right	390
version	390
hello	395
sure	396
area	397
run	398
arrang	399
account	401
join	403
hour	404
locat	406
togeth	406
engin	411
import	411
per	412
corpor	414
high	416
result	418
hear	420
final	422
deal	423
applic	428
even	429
web	430
custom	433
soon	435
long	436
sinc	439
futur	440
member	446
X000	447
event	447
don	450
part	450
feel	453
tuesday	454
wednesday	456
still	457
unit	457
site	458
X853	461
continu	464
understand	464
resourc	466
robert	466
analysi	468
form	468
point	474
assist	475
confirm	485
differ	489
intern	489
might	490
real	490
case	492
howev	496
comment	505
abl	515
complet	515
rate	516
appreci	518
tri	521
move	526
updat	527
approv	533
suggest	533
free	535
contract	544
detail	546
morn	546
end	550
mani	550
attend	558
thursday	558
direct	561
requir	562
cours	567
person	569
relat	573
depart	575
today	577
start	580
way	586
mark	588
valu	590
problem	593
peopl	599
note	600
school	607
invit	614
access	617
term	625
juli	630
monday	630
gibner	633
base	635
director	640
offer	643
cost	646
addit	648
kevin	654
great	655
set	658
file	659
find	665
much	669
oper	669
order	669
deriv	673
doc	673
april	677
book	680
address	693
copi	700
financi	702
month	709
student	710
respons	711
possibl	712
associ	715
particip	717
now	725
first	726
industri	731
dear	734
support	734
plan	738
back	739
name	745
come	748
opportun	760
report	772
product	776
two	787
origin	796
ask	797
credit	798
state	806
system	816
process	826
hope	828
london	828
just	830
receiv	830
chang	831
review	834
current	841
shall	844
friday	847
team	850
phone	858
issu	865
data	868
avail	872
last	874
good	876
give	883
www	897
gas	905
list	907
posit	917
visit	920
includ	924
resum	928
best	933
offic	935
servic	942
talk	943
number	951
well	961
fax	963
provid	970
sent	971
next.	975
send	986
http	1009
john	1022
univers	1025
financ	1038
stinson	1051
schedul	1054
take	1057
date	1060
want	1068
question	1069
program	1080
think	1084
X713	1097
crenshaw	1115
attach	1155
trade	1167
help	1168
email	1201
compani	1225
request	1227
see	1238
communic	1251
confer	1264
discuss	1270
make	1281
contact	1301
follow	1308
interview	1320
project	1328
mail	1352
present	1397
busi	1416
interest	1429
option	1432
day	1440
call	1497
one	1516
year	1523
week	1527
messag	1538
houston	1577
also	1604
look	1607
edu	1620
corp	1643
shirley	1687
develop	1691
get	1768
new	1777
use	1784
let	1856
regard	1859
inform	1883
need	1890
power	1972
may	1976
like	1980
risk	2097
energi	2124
market	2150
model	2170
price	2191
work	2293
manag	2334
know	2345
group	2474
meet	2544
time	2552
research	2752
forward	2952
X2001	3060
can	3426
thank	3558
com	4444
pleas	4494
kaminski	4801
X2000	4935
hou	5569
will	6802
vinc	8531
subject	8625
ect	11417
enron	13388

6 words stems appear at least 5000 times in the ham emails in the dataset.

How many word stems appear at least 1000 times in the spam emails in the dataset?

# Sort the spam emails in the dataset
a = sort((colSums(subset(emailsSparse, spam == 1))))
kable(a)

	x
X713	0
crenshaw	0
enron	0
gibner	0
kaminski	0
stinson	0
vkamin	0
X853	1
vinc	1
doc	2
kevin	2
shirley	2
deriv	3
april	5
houston	5
resum	5
edu	7
friday	7
hou	8
wednesday	8
ect	10
arrang	11
interview	13
attend	15
london	15
robert	16
student	16
schedul	17
thursday	17
monday	19
john	20
tuesday	20
attach	21
suggest	21
appreci	23
mark	25
begin	26
comment	26
analysi	27
X2001	29
model	29
hope	30
mention	30
X2000	32
togeth	32
confer	33
invit	33
univers	34
financ	35
talk	38
either	39
run	39
morn	40
shall	40
happi	42
thought	42
depart	46
confirm	47
respond	48
school	48
corp	49
etc	49
hear	49
howev	49
sorri	50
idea	51
energi	55
discuss	56
open	56
option	56
soon	57
understand	57
cours	59
experi	59
associ	62
point	62
bring	63
director	65
particip	65
anoth	66
join	66
still	66
final	68
research	68
case	69
set	69
specif	69
given	70
juli	71
problem	73
put	73
alreadi	74
ask	74
abl	75
deal	75
fax	75
book	76
team	76
issu	79
locat	79
meet	79
updat	79
lot	80
sincer	80
better	82
short	82
sinc	82
done	83
question	83
recent	83
possibl	84
contract	85
end	85
move	86
data	87
might	87
continu	88
note	88
feel	90
resourc	90
sever	90
area	92
communic	92
realli	93
due	94
direct	96
origin	96
copi	97
unit	97
long	98
member	99
sure	99
allow	102
dear	104
public	104
write	104
event	105
let	107
differ	109
file	111
involv	111
respons	113
creat	114
type	114
approv	115
detail	115
effort	115
intern	117
request	117
say	118
import	119
support	120
part	121
relat	121
assist	123
last	124
two	124
back	125
keep	125
addit	126
date	127
place	128
group	130
mean	131
valu	131
think	132
offic	133
read	134
immedi	136
check	137
applic	139
hello	139
tri	140
review	142
believ	143
phone	143
hour	144
power	145
present	146
process	149
corpor	151
oper	151
full	152
return	154
come	155
sent	155
opportun	158
real	158
repli	158
line	159
engin	160
term	161
credit	162
well	164
gas	165
info	165
plan	166
next.	170
risk	170
increas	171
access	172
give	172
thank	172
link	174
requir	174
version	174
cost	175
great	182
wish	185
regard	186
posit	187
thing	188
call	190
develop	191
complet	192
much	192
even	193
project	194
design	196
form	196
expect	198
person	198
without	198
buy	199
trade	199
effect	201
rate	201
base	202
find	202
current	203
first	203
chang	204
visit	206
financi	207
high	208
mani	208
forward	209
good	221
special	225
don	226
success	226
per	230
number	231
week	231
result	237
web	238
industri	239
contact	242
made	242
follow	244
month	249
right	249
today	251
also	260
help	262
internet	262
manag	266
know	269
way	278
avail	280
state	280
futur	282
home	285
start	300
system	302
take	304
net	305
includ	314
life	320
see	329
name	344
onlin	345
within	346
remov	357
best	358
program	358
peopl	359
custom	363
year	367
like	372
interest	385
send	393
servic	395
look	396
work	415
day	420
want	420
product	421
www	426
account	428
provid	435
need	438
softwar	440
messag	445
site	455
address	461
may	489
list	503
price	503
new	504
websit	506
report	507
secur	520
just	524
offer	528
invest	540
order	541
use	546
click	552
X000	560
now	575
one	592
time	593
http	600
market	600
make	603
free	606
pleas	619
money	662
get	694
receiv	727
inform	818
can	831
email	865
busi	897
mail	917
com	999
compani	1065
spam	1368
will	1450
subject	1577

3 word stems appear at least 1000 times.

Buuilding machine learning models (Training Set)

First, convert the dependent variable to a factor with “emailsSparse$spam = as.factor(emailsSparse$spam)”.

Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called “train” and a testing set called “test”. Make sure to perform this step on emailsSparse instead of emails.

Using the training set, train the following three machine learning models. The models should predict the dependent variable “spam”, using all other available variables as independent variables. Please be patient, as these models may take a few minutes to train.

A logistic regression model called spamLog. You may see a warning message here - we’ll discuss this more later.
A CART model called spamCART, using the default parameters to train the model (don’t worry about adding minbucket or cp). Remember to add the argument method=“class” since this is a binary classification problem.
A random forest model called spamRF, using the default parameters to train the model (don’t worry about specifying ntree or nodesize). Directly before training the random forest model, set the random seed to 123 (even though we’ve already done this earlier in the problem, it’s important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).

For each model, obtain the predicted spam probabilities for the training set. Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values. Recall that you can obtain probabilities for CART models by not passing any type parameter to the predict() function, and you can obtain probabilities from a random forest by adding the argument type=“prob”. For CART and random forest, you need to select the second column of the output of the predict() function, corresponding to the probability of a message being spam.

You may have noticed that training the logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model. Let’s investigate the predicted probabilities from the logistic regression model.

# Convert the dependent variable
emailsSparse$spam = as.factor(emailsSparse$spam)
# Split the dataset into training and testing sets
set.seed(123)
library(caTools)
spl = sample.split(emailsSparse$spam, 0.7)
train = subset(emailsSparse, spl == TRUE)
test = subset(emailsSparse, spl == FALSE)

Logistic Regression

# Create the logistic regression model
spamLog = glm(spam~., data=train, family="binomial")
# Create the predictions
predTrainLog = predict(spamLog, type="response")

How many of the training set predicted probabilities from spamLog are less than 0.00001?

# Tabulate the predictions
table(predTrainLog < 0.00001)
## 
## FALSE  TRUE 
##   964  3046

3046 training set predicted probabilities from spamLog are less than 0.00001.

How many of the training set predicted probabilities from spamLog are more than 0.99999?

# Tabulate the predictions
table(predTrainLog > 0.99999)
## 
## FALSE  TRUE 
##  3056   954

865 training set predicted probabilities from spamLog are more than 0.99999.

How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?

# Tabulate the predictions
table(predTrainLog= 0.00001 & predTrainLog <= 0.99999)
## predTrainLog
## FALSE  TRUE 
##   954  3056

0 training set predicted probabilities from spamLog are between 0.00001 and 0.99999.

How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?

# Output the summary
summary(spamLog)
## 
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.011   0.000   0.000   0.000   1.354  
## 
## Coefficients:
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)   -30.81671 10548.74309  -0.003    0.998
## X000           14.73835 10583.79587   0.001    0.999
## X2000         -36.30653 15559.78162  -0.002    0.998
## X2001         -32.14770 13177.68792  -0.002    0.998
## X713          -24.27301 29138.37993  -0.001    0.999
## X853           -1.21231 59416.82728   0.000    1.000
## abl            -2.04851 20883.26713   0.000    1.000
## access        -14.79724 13353.48989  -0.001    0.999
## account        24.88120  8164.78793   0.003    0.998
## addit           1.46349 27027.11666   0.000    1.000
## address        -4.61291 11134.38676   0.000    1.000
## allow          18.99178  6436.37103   0.003    0.998
## alreadi       -24.07476 33188.28852  -0.001    0.999
## also           29.89671 13781.79015   0.002    0.998
## analysi       -24.05003 38603.03061  -0.001    1.000
## anoth          -8.74404 20316.93640   0.000    1.000
## applic         -2.64873 16735.57715   0.000    1.000
## appreci       -21.44644 27616.28096  -0.001    0.999
## approv         -1.30155 15894.77923   0.000    1.000
## april         -26.20274 22080.53147  -0.001    0.999
## area           20.40642 22657.77444   0.001    0.999
## arrang         10.69469 21352.21386   0.001    1.000
## ask            -7.74592 19763.16826   0.000    1.000
## assist        -11.28267 24895.25793   0.000    1.000
## associ          9.04942 19093.54135   0.000    1.000
## attach        -10.36592 15343.29985  -0.001    0.999
## attend        -34.50552 32573.32268  -0.001    0.999
## avail           8.65114 17094.57157   0.001    1.000
## back          -13.23471 22723.02376  -0.001    1.000
## base          -13.54255 21218.17140  -0.001    0.999
## begin          22.28011 29731.41257   0.001    0.999
## believ         32.32591 21360.08248   0.002    0.999
## best           -8.20054  1333.38661  -0.006    0.995
## better         42.63151 23599.88794   0.002    0.999
## book            4.30072 20235.79190   0.000    1.000
## bring          16.06635 67670.96796   0.000    1.000
## busi           -4.80293 10002.28921   0.000    1.000
## buy            41.70188 38923.92521   0.001    0.999
## call           -1.14501 11111.06778   0.000    1.000
## can             3.76174  7673.89305   0.000    1.000
## case          -33.72402 28804.23279  -0.001    0.999
## chang         -27.16799 22152.89068  -0.001    0.999
## check           1.42516 19631.44189   0.000    1.000
## click          13.76120  7076.98961   0.002    0.998
## com             1.93633  4039.20494   0.000    1.000
## come           -1.16616 15107.73858   0.000    1.000
## comment        -3.25141 33870.01419   0.000    1.000
## communic       15.79546  8958.08782   0.002    0.999
## compani         4.78131  9186.32551   0.001    1.000
## complet       -13.62879 20237.90361  -0.001    0.999
## confer         -0.75029  8557.36337   0.000    1.000
## confirm       -12.99690 15139.72575  -0.001    0.999
## contact         1.53001 12616.52550   0.000    1.000
## continu        14.86611 15351.23871   0.001    0.999
## contract      -12.95405 14984.74369  -0.001    0.999
## copi          -42.73831 30699.56822  -0.001    0.999
## corp           16.05505 27083.03847   0.001    1.000
## corpor         -0.82863 28181.34783   0.000    1.000
## cost           -1.93757 18329.88729   0.000    1.000
## cours          16.65262 18338.38154   0.001    0.999
## creat          13.37623 39460.05157   0.000    1.000
## credit         26.17376 13138.00273   0.002    0.998
## crenshaw       99.94406 67692.02756   0.001    0.999
## current         3.62913 17066.24264   0.000    1.000
## custom         18.28821 10079.08744   0.002    0.999
## data          -26.09087 22714.27741  -0.001    0.999
## date           -2.78615 16985.30607   0.000    1.000
## day            -6.09984  5866.28762  -0.001    0.999
## deal          -11.29372 14476.48731  -0.001    0.999
## dear           -2.31316 23063.89229   0.000    1.000
## depart        -40.68465 25092.95410  -0.002    0.999
## deriv         -49.71057 35873.67244  -0.001    0.999
## design         -7.92306 29388.93892   0.000    1.000
## detail         11.96923 23008.84872   0.001    1.000
## develop         5.97638  9454.56063   0.001    0.999
## differ         -2.29290 10749.59972   0.000    1.000
## direct        -20.50611 31942.88823  -0.001    0.999
## director      -17.69812 17932.01295  -0.001    0.999
## discuss       -10.51005 19154.35311  -0.001    1.000
## doc           -25.97116 26031.83704  -0.001    0.999
## don            21.28659 14561.06709   0.001    0.999
## done            6.82837 18822.05005   0.000    1.000
## due            -4.16267 35316.37257   0.000    1.000
## ect             0.86849  5341.51294   0.000    1.000
## edu            -0.21215   691.74099   0.000    1.000
## effect         19.48236 21002.41283   0.001    0.999
## effort         16.05818 56700.57914   0.000    1.000
## either        -27.44247 39997.01701  -0.001    0.999
## email           3.83283 11856.54591   0.000    1.000
## end           -13.10536 29380.68822   0.000    1.000
## energi        -16.19710 16457.87662  -0.001    0.999
## engin          26.64290 23936.07677   0.001    0.999
## enron          -8.78876  5718.87819  -0.002    0.999
## etc             0.94697 15694.76515   0.000    1.000
## even          -16.53893 22886.63796  -0.001    0.999
## event          16.94185 18505.84730   0.001    0.999
## expect        -11.78693 19139.41707  -0.001    1.000
## experi          2.45969 22404.65521   0.000    1.000
## fax             3.53700 33855.88989   0.000    1.000
## feel            2.59590 23476.27698   0.000    1.000
## file          -29.43243 21649.57371  -0.001    0.999
## final           8.07492 50075.45250   0.000    1.000
## financ         -9.12241  7523.95040  -0.001    0.999
## financi        -9.74670 17271.83784  -0.001    1.000
## find           -2.62282  9727.09459   0.000    1.000
## first          -0.46663 20429.80447   0.000    1.000
## follow         17.65781  3079.68087   0.006    0.995
## form            8.48346 16736.54461   0.001    1.000
## forward        -3.48404 18642.93644   0.000    1.000
## free            6.11316  8121.04177   0.001    0.999
## friday        -11.46161 19964.73259  -0.001    1.000
## full           21.25102 21904.34008   0.001    0.999
## futur          41.45948 14387.24195   0.003    0.998
## gas            -3.90086  4160.29256  -0.001    0.999
## get             5.15375  9737.07069   0.001    1.000
## gibner         29.01185 24595.48183   0.001    0.999
## give          -25.18310 21296.83494  -0.001    0.999
## given         -21.86413 54264.02633   0.000    1.000
## good            5.39940 16193.42812   0.000    1.000
## great          12.21940 10901.07901   0.001    0.999
## group           0.52639 10371.47801   0.000    1.000
## happi           0.01939 12018.68812   0.000    1.000
## hear           28.86533 22809.11427   0.001    0.999
## hello          21.65549 13606.73123   0.002    0.999
## help           17.30963  2790.89981   0.006    0.995
## high           -1.98198 25536.23275   0.000    1.000
## home            5.97294  8964.82707   0.001    0.999
## hope          -14.35451 21794.88576  -0.001    0.999
## hou             6.85153  6436.89472   0.001    0.999
## hour            2.47799 13334.90035   0.000    1.000
## houston       -18.54502  7305.03681  -0.003    0.998
## howev         -34.49274 35618.85713  -0.001    0.999
## http           25.27938 21071.12399   0.001    0.999
## idea          -18.44864 38918.50700   0.000    1.000
## immedi         62.85329 33464.69294   0.002    0.999
## import         -1.85930 22364.33823   0.000    1.000
## includ         -3.45439 17988.89125   0.000    1.000
## increas         6.47593 23286.64042   0.000    1.000
## industri      -31.60069 23734.81080  -0.001    0.999
## info           -1.25474  4857.12017   0.000    1.000
## inform         20.78075  8549.02454   0.002    0.998
## interest       26.98037 11587.59215   0.002    0.998
## intern         -7.99071 33512.78147   0.000    1.000
## internet        8.74897 10999.92712   0.001    0.999
## interview     -16.40484 18733.97043  -0.001    0.999
## invest         32.01252 23934.41479   0.001    0.999
## invit           4.30368 22150.24289   0.000    1.000
## involv         38.14864 33152.60845   0.001    0.999
## issu          -37.08367 33960.70787  -0.001    0.999
## john           -0.53256 28562.06741   0.000    1.000
## join          -38.24082 23338.62282  -0.002    0.999
## juli          -13.57779 30093.27084   0.000    1.000
## just          -10.21157 11140.82560  -0.001    0.999
## kaminski      -18.11964  6029.07127  -0.003    0.998
## keep           18.66596 27816.06998   0.001    0.999
## kevin         -37.79040 47379.74713  -0.001    0.999
## know           12.77077 15263.56770   0.001    0.999
## last            1.04644 13724.44714   0.000    1.000
## let           -27.63338 14620.67500  -0.002    0.998
## life           58.12464 38643.08273   0.002    0.999
## like            5.64936  7659.87875   0.001    0.999
## line            8.74324 12361.53963   0.001    0.999
## link           -6.92851 13446.94610  -0.001    1.000
## list           -8.69209  2148.97953  -0.004    0.997
## locat          20.72567 15965.71676   0.001    0.999
## london          6.74530 16419.73479   0.000    1.000
## long          -14.89135 19336.44934  -0.001    0.999
## look           -7.03074 15631.44591   0.000    1.000
## lot           -19.63678 13211.37522  -0.001    0.999
## made            2.82049 27432.26185   0.000    1.000
## mail            7.58373 10210.95687   0.001    0.999
## make           29.00542 15276.35270   0.002    0.998
## manag           6.01449 14452.54495   0.000    1.000
## mani           18.85052 14418.02739   0.001    0.999
## mark          -33.50071 32080.87051  -0.001    0.999
## market          7.89523  8012.29528   0.001    0.999
## may            -9.43386 13969.56515  -0.001    0.999
## mean            0.60776 29518.71186   0.000    1.000
## meet           -1.06259 12633.55749   0.000    1.000
## member         13.81301 23429.90857   0.001    1.000
## mention       -22.78594 27136.91573  -0.001    0.999
## messag         17.15699  2561.57560   0.007    0.995
## might          12.44156 17533.00513   0.001    0.999
## model         -22.92334 10487.34692  -0.002    0.998
## monday         -1.03402 32330.80963   0.000    1.000
## money          32.63552 13212.06828   0.002    0.998
## month          -3.72670 11123.66899   0.000    1.000
## morn          -26.44760 34027.89144  -0.001    0.999
## move          -38.33622 30112.46626  -0.001    0.999
## much            0.37747 13921.57766   0.000    1.000
## name           16.72141 13218.44812   0.001    0.999
## need            0.84367 12207.61715   0.000    1.000
## net            12.56157 21972.81289   0.001    1.000
## new             1.00331 10091.52526   0.000    1.000
## next.          14.92299 17244.68652   0.001    0.999
## note           14.46034 22937.89167   0.001    0.999
## now            37.89680 12190.24904   0.003    0.998
## number         -9.62184 15914.59792  -0.001    1.000
## offer          11.73834 10837.22113   0.001    0.999
## offic         -13.44163 23114.72339  -0.001    1.000
## one            12.41238  6652.03196   0.002    0.999
## onlin          35.88623 16649.73495   0.002    0.998
## open           21.14171 29613.79926   0.001    0.999
## oper          -16.95704 27565.60102  -0.001    1.000
## opportun       -4.13117 19183.21135   0.000    1.000
## option         -1.08516  9325.32428   0.000    1.000
## order           6.53265 12424.08477   0.001    1.000
## origin         32.26280 38175.24720   0.001    0.999
## part            4.59427 34830.42984   0.000    1.000
## particip      -11.54271 17383.30582  -0.001    0.999
## peopl         -18.63789 14389.74787  -0.001    0.999
## per            13.67495 12732.83389   0.001    0.999
## person         18.69761  9575.47655   0.002    0.998
## phone          -6.95663 11717.69170  -0.001    1.000
## place           9.00530 36608.96507   0.000    1.000
## plan          -18.30364  6320.49885  -0.003    0.998
## pleas          -7.96138  9484.46386  -0.001    0.999
## point           5.49836 34025.65614   0.000    1.000
## posit         -15.43111 23155.99226  -0.001    0.999
## possibl       -13.65960 24918.15730  -0.001    1.000
## power          -5.64308 11727.15930   0.000    1.000
## present        -6.16295 12775.05633   0.000    1.000
## price           3.42759  7849.85957   0.000    1.000
## problem        12.62018  9763.03191   0.001    0.999
## process        -0.29572 11905.84841   0.000    1.000
## product        10.15835 13447.64033   0.001    0.999
## program         1.44411 11831.16188   0.000    1.000
## project         2.17330 14973.05155   0.000    1.000
## provid          0.24225 18589.08726   0.000    1.000
## public        -52.49850 23410.58227  -0.002    0.998
## put           -10.51886 26812.43218   0.000    1.000
## question      -34.67470 18588.44086  -0.002    0.999
## rate           -3.11213 13189.56979   0.000    1.000
## read          -15.27446 21446.74926  -0.001    0.999
## real           20.45912 23580.85242   0.001    0.999
## realli        -26.66848 46403.45625  -0.001    1.000
## receiv          0.57652 15848.49610   0.000    1.000
## recent         -2.06668 17795.16989   0.000    1.000
## regard         -3.66813 15110.01493   0.000    1.000
## relat         -51.13833 17926.46118  -0.003    0.998
## remov          23.25452 24837.86579   0.001    0.999
## repli          15.37977 29155.61883   0.001    1.000
## report        -14.82125 14769.91974  -0.001    0.999
## request       -12.31889 11669.66111  -0.001    0.999
## requir          0.50042 29365.45474   0.000    1.000
## research      -28.25897 15526.46633  -0.002    0.999
## resourc       -27.34889 35221.06048  -0.001    0.999
## respond        29.74186 38879.30348   0.001    0.999
## respons       -19.59598 36666.00577  -0.001    1.000
## result         -0.50024 31401.05156   0.000    1.000
## resum          -9.21906 20996.14073   0.000    1.000
## return         17.45096 18435.18761   0.001    0.999
## review         -4.82452 10132.79683   0.000    1.000
## right          23.11851 15904.45788   0.001    0.999
## risk           -4.00079 17177.99841   0.000    1.000
## robert        -20.95504 29071.43181  -0.001    0.999
## run           -51.62204 44337.51560  -0.001    0.999
## say             7.36621 22174.24418   0.000    1.000
## schedul         1.91913 35796.84272   0.000    1.000
## school         -3.87014 28823.46891   0.000    1.000
## secur         -16.03677  2200.71431  -0.007    0.994
## see           -11.19904 12932.46795  -0.001    0.999
## send          -24.26771 12224.21338  -0.002    0.998
## sent          -14.88198 21953.79637  -0.001    0.999
## servic         -7.16432 12351.22106  -0.001    1.000
## set            -9.35324 26268.89516   0.000    1.000
## sever          20.41198 30927.28109   0.001    0.999
## shall          19.29869 30748.77616   0.001    0.999
## shirley       -71.32873 63289.37737  -0.001    0.999
## short          -8.97353 17207.51481  -0.001    1.000
## sinc           -3.43847 35455.98205   0.000    1.000
## sincer        -20.73171 35145.26470  -0.001    1.000
## site            8.68864 14955.35264   0.001    1.000
## softwar        25.74855 10593.09469   0.002    0.998
## soon           23.49750 37313.28390   0.001    0.999
## sorri           6.03563 22992.82314   0.000    1.000
## special        17.77075 27552.36443   0.001    0.999
## specif        -23.36688 30834.20294  -0.001    0.999
## start          14.37480 18972.26951   0.001    0.999
## state          12.20754 16772.13151   0.001    0.999
## still           3.87790 26222.21248   0.000    1.000
## stinson       -43.45351 26967.01750  -0.002    0.999
## student       -18.14731 21856.41556  -0.001    0.999
## subject        30.41125 10548.74309   0.003    0.998
## success         4.34358 27830.47372   0.000    1.000
## suggest       -38.42169 44745.18597  -0.001    0.999
## support       -15.39269 19761.55243  -0.001    0.999
## sure           -5.50273 20777.09818   0.000    1.000
## system          3.77801  9148.65860   0.000    1.000
## take            5.73138 17156.12167   0.000    1.000
## talk          -10.10574 20206.41806  -0.001    1.000
## team            7.94049 25703.84987   0.000    1.000
## term           20.13285 23031.54376   0.001    0.999
## thank         -38.90473 10586.96129  -0.004    0.997
## thing          25.78599 13405.17195   0.002    0.998
## think         -12.18122 20772.99986  -0.001    1.000
## thought        12.43295 30228.11251   0.000    1.000
## thursday      -14.91355 32617.92027   0.000    1.000
## time           -5.92102  8334.70945  -0.001    0.999
## today         -17.61557 19649.57463  -0.001    0.999
## togeth        -23.54813 18689.97232  -0.001    0.999
## trade         -17.55016 14825.10071  -0.001    0.999
## tri             0.92783 12819.63747   0.000    1.000
## tuesday       -28.08297 39588.86870  -0.001    0.999
## two           -25.72666 18439.43987  -0.001    0.999
## type          -14.47371 27548.25790  -0.001    1.000
## understand      9.30723 23416.65694   0.000    1.000
## unit           -4.02049 30080.64655   0.000    1.000
## univers        12.27580 21969.41146   0.001    1.000
## updat         -15.09781 14480.71856  -0.001    0.999
## use           -13.85349  9381.76315  -0.001    0.999
## valu            0.90239 13599.59155   0.000    1.000
## version       -36.06359 29386.80360  -0.001    0.999
## vinc          -37.34756  8647.15534  -0.004    0.997
## visit          25.84604 11697.85338   0.002    0.998
## vkamin        -66.48981 57028.76975  -0.001    0.999
## want           -2.55510 11057.56463   0.000    1.000
## way            13.38972 11375.38536   0.001    0.999
## web             2.79074 16859.81655   0.000    1.000
## websit        -25.62659 18475.02800  -0.001    0.999
## wednesday     -15.26360 26422.76449  -0.001    1.000
## week           -6.79505 10458.98638  -0.001    0.999
## well          -22.21928  9713.40116  -0.002    0.998
## will          -11.19383  5980.47999  -0.002    0.999
## wish           11.73089 31747.37935   0.000    1.000
## within         29.00289 21632.49653   0.001    0.999
## without        19.41978 17628.74297   0.001    0.999
## work          -10.98745 11596.31706  -0.001    0.999
## write          44.06181 28249.11863   0.002    0.999
## www            -7.86715 22237.59888   0.000    1.000
## year          -10.10293 10394.69041  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4409.49  on 4009  degrees of freedom
## Residual deviance:   13.46  on 3679  degrees of freedom
## AIC: 675.46
## 
## Number of Fisher Scoring iterations: 25

0 variables are labeled as significant.

What is the training set accuracy of spamLog, using a threshold of 0.5 for predictions?

# Tabulate the spam in the training set and predictTrainLog
a = table(train$spam, predTrainLog > 0.5)
kable(a)

	FALSE	TRUE
0	3052	0
1	4	954

# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9990025

Training Set Accuracy = 0.9990025

What is the training set AUC of spamLog?

# Calculate the training set AUC
library(ROCR)
ROCRpred = prediction(predTrainLog, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9999959

Training Set AUC of spamLog = 0.9999959

CART Model


library(rpart)
library(rpart.plot)

spamCART = rpart(spam  ~ ., data=train, method="class")

How many of the word stems “enron”, “hou”, “vinc”, and “kaminski” appear in the CART tree? Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.

# Plot of CART Model
prp(spamCART)

“vinc” and “enron” appear in the CART tree.

What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions?

# Make predictions on the training set accuracy
predTrainCART = predict(spamCART)[,2]
# Tabulate the spam in the training set and predictions
a = table(train$spam, predTrainCART > 0.5)
kable(a)

	FALSE	TRUE
0	2885	167
1	64	894

# Calculate the accuracy
sum(diag(a))/(sum(a))
## [1] 0.942394

Training Set Accuracy of spamCART = 0.942394

What is the training set AUC of spamCART?

# Calculate the ROCR
library(ROCR)
ROCRpred = prediction(predTrainCART, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9696044

Training Set AUC of spamCART = 0.9696044

Random Forest (RF) Model

# Implement the Random Forest (RF) algorithm
library(randomForest)
set.seed(123)
spamRF = randomForest(spam~., data=train)

What is the training set accuracy of spamRF, using a threshold of 0.5 for predictions?

# Make predictions using the RF model
predTrainRF = predict(spamRF, type="prob")[,2]
# Tabulate the spam in the training and predictions
a = table(train$spam, predTrainRF > 0.5)
kable(a)

	FALSE	TRUE
0	3015	37
1	42	916

# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9802993

Training Set Accuracy = 0.9802993

What is the training set AUC of spamRF?

# Calculate the training set AUC
library(ROCR)
ROCRpred = prediction(predTrainRF, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9978155

Training Set AUC of spamRF = 0.9978155

Buuilding machine learning models (Testing Set)

Logistic Regression

# Make predictions using Logistic Regression
predTestLog = predict(spamLog, newdata = test, type="response")

What is the testing set accuracy of spamLog, using a threshold of 0.5 for predictions?

# Tabulate the spam in the testing set and predictions
a = table(test$spam, predTestLog > 0.5)
kable(a)

	FALSE	TRUE
0	1257	51
1	34	376

# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9505239

Testing Set Accuracy = 0.9505239

What is the testing set AUC of spamLog?

# Calculate the testing set AUC 
library(ROCR)
ROCRpred = prediction(predTestLog, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9627517

Testing Set AUC of spamLog = 0.9627517

CART Model

# Making predictions using the CART model
predTestCART = predict(spamCART, newdata = test)[,2]

What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?

# Tabulate the spam in the testing set and predictTestCART
a = table(test$spam, predTestCART > 0.5)
kable(a)

	FALSE	TRUE
0	1228	80
1	24	386

# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9394645

Testing Set Accuracy = 0.9394645

What is the testing set AUC of spamCART?

# Calculating the testing set AUC of spamCART
library(ROCR)
ROCRpred = prediction(predTestCART, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.963176

Testing Set Accuracy AUC of spamCART = 0.963176

Random Forest (RF) Model

# Make predictions using random forest
predTestRF = predict(spamRF, newdata = test, type="prob")[,2]

What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?

# Tabulate the spam in the testing set and predictTestRF
a = table(test$spam, predTestRF > 0.5)
kable(a)

	FALSE	TRUE
0	1291	17
1	23	387

# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9767171

Testing Set Accuracy = 0.9767171

What is the testing set AUC of spamRF?

# Calculate the testing set AUC of spamRF
library(ROCR)
ROCRpred = prediction(predTestRF, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9975899

Testing Set AUC of spamRF = 0.9975899

Analytics Edge: Unit 5 - Separating Spam from Ham (Part 1)

Sulman Khan

October 27, 2018

Background Information on the Dataset

R Exercises

Loading the Dataset

How many emails are in the dataset?

How many of the emails are spam?

Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.

Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?

How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?

Which row contains the shortest email in the dataset? (Just like in the previous problem, shortest is measured in terms of the fewest number of characters.)

Preparing the Corpus:

How many terms are in dtm?

To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don’t overwrite dtm, because we will use it in a later step of this homework). How many terms are in spdtm?

Build a data frame called emailsSparse from spdtm, and use the make.names function to make the variable names of emailsSparse valid. What is the word stem that shows up most frequently across all the emails in the dataset?

How many word stems appear at least 5000 times in the ham emails in the dataset? Hint: in this and the next question, remember not to count the dependent variable we just added.

How many word stems appear at least 1000 times in the spam emails in the dataset?

Buuilding machine learning models (Training Set)

Logistic Regression

How many of the training set predicted probabilities from spamLog are less than 0.00001?

How many of the training set predicted probabilities from spamLog are more than 0.99999?

How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?

How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?

What is the training set accuracy of spamLog, using a threshold of 0.5 for predictions?

What is the training set AUC of spamLog?

CART Model

How many of the word stems “enron”, “hou”, “vinc”, and “kaminski” appear in the CART tree? Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.

What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions?

What is the training set AUC of spamCART?

Random Forest (RF) Model

What is the training set accuracy of spamRF, using a threshold of 0.5 for predictions?

What is the training set AUC of spamRF?

Buuilding machine learning models (Testing Set)

Logistic Regression

What is the testing set accuracy of spamLog, using a threshold of 0.5 for predictions?

What is the testing set AUC of spamLog?

CART Model

What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?

What is the testing set AUC of spamCART?

Random Forest (RF) Model

What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?

What is the testing set AUC of spamRF?