As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.
In this homework problem, we will build and evaluate a spam filter using a publicly available dataset. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.
setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_5_Text_analytics")
emails <- read.csv("emails.csv", stringsAsFactors = F)
The nchar() function counts the number of characters in a piece of text. How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?
max(nchar(emails$text))
## [1] 43952
library(tm)
## Loading required package: NLP
corpus <- Corpus(VectorSource(emails$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
spdtm <- removeSparseTerms(dtm, 0.95)
spdtm
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity : 89%
## Maximal term length: 10
## Weighting : term frequency (tf)
Build a data frame called emailsSparse from spdtm, and use the make.names function to make the variable names of emailsSparse valid.
emailsSparse <- as.data.frame(as.matrix(spdtm))
colnames(emailsSparse) <- make.names(colnames(emailsSparse))
colSums() is an R function that returns the sum of values for each variable in our data frame. Our data frame contains the number of times each word stem (columns) appeared in each email (rows). Therefore, colSums(emailsSparse) returns the number of times a word stem appeared across all the emails in the dataset. What is the word stem that shows up most frequently across all the emails in the dataset?
which.max(colSums(emailsSparse))
## enron
## 92
Add a variable called “spam” to emailsSparse containing the email spam labels.
emailsSparse$spam <- emails$spam
How many word stems appear at least 5000 times in the ham emails in the dataset?
sum(colSums(emailsSparse[emailsSparse$spam == 0,]) >= 5000)
## [1] 6
How many word stems appear at least 1000 times in the spam emails in the dataset?
sum(colSums(emailsSparse[emailsSparse$spam == 1,]) >= 1000)
## [1] 4
The lists of most common words are significantly different between the spam and ham emails. This tells us that, the frequencies of these most common words are likely to help differentiate between spam and ham.
Several of the most common word stems from the ham documents, such as “enron”, “hou” (short for Houston), “vinc” (the word stem of “Vince”) and “kaminski”, are likely specific to Vincent Kaminski’s inbox. That means the models we build are personalized, and would need to be further tested before being used as a spam filter for another person.
For each model, obtain the predicted spam probabilities for the training set.
Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values.
convert the dependent variable to a factor
class(emailsSparse$spam)
## [1] "integer"
emailsSparse$spam <- as.factor(emailsSparse$spam)
library(caTools)
set.seed(123)
split <- sample.split(emailsSparse$spam, SplitRatio = 0.7)
train <- subset(emailsSparse, split == T )
test <- subset(emailsSparse, split == F)
spamLog <- glm(spam ~., data = train, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(spamLog)
##
## Call:
## glm(formula = spam ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.011 0.000 0.000 0.000 1.354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.082e+01 1.055e+04 -0.003 0.998
## X000 1.474e+01 1.058e+04 0.001 0.999
## X2000 -3.631e+01 1.556e+04 -0.002 0.998
## X2001 -3.215e+01 1.318e+04 -0.002 0.998
## X713 -2.427e+01 2.914e+04 -0.001 0.999
## X853 -1.212e+00 5.942e+04 0.000 1.000
## abl -2.049e+00 2.088e+04 0.000 1.000
## access -1.480e+01 1.335e+04 -0.001 0.999
## account 2.488e+01 8.165e+03 0.003 0.998
## addit 1.463e+00 2.703e+04 0.000 1.000
## address -4.613e+00 1.113e+04 0.000 1.000
## allow 1.899e+01 6.436e+03 0.003 0.998
## alreadi -2.407e+01 3.319e+04 -0.001 0.999
## also 2.990e+01 1.378e+04 0.002 0.998
## analysi -2.405e+01 3.860e+04 -0.001 1.000
## anoth -8.744e+00 2.032e+04 0.000 1.000
## applic -2.649e+00 1.674e+04 0.000 1.000
## appreci -2.145e+01 2.762e+04 -0.001 0.999
## approv -1.302e+00 1.589e+04 0.000 1.000
## april -2.620e+01 2.208e+04 -0.001 0.999
## area 2.041e+01 2.266e+04 0.001 0.999
## arrang 1.069e+01 2.135e+04 0.001 1.000
## ask -7.746e+00 1.976e+04 0.000 1.000
## assist -1.128e+01 2.490e+04 0.000 1.000
## associ 9.049e+00 1.909e+04 0.000 1.000
## attach -1.037e+01 1.534e+04 -0.001 0.999
## attend -3.451e+01 3.257e+04 -0.001 0.999
## avail 8.651e+00 1.709e+04 0.001 1.000
## back -1.323e+01 2.272e+04 -0.001 1.000
## base -1.354e+01 2.122e+04 -0.001 0.999
## begin 2.228e+01 2.973e+04 0.001 0.999
## believ 3.233e+01 2.136e+04 0.002 0.999
## best -8.201e+00 1.333e+03 -0.006 0.995
## better 4.263e+01 2.360e+04 0.002 0.999
## book 4.301e+00 2.024e+04 0.000 1.000
## bring 1.607e+01 6.767e+04 0.000 1.000
## busi -4.803e+00 1.000e+04 0.000 1.000
## buy 4.170e+01 3.892e+04 0.001 0.999
## call -1.145e+00 1.111e+04 0.000 1.000
## can 3.762e+00 7.674e+03 0.000 1.000
## case -3.372e+01 2.880e+04 -0.001 0.999
## chang -2.717e+01 2.215e+04 -0.001 0.999
## check 1.425e+00 1.963e+04 0.000 1.000
## click 1.376e+01 7.077e+03 0.002 0.998
## com 1.936e+00 4.039e+03 0.000 1.000
## come -1.166e+00 1.511e+04 0.000 1.000
## comment -3.251e+00 3.387e+04 0.000 1.000
## communic 1.580e+01 8.958e+03 0.002 0.999
## compani 4.781e+00 9.186e+03 0.001 1.000
## complet -1.363e+01 2.024e+04 -0.001 0.999
## confer -7.503e-01 8.557e+03 0.000 1.000
## confirm -1.300e+01 1.514e+04 -0.001 0.999
## contact 1.530e+00 1.262e+04 0.000 1.000
## continu 1.487e+01 1.535e+04 0.001 0.999
## contract -1.295e+01 1.498e+04 -0.001 0.999
## copi -4.274e+01 3.070e+04 -0.001 0.999
## corp 1.606e+01 2.708e+04 0.001 1.000
## corpor -8.286e-01 2.818e+04 0.000 1.000
## cost -1.938e+00 1.833e+04 0.000 1.000
## cours 1.665e+01 1.834e+04 0.001 0.999
## creat 1.338e+01 3.946e+04 0.000 1.000
## credit 2.617e+01 1.314e+04 0.002 0.998
## crenshaw 9.994e+01 6.769e+04 0.001 0.999
## current 3.629e+00 1.707e+04 0.000 1.000
## custom 1.829e+01 1.008e+04 0.002 0.999
## data -2.609e+01 2.271e+04 -0.001 0.999
## date -2.786e+00 1.699e+04 0.000 1.000
## day -6.100e+00 5.866e+03 -0.001 0.999
## deal -1.129e+01 1.448e+04 -0.001 0.999
## dear -2.313e+00 2.306e+04 0.000 1.000
## depart -4.068e+01 2.509e+04 -0.002 0.999
## deriv -4.971e+01 3.587e+04 -0.001 0.999
## design -7.923e+00 2.939e+04 0.000 1.000
## detail 1.197e+01 2.301e+04 0.001 1.000
## develop 5.976e+00 9.455e+03 0.001 0.999
## differ -2.293e+00 1.075e+04 0.000 1.000
## direct -2.051e+01 3.194e+04 -0.001 0.999
## director -1.770e+01 1.793e+04 -0.001 0.999
## discuss -1.051e+01 1.915e+04 -0.001 1.000
## doc -2.597e+01 2.603e+04 -0.001 0.999
## don 2.129e+01 1.456e+04 0.001 0.999
## done 6.828e+00 1.882e+04 0.000 1.000
## due -4.163e+00 3.532e+04 0.000 1.000
## ect 8.685e-01 5.342e+03 0.000 1.000
## edu -2.122e-01 6.917e+02 0.000 1.000
## effect 1.948e+01 2.100e+04 0.001 0.999
## effort 1.606e+01 5.670e+04 0.000 1.000
## either -2.744e+01 4.000e+04 -0.001 0.999
## email 3.833e+00 1.186e+04 0.000 1.000
## end -1.311e+01 2.938e+04 0.000 1.000
## energi -1.620e+01 1.646e+04 -0.001 0.999
## engin 2.664e+01 2.394e+04 0.001 0.999
## enron -8.789e+00 5.719e+03 -0.002 0.999
## etc 9.470e-01 1.569e+04 0.000 1.000
## even -1.654e+01 2.289e+04 -0.001 0.999
## event 1.694e+01 1.851e+04 0.001 0.999
## expect -1.179e+01 1.914e+04 -0.001 1.000
## experi 2.460e+00 2.240e+04 0.000 1.000
## fax 3.537e+00 3.386e+04 0.000 1.000
## feel 2.596e+00 2.348e+04 0.000 1.000
## file -2.943e+01 2.165e+04 -0.001 0.999
## final 8.075e+00 5.008e+04 0.000 1.000
## financ -9.122e+00 7.524e+03 -0.001 0.999
## financi -9.747e+00 1.727e+04 -0.001 1.000
## find -2.623e+00 9.727e+03 0.000 1.000
## first -4.666e-01 2.043e+04 0.000 1.000
## follow 1.766e+01 3.080e+03 0.006 0.995
## form 8.483e+00 1.674e+04 0.001 1.000
## forward -3.484e+00 1.864e+04 0.000 1.000
## free 6.113e+00 8.121e+03 0.001 0.999
## friday -1.146e+01 1.996e+04 -0.001 1.000
## full 2.125e+01 2.190e+04 0.001 0.999
## futur 4.146e+01 1.439e+04 0.003 0.998
## gas -3.901e+00 4.160e+03 -0.001 0.999
## get 5.154e+00 9.737e+03 0.001 1.000
## gibner 2.901e+01 2.460e+04 0.001 0.999
## give -2.518e+01 2.130e+04 -0.001 0.999
## given -2.186e+01 5.426e+04 0.000 1.000
## good 5.399e+00 1.619e+04 0.000 1.000
## great 1.222e+01 1.090e+04 0.001 0.999
## group 5.264e-01 1.037e+04 0.000 1.000
## happi 1.939e-02 1.202e+04 0.000 1.000
## hear 2.887e+01 2.281e+04 0.001 0.999
## hello 2.166e+01 1.361e+04 0.002 0.999
## help 1.731e+01 2.791e+03 0.006 0.995
## high -1.982e+00 2.554e+04 0.000 1.000
## home 5.973e+00 8.965e+03 0.001 0.999
## hope -1.435e+01 2.179e+04 -0.001 0.999
## hou 6.852e+00 6.437e+03 0.001 0.999
## hour 2.478e+00 1.333e+04 0.000 1.000
## houston -1.855e+01 7.305e+03 -0.003 0.998
## howev -3.449e+01 3.562e+04 -0.001 0.999
## http 2.528e+01 2.107e+04 0.001 0.999
## idea -1.845e+01 3.892e+04 0.000 1.000
## immedi 6.285e+01 3.346e+04 0.002 0.999
## import -1.859e+00 2.236e+04 0.000 1.000
## includ -3.454e+00 1.799e+04 0.000 1.000
## increas 6.476e+00 2.329e+04 0.000 1.000
## industri -3.160e+01 2.373e+04 -0.001 0.999
## info -1.255e+00 4.857e+03 0.000 1.000
## inform 2.078e+01 8.549e+03 0.002 0.998
## interest 2.698e+01 1.159e+04 0.002 0.998
## intern -7.991e+00 3.351e+04 0.000 1.000
## internet 8.749e+00 1.100e+04 0.001 0.999
## interview -1.640e+01 1.873e+04 -0.001 0.999
## invest 3.201e+01 2.393e+04 0.001 0.999
## invit 4.304e+00 2.215e+04 0.000 1.000
## involv 3.815e+01 3.315e+04 0.001 0.999
## issu -3.708e+01 3.396e+04 -0.001 0.999
## john -5.326e-01 2.856e+04 0.000 1.000
## join -3.824e+01 2.334e+04 -0.002 0.999
## juli -1.358e+01 3.009e+04 0.000 1.000
## just -1.021e+01 1.114e+04 -0.001 0.999
## kaminski -1.812e+01 6.029e+03 -0.003 0.998
## keep 1.867e+01 2.782e+04 0.001 0.999
## kevin -3.779e+01 4.738e+04 -0.001 0.999
## know 1.277e+01 1.526e+04 0.001 0.999
## last 1.046e+00 1.372e+04 0.000 1.000
## let -2.763e+01 1.462e+04 -0.002 0.998
## life 5.812e+01 3.864e+04 0.002 0.999
## like 5.649e+00 7.660e+03 0.001 0.999
## line 8.743e+00 1.236e+04 0.001 0.999
## link -6.929e+00 1.345e+04 -0.001 1.000
## list -8.692e+00 2.149e+03 -0.004 0.997
## locat 2.073e+01 1.597e+04 0.001 0.999
## london 6.745e+00 1.642e+04 0.000 1.000
## long -1.489e+01 1.934e+04 -0.001 0.999
## look -7.031e+00 1.563e+04 0.000 1.000
## lot -1.964e+01 1.321e+04 -0.001 0.999
## made 2.820e+00 2.743e+04 0.000 1.000
## mail 7.584e+00 1.021e+04 0.001 0.999
## make 2.901e+01 1.528e+04 0.002 0.998
## manag 6.014e+00 1.445e+04 0.000 1.000
## mani 1.885e+01 1.442e+04 0.001 0.999
## mark -3.350e+01 3.208e+04 -0.001 0.999
## market 7.895e+00 8.012e+03 0.001 0.999
## may -9.434e+00 1.397e+04 -0.001 0.999
## mean 6.078e-01 2.952e+04 0.000 1.000
## meet -1.063e+00 1.263e+04 0.000 1.000
## member 1.381e+01 2.343e+04 0.001 1.000
## mention -2.279e+01 2.714e+04 -0.001 0.999
## messag 1.716e+01 2.562e+03 0.007 0.995
## might 1.244e+01 1.753e+04 0.001 0.999
## model -2.292e+01 1.049e+04 -0.002 0.998
## monday -1.034e+00 3.233e+04 0.000 1.000
## money 3.264e+01 1.321e+04 0.002 0.998
## month -3.727e+00 1.112e+04 0.000 1.000
## morn -2.645e+01 3.403e+04 -0.001 0.999
## move -3.834e+01 3.011e+04 -0.001 0.999
## much 3.775e-01 1.392e+04 0.000 1.000
## name 1.672e+01 1.322e+04 0.001 0.999
## need 8.437e-01 1.221e+04 0.000 1.000
## net 1.256e+01 2.197e+04 0.001 1.000
## new 1.003e+00 1.009e+04 0.000 1.000
## next. 1.492e+01 1.724e+04 0.001 0.999
## note 1.446e+01 2.294e+04 0.001 0.999
## now 3.790e+01 1.219e+04 0.003 0.998
## number -9.622e+00 1.591e+04 -0.001 1.000
## offer 1.174e+01 1.084e+04 0.001 0.999
## offic -1.344e+01 2.311e+04 -0.001 1.000
## one 1.241e+01 6.652e+03 0.002 0.999
## onlin 3.589e+01 1.665e+04 0.002 0.998
## open 2.114e+01 2.961e+04 0.001 0.999
## oper -1.696e+01 2.757e+04 -0.001 1.000
## opportun -4.131e+00 1.918e+04 0.000 1.000
## option -1.085e+00 9.325e+03 0.000 1.000
## order 6.533e+00 1.242e+04 0.001 1.000
## origin 3.226e+01 3.818e+04 0.001 0.999
## part 4.594e+00 3.483e+04 0.000 1.000
## particip -1.154e+01 1.738e+04 -0.001 0.999
## peopl -1.864e+01 1.439e+04 -0.001 0.999
## per 1.367e+01 1.273e+04 0.001 0.999
## person 1.870e+01 9.575e+03 0.002 0.998
## phone -6.957e+00 1.172e+04 -0.001 1.000
## place 9.005e+00 3.661e+04 0.000 1.000
## plan -1.830e+01 6.320e+03 -0.003 0.998
## pleas -7.961e+00 9.484e+03 -0.001 0.999
## point 5.498e+00 3.403e+04 0.000 1.000
## posit -1.543e+01 2.316e+04 -0.001 0.999
## possibl -1.366e+01 2.492e+04 -0.001 1.000
## power -5.643e+00 1.173e+04 0.000 1.000
## present -6.163e+00 1.278e+04 0.000 1.000
## price 3.428e+00 7.850e+03 0.000 1.000
## problem 1.262e+01 9.763e+03 0.001 0.999
## process -2.957e-01 1.191e+04 0.000 1.000
## product 1.016e+01 1.345e+04 0.001 0.999
## program 1.444e+00 1.183e+04 0.000 1.000
## project 2.173e+00 1.497e+04 0.000 1.000
## provid 2.422e-01 1.859e+04 0.000 1.000
## public -5.250e+01 2.341e+04 -0.002 0.998
## put -1.052e+01 2.681e+04 0.000 1.000
## question -3.467e+01 1.859e+04 -0.002 0.999
## rate -3.112e+00 1.319e+04 0.000 1.000
## read -1.527e+01 2.145e+04 -0.001 0.999
## real 2.046e+01 2.358e+04 0.001 0.999
## realli -2.667e+01 4.640e+04 -0.001 1.000
## receiv 5.765e-01 1.585e+04 0.000 1.000
## recent -2.067e+00 1.780e+04 0.000 1.000
## regard -3.668e+00 1.511e+04 0.000 1.000
## relat -5.114e+01 1.793e+04 -0.003 0.998
## remov 2.325e+01 2.484e+04 0.001 0.999
## repli 1.538e+01 2.916e+04 0.001 1.000
## report -1.482e+01 1.477e+04 -0.001 0.999
## request -1.232e+01 1.167e+04 -0.001 0.999
## requir 5.004e-01 2.937e+04 0.000 1.000
## research -2.826e+01 1.553e+04 -0.002 0.999
## resourc -2.735e+01 3.522e+04 -0.001 0.999
## respond 2.974e+01 3.888e+04 0.001 0.999
## respons -1.960e+01 3.667e+04 -0.001 1.000
## result -5.002e-01 3.140e+04 0.000 1.000
## resum -9.219e+00 2.100e+04 0.000 1.000
## return 1.745e+01 1.844e+04 0.001 0.999
## review -4.825e+00 1.013e+04 0.000 1.000
## right 2.312e+01 1.590e+04 0.001 0.999
## risk -4.001e+00 1.718e+04 0.000 1.000
## robert -2.096e+01 2.907e+04 -0.001 0.999
## run -5.162e+01 4.434e+04 -0.001 0.999
## say 7.366e+00 2.217e+04 0.000 1.000
## schedul 1.919e+00 3.580e+04 0.000 1.000
## school -3.870e+00 2.882e+04 0.000 1.000
## secur -1.604e+01 2.201e+03 -0.007 0.994
## see -1.120e+01 1.293e+04 -0.001 0.999
## send -2.427e+01 1.222e+04 -0.002 0.998
## sent -1.488e+01 2.195e+04 -0.001 0.999
## servic -7.164e+00 1.235e+04 -0.001 1.000
## set -9.353e+00 2.627e+04 0.000 1.000
## sever 2.041e+01 3.093e+04 0.001 0.999
## shall 1.930e+01 3.075e+04 0.001 0.999
## shirley -7.133e+01 6.329e+04 -0.001 0.999
## short -8.974e+00 1.721e+04 -0.001 1.000
## sinc -3.438e+00 3.546e+04 0.000 1.000
## sincer -2.073e+01 3.515e+04 -0.001 1.000
## site 8.689e+00 1.496e+04 0.001 1.000
## softwar 2.575e+01 1.059e+04 0.002 0.998
## soon 2.350e+01 3.731e+04 0.001 0.999
## sorri 6.036e+00 2.299e+04 0.000 1.000
## special 1.777e+01 2.755e+04 0.001 0.999
## specif -2.337e+01 3.083e+04 -0.001 0.999
## start 1.437e+01 1.897e+04 0.001 0.999
## state 1.221e+01 1.677e+04 0.001 0.999
## still 3.878e+00 2.622e+04 0.000 1.000
## stinson -4.345e+01 2.697e+04 -0.002 0.999
## student -1.815e+01 2.186e+04 -0.001 0.999
## subject 3.041e+01 1.055e+04 0.003 0.998
## success 4.344e+00 2.783e+04 0.000 1.000
## suggest -3.842e+01 4.475e+04 -0.001 0.999
## support -1.539e+01 1.976e+04 -0.001 0.999
## sure -5.503e+00 2.078e+04 0.000 1.000
## system 3.778e+00 9.149e+03 0.000 1.000
## take 5.731e+00 1.716e+04 0.000 1.000
## talk -1.011e+01 2.021e+04 -0.001 1.000
## team 7.940e+00 2.570e+04 0.000 1.000
## term 2.013e+01 2.303e+04 0.001 0.999
## thank -3.890e+01 1.059e+04 -0.004 0.997
## thing 2.579e+01 1.341e+04 0.002 0.998
## think -1.218e+01 2.077e+04 -0.001 1.000
## thought 1.243e+01 3.023e+04 0.000 1.000
## thursday -1.491e+01 3.262e+04 0.000 1.000
## time -5.921e+00 8.335e+03 -0.001 0.999
## today -1.762e+01 1.965e+04 -0.001 0.999
## togeth -2.355e+01 1.869e+04 -0.001 0.999
## trade -1.755e+01 1.483e+04 -0.001 0.999
## tri 9.278e-01 1.282e+04 0.000 1.000
## tuesday -2.808e+01 3.959e+04 -0.001 0.999
## two -2.573e+01 1.844e+04 -0.001 0.999
## type -1.447e+01 2.755e+04 -0.001 1.000
## understand 9.307e+00 2.342e+04 0.000 1.000
## unit -4.020e+00 3.008e+04 0.000 1.000
## univers 1.228e+01 2.197e+04 0.001 1.000
## updat -1.510e+01 1.448e+04 -0.001 0.999
## use -1.385e+01 9.382e+03 -0.001 0.999
## valu 9.024e-01 1.360e+04 0.000 1.000
## version -3.606e+01 2.939e+04 -0.001 0.999
## vinc -3.735e+01 8.647e+03 -0.004 0.997
## visit 2.585e+01 1.170e+04 0.002 0.998
## vkamin -6.649e+01 5.703e+04 -0.001 0.999
## want -2.555e+00 1.106e+04 0.000 1.000
## way 1.339e+01 1.138e+04 0.001 0.999
## web 2.791e+00 1.686e+04 0.000 1.000
## websit -2.563e+01 1.848e+04 -0.001 0.999
## wednesday -1.526e+01 2.642e+04 -0.001 1.000
## week -6.795e+00 1.046e+04 -0.001 0.999
## well -2.222e+01 9.713e+03 -0.002 0.998
## will -1.119e+01 5.980e+03 -0.002 0.999
## wish 1.173e+01 3.175e+04 0.000 1.000
## within 2.900e+01 2.163e+04 0.001 0.999
## without 1.942e+01 1.763e+04 0.001 0.999
## work -1.099e+01 1.160e+04 -0.001 0.999
## write 4.406e+01 2.825e+04 0.002 0.999
## www -7.867e+00 2.224e+04 0.000 1.000
## year -1.010e+01 1.039e+04 -0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4409.49 on 4009 degrees of freedom
## Residual deviance: 13.46 on 3679 degrees of freedom
## AIC: 675.46
##
## Number of Fisher Scoring iterations: 25
spamLog_pred <- predict(spamLog)
table(train$spam, spamLog_pred >= 0.5)
##
## FALSE TRUE
## 0 3052 0
## 1 4 954
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
ROCRpred <- prediction(spamLog_pred, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9999959
library(rpart)
library(rpart.plot)
spamCART <- rpart(spam~., data = train, method = "class")
prp(spamCART)
spamCART_pred <- predict(spamCART)
table(train$spam, spamCART_pred[,2] >= 0.5)
##
## FALSE TRUE
## 0 2885 167
## 1 64 894
ROCRpred <- prediction(spamCART_pred[, 2], train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9696044
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
spamRF <- randomForest(spam~., data = train)
spamRF_pred <- predict(spamRF)
table(train$spam, spamRF_pred)
## spamRF_pred
## 0 1
## 0 3013 39
## 1 44 914
What is the training set AUC of spamRF? (Remember to pass the argument type=“prob” to the predict function to get predicted probabilities for a random forest model. The probabilities will be the second column of the output.)
spamRF_pred <- predict(spamRF, type = "prob")
ROCRpred <- prediction(spamRF_pred[,2], train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9979116
Logistic regression model has the best training set performance, in terms of accuracy and AUC.
spamLog_pred <- predict(spamLog, newdata = test)
spamCART_pred <- predict(spamCART, newdata = test)
spamRF_pred <- predict(spamRF, newdata = test, type = "prob")
Logistic regression model
table(test$spam, spamLog_pred >= 0.5)
##
## FALSE TRUE
## 0 1258 50
## 1 34 376
ROCRpred <- prediction(spamLog_pred, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9767994
CART model
table(test$spam, spamCART_pred[,2] >= 0.5)
##
## FALSE TRUE
## 0 1228 80
## 1 24 386
ROCRpred <- prediction(spamCART_pred[,2], test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.963176
Random Forest model
table(test$spam, spamRF_pred[,2]>=0.5)
##
## FALSE TRUE
## 0 1290 18
## 1 24 386
ROCRpred <- prediction(spamRF_pred[,2], test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9975656
Random Forest model has the best test set performance, in terms of accuracy and AUC.
Both CART and random forest had very similar accuracies on the training and testing sets. However, logistic regression obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting.