Homework assignment for MITX “The Analytics Edge”
Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.
As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.
In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes – Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.
knitr::opts_chunk$set(echo = TRUE)
library(tm) # text analytics
library(caTools) # create training and test sets
library(rpart) # Classification and Regression Trees
library(rpart.plot)
library(randomForest) # random forest
library(SnowballC)
library(ROCR) # receiver operator characteristic
The dataset contains just two fields:
text: The text of the email.
spam: A binary variable indicating if the email was spam.
emails <- read.csv("emails.csv", stringsAsFactors = FALSE)
str(emails)
## 'data.frame': 5728 obs. of 2 variables:
## $ text: chr "Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market"| __truncated__ "Subject: the stock trading gunslinger fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__ "Subject: unbelievable new homes made easy im wanting to show you this homeowner you have been pre - approved"| __truncated__ "Subject: 4 color printing special request additional information now ! click here click here for a printable "| __truncated__ ...
## $ spam: int 1 1 1 1 1 1 1 1 1 1 ...
cat("\nNumber of spam emails :", sum(emails$spam))
##
## Number of spam emails : 1368
cat("\nNumber of characters in the longest emial in the dataset :", max(nchar(emails$text)))
##
## Number of characters in the longest emial in the dataset : 43952
cat("\nRow of the shortest email in the dataset :", which.min(nchar(emails$text)))
##
## Row of the shortest email in the dataset : 1992
corpus <- VCorpus(VectorSource(emails$text))
corpus = tm_map(corpus, content_transformer(tolower)) # lower-case
corpus = tm_map(corpus, removePunctuation) # remove punctuation
corpus = tm_map(corpus, removeWords, stopwords("english")) # remove stop words
corpus = tm_map(corpus, stemDocument) # reduce words to 'stems'
dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
# remove words which don't appear in at least 5% of documents
spdtm <- removeSparseTerms(DocumentTermMatrix(corpus), 0.95)
spdtm
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity : 89%
## Maximal term length: 10
## Weighting : term frequency (tf)
emailsSparse <- as.data.frame(as.matrix(spdtm))
colnames(emailsSparse) <- make.names(colnames(emailsSparse))
# some words start with a number, but R struggles with variables names that start with a number
cat("Most common stem in the emails :", colnames(spdtm)[which.max(colSums(emailsSparse))])
## Most common stem in the emails : enron
emailsSparse$spam <- emails$spam # add variable 'spam' containing the email spam labels
cat("Number of word stems which appear at least 5000 times\nin ham emails : ",
sum(colSums(emailsSparse[emailsSparse$spam == 0,!names(emailsSparse) %in% "spam"])>=5000))
## Number of word stems which appear at least 5000 times
## in ham emails : 6
cat("\nMost common stems in ham emails\n")
##
## Most common stems in ham emails
sort(colSums(subset(emailsSparse, spam == 0)), decreasing = TRUE)[1:10]
## enron ect subject vinc will hou X2000 kaminski
## 13388 11417 8625 8531 6802 5569 4935 4801
## pleas com
## 4494 4444
cat("\nNumber of word stems which appear at least 1000 times\nin spam emails : ",
sum(colSums(emailsSparse[emailsSparse$spam == 1,!names(emailsSparse) %in% "spam"])>=1000))
##
## Number of word stems which appear at least 1000 times
## in spam emails : 3
cat("\nMost common stems in spam emails\n")
##
## Most common stems in spam emails
sort(colSums(emailsSparse[emailsSparse$spam == 1, !names(emailsSparse) %in% "spam"]), decreasing = TRUE)[1:10]
## subject will compani com mail busi email can inform
## 1577 1450 1065 999 917 897 865 831 818
## receiv
## 727
Several of the most common word stems from the ham documents, such as “enron”, “hou” (short for Houston), “vinc” (the word stem of “Vince”) and “kaminski”, are likely specific to Vincent Kaminski’s inbox.
The ham dataset is personalized to Vincent Kaminski, and therefore it might not generalize well to a general email user. Caution is definitely necessary before applying the filters derived in this problem to other email users.
emailsSparse$spam <- as.factor(emailsSparse$spam)
set.seed(123) # seed for random number generator
split <- sample.split(emailsSparse$spam, SplitRatio = 0.7)
train <- subset(emailsSparse, split == TRUE)
test <- subset(emailsSparse, split == FALSE)
spamlog <- glm(spam ~ ., data = train, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(spamlog)
##
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.011 0.000 0.000 0.000 1.354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.082e+01 1.055e+04 -0.003 0.998
## X000 1.474e+01 1.058e+04 0.001 0.999
## X2000 -3.631e+01 1.556e+04 -0.002 0.998
## X2001 -3.215e+01 1.318e+04 -0.002 0.998
## X713 -2.427e+01 2.914e+04 -0.001 0.999
## X853 -1.212e+00 5.942e+04 0.000 1.000
## abl -2.049e+00 2.088e+04 0.000 1.000
## access -1.480e+01 1.335e+04 -0.001 0.999
## account 2.488e+01 8.165e+03 0.003 0.998
## addit 1.463e+00 2.703e+04 0.000 1.000
## address -4.613e+00 1.113e+04 0.000 1.000
## allow 1.899e+01 6.436e+03 0.003 0.998
## alreadi -2.407e+01 3.319e+04 -0.001 0.999
## also 2.990e+01 1.378e+04 0.002 0.998
## analysi -2.405e+01 3.860e+04 -0.001 1.000
## anoth -8.744e+00 2.032e+04 0.000 1.000
## applic -2.649e+00 1.674e+04 0.000 1.000
## appreci -2.145e+01 2.762e+04 -0.001 0.999
## approv -1.302e+00 1.589e+04 0.000 1.000
## april -2.620e+01 2.208e+04 -0.001 0.999
## area 2.041e+01 2.266e+04 0.001 0.999
## arrang 1.069e+01 2.135e+04 0.001 1.000
## ask -7.746e+00 1.976e+04 0.000 1.000
## assist -1.128e+01 2.490e+04 0.000 1.000
## associ 9.049e+00 1.909e+04 0.000 1.000
## attach -1.037e+01 1.534e+04 -0.001 0.999
## attend -3.451e+01 3.257e+04 -0.001 0.999
## avail 8.651e+00 1.709e+04 0.001 1.000
## back -1.323e+01 2.272e+04 -0.001 1.000
## base -1.354e+01 2.122e+04 -0.001 0.999
## begin 2.228e+01 2.973e+04 0.001 0.999
## believ 3.233e+01 2.136e+04 0.002 0.999
## best -8.201e+00 1.333e+03 -0.006 0.995
## better 4.263e+01 2.360e+04 0.002 0.999
## book 4.301e+00 2.024e+04 0.000 1.000
## bring 1.607e+01 6.767e+04 0.000 1.000
## busi -4.803e+00 1.000e+04 0.000 1.000
## buy 4.170e+01 3.892e+04 0.001 0.999
## call -1.145e+00 1.111e+04 0.000 1.000
## can 3.762e+00 7.674e+03 0.000 1.000
## case -3.372e+01 2.880e+04 -0.001 0.999
## chang -2.717e+01 2.215e+04 -0.001 0.999
## check 1.425e+00 1.963e+04 0.000 1.000
## click 1.376e+01 7.077e+03 0.002 0.998
## com 1.936e+00 4.039e+03 0.000 1.000
## come -1.166e+00 1.511e+04 0.000 1.000
## comment -3.251e+00 3.387e+04 0.000 1.000
## communic 1.580e+01 8.958e+03 0.002 0.999
## compani 4.781e+00 9.186e+03 0.001 1.000
## complet -1.363e+01 2.024e+04 -0.001 0.999
## confer -7.503e-01 8.557e+03 0.000 1.000
## confirm -1.300e+01 1.514e+04 -0.001 0.999
## contact 1.530e+00 1.262e+04 0.000 1.000
## continu 1.487e+01 1.535e+04 0.001 0.999
## contract -1.295e+01 1.498e+04 -0.001 0.999
## copi -4.274e+01 3.070e+04 -0.001 0.999
## corp 1.606e+01 2.708e+04 0.001 1.000
## corpor -8.286e-01 2.818e+04 0.000 1.000
## cost -1.938e+00 1.833e+04 0.000 1.000
## cours 1.665e+01 1.834e+04 0.001 0.999
## creat 1.338e+01 3.946e+04 0.000 1.000
## credit 2.617e+01 1.314e+04 0.002 0.998
## crenshaw 9.994e+01 6.769e+04 0.001 0.999
## current 3.629e+00 1.707e+04 0.000 1.000
## custom 1.829e+01 1.008e+04 0.002 0.999
## data -2.609e+01 2.271e+04 -0.001 0.999
## date -2.786e+00 1.699e+04 0.000 1.000
## day -6.100e+00 5.866e+03 -0.001 0.999
## deal -1.129e+01 1.448e+04 -0.001 0.999
## dear -2.313e+00 2.306e+04 0.000 1.000
## depart -4.068e+01 2.509e+04 -0.002 0.999
## deriv -4.971e+01 3.587e+04 -0.001 0.999
## design -7.923e+00 2.939e+04 0.000 1.000
## detail 1.197e+01 2.301e+04 0.001 1.000
## develop 5.976e+00 9.455e+03 0.001 0.999
## differ -2.293e+00 1.075e+04 0.000 1.000
## direct -2.051e+01 3.194e+04 -0.001 0.999
## director -1.770e+01 1.793e+04 -0.001 0.999
## discuss -1.051e+01 1.915e+04 -0.001 1.000
## doc -2.597e+01 2.603e+04 -0.001 0.999
## don 2.129e+01 1.456e+04 0.001 0.999
## done 6.828e+00 1.882e+04 0.000 1.000
## due -4.163e+00 3.532e+04 0.000 1.000
## ect 8.685e-01 5.342e+03 0.000 1.000
## edu -2.122e-01 6.917e+02 0.000 1.000
## effect 1.948e+01 2.100e+04 0.001 0.999
## effort 1.606e+01 5.670e+04 0.000 1.000
## either -2.744e+01 4.000e+04 -0.001 0.999
## email 3.833e+00 1.186e+04 0.000 1.000
## end -1.311e+01 2.938e+04 0.000 1.000
## energi -1.620e+01 1.646e+04 -0.001 0.999
## engin 2.664e+01 2.394e+04 0.001 0.999
## enron -8.789e+00 5.719e+03 -0.002 0.999
## etc 9.470e-01 1.569e+04 0.000 1.000
## even -1.654e+01 2.289e+04 -0.001 0.999
## event 1.694e+01 1.851e+04 0.001 0.999
## expect -1.179e+01 1.914e+04 -0.001 1.000
## experi 2.460e+00 2.240e+04 0.000 1.000
## fax 3.537e+00 3.386e+04 0.000 1.000
## feel 2.596e+00 2.348e+04 0.000 1.000
## file -2.943e+01 2.165e+04 -0.001 0.999
## final 8.075e+00 5.008e+04 0.000 1.000
## financ -9.122e+00 7.524e+03 -0.001 0.999
## financi -9.747e+00 1.727e+04 -0.001 1.000
## find -2.623e+00 9.727e+03 0.000 1.000
## first -4.666e-01 2.043e+04 0.000 1.000
## follow 1.766e+01 3.080e+03 0.006 0.995
## form 8.483e+00 1.674e+04 0.001 1.000
## forward -3.484e+00 1.864e+04 0.000 1.000
## free 6.113e+00 8.121e+03 0.001 0.999
## friday -1.146e+01 1.996e+04 -0.001 1.000
## full 2.125e+01 2.190e+04 0.001 0.999
## futur 4.146e+01 1.439e+04 0.003 0.998
## gas -3.901e+00 4.160e+03 -0.001 0.999
## get 5.154e+00 9.737e+03 0.001 1.000
## gibner 2.901e+01 2.460e+04 0.001 0.999
## give -2.518e+01 2.130e+04 -0.001 0.999
## given -2.186e+01 5.426e+04 0.000 1.000
## good 5.399e+00 1.619e+04 0.000 1.000
## great 1.222e+01 1.090e+04 0.001 0.999
## group 5.264e-01 1.037e+04 0.000 1.000
## happi 1.939e-02 1.202e+04 0.000 1.000
## hear 2.887e+01 2.281e+04 0.001 0.999
## hello 2.166e+01 1.361e+04 0.002 0.999
## help 1.731e+01 2.791e+03 0.006 0.995
## high -1.982e+00 2.554e+04 0.000 1.000
## home 5.973e+00 8.965e+03 0.001 0.999
## hope -1.435e+01 2.179e+04 -0.001 0.999
## hou 6.852e+00 6.437e+03 0.001 0.999
## hour 2.478e+00 1.333e+04 0.000 1.000
## houston -1.855e+01 7.305e+03 -0.003 0.998
## howev -3.449e+01 3.562e+04 -0.001 0.999
## http 2.528e+01 2.107e+04 0.001 0.999
## idea -1.845e+01 3.892e+04 0.000 1.000
## immedi 6.285e+01 3.346e+04 0.002 0.999
## import -1.859e+00 2.236e+04 0.000 1.000
## includ -3.454e+00 1.799e+04 0.000 1.000
## increas 6.476e+00 2.329e+04 0.000 1.000
## industri -3.160e+01 2.373e+04 -0.001 0.999
## info -1.255e+00 4.857e+03 0.000 1.000
## inform 2.078e+01 8.549e+03 0.002 0.998
## interest 2.698e+01 1.159e+04 0.002 0.998
## intern -7.991e+00 3.351e+04 0.000 1.000
## internet 8.749e+00 1.100e+04 0.001 0.999
## interview -1.640e+01 1.873e+04 -0.001 0.999
## invest 3.201e+01 2.393e+04 0.001 0.999
## invit 4.304e+00 2.215e+04 0.000 1.000
## involv 3.815e+01 3.315e+04 0.001 0.999
## issu -3.708e+01 3.396e+04 -0.001 0.999
## john -5.326e-01 2.856e+04 0.000 1.000
## join -3.824e+01 2.334e+04 -0.002 0.999
## juli -1.358e+01 3.009e+04 0.000 1.000
## just -1.021e+01 1.114e+04 -0.001 0.999
## kaminski -1.812e+01 6.029e+03 -0.003 0.998
## keep 1.867e+01 2.782e+04 0.001 0.999
## kevin -3.779e+01 4.738e+04 -0.001 0.999
## know 1.277e+01 1.526e+04 0.001 0.999
## last 1.046e+00 1.372e+04 0.000 1.000
## let -2.763e+01 1.462e+04 -0.002 0.998
## life 5.812e+01 3.864e+04 0.002 0.999
## like 5.649e+00 7.660e+03 0.001 0.999
## line 8.743e+00 1.236e+04 0.001 0.999
## link -6.929e+00 1.345e+04 -0.001 1.000
## list -8.692e+00 2.149e+03 -0.004 0.997
## locat 2.073e+01 1.597e+04 0.001 0.999
## london 6.745e+00 1.642e+04 0.000 1.000
## long -1.489e+01 1.934e+04 -0.001 0.999
## look -7.031e+00 1.563e+04 0.000 1.000
## lot -1.964e+01 1.321e+04 -0.001 0.999
## made 2.820e+00 2.743e+04 0.000 1.000
## mail 7.584e+00 1.021e+04 0.001 0.999
## make 2.901e+01 1.528e+04 0.002 0.998
## manag 6.014e+00 1.445e+04 0.000 1.000
## mani 1.885e+01 1.442e+04 0.001 0.999
## mark -3.350e+01 3.208e+04 -0.001 0.999
## market 7.895e+00 8.012e+03 0.001 0.999
## may -9.434e+00 1.397e+04 -0.001 0.999
## mean 6.078e-01 2.952e+04 0.000 1.000
## meet -1.063e+00 1.263e+04 0.000 1.000
## member 1.381e+01 2.343e+04 0.001 1.000
## mention -2.279e+01 2.714e+04 -0.001 0.999
## messag 1.716e+01 2.562e+03 0.007 0.995
## might 1.244e+01 1.753e+04 0.001 0.999
## model -2.292e+01 1.049e+04 -0.002 0.998
## monday -1.034e+00 3.233e+04 0.000 1.000
## money 3.264e+01 1.321e+04 0.002 0.998
## month -3.727e+00 1.112e+04 0.000 1.000
## morn -2.645e+01 3.403e+04 -0.001 0.999
## move -3.834e+01 3.011e+04 -0.001 0.999
## much 3.775e-01 1.392e+04 0.000 1.000
## name 1.672e+01 1.322e+04 0.001 0.999
## need 8.437e-01 1.221e+04 0.000 1.000
## net 1.256e+01 2.197e+04 0.001 1.000
## new 1.003e+00 1.009e+04 0.000 1.000
## next. 1.492e+01 1.724e+04 0.001 0.999
## note 1.446e+01 2.294e+04 0.001 0.999
## now 3.790e+01 1.219e+04 0.003 0.998
## number -9.622e+00 1.591e+04 -0.001 1.000
## offer 1.174e+01 1.084e+04 0.001 0.999
## offic -1.344e+01 2.311e+04 -0.001 1.000
## one 1.241e+01 6.652e+03 0.002 0.999
## onlin 3.589e+01 1.665e+04 0.002 0.998
## open 2.114e+01 2.961e+04 0.001 0.999
## oper -1.696e+01 2.757e+04 -0.001 1.000
## opportun -4.131e+00 1.918e+04 0.000 1.000
## option -1.085e+00 9.325e+03 0.000 1.000
## order 6.533e+00 1.242e+04 0.001 1.000
## origin 3.226e+01 3.818e+04 0.001 0.999
## part 4.594e+00 3.483e+04 0.000 1.000
## particip -1.154e+01 1.738e+04 -0.001 0.999
## peopl -1.864e+01 1.439e+04 -0.001 0.999
## per 1.367e+01 1.273e+04 0.001 0.999
## person 1.870e+01 9.575e+03 0.002 0.998
## phone -6.957e+00 1.172e+04 -0.001 1.000
## place 9.005e+00 3.661e+04 0.000 1.000
## plan -1.830e+01 6.320e+03 -0.003 0.998
## pleas -7.961e+00 9.484e+03 -0.001 0.999
## point 5.498e+00 3.403e+04 0.000 1.000
## posit -1.543e+01 2.316e+04 -0.001 0.999
## possibl -1.366e+01 2.492e+04 -0.001 1.000
## power -5.643e+00 1.173e+04 0.000 1.000
## present -6.163e+00 1.278e+04 0.000 1.000
## price 3.428e+00 7.850e+03 0.000 1.000
## problem 1.262e+01 9.763e+03 0.001 0.999
## process -2.957e-01 1.191e+04 0.000 1.000
## product 1.016e+01 1.345e+04 0.001 0.999
## program 1.444e+00 1.183e+04 0.000 1.000
## project 2.173e+00 1.497e+04 0.000 1.000
## provid 2.422e-01 1.859e+04 0.000 1.000
## public -5.250e+01 2.341e+04 -0.002 0.998
## put -1.052e+01 2.681e+04 0.000 1.000
## question -3.467e+01 1.859e+04 -0.002 0.999
## rate -3.112e+00 1.319e+04 0.000 1.000
## read -1.527e+01 2.145e+04 -0.001 0.999
## real 2.046e+01 2.358e+04 0.001 0.999
## realli -2.667e+01 4.640e+04 -0.001 1.000
## receiv 5.765e-01 1.585e+04 0.000 1.000
## recent -2.067e+00 1.780e+04 0.000 1.000
## regard -3.668e+00 1.511e+04 0.000 1.000
## relat -5.114e+01 1.793e+04 -0.003 0.998
## remov 2.325e+01 2.484e+04 0.001 0.999
## repli 1.538e+01 2.916e+04 0.001 1.000
## report -1.482e+01 1.477e+04 -0.001 0.999
## request -1.232e+01 1.167e+04 -0.001 0.999
## requir 5.004e-01 2.937e+04 0.000 1.000
## research -2.826e+01 1.553e+04 -0.002 0.999
## resourc -2.735e+01 3.522e+04 -0.001 0.999
## respond 2.974e+01 3.888e+04 0.001 0.999
## respons -1.960e+01 3.667e+04 -0.001 1.000
## result -5.002e-01 3.140e+04 0.000 1.000
## resum -9.219e+00 2.100e+04 0.000 1.000
## return 1.745e+01 1.844e+04 0.001 0.999
## review -4.825e+00 1.013e+04 0.000 1.000
## right 2.312e+01 1.590e+04 0.001 0.999
## risk -4.001e+00 1.718e+04 0.000 1.000
## robert -2.096e+01 2.907e+04 -0.001 0.999
## run -5.162e+01 4.434e+04 -0.001 0.999
## say 7.366e+00 2.217e+04 0.000 1.000
## schedul 1.919e+00 3.580e+04 0.000 1.000
## school -3.870e+00 2.882e+04 0.000 1.000
## secur -1.604e+01 2.201e+03 -0.007 0.994
## see -1.120e+01 1.293e+04 -0.001 0.999
## send -2.427e+01 1.222e+04 -0.002 0.998
## sent -1.488e+01 2.195e+04 -0.001 0.999
## servic -7.164e+00 1.235e+04 -0.001 1.000
## set -9.353e+00 2.627e+04 0.000 1.000
## sever 2.041e+01 3.093e+04 0.001 0.999
## shall 1.930e+01 3.075e+04 0.001 0.999
## shirley -7.133e+01 6.329e+04 -0.001 0.999
## short -8.974e+00 1.721e+04 -0.001 1.000
## sinc -3.438e+00 3.546e+04 0.000 1.000
## sincer -2.073e+01 3.515e+04 -0.001 1.000
## site 8.689e+00 1.496e+04 0.001 1.000
## softwar 2.575e+01 1.059e+04 0.002 0.998
## soon 2.350e+01 3.731e+04 0.001 0.999
## sorri 6.036e+00 2.299e+04 0.000 1.000
## special 1.777e+01 2.755e+04 0.001 0.999
## specif -2.337e+01 3.083e+04 -0.001 0.999
## start 1.437e+01 1.897e+04 0.001 0.999
## state 1.221e+01 1.677e+04 0.001 0.999
## still 3.878e+00 2.622e+04 0.000 1.000
## stinson -4.345e+01 2.697e+04 -0.002 0.999
## student -1.815e+01 2.186e+04 -0.001 0.999
## subject 3.041e+01 1.055e+04 0.003 0.998
## success 4.344e+00 2.783e+04 0.000 1.000
## suggest -3.842e+01 4.475e+04 -0.001 0.999
## support -1.539e+01 1.976e+04 -0.001 0.999
## sure -5.503e+00 2.078e+04 0.000 1.000
## system 3.778e+00 9.149e+03 0.000 1.000
## take 5.731e+00 1.716e+04 0.000 1.000
## talk -1.011e+01 2.021e+04 -0.001 1.000
## team 7.940e+00 2.570e+04 0.000 1.000
## term 2.013e+01 2.303e+04 0.001 0.999
## thank -3.890e+01 1.059e+04 -0.004 0.997
## thing 2.579e+01 1.341e+04 0.002 0.998
## think -1.218e+01 2.077e+04 -0.001 1.000
## thought 1.243e+01 3.023e+04 0.000 1.000
## thursday -1.491e+01 3.262e+04 0.000 1.000
## time -5.921e+00 8.335e+03 -0.001 0.999
## today -1.762e+01 1.965e+04 -0.001 0.999
## togeth -2.355e+01 1.869e+04 -0.001 0.999
## trade -1.755e+01 1.483e+04 -0.001 0.999
## tri 9.278e-01 1.282e+04 0.000 1.000
## tuesday -2.808e+01 3.959e+04 -0.001 0.999
## two -2.573e+01 1.844e+04 -0.001 0.999
## type -1.447e+01 2.755e+04 -0.001 1.000
## understand 9.307e+00 2.342e+04 0.000 1.000
## unit -4.020e+00 3.008e+04 0.000 1.000
## univers 1.228e+01 2.197e+04 0.001 1.000
## updat -1.510e+01 1.448e+04 -0.001 0.999
## use -1.385e+01 9.382e+03 -0.001 0.999
## valu 9.024e-01 1.360e+04 0.000 1.000
## version -3.606e+01 2.939e+04 -0.001 0.999
## vinc -3.735e+01 8.647e+03 -0.004 0.997
## visit 2.585e+01 1.170e+04 0.002 0.998
## vkamin -6.649e+01 5.703e+04 -0.001 0.999
## want -2.555e+00 1.106e+04 0.000 1.000
## way 1.339e+01 1.138e+04 0.001 0.999
## web 2.791e+00 1.686e+04 0.000 1.000
## websit -2.563e+01 1.848e+04 -0.001 0.999
## wednesday -1.526e+01 2.642e+04 -0.001 1.000
## week -6.795e+00 1.046e+04 -0.001 0.999
## well -2.222e+01 9.713e+03 -0.002 0.998
## will -1.119e+01 5.980e+03 -0.002 0.999
## wish 1.173e+01 3.175e+04 0.000 1.000
## within 2.900e+01 2.163e+04 0.001 0.999
## without 1.942e+01 1.763e+04 0.001 0.999
## work -1.099e+01 1.160e+04 -0.001 0.999
## write 4.406e+01 2.825e+04 0.002 0.999
## www -7.867e+00 2.224e+04 0.000 1.000
## year -1.010e+01 1.039e+04 -0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4409.49 on 4009 degrees of freedom
## Residual deviance: 13.46 on 3679 degrees of freedom
## AIC: 675.46
##
## Number of Fisher Scoring iterations: 25
The logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model.
spamCART <- rpart(spam ~ ., data = train, method = "class")
prp(spamCART)
We see that “vinc” and “enron” (specific to the user ‘Vincent Kaminski’) appear in the CART tree as the top two branches, but that “hou” and “kaminski” (also specific to that user) do not appear.
set.seed(123)
spamRF <- randomForest(spam ~ . , data = train)
predlog <- predict(spamlog, type = "response")
predCART <- predict(spamCART)[,2]
predRF <- predict(spamRF, type = "prob")[,2]
cat("\nTraining set predicted probabilities from spamLog < 0.00001 :",
sum(predlog<0.00001))
##
## Training set predicted probabilities from spamLog < 0.00001 : 3046
cat("\nTraining set predicted probabilities from spamLog > 0.99999 :",
sum(predlog>0.99999))
##
## Training set predicted probabilities from spamLog > 0.99999 : 954
cat("\nTraining set predicted probabilities from spamLog between 0.00001 and 0.99999 :",
sum(predlog>=0.00001 & predlog<=0.99999))
##
## Training set predicted probabilities from spamLog between 0.00001 and 0.99999 : 10
The logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model.
Show some of the co-efficients in the logistic regression model.
head(coef(summary(spamlog)),10)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.816712 10548.743 -2.921363e-03 0.9976691
## X000 14.738355 10583.796 1.392540e-03 0.9988889
## X2000 -36.306532 15559.782 -2.333357e-03 0.9981383
## X2001 -32.147703 13177.688 -2.439556e-03 0.9980535
## X713 -24.273005 29138.380 -8.330252e-04 0.9993353
## X853 -1.212310 59416.827 -2.040349e-05 0.9999837
## abl -2.048513 20883.267 -9.809349e-05 0.9999217
## access -14.797240 13353.490 -1.108118e-03 0.9991159
## account 24.881195 8164.788 3.047378e-03 0.9975685
## addit 1.463489 27027.117 5.414892e-05 0.9999568
cat("\nNumber of variables labeled as significant (at p=0.05 level)\nin logistic regression model :",
sum(coef(summary(spamlog))[,4] < 0.05))
##
## Number of variables labeled as significant (at p=0.05 level)
## in logistic regression model : 0
We see that none of the variables are labeled as significant (a symptom of the logistic regression algorithm not converging).
logconfusion <- table(train$spam, predlog >= 0.5)
logconfusion
##
## FALSE TRUE
## 0 3052 0
## 1 4 954
cat("\nAccuracy : ", sum(diag(logconfusion))/nrow(train))
##
## Accuracy : 0.9990025
cat("\nSensitivity : ", logconfusion["1","TRUE"]/sum(logconfusion["1",]))
##
## Sensitivity : 0.9958246
cat("\nSpecificity : ", logconfusion["0","FALSE"]/sum(logconfusion["0",]))
##
## Specificity : 1
predictROCR <- prediction(predlog, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate
plot(perfROCR, colorize=TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
##
## AUC : 0.9999959
Using threhold of 0.5
CARTconfusion <- table(train$spam, predCART >= 0.5)
CARTconfusion
##
## FALSE TRUE
## 0 2885 167
## 1 64 894
cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(train))
##
## Accuracy : 0.942394
cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))
##
## Sensitivity : 0.9331942
cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))
##
## Specificity : 0.9452818
predictROCR <- prediction(predCART, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate
plot(perfROCR, colorize=TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
##
## AUC : 0.9696044
RFconfusion <- table(train$spam, predRF >= 0.5)
RFconfusion
##
## FALSE TRUE
## 0 3014 38
## 1 41 917
cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(train))
##
## Accuracy : 0.9802993
cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))
##
## Sensitivity : 0.9572025
cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))
##
## Specificity : 0.9875491
predictROCR <- prediction(predRF, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate
plot(perfROCR, colorize=TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
##
## AUC : 0.9978155
In terms of both accuracy and AUC, logistic regression is nearly perfect when used on the training set and outperforms the other two models.
predlog <- predict(spamlog, newdata = test, type = "response")
predCART <- predict(spamCART, newdata = test)[,2]
predRF <- predict(spamRF, newdata = test, type = "prob")[,2]
logconfusion <- table(test$spam, predlog >= 0.5)
logconfusion
##
## FALSE TRUE
## 0 1257 51
## 1 34 376
cat("\nAccuracy : ", sum(diag(logconfusion))/nrow(test))
##
## Accuracy : 0.9505239
cat("\nSensitivity : ", logconfusion["1","TRUE"]/sum(logconfusion["1",]))
##
## Sensitivity : 0.9170732
cat("\nSpecificity : ", logconfusion["0","FALSE"]/sum(logconfusion["0",]))
##
## Specificity : 0.9610092
predictROCR <- prediction(predlog, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate
plot(perfROCR, colorize=TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
##
## AUC : 0.9627517
Using threhold of 0.5
CARTconfusion <- table(test$spam, predCART >= 0.5)
CARTconfusion
##
## FALSE TRUE
## 0 1228 80
## 1 24 386
cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(test))
##
## Accuracy : 0.9394645
cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))
##
## Sensitivity : 0.9414634
cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))
##
## Specificity : 0.9388379
predictROCR <- prediction(predCART, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate
plot(perfROCR, colorize=TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
##
## AUC : 0.963176
RFconfusion <- table(test$spam, predRF >= 0.5)
RFconfusion
##
## FALSE TRUE
## 0 1290 18
## 1 23 387
cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(test))
##
## Accuracy : 0.976135
cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))
##
## Sensitivity : 0.9439024
cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))
##
## Specificity : 0.9862385
predictROCR <- prediction(predRF, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate
plot(perfROCR, colorize=TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
##
## AUC : 0.9975899
Both CART and random forest had very similar accuracies on the training and testing sets. However, logistic regression obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting.
The random forest outperformed logistic regression and CART in both measures on the test set, obtaining an impressive AUC of 0.997 on the test set.
While we have thus far mostly dealt with frequencies of specific words in our analysis, we can extract other information from text. The last two sections of this problem will deal with two other types of information we can extract.
We will use the number of words in the each email as an independent variable. We can use the original document term matrix called dtm for this task. The document term matrix has documents (in this case, emails) as its rows, terms (in this case word stems) as its columns, and frequencies as its values. As a result, the sum of all the elements in a row of the document term matrix is equal to the number of terms present in the document corresponding to the row.
wordCount <- rowSums(as.matrix(dtm))
hist(wordCount)
Nearly all the observations are in the very left of the graph, representing small values. Therefore, this distribution is skew right.
hist(log(wordCount))
Using hist(log(wordCount)), the frequencies are quite balanced, suggesting log(wordCount) is not skewed.
Plot logWordCount against whether a message is spam.
emailsSparse$logWordCount <- log(wordCount)
boxplot(logWordCount ~ spam, data = emailsSparse)
train2 <- subset(emailsSparse, split == TRUE)
test2 <- subset(emailsSparse, split == FALSE)
spam2CART <- rpart(spam ~ ., data = train2)
prp(spam2CART)
spam2RF <- randomForest(spam ~ ., data = train2)
summary(spam2RF)
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 4010 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 8020 matrix numeric
## oob.times 4010 -none- numeric
## classes 2 -none- character
## importance 331 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 4010 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
predCART <- predict(spam2CART, newdata = test2)[,2]
predRF <- predict(spam2RF, newdata = test2, type = "prob")[,2]
Using threhold of 0.5
CARTconfusion <- table(test2$spam, predCART >= 0.5)
CARTconfusion
##
## FALSE TRUE
## 0 1214 94
## 1 26 384
cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(test2))
##
## Accuracy : 0.9301513
cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))
##
## Sensitivity : 0.9365854
cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))
##
## Specificity : 0.9281346
predictROCR <- prediction(predCART, test2$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate
plot(perfROCR, colorize=TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
##
## AUC : 0.9582438
RFconfusion <- table(test2$spam, predRF >= 0.5)
RFconfusion
##
## FALSE TRUE
## 0 1298 10
## 1 28 382
cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(test2))
##
## Accuracy : 0.9778813
cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))
##
## Sensitivity : 0.9317073
cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))
##
## Specificity : 0.9923547
predictROCR <- prediction(predRF, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate
plot(perfROCR, colorize=TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
##
## AUC : 0.9979265
In this case, adding the logWordCounts variable did not result in improved results on the test set for the CART or random forest model.