Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.
As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.
In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes – Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.
The dataset contains just two fields:
text: The text of the email.
spam: A binary variable indicating if the email was spam.
# Data
setwd("/home/tarek/Analytics/Weeks/Week5-TextAnalytics/Data")
emails <- read.csv("emails.csv", stringsAsFactors = F)
str(emails)
## 'data.frame': 5728 obs. of 2 variables:
## $ text: chr "Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market is full of suqg"| __truncated__ "Subject: the stock trading gunslinger fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ra"| __truncated__ "Subject: unbelievable new homes made easy im wanting to show you this homeowner you have been pre - approved for a $ 454 , 1"| __truncated__ "Subject: 4 color printing special request additional information now ! click here click here for a printable version of our o"| __truncated__ ...
## $ spam: int 1 1 1 1 1 1 1 1 1 1 ...
# Number of spam emails
sum(emails$spam == 1)
## [1] 1368
# How many characters are in the longest email?
max(nchar(emails$text))
## [1] 43952
# Which row contains the shortest email in the dataset?
which.min(nchar(emails$text))
## [1] 1992
# Pre process data
library(tm)
# Create Corpus
corpus <- Corpus(VectorSource(emails$text))
# Convert to lower case
corpus <- tm_map(corpus, tolower)
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove Stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Stem the words
corpus <- tm_map(corpus, stemDocument)
# Create matrix
dtm <- DocumentTermMatrix(corpus)
dtm
## A document-term matrix (5728 documents, 28687 terms)
##
## Non-/sparse entries: 481719/163837417
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
# Filter out sparse terms by keeping only terms that appear in at least 5%
# or more of the documents
spdtm <- removeSparseTerms(dtm, 0.95)
spdtm
## A document-term matrix (5728 documents, 330 terms)
##
## Non-/sparse entries: 213551/1676689
## Sparsity : 89%
## Maximal term length: 10
## Weighting : term frequency (tf)
# Convert spdtm and to a data frame
emailsSparse <- as.data.frame(as.matrix(spdtm)) # make.names is set to true to make the variable names of emailsSparse valid
colnames(emailsSparse) <- make.names(colnames(emailsSparse), unique = T)
# What is the word stem that shows up most frequently across all the emails
# in the dataset?
which.max(colSums(emailsSparse))
## enron
## 92
# Add variable spam
emailsSparse$spam <- emails$spam
# Create a data set where spam == 0 (ham). The ham dataset is certainly
# personalized to Vincent Kaminski, and therefore it might not generalize
# well to a general email user. Caution is definitely necessary before
# applying the filters derived in this problem to other email users.
ham <- emailsSparse[emailsSparse$spam == 0, ]
# How many word stems appear at least 5000 times in the ham emails in the
# dataset?
sum(colSums(ham) >= 5000)
## [1] 6
# Spam data set
sort(colSums(subset(emailsSparse, spam == 1)))
## X713 crenshaw enron gibner kaminski stinson
## 0 0 0 0 0 0
## vkamin X853 vinc doc kevin shirley
## 0 1 1 2 2 2
## deriv april houston resum edu friday
## 3 5 5 5 7 7
## hou wednesday ect arrang interview attend
## 8 8 10 11 13 15
## london robert student schedul thursday monday
## 15 16 16 17 17 19
## john tuesday attach suggest appreci mark
## 20 20 21 21 23 25
## begin comment analysi X2001 model hope
## 26 26 27 29 29 30
## mention X2000 togeth confer invit univers
## 30 32 32 33 33 34
## financ talk either run morn shall
## 35 38 39 39 40 40
## happi thought depart confirm respond school
## 42 42 46 47 48 48
## corp etc hear howev sorri idea
## 49 49 49 49 50 51
## energi discuss open option soon understand
## 55 56 56 56 57 57
## cours experi associ point bring director
## 59 59 62 62 63 65
## particip anoth join still final research
## 65 66 66 66 68 68
## case set specif given juli problem
## 69 69 69 70 71 73
## put alreadi ask abl deal fax
## 73 74 74 75 75 75
## book team issu locat meet updat
## 76 76 79 79 79 79
## lot sincer better short sinc done
## 80 80 82 82 82 83
## question recent possibl contract end move
## 83 83 84 85 85 86
## data might continu note feel resourc
## 87 87 88 88 90 90
## sever area communic realli due direct
## 90 92 92 93 94 96
## origin copi unit long member sure
## 96 97 97 98 99 99
## allow dear public write event let
## 102 104 104 104 105 107
## differ file involv respons creat type
## 109 111 111 113 114 114
## approv detail effort intern request say
## 115 115 115 117 117 118
## import support part relat assist last
## 119 120 121 121 123 124
## two back keep addit date place
## 124 125 125 126 127 128
## group mean valu think offic read
## 130 131 131 132 133 134
## immedi check applic hello tri review
## 136 137 139 139 140 142
## believ phone hour power present process
## 143 143 144 145 146 149
## corpor oper full return come sent
## 151 151 152 154 155 155
## opportun real repli line engin term
## 158 158 158 159 160 161
## credit well gas info plan next.
## 162 164 165 165 166 170
## risk increas access give thank link
## 170 171 172 172 172 174
## requir version cost great wish regard
## 174 174 175 182 185 186
## posit thing call develop complet much
## 187 188 190 191 192 192
## even project design form expect person
## 193 194 196 196 198 198
## without buy trade effect rate base
## 198 199 199 201 201 202
## find current first chang visit financi
## 202 203 203 204 206 207
## high mani forward good special don
## 208 208 209 221 225 226
## success per number week result web
## 226 230 231 231 237 238
## industri contact made follow month right
## 239 242 242 244 249 249
## today also help internet manag know
## 251 260 262 262 266 269
## way avail state futur home start
## 278 280 280 282 285 300
## system take net includ life see
## 302 304 305 314 320 329
## name onlin within remov best program
## 344 345 346 357 358 358
## peopl custom year like interest send
## 359 363 367 372 385 393
## servic look work day want product
## 395 396 415 420 420 421
## www account provid need softwar messag
## 426 428 435 438 440 445
## site address may list price new
## 455 461 489 503 503 504
## websit report secur just offer invest
## 506 507 520 524 528 540
## order use click X000 now one
## 541 546 552 560 575 592
## time http market make free pleas
## 593 600 600 603 606 619
## money get receiv inform can email
## 662 694 727 818 831 865
## busi mail com compani spam will
## 897 917 999 1065 1368 1450
## subject
## 1577
First, convert the dependent variable to a factor with “emailsSparse$spam = as.factor(emailsSparse$spam)”.
Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called “train” and a testing set called “test”. Make sure to perform this step on emailsSparse instead of emails.
Using the training set, train the following three machine learning models. The models should predict the dependent variable “spam”, using all other available variables as independent variables.
1) A logistic regression model called spamLog.
2) A CART model called spamCART, using the default parameters to train the model.
3) A random forest model called spamRF, using the default parameters to train the model. Directly before training the random forest model, set the random seed to 123.
For each model, obtain the predicted spam probabilities for the training set. Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values. Recall that you can obtain probabilities for CART models by not passing any type parameter to the predict() function, and you can obtain probabilities from a random forest by adding the argument type=“prob”. For CART and random forest, you need to select the second column of the output of the predict() function, corresponding to the probability of a message being spam.
You may have noticed that training the logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model. Let's investigate the predicted probabilities from the logistic regression model.
emailsSparse$spam <- as.factor(emailsSparse$spam)
# Load CaTools
library(caTools)
set.seed(123)
spl <- sample.split(emailsSparse$spam, SplitRatio = 0.7)
train <- subset(emailsSparse, spl == T)
test <- subset(emailsSparse, spl == F)
# Logistic regression model
spamLog <- glm(spam ~ ., data = train, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(spamLog)
##
## Call:
## glm(formula = spam ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.01 0.00 0.00 0.00 1.35
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.08e+01 1.05e+04 0.00 1.00
## X000 1.47e+01 1.06e+04 0.00 1.00
## X2000 -3.63e+01 1.56e+04 0.00 1.00
## X2001 -3.21e+01 1.32e+04 0.00 1.00
## X713 -2.43e+01 2.91e+04 0.00 1.00
## X853 -1.21e+00 5.94e+04 0.00 1.00
## abl -2.05e+00 2.09e+04 0.00 1.00
## access -1.48e+01 1.34e+04 0.00 1.00
## account 2.49e+01 8.16e+03 0.00 1.00
## addit 1.46e+00 2.70e+04 0.00 1.00
## address -4.61e+00 1.11e+04 0.00 1.00
## allow 1.90e+01 6.44e+03 0.00 1.00
## alreadi -2.41e+01 3.32e+04 0.00 1.00
## also 2.99e+01 1.38e+04 0.00 1.00
## analysi -2.41e+01 3.86e+04 0.00 1.00
## anoth -8.74e+00 2.03e+04 0.00 1.00
## applic -2.65e+00 1.67e+04 0.00 1.00
## appreci -2.14e+01 2.76e+04 0.00 1.00
## approv -1.30e+00 1.59e+04 0.00 1.00
## april -2.62e+01 2.21e+04 0.00 1.00
## area 2.04e+01 2.27e+04 0.00 1.00
## arrang 1.07e+01 2.14e+04 0.00 1.00
## ask -7.75e+00 1.98e+04 0.00 1.00
## assist -1.13e+01 2.49e+04 0.00 1.00
## associ 9.05e+00 1.91e+04 0.00 1.00
## attach -1.04e+01 1.53e+04 0.00 1.00
## attend -3.45e+01 3.26e+04 0.00 1.00
## avail 8.65e+00 1.71e+04 0.00 1.00
## back -1.32e+01 2.27e+04 0.00 1.00
## base -1.35e+01 2.12e+04 0.00 1.00
## begin 2.23e+01 2.97e+04 0.00 1.00
## believ 3.23e+01 2.14e+04 0.00 1.00
## best -8.20e+00 1.33e+03 -0.01 1.00
## better 4.26e+01 2.36e+04 0.00 1.00
## book 4.30e+00 2.02e+04 0.00 1.00
## bring 1.61e+01 6.77e+04 0.00 1.00
## busi -4.80e+00 1.00e+04 0.00 1.00
## buy 4.17e+01 3.89e+04 0.00 1.00
## call -1.15e+00 1.11e+04 0.00 1.00
## can 3.76e+00 7.67e+03 0.00 1.00
## case -3.37e+01 2.88e+04 0.00 1.00
## chang -2.72e+01 2.22e+04 0.00 1.00
## check 1.43e+00 1.96e+04 0.00 1.00
## click 1.38e+01 7.08e+03 0.00 1.00
## com 1.94e+00 4.04e+03 0.00 1.00
## come -1.17e+00 1.51e+04 0.00 1.00
## comment -3.25e+00 3.39e+04 0.00 1.00
## communic 1.58e+01 8.96e+03 0.00 1.00
## compani 4.78e+00 9.19e+03 0.00 1.00
## complet -1.36e+01 2.02e+04 0.00 1.00
## confer -7.50e-01 8.56e+03 0.00 1.00
## confirm -1.30e+01 1.51e+04 0.00 1.00
## contact 1.53e+00 1.26e+04 0.00 1.00
## continu 1.49e+01 1.54e+04 0.00 1.00
## contract -1.30e+01 1.50e+04 0.00 1.00
## copi -4.27e+01 3.07e+04 0.00 1.00
## corp 1.61e+01 2.71e+04 0.00 1.00
## corpor -8.29e-01 2.82e+04 0.00 1.00
## cost -1.94e+00 1.83e+04 0.00 1.00
## cours 1.67e+01 1.83e+04 0.00 1.00
## creat 1.34e+01 3.95e+04 0.00 1.00
## credit 2.62e+01 1.31e+04 0.00 1.00
## crenshaw 9.99e+01 6.77e+04 0.00 1.00
## current 3.63e+00 1.71e+04 0.00 1.00
## custom 1.83e+01 1.01e+04 0.00 1.00
## data -2.61e+01 2.27e+04 0.00 1.00
## date -2.79e+00 1.70e+04 0.00 1.00
## day -6.10e+00 5.87e+03 0.00 1.00
## deal -1.13e+01 1.45e+04 0.00 1.00
## dear -2.31e+00 2.31e+04 0.00 1.00
## depart -4.07e+01 2.51e+04 0.00 1.00
## deriv -4.97e+01 3.59e+04 0.00 1.00
## design -7.92e+00 2.94e+04 0.00 1.00
## detail 1.20e+01 2.30e+04 0.00 1.00
## develop 5.98e+00 9.45e+03 0.00 1.00
## differ -2.29e+00 1.07e+04 0.00 1.00
## direct -2.05e+01 3.19e+04 0.00 1.00
## director -1.77e+01 1.79e+04 0.00 1.00
## discuss -1.05e+01 1.92e+04 0.00 1.00
## doc -2.60e+01 2.60e+04 0.00 1.00
## don 2.13e+01 1.46e+04 0.00 1.00
## done 6.83e+00 1.88e+04 0.00 1.00
## due -4.16e+00 3.53e+04 0.00 1.00
## ect 8.69e-01 5.34e+03 0.00 1.00
## edu -2.12e-01 6.92e+02 0.00 1.00
## effect 1.95e+01 2.10e+04 0.00 1.00
## effort 1.61e+01 5.67e+04 0.00 1.00
## either -2.74e+01 4.00e+04 0.00 1.00
## email 3.83e+00 1.19e+04 0.00 1.00
## end -1.31e+01 2.94e+04 0.00 1.00
## energi -1.62e+01 1.65e+04 0.00 1.00
## engin 2.66e+01 2.39e+04 0.00 1.00
## enron -8.79e+00 5.72e+03 0.00 1.00
## etc 9.47e-01 1.57e+04 0.00 1.00
## even -1.65e+01 2.29e+04 0.00 1.00
## event 1.69e+01 1.85e+04 0.00 1.00
## expect -1.18e+01 1.91e+04 0.00 1.00
## experi 2.46e+00 2.24e+04 0.00 1.00
## fax 3.54e+00 3.39e+04 0.00 1.00
## feel 2.60e+00 2.35e+04 0.00 1.00
## file -2.94e+01 2.16e+04 0.00 1.00
## final 8.07e+00 5.01e+04 0.00 1.00
## financ -9.12e+00 7.52e+03 0.00 1.00
## financi -9.75e+00 1.73e+04 0.00 1.00
## find -2.62e+00 9.73e+03 0.00 1.00
## first -4.67e-01 2.04e+04 0.00 1.00
## follow 1.77e+01 3.08e+03 0.01 1.00
## form 8.48e+00 1.67e+04 0.00 1.00
## forward -3.48e+00 1.86e+04 0.00 1.00
## free 6.11e+00 8.12e+03 0.00 1.00
## friday -1.15e+01 2.00e+04 0.00 1.00
## full 2.13e+01 2.19e+04 0.00 1.00
## futur 4.15e+01 1.44e+04 0.00 1.00
## gas -3.90e+00 4.16e+03 0.00 1.00
## get 5.15e+00 9.74e+03 0.00 1.00
## gibner 2.90e+01 2.46e+04 0.00 1.00
## give -2.52e+01 2.13e+04 0.00 1.00
## given -2.19e+01 5.43e+04 0.00 1.00
## good 5.40e+00 1.62e+04 0.00 1.00
## great 1.22e+01 1.09e+04 0.00 1.00
## group 5.26e-01 1.04e+04 0.00 1.00
## happi 1.94e-02 1.20e+04 0.00 1.00
## hear 2.89e+01 2.28e+04 0.00 1.00
## hello 2.17e+01 1.36e+04 0.00 1.00
## help 1.73e+01 2.79e+03 0.01 1.00
## high -1.98e+00 2.55e+04 0.00 1.00
## home 5.97e+00 8.96e+03 0.00 1.00
## hope -1.44e+01 2.18e+04 0.00 1.00
## hou 6.85e+00 6.44e+03 0.00 1.00
## hour 2.48e+00 1.33e+04 0.00 1.00
## houston -1.85e+01 7.31e+03 0.00 1.00
## howev -3.45e+01 3.56e+04 0.00 1.00
## http 2.53e+01 2.11e+04 0.00 1.00
## idea -1.84e+01 3.89e+04 0.00 1.00
## immedi 6.29e+01 3.35e+04 0.00 1.00
## import -1.86e+00 2.24e+04 0.00 1.00
## includ -3.45e+00 1.80e+04 0.00 1.00
## increas 6.48e+00 2.33e+04 0.00 1.00
## industri -3.16e+01 2.37e+04 0.00 1.00
## info -1.25e+00 4.86e+03 0.00 1.00
## inform 2.08e+01 8.55e+03 0.00 1.00
## interest 2.70e+01 1.16e+04 0.00 1.00
## intern -7.99e+00 3.35e+04 0.00 1.00
## internet 8.75e+00 1.10e+04 0.00 1.00
## interview -1.64e+01 1.87e+04 0.00 1.00
## invest 3.20e+01 2.39e+04 0.00 1.00
## invit 4.30e+00 2.22e+04 0.00 1.00
## involv 3.81e+01 3.32e+04 0.00 1.00
## issu -3.71e+01 3.40e+04 0.00 1.00
## john -5.33e-01 2.86e+04 0.00 1.00
## join -3.82e+01 2.33e+04 0.00 1.00
## juli -1.36e+01 3.01e+04 0.00 1.00
## just -1.02e+01 1.11e+04 0.00 1.00
## kaminski -1.81e+01 6.03e+03 0.00 1.00
## keep 1.87e+01 2.78e+04 0.00 1.00
## kevin -3.78e+01 4.74e+04 0.00 1.00
## know 1.28e+01 1.53e+04 0.00 1.00
## last 1.05e+00 1.37e+04 0.00 1.00
## let -2.76e+01 1.46e+04 0.00 1.00
## life 5.81e+01 3.86e+04 0.00 1.00
## like 5.65e+00 7.66e+03 0.00 1.00
## line 8.74e+00 1.24e+04 0.00 1.00
## link -6.93e+00 1.34e+04 0.00 1.00
## list -8.69e+00 2.15e+03 0.00 1.00
## locat 2.07e+01 1.60e+04 0.00 1.00
## london 6.75e+00 1.64e+04 0.00 1.00
## long -1.49e+01 1.93e+04 0.00 1.00
## look -7.03e+00 1.56e+04 0.00 1.00
## lot -1.96e+01 1.32e+04 0.00 1.00
## made 2.82e+00 2.74e+04 0.00 1.00
## mail 7.58e+00 1.02e+04 0.00 1.00
## make 2.90e+01 1.53e+04 0.00 1.00
## manag 6.01e+00 1.45e+04 0.00 1.00
## mani 1.89e+01 1.44e+04 0.00 1.00
## mark -3.35e+01 3.21e+04 0.00 1.00
## market 7.90e+00 8.01e+03 0.00 1.00
## may -9.43e+00 1.40e+04 0.00 1.00
## mean 6.08e-01 2.95e+04 0.00 1.00
## meet -1.06e+00 1.26e+04 0.00 1.00
## member 1.38e+01 2.34e+04 0.00 1.00
## mention -2.28e+01 2.71e+04 0.00 1.00
## messag 1.72e+01 2.56e+03 0.01 0.99
## might 1.24e+01 1.75e+04 0.00 1.00
## model -2.29e+01 1.05e+04 0.00 1.00
## monday -1.03e+00 3.23e+04 0.00 1.00
## money 3.26e+01 1.32e+04 0.00 1.00
## month -3.73e+00 1.11e+04 0.00 1.00
## morn -2.64e+01 3.40e+04 0.00 1.00
## move -3.83e+01 3.01e+04 0.00 1.00
## much 3.78e-01 1.39e+04 0.00 1.00
## name 1.67e+01 1.32e+04 0.00 1.00
## need 8.44e-01 1.22e+04 0.00 1.00
## net 1.26e+01 2.20e+04 0.00 1.00
## new 1.00e+00 1.01e+04 0.00 1.00
## next. 1.49e+01 1.72e+04 0.00 1.00
## note 1.45e+01 2.29e+04 0.00 1.00
## now 3.79e+01 1.22e+04 0.00 1.00
## number -9.62e+00 1.59e+04 0.00 1.00
## offer 1.17e+01 1.08e+04 0.00 1.00
## offic -1.34e+01 2.31e+04 0.00 1.00
## one 1.24e+01 6.65e+03 0.00 1.00
## onlin 3.59e+01 1.66e+04 0.00 1.00
## open 2.11e+01 2.96e+04 0.00 1.00
## oper -1.70e+01 2.76e+04 0.00 1.00
## opportun -4.13e+00 1.92e+04 0.00 1.00
## option -1.09e+00 9.33e+03 0.00 1.00
## order 6.53e+00 1.24e+04 0.00 1.00
## origin 3.23e+01 3.82e+04 0.00 1.00
## part 4.59e+00 3.48e+04 0.00 1.00
## particip -1.15e+01 1.74e+04 0.00 1.00
## peopl -1.86e+01 1.44e+04 0.00 1.00
## per 1.37e+01 1.27e+04 0.00 1.00
## person 1.87e+01 9.58e+03 0.00 1.00
## phone -6.96e+00 1.17e+04 0.00 1.00
## place 9.01e+00 3.66e+04 0.00 1.00
## plan -1.83e+01 6.32e+03 0.00 1.00
## pleas -7.96e+00 9.48e+03 0.00 1.00
## point 5.50e+00 3.40e+04 0.00 1.00
## posit -1.54e+01 2.32e+04 0.00 1.00
## possibl -1.37e+01 2.49e+04 0.00 1.00
## power -5.64e+00 1.17e+04 0.00 1.00
## present -6.16e+00 1.28e+04 0.00 1.00
## price 3.43e+00 7.85e+03 0.00 1.00
## problem 1.26e+01 9.76e+03 0.00 1.00
## process -2.96e-01 1.19e+04 0.00 1.00
## product 1.02e+01 1.34e+04 0.00 1.00
## program 1.44e+00 1.18e+04 0.00 1.00
## project 2.17e+00 1.50e+04 0.00 1.00
## provid 2.42e-01 1.86e+04 0.00 1.00
## public -5.25e+01 2.34e+04 0.00 1.00
## put -1.05e+01 2.68e+04 0.00 1.00
## question -3.47e+01 1.86e+04 0.00 1.00
## rate -3.11e+00 1.32e+04 0.00 1.00
## read -1.53e+01 2.14e+04 0.00 1.00
## real 2.05e+01 2.36e+04 0.00 1.00
## realli -2.67e+01 4.64e+04 0.00 1.00
## receiv 5.77e-01 1.58e+04 0.00 1.00
## recent -2.07e+00 1.78e+04 0.00 1.00
## regard -3.67e+00 1.51e+04 0.00 1.00
## relat -5.11e+01 1.79e+04 0.00 1.00
## remov 2.33e+01 2.48e+04 0.00 1.00
## repli 1.54e+01 2.92e+04 0.00 1.00
## report -1.48e+01 1.48e+04 0.00 1.00
## request -1.23e+01 1.17e+04 0.00 1.00
## requir 5.00e-01 2.94e+04 0.00 1.00
## research -2.83e+01 1.55e+04 0.00 1.00
## resourc -2.73e+01 3.52e+04 0.00 1.00
## respond 2.97e+01 3.89e+04 0.00 1.00
## respons -1.96e+01 3.67e+04 0.00 1.00
## result -5.00e-01 3.14e+04 0.00 1.00
## resum -9.22e+00 2.10e+04 0.00 1.00
## return 1.75e+01 1.84e+04 0.00 1.00
## review -4.82e+00 1.01e+04 0.00 1.00
## right 2.31e+01 1.59e+04 0.00 1.00
## risk -4.00e+00 1.72e+04 0.00 1.00
## robert -2.10e+01 2.91e+04 0.00 1.00
## run -5.16e+01 4.43e+04 0.00 1.00
## say 7.37e+00 2.22e+04 0.00 1.00
## schedul 1.92e+00 3.58e+04 0.00 1.00
## school -3.87e+00 2.88e+04 0.00 1.00
## secur -1.60e+01 2.20e+03 -0.01 0.99
## see -1.12e+01 1.29e+04 0.00 1.00
## send -2.43e+01 1.22e+04 0.00 1.00
## sent -1.49e+01 2.20e+04 0.00 1.00
## servic -7.16e+00 1.24e+04 0.00 1.00
## set -9.35e+00 2.63e+04 0.00 1.00
## sever 2.04e+01 3.09e+04 0.00 1.00
## shall 1.93e+01 3.07e+04 0.00 1.00
## shirley -7.13e+01 6.33e+04 0.00 1.00
## short -8.97e+00 1.72e+04 0.00 1.00
## sinc -3.44e+00 3.55e+04 0.00 1.00
## sincer -2.07e+01 3.51e+04 0.00 1.00
## site 8.69e+00 1.50e+04 0.00 1.00
## softwar 2.57e+01 1.06e+04 0.00 1.00
## soon 2.35e+01 3.73e+04 0.00 1.00
## sorri 6.04e+00 2.30e+04 0.00 1.00
## special 1.78e+01 2.76e+04 0.00 1.00
## specif -2.34e+01 3.08e+04 0.00 1.00
## start 1.44e+01 1.90e+04 0.00 1.00
## state 1.22e+01 1.68e+04 0.00 1.00
## still 3.88e+00 2.62e+04 0.00 1.00
## stinson -4.35e+01 2.70e+04 0.00 1.00
## student -1.81e+01 2.19e+04 0.00 1.00
## subject 3.04e+01 1.05e+04 0.00 1.00
## success 4.34e+00 2.78e+04 0.00 1.00
## suggest -3.84e+01 4.47e+04 0.00 1.00
## support -1.54e+01 1.98e+04 0.00 1.00
## sure -5.50e+00 2.08e+04 0.00 1.00
## system 3.78e+00 9.15e+03 0.00 1.00
## take 5.73e+00 1.72e+04 0.00 1.00
## talk -1.01e+01 2.02e+04 0.00 1.00
## team 7.94e+00 2.57e+04 0.00 1.00
## term 2.01e+01 2.30e+04 0.00 1.00
## thank -3.89e+01 1.06e+04 0.00 1.00
## thing 2.58e+01 1.34e+04 0.00 1.00
## think -1.22e+01 2.08e+04 0.00 1.00
## thought 1.24e+01 3.02e+04 0.00 1.00
## thursday -1.49e+01 3.26e+04 0.00 1.00
## time -5.92e+00 8.33e+03 0.00 1.00
## today -1.76e+01 1.96e+04 0.00 1.00
## togeth -2.35e+01 1.87e+04 0.00 1.00
## trade -1.76e+01 1.48e+04 0.00 1.00
## tri 9.28e-01 1.28e+04 0.00 1.00
## tuesday -2.81e+01 3.96e+04 0.00 1.00
## two -2.57e+01 1.84e+04 0.00 1.00
## type -1.45e+01 2.75e+04 0.00 1.00
## understand 9.31e+00 2.34e+04 0.00 1.00
## unit -4.02e+00 3.01e+04 0.00 1.00
## univers 1.23e+01 2.20e+04 0.00 1.00
## updat -1.51e+01 1.45e+04 0.00 1.00
## use -1.39e+01 9.38e+03 0.00 1.00
## valu 9.02e-01 1.36e+04 0.00 1.00
## version -3.61e+01 2.94e+04 0.00 1.00
## vinc -3.73e+01 8.65e+03 0.00 1.00
## visit 2.58e+01 1.17e+04 0.00 1.00
## vkamin -6.65e+01 5.70e+04 0.00 1.00
## want -2.56e+00 1.11e+04 0.00 1.00
## way 1.34e+01 1.14e+04 0.00 1.00
## web 2.79e+00 1.69e+04 0.00 1.00
## websit -2.56e+01 1.85e+04 0.00 1.00
## wednesday -1.53e+01 2.64e+04 0.00 1.00
## week -6.79e+00 1.05e+04 0.00 1.00
## well -2.22e+01 9.71e+03 0.00 1.00
## will -1.12e+01 5.98e+03 0.00 1.00
## wish 1.17e+01 3.17e+04 0.00 1.00
## within 2.90e+01 2.16e+04 0.00 1.00
## without 1.94e+01 1.76e+04 0.00 1.00
## work -1.10e+01 1.16e+04 0.00 1.00
## write 4.41e+01 2.82e+04 0.00 1.00
## www -7.87e+00 2.22e+04 0.00 1.00
## year -1.01e+01 1.04e+04 0.00 1.00
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4409.49 on 4009 degrees of freedom
## Residual deviance: 13.46 on 3679 degrees of freedom
## AIC: 675.5
##
## Number of Fisher Scoring iterations: 25
predLog <- predict(spamLog, type = "response")
sum(predLog < 1e-05)
## [1] 3046
sum(predLog > 0.99999)
## [1] 954
sum(predLog > 1e-05 & predLog < 0.99999)
## [1] 10
# Accuracy
tLog <- table(train$spam, predLog >= 0.5)
(tLog[1, 1] + tLog[2, 2])/sum(tLog)
## [1] 0.999
# AUC
library(ROCR)
## Loading required package: gplots
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
predROCR = prediction(predLog, train$spam)
perfROCR = performance(predROCR, "tpr", "fpr")
plot(perfROCR, colorize = TRUE)
# Compute AUC
performance(predROCR, "auc")@y.values
## [[1]]
## [1] 1
# CART Model
library(rpart)
library(rpart.plot)
spamCART <- rpart(spam ~ ., data = train, method = "class")
prp(spamCART)
# Predict using the trainig set.
predTrain <- predict(spamCART)[, 2]
# Accuracy on the training set
tCART <- table(train$spam, predTrain >= 0.5)
(tCART[1, 1] + tCART[2, 2])/(sum(tCART))
## [1] 0.9424
# AUC of the CART model
predROCRCART = prediction(predTrain, train$spam)
perfROCRCART = performance(predROCRCART, "tpr", "fpr")
performance(predROCRCART, "auc")@y.values
## [[1]]
## [1] 0.9696
# Random Forest model
library(randomForest)
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
spamRF <- randomForest(spam ~ ., data = train, method = "class")
# Accuracy of RF Model
predRF <- predict(spamRF, type = "prob")[, 2]
tRF <- table(train$spam, predRF >= 0.5)
(tRF[1, 1] + tRF[2, 2])/(sum(tRF))
## [1] 0.9803
# Performance of RF Model
predROCRRF = prediction(predRF, train$spam)
performance(predROCRRF, "auc")@y.values
## [[1]]
## [1] 0.998
# Testing set accuracy
predTestLog <- predict(spamLog, newdata = test, type = "response")
t2 <- table(test$spam, predTestLog >= 0.5)
(t2[1, 1] + t2[2, 2])/(sum(t2))
## [1] 0.9505
# ROC Curve
library(ROCR)
predROCRLog = prediction(predTestLog, test$spam)
# Compute AUC
performance(predROCRLog, "auc")@y.values
## [[1]]
## [1] 0.9628
# CART Test Accuracy and performance
predTestCART <- predict(spamCART, newdata = test)[, 2]
t3 <- table(test$spam, predTestCART >= 0.5)
(t3[1, 1] + t3[2, 2])/(sum(t3))
## [1] 0.9395
predROCRCART = prediction(predTestCART, test$spam)
# Compute AUC
performance(predROCRCART, "auc")@y.values
## [[1]]
## [1] 0.9632
Thus far, we have used a threshold of 0.5 as the cutoff for predicting that an email message is spam, and we have used accuracy as one of our measures of model quality. As we have previously learned, these are good choices when we have no preference for different types of errors (false positives vs. false negatives), but other choices might be better if we assign a higher cost to one type of error.
Consider the case of an email provider using the spam filter we have developed. The email provider moves all of the emails flagged as spam to a separate “Junk Email” folder, meaning those emails are not displayed in the main inbox. The emails not flagged as spam by the algorithm are displayed in the inbox. Many of this provider's email users never check the spam folder, so they will never see emails delivered there.
A false negative means the model labels a spam email as ham. This results in a spam email being displayed in the main inbox.
A false positive means the model labels a ham email as spam. This results in a ham email being sent to the Junk Email folder.
A false negative is largely a nuisance (the user will need to delete the unsolicited email). However a false positive can be very costly, since the user might completely miss an important email due to it being delivered to the spam folder. Therefore, the false positive is more costly.
A false negative results in spam reaching a user's main inbox, which is a nuisance. A user who is particularly annoyed by such spam would assign a particularly high cost to a false negative.
A false positive results in ham being sent to a user's Junk Email folder. While the user might catch the mistake upon checking the Junk Email folder, users who never check this folder will miss the email, incurring a particularly high cost.
While before many users would completely miss a ham email labeled as spam (false positive), now users will not miss an email after this sort of mistake. As a result, the cost of a false positive has been decreased.
While using expert opinion is practical, it is not personalized (we would use the same cost for all users). Likewise, a random sample of user preferences doesn't enable personalized costs for each user.
While a survey of all users would enable personalization, it is impractical to obtain survey results from all or most of the users.
While it's impractical to survey all users, it is easy to automatically collect their usage patterns. This could enable us to select higher regression thresholds for users who rarely check their Junk Email folder but lower thresholds for users who regularly check the folder.
While we have thus far mostly dealt with frequencies of specific words in our analysis, we can extract other information from text. The last two sections of this problem will deal with two other types of information we can extract.
First, we will use the number of words in the each email as an independent variable. We can use the original document term matrix called dtm for this task. The document term matrix has documents (in this case, emails) as its rows, terms (in this case word stems) as its columns, and frequencies as its values. As a result, the sum of all the elements in a row of the document term matrix is equal to the number of terms present in this document
# Obtain the word counts for each email with the command wordCount =
# rowSums(as.matrix(dtm)) # Gives an error because of size. Thus run the
# following.
library(slam)
wordCount = rollup(dtm, 2, FUN = sum)$v
# Histogram of wordCount
hist(wordCount)
# Histogram of log(wordCount)
hist(log(wordCount))
# Add the variable logWordCount to emailsSparse
emailsSparse$logWordCount <- log(wordCount)
# boxplot
boxplot(emailsSparse$logWordCount ~ emailsSparse$spam)
# train
train2 <- subset(emailsSparse, spl == T)
test2 <- subset(emailsSparse, spl == F)
# CART Model
spam2CART <- rpart(spam ~ ., data = train2, method = "class")
# test-set Accuracy of spam2CART
pred2CART <- predict(spam2CART, newdata = test2)[, 2]
t2CART <- table(test2$spam, pred2CART >= 0.5)
(t2CART[1, 1] + t2CART[2, 2])/sum(t2CART)
## [1] 0.9302
# Performance
predROCR2CART = prediction(pred2CART, test$spam)
performance(predROCR2CART, "auc")@y.values
## [[1]]
## [1] 0.9582
# Random Forest
set.seed(123)
spam2RF <- randomForest(spam ~ ., data = train2, method = "class")
# Accuracy of RF Model
pred2RF <- predict(spam2RF, type = "prob")[, 2]
t2RF <- table(train2$spam, pred2RF >= 0.5)
(t2RF[1, 1] + t2RF[2, 2])/(sum(t2RF))
## [1] 0.9805
# Performance of RF Model
predROCR2RF = prediction(pred2RF, train2$spam)
performance(predROCR2RF, "auc")@y.values
## [[1]]
## [1] 0.9973
Another source of information that might be extracted from text is the frequency of various n-grams. An n-gram is a sequence of n consecutive words in the document. For instance, for the document “Text analytics rocks!”, which we would preprocess to “text analyt rock”, the 1-grams are “text”, “analyt”, and “rock”, the 2-grams are “text analyt” and “analyt rock”, and the only 3-gram is “text analyt rock”. n-grams are order-specific, meaning the 2-grams “text analyt” and “analyt text” are considered two separate n-grams. We can see that so far our analysis has been extracting only 1-grams.
In this last subproblem, we will add 2-grams to our predictive model. Begin by installing and loading the RTextTools package.
library(RTextTools)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
##
## The following object is masked from 'package:base':
##
## backsolve
# Create a document term matrix containing all 2-grams in our dataset
dtm2gram = create_matrix(as.character(corpus), ngramLength = 2)
dtm2gram
## A document-term matrix (5728 documents, 304449 terms)
##
## Non-/sparse entries: 707213/1743176659
## Sparsity : 100%
## Maximal term length: 35
## Weighting : term frequency (tf)
# Filter out sparse terms by keeping only terms that appear in at least 5%
# or more of the documents
spdtm2Gram <- removeSparseTerms(dtm2gram, 0.95)
spdtm2Gram
## A document-term matrix (5728 documents, 35 terms)
##
## Non-/sparse entries: 25293/175187
## Sparsity : 87%
## Maximal term length: 16
## Weighting : term frequency (tf)
# Create a dataframe from spdtm2Gram
emailsSparse2gram <- as.data.frame(as.matrix(spdtm2Gram))
# Convert the column names of emailsSparse2gram to valid names using
# make.names()
colnames(emailsSparse2gram) <- make.names(colnames(emailsSparse2gram), unique = T)
# Combine the original emailsSparse with emailsSparse2gram into a final data
# frame
emailsCombined = cbind(emailsSparse, emailsSparse2gram)
# Split the above data frame into training and testing sets
trainCombined <- subset(emailsCombined, spl == T)
testCombined <- subset(emailsCombined, spl == F)
# Use trainCombined to train a CART tree with the default parameter
spamCARTCombined <- rpart(spam ~ ., data = trainCombined, method = "class")
# Tree
prp(spamCARTCombined, varlen = 0)
# Perform test-set predictions using the new CART. Accuracy
predCARTCombined <- predict(spamCARTCombined, newdata = testCombined)[, 2]
tCARTCombined <- table(testCombined$spam, predCARTCombined >= 0.5)
(tCARTCombined[1, 1] + tCARTCombined[2, 2])/sum(tCARTCombined)
## [1] 0.9354
# Performance of the CART Model
predROCRCombCART = prediction(predCARTCombined, testCombined$spam)
performance(predROCRCombCART, "auc")@y.values
## [[1]]
## [1] 0.9648
# Use trainCombined to train a random forest with the default parameters
set.seed(123)
spamRFCombined <- randomForest(spam ~ ., data = trainCombined, method = "class")
# Perform test-set predictions using the new RF model. Accuracy
predRFCombined <- predict(spamRFCombined, newdata = testCombined, type = "prob")[,
2]
tRFCombined <- table(testCombined$spam, predRFCombined >= 0.5)
(tRFCombined[1, 1] + tRFCombined[2, 2])/sum(tRFCombined)
## [1] 0.9796
# What is the test-set AUC of spamRFcombined? Performance of RF model
predROCRCombRF = prediction(predRFCombined, testCombined$spam)
performance(predROCRCombRF, "auc")@y.values
## [[1]]
## [1] 0.9977
For this problem, adding 2-grams did not dramatically improve our test-set performance. Adding n-grams is most effective in large datasets. Given the billions of emails sent each day, it's reasonable to expect that email providers would be able to construct datasets large enough for n-grams to provide useful predictive power.