Homework assignment for MITX “The Analytics Edge”

Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.

As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.

In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes – Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.

knitr::opts_chunk$set(echo = TRUE)
library(tm)           # text analytics
library(caTools)      # create training and test sets
library(rpart)        # Classification and Regression Trees
library(rpart.plot)
library(randomForest) # random forest
library(SnowballC)
library(ROCR)         # receiver operator characteristic

The dataset contains just two fields:

text: The text of the email.
spam: A binary variable indicating if the email was spam.
emails <- read.csv("emails.csv", stringsAsFactors = FALSE)
str(emails)
## 'data.frame':    5728 obs. of  2 variables:
##  $ text: chr  "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market"| __truncated__ "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__ "Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved"| __truncated__ "Subject: 4 color printing special  request additional information now ! click here  click here for a printable "| __truncated__ ...
##  $ spam: int  1 1 1 1 1 1 1 1 1 1 ...
cat("\nNumber of spam emails :", sum(emails$spam))
## 
## Number of spam emails : 1368
cat("\nNumber of characters in the longest emial in the dataset :", max(nchar(emails$text)))
## 
## Number of characters in the longest emial in the dataset : 43952
cat("\nRow of the shortest email in the dataset :", which.min(nchar(emails$text)))
## 
## Row of the shortest email in the dataset : 1992

pre-process emails

corpus <- VCorpus(VectorSource(emails$text))

corpus = tm_map(corpus, content_transformer(tolower)) # lower-case
corpus = tm_map(corpus, removePunctuation)            # remove punctuation
corpus = tm_map(corpus, removeWords, stopwords("english")) # remove stop words
corpus = tm_map(corpus, stemDocument)                 # reduce words to 'stems'

bag of words

dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)
# remove words which don't appear in at least 5% of documents
spdtm <- removeSparseTerms(DocumentTermMatrix(corpus), 0.95)
spdtm
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity           : 89%
## Maximal term length: 10
## Weighting          : term frequency (tf)
emailsSparse <- as.data.frame(as.matrix(spdtm))
colnames(emailsSparse) <- make.names(colnames(emailsSparse))
# some words start with a number, but R struggles with variables names that start with a number

cat("Most common stem in the emails :", colnames(spdtm)[which.max(colSums(emailsSparse))])
## Most common stem in the emails : enron
emailsSparse$spam <- emails$spam # add variable 'spam' containing the email spam labels

cat("Number of word stems which appear at least 5000 times\nin ham emails : ",
    sum(colSums(emailsSparse[emailsSparse$spam == 0,!names(emailsSparse) %in% "spam"])>=5000))
## Number of word stems which appear at least 5000 times
## in ham emails :  6
cat("\nMost common stems in ham emails\n")
## 
## Most common stems in ham emails
sort(colSums(subset(emailsSparse, spam == 0)), decreasing = TRUE)[1:10]
##    enron      ect  subject     vinc     will      hou    X2000 kaminski 
##    13388    11417     8625     8531     6802     5569     4935     4801 
##    pleas      com 
##     4494     4444
cat("\nNumber of word stems which appear at least 1000 times\nin spam emails : ",
    sum(colSums(emailsSparse[emailsSparse$spam == 1,!names(emailsSparse) %in% "spam"])>=1000))
## 
## Number of word stems which appear at least 1000 times
## in spam emails :  3
cat("\nMost common stems in spam emails\n")
## 
## Most common stems in spam emails
sort(colSums(emailsSparse[emailsSparse$spam == 1, !names(emailsSparse) %in% "spam"]), decreasing = TRUE)[1:10]
## subject    will compani     com    mail    busi   email     can  inform 
##    1577    1450    1065     999     917     897     865     831     818 
##  receiv 
##     727

Several of the most common word stems from the ham documents, such as “enron”, “hou” (short for Houston), “vinc” (the word stem of “Vince”) and “kaminski”, are likely specific to Vincent Kaminski’s inbox.

The ham dataset is personalized to Vincent Kaminski, and therefore it might not generalize well to a general email user. Caution is definitely necessary before applying the filters derived in this problem to other email users.

building machine language models

create training and test sets

emailsSparse$spam <- as.factor(emailsSparse$spam)

set.seed(123) # seed for random number generator

split <- sample.split(emailsSparse$spam, SplitRatio = 0.7)
train <- subset(emailsSparse, split == TRUE)
test <- subset(emailsSparse, split == FALSE)

logistic regression model

spamlog <- glm(spam ~ ., data = train, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(spamlog)
## 
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.011   0.000   0.000   0.000   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.082e+01  1.055e+04  -0.003    0.998
## X000         1.474e+01  1.058e+04   0.001    0.999
## X2000       -3.631e+01  1.556e+04  -0.002    0.998
## X2001       -3.215e+01  1.318e+04  -0.002    0.998
## X713        -2.427e+01  2.914e+04  -0.001    0.999
## X853        -1.212e+00  5.942e+04   0.000    1.000
## abl         -2.049e+00  2.088e+04   0.000    1.000
## access      -1.480e+01  1.335e+04  -0.001    0.999
## account      2.488e+01  8.165e+03   0.003    0.998
## addit        1.463e+00  2.703e+04   0.000    1.000
## address     -4.613e+00  1.113e+04   0.000    1.000
## allow        1.899e+01  6.436e+03   0.003    0.998
## alreadi     -2.407e+01  3.319e+04  -0.001    0.999
## also         2.990e+01  1.378e+04   0.002    0.998
## analysi     -2.405e+01  3.860e+04  -0.001    1.000
## anoth       -8.744e+00  2.032e+04   0.000    1.000
## applic      -2.649e+00  1.674e+04   0.000    1.000
## appreci     -2.145e+01  2.762e+04  -0.001    0.999
## approv      -1.302e+00  1.589e+04   0.000    1.000
## april       -2.620e+01  2.208e+04  -0.001    0.999
## area         2.041e+01  2.266e+04   0.001    0.999
## arrang       1.069e+01  2.135e+04   0.001    1.000
## ask         -7.746e+00  1.976e+04   0.000    1.000
## assist      -1.128e+01  2.490e+04   0.000    1.000
## associ       9.049e+00  1.909e+04   0.000    1.000
## attach      -1.037e+01  1.534e+04  -0.001    0.999
## attend      -3.451e+01  3.257e+04  -0.001    0.999
## avail        8.651e+00  1.709e+04   0.001    1.000
## back        -1.323e+01  2.272e+04  -0.001    1.000
## base        -1.354e+01  2.122e+04  -0.001    0.999
## begin        2.228e+01  2.973e+04   0.001    0.999
## believ       3.233e+01  2.136e+04   0.002    0.999
## best        -8.201e+00  1.333e+03  -0.006    0.995
## better       4.263e+01  2.360e+04   0.002    0.999
## book         4.301e+00  2.024e+04   0.000    1.000
## bring        1.607e+01  6.767e+04   0.000    1.000
## busi        -4.803e+00  1.000e+04   0.000    1.000
## buy          4.170e+01  3.892e+04   0.001    0.999
## call        -1.145e+00  1.111e+04   0.000    1.000
## can          3.762e+00  7.674e+03   0.000    1.000
## case        -3.372e+01  2.880e+04  -0.001    0.999
## chang       -2.717e+01  2.215e+04  -0.001    0.999
## check        1.425e+00  1.963e+04   0.000    1.000
## click        1.376e+01  7.077e+03   0.002    0.998
## com          1.936e+00  4.039e+03   0.000    1.000
## come        -1.166e+00  1.511e+04   0.000    1.000
## comment     -3.251e+00  3.387e+04   0.000    1.000
## communic     1.580e+01  8.958e+03   0.002    0.999
## compani      4.781e+00  9.186e+03   0.001    1.000
## complet     -1.363e+01  2.024e+04  -0.001    0.999
## confer      -7.503e-01  8.557e+03   0.000    1.000
## confirm     -1.300e+01  1.514e+04  -0.001    0.999
## contact      1.530e+00  1.262e+04   0.000    1.000
## continu      1.487e+01  1.535e+04   0.001    0.999
## contract    -1.295e+01  1.498e+04  -0.001    0.999
## copi        -4.274e+01  3.070e+04  -0.001    0.999
## corp         1.606e+01  2.708e+04   0.001    1.000
## corpor      -8.286e-01  2.818e+04   0.000    1.000
## cost        -1.938e+00  1.833e+04   0.000    1.000
## cours        1.665e+01  1.834e+04   0.001    0.999
## creat        1.338e+01  3.946e+04   0.000    1.000
## credit       2.617e+01  1.314e+04   0.002    0.998
## crenshaw     9.994e+01  6.769e+04   0.001    0.999
## current      3.629e+00  1.707e+04   0.000    1.000
## custom       1.829e+01  1.008e+04   0.002    0.999
## data        -2.609e+01  2.271e+04  -0.001    0.999
## date        -2.786e+00  1.699e+04   0.000    1.000
## day         -6.100e+00  5.866e+03  -0.001    0.999
## deal        -1.129e+01  1.448e+04  -0.001    0.999
## dear        -2.313e+00  2.306e+04   0.000    1.000
## depart      -4.068e+01  2.509e+04  -0.002    0.999
## deriv       -4.971e+01  3.587e+04  -0.001    0.999
## design      -7.923e+00  2.939e+04   0.000    1.000
## detail       1.197e+01  2.301e+04   0.001    1.000
## develop      5.976e+00  9.455e+03   0.001    0.999
## differ      -2.293e+00  1.075e+04   0.000    1.000
## direct      -2.051e+01  3.194e+04  -0.001    0.999
## director    -1.770e+01  1.793e+04  -0.001    0.999
## discuss     -1.051e+01  1.915e+04  -0.001    1.000
## doc         -2.597e+01  2.603e+04  -0.001    0.999
## don          2.129e+01  1.456e+04   0.001    0.999
## done         6.828e+00  1.882e+04   0.000    1.000
## due         -4.163e+00  3.532e+04   0.000    1.000
## ect          8.685e-01  5.342e+03   0.000    1.000
## edu         -2.122e-01  6.917e+02   0.000    1.000
## effect       1.948e+01  2.100e+04   0.001    0.999
## effort       1.606e+01  5.670e+04   0.000    1.000
## either      -2.744e+01  4.000e+04  -0.001    0.999
## email        3.833e+00  1.186e+04   0.000    1.000
## end         -1.311e+01  2.938e+04   0.000    1.000
## energi      -1.620e+01  1.646e+04  -0.001    0.999
## engin        2.664e+01  2.394e+04   0.001    0.999
## enron       -8.789e+00  5.719e+03  -0.002    0.999
## etc          9.470e-01  1.569e+04   0.000    1.000
## even        -1.654e+01  2.289e+04  -0.001    0.999
## event        1.694e+01  1.851e+04   0.001    0.999
## expect      -1.179e+01  1.914e+04  -0.001    1.000
## experi       2.460e+00  2.240e+04   0.000    1.000
## fax          3.537e+00  3.386e+04   0.000    1.000
## feel         2.596e+00  2.348e+04   0.000    1.000
## file        -2.943e+01  2.165e+04  -0.001    0.999
## final        8.075e+00  5.008e+04   0.000    1.000
## financ      -9.122e+00  7.524e+03  -0.001    0.999
## financi     -9.747e+00  1.727e+04  -0.001    1.000
## find        -2.623e+00  9.727e+03   0.000    1.000
## first       -4.666e-01  2.043e+04   0.000    1.000
## follow       1.766e+01  3.080e+03   0.006    0.995
## form         8.483e+00  1.674e+04   0.001    1.000
## forward     -3.484e+00  1.864e+04   0.000    1.000
## free         6.113e+00  8.121e+03   0.001    0.999
## friday      -1.146e+01  1.996e+04  -0.001    1.000
## full         2.125e+01  2.190e+04   0.001    0.999
## futur        4.146e+01  1.439e+04   0.003    0.998
## gas         -3.901e+00  4.160e+03  -0.001    0.999
## get          5.154e+00  9.737e+03   0.001    1.000
## gibner       2.901e+01  2.460e+04   0.001    0.999
## give        -2.518e+01  2.130e+04  -0.001    0.999
## given       -2.186e+01  5.426e+04   0.000    1.000
## good         5.399e+00  1.619e+04   0.000    1.000
## great        1.222e+01  1.090e+04   0.001    0.999
## group        5.264e-01  1.037e+04   0.000    1.000
## happi        1.939e-02  1.202e+04   0.000    1.000
## hear         2.887e+01  2.281e+04   0.001    0.999
## hello        2.166e+01  1.361e+04   0.002    0.999
## help         1.731e+01  2.791e+03   0.006    0.995
## high        -1.982e+00  2.554e+04   0.000    1.000
## home         5.973e+00  8.965e+03   0.001    0.999
## hope        -1.435e+01  2.179e+04  -0.001    0.999
## hou          6.852e+00  6.437e+03   0.001    0.999
## hour         2.478e+00  1.333e+04   0.000    1.000
## houston     -1.855e+01  7.305e+03  -0.003    0.998
## howev       -3.449e+01  3.562e+04  -0.001    0.999
## http         2.528e+01  2.107e+04   0.001    0.999
## idea        -1.845e+01  3.892e+04   0.000    1.000
## immedi       6.285e+01  3.346e+04   0.002    0.999
## import      -1.859e+00  2.236e+04   0.000    1.000
## includ      -3.454e+00  1.799e+04   0.000    1.000
## increas      6.476e+00  2.329e+04   0.000    1.000
## industri    -3.160e+01  2.373e+04  -0.001    0.999
## info        -1.255e+00  4.857e+03   0.000    1.000
## inform       2.078e+01  8.549e+03   0.002    0.998
## interest     2.698e+01  1.159e+04   0.002    0.998
## intern      -7.991e+00  3.351e+04   0.000    1.000
## internet     8.749e+00  1.100e+04   0.001    0.999
## interview   -1.640e+01  1.873e+04  -0.001    0.999
## invest       3.201e+01  2.393e+04   0.001    0.999
## invit        4.304e+00  2.215e+04   0.000    1.000
## involv       3.815e+01  3.315e+04   0.001    0.999
## issu        -3.708e+01  3.396e+04  -0.001    0.999
## john        -5.326e-01  2.856e+04   0.000    1.000
## join        -3.824e+01  2.334e+04  -0.002    0.999
## juli        -1.358e+01  3.009e+04   0.000    1.000
## just        -1.021e+01  1.114e+04  -0.001    0.999
## kaminski    -1.812e+01  6.029e+03  -0.003    0.998
## keep         1.867e+01  2.782e+04   0.001    0.999
## kevin       -3.779e+01  4.738e+04  -0.001    0.999
## know         1.277e+01  1.526e+04   0.001    0.999
## last         1.046e+00  1.372e+04   0.000    1.000
## let         -2.763e+01  1.462e+04  -0.002    0.998
## life         5.812e+01  3.864e+04   0.002    0.999
## like         5.649e+00  7.660e+03   0.001    0.999
## line         8.743e+00  1.236e+04   0.001    0.999
## link        -6.929e+00  1.345e+04  -0.001    1.000
## list        -8.692e+00  2.149e+03  -0.004    0.997
## locat        2.073e+01  1.597e+04   0.001    0.999
## london       6.745e+00  1.642e+04   0.000    1.000
## long        -1.489e+01  1.934e+04  -0.001    0.999
## look        -7.031e+00  1.563e+04   0.000    1.000
## lot         -1.964e+01  1.321e+04  -0.001    0.999
## made         2.820e+00  2.743e+04   0.000    1.000
## mail         7.584e+00  1.021e+04   0.001    0.999
## make         2.901e+01  1.528e+04   0.002    0.998
## manag        6.014e+00  1.445e+04   0.000    1.000
## mani         1.885e+01  1.442e+04   0.001    0.999
## mark        -3.350e+01  3.208e+04  -0.001    0.999
## market       7.895e+00  8.012e+03   0.001    0.999
## may         -9.434e+00  1.397e+04  -0.001    0.999
## mean         6.078e-01  2.952e+04   0.000    1.000
## meet        -1.063e+00  1.263e+04   0.000    1.000
## member       1.381e+01  2.343e+04   0.001    1.000
## mention     -2.279e+01  2.714e+04  -0.001    0.999
## messag       1.716e+01  2.562e+03   0.007    0.995
## might        1.244e+01  1.753e+04   0.001    0.999
## model       -2.292e+01  1.049e+04  -0.002    0.998
## monday      -1.034e+00  3.233e+04   0.000    1.000
## money        3.264e+01  1.321e+04   0.002    0.998
## month       -3.727e+00  1.112e+04   0.000    1.000
## morn        -2.645e+01  3.403e+04  -0.001    0.999
## move        -3.834e+01  3.011e+04  -0.001    0.999
## much         3.775e-01  1.392e+04   0.000    1.000
## name         1.672e+01  1.322e+04   0.001    0.999
## need         8.437e-01  1.221e+04   0.000    1.000
## net          1.256e+01  2.197e+04   0.001    1.000
## new          1.003e+00  1.009e+04   0.000    1.000
## next.        1.492e+01  1.724e+04   0.001    0.999
## note         1.446e+01  2.294e+04   0.001    0.999
## now          3.790e+01  1.219e+04   0.003    0.998
## number      -9.622e+00  1.591e+04  -0.001    1.000
## offer        1.174e+01  1.084e+04   0.001    0.999
## offic       -1.344e+01  2.311e+04  -0.001    1.000
## one          1.241e+01  6.652e+03   0.002    0.999
## onlin        3.589e+01  1.665e+04   0.002    0.998
## open         2.114e+01  2.961e+04   0.001    0.999
## oper        -1.696e+01  2.757e+04  -0.001    1.000
## opportun    -4.131e+00  1.918e+04   0.000    1.000
## option      -1.085e+00  9.325e+03   0.000    1.000
## order        6.533e+00  1.242e+04   0.001    1.000
## origin       3.226e+01  3.818e+04   0.001    0.999
## part         4.594e+00  3.483e+04   0.000    1.000
## particip    -1.154e+01  1.738e+04  -0.001    0.999
## peopl       -1.864e+01  1.439e+04  -0.001    0.999
## per          1.367e+01  1.273e+04   0.001    0.999
## person       1.870e+01  9.575e+03   0.002    0.998
## phone       -6.957e+00  1.172e+04  -0.001    1.000
## place        9.005e+00  3.661e+04   0.000    1.000
## plan        -1.830e+01  6.320e+03  -0.003    0.998
## pleas       -7.961e+00  9.484e+03  -0.001    0.999
## point        5.498e+00  3.403e+04   0.000    1.000
## posit       -1.543e+01  2.316e+04  -0.001    0.999
## possibl     -1.366e+01  2.492e+04  -0.001    1.000
## power       -5.643e+00  1.173e+04   0.000    1.000
## present     -6.163e+00  1.278e+04   0.000    1.000
## price        3.428e+00  7.850e+03   0.000    1.000
## problem      1.262e+01  9.763e+03   0.001    0.999
## process     -2.957e-01  1.191e+04   0.000    1.000
## product      1.016e+01  1.345e+04   0.001    0.999
## program      1.444e+00  1.183e+04   0.000    1.000
## project      2.173e+00  1.497e+04   0.000    1.000
## provid       2.422e-01  1.859e+04   0.000    1.000
## public      -5.250e+01  2.341e+04  -0.002    0.998
## put         -1.052e+01  2.681e+04   0.000    1.000
## question    -3.467e+01  1.859e+04  -0.002    0.999
## rate        -3.112e+00  1.319e+04   0.000    1.000
## read        -1.527e+01  2.145e+04  -0.001    0.999
## real         2.046e+01  2.358e+04   0.001    0.999
## realli      -2.667e+01  4.640e+04  -0.001    1.000
## receiv       5.765e-01  1.585e+04   0.000    1.000
## recent      -2.067e+00  1.780e+04   0.000    1.000
## regard      -3.668e+00  1.511e+04   0.000    1.000
## relat       -5.114e+01  1.793e+04  -0.003    0.998
## remov        2.325e+01  2.484e+04   0.001    0.999
## repli        1.538e+01  2.916e+04   0.001    1.000
## report      -1.482e+01  1.477e+04  -0.001    0.999
## request     -1.232e+01  1.167e+04  -0.001    0.999
## requir       5.004e-01  2.937e+04   0.000    1.000
## research    -2.826e+01  1.553e+04  -0.002    0.999
## resourc     -2.735e+01  3.522e+04  -0.001    0.999
## respond      2.974e+01  3.888e+04   0.001    0.999
## respons     -1.960e+01  3.667e+04  -0.001    1.000
## result      -5.002e-01  3.140e+04   0.000    1.000
## resum       -9.219e+00  2.100e+04   0.000    1.000
## return       1.745e+01  1.844e+04   0.001    0.999
## review      -4.825e+00  1.013e+04   0.000    1.000
## right        2.312e+01  1.590e+04   0.001    0.999
## risk        -4.001e+00  1.718e+04   0.000    1.000
## robert      -2.096e+01  2.907e+04  -0.001    0.999
## run         -5.162e+01  4.434e+04  -0.001    0.999
## say          7.366e+00  2.217e+04   0.000    1.000
## schedul      1.919e+00  3.580e+04   0.000    1.000
## school      -3.870e+00  2.882e+04   0.000    1.000
## secur       -1.604e+01  2.201e+03  -0.007    0.994
## see         -1.120e+01  1.293e+04  -0.001    0.999
## send        -2.427e+01  1.222e+04  -0.002    0.998
## sent        -1.488e+01  2.195e+04  -0.001    0.999
## servic      -7.164e+00  1.235e+04  -0.001    1.000
## set         -9.353e+00  2.627e+04   0.000    1.000
## sever        2.041e+01  3.093e+04   0.001    0.999
## shall        1.930e+01  3.075e+04   0.001    0.999
## shirley     -7.133e+01  6.329e+04  -0.001    0.999
## short       -8.974e+00  1.721e+04  -0.001    1.000
## sinc        -3.438e+00  3.546e+04   0.000    1.000
## sincer      -2.073e+01  3.515e+04  -0.001    1.000
## site         8.689e+00  1.496e+04   0.001    1.000
## softwar      2.575e+01  1.059e+04   0.002    0.998
## soon         2.350e+01  3.731e+04   0.001    0.999
## sorri        6.036e+00  2.299e+04   0.000    1.000
## special      1.777e+01  2.755e+04   0.001    0.999
## specif      -2.337e+01  3.083e+04  -0.001    0.999
## start        1.437e+01  1.897e+04   0.001    0.999
## state        1.221e+01  1.677e+04   0.001    0.999
## still        3.878e+00  2.622e+04   0.000    1.000
## stinson     -4.345e+01  2.697e+04  -0.002    0.999
## student     -1.815e+01  2.186e+04  -0.001    0.999
## subject      3.041e+01  1.055e+04   0.003    0.998
## success      4.344e+00  2.783e+04   0.000    1.000
## suggest     -3.842e+01  4.475e+04  -0.001    0.999
## support     -1.539e+01  1.976e+04  -0.001    0.999
## sure        -5.503e+00  2.078e+04   0.000    1.000
## system       3.778e+00  9.149e+03   0.000    1.000
## take         5.731e+00  1.716e+04   0.000    1.000
## talk        -1.011e+01  2.021e+04  -0.001    1.000
## team         7.940e+00  2.570e+04   0.000    1.000
## term         2.013e+01  2.303e+04   0.001    0.999
## thank       -3.890e+01  1.059e+04  -0.004    0.997
## thing        2.579e+01  1.341e+04   0.002    0.998
## think       -1.218e+01  2.077e+04  -0.001    1.000
## thought      1.243e+01  3.023e+04   0.000    1.000
## thursday    -1.491e+01  3.262e+04   0.000    1.000
## time        -5.921e+00  8.335e+03  -0.001    0.999
## today       -1.762e+01  1.965e+04  -0.001    0.999
## togeth      -2.355e+01  1.869e+04  -0.001    0.999
## trade       -1.755e+01  1.483e+04  -0.001    0.999
## tri          9.278e-01  1.282e+04   0.000    1.000
## tuesday     -2.808e+01  3.959e+04  -0.001    0.999
## two         -2.573e+01  1.844e+04  -0.001    0.999
## type        -1.447e+01  2.755e+04  -0.001    1.000
## understand   9.307e+00  2.342e+04   0.000    1.000
## unit        -4.020e+00  3.008e+04   0.000    1.000
## univers      1.228e+01  2.197e+04   0.001    1.000
## updat       -1.510e+01  1.448e+04  -0.001    0.999
## use         -1.385e+01  9.382e+03  -0.001    0.999
## valu         9.024e-01  1.360e+04   0.000    1.000
## version     -3.606e+01  2.939e+04  -0.001    0.999
## vinc        -3.735e+01  8.647e+03  -0.004    0.997
## visit        2.585e+01  1.170e+04   0.002    0.998
## vkamin      -6.649e+01  5.703e+04  -0.001    0.999
## want        -2.555e+00  1.106e+04   0.000    1.000
## way          1.339e+01  1.138e+04   0.001    0.999
## web          2.791e+00  1.686e+04   0.000    1.000
## websit      -2.563e+01  1.848e+04  -0.001    0.999
## wednesday   -1.526e+01  2.642e+04  -0.001    1.000
## week        -6.795e+00  1.046e+04  -0.001    0.999
## well        -2.222e+01  9.713e+03  -0.002    0.998
## will        -1.119e+01  5.980e+03  -0.002    0.999
## wish         1.173e+01  3.175e+04   0.000    1.000
## within       2.900e+01  2.163e+04   0.001    0.999
## without      1.942e+01  1.763e+04   0.001    0.999
## work        -1.099e+01  1.160e+04  -0.001    0.999
## write        4.406e+01  2.825e+04   0.002    0.999
## www         -7.867e+00  2.224e+04   0.000    1.000
## year        -1.010e+01  1.039e+04  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4409.49  on 4009  degrees of freedom
## Residual deviance:   13.46  on 3679  degrees of freedom
## AIC: 675.46
## 
## Number of Fisher Scoring iterations: 25

The logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model.

CART model

spamCART <- rpart(spam ~ ., data = train, method = "class")
prp(spamCART)

We see that “vinc” and “enron” (specific to the user ‘Vincent Kaminski’) appear in the CART tree as the top two branches, but that “hou” and “kaminski” (also specific to that user) do not appear.

random forest model

set.seed(123)

spamRF <- randomForest(spam ~ . , data = train)

calculated predicted probabilities for the three models on the training set

predlog <- predict(spamlog, type = "response")
predCART <- predict(spamCART)[,2]
predRF <- predict(spamRF, type = "prob")[,2]
cat("\nTraining set predicted probabilities from spamLog < 0.00001 :",
    sum(predlog<0.00001))
## 
## Training set predicted probabilities from spamLog < 0.00001 : 3046
cat("\nTraining set predicted probabilities from spamLog > 0.99999 :",
    sum(predlog>0.99999))
## 
## Training set predicted probabilities from spamLog > 0.99999 : 954
cat("\nTraining set predicted probabilities from spamLog between 0.00001 and 0.99999 :",
    sum(predlog>=0.00001 & predlog<=0.99999))
## 
## Training set predicted probabilities from spamLog between 0.00001 and 0.99999 : 10

The logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model.

p-values of logistic regression model

Show some of the co-efficients in the logistic regression model.

head(coef(summary(spamlog)),10)
##               Estimate Std. Error       z value  Pr(>|z|)
## (Intercept) -30.816712  10548.743 -2.921363e-03 0.9976691
## X000         14.738355  10583.796  1.392540e-03 0.9988889
## X2000       -36.306532  15559.782 -2.333357e-03 0.9981383
## X2001       -32.147703  13177.688 -2.439556e-03 0.9980535
## X713        -24.273005  29138.380 -8.330252e-04 0.9993353
## X853         -1.212310  59416.827 -2.040349e-05 0.9999837
## abl          -2.048513  20883.267 -9.809349e-05 0.9999217
## access      -14.797240  13353.490 -1.108118e-03 0.9991159
## account      24.881195   8164.788  3.047378e-03 0.9975685
## addit         1.463489  27027.117  5.414892e-05 0.9999568
cat("\nNumber of variables labeled as significant (at p=0.05 level)\nin logistic regression model :",
    sum(coef(summary(spamlog))[,4] < 0.05))
## 
## Number of variables labeled as significant (at p=0.05 level)
## in logistic regression model : 0

We see that none of the variables are labeled as significant (a symptom of the logistic regression algorithm not converging).

model accuracy on training sets

logistic regression model, using a threshold of 0.5

logconfusion <- table(train$spam, predlog >= 0.5)
logconfusion
##    
##     FALSE TRUE
##   0  3052    0
##   1     4  954
cat("\nAccuracy : ", sum(diag(logconfusion))/nrow(train))
## 
## Accuracy :  0.9990025
cat("\nSensitivity : ", logconfusion["1","TRUE"]/sum(logconfusion["1",]))
## 
## Sensitivity :  0.9958246
cat("\nSpecificity : ", logconfusion["0","FALSE"]/sum(logconfusion["0",]))
## 
## Specificity :  1
predictROCR <- prediction(predlog, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.9999959

CART (classification and regression tree) accuracy on training set

Using threhold of 0.5

CARTconfusion <- table(train$spam, predCART >= 0.5)
CARTconfusion
##    
##     FALSE TRUE
##   0  2885  167
##   1    64  894
cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(train))
## 
## Accuracy :  0.942394
cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))
## 
## Sensitivity :  0.9331942
cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))
## 
## Specificity :  0.9452818
predictROCR <- prediction(predCART, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.9696044

random forest accuracy on training set

RFconfusion <- table(train$spam, predRF >= 0.5)
RFconfusion
##    
##     FALSE TRUE
##   0  3014   38
##   1    41  917
cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(train))
## 
## Accuracy :  0.9802993
cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))
## 
## Sensitivity :  0.9572025
cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))
## 
## Specificity :  0.9875491
predictROCR <- prediction(predRF, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.9978155

In terms of both accuracy and AUC, logistic regression is nearly perfect when used on the training set and outperforms the other two models.

model accuracy on test sets

Calculated predicted probabilities for the three models on the training set

predlog <- predict(spamlog, newdata = test, type = "response")
predCART <- predict(spamCART, newdata = test)[,2]
predRF <- predict(spamRF, newdata = test, type = "prob")[,2]

logistic regression model, using a threshold of 0.5

logconfusion <- table(test$spam, predlog >= 0.5)
logconfusion
##    
##     FALSE TRUE
##   0  1257   51
##   1    34  376
cat("\nAccuracy : ", sum(diag(logconfusion))/nrow(test))
## 
## Accuracy :  0.9505239
cat("\nSensitivity : ", logconfusion["1","TRUE"]/sum(logconfusion["1",]))
## 
## Sensitivity :  0.9170732
cat("\nSpecificity : ", logconfusion["0","FALSE"]/sum(logconfusion["0",]))
## 
## Specificity :  0.9610092
predictROCR <- prediction(predlog, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.9627517

CART (classification and regression tree) accuracy on test set

Using threhold of 0.5

CARTconfusion <- table(test$spam, predCART >= 0.5)
CARTconfusion
##    
##     FALSE TRUE
##   0  1228   80
##   1    24  386
cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(test))
## 
## Accuracy :  0.9394645
cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))
## 
## Sensitivity :  0.9414634
cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))
## 
## Specificity :  0.9388379
predictROCR <- prediction(predCART, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.963176

random forest accuracy on test set

RFconfusion <- table(test$spam, predRF >= 0.5)
RFconfusion
##    
##     FALSE TRUE
##   0  1290   18
##   1    23  387
cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(test))
## 
## Accuracy :  0.976135
cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))
## 
## Sensitivity :  0.9439024
cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))
## 
## Specificity :  0.9862385
predictROCR <- prediction(predRF, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.9975899

Both CART and random forest had very similar accuracies on the training and testing sets. However, logistic regression obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting.

The random forest outperformed logistic regression and CART in both measures on the test set, obtaining an impressive AUC of 0.997 on the test set.

Integrating word count information

While we have thus far mostly dealt with frequencies of specific words in our analysis, we can extract other information from text. The last two sections of this problem will deal with two other types of information we can extract.

We will use the number of words in the each email as an independent variable. We can use the original document term matrix called dtm for this task. The document term matrix has documents (in this case, emails) as its rows, terms (in this case word stems) as its columns, and frequencies as its values. As a result, the sum of all the elements in a row of the document term matrix is equal to the number of terms present in the document corresponding to the row.

wordCount <- rowSums(as.matrix(dtm))
hist(wordCount)

Nearly all the observations are in the very left of the graph, representing small values. Therefore, this distribution is skew right.

hist(log(wordCount))

Using hist(log(wordCount)), the frequencies are quite balanced, suggesting log(wordCount) is not skewed.

Plot logWordCount against whether a message is spam.

emailsSparse$logWordCount <- log(wordCount)
boxplot(logWordCount ~ spam, data = emailsSparse)

Create new training and test sets with Word Count information

train2 <- subset(emailsSparse, split == TRUE)
test2 <- subset(emailsSparse, split == FALSE)

Create new CART model with word count information

spam2CART <- rpart(spam ~ ., data = train2)
prp(spam2CART)

Create new random forest model with wordcount information

spam2RF <- randomForest(spam ~ ., data = train2)
summary(spam2RF)
##                 Length Class  Mode     
## call               3   -none- call     
## type               1   -none- character
## predicted       4010   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           8020   matrix numeric  
## oob.times       4010   -none- numeric  
## classes            2   -none- character
## importance       331   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               4010   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

predictions on test set

predCART <- predict(spam2CART, newdata = test2)[,2]
predRF <- predict(spam2RF, newdata = test2, type = "prob")[,2]

CART (classification and regression tree) accuracy on test set

Using threhold of 0.5

CARTconfusion <- table(test2$spam, predCART >= 0.5)
CARTconfusion
##    
##     FALSE TRUE
##   0  1214   94
##   1    26  384
cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(test2))
## 
## Accuracy :  0.9301513
cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))
## 
## Sensitivity :  0.9365854
cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))
## 
## Specificity :  0.9281346
predictROCR <- prediction(predCART, test2$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.9582438

random forest accuracy on test set using wordcount

RFconfusion <- table(test2$spam, predRF >= 0.5)
RFconfusion
##    
##     FALSE TRUE
##   0  1298   10
##   1    28  382
cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(test2))
## 
## Accuracy :  0.9778813
cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))
## 
## Sensitivity :  0.9317073
cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))
## 
## Specificity :  0.9923547
predictROCR <- prediction(predRF, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])
## 
## AUC : 0.9979265

In this case, adding the logWordCounts variable did not result in improved results on the test set for the CART or random forest model.