Homework assignment for MITX “The Analytics Edge”

Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.

As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.

In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes – Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.

knitr::opts_chunk$set(echo = TRUE)
library(tm)           # text analytics
library(caTools)      # create training and test sets
library(rpart)        # Classification and Regression Trees
library(rpart.plot)
library(randomForest) # random forest
library(SnowballC)
library(ROCR)         # receiver operator characteristic

The dataset contains just two fields:

text: The text of the email.
spam: A binary variable indicating if the email was spam.

emails <- read.csv("emails.csv", stringsAsFactors = FALSE)
str(emails)

## 'data.frame':    5728 obs. of  2 variables:
##  $ text: chr  "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market"| __truncated__ "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__ "Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved"| __truncated__ "Subject: 4 color printing special  request additional information now ! click here  click here for a printable "| __truncated__ ...
##  $ spam: int  1 1 1 1 1 1 1 1 1 1 ...

cat("\nNumber of spam emails :", sum(emails$spam))

## 
## Number of spam emails : 1368

cat("\nNumber of characters in the longest emial in the dataset :", max(nchar(emails$text)))

## 
## Number of characters in the longest emial in the dataset : 43952

cat("\nRow of the shortest email in the dataset :", which.min(nchar(emails$text)))

## 
## Row of the shortest email in the dataset : 1992

pre-process emails

corpus <- VCorpus(VectorSource(emails$text))

corpus = tm_map(corpus, content_transformer(tolower)) # lower-case
corpus = tm_map(corpus, removePunctuation)            # remove punctuation
corpus = tm_map(corpus, removeWords, stopwords("english")) # remove stop words
corpus = tm_map(corpus, stemDocument)                 # reduce words to 'stems'

bag of words

dtm <- DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

# remove words which don't appear in at least 5% of documents
spdtm <- removeSparseTerms(DocumentTermMatrix(corpus), 0.95)
spdtm

## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity           : 89%
## Maximal term length: 10
## Weighting          : term frequency (tf)

emailsSparse <- as.data.frame(as.matrix(spdtm))
colnames(emailsSparse) <- make.names(colnames(emailsSparse))
# some words start with a number, but R struggles with variables names that start with a number

cat("Most common stem in the emails :", colnames(spdtm)[which.max(colSums(emailsSparse))])

## Most common stem in the emails : enron

emailsSparse$spam <- emails$spam # add variable 'spam' containing the email spam labels

cat("Number of word stems which appear at least 5000 times\nin ham emails : ",
    sum(colSums(emailsSparse[emailsSparse$spam == 0,!names(emailsSparse) %in% "spam"])>=5000))

## Number of word stems which appear at least 5000 times
## in ham emails :  6

cat("\nMost common stems in ham emails\n")

## 
## Most common stems in ham emails

sort(colSums(subset(emailsSparse, spam == 0)), decreasing = TRUE)[1:10]

##    enron      ect  subject     vinc     will      hou    X2000 kaminski 
##    13388    11417     8625     8531     6802     5569     4935     4801 
##    pleas      com 
##     4494     4444

cat("\nNumber of word stems which appear at least 1000 times\nin spam emails : ",
    sum(colSums(emailsSparse[emailsSparse$spam == 1,!names(emailsSparse) %in% "spam"])>=1000))

## 
## Number of word stems which appear at least 1000 times
## in spam emails :  3

cat("\nMost common stems in spam emails\n")

## 
## Most common stems in spam emails

sort(colSums(emailsSparse[emailsSparse$spam == 1, !names(emailsSparse) %in% "spam"]), decreasing = TRUE)[1:10]

## subject    will compani     com    mail    busi   email     can  inform 
##    1577    1450    1065     999     917     897     865     831     818 
##  receiv 
##     727

Several of the most common word stems from the ham documents, such as “enron”, “hou” (short for Houston), “vinc” (the word stem of “Vince”) and “kaminski”, are likely specific to Vincent Kaminski’s inbox.

The ham dataset is personalized to Vincent Kaminski, and therefore it might not generalize well to a general email user. Caution is definitely necessary before applying the filters derived in this problem to other email users.

building machine language models

create training and test sets

emailsSparse$spam <- as.factor(emailsSparse$spam)

set.seed(123) # seed for random number generator

split <- sample.split(emailsSparse$spam, SplitRatio = 0.7)
train <- subset(emailsSparse, split == TRUE)
test <- subset(emailsSparse, split == FALSE)

logistic regression model

spamlog <- glm(spam ~ ., data = train, family = "binomial")

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(spamlog)

## 
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.011   0.000   0.000   0.000   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.082e+01  1.055e+04  -0.003    0.998
## X000         1.474e+01  1.058e+04   0.001    0.999
## X2000       -3.631e+01  1.556e+04  -0.002    0.998
## X2001       -3.215e+01  1.318e+04  -0.002    0.998
## X713        -2.427e+01  2.914e+04  -0.001    0.999
## X853        -1.212e+00  5.942e+04   0.000    1.000
## abl         -2.049e+00  2.088e+04   0.000    1.000
## access      -1.480e+01  1.335e+04  -0.001    0.999
## account      2.488e+01  8.165e+03   0.003    0.998
## addit        1.463e+00  2.703e+04   0.000    1.000
## address     -4.613e+00  1.113e+04   0.000    1.000
## allow        1.899e+01  6.436e+03   0.003    0.998
## alreadi     -2.407e+01  3.319e+04  -0.001    0.999
## also         2.990e+01  1.378e+04   0.002    0.998
## analysi     -2.405e+01  3.860e+04  -0.001    1.000
## anoth       -8.744e+00  2.032e+04   0.000    1.000
## applic      -2.649e+00  1.674e+04   0.000    1.000
## appreci     -2.145e+01  2.762e+04  -0.001    0.999
## approv      -1.302e+00  1.589e+04   0.000    1.000
## april       -2.620e+01  2.208e+04  -0.001    0.999
## area         2.041e+01  2.266e+04   0.001    0.999
## arrang       1.069e+01  2.135e+04   0.001    1.000
## ask         -7.746e+00  1.976e+04   0.000    1.000
## assist      -1.128e+01  2.490e+04   0.000    1.000
## associ       9.049e+00  1.909e+04   0.000    1.000
## attach      -1.037e+01  1.534e+04  -0.001    0.999
## attend      -3.451e+01  3.257e+04  -0.001    0.999
## avail        8.651e+00  1.709e+04   0.001    1.000
## back        -1.323e+01  2.272e+04  -0.001    1.000
## base        -1.354e+01  2.122e+04  -0.001    0.999
## begin        2.228e+01  2.973e+04   0.001    0.999
## believ       3.233e+01  2.136e+04   0.002    0.999
## best        -8.201e+00  1.333e+03  -0.006    0.995
## better       4.263e+01  2.360e+04   0.002    0.999
## book         4.301e+00  2.024e+04   0.000    1.000
## bring        1.607e+01  6.767e+04   0.000    1.000
## busi        -4.803e+00  1.000e+04   0.000    1.000
## buy          4.170e+01  3.892e+04   0.001    0.999
## call        -1.145e+00  1.111e+04   0.000    1.000
## can          3.762e+00  7.674e+03   0.000    1.000
## case        -3.372e+01  2.880e+04  -0.001    0.999
## chang       -2.717e+01  2.215e+04  -0.001    0.999
## check        1.425e+00  1.963e+04   0.000    1.000
## click        1.376e+01  7.077e+03   0.002    0.998
## com          1.936e+00  4.039e+03   0.000    1.000
## come        -1.166e+00  1.511e+04   0.000    1.000
## comment     -3.251e+00  3.387e+04   0.000    1.000
## communic     1.580e+01  8.958e+03   0.002    0.999
## compani      4.781e+00  9.186e+03   0.001    1.000
## complet     -1.363e+01  2.024e+04  -0.001    0.999
## confer      -7.503e-01  8.557e+03   0.000    1.000
## confirm     -1.300e+01  1.514e+04  -0.001    0.999
## contact      1.530e+00  1.262e+04   0.000    1.000
## continu      1.487e+01  1.535e+04   0.001    0.999
## contract    -1.295e+01  1.498e+04  -0.001    0.999
## copi        -4.274e+01  3.070e+04  -0.001    0.999
## corp         1.606e+01  2.708e+04   0.001    1.000
## corpor      -8.286e-01  2.818e+04   0.000    1.000
## cost        -1.938e+00  1.833e+04   0.000    1.000
## cours        1.665e+01  1.834e+04   0.001    0.999
## creat        1.338e+01  3.946e+04   0.000    1.000
## credit       2.617e+01  1.314e+04   0.002    0.998
## crenshaw     9.994e+01  6.769e+04   0.001    0.999
## current      3.629e+00  1.707e+04   0.000    1.000
## custom       1.829e+01  1.008e+04   0.002    0.999
## data        -2.609e+01  2.271e+04  -0.001    0.999
## date        -2.786e+00  1.699e+04   0.000    1.000
## day         -6.100e+00  5.866e+03  -0.001    0.999
## deal        -1.129e+01  1.448e+04  -0.001    0.999
## dear        -2.313e+00  2.306e+04   0.000    1.000
## depart      -4.068e+01  2.509e+04  -0.002    0.999
## deriv       -4.971e+01  3.587e+04  -0.001    0.999
## design      -7.923e+00  2.939e+04   0.000    1.000
## detail       1.197e+01  2.301e+04   0.001    1.000
## develop      5.976e+00  9.455e+03   0.001    0.999
## differ      -2.293e+00  1.075e+04   0.000    1.000
## direct      -2.051e+01  3.194e+04  -0.001    0.999
## director    -1.770e+01  1.793e+04  -0.001    0.999
## discuss     -1.051e+01  1.915e+04  -0.001    1.000
## doc         -2.597e+01  2.603e+04  -0.001    0.999
## don          2.129e+01  1.456e+04   0.001    0.999
## done         6.828e+00  1.882e+04   0.000    1.000
## due         -4.163e+00  3.532e+04   0.000    1.000
## ect          8.685e-01  5.342e+03   0.000    1.000
## edu         -2.122e-01  6.917e+02   0.000    1.000
## effect       1.948e+01  2.100e+04   0.001    0.999
## effort       1.606e+01  5.670e+04   0.000    1.000
## either      -2.744e+01  4.000e+04  -0.001    0.999
## email        3.833e+00  1.186e+04   0.000    1.000
## end         -1.311e+01  2.938e+04   0.000    1.000
## energi      -1.620e+01  1.646e+04  -0.001    0.999
## engin        2.664e+01  2.394e+04   0.001    0.999
## enron       -8.789e+00  5.719e+03  -0.002    0.999
## etc          9.470e-01  1.569e+04   0.000    1.000
## even        -1.654e+01  2.289e+04  -0.001    0.999
## event        1.694e+01  1.851e+04   0.001    0.999
## expect      -1.179e+01  1.914e+04  -0.001    1.000
## experi       2.460e+00  2.240e+04   0.000    1.000
## fax          3.537e+00  3.386e+04   0.000    1.000
## feel         2.596e+00  2.348e+04   0.000    1.000
## file        -2.943e+01  2.165e+04  -0.001    0.999
## final        8.075e+00  5.008e+04   0.000    1.000
## financ      -9.122e+00  7.524e+03  -0.001    0.999
## financi     -9.747e+00  1.727e+04  -0.001    1.000
## find        -2.623e+00  9.727e+03   0.000    1.000
## first       -4.666e-01  2.043e+04   0.000    1.000
## follow       1.766e+01  3.080e+03   0.006    0.995
## form         8.483e+00  1.674e+04   0.001    1.000
## forward     -3.484e+00  1.864e+04   0.000    1.000
## free         6.113e+00  8.121e+03   0.001    0.999
## friday      -1.146e+01  1.996e+04  -0.001    1.000
## full         2.125e+01  2.190e+04   0.001    0.999
## futur        4.146e+01  1.439e+04   0.003    0.998
## gas         -3.901e+00  4.160e+03  -0.001    0.999
## get          5.154e+00  9.737e+03   0.001    1.000
## gibner       2.901e+01  2.460e+04   0.001    0.999
## give        -2.518e+01  2.130e+04  -0.001    0.999
## given       -2.186e+01  5.426e+04   0.000    1.000
## good         5.399e+00  1.619e+04   0.000    1.000
## great        1.222e+01  1.090e+04   0.001    0.999
## group        5.264e-01  1.037e+04   0.000    1.000
## happi        1.939e-02  1.202e+04   0.000    1.000
## hear         2.887e+01  2.281e+04   0.001    0.999
## hello        2.166e+01  1.361e+04   0.002    0.999
## help         1.731e+01  2.791e+03   0.006    0.995
## high        -1.982e+00  2.554e+04   0.000    1.000
## home         5.973e+00  8.965e+03   0.001    0.999
## hope        -1.435e+01  2.179e+04  -0.001    0.999
## hou          6.852e+00  6.437e+03   0.001    0.999
## hour         2.478e+00  1.333e+04   0.000    1.000
## houston     -1.855e+01  7.305e+03  -0.003    0.998
## howev       -3.449e+01  3.562e+04  -0.001    0.999
## http         2.528e+01  2.107e+04   0.001    0.999
## idea        -1.845e+01  3.892e+04   0.000    1.000
## immedi       6.285e+01  3.346e+04   0.002    0.999
## import      -1.859e+00  2.236e+04   0.000    1.000
## includ      -3.454e+00  1.799e+04   0.000    1.000
## increas      6.476e+00  2.329e+04   0.000    1.000
## industri    -3.160e+01  2.373e+04  -0.001    0.999
## info        -1.255e+00  4.857e+03   0.000    1.000
## inform       2.078e+01  8.549e+03   0.002    0.998
## interest     2.698e+01  1.159e+04   0.002    0.998
## intern      -7.991e+00  3.351e+04   0.000    1.000
## internet     8.749e+00  1.100e+04   0.001    0.999
## interview   -1.640e+01  1.873e+04  -0.001    0.999
## invest       3.201e+01  2.393e+04   0.001    0.999
## invit        4.304e+00  2.215e+04   0.000    1.000
## involv       3.815e+01  3.315e+04   0.001    0.999
## issu        -3.708e+01  3.396e+04  -0.001    0.999
## john        -5.326e-01  2.856e+04   0.000    1.000
## join        -3.824e+01  2.334e+04  -0.002    0.999
## juli        -1.358e+01  3.009e+04   0.000    1.000
## just        -1.021e+01  1.114e+04  -0.001    0.999
## kaminski    -1.812e+01  6.029e+03  -0.003    0.998
## keep         1.867e+01  2.782e+04   0.001    0.999
## kevin       -3.779e+01  4.738e+04  -0.001    0.999
## know         1.277e+01  1.526e+04   0.001    0.999
## last         1.046e+00  1.372e+04   0.000    1.000
## let         -2.763e+01  1.462e+04  -0.002    0.998
## life         5.812e+01  3.864e+04   0.002    0.999
## like         5.649e+00  7.660e+03   0.001    0.999
## line         8.743e+00  1.236e+04   0.001    0.999
## link        -6.929e+00  1.345e+04  -0.001    1.000
## list        -8.692e+00  2.149e+03  -0.004    0.997
## locat        2.073e+01  1.597e+04   0.001    0.999
## london       6.745e+00  1.642e+04   0.000    1.000
## long        -1.489e+01  1.934e+04  -0.001    0.999
## look        -7.031e+00  1.563e+04   0.000    1.000
## lot         -1.964e+01  1.321e+04  -0.001    0.999
## made         2.820e+00  2.743e+04   0.000    1.000
## mail         7.584e+00  1.021e+04   0.001    0.999
## make         2.901e+01  1.528e+04   0.002    0.998
## manag        6.014e+00  1.445e+04   0.000    1.000
## mani         1.885e+01  1.442e+04   0.001    0.999
## mark        -3.350e+01  3.208e+04  -0.001    0.999
## market       7.895e+00  8.012e+03   0.001    0.999
## may         -9.434e+00  1.397e+04  -0.001    0.999
## mean         6.078e-01  2.952e+04   0.000    1.000
## meet        -1.063e+00  1.263e+04   0.000    1.000
## member       1.381e+01  2.343e+04   0.001    1.000
## mention     -2.279e+01  2.714e+04  -0.001    0.999
## messag       1.716e+01  2.562e+03   0.007    0.995
## might        1.244e+01  1.753e+04   0.001    0.999
## model       -2.292e+01  1.049e+04  -0.002    0.998
## monday      -1.034e+00  3.233e+04   0.000    1.000
## money        3.264e+01  1.321e+04   0.002    0.998
## month       -3.727e+00  1.112e+04   0.000    1.000
## morn        -2.645e+01  3.403e+04  -0.001    0.999
## move        -3.834e+01  3.011e+04  -0.001    0.999
## much         3.775e-01  1.392e+04   0.000    1.000
## name         1.672e+01  1.322e+04   0.001    0.999
## need         8.437e-01  1.221e+04   0.000    1.000
## net          1.256e+01  2.197e+04   0.001    1.000
## new          1.003e+00  1.009e+04   0.000    1.000
## next.        1.492e+01  1.724e+04   0.001    0.999
## note         1.446e+01  2.294e+04   0.001    0.999
## now          3.790e+01  1.219e+04   0.003    0.998
## number      -9.622e+00  1.591e+04  -0.001    1.000
## offer        1.174e+01  1.084e+04   0.001    0.999
## offic       -1.344e+01  2.311e+04  -0.001    1.000
## one          1.241e+01  6.652e+03   0.002    0.999
## onlin        3.589e+01  1.665e+04   0.002    0.998
## open         2.114e+01  2.961e+04   0.001    0.999
## oper        -1.696e+01  2.757e+04  -0.001    1.000
## opportun    -4.131e+00  1.918e+04   0.000    1.000
## option      -1.085e+00  9.325e+03   0.000    1.000
## order        6.533e+00  1.242e+04   0.001    1.000
## origin       3.226e+01  3.818e+04   0.001    0.999
## part         4.594e+00  3.483e+04   0.000    1.000
## particip    -1.154e+01  1.738e+04  -0.001    0.999
## peopl       -1.864e+01  1.439e+04  -0.001    0.999
## per          1.367e+01  1.273e+04   0.001    0.999
## person       1.870e+01  9.575e+03   0.002    0.998
## phone       -6.957e+00  1.172e+04  -0.001    1.000
## place        9.005e+00  3.661e+04   0.000    1.000
## plan        -1.830e+01  6.320e+03  -0.003    0.998
## pleas       -7.961e+00  9.484e+03  -0.001    0.999
## point        5.498e+00  3.403e+04   0.000    1.000
## posit       -1.543e+01  2.316e+04  -0.001    0.999
## possibl     -1.366e+01  2.492e+04  -0.001    1.000
## power       -5.643e+00  1.173e+04   0.000    1.000
## present     -6.163e+00  1.278e+04   0.000    1.000
## price        3.428e+00  7.850e+03   0.000    1.000
## problem      1.262e+01  9.763e+03   0.001    0.999
## process     -2.957e-01  1.191e+04   0.000    1.000
## product      1.016e+01  1.345e+04   0.001    0.999
## program      1.444e+00  1.183e+04   0.000    1.000
## project      2.173e+00  1.497e+04   0.000    1.000
## provid       2.422e-01  1.859e+04   0.000    1.000
## public      -5.250e+01  2.341e+04  -0.002    0.998
## put         -1.052e+01  2.681e+04   0.000    1.000
## question    -3.467e+01  1.859e+04  -0.002    0.999
## rate        -3.112e+00  1.319e+04   0.000    1.000
## read        -1.527e+01  2.145e+04  -0.001    0.999
## real         2.046e+01  2.358e+04   0.001    0.999
## realli      -2.667e+01  4.640e+04  -0.001    1.000
## receiv       5.765e-01  1.585e+04   0.000    1.000
## recent      -2.067e+00  1.780e+04   0.000    1.000
## regard      -3.668e+00  1.511e+04   0.000    1.000
## relat       -5.114e+01  1.793e+04  -0.003    0.998
## remov        2.325e+01  2.484e+04   0.001    0.999
## repli        1.538e+01  2.916e+04   0.001    1.000
## report      -1.482e+01  1.477e+04  -0.001    0.999
## request     -1.232e+01  1.167e+04  -0.001    0.999
## requir       5.004e-01  2.937e+04   0.000    1.000
## research    -2.826e+01  1.553e+04  -0.002    0.999
## resourc     -2.735e+01  3.522e+04  -0.001    0.999
## respond      2.974e+01  3.888e+04   0.001    0.999
## respons     -1.960e+01  3.667e+04  -0.001    1.000
## result      -5.002e-01  3.140e+04   0.000    1.000
## resum       -9.219e+00  2.100e+04   0.000    1.000
## return       1.745e+01  1.844e+04   0.001    0.999
## review      -4.825e+00  1.013e+04   0.000    1.000
## right        2.312e+01  1.590e+04   0.001    0.999
## risk        -4.001e+00  1.718e+04   0.000    1.000
## robert      -2.096e+01  2.907e+04  -0.001    0.999
## run         -5.162e+01  4.434e+04  -0.001    0.999
## say          7.366e+00  2.217e+04   0.000    1.000
## schedul      1.919e+00  3.580e+04   0.000    1.000
## school      -3.870e+00  2.882e+04   0.000    1.000
## secur       -1.604e+01  2.201e+03  -0.007    0.994
## see         -1.120e+01  1.293e+04  -0.001    0.999
## send        -2.427e+01  1.222e+04  -0.002    0.998
## sent        -1.488e+01  2.195e+04  -0.001    0.999
## servic      -7.164e+00  1.235e+04  -0.001    1.000
## set         -9.353e+00  2.627e+04   0.000    1.000
## sever        2.041e+01  3.093e+04   0.001    0.999
## shall        1.930e+01  3.075e+04   0.001    0.999
## shirley     -7.133e+01  6.329e+04  -0.001    0.999
## short       -8.974e+00  1.721e+04  -0.001    1.000
## sinc        -3.438e+00  3.546e+04   0.000    1.000
## sincer      -2.073e+01  3.515e+04  -0.001    1.000
## site         8.689e+00  1.496e+04   0.001    1.000
## softwar      2.575e+01  1.059e+04   0.002    0.998
## soon         2.350e+01  3.731e+04   0.001    0.999
## sorri        6.036e+00  2.299e+04   0.000    1.000
## special      1.777e+01  2.755e+04   0.001    0.999
## specif      -2.337e+01  3.083e+04  -0.001    0.999
## start        1.437e+01  1.897e+04   0.001    0.999
## state        1.221e+01  1.677e+04   0.001    0.999
## still        3.878e+00  2.622e+04   0.000    1.000
## stinson     -4.345e+01  2.697e+04  -0.002    0.999
## student     -1.815e+01  2.186e+04  -0.001    0.999
## subject      3.041e+01  1.055e+04   0.003    0.998
## success      4.344e+00  2.783e+04   0.000    1.000
## suggest     -3.842e+01  4.475e+04  -0.001    0.999
## support     -1.539e+01  1.976e+04  -0.001    0.999
## sure        -5.503e+00  2.078e+04   0.000    1.000
## system       3.778e+00  9.149e+03   0.000    1.000
## take         5.731e+00  1.716e+04   0.000    1.000
## talk        -1.011e+01  2.021e+04  -0.001    1.000
## team         7.940e+00  2.570e+04   0.000    1.000
## term         2.013e+01  2.303e+04   0.001    0.999
## thank       -3.890e+01  1.059e+04  -0.004    0.997
## thing        2.579e+01  1.341e+04   0.002    0.998
## think       -1.218e+01  2.077e+04  -0.001    1.000
## thought      1.243e+01  3.023e+04   0.000    1.000
## thursday    -1.491e+01  3.262e+04   0.000    1.000
## time        -5.921e+00  8.335e+03  -0.001    0.999
## today       -1.762e+01  1.965e+04  -0.001    0.999
## togeth      -2.355e+01  1.869e+04  -0.001    0.999
## trade       -1.755e+01  1.483e+04  -0.001    0.999
## tri          9.278e-01  1.282e+04   0.000    1.000
## tuesday     -2.808e+01  3.959e+04  -0.001    0.999
## two         -2.573e+01  1.844e+04  -0.001    0.999
## type        -1.447e+01  2.755e+04  -0.001    1.000
## understand   9.307e+00  2.342e+04   0.000    1.000
## unit        -4.020e+00  3.008e+04   0.000    1.000
## univers      1.228e+01  2.197e+04   0.001    1.000
## updat       -1.510e+01  1.448e+04  -0.001    0.999
## use         -1.385e+01  9.382e+03  -0.001    0.999
## valu         9.024e-01  1.360e+04   0.000    1.000
## version     -3.606e+01  2.939e+04  -0.001    0.999
## vinc        -3.735e+01  8.647e+03  -0.004    0.997
## visit        2.585e+01  1.170e+04   0.002    0.998
## vkamin      -6.649e+01  5.703e+04  -0.001    0.999
## want        -2.555e+00  1.106e+04   0.000    1.000
## way          1.339e+01  1.138e+04   0.001    0.999
## web          2.791e+00  1.686e+04   0.000    1.000
## websit      -2.563e+01  1.848e+04  -0.001    0.999
## wednesday   -1.526e+01  2.642e+04  -0.001    1.000
## week        -6.795e+00  1.046e+04  -0.001    0.999
## well        -2.222e+01  9.713e+03  -0.002    0.998
## will        -1.119e+01  5.980e+03  -0.002    0.999
## wish         1.173e+01  3.175e+04   0.000    1.000
## within       2.900e+01  2.163e+04   0.001    0.999
## without      1.942e+01  1.763e+04   0.001    0.999
## work        -1.099e+01  1.160e+04  -0.001    0.999
## write        4.406e+01  2.825e+04   0.002    0.999
## www         -7.867e+00  2.224e+04   0.000    1.000
## year        -1.010e+01  1.039e+04  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4409.49  on 4009  degrees of freedom
## Residual deviance:   13.46  on 3679  degrees of freedom
## AIC: 675.46
## 
## Number of Fisher Scoring iterations: 25

The logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model.

CART model

spamCART <- rpart(spam ~ ., data = train, method = "class")
prp(spamCART)

We see that “vinc” and “enron” (specific to the user ‘Vincent Kaminski’) appear in the CART tree as the top two branches, but that “hou” and “kaminski” (also specific to that user) do not appear.

random forest model

set.seed(123)

spamRF <- randomForest(spam ~ . , data = train)

calculated predicted probabilities for the three models on the training set

predlog <- predict(spamlog, type = "response")
predCART <- predict(spamCART)[,2]
predRF <- predict(spamRF, type = "prob")[,2]

cat("\nTraining set predicted probabilities from spamLog < 0.00001 :",
    sum(predlog<0.00001))

## 
## Training set predicted probabilities from spamLog < 0.00001 : 3046

cat("\nTraining set predicted probabilities from spamLog > 0.99999 :",
    sum(predlog>0.99999))

## 
## Training set predicted probabilities from spamLog > 0.99999 : 954

cat("\nTraining set predicted probabilities from spamLog between 0.00001 and 0.99999 :",
    sum(predlog>=0.00001 & predlog<=0.99999))

## 
## Training set predicted probabilities from spamLog between 0.00001 and 0.99999 : 10

p-values of logistic regression model

Show some of the co-efficients in the logistic regression model.

head(coef(summary(spamlog)),10)

##               Estimate Std. Error       z value  Pr(>|z|)
## (Intercept) -30.816712  10548.743 -2.921363e-03 0.9976691
## X000         14.738355  10583.796  1.392540e-03 0.9988889
## X2000       -36.306532  15559.782 -2.333357e-03 0.9981383
## X2001       -32.147703  13177.688 -2.439556e-03 0.9980535
## X713        -24.273005  29138.380 -8.330252e-04 0.9993353
## X853         -1.212310  59416.827 -2.040349e-05 0.9999837
## abl          -2.048513  20883.267 -9.809349e-05 0.9999217
## access      -14.797240  13353.490 -1.108118e-03 0.9991159
## account      24.881195   8164.788  3.047378e-03 0.9975685
## addit         1.463489  27027.117  5.414892e-05 0.9999568

cat("\nNumber of variables labeled as significant (at p=0.05 level)\nin logistic regression model :",
    sum(coef(summary(spamlog))[,4] < 0.05))

## 
## Number of variables labeled as significant (at p=0.05 level)
## in logistic regression model : 0

We see that none of the variables are labeled as significant (a symptom of the logistic regression algorithm not converging).

model accuracy on training sets

logistic regression model, using a threshold of 0.5

logconfusion <- table(train$spam, predlog >= 0.5)
logconfusion

##    
##     FALSE TRUE
##   0  3052    0
##   1     4  954

cat("\nAccuracy : ", sum(diag(logconfusion))/nrow(train))

## 
## Accuracy :  0.9990025

cat("\nSensitivity : ", logconfusion["1","TRUE"]/sum(logconfusion["1",]))

## 
## Sensitivity :  0.9958246

cat("\nSpecificity : ", logconfusion["0","FALSE"]/sum(logconfusion["0",]))

## 
## Specificity :  1

predictROCR <- prediction(predlog, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])

## 
## AUC : 0.9999959

CART (classification and regression tree) accuracy on training set

Using threhold of 0.5

CARTconfusion <- table(train$spam, predCART >= 0.5)
CARTconfusion

##    
##     FALSE TRUE
##   0  2885  167
##   1    64  894

cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(train))

## 
## Accuracy :  0.942394

cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))

## 
## Sensitivity :  0.9331942

cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))

## 
## Specificity :  0.9452818

predictROCR <- prediction(predCART, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])

## 
## AUC : 0.9696044

random forest accuracy on training set

RFconfusion <- table(train$spam, predRF >= 0.5)
RFconfusion

##    
##     FALSE TRUE
##   0  3014   38
##   1    41  917

cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(train))

## 
## Accuracy :  0.9802993

cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))

## 
## Sensitivity :  0.9572025

cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))

## 
## Specificity :  0.9875491

predictROCR <- prediction(predRF, train$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])

## 
## AUC : 0.9978155

In terms of both accuracy and AUC, logistic regression is nearly perfect when used on the training set and outperforms the other two models.

model accuracy on test sets

Calculated predicted probabilities for the three models on the training set

predlog <- predict(spamlog, newdata = test, type = "response")
predCART <- predict(spamCART, newdata = test)[,2]
predRF <- predict(spamRF, newdata = test, type = "prob")[,2]

logistic regression model, using a threshold of 0.5

logconfusion <- table(test$spam, predlog >= 0.5)
logconfusion

##    
##     FALSE TRUE
##   0  1257   51
##   1    34  376

cat("\nAccuracy : ", sum(diag(logconfusion))/nrow(test))

## 
## Accuracy :  0.9505239

cat("\nSensitivity : ", logconfusion["1","TRUE"]/sum(logconfusion["1",]))

## 
## Sensitivity :  0.9170732

cat("\nSpecificity : ", logconfusion["0","FALSE"]/sum(logconfusion["0",]))

## 
## Specificity :  0.9610092

predictROCR <- prediction(predlog, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])

## 
## AUC : 0.9627517

CART (classification and regression tree) accuracy on test set

Using threhold of 0.5

CARTconfusion <- table(test$spam, predCART >= 0.5)
CARTconfusion

##    
##     FALSE TRUE
##   0  1228   80
##   1    24  386

cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(test))

## 
## Accuracy :  0.9394645

cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))

## 
## Sensitivity :  0.9414634

cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))

## 
## Specificity :  0.9388379

predictROCR <- prediction(predCART, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])

## 
## AUC : 0.963176

random forest accuracy on test set

RFconfusion <- table(test$spam, predRF >= 0.5)
RFconfusion

##    
##     FALSE TRUE
##   0  1290   18
##   1    23  387

cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(test))

## 
## Accuracy :  0.976135

cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))

## 
## Sensitivity :  0.9439024

cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))

## 
## Specificity :  0.9862385

predictROCR <- prediction(predRF, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])

## 
## AUC : 0.9975899

Both CART and random forest had very similar accuracies on the training and testing sets. However, logistic regression obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting.

The random forest outperformed logistic regression and CART in both measures on the test set, obtaining an impressive AUC of 0.997 on the test set.

Integrating word count information

While we have thus far mostly dealt with frequencies of specific words in our analysis, we can extract other information from text. The last two sections of this problem will deal with two other types of information we can extract.

We will use the number of words in the each email as an independent variable. We can use the original document term matrix called dtm for this task. The document term matrix has documents (in this case, emails) as its rows, terms (in this case word stems) as its columns, and frequencies as its values. As a result, the sum of all the elements in a row of the document term matrix is equal to the number of terms present in the document corresponding to the row.

wordCount <- rowSums(as.matrix(dtm))
hist(wordCount)

Nearly all the observations are in the very left of the graph, representing small values. Therefore, this distribution is skew right.

hist(log(wordCount))

Using hist(log(wordCount)), the frequencies are quite balanced, suggesting log(wordCount) is not skewed.

Plot logWordCount against whether a message is spam.

emailsSparse$logWordCount <- log(wordCount)
boxplot(logWordCount ~ spam, data = emailsSparse)

Create new training and test sets with Word Count information

train2 <- subset(emailsSparse, split == TRUE)
test2 <- subset(emailsSparse, split == FALSE)

Create new CART model with word count information

spam2CART <- rpart(spam ~ ., data = train2)
prp(spam2CART)

Create new random forest model with wordcount information

spam2RF <- randomForest(spam ~ ., data = train2)
summary(spam2RF)

##                 Length Class  Mode     
## call               3   -none- call     
## type               1   -none- character
## predicted       4010   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           8020   matrix numeric  
## oob.times       4010   -none- numeric  
## classes            2   -none- character
## importance       331   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               4010   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

predictions on test set

predCART <- predict(spam2CART, newdata = test2)[,2]
predRF <- predict(spam2RF, newdata = test2, type = "prob")[,2]

CART (classification and regression tree) accuracy on test set

Using threhold of 0.5

CARTconfusion <- table(test2$spam, predCART >= 0.5)
CARTconfusion

##    
##     FALSE TRUE
##   0  1214   94
##   1    26  384

cat("\nAccuracy : ", sum(diag(CARTconfusion))/nrow(test2))

## 
## Accuracy :  0.9301513

cat("\nSensitivity : ", CARTconfusion["1","TRUE"]/sum(CARTconfusion["1",]))

## 
## Sensitivity :  0.9365854

cat("\nSpecificity : ", CARTconfusion["0","FALSE"]/sum(CARTconfusion["0",]))

## 
## Specificity :  0.9281346

predictROCR <- prediction(predCART, test2$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])

## 
## AUC : 0.9582438

random forest accuracy on test set using wordcount

RFconfusion <- table(test2$spam, predRF >= 0.5)
RFconfusion

##    
##     FALSE TRUE
##   0  1298   10
##   1    28  382

cat("\nAccuracy : ", sum(diag(RFconfusion))/nrow(test2))

## 
## Accuracy :  0.9778813

cat("\nSensitivity : ", RFconfusion["1","TRUE"]/sum(RFconfusion["1",]))

## 
## Sensitivity :  0.9317073

cat("\nSpecificity : ", RFconfusion["0","FALSE"]/sum(RFconfusion["0",]))

## 
## Specificity :  0.9923547

predictROCR <- prediction(predRF, test$spam)
perfROCR <- performance(predictROCR, "tpr", "fpr") # true-positive rate vs false-positive rate

plot(perfROCR, colorize=TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

cat("\nAUC :", performance(predictROCR, "auc")@y.values[[1]])

## 
## AUC : 0.9979265

In this case, adding the logWordCounts variable did not result in improved results on the test set for the CART or random forest model.

Separating Spam from Ham

David Fong

23 March 2019

pre-process emails

bag of words

building machine language models

create training and test sets

logistic regression model

CART model

random forest model

calculated predicted probabilities for the three models on the training set

p-values of logistic regression model

model accuracy on training sets

logistic regression model, using a threshold of 0.5

CART (classification and regression tree) accuracy on training set

random forest accuracy on training set

model accuracy on test sets

Calculated predicted probabilities for the three models on the training set

logistic regression model, using a threshold of 0.5

CART (classification and regression tree) accuracy on test set

random forest accuracy on test set

Integrating word count information

Create new training and test sets with Word Count information

Create new CART model with word count information

Create new random forest model with wordcount information

predictions on test set

CART (classification and regression tree) accuracy on test set

random forest accuracy on test set using wordcount