Nearly every email user has at some point encountered a “spam” email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. The remainder of emails are called “ham” emails.
As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Though these filters use a number of techniques (e.g. looking up the sender in a so-called “Blackhole List” that contains IP addresses of likely spammers), most rely heavily on the analysis of the contents of an email via text analytics.
In this homework problem, we will build and evaluate a spam filter using a publicly available dataset first described in the 2006 conference paper “Spam Filtering with Naive Bayes – Which Naive Bayes?” by V. Metsis, I. Androutsopoulos, and G. Paliouras. The “ham” messages in this dataset come from the inbox of former Enron Managing Director for Research Vincent Kaminski, one of the inboxes in the Enron Corpus. One source of spam messages in this dataset is the SpamAssassin corpus, which contains hand-labeled spam messages contributed by Internet users. The remaining spam was collected by Project Honey Pot, a project that collects spam messages and identifies spammers by publishing email address that humans would know not to contact but that bots might target with spam. The full dataset we will use was constructed as roughly a 75/25 mix of the ham and spam messages.
The dataset contains just two fields:
Begin by loading the dataset emails.csv into a data frame called emails. Remember to pass the stringsAsFactors=FALSE option when loading the data.
emails = read.csv("emails.csv", stringsAsFactors=FALSE)# Examine the string emails
str(emails)
## 'data.frame': 5728 obs. of 2 variables:
## $ text: chr "Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market"| __truncated__ "Subject: the stock trading gunslinger fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__ "Subject: unbelievable new homes made easy im wanting to show you this homeowner you have been pre - approved"| __truncated__ "Subject: 4 color printing special request additional information now ! click here click here for a printable "| __truncated__ ...
## $ spam: int 1 1 1 1 1 1 1 1 1 1 ...5728 emails in the dataset.
# Tabulates the amount of emails are spam
table(emails$spam)
##
## 0 1
## 4360 13681368 emails are spam.
# Examine the string emails that are text
str(emails$text[2])
## chr "Subject: the stock trading gunslinger fanny is merrill but muzo not colza attainder and penultimate like esmar"| __truncated__Subject appears at the beginning of every email in the dataset.
We know that each email has the word “subject” appear at least once, but the frequency with which it appears might help us differentiate spam from ham. For instance, a long email chain would have the word “subject” appear a number of times, and this higher frequency might be indicative of a ham message.
max(nchar(emails$text))
## [1] 4395243952 characters are in the longest email.
# Finds the row with the shortest email
which.min(nchar(emails$text))
## [1] 1992The 1992 row contains the shortest email in the dataset.
Build a new corpus variable called corpus.
Using tm_map, convert the text to lowercase.
Using tm_map, remove all punctuation from the corpus.
Using tm_map, remove all English stopwords from the corpus.
Using tm_map, stem the words in the corpus.
Build a document term matrix from the corpus, called dtm.
If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords(“english”)).
# Preparing the Corpus
library(tm)
# Build a new corpus variable called corpus
corpus = VCorpus(VectorSource(emails$text))
# Convert the text to lowercase.
corpus = tm_map(corpus, content_transformer(tolower))
# Remove all punctuation from the corpus
corpus = tm_map(corpus, removePunctuation)
# Remove all English stopwords from the corpus
corpus = tm_map(corpus, removeWords, stopwords("english"))
# Stem the words in the corpus
corpus = tm_map(corpus, stemDocument)
# Build a document term matrix from the corpus, called dtm
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)28687 terms are in dtm.
# Remove the sparse terms
spdtm = removeSparseTerms(dtm, 0.95)
spdtm
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity : 89%
## Maximal term length: 10
## Weighting : term frequency (tf)330 terms are in spdtm.
# Build data frame called emailsSparse from spdtm
emailsSparse = as.data.frame(as.matrix(spdtm))
colnames(emailsSparse) = make.names(colnames(emailsSparse))
# Word stem that is frequent
frequency <- colSums(emailsSparse)
which.max(frequency)
## enron
## 92Enron is the most frequent word.
# Add a variable called spam
emailsSparse$spam = emails$spam
# Sort the ham emails in dataset
a = sort((colSums(subset(emailsSparse, spam == 0))))
kable(a)| x | |
|---|---|
| spam | 0 |
| life | 80 |
| remov | 103 |
| money | 114 |
| onlin | 173 |
| without | 191 |
| websit | 194 |
| click | 217 |
| special | 226 |
| wish | 229 |
| repli | 239 |
| buy | 243 |
| net | 243 |
| link | 247 |
| immedi | 249 |
| done | 254 |
| mean | 259 |
| design | 261 |
| lot | 268 |
| effect | 270 |
| info | 273 |
| either | 279 |
| read | 279 |
| write | 286 |
| line | 289 |
| begin | 291 |
| sorri | 293 |
| success | 293 |
| involv | 294 |
| creat | 299 |
| softwar | 299 |
| better | 301 |
| vkamin | 301 |
| say | 305 |
| keep | 306 |
| bring | 311 |
| believ | 313 |
| full | 317 |
| increas | 320 |
| realli | 324 |
| mention | 325 |
| thought | 325 |
| idea | 327 |
| invest | 327 |
| secur | 337 |
| specif | 338 |
| sever | 340 |
| experi | 346 |
| thing | 347 |
| allow | 348 |
| check | 351 |
| due | 351 |
| type | 352 |
| happi | 354 |
| return | 355 |
| expect | 356 |
| short | 357 |
| effort | 358 |
| open | 360 |
| internet | 361 |
| sincer | 361 |
| public | 364 |
| recent | 368 |
| anoth | 369 |
| alreadi | 372 |
| home | 375 |
| made | 380 |
| respond | 382 |
| given | 383 |
| etc | 385 |
| put | 385 |
| within | 386 |
| place | 388 |
| right | 390 |
| version | 390 |
| hello | 395 |
| sure | 396 |
| area | 397 |
| run | 398 |
| arrang | 399 |
| account | 401 |
| join | 403 |
| hour | 404 |
| locat | 406 |
| togeth | 406 |
| engin | 411 |
| import | 411 |
| per | 412 |
| corpor | 414 |
| high | 416 |
| result | 418 |
| hear | 420 |
| final | 422 |
| deal | 423 |
| applic | 428 |
| even | 429 |
| web | 430 |
| custom | 433 |
| soon | 435 |
| long | 436 |
| sinc | 439 |
| futur | 440 |
| member | 446 |
| X000 | 447 |
| event | 447 |
| don | 450 |
| part | 450 |
| feel | 453 |
| tuesday | 454 |
| wednesday | 456 |
| still | 457 |
| unit | 457 |
| site | 458 |
| X853 | 461 |
| continu | 464 |
| understand | 464 |
| resourc | 466 |
| robert | 466 |
| analysi | 468 |
| form | 468 |
| point | 474 |
| assist | 475 |
| confirm | 485 |
| differ | 489 |
| intern | 489 |
| might | 490 |
| real | 490 |
| case | 492 |
| howev | 496 |
| comment | 505 |
| abl | 515 |
| complet | 515 |
| rate | 516 |
| appreci | 518 |
| tri | 521 |
| move | 526 |
| updat | 527 |
| approv | 533 |
| suggest | 533 |
| free | 535 |
| contract | 544 |
| detail | 546 |
| morn | 546 |
| end | 550 |
| mani | 550 |
| attend | 558 |
| thursday | 558 |
| direct | 561 |
| requir | 562 |
| cours | 567 |
| person | 569 |
| relat | 573 |
| depart | 575 |
| today | 577 |
| start | 580 |
| way | 586 |
| mark | 588 |
| valu | 590 |
| problem | 593 |
| peopl | 599 |
| note | 600 |
| school | 607 |
| invit | 614 |
| access | 617 |
| term | 625 |
| juli | 630 |
| monday | 630 |
| gibner | 633 |
| base | 635 |
| director | 640 |
| offer | 643 |
| cost | 646 |
| addit | 648 |
| kevin | 654 |
| great | 655 |
| set | 658 |
| file | 659 |
| find | 665 |
| much | 669 |
| oper | 669 |
| order | 669 |
| deriv | 673 |
| doc | 673 |
| april | 677 |
| book | 680 |
| address | 693 |
| copi | 700 |
| financi | 702 |
| month | 709 |
| student | 710 |
| respons | 711 |
| possibl | 712 |
| associ | 715 |
| particip | 717 |
| now | 725 |
| first | 726 |
| industri | 731 |
| dear | 734 |
| support | 734 |
| plan | 738 |
| back | 739 |
| name | 745 |
| come | 748 |
| opportun | 760 |
| report | 772 |
| product | 776 |
| two | 787 |
| origin | 796 |
| ask | 797 |
| credit | 798 |
| state | 806 |
| system | 816 |
| process | 826 |
| hope | 828 |
| london | 828 |
| just | 830 |
| receiv | 830 |
| chang | 831 |
| review | 834 |
| current | 841 |
| shall | 844 |
| friday | 847 |
| team | 850 |
| phone | 858 |
| issu | 865 |
| data | 868 |
| avail | 872 |
| last | 874 |
| good | 876 |
| give | 883 |
| www | 897 |
| gas | 905 |
| list | 907 |
| posit | 917 |
| visit | 920 |
| includ | 924 |
| resum | 928 |
| best | 933 |
| offic | 935 |
| servic | 942 |
| talk | 943 |
| number | 951 |
| well | 961 |
| fax | 963 |
| provid | 970 |
| sent | 971 |
| next. | 975 |
| send | 986 |
| http | 1009 |
| john | 1022 |
| univers | 1025 |
| financ | 1038 |
| stinson | 1051 |
| schedul | 1054 |
| take | 1057 |
| date | 1060 |
| want | 1068 |
| question | 1069 |
| program | 1080 |
| think | 1084 |
| X713 | 1097 |
| crenshaw | 1115 |
| attach | 1155 |
| trade | 1167 |
| help | 1168 |
| 1201 | |
| compani | 1225 |
| request | 1227 |
| see | 1238 |
| communic | 1251 |
| confer | 1264 |
| discuss | 1270 |
| make | 1281 |
| contact | 1301 |
| follow | 1308 |
| interview | 1320 |
| project | 1328 |
| 1352 | |
| present | 1397 |
| busi | 1416 |
| interest | 1429 |
| option | 1432 |
| day | 1440 |
| call | 1497 |
| one | 1516 |
| year | 1523 |
| week | 1527 |
| messag | 1538 |
| houston | 1577 |
| also | 1604 |
| look | 1607 |
| edu | 1620 |
| corp | 1643 |
| shirley | 1687 |
| develop | 1691 |
| get | 1768 |
| new | 1777 |
| use | 1784 |
| let | 1856 |
| regard | 1859 |
| inform | 1883 |
| need | 1890 |
| power | 1972 |
| may | 1976 |
| like | 1980 |
| risk | 2097 |
| energi | 2124 |
| market | 2150 |
| model | 2170 |
| price | 2191 |
| work | 2293 |
| manag | 2334 |
| know | 2345 |
| group | 2474 |
| meet | 2544 |
| time | 2552 |
| research | 2752 |
| forward | 2952 |
| X2001 | 3060 |
| can | 3426 |
| thank | 3558 |
| com | 4444 |
| pleas | 4494 |
| kaminski | 4801 |
| X2000 | 4935 |
| hou | 5569 |
| will | 6802 |
| vinc | 8531 |
| subject | 8625 |
| ect | 11417 |
| enron | 13388 |
6 words stems appear at least 5000 times in the ham emails in the dataset.
# Sort the spam emails in the dataset
a = sort((colSums(subset(emailsSparse, spam == 1))))
kable(a)| x | |
|---|---|
| X713 | 0 |
| crenshaw | 0 |
| enron | 0 |
| gibner | 0 |
| kaminski | 0 |
| stinson | 0 |
| vkamin | 0 |
| X853 | 1 |
| vinc | 1 |
| doc | 2 |
| kevin | 2 |
| shirley | 2 |
| deriv | 3 |
| april | 5 |
| houston | 5 |
| resum | 5 |
| edu | 7 |
| friday | 7 |
| hou | 8 |
| wednesday | 8 |
| ect | 10 |
| arrang | 11 |
| interview | 13 |
| attend | 15 |
| london | 15 |
| robert | 16 |
| student | 16 |
| schedul | 17 |
| thursday | 17 |
| monday | 19 |
| john | 20 |
| tuesday | 20 |
| attach | 21 |
| suggest | 21 |
| appreci | 23 |
| mark | 25 |
| begin | 26 |
| comment | 26 |
| analysi | 27 |
| X2001 | 29 |
| model | 29 |
| hope | 30 |
| mention | 30 |
| X2000 | 32 |
| togeth | 32 |
| confer | 33 |
| invit | 33 |
| univers | 34 |
| financ | 35 |
| talk | 38 |
| either | 39 |
| run | 39 |
| morn | 40 |
| shall | 40 |
| happi | 42 |
| thought | 42 |
| depart | 46 |
| confirm | 47 |
| respond | 48 |
| school | 48 |
| corp | 49 |
| etc | 49 |
| hear | 49 |
| howev | 49 |
| sorri | 50 |
| idea | 51 |
| energi | 55 |
| discuss | 56 |
| open | 56 |
| option | 56 |
| soon | 57 |
| understand | 57 |
| cours | 59 |
| experi | 59 |
| associ | 62 |
| point | 62 |
| bring | 63 |
| director | 65 |
| particip | 65 |
| anoth | 66 |
| join | 66 |
| still | 66 |
| final | 68 |
| research | 68 |
| case | 69 |
| set | 69 |
| specif | 69 |
| given | 70 |
| juli | 71 |
| problem | 73 |
| put | 73 |
| alreadi | 74 |
| ask | 74 |
| abl | 75 |
| deal | 75 |
| fax | 75 |
| book | 76 |
| team | 76 |
| issu | 79 |
| locat | 79 |
| meet | 79 |
| updat | 79 |
| lot | 80 |
| sincer | 80 |
| better | 82 |
| short | 82 |
| sinc | 82 |
| done | 83 |
| question | 83 |
| recent | 83 |
| possibl | 84 |
| contract | 85 |
| end | 85 |
| move | 86 |
| data | 87 |
| might | 87 |
| continu | 88 |
| note | 88 |
| feel | 90 |
| resourc | 90 |
| sever | 90 |
| area | 92 |
| communic | 92 |
| realli | 93 |
| due | 94 |
| direct | 96 |
| origin | 96 |
| copi | 97 |
| unit | 97 |
| long | 98 |
| member | 99 |
| sure | 99 |
| allow | 102 |
| dear | 104 |
| public | 104 |
| write | 104 |
| event | 105 |
| let | 107 |
| differ | 109 |
| file | 111 |
| involv | 111 |
| respons | 113 |
| creat | 114 |
| type | 114 |
| approv | 115 |
| detail | 115 |
| effort | 115 |
| intern | 117 |
| request | 117 |
| say | 118 |
| import | 119 |
| support | 120 |
| part | 121 |
| relat | 121 |
| assist | 123 |
| last | 124 |
| two | 124 |
| back | 125 |
| keep | 125 |
| addit | 126 |
| date | 127 |
| place | 128 |
| group | 130 |
| mean | 131 |
| valu | 131 |
| think | 132 |
| offic | 133 |
| read | 134 |
| immedi | 136 |
| check | 137 |
| applic | 139 |
| hello | 139 |
| tri | 140 |
| review | 142 |
| believ | 143 |
| phone | 143 |
| hour | 144 |
| power | 145 |
| present | 146 |
| process | 149 |
| corpor | 151 |
| oper | 151 |
| full | 152 |
| return | 154 |
| come | 155 |
| sent | 155 |
| opportun | 158 |
| real | 158 |
| repli | 158 |
| line | 159 |
| engin | 160 |
| term | 161 |
| credit | 162 |
| well | 164 |
| gas | 165 |
| info | 165 |
| plan | 166 |
| next. | 170 |
| risk | 170 |
| increas | 171 |
| access | 172 |
| give | 172 |
| thank | 172 |
| link | 174 |
| requir | 174 |
| version | 174 |
| cost | 175 |
| great | 182 |
| wish | 185 |
| regard | 186 |
| posit | 187 |
| thing | 188 |
| call | 190 |
| develop | 191 |
| complet | 192 |
| much | 192 |
| even | 193 |
| project | 194 |
| design | 196 |
| form | 196 |
| expect | 198 |
| person | 198 |
| without | 198 |
| buy | 199 |
| trade | 199 |
| effect | 201 |
| rate | 201 |
| base | 202 |
| find | 202 |
| current | 203 |
| first | 203 |
| chang | 204 |
| visit | 206 |
| financi | 207 |
| high | 208 |
| mani | 208 |
| forward | 209 |
| good | 221 |
| special | 225 |
| don | 226 |
| success | 226 |
| per | 230 |
| number | 231 |
| week | 231 |
| result | 237 |
| web | 238 |
| industri | 239 |
| contact | 242 |
| made | 242 |
| follow | 244 |
| month | 249 |
| right | 249 |
| today | 251 |
| also | 260 |
| help | 262 |
| internet | 262 |
| manag | 266 |
| know | 269 |
| way | 278 |
| avail | 280 |
| state | 280 |
| futur | 282 |
| home | 285 |
| start | 300 |
| system | 302 |
| take | 304 |
| net | 305 |
| includ | 314 |
| life | 320 |
| see | 329 |
| name | 344 |
| onlin | 345 |
| within | 346 |
| remov | 357 |
| best | 358 |
| program | 358 |
| peopl | 359 |
| custom | 363 |
| year | 367 |
| like | 372 |
| interest | 385 |
| send | 393 |
| servic | 395 |
| look | 396 |
| work | 415 |
| day | 420 |
| want | 420 |
| product | 421 |
| www | 426 |
| account | 428 |
| provid | 435 |
| need | 438 |
| softwar | 440 |
| messag | 445 |
| site | 455 |
| address | 461 |
| may | 489 |
| list | 503 |
| price | 503 |
| new | 504 |
| websit | 506 |
| report | 507 |
| secur | 520 |
| just | 524 |
| offer | 528 |
| invest | 540 |
| order | 541 |
| use | 546 |
| click | 552 |
| X000 | 560 |
| now | 575 |
| one | 592 |
| time | 593 |
| http | 600 |
| market | 600 |
| make | 603 |
| free | 606 |
| pleas | 619 |
| money | 662 |
| get | 694 |
| receiv | 727 |
| inform | 818 |
| can | 831 |
| 865 | |
| busi | 897 |
| 917 | |
| com | 999 |
| compani | 1065 |
| spam | 1368 |
| will | 1450 |
| subject | 1577 |
3 word stems appear at least 1000 times.
First, convert the dependent variable to a factor with “emailsSparse$spam = as.factor(emailsSparse$spam)”.
Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called “train” and a testing set called “test”. Make sure to perform this step on emailsSparse instead of emails.
Using the training set, train the following three machine learning models. The models should predict the dependent variable “spam”, using all other available variables as independent variables. Please be patient, as these models may take a few minutes to train.
A logistic regression model called spamLog. You may see a warning message here - we’ll discuss this more later.
A CART model called spamCART, using the default parameters to train the model (don’t worry about adding minbucket or cp). Remember to add the argument method=“class” since this is a binary classification problem.
A random forest model called spamRF, using the default parameters to train the model (don’t worry about specifying ntree or nodesize). Directly before training the random forest model, set the random seed to 123 (even though we’ve already done this earlier in the problem, it’s important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).
For each model, obtain the predicted spam probabilities for the training set. Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values. Recall that you can obtain probabilities for CART models by not passing any type parameter to the predict() function, and you can obtain probabilities from a random forest by adding the argument type=“prob”. For CART and random forest, you need to select the second column of the output of the predict() function, corresponding to the probability of a message being spam.
You may have noticed that training the logistic regression model yielded the messages “algorithm did not converge” and “fitted probabilities numerically 0 or 1 occurred”. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model. Let’s investigate the predicted probabilities from the logistic regression model.
# Convert the dependent variable
emailsSparse$spam = as.factor(emailsSparse$spam)
# Split the dataset into training and testing sets
set.seed(123)
library(caTools)
spl = sample.split(emailsSparse$spam, 0.7)
train = subset(emailsSparse, spl == TRUE)
test = subset(emailsSparse, spl == FALSE)# Create the logistic regression model
spamLog = glm(spam~., data=train, family="binomial")
# Create the predictions
predTrainLog = predict(spamLog, type="response")# Tabulate the predictions
table(predTrainLog < 0.00001)
##
## FALSE TRUE
## 964 30463046 training set predicted probabilities from spamLog are less than 0.00001.
# Tabulate the predictions
table(predTrainLog > 0.99999)
##
## FALSE TRUE
## 3056 954865 training set predicted probabilities from spamLog are more than 0.99999.
# Tabulate the predictions
table(predTrainLog= 0.00001 & predTrainLog <= 0.99999)
## predTrainLog
## FALSE TRUE
## 954 30560 training set predicted probabilities from spamLog are between 0.00001 and 0.99999.
# Output the summary
summary(spamLog)
##
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.011 0.000 0.000 0.000 1.354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.81671 10548.74309 -0.003 0.998
## X000 14.73835 10583.79587 0.001 0.999
## X2000 -36.30653 15559.78162 -0.002 0.998
## X2001 -32.14770 13177.68792 -0.002 0.998
## X713 -24.27301 29138.37993 -0.001 0.999
## X853 -1.21231 59416.82728 0.000 1.000
## abl -2.04851 20883.26713 0.000 1.000
## access -14.79724 13353.48989 -0.001 0.999
## account 24.88120 8164.78793 0.003 0.998
## addit 1.46349 27027.11666 0.000 1.000
## address -4.61291 11134.38676 0.000 1.000
## allow 18.99178 6436.37103 0.003 0.998
## alreadi -24.07476 33188.28852 -0.001 0.999
## also 29.89671 13781.79015 0.002 0.998
## analysi -24.05003 38603.03061 -0.001 1.000
## anoth -8.74404 20316.93640 0.000 1.000
## applic -2.64873 16735.57715 0.000 1.000
## appreci -21.44644 27616.28096 -0.001 0.999
## approv -1.30155 15894.77923 0.000 1.000
## april -26.20274 22080.53147 -0.001 0.999
## area 20.40642 22657.77444 0.001 0.999
## arrang 10.69469 21352.21386 0.001 1.000
## ask -7.74592 19763.16826 0.000 1.000
## assist -11.28267 24895.25793 0.000 1.000
## associ 9.04942 19093.54135 0.000 1.000
## attach -10.36592 15343.29985 -0.001 0.999
## attend -34.50552 32573.32268 -0.001 0.999
## avail 8.65114 17094.57157 0.001 1.000
## back -13.23471 22723.02376 -0.001 1.000
## base -13.54255 21218.17140 -0.001 0.999
## begin 22.28011 29731.41257 0.001 0.999
## believ 32.32591 21360.08248 0.002 0.999
## best -8.20054 1333.38661 -0.006 0.995
## better 42.63151 23599.88794 0.002 0.999
## book 4.30072 20235.79190 0.000 1.000
## bring 16.06635 67670.96796 0.000 1.000
## busi -4.80293 10002.28921 0.000 1.000
## buy 41.70188 38923.92521 0.001 0.999
## call -1.14501 11111.06778 0.000 1.000
## can 3.76174 7673.89305 0.000 1.000
## case -33.72402 28804.23279 -0.001 0.999
## chang -27.16799 22152.89068 -0.001 0.999
## check 1.42516 19631.44189 0.000 1.000
## click 13.76120 7076.98961 0.002 0.998
## com 1.93633 4039.20494 0.000 1.000
## come -1.16616 15107.73858 0.000 1.000
## comment -3.25141 33870.01419 0.000 1.000
## communic 15.79546 8958.08782 0.002 0.999
## compani 4.78131 9186.32551 0.001 1.000
## complet -13.62879 20237.90361 -0.001 0.999
## confer -0.75029 8557.36337 0.000 1.000
## confirm -12.99690 15139.72575 -0.001 0.999
## contact 1.53001 12616.52550 0.000 1.000
## continu 14.86611 15351.23871 0.001 0.999
## contract -12.95405 14984.74369 -0.001 0.999
## copi -42.73831 30699.56822 -0.001 0.999
## corp 16.05505 27083.03847 0.001 1.000
## corpor -0.82863 28181.34783 0.000 1.000
## cost -1.93757 18329.88729 0.000 1.000
## cours 16.65262 18338.38154 0.001 0.999
## creat 13.37623 39460.05157 0.000 1.000
## credit 26.17376 13138.00273 0.002 0.998
## crenshaw 99.94406 67692.02756 0.001 0.999
## current 3.62913 17066.24264 0.000 1.000
## custom 18.28821 10079.08744 0.002 0.999
## data -26.09087 22714.27741 -0.001 0.999
## date -2.78615 16985.30607 0.000 1.000
## day -6.09984 5866.28762 -0.001 0.999
## deal -11.29372 14476.48731 -0.001 0.999
## dear -2.31316 23063.89229 0.000 1.000
## depart -40.68465 25092.95410 -0.002 0.999
## deriv -49.71057 35873.67244 -0.001 0.999
## design -7.92306 29388.93892 0.000 1.000
## detail 11.96923 23008.84872 0.001 1.000
## develop 5.97638 9454.56063 0.001 0.999
## differ -2.29290 10749.59972 0.000 1.000
## direct -20.50611 31942.88823 -0.001 0.999
## director -17.69812 17932.01295 -0.001 0.999
## discuss -10.51005 19154.35311 -0.001 1.000
## doc -25.97116 26031.83704 -0.001 0.999
## don 21.28659 14561.06709 0.001 0.999
## done 6.82837 18822.05005 0.000 1.000
## due -4.16267 35316.37257 0.000 1.000
## ect 0.86849 5341.51294 0.000 1.000
## edu -0.21215 691.74099 0.000 1.000
## effect 19.48236 21002.41283 0.001 0.999
## effort 16.05818 56700.57914 0.000 1.000
## either -27.44247 39997.01701 -0.001 0.999
## email 3.83283 11856.54591 0.000 1.000
## end -13.10536 29380.68822 0.000 1.000
## energi -16.19710 16457.87662 -0.001 0.999
## engin 26.64290 23936.07677 0.001 0.999
## enron -8.78876 5718.87819 -0.002 0.999
## etc 0.94697 15694.76515 0.000 1.000
## even -16.53893 22886.63796 -0.001 0.999
## event 16.94185 18505.84730 0.001 0.999
## expect -11.78693 19139.41707 -0.001 1.000
## experi 2.45969 22404.65521 0.000 1.000
## fax 3.53700 33855.88989 0.000 1.000
## feel 2.59590 23476.27698 0.000 1.000
## file -29.43243 21649.57371 -0.001 0.999
## final 8.07492 50075.45250 0.000 1.000
## financ -9.12241 7523.95040 -0.001 0.999
## financi -9.74670 17271.83784 -0.001 1.000
## find -2.62282 9727.09459 0.000 1.000
## first -0.46663 20429.80447 0.000 1.000
## follow 17.65781 3079.68087 0.006 0.995
## form 8.48346 16736.54461 0.001 1.000
## forward -3.48404 18642.93644 0.000 1.000
## free 6.11316 8121.04177 0.001 0.999
## friday -11.46161 19964.73259 -0.001 1.000
## full 21.25102 21904.34008 0.001 0.999
## futur 41.45948 14387.24195 0.003 0.998
## gas -3.90086 4160.29256 -0.001 0.999
## get 5.15375 9737.07069 0.001 1.000
## gibner 29.01185 24595.48183 0.001 0.999
## give -25.18310 21296.83494 -0.001 0.999
## given -21.86413 54264.02633 0.000 1.000
## good 5.39940 16193.42812 0.000 1.000
## great 12.21940 10901.07901 0.001 0.999
## group 0.52639 10371.47801 0.000 1.000
## happi 0.01939 12018.68812 0.000 1.000
## hear 28.86533 22809.11427 0.001 0.999
## hello 21.65549 13606.73123 0.002 0.999
## help 17.30963 2790.89981 0.006 0.995
## high -1.98198 25536.23275 0.000 1.000
## home 5.97294 8964.82707 0.001 0.999
## hope -14.35451 21794.88576 -0.001 0.999
## hou 6.85153 6436.89472 0.001 0.999
## hour 2.47799 13334.90035 0.000 1.000
## houston -18.54502 7305.03681 -0.003 0.998
## howev -34.49274 35618.85713 -0.001 0.999
## http 25.27938 21071.12399 0.001 0.999
## idea -18.44864 38918.50700 0.000 1.000
## immedi 62.85329 33464.69294 0.002 0.999
## import -1.85930 22364.33823 0.000 1.000
## includ -3.45439 17988.89125 0.000 1.000
## increas 6.47593 23286.64042 0.000 1.000
## industri -31.60069 23734.81080 -0.001 0.999
## info -1.25474 4857.12017 0.000 1.000
## inform 20.78075 8549.02454 0.002 0.998
## interest 26.98037 11587.59215 0.002 0.998
## intern -7.99071 33512.78147 0.000 1.000
## internet 8.74897 10999.92712 0.001 0.999
## interview -16.40484 18733.97043 -0.001 0.999
## invest 32.01252 23934.41479 0.001 0.999
## invit 4.30368 22150.24289 0.000 1.000
## involv 38.14864 33152.60845 0.001 0.999
## issu -37.08367 33960.70787 -0.001 0.999
## john -0.53256 28562.06741 0.000 1.000
## join -38.24082 23338.62282 -0.002 0.999
## juli -13.57779 30093.27084 0.000 1.000
## just -10.21157 11140.82560 -0.001 0.999
## kaminski -18.11964 6029.07127 -0.003 0.998
## keep 18.66596 27816.06998 0.001 0.999
## kevin -37.79040 47379.74713 -0.001 0.999
## know 12.77077 15263.56770 0.001 0.999
## last 1.04644 13724.44714 0.000 1.000
## let -27.63338 14620.67500 -0.002 0.998
## life 58.12464 38643.08273 0.002 0.999
## like 5.64936 7659.87875 0.001 0.999
## line 8.74324 12361.53963 0.001 0.999
## link -6.92851 13446.94610 -0.001 1.000
## list -8.69209 2148.97953 -0.004 0.997
## locat 20.72567 15965.71676 0.001 0.999
## london 6.74530 16419.73479 0.000 1.000
## long -14.89135 19336.44934 -0.001 0.999
## look -7.03074 15631.44591 0.000 1.000
## lot -19.63678 13211.37522 -0.001 0.999
## made 2.82049 27432.26185 0.000 1.000
## mail 7.58373 10210.95687 0.001 0.999
## make 29.00542 15276.35270 0.002 0.998
## manag 6.01449 14452.54495 0.000 1.000
## mani 18.85052 14418.02739 0.001 0.999
## mark -33.50071 32080.87051 -0.001 0.999
## market 7.89523 8012.29528 0.001 0.999
## may -9.43386 13969.56515 -0.001 0.999
## mean 0.60776 29518.71186 0.000 1.000
## meet -1.06259 12633.55749 0.000 1.000
## member 13.81301 23429.90857 0.001 1.000
## mention -22.78594 27136.91573 -0.001 0.999
## messag 17.15699 2561.57560 0.007 0.995
## might 12.44156 17533.00513 0.001 0.999
## model -22.92334 10487.34692 -0.002 0.998
## monday -1.03402 32330.80963 0.000 1.000
## money 32.63552 13212.06828 0.002 0.998
## month -3.72670 11123.66899 0.000 1.000
## morn -26.44760 34027.89144 -0.001 0.999
## move -38.33622 30112.46626 -0.001 0.999
## much 0.37747 13921.57766 0.000 1.000
## name 16.72141 13218.44812 0.001 0.999
## need 0.84367 12207.61715 0.000 1.000
## net 12.56157 21972.81289 0.001 1.000
## new 1.00331 10091.52526 0.000 1.000
## next. 14.92299 17244.68652 0.001 0.999
## note 14.46034 22937.89167 0.001 0.999
## now 37.89680 12190.24904 0.003 0.998
## number -9.62184 15914.59792 -0.001 1.000
## offer 11.73834 10837.22113 0.001 0.999
## offic -13.44163 23114.72339 -0.001 1.000
## one 12.41238 6652.03196 0.002 0.999
## onlin 35.88623 16649.73495 0.002 0.998
## open 21.14171 29613.79926 0.001 0.999
## oper -16.95704 27565.60102 -0.001 1.000
## opportun -4.13117 19183.21135 0.000 1.000
## option -1.08516 9325.32428 0.000 1.000
## order 6.53265 12424.08477 0.001 1.000
## origin 32.26280 38175.24720 0.001 0.999
## part 4.59427 34830.42984 0.000 1.000
## particip -11.54271 17383.30582 -0.001 0.999
## peopl -18.63789 14389.74787 -0.001 0.999
## per 13.67495 12732.83389 0.001 0.999
## person 18.69761 9575.47655 0.002 0.998
## phone -6.95663 11717.69170 -0.001 1.000
## place 9.00530 36608.96507 0.000 1.000
## plan -18.30364 6320.49885 -0.003 0.998
## pleas -7.96138 9484.46386 -0.001 0.999
## point 5.49836 34025.65614 0.000 1.000
## posit -15.43111 23155.99226 -0.001 0.999
## possibl -13.65960 24918.15730 -0.001 1.000
## power -5.64308 11727.15930 0.000 1.000
## present -6.16295 12775.05633 0.000 1.000
## price 3.42759 7849.85957 0.000 1.000
## problem 12.62018 9763.03191 0.001 0.999
## process -0.29572 11905.84841 0.000 1.000
## product 10.15835 13447.64033 0.001 0.999
## program 1.44411 11831.16188 0.000 1.000
## project 2.17330 14973.05155 0.000 1.000
## provid 0.24225 18589.08726 0.000 1.000
## public -52.49850 23410.58227 -0.002 0.998
## put -10.51886 26812.43218 0.000 1.000
## question -34.67470 18588.44086 -0.002 0.999
## rate -3.11213 13189.56979 0.000 1.000
## read -15.27446 21446.74926 -0.001 0.999
## real 20.45912 23580.85242 0.001 0.999
## realli -26.66848 46403.45625 -0.001 1.000
## receiv 0.57652 15848.49610 0.000 1.000
## recent -2.06668 17795.16989 0.000 1.000
## regard -3.66813 15110.01493 0.000 1.000
## relat -51.13833 17926.46118 -0.003 0.998
## remov 23.25452 24837.86579 0.001 0.999
## repli 15.37977 29155.61883 0.001 1.000
## report -14.82125 14769.91974 -0.001 0.999
## request -12.31889 11669.66111 -0.001 0.999
## requir 0.50042 29365.45474 0.000 1.000
## research -28.25897 15526.46633 -0.002 0.999
## resourc -27.34889 35221.06048 -0.001 0.999
## respond 29.74186 38879.30348 0.001 0.999
## respons -19.59598 36666.00577 -0.001 1.000
## result -0.50024 31401.05156 0.000 1.000
## resum -9.21906 20996.14073 0.000 1.000
## return 17.45096 18435.18761 0.001 0.999
## review -4.82452 10132.79683 0.000 1.000
## right 23.11851 15904.45788 0.001 0.999
## risk -4.00079 17177.99841 0.000 1.000
## robert -20.95504 29071.43181 -0.001 0.999
## run -51.62204 44337.51560 -0.001 0.999
## say 7.36621 22174.24418 0.000 1.000
## schedul 1.91913 35796.84272 0.000 1.000
## school -3.87014 28823.46891 0.000 1.000
## secur -16.03677 2200.71431 -0.007 0.994
## see -11.19904 12932.46795 -0.001 0.999
## send -24.26771 12224.21338 -0.002 0.998
## sent -14.88198 21953.79637 -0.001 0.999
## servic -7.16432 12351.22106 -0.001 1.000
## set -9.35324 26268.89516 0.000 1.000
## sever 20.41198 30927.28109 0.001 0.999
## shall 19.29869 30748.77616 0.001 0.999
## shirley -71.32873 63289.37737 -0.001 0.999
## short -8.97353 17207.51481 -0.001 1.000
## sinc -3.43847 35455.98205 0.000 1.000
## sincer -20.73171 35145.26470 -0.001 1.000
## site 8.68864 14955.35264 0.001 1.000
## softwar 25.74855 10593.09469 0.002 0.998
## soon 23.49750 37313.28390 0.001 0.999
## sorri 6.03563 22992.82314 0.000 1.000
## special 17.77075 27552.36443 0.001 0.999
## specif -23.36688 30834.20294 -0.001 0.999
## start 14.37480 18972.26951 0.001 0.999
## state 12.20754 16772.13151 0.001 0.999
## still 3.87790 26222.21248 0.000 1.000
## stinson -43.45351 26967.01750 -0.002 0.999
## student -18.14731 21856.41556 -0.001 0.999
## subject 30.41125 10548.74309 0.003 0.998
## success 4.34358 27830.47372 0.000 1.000
## suggest -38.42169 44745.18597 -0.001 0.999
## support -15.39269 19761.55243 -0.001 0.999
## sure -5.50273 20777.09818 0.000 1.000
## system 3.77801 9148.65860 0.000 1.000
## take 5.73138 17156.12167 0.000 1.000
## talk -10.10574 20206.41806 -0.001 1.000
## team 7.94049 25703.84987 0.000 1.000
## term 20.13285 23031.54376 0.001 0.999
## thank -38.90473 10586.96129 -0.004 0.997
## thing 25.78599 13405.17195 0.002 0.998
## think -12.18122 20772.99986 -0.001 1.000
## thought 12.43295 30228.11251 0.000 1.000
## thursday -14.91355 32617.92027 0.000 1.000
## time -5.92102 8334.70945 -0.001 0.999
## today -17.61557 19649.57463 -0.001 0.999
## togeth -23.54813 18689.97232 -0.001 0.999
## trade -17.55016 14825.10071 -0.001 0.999
## tri 0.92783 12819.63747 0.000 1.000
## tuesday -28.08297 39588.86870 -0.001 0.999
## two -25.72666 18439.43987 -0.001 0.999
## type -14.47371 27548.25790 -0.001 1.000
## understand 9.30723 23416.65694 0.000 1.000
## unit -4.02049 30080.64655 0.000 1.000
## univers 12.27580 21969.41146 0.001 1.000
## updat -15.09781 14480.71856 -0.001 0.999
## use -13.85349 9381.76315 -0.001 0.999
## valu 0.90239 13599.59155 0.000 1.000
## version -36.06359 29386.80360 -0.001 0.999
## vinc -37.34756 8647.15534 -0.004 0.997
## visit 25.84604 11697.85338 0.002 0.998
## vkamin -66.48981 57028.76975 -0.001 0.999
## want -2.55510 11057.56463 0.000 1.000
## way 13.38972 11375.38536 0.001 0.999
## web 2.79074 16859.81655 0.000 1.000
## websit -25.62659 18475.02800 -0.001 0.999
## wednesday -15.26360 26422.76449 -0.001 1.000
## week -6.79505 10458.98638 -0.001 0.999
## well -22.21928 9713.40116 -0.002 0.998
## will -11.19383 5980.47999 -0.002 0.999
## wish 11.73089 31747.37935 0.000 1.000
## within 29.00289 21632.49653 0.001 0.999
## without 19.41978 17628.74297 0.001 0.999
## work -10.98745 11596.31706 -0.001 0.999
## write 44.06181 28249.11863 0.002 0.999
## www -7.86715 22237.59888 0.000 1.000
## year -10.10293 10394.69041 -0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4409.49 on 4009 degrees of freedom
## Residual deviance: 13.46 on 3679 degrees of freedom
## AIC: 675.46
##
## Number of Fisher Scoring iterations: 250 variables are labeled as significant.
# Tabulate the spam in the training set and predictTrainLog
a = table(train$spam, predTrainLog > 0.5)
kable(a)| FALSE | TRUE | |
|---|---|---|
| 0 | 3052 | 0 |
| 1 | 4 | 954 |
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9990025Training Set Accuracy = 0.9990025
# Calculate the training set AUC
library(ROCR)
ROCRpred = prediction(predTrainLog, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9999959Training Set AUC of spamLog = 0.9999959
library(rpart)
library(rpart.plot)
spamCART = rpart(spam ~ ., data=train, method="class")# Plot of CART Model
prp(spamCART)“vinc” and “enron” appear in the CART tree.
# Make predictions on the training set accuracy
predTrainCART = predict(spamCART)[,2]
# Tabulate the spam in the training set and predictions
a = table(train$spam, predTrainCART > 0.5)
kable(a)| FALSE | TRUE | |
|---|---|---|
| 0 | 2885 | 167 |
| 1 | 64 | 894 |
# Calculate the accuracy
sum(diag(a))/(sum(a))
## [1] 0.942394Training Set Accuracy of spamCART = 0.942394
# Calculate the ROCR
library(ROCR)
ROCRpred = prediction(predTrainCART, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9696044Training Set AUC of spamCART = 0.9696044
# Implement the Random Forest (RF) algorithm
library(randomForest)
set.seed(123)
spamRF = randomForest(spam~., data=train)# Make predictions using the RF model
predTrainRF = predict(spamRF, type="prob")[,2]
# Tabulate the spam in the training and predictions
a = table(train$spam, predTrainRF > 0.5)
kable(a)| FALSE | TRUE | |
|---|---|---|
| 0 | 3015 | 37 |
| 1 | 42 | 916 |
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9802993Training Set Accuracy = 0.9802993
# Calculate the training set AUC
library(ROCR)
ROCRpred = prediction(predTrainRF, train$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9978155Training Set AUC of spamRF = 0.9978155
# Make predictions using Logistic Regression
predTestLog = predict(spamLog, newdata = test, type="response")# Tabulate the spam in the testing set and predictions
a = table(test$spam, predTestLog > 0.5)
kable(a)| FALSE | TRUE | |
|---|---|---|
| 0 | 1257 | 51 |
| 1 | 34 | 376 |
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9505239Testing Set Accuracy = 0.9505239
# Calculate the testing set AUC
library(ROCR)
ROCRpred = prediction(predTestLog, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9627517Testing Set AUC of spamLog = 0.9627517
# Making predictions using the CART model
predTestCART = predict(spamCART, newdata = test)[,2]# Tabulate the spam in the testing set and predictTestCART
a = table(test$spam, predTestCART > 0.5)
kable(a)| FALSE | TRUE | |
|---|---|---|
| 0 | 1228 | 80 |
| 1 | 24 | 386 |
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9394645Testing Set Accuracy = 0.9394645
# Calculating the testing set AUC of spamCART
library(ROCR)
ROCRpred = prediction(predTestCART, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.963176Testing Set Accuracy AUC of spamCART = 0.963176
# Make predictions using random forest
predTestRF = predict(spamRF, newdata = test, type="prob")[,2]# Tabulate the spam in the testing set and predictTestRF
a = table(test$spam, predTestRF > 0.5)
kable(a)| FALSE | TRUE | |
|---|---|---|
| 0 | 1291 | 17 |
| 1 | 23 | 387 |
# Compute the accuracy
sum(diag(a))/(sum(a))
## [1] 0.9767171Testing Set Accuracy = 0.9767171
# Calculate the testing set AUC of spamRF
library(ROCR)
ROCRpred = prediction(predTestRF, test$spam)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9975899Testing Set AUC of spamRF = 0.9975899