AS10-3 SPAM or HAM

packages = c(
  "dplyr","ggplot2","caTools","tm","SnowballC","ROCR","rpart",
  "rpart.plot","randomForest")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

rm(list=ls(all=TRUE))
Sys.setlocale("LC_ALL","C")

## [1] "C"

options(digits=5, scipen=10)

library(dplyr)
library(tm)
library(SnowballC)
library(ROCR)
library(caTools)
library(rpart)
library(rpart.plot)
library(randomForest)

Problem 1 - Exploration

Begin by loading the dataset emails.csv into a data frame called emails. Remember to pass the stringsAsFactors=FALSE option when loading the data.

emails= read.csv("data/emails.csv", stringsAsFactors = F)

1.1 How many emails are in the dataset?

5728

1.2 How many of the emails are spam?

table(emails$spam)

## 
##    0    1 
## 4360 1368

1368

1.3 Which word appears at the beginning of every email in the dataset?

substr(emails$text[1:5], 1, 60) #Extract or replace substrings in a character vector.

## [1] "Subject: naturally irresistible your corporate identity  lt "
## [2] "Subject: the stock trading gunslinger  fanny is merrill but "
## [3] "Subject: unbelievable new homes made easy  im wanting to sho"
## [4] "Subject: 4 color printing special  request additional inform"
## [5] "Subject: do not have money , get software cds from here !  s"

#head(emails$text)

subjects

1.4 Words in every document

【P1.4】Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?

Yes – the number of times the word appears might help us differentiate spam from ham.

1.5 How many characters are in the longest email?

nchar(emails$text) %>% max

## [1] 43952

1.6 Which row contains the shortest email in the dataset?

nchar(emails$text) %>% which.min

## [1] 1992

Problem 2 - Preparing the Corpus

2.1 Corpus and DTM

Follow the standard steps to build and pre-process the corpus:

1. Build a new corpus variable called corpus.
1. Using tm_map, convert the text to lowercase.
1. Using tm_map, remove all punctuation from the corpus.
1. Using tm_map, remove all English stopwords from the corpus.
1. Using tm_map, stem the words in the corpus.
1. Build a document term matrix from the corpus, called dtm.

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords(“english”)).

corpus = Corpus(VectorSource(emails$text))
corpus = tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents

corpus = tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation
## drops documents

corpus = tm_map(corpus, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents

corpus = tm_map(corpus, stemDocument)

## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents

dtm = DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

How many terms are in dtm?

28687

2.2 Remove less frequent words

To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don’t overwrite dtm, because we will use it in a later step of this homework).

spdtm= removeSparseTerms(dtm, 0.95)
spdtm

## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity           : 89%
## Maximal term length: 10
## Weighting          : term frequency (tf)

【P2.2】How many terms are in spdtm?

2.3 Build data frame

Build a data frame ems from spdtm

ems = as.data.frame(as.matrix(spdtm))

【P2.3】What is the most frequent word in spdtm?

colSums(ems) %>% sort %>% tail

##     hou    will    vinc subject     ect   enron 
##    5577    8252    8532   10202   11427   13388

enron

2.4 Most frequent words in HAM emalis

Add a variable called “spam” to emailsSparse containing the email spam labels. You can do this by copying over the “spam” variable from the original data frame (remember how we did this in the Twitter lecture).

Incorporate target variable spam

ems$spam = emails$spam

How many word stems appear at least 5000 times in the ham emails in the dataset? Hint: in this and the next question, remember not to count the dependent variable we just added.

# ham emails即選擇不是spam的，因此令spam==0
subset(ems, spam==0) %>% colSums %>% sort %>% tail(7)

##    2000     hou    will    vinc subject     ect   enron 
##    4935    5569    6802    8531    8625   11417   13388

2.5 Most frequent words in SPAM emalis

【P2.5】How many word stems appear at least 1000 times in the spam emails in the dataset?

subset(ems, spam==1) %>% colSums %>% sort %>% tail()

##    mail     com compani    spam    will subject 
##     917     999    1065    1368    1450    1577

3
Note that the variable “spam” is the dependent variable and is not the frequency of a word stem.

2.6 Observation 1

【P2.6】The lists of most common words are significantly different between the spam and ham emails. What does this likely imply?

The frequencies of these most common words are likely to help differentiate between spam and ham.

2.7 Observation 2

【P2.7】Several of the most common word stems from the ham documents, such as “enron”, “hou” (short for Houston), “vinc” (the word stem of “Vince”) and “kaminski”, are likely specific to Vincent Kaminski’s inbox. What does this mean about the applicability of the text analytics models we will train for the spam filtering problem?

The models we build are personalized, and would need to be further tested before being used as a spam filter for another person.
The ham dataset is certainly personalized to Vincent Kaminski, and therefore it might not generalize well to a general email user. Caution is definitely necessary before applying the filters derived in this problem to other email users. 資料顯然相當個人化，要注意此預測個人可能會有相當的準確度，但較缺乏一般性

Problem 3 - Building machine learning models

First, convert the dependent variable to a factor with “emailsSparse\(spam = as.factor(emailsSparse\)spam)”.

Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called “train” and a testing set called “test”. Make sure to perform this step on emailsSparse instead of emails.

Split the data first

set.seed(123)
ems$spam = as.factor(ems$spam)
names(ems) = make.names(names(ems)) #不用這行跑不出random forest，why?

spl = sample.split(ems$spam, SplitRatio = 0.7)
tr = subset(ems, spl)
ts = subset(ems, !spl)

and build

1. A logistic regression model called spamLog. You may see a warning message here - we’ll discuss this more later.

A CART model called spamCART, using the default parameters to train the model (don’t worry about adding minbucket or cp). Remember to add the argument method=“class” since this is a binary classification problem.
A random forest model called spamRF, using the default parameters to train the model (don’t worry about specifying ntree or nodesize). Directly before training the random forest model, set the random seed to 123 (even though we’ve already done this earlier in the problem, it’s important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).

#1 GLM
glm = glm(spam~., tr, family = "binomial")

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#2 CART
cart= rpart(spam~., tr, method = "class")

#3 random forest
set.seed(123)
rf= randomForest(spam~., data=tr)

3.1 Prediction of Logistic Model

predglm = predict(glm, type="response")

【P3.1a】 How many of the training set predicted probabilities from spamLog are less than 0.00001?

sum(predglm < 0.00001)

## [1] 3046

【P3.1b】 How many of the training set predicted probabilities from spamLog are more than 0.99999?

sum(predglm>0.99999)

## [1] 954

【P3.1c】 How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?

sum(predglm > 0.00001 & predglm < 0.99999)

## [1] 10

3.2 Significant predictors in the GLM model

【P3.2】How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?

summary(glm)

## 
## Call:
## glm(formula = spam ~ ., family = "binomial", data = tr)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.01    0.00    0.00    0.00    1.35  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -30.8167 10548.7431    0.00     1.00
## busi           -4.8029 10002.2892    0.00     1.00
## chang         -27.1680 22152.8907    0.00     1.00
## compani         4.7813  9186.3255    0.00     1.00
## corpor         -0.8286 28181.3478    0.00     1.00
## day            -6.0998  5866.2876    0.00     1.00
## done            6.8284 18822.0500    0.00     1.00
## effect         19.4824 21002.4128    0.00     1.00
## effort         16.0582 56700.5791    0.00     1.00
## even          -16.5389 22886.6379    0.00     1.00
## full           21.2510 21904.3401    0.00     1.00
## good            5.3994 16193.4281    0.00     1.00
## inform         20.7808  8549.0245    0.00     1.00
## interest       26.9804 11587.5921    0.00     1.00
## list           -8.6921  2148.9795    0.00     1.00
## look           -7.0307 15631.4459    0.00     1.00
## made            2.8205 27432.2618    0.00     1.00
## make           29.0054 15276.3527    0.00     1.00
## manag           6.0145 14452.5449    0.00     1.00
## market          7.8952  8012.2953    0.00     1.00
## much            0.3775 13921.5777    0.00     1.00
## order           6.5327 12424.0848    0.00     1.00
## origin         32.2628 38175.2472    0.00     1.00
## product        10.1584 13447.6403    0.00     1.00
## provid          0.2423 18589.0873    0.00     1.00
## realli        -26.6685 46403.4563    0.00     1.00
## result         -0.5002 31401.0516    0.00     1.00
## see           -11.1990 12932.4679    0.00     1.00
## special        17.7707 27552.3644    0.00     1.00
## subject        30.4112 10548.7431    0.00     1.00
## system          3.7780  9148.6586    0.00     1.00
## use           -13.8535  9381.7631    0.00     1.00
## websit        -25.6266 18475.0280    0.00     1.00
## will          -11.1938  5980.4800    0.00     1.00
## within         29.0029 21632.4965    0.00     1.00
## without        19.4198 17628.7430    0.00     1.00
## continu        14.8661 15351.2387    0.00     1.00
## group           0.5264 10371.4780    0.00     1.00
## like            5.6494  7659.8787    0.00     1.00
## trade         -17.5502 14825.1007    0.00     1.00
## tri             0.9278 12819.6375    0.00     1.00
## approv         -1.3015 15894.7792    0.00     1.00
## ask            -7.7459 19763.1683    0.00     1.00
## complet       -13.6288 20237.9036    0.00     1.00
## credit         26.1738 13138.0027    0.00     1.00
## form            8.4835 16736.5446    0.00     1.00
## hear           28.8653 22809.1143    0.00     1.00
## home            5.9729  8964.8271    0.00     1.00
## new             1.0033 10091.5253    0.00     1.00
## offer          11.7383 10837.2211    0.00     1.00
## opportun       -4.1312 19183.2113    0.00     1.00
## rate           -3.1121 13189.5698    0.00     1.00
## take            5.7314 17156.1217    0.00     1.00
## time           -5.9210  8334.7095    0.00     1.00
## visit          25.8460 11697.8534    0.00     1.00
## want           -2.5551 11057.5646    0.00     1.00
## way            13.3897 11375.3854    0.00     1.00
## addit           1.4635 27027.1167    0.00     1.00
## click          13.7612  7076.9896    0.00     1.00
## com             1.9363  4039.2049    0.00     1.00
## fax             3.5370 33855.8899    0.00     1.00
## mail            7.5837 10210.9569    0.00     1.00
## messag         17.1570  2561.5756    0.01     0.99
## now            37.8968 12190.2490    0.00     1.00
## phone          -6.9566 11717.6917    0.00     1.00
## request       -12.3189 11669.6611    0.00     1.00
## version       -36.0636 29386.8036    0.00     1.00
## best           -8.2005  1333.3866   -0.01     1.00
## end           -13.1054 29380.6882    0.00     1.00
## get             5.1538  9737.0707    0.00     1.00
## great          12.2194 10901.0790    0.00     1.00
## money          32.6355 13212.0683    0.00     1.00
## softwar        25.7485 10593.0947    0.00     1.00
## custom         18.2882 10079.0874    0.00     1.00
## hello          21.6555 13606.7312    0.00     1.00
## one            12.4124  6652.0320    0.00     1.00
## onlin          35.8862 16649.7350    0.00     1.00
## pleas          -7.9614  9484.4639    0.00     1.00
## access        -14.7972 13353.4899    0.00     1.00
## account        24.8812  8164.7879    0.00     1.00
## allow          18.9918  6436.3710    0.00     1.00
## alreadi       -24.0748 33188.2885    0.00     1.00
## also           29.8967 13781.7901    0.00     1.00
## applic         -2.6487 16735.5771    0.00     1.00
## area           20.4064 22657.7744    0.00     1.00
## assist        -11.2827 24895.2579    0.00     1.00
## base          -13.5426 21218.1714    0.00     1.00
## believ         32.3259 21360.0825    0.00     1.00
## buy            41.7019 38923.9252    0.00     1.00
## can             3.7617  7673.8931    0.00     1.00
## cost           -1.9376 18329.8873    0.00     1.00
## creat          13.3762 39460.0516    0.00     1.00
## current         3.6291 17066.2426    0.00     1.00
## design         -7.9231 29388.9389    0.00     1.00
## develop         5.9764  9454.5606    0.00     1.00
## differ         -2.2929 10749.5997    0.00     1.00
## director      -17.6981 17932.0129    0.00     1.00
## discuss       -10.5101 19154.3531    0.00     1.00
## due            -4.1627 35316.3726    0.00     1.00
## email           3.8328 11856.5459    0.00     1.00
## event          16.9419 18505.8473    0.00     1.00
## expect        -11.7869 19139.4171    0.00     1.00
## file          -29.4324 21649.5737    0.00     1.00
## forward        -3.4840 18642.9364    0.00     1.00
## futur          41.4595 14387.2419    0.00     1.00
## gas            -3.9009  4160.2926    0.00     1.00
## give          -25.1831 21296.8349    0.00     1.00
## given         -21.8641 54264.0263    0.00     1.00
## high           -1.9820 25536.2327    0.00     1.00
## import         -1.8593 22364.3382    0.00     1.00
## includ         -3.4544 17988.8912    0.00     1.00
## increas         6.4759 23286.6404    0.00     1.00
## industri      -31.6007 23734.8108    0.00     1.00
## invest         32.0125 23934.4148    0.00     1.00
## involv         38.1486 33152.6085    0.00     1.00
## just          -10.2116 11140.8256    0.00     1.00
## know           12.7708 15263.5677    0.00     1.00
## locat          20.7257 15965.7168    0.00     1.00
## mani           18.8505 14418.0274    0.00     1.00
## may            -9.4339 13969.5651    0.00     1.00
## mean            0.6078 29518.7119    0.00     1.00
## mention       -22.7859 27136.9157    0.00     1.00
## might          12.4416 17533.0051    0.00     1.00
## month          -3.7267 11123.6690    0.00     1.00
## need            0.8437 12207.6171    0.00     1.00
## note           14.4603 22937.8916    0.00     1.00
## number         -9.6218 15914.5979    0.00     1.00
## offic         -13.4416 23114.7234    0.00     1.00
## oper          -16.9570 27565.6010    0.00     1.00
## person         18.6976  9575.4766    0.00     1.00
## posit         -15.4311 23155.9923    0.00     1.00
## possibl       -13.6596 24918.1573    0.00     1.00
## present        -6.1630 12775.0563    0.00     1.00
## price           3.4276  7849.8596    0.00     1.00
## problem        12.6202  9763.0319    0.00     1.00
## process        -0.2957 11905.8484    0.00     1.00
## project         2.1733 14973.0515    0.00     1.00
## read          -15.2745 21446.7493    0.00     1.00
## relat         -51.1383 17926.4612    0.00     1.00
## report        -14.8212 14769.9198    0.00     1.00
## requir          0.5004 29365.4547    0.00     1.00
## research      -28.2590 15526.4663    0.00     1.00
## resourc       -27.3489 35221.0605    0.00     1.00
## return         17.4510 18435.1876    0.00     1.00
## review         -4.8245 10132.7968    0.00     1.00
## risk           -4.0008 17177.9984    0.00     1.00
## secur         -16.0368  2200.7143   -0.01     0.99
## servic         -7.1643 12351.2210    0.00     1.00
## set            -9.3532 26268.8952    0.00     1.00
## short          -8.9735 17207.5148    0.00     1.00
## specif        -23.3669 30834.2029    0.00     1.00
## state          12.2075 16772.1315    0.00     1.00
## term           20.1329 23031.5438    0.00     1.00
## thing          25.7860 13405.1719    0.00     1.00
## today         -17.6156 19649.5746    0.00     1.00
## two           -25.7267 18439.4399    0.00     1.00
## understand      9.3072 23416.6569    0.00     1.00
## unit           -4.0205 30080.6466    0.00     1.00
## well          -22.2193  9713.4012    0.00     1.00
## work          -10.9874 11596.3171    0.00     1.00
## hour            2.4780 13334.9003    0.00     1.00
## lot           -19.6368 13211.3752    0.00     1.00
## real           20.4591 23580.8524    0.00     1.00
## right          23.1185 15904.4579    0.00     1.00
## start          14.3748 18972.2695    0.00     1.00
## X000           14.7384 10583.7959    0.00     1.00
## X2001         -32.1477 13177.6879    0.00     1.00
## follow         17.6578  3079.6809    0.01     1.00
## name           16.7214 13218.4481    0.00     1.00
## sent          -14.8820 21953.7964    0.00     1.00
## last            1.0464 13724.4471    0.00     1.00
## avail           8.6511 17094.5716    0.00     1.00
## first          -0.4666 20429.8045    0.00     1.00
## http           25.2794 21071.1240    0.00     1.00
## join          -38.2408 23338.6228    0.00     1.00
## line            8.7432 12361.5396    0.00     1.00
## next.          14.9230 17244.6865    0.00     1.00
## remov          23.2545 24837.8658    0.00     1.00
## repli          15.3798 29155.6188    0.00     1.00
## wish           11.7309 31747.3794    0.00     1.00
## www            -7.8672 22237.5989    0.00     1.00
## year          -10.1029 10394.6904    0.00     1.00
## back          -13.2347 22723.0238    0.00     1.00
## internet        8.7490 10999.9271    0.00     1.00
## member         13.8130 23429.9085    0.00     1.00
## receiv          0.5765 15848.4961    0.00     1.00
## site            8.6886 14955.3526    0.00     1.00
## anoth          -8.7440 20316.9364    0.00     1.00
## associ          9.0494 19093.5413    0.00     1.00
## comment        -3.2514 33870.0142    0.00     1.00
## corp           16.0550 27083.0385    0.00     1.00
## date           -2.7862 16985.3060    0.00     1.00
## find           -2.6228  9727.0946    0.00     1.00
## free            6.1132  8121.0418    0.00     1.00
## issu          -37.0837 33960.7079    0.00     1.00
## long          -14.8913 19336.4493    0.00     1.00
## move          -38.3362 30112.4663    0.00     1.00
## particip      -11.5427 17383.3058    0.00     1.00
## recent         -2.0667 17795.1699    0.00     1.00
## respons       -19.5960 36666.0058    0.00     1.00
## say             7.3662 22174.2442    0.00     1.00
## week           -6.7950 10458.9864    0.00     1.00
## dear           -2.3132 23063.8923    0.00     1.00
## regard         -3.6681 15110.0149    0.00     1.00
## thank         -38.9047 10586.9613    0.00     1.00
## address        -4.6129 11134.3868    0.00     1.00
## contact         1.5300 12616.5255    0.00     1.00
## engin          26.6429 23936.0768    0.00     1.00
## etc             0.9470 15694.7652    0.00     1.00
## immedi         62.8533 33464.6929    0.00     1.00
## net            12.5616 21972.8129    0.00     1.00
## per            13.6749 12732.8339    0.00     1.00
## place           9.0053 36608.9650    0.00     1.00
## respond        29.7419 38879.3034    0.00     1.00
## sincer        -20.7317 35145.2647    0.00     1.00
## type          -14.4737 27548.2578    0.00     1.00
## come           -1.1662 15107.7386    0.00     1.00
## confirm       -12.9969 15139.7258    0.00     1.00
## analysi       -24.0500 38603.0306    0.00     1.00
## bring          16.0664 67670.9680    0.00     1.00
## call           -1.1450 11111.0678    0.00     1.00
## data          -26.0909 22714.2774    0.00     1.00
## detail         11.9692 23008.8487    0.00     1.00
## happi           0.0194 12018.6881    0.00     1.00
## idea          -18.4486 38918.5070    0.00     1.00
## info           -1.2547  4857.1202    0.00     1.00
## send          -24.2677 12224.2134    0.00     1.00
## success         4.3436 27830.4737    0.00     1.00
## sure           -5.5027 20777.0982    0.00     1.00
## team            7.9405 25703.8499    0.00     1.00
## web             2.7907 16859.8165    0.00     1.00
## don            21.2866 14561.0671    0.00     1.00
## copi          -42.7383 30699.5682    0.00     1.00
## help           17.3096  2790.8998    0.01     1.00
## part            4.5943 34830.4298    0.00     1.00
## life           58.1246 38643.0827    0.00     1.00
## meet           -1.0626 12633.5575    0.00     1.00
## sever          20.4120 30927.2811    0.00     1.00
## question      -34.6747 18588.4409    0.00     1.00
## write          44.0618 28249.1186    0.00     1.00
## think         -12.1812 20772.9999    0.00     1.00
## point           5.4984 34025.6561    0.00     1.00
## let           -27.6334 14620.6750    0.00     1.00
## link           -6.9285 13446.9461    0.00     1.00
## communic       15.7955  8958.0878    0.00     1.00
## contract      -12.9540 14984.7437    0.00     1.00
## either        -27.4425 39997.0170    0.00     1.00
## final           8.0749 50075.4525    0.00     1.00
## howev         -34.4927 35618.8571    0.00     1.00
## peopl         -18.6379 14389.7479    0.00     1.00
## power          -5.6431 11727.1593    0.00     1.00
## put           -10.5189 26812.4322    0.00     1.00
## run           -51.6220 44337.5156    0.00     1.00
## shall          19.2987 30748.7761    0.00     1.00
## soon           23.4975 37313.2839    0.00     1.00
## support       -15.3927 19761.5524    0.00     1.00
## attach        -10.3659 15343.2998    0.00     1.00
## abl            -2.0485 20883.2671    0.00     1.00
## program         1.4441 11831.1619    0.00     1.00
## sorri           6.0356 22992.8231    0.00     1.00
## valu            0.9024 13599.5916    0.00     1.00
## check           1.4252 19631.4419    0.00     1.00
## feel            2.5959 23476.2770    0.00     1.00
## better         42.6315 23599.8879    0.00     1.00
## plan          -18.3036  6320.4988    0.00     1.00
## experi          2.4597 22404.6552    0.00     1.00
## hope          -14.3545 21794.8858    0.00     1.00
## begin          22.2801 29731.4126    0.00     1.00
## X2000         -36.3065 15559.7816    0.00     1.00
## case          -33.7240 28804.2328    0.00     1.00
## depart        -40.6847 25092.9541    0.00     1.00
## financi        -9.7467 17271.8378    0.00     1.00
## houston       -18.5450  7305.0368    0.00     1.00
## intern         -7.9907 33512.7814    0.00     1.00
## john           -0.5326 28562.0674    0.00     1.00
## juli          -13.5778 30093.2708    0.00     1.00
## mark          -33.5007 32080.8705    0.00     1.00
## open           21.1417 29613.7993    0.00     1.00
## public        -52.4985 23410.5823    0.00     1.00
## sinc           -3.4385 35455.9820    0.00     1.00
## still           3.8779 26222.2125    0.00     1.00
## thought        12.4329 30228.1125    0.00     1.00
## univers        12.2758 21969.4115    0.00     1.00
## appreci       -21.4464 27616.2809    0.00     1.00
## keep           18.6660 27816.0700    0.00     1.00
## cours          16.6526 18338.3815    0.00     1.00
## direct        -20.5061 31942.8882    0.00     1.00
## togeth        -23.5481 18689.9723    0.00     1.00
## energi        -16.1971 16457.8766    0.00     1.00
## london          6.7453 16419.7348    0.00     1.00
## updat         -15.0978 14480.7185    0.00     1.00
## suggest       -38.4217 44745.1860    0.00     1.00
## option         -1.0852  9325.3243    0.00     1.00
## monday         -1.0340 32330.8096    0.00     1.00
## kevin         -37.7904 47379.7471    0.00     1.00
## book            4.3007 20235.7919    0.00     1.00
## deal          -11.2937 14476.4873    0.00     1.00
## invit           4.3037 22150.2429    0.00     1.00
## tuesday       -28.0830 39588.8687    0.00     1.00
## interview     -16.4048 18733.9704    0.00     1.00
## schedul         1.9191 35796.8427    0.00     1.00
## school         -3.8701 28823.4689    0.00     1.00
## model         -22.9233 10487.3469    0.00     1.00
## financ         -9.1224  7523.9504    0.00     1.00
## morn          -26.4476 34027.8914    0.00     1.00
## attend        -34.5055 32573.3227    0.00     1.00
## robert        -20.9550 29071.4318    0.00     1.00
## student       -18.1473 21856.4156    0.00     1.00
## april         -26.2027 22080.5315    0.00     1.00
## talk          -10.1057 20206.4181    0.00     1.00
## arrang         10.6947 21352.2139    0.00     1.00
## deriv         -49.7106 35873.6724    0.00     1.00
## thursday      -14.9135 32617.9203    0.00     1.00
## resum          -9.2191 20996.1407    0.00     1.00
## doc           -25.9712 26031.8370    0.00     1.00
## confer         -0.7503  8557.3634    0.00     1.00
## wednesday     -15.2636 26422.7645    0.00     1.00
## edu            -0.2122   691.7410    0.00     1.00
## friday        -11.4616 19964.7326    0.00     1.00
## ect             0.8685  5341.5129    0.00     1.00
## hou             6.8515  6436.8947    0.00     1.00
## vinc          -37.3476  8647.1553    0.00     1.00
## X853           -1.2123 59416.8273    0.00     1.00
## shirley       -71.3287 63289.3774    0.00     1.00
## enron          -8.7888  5718.8782    0.00     1.00
## kaminski      -18.1196  6029.0713    0.00     1.00
## X713          -24.2730 29138.3799    0.00     1.00
## crenshaw       99.9441 67692.0276    0.00     1.00
## vkamin        -66.4898 57028.7697    0.00     1.00
## gibner         29.0119 24595.4818    0.00     1.00
## stinson       -43.4535 26967.0175    0.00     1.00
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4409.49  on 4009  degrees of freedom
## Residual deviance:   13.46  on 3679  degrees of freedom
## AIC: 675.5
## 
## Number of Fisher Scoring iterations: 25

0 (none)

3.3 Words in the Decision Tree

【P3.3】How many of the word stems “enron”, “hou”, “vinc”, and “kaminski” appear in the CART tree?

prp(cart)

2 (enrou, vinc) Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.

3.4 training accuracy

What is the training set accuracy of spamLog, using a threshold of 0.5 for predictions?

table(tr$spam, predglm>0.5) %>% {sum(diag(.))/sum(.)}

## [1] 0.999

3.5 What is the training AUC of the GLM model?

colAUC(predglm, tr$spam)

##         [,1]
## 0 vs. 1    1

#prediction = prediction(predglm, tr$spam)
#as.numeric(performance(prediction, "auc")@y.values)

3.6 Building Machine Learning Models

What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions? (Remember that if you used the type=“class” argument when making predictions, you automatically used a threshold of 0.5. If you did not add in the type argument to the predict function, the probabilities are in the second column of the predict output.)

predcart = predict(cart)[,2]
table(tr$spam, predcart>0.5) %>% {sum(diag(.))/sum(.)}

## [1] 0.94239

3.7 What is the training ACC of the CART model?

What is the training set AUC of spamCART? (Remember that you have to pass the prediction function predicted probabilities, so don’t include the type argument when making predictions for your CART model.)

colAUC(predcart, tr$spam)

##           [,1]
## 0 vs. 1 0.9696

3.8 What is the training accuracy of the RF model?

predrf = predict(rf,type="prob")[,2]
table(tr$spam, predrf>0.5) %>% {sum(diag(.))/sum(.)}

## [1] 0.9808

3.9 What is the training AUC of the RF model?

What is the training set AUC of spamRF? (Remember to pass the argument type=“prob” to the predict function to get predicted probabilities for a random forest model. The probabilities will be the second column of the output.)

colAUC(predrf, tr$spam)

##           [,1]
## 0 vs. 1 0.9979

3.10 Which model had the best training set performance, in terms of accuracy & AUC?

Logistic regression

Problem 4 - Evaluating on the Test Set

Obtain predicted probabilities for the testing set for each of the models, again ensuring that probabilities instead of classes are obtained.

#將ts的predict結果合併在一起，令為data frame
pred2 = data.frame(
  predglm2 = predict(glm, ts, type='response'),
  predcart2 = predict(cart, ts)[,2],
  predrf2 = predict(rf, ts, type='prob')[,2] )
rbind(
  ACC = apply(pred2, 2, function(x) {
    table(ts$spam, x > 0.5) %>% {sum(diag(.)) / sum(.)} } ),
  AUC = colAUC(pred2, ts$spam)  ) %>% t

##               ACC 0 vs. 1
## predglm2  0.95052 0.96275
## predcart2 0.93946 0.96318
## predrf2   0.97555 0.99777

4.1

What is the testing set accuracy of spamLog, using a threshold of 0.5 for predictions?

0.95052

4.2

What is the testing set AUC of spamLog?

0.96275

4.3

What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?

0.93946

4.4

What is the testing set AUC of spamCART?

0.96318

4.5

What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?

0.97555

4.6

What is the testing set AUC of spamRF?

0.99777

4.7

Which model had the best testing set performance, in terms of accuracy and AUC?

Random forest

4.8

Which model demonstrated the greatest degree of overfitting?

Logistic regression