CAP 5703 Final Project Danilo Martinez
Problem A (classification). Consider the spam email detection problem. The data set is available at https://archive.ics.uci.edu/ml/datasets/Spambase. The objective is to apply data mining methods to determine whether an email is a regular or spam email. Thus, this is a classification problem with binary responses. You will apply a number of supervised learning methods to this classification problem including logistic regression, linear discriminate analysis (LDA), quadratic discriminate analysis (QDA), KNN. Randomly divide your data sets into the training data with sample size of 4000 and the test data with sample size of 601. You will use the training sample to train a number of models and then use the test sample to compare them.
Setting up parallel processing for faster processing of data.
#Recording
the time
ptm<-proc.time()
#Including library for
multi core processing
library(doParallel)
# Setting up parallel
processing.
cl <-makeCluster(detectCores())
registerDoParallel(cl)
In the context of this problem:
PERCENT CORRECT FOR EACH CATEGORY: The number of correctly classified spam and NOT spam independently. We would like these figures to have similar values.
ACCURACY: The total number of correctly classified as spam and NOT spam, divided by the total number of classifications of spam and NOT spam. We would like this figure to be high.
TRUE POSITIVE: The emails that are predicted as spam and are truly classified as spam. We would like this figure to be high.
FALSE NEGATIVE: The emails that are NOT predicted as spam, but actually are spam. We would like this figure to be low.
Importing data file from the internet and verifying content.
info
<- read.csv(file="https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data",header = FALSE, col.names=c("word_freq_make","word_freq_address","word_freq_all","word_freq_3d", "word_freq_our","word_freq_over","word_freq_remove","word_freq_internet","word_freq_order","word_freq_mail","word_freq_receive","word_freq_will","word_freq_people","word_freq_report","word_freq_addresses","word_freq_free","word_freq_business","word_freq_email","word_freq_you","word_freq_credit","word_freq_your","word_freq_font","word_freq_000","word_freq_money","word_freq_hp","word_freq_hpl","word_freq_george","word_freq_650","word_freq_lab","word_freq_labs","word_freq_telnet","word_freq_857","word_freq_data","word_freq_415","word_freq_85","word_freq_technology","word_freq_1999","word_freq_parts","word_freq_pm","word_freq_direct","word_freq_cs","word_freq_meeting","word_freq_original","word_freq_project","word_freq_re","word_freq_edu","word_freq_table","word_freq_conference","char_freq_;","char_freq_(","char_freq_[","char_freq_!","char_freq_$","char_freq_#","capital_run_length_average","capital_run_length_longest","capital_run_length_total","spam"),na.strings = c("NA", "", " ", "999"))
head(info,10)
##
word_freq_make word_freq_address word_freq_all word_freq_3d
## 1
0.00 0.64 0.64 0
## 2
0.21 0.28 0.50 0
## 3
0.06 0.00 0.71 0
## 4 0.00
0.00 0.00 0
## 5
0.00 0.00 0.00 0
## 6
0.00 0.00 0.00 0
## 7
0.00 0.00 0.00 0
## 8 0.00
0.00 0.00 0
## 9
0.15 0.00 0.46 0
## 10
0.06 0.12 0.77 0
## word_freq_our
word_freq_over word_freq_remove word_freq_internet
## 1
0.32 0.00 0.00 0.00
## 2
0.14 0.28 0.21 0.07
## 3
1.23 0.19 0.19 0.12
## 4
0.63 0.00 0.31 0.63
## 5
0.63 0.00 0.31 0.63
## 6
1.85 0.00 0.00 1.85
## 7
1.92 0.00 0.00 0.00
## 8 1.88
0.00 0.00 1.88
## 9
0.61 0.00 0.30 0.00
## 10
0.19 0.32 0.38 0.00
## word_freq_order
word_freq_mail word_freq_receive word_freq_will
## 1
0.00 0.00 0.00 0.64
## 2
0.00 0.94 0.21 0.79
## 3
0.64 0.25 0.38 0.45
## 4
0.31 0.63 0.31 0.31
## 5
0.31 0.63 0.31 0.31
## 6
0.00 0.00 0.00 0.00
## 7
0.00 0.64 0.96 1.28
## 8
0.00 0.00 0.00 0.00
## 9
0.92 0.76 0.76 0.92
## 10
0.06 0.00 0.00 0.64
## word_freq_people
word_freq_report word_freq_addresses word_freq_free
## 1
0.00 0.00 0.00 0.32
## 2
0.65 0.21 0.14 0.14
## 3
0.12 0.00 1.75 0.06
## 4
0.31 0.00 0.00 0.31
## 5
0.31 0.00 0.00 0.31
## 6
0.00 0.00 0.00 0.00
## 7
0.00 0.00 0.00 0.96
## 8
0.00 0.00 0.00 0.00
## 9
0.00 0.00 0.00 0.00
## 10
0.25 0.00 0.12 0.00
##
word_freq_business word_freq_email word_freq_you word_freq_credit
## 1
0.00 1.29 1.93 0.00
## 2
0.07 0.28 3.47 0.00
## 3
0.06 1.03 1.36 0.32
## 4
0.00 0.00 3.18 0.00
## 5
0.00 0.00 3.18 0.00
## 6
0.00 0.00 0.00 0.00
## 7
0.00 0.32 3.85 0.00
## 8
0.00 0.00 0.00 0.00
## 9
0.00 0.15 1.23 3.53
## 10
0.00 0.12 1.67 0.06
## word_freq_your
word_freq_font word_freq_000 word_freq_money
## 1
0.96 0 0.00 0.00
## 2
1.59 0 0.43 0.43
## 3
0.51 0 1.16 0.06
## 4
0.31 0 0.00 0.00
## 5
0.31 0 0.00 0.00
## 6
0.00 0 0.00 0.00
## 7 0.64
0 0.00 0.00
## 8
0.00 0 0.00 0.00
## 9
2.00 0 0.00 0.15
## 10
0.71 0 0.19 0.00
## word_freq_hp
word_freq_hpl word_freq_george word_freq_650 word_freq_lab
## 1
0 0 0 0 0
## 2
0 0 0 0 0
## 3
0 0 0 0 0
## 4
0 0 0 0 0
## 5
0 0 0 0 0
## 6
0 0 0 0 0
## 7
0 0 0 0 0
## 8
0 0 0 0 0
## 9
0 0 0 0 0
## 10
0 0 0 0 0
## word_freq_labs
word_freq_telnet word_freq_857 word_freq_data
## 1
0 0 0 0.00
## 2
0 0 0 0.00
## 3
0 0 0 0.00
## 4
0 0 0 0.00
## 5
0 0 0 0.00
## 6
0 0 0 0.00
## 7
0 0 0 0.00
## 8
0 0 0 0.00
## 9
0 0 0 0.15
## 10 0
0 0 0.00
## word_freq_415
word_freq_85 word_freq_technology word_freq_1999
## 1
0 0 0 0.00
## 2
0 0 0 0.07
## 3
0 0 0 0.00
## 4
0 0 0 0.00
## 5
0 0 0 0.00
## 6
0 0 0 0.00
## 7
0 0 0 0.00
## 8
0 0 0 0.00
## 9
0 0 0 0.00
## 10
0 0 0 0.00
## word_freq_parts
word_freq_pm word_freq_direct word_freq_cs
## 1
0 0 0.00 0
## 2
0 0 0.00 0
## 3 0
0 0.06 0
## 4
0 0 0.00 0
## 5
0 0 0.00 0
## 6
0 0 0.00 0
## 7 0
0 0.00 0
## 8
0 0 0.00 0
## 9
0 0 0.00 0
## 10
0 0 0.00 0
## word_freq_meeting
word_freq_original word_freq_project word_freq_re
## 1
0 0.00 0.00 0.00
## 2
0 0.00 0.00 0.00
## 3
0 0.12 0.00 0.06
## 4
0 0.00 0.00 0.00
## 5
0 0.00 0.00 0.00
## 6
0 0.00 0.00 0.00
## 7 0
0.00 0.00 0.00
## 8
0 0.00 0.00 0.00
## 9
0 0.30 0.00 0.00
## 10
0 0.00 0.06 0.00
## word_freq_edu
word_freq_table word_freq_conference char_freq_.
## 1
0.00 0 0 0.00
## 2
0.00 0 0 0.00
## 3 0.06
0 0 0.01
## 4
0.00 0 0 0.00
## 5
0.00 0 0 0.00
## 6
0.00 0 0 0.00
## 7
0.00 0 0 0.00
## 8
0.00 0 0 0.00
## 9
0.00 0 0 0.00
## 10
0.00 0 0 0.04
## char_freq_..1
char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 1
0.000 0 0.778 0.000 0.000
## 2
0.132 0 0.372 0.180 0.048
## 3
0.143 0 0.276 0.184 0.010
## 4
0.137 0 0.137 0.000 0.000
## 5
0.135 0 0.135 0.000 0.000
## 6
0.223 0 0.000 0.000 0.000
## 7
0.054 0 0.164 0.054 0.000
## 8
0.206 0 0.000 0.000 0.000
## 9
0.271 0 0.181 0.203 0.022
## 10
0.030 0 0.244 0.081 0.000
##
capital_run_length_average capital_run_length_longest
##
1 3.756 61
##
2 5.114 101
##
3 9.821 485
##
4 3.537 40
##
5 3.537 40
##
6 3.000 15
##
7 1.671 4
##
8 2.450 11
##
9 9.744 445
##
10 1.729 43
## capital_run_length_total
spam
##
1 278 1
##
2 1028 1
##
3 2259 1
##
4 191 1
##
5 191 1
##
6 54 1
## 7 112
1
##
8 49 1
##
9 1257 1
##
10 749 1
Dividing data into training and test set.
#Sampled
Observations from each data set
#Number of observations
sampled
n=4000
#Setting seed to produce
same results
set.seed(123)
#Sampling and setting up
training data
train_ind <- sample(nrow(info),n,replace = FALSE)
train=info[train_ind,]
#Setting up testing data
test =info[-train_ind,]
#Removing unnecessary
files
rm(info)
rm(n)
Logistic Regression is a special type of regression where a binary response variable is related to a set of explanatory variables, which can be discrete and/or continuous. The important point here to note is that in linear regression, the expected values of the response variable are modeled based on combination of values taken by the predictors. In logistic regression, probability, or odds of the response taking a particular value, is modeled based on combination of values taken by the predictors. Like regression, and unlike log-linear models, we make an explicit distinction between a response variable and one or more predictor (explanatory) variables. The response variable spam is binary and we are using .5 as the predicted probability cut off for the classifier.
Creating model for logistic regression using training set.
#Including
library necessary for predict function
library(caret)
#Creating training model
using logistic regression
model=glm(spam~.,data=train,family=binomial(link ="logit"))
#Printing out the logistic
model
summary(model)
##
## Call:
## glm(formula = spam ~
., family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q
Median 3Q Max
## -4.0181 -0.2041
0.0000 0.1120 5.2670
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
##
(Intercept) -1.588e+00 1.515e-01 -10.484 < 2e-16 ***
##
word_freq_make -2.312e-01 2.330e-01 -0.992 0.321202
## word_freq_address -1.367e-01
7.056e-02 -1.937 0.052699 .
##
word_freq_all 1.339e-01 1.170e-01 1.144 0.252471
##
word_freq_3d 1.710e+00 1.496e+00 1.143 0.253137
##
word_freq_our 5.958e-01 1.128e-01 5.283 1.27e-07 ***
##
word_freq_over 9.175e-01 2.710e-01 3.385 0.000711 ***
##
word_freq_remove 2.145e+00 3.466e-01 6.189 6.05e-10 ***
##
word_freq_internet 6.311e-01 1.849e-01 3.413 0.000642 ***
## word_freq_order
5.934e-01 2.971e-01 1.998 0.045769 *
##
word_freq_mail 9.300e-02 7.038e-02 1.321 0.186384
##
word_freq_receive -4.412e-01 3.359e-01 -1.314 0.189004
##
word_freq_will -1.075e-01 7.491e-02 -1.435 0.151322
##
word_freq_people -1.004e-01 2.399e-01 -0.418 0.675741
##
word_freq_report 1.697e-01 1.454e-01 1.167 0.243312
##
word_freq_addresses 8.941e-01 7.105e-01 1.259 0.208206
##
word_freq_free 1.030e+00 1.556e-01 6.624 3.50e-11 ***
##
word_freq_business 9.278e-01 2.325e-01 3.991 6.57e-05 ***
##
word_freq_email 1.729e-01 1.242e-01 1.391 0.164147
##
word_freq_you 7.936e-02 3.721e-02 2.133 0.032960 *
##
word_freq_credit 1.116e+00 6.011e-01 1.857 0.063304 .
##
word_freq_your 2.176e-01 5.695e-02 3.820 0.000133 ***
##
word_freq_font 1.721e-01 1.577e-01 1.091 0.275108
##
word_freq_000 2.143e+00 4.776e-01 4.486 7.26e-06 ***
##
word_freq_money 6.321e-01 2.684e-01 2.355 0.018536 *
##
word_freq_hp -2.183e+00 3.905e-01 -5.591 2.26e-08 ***
##
word_freq_hpl -9.135e-01 5.130e-01 -1.781 0.074973 .
##
word_freq_george -1.117e+01 2.143e+00 -5.214 1.85e-07 ***
##
word_freq_650 5.382e-01 2.414e-01 2.229 0.025810 *
##
word_freq_lab -2.094e+00 1.362e+00 -1.537 0.124255
##
word_freq_labs -2.466e-01 3.118e-01 -0.791 0.429068
##
word_freq_telnet -1.259e-01 3.883e-01 -0.324 0.745859
##
word_freq_857 2.371e+00 3.504e+00 0.677 0.498544
##
word_freq_data -8.358e-01 3.554e-01 -2.352 0.018687 *
##
word_freq_415 -1.312e+01 4.063e+00 -3.228 0.001247 **
##
word_freq_85 -1.848e+00 8.046e-01 -2.297 0.021624 *
##
word_freq_technology 9.075e-01 3.268e-01 2.777 0.005487 **
##
word_freq_1999 -1.755e-02 1.891e-01 -0.093 0.926070
##
word_freq_parts 1.214e+00 1.080e+00 1.124 0.261170
##
word_freq_pm -8.460e-01 4.045e-01 -2.092 0.036472 *
##
word_freq_direct -1.893e-01 4.721e-01 -0.401 0.688405
##
word_freq_cs -4.327e+01 2.665e+01 -1.624 0.104414
##
word_freq_meeting -2.808e+00 9.481e-01 -2.962 0.003055 **
##
word_freq_original -1.692e+00 1.044e+00 -1.621 0.105040
##
word_freq_project -1.829e+00 6.244e-01 -2.929 0.003395 **
##
word_freq_re -7.616e-01 1.650e-01 -4.616 3.91e-06 ***
##
word_freq_edu -1.355e+00 2.711e-01 -5.000 5.74e-07 ***
##
word_freq_table -2.188e+00 1.650e+00 -1.326 0.184745
##
word_freq_conference -3.994e+00 1.632e+00 -2.447 0.014418 *
##
char_freq_. -1.232e+00 4.367e-01 -2.820 0.004798 **
##
char_freq_..1 -2.315e-01 2.649e-01 -0.874 0.382246
##
char_freq_..2 -5.956e-01 8.178e-01 -0.728 0.466468
##
char_freq_..3 3.900e-01 1.039e-01 3.754 0.000174 ***
##
char_freq_..4 5.011e+00 7.393e-01 6.778 1.22e-11 ***
##
char_freq_..5 2.472e+00 1.135e+00 2.178 0.029435 *
##
capital_run_length_average -4.235e-04 1.908e-02 -0.022 0.982292
##
capital_run_length_longest 1.063e-02 2.734e-03 3.889 0.000101 ***
##
capital_run_length_total 8.561e-04 2.380e-04 3.597 0.000322 ***
##
---
##
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion
parameter for binomial family taken to be 1)
##
## Null deviance:
5349.1 on 3999 degrees of freedom
## Residual deviance: 1579.3
on 3942 degrees of freedom
## AIC: 1695.3
##
## Number of Fisher
Scoring iterations: 13
Fitting model to the test set and checking accuracy.
#Fitting
trainning model on test set
pred = predict(model,newdata=test,type="response")
glm.pred <- rep(0, nrow(test))
glm.pred[pred > .5] <- 1
# Percent correct for
each category
print("Percent Correct for Each
Category")
## [1] "Percent Correct for Each Category"
ct
<- table(glm.pred,test$spam)
diag(prop.table(ct, 1))
##
0 1
## 0.9271709 0.9344262
#Calculating
Accuracy
print("Accuracy Rate")
## [1] "Accuracy Rate"
sum(diag(prop.table(ct)))
## [1] 0.9301165
#Displaying
True Positive
print("True Positive Rate")
## [1] "True Positive Rate"
ct[2,2]/sum(ct[1,])
## [1] 0.6386555
#Displaying
False Negative
print("False Negative Rate")
## [1] "False Negative Rate"
ct[1,2]/sum(ct[,2])
## [1] 0.1023622
Logistic regression involves directly modeling Pr(Y = k|X = x) using the logistic function for the case of two response classes. In statistical jargon, we model the conditional distribution of the response Y, given the predictor(s) X. We now consider an alternative and less direct approach to estimating these probabilities.
Linear Discriminant Analysis. In this alternative approach, we model the distribution of the predictors X separately in each of the response classes (i.e. given Y ), and then use Bayes' theorem to flip these around into estimates for Pr(Y = k|X = x). When these distributions are assumed to be normal, it turns out that the model is very similar in form to logistic regression.
Why do we need another method, when we have logistic regression? There are several reasons: . When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. . If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. . Linear discriminant analysis is popular when we have more than two response classes.
Creating model for linear discriminant using training set
#Including
library necessary for predict function
library(MASS)
#Creating training model
using linear discriminant analysis
model=lda(spam~.,data=train,family=binomial(link ="logit"))
#Printing out the
logistic model
model
##
Call:
## lda(spam ~ ., data =
train, family = binomial(link = "logit"))
##
## Prior probabilities
of groups:
## 0 1
## 0.61025 0.38975
##
## Group means:
## word_freq_make
word_freq_address word_freq_all word_freq_3d
## 0
0.07377714 0.2478861 0.1938468 0.0006554691
## 1
0.14991661 0.1668890 0.4013983 0.1491148172
## word_freq_our
word_freq_over word_freq_remove word_freq_internet
## 0 0.1768701
0.04366243 0.01008193 0.04013519
## 1 0.5231430
0.17937781 0.27622835 0.21462476
## word_freq_order
word_freq_mail word_freq_receive word_freq_will
## 0
0.0407374 0.1762720 0.02154855 0.5339205
## 1
0.1683579 0.3483066 0.11760103 0.5592431
## word_freq_people
word_freq_report word_freq_addresses word_freq_free
## 0
0.06404752 0.04231053 0.008885703 0.06621876
## 1
0.14813983 0.08182809 0.118871071 0.50522771
## word_freq_business
word_freq_email word_freq_you word_freq_credit
## 0
0.04902909 0.09575584 1.259365 0.00818517
## 1 0.29072482
0.32538165 2.274009 0.20293137
## word_freq_your
word_freq_font word_freq_000 word_freq_money word_freq_hp
## 0 0.4319132
0.04837771 0.007025809 0.01491602 0.91918066
## 1 1.3703207
0.21624759 0.247293137 0.20552277 0.01599102
## word_freq_hpl
word_freq_george word_freq_650 word_freq_lab
## 0 0.435538714
1.259254404 0.19308070 0.1650512085
## 1 0.008858242
0.001802437 0.02145606 0.0007953817
## word_freq_labs word_freq_telnet
word_freq_857 word_freq_data
## 0
0.168070463 0.105977059 0.0778697255 0.14423597
## 1
0.006606799 0.001481719 0.0006029506 0.01463759
## word_freq_415
word_freq_85 word_freq_technology word_freq_1999
## 0 0.078418681
0.163416633 0.14229005 0.19927079
## 1 0.001090443
0.007062219 0.03061578 0.03838358
## word_freq_parts
word_freq_pm word_freq_direct word_freq_cs
## 0 0.011929537
0.11777960 0.08340844 7.248259e-02
## 1 0.005355997
0.01259782 0.03575369 6.414368e-05
## word_freq_meeting
word_freq_original word_freq_project word_freq_re
## 0
0.210376895 0.073092995 0.127820565 0.4081483
## 1
0.002405388 0.008389994 0.005657473 0.1270045
## word_freq_edu
word_freq_table word_freq_conference char_freq_.
## 0 0.28512495
0.007980336 0.050938140 0.05171323
## 1 0.01639513
0.001417575 0.002161642 0.02027325
## char_freq_..1
char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 0 0.1581258
0.023241295 0.1093404 0.01219541 0.02251331
## 1 0.1102277
0.008030789 0.5048691 0.17515074 0.07748044
##
capital_run_length_average capital_run_length_longest
## 0
2.406553 18.3773
## 1
9.026223 105.1366
##
capital_run_length_total
## 0
164.0320
## 1
478.0635
##
## Coefficients of
linear discriminants:
## LD1
##
word_freq_make -0.2163455991
##
word_freq_address -0.0474586640
##
word_freq_all 0.1650791942
##
word_freq_3d 0.0509304467
##
word_freq_our 0.3319967634
##
word_freq_over 0.5357188117
##
word_freq_remove 0.8482309869
##
word_freq_internet 0.4006777509
##
word_freq_order 0.2860332531
##
word_freq_mail 0.0487486964
##
word_freq_receive 0.2164446843
##
word_freq_will -0.1018737462
##
word_freq_people 0.0459127679
##
word_freq_report 0.0489939732
##
word_freq_addresses 0.0516002811
##
word_freq_free 0.3771967999
##
word_freq_business 0.2231539444
##
word_freq_email 0.2203346870
##
word_freq_you 0.0601882882
##
word_freq_credit 0.2533768445
##
word_freq_your 0.2067232022
##
word_freq_font 0.1941042321
## word_freq_000 0.7422446793
##
word_freq_money 0.4183977840
##
word_freq_hp -0.0885356350
##
word_freq_hpl -0.0899163561
##
word_freq_george -0.0481923360
##
word_freq_650 0.0780157592
## word_freq_lab
-0.0361062855
##
word_freq_labs -0.1884410315
##
word_freq_telnet -0.0875786224
##
word_freq_857 1.3425586437
##
word_freq_data -0.1801834649
##
word_freq_415 -1.0967325675
##
word_freq_85 -0.2436657759
##
word_freq_technology 0.1016214640
##
word_freq_1999 -0.1541720710
##
word_freq_parts -0.2001555786
##
word_freq_pm -0.1026830097
##
word_freq_direct 0.1979732928
##
word_freq_cs -0.0361736111
##
word_freq_meeting -0.1520291589
##
word_freq_original -0.2445077752
##
word_freq_project -0.1295421877
##
word_freq_re -0.1414068464
##
word_freq_edu -0.1486796152
##
word_freq_table -0.7721268220
##
word_freq_conference -0.2667515459
##
char_freq_. -0.5995119757
##
char_freq_..1 -0.2521929032
##
char_freq_..2 -0.2175779559
## char_freq_..3 0.2807797904
##
char_freq_..4 0.9155444920
##
char_freq_..5 0.0969515662
##
capital_run_length_average 0.0011575319
##
capital_run_length_longest 0.0002789543
##
capital_run_length_total 0.0003243749
Fitting model to the test set and checking accuracy.
#Fitting
trainning model on test set
pred = predict(model,newdata=test,type="response")
# Assess the accuracy of
the prediction
# Percent correct for
each category
print("Percent Correct for Each
Category")
## [1] "Percent Correct for Each Category"
ct
<- table(pred$class,test$spam)
diag(prop.table(ct, 1))
##
0 1
## 0.8631579 0.9140271
#Calculating
Accuracy
print("Accuracy Rate")
## [1] "Accuracy Rate"
sum(diag(prop.table(ct)))
## [1] 0.8818636
#Displaying
True Positive
print("True Positive Rate")
## [1] "True Positive Rate"
ct[2,2]/sum(ct[1,])
## [1] 0.5315789
#Displaying
False Negative
print("False Negative Rate")
## [1] "False Negative Rate"
ct[1,2]/sum(ct[,2])
## [1] 0.2047244
LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class specific mean vector and a covariance matrix that is common to all K classes.
Quadratic discriminant analysis (QDA) provides an alternative. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes' theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. So the QDA classifier involves assigning an observation X = x to the class for which this quantity is largest. The quantity x appears as a quadratic function, and this is where QDA gets its name.
Why does it matter whether or not we assume that the K classes share a common covariance matrix? In other words, why would one prefer LDA to QDA, or vice-versa? The answer lies in the bias-variance trade-off. When there are p predictors, then estimating a covariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2 parameters. With 50 predictors this is some multiple of 1,275, which is a lot of parameters. By instead assuming that the K classes share a common covariance matrix, the LDA model becomes linear in x, which means there are Kp linear coefficients to estimate.
Consequently, LDA is a much less flexible classifier than QDA, and so has substantially lower variance. This can potentially lead to improved prediction performance. But there is a trade-off: if LDA's assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias. Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.
Creating model for quadratic discriminant analysis using training set.
#Creating
training model using quadratic discriminant analysis
model=qda(spam~.,data=train,family=binomial(link ="logit"))
#Printing out the
logistic model
model
##
Call:
## qda(spam ~ ., data =
train, family = binomial(link = "logit"))
##
## Prior probabilities
of groups:
## 0 1
## 0.61025 0.38975
##
## Group means:
## word_freq_make
word_freq_address word_freq_all word_freq_3d
## 0 0.07377714
0.2478861 0.1938468 0.0006554691
## 1
0.14991661 0.1668890 0.4013983 0.1491148172
## word_freq_our
word_freq_over word_freq_remove word_freq_internet
## 0 0.1768701
0.04366243 0.01008193 0.04013519
## 1 0.5231430
0.17937781 0.27622835 0.21462476
## word_freq_order
word_freq_mail word_freq_receive word_freq_will
## 0
0.0407374 0.1762720 0.02154855 0.5339205
## 1
0.1683579 0.3483066 0.11760103 0.5592431
## word_freq_people
word_freq_report word_freq_addresses word_freq_free
## 0
0.06404752 0.04231053 0.008885703 0.06621876
## 1
0.14813983 0.08182809 0.118871071 0.50522771
## word_freq_business
word_freq_email word_freq_you word_freq_credit
## 0
0.04902909 0.09575584 1.259365 0.00818517
## 1
0.29072482 0.32538165 2.274009 0.20293137
## word_freq_your
word_freq_font word_freq_000 word_freq_money word_freq_hp
## 0 0.4319132
0.04837771 0.007025809 0.01491602 0.91918066
## 1 1.3703207
0.21624759 0.247293137 0.20552277 0.01599102
## word_freq_hpl
word_freq_george word_freq_650 word_freq_lab
## 0 0.435538714
1.259254404 0.19308070 0.1650512085
## 1 0.008858242
0.001802437 0.02145606 0.0007953817
## word_freq_labs
word_freq_telnet word_freq_857 word_freq_data
## 0
0.168070463 0.105977059 0.0778697255 0.14423597
## 1
0.006606799 0.001481719 0.0006029506 0.01463759
## word_freq_415
word_freq_85 word_freq_technology word_freq_1999
## 0 0.078418681
0.163416633 0.14229005 0.19927079
## 1 0.001090443
0.007062219 0.03061578 0.03838358
## word_freq_parts
word_freq_pm word_freq_direct word_freq_cs
## 0 0.011929537
0.11777960 0.08340844 7.248259e-02
## 1 0.005355997
0.01259782 0.03575369 6.414368e-05
## word_freq_meeting
word_freq_original word_freq_project word_freq_re
## 0
0.210376895 0.073092995 0.127820565 0.4081483
## 1
0.002405388 0.008389994 0.005657473 0.1270045
## word_freq_edu
word_freq_table word_freq_conference char_freq_.
## 0 0.28512495 0.007980336
0.050938140 0.05171323
## 1 0.01639513
0.001417575 0.002161642 0.02027325
## char_freq_..1
char_freq_..2 char_freq_..3 char_freq_..4 char_freq_..5
## 0 0.1581258
0.023241295 0.1093404 0.01219541 0.02251331
## 1 0.1102277
0.008030789 0.5048691 0.17515074 0.07748044
##
capital_run_length_average capital_run_length_longest
## 0
2.406553 18.3773
## 1
9.026223 105.1366
##
capital_run_length_total
## 0
164.0320
## 1
478.0635
Fitting model to the test set and checking accuracy.
#Fitting
trainning model on test set
pred = predict(model,newdata=test,type="response")
# Assess the accuracy of
the prediction
# Percent correct for
each category
print("Percent Correct for Each
Category")
## [1] "Percent Correct for Each Category"
ct
<- table(pred$class,test$spam)
diag(prop.table(ct, 1))
##
0 1
## 0.9600000 0.7453988
#Calculating
Accuracy
print("Accuracy Rate")
## [1] "Accuracy Rate"
sum(diag(prop.table(ct)))
## [1] 0.843594
#Displaying
True Positive
print("True Positive Rate")
## [1] "True Positive Rate"
ct[2,2]/sum(ct[1,])
## [1] 0.8836364
#Displaying
False Negative
print("False Negative Rate")
## [1] "False Negative Rate"
ct[1,2]/sum(ct[,2])
## [1] 0.04330709
The K-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. The distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance, cosine similarity, or the Manhattan distance. In other words, the similarity to the data that was already in the system is calculated for any new data point that you input into the system. Then, you use this similarity value to perform predictive modeling. Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Basically, tell me who your neighbors are, and I will tell you who you are. Chosing K is an important step in KNN. In theory, if their is an infinite number of samples available, the larger is k, the better is the classification. The one thing we must be aware of is that all k neighbors have to be close. This is possible when infinite number of samples are available, but it is impossible in practice since the number of samples is finite. It is referred to as a lazy learning algorithm because the function is only approximated locally and all computation is deferred until classification.
Creating model for K Nearest Neighbor using training set. K = 1 thru 11 and showing accuracy output for each. The KNN function used automatically defaults to normalize the data when building a model.
#Including
library necessary for KNN function
library(class)
#Creating model using K
Nearest neighbor
for (i in seq(1, 15, 2))
{
cat("KNN using K = ", i, "\n", sep = "")
#Fitting trainning model on
test set
pred=knn(train[,-58],test[,-58],train$spam,k=i)
# Assess the accuracy of the
prediction
# Percent correct for each category
print("Percent Correct for Each
Category")
ct <- table(pred,test$spam)
print(diag(prop.table(ct, 1)))
#Calculating Accuracy
print("Accuracy Rate")
print(sum(diag(prop.table(ct))))
#Displaying True Positive
print("True Positive Rate")
print(ct[2,2]/sum(ct[1,]))
#Displaying False Negative
print("False Negative Rate")
print(ct[1,2]/sum(ct[,2]))
cat("\n")
}
##
KNN using K = 1
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.8415301 0.8340426
## [1] "Accuracy
Rate"
## [1] 0.8386023
## [1] "True
Positive Rate"
## [1] 0.5355191
## [1] "False
Negative Rate"
## [1] 0.2283465
##
## KNN using K = 3
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.8328767 0.8177966
## [1] "Accuracy
Rate"
## [1] 0.8269551
## [1] "True
Positive Rate"
## [1] 0.5287671
## [1] "False
Negative Rate"
## [1] 0.2401575
##
## KNN using K = 5
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.8243243 0.8181818
## [1] "Accuracy
Rate"
## [1] 0.8219634
## [1] "True
Positive Rate"
## [1] 0.5108108
## [1] "False
Negative Rate"
## [1] 0.2559055
##
## KNN using K = 7
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.8147139 0.7948718
## [1] "Accuracy
Rate"
## [1] 0.8069884
## [1] "True
Positive Rate"
## [1] 0.506812
## [1] "False
Negative Rate"
## [1] 0.2677165
##
## KNN using K = 9
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.8085106 0.8088889
## [1] "Accuracy
Rate"
## [1] 0.8086522
## [1] "True
Positive Rate"
## [1] 0.4840426
## [1] "False
Negative Rate"
## [1] 0.2834646
##
## KNN using K = 11
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.8021390 0.7929515
## [1] "Accuracy
Rate"
## [1] 0.7986689
## [1] "True
Positive Rate"
## [1] 0.4812834
## [1] "False
Negative Rate"
## [1] 0.2913386
##
## KNN using K = 13
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.8005319 0.7955556
## [1] "Accuracy
Rate"
## [1] 0.7986689
## [1] "True
Positive Rate"
## [1] 0.4760638
## [1] "False
Negative Rate"
## [1] 0.2952756
##
## KNN using K = 15
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.7827225 0.7808219
## [1] "Accuracy
Rate"
## [1] 0.78203
## [1] "True
Positive Rate"
## [1] 0.447644
## [1] "False
Negative Rate"
## [1] 0.3267717
#Removing
unnecessary files
rm(class)
The best accuracy for K Nearest Neighbor is achieved with K = 1, 0.8386023. The highest True Positive Rate is achieved with K = 1, 0.5355191. The lowest False Negative is achieved with K = 1, 0.2283465. Overall, not impressive results in comparison to other approaches previously explored. The higher we go in K, the worse results we achieve overall.
Below, I used the kknn function that performs k-nearest neighbor classification of a test set using a training set. For each row of the test set, the k nearest training set vectors (according to Minkowski distance) are found, and the classification is done via the maximum of summed kernel densities. In addition even ordinal and continuous variables can be predicted. Using a kd tree alogrithm helped with processing, but it still took longer than normal knn to run using eucledian distance. The data is automatically scaled in the train function when using the kknn method.
Creating model for K Nearest Neighbor using training set and Minkowski distance. K = 1 thru 11 and showing accuracy output for each. The train function uses preProcess to scale the data when building a model.
#Creating
model using K Nearest neighbor
for (i in seq(1, 15, 2))
{
cat("KNN using K = ", i, "\n", sep = "")
model <- train(spam~.,data =train,
method ='kknn',algorithm=c("kd_tree"),tuneLength=3,number=1,k=i, preProcess=c("scale"))
#Fitting trainning model on
test set
pred = predict(model,newdata=test)
# Assess the accuracy of the
prediction
# Percent correct for each
category
print("Percent Correct for Each
Category")
ct <- table(pred,test$spam)
print(diag(prop.table(ct, 1)))
#Calculating Accuracy
print("Accuracy Rate")
print(sum(diag(prop.table(ct))))
#Displaying True Positive
print("True Positive Rate")
print(ct[2,2]/sum(ct[1,]))
#Displaying False Negative
print("False Negative Rate")
print(ct[1,2]/sum(ct[,2]))
cat("\n")
}
##
KNN using K = 1
## [1] "Percent
Correct for Each Category"
## 0 1
## 0.9295775 0.9308943
## [1] "Accuracy
Rate"
## [1] 0.9301165
## [1] "True
Positive Rate"
## [1] 0.6450704
## [1] "False
Negative Rate"
## [1] 0.0984252
##
## KNN using K = 3
## [1] "Percent
Correct for Each Category"
## [1] 0.9799331
0.1363636
## [1] "Accuracy
Rate"
## [1] 0.4925125
## [1] "True
Positive Rate"
## [1] 0.01003344
## [1] "False
Negative Rate"
## [1] 0.02362205
##
## KNN using K = 5
## [1] "Percent
Correct for Each Category"
## [1] 0.9775281 0.0000000
## [1] "Accuracy
Rate"
## [1] 0.4342762
## [1] "True
Positive Rate"
## [1] 0
## [1] "False
Negative Rate"
## [1] 0.02362205
##
## KNN using K = 7
## [1] "Percent
Correct for Each Category"
## [1] 0.9875 0.0000
## [1] "Accuracy
Rate"
## [1] 0.3943428
## [1] "True
Positive Rate"
## [1] 0
## [1] "False
Negative Rate"
## [1] 0.01181102
##
## KNN using K = 9
## [1] "Percent
Correct for Each Category"
## [1] 0.9865471
0.0000000
## [1] "Accuracy
Rate"
## [1] 0.3660566
## [1] "True
Positive Rate"
## [1] 0
## [1] "False
Negative Rate"
## [1] 0.01181102
##
## KNN using K = 11
## [1] "Percent
Correct for Each Category"
## [1] 0.9853659
0.0000000
## [1] "Accuracy
Rate"
## [1] 0.3361065
## [1] "True
Positive Rate"
## [1] 0
## [1] "False
Negative Rate"
## [1] 0.01181102
##
## KNN using K = 13
## [1] "Percent
Correct for Each Category"
## [1] 0.9846939
0.0000000
## [1] "Accuracy
Rate"
## [1] 0.3211314
## [1] "True
Positive Rate"
## [1] 0
## [1] "False
Negative Rate"
## [1] 0.01181102
##
## KNN using K = 15
## [1] "Percent
Correct for Each Category"
## [1] 0.9828571
0.0000000
## [1] "Accuracy
Rate"
## [1] 0.2861897
## [1] "True
Positive Rate"
## [1] 0
## [1] "False
Negative Rate"
## [1] 0.01181102
#Removing
unnecessary files
rm(kknn)
rm(caret)
rm(i)
Using the kknn method, the best accuracy for K Nearest Neighbor is achieved with K = 1, 0.9301165. The highest True Positive Rate is achieved with K = 1, 0.6450704. The lowest False Negative is achieved with K = 6, 0.01181102. This False Negative exists with the minimal K, where there were predictions for spam, or not spam. K = 6 would not be a good choice cause the True Positive Rate equals the False Negative Rate, and the overall accuracy is significantly lower at 0.422629. Overall, a decent improvement over the earlier knn attempt for K = 1, and rivals logistic regression (the best approach explored so far) in accuracy measures and other results. Therefore, K = 1 is the best model for KNN in this modification using Minkowski distance. It is also the best model overall having tied for the highest accuracy, having the highest True Positive, and the lowest False Negative.
Though their motivations differ, the logistic regression and LDA methods are closely connected. Both logistic regression and LDA produce linear decision boundaries. The only difference between the two approaches lies in the fact that c0 and c1(In logistic regression) are estimated using maximum likelihood, whereas c0 and c1 (In LDA) are computed using the estimated mean and variance from a normal distribution. This same connection between LDA and logistic regression also holds for multidimensional data with p > 1. Since logistic regression and LDA differ only in their fitting procedures, one might expect the two approaches to give similar results. This is often, but not always, the case. LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. Conversely, logistic regression can outperform LDA if these Gaussian assumptions are not met.
KNN takes a completely different approach from the logistic regression, LDA, and QDA classifiers. In order to make a prediction for an observation X = x, the K training observations that are closest to x are identified. Then X is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: no assumptions are made about the shape of the decision boundary. Therefore, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear. On the other hand, KNN does not tell us which predictors are important. QDA serves as a compromise between the non-parametric KNN method and the linear LDA and logistic regression approaches. Since QDA assumes a quadratic decision boundary, it can accurately model a wider range of problems than can the linear methods. Though not as flexible as KNN, QDA can perform better in the presence of a limited number of training observations because it does make some assumptions about the form of the decision boundary. From the above, we can see that the spam email data is not a Gaussian Distribution and it is somewhat linear. The best approaches so far, using accuracy as a measure, is the logistic regression and the KNN = 1 using Minkowski distance approach. Their accuracies is 93.01165%. Linear discriminant is below at 88.18636%, followed by quadratic discriminant at 84.3594%, and finally KNN = 1 using Eucledian distance at 83.86023%.
Finally, I decided to try the AdaBoost allgorithm to select a model utilzing the Extreme Gradient Boosting package (xgBoost). AdaBoost, short for "Adaptive Boosting", is a machine learning meta-algorithm that can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems, it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (e.g., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner.
While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples. I tested many ranges for the appropriate max number of iterations in the training model, and selected 200 as the max number of iterations for the model building and a learning rate of .5. The range up to 200 contained the iteration with lowest training error that yielded the lowest test error.
Creating model for AdaBoost method using training set.
Fitting model to the test set and checking accuracy.
#Establishing
minimum error and iterations from ababoost model
error=model$evaluation_log[which(model$evaluation_log$train_error==min(model$evaluation_log$train_error)),]
#Choosing minimum
ireration from errors in training model
model=xgboost(data=train[,-58],label=train[,58],max.depth=2,eta=.5,nthread=4,nroun=min(error$iter),objective="binary:logistic",silent=1,verbose = 0)
#Fitting trainning model
on test set
pred = predict(model,newdata=test)
glm.pred <- rep(0, nrow(test))
glm.pred[pred > .5] <- 1
# Assess the accuracy of
the prediction
# Percent correct for
each category
print("Percent Correct for Each
Category")
## [1] "Percent Correct for Each Category"
ct
<- table(glm.pred,test[,58])
print(diag(prop.table(ct, 1)))
##
0 1
## 0.9628571 0.9601594
#Calculating
Accuracy
print("Accuracy Rate")
## [1] "Accuracy Rate"
print(sum(diag(prop.table(ct))))
## [1] 0.9617304
#Displaying
True Positive
print("True Positive Rate")
## [1] "True Positive Rate"
print(ct[2,2]/sum(ct[1,]))
## [1] 0.6885714
#Displaying
False Negative
print("False Negative Rate")
## [1] "False Negative Rate"
print(ct[1,2]/sum(ct[,2]))
## [1] 0.0511811
cat("\n")
cat("\n")
print("Total time to process this file is:")
## [1] "Total time to process this file is:"
proc.time()-ptm
##
user system elapsed
## 21.22 1.06
442.39
#Removing
unnecessary files
#rm(list = ls())
SUMMARY ACCURACY RATE TRUE POSITIVE FALSE NEGATIVE
AdaBoost 0.9617304 0.6885714 0.0511811
KNN1 Minkowski 0.9301165 0.6450704 0.0984252
Logistic Reg. 0.9301165 0.6386555 0.1023622
Linear Disc. 0.8818636 0.5315789 0.2047244
Quad. Disc. 0.843594 0.8836364 0.04330709
KNN1 Eucledian 0.8386023 0.5355191 0.2283465
Overall, the AdaBoost approach improved the previous results and it is the best model, almost cutting the false positive rate from the previous best in half. This is a great improvement, particularly if we do not want the software to send spam emails to the inbox.
*ISLR Text
*Wikipedia