Chapter 09 (page 368): 5, 7, 8

Chapter 9 ex 5

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary.We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

x1=runif(500)-0.5
x2=runif(500)-0.5
y=1*(x1^2-x2^2 > 0)
plot(x1[y == 0], x2[y == 0], col = "purple", xlab = "X1", ylab = "X2", pch = 4)
  points(x1[y == 1], x2[y == 1], col = "blue", pch = 5)

glm.random = glm(y ~ x1 + x2, family = binomial)
summary(glm.random)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.314  -1.124  -1.034   1.183   1.327  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.02597    0.08993  -0.289   0.7727  
## x1           0.06184    0.31699   0.195   0.8453  
## x2           0.63439    0.30817   2.059   0.0395 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 693.02  on 499  degrees of freedom
## Residual deviance: 688.73  on 497  degrees of freedom
## AIC: 694.73
## 
## Number of Fisher Scoring iterations: 3
# Not using SVM yet
data = data.frame(x1 = x1, x2 = x2, y = y)
glm.prob = predict(glm.random, data, type = "response")
glm.pred = ifelse(glm.prob > 0.50, 1, 0)
data.pos = data[glm.pred == 1, ]
data.neg = data[glm.pred == 0, ]
plot(data.pos$x1, data.pos$x2, col = "purple", xlab = "X1", ylab = "X2", pch = 4)
points(data.neg$x1, data.neg$x2, col = "blue", pch = 5)

#glm.random2 = glm(y ~ x1 + poly(x1,2) + poly(x1,3) + x2 + poly(x2,2), #family = binomial)


glm.random2 = glm(y ~ poly(x1,2) + poly(x2,2), family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.random2)
## 
## Call:
## glm(formula = y ~ poly(x1, 2) + poly(x2, 2), family = binomial)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -7.643e-04  -2.000e-08  -2.000e-08   2.000e-08   1.120e-03  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -76.81    2886.88  -0.027    0.979
## poly(x1, 2)1    682.65   52298.32   0.013    0.990
## poly(x1, 2)2  25774.14  855891.10   0.030    0.976
## poly(x2, 2)1    459.52   50367.37   0.009    0.993
## poly(x2, 2)2 -27093.14  900121.22  -0.030    0.976
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.9302e+02  on 499  degrees of freedom
## Residual deviance: 2.9345e-06  on 495  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25

Lovaas note ex 5e: ignoring the errors since this is just meant to illustrate the nonlinearity.

glm.prob2 = predict(glm.random2, data, type = "response")
glm.pred2 = ifelse(glm.prob2 > 0.50, 1, 0)
data.pos2 = data[glm.pred2 == 1, ]
data.neg2 = data[glm.pred2 == 0, ]
plot(data.pos2$x1, data.pos2$x2, col = "purple", xlab = "X1", ylab = "X2", pch = 4)
points(data.neg2$x1, data.neg2$x2, col = "blue", pch = 5)

Lovaas note ex 5f: Clearly non-linear

svm1 = svm(as.factor(y) ~ x1 + x2, data, gamma = 1)
svm1
## 
## Call:
## svm(formula = as.factor(y) ~ x1 + x2, data = data, gamma = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  158
svm.pred = predict(svm1, data)
data.pos.svm = data[svm.pred == 1, ]
data.neg.svm = data[svm.pred == 0, ]
plot(data.pos.svm$x1, data.pos.svm$x2, col = "purple", xlab = "X1", ylab = "X2", pch = 4)
points(data.neg.svm$x1, data.neg.svm$x2, col = "blue", pch = 5)

Lovaas commenting on findings: This was a fun experiment!the SVM was pretty good at fitting a model that’s non-linear. 5D, the GLM model, did a pretty poor job. The SVM in 5H fit the data much better.

Chapter 9 ex 7

gas.med = median(mpg)
binary.mpg = ifelse(mpg > gas.med, 1, 0)
Auto$binary.mpg = as.factor(binary.mpg)
set.seed(42)
svm7b=tune(svm, binary.mpg ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(0.01, 
    0.1, 1, 5, 10, 100)))
summary(svm7b)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.0075 
## 
## - Detailed performance results:
##    cost      error dispersion
## 1 1e-02 0.07391026 0.05301906
## 2 1e-01 0.04326923 0.03594224
## 3 1e+00 0.00750000 0.01687371
## 4 5e+00 0.01519231 0.01760469
## 5 1e+01 0.01775641 0.01700310
## 6 1e+02 0.03044872 0.02842322

Lovaas answer ex 7b: CV is minimized at cost = 1. This is where error is minimized (0.0075) as well as dispersion.

set.seed(42)
svm7c = tune(svm, binary.mpg ~ ., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.1, 
    1, 5, 10), degree = c(2, 3, 4)))
summary(svm7c)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##    10      2
## 
## - best performance: 0.5814103 
## 
## - Detailed performance results:
##    cost degree     error dispersion
## 1   0.1      2 0.5967949 0.05312225
## 2   1.0      2 0.5967949 0.05312225
## 3   5.0      2 0.5967949 0.05312225
## 4  10.0      2 0.5814103 0.05427855
## 5   0.1      3 0.5967949 0.05312225
## 6   1.0      3 0.5967949 0.05312225
## 7   5.0      3 0.5967949 0.05312225
## 8  10.0      3 0.5967949 0.05312225
## 9   0.1      4 0.5967949 0.05312225
## 10  1.0      4 0.5967949 0.05312225
## 11  5.0      4 0.5967949 0.05312225
## 12 10.0      4 0.5967949 0.05312225

Lovaas answer ex 7c: The CV is minimized at cost = 10 and degree = 2. You can see in the above chart that error is minimized at the point, although dispersion is actually maximised. There’s very little variety in the results; I wouldn’t say this approach is the best.

set.seed(463)
svm7c2 = tune(svm, binary.mpg ~ ., data = Auto, kernel = "radial", ranges = list(cost = c(0.1, 
    1, 5, 10), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(svm7c2)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##    10  0.01
## 
## - best performance: 0.02551282 
## 
## - Detailed performance results:
##    cost gamma      error dispersion
## 1   0.1 1e-02 0.09429487 0.04814900
## 2   1.0 1e-02 0.07897436 0.03875105
## 3   5.0 1e-02 0.05352564 0.02532795
## 4  10.0 1e-02 0.02551282 0.02417610
## 5   0.1 1e-01 0.07891026 0.03847631
## 6   1.0 1e-01 0.05602564 0.02881876
## 7   5.0 1e-01 0.03826923 0.03252085
## 8  10.0 1e-01 0.03320513 0.02964746
## 9   0.1 1e+00 0.57660256 0.05479863
## 10  1.0 1e+00 0.06628205 0.02996211
## 11  5.0 1e+00 0.06115385 0.02733573
## 12 10.0 1e+00 0.06115385 0.02733573
## 13  0.1 5e+00 0.57660256 0.05479863
## 14  1.0 5e+00 0.51538462 0.06642516
## 15  5.0 5e+00 0.50775641 0.07152757
## 16 10.0 5e+00 0.50775641 0.07152757
## 17  0.1 1e+01 0.57660256 0.05479863
## 18  1.0 1e+01 0.53833333 0.05640443
## 19  5.0 1e+01 0.53070513 0.05708644
## 20 10.0 1e+01 0.53070513 0.05708644
## 21  0.1 1e+02 0.57660256 0.05479863
## 22  1.0 1e+02 0.57660256 0.05479863
## 23  5.0 1e+02 0.57660256 0.05479863
## 24 10.0 1e+02 0.57660256 0.05479863

Lovaas answer for the other part of ex 7c: This is for the radial basis kernel. This took quite a while to run and CV was minimized at cost = 10 and gamma = 0.01. You can see error and dispersion are both minimized at that point.

gas.med = median(Auto$mpg)
new.var = ifelse(Auto$mpg > gas.med, 1, 0)
Auto$notfactor = as.factor(new.var)

svm.linear = svm(notfactor ~.-Auto$name, data = Auto, kernel = "linear", cost = 1)
svm.poly = svm(notfactor ~.-Auto$name, data = Auto, kernel = "polynomial", cost = 10, 
    degree = 2)
svm.radial = svm(notfactor ~.-Auto$name, data = Auto, kernel = "radial", cost = 10, gamma = 0.01)

Auto$notfactor = as.numeric(new.var)

plotpairs = function(fit) {
    for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
        plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
    }
}
#plotpairs(svm.linear)
#plotpairs(svm.poly)
#plotpairs(svm.radial)
length(Auto$name)
## [1] 392
length(Auto$notfactor)
## [1] 392

Ya know what, I have no idea why this isn’t working and I don’t care! I don’t care!

Chapter 9 ex 8

attach(OJ)
set.seed(42)
train = sample(dim(OJ)[1], 800)
OJ.train = OJ[train, ]
OJ.test = OJ[-train, ]
  1. Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
svm.linear = svm(Purchase ~ ., kernel = "linear", data = OJ.train, cost = 0.01)
summary(svm.linear)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  432
## 
##  ( 215 217 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

Lovaas answer ex 8b: SVC creates 432 support vectors out of 800 training points. Out of these, 217 belong to level CH (Country Hill) and remaining 215 belong to level MM (Minute Made).

  1. What are the training and test error rates?
train.pred = predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 432  60
##   MM  77 231
test.pred = predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
##     test.pred
##       CH  MM
##   CH 142  19
##   MM  25  84
(60 + 77)/(432 + 60 + 77 + 231)
## [1] 0.17125
(19 + 25)/(142 + 19 + 25 + 84)
## [1] 0.162963

Lovaas answer ex 8c: The training error rate is 17.1% & test error rate is 16.3%.

  1. Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(42)
tune.out = tune(svm, Purchase ~ ., data = OJ.train, kernel = "linear", ranges = list(cost = 10^seq(-2, 
    1, by = 0.25)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.175 
## 
## - Detailed performance results:
##           cost   error dispersion
## 1   0.01000000 0.17750 0.02415229
## 2   0.01778279 0.17625 0.02664713
## 3   0.03162278 0.17625 0.02913689
## 4   0.05623413 0.18125 0.03498512
## 5   0.10000000 0.17625 0.03356689
## 6   0.17782794 0.17625 0.02913689
## 7   0.31622777 0.17750 0.02874698
## 8   0.56234133 0.17875 0.02503470
## 9   1.00000000 0.17500 0.02886751
## 10  1.77827941 0.18000 0.02140872
## 11  3.16227766 0.18000 0.02648375
## 12  5.62341325 0.18625 0.02972676
## 13 10.00000000 0.18625 0.02729087

Lovaas answer ex 8d: Tuning shows that optimal cost is 0.3162. You can see that error is at a local min here (tied with cost - 0.01778279), and dispesion is actually at a local min at 0.01 - probably the model should have chosen 0.01778279.

  1. Compute the training and test error rates using this new value for cost.
svm.linear = svm(Purchase ~ ., kernel = "linear", data = OJ.train, cost = tune.out$best.parameters$cost)
train.pred = predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 434  58
##   MM  76 232
test.pred = predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
##     test.pred
##       CH  MM
##   CH 140  21
##   MM  23  86
(58 + 76)/(434 + 58 + 76 + 232)
## [1] 0.1675
(21 + 23)/(140 + 21 + 23 + 86)
## [1] 0.162963

Lovaas answer ex 8e: Training error 16.75% and test 16.3%

set.seed(42)
svm.radial = svm(Purchase ~ ., data = OJ.train, kernel = "radial")
summary(svm.radial)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  375
## 
##  ( 183 192 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
train.pred = predict(svm.radial, OJ.train)
table(OJ.train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 453  39
##   MM  81 227
test.pred = predict(svm.radial, OJ.test)
table(OJ.test$Purchase, test.pred)
##     test.pred
##       CH  MM
##   CH 146  15
##   MM  28  81
(39 + 81)/(453 + 39 + 81 + 227)
## [1] 0.15
(15 + 28)/(146 + 15 + 28 + 81)
## [1] 0.1592593

Lovaas answer ex 8f: Training error 15% and test error 15.93%. This is an improvement!

set.seed(42)
svm.poly = svm(Purchase ~ ., data = OJ.train, kernel = "poly", degree = 2)
summary(svm.poly)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "poly", degree = 2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  443
## 
##  ( 217 226 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
train.pred = predict(svm.poly, OJ.train)
table(OJ.train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 461  31
##   MM 112 196
test.pred = predict(svm.poly, OJ.test)
table(OJ.test$Purchase, test.pred)
##     test.pred
##       CH  MM
##   CH 149  12
##   MM  41  68
(31 + 112)/(461 + 31 + 112 + 196)
## [1] 0.17875
(12 + 41)/(149 + 12 + 41 + 68)
## [1] 0.1962963

Lovaas answer ex 8g: Training error 17.9% and test error 19.6%. This is definitely not an improvement!

  1. Overall, which approach seems to give the best results on this data?

Lovaas answer ex 8h: SVM with a radial kernel and the default value for gamma gave the lowest test and train errors. It was my choice.