We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary.We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
x1=runif(500)-0.5
x2=runif(500)-0.5
y=1*(x1^2-x2^2 > 0)
plot(x1[y == 0], x2[y == 0], col = "purple", xlab = "X1", ylab = "X2", pch = 4)
points(x1[y == 1], x2[y == 1], col = "blue", pch = 5)
glm.random = glm(y ~ x1 + x2, family = binomial)
summary(glm.random)
##
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.314 -1.124 -1.034 1.183 1.327
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.02597 0.08993 -0.289 0.7727
## x1 0.06184 0.31699 0.195 0.8453
## x2 0.63439 0.30817 2.059 0.0395 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 693.02 on 499 degrees of freedom
## Residual deviance: 688.73 on 497 degrees of freedom
## AIC: 694.73
##
## Number of Fisher Scoring iterations: 3
# Not using SVM yet
data = data.frame(x1 = x1, x2 = x2, y = y)
glm.prob = predict(glm.random, data, type = "response")
glm.pred = ifelse(glm.prob > 0.50, 1, 0)
data.pos = data[glm.pred == 1, ]
data.neg = data[glm.pred == 0, ]
plot(data.pos$x1, data.pos$x2, col = "purple", xlab = "X1", ylab = "X2", pch = 4)
points(data.neg$x1, data.neg$x2, col = "blue", pch = 5)
#glm.random2 = glm(y ~ x1 + poly(x1,2) + poly(x1,3) + x2 + poly(x2,2), #family = binomial)
glm.random2 = glm(y ~ poly(x1,2) + poly(x2,2), family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.random2)
##
## Call:
## glm(formula = y ~ poly(x1, 2) + poly(x2, 2), family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.643e-04 -2.000e-08 -2.000e-08 2.000e-08 1.120e-03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -76.81 2886.88 -0.027 0.979
## poly(x1, 2)1 682.65 52298.32 0.013 0.990
## poly(x1, 2)2 25774.14 855891.10 0.030 0.976
## poly(x2, 2)1 459.52 50367.37 0.009 0.993
## poly(x2, 2)2 -27093.14 900121.22 -0.030 0.976
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.9302e+02 on 499 degrees of freedom
## Residual deviance: 2.9345e-06 on 495 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
Lovaas note ex 5e: ignoring the errors since this is just meant to illustrate the nonlinearity.
glm.prob2 = predict(glm.random2, data, type = "response")
glm.pred2 = ifelse(glm.prob2 > 0.50, 1, 0)
data.pos2 = data[glm.pred2 == 1, ]
data.neg2 = data[glm.pred2 == 0, ]
plot(data.pos2$x1, data.pos2$x2, col = "purple", xlab = "X1", ylab = "X2", pch = 4)
points(data.neg2$x1, data.neg2$x2, col = "blue", pch = 5)
Lovaas note ex 5f: Clearly non-linear
svm1 = svm(as.factor(y) ~ x1 + x2, data, gamma = 1)
svm1
##
## Call:
## svm(formula = as.factor(y) ~ x1 + x2, data = data, gamma = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 158
svm.pred = predict(svm1, data)
data.pos.svm = data[svm.pred == 1, ]
data.neg.svm = data[svm.pred == 0, ]
plot(data.pos.svm$x1, data.pos.svm$x2, col = "purple", xlab = "X1", ylab = "X2", pch = 4)
points(data.neg.svm$x1, data.neg.svm$x2, col = "blue", pch = 5)
Lovaas commenting on findings: This was a fun experiment!the SVM was pretty good at fitting a model that’s non-linear. 5D, the GLM model, did a pretty poor job. The SVM in 5H fit the data much better.
gas.med = median(mpg)
binary.mpg = ifelse(mpg > gas.med, 1, 0)
Auto$binary.mpg = as.factor(binary.mpg)
set.seed(42)
svm7b=tune(svm, binary.mpg ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(0.01,
0.1, 1, 5, 10, 100)))
summary(svm7b)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.0075
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-02 0.07391026 0.05301906
## 2 1e-01 0.04326923 0.03594224
## 3 1e+00 0.00750000 0.01687371
## 4 5e+00 0.01519231 0.01760469
## 5 1e+01 0.01775641 0.01700310
## 6 1e+02 0.03044872 0.02842322
Lovaas answer ex 7b: CV is minimized at cost = 1. This is where error is minimized (0.0075) as well as dispersion.
set.seed(42)
svm7c = tune(svm, binary.mpg ~ ., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.1,
1, 5, 10), degree = c(2, 3, 4)))
summary(svm7c)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 10 2
##
## - best performance: 0.5814103
##
## - Detailed performance results:
## cost degree error dispersion
## 1 0.1 2 0.5967949 0.05312225
## 2 1.0 2 0.5967949 0.05312225
## 3 5.0 2 0.5967949 0.05312225
## 4 10.0 2 0.5814103 0.05427855
## 5 0.1 3 0.5967949 0.05312225
## 6 1.0 3 0.5967949 0.05312225
## 7 5.0 3 0.5967949 0.05312225
## 8 10.0 3 0.5967949 0.05312225
## 9 0.1 4 0.5967949 0.05312225
## 10 1.0 4 0.5967949 0.05312225
## 11 5.0 4 0.5967949 0.05312225
## 12 10.0 4 0.5967949 0.05312225
Lovaas answer ex 7c: The CV is minimized at cost = 10 and degree = 2. You can see in the above chart that error is minimized at the point, although dispersion is actually maximised. There’s very little variety in the results; I wouldn’t say this approach is the best.
set.seed(463)
svm7c2 = tune(svm, binary.mpg ~ ., data = Auto, kernel = "radial", ranges = list(cost = c(0.1,
1, 5, 10), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(svm7c2)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 10 0.01
##
## - best performance: 0.02551282
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.1 1e-02 0.09429487 0.04814900
## 2 1.0 1e-02 0.07897436 0.03875105
## 3 5.0 1e-02 0.05352564 0.02532795
## 4 10.0 1e-02 0.02551282 0.02417610
## 5 0.1 1e-01 0.07891026 0.03847631
## 6 1.0 1e-01 0.05602564 0.02881876
## 7 5.0 1e-01 0.03826923 0.03252085
## 8 10.0 1e-01 0.03320513 0.02964746
## 9 0.1 1e+00 0.57660256 0.05479863
## 10 1.0 1e+00 0.06628205 0.02996211
## 11 5.0 1e+00 0.06115385 0.02733573
## 12 10.0 1e+00 0.06115385 0.02733573
## 13 0.1 5e+00 0.57660256 0.05479863
## 14 1.0 5e+00 0.51538462 0.06642516
## 15 5.0 5e+00 0.50775641 0.07152757
## 16 10.0 5e+00 0.50775641 0.07152757
## 17 0.1 1e+01 0.57660256 0.05479863
## 18 1.0 1e+01 0.53833333 0.05640443
## 19 5.0 1e+01 0.53070513 0.05708644
## 20 10.0 1e+01 0.53070513 0.05708644
## 21 0.1 1e+02 0.57660256 0.05479863
## 22 1.0 1e+02 0.57660256 0.05479863
## 23 5.0 1e+02 0.57660256 0.05479863
## 24 10.0 1e+02 0.57660256 0.05479863
Lovaas answer for the other part of ex 7c: This is for the radial basis kernel. This took quite a while to run and CV was minimized at cost = 10 and gamma = 0.01. You can see error and dispersion are both minimized at that point.
gas.med = median(Auto$mpg)
new.var = ifelse(Auto$mpg > gas.med, 1, 0)
Auto$notfactor = as.factor(new.var)
svm.linear = svm(notfactor ~.-Auto$name, data = Auto, kernel = "linear", cost = 1)
svm.poly = svm(notfactor ~.-Auto$name, data = Auto, kernel = "polynomial", cost = 10,
degree = 2)
svm.radial = svm(notfactor ~.-Auto$name, data = Auto, kernel = "radial", cost = 10, gamma = 0.01)
Auto$notfactor = as.numeric(new.var)
plotpairs = function(fit) {
for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
}
}
#plotpairs(svm.linear)
#plotpairs(svm.poly)
#plotpairs(svm.radial)
length(Auto$name)
## [1] 392
length(Auto$notfactor)
## [1] 392
Ya know what, I have no idea why this isn’t working and I don’t care! I don’t care!
attach(OJ)
set.seed(42)
train = sample(dim(OJ)[1], 800)
OJ.train = OJ[train, ]
OJ.test = OJ[-train, ]
svm.linear = svm(Purchase ~ ., kernel = "linear", data = OJ.train, cost = 0.01)
summary(svm.linear)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear", cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 432
##
## ( 215 217 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
Lovaas answer ex 8b: SVC creates 432 support vectors out of 800 training points. Out of these, 217 belong to level CH (Country Hill) and remaining 215 belong to level MM (Minute Made).
train.pred = predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 432 60
## MM 77 231
test.pred = predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 142 19
## MM 25 84
(60 + 77)/(432 + 60 + 77 + 231)
## [1] 0.17125
(19 + 25)/(142 + 19 + 25 + 84)
## [1] 0.162963
Lovaas answer ex 8c: The training error rate is 17.1% & test error rate is 16.3%.
set.seed(42)
tune.out = tune(svm, Purchase ~ ., data = OJ.train, kernel = "linear", ranges = list(cost = 10^seq(-2,
1, by = 0.25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.175
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.17750 0.02415229
## 2 0.01778279 0.17625 0.02664713
## 3 0.03162278 0.17625 0.02913689
## 4 0.05623413 0.18125 0.03498512
## 5 0.10000000 0.17625 0.03356689
## 6 0.17782794 0.17625 0.02913689
## 7 0.31622777 0.17750 0.02874698
## 8 0.56234133 0.17875 0.02503470
## 9 1.00000000 0.17500 0.02886751
## 10 1.77827941 0.18000 0.02140872
## 11 3.16227766 0.18000 0.02648375
## 12 5.62341325 0.18625 0.02972676
## 13 10.00000000 0.18625 0.02729087
Lovaas answer ex 8d: Tuning shows that optimal cost is 0.3162. You can see that error is at a local min here (tied with cost - 0.01778279), and dispesion is actually at a local min at 0.01 - probably the model should have chosen 0.01778279.
svm.linear = svm(Purchase ~ ., kernel = "linear", data = OJ.train, cost = tune.out$best.parameters$cost)
train.pred = predict(svm.linear, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 434 58
## MM 76 232
test.pred = predict(svm.linear, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 140 21
## MM 23 86
(58 + 76)/(434 + 58 + 76 + 232)
## [1] 0.1675
(21 + 23)/(140 + 21 + 23 + 86)
## [1] 0.162963
Lovaas answer ex 8e: Training error 16.75% and test 16.3%
set.seed(42)
svm.radial = svm(Purchase ~ ., data = OJ.train, kernel = "radial")
summary(svm.radial)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 375
##
## ( 183 192 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred = predict(svm.radial, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 453 39
## MM 81 227
test.pred = predict(svm.radial, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 146 15
## MM 28 81
(39 + 81)/(453 + 39 + 81 + 227)
## [1] 0.15
(15 + 28)/(146 + 15 + 28 + 81)
## [1] 0.1592593
Lovaas answer ex 8f: Training error 15% and test error 15.93%. This is an improvement!
set.seed(42)
svm.poly = svm(Purchase ~ ., data = OJ.train, kernel = "poly", degree = 2)
summary(svm.poly)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "poly", degree = 2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 443
##
## ( 217 226 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
train.pred = predict(svm.poly, OJ.train)
table(OJ.train$Purchase, train.pred)
## train.pred
## CH MM
## CH 461 31
## MM 112 196
test.pred = predict(svm.poly, OJ.test)
table(OJ.test$Purchase, test.pred)
## test.pred
## CH MM
## CH 149 12
## MM 41 68
(31 + 112)/(461 + 31 + 112 + 196)
## [1] 0.17875
(12 + 41)/(149 + 12 + 41 + 68)
## [1] 0.1962963
Lovaas answer ex 8g: Training error 17.9% and test error 19.6%. This is definitely not an improvement!
Lovaas answer ex 8h: SVM with a radial kernel and the default value for gamma gave the lowest test and train errors. It was my choice.