5. We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.
set.seed(12)
x1 <- runif(500)-0.5
x2 <- runif(500)-0.5
y <- 1*(x1^2-x2^2 > 0)
(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.
df=data.frame(x1=x1, x2=x2, y=as.factor(y))
plot(x=x1, y=x2, col=(2-y))
(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.
glm_fit <- glm(y~x1+x2,data=df, family = "binomial")
summary(glm_fit)
##
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.350 -1.165 1.050 1.151 1.291
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.04927 0.08978 0.549 0.583
## x1 -0.23002 0.31534 -0.729 0.466
## x2 0.51072 0.31560 1.618 0.106
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.86 on 499 degrees of freedom
## Residual deviance: 689.58 on 497 degrees of freedom
## AIC: 695.58
##
## Number of Fisher Scoring iterations: 3
(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
glm_probs <- predict(glm_fit, newdata = df, type = "response")
glm_preds <- rep(0,500)
glm_preds[glm_probs > 0.5] =1
plot(x1,x2, col = 2-glm_preds)
We can see from the plot that the decision boundary is linear.
table(predictions=glm_preds, truth = df$y)
## truth
## predictions 0 1
## 0 125 73
## 1 119 183
(119+73)/500
## [1] 0.384
The error rate is 0.384 or 38.4%
(e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X1^2 , X1 × X2, log(X2), and so forth).
set.seed(12)
glm_fit2 <- glm(y~I(x1^2)+I(x2^2),data=df, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_fit2)
##
## Call:
## glm(formula = y ~ I(x1^2) + I(x2^2), family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.004247 0.000000 0.000000 0.000000 0.004301
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.291 165.476 -0.026 0.979
## I(x1^2) 67337.813 812667.866 0.083 0.934
## I(x2^2) -67293.307 809254.186 -0.083 0.934
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.9286e+02 on 499 degrees of freedom
## Residual deviance: 3.8093e-05 on 497 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
glm_probs <- predict(glm_fit2, newdata = df, type = "response")
glm_preds <- rep(0,500)
glm_preds[glm_probs > 0.5] =1
plot(x1,x2, col = 2-glm_preds)
The predicted class labels here are obviously nonlinear.
table(predictions=glm_preds, truth = df$y)
## truth
## predictions 0 1
## 0 244 0
## 1 0 256
The model where x1 and x2 are nonlinear terms made 0 prediction errors.
(g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
library(e1071)
#Best model
tune_out=tune(svm,y~x1+x2,data = df,kernel='linear', ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
bestmod=tune_out$best.model
plot(bestmod,df)
ypred <- predict(bestmod, newdata=df, type='response')
plot(x1,x2,col=ypred)
table(predict=ypred, truth=df$y)
## truth
## predict 0 1
## 0 127 56
## 1 117 200
(117+56)/500
## [1] 0.346
The error rate here is 0.346, or 34.6%
(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
#Best model
tune_out2=tune(svm,y~x1+x2,data = df,kernel='radial', ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
bestmod2=tune_out2$best.model
plot(bestmod2,df)
ypred <- predict(bestmod2, newdata=df, type='response')
plot(x1,x2,col=ypred)
table(predict=ypred, truth=df$y)
## truth
## predict 0 1
## 0 243 1
## 1 1 255
2/500
## [1] 0.004
The error rate here is 0.4%.
(i) Comment on your results.
The logistic regression model that used linear predictors had an error rate of 38.4% while the logistic regression model that used quadratic predictors had an error rate of 0%. the svm model that used a linear kernel had an error rate of 34.6%, while the svm model that used a radial kernel had an error rate of 0.4%. As mentioned in the ISLR textbook, these results suggest that logistic regression and support vector machines are closely related.
7. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
library(ISLR)
set.seed(88)
Auto$mpg <- ifelse(Auto$mpg>median(Auto$mpg),1,0)
table(Auto$mpg)
##
## 0 1
## 196 196
Here we have split the data into two groups, o and 1, based on the median value.
(b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.
Auto$mpg <- as.factor(Auto$mpg)
linear_tune <- tune(svm, mpg~., data = Auto, kernel = "linear", ranges = list(cost = c(0.001,0.01,0.1,1,5,10,100,1000)))
summary(linear_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.08653846
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-03 0.13243590 0.07995006
## 2 1e-02 0.08653846 0.07188448
## 3 1e-01 0.09673077 0.06285281
## 4 1e+00 0.10448718 0.04953303
## 5 5e+00 0.11205128 0.04915905
## 6 1e+01 0.11461538 0.04777235
## 7 1e+02 0.13000000 0.06252021
## 8 1e+03 0.12480769 0.05901774
We can see from this summary that the error is lowest when the cost is 0.01. As the cost increases past this value, the error also increases.
(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.
set.seed(88)
radial_tune <- tune(svm, mpg~., data = Auto, kernel = "radial", ranges = list(cost = c(0.001,0.01,0.1,1,5,10,100,1000),gamma=c(0.5,1,2,3,4)))
summary(radial_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 0.1 0.5
##
## - best performance: 0.08653846
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-03 0.5 0.57121795 0.06318174
## 2 1e-02 0.5 0.57121795 0.06318174
## 3 1e-01 0.5 0.08653846 0.06688000
## 4 1e+00 0.5 0.08660256 0.06800544
## 5 5e+00 0.5 0.08923077 0.06725632
## 6 1e+01 0.5 0.09429487 0.06369561
## 7 1e+02 0.5 0.09173077 0.06031128
## 8 1e+03 0.5 0.09173077 0.06031128
## 9 1e-03 1.0 0.57121795 0.06318174
## 10 1e-02 1.0 0.57121795 0.06318174
## 11 1e-01 1.0 0.57121795 0.06318174
## 12 1e+00 1.0 0.08910256 0.06926571
## 13 5e+00 1.0 0.08916667 0.07344863
## 14 1e+01 1.0 0.08916667 0.07344863
## 15 1e+02 1.0 0.08916667 0.07344863
## 16 1e+03 1.0 0.08916667 0.07344863
## 17 1e-03 2.0 0.57121795 0.06318174
## 18 1e-02 2.0 0.57121795 0.06318174
## 19 1e-01 2.0 0.57121795 0.06318174
## 20 1e+00 2.0 0.12724359 0.07158933
## 21 5e+00 2.0 0.12474359 0.07015852
## 22 1e+01 2.0 0.12474359 0.07015852
## 23 1e+02 2.0 0.12474359 0.07015852
## 24 1e+03 2.0 0.12474359 0.07015852
## 25 1e-03 3.0 0.57121795 0.06318174
## 26 1e-02 3.0 0.57121795 0.06318174
## 27 1e-01 3.0 0.57121795 0.06318174
## 28 1e+00 3.0 0.35416667 0.21004072
## 29 5e+00 3.0 0.33634615 0.20588448
## 30 1e+01 3.0 0.33634615 0.20588448
## 31 1e+02 3.0 0.33634615 0.20588448
## 32 1e+03 3.0 0.33634615 0.20588448
## 33 1e-03 4.0 0.57121795 0.06318174
## 34 1e-02 4.0 0.57121795 0.06318174
## 35 1e-01 4.0 0.57121795 0.06318174
## 36 1e+00 4.0 0.43314103 0.17153477
## 37 5e+00 4.0 0.42814103 0.16154198
## 38 1e+01 4.0 0.42814103 0.16154198
## 39 1e+02 4.0 0.42814103 0.16154198
## 40 1e+03 4.0 0.42814103 0.16154198
We can see from this summary that the lowest error comes when the cost is 0.1 and the gamma is 0.5.
set.seed(88)
polynomial_tune <- tune(svm, mpg~., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.001,0.01,0.1,1,5,10,100,1000),gamma=c(0.5,1,2,3,4)))
summary(polynomial_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 0.01 2
##
## - best performance: 0.08416667
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-03 0.5 0.25724359 0.10393495
## 2 1e-02 0.5 0.10198718 0.06462921
## 3 1e-01 0.5 0.09948718 0.05018848
## 4 1e+00 0.5 0.08673077 0.04693472
## 5 5e+00 0.5 0.10205128 0.04326256
## 6 1e+01 0.5 0.09955128 0.04715212
## 7 1e+02 0.5 0.09955128 0.04867674
## 8 1e+03 0.5 0.09955128 0.04867674
## 9 1e-03 1.0 0.10198718 0.05993767
## 10 1e-02 1.0 0.09948718 0.05018848
## 11 1e-01 1.0 0.08673077 0.04693472
## 12 1e+00 1.0 0.10205128 0.04326256
## 13 5e+00 1.0 0.09955128 0.04867674
## 14 1e+01 1.0 0.09955128 0.04867674
## 15 1e+02 1.0 0.09955128 0.04867674
## 16 1e+03 1.0 0.09955128 0.04867674
## 17 1e-03 2.0 0.09179487 0.04973522
## 18 1e-02 2.0 0.08416667 0.04667329
## 19 1e-01 2.0 0.10205128 0.04326256
## 20 1e+00 2.0 0.09955128 0.04867674
## 21 5e+00 2.0 0.09955128 0.04867674
## 22 1e+01 2.0 0.09955128 0.04867674
## 23 1e+02 2.0 0.09955128 0.04867674
## 24 1e+03 2.0 0.09955128 0.04867674
## 25 1e-03 3.0 0.10205128 0.05101139
## 26 1e-02 3.0 0.09435897 0.04954391
## 27 1e-01 3.0 0.10211538 0.04952109
## 28 1e+00 3.0 0.09955128 0.04867674
## 29 5e+00 3.0 0.09955128 0.04867674
## 30 1e+01 3.0 0.09955128 0.04867674
## 31 1e+02 3.0 0.09955128 0.04867674
## 32 1e+03 3.0 0.09955128 0.04867674
## 33 1e-03 4.0 0.08673077 0.05279469
## 34 1e-02 4.0 0.10205128 0.04326256
## 35 1e-01 4.0 0.09955128 0.04867674
## 36 1e+00 4.0 0.09955128 0.04867674
## 37 5e+00 4.0 0.09955128 0.04867674
## 38 1e+01 4.0 0.09955128 0.04867674
## 39 1e+02 4.0 0.09955128 0.04867674
## 40 1e+03 4.0 0.09955128 0.04867674
We can see from this summary that the error is lowest when the cost is 0.01 and the gamma is 2. The lowest error using the linear SVM was 0.0865, the lowest error using the radial kernel was 0.0865, and the lowest using the polynomial kernel was 0.0842. These results suggest that the error rate is similar no matter what method we decide to use for classification.
(d) Make some plots to back up your assertions in (b) and (c).
plot(linear_tune$performances[,c(1,2)], type = "l", xlim = c(0,1))
This plot shows that the error for the linear model is lowest when the cost is 0.01.
radial_perf <- radial_tune$performances
plot(x=radial_perf$cost, radial_perf$error, xlim=c(0,1))
plot(x=radial_perf$gamma,y=radial_perf$error)
We can see from the first plot that the error is lowest when the cost is 0.1, and from the second plot we can see that error is lowest when the gamma is 0.5.
poly_perf <- polynomial_tune$performances
plot(x=poly_perf$cost, poly_perf$error, xlim=c(0,0.05))
plot(x=poly_perf$gamma,y=poly_perf$error)
From the first plot, we can see that the error is lowest when the cost is 0.01. We can see from the second plot that the error is lowest when gamma is 2.
8. This problem involves the OJ data set which is part of the ISLR package.
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
library(ISLR)
set.seed(76)
model_train <- sample(1:nrow(OJ), 800)
OJ_train <- OJ[model_train,]
OJ_test <- OJ[-model_train,]
(b) Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
OJ$Purchase <- as.factor(OJ$Purchase)
OJ_svm <-svm(Purchase~., data = OJ_train, kernel = "linear", cost = 0.01)
summary(OJ_svm)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ_train, kernel = "linear", cost = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 448
##
## ( 223 225 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
We can see from this summary that from the 800 observations in the training data, 448 of them were used as support vectors. This large amount of support vectors indicates that the margin is wide.
(c) What are the training and test error rates?
svm_pred <- predict(OJ_svm,OJ_train)
table(prediction = svm_pred, truth = OJ_train$Purchase)
## truth
## prediction CH MM
## CH 432 81
## MM 55 232
(55+81)/800
## [1] 0.17
The training error rate is 0.17 or 17%
svm_test_pred <- predict(OJ_svm,OJ_test)
table(prediction = svm_test_pred, truth = OJ_test$Purchase)
## truth
## prediction CH MM
## CH 147 26
## MM 19 78
(19+26)/270
## [1] 0.1666667
The test error rate is 0.167 or 16.7%
(d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(76)
OJ_tune_out <- tune(svm, Purchase~., data = OJ_train, kernel = "linear", ranges = list(cost = c(0.01,0.1,1,5,10)))
summary(OJ_tune_out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 5
##
## - best performance: 0.17
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.1800 0.02898755
## 2 0.10 0.1700 0.02838231
## 3 1.00 0.1725 0.03106892
## 4 5.00 0.1700 0.03016160
## 5 10.00 0.1725 0.02934469
The error is lowest when the cost is .1 or 5.
(e) Compute the training and test error rates using this new value for cost.
OJ$Purchase <- as.factor(OJ$Purchase)
OJ_svm <-svm(Purchase~., data = OJ_train, kernel = "linear", cost = 5)
summary(OJ_svm)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ_train, kernel = "linear", cost = 5)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 5
##
## Number of Support Vectors: 341
##
## ( 171 170 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
svm_pred <- predict(OJ_svm,OJ_train)
table(prediction = svm_pred, truth = OJ_train$Purchase)
## truth
## prediction CH MM
## CH 427 70
## MM 60 243
(60+70)/800
## [1] 0.1625
The training error rate is 0.1625 or 16.25%
svm_test_pred <- predict(OJ_svm,OJ_test)
table(prediction = svm_test_pred, truth = OJ_test$Purchase)
## truth
## prediction CH MM
## CH 144 19
## MM 22 85
(22+19)/270
## [1] 0.1518519
The test error rate is 0.152, or 15.2%
(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.
set.seed(76)
OJ_tune_out <- tune(svm, Purchase~., data = OJ_train, kernel = "radial", ranges = list(cost = c(0.01,0.1,1,5,10)))
summary(OJ_tune_out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.1675
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.39125 0.05591723
## 2 0.10 0.17875 0.03729108
## 3 1.00 0.16750 0.03496029
## 4 5.00 0.17500 0.03280837
## 5 10.00 0.18500 0.02622022
The lowest error is obtained when the cost is 1.
OJ$Purchase <- as.factor(OJ$Purchase)
OJ_svm <-svm(Purchase~., data = OJ_train, kernel = "radial", cost = 1)
summary(OJ_svm)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ_train, kernel = "radial", cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 380
##
## ( 188 192 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
svm_pred <- predict(OJ_svm,OJ_train)
table(prediction = svm_pred, truth = OJ_train$Purchase)
## truth
## prediction CH MM
## CH 444 76
## MM 43 237
(43+76)/800
## [1] 0.14875
The training error rate is 0.14875 or 14.875%
svm_test_pred <- predict(OJ_svm,OJ_test)
table(prediction = svm_test_pred, truth = OJ_test$Purchase)
## truth
## prediction CH MM
## CH 147 29
## MM 19 75
(29+19)/270
## [1] 0.1777778
The test error rate is 0.178 or 17.8%
(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.
set.seed(76)
OJ_tune_out <- tune(svm, Purchase~., data = OJ_train, kernel = "polynomial", degree = 2, ranges = list(cost = c(0.01,0.1,1,5,10)))
summary(OJ_tune_out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 5
##
## - best performance: 0.19125
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.38750 0.05137012
## 2 0.10 0.31875 0.04649149
## 3 1.00 0.20625 0.03186887
## 4 5.00 0.19125 0.03007514
## 5 10.00 0.19125 0.03821086
The error is lowest when the cost is 5 or 10.
OJ$Purchase <- as.factor(OJ$Purchase)
OJ_svm <-svm(Purchase~., data = OJ_train, kernel = "polynomial", cost = 5, degree=2)
summary(OJ_svm)
##
## Call:
## svm(formula = Purchase ~ ., data = OJ_train, kernel = "polynomial",
## cost = 5, degree = 2)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 5
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 377
##
## ( 184 193 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
svm_pred <- predict(OJ_svm,OJ_train)
table(prediction = svm_pred, truth = OJ_train$Purchase)
## truth
## prediction CH MM
## CH 448 89
## MM 39 224
(39+89)/800
## [1] 0.16
The training error rate is 0.16 or 16%
svm_test_pred <- predict(OJ_svm,OJ_test)
table(prediction = svm_test_pred, truth = OJ_test$Purchase)
## truth
## prediction CH MM
## CH 153 30
## MM 13 74
(13+30)/270
## [1] 0.1592593
The test error rate is 0.159 or 15.9%
(h) Overall, which approach seems to give the best results on this data?
Using the linear SVM with the optimal cost, the training error rate was 16.25% and the test error rate was 15.2%. Using the radial kernel with the optimal cost, the training error rate was 14.875% and the test error rate was 17.8% Using the polynomial kernel with the optimal cost, the training error rate was 16% and the test error rate was 15.9% Between these methods, linear SVM and the polynomial kernel had similar training error rates, and the radial kernel had a lower training error rate. The radial kernel had the highest test error rate though, meaning this approach does not give the best results. Linear SVM has a slightly lower test error rate than the polynomial kernel and so linear SVM gives the best results on this data.