Ch.9 problems 5,7,8

problem 5

5. We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.

set.seed(12)
x1 <- runif(500)-0.5
x2 <- runif(500)-0.5
y <- 1*(x1^2-x2^2 > 0)

(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.

df=data.frame(x1=x1, x2=x2, y=as.factor(y))
plot(x=x1, y=x2, col=(2-y))

(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.

glm_fit <- glm(y~x1+x2,data=df, family = "binomial")
summary(glm_fit)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial", data = df)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.350  -1.165   1.050   1.151   1.291  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.04927    0.08978   0.549    0.583
## x1          -0.23002    0.31534  -0.729    0.466
## x2           0.51072    0.31560   1.618    0.106
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 692.86  on 499  degrees of freedom
## Residual deviance: 689.58  on 497  degrees of freedom
## AIC: 695.58
## 
## Number of Fisher Scoring iterations: 3

(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

glm_probs <- predict(glm_fit, newdata = df, type = "response")
glm_preds <- rep(0,500)
glm_preds[glm_probs > 0.5] =1
plot(x1,x2, col = 2-glm_preds)

We can see from the plot that the decision boundary is linear.

table(predictions=glm_preds, truth = df$y)
##            truth
## predictions   0   1
##           0 125  73
##           1 119 183
(119+73)/500
## [1] 0.384

The error rate is 0.384 or 38.4%

(e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X1^2 , X1 × X2, log(X2), and so forth).

set.seed(12)
glm_fit2 <- glm(y~I(x1^2)+I(x2^2),data=df, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_fit2)
## 
## Call:
## glm(formula = y ~ I(x1^2) + I(x2^2), family = "binomial", data = df)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.004247   0.000000   0.000000   0.000000   0.004301  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -4.291    165.476  -0.026    0.979
## I(x1^2)      67337.813 812667.866   0.083    0.934
## I(x2^2)     -67293.307 809254.186  -0.083    0.934
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.9286e+02  on 499  degrees of freedom
## Residual deviance: 3.8093e-05  on 497  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25

(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

glm_probs <- predict(glm_fit2, newdata = df, type = "response")
glm_preds <- rep(0,500)
glm_preds[glm_probs > 0.5] =1
plot(x1,x2, col = 2-glm_preds)

The predicted class labels here are obviously nonlinear.

table(predictions=glm_preds, truth = df$y)
##            truth
## predictions   0   1
##           0 244   0
##           1   0 256

The model where x1 and x2 are nonlinear terms made 0 prediction errors.

(g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

library(e1071)
#Best model
tune_out=tune(svm,y~x1+x2,data = df,kernel='linear', ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
bestmod=tune_out$best.model
plot(bestmod,df)

ypred <- predict(bestmod, newdata=df, type='response')
plot(x1,x2,col=ypred)

table(predict=ypred, truth=df$y)
##        truth
## predict   0   1
##       0 127  56
##       1 117 200
(117+56)/500
## [1] 0.346

The error rate here is 0.346, or 34.6%

(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

#Best model
tune_out2=tune(svm,y~x1+x2,data = df,kernel='radial', ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100)))
bestmod2=tune_out2$best.model
plot(bestmod2,df)

ypred <- predict(bestmod2, newdata=df, type='response')
plot(x1,x2,col=ypred)

table(predict=ypred, truth=df$y)
##        truth
## predict   0   1
##       0 243   1
##       1   1 255
2/500
## [1] 0.004

The error rate here is 0.4%.

(i) Comment on your results.

The logistic regression model that used linear predictors had an error rate of 38.4% while the logistic regression model that used quadratic predictors had an error rate of 0%. the svm model that used a linear kernel had an error rate of 34.6%, while the svm model that used a radial kernel had an error rate of 0.4%. As mentioned in the ISLR textbook, these results suggest that logistic regression and support vector machines are closely related.

problem 7

7. In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

library(ISLR)
set.seed(88)
Auto$mpg <- ifelse(Auto$mpg>median(Auto$mpg),1,0)
table(Auto$mpg)
## 
##   0   1 
## 196 196

Here we have split the data into two groups, o and 1, based on the median value.

(b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.

Auto$mpg <- as.factor(Auto$mpg)
linear_tune <- tune(svm, mpg~., data = Auto, kernel = "linear", ranges = list(cost = c(0.001,0.01,0.1,1,5,10,100,1000)))
summary(linear_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.08653846 
## 
## - Detailed performance results:
##    cost      error dispersion
## 1 1e-03 0.13243590 0.07995006
## 2 1e-02 0.08653846 0.07188448
## 3 1e-01 0.09673077 0.06285281
## 4 1e+00 0.10448718 0.04953303
## 5 5e+00 0.11205128 0.04915905
## 6 1e+01 0.11461538 0.04777235
## 7 1e+02 0.13000000 0.06252021
## 8 1e+03 0.12480769 0.05901774

We can see from this summary that the error is lowest when the cost is 0.01. As the cost increases past this value, the error also increases.

(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.

set.seed(88)
radial_tune <- tune(svm, mpg~., data = Auto, kernel = "radial", ranges = list(cost = c(0.001,0.01,0.1,1,5,10,100,1000),gamma=c(0.5,1,2,3,4)))
summary(radial_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##   0.1   0.5
## 
## - best performance: 0.08653846 
## 
## - Detailed performance results:
##     cost gamma      error dispersion
## 1  1e-03   0.5 0.57121795 0.06318174
## 2  1e-02   0.5 0.57121795 0.06318174
## 3  1e-01   0.5 0.08653846 0.06688000
## 4  1e+00   0.5 0.08660256 0.06800544
## 5  5e+00   0.5 0.08923077 0.06725632
## 6  1e+01   0.5 0.09429487 0.06369561
## 7  1e+02   0.5 0.09173077 0.06031128
## 8  1e+03   0.5 0.09173077 0.06031128
## 9  1e-03   1.0 0.57121795 0.06318174
## 10 1e-02   1.0 0.57121795 0.06318174
## 11 1e-01   1.0 0.57121795 0.06318174
## 12 1e+00   1.0 0.08910256 0.06926571
## 13 5e+00   1.0 0.08916667 0.07344863
## 14 1e+01   1.0 0.08916667 0.07344863
## 15 1e+02   1.0 0.08916667 0.07344863
## 16 1e+03   1.0 0.08916667 0.07344863
## 17 1e-03   2.0 0.57121795 0.06318174
## 18 1e-02   2.0 0.57121795 0.06318174
## 19 1e-01   2.0 0.57121795 0.06318174
## 20 1e+00   2.0 0.12724359 0.07158933
## 21 5e+00   2.0 0.12474359 0.07015852
## 22 1e+01   2.0 0.12474359 0.07015852
## 23 1e+02   2.0 0.12474359 0.07015852
## 24 1e+03   2.0 0.12474359 0.07015852
## 25 1e-03   3.0 0.57121795 0.06318174
## 26 1e-02   3.0 0.57121795 0.06318174
## 27 1e-01   3.0 0.57121795 0.06318174
## 28 1e+00   3.0 0.35416667 0.21004072
## 29 5e+00   3.0 0.33634615 0.20588448
## 30 1e+01   3.0 0.33634615 0.20588448
## 31 1e+02   3.0 0.33634615 0.20588448
## 32 1e+03   3.0 0.33634615 0.20588448
## 33 1e-03   4.0 0.57121795 0.06318174
## 34 1e-02   4.0 0.57121795 0.06318174
## 35 1e-01   4.0 0.57121795 0.06318174
## 36 1e+00   4.0 0.43314103 0.17153477
## 37 5e+00   4.0 0.42814103 0.16154198
## 38 1e+01   4.0 0.42814103 0.16154198
## 39 1e+02   4.0 0.42814103 0.16154198
## 40 1e+03   4.0 0.42814103 0.16154198

We can see from this summary that the lowest error comes when the cost is 0.1 and the gamma is 0.5.

set.seed(88)
polynomial_tune <- tune(svm, mpg~., data = Auto, kernel = "polynomial", ranges = list(cost = c(0.001,0.01,0.1,1,5,10,100,1000),gamma=c(0.5,1,2,3,4)))
summary(polynomial_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##  0.01     2
## 
## - best performance: 0.08416667 
## 
## - Detailed performance results:
##     cost gamma      error dispersion
## 1  1e-03   0.5 0.25724359 0.10393495
## 2  1e-02   0.5 0.10198718 0.06462921
## 3  1e-01   0.5 0.09948718 0.05018848
## 4  1e+00   0.5 0.08673077 0.04693472
## 5  5e+00   0.5 0.10205128 0.04326256
## 6  1e+01   0.5 0.09955128 0.04715212
## 7  1e+02   0.5 0.09955128 0.04867674
## 8  1e+03   0.5 0.09955128 0.04867674
## 9  1e-03   1.0 0.10198718 0.05993767
## 10 1e-02   1.0 0.09948718 0.05018848
## 11 1e-01   1.0 0.08673077 0.04693472
## 12 1e+00   1.0 0.10205128 0.04326256
## 13 5e+00   1.0 0.09955128 0.04867674
## 14 1e+01   1.0 0.09955128 0.04867674
## 15 1e+02   1.0 0.09955128 0.04867674
## 16 1e+03   1.0 0.09955128 0.04867674
## 17 1e-03   2.0 0.09179487 0.04973522
## 18 1e-02   2.0 0.08416667 0.04667329
## 19 1e-01   2.0 0.10205128 0.04326256
## 20 1e+00   2.0 0.09955128 0.04867674
## 21 5e+00   2.0 0.09955128 0.04867674
## 22 1e+01   2.0 0.09955128 0.04867674
## 23 1e+02   2.0 0.09955128 0.04867674
## 24 1e+03   2.0 0.09955128 0.04867674
## 25 1e-03   3.0 0.10205128 0.05101139
## 26 1e-02   3.0 0.09435897 0.04954391
## 27 1e-01   3.0 0.10211538 0.04952109
## 28 1e+00   3.0 0.09955128 0.04867674
## 29 5e+00   3.0 0.09955128 0.04867674
## 30 1e+01   3.0 0.09955128 0.04867674
## 31 1e+02   3.0 0.09955128 0.04867674
## 32 1e+03   3.0 0.09955128 0.04867674
## 33 1e-03   4.0 0.08673077 0.05279469
## 34 1e-02   4.0 0.10205128 0.04326256
## 35 1e-01   4.0 0.09955128 0.04867674
## 36 1e+00   4.0 0.09955128 0.04867674
## 37 5e+00   4.0 0.09955128 0.04867674
## 38 1e+01   4.0 0.09955128 0.04867674
## 39 1e+02   4.0 0.09955128 0.04867674
## 40 1e+03   4.0 0.09955128 0.04867674

We can see from this summary that the error is lowest when the cost is 0.01 and the gamma is 2. The lowest error using the linear SVM was 0.0865, the lowest error using the radial kernel was 0.0865, and the lowest using the polynomial kernel was 0.0842. These results suggest that the error rate is similar no matter what method we decide to use for classification.

(d) Make some plots to back up your assertions in (b) and (c).

plot(linear_tune$performances[,c(1,2)], type = "l", xlim = c(0,1))

This plot shows that the error for the linear model is lowest when the cost is 0.01.

radial_perf <- radial_tune$performances
plot(x=radial_perf$cost, radial_perf$error, xlim=c(0,1))

plot(x=radial_perf$gamma,y=radial_perf$error)

We can see from the first plot that the error is lowest when the cost is 0.1, and from the second plot we can see that error is lowest when the gamma is 0.5.

poly_perf <- polynomial_tune$performances
plot(x=poly_perf$cost, poly_perf$error, xlim=c(0,0.05))

plot(x=poly_perf$gamma,y=poly_perf$error)

From the first plot, we can see that the error is lowest when the cost is 0.01. We can see from the second plot that the error is lowest when gamma is 2.

problem 8

8. This problem involves the OJ data set which is part of the ISLR package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

library(ISLR)
set.seed(76)
model_train <- sample(1:nrow(OJ), 800)
OJ_train <- OJ[model_train,]
OJ_test <- OJ[-model_train,]

(b) Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

OJ$Purchase <- as.factor(OJ$Purchase)
OJ_svm <-svm(Purchase~., data = OJ_train, kernel = "linear", cost = 0.01)
summary(OJ_svm)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ_train, kernel = "linear", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  448
## 
##  ( 223 225 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

We can see from this summary that from the 800 observations in the training data, 448 of them were used as support vectors. This large amount of support vectors indicates that the margin is wide.

(c) What are the training and test error rates?

svm_pred <- predict(OJ_svm,OJ_train)
table(prediction = svm_pred, truth = OJ_train$Purchase)
##           truth
## prediction  CH  MM
##         CH 432  81
##         MM  55 232
(55+81)/800
## [1] 0.17

The training error rate is 0.17 or 17%

svm_test_pred <- predict(OJ_svm,OJ_test)
table(prediction = svm_test_pred, truth = OJ_test$Purchase)
##           truth
## prediction  CH  MM
##         CH 147  26
##         MM  19  78
(19+26)/270
## [1] 0.1666667

The test error rate is 0.167 or 16.7%

(d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

set.seed(76)
OJ_tune_out <- tune(svm, Purchase~., data = OJ_train, kernel = "linear", ranges = list(cost = c(0.01,0.1,1,5,10)))
summary(OJ_tune_out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     5
## 
## - best performance: 0.17 
## 
## - Detailed performance results:
##    cost  error dispersion
## 1  0.01 0.1800 0.02898755
## 2  0.10 0.1700 0.02838231
## 3  1.00 0.1725 0.03106892
## 4  5.00 0.1700 0.03016160
## 5 10.00 0.1725 0.02934469

The error is lowest when the cost is .1 or 5.

(e) Compute the training and test error rates using this new value for cost.

OJ$Purchase <- as.factor(OJ$Purchase)
OJ_svm <-svm(Purchase~., data = OJ_train, kernel = "linear", cost = 5)
summary(OJ_svm)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ_train, kernel = "linear", cost = 5)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  5 
## 
## Number of Support Vectors:  341
## 
##  ( 171 170 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
svm_pred <- predict(OJ_svm,OJ_train)
table(prediction = svm_pred, truth = OJ_train$Purchase)
##           truth
## prediction  CH  MM
##         CH 427  70
##         MM  60 243
(60+70)/800
## [1] 0.1625

The training error rate is 0.1625 or 16.25%

svm_test_pred <- predict(OJ_svm,OJ_test)
table(prediction = svm_test_pred, truth = OJ_test$Purchase)
##           truth
## prediction  CH  MM
##         CH 144  19
##         MM  22  85
(22+19)/270
## [1] 0.1518519

The test error rate is 0.152, or 15.2%

(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

set.seed(76)
OJ_tune_out <- tune(svm, Purchase~., data = OJ_train, kernel = "radial", ranges = list(cost = c(0.01,0.1,1,5,10)))
summary(OJ_tune_out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.1675 
## 
## - Detailed performance results:
##    cost   error dispersion
## 1  0.01 0.39125 0.05591723
## 2  0.10 0.17875 0.03729108
## 3  1.00 0.16750 0.03496029
## 4  5.00 0.17500 0.03280837
## 5 10.00 0.18500 0.02622022

The lowest error is obtained when the cost is 1.

OJ$Purchase <- as.factor(OJ$Purchase)
OJ_svm <-svm(Purchase~., data = OJ_train, kernel = "radial", cost = 1)
summary(OJ_svm)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ_train, kernel = "radial", cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  380
## 
##  ( 188 192 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
svm_pred <- predict(OJ_svm,OJ_train)
table(prediction = svm_pred, truth = OJ_train$Purchase)
##           truth
## prediction  CH  MM
##         CH 444  76
##         MM  43 237
(43+76)/800
## [1] 0.14875

The training error rate is 0.14875 or 14.875%

svm_test_pred <- predict(OJ_svm,OJ_test)
table(prediction = svm_test_pred, truth = OJ_test$Purchase)
##           truth
## prediction  CH  MM
##         CH 147  29
##         MM  19  75
(29+19)/270
## [1] 0.1777778

The test error rate is 0.178 or 17.8%

(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.

set.seed(76)
OJ_tune_out <- tune(svm, Purchase~., data = OJ_train, kernel = "polynomial", degree = 2, ranges = list(cost = c(0.01,0.1,1,5,10)))
summary(OJ_tune_out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     5
## 
## - best performance: 0.19125 
## 
## - Detailed performance results:
##    cost   error dispersion
## 1  0.01 0.38750 0.05137012
## 2  0.10 0.31875 0.04649149
## 3  1.00 0.20625 0.03186887
## 4  5.00 0.19125 0.03007514
## 5 10.00 0.19125 0.03821086

The error is lowest when the cost is 5 or 10.

OJ$Purchase <- as.factor(OJ$Purchase)
OJ_svm <-svm(Purchase~., data = OJ_train, kernel = "polynomial", cost = 5, degree=2)
summary(OJ_svm)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ_train, kernel = "polynomial", 
##     cost = 5, degree = 2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  5 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  377
## 
##  ( 184 193 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
svm_pred <- predict(OJ_svm,OJ_train)
table(prediction = svm_pred, truth = OJ_train$Purchase)
##           truth
## prediction  CH  MM
##         CH 448  89
##         MM  39 224
(39+89)/800
## [1] 0.16

The training error rate is 0.16 or 16%

svm_test_pred <- predict(OJ_svm,OJ_test)
table(prediction = svm_test_pred, truth = OJ_test$Purchase)
##           truth
## prediction  CH  MM
##         CH 153  30
##         MM  13  74
(13+30)/270
## [1] 0.1592593

The test error rate is 0.159 or 15.9%

(h) Overall, which approach seems to give the best results on this data?

Using the linear SVM with the optimal cost, the training error rate was 16.25% and the test error rate was 15.2%. Using the radial kernel with the optimal cost, the training error rate was 14.875% and the test error rate was 17.8% Using the polynomial kernel with the optimal cost, the training error rate was 16% and the test error rate was 15.9% Between these methods, linear SVM and the polynomial kernel had similar training error rates, and the radial kernel had a lower training error rate. The radial kernel had the highest test error rate though, meaning this approach does not give the best results. Linear SVM has a slightly lower test error rate than the polynomial kernel and so linear SVM gives the best results on this data.