In this assignment, we explore the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers
Problem 5
We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:
library(tidyverse)
set.seed(421)
x1=runif(500) - 0.5
x2=runif(500) - 0.5
y=as_factor(1*(x1*x1 - x2*x2 > 0))
df<-tibble(x1=x1,
x2=x2,
y=y)
df
(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the yaxis.
library(plotly)
p<-ggplot(data=df,mapping=aes(x=x1,y=x2,color=y))+
geom_point() +
theme_light() +
theme(legend.position = "none")
ggplotly(p)
(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.
glm_fit = glm(y ~ x1 + x2, family = binomial)
summary(glm_fit)
Call:
glm(formula = y ~ x1 + x2, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.278 -1.227 1.089 1.135 1.175
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.11999 0.08971 1.338 0.181
x1 -0.16881 0.30854 -0.547 0.584
x2 -0.08198 0.31476 -0.260 0.795
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 691.35 on 499 degrees of freedom
Residual deviance: 690.99 on 497 degrees of freedom
AIC: 696.99
Number of Fisher Scoring iterations: 3
Both variables are not significant for predicting y.
(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
glm_prob = predict(glm_fit, df, type = "response")
df<-mutate(df, pred = as_factor(ifelse(glm_prob > 0.52, 1, 0)))
q<-ggplot(data=df,mapping=aes(x=x1,y=x2,color=pred))+
geom_point() +
theme_light() +
theme(legend.position = "none")
ggplotly(q)
With the given model and a probability threshold of 0.5, all points are classified to single class and no decision boundary can be shown. So I shifted the probability threshold to 0.52 to show a meaningful decision boundary. This boundary is linear.
(e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X21 , X1×X2, log(X2), and so forth).
I’ll use squares, product interaction terms to fit the model.
glm_fit2 = glm(y ~ x1 + x2+ I(x1^2) + I(x2^2) + I(x1 * x2), data = df, family = binomial)
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
glm_prob = predict(glm_fit2, df, type = "response")
df2<-mutate(df, nonlin_pred = as_factor(ifelse(glm_prob > 0.52, 1, 0)))
r<-ggplot(data=df2,mapping=aes(x=x1,y=x2,color=nonlin_pred))+
geom_point() +
theme_light() +
theme(legend.position = "none")
ggplotly(r)
This non-linear decision boundary closely resembles the true decision boundary.
(g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
library(e1071)
svm_fit = svm(y ~ x1 + x2, df, kernel = "linear", cost = 0.1)
df3<-mutate(df, svm_pred = as_factor(predict(svm_fit, df)))
s<-ggplot(data=df3,mapping=aes(x=x1,y=x2,color=svm_pred))+
geom_point() +
theme_light() +
theme(legend.position = "none")
ggplotly(s)
A linear kernel, even with low cost fails to find non-linear decision boundary and classifies all points to a single class.
(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
svm_fit2 = svm(y ~ x1 + x2, df, gamma = 1)
df4<-mutate(df, svm_pred = as_factor(predict(svm_fit2, df)))
t<-ggplot(data=df4,mapping=aes(x=x1,y=x2,color=svm_pred))+
geom_point() +
theme_light() +
theme(legend.position = "none")
ggplotly(t)
Similar to the non-linear logistic regression model, the non-linear decision boundary on predicted labels closely resembles the true decision boundary.
(i) Comment on your results.
This experiment enforces the idea that SVMs with non-linear kernel are extremely powerful in finding non-linear boundary. Both, logistic regression with non-interactions and SVMs with linear kernels fail to find the decision boundary. Adding interaction terms to logistic regression seems to give them same power as radial-basis kernels. However, there is some manual efforts and tuning involved in picking right interaction terms. This effort can become prohibitive with large number of features. Radial basis kernels, on the other hand, only require tuning of one parameter, gamma
, which can be easily done using cross-validation.
Problem 7
In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto
data set.
(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
library(ISLR)
auto <- tibble(Auto) %>%
mutate(mpglevel = as_factor(ifelse(mpg > median(mpg), 1, 0)))
auto
(b) Fit a support vector classifier to the data with various values of cost
, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.
library(e1071)
set.seed(3255)
tune_out = tune(svm, mpglevel ~ . -mpg, data = auto, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune_out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
- best performance: 0.08923077
- Detailed performance results:
svm_linear<-tune_out$best.model #save for later
We see that cross-validation error is minimized for cost=1
.
(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma
and degree
and cost
. Comment on your results.
set.seed(21)
tune_out = tune(svm, mpglevel ~ . -mpg, data = auto, kernel = "polynomial", ranges = list(cost = c(0.1, 1, 5, 10), degree = c(2, 3, 4)))
summary(tune_out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
- best performance: 0.5435897
- Detailed performance results:
svm_poly<-tune_out$best.model #save for later
The lowest cross-validation error is obtained for cost=10
and degree=2
.
set.seed(463)
tune_out = tune(svm, mpglevel ~ .-mpg, data = auto, kernel = "radial", ranges = list(cost = c(0.1, 1, 5, 10), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune_out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
- best performance: 0.08160256
- Detailed performance results:
svm_radial<-tune_out$best.model #save for later
Finally, for radial basis kernel, cost=10
and gamma=0.01
.
__(d) Make some plots to back up your assertions in (b) and (c).
svm_linear = svm(mpglevel ~ ., data = auto, kernel = "linear", cost = 1)
svm_poly = svm(mpglevel ~ ., data = auto, kernel = "polynomial", cost = 10,
degree = 2)
svm_radial = svm(mpglevel ~ ., data = auto, kernel = "radial", cost = 10, gamma = 0.01)
#plot(svm_linear,auto,mpg~cylinders)
#plot(svm_linear,auto,mpg~displacement)
#plot(svm_linear,auto,mpg~horsepower)
#plot(svm_linear,auto,mpg~weight)
library(RColorBrewer)
plotpairs<-function(fit){
for(name in names(auto)[!(names(auto)) %in% c("mpg","name","mpglevel")]){
plot(fit,auto,as.formula(paste("mpg~",name,sep="")))
}
}
plotpairs(svm_linear)







plotpairs(svm_poly)







plotpairs(svm_radial)







Problem 8
This problem involves the OJ data set which is part of the ISLR
package.
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
library(ISLR)
set.seed(9004)
inTrain = sample(nrow(OJ), 800)
train_oj = OJ[inTrain,]
test_oj = OJ[-inTrain,]
(b) Fit a support vector classifier to the training data using cost=0.01
, with Purchase as the response and the other variables as predictors. Use the summary()
function to produce summary statistics, and describe the results obtained.
library(e1071)
svm_linear = svm(Purchase ~ ., kernel = "linear", data = train_oj, cost = 0.01)
summary(svm_linear)
Call:
svm(formula = Purchase ~ ., data = train_oj, kernel = "linear",
cost = 0.01)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 0.01
Number of Support Vectors: 442
( 222 220 )
Number of Classes: 2
Levels:
CH MM
Support vector classifier creates 442 support vectors out of 800 training points. Out of these, 222 belong to level CH
and remaining 220 belong to level MM
.
(c) What are the training and test error rates?
train_pred = predict(svm_linear, train_oj)
(t<-table(train_oj$Purchase, train_pred))
train_pred
CH MM
CH 432 51
MM 80 237
(train_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.16375
test_pred = predict(svm_linear, test_oj)
(t<-table(test_oj$Purchase, test_pred))
test_pred
CH MM
CH 146 24
MM 22 78
(test_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.1703704
The training error rate is 18% and test error rate is about 20.37%.
(d) Use the tune()
function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(1554)
tune_out = tune(svm, Purchase ~ ., data = train_oj, kernel = "linear", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune_out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
- best performance: 0.1625
- Detailed performance results:
best<-tune_out$best.parameters$cost
Tuning shows that optimal cost is 3.1622777.
(e) Compute the training and test error rates using this new value for cost.
svm_linear = svm(Purchase ~ ., kernel = "linear", data = train_oj, cost = best)
train_pred = predict(svm_linear, train_oj)
(t<-table(train_oj$Purchase, train_pred))
train_pred
CH MM
CH 428 55
MM 74 243
(train_error_linear=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.16125
test_pred = predict(svm_linear, test_oj)
(t<-table(test_oj$Purchase, test_pred))
test_pred
CH MM
CH 146 24
MM 20 80
(test_error_linear=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.162963
The training error decreases to 16.12% but test error slightly increases to 16.3% by using best cost.
(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.
set.seed(410)
svm_radial = svm(Purchase ~ ., kernel = "radial", data = train_oj)
summary(svm_radial)
Call:
svm(formula = Purchase ~ ., data = train_oj, kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 371
( 188 183 )
Number of Classes: 2
Levels:
CH MM
train_pred = predict(svm_radial, train_oj)
(t<-table(train_oj$Purchase, train_pred))
train_pred
CH MM
CH 441 42
MM 74 243
(train_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.145
test_pred = predict(svm_radial, test_oj)
(t<-table(test_oj$Purchase, test_pred))
test_pred
CH MM
CH 148 22
MM 27 73
(test_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.1814815
The radial basis kernel with default gamma creates 371 support vectors, out of which, 188 belong to level CH
and remaining 183 belong to level MM
. The classifier has a training error of 18% and a test error of 20.37% which is not an improvement over linear kernel. We now use cross validation to find optimal gamma.
set.seed(755)
tune_out = tune(svm, Purchase ~ ., data = train_oj, kernel = "radial", ranges = list(cost = 10^seq(-2,1, by = 0.25)))
summary(tune_out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
- best performance: 0.1675
- Detailed performance results:
NA
svm_radial = svm(Purchase ~ ., kernel = "radial", data = train_oj, cost = tune_out$best.parameters$cost)
train_pred = predict(svm_radial, train_oj)
(t<-table(train_oj$Purchase, train_pred))
train_pred
CH MM
CH 440 43
MM 81 236
(train_error_radial=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.155
test_pred = predict(svm_radial, test_oj)
(t<-table(test_oj$Purchase, test_pred))
test_pred
CH MM
CH 145 25
MM 28 72
(test_error_radial=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.1962963
Tuning slightly decreases training error to 15.5% and increases test error to 19.63% which is still not better than linear kernel.
(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.
svm_poly = svm(Purchase ~ ., kernel = "poly", data = train_oj, degree=2)
summary(svm_poly)
Call:
svm(formula = Purchase ~ ., data = train_oj, kernel = "poly",
degree = 2)
Parameters:
SVM-Type: C-classification
SVM-Kernel: polynomial
cost: 1
degree: 2
coef.0: 0
Number of Support Vectors: 456
( 232 224 )
Number of Classes: 2
Levels:
CH MM
train_pred = predict(svm_poly, train_oj)
(t<-table(train_oj$Purchase, train_pred))
train_pred
CH MM
CH 450 33
MM 111 206
(train_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.18
test_pred = predict(svm_poly, test_oj)
(t<-table(test_oj$Purchase, test_pred))
test_pred
CH MM
CH 149 21
MM 34 66
(test_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.2037037
The polynomial kernel produces 456 support vectors, out of which, 232 belong to level CH
and remaining 224 belong to level MM
. This kernel produces a train error of 18% and a test error of 20.37% which are slightly higher than the errors produces by linear and radial kernel.
set.seed(322)
tune_out = tune(svm, Purchase ~ ., data = train_oj, kernel = "poly", degree = 2, ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune_out)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation
- best parameters:
- best performance: 0.18
- Detailed performance results:
NA
svm_poly = svm(Purchase ~ ., kernel = "poly", data = train_oj, degree=2, cost = tune_out$best.parameters$cost)
train_pred = predict(svm_poly, train_oj)
(t<-table(train_oj$Purchase, train_pred))
train_pred
CH MM
CH 447 36
MM 85 232
(train_error_poly=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.15125
test_pred = predict(svm_poly, test_oj)
(t<-table(test_oj$Purchase, test_pred))
test_pred
CH MM
CH 148 22
MM 28 72
(test_error_poly=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
[1] 0.1851852
Tuning reduces the training error to 15.12% and test error to 18.52% which is about the same as the radial kernel but worse than linear kernel.
(h) Overall, which approach seems to give the best results on this data?
df<-tibble(`SVM kernel`=c("Linear","Radial","Polynomial"),
`Training Error`=c(train_error_linear,train_error_radial,train_error_poly),
`Test Error`=c(test_error_linear,test_error_radial,test_error_poly))
df
Overall, the linear basis kernel seems to be producing minimum misclassification error on test data.
---
title: "R Notebook"
output: 
  html_notebook:
    toc: true
    toc_float: true
---
In this assignment, we explore the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers

## Problem 5
__We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by
performing logistic regression using non-linear transformations of the features.__

__(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:__
```{r, message=FALSE,warning=FALSE}
library(tidyverse)
set.seed(421)
x1=runif(500) - 0.5 
x2=runif(500) - 0.5
y=as_factor(1*(x1*x1 - x2*x2 > 0))
df<-tibble(x1=x1,
           x2=x2,
           y=y)
df
```

__(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the yaxis.__
```{r}
library(plotly)
p<-ggplot(data=df,mapping=aes(x=x1,y=x2,color=y))+
  geom_point() +
  theme_light() +
  theme(legend.position = "none")
ggplotly(p)
```

__(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.__
```{r}
glm_fit = glm(y ~ x1 + x2, family = binomial)
summary(glm_fit)
```
Both variables are not significant for predicting _y_.

__(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.__
```{r}
glm_prob = predict(glm_fit, df, type = "response")
df<-mutate(df, pred = as_factor(ifelse(glm_prob > 0.52, 1, 0)))
q<-ggplot(data=df,mapping=aes(x=x1,y=x2,color=pred))+
  geom_point() +
  theme_light() +
  theme(legend.position = "none")
ggplotly(q)
```
With the given model and a probability threshold of 0.5, all points are classified to single class and no decision boundary can be shown. So I shifted the probability threshold to 0.52 to show a meaningful decision boundary. This boundary is linear.

__(e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X21 , X1×X2, log(X2), and so forth).__  

I'll use squares, product interaction terms to fit the model.

```{r,error=TRUE}
glm_fit2 = glm(y ~ x1 + x2+ I(x1^2) + I(x2^2) + I(x1 * x2), data = df, family = binomial)
```

__(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.__

```{r}
glm_prob = predict(glm_fit2, df, type = "response")
df2<-mutate(df, nonlin_pred = as_factor(ifelse(glm_prob > 0.52, 1, 0)))
r<-ggplot(data=df2,mapping=aes(x=x1,y=x2,color=nonlin_pred))+
  geom_point() +
  theme_light() +
  theme(legend.position = "none")
ggplotly(r)
```
This non-linear decision boundary closely resembles the true decision boundary.

__(g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.__
```{r}
library(e1071)
svm_fit = svm(y ~ x1 + x2, df, kernel = "linear", cost = 0.1)
df3<-mutate(df, svm_pred = as_factor(predict(svm_fit, df)))
s<-ggplot(data=df3,mapping=aes(x=x1,y=x2,color=svm_pred))+
  geom_point() +
  theme_light() +
  theme(legend.position = "none")
ggplotly(s)
```
A linear kernel, even with low cost fails to find non-linear decision boundary and classifies all points to a single class.

__(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.__
```{r}
svm_fit2 = svm(y ~ x1 + x2, df, gamma = 1)
df4<-mutate(df, svm_pred = as_factor(predict(svm_fit2, df)))
t<-ggplot(data=df4,mapping=aes(x=x1,y=x2,color=svm_pred))+
  geom_point() +
  theme_light() +
  theme(legend.position = "none")
ggplotly(t)
```
Similar to the non-linear logistic regression model, the non-linear decision boundary on predicted labels closely resembles the true decision boundary.

__(i) Comment on your results.__  
This experiment enforces the idea that SVMs with non-linear kernel are extremely powerful in finding non-linear boundary. Both, logistic regression with non-interactions and SVMs with linear kernels fail to find the decision boundary. Adding interaction terms to logistic regression seems to give them same power as radial-basis kernels. However, there is some manual efforts and tuning involved in picking right interaction terms. This effort can become prohibitive with large number of features. Radial basis kernels, on the other hand, only require tuning of one parameter, `gamma`, which can be easily done using cross-validation.

## Problem 7   
__In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the `Auto` data set.__  

__(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.__
```{r}
library(ISLR)
auto <- tibble(Auto) %>%
  mutate(mpglevel = as_factor(ifelse(mpg > median(mpg), 1, 0)))
auto
```

__(b) Fit a support vector classifier to the data with various values of `cost`, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.__
```{r}
library(e1071)
set.seed(3255)
tune_out = tune(svm, mpglevel ~ . -mpg, data = auto, kernel = "linear", ranges = list(cost = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune_out)
```
We see that cross-validation error is minimized for `cost=1`.

__(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of `gamma` and `degree` and `cost`. Comment on your results.__
```{r}
set.seed(21)
tune_out = tune(svm, mpglevel ~ . -mpg, data = auto, kernel = "polynomial", ranges = list(cost = c(0.1, 1, 5, 10), degree = c(2, 3, 4)))
summary(tune_out)
```
The lowest cross-validation error is obtained for `cost=10` and `degree=2`.

```{r}
set.seed(463)
tune_out = tune(svm, mpglevel ~ .-mpg, data = auto, kernel = "radial", ranges = list(cost = c(0.1, 1, 5, 10), gamma = c(0.01, 0.1, 1, 5, 10, 100)))
summary(tune_out)
```
Finally, for radial basis kernel, `cost=10` and `gamma=0.01`.

__(d) Make some plots to back up your assertions in (b) and (c). 
```{r}
svm_linear = svm(mpglevel ~ ., data = auto, kernel = "linear", cost = 1)
svm_poly = svm(mpglevel ~ ., data = auto, kernel = "polynomial", cost = 10, 
    degree = 2)
svm_radial = svm(mpglevel ~ ., data = auto, kernel = "radial", cost = 10, gamma = 0.01)

#plot(svm_linear,auto,mpg~cylinders)
#plot(svm_linear,auto,mpg~displacement)
#plot(svm_linear,auto,mpg~horsepower)
#plot(svm_linear,auto,mpg~weight)
library(RColorBrewer)
plotpairs<-function(fit){
  for(name in names(auto)[!(names(auto)) %in% c("mpg","name","mpglevel")]){
    plot(fit,auto,as.formula(paste("mpg~",name,sep="")))
  }
}
plotpairs(svm_linear)
```
```{r}
plotpairs(svm_poly)
```
```{r}
plotpairs(svm_radial)
```

## Problem 8

__This problem involves the [OJ](https://rdrr.io/cran/ISLR/man/OJ.html) data set which is part of the `ISLR` package.__

__(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.__
```{r}
library(ISLR)
set.seed(9004)
inTrain = sample(nrow(OJ), 800)
train_oj = OJ[inTrain,]
test_oj = OJ[-inTrain,]
```

__(b) Fit a support vector classifier to the training data using `cost=0.01`, with Purchase as the response and the other variables as predictors. Use the `summary()` function to produce summary statistics, and describe the results obtained.__
```{r,warning=FALSE,message=FALSE}
library(e1071)
svm_linear = svm(Purchase ~ ., kernel = "linear", data = train_oj, cost = 0.01)
summary(svm_linear)
```
Support vector classifier creates 442 support vectors out of 800 training points. Out of these, 222 belong to level `CH` and remaining 220 belong to level `MM`.

__(c) What are the training and test error rates?__
```{r}
train_pred = predict(svm_linear, train_oj)
(t<-table(train_oj$Purchase, train_pred))
(train_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
```
```{r}
test_pred = predict(svm_linear, test_oj)
(t<-table(test_oj$Purchase, test_pred))
(test_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
```
The training error rate is `r paste(round(100*train_error,2),'%',sep="")` and test error rate is about `r paste(round(100*test_error,2),'%',sep="")`.

__(d) Use the `tune()` function to select an optimal cost. Consider values in the range 0.01 to 10.__
```{r}
set.seed(1554)
tune_out = tune(svm, Purchase ~ ., data = train_oj, kernel = "linear", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune_out)
best<-tune_out$best.parameters$cost
```
Tuning shows that optimal cost is `r best`.

__(e) Compute the training and test error rates using this new value for cost.__
```{r}
svm_linear = svm(Purchase ~ ., kernel = "linear", data = train_oj, cost = best)
train_pred = predict(svm_linear, train_oj)
(t<-table(train_oj$Purchase, train_pred))
(train_error_linear=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
test_pred = predict(svm_linear, test_oj)
(t<-table(test_oj$Purchase, test_pred))
(test_error_linear=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
```
The training error decreases to `r paste(round(100*train_error_linear,2),'%',sep="")` but test error slightly increases to `r paste(round(100*test_error_linear,2),'%',sep="")` by using best cost.

__(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.__
```{r}
set.seed(410)
svm_radial = svm(Purchase ~ ., kernel = "radial", data = train_oj)
summary(svm_radial)
train_pred = predict(svm_radial, train_oj)
(t<-table(train_oj$Purchase, train_pred))
(train_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
test_pred = predict(svm_radial, test_oj)
(t<-table(test_oj$Purchase, test_pred))
(test_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
```
The radial basis kernel with default gamma creates 371 support vectors, out of which, 188 belong to level `CH` and remaining 183 belong to level `MM`. The classifier has a training error of `r paste(round(100*train_error,2),'%',sep="")` and a test error of `r paste(round(100*test_error,2),'%',sep="")` which is not an improvement over linear kernel. We now use cross validation to find optimal gamma.

```{r}
set.seed(755)
tune_out = tune(svm, Purchase ~ ., data = train_oj, kernel = "radial", ranges = list(cost = 10^seq(-2,1, by = 0.25)))
summary(tune_out)
```
```{r}
svm_radial = svm(Purchase ~ ., kernel = "radial", data = train_oj, cost = tune_out$best.parameters$cost)
train_pred = predict(svm_radial, train_oj)
(t<-table(train_oj$Purchase, train_pred))
(train_error_radial=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
test_pred = predict(svm_radial, test_oj)
(t<-table(test_oj$Purchase, test_pred))
(test_error_radial=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
```
Tuning slightly decreases training error to `r paste(round(100*train_error_radial,2),'%',sep="")` and increases test error to `r paste(round(100*test_error_radial,2),'%',sep="")` which is still not better than linear kernel.

__(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.__
```{r}
svm_poly = svm(Purchase ~ ., kernel = "poly", data = train_oj, degree=2)
summary(svm_poly)
train_pred = predict(svm_poly, train_oj)
(t<-table(train_oj$Purchase, train_pred))
(train_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
test_pred = predict(svm_poly, test_oj)
(t<-table(test_oj$Purchase, test_pred))
(test_error=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
```
The polynomial kernel produces 456 support vectors, out of which, 232 belong to level `CH` and remaining 224 belong to level `MM`. This kernel produces a train error of `r paste(round(100*train_error,2),'%',sep="")` and a test error of `r paste(round(100*test_error,2),'%',sep="")` which are slightly higher than the errors produces by linear and radial kernel.

```{r}
set.seed(322)
tune_out = tune(svm, Purchase ~ ., data = train_oj, kernel = "poly", degree = 2, ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune_out)
```
```{r}
svm_poly = svm(Purchase ~ ., kernel = "poly", data = train_oj, degree=2, cost = tune_out$best.parameters$cost)
train_pred = predict(svm_poly, train_oj)
(t<-table(train_oj$Purchase, train_pred))
(train_error_poly=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
test_pred = predict(svm_poly, test_oj)
(t<-table(test_oj$Purchase, test_pred))
(test_error_poly=(t[2]+t[3])/(t[1]+t[2]+t[3]+t[4]))
```
Tuning reduces the training error to `r paste(round(100*train_error_poly,2),'%',sep="")` and test error to `r paste(round(100*test_error_poly,2),'%',sep="")` which is about the same as the radial kernel but worse than linear kernel.

__(h) Overall, which approach seems to give the best results on this data?__
```{r}
df<-tibble(`SVM kernel`=c("Linear","Radial","Polynomial"),
           `Training Error`=c(train_error_linear,train_error_radial,train_error_poly),
           `Test Error`=c(test_error_linear,test_error_radial,test_error_poly))
df
```

Overall, the linear basis kernel seems to be producing minimum misclassification error on test data.