Exercise 5

(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them.

set.seed(421)
x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1 * (x1^2 - x2^2 > 0)

The quadratic decision boundary between two classes is,

\(x_1^{2}\) - \(x_2^{2}\) = 0

(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the y-axis.

plot(x1[y == 0], x2[y == 0], col = "green", xlab = "X1", ylab = "X2", pch = "*")
points(x1[y == 1], x2[y == 1], col = "blue", pch = 1)

The plot shows a non-linear decision boundary.

(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.

fit_lm = glm(y ~ x1 + x2, family = binomial)
summary(fit_lm)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.278  -1.227   1.089   1.135   1.175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.11999    0.08971   1.338    0.181
## x1          -0.16881    0.30854  -0.547    0.584
## x2          -0.08198    0.31476  -0.260    0.795
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 691.35  on 499  degrees of freedom
## Residual deviance: 690.99  on 497  degrees of freedom
## AIC: 696.99
## 
## Number of Fisher Scoring iterations: 3

Predictors \(X_1\) and \(X_2\) are not statistically significant predictors for the y variable at the level of significance \(\alpha\)=0.05. This coincides with the non-linear boundry is represented in the plot.

(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

data = data.frame(x1 = x1, x2 = x2, y = y)
prob = predict(fit_lm, data, type = "response")
pred = ifelse(prob > 0.52, 1, 0)
pos = data[pred == 1, ]
neg = data[pred == 0, ]
plot(pos$x1, pos$x2, col = "blue", xlab = "X1", ylab = "X2", pch = "*")
points(neg$x1, neg$x2, col = "green", pch = 1)

The threshold of 0.52 creates a new linear boundary.

(e) Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. X21 , X1×X2, log(X2), and so forth).

fit_poly = glm(y ~ poly(x1, 2) + log(x2, 2) + I(x1 * x2), data = data, family = binomial)
summary(fit_poly)

## 
## Call:
## glm(formula = y ~ poly(x1, 2) + log(x2, 2) + I(x1 * x2), family = binomial, 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2453  -0.2280   0.0006   0.1853   1.7565  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.253      1.016  -6.153 7.61e-10 ***
## poly(x1, 2)1   12.537     17.244   0.727    0.467    
## poly(x1, 2)2   74.737     11.176   6.688 2.27e-11 ***
## log(x2, 2)     -2.881      0.448  -6.432 1.26e-10 ***
## I(x1 * x2)     -4.653      8.011  -0.581    0.561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 333.89  on 240  degrees of freedom
## Residual deviance: 102.21  on 236  degrees of freedom
##   (259 observations deleted due to missingness)
## AIC: 112.21
## 
## Number of Fisher Scoring iterations: 7

The new model using non-linear predictors resulted in a few predictors being statistically significant. The predictors that are statistically significant are \({X_1}^2\) and the log of \(X_2\).

(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

prob_p = predict(fit_poly, data, type = "response")
pred_p = ifelse(prob_p > 0.5, 1, 0)
pos_p = data[pred_p == 1, ]
neg_p = data[pred_p == 0, ]
plot(pos_p$x1, pos_p$x2, col = "blue", xlab = "X1", ylab = "X2", pch = 1)
points(neg_p$x1, neg_p$x2, col = "red", pch = 1)

The predictions are much better than linear model.The new model resulted in a non-linear boundary which is different than original non-linear boundary.

(g) Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

library(e1071)
svm.fit = svm(as.factor(y) ~ x1 + x2, data, kernel = "linear", cost = 0.1)
svm.pred = predict(svm.fit, data)
pos = data[svm.pred == 1, ]
neg = data[svm.pred == 0, ]
plot(pos$x1, pos$x2, col = "blue", xlab = "X1", ylab = "X2", pch = "*")
points(neg$x1, neg$x2, col = "red", pch = 3)

The linear kernel creates a small linear boundary with a few values in one of the boundaries but majority of values are classified into one group.

(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

library(e1071)
svm.fit = svm(as.factor(y) ~ x1 + x2, data, gamma=1)
svm.pred = predict(svm.fit, data)
pos = data[svm.pred == 1, ]
neg = data[svm.pred == 0, ]
plot(pos$x1, pos$x2, col = "blue", xlab = "X1", ylab = "X2", pch = "+")
points(neg$x1, neg$x2, col = "red", pch = 4)

The predicted labels of the non-linear boundary is similiar to the original decision boundary.

(i) Comment on your results.

The experiment perfromed covers the idea of SVMS are important to use for finding non linear models. using cross validation would be easier with the parameter of gamma. Linear logistic regression has a poor effect on nonlinear boundaries, SVM linear kernels work well with small cost, and both nonlinear logistic regression and SVM have good effects on nonlinear boundaries.

Exercise 7

(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

library(ISLR)
mileage.median <- median(Auto$mpg)
Auto$mb <- ifelse(Auto$mpg > mileage.median, 1, 0)
Auto$mpglevel = as.factor(Auto$mb)

(b) Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results. Note you will need to fit the classifier without the gas mileage variable to produce sensible results.

cost.grid <- c(0.001,0.01, 0.1, 1, 5, 10, 100)
set.seed(1)
tune.res <- tune(svm, mb ~ . - mpg, data = Auto, kernel = 'linear', ranges = list(cost = cost.grid))
summary(tune.res)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     5
## 
## - best performance: 0.001749865 
## 
## - Detailed performance results:
##    cost       error   dispersion
## 1 1e-03 0.094369989 0.0221743982
## 2 1e-02 0.038760453 0.0128111486
## 3 1e-01 0.002059194 0.0003321964
## 4 1e+00 0.001750708 0.0001172866
## 5 5e+00 0.001749865 0.0001196184
## 6 1e+01 0.001749865 0.0001196184
## 7 1e+02 0.001749865 0.0001196184

(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.

SVM with radial kernel

grid <- c(0.01, 0.1, 1, 5, 10)
set.seed(1)
tune.out <- tune(svm, mb ~ . - mpg, data = Auto, kernel = 'radial', ranges = list(cost = grid), gamma=c(0.01, 0.1, 1, 5, 10, 100))
summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.002002138 
## 
## - Detailed performance results:
##    cost       error   dispersion
## 1  0.01 0.287651483 0.0351963527
## 2  0.10 0.078886860 0.0206536281
## 3  1.00 0.017484810 0.0071390321
## 4  5.00 0.002281495 0.0005543150
## 5 10.00 0.002002138 0.0004984504

SVM with polynomial kernel

grid <- c(0.01, 0.1, 1, 5, 10)
set.seed(1)
tune.out <- tune(svm, mb ~ . - mpg, data = Auto, kernel = 'polynomial', ranges = list(cost = grid),degree=c(2,3,4))
summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.3026392 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1  0.01 0.4981933 0.03934104
## 2  0.10 0.4956527 0.03969561
## 3  1.00 0.4706992 0.04479729
## 4  5.00 0.3777593 0.07172821
## 5 10.00 0.3026392 0.07693095

(d) Make some plots to back up your assertions in (b) and (c).

svm.linear = svm(mpglevel ~ ., data = Auto, kernel = "linear", cost = 1)
svm.poly = svm(mpglevel ~ ., data = Auto, kernel = "polynomial", cost = 100, degree = 2)
svm.radial = svm(mpglevel ~ ., data = Auto, kernel = "radial", cost = 100, gamma = 0.01)
plotpairs = function(fit) {
    for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
        plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
    }
}
plotpairs(svm.linear)

plotpairs(svm.poly)

plotpairs(svm.radial)

Exercise 8

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(1)
s = sample(dim(OJ)[1], 800)
train = OJ[s, ]
test = OJ[-s, ]

(b) Fit a support vector classifier to the training data using cost = 0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

library(e1071)
svm.linear = svm(Purchase ~ ., kernel = "linear", data = train, cost = 0.01)
summary(svm.linear)

## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  435
## 
##  ( 219 216 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

The support vector classifer creates 435 support vectors out of the 800 observations available in the dataset. Citrus Hill and Minute Maid are evenly split amongst the support vectors with 219 and 216 each.

(c) What are the training and test error rates?

train.pred = predict(svm.linear, train)
mean(train.pred != train$Purchase)

## [1] 0.175

test.pred = predict(svm.linear, test)
mean(test.pred != test$Purchase)

## [1] 0.1777778

The test error rate is approximately 17.8%.

(d) Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

set.seed(1)
tune.out = tune(svm, Purchase ~ ., data = train, kernel = "linear", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  3.162278
## 
## - best performance: 0.16875 
## 
## - Detailed performance results:
##           cost   error dispersion
## 1   0.01000000 0.17625 0.02853482
## 2   0.01778279 0.17625 0.03143004
## 3   0.03162278 0.17125 0.02829041
## 4   0.05623413 0.17625 0.02853482
## 5   0.10000000 0.17250 0.03162278
## 6   0.17782794 0.17125 0.02829041
## 7   0.31622777 0.17125 0.02889757
## 8   0.56234133 0.17125 0.02703521
## 9   1.00000000 0.17500 0.02946278
## 10  1.77827941 0.17375 0.02729087
## 11  3.16227766 0.16875 0.03019037
## 12  5.62341325 0.17375 0.03304563
## 13 10.00000000 0.17375 0.03197764

The tuning selected cost=10 as the most optimal.

(e) Compute the training and test error rates using this new value for cost.

svm.linear = svm(Purchase ~ ., kernel = "linear", data = train, cost = tune.out$best.parameters$cost)
train.pred = predict(svm.linear, train)
table(train$Purchase, train.pred)

##     train.pred
##       CH  MM
##   CH 423  62
##   MM  70 245

test.pred = predict(svm.linear, test)
mean(test.pred != test$Purchase)

## [1] 0.1518519

The test error rate decreased with the use of the optimal cost to approximately 15.19%.

(f) Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

set.seed(1)
svm.radial = svm(Purchase ~ ., data = train, kernel = "radial")
summary(svm.radial)

## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  373
## 
##  ( 188 185 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

The radial basis kernel with default gamma creates 373 support vectors out of the 800 observations available in the dataset. Citrus Hill and Minute Maid are split into 188 and 185 vectors respectively.

train.pred = predict(svm.radial, train)
table(train$Purchase, train.pred)

##     train.pred
##       CH  MM
##   CH 441  44
##   MM  77 238

test.pred = predict(svm.radial, test)
mean(test.pred != test$Purchase)

## [1] 0.1851852

The test error for the radial basis kernel is approximately 18.52%.

set.seed(1)
tune.out = tune(svm, Purchase ~ ., data = train, kernel = "radial", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       cost
##  0.5623413
## 
## - best performance: 0.16875 
## 
## - Detailed performance results:
##           cost   error dispersion
## 1   0.01000000 0.39375 0.04007372
## 2   0.01778279 0.39375 0.04007372
## 3   0.03162278 0.35750 0.05927806
## 4   0.05623413 0.19500 0.02443813
## 5   0.10000000 0.18625 0.02853482
## 6   0.17782794 0.18250 0.03291403
## 7   0.31622777 0.17875 0.03230175
## 8   0.56234133 0.16875 0.02651650
## 9   1.00000000 0.17125 0.02128673
## 10  1.77827941 0.17625 0.02079162
## 11  3.16227766 0.17750 0.02266912
## 12  5.62341325 0.18000 0.02220485
## 13 10.00000000 0.18625 0.02853482

The tuning selected cost=0.5623413 as the most optimal.

svm.radial = svm(Purchase ~ ., data = train, kernel = "radial", cost = tune.out$best.parameters$cost)
train.pred = predict(svm.radial, train)
table(train$Purchase, train.pred)

##     train.pred
##       CH  MM
##   CH 437  48
##   MM  71 244

test.pred<-predict(svm.radial, test)
mean(test.pred != test$Purchase)

## [1] 0.1777778

The test error rate for the basis radial kernel with the most optimal cost is approximately 17.78%; this is an increase from the default gamma.

(g) Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree = 2.

svm.poly<-svm(Purchase~ ., kernel='polynomial', data=train, degree=2 )
summary(svm.poly)

## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "polynomial", 
##     degree = 2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  447
## 
##  ( 225 222 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

The polynomial kernel with two degrees create 447 support vectors. 225 vectors belong to Citrus Hill, while 222 vectors belong to Minute Maid.

train.pred<-predict(svm.poly, train)
table(train$Purchase, train.pred)

##     train.pred
##       CH  MM
##   CH 449  36
##   MM 110 205

test.pred<-predict(svm.poly, test)
mean(test.pred != test$Purchase)

## [1] 0.2222222

The test error for the polynomial kernels with two degrees is approximately 22.22%.

tunepoly<-tune(svm, Purchase ~ ., data=train, kernel='polynomial', degree=2, ranges=list(cost=10^seq(-2,1, by=.5)))
summary(tunepoly)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  3.162278
## 
## - best performance: 0.18375 
## 
## - Detailed performance results:
##          cost   error dispersion
## 1  0.01000000 0.39125 0.05239076
## 2  0.03162278 0.36125 0.05993341
## 3  0.10000000 0.32125 0.05001736
## 4  0.31622777 0.20250 0.04031129
## 5  1.00000000 0.20000 0.04249183
## 6  3.16227766 0.18375 0.03682259
## 7 10.00000000 0.18875 0.04143687

The tuning selected cost=3.162278 as the most optimal.

svm.poly<-svm(Purchase ~ ., kernel='polynomial', data=train, degree=2, cost=tunepoly$best.parameters$cost)
train.pred<-predict(svm.poly, train)
table(train$Purchase, train.pred)

##     train.pred
##       CH  MM
##   CH 451  34
##   MM  90 225

test.pred<-predict(svm.poly, test)
mean(test.pred != test$Purchase)

## [1] 0.2037037

The test error for the polynomial kernel with the most optimal cost is ~14.4%; this is a significant reduction of error compared to the original polynomial kernel.

(h) Overall, which approach seems to give the best results on this data?

Overall the approach that gives the best results on this data is basis radial kernel with the default gamma. This approach resulted in the lowest misclassification rate for both the training and test datasets.

Assignment 8

Sunanda Das

2022-10-28

Exercise 5

Exercise 7

SVM with radial kernel

SVM with polynomial kernel

Exercise 8