Exercise 5

set.seed(421)
x1 = runif(500) - 0.5
x2 = runif(500) - 0.5
y = 1 * (x1^2 - x2^2 > 0)

The quadratic decision boundary between two classes is,

\(x_1^{2}\) - \(x_2^{2}\) = 0

plot(x1[y == 0], x2[y == 0], col = "green", xlab = "X1", ylab = "X2", pch = "*")
points(x1[y == 1], x2[y == 1], col = "blue", pch = 1)

The plot shows a non-linear decision boundary.

fit_lm = glm(y ~ x1 + x2, family = binomial)
summary(fit_lm)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.278  -1.227   1.089   1.135   1.175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.11999    0.08971   1.338    0.181
## x1          -0.16881    0.30854  -0.547    0.584
## x2          -0.08198    0.31476  -0.260    0.795
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 691.35  on 499  degrees of freedom
## Residual deviance: 690.99  on 497  degrees of freedom
## AIC: 696.99
## 
## Number of Fisher Scoring iterations: 3

Predictors \(X_1\) and \(X_2\) are not statistically significant predictors for the y variable at the level of significance \(\alpha\)=0.05. This coincides with the non-linear boundry is represented in the plot.

data = data.frame(x1 = x1, x2 = x2, y = y)
prob = predict(fit_lm, data, type = "response")
pred = ifelse(prob > 0.52, 1, 0)
pos = data[pred == 1, ]
neg = data[pred == 0, ]
plot(pos$x1, pos$x2, col = "blue", xlab = "X1", ylab = "X2", pch = "*")
points(neg$x1, neg$x2, col = "green", pch = 1)

The threshold of 0.52 creates a new linear boundary.

fit_poly = glm(y ~ poly(x1, 2) + log(x2, 2) + I(x1 * x2), data = data, family = binomial)
summary(fit_poly)
## 
## Call:
## glm(formula = y ~ poly(x1, 2) + log(x2, 2) + I(x1 * x2), family = binomial, 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2453  -0.2280   0.0006   0.1853   1.7565  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.253      1.016  -6.153 7.61e-10 ***
## poly(x1, 2)1   12.537     17.244   0.727    0.467    
## poly(x1, 2)2   74.737     11.176   6.688 2.27e-11 ***
## log(x2, 2)     -2.881      0.448  -6.432 1.26e-10 ***
## I(x1 * x2)     -4.653      8.011  -0.581    0.561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 333.89  on 240  degrees of freedom
## Residual deviance: 102.21  on 236  degrees of freedom
##   (259 observations deleted due to missingness)
## AIC: 112.21
## 
## Number of Fisher Scoring iterations: 7

The new model using non-linear predictors resulted in a few predictors being statistically significant. The predictors that are statistically significant are \({X_1}^2\) and the log of \(X_2\).

prob_p = predict(fit_poly, data, type = "response")
pred_p = ifelse(prob_p > 0.5, 1, 0)
pos_p = data[pred_p == 1, ]
neg_p = data[pred_p == 0, ]
plot(pos_p$x1, pos_p$x2, col = "blue", xlab = "X1", ylab = "X2", pch = 1)
points(neg_p$x1, neg_p$x2, col = "red", pch = 1)

The predictions are much better than linear model.The new model resulted in a non-linear boundary which is different than original non-linear boundary.

library(e1071)
svm.fit = svm(as.factor(y) ~ x1 + x2, data, kernel = "linear", cost = 0.1)
svm.pred = predict(svm.fit, data)
pos = data[svm.pred == 1, ]
neg = data[svm.pred == 0, ]
plot(pos$x1, pos$x2, col = "blue", xlab = "X1", ylab = "X2", pch = "*")
points(neg$x1, neg$x2, col = "red", pch = 3)

The linear kernel creates a small linear boundary with a few values in one of the boundaries but majority of values are classified into one group.

library(e1071)
svm.fit = svm(as.factor(y) ~ x1 + x2, data, gamma=1)
svm.pred = predict(svm.fit, data)
pos = data[svm.pred == 1, ]
neg = data[svm.pred == 0, ]
plot(pos$x1, pos$x2, col = "blue", xlab = "X1", ylab = "X2", pch = "+")
points(neg$x1, neg$x2, col = "red", pch = 4)

The predicted labels of the non-linear boundary is similiar to the original decision boundary.

The experiment perfromed covers the idea of SVMS are important to use for finding non linear models. using cross validation would be easier with the parameter of gamma. Linear logistic regression has a poor effect on nonlinear boundaries, SVM linear kernels work well with small cost, and both nonlinear logistic regression and SVM have good effects on nonlinear boundaries.

Exercise 7

library(ISLR)
mileage.median <- median(Auto$mpg)
Auto$mb <- ifelse(Auto$mpg > mileage.median, 1, 0)
Auto$mpglevel = as.factor(Auto$mb)
cost.grid <- c(0.001,0.01, 0.1, 1, 5, 10, 100)
set.seed(1)
tune.res <- tune(svm, mb ~ . - mpg, data = Auto, kernel = 'linear', ranges = list(cost = cost.grid))
summary(tune.res)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     5
## 
## - best performance: 0.001749865 
## 
## - Detailed performance results:
##    cost       error   dispersion
## 1 1e-03 0.094369989 0.0221743982
## 2 1e-02 0.038760453 0.0128111486
## 3 1e-01 0.002059194 0.0003321964
## 4 1e+00 0.001750708 0.0001172866
## 5 5e+00 0.001749865 0.0001196184
## 6 1e+01 0.001749865 0.0001196184
## 7 1e+02 0.001749865 0.0001196184

SVM with radial kernel

grid <- c(0.01, 0.1, 1, 5, 10)
set.seed(1)
tune.out <- tune(svm, mb ~ . - mpg, data = Auto, kernel = 'radial', ranges = list(cost = grid), gamma=c(0.01, 0.1, 1, 5, 10, 100))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.002002138 
## 
## - Detailed performance results:
##    cost       error   dispersion
## 1  0.01 0.287651483 0.0351963527
## 2  0.10 0.078886860 0.0206536281
## 3  1.00 0.017484810 0.0071390321
## 4  5.00 0.002281495 0.0005543150
## 5 10.00 0.002002138 0.0004984504

SVM with polynomial kernel

grid <- c(0.01, 0.1, 1, 5, 10)
set.seed(1)
tune.out <- tune(svm, mb ~ . - mpg, data = Auto, kernel = 'polynomial', ranges = list(cost = grid),degree=c(2,3,4))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.3026392 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1  0.01 0.4981933 0.03934104
## 2  0.10 0.4956527 0.03969561
## 3  1.00 0.4706992 0.04479729
## 4  5.00 0.3777593 0.07172821
## 5 10.00 0.3026392 0.07693095
  • (d) Make some plots to back up your assertions in (b) and (c).
svm.linear = svm(mpglevel ~ ., data = Auto, kernel = "linear", cost = 1)
svm.poly = svm(mpglevel ~ ., data = Auto, kernel = "polynomial", cost = 100, degree = 2)
svm.radial = svm(mpglevel ~ ., data = Auto, kernel = "radial", cost = 100, gamma = 0.01)
plotpairs = function(fit) {
    for (name in names(Auto)[!(names(Auto) %in% c("mpg", "mpglevel", "name"))]) {
        plot(fit, Auto, as.formula(paste("mpg~", name, sep = "")))
    }
}
plotpairs(svm.linear)

plotpairs(svm.poly)

plotpairs(svm.radial)

Exercise 8

set.seed(1)
s = sample(dim(OJ)[1], 800)
train = OJ[s, ]
test = OJ[-s, ]
library(e1071)
svm.linear = svm(Purchase ~ ., kernel = "linear", data = train, cost = 0.01)
summary(svm.linear)
## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "linear", cost = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  435
## 
##  ( 219 216 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

The support vector classifer creates 435 support vectors out of the 800 observations available in the dataset. Citrus Hill and Minute Maid are evenly split amongst the support vectors with 219 and 216 each.

train.pred = predict(svm.linear, train)
mean(train.pred != train$Purchase)
## [1] 0.175
test.pred = predict(svm.linear, test)
mean(test.pred != test$Purchase)
## [1] 0.1777778

The test error rate is approximately 17.8%.

set.seed(1)
tune.out = tune(svm, Purchase ~ ., data = train, kernel = "linear", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  3.162278
## 
## - best performance: 0.16875 
## 
## - Detailed performance results:
##           cost   error dispersion
## 1   0.01000000 0.17625 0.02853482
## 2   0.01778279 0.17625 0.03143004
## 3   0.03162278 0.17125 0.02829041
## 4   0.05623413 0.17625 0.02853482
## 5   0.10000000 0.17250 0.03162278
## 6   0.17782794 0.17125 0.02829041
## 7   0.31622777 0.17125 0.02889757
## 8   0.56234133 0.17125 0.02703521
## 9   1.00000000 0.17500 0.02946278
## 10  1.77827941 0.17375 0.02729087
## 11  3.16227766 0.16875 0.03019037
## 12  5.62341325 0.17375 0.03304563
## 13 10.00000000 0.17375 0.03197764

The tuning selected cost=10 as the most optimal.

svm.linear = svm(Purchase ~ ., kernel = "linear", data = train, cost = tune.out$best.parameters$cost)
train.pred = predict(svm.linear, train)
table(train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 423  62
##   MM  70 245
test.pred = predict(svm.linear, test)
mean(test.pred != test$Purchase)
## [1] 0.1518519

The test error rate decreased with the use of the optimal cost to approximately 15.19%.

set.seed(1)
svm.radial = svm(Purchase ~ ., data = train, kernel = "radial")
summary(svm.radial)
## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  373
## 
##  ( 188 185 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

The radial basis kernel with default gamma creates 373 support vectors out of the 800 observations available in the dataset. Citrus Hill and Minute Maid are split into 188 and 185 vectors respectively.

train.pred = predict(svm.radial, train)
table(train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 441  44
##   MM  77 238
test.pred = predict(svm.radial, test)
mean(test.pred != test$Purchase)
## [1] 0.1851852

The test error for the radial basis kernel is approximately 18.52%.

set.seed(1)
tune.out = tune(svm, Purchase ~ ., data = train, kernel = "radial", ranges = list(cost = 10^seq(-2, 1, by = 0.25)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       cost
##  0.5623413
## 
## - best performance: 0.16875 
## 
## - Detailed performance results:
##           cost   error dispersion
## 1   0.01000000 0.39375 0.04007372
## 2   0.01778279 0.39375 0.04007372
## 3   0.03162278 0.35750 0.05927806
## 4   0.05623413 0.19500 0.02443813
## 5   0.10000000 0.18625 0.02853482
## 6   0.17782794 0.18250 0.03291403
## 7   0.31622777 0.17875 0.03230175
## 8   0.56234133 0.16875 0.02651650
## 9   1.00000000 0.17125 0.02128673
## 10  1.77827941 0.17625 0.02079162
## 11  3.16227766 0.17750 0.02266912
## 12  5.62341325 0.18000 0.02220485
## 13 10.00000000 0.18625 0.02853482

The tuning selected cost=0.5623413 as the most optimal.

svm.radial = svm(Purchase ~ ., data = train, kernel = "radial", cost = tune.out$best.parameters$cost)
train.pred = predict(svm.radial, train)
table(train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 437  48
##   MM  71 244
test.pred<-predict(svm.radial, test)
mean(test.pred != test$Purchase)
## [1] 0.1777778

The test error rate for the basis radial kernel with the most optimal cost is approximately 17.78%; this is an increase from the default gamma.

svm.poly<-svm(Purchase~ ., kernel='polynomial', data=train, degree=2 )
summary(svm.poly)
## 
## Call:
## svm(formula = Purchase ~ ., data = train, kernel = "polynomial", 
##     degree = 2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  447
## 
##  ( 225 222 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

The polynomial kernel with two degrees create 447 support vectors. 225 vectors belong to Citrus Hill, while 222 vectors belong to Minute Maid.

train.pred<-predict(svm.poly, train)
table(train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 449  36
##   MM 110 205
test.pred<-predict(svm.poly, test)
mean(test.pred != test$Purchase)
## [1] 0.2222222

The test error for the polynomial kernels with two degrees is approximately 22.22%.

tunepoly<-tune(svm, Purchase ~ ., data=train, kernel='polynomial', degree=2, ranges=list(cost=10^seq(-2,1, by=.5)))
summary(tunepoly)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  3.162278
## 
## - best performance: 0.18375 
## 
## - Detailed performance results:
##          cost   error dispersion
## 1  0.01000000 0.39125 0.05239076
## 2  0.03162278 0.36125 0.05993341
## 3  0.10000000 0.32125 0.05001736
## 4  0.31622777 0.20250 0.04031129
## 5  1.00000000 0.20000 0.04249183
## 6  3.16227766 0.18375 0.03682259
## 7 10.00000000 0.18875 0.04143687

The tuning selected cost=3.162278 as the most optimal.

svm.poly<-svm(Purchase ~ ., kernel='polynomial', data=train, degree=2, cost=tunepoly$best.parameters$cost)
train.pred<-predict(svm.poly, train)
table(train$Purchase, train.pred)
##     train.pred
##       CH  MM
##   CH 451  34
##   MM  90 225
test.pred<-predict(svm.poly, test)
mean(test.pred != test$Purchase)
## [1] 0.2037037

The test error for the polynomial kernel with the most optimal cost is ~14.4%; this is a significant reduction of error compared to the original polynomial kernel.

Overall the approach that gives the best results on this data is basis radial kernel with the default gamma. This approach resulted in the lowest misclassification rate for both the training and test datasets.