Notebook prepared by Everton Lima

Introduction to Statistical Learning Solutions (ISLR)

Ch 9 Exercises

Table of Contents

Conceptual Exercises

Applied Exercises

Conceptual Exercises

1

1a

X1=seq(-1,1,0.1)
plot(X1,1+3*X1,xlab='X1',ylab='X2',type='l',xlim=c(-1,1),ylim=c(-1,4))

for(i in seq(-1,1,length.out = 25)){
  pts=data.frame(rep(i,25),seq(-1,4,length.out = 25))
  points(pts,col=ifelse(1+3*pts[,1]-pts[,2]>0,'red','purple'))
}

The plot shows the linear decision boundary for the equation \(1 + 3 X_1 - X_2\). Purple indicates the values such that they \(1 + 3 X_1 - X_2\leq0\).

1b

X1=seq(-1,1,0.1)
plot(X1,1+3*X1,xlab='X1',ylab='X2',type='l',xlim=c(-1,1),ylim=c(-1,4))
lines(X1,1-1/2*X1)

for(i in seq(-1,1,length.out = 25)){
  pts=data.frame(rep(i,25),seq(-1,4,length.out = 25))
  points(pts,col=ifelse(1+3*pts[,1]-pts[,2]>0,'red','purple'),pch=ifelse(-2+pts[,1]+2*pts[,2]>0,1,2))
}

Triangles indicates the points that lie below the second hyper plane.

2

2a-b

# a
X1=seq(-3,3,0.01)

X2=2-sqrt(4-(1+X1)^2)
## Warning in sqrt(4 - (1 + X1)^2): NaNs produced
plot(X1,X2,type='l',xlim=c(-3,1.5),ylim=c(0,4.5))

X2=2+sqrt(4-(1+X1)^2)
## Warning in sqrt(4 - (1 + X1)^2): NaNs produced
lines(X1,X2)

# b

for(i in seq(-3,3,length.out = 25)){
  pts=data.frame(rep(i,25),seq(-1,5,length.out = 25))
  points(pts,col=ifelse( (1 + pts[,1])^2 + (2-pts[,2])^2 > 4,'blue','red'))
}

The equation describes a circle centered at the point \((-1,2)\). The red points show the set of points such that \((1+X_1)^2+(2-X_2)^2>4\).

2c

From the plot we can see that the points (0,0),(2,2),(3,8) belong to the blue class, and the point (-1,1) to the red class.

2d

In order to express the decision boundary in terms of \(X_1\) and \(X_2\) it is necessary to use quadratic terms. However, if we treat the polynomial transformation of these predictors are completely separate variables then the decision boundary becomes linear. What is happening here is that we are looking for a higher dimensional boundary that separated the two classes, and is equivalent (in this context) to adding more predictors.

Let \(X_1^2\) and \(X_1^2\) be the predictors \(Y\) and \(Z\) respectively. Then,

\[(1+X_1)^2+(2-X_2)^2 = X_1^2+2 X_1 + 1 + X_2^2 - 4 X_2 + 4 = Y + 2 X_1 + Z - 4 X_2 + 5 > 4\]

Which is clearly linear in terms of the predictors \(X_1\),\(X_2\),Y and Z.

3

data=data.frame(X1=c(3,2,4,1,2,4,4),X2=c(4,2,4,4,1,3,1),Y=c('red','red','red','red','blue','blue','blue'))

require(knitr)
## Loading required package: knitr
kable(data)
X1 X2 Y
3 4 red
2 2 red
4 4 red
1 4 red
2 1 blue
4 3 blue
4 1 blue

3a

plot(data[,c(1,2)],col=data$Y)

3b

plot(data[,c(1,2)],col=c('red','red','red','red','blue','blue','blue'))
abline(-0.5,1)                     # -0.5+X1=X2

\[-\frac{1}{2}+X_1+X_2 > 0\]

3c

The classification rules are,

\[-\frac{1}{2}+X_1-X_2 > 0 \rightarrow Blue\]

and,

\[-\frac{1}{2}+X_1-X_2 \leq 0 \rightarrow Red\]

3d

The margin the minimum distance from the support vectors. Since we know that the distance is simply \(Y(\beta_0+\beta_1 X + \beta_2 X_2)\) then we only need evaluate the function on the set of points and choose the minimum.

min(abs(data[,1]-data[,2]-0.5))  # Margin
## [1] 0.5

Plotting the decision boundary is as easy as plotting two additional lines that can be obtained by shifting the intercept of the decision boundary,

plot(data[,c(1,2)],col=data$Y)
abline(-0.5,1)                       # -0.5+X1-X2=0

# Sketch Margin

abline(0,1,lty=2)
abline(-1,1,lty=2)

3e

There are four support vector for the maximal margin classifier.

kable(data[which(abs(data[,1]-data[,2]-0.5) ==0.5),])
X1 X2 Y
2 2 2 red
3 4 4 red
5 2 1 blue
6 4 3 blue

3f

The seventh observation is not a support vector. Hence, a small permutation in its position does not change the optimal boundary.

3g

\[\frac{1}{4} + X_1 - X_2 > 0\]

The equation above separates all observations, but it is not the optimal boundary. This is because its margin is smaller than the optimal choice.

plot(data[,c(1,2)],col=data$Y)

abline(-0.5,1)                          # -0.5+X1-X2=0
abline(0,1,lty=2)
abline(-1,1,lty=2)

abline(-0.25,1,col='red')               # -0.25+X1-X2=0
margin=min(abs(data[,1]-data[,2]-0.25))
abline(-0.25+margin,1,lty=2,col='red')
abline(-0.23-margin,1,lty=2,col='red')  # small change in intercept so its easier to see it in the plot.

3h

plot(data[,c(1,2)],col=data$Y)
abline(-0.5,1)                       # -0.5+X1-X2=0

# Sketch Margin
abline(0,1,lty=2)
abline(-1,1,lty=2)

points(2.5,3.5,col='blue')                # (2.5,3.5)

Applied Exercises

4

The first step is to generate data. A simple manner in which to generate data is to sample from a normal distribution. This is done by the code snippet below. Moreover, the class assignment of a point is done by checking if the point lies above or below a decision boundary. For this question, I have selected \(\frac{2}{3}(X-3)^2+X-2\) to be such boundary.

# generate dataset
set.seed(22)
X=data.frame(X1=abs(rnorm(100,1.5,1.2)),X2=abs(rnorm(100,1.5,1)+0.2))
pts=seq(0,5,length.out = 100)

# nonlinear decision boundary
class=ifelse(2/3*(X[,1]-3)^2+X[,1]-2-X[2]>0,'red','blue')  # Yl-Yp > 0 => Point lies above the line

plot(X,xlab='X1',ylab='X2',col=class)

Now we understand how to generate data the next step is to train the two classifiers. This can easily be done by using the svm function from the, uniquely named, e1071 library. This function allows to fit both linear and polynomial boundary.

# linear SVM
require(e1071)
## Loading required package: e1071
svm.linear=svm(class~.,data=data.frame(X,class=as.factor(class)),kernel='linear')
summary(svm.linear)
## 
## Call:
## svm(formula = class ~ ., data = data.frame(X, class = as.factor(class)), 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  59
## 
##  ( 30 29 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  blue red
svm.linear.pred=predict(svm.linear,X,type='response')
table(class,svm.linear.pred)
##       svm.linear.pred
## class  blue red
##   blue   38   9
##   red    16  37

The linear model achieves a miss classification rate of 25%. Moreover, more than half of the observations are used as support vectors.

Continue on to fit a polynomial SVM.

# fitting a polynomial kernel
svm.poly=svm(class~.,data=data.frame(X,class=as.factor(class)),kernel='polynomial')
summary(svm.poly)
## 
## Call:
## svm(formula = class ~ ., data = data.frame(X, class = as.factor(class)), 
##     kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.5 
##      coef.0:  0 
## 
## Number of Support Vectors:  69
## 
##  ( 35 34 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  blue red
svm.poly.pred=predict(svm.poly,X,type='response')
table(class,svm.poly.pred)
##       svm.poly.pred
## class  blue red
##   blue   47   0
##   red    16  37

From the the output of the function we can note that a polynomial of degree 3 was selected and 69 out of 100 observations are used as Support Vectors. Moreover, we can see from the confusion matrix above that 16 observations are misclassified; This model has a 16% classifications error rate on the training set.

In order to get the most of our SVM with a polynomial decision boundary we should do parameter tuning. As shown in the lab, this can be done by the tune function.

# parameter tuning for a polynomial SVM
set.seed(42)
svm.poly.tune=tune(method=svm,y~.,data=data.frame(X,y=as.factor(class)),
                   kernel="polynomial",ranges=list(cost=c(seq(0.05,1,length.out = 23),5,10,100)) )
svm.poly.tune
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   100
## 
## - best performance: 0.14
svm.poly.tune$best.model
## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = data.frame(X, 
##     y = as.factor(class)), ranges = list(cost = c(seq(0.05, 1, 
##     length.out = 23), 5, 10, 100)), kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  100 
##      degree:  3 
##       gamma:  0.5 
##      coef.0:  0 
## 
## Number of Support Vectors:  41
svm.poly.pred=predict(svm.poly.tune$best.model,X,type=response)
table(class,svm.poly.pred)
##       svm.poly.pred
## class  blue red
##   blue   46   1
##   red    10  43

From the output of the snippet above we can see that a small increase in the cost improves the model. When the cost parameter is set of 100 then the polynomial SVM has a miss classification rate of 11%, on the training data. Note that there are lesser amount of observations being used as support vectors; Increasing the cost parameter has had the effect of decreasing the bias for this model.

5

5a

x1=runif(500)-0.5
x2=runif(500)-0.5
y=1*(x1^2-x2^2 > 0 )

5b

plot(x1,x2,col=ifelse(y,'red','blue'))

5c

require(glm)
## Loading required package: glm
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'glm'
glm.fit=glm(y~. ,family='binomial', data=data.frame(x1,x2,y))
glm.fit
## 
## Call:  glm(formula = y ~ ., family = "binomial", data = data.frame(x1, 
##     x2, y))
## 
## Coefficients:
## (Intercept)           x1           x2  
##     0.05897     -0.41248      0.17815  
## 
## Degrees of Freedom: 499 Total (i.e. Null);  497 Residual
## Null Deviance:       692.6 
## Residual Deviance: 690.4     AIC: 696.4

5d

In the plot below the circles are observations that are classified correctly, and the crosses are misclassified. It is clear that this model performs poorly as it predicts class 1 for all observations.

glm.pred=predict(glm.fit,data.frame(x1,x2))   # returns the log odds value.
plot(x1,x2,col=ifelse(glm.pred>0,'red','blue'),pch=ifelse(as.integer(glm.pred>0) == y,1,4))

5e

glm.fit=glm(y~poly(x1,2)+poly(x2,2) ,family='binomial', data=data.frame(x1,x2,y))
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

5f

Again, the circles are observations that are classified correctly. As we can see from the absence of crosses, all training observations are correctly classified.

glm.pred=predict(glm.fit,data.frame(x1,x2))     # returns the log-odds.
plot(x1,x2,col=ifelse(glm.pred>0,'red','blue'),pch=ifelse(as.integer(glm.pred>0) == y,1,4))

5g

The same is done as before, but now with a linear SVM.

svm.fit=svm(y~.,data=data.frame(x1,x2,y=as.factor(y)),kernel='linear')
svm.pred=predict(svm.fit,data.frame(x1,x2),type='response')
plot(x1,x2,col=ifelse(svm.pred!=0,'red','blue'),pch=ifelse(svm.pred == y,1,4))

5f

More of the same but now the SVM uses a polynomial kernel. As can be observed from the plot below, there is a significant improvement when compared to using a linear kernel, as only a few observations are misclassified.

svm.fit=svm(y~.,data=data.frame(x1,x2,y=as.factor(y)),kernel='polynomial',degree=2)
svm.pred=predict(svm.fit,data.frame(x1,x2),type='response')
plot(x1,x2,col=ifelse(svm.pred!=0,'red','blue'),pch=ifelse(svm.pred == y,1,4))

5i

From the plots it is clear to see that the polynomial logit model performs much better than SVM with a polynomial kernel, on the training data. If we examine the confusion matrix (shown below) we can see that there are 24 observations that are misclassified by the SVM whilst the logit model has 100% accuracy.

require(knitr)

kable(table(y,as.integer(glm.pred>0)))
0 1
0 242 0
1 0 258
kable(table(y,svm.pred))
0 1
0 226 16
1 8 250

6

6a

set.seed(42)

x1=runif(500)-0.5
x2=runif(500)-0.5

y=ifelse(-0.2*x1+x2 > 0,'red','blue')
plot(x1,x2,col=y)

6b

A straightforward way to produce the training error is to simply train several models.

costs=seq(1,50,length.out = 20)

perf=c()
for(c in costs){
  svm.fit=svm(y~.,data=data.frame(x1,x2,y),kernel='linear',cost=c)
  svm.pred=predict(svm.fit,data.frame(x1,x2))
  error=mean(svm.pred!=y)

  perf=rbind(perf,c(c,error))
}

The tune function can be used in order to obtain cross validation error. I have chosen to do 20-fold cross validation is used. Note that different executions of the snippet below will produce different graphs due to the sampling that occurs when cross validation is done.

set.seed(42)
costs=data.frame(cost=costs)
svm.tune=tune(svm,y~.,data=data.frame(x1,x2,y),ranges=costs,tunecontrol=tune.control(cross=20))

From the plot below one can note that the cross-validation error (shown in black) and the training error follow the same trend, with the cross-validation error achieving a higher value.

plot(svm.tune$performances[,c(1,2)],type='l',ylim=c(0,0.02),xlim=c(0,50))
lines(perf,col='red')

7

7a

require(ISLR)
## Loading required package: ISLR
Auto$mpg=ifelse(Auto$mpg>median(Auto$mpg),1,0)
table(Auto$mpg)
## 
##   0   1 
## 196 196

7b

The cross-validation errors can be produced with ease by the tune function from the e1071. This is done by the code below.

require(e1071)

costs=data.frame(cost=seq(0.05,100,length.out = 15))               # tuning grid for the cost parameter.
svm.tune=tune(svm,mpg~.,data=Auto,ranges=costs,kernel='linear')     # 10-fold cross validation.
svm.tune
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.05
## 
## - best performance: 0.1029192

Moreover, if we plot we values we can see that while the cost of about 5 achieves the most reduction in the miss classification error.

plot(svm.tune$performance[,c(1,2)],type='l')

7c

params=data.frame(cost=seq(0.05,100,length.out = 5),degree=seq(1,100,length.out = 5))
svm.poly=tune(svm,mpg~.,data=Auto,ranges=params,kernel='polynomial')
svm.poly
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##   100      1
## 
## - best performance: 0.09874441
params=data.frame(cost=seq(0.05,100,length.out = 5),gamma=seq(0.1,100,length.out = 5))
svm.radial=tune(svm,mpg~.,data=Auto,ranges=params,kernel='radial')
svm.radial
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##     cost gamma
##  25.0375   0.1
## 
## - best performance: 0.06606846

8

8a

set.seed(42)
train=sample(1:1070,800)
test=(1:1070)[-train]

tb=c()
res=c()

8b

require(ISLR)
require(e1071)

svm.fit=svm(Purchase~.,data=OJ,subset=train,cost=0.01,kernel='linear')
summary(svm.fit)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "linear", 
##     subset = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  439
## 
##  ( 219 220 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

From the output of R’s summary function we can see that 439 observations are used as support vector. Moreover, the support vectors are almost equally split among the classes.

8c

# train
svm.pred=predict(svm.fit,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))
CH MM
CH 428 54
MM 78 240
mean(OJ$Purchase[train] != svm.pred)
## [1] 0.165
res=cbind(res,'train'=mean(OJ$Purchase[train] != svm.pred))
# test
svm.pred=predict(svm.fit,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))
CH MM
CH 150 21
MM 29 70
mean(OJ$Purchase[test] != svm.pred)
## [1] 0.1851852
res=cbind(res,'test'=mean(OJ$Purchase[test] != svm.pred))

8d

svm.tune=tune(svm,Purchase~.,data=OJ[train,],ranges=data.frame(cost=seq(0.01,10,25)),kernel='linear')
summary(svm.tune)
## 
## Error estimation of 'svm' using 10-fold cross validation: 0.17375
res=cbind(res,'CV'=svm.tune$best.performance)

8e

svm.pred=predict(svm.tune$best.model,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))
CH MM
CH 428 54
MM 78 240
mean(OJ$Purchase[train] != svm.pred)
## [1] 0.165
res=cbind(res,'train.tuned'=mean(OJ$Purchase[train] != svm.pred))
svm.pred=predict(svm.tune$best.model,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))
CH MM
CH 150 21
MM 29 70
mean(OJ$Purchase[test] != svm.pred)
## [1] 0.1851852
res=cbind(res,'test.tuned'=mean(OJ$Purchase[test] != svm.pred))

tb=rbind(tb,res)
res=c()

8f

# b
svm.fit=svm(Purchase~.,data=OJ,subset=train,cost=0.01,kernel='radial')
summary(svm.fit)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "radial", 
##     subset = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  638
## 
##  ( 318 320 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
# train
svm.pred=predict(svm.fit,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))
CH MM
CH 482 0
MM 318 0
mean(OJ$Purchase[train] != svm.pred)
## [1] 0.3975
res=cbind(res,'train'=mean(OJ$Purchase[train] != svm.pred))


# test
svm.pred=predict(svm.fit,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))
CH MM
CH 171 0
MM 99 0
mean(OJ$Purchase[test] != svm.pred)
## [1] 0.3666667
res=cbind(res,'train'=mean(OJ$Purchase[test] != svm.pred))
svm.tune=tune(svm,Purchase~.,data=OJ[train,],ranges=data.frame(cost=seq(0.01,10,25)))
summary(svm.tune)
## 
## Error estimation of 'svm' using 10-fold cross validation: 0.3975
res=cbind(res,'CV'=svm.tune$best.performance)
# train
svm.pred=predict(svm.tune$best.model,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))
CH MM
CH 482 0
MM 318 0
mean(OJ$Purchase[train] != svm.pred)
## [1] 0.3975
res=cbind(res,'train.tuned'=mean(OJ$Purchase[train] != svm.pred))


# test
svm.pred=predict(svm.tune$best.model,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))
CH MM
CH 171 0
MM 99 0
mean(OJ$Purchase[test] != svm.pred)
## [1] 0.3666667
res=cbind(res,'test.tuned'=mean(OJ$Purchase[test] != svm.pred))

tb=rbind(tb,res)
res=c()

8g

# b
svm.fit=svm(Purchase~.,data=OJ,subset=train,cost=0.01,kernel='polynomial')
summary(svm.fit)
## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "polynomial", 
##     subset = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.01 
##      degree:  3 
##       gamma:  0.05555556 
##      coef.0:  0 
## 
## Number of Support Vectors:  634
## 
##  ( 315 319 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
# train
svm.pred=predict(svm.fit,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))
CH MM
CH 475 7
MM 298 20
mean(OJ$Purchase[train] != svm.pred)
## [1] 0.38125
res=cbind(res,'train'=mean(OJ$Purchase[train] != svm.pred))

# test
svm.pred=predict(svm.fit,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))
CH MM
CH 171 0
MM 90 9
mean(OJ$Purchase[test] != svm.pred)
## [1] 0.3333333
res=cbind(res,'test'=mean(OJ$Purchase[test] != svm.pred))
svm.tune=tune(svm,Purchase~.,data=OJ[train,],ranges=data.frame(cost=seq(0.01,10,25)),kernel='polynomial')
summary(svm.tune)
## 
## Error estimation of 'svm' using 10-fold cross validation: 0.38125
res=cbind(res,'CV'=svm.tune$best.performance)
# train
svm.pred=predict(svm.tune$best.model,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))
CH MM
CH 475 7
MM 298 20
mean(OJ$Purchase[train] != svm.pred)
## [1] 0.38125
res=cbind(res,'train.tuned'=mean(OJ$Purchase[train] != svm.pred))


# test
svm.pred=predict(svm.tune$best.model,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))
CH MM
CH 171 0
MM 90 9
mean(OJ$Purchase[test] != svm.pred)
## [1] 0.3333333
res=cbind(res,'test.tuned'=mean(OJ$Purchase[test] != svm.pred))

tb=rbind(tb,res)

8h

The linear SVM performs best on this data.

rownames(tb)=c('LINEAR','POLYNOMIAL','RADIAL')
kable(tb)
train test CV train.tuned test.tuned
LINEAR 0.16500 0.1851852 0.17375 0.16500 0.1851852
POLYNOMIAL 0.39750 0.3666667 0.39750 0.39750 0.3666667
RADIAL 0.38125 0.3333333 0.38125 0.38125 0.3333333