Introduction to Statistical Learning Solutions (ISLR)

Ch 9 Exercises

Conceptual Exercises

Applied Exercises

Conceptual Exercises

1 1a

X1=seq(-1,1,0.1)
plot(X1,1+3*X1,xlab='X1',ylab='X2',type='l',xlim=c(-1,1),ylim=c(-1,4))

for(i in seq(-1,1,length.out = 25)){
  pts=data.frame(rep(i,25),seq(-1,4,length.out = 25))
  points(pts,col=ifelse(1+3*pts[,1]-pts[,2]>0,'red','purple'))
}

The plot shows the linear decision boundary for the equation \(1 + 3 X_1 - X_2\). Purple indicates the values such that they \(1 + 3 X_1 - X_2\leq0\).

1b

X1=seq(-1,1,0.1)
plot(X1,1+3*X1,xlab='X1',ylab='X2',type='l',xlim=c(-1,1),ylim=c(-1,4))
lines(X1,1-1/2*X1)

for(i in seq(-1,1,length.out = 25)){
  pts=data.frame(rep(i,25),seq(-1,4,length.out = 25))
  points(pts,col=ifelse(1+3*pts[,1]-pts[,2]>0,'red','purple'),pch=ifelse(-2+pts[,1]+2*pts[,2]>0,1,2))
}

Triangles indicates the points that lie below the second hyper plane.

2

2a-b

# a
X1=seq(-3,3,0.01)

X2=2-sqrt(4-(1+X1)^2)

## Warning in sqrt(4 - (1 + X1)^2): NaNs produced

plot(X1,X2,type='l',xlim=c(-3,1.5),ylim=c(0,4.5))

X2=2+sqrt(4-(1+X1)^2)

## Warning in sqrt(4 - (1 + X1)^2): NaNs produced

lines(X1,X2)

# b

for(i in seq(-3,3,length.out = 25)){
  pts=data.frame(rep(i,25),seq(-1,5,length.out = 25))
  points(pts,col=ifelse( (1 + pts[,1])^2 + (2-pts[,2])^2 > 4,'blue','red'))
}

The equation describes a circle centered at the point \((-1,2)\). The red points show the set of points such that \((1+X_1)^2+(2-X_2)^2>4\).

2c

From the plot we can see that the points (0,0),(2,2),(3,8) belong to the blue class, and the point (-1,1) to the red class.

2d

In order to express the decision boundary in terms of \(X_1\) and \(X_2\) it is necessary to use quadratic terms. However, if we treat the polynomial transformation of these predictors are completely separate variables then the decision boundary becomes linear. What is happening here is that we are looking for a higher dimensional boundary that separated the two classes, and is equivalent (in this context) to adding more predictors.

Let \(X_1^2\) and \(X_1^2\) be the predictors \(Y\) and \(Z\) respectively. Then,

\[(1+X_1)^2+(2-X_2)^2 = X_1^2+2 X_1 + 1 + X_2^2 - 4 X_2 + 4 = Y + 2 X_1 + Z - 4 X_2 + 5 > 4\]

Which is clearly linear in terms of the predictors \(X_1\),\(X_2\),Y and Z.

3

data=data.frame(X1=c(3,2,4,1,2,4,4),X2=c(4,2,4,4,1,3,1),Y=c('red','red','red','red','blue','blue','blue'))

require(knitr)

## Loading required package: knitr

kable(data)

X1	X2	Y
3	4	red
2	2	red
4	4	red
1	4	red
2	1	blue
4	3	blue
4	1	blue

3a

plot(data[,c(1,2)],col=data$Y)

3b

plot(data[,c(1,2)],col=c('red','red','red','red','blue','blue','blue'))
abline(-0.5,1)                     # -0.5+X1=X2

\[-\frac{1}{2}+X_1+X_2 > 0\]

3c

The classification rules are,

\[-\frac{1}{2}+X_1-X_2 > 0 \rightarrow Blue\]

and,

\[-\frac{1}{2}+X_1-X_2 \leq 0 \rightarrow Red\]

3d

The margin the minimum distance from the support vectors. Since we know that the distance is simply \(Y(\beta_0+\beta_1 X + \beta_2 X_2)\) then we only need evaluate the function on the set of points and choose the minimum.

min(abs(data[,1]-data[,2]-0.5))  # Margin

## [1] 0.5

Plotting the decision boundary is as easy as plotting two additional lines that can be obtained by shifting the intercept of the decision boundary,

plot(data[,c(1,2)],col=data$Y)
abline(-0.5,1)                       # -0.5+X1-X2=0

# Sketch Margin

abline(0,1,lty=2)
abline(-1,1,lty=2)

3e

There are four support vector for the maximal margin classifier.

kable(data[which(abs(data[,1]-data[,2]-0.5) ==0.5),])

	X1	X2	Y
2	2	2	red
3	4	4	red
5	2	1	blue
6	4	3	blue

3f

The seventh observation is not a support vector. Hence, a small permutation in its position does not change the optimal boundary.

3g

\[\frac{1}{4} + X_1 - X_2 > 0\]

The equation above separates all observations, but it is not the optimal boundary. This is because its margin is smaller than the optimal choice.

plot(data[,c(1,2)],col=data$Y)

abline(-0.5,1)                          # -0.5+X1-X2=0
abline(0,1,lty=2)
abline(-1,1,lty=2)

abline(-0.25,1,col='red')               # -0.25+X1-X2=0
margin=min(abs(data[,1]-data[,2]-0.25))
abline(-0.25+margin,1,lty=2,col='red')
abline(-0.23-margin,1,lty=2,col='red')  # small change in intercept so its easier to see it in the plot.

3h

plot(data[,c(1,2)],col=data$Y)
abline(-0.5,1)                       # -0.5+X1-X2=0

# Sketch Margin
abline(0,1,lty=2)
abline(-1,1,lty=2)

points(2.5,3.5,col='blue')                # (2.5,3.5)

Applied Exercises

4

The first step is to generate data. A simple manner in which to generate data is to sample from a normal distribution. This is done by the code snippet below. Moreover, the class assignment of a point is done by checking if the point lies above or below a decision boundary. For this question, I have selected \(\frac{2}{3}(X-3)^2+X-2\) to be such boundary.

# generate dataset
set.seed(22)
X=data.frame(X1=abs(rnorm(100,1.5,1.2)),X2=abs(rnorm(100,1.5,1)+0.2))
pts=seq(0,5,length.out = 100)

# nonlinear decision boundary
class=ifelse(2/3*(X[,1]-3)^2+X[,1]-2-X[2]>0,'red','blue')  # Yl-Yp > 0 => Point lies above the line

plot(X,xlab='X1',ylab='X2',col=class)

Now we understand how to generate data the next step is to train the two classifiers. This can easily be done by using the svm function from the, uniquely named, e1071 library. This function allows to fit both linear and polynomial boundary.

# linear SVM
require(e1071)

## Loading required package: e1071

svm.linear=svm(class~.,data=data.frame(X,class=as.factor(class)),kernel='linear')
summary(svm.linear)

## 
## Call:
## svm(formula = class ~ ., data = data.frame(X, class = as.factor(class)), 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  59
## 
##  ( 30 29 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  blue red

svm.linear.pred=predict(svm.linear,X,type='response')
table(class,svm.linear.pred)

##       svm.linear.pred
## class  blue red
##   blue   38   9
##   red    16  37

The linear model achieves a miss classification rate of 25%. Moreover, more than half of the observations are used as support vectors.

Continue on to fit a polynomial SVM.

# fitting a polynomial kernel
svm.poly=svm(class~.,data=data.frame(X,class=as.factor(class)),kernel='polynomial')
summary(svm.poly)

## 
## Call:
## svm(formula = class ~ ., data = data.frame(X, class = as.factor(class)), 
##     kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.5 
##      coef.0:  0 
## 
## Number of Support Vectors:  69
## 
##  ( 35 34 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  blue red

svm.poly.pred=predict(svm.poly,X,type='response')
table(class,svm.poly.pred)

##       svm.poly.pred
## class  blue red
##   blue   47   0
##   red    16  37

From the the output of the function we can note that a polynomial of degree 3 was selected and 69 out of 100 observations are used as Support Vectors. Moreover, we can see from the confusion matrix above that 16 observations are misclassified; This model has a 16% classifications error rate on the training set.

In order to get the most of our SVM with a polynomial decision boundary we should do parameter tuning. As shown in the lab, this can be done by the tune function.

# parameter tuning for a polynomial SVM
set.seed(42)
svm.poly.tune=tune(method=svm,y~.,data=data.frame(X,y=as.factor(class)),
                   kernel="polynomial",ranges=list(cost=c(seq(0.05,1,length.out = 23),5,10,100)) )
svm.poly.tune

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   100
## 
## - best performance: 0.14

svm.poly.tune$best.model

## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = data.frame(X, 
##     y = as.factor(class)), ranges = list(cost = c(seq(0.05, 1, 
##     length.out = 23), 5, 10, 100)), kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  100 
##      degree:  3 
##       gamma:  0.5 
##      coef.0:  0 
## 
## Number of Support Vectors:  41

svm.poly.pred=predict(svm.poly.tune$best.model,X,type=response)
table(class,svm.poly.pred)

##       svm.poly.pred
## class  blue red
##   blue   46   1
##   red    10  43

From the output of the snippet above we can see that a small increase in the cost improves the model. When the cost parameter is set of 100 then the polynomial SVM has a miss classification rate of 11%, on the training data. Note that there are lesser amount of observations being used as support vectors; Increasing the cost parameter has had the effect of decreasing the bias for this model.

5

5a

x1=runif(500)-0.5
x2=runif(500)-0.5
y=1*(x1^2-x2^2 > 0 )

5b

plot(x1,x2,col=ifelse(y,'red','blue'))

5c

require(glm)

## Loading required package: glm

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'glm'

glm.fit=glm(y~. ,family='binomial', data=data.frame(x1,x2,y))
glm.fit

## 
## Call:  glm(formula = y ~ ., family = "binomial", data = data.frame(x1, 
##     x2, y))
## 
## Coefficients:
## (Intercept)           x1           x2  
##     0.05897     -0.41248      0.17815  
## 
## Degrees of Freedom: 499 Total (i.e. Null);  497 Residual
## Null Deviance:       692.6 
## Residual Deviance: 690.4     AIC: 696.4

5d

In the plot below the circles are observations that are classified correctly, and the crosses are misclassified. It is clear that this model performs poorly as it predicts class 1 for all observations.

glm.pred=predict(glm.fit,data.frame(x1,x2))   # returns the log odds value.
plot(x1,x2,col=ifelse(glm.pred>0,'red','blue'),pch=ifelse(as.integer(glm.pred>0) == y,1,4))

5e

glm.fit=glm(y~poly(x1,2)+poly(x2,2) ,family='binomial', data=data.frame(x1,x2,y))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

5f

Again, the circles are observations that are classified correctly. As we can see from the absence of crosses, all training observations are correctly classified.

glm.pred=predict(glm.fit,data.frame(x1,x2))     # returns the log-odds.
plot(x1,x2,col=ifelse(glm.pred>0,'red','blue'),pch=ifelse(as.integer(glm.pred>0) == y,1,4))

5g

The same is done as before, but now with a linear SVM.

svm.fit=svm(y~.,data=data.frame(x1,x2,y=as.factor(y)),kernel='linear')
svm.pred=predict(svm.fit,data.frame(x1,x2),type='response')
plot(x1,x2,col=ifelse(svm.pred!=0,'red','blue'),pch=ifelse(svm.pred == y,1,4))

5f

More of the same but now the SVM uses a polynomial kernel. As can be observed from the plot below, there is a significant improvement when compared to using a linear kernel, as only a few observations are misclassified.

svm.fit=svm(y~.,data=data.frame(x1,x2,y=as.factor(y)),kernel='polynomial',degree=2)
svm.pred=predict(svm.fit,data.frame(x1,x2),type='response')
plot(x1,x2,col=ifelse(svm.pred!=0,'red','blue'),pch=ifelse(svm.pred == y,1,4))

5i

From the plots it is clear to see that the polynomial logit model performs much better than SVM with a polynomial kernel, on the training data. If we examine the confusion matrix (shown below) we can see that there are 24 observations that are misclassified by the SVM whilst the logit model has 100% accuracy.

require(knitr)

kable(table(y,as.integer(glm.pred>0)))

	0	1
0	242	0
1	0	258

kable(table(y,svm.pred))

	0	1
0	226	16
1	8	250

6

6a

set.seed(42)

x1=runif(500)-0.5
x2=runif(500)-0.5

y=ifelse(-0.2*x1+x2 > 0,'red','blue')
plot(x1,x2,col=y)

6b

A straightforward way to produce the training error is to simply train several models.

costs=seq(1,50,length.out = 20)

perf=c()
for(c in costs){
  svm.fit=svm(y~.,data=data.frame(x1,x2,y),kernel='linear',cost=c)
  svm.pred=predict(svm.fit,data.frame(x1,x2))
  error=mean(svm.pred!=y)

  perf=rbind(perf,c(c,error))
}

The tune function can be used in order to obtain cross validation error. I have chosen to do 20-fold cross validation is used. Note that different executions of the snippet below will produce different graphs due to the sampling that occurs when cross validation is done.

set.seed(42)
costs=data.frame(cost=costs)
svm.tune=tune(svm,y~.,data=data.frame(x1,x2,y),ranges=costs,tunecontrol=tune.control(cross=20))

From the plot below one can note that the cross-validation error (shown in black) and the training error follow the same trend, with the cross-validation error achieving a higher value.

plot(svm.tune$performances[,c(1,2)],type='l',ylim=c(0,0.02),xlim=c(0,50))
lines(perf,col='red')

7 7a

require(ISLR)

## Loading required package: ISLR

Auto$mpg=ifelse(Auto$mpg>median(Auto$mpg),1,0)
table(Auto$mpg)

## 
##   0   1 
## 196 196

7b

The cross-validation errors can be produced with ease by the tune function from the e1071. This is done by the code below.

require(e1071)

costs=data.frame(cost=seq(0.05,100,length.out = 15))               # tuning grid for the cost parameter.
svm.tune=tune(svm,mpg~.,data=Auto,ranges=costs,kernel='linear')     # 10-fold cross validation.
svm.tune

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.05
## 
## - best performance: 0.1029192

Moreover, if we plot we values we can see that while the cost of about 5 achieves the most reduction in the miss classification error.

plot(svm.tune$performance[,c(1,2)],type='l')

7c

params=data.frame(cost=seq(0.05,100,length.out = 5),degree=seq(1,100,length.out = 5))
svm.poly=tune(svm,mpg~.,data=Auto,ranges=params,kernel='polynomial')
svm.poly

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##   100      1
## 
## - best performance: 0.09874441

params=data.frame(cost=seq(0.05,100,length.out = 5),gamma=seq(0.1,100,length.out = 5))
svm.radial=tune(svm,mpg~.,data=Auto,ranges=params,kernel='radial')
svm.radial

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##     cost gamma
##  25.0375   0.1
## 
## - best performance: 0.06606846

8

8a

set.seed(42)
train=sample(1:1070,800)
test=(1:1070)[-train]

tb=c()
res=c()

8b

require(ISLR)
require(e1071)

svm.fit=svm(Purchase~.,data=OJ,subset=train,cost=0.01,kernel='linear')
summary(svm.fit)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "linear", 
##     subset = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  439
## 
##  ( 219 220 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

From the output of R’s summary function we can see that 439 observations are used as support vector. Moreover, the support vectors are almost equally split among the classes.

8c

# train
svm.pred=predict(svm.fit,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))

	CH	MM
CH	428	54
MM	78	240

mean(OJ$Purchase[train] != svm.pred)

## [1] 0.165

res=cbind(res,'train'=mean(OJ$Purchase[train] != svm.pred))

# test
svm.pred=predict(svm.fit,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))

	CH	MM
CH	150	21
MM	29	70

mean(OJ$Purchase[test] != svm.pred)

## [1] 0.1851852

res=cbind(res,'test'=mean(OJ$Purchase[test] != svm.pred))

8d

svm.tune=tune(svm,Purchase~.,data=OJ[train,],ranges=data.frame(cost=seq(0.01,10,25)),kernel='linear')
summary(svm.tune)

## 
## Error estimation of 'svm' using 10-fold cross validation: 0.17375

res=cbind(res,'CV'=svm.tune$best.performance)

8e

svm.pred=predict(svm.tune$best.model,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))

	CH	MM
CH	428	54
MM	78	240

mean(OJ$Purchase[train] != svm.pred)

## [1] 0.165

res=cbind(res,'train.tuned'=mean(OJ$Purchase[train] != svm.pred))

svm.pred=predict(svm.tune$best.model,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))

	CH	MM
CH	150	21
MM	29	70

mean(OJ$Purchase[test] != svm.pred)

## [1] 0.1851852

res=cbind(res,'test.tuned'=mean(OJ$Purchase[test] != svm.pred))

tb=rbind(tb,res)
res=c()

8f

# b
svm.fit=svm(Purchase~.,data=OJ,subset=train,cost=0.01,kernel='radial')
summary(svm.fit)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "radial", 
##     subset = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  638
## 
##  ( 318 320 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

# train
svm.pred=predict(svm.fit,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))

	CH	MM
CH	482	0
MM	318	0

mean(OJ$Purchase[train] != svm.pred)

## [1] 0.3975

res=cbind(res,'train'=mean(OJ$Purchase[train] != svm.pred))


# test
svm.pred=predict(svm.fit,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))

	CH	MM
CH	171	0
MM	99	0

mean(OJ$Purchase[test] != svm.pred)

## [1] 0.3666667

res=cbind(res,'train'=mean(OJ$Purchase[test] != svm.pred))

svm.tune=tune(svm,Purchase~.,data=OJ[train,],ranges=data.frame(cost=seq(0.01,10,25)))
summary(svm.tune)

## 
## Error estimation of 'svm' using 10-fold cross validation: 0.3975

res=cbind(res,'CV'=svm.tune$best.performance)

# train
svm.pred=predict(svm.tune$best.model,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))

	CH	MM
CH	482	0
MM	318	0

mean(OJ$Purchase[train] != svm.pred)

## [1] 0.3975

res=cbind(res,'train.tuned'=mean(OJ$Purchase[train] != svm.pred))


# test
svm.pred=predict(svm.tune$best.model,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))

	CH	MM
CH	171	0
MM	99	0

mean(OJ$Purchase[test] != svm.pred)

## [1] 0.3666667

res=cbind(res,'test.tuned'=mean(OJ$Purchase[test] != svm.pred))

tb=rbind(tb,res)
res=c()

8g

# b
svm.fit=svm(Purchase~.,data=OJ,subset=train,cost=0.01,kernel='polynomial')
summary(svm.fit)

## 
## Call:
## svm(formula = Purchase ~ ., data = OJ, cost = 0.01, kernel = "polynomial", 
##     subset = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.01 
##      degree:  3 
##       gamma:  0.05555556 
##      coef.0:  0 
## 
## Number of Support Vectors:  634
## 
##  ( 315 319 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

# train
svm.pred=predict(svm.fit,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))

	CH	MM
CH	475	7
MM	298	20

mean(OJ$Purchase[train] != svm.pred)

## [1] 0.38125

res=cbind(res,'train'=mean(OJ$Purchase[train] != svm.pred))

# test
svm.pred=predict(svm.fit,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))

	CH	MM
CH	171	0
MM	90	9

mean(OJ$Purchase[test] != svm.pred)

## [1] 0.3333333

res=cbind(res,'test'=mean(OJ$Purchase[test] != svm.pred))

svm.tune=tune(svm,Purchase~.,data=OJ[train,],ranges=data.frame(cost=seq(0.01,10,25)),kernel='polynomial')
summary(svm.tune)

## 
## Error estimation of 'svm' using 10-fold cross validation: 0.38125

res=cbind(res,'CV'=svm.tune$best.performance)

# train
svm.pred=predict(svm.tune$best.model,OJ[train,])
kable(table(OJ[train,'Purchase'],svm.pred))

	CH	MM
CH	475	7
MM	298	20

mean(OJ$Purchase[train] != svm.pred)

## [1] 0.38125

res=cbind(res,'train.tuned'=mean(OJ$Purchase[train] != svm.pred))


# test
svm.pred=predict(svm.tune$best.model,OJ[test,])
kable(table(OJ[test,'Purchase'],svm.pred))

	CH	MM
CH	171	0
MM	90	9

mean(OJ$Purchase[test] != svm.pred)

## [1] 0.3333333

res=cbind(res,'test.tuned'=mean(OJ$Purchase[test] != svm.pred))

tb=rbind(tb,res)

8h

The linear SVM performs best on this data.

rownames(tb)=c('LINEAR','POLYNOMIAL','RADIAL')
kable(tb)

	train	test	CV	train.tuned	test.tuned
LINEAR	0.16500	0.1851852	0.17375	0.16500	0.1851852
POLYNOMIAL	0.39750	0.3666667	0.39750	0.39750	0.3666667
RADIAL	0.38125	0.3333333	0.38125	0.38125	0.3333333

Introduction to Statistical Learning Solutions (ISLR)

Ch 9 Exercises

Table of Contents

Conceptual Exercises

Applied Exercises

Conceptual Exercises

1

1a

1b

2

2a-b

2c

2d

3

3a

3b

3c

3d

3e

3f

3g

3h

Applied Exercises

4

5

5a

5b

5c

5d

5e

5f

5g

5f

5i

6

6a

6b

7

7a

7b

7c

8

8a

8b

8c

8d

8e

8f

8g

8h