Assigned Exercises 9.7 and 9.8

Chapter 8

Exercise 8.8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

Split the data set into a training set and a test set.

data(Carseats)
set.seed(1)
train<-sample(1:nrow(Carseats), nrow(Carseats)*0.5)

train.set<-Carseats[train,]
test.set<-Carseats[-train,]

Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

#tree.cs<-rpart(Sales~. , train.set, method = 'anova')
tree.cs<-tree(Sales~. , data=train.set)

summary(tree.cs)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train.set)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "CompPrice"  
## [6] "US"         
## Number of terminal nodes:  18 
## Residual mean deviance:  2.167 = 394.3 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.88200 -0.88200 -0.08712  0.00000  0.89590  4.09900
plot(tree.cs, type = c("proportional", "uniform"))
text(tree.cs, pretty = 0, cex=0.5)

#fancyRpartPlot(tree.cs, cex=0.5)

# Each node box displays the classification and the percentage of observations used at that node.
# number on top in the box refers to average sales value (response that needs to be predicted) in the split. e.g. for the top box I can see that the values mentioned is equal to mean(train.set$Sales).
yhat<-predict (tree.cs, newdata = test.set)
testMSE<-mean((yhat-test.set$Sales)^2)
testMSE 
## [1] 4.922039
#plotcp(tree.cs)

On of the most important indicator of Sales appears to be shelving location, since the first branch differentiates Good locations from Bad and Medium locations. Another very important predictor is Price, which appears for both cases of shelving location {Good, Medium:Bad}.

Splitting the predictor shelving location (cutpoints {Good, Medium:Bad}) leads to the greatest possible reduction in RSS.

The test MSE is 4.9220391.

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

#set.seed(1)
cv.carseats<-cv.tree(tree.cs, FUN = prune.tree)
#Runs a K-fold cross-validation experiment to find the deviance ( misclassifications for classification prob) as a function of the cost-complexity parameter k.

# Alternative, with Caret package
# train.ctrl<-trainControl(method = "repeatedcv", number = 5 , repeats = 3)
# set.seed(1)
# rpart_tree<-train(Sales~. , data=train.set, method = "rpart", trControl = train.ctrl )



cv.carseats
## $size
##  [1] 18 17 16 15 14 13 12 11 10  8  7  6  5  4  3  2  1
## 
## $dev
##  [1]  831.3437  852.3639  868.6815  862.3400  862.3400  893.4641  911.2580
##  [8]  950.2691  955.2535 1039.1241 1066.6899 1125.0894 1205.5806 1273.2889
## [15] 1302.8607 1349.9273 1620.4687
## 
## $k
##  [1]      -Inf  16.99544  20.56322  25.01730  25.57104  28.01938  30.36962
##  [8]  31.56747  31.80816  40.75445  44.44673  52.57126  76.21881  99.59459
## [15] 116.69889 159.79501 337.60153
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
best.size<-cv.carseats$size[which.min(cv.carseats$dev)]

best.size
## [1] 18
data.frame(Size=cv.carseats$size, CVerror=cv.carseats$dev) %>% 
  ggplot(mapping=aes(x=Size,y=CVerror)) +
  geom_point(color="red") +
  geom_line() + scale_x_continuous(breaks=seq(0,20,1)) 

data.frame(Alpha=cv.carseats$k, CVerror=cv.carseats$dev) %>% 
  ggplot(mapping=aes(x=Alpha,y=CVerror)) +
  geom_point(color="Orange") +
  geom_line() + scale_x_continuous(breaks=seq(0,400,20)) 

#The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to alpha in (8.4))

pruned.carseats<-prune.tree(tree.cs, best = best.size)
pruned.carseats %>% summary()
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train.set)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "CompPrice"  
## [6] "US"         
## Number of terminal nodes:  18 
## Residual mean deviance:  2.167 = 394.3 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.88200 -0.88200 -0.08712  0.00000  0.89590  4.09900
yhat.pruned<-predict(pruned.carseats, newdata= test.set)
MSE.pruned<-mean((test.set$Sales - yhat.pruned)^2)
MSE.pruned
## [1] 4.922039

Cross validation determines that the best size/optimal level of tree complexity is 18 the corresponding MSE is 4.9220391. The MSE has not changed. The total number number of nodes has not changed as well, so there was not actual pruning.

Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

set.seed (1)
p<-ncol(Carseats)-1 # number of predictors
bag.cs<-randomForest(Sales~.,data = train.set, mtry=p, importance =TRUE)

importance(bag.cs) %>% data.frame()
##               X.IncMSE IncNodePurity
## CompPrice   24.8888481    170.182937
## Income       4.7121131     91.264880
## Advertising 12.7692401     97.164338
## Population  -1.8074075     58.244596
## Price       56.3326252    502.903407
## ShelveLoc   48.8886689    380.032715
## Age         17.7275460    157.846774
## Education    0.5962186     44.598731
## Urban        0.1728373      9.822082
## US           4.2172102     18.073863
# %IncMSE is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model.

# IncNodePurity is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees. In the case of regression trees, the node impurity is measured by the training RSS


varImpPlot(bag.cs, main = "Importance")

yhat.bag<-predict(bag.cs, newdata = test.set)
mse.bag<-mean((yhat.bag - test.set$Sales))
mse.bag
## [1] 0.2656695

Price and ShelveLoc are by far the most important variables. Using Bagging approach leads to a much lower MSE.

Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed (1)
m.rf<-round(sqrt(p))
rf.cs<-randomForest(Sales~.,data = train.set, mtry=m.rf, importance =TRUE)

importance(rf.cs) %>% data.frame()
##               X.IncMSE IncNodePurity
## CompPrice   14.8840765     158.82956
## Income       4.3293950     125.64850
## Advertising  8.2215192     107.51700
## Population  -0.9488134      97.06024
## Price       34.9793386     385.93142
## ShelveLoc   34.9248499     298.54210
## Age         14.3055912     178.42061
## Education    1.3117842      70.49202
## Urban       -1.2680807      17.39986
## US           6.1139696      33.98963
varImpPlot(rf.cs, main = "Importance")

yhat.rf<-predict(rf.cs, newdata = test.set)
mse.rf<-mean((yhat.rf - test.set$Sales))
mse.rf
## [1] 0.2402389

MSE from randomforest is slightly better than the one we got with bagging. Price and ShelveLoc are still the most important variables. However now, we can note that Age has increased in importance compared to the other variables. At each split the random forest model consider 3 variables (chosen randomly) instead of the full set of predictors like with bagging.

Exercise 8.9

This problem involves the OJ data set which is part of the ISLR package

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

data(OJ)

set.seed(1)
train<-sample(1:nrow(OJ), 800)

train.set<-OJ[train,]
test.set<-OJ[-train,]

Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

#tree.cs<-rpart(Sales~. , train.set, method = 'anova')
tree.oj<-tree(Purchase~. , data=train.set) 
p<-ncol(OJ)-1
#?OJ

# Purchase: A factor with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice

#head
summary(tree.oj)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train.set)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800

The number of nodes in the tree is 9. The model uses the following variables:

LoyalCH, PriceDiff, SpecialCH, ListPriceDiff, PctDiscMM

The training error, corresponding to misclassifications, is 0.15875.

Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree.oj
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  
##         20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *
##         21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *
##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
##         24) PctDiscMM < 0.196197 55   73.14 CH ( 0.61818 0.38182 ) *
##         25) PctDiscMM > 0.196197 17   12.32 MM ( 0.11765 0.88235 ) *
##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

We choose to analyze terminal node labeled “3)”. For each node the function print the following info:

node), split, n, deviance, yval, (yprob) * denotes terminal node

This node is not a terminal node since it is not marked with *. The splitting variable at node 3 is LoyalCH and the corresponding splitting value is 0.5036. There are 435 observations in the subtree below this node. The deviance for all observations contained in region below this node is 337.90. Next to the deviance value we find the prediction for this node which is CH. 87% of the observations in this region have CH as Purchase value, the remaining 13% MM.

Create a plot of the tree, and interpret the results.

plot(tree.oj, type = c("proportional", "uniform"))
text(tree.oj, pretty = 0, cex=0.7)

LoyalCH appears in several nodes of the tree, including the nodes at the top of the tree. This indicate that this predictors is one of the most important.

We can follow a specific path in the tree and check what’s the related prediction.

For example: if LoyalCH is lower than 0.5036 but greater (not minor) than 0.280875 and PriceDiff is greater (not minor) than 0.05 then the model predicts that Purchase will be CH.

Summarized:

If {0.280875<LoyalCH<0.5036 & PriceDiff>0.05Purchas} –> Purchase = CH.

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

yhat<-predict(tree.oj, newdata = test.set, type = "class")
table(test.set$Purchase,yhat) %>% kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>%
  add_header_above(c(" ", "Prediction" = 2))
Prediction
CH MM
CH 160 8
MM 38 64
test.er<-mean(yhat!=test.set$Purchase)
#(8+38)/(160+8+38+64)

Test error rate = 0.1703704. It corresponds the ratio of misclassified observations in the test.set.

Apply the cv.tree() function to the training set in order to determine the optimal tree size.

cv.oj<-cv.tree(tree.oj, FUN = prune.tree)
#Runs a K-fold cross-validation experiment to find the deviance ( misclassifications for classification prob) as a function of the cost-complexity parameter k.

cv.oj
## $size
## [1] 9 8 7 6 5 4 3 2 1
## 
## $dev
## [1]  685.6493  698.8799  702.8083  702.8083  714.1093  725.4734  780.2099
## [8]  790.0301 1074.2062
## 
## $k
## [1]      -Inf  12.62207  13.94616  14.35384  26.21539  35.74964  43.07317
## [8]  45.67120 293.15784
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
best.size<-cv.oj$size[which.min(cv.carseats$dev)]

best.size
## [1] 9

Tree size equal to 9 return the lowest cross validation error.

Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

data.frame(Size=cv.oj$size, CVerror=cv.oj$dev) %>% 
  ggplot(mapping=aes(x=Size,y=CVerror)) +
  geom_point(color="red") +
  geom_line() + scale_x_continuous(breaks=seq(0,10,1)) 

#data.frame(Alpha=cv.oj$k, CVerror=cv.oj$dev) %>% 
  #ggplot(mapping=aes(x=Alpha,y=CVerror)) +
  #geom_point(color="Orange") +
  #geom_line() + scale_x_continuous(breaks=seq(0,300,20)) 

Tree size equal to 9 return the lowest cross validation error. However, from the plot we can see that tree size 6 and 7 have comparable cross validation errors.

Which tree size corresponds to the lowest cross-validated classification error rate?

best.size<-cv.oj$size[which.min(cv.carseats$dev)]

best.size
## [1] 9

Tree size equal to 9 return the lowest cross validation error.

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

pruned.oj<-prune.tree(tree.oj, best = 5)

Compare the training error rates between the pruned and unpruned trees. Which is higher?

pruned.oj %>% summary()
## 
## Classification tree:
## snip.tree(tree = tree.oj, nodes = c(4L, 12L, 5L))
## Variables actually used in tree construction:
## [1] "LoyalCH"       "ListPriceDiff"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.8239 = 655 / 795 
## Misclassification error rate: 0.205 = 164 / 800
tree.oj %>% summary()
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train.set)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800

Misclassification error (training error) for the unpruned tree: 0.15875

Misclassification error (training error) for the pruned tree: 0.205

Unpruned tree has a lower misclassification error compared to the pruned version.

Compare the test error rates between the pruned and unpruned trees. Which is higher?

yhat.pruned<-predict(pruned.oj, newdata = test.set, type = "class")

test.er.pruned<-mean(yhat.pruned!=test.set$Purchase)

test.er.pruned
## [1] 0.1925926

For the pruned tree the test error rate is 0.1925926. While the one for the unpruned tree is 0.1703704.

The unpruned tree returns lower test error rate.

Exercise 8.10

We now use boosting to predict Salary in the Hitters data set.

Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

data("Hitters")
hit<-Hitters %>% drop_na("Salary")
hit$Salary<-log(hit$Salary)

Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations

set.seed(1)
train<-1:200

train.set<-hit[train,]
test.set<-hit[-train,]

Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

lambda<-seq(from=0.001, to= 0.05, by= 0.001)
B<-1000
train.mse<-rep(NA,length(lambda))
test.mse<-rep(NA,length(lambda))
for (i in seq_along(lambda)){
  boost.hit<-gbm(Salary~.,data=train.set,
               distribution = "gaussian", n.trees = B,
               shrinkage = lambda[i])
  train.pred = predict(boost.hit, newdata = train.set, n.trees = B)
  test.pred = predict(boost.hit, newdata = test.set, n.trees = B)
  train.mse[i] = mean((train.set$Salary - train.pred)^2)
  test.mse[i] = mean((test.set$Salary - test.pred)^2)
}

data.frame(Lambda = lambda, train.mse = train.mse) %>%
  ggplot(mapping = aes(x = Lambda, y = train.mse)) +
  geom_point(color = "red") +
  geom_line() +
  labs(y = "Training set MSE")

Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

data.frame(Lambda = lambda, test.mse = test.mse) %>%
  ggplot(mapping = aes(x = Lambda, y = test.mse)) +
  geom_point(color = "blue") +
  geom_line() +
  labs(y = "Test set MSE")

e)

Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

# linear regression
lm.hit<-lm(Salary ~ ., data = train.set)
lm.pred<-predict(lm.hit, newdata = test.set)
lm.test.mse<-mean((test.set$Salary - lm.pred)^2)

# lasso
set.seed(1)
x.train<-model.matrix(Salary ~ ., data = train.set)
y.train<-train.set$Salary
x.test<-model.matrix(Salary ~ ., data = test.set)
lasso.hit<-glmnet::glmnet(x.train, y.train, alpha = 1)
Lcv.out<-glmnet::cv.glmnet (x.train, y.train, alpha =1)
bestlam.L<-Lcv.out$lambda.min
bestlam.L
## [1] 0.00346633
lasso.pred<-predict(lasso.hit, s = bestlam.L, newx = x.test)
lasso.test.mse<-mean((test.set$Salary - lasso.pred)^2)

lm.test.mse
## [1] 0.4917959
lasso.test.mse
## [1] 0.4706337
boosting.test.mse<-min(test.mse)
boosting.test.mse
## [1] 0.255546

Boosting yields a lower test MSE 0.255546 compared to linear regression (0.4917959) and Lasso (0.4706337)

Which variables appear to be the most important predictors in the boosted model?

boost.best<-gbm(Salary ~ ., data = train.set, distribution = "gaussian", 
    n.trees = B, shrinkage = lambda[which.min(test.mse)])

summary(boost.best)

##                 var    rel.inf
## CAtBat       CAtBat 14.6821030
## CHits         CHits 13.1718861
## CRuns         CRuns  9.2318785
## CWalks       CWalks  8.1675269
## PutOuts     PutOuts  7.7964606
## CRBI           CRBI  7.4968907
## Years         Years  6.5191419
## Walks         Walks  5.9508931
## CHmRun       CHmRun  5.6988413
## AtBat         AtBat  4.5363925
## RBI             RBI  3.5439356
## Assists     Assists  3.2086097
## Hits           Hits  2.5620480
## HmRun         HmRun  2.5547074
## Errors       Errors  2.0569793
## Runs           Runs  1.5230104
## Division   Division  0.6218924
## NewLeague NewLeague  0.4610069
## League       League  0.2157957

CAtBat is the most important predictor followed by CHits, CRuns.

Now apply bagging to the training set. What is the test set MSE for this approach?

set.seed(1)
p<-ncol(hit)-1
bag.hit<-randomForest(Salary ~ ., data = train.set, mtry = p)
bag.pred<-predict(bag.hit, newdata = test.set)
bagging.test.mse<-mean((test.set$Salary - bag.pred)^2)
bagging.test.mse
## [1] 0.2299324

Bagging method returns a slightly lower test MSE (0.2299324) compared to boosting (0.255546)

Exercise 8.12

Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?

data(College)

#response: Apps, number of application received
set.seed(1)
train=sample(1:nrow(College) , 0.8*nrow(College))

train.set<-College[train,-3]
test.set<-College[-train,-3]
view(train.set)
is.na(College) %>% sum()
## [1] 0

We will use the data set College with response variable equal to Apps (number of application received).

Linear Regression

lm.college<-lm(Apps ~ ., data = train.set)
lm.pred<-predict(lm.college, newdata = test.set)
lm.test.mse<-mean((test.set$Apps - lm.pred)^2)
lm.test.mse
## [1] 3045265

Boosting

B<-3000
boost.college<-gbm(Apps~. ,data = train.set,
                   distribution = "gaussian", n.trees = B,
                   shrinkage = 0.001,
                   interaction.depth = 2)
test.pred<-predict(boost.college, newdata = test.set, n.trees = B)
boosting.test.mse<-mean((test.set$Apps - test.pred)^2)
boosting.test.mse
## [1] 2541130

Boosting with Caret

set.seed(1)
train.ctrl<-trainControl(method = "repeatedcv", number = 5 , repeats = 3)
gbm.caret<-train(Apps~. ,data = train.set, method = "gbm", distribution = "gaussian", trControl = train.ctrl, verbose = F)

#getModelInfo()$gbm$parameters

myGrid<-expand.grid(n.trees = seq(1000,5000, by=1000),
                    interaction.depth = seq(1,5, by=1),
                    shrinkage = c(0.001, 0.01, 0.02, 0.03),
                    n.minobsinnode = 10)
set.seed(1)
gbm.caret.tune<-train(Apps~ ., data = train.set, method = "gbm", distribution = "gaussian", trControl = train.ctrl, verbose = F, tuneGrid=myGrid)

myGrid.best<-gbm.caret.tune$bestTune

set.seed(1)
gbm.caret.best<-train(Apps~ ., data=train.set, method = "gbm", distribution="gaussian", trControl = train.ctrl, verbose = F, tuneGrid=myGrid.best)


test.pred<-predict(gbm.caret.best, newdata = test.set)
boostcaret.test.mse<-mean((test.set$Apps - test.pred)^2)
boostcaret.test.mse
## [1] 1942572

Bagging

p<-ncol(train.set)-1
bag.college<-randomForest(Apps ~ ., data = train.set, mtry = p)
bag.pred<-predict(bag.college, newdata = test.set)
bagging.test.mse<-mean((test.set$Apps - bag.pred)^2)
bagging.test.mse
## [1] 3875409

Random Forest

set.seed(1)
m.rf<-round(sqrt(p))
rf.college<-randomForest(Apps ~ ., data = train.set, mtry = m.rf)
rf.pred<-predict(rf.college, newdata = test.set)
rf.test.mse<-mean((test.set$Apps - rf.pred)^2)
rf.test.mse
## [1] 2014468
a<-rbind(Linear_Reg = lm.test.mse,
         Boosting = boosting.test.mse,
         Boosting_caret=boostcaret.test.mse,
         Bagging = bagging.test.mse,
         Random_Forest = rf.test.mse) %>% data.frame()
colnames(a)<-"Prediction_error"
  
kable(a) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Prediction_error
Linear_Reg 3045265
Boosting 2541130
Boosting_caret 1942572
Bagging 3875409
Random_Forest 2014468
bestmodel<-subset(a, a$Prediction_error==min(a$Prediction_error)) %>% rownames()

Boosting_caret returns the lowest test MSE.

Chapter 9

Exercise 9.4

Generate a simulated two-class data set with 100 observations and two features in which there is a visible but non-linear separation between the two classes.

Show that in this setting, a support vector machine with a polynomial kernel (with degree greater than 1) or a radial kernel will outperform a support vector classifier on the training data.

Which technique performs best on the test data? Make plots and report training and test error rates in order to back up your assertions.

Data

n<-100
set.seed(3)
x1<-rnorm(n)
x2<-x1 + 4*x1^2 + 1 + rnorm(n)
A<-sample(1:n, n/2) 
B<- -A
x2[A]<-x2[A] + 2 # shift the points in order to make 2 distinguishable areas
x2[B]<-x2[B] - 2

df<-data.frame(x1,x2) %>% mutate(type = NA)
df$type[A]<-"A"
df$type[B]<-"B"
df$type<-as.factor(df$type)
# Plot using different colors
ggplot(df) + geom_point(mapping = aes(x=x1, y=x2, col=type))

Support Vector Classifier - Linear Kernel

set.seed(5)
#train<-df %>% group_by(type) %>% sample_n(n/4)

train<-sample(1:n,n/2)
train.set<-df[train,]
test.set<-df[-train,]
svc.fit<-svm(type~., data=train.set , kernel = "linear", cost =10, scale =FALSE)
# The decision boundary between the two classes is linear (because we used the argument kernel="linear")
# If we use a  smaller value of the cost parameter, we obtain a larger number of support vectors, because the margin is now wider.
# crossvalidation with tune() can help us finding the optimal value for cost
plot(svc.fit, data=train.set)

summary(svc.fit)
## 
## Call:
## svm(formula = type ~ ., data = train.set, kernel = "linear", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  31
## 
##  ( 16 15 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  A B

Training Error & Test Error - Support Vector Classifier

lin.trn.er<-mean(svc.fit$fitted != train.set$type) # training error
lin.trn.er
## [1] 0.2
lin.tst.er<-mean(predict(svc.fit, test.set) != test.set$type) # test error
lin.tst.er
## [1] 0.24

Support Vector Machine - Polynomial Kernel

svm.fit<-svm(type~., data=train.set , kernel = "polynomial", cost =10,scale =FALSE)
plot(svm.fit, data=train.set)

summary(svm.fit)
## 
## Call:
## svm(formula = type ~ ., data = train.set, kernel = "polynomial", 
##     cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  10 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  32
## 
##  ( 15 17 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  A B

Training Error & Test Error- Support Vector Machine (Polynomial)

pol.trn.er<-mean(svm.fit$fitted != train.set$type) # Training error
pol.trn.er
## [1] 0.24
pol.tst.er<-mean(predict(svm.fit, test.set) != test.set$type) # Test error
pol.tst.er
## [1] 0.38
svm.fit.r<-svm(type~., data=train.set, kernel = "radial", cost=10)
plot(svm.fit.r, data=train.set)

summary(svm.fit.r)
## 
## Call:
## svm(formula = type ~ ., data = train.set, kernel = "radial", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  15
## 
##  ( 7 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  A B

Training Error & Test Error- Support Vector Machine (Radial)

rad.trn.er<-mean(svm.fit.r$fitted != train.set$type) # Training error
rad.trn.er
## [1] 0
rad.tst.er<-mean(predict(svm.fit.r, test.set) != test.set$type) # Test error
rad.tst.er
## [1] 0.18
s<-rbind(Linear = lin.tst.er,
         Polynomial = pol.tst.er,
         Radial = rad.tst.er) %>% data.frame() %>% 
  cbind(c(lin.trn.er,pol.trn.er,rad.trn.er))

colnames(s)<-c("Test_error","Train_error")
  
kable(s,  caption = "Kernel") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Kernel
Test_error Train_error
Linear 0.24 0.20
Polynomial 0.38 0.24
Radial 0.18 0.00
bestmodel<-subset(s, s$Test_error==min(s$Test_error)) %>% rownames()

bestmodel
## [1] "Radial"

The table summarize the performances of the 3 models used in term of training error and test error.

The radial kernel return the lost test error meaning that the rate of the misclassified observation is the lowest when using the radial kernel. The same can be appreciated looking at the plot, the boundaries with the radial kernel follow better the split between the 2 classes which was defined using a specific non-linear equation.

Exercise 9.5

We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.

Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:

set.seed(123)
x1=runif(500) -0.5
x2=runif(500) -0.5
y=1*(x1^2-x2^2 > 0)

Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the yaxis.

df<-data.frame(x1,x2,y=as.factor(y))
# Plot using different colors
df %>% ggplot(mapping = aes(x=x1, y=x2, color=factor(y))) + geom_point() + labs(color ="Class")

Fit a logistic regression model to the data, using X1 and X2 as predictors.

log.fit<-glm(y~x1+x2, family = binomial)
log.fit %>% summary
## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.227  -1.200   1.133   1.157   1.188  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.04792    0.08949   0.535    0.592
## x1          -0.03999    0.31516  -0.127    0.899
## x2           0.11509    0.30829   0.373    0.709
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 692.86  on 499  degrees of freedom
## Residual deviance: 692.71  on 497  degrees of freedom
## AIC: 698.71
## 
## Number of Fisher Scoring iterations: 3

Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.

log.prob<-predict(log.fit, newdata=df, type = "response")
log.pred<-ifelse(log.prob > 0.50, 1, 0)
mean(log.pred == df$y)
## [1] 0.574
df %>% cbind(log.pred) %>%
  ggplot(mapping = aes(x=x1,y=x2,color=factor(log.pred))) +
  geom_point() + labs(color="Class")

Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. \(X_1^2\) , \(X_1X_2\), \(log(X_2)\), and so forth).

logp.fit<-glm(y ~ poly(x1,2) + I(log(abs(x2)))+ I(x1 * x2), data = df, family = binomial)

Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.

set.seed(123)
train<-sample(1:nrow(df), nrow(df)*0.5)

train.set<-df[train,]
test.set<-df[-train,]

logp.prob<-predict(logp.fit, newdata=train.set, type = "response")
logp.pred<-ifelse(logp.prob > 0.50, 1, 0)
mean(logp.pred != train.set$y)
## [1] 0.116
train.set %>% cbind(logp.pred) %>%
  ggplot(mapping = aes(x=x1,y=x2,color=factor(logp.pred))) +
  geom_point() + labs(color="Class")

logp.prob.ts<-predict(logp.fit, newdata=test.set, type = "response")
logp.pred.ts<-ifelse(logp.prob.ts > 0.50, 1, 0)
test.er.logp<-mean(logp.pred.ts != test.set$y)
test.er.logp
## [1] 0.08

Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

svc.fit<-svm(y~., data=train.set , kernel = "linear", cost =10, scale =FALSE)

plot(svc.fit, data=train.set)

summary(svc.fit)
## 
## Call:
## svm(formula = y ~ ., data = train.set, kernel = "linear", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  237
## 
##  ( 118 119 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
tst.er.lin<-mean(predict(svc.fit, test.set) != test.set$y) # Test error
tst.er.lin
## [1] 0.472

Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.

Polynomial Kernel

svm.fit.pol<-svm(y~., data=train.set , kernel = "polynomial", degree=2, cost =10, scale =FALSE)

plot(svm.fit.pol, data=train.set)

summary(svm.fit.pol)
## 
## Call:
## svm(formula = y ~ ., data = train.set, kernel = "polynomial", degree = 2, 
##     cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  10 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  163
## 
##  ( 82 81 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
tst.er.pol<-mean(predict(svm.fit.pol, test.set) != test.set$y) # Test error
tst.er.pol
## [1] 0.044

Radial Kernel

svm.fit.rad<-svm(y~., data=train.set , kernel = "radial",cost =10, scale =FALSE)

plot(svm.fit.rad, data=train.set)

summary(svm.fit.rad)
## 
## Call:
## svm(formula = y ~ ., data = train.set, kernel = "radial", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  142
## 
##  ( 71 71 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
tst.er.rad<-mean(predict(svm.fit.rad, test.set) != test.set$y) # Test error
tst.er.rad
## [1] 0.072

Comment on your results.

The plots here above underline the fact that SVM with non-linear kernel and logistic regression with transformed predictors do a pretty good job when the boundaries between 2 classes are non-linear. The boundaries found with the SVM and the logistic regression are pretty similar. However the main advantage with the SVM is that we have way less tuning options (gamma or poly degree and cost) compared to the logistic regression whose tuning options are as many as the number of transformation of the variables.

Further to this optimal tuning parameters for SVM can be selected through cross-validation.

Exercise 9.7 - Assigned Exercise

In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.

Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.

data(Auto)
Auto<-Auto %>% mutate(mpg.high = ifelse(mpg>median(mpg),1,0)) %>% 
  mutate(mpg.high=factor(mpg.high), cylinders=factor(cylinders))

Auto<-Auto %>% dplyr::select(-mpg,-name)
Auto %>% sample_n(6)
##   cylinders displacement horsepower weight acceleration year origin mpg.high
## 1         4           97         60   1834         19.0   71      2        1
## 2         4           79         58   1825         18.6   77      2        1
## 3         8          350        175   4100         13.0   73      1        0
## 4         6          258        110   2962         13.5   71      1        0
## 5         4           90         75   2108         15.5   74      2        1
## 6         4          134         96   2702         13.5   75      3        1

Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.

set.seed(123)

tune.out<-tune(svm, mpg.high ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(1,5,10,25)))

summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     5
## 
## - best performance: 0.07865385 
## 
## - Detailed performance results:
##   cost      error dispersion
## 1    1 0.08641026 0.04893284
## 2    5 0.07865385 0.05317556
## 3   10 0.07865385 0.05317556
## 4   25 0.07865385 0.05317556
#summary(tune.out)$performances
#summary(tune.out)$best.model

best.par<-summary(tune.out)$best.parameters
best.per<-summary(tune.out)$best.performance

10-folds cross validation select cost equal to 5 . The related cv error is 0.0786538.

Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.

Radial

set.seed(123)

tune.out.rad<-tune(svm, mpg.high ~ ., data = Auto, kernel = "radial",
                   ranges = list(cost = c(1,5,10,25), 
                                 gamma = 2^(-3:3)))

summary(tune.out.rad)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     5  0.25
## 
## - best performance: 0.06096154 
## 
## - Detailed performance results:
##    cost gamma      error dispersion
## 1     1 0.125 0.09160256 0.04304211
## 2     5 0.125 0.07621795 0.04867524
## 3    10 0.125 0.07365385 0.04480097
## 4    25 0.125 0.08384615 0.05014207
## 5     1 0.250 0.07634615 0.03557092
## 6     5 0.250 0.06096154 0.04311842
## 7    10 0.250 0.06615385 0.05089670
## 8    25 0.250 0.08397436 0.06066946
## 9     1 0.500 0.07371795 0.03988464
## 10    5 0.500 0.07365385 0.05975763
## 11   10 0.500 0.07871795 0.05459458
## 12   25 0.500 0.09416667 0.04722024
## 13    1 1.000 0.07628205 0.04241762
## 14    5 1.000 0.08391026 0.05149746
## 15   10 1.000 0.08647436 0.05174863
## 16   25 1.000 0.08903846 0.05324797
## 17    1 2.000 0.08397436 0.04088294
## 18    5 2.000 0.09679487 0.04065897
## 19   10 2.000 0.09423077 0.04280336
## 20   25 2.000 0.09935897 0.04185417
## 21    1 4.000 0.09929487 0.04667955
## 22    5 4.000 0.09679487 0.03882073
## 23   10 4.000 0.09679487 0.04241762
## 24   25 4.000 0.09935897 0.04007080
## 25    1 8.000 0.12743590 0.06693855
## 26    5 8.000 0.12493590 0.07469619
## 27   10 8.000 0.12493590 0.07469619
## 28   25 8.000 0.12493590 0.07469619
#summary(tune.out.rad)$performances
#summary(tune.out.rad)$best.model
best.par.rad<-summary(tune.out.rad)$best.parameters
best.per.rad<-summary(tune.out.rad)$best.performance

With radial kernel 10-folds cross validation select cost equal to 5 and gamma equal to 0.25. The related cv error is 0.0609615.

Polynomial

set.seed(123)

tune.out.pol<-tune(svm, mpg.high ~ ., data = Auto, kernel = "polynomial",
                   ranges = list(cost = c(1,5,10,25),
                                 degree=c(2,3,4)))

summary(tune.out.pol)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost degree
##     5      3
## 
## - best performance: 0.06096154 
## 
## - Detailed performance results:
##    cost degree      error dispersion
## 1     1      2 0.09410256 0.04575425
## 2     5      2 0.08134615 0.04550073
## 3    10      2 0.07884615 0.04176353
## 4    25      2 0.07108974 0.05400402
## 5     1      3 0.09929487 0.04011231
## 6     5      3 0.06096154 0.04478058
## 7    10      3 0.06096154 0.04778157
## 8    25      3 0.06608974 0.04935338
## 9     1      4 0.14275641 0.08367018
## 10    5      4 0.10692308 0.05017119
## 11   10      4 0.10685897 0.04978224
## 12   25      4 0.08897436 0.04753954
#summary(tune.out.pol)$performances
#summary(tune.out.pol)$best.model

best.par.pol<-summary(tune.out.pol)$best.parameters
best.per.pol<-summary(tune.out.pol)$best.performance

With polynomial kernel 10-folds cross validation select cost equal to 5 and degree equal to 3. The related cv error is 0.0609615.

  1. Make some plots to back up your assertions in (b) and (c)
summary(tune.out)$performance %>% ggplot(mapping=aes(x=cost,y=error)) + geom_point() + geom_line()

summary(tune.out.pol)$performance %>% ggplot(mapping=aes(x=cost,y=error, col=factor(degree))) + geom_point() + geom_line() + labs(color="Degree")

summary(tune.out.rad)$performance %>% ggplot(mapping=aes(x=cost,y=error, col=factor(gamma))) + geom_point() + geom_line() + labs(color="Gamma")

svm.linear<-svm(mpg.high ~ ., data = Auto, kernel = "linear",
                cost = best.par[1,1])

svm.poly<-svm(mpg.high ~ ., data = Auto, kernel = "polynomial",
              cost = best.par.pol[1,1], degree = best.par.pol[1,2])

svm.radial<-svm(mpg.high ~ ., data = Auto, kernel = "radial",
                cost = best.par.rad[1,1], gamma = best.par.rad[1,2])

The plots here above basically represents how the error (estimated prediction error based on 10-fold cross validation) varies according to different values of the parameters: cost, gamma and degree.

Support vector machine with linear kernel yields the lowest error with cost equal to 5 or higher.

Support vector machine with polynomial kernel yields the lowest error with degree equal to 3 across all the cost values.

Support vector machine with radial kernel yields the lowest error with gamma equal to 0.25 and cost equal to 5. However as cost increases the lines (function of gamma) tends to converge to similar error value.

# plot(svm.linear, Auto, horsepower~year)

name.plot<-names(Auto)[!(names(Auto) %in% c("cylinders","mpg.high","origin"))]
choose(length(name.plot),2) # number of plot for each model
## [1] 10
name.plot
## [1] "displacement" "horsepower"   "weight"       "acceleration" "year"
plot(svm.linear, Auto, displacement~horsepower)

plot(svm.linear, Auto, displacement~weight)

plot(svm.linear, Auto, displacement~acceleration)

plot(svm.linear, Auto, displacement~year)

plot(svm.linear, Auto, weight~year) #

plot(svm.linear, Auto, weight~acceleration)

plot(svm.linear, Auto, weight~horsepower)

plot(svm.linear, Auto, acceleration~horsepower)

plot(svm.linear, Auto, acceleration~year)

plot(svm.linear, Auto, year~horsepower)

Extra Question + Answer

Explain how the cost parameter works. What happens when we increase it?

Answer:

The cost parameter allows us to specify how “expensive” it is to misclassify an observation/violate the margin. In case the margin is the distance from the decision surface to the closest data point. Support Vector Classifier, in order to ensure greater robustness of the model and towards individual observations/outliers, allows ‘some’ training data to be misclassified. In the optimization problem such characteristic is expressed with the following constrain.

\(\sum_{i=1}^n \epsilon_i \leq C\)

Here we are basically setting a non-negative “budget” \(C\) where \(\epsilon_i\) are called slack variables in the sense that allow individual observations to be on the wrong side of the margin or the hyperplane. So \(C\) put a cap on the misclassifications/violation.

When we code this in R we don’t set a budget for misclassification/margin violation. Instead we specify how “expensive” it is to misclassify an observation in the training set or violate the margin. The higher the cost parameter is, the smaller the margin will be and the fewer misclassifications/violations there will be in the training set since each misclassifition is “expensive” in term of optimization. This will lead to a decrease of the support vectors well.

The same logic can be extended to support vector machines.

Exercise 9.8 - Assigned Exercise

This problem involves the OJ data set which is part of the ISLR package.

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

data(OJ)

set.seed(777)
train<-sample(1:nrow(OJ),800)
train.set<-OJ[train,]
test.set<-OJ[-train,]

Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

svc.fit<-svm(Purchase~., data=train.set, kernel= "linear", cost=0.01, scale=F)
summary(svc.fit)
## 
## Call:
## svm(formula = Purchase ~ ., data = train.set, kernel = "linear", 
##     cost = 0.01, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
## 
## Number of Support Vectors:  618
## 
##  ( 310 308 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM

Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. These observations do affect the support vector classifier.

In our case 618 observations out of 800 (train.set) have been used as support vectors. 310 are classified as CH while 308 as MM. The percentage of support vectors is quite high. This is a direct consequence of the parameter cost which tell the model how “expensive” it is to misclassify/violate the margin. In this case cost value is quite low so it make sense that almost 80% of the obs are support vectors.

What are the training and test error rates?

lin.trn.er<-mean(svc.fit$fitted != train.set$Purchase) # training error

lin.trn.er
## [1] 0.24
lin.tst.er<-mean(predict(svc.fit, test.set) != test.set$Purchase) # test error

lin.tst.er
## [1] 0.262963

Training error = 0.24

Test error = 0.262963

Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

set.seed(4234)

tune.out<-tune(svm, Purchase~., data=train.set, kernel = "linear",
               ranges = list(cost = 10^seq(-2,1, by=0.5)))
               

tune.out %>% summary()          
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      cost
##  3.162278
## 
## - best performance: 0.165 
## 
## - Detailed performance results:
##          cost   error dispersion
## 1  0.01000000 0.17750 0.02934469
## 2  0.03162278 0.17000 0.03073181
## 3  0.10000000 0.17125 0.02703521
## 4  0.31622777 0.16750 0.02898755
## 5  1.00000000 0.16625 0.02638523
## 6  3.16227766 0.16500 0.02687419
## 7 10.00000000 0.16500 0.03050501

Compute the training and test error rates using this new value for cost.

best.cost<-summary(tune.out)$best.parameters[1,1]
svc.best<-svm(Purchase~. ,data=train.set, kernel="linear", cost = best.cost)

lin.trn.er.bc<-mean(svc.best$fitted != train.set$Purchase) # training error
lin.trn.er.bc
## [1] 0.155
lin.tst.er.bc<-mean(predict(svc.best, test.set) != test.set$Purchase) # test error

lin.tst.er.bc
## [1] 0.1814815

Training error = 0.155

Test error = 0.1814815

Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.

### b)
# Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

svc.rad<-svm(Purchase~., data=train.set, kernel= "radial", cost=0.01, scale=F)
summary(svc.rad)
## 
## Call:
## svm(formula = Purchase ~ ., data = train.set, kernel = "radial", 
##     cost = 0.01, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.01 
## 
## Number of Support Vectors:  633
## 
##  ( 323 310 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
### c)
# What are the training and test error rates?

rad.trn.er<-mean(svc.rad$fitted != train.set$Purchase) 
rad.trn.er # training error with cost = 0.01
## [1] 0.3875
rad.tst.er<-mean(predict(svc.rad, test.set) != test.set$Purchase) 
rad.tst.er # test error with cost = 0.01
## [1] 0.3962963
### d)
# Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

set.seed(4234)
tune.out.rad<-tune(svm, Purchase~., data=train.set, kernel = "radial",
               ranges = list(cost = 10^seq(-2,1, by=0.5)))
tune.out.rad %>% summary()
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       cost
##  0.3162278
## 
## - best performance: 0.16625 
## 
## - Detailed performance results:
##          cost   error dispersion
## 1  0.01000000 0.38750 0.08354141
## 2  0.03162278 0.37000 0.09721825
## 3  0.10000000 0.17750 0.02554952
## 4  0.31622777 0.16625 0.01671867
## 5  1.00000000 0.16875 0.02651650
## 6  3.16227766 0.17250 0.03322900
## 7 10.00000000 0.17500 0.02946278
best.cost.rad<-summary(tune.out.rad)$best.parameters[1,1]
best.cost.rad # cost value yielding the lowest  error
## [1] 0.3162278
### e)
# Compute the training and test error rates using this new value for cost.

svc.rad.bc<-svm(Purchase~., data = train.set, kernel = "radial",
                cost = best.cost.rad, scale = F)

rad.trn.er.bc<-mean(svc.rad.bc$fitted != train.set$Purchase) 
rad.trn.er.bc # training error with cost value from cross validation
## [1] 0.28625
rad.tst.er.bc<-mean(predict(svc.rad.bc, test.set) != test.set$Purchase)
rad.tst.er.bc # test error with cost value from cross validation
## [1] 0.3185185

With cost = 0.01 we have 633 support vectors: 323 classified as CH while 310 as MM. Training and test error increase.

Training error = 0.28625

Test error = 0.3185185

Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.

### b)
# Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.

svc.pol<-svm(Purchase~., data=train.set, kernel= "polynomial", cost=0.01, scale=F, degree = 2)

summary(svc.pol)
## 
## Call:
## svm(formula = Purchase ~ ., data = train.set, kernel = "polynomial", 
##     cost = 0.01, degree = 2, scale = F)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.01 
##      degree:  2 
##      coef.0:  0 
## 
## Number of Support Vectors:  326
## 
##  ( 162 164 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  CH MM
### c)
# What are the training and test error rates?

pol.trn.er<-mean(svc.pol$fitted != train.set$Purchase) 
pol.trn.er # training error with cost = 0.01
## [1] 0.155
pol.tst.er<-mean(predict(svc.pol, test.set) != test.set$Purchase) 
pol.tst.er # test error with cost = 0.01
## [1] 0.1814815
### d)
# Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.

set.seed(4234)
tune.out.pol<-tune(svm, Purchase~., data=train.set, kernel = "polynomial",
               ranges = list(cost = 10^seq(-2,1, by=0.5)), degree = 2)
tune.out.pol %>% summary()
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.1725 
## 
## - Detailed performance results:
##          cost   error dispersion
## 1  0.01000000 0.38875 0.08174529
## 2  0.03162278 0.35125 0.06832571
## 3  0.10000000 0.31375 0.06547105
## 4  0.31622777 0.20500 0.04609772
## 5  1.00000000 0.19375 0.03397814
## 6  3.16227766 0.18000 0.02838231
## 7 10.00000000 0.17250 0.02934469
best.cost.pol<-summary(tune.out.pol)$best.parameters[1,1]
best.cost.pol # cost value yielding the lowest  error
## [1] 10
### e)
# Compute the training and test error rates using this new value for cost.

svc.pol.bc<-svm(Purchase~., data=train.set, kernel= "polynomial", cost=best.cost.pol, scale=F, degree = 2)

pol.trn.er.bc<-mean(svc.pol.bc$fitted != train.set$Purchase)
pol.trn.er.bc # training error with cost value from cross validation
## [1] 0.15875
pol.tst.er.bc<-mean(predict(svc.pol.bc, test.set) != test.set$Purchase)
pol.tst.er.bc # test error with cost value from cross validation
## [1] 0.1740741

Polynomial kernel uses less support vector compared to the other two - 326 out of 800. SVM with polynomial kernel performs slightly better than the linear one and considerably better than the radial kernel.

Within the set of the support vectors 162 obs are classified as CH while 164 as MM.

Training error = 0.15875

Test error = 0.1740741

Overall, which approach seems to give the best results on this data?

h<-rbind(Linear = lin.tst.er.bc,
         Radial = rad.tst.er.bc,
         Polynomial = pol.tst.er.bc) %>%  data.frame()

colnames(h)<-"Prediction_error"

kable(h, caption = "Kernel") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Kernel
Prediction_error
Linear 0.1814815
Radial 0.3185185
Polynomial 0.1740741
bestmodel<-subset(h, h$Prediction_error==min(h$Prediction_error)) %>% rownames()
bestmodel
## [1] "Polynomial"

Polynomial is the best model as it yield the lowest test error. Surprisingly the radial kernel perform the worst. Perhaps if we tune also the gamma value we would get a better test error.

Extra Question + Answer

Does it make sense to try a radial kernel here? Why?

Answer:

Yes it makes sense since the data set has 18 variables (17 predictors + 1 response) so we cannot clearly visualize the features in a 2-dimensions plot (unless we use PCA for example or variable selection methods) and assess that CH and MM obs are clearly separated by a straight line (or a plane in case we have 3 features and we generate a 3-dim plot). If we could see that the obs are clearly separated by a straight line/plane then a linear kernel might be the best choice and there would not be any need to use the radial kernel.

Chapter 10

Exercise 10.8

In Section 10.2.3, a formula for calculating PVE was given in Equation 10.8. We also saw that the PVE can be obtained using the sdev output of the prcomp() function.

On the USArrests data, calculate PVE in two ways:

Using the sdev output of the prcomp() function, as was done in Section 10.2.3.

data("USArrests")
names(USArrests)
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"
dim(USArrests) # PC = min(n-1,p)
## [1] 50  4
pr.out<-prcomp(USArrests , scale = TRUE)
# scale = TRUE to scale the variables and have st.dev = 1. This is necessary since the 4 variables have very different values ranges and scale (e.g. UrbanPop is a percentage)

# Centre = TRUE is the default value. It means that the function centers the variables to have mean zero. I calculate the mean for each variable which give us the center of the data and shift the data/axis in such a way that the center of the data coincide with the origin.



pr.var<-pr.out$sdev^2 # Variance of each principal component.
# sum of squares distances from the origin of each observation projected onto each PC divided by n-1.

# # sum of squares distances from the origin of each observation projected onto each PC = Eigenvalue

#PC1 = sum(squared distances from the origin of each obs projected onto PC1)/(n-1)

# PC1 was drawn in such a way that it has the largest sum of squared distances from the origin of the obs projected on PC1 
pve<-pr.var/sum(pr.var)
pve # Proportion of variance explained by each principal component.
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

By applying Equation 10.8 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 10.8 to obtain the PVE.

loadings<-pr.out$rotation
# The rotation matrix provides the principal component loadings; each column of pr.out$rotation contains the corresponding principal component loading vector

# It is a matrix whose columns contain the eigenvectors (unit vectors)
# sum(pr.out$rotation[,1]^2) = 1, same for the other PCs

# Eigenvectors are just the linear combinations of the original variables (in the simple or rotated factor space); they described how variables "contribute" to each factor axis. Basically, PCA can be thought as way to construct new axes that point to the directions of maximal variance (in the original variable space), as expressed by the eigenvalue, and how variables contributions are weighted or linearly transformed in this new space.

# PC scores can be seen as the co-ordinates of each point/obs, with respect to the principal component axes. The principal component scores describe where each data point lies on each straight line, relative to the "centroid" of the data.


stnd.df<-scale(USArrests) # centers and scales the columns of a numeric matrix.
# Center: X (matrix with all the observations: dataframe) - Xbar(vector with mean value for each col/var)
# Scale: X/st.dev where st.dev is the vector with the standard dev value for each col/var.

pve2 = rep(NA, 4)

zetas<-stnd.df %*% loadings # Data points in the new coordinates system spanned out by the PCs.
var.pc<-apply(zetas,2,FUN = function(x) sum(x^2)) # sum of square distances from the origin (new "centered" origin)

pve.b<-var.pc/sum(var.pc)
pve.b # Proportion of variance explained by each principal component.
##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

Exercise 10.9

Consider the USArrests data. We will now perform hierarchical clustering on the states.

Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

hc.complete<-hclust(dist(USArrests), method="complete")

# Compute all pairwise dissimilarities (Euclidean distance) between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.

# The first step is to calculate the distance between every point with every other point. The result is a distance matrix, which can be computed with the dist() function.

plot(hc.complete, xlab = "", sub="Complete Linkage", hang = -1,cex = 0.75)

Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

cutree(hc.complete, k=3) # With k=3 I specify I many cluster I want.
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2
ct<-cutree(hc.complete, 3) %>% table() %>% data.frame()
ct<-dplyr::rename(ct, Class = .)
ct
##   Class Freq
## 1     1   16
## 2     2   14
## 3     3   20

Cutree function cuts the dendrogram tree in such a way that we have k clusters. The function returns the cluster labels for each observation (city) associated with a given cut of the dendrogram,

Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

USArrests.sc<-scale(USArrests)
hc.complete.sc<-hclust(dist(USArrests.sc), method = "complete")
plot(hc.complete.sc, xlab = "", sub = "", hang = -1, cex = 0.75)

ct.sc<-cutree(hc.complete.sc, 3) %>% table() %>% data.frame()
ct.sc<-dplyr::rename(ct.sc, Class = .)
ct.sc
##   Class Freq
## 1     1    8
## 2     2   11
## 3     3   31

What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed?

Provide a justification for your answer.

In this case scaling the variable changes the clustering output. The scale of the high axis changes as well.

The function scale(x, center = TRUE, scale = TRUE) center and scale the observations in the data frame which means:

Center: X (matrix with all the observations: dataframe) - Xbar(vector with mean value for each col/var)

Scale: X/st.dev where st.dev is the vector with the standard dev value for each col/var.

I would say that we should scale the variables when the unit of measurement for the variable are not comparable or the scales are not consistent.

For example in this case, variable ‘UrbanPop’ is expressed in percentage while the other are quantities. So it is wise to scale as the ranges are very different.

In addition, scaling can be needed when the unit of measurement are not consistent. Let’s assume that variable ‘Assault’ is expressed as #cases/100 citizens and ‘Rape’ expressed as #cases/100000 citizens. This is another case where scaling is needed.

More in general scaling is necessary since the dissimilarity between cluster can be measured with the Euclidean distance, therefore a “big” number would dominate this measurement. So we need to make sure that a variable has not a much bigger value range only due to the scale used (unit of measurement, etc etc).

Exercise 10.10

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.

Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.

n<-60
p<-50
c<-3
set.seed (2)
x1<-matrix(rnorm ((n/c)*p, mean = 0, sd = 0.1), ncol = p) %>%
  data.frame(Class = "A")
x2<-matrix(rnorm ((n/c)*p, mean = 3, sd = 0.6), ncol = p) %>% 
  data.frame(Class = "B")
x3<-matrix(rnorm ((n/c)*p, mean = 7, sd = 1), ncol = p) %>% 
  data.frame(Class = "C")

x<-rbind(x1,x2,x3)

x[sample(1:n,5),sample(1:p,6)]
##            X41          X17        X47         X1          X8        X45
## 54  6.18867869  7.351543751  8.3784004 5.13129919  5.57463495 5.55168039
## 49  5.92711167  7.402272088  6.4773200 8.00098243  6.69375726 7.93824121
## 44  9.32816114 10.246016418  7.2988462 8.04093832  5.96825350 7.24870630
## 12 -0.09875221 -0.008767276 -0.1030572 0.09817528 -0.01694232 0.00991457
## 51  7.10684099  6.920039601  7.0448973 7.44346277  8.47999033 6.34946173

Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.

pr.out<-prcomp(dplyr::select(x,-Class))

plot(pr.out$x[,1:2], col=x$Class, xlab="PC1", ylab="PC2", pch=19) 

ggbiplot(pr.out, choice = 1:2, groups = x$Class, var.axes = F, 
         ellipse = TRUE) + geom_point(aes(col = x$Class), size = 4)

Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?

Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.

km3.out<-kmeans(dplyr::select(x,-Class), 3, nstart=20)

# nstart = 20 means that the first random cluster/class assignment is performed 20 times, the function picks the best result and start clustering from there.

table(km3.out$cluster, x$Class)
##    
##      A  B  C
##   1  0  0 20
##   2 20  0  0
##   3  0 20  0

class A from the original data frame corresponds to cluster 2.

class B from the original data frame corresponds to cluster 3.

class C from the original data frame corresponds to cluster 1.

As we can see from the table, k-cluster groups the observation correctly. All class A obs are grouped in cluster number 2. All class B obs are grouped in cluster number 3 etc etc.

Perform K-means clustering with K = 2. Describe your results.

km2.out<-kmeans(dplyr::select(x,-Class), 2, nstart=20)

table(km2.out$cluster, x$Class)
##    
##      A  B  C
##   1  0  0 20
##   2 20 20  0

Original class A,B are grouped together in cluster 2.

Now perform K-means clustering with K = 4, and describe your results.

km4.out<-kmeans(dplyr::select(x,-Class), 4, nstart=20)

table(km4.out$cluster, x$Class)
##    
##      A  B  C
##   1  0  0  9
##   2  0  0 11
##   3 20  0  0
##   4  0 20  0

Original class C is now split (almost evenly) in cluster 1 and 2

Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

km3pc.out<-kmeans(pr.out$x[,1:2], 3, nstart=20)

pr.out$x[,1:2] %>% data.frame() %>% 
  ggplot(mapping = aes(x=PC1, y=PC2, color=factor(km3pc.out$cluster))) + geom_point() + labs(color="Cluster")

table(km3pc.out$cluster, x$Class)
##    
##      A  B  C
##   1  0  0 20
##   2 20  0  0
##   3  0 20  0

Applying k-mean clustering on PC1 and PC2 with k=3 yields the same results as using the same method on the full original data set (see point c).

Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.

km3sc.out<-kmeans(scale(dplyr::select(x,-Class), center = T, scale = T),
                  3, nstart=20)

km3sc.out$cluster
##  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
km3sc.out$betweenss # The between-cluster sum of squares with scaling
## [1] 2802.036
km3.out$betweenss # The between-cluster sum of squares with no scaling
## [1] 25149.31
table(km3sc.out$cluster, x$Class)
##    
##      A  B  C
##   1  0 20  0
##   2  0  0 20
##   3 20  0  0

Scaling the data doesn’t change the result. The way the algorithm clusters the obs match the class type.

Exercise 10.11

On the book website, www.StatLearning.com, there is a gene expression data set (Ch10Ex11.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

Load in the data using read.csv(). You will need to select header=F.

g.df<-read.csv("Ch10Ex11.csv", header=F)
#view(g.df)
#summary(g.df)
dim(g.df)
## [1] 1000   40
# row: gene
# column: tissue sample

Apply hierarchical clustering to the samples using correlation based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?

dd=as.dist(1- cor(g.df))
# two observations are similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.
# if I use cor(t(g.df)) I end up clustering the gene but I am interested in clustering the 40 samples.

hc.comp.corr<-hclust(dd, method = "complete")
hc.sing.corr<-hclust(dd, method = "single")
hc.avg.corr<-hclust(dd, method = "average")
hc.cent.corr<-hclust(dd, method = "centroid")
plot(hc.comp.corr, xlab = "", sub = "Complete Linkage",cex = 0.75, hang = -1)

plot(hc.sing.corr, xlab = "", sub = "Single Linkage",cex = 0.75, hang = -1)

plot(hc.avg.corr, xlab = "", sub = "Average Linkage", cex = 0.75, hang= -1)

plot(hc.cent.corr, xlab = "", sub = "centroid", cex = 0.75)

The dendrograms suggest that there are 2/3 main groups. The choice of linkage affects the results obtained. Single linkage tends to yield trailing clusters: very large clusters onto which individual observations attach one-by-one. On the other hand, complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete and average linkage are generally preferred to single linkage.

Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.

pca.out<-prcomp(t(g.df))

plot(pca.out$x[,1:2], xlab="PC1", ylab="PC2", pch=19) 

pca.out<-prcomp(t(g.df))

pca2<-pca.out$x %>%
    data.frame() %>%
    dplyr::select(PC1, PC2)

pca.kmeans<-kmeans(pca2,centers = 2)

ggbiplot(pca.out, groups = factor(pca.kmeans$cluster),var.axes = F, ellipse = TRUE) + geom_point(aes(col = factor(pca.kmeans$cluster)))

topgen.PC1<-abs(pca.out$rotation) %>% data.frame %>%
  dplyr::select(PC1) %>% mutate(Gene = 1:1000) %>% arrange(desc(PC1))

topgen.PC1[1:10,] # Top 10 genes. Intuitively I would say that these 10 genes are the ones that differs the most between the 2 clusters.
##          PC1 Gene
## 1  0.1142430  600
## 2  0.1103983  584
## 3  0.1083294  549
## 4  0.1081255  540
## 5  0.1074596  502
## 6  0.1065494  582
## 7  0.1062712  565
## 8  0.1061876  568
## 9  0.1041632  529
## 10 0.1038207  599

PCA will tell us in which directions the data varies the most. For simplicity’s sake we consider the first 2 PCs. The PCs are a linear combination of the original variables (genes). So, the genes with biggest “weight” in PC1 and PC2 should be the genes that differs the most between the clusters.

This last sentence implies that we need to perform a cluster analysis using the PC1 and PC2.