Assigned Exercises 9.7 and 9.8
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
Split the data set into a training set and a test set.
data(Carseats)
set.seed(1)
train<-sample(1:nrow(Carseats), nrow(Carseats)*0.5)
train.set<-Carseats[train,]
test.set<-Carseats[-train,]
Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
#tree.cs<-rpart(Sales~. , train.set, method = 'anova')
tree.cs<-tree(Sales~. , data=train.set)
summary(tree.cs)
##
## Regression tree:
## tree(formula = Sales ~ ., data = train.set)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "CompPrice"
## [6] "US"
## Number of terminal nodes: 18
## Residual mean deviance: 2.167 = 394.3 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.88200 -0.88200 -0.08712 0.00000 0.89590 4.09900
plot(tree.cs, type = c("proportional", "uniform"))
text(tree.cs, pretty = 0, cex=0.5)
#fancyRpartPlot(tree.cs, cex=0.5)
# Each node box displays the classification and the percentage of observations used at that node.
# number on top in the box refers to average sales value (response that needs to be predicted) in the split. e.g. for the top box I can see that the values mentioned is equal to mean(train.set$Sales).
yhat<-predict (tree.cs, newdata = test.set)
testMSE<-mean((yhat-test.set$Sales)^2)
testMSE
## [1] 4.922039
#plotcp(tree.cs)
On of the most important indicator of Sales appears to be shelving location, since the first branch differentiates Good locations from Bad and Medium locations. Another very important predictor is Price, which appears for both cases of shelving location {Good, Medium:Bad}.
Splitting the predictor shelving location (cutpoints {Good, Medium:Bad}) leads to the greatest possible reduction in RSS.
The test MSE is 4.9220391.
Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
#set.seed(1)
cv.carseats<-cv.tree(tree.cs, FUN = prune.tree)
#Runs a K-fold cross-validation experiment to find the deviance ( misclassifications for classification prob) as a function of the cost-complexity parameter k.
# Alternative, with Caret package
# train.ctrl<-trainControl(method = "repeatedcv", number = 5 , repeats = 3)
# set.seed(1)
# rpart_tree<-train(Sales~. , data=train.set, method = "rpart", trControl = train.ctrl )
cv.carseats
## $size
## [1] 18 17 16 15 14 13 12 11 10 8 7 6 5 4 3 2 1
##
## $dev
## [1] 831.3437 852.3639 868.6815 862.3400 862.3400 893.4641 911.2580
## [8] 950.2691 955.2535 1039.1241 1066.6899 1125.0894 1205.5806 1273.2889
## [15] 1302.8607 1349.9273 1620.4687
##
## $k
## [1] -Inf 16.99544 20.56322 25.01730 25.57104 28.01938 30.36962
## [8] 31.56747 31.80816 40.75445 44.44673 52.57126 76.21881 99.59459
## [15] 116.69889 159.79501 337.60153
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
best.size<-cv.carseats$size[which.min(cv.carseats$dev)]
best.size
## [1] 18
data.frame(Size=cv.carseats$size, CVerror=cv.carseats$dev) %>%
ggplot(mapping=aes(x=Size,y=CVerror)) +
geom_point(color="red") +
geom_line() + scale_x_continuous(breaks=seq(0,20,1))
data.frame(Alpha=cv.carseats$k, CVerror=cv.carseats$dev) %>%
ggplot(mapping=aes(x=Alpha,y=CVerror)) +
geom_point(color="Orange") +
geom_line() + scale_x_continuous(breaks=seq(0,400,20))
#The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate and the value of the cost-complexity parameter used (k, which corresponds to alpha in (8.4))
pruned.carseats<-prune.tree(tree.cs, best = best.size)
pruned.carseats %>% summary()
##
## Regression tree:
## tree(formula = Sales ~ ., data = train.set)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "CompPrice"
## [6] "US"
## Number of terminal nodes: 18
## Residual mean deviance: 2.167 = 394.3 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.88200 -0.88200 -0.08712 0.00000 0.89590 4.09900
yhat.pruned<-predict(pruned.carseats, newdata= test.set)
MSE.pruned<-mean((test.set$Sales - yhat.pruned)^2)
MSE.pruned
## [1] 4.922039
Cross validation determines that the best size/optimal level of tree complexity is 18 the corresponding MSE is 4.9220391. The MSE has not changed. The total number number of nodes has not changed as well, so there was not actual pruning.
Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
set.seed (1)
p<-ncol(Carseats)-1 # number of predictors
bag.cs<-randomForest(Sales~.,data = train.set, mtry=p, importance =TRUE)
importance(bag.cs) %>% data.frame()
## X.IncMSE IncNodePurity
## CompPrice 24.8888481 170.182937
## Income 4.7121131 91.264880
## Advertising 12.7692401 97.164338
## Population -1.8074075 58.244596
## Price 56.3326252 502.903407
## ShelveLoc 48.8886689 380.032715
## Age 17.7275460 157.846774
## Education 0.5962186 44.598731
## Urban 0.1728373 9.822082
## US 4.2172102 18.073863
# %IncMSE is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model.
# IncNodePurity is a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees. In the case of regression trees, the node impurity is measured by the training RSS
varImpPlot(bag.cs, main = "Importance")
yhat.bag<-predict(bag.cs, newdata = test.set)
mse.bag<-mean((yhat.bag - test.set$Sales))
mse.bag
## [1] 0.2656695
Price and ShelveLoc are by far the most important variables. Using Bagging approach leads to a much lower MSE.
Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
set.seed (1)
m.rf<-round(sqrt(p))
rf.cs<-randomForest(Sales~.,data = train.set, mtry=m.rf, importance =TRUE)
importance(rf.cs) %>% data.frame()
## X.IncMSE IncNodePurity
## CompPrice 14.8840765 158.82956
## Income 4.3293950 125.64850
## Advertising 8.2215192 107.51700
## Population -0.9488134 97.06024
## Price 34.9793386 385.93142
## ShelveLoc 34.9248499 298.54210
## Age 14.3055912 178.42061
## Education 1.3117842 70.49202
## Urban -1.2680807 17.39986
## US 6.1139696 33.98963
varImpPlot(rf.cs, main = "Importance")
yhat.rf<-predict(rf.cs, newdata = test.set)
mse.rf<-mean((yhat.rf - test.set$Sales))
mse.rf
## [1] 0.2402389
MSE from randomforest is slightly better than the one we got with bagging. Price and ShelveLoc are still the most important variables. However now, we can note that Age has increased in importance compared to the other variables. At each split the random forest model consider 3 variables (chosen randomly) instead of the full set of predictors like with bagging.
This problem involves the OJ data set which is part of the ISLR package
Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
data(OJ)
set.seed(1)
train<-sample(1:nrow(OJ), 800)
train.set<-OJ[train,]
test.set<-OJ[-train,]
Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
#tree.cs<-rpart(Sales~. , train.set, method = 'anova')
tree.oj<-tree(Purchase~. , data=train.set)
p<-ncol(OJ)-1
#?OJ
# Purchase: A factor with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice
#head
summary(tree.oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train.set)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
The number of nodes in the tree is 9. The model uses the following variables:
LoyalCH, PriceDiff, SpecialCH, ListPriceDiff, PctDiscMM
The training error, corresponding to misclassifications, is 0.15875.
Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
tree.oj
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196197 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196197 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
We choose to analyze terminal node labeled “3)”. For each node the function print the following info:
node), split, n, deviance, yval, (yprob) * denotes terminal node
This node is not a terminal node since it is not marked with *. The splitting variable at node 3 is LoyalCH and the corresponding splitting value is 0.5036. There are 435 observations in the subtree below this node. The deviance for all observations contained in region below this node is 337.90. Next to the deviance value we find the prediction for this node which is CH. 87% of the observations in this region have CH as Purchase value, the remaining 13% MM.
Create a plot of the tree, and interpret the results.
plot(tree.oj, type = c("proportional", "uniform"))
text(tree.oj, pretty = 0, cex=0.7)
LoyalCH appears in several nodes of the tree, including the nodes at the top of the tree. This indicate that this predictors is one of the most important.
We can follow a specific path in the tree and check what’s the related prediction.
For example: if LoyalCH is lower than 0.5036 but greater (not minor) than 0.280875 and PriceDiff is greater (not minor) than 0.05 then the model predicts that Purchase will be CH.
Summarized:
If {0.280875<LoyalCH<0.5036 & PriceDiff>0.05Purchas} –> Purchase = CH.
Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
yhat<-predict(tree.oj, newdata = test.set, type = "class")
table(test.set$Purchase,yhat) %>% kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F) %>%
add_header_above(c(" ", "Prediction" = 2))
| CH | MM | |
|---|---|---|
| CH | 160 | 8 |
| MM | 38 | 64 |
test.er<-mean(yhat!=test.set$Purchase)
#(8+38)/(160+8+38+64)
Test error rate = 0.1703704. It corresponds the ratio of misclassified observations in the test.set.
Apply the cv.tree() function to the training set in order to determine the optimal tree size.
cv.oj<-cv.tree(tree.oj, FUN = prune.tree)
#Runs a K-fold cross-validation experiment to find the deviance ( misclassifications for classification prob) as a function of the cost-complexity parameter k.
cv.oj
## $size
## [1] 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 685.6493 698.8799 702.8083 702.8083 714.1093 725.4734 780.2099
## [8] 790.0301 1074.2062
##
## $k
## [1] -Inf 12.62207 13.94616 14.35384 26.21539 35.74964 43.07317
## [8] 45.67120 293.15784
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
best.size<-cv.oj$size[which.min(cv.carseats$dev)]
best.size
## [1] 9
Tree size equal to 9 return the lowest cross validation error.
Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
data.frame(Size=cv.oj$size, CVerror=cv.oj$dev) %>%
ggplot(mapping=aes(x=Size,y=CVerror)) +
geom_point(color="red") +
geom_line() + scale_x_continuous(breaks=seq(0,10,1))
#data.frame(Alpha=cv.oj$k, CVerror=cv.oj$dev) %>%
#ggplot(mapping=aes(x=Alpha,y=CVerror)) +
#geom_point(color="Orange") +
#geom_line() + scale_x_continuous(breaks=seq(0,300,20))
Tree size equal to 9 return the lowest cross validation error. However, from the plot we can see that tree size 6 and 7 have comparable cross validation errors.
Which tree size corresponds to the lowest cross-validated classification error rate?
best.size<-cv.oj$size[which.min(cv.carseats$dev)]
best.size
## [1] 9
Tree size equal to 9 return the lowest cross validation error.
Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
pruned.oj<-prune.tree(tree.oj, best = 5)
Compare the training error rates between the pruned and unpruned trees. Which is higher?
pruned.oj %>% summary()
##
## Classification tree:
## snip.tree(tree = tree.oj, nodes = c(4L, 12L, 5L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "ListPriceDiff"
## Number of terminal nodes: 5
## Residual mean deviance: 0.8239 = 655 / 795
## Misclassification error rate: 0.205 = 164 / 800
tree.oj %>% summary()
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train.set)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
Misclassification error (training error) for the unpruned tree: 0.15875
Misclassification error (training error) for the pruned tree: 0.205
Unpruned tree has a lower misclassification error compared to the pruned version.
Compare the test error rates between the pruned and unpruned trees. Which is higher?
yhat.pruned<-predict(pruned.oj, newdata = test.set, type = "class")
test.er.pruned<-mean(yhat.pruned!=test.set$Purchase)
test.er.pruned
## [1] 0.1925926
For the pruned tree the test error rate is 0.1925926. While the one for the unpruned tree is 0.1703704.
The unpruned tree returns lower test error rate.
We now use boosting to predict Salary in the Hitters data set.
Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
data("Hitters")
hit<-Hitters %>% drop_na("Salary")
hit$Salary<-log(hit$Salary)
Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations
set.seed(1)
train<-1:200
train.set<-hit[train,]
test.set<-hit[-train,]
Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.
lambda<-seq(from=0.001, to= 0.05, by= 0.001)
B<-1000
train.mse<-rep(NA,length(lambda))
test.mse<-rep(NA,length(lambda))
for (i in seq_along(lambda)){
boost.hit<-gbm(Salary~.,data=train.set,
distribution = "gaussian", n.trees = B,
shrinkage = lambda[i])
train.pred = predict(boost.hit, newdata = train.set, n.trees = B)
test.pred = predict(boost.hit, newdata = test.set, n.trees = B)
train.mse[i] = mean((train.set$Salary - train.pred)^2)
test.mse[i] = mean((test.set$Salary - test.pred)^2)
}
data.frame(Lambda = lambda, train.mse = train.mse) %>%
ggplot(mapping = aes(x = Lambda, y = train.mse)) +
geom_point(color = "red") +
geom_line() +
labs(y = "Training set MSE")
Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.
data.frame(Lambda = lambda, test.mse = test.mse) %>%
ggplot(mapping = aes(x = Lambda, y = test.mse)) +
geom_point(color = "blue") +
geom_line() +
labs(y = "Test set MSE")
e)
Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.
# linear regression
lm.hit<-lm(Salary ~ ., data = train.set)
lm.pred<-predict(lm.hit, newdata = test.set)
lm.test.mse<-mean((test.set$Salary - lm.pred)^2)
# lasso
set.seed(1)
x.train<-model.matrix(Salary ~ ., data = train.set)
y.train<-train.set$Salary
x.test<-model.matrix(Salary ~ ., data = test.set)
lasso.hit<-glmnet::glmnet(x.train, y.train, alpha = 1)
Lcv.out<-glmnet::cv.glmnet (x.train, y.train, alpha =1)
bestlam.L<-Lcv.out$lambda.min
bestlam.L
## [1] 0.00346633
lasso.pred<-predict(lasso.hit, s = bestlam.L, newx = x.test)
lasso.test.mse<-mean((test.set$Salary - lasso.pred)^2)
lm.test.mse
## [1] 0.4917959
lasso.test.mse
## [1] 0.4706337
boosting.test.mse<-min(test.mse)
boosting.test.mse
## [1] 0.255546
Boosting yields a lower test MSE 0.255546 compared to linear regression (0.4917959) and Lasso (0.4706337)
Which variables appear to be the most important predictors in the boosted model?
boost.best<-gbm(Salary ~ ., data = train.set, distribution = "gaussian",
n.trees = B, shrinkage = lambda[which.min(test.mse)])
summary(boost.best)
## var rel.inf
## CAtBat CAtBat 14.6821030
## CHits CHits 13.1718861
## CRuns CRuns 9.2318785
## CWalks CWalks 8.1675269
## PutOuts PutOuts 7.7964606
## CRBI CRBI 7.4968907
## Years Years 6.5191419
## Walks Walks 5.9508931
## CHmRun CHmRun 5.6988413
## AtBat AtBat 4.5363925
## RBI RBI 3.5439356
## Assists Assists 3.2086097
## Hits Hits 2.5620480
## HmRun HmRun 2.5547074
## Errors Errors 2.0569793
## Runs Runs 1.5230104
## Division Division 0.6218924
## NewLeague NewLeague 0.4610069
## League League 0.2157957
CAtBat is the most important predictor followed by CHits, CRuns.
Now apply bagging to the training set. What is the test set MSE for this approach?
set.seed(1)
p<-ncol(hit)-1
bag.hit<-randomForest(Salary ~ ., data = train.set, mtry = p)
bag.pred<-predict(bag.hit, newdata = test.set)
bagging.test.mse<-mean((test.set$Salary - bag.pred)^2)
bagging.test.mse
## [1] 0.2299324
Bagging method returns a slightly lower test MSE (0.2299324) compared to boosting (0.255546)
Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?
data(College)
#response: Apps, number of application received
set.seed(1)
train=sample(1:nrow(College) , 0.8*nrow(College))
train.set<-College[train,-3]
test.set<-College[-train,-3]
view(train.set)
is.na(College) %>% sum()
## [1] 0
We will use the data set College with response variable equal to Apps (number of application received).
Linear Regression
lm.college<-lm(Apps ~ ., data = train.set)
lm.pred<-predict(lm.college, newdata = test.set)
lm.test.mse<-mean((test.set$Apps - lm.pred)^2)
lm.test.mse
## [1] 3045265
Boosting
B<-3000
boost.college<-gbm(Apps~. ,data = train.set,
distribution = "gaussian", n.trees = B,
shrinkage = 0.001,
interaction.depth = 2)
test.pred<-predict(boost.college, newdata = test.set, n.trees = B)
boosting.test.mse<-mean((test.set$Apps - test.pred)^2)
boosting.test.mse
## [1] 2541130
Boosting with Caret
set.seed(1)
train.ctrl<-trainControl(method = "repeatedcv", number = 5 , repeats = 3)
gbm.caret<-train(Apps~. ,data = train.set, method = "gbm", distribution = "gaussian", trControl = train.ctrl, verbose = F)
#getModelInfo()$gbm$parameters
myGrid<-expand.grid(n.trees = seq(1000,5000, by=1000),
interaction.depth = seq(1,5, by=1),
shrinkage = c(0.001, 0.01, 0.02, 0.03),
n.minobsinnode = 10)
set.seed(1)
gbm.caret.tune<-train(Apps~ ., data = train.set, method = "gbm", distribution = "gaussian", trControl = train.ctrl, verbose = F, tuneGrid=myGrid)
myGrid.best<-gbm.caret.tune$bestTune
set.seed(1)
gbm.caret.best<-train(Apps~ ., data=train.set, method = "gbm", distribution="gaussian", trControl = train.ctrl, verbose = F, tuneGrid=myGrid.best)
test.pred<-predict(gbm.caret.best, newdata = test.set)
boostcaret.test.mse<-mean((test.set$Apps - test.pred)^2)
boostcaret.test.mse
## [1] 1942572
Bagging
p<-ncol(train.set)-1
bag.college<-randomForest(Apps ~ ., data = train.set, mtry = p)
bag.pred<-predict(bag.college, newdata = test.set)
bagging.test.mse<-mean((test.set$Apps - bag.pred)^2)
bagging.test.mse
## [1] 3875409
Random Forest
set.seed(1)
m.rf<-round(sqrt(p))
rf.college<-randomForest(Apps ~ ., data = train.set, mtry = m.rf)
rf.pred<-predict(rf.college, newdata = test.set)
rf.test.mse<-mean((test.set$Apps - rf.pred)^2)
rf.test.mse
## [1] 2014468
a<-rbind(Linear_Reg = lm.test.mse,
Boosting = boosting.test.mse,
Boosting_caret=boostcaret.test.mse,
Bagging = bagging.test.mse,
Random_Forest = rf.test.mse) %>% data.frame()
colnames(a)<-"Prediction_error"
kable(a) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Prediction_error | |
|---|---|
| Linear_Reg | 3045265 |
| Boosting | 2541130 |
| Boosting_caret | 1942572 |
| Bagging | 3875409 |
| Random_Forest | 2014468 |
bestmodel<-subset(a, a$Prediction_error==min(a$Prediction_error)) %>% rownames()
Boosting_caret returns the lowest test MSE.
Generate a simulated two-class data set with 100 observations and two features in which there is a visible but non-linear separation between the two classes.
Show that in this setting, a support vector machine with a polynomial kernel (with degree greater than 1) or a radial kernel will outperform a support vector classifier on the training data.
Which technique performs best on the test data? Make plots and report training and test error rates in order to back up your assertions.
Data
n<-100
set.seed(3)
x1<-rnorm(n)
x2<-x1 + 4*x1^2 + 1 + rnorm(n)
A<-sample(1:n, n/2)
B<- -A
x2[A]<-x2[A] + 2 # shift the points in order to make 2 distinguishable areas
x2[B]<-x2[B] - 2
df<-data.frame(x1,x2) %>% mutate(type = NA)
df$type[A]<-"A"
df$type[B]<-"B"
df$type<-as.factor(df$type)
# Plot using different colors
ggplot(df) + geom_point(mapping = aes(x=x1, y=x2, col=type))
Support Vector Classifier - Linear Kernel
set.seed(5)
#train<-df %>% group_by(type) %>% sample_n(n/4)
train<-sample(1:n,n/2)
train.set<-df[train,]
test.set<-df[-train,]
svc.fit<-svm(type~., data=train.set , kernel = "linear", cost =10, scale =FALSE)
# The decision boundary between the two classes is linear (because we used the argument kernel="linear")
# If we use a smaller value of the cost parameter, we obtain a larger number of support vectors, because the margin is now wider.
# crossvalidation with tune() can help us finding the optimal value for cost
plot(svc.fit, data=train.set)
summary(svc.fit)
##
## Call:
## svm(formula = type ~ ., data = train.set, kernel = "linear", cost = 10,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 31
##
## ( 16 15 )
##
##
## Number of Classes: 2
##
## Levels:
## A B
Training Error & Test Error - Support Vector Classifier
lin.trn.er<-mean(svc.fit$fitted != train.set$type) # training error
lin.trn.er
## [1] 0.2
lin.tst.er<-mean(predict(svc.fit, test.set) != test.set$type) # test error
lin.tst.er
## [1] 0.24
Support Vector Machine - Polynomial Kernel
svm.fit<-svm(type~., data=train.set , kernel = "polynomial", cost =10,scale =FALSE)
plot(svm.fit, data=train.set)
summary(svm.fit)
##
## Call:
## svm(formula = type ~ ., data = train.set, kernel = "polynomial",
## cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 10
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 32
##
## ( 15 17 )
##
##
## Number of Classes: 2
##
## Levels:
## A B
Training Error & Test Error- Support Vector Machine (Polynomial)
pol.trn.er<-mean(svm.fit$fitted != train.set$type) # Training error
pol.trn.er
## [1] 0.24
pol.tst.er<-mean(predict(svm.fit, test.set) != test.set$type) # Test error
pol.tst.er
## [1] 0.38
svm.fit.r<-svm(type~., data=train.set, kernel = "radial", cost=10)
plot(svm.fit.r, data=train.set)
summary(svm.fit.r)
##
## Call:
## svm(formula = type ~ ., data = train.set, kernel = "radial", cost = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 15
##
## ( 7 8 )
##
##
## Number of Classes: 2
##
## Levels:
## A B
Training Error & Test Error- Support Vector Machine (Radial)
rad.trn.er<-mean(svm.fit.r$fitted != train.set$type) # Training error
rad.trn.er
## [1] 0
rad.tst.er<-mean(predict(svm.fit.r, test.set) != test.set$type) # Test error
rad.tst.er
## [1] 0.18
s<-rbind(Linear = lin.tst.er,
Polynomial = pol.tst.er,
Radial = rad.tst.er) %>% data.frame() %>%
cbind(c(lin.trn.er,pol.trn.er,rad.trn.er))
colnames(s)<-c("Test_error","Train_error")
kable(s, caption = "Kernel") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Test_error | Train_error | |
|---|---|---|
| Linear | 0.24 | 0.20 |
| Polynomial | 0.38 | 0.24 |
| Radial | 0.18 | 0.00 |
bestmodel<-subset(s, s$Test_error==min(s$Test_error)) %>% rownames()
bestmodel
## [1] "Radial"
The table summarize the performances of the 3 models used in term of training error and test error.
The radial kernel return the lost test error meaning that the rate of the misclassified observation is the lowest when using the radial kernel. The same can be appreciated looking at the plot, the boundaries with the radial kernel follow better the split between the 2 classes which was defined using a specific non-linear equation.
We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.
Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:
set.seed(123)
x1=runif(500) -0.5
x2=runif(500) -0.5
y=1*(x1^2-x2^2 > 0)
Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the yaxis.
df<-data.frame(x1,x2,y=as.factor(y))
# Plot using different colors
df %>% ggplot(mapping = aes(x=x1, y=x2, color=factor(y))) + geom_point() + labs(color ="Class")
Fit a logistic regression model to the data, using X1 and X2 as predictors.
log.fit<-glm(y~x1+x2, family = binomial)
log.fit %>% summary
##
## Call:
## glm(formula = y ~ x1 + x2, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.227 -1.200 1.133 1.157 1.188
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.04792 0.08949 0.535 0.592
## x1 -0.03999 0.31516 -0.127 0.899
## x2 0.11509 0.30829 0.373 0.709
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.86 on 499 degrees of freedom
## Residual deviance: 692.71 on 497 degrees of freedom
## AIC: 698.71
##
## Number of Fisher Scoring iterations: 3
Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
log.prob<-predict(log.fit, newdata=df, type = "response")
log.pred<-ifelse(log.prob > 0.50, 1, 0)
mean(log.pred == df$y)
## [1] 0.574
df %>% cbind(log.pred) %>%
ggplot(mapping = aes(x=x1,y=x2,color=factor(log.pred))) +
geom_point() + labs(color="Class")
Now fit a logistic regression model to the data using non-linear functions of X1 and X2 as predictors (e.g. \(X_1^2\) , \(X_1X_2\), \(log(X_2)\), and so forth).
logp.fit<-glm(y ~ poly(x1,2) + I(log(abs(x2)))+ I(x1 * x2), data = df, family = binomial)
Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
set.seed(123)
train<-sample(1:nrow(df), nrow(df)*0.5)
train.set<-df[train,]
test.set<-df[-train,]
logp.prob<-predict(logp.fit, newdata=train.set, type = "response")
logp.pred<-ifelse(logp.prob > 0.50, 1, 0)
mean(logp.pred != train.set$y)
## [1] 0.116
train.set %>% cbind(logp.pred) %>%
ggplot(mapping = aes(x=x1,y=x2,color=factor(logp.pred))) +
geom_point() + labs(color="Class")
logp.prob.ts<-predict(logp.fit, newdata=test.set, type = "response")
logp.pred.ts<-ifelse(logp.prob.ts > 0.50, 1, 0)
test.er.logp<-mean(logp.pred.ts != test.set$y)
test.er.logp
## [1] 0.08
Fit a support vector classifier to the data with X1 and X2 as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
svc.fit<-svm(y~., data=train.set , kernel = "linear", cost =10, scale =FALSE)
plot(svc.fit, data=train.set)
summary(svc.fit)
##
## Call:
## svm(formula = y ~ ., data = train.set, kernel = "linear", cost = 10,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 237
##
## ( 118 119 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
tst.er.lin<-mean(predict(svc.fit, test.set) != test.set$y) # Test error
tst.er.lin
## [1] 0.472
Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
Polynomial Kernel
svm.fit.pol<-svm(y~., data=train.set , kernel = "polynomial", degree=2, cost =10, scale =FALSE)
plot(svm.fit.pol, data=train.set)
summary(svm.fit.pol)
##
## Call:
## svm(formula = y ~ ., data = train.set, kernel = "polynomial", degree = 2,
## cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 10
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 163
##
## ( 82 81 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
tst.er.pol<-mean(predict(svm.fit.pol, test.set) != test.set$y) # Test error
tst.er.pol
## [1] 0.044
Radial Kernel
svm.fit.rad<-svm(y~., data=train.set , kernel = "radial",cost =10, scale =FALSE)
plot(svm.fit.rad, data=train.set)
summary(svm.fit.rad)
##
## Call:
## svm(formula = y ~ ., data = train.set, kernel = "radial", cost = 10,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 142
##
## ( 71 71 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
tst.er.rad<-mean(predict(svm.fit.rad, test.set) != test.set$y) # Test error
tst.er.rad
## [1] 0.072
Comment on your results.
The plots here above underline the fact that SVM with non-linear kernel and logistic regression with transformed predictors do a pretty good job when the boundaries between 2 classes are non-linear. The boundaries found with the SVM and the logistic regression are pretty similar. However the main advantage with the SVM is that we have way less tuning options (gamma or poly degree and cost) compared to the logistic regression whose tuning options are as many as the number of transformation of the variables.
Further to this optimal tuning parameters for SVM can be selected through cross-validation.
In this problem, you will use support vector approaches in order to predict whether a given car gets high or low gas mileage based on the Auto data set.
Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars with gas mileage below the median.
data(Auto)
Auto<-Auto %>% mutate(mpg.high = ifelse(mpg>median(mpg),1,0)) %>%
mutate(mpg.high=factor(mpg.high), cylinders=factor(cylinders))
Auto<-Auto %>% dplyr::select(-mpg,-name)
Auto %>% sample_n(6)
## cylinders displacement horsepower weight acceleration year origin mpg.high
## 1 4 97 60 1834 19.0 71 2 1
## 2 4 79 58 1825 18.6 77 2 1
## 3 8 350 175 4100 13.0 73 1 0
## 4 6 258 110 2962 13.5 71 1 0
## 5 4 90 75 2108 15.5 74 2 1
## 6 4 134 96 2702 13.5 75 3 1
Fit a support vector classifier to the data with various values of cost, in order to predict whether a car gets high or low gas mileage. Report the cross-validation errors associated with different values of this parameter. Comment on your results.
set.seed(123)
tune.out<-tune(svm, mpg.high ~ ., data = Auto, kernel = "linear", ranges = list(cost = c(1,5,10,25)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 5
##
## - best performance: 0.07865385
##
## - Detailed performance results:
## cost error dispersion
## 1 1 0.08641026 0.04893284
## 2 5 0.07865385 0.05317556
## 3 10 0.07865385 0.05317556
## 4 25 0.07865385 0.05317556
#summary(tune.out)$performances
#summary(tune.out)$best.model
best.par<-summary(tune.out)$best.parameters
best.per<-summary(tune.out)$best.performance
10-folds cross validation select cost equal to 5 . The related cv error is 0.0786538.
Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of gamma and degree and cost. Comment on your results.
Radial
set.seed(123)
tune.out.rad<-tune(svm, mpg.high ~ ., data = Auto, kernel = "radial",
ranges = list(cost = c(1,5,10,25),
gamma = 2^(-3:3)))
summary(tune.out.rad)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 5 0.25
##
## - best performance: 0.06096154
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1 0.125 0.09160256 0.04304211
## 2 5 0.125 0.07621795 0.04867524
## 3 10 0.125 0.07365385 0.04480097
## 4 25 0.125 0.08384615 0.05014207
## 5 1 0.250 0.07634615 0.03557092
## 6 5 0.250 0.06096154 0.04311842
## 7 10 0.250 0.06615385 0.05089670
## 8 25 0.250 0.08397436 0.06066946
## 9 1 0.500 0.07371795 0.03988464
## 10 5 0.500 0.07365385 0.05975763
## 11 10 0.500 0.07871795 0.05459458
## 12 25 0.500 0.09416667 0.04722024
## 13 1 1.000 0.07628205 0.04241762
## 14 5 1.000 0.08391026 0.05149746
## 15 10 1.000 0.08647436 0.05174863
## 16 25 1.000 0.08903846 0.05324797
## 17 1 2.000 0.08397436 0.04088294
## 18 5 2.000 0.09679487 0.04065897
## 19 10 2.000 0.09423077 0.04280336
## 20 25 2.000 0.09935897 0.04185417
## 21 1 4.000 0.09929487 0.04667955
## 22 5 4.000 0.09679487 0.03882073
## 23 10 4.000 0.09679487 0.04241762
## 24 25 4.000 0.09935897 0.04007080
## 25 1 8.000 0.12743590 0.06693855
## 26 5 8.000 0.12493590 0.07469619
## 27 10 8.000 0.12493590 0.07469619
## 28 25 8.000 0.12493590 0.07469619
#summary(tune.out.rad)$performances
#summary(tune.out.rad)$best.model
best.par.rad<-summary(tune.out.rad)$best.parameters
best.per.rad<-summary(tune.out.rad)$best.performance
With radial kernel 10-folds cross validation select cost equal to 5 and gamma equal to 0.25. The related cv error is 0.0609615.
Polynomial
set.seed(123)
tune.out.pol<-tune(svm, mpg.high ~ ., data = Auto, kernel = "polynomial",
ranges = list(cost = c(1,5,10,25),
degree=c(2,3,4)))
summary(tune.out.pol)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost degree
## 5 3
##
## - best performance: 0.06096154
##
## - Detailed performance results:
## cost degree error dispersion
## 1 1 2 0.09410256 0.04575425
## 2 5 2 0.08134615 0.04550073
## 3 10 2 0.07884615 0.04176353
## 4 25 2 0.07108974 0.05400402
## 5 1 3 0.09929487 0.04011231
## 6 5 3 0.06096154 0.04478058
## 7 10 3 0.06096154 0.04778157
## 8 25 3 0.06608974 0.04935338
## 9 1 4 0.14275641 0.08367018
## 10 5 4 0.10692308 0.05017119
## 11 10 4 0.10685897 0.04978224
## 12 25 4 0.08897436 0.04753954
#summary(tune.out.pol)$performances
#summary(tune.out.pol)$best.model
best.par.pol<-summary(tune.out.pol)$best.parameters
best.per.pol<-summary(tune.out.pol)$best.performance
With polynomial kernel 10-folds cross validation select cost equal to 5 and degree equal to 3. The related cv error is 0.0609615.
summary(tune.out)$performance %>% ggplot(mapping=aes(x=cost,y=error)) + geom_point() + geom_line()
summary(tune.out.pol)$performance %>% ggplot(mapping=aes(x=cost,y=error, col=factor(degree))) + geom_point() + geom_line() + labs(color="Degree")
summary(tune.out.rad)$performance %>% ggplot(mapping=aes(x=cost,y=error, col=factor(gamma))) + geom_point() + geom_line() + labs(color="Gamma")
svm.linear<-svm(mpg.high ~ ., data = Auto, kernel = "linear",
cost = best.par[1,1])
svm.poly<-svm(mpg.high ~ ., data = Auto, kernel = "polynomial",
cost = best.par.pol[1,1], degree = best.par.pol[1,2])
svm.radial<-svm(mpg.high ~ ., data = Auto, kernel = "radial",
cost = best.par.rad[1,1], gamma = best.par.rad[1,2])
The plots here above basically represents how the error (estimated prediction error based on 10-fold cross validation) varies according to different values of the parameters: cost, gamma and degree.
Support vector machine with linear kernel yields the lowest error with cost equal to 5 or higher.
Support vector machine with polynomial kernel yields the lowest error with degree equal to 3 across all the cost values.
Support vector machine with radial kernel yields the lowest error with gamma equal to 0.25 and cost equal to 5. However as cost increases the lines (function of gamma) tends to converge to similar error value.
# plot(svm.linear, Auto, horsepower~year)
name.plot<-names(Auto)[!(names(Auto) %in% c("cylinders","mpg.high","origin"))]
choose(length(name.plot),2) # number of plot for each model
## [1] 10
name.plot
## [1] "displacement" "horsepower" "weight" "acceleration" "year"
plot(svm.linear, Auto, displacement~horsepower)
plot(svm.linear, Auto, displacement~weight)
plot(svm.linear, Auto, displacement~acceleration)
plot(svm.linear, Auto, displacement~year)
plot(svm.linear, Auto, weight~year) #
plot(svm.linear, Auto, weight~acceleration)
plot(svm.linear, Auto, weight~horsepower)
plot(svm.linear, Auto, acceleration~horsepower)
plot(svm.linear, Auto, acceleration~year)
plot(svm.linear, Auto, year~horsepower)
Explain how the cost parameter works. What happens when we increase it?
Answer:
The cost parameter allows us to specify how “expensive” it is to misclassify an observation/violate the margin. In case the margin is the distance from the decision surface to the closest data point. Support Vector Classifier, in order to ensure greater robustness of the model and towards individual observations/outliers, allows ‘some’ training data to be misclassified. In the optimization problem such characteristic is expressed with the following constrain.
\(\sum_{i=1}^n \epsilon_i \leq C\)
Here we are basically setting a non-negative “budget” \(C\) where \(\epsilon_i\) are called slack variables in the sense that allow individual observations to be on the wrong side of the margin or the hyperplane. So \(C\) put a cap on the misclassifications/violation.
When we code this in R we don’t set a budget for misclassification/margin violation. Instead we specify how “expensive” it is to misclassify an observation in the training set or violate the margin. The higher the cost parameter is, the smaller the margin will be and the fewer misclassifications/violations there will be in the training set since each misclassifition is “expensive” in term of optimization. This will lead to a decrease of the support vectors well.
The same logic can be extended to support vector machines.
This problem involves the OJ data set which is part of the ISLR package.
Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
data(OJ)
set.seed(777)
train<-sample(1:nrow(OJ),800)
train.set<-OJ[train,]
test.set<-OJ[-train,]
Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
svc.fit<-svm(Purchase~., data=train.set, kernel= "linear", cost=0.01, scale=F)
summary(svc.fit)
##
## Call:
## svm(formula = Purchase ~ ., data = train.set, kernel = "linear",
## cost = 0.01, scale = F)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.01
##
## Number of Support Vectors: 618
##
## ( 310 308 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. These observations do affect the support vector classifier.
In our case 618 observations out of 800 (train.set) have been used as support vectors. 310 are classified as CH while 308 as MM. The percentage of support vectors is quite high. This is a direct consequence of the parameter cost which tell the model how “expensive” it is to misclassify/violate the margin. In this case cost value is quite low so it make sense that almost 80% of the obs are support vectors.
What are the training and test error rates?
lin.trn.er<-mean(svc.fit$fitted != train.set$Purchase) # training error
lin.trn.er
## [1] 0.24
lin.tst.er<-mean(predict(svc.fit, test.set) != test.set$Purchase) # test error
lin.tst.er
## [1] 0.262963
Training error = 0.24
Test error = 0.262963
Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(4234)
tune.out<-tune(svm, Purchase~., data=train.set, kernel = "linear",
ranges = list(cost = 10^seq(-2,1, by=0.5)))
tune.out %>% summary()
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 3.162278
##
## - best performance: 0.165
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.17750 0.02934469
## 2 0.03162278 0.17000 0.03073181
## 3 0.10000000 0.17125 0.02703521
## 4 0.31622777 0.16750 0.02898755
## 5 1.00000000 0.16625 0.02638523
## 6 3.16227766 0.16500 0.02687419
## 7 10.00000000 0.16500 0.03050501
Compute the training and test error rates using this new value for cost.
best.cost<-summary(tune.out)$best.parameters[1,1]
svc.best<-svm(Purchase~. ,data=train.set, kernel="linear", cost = best.cost)
lin.trn.er.bc<-mean(svc.best$fitted != train.set$Purchase) # training error
lin.trn.er.bc
## [1] 0.155
lin.tst.er.bc<-mean(predict(svc.best, test.set) != test.set$Purchase) # test error
lin.tst.er.bc
## [1] 0.1814815
Training error = 0.155
Test error = 0.1814815
Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for gamma.
### b)
# Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
svc.rad<-svm(Purchase~., data=train.set, kernel= "radial", cost=0.01, scale=F)
summary(svc.rad)
##
## Call:
## svm(formula = Purchase ~ ., data = train.set, kernel = "radial",
## cost = 0.01, scale = F)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.01
##
## Number of Support Vectors: 633
##
## ( 323 310 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
### c)
# What are the training and test error rates?
rad.trn.er<-mean(svc.rad$fitted != train.set$Purchase)
rad.trn.er # training error with cost = 0.01
## [1] 0.3875
rad.tst.er<-mean(predict(svc.rad, test.set) != test.set$Purchase)
rad.tst.er # test error with cost = 0.01
## [1] 0.3962963
### d)
# Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(4234)
tune.out.rad<-tune(svm, Purchase~., data=train.set, kernel = "radial",
ranges = list(cost = 10^seq(-2,1, by=0.5)))
tune.out.rad %>% summary()
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.3162278
##
## - best performance: 0.16625
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.38750 0.08354141
## 2 0.03162278 0.37000 0.09721825
## 3 0.10000000 0.17750 0.02554952
## 4 0.31622777 0.16625 0.01671867
## 5 1.00000000 0.16875 0.02651650
## 6 3.16227766 0.17250 0.03322900
## 7 10.00000000 0.17500 0.02946278
best.cost.rad<-summary(tune.out.rad)$best.parameters[1,1]
best.cost.rad # cost value yielding the lowest error
## [1] 0.3162278
### e)
# Compute the training and test error rates using this new value for cost.
svc.rad.bc<-svm(Purchase~., data = train.set, kernel = "radial",
cost = best.cost.rad, scale = F)
rad.trn.er.bc<-mean(svc.rad.bc$fitted != train.set$Purchase)
rad.trn.er.bc # training error with cost value from cross validation
## [1] 0.28625
rad.tst.er.bc<-mean(predict(svc.rad.bc, test.set) != test.set$Purchase)
rad.tst.er.bc # test error with cost value from cross validation
## [1] 0.3185185
With cost = 0.01 we have 633 support vectors: 323 classified as CH while 310 as MM. Training and test error increase.
Training error = 0.28625
Test error = 0.3185185
Repeat parts (b) through (e) using a support vector machine with a polynomial kernel. Set degree=2.
### b)
# Fit a support vector classifier to the training data using cost=0.01, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained.
svc.pol<-svm(Purchase~., data=train.set, kernel= "polynomial", cost=0.01, scale=F, degree = 2)
summary(svc.pol)
##
## Call:
## svm(formula = Purchase ~ ., data = train.set, kernel = "polynomial",
## cost = 0.01, degree = 2, scale = F)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 0.01
## degree: 2
## coef.0: 0
##
## Number of Support Vectors: 326
##
## ( 162 164 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
### c)
# What are the training and test error rates?
pol.trn.er<-mean(svc.pol$fitted != train.set$Purchase)
pol.trn.er # training error with cost = 0.01
## [1] 0.155
pol.tst.er<-mean(predict(svc.pol, test.set) != test.set$Purchase)
pol.tst.er # test error with cost = 0.01
## [1] 0.1814815
### d)
# Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(4234)
tune.out.pol<-tune(svm, Purchase~., data=train.set, kernel = "polynomial",
ranges = list(cost = 10^seq(-2,1, by=0.5)), degree = 2)
tune.out.pol %>% summary()
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.1725
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01000000 0.38875 0.08174529
## 2 0.03162278 0.35125 0.06832571
## 3 0.10000000 0.31375 0.06547105
## 4 0.31622777 0.20500 0.04609772
## 5 1.00000000 0.19375 0.03397814
## 6 3.16227766 0.18000 0.02838231
## 7 10.00000000 0.17250 0.02934469
best.cost.pol<-summary(tune.out.pol)$best.parameters[1,1]
best.cost.pol # cost value yielding the lowest error
## [1] 10
### e)
# Compute the training and test error rates using this new value for cost.
svc.pol.bc<-svm(Purchase~., data=train.set, kernel= "polynomial", cost=best.cost.pol, scale=F, degree = 2)
pol.trn.er.bc<-mean(svc.pol.bc$fitted != train.set$Purchase)
pol.trn.er.bc # training error with cost value from cross validation
## [1] 0.15875
pol.tst.er.bc<-mean(predict(svc.pol.bc, test.set) != test.set$Purchase)
pol.tst.er.bc # test error with cost value from cross validation
## [1] 0.1740741
Polynomial kernel uses less support vector compared to the other two - 326 out of 800. SVM with polynomial kernel performs slightly better than the linear one and considerably better than the radial kernel.
Within the set of the support vectors 162 obs are classified as CH while 164 as MM.
Training error = 0.15875
Test error = 0.1740741
Overall, which approach seems to give the best results on this data?
h<-rbind(Linear = lin.tst.er.bc,
Radial = rad.tst.er.bc,
Polynomial = pol.tst.er.bc) %>% data.frame()
colnames(h)<-"Prediction_error"
kable(h, caption = "Kernel") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Prediction_error | |
|---|---|
| Linear | 0.1814815 |
| Radial | 0.3185185 |
| Polynomial | 0.1740741 |
bestmodel<-subset(h, h$Prediction_error==min(h$Prediction_error)) %>% rownames()
bestmodel
## [1] "Polynomial"
Polynomial is the best model as it yield the lowest test error. Surprisingly the radial kernel perform the worst. Perhaps if we tune also the gamma value we would get a better test error.
Does it make sense to try a radial kernel here? Why?
Answer:
Yes it makes sense since the data set has 18 variables (17 predictors + 1 response) so we cannot clearly visualize the features in a 2-dimensions plot (unless we use PCA for example or variable selection methods) and assess that CH and MM obs are clearly separated by a straight line (or a plane in case we have 3 features and we generate a 3-dim plot). If we could see that the obs are clearly separated by a straight line/plane then a linear kernel might be the best choice and there would not be any need to use the radial kernel.
In Section 10.2.3, a formula for calculating PVE was given in Equation 10.8. We also saw that the PVE can be obtained using the sdev output of the prcomp() function.
On the USArrests data, calculate PVE in two ways:
Using the sdev output of the prcomp() function, as was done in Section 10.2.3.
data("USArrests")
names(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
dim(USArrests) # PC = min(n-1,p)
## [1] 50 4
pr.out<-prcomp(USArrests , scale = TRUE)
# scale = TRUE to scale the variables and have st.dev = 1. This is necessary since the 4 variables have very different values ranges and scale (e.g. UrbanPop is a percentage)
# Centre = TRUE is the default value. It means that the function centers the variables to have mean zero. I calculate the mean for each variable which give us the center of the data and shift the data/axis in such a way that the center of the data coincide with the origin.
pr.var<-pr.out$sdev^2 # Variance of each principal component.
# sum of squares distances from the origin of each observation projected onto each PC divided by n-1.
# # sum of squares distances from the origin of each observation projected onto each PC = Eigenvalue
#PC1 = sum(squared distances from the origin of each obs projected onto PC1)/(n-1)
# PC1 was drawn in such a way that it has the largest sum of squared distances from the origin of the obs projected on PC1
pve<-pr.var/sum(pr.var)
pve # Proportion of variance explained by each principal component.
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
By applying Equation 10.8 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 10.8 to obtain the PVE.
loadings<-pr.out$rotation
# The rotation matrix provides the principal component loadings; each column of pr.out$rotation contains the corresponding principal component loading vector
# It is a matrix whose columns contain the eigenvectors (unit vectors)
# sum(pr.out$rotation[,1]^2) = 1, same for the other PCs
# Eigenvectors are just the linear combinations of the original variables (in the simple or rotated factor space); they described how variables "contribute" to each factor axis. Basically, PCA can be thought as way to construct new axes that point to the directions of maximal variance (in the original variable space), as expressed by the eigenvalue, and how variables contributions are weighted or linearly transformed in this new space.
# PC scores can be seen as the co-ordinates of each point/obs, with respect to the principal component axes. The principal component scores describe where each data point lies on each straight line, relative to the "centroid" of the data.
stnd.df<-scale(USArrests) # centers and scales the columns of a numeric matrix.
# Center: X (matrix with all the observations: dataframe) - Xbar(vector with mean value for each col/var)
# Scale: X/st.dev where st.dev is the vector with the standard dev value for each col/var.
pve2 = rep(NA, 4)
zetas<-stnd.df %*% loadings # Data points in the new coordinates system spanned out by the PCs.
var.pc<-apply(zetas,2,FUN = function(x) sum(x^2)) # sum of square distances from the origin (new "centered" origin)
pve.b<-var.pc/sum(var.pc)
pve.b # Proportion of variance explained by each principal component.
## PC1 PC2 PC3 PC4
## 0.62006039 0.24744129 0.08914080 0.04335752
Consider the USArrests data. We will now perform hierarchical clustering on the states.
Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
hc.complete<-hclust(dist(USArrests), method="complete")
# Compute all pairwise dissimilarities (Euclidean distance) between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
# The first step is to calculate the distance between every point with every other point. The result is a distance matrix, which can be computed with the dist() function.
plot(hc.complete, xlab = "", sub="Complete Linkage", hang = -1,cex = 0.75)
Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
cutree(hc.complete, k=3) # With k=3 I specify I many cluster I want.
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
ct<-cutree(hc.complete, 3) %>% table() %>% data.frame()
ct<-dplyr::rename(ct, Class = .)
ct
## Class Freq
## 1 1 16
## 2 2 14
## 3 3 20
Cutree function cuts the dendrogram tree in such a way that we have k clusters. The function returns the cluster labels for each observation (city) associated with a given cut of the dendrogram,
Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
USArrests.sc<-scale(USArrests)
hc.complete.sc<-hclust(dist(USArrests.sc), method = "complete")
plot(hc.complete.sc, xlab = "", sub = "", hang = -1, cex = 0.75)
ct.sc<-cutree(hc.complete.sc, 3) %>% table() %>% data.frame()
ct.sc<-dplyr::rename(ct.sc, Class = .)
ct.sc
## Class Freq
## 1 1 8
## 2 2 11
## 3 3 31
What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed?
Provide a justification for your answer.
In this case scaling the variable changes the clustering output. The scale of the high axis changes as well.
The function scale(x, center = TRUE, scale = TRUE) center and scale the observations in the data frame which means:
Center: X (matrix with all the observations: dataframe) - Xbar(vector with mean value for each col/var)
Scale: X/st.dev where st.dev is the vector with the standard dev value for each col/var.
I would say that we should scale the variables when the unit of measurement for the variable are not comparable or the scales are not consistent.
For example in this case, variable ‘UrbanPop’ is expressed in percentage while the other are quantities. So it is wise to scale as the ranges are very different.
In addition, scaling can be needed when the unit of measurement are not consistent. Let’s assume that variable ‘Assault’ is expressed as #cases/100 citizens and ‘Rape’ expressed as #cases/100000 citizens. This is another case where scaling is needed.
More in general scaling is necessary since the dissimilarity between cluster can be measured with the Euclidean distance, therefore a “big” number would dominate this measurement. So we need to make sure that a variable has not a much bigger value range only due to the scale used (unit of measurement, etc etc).
In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.
Hint: There are a number of functions in R that you can use to generate data. One example is the rnorm() function; runif() is another option. Be sure to add a mean shift to the observations in each class so that there are three distinct classes.
n<-60
p<-50
c<-3
set.seed (2)
x1<-matrix(rnorm ((n/c)*p, mean = 0, sd = 0.1), ncol = p) %>%
data.frame(Class = "A")
x2<-matrix(rnorm ((n/c)*p, mean = 3, sd = 0.6), ncol = p) %>%
data.frame(Class = "B")
x3<-matrix(rnorm ((n/c)*p, mean = 7, sd = 1), ncol = p) %>%
data.frame(Class = "C")
x<-rbind(x1,x2,x3)
x[sample(1:n,5),sample(1:p,6)]
## X41 X17 X47 X1 X8 X45
## 54 6.18867869 7.351543751 8.3784004 5.13129919 5.57463495 5.55168039
## 49 5.92711167 7.402272088 6.4773200 8.00098243 6.69375726 7.93824121
## 44 9.32816114 10.246016418 7.2988462 8.04093832 5.96825350 7.24870630
## 12 -0.09875221 -0.008767276 -0.1030572 0.09817528 -0.01694232 0.00991457
## 51 7.10684099 6.920039601 7.0448973 7.44346277 8.47999033 6.34946173
Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.
pr.out<-prcomp(dplyr::select(x,-Class))
plot(pr.out$x[,1:2], col=x$Class, xlab="PC1", ylab="PC2", pch=19)
ggbiplot(pr.out, choice = 1:2, groups = x$Class, var.axes = F,
ellipse = TRUE) + geom_point(aes(col = x$Class), size = 4)
Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?
Hint: You can use the table() function in R to compare the true class labels to the class labels obtained by clustering. Be careful how you interpret the results: K-means clustering will arbitrarily number the clusters, so you cannot simply check whether the true class labels and clustering labels are the same.
km3.out<-kmeans(dplyr::select(x,-Class), 3, nstart=20)
# nstart = 20 means that the first random cluster/class assignment is performed 20 times, the function picks the best result and start clustering from there.
table(km3.out$cluster, x$Class)
##
## A B C
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0
class A from the original data frame corresponds to cluster 2.
class B from the original data frame corresponds to cluster 3.
class C from the original data frame corresponds to cluster 1.
As we can see from the table, k-cluster groups the observation correctly. All class A obs are grouped in cluster number 2. All class B obs are grouped in cluster number 3 etc etc.
Perform K-means clustering with K = 2. Describe your results.
km2.out<-kmeans(dplyr::select(x,-Class), 2, nstart=20)
table(km2.out$cluster, x$Class)
##
## A B C
## 1 0 0 20
## 2 20 20 0
Original class A,B are grouped together in cluster 2.
Now perform K-means clustering with K = 4, and describe your results.
km4.out<-kmeans(dplyr::select(x,-Class), 4, nstart=20)
table(km4.out$cluster, x$Class)
##
## A B C
## 1 0 0 9
## 2 0 0 11
## 3 20 0 0
## 4 0 20 0
Original class C is now split (almost evenly) in cluster 1 and 2
Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
km3pc.out<-kmeans(pr.out$x[,1:2], 3, nstart=20)
pr.out$x[,1:2] %>% data.frame() %>%
ggplot(mapping = aes(x=PC1, y=PC2, color=factor(km3pc.out$cluster))) + geom_point() + labs(color="Cluster")
table(km3pc.out$cluster, x$Class)
##
## A B C
## 1 0 0 20
## 2 20 0 0
## 3 0 20 0
Applying k-mean clustering on PC1 and PC2 with k=3 yields the same results as using the same method on the full original data set (see point c).
Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.
km3sc.out<-kmeans(scale(dplyr::select(x,-Class), center = T, scale = T),
3, nstart=20)
km3sc.out$cluster
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
km3sc.out$betweenss # The between-cluster sum of squares with scaling
## [1] 2802.036
km3.out$betweenss # The between-cluster sum of squares with no scaling
## [1] 25149.31
table(km3sc.out$cluster, x$Class)
##
## A B C
## 1 0 20 0
## 2 0 0 20
## 3 20 0 0
Scaling the data doesn’t change the result. The way the algorithm clusters the obs match the class type.
On the book website, www.StatLearning.com, there is a gene expression data set (Ch10Ex11.csv) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.
Load in the data using read.csv(). You will need to select header=F.
g.df<-read.csv("Ch10Ex11.csv", header=F)
#view(g.df)
#summary(g.df)
dim(g.df)
## [1] 1000 40
# row: gene
# column: tissue sample
Apply hierarchical clustering to the samples using correlation based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?
dd=as.dist(1- cor(g.df))
# two observations are similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.
# if I use cor(t(g.df)) I end up clustering the gene but I am interested in clustering the 40 samples.
hc.comp.corr<-hclust(dd, method = "complete")
hc.sing.corr<-hclust(dd, method = "single")
hc.avg.corr<-hclust(dd, method = "average")
hc.cent.corr<-hclust(dd, method = "centroid")
plot(hc.comp.corr, xlab = "", sub = "Complete Linkage",cex = 0.75, hang = -1)
plot(hc.sing.corr, xlab = "", sub = "Single Linkage",cex = 0.75, hang = -1)
plot(hc.avg.corr, xlab = "", sub = "Average Linkage", cex = 0.75, hang= -1)
plot(hc.cent.corr, xlab = "", sub = "centroid", cex = 0.75)
The dendrograms suggest that there are 2/3 main groups. The choice of linkage affects the results obtained. Single linkage tends to yield trailing clusters: very large clusters onto which individual observations attach one-by-one. On the other hand, complete and average linkage tend to yield more balanced, attractive clusters. For this reason, complete and average linkage are generally preferred to single linkage.
Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.
pca.out<-prcomp(t(g.df))
plot(pca.out$x[,1:2], xlab="PC1", ylab="PC2", pch=19)
pca.out<-prcomp(t(g.df))
pca2<-pca.out$x %>%
data.frame() %>%
dplyr::select(PC1, PC2)
pca.kmeans<-kmeans(pca2,centers = 2)
ggbiplot(pca.out, groups = factor(pca.kmeans$cluster),var.axes = F, ellipse = TRUE) + geom_point(aes(col = factor(pca.kmeans$cluster)))
topgen.PC1<-abs(pca.out$rotation) %>% data.frame %>%
dplyr::select(PC1) %>% mutate(Gene = 1:1000) %>% arrange(desc(PC1))
topgen.PC1[1:10,] # Top 10 genes. Intuitively I would say that these 10 genes are the ones that differs the most between the 2 clusters.
## PC1 Gene
## 1 0.1142430 600
## 2 0.1103983 584
## 3 0.1083294 549
## 4 0.1081255 540
## 5 0.1074596 502
## 6 0.1065494 582
## 7 0.1062712 565
## 8 0.1061876 568
## 9 0.1041632 529
## 10 0.1038207 599
PCA will tell us in which directions the data varies the most. For simplicity’s sake we consider the first 2 PCs. The PCs are a linear combination of the original variables (genes). So, the genes with biggest “weight” in PC1 and PC2 should be the genes that differs the most between the clusters.
This last sentence implies that we need to perform a cluster analysis using the PC1 and PC2.