0.1 Executive Summary

??? Goal and Background – What is the problem? ??? Approach – What have you done? ??? Major findings – What do you find and what is your conclusion?

0.2 Load the Dataset

0.2.1 Random sample a training data set that contains 75\(\%\) of original data points

# load Boston data
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.4
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.75)
boston.train <- Boston[index,]
boston.test <- Boston[-index,]

0.2.2 Fit a linear regression

0.2.3 Tree models including regression tree, bagging, random forests and boosting tree.

0.2.4 Compare different model performance based on fits of the training data (in-sample). Test the out-of-sample performance. Using final model built from the 25\(\%\) of original data, test with the remaining 25\(\%\) testing data. Compare the model fits.

0.3 Bagging

library(ipred)
## Warning: package 'ipred' was built under R version 3.4.4
library(rpart)
## Warning: package 'rpart' was built under R version 3.4.4

Regression tree 100. Improving the accuracy

boston.bag<- bagging(medv~., data = boston.train, nbagg=100)
boston.bag
## 
## Bagging regression trees with 100 bootstrap replications 
## 
## Call: bagging.data.frame(formula = medv ~ ., data = boston.train, nbagg = 100)
summary(boston.bag)
##        Length Class      Mode   
## y      379    -none-     numeric
## X       13    data.frame list   
## mtrees 100    -none-     list   
## OOB      1    -none-     logical
## comb     1    -none-     logical
## call     4    -none-     call

Prediction on testing sample.

boston.bag.pred<- predict(boston.bag, newdata = boston.test)
mean((boston.test$medv-boston.bag.pred)^2)#mse
## [1] 11.09548

In sample

boston.bag.pred_train<- predict(boston.bag, newdata = boston.train)
mean((boston.train$medv-boston.bag.pred_train)^2)#mse
## [1] 10.90266

Comparing with a single tree.

library(rpart)
boston.tree<- rpart(medv~., data = boston.train)
boston.tree.pred<- predict(boston.tree, newdata = boston.test)
mean((boston.test$medv-boston.tree.pred)^2)
## [1] 15.86878

Tree performs better. Boostrap is a resample your data distribution.

ntree<- c(seq(1, 200, 10))
MSE.test<- rep(0, length(ntree))
for(i in 1:length(ntree)){
  boston.bag1<- bagging(medv~., data = boston.train, nbagg=ntree[i])
  boston.bag.pred1<- predict(boston.bag1, newdata = boston.test)
  MSE.test[i]<- mean((boston.test$medv-boston.bag.pred1)^2)
}
plot(ntree, MSE.test, type = 'l', col=2, lwd=2, xaxt="n")
axis(1, at = ntree, las=1)

0.4 Out-of-bag (OOB) prediction for regression tree.

boston.bag.oob<- bagging(medv~., data = boston.train, coob=T, nbagg=100)
boston.bag.oob
## 
## Bagging regression trees with 100 bootstrap replications 
## 
## Call: bagging.data.frame(formula = medv ~ ., data = boston.train, coob = T, 
##     nbagg = 100)
## 
## Out-of-bag estimate of root mean squared error:  4.2074

0.5 Bagging for classification tree.

misclassfication error Bagging outperforms

```