▲ Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.
▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
▲ Trees can easily handle qualitative predictors without the need to create dummy variables.
▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book.
**Trees combined with other trees become very powerful
The decision trees suffer from high variance.
This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets; linear regression tends to have low variance, if the ratio of n to p is moderately large
Bootstrap agregation, or bagging, is a general-purpose procedure for reducing the bagging variance of a statistical learning method
averaging a set of observations reduces variance. Hence a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.
we sample from the training set (bootstrapping), create model and take average or majority rule
In regression: Average
In classification: majority rule
One can show that on average, each bagged tree makes use of around two-thirds of the observations.3 The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. We out-of-bag can predict the response for the ith observation using each of the trees in
The OOB approach for estimating the test error is particularly convenient when performing bagging on large data sets for which cross-validation would be computationally onerous.
It can be shown that with B sufficiently large
Bagging improves prediction accuracy at the expense of interpretability
Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees.
steps:
we build a number forest of decision trees on bootstrapped training samples
when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.
A fresh sample of m predictors is taken at each split, and typically we choose \(m ≈ √p\). You do this so that an important variable is not driving thre whole relationship, and so that trees look different.
Random forests overcome this problem by forcing each split to consider only a subset of the predictors. Therefore, on average \((p − m)/p\) of the splits will not even consider the strong predictor, and so other predictors will have more of a chance.
In building a random forest, at each split in the tree,the algorithm is not even allowed to consider a majority of the available predictors.
Note: Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors.
In bagging we make copy of the bootstraps and each tree is built on a bootstrap data set, independent of othre trees.
Boosting works in a similar way, except that the trees are grown sequentially
each tree uses information from previous models
Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly
Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y , as the response.
The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the residuals
library (gbm)
library(MASS)
library(tree)
data("Boston")
set.seed(1)
train = sample (1:nrow(Boston), nrow(Boston)/2)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
#random forest
library(randomForest)
set.seed(1)
bag.boston= randomForest( medv~.,data=Boston , subset=train ,mtry=13,importance=TRUE)
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.15723
## % Var explained: 86.49
plot(bag.boston)
# boosted trees with gbm
boost.boston=gbm(medv~.,data=Boston[train ,], distribution="gaussian",n.trees=5000, interaction.depth=4)
boost.boston
## gbm(formula = medv ~ ., distribution = "gaussian", data = Boston[train,
## ], n.trees = 5000, interaction.depth = 4)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## There were 13 predictors of which 13 had non-zero influence.
summary(boost.boston)
## var rel.inf
## lstat lstat 45.6039091
## rm rm 31.6910080
## dis dis 6.5429510
## crim crim 3.8480451
## nox nox 2.5780827
## ptratio ptratio 2.3140760
## black black 1.8736901
## age age 1.7592798
## tax tax 1.5159821
## indus indus 1.2273867
## chas chas 0.8252505
## rad rad 0.2048336
## zn zn 0.0155052
**Remeber there are many boosting algorithms such as adboost, light gbm, and xgboost
SVMs have been shown to perform well in a variety of settings, and are often considered one of the best “out of the box” classifiers.
generalization of a simple and intuitive classifier called the maximal margin classifier.
Although there are some extentions Support vector machines are intended for the binary classification setting
In a p-dimensional space, a hyperplane is a flat affine subspace of hyperplane dimension p − 1.
Trick to add a dimention to your data so it becomes separabale. There are differnt kind of kernels,