1 Tree Models:

1.0.1 Advantages and Disadvantages of Trees

▲ Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!

▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.

▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).

▲ Trees can easily handle qualitative predictors without the need to create dummy variables.

▼ Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book.

**Trees combined with other trees become very powerful

1.1 Bootstrap agregation, or bagging

The decision trees suffer from high variance.

This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets; linear regression tends to have low variance, if the ratio of n to p is moderately large

Bootstrap agregation, or bagging, is a general-purpose procedure for reducing the bagging variance of a statistical learning method

averaging a set of observations reduces variance. Hence a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.

we sample from the training set (bootstrapping), create model and take average or majority rule

  • In regression: Average

  • In classification: majority rule

1.1.1 Out-of-Bag Error Estimation

One can show that on average, each bagged tree makes use of around two-thirds of the observations.3 The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. We out-of-bag can predict the response for the ith observation using each of the trees in

The OOB approach for estimating the test error is particularly convenient when performing bagging on large data sets for which cross-validation would be computationally onerous.

It can be shown that with B sufficiently large

1.1.2 Variable Importance Measures:

Bagging improves prediction accuracy at the expense of interpretability

1.2 Random Forests

Random forests provide an improvement over bagged trees by way of a random small tweak that decorrelates the trees.

steps:

  1. we build a number forest of decision trees on bootstrapped training samples

  2. when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

  3. A fresh sample of m predictors is taken at each split, and typically we choose \(m ≈ √p\). You do this so that an important variable is not driving thre whole relationship, and so that trees look different.

1.2.0.1 tree decorrelating

Random forests overcome this problem by forcing each split to consider only a subset of the predictors. Therefore, on average \((p − m)/p\) of the splits will not even consider the strong predictor, and so other predictors will have more of a chance.

In building a random forest, at each split in the tree,the algorithm is not even allowed to consider a majority of the available predictors.

Note: Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors.

1.3 Boosting

In bagging we make copy of the bootstraps and each tree is built on a bootstrap data set, independent of othre trees.

  1. Boosting works in a similar way, except that the trees are grown sequentially

  2. each tree uses information from previous models

  3. Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly

  4. Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y , as the response.

The shrinkage parameter λ slows the process down even further, allowing more and different shaped trees to attack the residuals

library (gbm)
library(MASS)
library(tree)
data("Boston")
set.seed(1)

 train = sample (1:nrow(Boston), nrow(Boston)/2)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
#random forest
library(randomForest)
set.seed(1)
bag.boston= randomForest( medv~.,data=Boston , subset=train ,mtry=13,importance=TRUE)
bag.boston
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,      subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.15723
##                     % Var explained: 86.49
plot(bag.boston)

# boosted trees with gbm
boost.boston=gbm(medv~.,data=Boston[train ,], distribution="gaussian",n.trees=5000, interaction.depth=4)
boost.boston
## gbm(formula = medv ~ ., distribution = "gaussian", data = Boston[train, 
##     ], n.trees = 5000, interaction.depth = 4)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## There were 13 predictors of which 13 had non-zero influence.
summary(boost.boston)

##             var    rel.inf
## lstat     lstat 45.6039091
## rm           rm 31.6910080
## dis         dis  6.5429510
## crim       crim  3.8480451
## nox         nox  2.5780827
## ptratio ptratio  2.3140760
## black     black  1.8736901
## age         age  1.7592798
## tax         tax  1.5159821
## indus     indus  1.2273867
## chas       chas  0.8252505
## rad         rad  0.2048336
## zn           zn  0.0155052

**Remeber there are many boosting algorithms such as adboost, light gbm, and xgboost

2 Support Vector Machine (SVM)

2.1 Hyperplane:

In a p-dimensional space, a hyperplane is a flat affine subspace of hyperplane dimension p − 1.

Good SVM tutorial

  • SVM finds the pllane that leaves as much as possible in the boundaries (widest rectangular that seperatex the data). support vectors are the point in the border of the seoaration plane.

2.2 Kernel:

Trick to add a dimention to your data so it becomes separabale. There are differnt kind of kernels,

2.3 tuning parameter:

  • c: complexity
  • gamma: smaller is less complex

3 Source

http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf