Tree-based methods

class: inverse, center, middle

background-image: url(images/title_slide_trees.jpeg)
background-position: center
background-size: cover

# Tree-based methods

---

## R package prerequisites

.scrollable[

```r
# Install required packages from CRAN
install.packages(
  "AmesHousing",   # for Ames housing data
  "caret",         # for data splitting function
  "gbm",           # for generalized boosted models
  "ggplot2",       # for nicer graphics
  "kernlab",       # for email spam data
  "pdp",           # for partial dependence plots
  "ranger",        # for faster random forest
  "randomForest",  # for classic random forest
  "RColorBrewer",  # for nicer looking color palettes 
  "rpart",         # for binary recursive partioning (i.e., CART)
  "tibble",        # for nicer data frames
  "vip",           # for variable importance plots
  "xaringan",      # for building this presentation
  "xgboost"        # for eXtreme Gradient Boosting
)

# Install required packages from GitHub
# remotes::install_github("koalaverse/vip")  # for variable importance plots
```

]

---

## Data set information

.scrollable[

.pull-left[

### Classification:

* [mushroom](https://archive.ics.uci.edu/ml/datasets/mushroom)

- 21 physical characteristics on 8124 mushrooms classified as either poisonous or edible

* [spam](http://search.r-project.org/library/kernlab/html/spam.html)

- 57 variables on 4601 emails classified as spam or ham

* [banknote](http://search.r-project.org/library/alr3/html/banknote.html)

- Six measurements made on 100 genuine Swiss banknotes and 100 counterfeit ones

]

.pull-right[

### Regression:

* [boston](http://search.r-project.org/library/pdp/html/boston.html)

- Data on median housing values from 506 census tracts in the suburbs of Boston from the 1970 census

* [ames](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

- Data describing the sale of individual residential property in Ames, Iowa
from 2006 to 2010

]

---

## A good modelling tool

.pull-left[

### At a minimum:

* Applicable to classification and regression
    
* Competitive accuracy
    
* Capable of handling large data sets
    
* Handle missing values effectively

]

.pull-right[

### Bonus features:

* Which predictors are important? ([`vip`](https://github.com/AFIT-R/vip))

* How do the predictors functionally relate to the response? ([`pdp`](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html))
    
* How do the predictors interact? ([`pdp`](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html), [`vip`](https://github.com/AFIT-R/vip))
    
* What is the shape of the data (i.e., how does it cluster?)
    
* Are their novel cases and outliers?

]

---
class: inverse, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

# Decision trees

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

## Mushroom classification

There is no simple rule for determining the edibility of a [mushroom](https://raw.githubusercontent.com/bgreenwell/MLDay18/master/docs/data/mushroom.csv); no rules like **"leaflets three, let it be"**, **"hairy vine, no friend of mine"** and **"berries white, run in fright"** for poison ivy.

```r
path <- paste0("https://raw.githubusercontent.com/bgreenwell/",
 "MLDay18/master/docs/data/mushroom.csv")
mushroom <- read.csv(path) # load the data from GitHub
*mushroom$veil.type <- NULL # only takes on a single value
```

---

## Mushroom classification

```r
# Load required packages
library(caret)  # for data splitting function
library(rpart)  # for binary recursive partitioning

# Partition the data into train/test sets
set.seed(101)
trn_id <- createDataPartition(
 y = mushroom$Edibility, p = 0.5, list = FALSE
)
trn <- mushroom[trn_id, ] # training data
tst <- mushroom[-trn_id, ] # test data

# Function to calculate accuracy
accuracy <- function(pred, obs) {
 sum(diag(table(pred, obs))) / length(obs)
}
```

---

## Mushroom classification

.pull-left[

```r
# Decision stump (test error = 1.53%):
cart1 <- rpart(
 Edibility ~ ., data = trn,
* control = rpart.control(maxdepth = 1)
)

# Get test set predictions
pred1 <- predict(
 cart1, newdata = tst, 
 type = "class"
)

# Compute test set accuracy
accuracy(
  pred = pred1, 
  obs = tst$Edibility
)
```

```
## [1] 0.9847366
```

]

.pull-right[

```r
# Optimal tree (test error = 0%):
cart2 <- rpart(
 Edibility ~ ., data = trn, 
* control = list(cp = 0, minbucket = 1, minsplit = 1)
)

# Get test set predictions
pred2 <- predict(
 cart2, newdata = tst, 
 type = "class"
)

# Compute test set accuracy
accuracy(
  pred = pred2, 
  obs = tst$Edibility
)
```

```
## [1] 1
```

]

---

## Mushroom classification

.pull-left[

Decision stump (test error = 1.53%):
<img src="figures/mushroom-tree-diagram-1-1.svg" width="100%" style="display: block; margin: auto;" />

]

.pull-right[

Optimal tree (test error = 0%):
<img src="figures/mushroom-tree-diagram-2-1.svg" width="100%" style="display: block; margin: auto;" />

]

---

## A handy rule for mushrooms

.right[<div class="centered">
"If it smells bad, don't eat it!"
</div>]
 
.right[-Decision Stump]

---
class: center, middle, inverse

background-image: url(images/your-turn.jpg)
background-position: center
background-size: contain

???

Image credit: [imgflip](https://imgflip.com/)

---
class: middle

.large[

Use the [`pROC`](https://cran.r-project.org/package=pROC) package to compute and plot the train and test ROC curves based on the shallow tree (i.e., `cart2`). What is AUC for the train and test sets?

**Hint:** Just modify the code from the job attrition example in .red[`08-SVM.R`].

]

---
class: middle

```r
# Load required packages
library(pROC)

# Plot train and test ROC curves
roc_trn <- roc( # train AUC: 0.9717
 predictor = predict(cart1, newdata = trn, type = "prob")[, 1L], 
 response = trn$Edibility,
 levels = rev(levels(trn$Edibility))
)
roc_tst <- roc( # test AUC: 0.8567
 predictor = predict(cart1, newdata = tst, type = "prob")[, 1L], 
 response = tst$Edibility,
 levels = rev(levels(tst$Edibility))
)
plot(roc_trn)
lines(roc_tst, col = "dodgerblue2", lty = 2)
legend("bottomright", legend = c("Train", "Test"), bty = "n", 
 cex = 1.5, col = c("black", "dodgerblue2"), inset = 0.01, 
 lwd = 2, lty = c(1, 2))
```

---
class: center, middle

---

## Predicting email spam

.pull-left[

- Data from 4601 email messages collected at Hewlett-Packard Labs

- **Goal:** predict whether an email message is spam (junk email) or ham (good email)

- **Features:** relative frequencies in a message of 57 of the most commonly occurring words and punctuation marks in all the training the email messages

- For this problem, not all errors are equal; misclassifying spam is not as bad as misclassifying ham!

]

.pull-right[

]

---

## Predicting email spam

```r
# Load the data
*data(spam, package = "kernlab")

# Partition the data into train/test sets
set.seed(101) # for reproducibility
trn_id <- createDataPartition(spam$type, p = 0.7, list = FALSE)
trn <- spam[trn_id, ] # training data
tst <- spam[-trn_id, ] # test data
xtrn <- subset(trn, select = -type) # training data features
xtst <- subset(tst, select = -type) # test data features
ytrn <- trn$type # training data response

# Fit a classification tree (cp found using k-fold CV)
*spam_tree <- rpart(type ~ ., data = trn, cp = 0.001)
pred <- predict(spam_tree, newdata = xtst, type = "class")

# Compute test set accuracy
(spam_tree_acc <- accuracy(pred = pred, obs = tst$type))
```

```
## [1] 0.9260334
```

---

## Predicting email spam

.scrollable[

]

---

## Predicting email spam

.pull-left[

```r
# Extract tibble of variable importance scores
*vip::vi(spam_tree)
```

```
## # A tibble: 48 x 2
## Variable Importance
## <chr> <dbl>
## 1 charExclamation 508. 
## 2 capitalLong 260. 
## 3 capitalAve 199. 
## 4 free 198. 
## 5 charDollar 195. 
## 6 your 183. 
## 7 remove 177. 
## 8 all 127. 
## 9 capitalTotal 108. 
## 10 hp 74.3
## # with 38 more rows
```

]

.pull-right[

```r
# Construct ggplot2-based variable importance plot
*vip::vip(spam_tree, num_features = 10)
```

]

---

## How do trees measure up?

.pull-left[

### Advantages:

* Small trees are easy to interpret

* Trees scale well to large `$N$` (fast!!)

* Can handle data of all types (i.e., requires little, if any, coding)

* Automatic variable selection

* Can handle missing data (through *surrogate splits*)

* Completely nonparametric (great for DM and EDA tasks!)

]

.pull-right[

### Disadvantages:

* Large trees can be difficult to interpret

* Trees are step functions (i.e., binary splits)

* Greedy splitting algorithms (i.e., trees are noisy)

* All splits depend on previous splits (i.e., high order interactions)

* Data is essentially taken away after each split

]

---

## How do trees measure up?

<img src="images/esl-tbl.png" width="65%" style="display: block; margin: auto;" />
 
.center[**Source:** The Elements of Statistical Learning, Second Edition]

---

## Creating ensembles of trees

* The key to accuracy is **low bias** and **low variance**

* **Main idea:** breaking the [*bias-variance trade-off*](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff)

- `$Error \approx Bias + Variance$`

* Two popular algorithms (can be applied to any model, **not just trees!**):

- Bagging (reduce variance by averaging)

- Boosting (reduce bias by building models sequentially)
 
* Random forest is just a slight modification over bagging applied to decision trees!

<center>
Boosting >= Random forest > Bagging > Single tree
</center>

---
class: inverse, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

# Bagging

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

## Background

.pull-left[

* **B**ootstrap **agg**regat**ing**

* Bagging trees:

1. Sample records **with replacement** (i.e., bootstrap the training data)
    
    2. Fit an **overgrown tree** to the resampled data set
    
    3. Repeat a large number of times ( `$\ge 500$` , say)

]

.pull-right[

]
* Predictions are combined by popular vote (**classification**) or averaging (**regression**)

* Same idea as ["wisdom of the crowd"](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd) (Francis Galton's Ox weight survey)

* Improves the stability and accuracy of noisy models (e.g., individual trees)

---

## Bagging trees

---

## Trees are step functions

---

## Trees are noisy

---

## Bagging reduces variance

---
class: center, middle

.large[Remember this data set?]

---
class: center, middle

---

## Predicting email spam

.pull-left[

```r
# Load required packages
*library(randomForest)

# Fit a bagger model
set.seed(1633) # for reproducibility
spam_bag <- randomForest(
 type ~ ., 
 data = trn, 
 ntree = 250,
* mtry = ncol(xtrn), # use all available features
 xtest = subset(tst, select = -type),
 ytest = tst$type,
 keep.forest = TRUE
)
```

]

.pull-right[

]

---
class: inverse, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

# Random forest

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

## Background

* **Bagging on steroids!!**

* Correlation limits the effect of bagging!

* Grow trees just as in bagging, but with a small twist

- Before each split in each tree in the forest, select a subset of the predictors **at random** as candidate splitters
    
      - This essentially **"decorrelates"** the trees in the ensemble!
    
    - The number of randomly selected variables, denoted `$m_{try}$`, is a **tuning parameter** (arguably the only tuning parameter in a random forest)

* Bagging introduces randomness into the rows of the data

* Random forest introduces randomness into the rows and columns of the data

* [*Extremely randomized trees*](https://link.springer.com/article/10.1007/s10994-006-6226-1) take this randomness one step further to reduce variance!

---

## Random forest packages in R

.scrollable[

.pull-left[

* [`randomForest`](https://cran.r-project.org/package=randomForest)

- The standard (implements most of the **bells and whistles**; great for EDA!!)
    
    - Classification and regression
    
    - Does not scale well, but can be parallelized using `foreach`!

* [`randomForestSRC`](https://kogalur.github.io/randomForestSRC/index.html)

- Classification, regression, and survival analysis (including ***competing risks***)
    
    - MPI/OpenMP support

* [`ranger`](https://github.com/imbs-hl/ranger): **ran**dom forest **ge**ne**r**ator

- Classification, regression, and survival analysis
 
 - [Fast!](https://www.jstatsoft.org/article/view/v077i01)
 
 - Accepts sparse numeric matrices (i.e., `dgcMatrix` objects from the [`Matrix`](https://cran.r-project.org/package=Matrix) package)
 
 - Estimated time to completion!! <U+0001F4B4>

]

.pull-right[

* [`party`/`partykit`](https://cran.r-project.org/web/packages/party/index.html)

- Uses *conditional inference trees* as the base learners
    
    - Unbiased variable selection
    
    - Can be slow on large data sets

* [`bigrf`](https://github.com/aloysius-lim/bigrf)

- Classification only

- Currently **ORPHANED** on CRAN
  
  - Disk caching using [`bigmmory`](https://cran.r-project.org/web/packages/bigmemory/)

* [`Rborist`](https://github.com/suiji/Arborist)

- Offers parallel, distributed tree construction

* [`h2o`](https://github.com/h2oai/h2o-3/tree/master/h2o-r)

- Distributed random forests via cloud computing
  
  - Can be stacked with other [`h2o`](https://github.com/h2oai/h2o-3/tree/master/h2o-r) models, like GBMs and DNNs!!

]

---

## Predicting email spam

.pull-left[

```r
# Load required packages
library(randomForest)

# Fit a random forest
set.seed(1633) # for reproducibility
spam_rf <- randomForest(
 type ~ ., 
 data = trn, 
 ntree = 250,
* mtry = 7, # floor(sqrt(p))
 xtest = subset(tst, select = -type),
 ytest = tst$type,
 keep.forest = TRUE
)
```

]

.pull-right[

]

---

## Random forest

.pull-left[

### Advantages:

* Competitive accuracy

* Supervised and unsupervised learning

* *Out-of-bag* data <U+0001F60E>

- Free cross-validation

- Novel variable importance

* Novel proximity matrix <U+0001F60E>

- Outlier/novelty detection
    
    - Multi-dimensional scaling
    
    - Missing value imputation

]

.pull-right[

### Disadvantages:

* Missing values (**why?**)

* Can be slow on large data sets (deep trees)!

- `$m_{try}$` mitigates this issue to some extent

]

---

## Out-of-bag data

* For large enough `$N$`, on average, `$1 - e^{-1} \approx 63.21$`% or the original records end up in any bootstrap sample

```r
N <- 100000
set.seed(1537) # for reproducibility
x <- rnorm(N)
mean(x %in% sample(x, replace = TRUE)) # non-OOB proportion
```

```
## [1] 0.63232
```

* Thus, roughly `$e^{-1} \approx 36.79$`% of the original observations are not used in the construction of a particular tree

* These observations are considered *out-of-bag* (OOB) and can be used to

- Provide an unbiased assessment of model performance (**sort of an unstructured, but free, cross-validation**)
 
    - Construct novel variable importance measure based on the **predictive strength** of each predictor!!

---

## Predicting email spam

---

## OOB-based variable importance

* **Traditional approach:** Average variable importance scores (i.e., total goodness of split) over all the trees in the ensemble (this is what is done for GBMs)

* **Novel approach:** To estimate the importance of the `$k$`-th variable, `$x_k$`:

1. Record the OOB performance of the model
  
  2. Randomly permute all the values of `$x_k$` in the OOB data
  
  3. Recompute the OOB performance of the model
  
  4. The difference between the two OOB performance measures the strength of the structural importance of `$x_k$`
  
* Fundamentally different as these importance scores are **NOT** based on data seen by the individual trees!

---

## Boston housing example

* Housing data from `$N = 506$` census tracts in the city of Boston for the year 1970

* The data violate many classical assumptions like linearity, normality, and constant variance

* Harrison and Rubinfeld's housing value equation ( `$R^2 = 0.81$` ):

* Nowadays, many supervised learning algorithms can fit the data automatically in seconds (**typically with higher accuracy!**)

* The downfall, however, is some loss of interpretation since these algorithms typically do not produce simple prediction formulas (**but there's still hope!** <U+0001F64F>)

---

```r
# Load required packages
*library(ranger)  # a much faster implementation of random forest

# Load the (corrected) Boston housing data
data(boston, package = "pdp")
```

.pull-left[

```r
# Using the randomForest package
set.seed(2007) # for reproducibility
system.time(
 boston_rf <- randomForest(
 cmedv ~ ., data = boston, 
 ntree = 5000,
 mtry = 5,
 importance = FALSE
 )
)
```

```
##    user  system elapsed 
##   6.774   0.133   6.927
```

```
## [1] 0.893066
```

]

.pull-right[

```r
# Using the ranger package
set.seed(1652) # for reproducibility
system.time(
 boston_ranger <- ranger(
 cmedv ~ ., data = boston, 
 num.trees = 5000, 
* mtry = 5, # :/
 importance = "impurity"
 )
)
```

```
##    user  system elapsed 
##   5.002   0.163   0.927
```

```
## [1] 0.892634
```

]

---

## Variable importance plots

.pull-left[

* Most RF packages provide variable importance measures, but not all of them provide variable importance plots (VIPs)

* Enter...[`vip`](https://github.com/AFIT-R/vip)

- Provides a consistent framework for extracting and plotting variable importance scores from many types of ML models

- `vi()` always returns a [*tibble*](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)

- `vip()` uses [`ggplot2`](http://ggplot2.org/)

]

.pull-right[

```r
# Not yet on CRAN (experimental)
devtools::install_github("AFIT-R/vip")
```

<img src="images/vip-logo.svg" width="80%" style="display: block; margin: auto;" />
]

---

```r
# Construct variable importance plots (the old way)
par(mfrow = c(1, 2))  # side-by-side plots
*varImpPlot(boston_rf)  # randomForest::varImpPlot()
```

---

```r
# Load required packages
library(vip)  # for better (and consistent) variable importance plots

# Construct variable importance plots
p1 <- vip(boston_rf, type = 1) + ggtitle("randomForest")
p2 <- vip(boston_rf, type = 2) + ggtitle("randomForest")
p3 <- vip(boston_ranger) + ggtitle("ranger")
*grid.arrange(p1, p2, p3, ncol = 3)
```

---

## Partial dependence plots

.scrollable[

* [Partial dependence plots (PDPs)](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html) help visualize the relationship between a subset of the features (typically 1-3) and the response

* Let `$x_1$` be the predictor variable of interest with unique values `$\left\{x_{11}, x_{12}, \dots, x_{1k}\right\}$`

* The partial dependence of the response on `$x_1$` can be constructed as follows:

1. For `$i \in \left\{1, 2, \dots, k\right\}$`:
    
        a. Copy the training data and replace the original values of `$x_1$` with the constant `$x_{1i}$`
    
        b. Compute the vector of predicted values from the modified copy of the training data
        
        c. Compute the average prediction to obtain `$\bar{f}_1\left(x_{1i}\right)$`

2. Plot the pairs `$\left\{x_{1i}, \bar{f}_1\left(x_{1i}\right)\right\}$` for `$i = 1, 2, \dotsc, k$`

]

---

## Partial dependence plots

.pull-left[

* Not all RF implementations provide support for constructing partial dependence plots (PDPs)

* Enter...[`pdp`](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html)

- Provides a consistent way of constructing PDPs (and more) from many types of ML models (not just RFs)

- Allows for multivariate displays (i.e., interactions) and so much more!!
 
 - Includes options for **parallel processing** and **progress bars** <U+0001F60E>

]

.pull-right[

```r
# Install from CRAN
install.packages("pdp")

# Install from GitHub
devtools::install_github("bgreenwell/pdp")
```

]

---

## Partial dependence plots

```r
# Load required packages
library(pdp)

# PDPs for the top two predictors
p1 <- pdp::partial(boston_ranger, pred.var = "lstat", plot = TRUE)
p2 <- pdp::partial(boston_ranger, pred.var = "rm", plot = TRUE)
*p3 <- pdp::partial(boston_ranger, pred.var = c("lstat", "rm"),
* chull = TRUE, plot = TRUE)
grid.arrange(p1, p2, p3, ncol = 3)
```

---

## Partial dependence plots

```r
# 3-D plots
*pd <- attr(p3, "partial.data") # no need to recalculate
p1 <- plotPartial(pd, 
 levelplot = FALSE, drape = TRUE, colorkey = FALSE,
 screen = list(z = -20, x = -60)
)
```

```r
# Using ggplot2
library(ggplot2)
p2 <- autoplot(pd)
```

```r
# ICE and c-ICE curves
p3 <- boston_ranger %>% # %>% is automatically imported!
 pdp::partial(pred.var = "rm", ice = TRUE, center = TRUE) %>%
 autoplot(alpha = 0.1)
```

---

## Partial dependence plots

```r
# Display all three plots side-by-side
grid.arrange(p1, p2, p3, ncol = 3)
```

---

## Proximity matrix

* In random forests (and bagging), **trees are intentionally grown deep**!

---

---

## Proximity matrix

* In random forests (and bagging), **trees are intentionally grown deep**!

1. Initialize the proximity between cases `$i$` and `$j$` to zero: `$prox_{ij} = 0$`

2. After a tree is grown, put all of the data, both training and OOB, down the tree

3. If cases `$i$` and `$j$` land in the same terminal node, increase `$prox_{ij}$` by one

4. At the end, normalize `$prox_{ij}$` by dividing by the number of trees in the forest

* The pairwise proximities can be used to construct a useful distance matrix

* **Requires the storage of an `$N \times N$` matrix!!** <U+0001F631>

* Only available in [`randomForest`](https://cran.r-project.org/package=randomForest), [`party`](https://cran.r-project.org/web/packages/party/index.html), and [`bigrf`](https://github.com/aloysius-lim/bigrf)

---

## Swiss banknote data

```r
# Load the data
data(banknote, package = "alr3")

# Fit a random forest
set.seed(1701) # for reproducibility
banknote_rf <- randomForest(
 as.factor(Y) ~ ., 
 data = banknote, 
* proximity = TRUE
)

# Print the OOB confusion matrix
banknote_rf$confusion
```

```
##    0   1 class.error
## 0 99   1        0.01
## 1  0 100        0.00
```

---

## Proximity matrix

```r
# Heatmap of proximity-based distance matrix
heatmap(1 - banknote_rf$proximity, col = viridis::plasma(256))
```

---

## Outlier scores

Outlyingness of case `$n$` in class `$j$` to all other cases in class `$j$`: `$$out\left(n\right) = N / \sum_{cl\left(k\right) = j} prox_{ik} ^ 2$$`

.pull-left[

```r
# Dot chart of proximity-based outlier scores
outlyingness <- tibble::tibble(
* "out" = outlier(banknote_rf),
 "obs" = seq_along(out),
 "class" = as.factor(banknote$Y)
)
ggplot(outlyingness, aes(x = obs, y = out)) +
 geom_point(aes(color = class, size = out), alpha = 0.5) +
* geom_hline(yintercept = 10, linetype = 2) +
 labs(x = "Observation", y = "Outlyingness") +
 theme_light() +
 theme(legend.position = "none")
```

]

.pull-right[

]

---

## Multi-dimensional scaling

```r
# Multi-dimensional scaling plot of proximity matrix
MDSplot(banknote_rf, fac = as.factor(banknote$Y), k = 2, cex = 1.5)
```

---

## Missing value imputation

* Random forest offers a novel approach to imputing missing values (i.e., `NA`s)

1. Start with cheap imputation (e.g., median for continuous features)
    
    2. Run a random forest and obtain the proximity matrix
    
    3. Imputed values are updated using:
    
        - The weighted average of the non-missing observations, where the weights are the proximity (**continuous**)
        
        - The category with the largest average proximity (**categorical**)

4. Iterate this procedure until convergence (~ 4-6 times seems sufficient)

* Highly computational (requires many random forest runs!!)

* Resulting **OOB error estimate tends to be overly optimistic**

* Only available in [`randomForest`](https://cran.r-project.org/package=randomForest)

---
class: center, middle, inverse

background-image: url(images/your-turn.jpg)
background-position: center
background-size: contain

???

Image credit: [imgflip](https://imgflip.com/)

---
class: middle

.large[

Fit a random forest to the training data from the job attrition example. What is the OOB error for each class? What is the corresponding test error? Which variables seem to be the most important? Does this match with the SVM results obtained earlier?

]

---
class: middle

.scrollable[

```r
# Load required packages
library(rsample)  # for attrition data and data splitting

# Load the attrition data
attrition <- attrition %>%
 dplyr::mutate( # convert some numeric features to factors
 JobLevel = factor(JobLevel),
 StockOptionLevel = factor(StockOptionLevel),
 TrainingTimesLastYear = factor(TrainingTimesLastYear)
 )

# Train and test splits
set.seed(123) # for reproducibility
split <- initial_split(attrition, prop = 0.7, strata = "Attrition")
trn <- training(split)
tst <- testing(split)

# Fit a random forest
set.seed(1710)
attr_rf <- randomForest(
 x = subset(trn, select = -Attrition),
 y = trn$Attrition,
* xtest = subset(tst, select = -Attrition),
* ytest = tst$Attrition,
 ntree = 1000,
* strata = trn$Attrition,
* sampsize = rep(min(table(trn$Attrition)), times = 2)
)
attr_rf # test error ~ 16.36%
```

```
## 
## Call:
##  randomForest(x = subset(trn, select = -Attrition), y = trn$Attrition,      xtest = subset(tst, select = -Attrition), ytest = tst$Attrition,      ntree = 1000, strata = trn$Attrition, sampsize = rep(min(table(trn$Attrition)),          times = 2)) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 16.7%
## Confusion matrix:
##      No Yes class.error
## No  767  97   0.1122685
## Yes  75  91   0.4518072
##                 Test set error rate: 17.05%
## Confusion matrix:
##      No Yes class.error
## No  322  47   0.1273713
## Yes  28  43   0.3943662
```

```r
# Variable importance plot
vip(attr_rf)
```

]

---
class: inverse, center, middle

background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/99/Fog_forrest_frickberg.jpg)

# Boosting

???

Image credit: [Wikimedia Commons](https://commons.wikimedia.org/wiki/Forest#/media/File:Fog_forrest_frickberg.jpg)

---

## Background

.scrollable[

* Boosting is a general technique for creating an ensemble of models (*meta-algorithm*)

* In contrast to bagging, boosting:

- Uses a *weak learner* for the base learners
    
    - Builds the models sequentially (fix past mistakes)
    
* Most commonly used with **shallow decision trees** (e.g., decision stumps), but other types of models can be boosted

* **Boosting requires more tuning**, but is typically faster and more accurate than RFs!

* **Many flavors of boosting:** AdaBoost, Real AdaBoost, gradient boosting, and stochastic gradient boosting (GBM), for example

* GBMs are the most **general** and widely used flavor of boosting

- Introduces subsampling into the rows and columns (**stochastic**)

- Loss function is optimized using *gradient descent* (**gradient**)

- With the right combination of parameters, GBMs can emulate RFs!

]

---

## GBMs with squared-error loss

1. Given a training sample `$\boldsymbol{d} = \left(\boldsymbol{X}, \boldsymbol{y}\right)$`. Fix the number of steps `$B$`, the shrinkage factor `$\epsilon$`, and the tree depth `$d$`. Set the initial fit `$\widehat{G}_0 \equiv 0$`, and the residual vector `$\boldsymbol{r} = \boldsymbol{y}$`.

2. For `$b = 1, 2, \dots, B$` repeat:

a) Fit a regression tree `$\tilde{g}_b$` to the data `$\left(\boldsymbol{X}, \boldsymbol{r}\right)$`, grown best-first to depth `$d$`
    
    b) Update the fitted model with a shrunken version of `$\tilde{g}_b$`: `$\widehat{G}_b = \widehat{G}_{b - 1} + \epsilon \tilde{g}_b$`
    
    c) Update the residuals accordingly: `$r_i = r_i - \epsilon \tilde{g}_b$`, `$i = 1, 2, \dots, n$`
    
3. Return the sequence of fitted functions `$\widehat{G}_b$`, `$b = 1, 2, \dots, B$`.

* Here, `$B$`, `$\epsilon$`, and `$d$` are tuning parameters

* [R code](https://github.com/bgreenwell/MLDay18/blob/master/code/rpartBoost.R)

???

Source: [Computer Age Statistical Inference, pg. 333](https://web.stanford.edu/~hastie/CASI/)

---
class: center, middle

## [Boosting regression stumps](https://github.com/bgreenwell/MLDay18/blob/master/code/rpartBoost.R)

![](GIFs/boosted_stumps.gif)

---

## GBM packages in R

.scrollable[

* [`gbm`](https://github.com/gbm-developers/gbm): **g**eneralized **b**oosted **m**odels

- The original R implementation of GMBs (by Greg Ridgeway)
    
    - Slower than modem implementations (but still pretty fast)
    
    - Provides OOB error estimate
    
    - Supports the *weighted tree traversal* method for fast construction of PDPs
    
    - Currently orphaned on CRAN (for now)
    
* [`gbm3`](https://github.com/gbm-developers/gbm3): **g**eneralized **b**oosted **m**odels

- Shiny new version of `gbm` that is not backwards compatible
 
 - Faster and supports parallel tree building
 
 - Not currently listed on CRAN
 
* [`xgboost`](https://github.com/dmlc/xgboost): eXtreme Gradient Boosting

- Widely used by data scientists to achieve **state-of-the-art** results across many machine learning challenges (e.g., Kaggle)

- Designed to be highly efficient, flexible, and portable
    
    - Supports user-defined loss functions
    
    - Supports early stopping
    
    - Includes fast histogram method for numeric features
    
    - Parallel tree boosting and GPU support!!

* [`lightgbm`](https://github.com/Microsoft/LightGBM):

- Microsoft supported, and somewhat [faster](https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#comparison-experiment), alternative to `xgboost`

- Uses *leaf-wise* (i.e., best-first) tree growth
    
    - Not currently listed on CRAN

* [`mboost`](https://github.com/boost-R): **m**odel-based **boost**ing

- Implements tools for fitting and evaluating a variety of regression and classification models (somewhere between GLMs/GAMs and GBMs)

]

---

## Missing values

* Unlike RFs, GBMs typically support missing values!

* In most GBM implementations, `NA`s are interpreted as containing information (i.e., missing for a reason), rather than missing at random

- In [`gbm`](https://github.com/gbm-developers/gbm) and [`gbm3`](https://github.com/gbm-developers/gbm3), a separate branch is created at each split for `NA`s (i.e., each decision node contains three splits: left, right, and missing)
    
    - In [`h2o`](https://github.com/h2oai/h2o-3/tree/master/h2o-r) and [`xgboost`](https://github.com/dmlc/xgboost) each split has a **default direction** that is learned during training (`NA`s take the default route!)

---

## Boston housing example

```r
# Load required packages
library(gbm)

# Fit a GBM to the Boston housing data
set.seed(1053) # for reproducibility
boston_gbm <- gbm(
 cmedv ~ ., 
 data = boston, 
* var.monotone = NULL,
* distribution = "gaussian", # "benoulli", "coxph", etc.
* n.trees = 10000,
* interaction.depth = 5,
* n.minobsinnode = 10,
* shrinkage = 0.005,
* bag.fraction = 1,
* train.fraction = 1,
 cv.folds = 10 # k-fold CV often gives the best results
)
```

---

## Boston housing example

```r
best_iter <- gbm.perf(
 boston_gbm, 
* method = "cv" # or "OOB" or "test"
)
```

---

## Partial dependence plots

.pull-left[

### Brute force method:

```r
system.time(
 pd1 <- pdp::partial(
 boston_gbm, 
 pred.var = c("lon", "nox"),
* recursive = FALSE,
 chull = TRUE, 
* n.trees = best_iter
 )
)
```

```
##    user  system elapsed 
## 115.772   0.165 116.038
```

]

.pull-right[

### Recursive method:

```r
system.time(
 pd2 <- pdp::partial(
 boston_gbm, 
 pred.var = c("lon", "nox"),
* recursive = TRUE,
 chull = TRUE, 
* n.trees = best_iter
 )
)
```

```
##    user  system elapsed 
##   0.455   0.000   0.455
```

]

---

## Partial dependence plots

```r
# Display plots side-by-side
grid.arrange(autoplot(pd1), autoplot(pd2), ncol = 2)
```

---

## Ames housing data

.pull-left[

* A contemporary alternative to the often cited Boston housing data

* Contains the sale of individual residential property in Ames, Iowa from 2006 to 2010

* Contains `$N = 2930$` observations and `$p = 80$` features involved in assessing home values

* Lots of potential for *feature engineering*!!

]

.pull-right[

```r
ames <- AmesHousing::make_ames()
ggplot(ames, aes(x = Sale_Price, y = Overall_Qual)) + 
* ggridges::geom_density_ridges(aes(fill = Overall_Qual)) +
 scale_x_continuous(labels = scales::dollar) +
 labs(x = "Sale price", y = "Overall quality") +
 theme_light() + theme(legend.position = "none")
```

**Source:** [Bradley Boehmke](https://github.com/bradleyboehmke)

]

---

```r
# Load required packages
library(xgboost)

# Construct data set
ames <- AmesHousing::make_ames()

*# Feature matrix # or xgb.DMatrix or sparse matrix
X <- data.matrix(subset(ames, select = -Sale_Price))

# Use k-fold cross-validation to find the "optimal" number of trees
set.seed(1214) # for reproducibility
ames_xgb_cv <- xgb.cv(
 data = X, 
 label = ames$Sale_Price, 
* objective = "reg:linear", # loss function
* nrounds = 10000, # number of trees
* max_depth = 5, # interaction depth
* eta = 0.01, # learning rate
 subsample = 1, 
 colsample = 1, 
 num_parallel_tree = 1, 
* eval_metric = "rmse",
* early_stopping_rounds = 50,
 verbose = 0,
 nfold = 5
)
```

---

---

## Ames housing data

.scrollable[

```r
# Train an XGBoost model
set.seed(203) # for reproducibility
ames_xgb <- xgboost( 
 data = X, 
 label = ames$Sale_Price, 
* objective = "reg:linear", # loss function
* nrounds = 2771, # number of trees
* max_depth = 5, # interaction depth
* eta = 0.01, # learning rate
 subsample = 1, 
 colsample = 1, 
 num_parallel_tree = 1, 
 eval_metric = "rmse", 
 verbose = 0,
* save_period = NULL
)
```

]

---

## Ames housing data

```r
# Variable importance plots
p1 <- vip(ames_xgb, feature_names = colnames(X), type = "Gain")
p2 <- vip(ames_xgb, feature_names = colnames(X), type = "Cover")
p3 <- vip(ames_xgb, feature_names = colnames(X), type = "Frequency")
grid.arrange(p1, p2, p3, ncol = 3)
```

---

## Ames housing data

```r
# By default, `vip()` plots the top 10 features
vip(ames_xgb, feature_names = colnames(X), type = "Gain", 
    num_features = ncol(X), bar = FALSE)
```

---

## Ames housing data

```r
# Partial dependence plots
oq_ice <- pdp::partial(ames_xgb, pred.var = "Overall_Qual", ice = TRUE, 
 center = TRUE, train = X)
p4 <- autoplot(pdp::partial(ames_xgb, pred.var = "Gr_Liv_Area", train = X))
p5 <- autoplot(pdp::partial(ames_xgb, pred.var = "Garage_Cars", train = X))
p6 <- autoplot(oq_ice, alpha = 0.1)
grid.arrange(p4, p5, p6, ncol = 3)
```

---

## Ames housing data

---

## Summary

.scrollable[

* Check out the [machine learning task view](https://cran.r-project.org/web/views/MachineLearning.html)!

.pull-left[

#### Random forests:

* Builds an ensemble of fully grown decision trees (**low bias, high variance**)

- Correlation between trees is reduced through subsampling the columns

- Variance is reduced through averaging
    
* Nearly automatic

* Lots of cool bells and whistles

* Trees are independently grown (embarrassingly parallel)

]

.pull-right[

#### Gradient boosting machines:

* Builds an ensemble of small decision trees (**high bias, low variance**)

- Bias is reduced through sequential learning and fixing past mistakes

* Requires more tuning, but can be more accurate

* Flexible loss functions

* Trees are grown **NOT** independent, but training times are usually pretty fast since trees are not grown too deep

]

---

## Questions?