Decision Trees Teaser

0. Setup & Packages Activation

knitr::opts_chunk$set(echo = TRUE)
options(scipen = 999) # to avoid using scientific notation
library(rpart) # Recursive PARTitioning -> Decision Trees
library(rpart.plot) # To plot Decision Trees
library(caret) # Classification And REgression Training -> Automatic Machine Learning
library(Metrics) # For Model Performance Metrics Calculation
library(tidyverse) # For Data Organization, Wrangling & Cleaning
library(e1071) # Miscellaneous Statistics Functions

1. Reading and Exploring the Data

Reading original data: weight in pounds and height in inches.

# Data from:
# https://www.kaggle.com/mustafaali96/weight-height
data <- read.csv("data/weight-height.csv", stringsAsFactors = T)
head(data)

##   Gender   Height   Weight
## 1   Male 73.84702 241.8936
## 2   Male 68.78190 162.3105
## 3   Male 74.11011 212.7409
## 4   Male 71.73098 220.0425
## 5   Male 69.88180 206.3498
## 6   Male 67.25302 152.2122

Transforming data: weight in kilograms and height in meters.

gwh <- data.frame(gen = data$Gender, wt = data$Weight/2.205, ht = data$Height/39.37)
head(gwh)

##    gen        wt       ht
## 1 Male 109.70230 1.875718
## 2 Male  73.61019 1.747064
## 3 Male  96.48111 1.882400
## 4 Male  99.79250 1.821970
## 5 Male  93.58268 1.775001
## 6 Male  69.03046 1.708230

summary(gwh)

##      gen             wt               ht       
##  Female:5000   Min.   : 29.34   Min.   :1.378  
##  Male  :5000   1st Qu.: 61.60   1st Qu.:1.613  
##                Median : 73.11   Median :1.684  
##                Mean   : 73.22   Mean   :1.686  
##                3rd Qu.: 84.88   3rd Qu.:1.757  
##                Max.   :122.44   Max.   :2.007

gwh %>% group_by(gen) %>% summarize(n = n(), avght = mean(ht), avgwt = mean(wt))

## # A tibble: 2 x 4
##   gen        n avght avgwt
##   <fct>  <int> <dbl> <dbl>
## 1 Female  5000  1.62  61.6
## 2 Male    5000  1.75  84.8

Brief parenthesis for the sample() function

sample(x = 1:10, size = 5)

## [1] 1 5 2 4 6

sample(x = 1:10, size = 5, replace = T)

## [1]  6 10  1  5  1

sample(1:10, 10)

##  [1]  6  9  3  4  8  1  2  7 10  5

#sample(1:10, 11) # What's up with this one?
sample(1:10, 11, replace = T)

##  [1]  3  2  6  4  2 10  5  4  8  8  9

Generate a random sample of 100 individuals

n <- nrow(gwh)
nt <- 100
set.seed(123)
subset <- sample(1:n, 100)
gwh_sub <- gwh[subset, ]

2. First Model: Only Weights

How could you predict the weight of each individual, minimizing the sum of errors?

mean(gwh_sub$wt)

## [1] 71.08253

How do we see that in a regression with no explanatory or independent variables?

reg <- lm(wt ~ 1, data = gwh_sub)
summary(reg)

## 
## Call:
## lm(formula = wt ~ 1, data = gwh_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.592 -11.323   0.148  11.723  32.092 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   71.083      1.441   49.33 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.41 on 99 degrees of freedom

plot(x = 1:100, y = gwh_sub$wt, 
     xlab = "Observation", ylab = "Weight", 
     main = "Decision Tree")
y <- rep(mean(gwh_sub$wt), nrow(gwh_sub))
lines(x = 1:100, y = y, col = "red")

Recall that the mean of a series is its expected value, so it is our base model. The \(R^2\) of a regression is the proportion of the variance in the results versus the mean, that can be explained by the model. In other words, the out-performance of a model vs the mean. Notice how there is no \(R^2\) shown for the model above.

Total Sum of Squares: \(SST = \sum(y - \overline{y})^2\)
Error Sum of Squares: \(SSE = \sum(y - \hat{y})^2 = \sum\epsilon_i^2\)
Regression Sum of Squares: \(SSR = \sum(\hat{y} - \overline{y})^2\)
Coefficient of determination: \(R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\)

3. Second Model: With Gender

What if we had additional info, like the sex or gender of each individual, can we improve our performance?

mean(gwh_sub[gwh_sub$gen=="Female", "wt"])

## [1] 61.29075

mean(gwh_sub[gwh_sub$gen=="Male", "wt"])

## [1] 84.06232

This is essentially a Decision Tree:

dt <- rpart(wt ~ gen, data = gwh_sub)
rpart.plot(dt, digits = 5)

We can do the same thing with a linear regression though:

reg <- lm(wt ~ gen, data = gwh_sub)
summary(reg)

## 
## Call:
## lm(formula = wt ~ gen, data = gwh_sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.2888  -6.0126  -0.1374   6.8094  19.1126 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   61.291      1.185    51.7 <0.0000000000000002 ***
## genMale       22.772      1.808    12.6 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.95 on 98 degrees of freedom
## Multiple R-squared:  0.6182, Adjusted R-squared:  0.6143 
## F-statistic: 158.7 on 1 and 98 DF,  p-value: < 0.00000000000000022

How do we interpret the coefficients, do they match the decision tree? Notice that now we do have an \(R^2\).

sum(reg$coefficients)

## [1] 84.06232

plot(x = 1:100, y = gwh_sub$wt, col = as.numeric(gwh_sub$gen)*2,
     xlab = "Observation", ylab = "Weight", 
     main = "Decision Tree: Females (red) vs Males (blue)")
lines(x = 1:100, y = rep(dt$frame[2,5],100), col = "red")
lines(x = 1:100, y = rep(dt$frame[3,5],100), col = "blue")

4. Third Model: With Heights

What if we had heights instead of gender? Decision Trees are binary… right?

dt <- rpart(wt ~ ht, data = gwh_sub, control = rpart.control(maxdepth = 1))
rpart.plot(dt, digits = 5)

# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out <- rmse(actual = gwh$wt, predicted = pred)

plot(x = gwh_sub$ht, y = gwh_sub$wt, 
     xlab = "Height", ylab = "Weight", 
     main = "Decision Tree", col = "blue")
df <- data.frame(ht = sort(gwh_sub$ht))
df$avgwt <- ifelse(df$ht < dt$splits[1,4], dt$frame[2,5], dt$frame[3,5])
lines(x = df$ht, y = df$avgwt, col = "red")

What about regression?

reg <- lm(wt ~ ht, data = gwh_sub)
summary(reg)

## 
## Call:
## lm(formula = wt ~ ht, data = gwh_sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2611  -3.5415   0.4786   3.4256  13.8621 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -164.324      9.690  -16.96 <0.0000000000000002 ***
## ht           140.829      5.788   24.33 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.458 on 98 degrees of freedom
## Multiple R-squared:  0.858,  Adjusted R-squared:  0.8565 
## F-statistic:   592 on 1 and 98 DF,  p-value: < 0.00000000000000022

plot(x = gwh_sub$ht, y = gwh_sub$wt, 
     xlab = "Height", ylab = "Weight", 
     main = "Linear Regression", col = "blue")
abline(reg, col = "red")

Case closed?

Notice that, in this case, including or not a constant (intercept) affects the ease of interpretation of the model. For mathematical purposes, like a valid calculation of \(R^2\), it is generally better to keep the constant, even if it’s not significant (contrary to other explanatory variables that should be removed if they aren’t significant).

reg <- lm(wt ~ ht - 1, data = gwh_sub)
summary(reg)

## 
## Call:
## lm(formula = wt ~ ht - 1, data = gwh_sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0357  -8.1149  -0.9992   8.6216  21.9532 
## 
## Coefficients:
##    Estimate Std. Error t value            Pr(>|t|)    
## ht  42.8361     0.6434   66.58 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.77 on 99 degrees of freedom
## Multiple R-squared:  0.9782, Adjusted R-squared:  0.9779 
## F-statistic:  4433 on 1 and 99 DF,  p-value: < 0.00000000000000022

plot(x = gwh_sub$ht, y = gwh_sub$wt, 
     xlab = "Height", ylab = "Weight", 
     main = "Linear Regression", col = "blue")
abline(reg, col = "red")

5. Allowing More Splits

dt <- rpart(wt ~ ht, data = gwh_sub, 
            control = rpart.control(minsplit = 20, cp = 0.05))
rpart.plot(dt, digits = 5)

# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[2] <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out[2] <- rmse(actual = gwh$wt, predicted = pred)

plot(x = gwh_sub$ht, y = gwh_sub$wt, 
     xlab = "Height", ylab = "Weight", 
     main = "Decision Tree", col = "blue")
df <- gwh_sub %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
  for (j in 1:length(index)) {
    df$lnode[i] <- ifelse(df$ht[i] > index[j], j+1, df$lnode[i]) 
  }
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")

dt <- rpart(wt ~ ht, data = gwh_sub, 
            control = rpart.control(minsplit = 10, cp = 0.005))
rpart.plot(dt, digits = 5)

# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[3] <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out[3] <- rmse(actual = gwh$wt, predicted = pred)

plot(x = gwh_sub$ht, y = gwh_sub$wt, 
     xlab = "Height", ylab = "Weight", 
     main = "Decision Tree", col = "blue")
df <- gwh_sub %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
  for (j in 1:length(index)) {
    df$lnode[i] <- ifelse(df$ht[i] > index[j], j+1, df$lnode[i]) 
  }
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")

dt <- rpart(wt ~ ht, data = gwh_sub, 
            control = rpart.control(minsplit = 10, cp = 0.003))
rpart.plot(dt, digits = 5)

# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[4] <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out[4] <- rmse(actual = gwh$wt, predicted = pred)

plot(x = gwh_sub$ht, y = gwh_sub$wt, 
     xlab = "Height", ylab = "Weight", 
     main = "Decision Tree", col = "blue")
df <- gwh_sub %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
  for (j in 1:length(index)) {
    df$lnode[i] <- ifelse(df$ht[i] > index[j], j+1, df$lnode[i]) 
  }
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")

6. Flexibility vs Bias

6.1 Linear Regression

Linear Regression for a:

Linear Relationship:

x <- 0:40
b0 <- 10
b1 <- 5
ybp <- b0 + b1*x + rnorm(41, sd = 20)
ybn <- b0 - b1*x + rnorm(41, sd = 20)

par(mfrow = c(1,2))
plot(x = x, y = ybp, type = "p", main = "Positive Linear Relationship")
abline(lm(ybp ~ x), col = "blue")
plot(x = x, y = ybn, type = "p", main = "Negative Linear Relationship")
abline(lm(ybn ~ x), col = "red")

Cuadratic Relationship:

x <- -20:20
b0 <- 1
b1 <- 1
ybp <- b0 + b1*x^2 + rnorm(41, sd = 20)
ybn <- b0 - b1*x^2 + rnorm(41, sd = 20)

par(mfrow = c(2,2))
plot(x = x, y = ybp, main = "Positive Quadratic Relationship")
abline(lm(ybp ~ x), col = "blue")
plot(x = x, y = ybn, main = "Negative Quadratic Relationship")
abline(lm(ybn ~ x), col = "red")
plot(x = x^2, y = ybp, main = "Positive Linear Relationsip")
abline(lm(ybp ~ I(x^2)), col = "blue")
plot(x = x^2, y = ybn, main = "Negative Linear Relationsip")
abline(lm(ybn ~ I(x^2)), col = "red")

To use a linear regression model, the relationship that we want to model should be linear, since we can only use lines (intercept + slope) to capture it. Linear regression has a bias towards a specific shape, which might not be the real shape of the relationship. A positive consequence of this bias is that the result has low variance, which means that selecting a new random sample shouldn’t change the results much.

If the relationship between the original data is not linear, then an alternative is to transform either \(y\), \(x\) or both to a different functional form. Typical transformations are: quadratic, cubic, logs, square root, reciprocal (1/x), etc.

This approach could work fine, however, there are some caveats: - A transformation for a perfect linear relationship might not be possible - If we estimate the model with a transformation of: + \(x\), we need to apply the same transformation to the new \(x\) before using them for forecasts. + \(y\), we need to apply the inverse transformation to the forecasted \(y\) to have compare it with the historical / original \(y\)

6.2 Decision Trees

Decision Trees don’t care, they adapt to the functional form (high flexibility):

First row: A simple decision tree with just three terminal nodes (red) Second row: A complex decision tree with a bunch of splits (blue)

dt1 <- rpart(ybp ~ x, control = rpart.control(minsplit = 20, cp = 0.1))
dt2 <- rpart(ybn ~ x, control = rpart.control(minsplit = 20, cp = 0.1))
dt3 <- rpart(ybp ~ x, control = rpart.control(minsplit = 2, cp = 0.005))
dt4 <- rpart(ybn ~ x, control = rpart.control(minsplit = 2, cp = 0.005))

par(mfrow = c(2,2))

plot(x = x, y = ybp, main = "Positive Quadratic Relationship")
index <- sort(dt1$splits[,4])
lnode <- x - (x-1)
for (i in 1:length(x)) {
  for (j in 1:length(index)) {
    lnode[i] <- ifelse(x[i] > index[j], j+1, lnode[i]) 
  }
}
df_agg <- data.frame(lnode, ybp) %>% group_by(lnode) %>% summarize(avgybp = mean(ybp))
df <- merge(cbind(x, ybp, lnode), df_agg)
lines(x = df$x, y = df$avgybp, col = "red")

plot(x = x, y = ybn, main = "Negative Quadratic Relationship")
index <- sort(dt2$splits[,4])
lnode <- x - (x-1)
for (i in 1:length(x)) {
  for (j in 1:length(index)) {
    lnode[i] <- ifelse(x[i] > index[j], j+1, lnode[i]) 
  }
}
df_agg <- data.frame(lnode, ybn) %>% group_by(lnode) %>% summarize(avgybn = mean(ybn))
df <- merge(cbind(x, ybn, lnode), df_agg)
lines(x = df$x, y = df$avgybn, col = "red")

plot(x = x, y = ybp, main = "Positive Quadratic Relationship")
index <- sort(dt3$splits[,4])
lnode <- x - (x-1)
for (i in 1:length(x)) {
  for (j in 1:length(index)) {
    lnode[i] <- ifelse(x[i] > index[j], j+1, lnode[i]) 
  }
}
df_agg <- data.frame(lnode, ybp) %>% group_by(lnode) %>% summarize(avgybp = mean(ybp))
df <- merge(cbind(x, ybp, lnode), df_agg)
lines(x = df$x, y = df$avgybp, col = "blue")

plot(x = x, y = ybn, main = "Negative Quadratic Relationship")
index <- sort(dt4$splits[,4])
lnode <- x - (x-1)
for (i in 1:length(x)) {
  for (j in 1:length(index)) {
    lnode[i] <- ifelse(x[i] > index[j], j+1, lnode[i]) 
  }
}
df_agg <- data.frame(lnode, ybn) %>% group_by(lnode) %>% summarize(avgybn = mean(ybn))
df <- merge(cbind(x, ybn, lnode), df_agg)
lines(x = df$x, y = df$avgybn, col = "blue")

FLEXIBILITY IS GREAT! … right?

Decision Trees are very flexible and have very low bias, but have higher variance. Having a high variance means that the results might change considerably if we estimate the model using a new random sample.

6.3 Decision Trees to the Extreme

dt <- rpart(wt ~ ht, data = gwh_sub, 
            control = rpart.control(minsplit = 5, cp = 0.001))
rpart.plot(dt, digits = 5)

# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[5] <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out[5] <- rmse(actual = gwh$wt, predicted = pred)

plot(x = gwh_sub$ht, y = gwh_sub$wt, 
     xlab = "Height", ylab = "Weight", 
     main = "Decision Tree", col = "blue")
df <- gwh_sub %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
  for (j in 1:length(index)) {
    df$lnode[i] <- ifelse(df$ht[i] >= index[j], j+1, df$lnode[i]) 
  }
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")

7. Overfitting

Let’s look at the Root Mean Squared Errors of the previous Decision Trees for the weight variable as a function of height. Recall that we had five models, each more complex (more splits) than the previous ones. We will omit the first model since it had a single split just for didactic purposes, but we will compare the other four models.

plot(y = rmse_in[2:5], x = 1:4, 
     ylab = "RMSE", xlab = "Decision Trees Model: Increasing in Complexity", 
     ylim = c(3.5,7), xaxt = "n",
     type = "b", col = "blue", main = "In-Sample: Blue | Out-Of-Sample: Red")
lines(y = rmse_out[2:5], x = 1:4, col = "red", type = "b")
axis(side = 1, at = 1:4)

The more complex the model, the better its in-sample performance. However, that is not the case when measured out-of-sample.

As complexity increases the model’s performance might increase too, but there comes a point where the increased complexity just overfits the model to the sample. Overfitting a model decreases its out-of sample performance, because it is less “generalizable” to the population.

8. Machine Learning

dt <- rpart(wt ~ ht, data = gwh, 
            control = rpart.control(minsplit = 5, cp = 0.001))
rpart.plot(dt, digits = 5)

plot(x = gwh$ht, y = gwh$wt, 
     xlab = "Height", ylab = "Weight", 
     main = "Decision Tree", col = "blue")
df <- gwh %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
  for (j in 1:length(index)) {
    df$lnode[i] <- ifelse(df$ht[i] > index[j], j+1, df$lnode[i]) 
  }
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")

In-Sample RMSE: The first five models use the subsample, while the last model uses the whole sample.

pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[6] <- rmse(actual = gwh_sub$wt, predicted = pred)
rmse_in

## [1] 8.349013 6.042384 5.136755 4.968517 3.794229 5.464297

Out-Of-Sample RMSE: The first five models use the subsample, while the last model uses the whole sample.

pred <- predict(object = dt, newdata = gwh)
rmse_out[6] <- rmse(actual = gwh$wt, predicted = pred)
rmse_out

## [1] 9.050297 6.941852 6.365973 6.048274 6.748140 5.550644

Machine Learning is a process that can be used with any model, but a very attractive characteristic is that it allows the usage of complex models due to the recent combination of:

Availability of large data sets, locally and in the cloud
Development of cross-validation techniques
Development of more advanced models
More computing / processing power, which enables the usage of all of the above.

Cross-validation is a very important concept in machine learning, and we will discuss it more in the next document. As a summary: it involves estimating (“training”) the model several times (e.g. \(k\) times), but each time using a different random sample and testing it outside the training sample used. The error estimation is averaged over all \(k\) trials to get total effectiveness of our model. This reduces bias as we are using most of the data for training, and also reduces variance as most of the data is also being used in test set