knitr::opts_chunk$set(echo = TRUE)
options(scipen = 999) # to avoid using scientific notation
library(rpart) # Recursive PARTitioning -> Decision Trees
library(rpart.plot) # To plot Decision Trees
library(caret) # Classification And REgression Training -> Automatic Machine Learning
library(Metrics) # For Model Performance Metrics Calculation
library(tidyverse) # For Data Organization, Wrangling & Cleaning
library(e1071) # Miscellaneous Statistics Functions
Reading original data: weight in pounds and height in inches.
# Data from:
# https://www.kaggle.com/mustafaali96/weight-height
data <- read.csv("data/weight-height.csv", stringsAsFactors = T)
head(data)
## Gender Height Weight
## 1 Male 73.84702 241.8936
## 2 Male 68.78190 162.3105
## 3 Male 74.11011 212.7409
## 4 Male 71.73098 220.0425
## 5 Male 69.88180 206.3498
## 6 Male 67.25302 152.2122
Transforming data: weight in kilograms and height in meters.
gwh <- data.frame(gen = data$Gender, wt = data$Weight/2.205, ht = data$Height/39.37)
head(gwh)
## gen wt ht
## 1 Male 109.70230 1.875718
## 2 Male 73.61019 1.747064
## 3 Male 96.48111 1.882400
## 4 Male 99.79250 1.821970
## 5 Male 93.58268 1.775001
## 6 Male 69.03046 1.708230
summary(gwh)
## gen wt ht
## Female:5000 Min. : 29.34 Min. :1.378
## Male :5000 1st Qu.: 61.60 1st Qu.:1.613
## Median : 73.11 Median :1.684
## Mean : 73.22 Mean :1.686
## 3rd Qu.: 84.88 3rd Qu.:1.757
## Max. :122.44 Max. :2.007
gwh %>% group_by(gen) %>% summarize(n = n(), avght = mean(ht), avgwt = mean(wt))
## # A tibble: 2 x 4
## gen n avght avgwt
## <fct> <int> <dbl> <dbl>
## 1 Female 5000 1.62 61.6
## 2 Male 5000 1.75 84.8
Brief parenthesis for the sample() function
sample(x = 1:10, size = 5)
## [1] 1 5 2 4 6
sample(x = 1:10, size = 5, replace = T)
## [1] 6 10 1 5 1
sample(1:10, 10)
## [1] 6 9 3 4 8 1 2 7 10 5
#sample(1:10, 11) # What's up with this one?
sample(1:10, 11, replace = T)
## [1] 3 2 6 4 2 10 5 4 8 8 9
Generate a random sample of 100 individuals
n <- nrow(gwh)
nt <- 100
set.seed(123)
subset <- sample(1:n, 100)
gwh_sub <- gwh[subset, ]
How could you predict the weight of each individual, minimizing the sum of errors?
mean(gwh_sub$wt)
## [1] 71.08253
How do we see that in a regression with no explanatory or independent variables?
reg <- lm(wt ~ 1, data = gwh_sub)
summary(reg)
##
## Call:
## lm(formula = wt ~ 1, data = gwh_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.592 -11.323 0.148 11.723 32.092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.083 1.441 49.33 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.41 on 99 degrees of freedom
plot(x = 1:100, y = gwh_sub$wt,
xlab = "Observation", ylab = "Weight",
main = "Decision Tree")
y <- rep(mean(gwh_sub$wt), nrow(gwh_sub))
lines(x = 1:100, y = y, col = "red")
Recall that the mean of a series is its expected value, so it is our base model. The \(R^2\) of a regression is the proportion of the variance in the results versus the mean, that can be explained by the model. In other words, the out-performance of a model vs the mean. Notice how there is no \(R^2\) shown for the model above.
Total Sum of Squares: \(SST = \sum(y - \overline{y})^2\)
Error Sum of Squares: \(SSE = \sum(y - \hat{y})^2 = \sum\epsilon_i^2\)
Regression Sum of Squares: \(SSR = \sum(\hat{y} - \overline{y})^2\)
Coefficient of determination: \(R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\)
What if we had additional info, like the sex or gender of each individual, can we improve our performance?
mean(gwh_sub[gwh_sub$gen=="Female", "wt"])
## [1] 61.29075
mean(gwh_sub[gwh_sub$gen=="Male", "wt"])
## [1] 84.06232
This is essentially a Decision Tree:
dt <- rpart(wt ~ gen, data = gwh_sub)
rpart.plot(dt, digits = 5)
We can do the same thing with a linear regression though:
reg <- lm(wt ~ gen, data = gwh_sub)
summary(reg)
##
## Call:
## lm(formula = wt ~ gen, data = gwh_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.2888 -6.0126 -0.1374 6.8094 19.1126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.291 1.185 51.7 <0.0000000000000002 ***
## genMale 22.772 1.808 12.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.95 on 98 degrees of freedom
## Multiple R-squared: 0.6182, Adjusted R-squared: 0.6143
## F-statistic: 158.7 on 1 and 98 DF, p-value: < 0.00000000000000022
How do we interpret the coefficients, do they match the decision tree? Notice that now we do have an \(R^2\).
sum(reg$coefficients)
## [1] 84.06232
plot(x = 1:100, y = gwh_sub$wt, col = as.numeric(gwh_sub$gen)*2,
xlab = "Observation", ylab = "Weight",
main = "Decision Tree: Females (red) vs Males (blue)")
lines(x = 1:100, y = rep(dt$frame[2,5],100), col = "red")
lines(x = 1:100, y = rep(dt$frame[3,5],100), col = "blue")
What if we had heights instead of gender? Decision Trees are binary… right?
dt <- rpart(wt ~ ht, data = gwh_sub, control = rpart.control(maxdepth = 1))
rpart.plot(dt, digits = 5)
# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out <- rmse(actual = gwh$wt, predicted = pred)
plot(x = gwh_sub$ht, y = gwh_sub$wt,
xlab = "Height", ylab = "Weight",
main = "Decision Tree", col = "blue")
df <- data.frame(ht = sort(gwh_sub$ht))
df$avgwt <- ifelse(df$ht < dt$splits[1,4], dt$frame[2,5], dt$frame[3,5])
lines(x = df$ht, y = df$avgwt, col = "red")
What about regression?
reg <- lm(wt ~ ht, data = gwh_sub)
summary(reg)
##
## Call:
## lm(formula = wt ~ ht, data = gwh_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2611 -3.5415 0.4786 3.4256 13.8621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -164.324 9.690 -16.96 <0.0000000000000002 ***
## ht 140.829 5.788 24.33 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.458 on 98 degrees of freedom
## Multiple R-squared: 0.858, Adjusted R-squared: 0.8565
## F-statistic: 592 on 1 and 98 DF, p-value: < 0.00000000000000022
plot(x = gwh_sub$ht, y = gwh_sub$wt,
xlab = "Height", ylab = "Weight",
main = "Linear Regression", col = "blue")
abline(reg, col = "red")
Case closed?
Notice that, in this case, including or not a constant (intercept) affects the ease of interpretation of the model. For mathematical purposes, like a valid calculation of \(R^2\), it is generally better to keep the constant, even if it’s not significant (contrary to other explanatory variables that should be removed if they aren’t significant).
reg <- lm(wt ~ ht - 1, data = gwh_sub)
summary(reg)
##
## Call:
## lm(formula = wt ~ ht - 1, data = gwh_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0357 -8.1149 -0.9992 8.6216 21.9532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## ht 42.8361 0.6434 66.58 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.77 on 99 degrees of freedom
## Multiple R-squared: 0.9782, Adjusted R-squared: 0.9779
## F-statistic: 4433 on 1 and 99 DF, p-value: < 0.00000000000000022
plot(x = gwh_sub$ht, y = gwh_sub$wt,
xlab = "Height", ylab = "Weight",
main = "Linear Regression", col = "blue")
abline(reg, col = "red")
dt <- rpart(wt ~ ht, data = gwh_sub,
control = rpart.control(minsplit = 20, cp = 0.05))
rpart.plot(dt, digits = 5)
# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[2] <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out[2] <- rmse(actual = gwh$wt, predicted = pred)
plot(x = gwh_sub$ht, y = gwh_sub$wt,
xlab = "Height", ylab = "Weight",
main = "Decision Tree", col = "blue")
df <- gwh_sub %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
for (j in 1:length(index)) {
df$lnode[i] <- ifelse(df$ht[i] > index[j], j+1, df$lnode[i])
}
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")
dt <- rpart(wt ~ ht, data = gwh_sub,
control = rpart.control(minsplit = 10, cp = 0.005))
rpart.plot(dt, digits = 5)
# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[3] <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out[3] <- rmse(actual = gwh$wt, predicted = pred)
plot(x = gwh_sub$ht, y = gwh_sub$wt,
xlab = "Height", ylab = "Weight",
main = "Decision Tree", col = "blue")
df <- gwh_sub %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
for (j in 1:length(index)) {
df$lnode[i] <- ifelse(df$ht[i] > index[j], j+1, df$lnode[i])
}
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")
dt <- rpart(wt ~ ht, data = gwh_sub,
control = rpart.control(minsplit = 10, cp = 0.003))
rpart.plot(dt, digits = 5)
# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[4] <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out[4] <- rmse(actual = gwh$wt, predicted = pred)
plot(x = gwh_sub$ht, y = gwh_sub$wt,
xlab = "Height", ylab = "Weight",
main = "Decision Tree", col = "blue")
df <- gwh_sub %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
for (j in 1:length(index)) {
df$lnode[i] <- ifelse(df$ht[i] > index[j], j+1, df$lnode[i])
}
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")
Linear Regression for a:
Linear Relationship:
x <- 0:40
b0 <- 10
b1 <- 5
ybp <- b0 + b1*x + rnorm(41, sd = 20)
ybn <- b0 - b1*x + rnorm(41, sd = 20)
par(mfrow = c(1,2))
plot(x = x, y = ybp, type = "p", main = "Positive Linear Relationship")
abline(lm(ybp ~ x), col = "blue")
plot(x = x, y = ybn, type = "p", main = "Negative Linear Relationship")
abline(lm(ybn ~ x), col = "red")
Cuadratic Relationship:
x <- -20:20
b0 <- 1
b1 <- 1
ybp <- b0 + b1*x^2 + rnorm(41, sd = 20)
ybn <- b0 - b1*x^2 + rnorm(41, sd = 20)
par(mfrow = c(2,2))
plot(x = x, y = ybp, main = "Positive Quadratic Relationship")
abline(lm(ybp ~ x), col = "blue")
plot(x = x, y = ybn, main = "Negative Quadratic Relationship")
abline(lm(ybn ~ x), col = "red")
plot(x = x^2, y = ybp, main = "Positive Linear Relationsip")
abline(lm(ybp ~ I(x^2)), col = "blue")
plot(x = x^2, y = ybn, main = "Negative Linear Relationsip")
abline(lm(ybn ~ I(x^2)), col = "red")
To use a linear regression model, the relationship that we want to model should be linear, since we can only use lines (intercept + slope) to capture it. Linear regression has a bias towards a specific shape, which might not be the real shape of the relationship. A positive consequence of this bias is that the result has low variance, which means that selecting a new random sample shouldn’t change the results much.
If the relationship between the original data is not linear, then an alternative is to transform either \(y\), \(x\) or both to a different functional form. Typical transformations are: quadratic, cubic, logs, square root, reciprocal (1/x), etc.
This approach could work fine, however, there are some caveats: - A transformation for a perfect linear relationship might not be possible - If we estimate the model with a transformation of: + \(x\), we need to apply the same transformation to the new \(x\) before using them for forecasts. + \(y\), we need to apply the inverse transformation to the forecasted \(y\) to have compare it with the historical / original \(y\)
Decision Trees don’t care, they adapt to the functional form (high flexibility):
First row: A simple decision tree with just three terminal nodes (red) Second row: A complex decision tree with a bunch of splits (blue)
dt1 <- rpart(ybp ~ x, control = rpart.control(minsplit = 20, cp = 0.1))
dt2 <- rpart(ybn ~ x, control = rpart.control(minsplit = 20, cp = 0.1))
dt3 <- rpart(ybp ~ x, control = rpart.control(minsplit = 2, cp = 0.005))
dt4 <- rpart(ybn ~ x, control = rpart.control(minsplit = 2, cp = 0.005))
par(mfrow = c(2,2))
plot(x = x, y = ybp, main = "Positive Quadratic Relationship")
index <- sort(dt1$splits[,4])
lnode <- x - (x-1)
for (i in 1:length(x)) {
for (j in 1:length(index)) {
lnode[i] <- ifelse(x[i] > index[j], j+1, lnode[i])
}
}
df_agg <- data.frame(lnode, ybp) %>% group_by(lnode) %>% summarize(avgybp = mean(ybp))
df <- merge(cbind(x, ybp, lnode), df_agg)
lines(x = df$x, y = df$avgybp, col = "red")
plot(x = x, y = ybn, main = "Negative Quadratic Relationship")
index <- sort(dt2$splits[,4])
lnode <- x - (x-1)
for (i in 1:length(x)) {
for (j in 1:length(index)) {
lnode[i] <- ifelse(x[i] > index[j], j+1, lnode[i])
}
}
df_agg <- data.frame(lnode, ybn) %>% group_by(lnode) %>% summarize(avgybn = mean(ybn))
df <- merge(cbind(x, ybn, lnode), df_agg)
lines(x = df$x, y = df$avgybn, col = "red")
plot(x = x, y = ybp, main = "Positive Quadratic Relationship")
index <- sort(dt3$splits[,4])
lnode <- x - (x-1)
for (i in 1:length(x)) {
for (j in 1:length(index)) {
lnode[i] <- ifelse(x[i] > index[j], j+1, lnode[i])
}
}
df_agg <- data.frame(lnode, ybp) %>% group_by(lnode) %>% summarize(avgybp = mean(ybp))
df <- merge(cbind(x, ybp, lnode), df_agg)
lines(x = df$x, y = df$avgybp, col = "blue")
plot(x = x, y = ybn, main = "Negative Quadratic Relationship")
index <- sort(dt4$splits[,4])
lnode <- x - (x-1)
for (i in 1:length(x)) {
for (j in 1:length(index)) {
lnode[i] <- ifelse(x[i] > index[j], j+1, lnode[i])
}
}
df_agg <- data.frame(lnode, ybn) %>% group_by(lnode) %>% summarize(avgybn = mean(ybn))
df <- merge(cbind(x, ybn, lnode), df_agg)
lines(x = df$x, y = df$avgybn, col = "blue")
FLEXIBILITY IS GREAT! … right?
Decision Trees are very flexible and have very low bias, but have higher variance. Having a high variance means that the results might change considerably if we estimate the model using a new random sample.
dt <- rpart(wt ~ ht, data = gwh_sub,
control = rpart.control(minsplit = 5, cp = 0.001))
rpart.plot(dt, digits = 5)
# saving these for later... don't ask =P
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[5] <- rmse(actual = gwh_sub$wt, predicted = pred)
pred <- predict(object = dt, newdata = gwh)
rmse_out[5] <- rmse(actual = gwh$wt, predicted = pred)
plot(x = gwh_sub$ht, y = gwh_sub$wt,
xlab = "Height", ylab = "Weight",
main = "Decision Tree", col = "blue")
df <- gwh_sub %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
for (j in 1:length(index)) {
df$lnode[i] <- ifelse(df$ht[i] >= index[j], j+1, df$lnode[i])
}
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")
Let’s look at the Root Mean Squared Errors of the previous Decision Trees for the weight variable as a function of height. Recall that we had five models, each more complex (more splits) than the previous ones. We will omit the first model since it had a single split just for didactic purposes, but we will compare the other four models.
plot(y = rmse_in[2:5], x = 1:4,
ylab = "RMSE", xlab = "Decision Trees Model: Increasing in Complexity",
ylim = c(3.5,7), xaxt = "n",
type = "b", col = "blue", main = "In-Sample: Blue | Out-Of-Sample: Red")
lines(y = rmse_out[2:5], x = 1:4, col = "red", type = "b")
axis(side = 1, at = 1:4)
The more complex the model, the better its in-sample performance. However, that is not the case when measured out-of-sample.
As complexity increases the model’s performance might increase too, but there comes a point where the increased complexity just overfits the model to the sample. Overfitting a model decreases its out-of sample performance, because it is less “generalizable” to the population.
dt <- rpart(wt ~ ht, data = gwh,
control = rpart.control(minsplit = 5, cp = 0.001))
rpart.plot(dt, digits = 5)
plot(x = gwh$ht, y = gwh$wt,
xlab = "Height", ylab = "Weight",
main = "Decision Tree", col = "blue")
df <- gwh %>% arrange(ht)
index <- sort(dt$splits[,4])
df$lnode <- 1
for (i in 1:nrow(df)) {
for (j in 1:length(index)) {
df$lnode[i] <- ifelse(df$ht[i] > index[j], j+1, df$lnode[i])
}
}
df_agg <- df %>% select(lnode, wt) %>% group_by(lnode) %>% summarize(avgwt = mean(wt))
df <- merge(df, df_agg)
lines(x = df$ht, y = df$avgwt, col = "red")
In-Sample RMSE: The first five models use the subsample, while the last model uses the whole sample.
pred <- predict(object = dt, newdata = gwh_sub)
rmse_in[6] <- rmse(actual = gwh_sub$wt, predicted = pred)
rmse_in
## [1] 8.349013 6.042384 5.136755 4.968517 3.794229 5.464297
Out-Of-Sample RMSE: The first five models use the subsample, while the last model uses the whole sample.
pred <- predict(object = dt, newdata = gwh)
rmse_out[6] <- rmse(actual = gwh$wt, predicted = pred)
rmse_out
## [1] 9.050297 6.941852 6.365973 6.048274 6.748140 5.550644
Machine Learning is a process that can be used with any model, but a very attractive characteristic is that it allows the usage of complex models due to the recent combination of:
Cross-validation is a very important concept in machine learning, and we will discuss it more in the next document. As a summary: it involves estimating (“training”) the model several times (e.g. \(k\) times), but each time using a different random sample and testing it outside the training sample used. The error estimation is averaged over all \(k\) trials to get total effectiveness of our model. This reduces bias as we are using most of the data for training, and also reduces variance as most of the data is also being used in test set