Supervised Learning in R - Regression (DataCamp)

Ch. 1 - What is Regression?

Welcome and Introduction

[Video]

Identify the regression tasks

From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.

Which of the following tasks are regression problems?

Predict (yes/no) whether a student will pass the final exam, given scores on midterms and homework.
[*] Predict the student’s score (0 - 100) on the final exam, given scores on midterms and homework.
Predict the student’s final grade (A, B, C, D, F) in the class given scores on midterms and homework (before they’ve taken the final exam).

Linear regression - the fundamental method

[Video]

Linear Regression in R: lm()

# cmodel <- lm(temperature ~ chirps_per_sec, data = cricket())

Formulas

fmla_1 <- temperature ~ chirps_per_sec
fmla_2 <- blood_pressure ~ age + weight

fmla_1 <- as.formula("temperature ~ chirps_per_sec")

Looking at the Model

# cmodel

More Information about the Model

# summary(cmodel)
# 
# broom::glance(cmodel)
# 
# sigr::wrapFTest(cmodel)

Code a simple one-variable regression

# unemployment is loaded in the workspace
summary(unemployment)

##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900

# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment

# Print it
fmla

## female_unemployment ~ male_unemployment

# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)

# Print it
unemployment_model

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

Examining a model

# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

# Call summary() on unemployment_model to get more details
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

# Call glance() on unemployment_model to see the details in a tidier form
broom::glance(unemployment_model)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.821         0.805 0.580      50.6 1.97e-5     2  -10.3  26.6  28.3
## # … with 2 more variables: deviance <dbl>, df.residual <int>

# Call wrapFTest() on unemployment_model to see the most relevant details
sigr::wrapFTest(unemployment_model)

## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."

Predicting once you fit a model

[Video]

Predicting From the Training Data

# cricket$prediction <- predict(model)

Looking at the Predictions

# ggplot(cricket, aes(x = prediction, y = temperature)) +
#   geom_point() +
#   geom_abline(color = "darkblue") +
#   ggtitle("temperature vs. linear model prediction")

Predicting on New Data

# newchirps <- data.frame(chirps_per_sec = 16.5)
# newchirps$prediction <- predict(cmodel, newdata = newchirps)
# newchirps

Predicting from the unemployment model

# unemployment is in your workspace
summary(unemployment)

##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900

# newrates is in your workspace
newrates

##   male_unemployment
## 1                 5

# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model, data = unemployment)

# load the ggplot2 package
library(ggplot2)

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newdata = newrates)
# Print it
pred

##        1 
## 4.906757

Multivariate linear regression (Part 1)

# bloodpressure is in the workspace
summary(bloodpressure)

##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220

# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla

## blood_pressure ~ age + weight

# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349

summary(bloodpressure_model)

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## age           0.8614     0.2482   3.470  0.00844 **
## weight        0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

Multivariate linear regression (Part 2)

# bloodpressure is in your workspace
summary(bloodpressure)

##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220

# bloodpressure_model is in your workspace
bloodpressure_model

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349

# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, data = bloodpressure)

# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure)) + 
    geom_point() +
    geom_abline(color = "blue")

Wrapping up linear regression

[Video]

Ch. 2 - Training and Evaluating Regression Models

Evaluating a model graphically

[Video]

Reading the Gain Curve

# GainCurvePlot(houseprices, "prediction", "price", "Home price model")

Graphically evaluate the unemployment model

# unemployment, unemployment_model are in the workspace
summary(unemployment)

##  male_unemployment female_unemployment   prediction   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240

summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, data = unemployment)

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) + 
  geom_point() + 
  geom_abline()

# From previous step
unemployment$predictions <- predict(unemployment_model)

# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions

# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

The gain curve to evaluate the unemployment model

# unemployment is in the workspace (with predictions)
summary(unemployment)

##  male_unemployment female_unemployment   prediction     predictions   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601   Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240  
##    residuals       
##  Min.   :-0.77621  
##  1st Qu.:-0.34050  
##  Median :-0.09004  
##  Mean   : 0.00000  
##  3rd Qu.: 0.27911  
##  Max.   : 1.31254

# unemployment_model is in the workspace
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

# Load the package WVPlots
library(WVPlots)

# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

Root Mean Squared Error (RMSE)

[Video]

RMSE of the Home Sales Price Model

# Calculate error
# err <- houseprices$prediction - houseprices$price

# Square the error vector
# err2 <- err ^ 2

# Take the mean, and sqrt it
# (rmse <- sqrt(mean(err2)))

Is the RMSE Large or Small?

RMSE ≈ 58.3 sd(price) ≈ 135

Calculate RMSE

# unemployment is in the workspace
summary(unemployment)

##  male_unemployment female_unemployment   prediction     predictions   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601   Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240  
##    residuals       
##  Min.   :-0.77621  
##  1st Qu.:-0.34050  
##  Median :-0.09004  
##  Mean   : 0.00000  
##  3rd Qu.: 0.27911  
##  Max.   : 1.31254

# For convenience put the residuals in the variable res
res <- unemployment$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res ^ 2)))

## [1] 0.5337612

# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))

## [1] 1.314271

R-Squared

[Video]

What is R^2?

A measure of how well the model fits or explains the data

A value between 0-1 near 1: model fits well near 0: no better than guessing the average value

Calculating R^2R

R^2 is the variance explained by the model.

R^2 = 1 - R

R^2 = 1 − RSS / SStot

where

Residual sum of squares (variance from model) RSS = ∑(y−prediction)^2

Total sum of squares (variance of data) SStot = ∑(y−y‾)^2

Calculate R^2R

Calculate error

# err <- houseprices$prediction - houseprices$price

Square it and take the sum

# rss <- sum(err^2)

price: column of actual sale prices (in thousands) pred: column of predicted sale prices (in thousands) RSS ≈ 136138

Calculate R^2 of the House Price Model: SStot

Take the difference of prices from the mean price

# toterr <- houseprices$price - mean(houseprices$price)

Square it and take the sum

# sstot <- sum(toterr^2)

RSS ≈ 136138 SStot ≈ 713615

Calculate R^2 of the House Price Model

# (r_squared <- 1 - (rss/sstot))

RSS ≈ 136138 SStot ≈ 713615 R^2 ≈ 0.809

Reading R^2 from the lm() model

# # From summary()
# summary(hmodel)

# summary(hmodel)$r.squared

# # From glance()
# glance(hmodel)$r.squared

Correlation and R^2

# rho <- cor(houseprices$prediction, houseprices$price)

# rho^2

ρ = cor(prediction, price) = 0.8995709 ρ^2 = 0.8092278 = R^2

Calculate R-Squared

# unemployment is in your workspace
summary(unemployment)

##  male_unemployment female_unemployment   prediction     predictions   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601   Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240  
##    residuals       
##  Min.   :-0.77621  
##  1st Qu.:-0.34050  
##  Median :-0.09004  
##  Mean   : 0.00000  
##  3rd Qu.: 0.27911  
##  Max.   : 1.31254

# unemployment_model is in the workspace
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))

## [1] 5.569231

# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))

## [1] 20.72769

# Calculate residual sum of squares: rss. Print it
(rss <- sum(unemployment$residuals ^ 2))

## [1] 3.703714

# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - rss / tss)

## [1] 0.8213157

# Get R-squared from glance. Print it
(rsq_glance <- broom::glance(unemployment_model)$r.squared)

## [1] 0.8213157

Correlation and R-squared

# unemployment is in your workspace
summary(unemployment)

##  male_unemployment female_unemployment   prediction     predictions   
##  Min.   :2.900     Min.   :4.000       Min.   :3.448   Min.   :3.448  
##  1st Qu.:4.900     1st Qu.:4.400       1st Qu.:4.837   1st Qu.:4.837  
##  Median :6.000     Median :5.200       Median :5.601   Median :5.601  
##  Mean   :5.954     Mean   :5.569       Mean   :5.569   Mean   :5.569  
##  3rd Qu.:6.700     3rd Qu.:6.100       3rd Qu.:6.087   3rd Qu.:6.087  
##  Max.   :9.800     Max.   :7.900       Max.   :8.240   Max.   :8.240  
##    residuals       
##  Min.   :-0.77621  
##  1st Qu.:-0.34050  
##  Median :-0.09004  
##  Mean   : 0.00000  
##  3rd Qu.: 0.27911  
##  Max.   : 1.31254

# unemployment_model is in the workspace
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))

## [1] 0.9062647

# Square rho: rho2 and print it
(rho2 <- rho ^ 2)

## [1] 0.8213157

# Get R-squared from glance and print it
(rsq_glance <- broom::glance(unemployment_model)$r.squared)

## [1] 0.8213157

Properly Training a Model

[Video]

Create a cross-validation plan

# library(vtreat)
# splitPlan <- kWayCrossValidation(nRows, nSplits, NULL, NULL)

# library(vtreat)
# splitPlan <- kWayCrossValidation(10, 3, NULL, NULL)

First fold (A and B to train, C to test)

# splitPlan[[1]]

Train A and B, test on C, etc.

# split <- splitPlan[[1]]
# model <- lm(fmla, data = df[split$train, ])
# df$pred.cv[split$app] <- predict(model, newdata = df[split$app, ])

Generating a random test/train split

# mpg is in the workspace
summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

dim(mpg)

## [1] 234  11

# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))

## [1] 234

# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))

## [1] 176

# Create the vector of N uniform random variables: gp
gp <- runif(N)

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75, ]
mpg_test <- mpg[gp >= 0.75, ]

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)

## [1] 185

nrow(mpg_test)

## [1] 49

Train a model using test/train split

# mpg_train is in the workspace
summary(mpg_train)

##  manufacturer          model               displ            year     
##  Length:185         Length:185         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :1999  
##                                        Mean   :3.438   Mean   :2003  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :6.200   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:185         Length:185         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.854                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:185         Length:185        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :25.00   Mode  :character   Mode  :character  
##  Mean   :23.45                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)

## cty ~ hwy

# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla, data = mpg_train)

# Use summary() to examine the model
summary(mpg_model)

## 
## Call:
## lm(formula = fmla, data = mpg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9294 -0.6836 -0.0437  0.6449  4.5621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.71513    0.38349   1.865   0.0638 .  
## hwy          0.68857    0.01586  43.413   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.267 on 183 degrees of freedom
## Multiple R-squared:  0.9115, Adjusted R-squared:  0.911 
## F-statistic:  1885 on 1 and 183 DF,  p-value: < 2.2e-16

Evaluate a model using test/train split

# Examine the objects in the workspace
ls.str()

## bloodpressure : 'data.frame':    11 obs. of  4 variables:
##  $ blood_pressure: int  132 143 153 162 154 168 137 149 159 128 ...
##  $ age           : int  52 59 67 73 64 74 54 61 65 46 ...
##  $ weight        : int  173 184 194 211 196 220 188 188 207 167 ...
##  $ prediction    : num  134 143 154 165 152 ...
## bloodpressure_model : List of 12
##  $ coefficients : Named num [1:3] 30.994 0.861 0.335
##  $ residuals    : Named num [1:11] -1.718 -0.432 -0.672 -2.533 2.243 ...
##  $ effects      : Named num [1:11] -497.8 42.17 5.94 -2.11 2.65 ...
##  $ rank         : int 3
##  $ fitted.values: Named num [1:11] 134 143 154 165 152 ...
##  $ assign       : int [1:3] 0 1 2
##  $ qr           :List of 5
##  $ df.residual  : int 8
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = fmla, data = bloodpressure)
##  $ terms        :Classes 'terms', 'formula'  language blood_pressure ~ age + weight
##  $ model        :'data.frame':   11 obs. of  3 variables:
## fe_mean :  num 5.57
## fmla : Class 'formula'  language cty ~ hwy
## fmla_1 : Class 'formula'  language temperature ~ chirps_per_sec
## fmla_2 : Class 'formula'  language blood_pressure ~ age + weight
## gp :  num [1:234] 0.3815 0.0202 0.2264 0.5662 0.4676 ...
## mpg_model : List of 12
##  $ coefficients : Named num [1:2] 0.715 0.689
##  $ residuals    : Named num [1:185] -2.684 0.316 -2.061 -0.372 -2.618 ...
##  $ effects      : Named num [1:185] -229.387 -55.007 -1.919 -0.225 -2.447 ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:185] 20.7 20.7 22.1 21.4 18.6 ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##  $ df.residual  : int 183
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = fmla, data = mpg_train)
##  $ terms        :Classes 'terms', 'formula'  language cty ~ hwy
##  $ model        :'data.frame':   185 obs. of  2 variables:
## mpg_test : Classes 'tbl_df', 'tbl' and 'data.frame': 49 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "chevrolet" "chevrolet" ...
##  $ model       : chr  "a4" "a4 quattro" "corvette" "corvette" ...
##  $ displ       : num  3.1 2 5.7 7 6.5 2.4 3 3.3 3.7 3.9 ...
##  $ year        : int  2008 2008 1999 2008 1999 2008 1999 1999 2008 1999 ...
##  $ cyl         : int  6 4 8 8 8 4 6 6 6 6 ...
##  $ trans       : chr  "auto(av)" "manual(m6)" "manual(m6)" "manual(m6)" ...
##  $ drv         : chr  "f" "4" "r" "r" ...
##  $ cty         : int  18 20 16 15 14 22 17 16 15 14 ...
##  $ hwy         : int  27 28 26 24 17 30 24 22 19 17 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "2seater" "2seater" ...
## mpg_train : Classes 'tbl_df', 'tbl' and 'data.frame':    185 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 1.8 1.8 2 2.8 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 1999 1999 2008 1999 ...
##  $ cyl         : int  4 4 4 4 6 6 4 4 4 6 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 16 19 15 ...
##  $ hwy         : int  29 29 31 30 26 26 26 25 27 25 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...
## mpgtest : Classes 'tbl_df', 'tbl' and 'data.frame':  54 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4 quattro" "a4 quattro" ...
##  $ displ       : num  2 3.1 1.8 2 4.2 5.3 6.5 2.4 3.7 5.2 ...
##  $ year        : int  2008 2008 1999 2008 2008 2008 1999 1999 2008 1999 ...
##  $ cyl         : int  4 6 4 4 8 8 8 4 6 8 ...
##  $ trans       : chr  "auto(av)" "auto(av)" "auto(l5)" "auto(s6)" ...
##  $ drv         : chr  "f" "f" "4" "4" ...
##  $ cty         : int  21 18 16 19 16 11 14 19 15 11 ...
##  $ hwy         : int  30 27 25 27 23 14 17 27 19 17 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...
## mpgtrain : Classes 'tbl_df', 'tbl' and 'data.frame': 180 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2.8 2.8 1.8 2 2.8 2.8 3.1 ...
##  $ year        : int  1999 1999 2008 1999 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 6 6 4 4 6 6 6 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(l5)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 16 18 18 20 15 17 17 ...
##  $ hwy         : int  29 29 31 26 26 26 28 25 25 25 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...
## N :  int 234
## newrates : 'data.frame': 1 obs. of  1 variable:
##  $ male_unemployment: int 5
## pred :  Named num 4.91
## r_squared : function (prediction, actual)  
## res :  num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
## rho :  num 0.906
## rho2 :  num 0.821
## rmse : function (prediction, actual)  
## rsq :  num 0.821
## rsq_glance :  num 0.821
## rss :  num 3.7
## sd_unemployment :  num 1.31
## target :  num 176
## tss :  num 20.7
## unemployment : 'data.frame': 13 obs. of  5 variables:
##  $ male_unemployment  : num  2.9 6.7 4.9 7.9 9.8 ...
##  $ female_unemployment: num  4 7.4 5 7.2 7.9 ...
##  $ prediction         : num  3.45 6.09 4.84 6.92 8.24 ...
##  $ predictions        : num  3.45 6.09 4.84 6.92 8.24 ...
##  $ residuals          : num  0.552 1.313 0.163 0.279 -0.34 ...
## unemployment_model : List of 12
##  $ coefficients : Named num [1:2] 1.434 0.695
##  $ residuals    : Named num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
##  $ effects      : Named num [1:13] -20.08 -4.126 0.106 -0.264 -1.192 ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:13] 3.45 6.09 4.84 6.92 8.24 ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##  $ df.residual  : int 11
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = fmla, data = unemployment)
##  $ terms        :Classes 'terms', 'formula'  language female_unemployment ~ male_unemployment
##  $ model        :'data.frame':   13 obs. of  2 variables:

# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model, data = mpg_train)

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, newdata = mpg_test)

# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))

## [1] 1.260217

(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))

## [1] 1.198051

# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))

## [1] 0.9114938

(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))

## [1] 0.9219146

# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()

Create a cross validation plan

# Load the package vtreat
library(vtreat)

## Loading required package: wrapr

# mpg is in the workspace
summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

# Get the number of rows in mpg
nRows <- nrow(mpg)

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3, NULL, NULL)

# Examine the split plan
str(splitPlan)

## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 6 8 11 13 14 15 16 19 ...
##   ..$ app  : int [1:78] 223 9 54 77 27 99 67 18 119 103 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 4 5 6 7 8 9 10 12 ...
##   ..$ app  : int [1:78] 19 23 159 219 138 154 204 139 124 101 ...
##  $ :List of 2
##   ..$ train: int [1:156] 2 3 4 5 7 9 10 11 12 14 ...
##   ..$ app  : int [1:78] 85 150 217 51 122 175 135 31 76 60 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

Evaluate a modeling procedure using n-fold cross-validation

# mpg is in the workspace
summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

# splitPlan is in the workspace
str(splitPlan)

## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 6 8 11 13 14 15 16 19 ...
##   ..$ app  : int [1:78] 223 9 54 77 27 99 67 18 119 103 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 4 5 6 7 8 9 10 12 ...
##   ..$ app  : int [1:78] 19 23 159 219 138 154 204 139 124 101 ...
##  $ :List of 2
##   ..$ train: int [1:156] 2 3 4 5 7 9 10 11 12 14 ...
##   ..$ app  : int [1:78] 85 150 217 51 122 175 135 31 76 60 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train, ])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)

## [1] 1.247045

# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)

## [1] 1.289782

Ch. 3 - Issues to Consider

Categorical inputs

Examining the structure of categorical inputs

Modeling with categorical inputs

Interactions

Modeling an interaction

Modeling an interaction (2)

Transforming the response before modeling

Relative error

Modeling log-transformed monetary output

Comparing RMSE and root-mean-squared Relative Error

Transforming inputs before modeling

Input transforms: the “hockey stick”

Input transforms: the “hockey stick” (2)

Ch. 4 - Dealing with Non-Linear Responses

Logistic regression to predict probabilities

Fit a model of sparrow survival probability

Predict sparrow survival

Poisson and quasipoisson regression to predict counts

Poisson or quasipoisson

Fit a model to predict bike rental counts

Predict bike rentals on new data

Visualize the Bike Rental Predictions

GAM to learn non-linear transforms

Writing formulas for GAM models

Writing formulas for GAM models (2)

Model soybean growth with GAM

Predict with the soybean model on test data

Ch. 5 - Tree-Based Methods

The intuition behind tree-based methods

[Video]

Predicting with a decision tree

Here you see the decision tree learned from the brain data set shown in the previous video. The tree predicts the expected intelligence (humans have an intelligence of 1) of several mammals, as a function of gestation time (in days) and average litter size for that species.

The leaf nodes show the expected brain size for the datums in that node, as well as how many (percentage-wise) of the datums fall into the node.

You want to predict the intelligence of a gorilla, which has a gestation time of 265 days and an average litter size of 1.

What relative brain size does this tree model predict?

0.073
0.131
[*] 0.148
0.161
0.274

Random forests

[Video]

Random Forests with ranger()

# model <- ranger(fmla, bikesJan, num.trees = 500, respect.unordered.factors = "order")

Predicting with a ranger() model

# bikesFed$pred <- predict(model, bikesFeb)$predictions

Build a random forest model for bike rentals

# bikesJuly is in the workspace
str(bikesJuly)

## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
##  $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
##  $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
##  $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
##  $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
##  $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
##  $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

# Random seed to reproduce results
seed <- 423563

# The outcome column
(outcome <- "cnt")

## [1] "cnt"

# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))

## [1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
## [6] "atemp"      "hum"        "windspeed"

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))

## [1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"

# Load the package ranger
library(ranger)

# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula 
                         bikesJuly, # data
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))

## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       8299.851 
## R squared (OOB):                  0.8190328

Predict bike rentals with the random forest model

# bikesAugust is in the workspace
str(bikesAugust)

## 'data.frame':    744 obs. of  12 variables:
##  $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
##  $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
##  $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
##  $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
##  $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
##  $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
##  $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
##  $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...

# bike_model_rf is in the workspace
bike_model_rf

## Ranger result
## 
## Call:
##  ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      744 
## Number of independent variables:  8 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       8299.851 
## R squared (OOB):                  0.8190328

# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt - pred)  %>%        # calculate the residual
  summarize(rmse  = sqrt(mean(residual ^ 2))) # calculate rmse

##       rmse
## 1 96.66032

# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

Visualize random forest bike model predictions

first_two_weeks <- bikesAugust %>% 
  # Set start to 0, convert unit to days
  mutate(instant = (instant - min(instant)) / 24) %>% 
  # Gather cnt and pred into a column named value with key valuetype
  gather(key = valuetype, value = value, cnt, pred) %>%
  # Filter for rows in the first two
  filter(instant < 14) 

# Plot predictions and cnt by date/time 
ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot")

One-Hot-Encoding Categorical Variables

[Video]

vtreat on a small example

Novel levels

vtreat the bike rental data

Gradient boosting machines

Find the right number of trees for a gradient boosting machine

Fit an xgboost bike rental model and predict

Evaluate the xgboost bike rental model

Visualize the xgboost bike rental model

About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470