In this tutorial, we will learn about linear regression with tidymodels. We will start by fitting a linear regression model to the advertising data set that is used throughout chapter 3 of our course textbook, An Introduction to Statistical Learning.

Then we will focus on building our first machine learning pipeline, with data resampling, featuring engineering, modeling fitting, and model accuracy assessment using the workflows, rsample, recipes, parsnip, and tune packages from tidymodels.

Introduction to Linear Regression

The R code below will load the data and packages we will be working with throughout this tutorial. The vip package is used for exploring predictor variable importance. We will use this package for visualizing which predictors have the most predictive power in our linear regression models.

# Load libraries
library(tidyverse)

## Registered S3 methods overwritten by 'tibble':
##   method     from  
##   format.tbl pillar
##   print.tbl  pillar

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──

## ✓ broom     0.7.0      ✓ recipes   0.1.13
## ✓ dials     0.0.8      ✓ rsample   0.0.7 
## ✓ infer     0.5.3      ✓ tune      0.1.1 
## ✓ modeldata 0.0.2      ✓ workflows 0.1.3 
## ✓ parsnip   0.1.3      ✓ yardstick 0.0.7

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()

library(vip) # for variable importance

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

# Load data sets
advertising <- read_rds('./data/advertising.rds')

home_sales <- read_rds('./data/home_sales.rds') %>% 
              select(-selling_date)
print(home_sales)

## # A tibble: 6,659 x 10
##    selling_price city  house_age bedrooms bathrooms sqft_living sqft_lot
##            <dbl> <fct>     <dbl>    <dbl>     <dbl>       <dbl>    <dbl>
##  1        280000 Aubu…        23        6      3           2400     9373
##  2        487000 Seat…         9        4      2.5         2540     5001
##  3        465000 Seat…         0        3      2.25        1530     1245
##  4        411000 Seat…         7        2      2           1130     1148
##  5        570000 Bell…        16        3      2.5         1530     3296
##  6        546000 Bell…        16        3      2.5         1530     3464
##  7        617000 Bell…        16        3      2.5         1910     4488
##  8        635000 Kirk…         9        3      2.5         3350     4007
##  9        872750 Redm…        24        3      2.5         2870    13695
## 10        843000 Redm…        24        3      2.5         3130     8750
## # … with 6,649 more rows, and 3 more variables: sqft_basement <dbl>,
## #   floors <dbl>, scenic_views <dbl>

Data

We will be working with the advertisting data set, where each row represents a store from a large retail chain and their associated sales revenue and advertising budgets, and the home_sales data, where each row represents a real estate home sale in the Seattle area between 2014 and 2015.

Take a moment to explore these data sets below.

Advertising Data

Seattle Home Sales

Data Splitting

The first step in building regression models is to split our original data into a training and test set. We then perform all feature engineering and model fitting tasks on the training set and use the test set as an independent assessment of our model’s prediction accuracy.

We will be using the initial_split() function from rsample to partition the advertising data into training and test sets. Remember to always use set.seed() to ensure your results are reproducible.

set.seed(314)

# Create a split object
advertising_split <- initial_split(advertising, prop = 0.75, 
                                   strata = Sales)

# Build training data set
advertising_training <- advertising_split %>% 
                        training()

# Build testing data set
advertising_test <- advertising_split %>% 
                    testing()

Model Specification

The next step in the process is to build a linear regression model object to which we fit our training data.

For every model type, such as linear regression, there are numerous packages (or engines) in R that can be used.

For example, we can use the lm() function from base R or the stan_glm() function from the rstanarm package. Both of these functions will fit a linear regression model to our data with slightly different implementations.

The parsnip package from tidymodels acts like an aggregator across the various modeling engines within R. This makes it easy to implement machine learning algorithms from different R packages with one unifying syntax.

To specify a model object with parsnip, we must:

Pick a model type
Set the engine
Set the mode (either regression or classification)

Linear regression is implemented with the linear_reg() function in parsnip. To the set the engine and mode, we use set_engine() and set_mode() respectively. Each one of these functions takes a parsnip object as an argument and updates its properties.

To explore all parsnip models, please see the documentation where you can search by keyword.

Let’s create a linear regression model object with the lm engine. This is the default engine for most applications.

lm_model <- linear_reg() %>% 
            set_engine('lm') %>% # adds lm implementation of linear regression
            set_mode('regression')

# View object properties
lm_model

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Fitting to Training Data

Now we are ready to train our model object on the advertising_training data. We can do this using the fit() function from the parsnip package. The fit() function takes the following arguments:

a parnsip model object specification
a model formula
a data frame with the training data

The code below trains our linear regression model on the advertising_training data. In our formula, we have specified that Sales is the response variable and TV, Radio, and Newspaper are our predictor variables.

We have assigned the name lm_fit to our trained linear regression model.

lm_fit <- lm_model %>% 
          fit(Sales ~ ., data = advertising_training)

# View lm_fit properties
lm_fit

## parsnip model object
## 
## Fit time:  4ms 
## 
## Call:
## stats::lm(formula = Sales ~ ., data = data)
## 
## Coefficients:
## (Intercept)           TV        Radio    Newspaper  
##    3.020478     0.044105     0.198164    -0.003748

Exploring Training Results

As mentioned in the first R tutorial, most model objects in R are stored as specialized lists.

The lm_fit object is list that contains all of the information about how our model was trained as well as the detailed results. Let’s use the names() function to print the named objects that are stored within lm_fit.

The important objects are fit and preproc. These contain the trained model and preprocessing steps (if any are used), respectively.

names(lm_fit)

## [1] "lvl"     "spec"    "fit"     "preproc" "elapsed"

To print a summary of our model, we can extract fit from lm_fit and pass it to the summary() function. We can explore the estimated coefficients, F-statistics, p-values, residual standard error (also known as RMSE) and R2 value.

However, this feature is best for visually exploring our results on the training data since the results are returned as a data frame.

summary(lm_fit$fit)

## 
## Call:
## stats::lm(formula = Sales ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2834 -0.9082  0.2456  1.1927  2.8446 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.020478   0.339772   8.890 2.03e-15 ***
## TV           0.044105   0.001487  29.659  < 2e-16 ***
## Radio        0.198164   0.009143  21.674  < 2e-16 ***
## Newspaper   -0.003748   0.006234  -0.601    0.549    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.577 on 147 degrees of freedom
## Multiple R-squared:  0.9071, Adjusted R-squared:  0.9052 
## F-statistic: 478.6 on 3 and 147 DF,  p-value: < 2.2e-16

We can use the plot() function to obtain diagnostic plots for our trained regression model. Again, we must first extract the fit object from lm_fit and then pass it into plot(). These plots provide a check for the main assumptions of the linear regression model.

par(mfrow=c(2,2)) # plot all 4 plots in one

plot(lm_fit$fit, 
     pch = 16,    # optional parameters to make points blue
     col = '#006EA1')

Tidy Training Results

To obtain the detailed results from our trained linear regression model in a data frame, we can use the tidy() and glance() functions directly on our trained parsnip model, lm_fit.

The tidy() function takes a linear regression object and returns a data frame of the estimated model coefficients and their associated F-statistics and p-values.

The glance() function will return performance metrics obtained on the training data such as the R2 value (r.squared) and the RMSE (sigma).

# Data frame of estimated coefficients
tidy(lm_fit)

# Performance metrics on training data
glance(lm_fit)

We can also use the vip() function to plot the variable importance for each predictor in our model. The importance value is determined based on the F-statistics and estimate coefficents in our trained model object.

vip(lm_fit)

Evaluating Test Set Accuracy

To assess the accuracy of our trained linear regression model, lm_fit, we must use it to make predictions on our test data, advertising_test.

This is done with the predict() function from parnsip. This function takes two important arguments:

a trained parnsip model object
new_data for which to generate predictions

The code below uses the predict function to generate a data frame with a single column, .pred, which contains the predicted Sales values on the advertisting_test data.

predict(lm_fit, new_data = advertising_test)

Generally it’s best to combine the test data set and the predictions into a single data frame. We create a data frame with the predictions on the advertising_test data and then use bind_cols to add the advertising_test data to the results.

Now we have the model results and the test data in a single data frame.

advertising_test_results <- predict(lm_fit, new_data = advertising_test) %>% 
                            bind_cols(advertising_test)

# View results
advertising_test_results

Calculating RMSE and R2 on the Test Data

To obtain the RMSE and R2 values on our test set results, we can use the rmse() and rsq() functions.

Both functions take the following arguments:

data - a data frame with columns that have the true values and predictions
truth - the column with the true response values
estimate - the column with predicted values

In the examples below we pass our advertising_test_results to these functions to obtain these values for our test set. results are always returned as a data frame with the following columns: .metric, .estimator, and .estimate.

# RMSE on test set
rmse(advertising_test_results, 
     truth = Sales,
     estimate = .pred)

# R2 on test set
rsq(advertising_test_results,
    truth = Sales,
    estimate = .pred)

R2 Plot

The best way to assess the test set accuracy is by making an R2 plot. This is a plot that can be used for any regression model.

It plots the actual values (Sales) versus the model predictions (.pred) as a scatter plot. It also plot the line y = x through the origin. This line is a visually representation of the perfect model where all predicted values are equal to the true values in the test set. The farther the points are from this line, the worse the model fit.

The reason this plot is called an R2 plot, is because the R2 is simply the squared correlation between the true and predicted values, which are plotted as paired in the plot.

In the code below, we use geom_point() and geom_abline() to make this plot using out advertising_test_results data. The geom_abline() function will plot a line with the provided slope and intercept arguments.

ggplot(data = advertising_test_results,
       mapping = aes(x = .pred, y = Sales)) +
  geom_point(color = '#006EA1') +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Advertising Test Set',
       x = 'Predicted Sales',
       y = 'Actual Sales')

Creating a Machine Learning Workflow

In the previous section, we trained a linear regression model to the advertising data step-by-step. In this section, we will go over how to combine all of the modeling steps into a single workflow.

We will be using the workflow package, which combines a parnsip model with a recipe, and the last_fit() function to build an end-to-end modeling training pipeline.

Let’s assume we would like to do the following with the advertising data:

Split our data into training and test sets
Feature engineer the training data by removing skewness and normalizing numeric predictors
Specify a linear regression model
Train our model on the training data
Transform the test data with steps learned in part 2 and obtain predictions using our trained model

The machine learning workflow can be accomplished with a few steps using tidymodels

Step 1. Split Our Data

First we split our data into training and test sets.

set.seed(314)

# Create a split object
advertising_split <- initial_split(advertising, prop = 0.75, 
                                   strata = Sales)

# Build training data set
advertising_training <- advertising_split %>% 
                        training()

# Build testing data set
advertising_test <- advertising_split %>% 
                    testing()

Step 2. Feature Engineering

Next, we specify our feature engineering recipe. In this step, we do not use prep() or bake(). This recipe will be automatically applied in a later step using the workflow() and last_fit() functions.

advertising_recipe <- recipe(Sales ~ ., data = advertising_training) %>% 
                      step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
                      step_normalize(all_numeric(), -all_outcomes())

Step 3. Specify a Model

Next, we specify our linear regression model with parsnip.

lm_model <- linear_reg() %>% 
            set_engine('lm') %>% 
            set_mode('regression')

Step 4. Create a Workflow

The workflow package was designed to combine models and recipes into a single object. To create a workflow, we start with workflow() to create an empty workflow and then add out model and recipe with add_model() and add_recipe().

advertising_workflow <- workflow() %>% 
                        add_model(lm_model) %>% 
                        add_recipe(advertising_recipe)

Step 5. Execute the Workflow

The last_fit() function will take a workflow object and apply the recipe and model to a specified data split object.

In the code below, we pass the advertising_workflow object and advertising_split object into last_fit().

The last_fit() function will then train the feature engineering steps on the training data, fit the model to the training data, apply the feature engineering steps to the test data, and calculate the predictions on the test data, all in one step!

advertising_fit <- advertising_workflow %>% 
                   last_fit(split = advertising_split)

To obtain the performance metrics and predictions on the test set, we use the collect_metrics() and collect_predictions() functions on our advertising_fit object.

# Obtain performance metrics on test data
advertising_fit %>% collect_metrics()

We can save the test set predictions by using the collect_predictions() function. This function returns a data frame which will have the response variables values from the test set and a column named .pred with the model predictions.

# Obtain test set predictions data frame
test_results <- advertising_fit %>% 
                collect_predictions()
# View results
test_results

Workflow for Home Selling Price

For another example of fitting a machine learning workflow, let’s use linear regression to predict the selling price of homes using the home_sales data.

For our feature engineering steps, we will include removing skewness and normalizing numeric predictors, and creating dummy variables for the city variable.

Remember that all machine learning algorithms need a numeric feature matrix. Therefore we must also transform character or factor predictor variables to dummy variables.

Step 1. Split Our Data

First we split our data into training and test sets.

set.seed(271)

# Create a split object
homes_split <- initial_split(home_sales, prop = 0.75, 
                             strata = selling_price)

# Build training data set
homes_training <- homes_split %>% 
                  training()

# Build testing data set
homes_test <- homes_split %>% 
              testing()

Step 2. Feature Engineering

For our model formula, we are specifying that selling_price is our response variable and all others are predictor variables.

homes_recipe <- recipe(selling_price ~ ., data = homes_training) %>% 
                step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
                step_normalize(all_numeric(), -all_outcomes()) %>% 
                step_dummy(all_nominal(), - all_outcomes())

As an intermediate step, let’s check our recipe by prepping it on the training data and applying it to the test data. We want to make sure that we get the correct transformations.

From the results below, things look correct.

homes_recipe %>% 
  prep() %>% 
  bake(new_data = homes_test)

Step 3. Specify a Model

Next, we specify our linear regression model with parsnip.

lm_model <- linear_reg() %>% 
            set_engine('lm') %>% 
            set_mode('regression')

Step 4. Create a Workflow

Next, we combine our model and recipe into a workflow object.

homes_workflow <- workflow() %>% 
                  add_model(lm_model) %>% 
                  add_recipe(homes_recipe)

Step 5. Execute the Workflow

Finally, we process our machine learning workflow with last_fit().

homes_fit <- homes_workflow %>% 
             last_fit(split = homes_split)

To obtain the performance metrics and predictions on the test set, we use the collect_metrics() and collect_predictions() functions on our homes_fit object.

# Obtain performance metrics on test data
homes_fit %>% collect_metrics()

# Obtain test set predictions data frame
homes_results <- homes_fit %>% 
                 collect_predictions()
# View results
homes_results

R2 Plot

Finally, lets use the homes_results data frame to make an R2 plot to visualize our model performance on the test data set.

ggplot(data = homes_results,
       mapping = aes(x = .pred, y = selling_price)) +
  geom_point(color = '#006EA1', alpha = 0.25) +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Home Sales Test Set',
       x = 'Predicted Selling Price',
       y = 'Actual Selling Price')

Variable Importance

Creating a workflow and using the last_fit() function is a great option of automating a machine learning pipeline. However, we are not able to explore variable importance on the training data when we fit our model with last_fit().

To obtain a variable importance plot with vip(), we must use the methods introduced at the beginning of the tutorial. This involves fitting the model with the fit() function on the training data.

To do this, we will train our homes_recipe and transform our training data. Then we use the fit() function to train our linear regression object, lm_model, on our processed data.

Then we can use the vip() function to see which predictors were most important.

homes_training_baked <- homes_recipe %>% 
                        prep() %>% 
                        bake(new_data = homes_training)

# View results
homes_training_baked

Now we fit our linear regression model to the baked training data.

homes_lm_fit <- lm_model %>% 
                fit(selling_price ~ ., data = homes_training_baked)

Finally, we use vip() on our trained linear regression model. It appears that square footage and location are the most importance predictor variables.

vip(homes_lm_fit)

Linear Regression with tidymodels

David Svancer

Introduction to Linear Regression

Data

Advertising Data

Seattle Home Sales

Data Splitting

Model Specification

Fitting to Training Data

Exploring Training Results

Tidy Training Results

Evaluating Test Set Accuracy

Calculating RMSE and R2 on the Test Data

R2 Plot

Creating a Machine Learning Workflow

Step 1. Split Our Data

Step 2. Feature Engineering

Step 3. Specify a Model

Step 4. Create a Workflow

Step 5. Execute the Workflow

Workflow for Home Selling Price

Step 1. Split Our Data

Step 2. Feature Engineering

Step 3. Specify a Model

Step 4. Create a Workflow

Step 5. Execute the Workflow

R2 Plot

Variable Importance