In this tutorial, we will learn about linear regression with tidymodels
. We will start by fitting a linear regression model to the advertising
data set that is used throughout chapter 3 of our course textbook, An Introduction to Statistical Learning.
Then we will focus on building our first machine learning pipeline, with data resampling, featuring engineering, modeling fitting, and model accuracy assessment using the workflows
, rsample
, recipes
, parsnip
, and tune
packages from tidymodels
.
The R
code below will load the data and packages we will be working with throughout this tutorial. The vip
package is used for exploring predictor variable importance. We will use this package for visualizing which predictors have the most predictive power in our linear regression models.
# Load libraries
library(tidyverse)
## Registered S3 methods overwritten by 'tibble':
## method from
## format.tbl pillar
## print.tbl pillar
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom 0.7.0 ✓ recipes 0.1.13
## ✓ dials 0.0.8 ✓ rsample 0.0.7
## ✓ infer 0.5.3 ✓ tune 0.1.1
## ✓ modeldata 0.0.2 ✓ workflows 0.1.3
## ✓ parsnip 0.1.3 ✓ yardstick 0.0.7
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
library(vip) # for variable importance
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
# Load data sets
advertising <- read_rds('./data/advertising.rds')
home_sales <- read_rds('./data/home_sales.rds') %>%
select(-selling_date)
print(home_sales)
## # A tibble: 6,659 x 10
## selling_price city house_age bedrooms bathrooms sqft_living sqft_lot
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 280000 Aubu… 23 6 3 2400 9373
## 2 487000 Seat… 9 4 2.5 2540 5001
## 3 465000 Seat… 0 3 2.25 1530 1245
## 4 411000 Seat… 7 2 2 1130 1148
## 5 570000 Bell… 16 3 2.5 1530 3296
## 6 546000 Bell… 16 3 2.5 1530 3464
## 7 617000 Bell… 16 3 2.5 1910 4488
## 8 635000 Kirk… 9 3 2.5 3350 4007
## 9 872750 Redm… 24 3 2.5 2870 13695
## 10 843000 Redm… 24 3 2.5 3130 8750
## # … with 6,649 more rows, and 3 more variables: sqft_basement <dbl>,
## # floors <dbl>, scenic_views <dbl>
We will be working with the advertisting
data set, where each row represents a store from a large retail chain and their associated sales revenue and advertising budgets, and the home_sales
data, where each row represents a real estate home sale in the Seattle area between 2014 and 2015.
Take a moment to explore these data sets below.
The first step in building regression models is to split our original data into a training and test set. We then perform all feature engineering and model fitting tasks on the training set and use the test set as an independent assessment of our model’s prediction accuracy.
We will be using the initial_split()
function from rsample
to partition the advertising
data into training and test sets. Remember to always use set.seed()
to ensure your results are reproducible.
set.seed(314)
# Create a split object
advertising_split <- initial_split(advertising, prop = 0.75,
strata = Sales)
# Build training data set
advertising_training <- advertising_split %>%
training()
# Build testing data set
advertising_test <- advertising_split %>%
testing()
The next step in the process is to build a linear regression model object to which we fit our training data.
For every model type, such as linear regression, there are numerous packages (or engines) in R
that can be used.
For example, we can use the lm()
function from base R
or the stan_glm()
function from the rstanarm
package. Both of these functions will fit a linear regression model to our data with slightly different implementations.
The parsnip
package from tidymodels
acts like an aggregator across the various modeling engines within R
. This makes it easy to implement machine learning algorithms from different R
packages with one unifying syntax.
To specify a model object with parsnip
, we must:
Linear regression is implemented with the linear_reg()
function in parsnip
. To the set the engine and mode, we use set_engine()
and set_mode()
respectively. Each one of these functions takes a parsnip object as an argument and updates its properties.
To explore all parsnip
models, please see the documentation where you can search by keyword.
Let’s create a linear regression model object with the lm
engine. This is the default engine for most applications.
lm_model <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
# View object properties
lm_model
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Now we are ready to train our model object on the advertising_training
data. We can do this using the fit()
function from the parsnip
package. The fit()
function takes the following arguments:
parnsip
model object specificationThe code below trains our linear regression model on the advertising_training
data. In our formula, we have specified that Sales
is the response variable and TV
, Radio
, and Newspaper
are our predictor variables.
We have assigned the name lm_fit
to our trained linear regression model.
lm_fit <- lm_model %>%
fit(Sales ~ ., data = advertising_training)
# View lm_fit properties
lm_fit
## parsnip model object
##
## Fit time: 4ms
##
## Call:
## stats::lm(formula = Sales ~ ., data = data)
##
## Coefficients:
## (Intercept) TV Radio Newspaper
## 3.020478 0.044105 0.198164 -0.003748
As mentioned in the first R tutorial, most model objects in R
are stored as specialized lists.
The lm_fit
object is list that contains all of the information about how our model was trained as well as the detailed results. Let’s use the names()
function to print the named objects that are stored within lm_fit
.
The important objects are fit
and preproc
. These contain the trained model and preprocessing steps (if any are used), respectively.
names(lm_fit)
## [1] "lvl" "spec" "fit" "preproc" "elapsed"
To print a summary of our model, we can extract fit
from lm_fit
and pass it to the summary()
function. We can explore the estimated coefficients, F-statistics, p-values, residual standard error (also known as RMSE) and R2 value.
However, this feature is best for visually exploring our results on the training data since the results are returned as a data frame.
summary(lm_fit$fit)
##
## Call:
## stats::lm(formula = Sales ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2834 -0.9082 0.2456 1.1927 2.8446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.020478 0.339772 8.890 2.03e-15 ***
## TV 0.044105 0.001487 29.659 < 2e-16 ***
## Radio 0.198164 0.009143 21.674 < 2e-16 ***
## Newspaper -0.003748 0.006234 -0.601 0.549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.577 on 147 degrees of freedom
## Multiple R-squared: 0.9071, Adjusted R-squared: 0.9052
## F-statistic: 478.6 on 3 and 147 DF, p-value: < 2.2e-16
We can use the plot()
function to obtain diagnostic plots for our trained regression model. Again, we must first extract the fit
object from lm_fit
and then pass it into plot()
. These plots provide a check for the main assumptions of the linear regression model.
par(mfrow=c(2,2)) # plot all 4 plots in one
plot(lm_fit$fit,
pch = 16, # optional parameters to make points blue
col = '#006EA1')
To obtain the detailed results from our trained linear regression model in a data frame, we can use the tidy()
and glance()
functions directly on our trained parsnip
model, lm_fit
.
The tidy()
function takes a linear regression object and returns a data frame of the estimated model coefficients and their associated F-statistics and p-values.
The glance()
function will return performance metrics obtained on the training data such as the R2 value (r.squared
) and the RMSE (sigma
).
# Data frame of estimated coefficients
tidy(lm_fit)
# Performance metrics on training data
glance(lm_fit)
We can also use the vip()
function to plot the variable importance for each predictor in our model. The importance value is determined based on the F-statistics and estimate coefficents in our trained model object.
vip(lm_fit)
To assess the accuracy of our trained linear regression model, lm_fit
, we must use it to make predictions on our test data, advertising_test
.
This is done with the predict()
function from parnsip
. This function takes two important arguments:
parnsip
model objectnew_data
for which to generate predictionsThe code below uses the predict
function to generate a data frame with a single column, .pred
, which contains the predicted Sales
values on the advertisting_test
data.
predict(lm_fit, new_data = advertising_test)
Generally it’s best to combine the test data set and the predictions into a single data frame. We create a data frame with the predictions on the advertising_test
data and then use bind_cols
to add the advertising_test
data to the results.
Now we have the model results and the test data in a single data frame.
advertising_test_results <- predict(lm_fit, new_data = advertising_test) %>%
bind_cols(advertising_test)
# View results
advertising_test_results
To obtain the RMSE and R2 values on our test set results, we can use the rmse()
and rsq()
functions.
Both functions take the following arguments:
data
- a data frame with columns that have the true values and predictionstruth
- the column with the true response valuesestimate
- the column with predicted valuesIn the examples below we pass our advertising_test_results
to these functions to obtain these values for our test set. results are always returned as a data frame with the following columns: .metric
, .estimator
, and .estimate
.
# RMSE on test set
rmse(advertising_test_results,
truth = Sales,
estimate = .pred)
# R2 on test set
rsq(advertising_test_results,
truth = Sales,
estimate = .pred)
The best way to assess the test set accuracy is by making an R2 plot. This is a plot that can be used for any regression model.
It plots the actual values (Sales
) versus the model predictions (.pred
) as a scatter plot. It also plot the line y = x
through the origin. This line is a visually representation of the perfect model where all predicted values are equal to the true values in the test set. The farther the points are from this line, the worse the model fit.
The reason this plot is called an R2 plot, is because the R2 is simply the squared correlation between the true and predicted values, which are plotted as paired in the plot.
In the code below, we use geom_point()
and geom_abline()
to make this plot using out advertising_test_results
data. The geom_abline()
function will plot a line with the provided slope
and intercept
arguments.
ggplot(data = advertising_test_results,
mapping = aes(x = .pred, y = Sales)) +
geom_point(color = '#006EA1') +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - Advertising Test Set',
x = 'Predicted Sales',
y = 'Actual Sales')
In the previous section, we trained a linear regression model to the advertising
data step-by-step. In this section, we will go over how to combine all of the modeling steps into a single workflow.
We will be using the workflow
package, which combines a parnsip
model with a recipe
, and the last_fit()
function to build an end-to-end modeling training pipeline.
Let’s assume we would like to do the following with the advertising
data:
The machine learning workflow can be accomplished with a few steps using tidymodels
First we split our data into training and test sets.
set.seed(314)
# Create a split object
advertising_split <- initial_split(advertising, prop = 0.75,
strata = Sales)
# Build training data set
advertising_training <- advertising_split %>%
training()
# Build testing data set
advertising_test <- advertising_split %>%
testing()
Next, we specify our feature engineering recipe. In this step, we do not use prep()
or bake()
. This recipe will be automatically applied in a later step using the workflow()
and last_fit()
functions.
advertising_recipe <- recipe(Sales ~ ., data = advertising_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes())
Next, we specify our linear regression model with parsnip
.
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
The workflow
package was designed to combine models and recipes into a single object. To create a workflow, we start with workflow()
to create an empty workflow and then add out model and recipe with add_model()
and add_recipe()
.
advertising_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(advertising_recipe)
The last_fit()
function will take a workflow object and apply the recipe and model to a specified data split object.
In the code below, we pass the advertising_workflow
object and advertising_split
object into last_fit()
.
The last_fit()
function will then train the feature engineering steps on the training data, fit the model to the training data, apply the feature engineering steps to the test data, and calculate the predictions on the test data, all in one step!
advertising_fit <- advertising_workflow %>%
last_fit(split = advertising_split)
To obtain the performance metrics and predictions on the test set, we use the collect_metrics()
and collect_predictions()
functions on our advertising_fit
object.
# Obtain performance metrics on test data
advertising_fit %>% collect_metrics()
We can save the test set predictions by using the collect_predictions()
function. This function returns a data frame which will have the response variables values from the test set and a column named .pred
with the model predictions.
# Obtain test set predictions data frame
test_results <- advertising_fit %>%
collect_predictions()
# View results
test_results
For another example of fitting a machine learning workflow, let’s use linear regression to predict the selling price of homes using the home_sales
data.
For our feature engineering steps, we will include removing skewness and normalizing numeric predictors, and creating dummy variables for the city
variable.
Remember that all machine learning algorithms need a numeric feature matrix. Therefore we must also transform character or factor predictor variables to dummy variables.
First we split our data into training and test sets.
set.seed(271)
# Create a split object
homes_split <- initial_split(home_sales, prop = 0.75,
strata = selling_price)
# Build training data set
homes_training <- homes_split %>%
training()
# Build testing data set
homes_test <- homes_split %>%
testing()
Next, we specify our feature engineering recipe. In this step, we do not use prep()
or bake()
. This recipe will be automatically applied in a later step using the workflow()
and last_fit()
functions.
For our model formula, we are specifying that selling_price
is our response variable and all others are predictor variables.
homes_recipe <- recipe(selling_price ~ ., data = homes_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), - all_outcomes())
As an intermediate step, let’s check our recipe by prepping it on the training data and applying it to the test data. We want to make sure that we get the correct transformations.
From the results below, things look correct.
homes_recipe %>%
prep() %>%
bake(new_data = homes_test)
Next, we specify our linear regression model with parsnip
.
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
Next, we combine our model and recipe into a workflow object.
homes_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(homes_recipe)
Finally, we process our machine learning workflow with last_fit()
.
homes_fit <- homes_workflow %>%
last_fit(split = homes_split)
To obtain the performance metrics and predictions on the test set, we use the collect_metrics()
and collect_predictions()
functions on our homes_fit
object.
# Obtain performance metrics on test data
homes_fit %>% collect_metrics()
We can save the test set predictions by using the collect_predictions()
function. This function returns a data frame which will have the response variables values from the test set and a column named .pred
with the model predictions.
# Obtain test set predictions data frame
homes_results <- homes_fit %>%
collect_predictions()
# View results
homes_results
Finally, lets use the homes_results
data frame to make an R2 plot to visualize our model performance on the test data set.
ggplot(data = homes_results,
mapping = aes(x = .pred, y = selling_price)) +
geom_point(color = '#006EA1', alpha = 0.25) +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - Home Sales Test Set',
x = 'Predicted Selling Price',
y = 'Actual Selling Price')
Creating a workflow and using the last_fit()
function is a great option of automating a machine learning pipeline. However, we are not able to explore variable importance on the training data when we fit our model with last_fit()
.
To obtain a variable importance plot with vip()
, we must use the methods introduced at the beginning of the tutorial. This involves fitting the model with the fit()
function on the training data.
To do this, we will train our homes_recipe
and transform our training data. Then we use the fit()
function to train our linear regression object, lm_model
, on our processed data.
Then we can use the vip()
function to see which predictors were most important.
homes_training_baked <- homes_recipe %>%
prep() %>%
bake(new_data = homes_training)
# View results
homes_training_baked
Now we fit our linear regression model to the baked training data.
homes_lm_fit <- lm_model %>%
fit(selling_price ~ ., data = homes_training_baked)
Finally, we use vip()
on our trained linear regression model. It appears that square footage and location are the most importance predictor variables.
vip(homes_lm_fit)