1 Overview

How much orders my company will receive from my customers next month? How many customers will churn when the contract expire? How many catalogues do I need to mail in order to increase the probability of my potential customers to buy? These and many other related questions are the challenges face by you and many business analysts.

Multiple Linear Regression (MLR) models are one of the three most common analytics modelling techniques used by practitioners today. In this session, I would like to share with you a collection of R packages specially design for building better MLR models.

This is not a session for explaining the concepts and methods of MLR.

1.1 Setting the Scene

TBA

1.2 The data

Diamond data set includes in ggplot2 package. It consists of 53940 records and 10 variables. the variables are:

  • price: price in US dollars ($326–$18,823)
  • carat: weight of the diamond (0.2–5.01)
  • cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color: diamond colour, from D (best) to J (worst)
  • clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x: length in mm (0–10.74)
  • y: width in mm (0–58.9)
  • z: depth in mm (0–31.8)
  • depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
  • table: width of top of diamond relative to widest point (43–95)

1.3 Getting Started

Before we get started, it is important for us to install the necessary R packages into R and launch these R packages into R environment.

The R packages needed for this exercise are as follows:

  • Attribute data handling
    • tidyverse, especially dplyr.
  • Exploratory Data Analysis
    • ggplot2, skimr, and corrplot.
  • Model building

The code chunks below installs and launches these R packages into R environment.

packages = c('corrplot', 'skimr', 'tidymodels', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2 Data Wrangling

2.1 Importing the data

data(diamonds)

2.2 Checking data structure

After importing the data file into R, it is important for us to examine if the data file has been imported correctly.

2.2.1 Checking data types using glimpse()

The codes chunks below uses glimpse() to display the data structure of will do the job.

glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

Important

  • In regression modelling using R (i.e. lm()), categorical variables encoded in character and logical data type are required to converted to factor data type. This is because in regression modelling, categorical variables need to be transformed to dummy variables.

3 Exploratory Data Analysis (EDA)

3.1 Univariate data analysis

Now, we can use summary() to examine the summary statistics of the variables again.

summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

3.2 skimr package

A tidyverse complied summary statistics and EDA package.

skim(diamonds)
Data summary
Name diamonds
Number of rows 53940
Number of columns 10
_______________________
Column type frequency:
factor 3
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cut 0 1 TRUE 5 Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color 0 1 TRUE 7 G: 11292, E: 9797, F: 9542, H: 8304
clarity 0 1 TRUE 8 SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
carat 0 1 0.80 0.47 0.2 0.40 0.70 1.04 5.01 ▇▂▁▁▁
depth 0 1 61.75 1.43 43.0 61.00 61.80 62.50 79.00 ▁▁▇▁▁
table 0 1 57.46 2.23 43.0 56.00 57.00 59.00 95.00 ▁▇▁▁▁
price 0 1 3932.80 3989.44 326.0 950.00 2401.00 5324.25 18823.00 ▇▂▁▁▁
x 0 1 5.73 1.12 0.0 4.71 5.70 6.54 10.74 ▁▁▇▃▁
y 0 1 5.73 1.14 0.0 4.72 5.71 6.54 58.90 ▇▁▁▁▁
z 0 1 3.54 0.71 0.0 2.91 3.53 4.04 31.80 ▇▁▁▁▁

3.3 Bivariate data analysis

Before building a multiple regression model, it is important to ensure that the independent variables used are not highly correlated to each other. If these highly correlated independent variables are used in building a regression model by mistake, the quality of the model will be compromised. This phenomenon is known as multicollinearity in statistics.

Correlation matrix is commonly used to visualise the relationships between the independent variables. Beside the pairs() of R, there are many packages support the display of a correlation matrix. In this section, the corrplot package will be used.

The code chunk below is used to plot a scatterplot matrix of the relationship between the independent variables in diamonds data.frame.

corrplot(cor(diamonds[, c(1,5:6, 8:10)]), 
         diag = FALSE, 
         tl.pos = "td", 
         tl.cex = 1, 
         method = "number", 
         type = "upper",
         mar = c(0, 0, 1.5, 0),
         title = "Correlations between predictors")

4 Data Sampling

First of all, we want to extract a data set for testing the predictions in the end. We’ll keep 60% of the data for training and 40% of the data for testing.

Initial code chunk without excluding variables that are highly correlated.

set.seed(1243)

diamonds_split <- initial_split(diamonds, 
                           prop = .6, 
                           strata = price)

initial_split() of rsample package. initial_split creates a single binary split of the data into a training set and testing set. initial_time_split does the same, but takes the first prop samples for training, instead of a random selection. training and testing are used to extract the resulting data.

Code chunk that excludes predictors that are highly correlated.

set.seed(1243)
diamonds_split <- diamonds %>%
  select(c(1:7)) %>%
  initial_split(prop = .6, 
                strata = price)

The complete code chunk.

set.seed(1243)
diamonds_split <- diamonds %>%
  select(c(1:7)) %>%
  initial_split(prop = .6, 
                strata = price)

training_data <- training(diamonds_split)
testing_data  <- testing(diamonds_split)

4.1 Cross-validation data sets

Furthermore, the training data set will be prepared for 3-fold cross-validation (using three here to speed things up). All this is accomplished using the rsample package:

vfold_data <- vfold_cv(training_data, v = 3, 
                      repeats = 1, 
                      strata = price)
vfold_data %>% 
  mutate(df_ana = map(splits, analysis),
         df_ass = map(splits, assessment))
## #  3-fold cross-validation using stratification 
## # A tibble: 3 x 4
##   splits                id    df_ana                df_ass               
##   <list>                <chr> <list>                <list>               
## 1 <split [21.6K/10.8K]> Fold1 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
## 2 <split [21.6K/10.8K]> Fold2 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
## 3 <split [21.6K/10.8K]> Fold3 <tibble [21,576 x 7]> <tibble [10,788 x 7]>

5 Data Pre-Processing and Feature Engineering

The recipes package provides a collection of useful functions for performing data pre-processing feature engineering tasks. By and large these tasks can be accomplished by using a family of step_()* functions. For example, the plot below indicates that there may be a nonlinear relationship between price and carat, and I want to address that using higher-order terms.

qplot(carat, price, data = training_data) +
    scale_y_continuous(trans = log_trans(), labels = function(x) round(x, -2)) +
    geom_smooth(method = "lm", formula = "y ~ poly(x, 4)") +
    labs(title = "Nonlinear relationship between price and carat of diamonds",
         subtitle = "The degree of the polynomial is a potential tuning parameter")

5.1 Working with step_()* functions

Herein, I want to log transform price (step_log()), I want to center and scale all numeric predictors (step_normalize()), and the categorical predictors should be dummy coded (step_dummy()). Furthermore, a quadratic effect of carat is added using step_poly().

processed_data <- recipe(price ~ ., data = training_data) %>%
  step_log(all_outcomes()) %>%
  step_normalize(all_predictors(), -all_nominal()) %>%
  step_dummy(all_nominal()) %>%
  step_poly(carat, degree = 2)

Calling prep() on a recipe applies all the steps. You can now call juice() to extract the transformed data set or call bake() on a new data set.

prep(processed_data)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          6
## 
## Training data contained 32364 data points and no missing data.
## 
## Operations:
## 
## Log transformation on price [trained]
## Centering and scaling for carat, depth, table [trained]
## Dummy variables from cut, color, clarity [trained]
## Orthogonal polynomials on carat [trained]
juiced_data <- juice(prep(processed_data))
names(juiced_data)
##  [1] "depth"        "table"        "price"        "cut_1"        "cut_2"       
##  [6] "cut_3"        "cut_4"        "color_1"      "color_2"      "color_3"     
## [11] "color_4"      "color_5"      "color_6"      "clarity_1"    "clarity_2"   
## [16] "clarity_3"    "clarity_4"    "clarity_5"    "clarity_6"    "clarity_7"   
## [21] "carat_poly_1" "carat_poly_2"

6 Defining and Fitting Models

The parsnip package has wrappers around many popular machine learning algorithms, and you can fit them using a unified interface. This is extremely helpful, since you have to remember only one rather then dozens of interfaces.

The models are separated into two modes/categories, namely, regression and classification (set_mode()). The model is defined using a function specific to each algorithm (e.g., linear_reg(), rand_forest()). Finally, the backend/engine/implementation is selected using set_engine().

Herein, I will start with a basic linear regression model as implemented in stats::lm().

lm_model <-
    linear_reg() %>%
    set_mode("regression") %>%
    set_engine("lm")

Next, we can fit() the model.

lm_fit1 <- fit(lm_model, price ~ ., juiced_data)
lm_fit1
## parsnip model object
## 
## Fit time:  30ms 
## 
## Call:
## stats::lm(formula = price ~ ., data = data)
## 
## Coefficients:
##  (Intercept)         depth         table         cut_1         cut_2  
##    7.720e+00    -2.957e-03    -9.111e-04     9.191e-02    -7.278e-03  
##        cut_3         cut_4       color_1       color_2       color_3  
##    1.423e-02    -1.951e-04    -4.524e-01    -9.292e-02    -5.462e-03  
##      color_4       color_5       color_6     clarity_1     clarity_2  
##    1.683e-02    -1.066e-02    -7.511e-04     8.133e-01    -2.264e-01  
##    clarity_3     clarity_4     clarity_5     clarity_6     clarity_7  
##    1.181e-01    -4.116e-02     2.266e-02     3.736e-03     2.340e-02  
## carat_poly_1  carat_poly_2  
##    1.888e+02    -5.429e+01

7 Summarizing Fitted Models

Many models have implemented summary() or coef() methods. However, the output of these is usually not in a tidy format, and the broom package has the aim to resolve this issue.

glance() gives us information about the whole model. Here, R squared is pretty high and the RMSE equals 0.154.

glance(lm_fit1$fit)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik     AIC     BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1     0.974         0.974 0.164    57320.       0    21 12605. -25164. -24971.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy() gives us information about the model parameters, and we see that we have a significant quadratic effect of carat.

tidy(lm_fit1) %>% 
  arrange(desc(abs(statistic)))
## # A tibble: 22 x 5
##    term         estimate std.error statistic   p.value
##    <chr>           <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)    7.72     0.00184    4186.  0.       
##  2 carat_poly_1 189.       0.189       998.  0.       
##  3 carat_poly_2 -54.3      0.167      -325.  0.       
##  4 clarity_1      0.813    0.00567     143.  0.       
##  5 color_1       -0.452    0.00325    -139.  0.       
##  6 clarity_2     -0.226    0.00530     -42.7 0.       
##  7 color_2       -0.0929   0.00296     -31.4 1.18e-213
##  8 clarity_3      0.118    0.00454      26.0 5.28e-148
##  9 cut_1          0.0919   0.00420      21.9 3.44e-105
## 10 clarity_4     -0.0412   0.00361     -11.4 5.22e- 30
## # ... with 12 more rows

Finally, augment() can be used to get model predictions, residuals, etc.

lm_predicted <- augment(lm_fit1$fit, data = juiced_data) %>% 
  rowid_to_column()
select(lm_predicted, rowid, price, .fitted:.std.resid)
## # A tibble: 32,364 x 5
##    rowid price .fitted  .resid .std.resid
##    <int> <dbl>   <dbl>   <dbl>      <dbl>
##  1     1  5.79    6.04 -0.253      -1.54 
##  2     2  5.79    6.33 -0.544      -3.32 
##  3     3  5.81    6.19 -0.381      -2.33 
##  4     4  5.81    5.79  0.0211      0.129
##  5     5  5.82    6.16 -0.344      -2.10 
##  6     6  5.82    6.14 -0.317      -1.93 
##  7     7  5.83    5.92 -0.0964     -0.588
##  8     8  5.83    5.94 -0.112      -0.685
##  9     9  5.84    5.87 -0.0330     -0.201
## 10    10  5.84    5.89 -0.0451     -0.275
## # ... with 32,354 more rows

8 Evaluating Model Performance

We already saw performance measures RMSE and R squared in the output of glance() above. The yardstick package is specifically designed for such measures for both numeric and categorical outcomes, and it plays well with grouped predictions (e.g., from cross-validation).

Let’s use rsample, parsnip, and yardstick for cross-validation to get a more accurate estimation of RMSE.

In the following pipeline, the model is fit() separately to the three analysis data sets, and then the fitted models are used to predict() on the three corresponding assessment data sets (i.e., 3-fold cross-validation).

Before that, analysis() and assessment() are used to extract the respective folds from dia_vfold. Furthermore, the recipe dia_rec is prepped (i.e., trained) using the analysis data of each fold, and this prepped recipe is then applied to the assessment data of each fold using bake(). Preparing the recipe separately for each fold (rather than once for the whole training data set dia_train) guards against data leakage.

The code in the following chunk makes use of list columns to store all information about the three folds in a single tibble lm_fit2, and a combination of dplyr::mutate() and purrr::map() is used to “loop” across the three rows of the tibble.

8.1 How does dia_vfold look?

vfold_data
## #  3-fold cross-validation using stratification 
## # A tibble: 3 x 2
##   splits                id   
##   <list>                <chr>
## 1 <split [21.6K/10.8K]> Fold1
## 2 <split [21.6K/10.8K]> Fold2
## 3 <split [21.6K/10.8K]> Fold3

8.2 Extracting analysis/training and assessment/testing data

extracted_data <- mutate(vfold_data,
                         df_ana = map (splits,
                                       analysis),
                         df_ass = map (splits,
                                       assessment))
extracted_data
## #  3-fold cross-validation using stratification 
## # A tibble: 3 x 4
##   splits                id    df_ana                df_ass               
##   <list>                <chr> <list>                <list>               
## 1 <split [21.6K/10.8K]> Fold1 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
## 2 <split [21.6K/10.8K]> Fold2 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
## 3 <split [21.6K/10.8K]> Fold3 <tibble [21,576 x 7]> <tibble [10,788 x 7]>

8.3 Fitting the cross-validation data sets

lm_fit2 <- extracted_data %>% 
  mutate(recipe = map(df_ana, 
                      ~prep(processed_data, 
                            training = .x)),
         df_ana = map (recipe,  
                       juice),
         df_ass = map2(recipe, 
                       df_ass, 
                       ~bake(.x, new_data = .y))) %>%
  mutate(model_fit = map(df_ana, 
                         ~fit(lm_model, 
                              price ~ ., data = .x))) %>% 
  mutate(model_pred = map2(model_fit, 
                           df_ass, ~predict(.x,
                                            new_data = .y)))
  
select(lm_fit2, id, recipe:model_pred)
## # A tibble: 3 x 4
##   id    recipe   model_fit model_pred           
##   <chr> <list>   <list>    <list>               
## 1 Fold1 <recipe> <fit[+]>  <tibble [10,788 x 1]>
## 2 Fold2 <recipe> <fit[+]>  <tibble [10,788 x 1]>
## 3 Fold3 <recipe> <fit[+]>  <tibble [10,788 x 1]>

8.4 Extracting actual prices

Now, we can extract the actual prices price from the assessment data and compare them to the predicted prices .pred. Then, the yardstick package comes into play: The function metrics() calculates three different metrics for numeric outcomes. Furthermore, it automatically recognizes that lm_preds is grouped by folds and thus calculates the metrics for each fold.

Across the three folds, we see that the RMSE is a little higher and R squared a little smaller compared to above (see output of glance(lm_fit1$fit)). This is expected, since out-of-sample prediction is harder but also way more useful.

lm_preds <- 
  lm_fit2 %>% 
  mutate(res = map2(df_ass, 
                    model_pred, 
                    ~data.frame(price = .x$price,
                                                      .pred = .y$.pred))) %>% 
  select(id, res) %>% 
  tidyr::unnest(res) %>% 
  group_by(id)
lm_preds
## # A tibble: 32,364 x 3
## # Groups:   id [3]
##    id    price .pred
##    <chr> <dbl> <dbl>
##  1 Fold1  5.79  6.04
##  2 Fold1  5.82  6.14
##  3 Fold1  5.83  5.94
##  4 Fold1  5.86  5.93
##  5 Fold1  5.88  6.34
##  6 Fold1  6.00  6.33
##  7 Fold1  6.00  6.43
##  8 Fold1  6.00  6.11
##  9 Fold1  6.00  6.00
## 10 Fold1  6.00  6.15
## # ... with 32,354 more rows

8.5 Camparing the results

The code chunk below uses metrics() of yardstick package.

metrics(lm_preds, 
        truth = price, 
        estimate = .pred)
## # A tibble: 9 x 4
##   id    .metric .estimator .estimate
##   <chr> <chr>   <chr>          <dbl>
## 1 Fold1 rmse    standard       0.162
## 2 Fold2 rmse    standard       0.180
## 3 Fold3 rmse    standard       0.153
## 4 Fold1 rsq     standard       0.974
## 5 Fold2 rsq     standard       0.969
## 6 Fold3 rsq     standard       0.977
## 7 Fold1 mae     standard       0.123
## 8 Fold2 mae     standard       0.121
## 9 Fold3 mae     standard       0.122

Note that metrics() has default measures for numeric and categorical outcomes, and here RMSE, R squared, and the mean absolute difference (MAE) are returned. You could also use one metric directly like rmse() or define a custom set of metrics via metric_set().