How much orders my company will receive from my customers next month? How many customers will churn when the contract expire? How many catalogues do I need to mail in order to increase the probability of my potential customers to buy? These and many other related questions are the challenges face by you and many business analysts.
Multiple Linear Regression (MLR) models are one of the three most common analytics modelling techniques used by practitioners today. In this session, I would like to share with you a collection of R packages specially design for building better MLR models.
This is not a session for explaining the concepts and methods of MLR.
TBA
Diamond data set includes in ggplot2 package. It consists of 53940 records and 10 variables. the variables are:
Before we get started, it is important for us to install the necessary R packages into R and launch these R packages into R environment.
The R packages needed for this exercise are as follows:
The code chunks below installs and launches these R packages into R environment.
packages = c('corrplot', 'skimr', 'tidymodels', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
data(diamonds)
After importing the data file into R, it is important for us to examine if the data file has been imported correctly.
The codes chunks below uses glimpse() to display the data structure of will do the job.
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
Important
Now, we can use summary() to examine the summary statistics of the variables again.
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
A tidyverse complied summary statistics and EDA package.
skim(diamonds)
| Name | diamonds |
| Number of rows | 53940 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cut | 0 | 1 | TRUE | 5 | Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906 |
| color | 0 | 1 | TRUE | 7 | G: 11292, E: 9797, F: 9542, H: 8304 |
| clarity | 0 | 1 | TRUE | 8 | SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| carat | 0 | 1 | 0.80 | 0.47 | 0.2 | 0.40 | 0.70 | 1.04 | 5.01 | ▇▂▁▁▁ |
| depth | 0 | 1 | 61.75 | 1.43 | 43.0 | 61.00 | 61.80 | 62.50 | 79.00 | ▁▁▇▁▁ |
| table | 0 | 1 | 57.46 | 2.23 | 43.0 | 56.00 | 57.00 | 59.00 | 95.00 | ▁▇▁▁▁ |
| price | 0 | 1 | 3932.80 | 3989.44 | 326.0 | 950.00 | 2401.00 | 5324.25 | 18823.00 | ▇▂▁▁▁ |
| x | 0 | 1 | 5.73 | 1.12 | 0.0 | 4.71 | 5.70 | 6.54 | 10.74 | ▁▁▇▃▁ |
| y | 0 | 1 | 5.73 | 1.14 | 0.0 | 4.72 | 5.71 | 6.54 | 58.90 | ▇▁▁▁▁ |
| z | 0 | 1 | 3.54 | 0.71 | 0.0 | 2.91 | 3.53 | 4.04 | 31.80 | ▇▁▁▁▁ |
Before building a multiple regression model, it is important to ensure that the independent variables used are not highly correlated to each other. If these highly correlated independent variables are used in building a regression model by mistake, the quality of the model will be compromised. This phenomenon is known as multicollinearity in statistics.
Correlation matrix is commonly used to visualise the relationships between the independent variables. Beside the pairs() of R, there are many packages support the display of a correlation matrix. In this section, the corrplot package will be used.
The code chunk below is used to plot a scatterplot matrix of the relationship between the independent variables in diamonds data.frame.
corrplot(cor(diamonds[, c(1,5:6, 8:10)]),
diag = FALSE,
tl.pos = "td",
tl.cex = 1,
method = "number",
type = "upper",
mar = c(0, 0, 1.5, 0),
title = "Correlations between predictors")
First of all, we want to extract a data set for testing the predictions in the end. We’ll keep 60% of the data for training and 40% of the data for testing.
Initial code chunk without excluding variables that are highly correlated.
set.seed(1243)
diamonds_split <- initial_split(diamonds,
prop = .6,
strata = price)
initial_split() of rsample package. initial_split creates a single binary split of the data into a training set and testing set. initial_time_split does the same, but takes the first prop samples for training, instead of a random selection. training and testing are used to extract the resulting data.
Code chunk that excludes predictors that are highly correlated.
set.seed(1243)
diamonds_split <- diamonds %>%
select(c(1:7)) %>%
initial_split(prop = .6,
strata = price)
The complete code chunk.
set.seed(1243)
diamonds_split <- diamonds %>%
select(c(1:7)) %>%
initial_split(prop = .6,
strata = price)
training_data <- training(diamonds_split)
testing_data <- testing(diamonds_split)
Furthermore, the training data set will be prepared for 3-fold cross-validation (using three here to speed things up). All this is accomplished using the rsample package:
vfold_data <- vfold_cv(training_data, v = 3,
repeats = 1,
strata = price)
vfold_data %>%
mutate(df_ana = map(splits, analysis),
df_ass = map(splits, assessment))
## # 3-fold cross-validation using stratification
## # A tibble: 3 x 4
## splits id df_ana df_ass
## <list> <chr> <list> <list>
## 1 <split [21.6K/10.8K]> Fold1 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
## 2 <split [21.6K/10.8K]> Fold2 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
## 3 <split [21.6K/10.8K]> Fold3 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
The recipes package provides a collection of useful functions for performing data pre-processing feature engineering tasks. By and large these tasks can be accomplished by using a family of step_()* functions. For example, the plot below indicates that there may be a nonlinear relationship between price and carat, and I want to address that using higher-order terms.
qplot(carat, price, data = training_data) +
scale_y_continuous(trans = log_trans(), labels = function(x) round(x, -2)) +
geom_smooth(method = "lm", formula = "y ~ poly(x, 4)") +
labs(title = "Nonlinear relationship between price and carat of diamonds",
subtitle = "The degree of the polynomial is a potential tuning parameter")
Herein, I want to log transform price (step_log()), I want to center and scale all numeric predictors (step_normalize()), and the categorical predictors should be dummy coded (step_dummy()). Furthermore, a quadratic effect of carat is added using step_poly().
processed_data <- recipe(price ~ ., data = training_data) %>%
step_log(all_outcomes()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_dummy(all_nominal()) %>%
step_poly(carat, degree = 2)
Calling prep() on a recipe applies all the steps. You can now call juice() to extract the transformed data set or call bake() on a new data set.
prep(processed_data)
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 6
##
## Training data contained 32364 data points and no missing data.
##
## Operations:
##
## Log transformation on price [trained]
## Centering and scaling for carat, depth, table [trained]
## Dummy variables from cut, color, clarity [trained]
## Orthogonal polynomials on carat [trained]
juiced_data <- juice(prep(processed_data))
names(juiced_data)
## [1] "depth" "table" "price" "cut_1" "cut_2"
## [6] "cut_3" "cut_4" "color_1" "color_2" "color_3"
## [11] "color_4" "color_5" "color_6" "clarity_1" "clarity_2"
## [16] "clarity_3" "clarity_4" "clarity_5" "clarity_6" "clarity_7"
## [21] "carat_poly_1" "carat_poly_2"
The parsnip package has wrappers around many popular machine learning algorithms, and you can fit them using a unified interface. This is extremely helpful, since you have to remember only one rather then dozens of interfaces.
The models are separated into two modes/categories, namely, regression and classification (set_mode()). The model is defined using a function specific to each algorithm (e.g., linear_reg(), rand_forest()). Finally, the backend/engine/implementation is selected using set_engine().
Herein, I will start with a basic linear regression model as implemented in stats::lm().
lm_model <-
linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
Next, we can fit() the model.
lm_fit1 <- fit(lm_model, price ~ ., juiced_data)
lm_fit1
## parsnip model object
##
## Fit time: 30ms
##
## Call:
## stats::lm(formula = price ~ ., data = data)
##
## Coefficients:
## (Intercept) depth table cut_1 cut_2
## 7.720e+00 -2.957e-03 -9.111e-04 9.191e-02 -7.278e-03
## cut_3 cut_4 color_1 color_2 color_3
## 1.423e-02 -1.951e-04 -4.524e-01 -9.292e-02 -5.462e-03
## color_4 color_5 color_6 clarity_1 clarity_2
## 1.683e-02 -1.066e-02 -7.511e-04 8.133e-01 -2.264e-01
## clarity_3 clarity_4 clarity_5 clarity_6 clarity_7
## 1.181e-01 -4.116e-02 2.266e-02 3.736e-03 2.340e-02
## carat_poly_1 carat_poly_2
## 1.888e+02 -5.429e+01
Many models have implemented summary() or coef() methods. However, the output of these is usually not in a tidy format, and the broom package has the aim to resolve this issue.
glance() gives us information about the whole model. Here, R squared is pretty high and the RMSE equals 0.154.
glance(lm_fit1$fit)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.974 0.974 0.164 57320. 0 21 12605. -25164. -24971.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy() gives us information about the model parameters, and we see that we have a significant quadratic effect of carat.
tidy(lm_fit1) %>%
arrange(desc(abs(statistic)))
## # A tibble: 22 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.72 0.00184 4186. 0.
## 2 carat_poly_1 189. 0.189 998. 0.
## 3 carat_poly_2 -54.3 0.167 -325. 0.
## 4 clarity_1 0.813 0.00567 143. 0.
## 5 color_1 -0.452 0.00325 -139. 0.
## 6 clarity_2 -0.226 0.00530 -42.7 0.
## 7 color_2 -0.0929 0.00296 -31.4 1.18e-213
## 8 clarity_3 0.118 0.00454 26.0 5.28e-148
## 9 cut_1 0.0919 0.00420 21.9 3.44e-105
## 10 clarity_4 -0.0412 0.00361 -11.4 5.22e- 30
## # ... with 12 more rows
Finally, augment() can be used to get model predictions, residuals, etc.
lm_predicted <- augment(lm_fit1$fit, data = juiced_data) %>%
rowid_to_column()
select(lm_predicted, rowid, price, .fitted:.std.resid)
## # A tibble: 32,364 x 5
## rowid price .fitted .resid .std.resid
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.79 6.04 -0.253 -1.54
## 2 2 5.79 6.33 -0.544 -3.32
## 3 3 5.81 6.19 -0.381 -2.33
## 4 4 5.81 5.79 0.0211 0.129
## 5 5 5.82 6.16 -0.344 -2.10
## 6 6 5.82 6.14 -0.317 -1.93
## 7 7 5.83 5.92 -0.0964 -0.588
## 8 8 5.83 5.94 -0.112 -0.685
## 9 9 5.84 5.87 -0.0330 -0.201
## 10 10 5.84 5.89 -0.0451 -0.275
## # ... with 32,354 more rows
We already saw performance measures RMSE and R squared in the output of glance() above. The yardstick package is specifically designed for such measures for both numeric and categorical outcomes, and it plays well with grouped predictions (e.g., from cross-validation).
Let’s use rsample, parsnip, and yardstick for cross-validation to get a more accurate estimation of RMSE.
In the following pipeline, the model is fit() separately to the three analysis data sets, and then the fitted models are used to predict() on the three corresponding assessment data sets (i.e., 3-fold cross-validation).
Before that, analysis() and assessment() are used to extract the respective folds from dia_vfold. Furthermore, the recipe dia_rec is prepped (i.e., trained) using the analysis data of each fold, and this prepped recipe is then applied to the assessment data of each fold using bake(). Preparing the recipe separately for each fold (rather than once for the whole training data set dia_train) guards against data leakage.
The code in the following chunk makes use of list columns to store all information about the three folds in a single tibble lm_fit2, and a combination of dplyr::mutate() and purrr::map() is used to “loop” across the three rows of the tibble.
vfold_data
## # 3-fold cross-validation using stratification
## # A tibble: 3 x 2
## splits id
## <list> <chr>
## 1 <split [21.6K/10.8K]> Fold1
## 2 <split [21.6K/10.8K]> Fold2
## 3 <split [21.6K/10.8K]> Fold3
extracted_data <- mutate(vfold_data,
df_ana = map (splits,
analysis),
df_ass = map (splits,
assessment))
extracted_data
## # 3-fold cross-validation using stratification
## # A tibble: 3 x 4
## splits id df_ana df_ass
## <list> <chr> <list> <list>
## 1 <split [21.6K/10.8K]> Fold1 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
## 2 <split [21.6K/10.8K]> Fold2 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
## 3 <split [21.6K/10.8K]> Fold3 <tibble [21,576 x 7]> <tibble [10,788 x 7]>
lm_fit2 <- extracted_data %>%
mutate(recipe = map(df_ana,
~prep(processed_data,
training = .x)),
df_ana = map (recipe,
juice),
df_ass = map2(recipe,
df_ass,
~bake(.x, new_data = .y))) %>%
mutate(model_fit = map(df_ana,
~fit(lm_model,
price ~ ., data = .x))) %>%
mutate(model_pred = map2(model_fit,
df_ass, ~predict(.x,
new_data = .y)))
select(lm_fit2, id, recipe:model_pred)
## # A tibble: 3 x 4
## id recipe model_fit model_pred
## <chr> <list> <list> <list>
## 1 Fold1 <recipe> <fit[+]> <tibble [10,788 x 1]>
## 2 Fold2 <recipe> <fit[+]> <tibble [10,788 x 1]>
## 3 Fold3 <recipe> <fit[+]> <tibble [10,788 x 1]>
Now, we can extract the actual prices price from the assessment data and compare them to the predicted prices .pred. Then, the yardstick package comes into play: The function metrics() calculates three different metrics for numeric outcomes. Furthermore, it automatically recognizes that lm_preds is grouped by folds and thus calculates the metrics for each fold.
Across the three folds, we see that the RMSE is a little higher and R squared a little smaller compared to above (see output of glance(lm_fit1$fit)). This is expected, since out-of-sample prediction is harder but also way more useful.
lm_preds <-
lm_fit2 %>%
mutate(res = map2(df_ass,
model_pred,
~data.frame(price = .x$price,
.pred = .y$.pred))) %>%
select(id, res) %>%
tidyr::unnest(res) %>%
group_by(id)
lm_preds
## # A tibble: 32,364 x 3
## # Groups: id [3]
## id price .pred
## <chr> <dbl> <dbl>
## 1 Fold1 5.79 6.04
## 2 Fold1 5.82 6.14
## 3 Fold1 5.83 5.94
## 4 Fold1 5.86 5.93
## 5 Fold1 5.88 6.34
## 6 Fold1 6.00 6.33
## 7 Fold1 6.00 6.43
## 8 Fold1 6.00 6.11
## 9 Fold1 6.00 6.00
## 10 Fold1 6.00 6.15
## # ... with 32,354 more rows
The code chunk below uses metrics() of yardstick package.
metrics(lm_preds,
truth = price,
estimate = .pred)
## # A tibble: 9 x 4
## id .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 Fold1 rmse standard 0.162
## 2 Fold2 rmse standard 0.180
## 3 Fold3 rmse standard 0.153
## 4 Fold1 rsq standard 0.974
## 5 Fold2 rsq standard 0.969
## 6 Fold3 rsq standard 0.977
## 7 Fold1 mae standard 0.123
## 8 Fold2 mae standard 0.121
## 9 Fold3 mae standard 0.122
Note that metrics() has default measures for numeric and categorical outcomes, and here RMSE, R squared, and the mean absolute difference (MAE) are returned. You could also use one metric directly like rmse() or define a custom set of metrics via metric_set().