My motivation and code I used in here is an adaptation from Dr.Julia Silage’s blog bird-baths. I’ve applied the similar modeling process to Default dataset from {ISLR} package.
The goal is to build logistic regression model to predict default status.
library(tidyverse)
library(ISLR)
theme_set(theme_bw())
Let’s take a look at the Default data set. It has 2 numeric variables: balance and income; and 2 factor variables: default and student
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
Detailed summary can be done with skimr::skim()
skimr::skim(Default)
| Name | Default |
| Number of rows | 10000 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| default | 0 | 1 | FALSE | 2 | No: 9667, Yes: 333 |
| student | 0 | 1 | FALSE | 2 | No: 7056, Yes: 2944 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| balance | 0 | 1 | 835.37 | 483.71 | 0.00 | 481.73 | 823.64 | 1166.31 | 2654.32 | ▆▇▅▁▁ |
| income | 0 | 1 | 33516.98 | 13336.64 | 771.97 | 21340.46 | 34552.64 | 43807.73 | 73554.23 | ▂▇▇▅▁ |
Please note that there are no missing values, which is great!
The goal here is to build a logistic regression model to predict default.
I will explore the relationship between default and other variables (income, balace, student). Let’s make some plots!
Default %>%
ggplot(aes(balance, income, color = default)) +
geom_point(alpha = 0.4) +
scale_color_brewer(palette = "Set1", direction = -1) +
labs(title = "Default status by income and balance")
By visual inspection, balance looks like a better predictor for default than income.
p1 <- Default %>%
ggplot(aes(balance, default, fill = student)) +
geom_boxplot(alpha = 0.8) +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Distribution of default",
subtitle = "by balance and student status",
caption = "Data from ISLR package")
p1
This plot shows that balance and student seem to be a decent predictor of default.
Goal: Outcome variable = default (factor)
First, I will use glm() function to build logistic regression model using all predictors.
glm_fit_all <- glm(default ~ ., data = Default, family = binomial)
summary(glm_fit_all)
##
## Call:
## glm(formula = default ~ ., family = binomial, data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
Coefficient of income is not significant (p = 0.7115203). Let’s try remove income out.
glm_fit_st_bal <- glm(default ~ student + balance,
data = Default,
family = binomial)
glm_fit_st_bal
##
## Call: glm(formula = default ~ student + balance, family = binomial,
## data = Default)
##
## Coefficients:
## (Intercept) studentYes balance
## -10.749496 -0.714878 0.005738
##
## Degrees of Freedom: 9999 Total (i.e. Null); 9997 Residual
## Null Deviance: 2921
## Residual Deviance: 1572 AIC: 1578
Model that include only student and balance has lower estimation of test error (i.e., AIC & BIC).
list("All Predictors" = glm_fit_all,
"student + balance" = glm_fit_st_bal) %>%
map_dfr(
broom::glance,
.id = "Model") %>%
select(Model, AIC, BIC)
Now, let’s apply full modeling process with tidymodels.
library(tidymodels)
First, we need to spending data budgets in 2 portions: training and testing data.
set.seed(123)
# Split to Test and Train
Default_split <- initial_split(Default, strata = default)
Default_train <- training(Default_split) # 75% to traning data
Default_test <- testing(Default_split) # 25% to test data
Default_split
## <Analysis/Assess/Total>
## <7500/2500/10000>
I will use 10-Fold Cross-Validation to resample the training data.
set.seed(234)
Default_folds <- vfold_cv(Default_train, v = 10, strata = default)
Default_folds
Note that I use strata = default because the frequency of each class is quite difference.
Default %>% count(default)
For each folds, 6750 rows will spend on fitting models and 750 spend on analysis of model performance:
Default_folds$splits[[1]]
## <Analysis/Assess/Total>
## <6750/750/7500>
glm_spec <- logistic_reg()
glm_spec
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
rec_basic <- recipe(default ~ ., data = Default) %>%
step_dummy(all_nominal_predictors())
summary(rec_basic)
Preview of engineered training data
You can see that student was replaced by student_Yes with 0-1 encoding. What does it mean?
rec_basic %>%
prep(log_change = T) %>%
bake(new_data = Default_train)
## step_dummy (dummy_72FQt):
## new (1): student_Yes
## removed (1): student
From contrast() function we can see the dummy encoding of the student (factor) variable: No = 0, Yes = 1.
contrasts(Default$student)
## Yes
## No 0
## Yes 1
workflow() will bundle blueprint of feature engineering and model specification together.
wf_basic <- workflow(rec_basic, glm_spec)
wf_basic
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
doParallel::registerDoParallel()
ctrl_preds <- control_resamples(save_pred = TRUE)
## Fit
rs_basic <- fit_resamples(wf_basic, resamples = Default_folds, control = ctrl_preds)
head(rs_basic)
From ROC curve we can see that it has pretty good upper-left bulging curve.
augment(rs_basic) %>%
roc_curve(truth = default, .pred_No) %>%
autoplot()
This would result in AUC (area under the ROC curve) and accuracy close to 1.
collect_metrics(rs_basic)
As we can see from the beginning, removing income from predictor result in better estimation of test error by AIC and BIC.
Now, I will remove income from the recipes:
rec_simple <- rec_basic %>%
remove_role(income, old_role = "predictor")
summary(rec_simple)
And I will update the WorkFlow
wf_simple <- workflow(rec_simple, glm_spec)
Then the rest is the same. So, I wrote a simple wrapper function to do it.
update_workflow <- function(wf) {
ctrl_preds <- control_resamples(save_pred = TRUE)
rs <- fit_resamples(wf, resamples = Default_folds, control = ctrl_preds)
rs
}
rs_simple <- update_workflow(wf_simple)
rs_ls <- list("All Predictors" = rs_basic,
"student + balance" = rs_simple)
roc_df <- rs_ls %>%
map_dfr(~augment(.x) %>% roc_curve(truth = default, .pred_No),
.id = "Predictors"
)
This plot show comparison of ROC curves of the 2 logistic regression models.
student, balance and income.student and balance.roc_df %>%
ggplot(aes(1-specificity, sensitivity, color = Predictors)) +
geom_line(size = 1, alpha = 0.6, linetype = "dashed") +
geom_abline(intercept = 0, slope = 1, linetype = "dotted")
rs_ls %>%
map_dfr(
collect_metrics, .id = "Features"
)
You can see that a simpler model result in a similar (or may be slightly improved) AUC. So it’s reasonable to prefer it over more complicated model.
In this section, I will evaluate the simpler model with 2 predictors (student, balance).
First, fit the model to training data.
Default_fit <- fit(wf_simple, Default_train)
Default_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
##
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) balance student_Yes
## -10.85340 0.00578 -0.71313
##
## Degrees of Freedom: 7499 Total (i.e. Null); 7497 Residual
## Null Deviance: 2158
## Residual Deviance: 1139 AIC: 1145
Then, use test data to predict default.
Default_pred <-
augment(Default_fit, Default_test) %>%
bind_cols(
predict(Default_fit, Default_test, type = "conf_int")
)
library(latex2exp)
p2 <- Default_pred %>%
ggplot(aes(balance, .pred_Yes, fill = student)) +
geom_line(aes(color = student)) +
geom_ribbon(aes(ymin = .pred_lower_Yes, ymax = .pred_upper_Yes),
alpha = 0.3) +
scale_fill_brewer(palette = "Dark2") +
scale_color_brewer(palette = "Dark2") +
labs(y = TeX("$\\hat{p}(default)$"),
caption = "Area indicate 95% CI of the estimated probability") +
labs(title = "Predicted probability of default",
subtitle = "by logistic regression with 2 predictors")
p2
This plot show estimated probability of
default if we know the values of predictors: student and balance by using logistic regression model fitted on training data. The prediction was made by plugging test data to the model.
library(patchwork)
p1 + p2
The left-sided plot showed default status by balance and student status as observed in the Default data set. After multivariate logistic regression model (default on balance and student) was fitted to the training data, the predicted probability of default using the model was shown in the right-sided plot.
If you have any comment or noticed any mistake, please feel free to share it with me in the comment section.
Thanks a lot for reading.
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.1.0 (2021-05-18)
## os macOS Big Sur 10.16
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Asia/Bangkok
## date 2021-10-11
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
## backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.1.0)
## broom * 0.7.9 2021-07-27 [1] CRAN (R 4.1.0)
## bslib 0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0)
## cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0)
## callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
## class 7.3-19 2021-05-03 [1] CRAN (R 4.1.0)
## cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
## codetools 0.2-18 2020-11-04 [1] CRAN (R 4.1.0)
## colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
## crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
## DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
## dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
## desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0)
## devtools 2.4.2 2021-06-07 [1] CRAN (R 4.1.0)
## dials * 0.0.10 2021-09-10 [1] CRAN (R 4.1.0)
## DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.1.0)
## digest 0.6.28 2021-09-23 [1] CRAN (R 4.1.0)
## doParallel 1.0.16 2020-10-16 [1] CRAN (R 4.1.0)
## dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
## fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
## farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
## forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
## foreach 1.5.1 2020-10-15 [1] CRAN (R 4.1.0)
## fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
## furrr 0.2.3 2021-06-25 [1] CRAN (R 4.1.0)
## future 1.22.1 2021-08-25 [1] CRAN (R 4.1.0)
## generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
## ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
## globals 0.14.0 2020-11-22 [1] CRAN (R 4.1.0)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
## gower 0.2.2 2020-06-23 [1] CRAN (R 4.1.0)
## GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.1.0)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
## hardhat 0.1.6 2021-07-14 [1] CRAN (R 4.1.0)
## haven 2.4.3 2021-08-04 [1] CRAN (R 4.1.0)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.1.0)
## highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
## hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
## htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0)
## httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
## infer * 1.0.0 2021-08-13 [1] CRAN (R 4.1.0)
## ipred 0.9-11 2021-03-12 [1] CRAN (R 4.1.0)
## ISLR * 1.2 2017-10-20 [1] CRAN (R 4.1.0)
## iterators 1.0.13 2020-10-15 [1] CRAN (R 4.1.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
## jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
## knitr 1.34 2021-09-09 [1] CRAN (R 4.1.0)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
## latex2exp * 0.5.0 2021-03-18 [1] CRAN (R 4.1.0)
## lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.0)
## lava 1.6.9 2021-03-11 [1] CRAN (R 4.1.0)
## lhs 1.1.3 2021-09-08 [1] CRAN (R 4.1.0)
## lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0)
## listenv 0.8.0 2019-12-05 [1] CRAN (R 4.1.0)
## lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
## MASS 7.3-54 2021-05-03 [1] CRAN (R 4.1.0)
## Matrix 1.3-3 2021-05-04 [1] CRAN (R 4.1.0)
## memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0)
## modeldata * 0.1.1 2021-07-14 [1] CRAN (R 4.1.0)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
## nnet 7.3-16 2021-05-03 [1] CRAN (R 4.1.0)
## parallelly 1.28.1 2021-09-09 [1] CRAN (R 4.1.0)
## parsnip * 0.1.7 2021-07-21 [1] CRAN (R 4.1.0)
## patchwork * 1.1.1 2020-12-17 [1] CRAN (R 4.1.0)
## pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0)
## pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
## pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0)
## plyr 1.8.6 2020-03-03 [1] CRAN (R 4.1.0)
## prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
## pROC 1.17.0.1 2021-01-13 [1] CRAN (R 4.1.0)
## processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0)
## prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.1.0)
## ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
## RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.1.0)
## Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
## readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.0)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
## recipes * 0.1.16 2021-04-16 [1] CRAN (R 4.1.0)
## remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0)
## repr 1.1.3 2021-01-21 [1] CRAN (R 4.1.0)
## reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
## rlang * 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
## rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.0)
## rpart 4.1-15 2019-04-12 [1] CRAN (R 4.1.0)
## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
## rsample * 0.1.0 2021-05-08 [1] CRAN (R 4.1.0)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
## rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0)
## sass 0.4.0 2021-05-12 [1] CRAN (R 4.1.0)
## scales * 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
## skimr 2.1.3 2021-03-07 [1] CRAN (R 4.1.0)
## stringi 1.7.4 2021-08-25 [1] CRAN (R 4.1.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
## survival 3.2-11 2021-04-26 [1] CRAN (R 4.1.0)
## testthat 3.0.4 2021-07-01 [1] CRAN (R 4.1.0)
## tibble * 3.1.4 2021-08-25 [1] CRAN (R 4.1.0)
## tidymodels * 0.1.3 2021-04-19 [1] CRAN (R 4.1.0)
## tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
## tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
## tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
## timeDate 3043.102 2018-02-21 [1] CRAN (R 4.1.0)
## tune * 0.1.6 2021-07-21 [1] CRAN (R 4.1.0)
## tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0)
## usethis 2.0.1.9000 2021-09-23 [1] Github (r-lib/usethis@3385e14)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
## vctrs * 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
## withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
## workflows * 0.2.3 2021-07-16 [1] CRAN (R 4.1.0)
## workflowsets * 0.1.0 2021-07-22 [1] CRAN (R 4.1.0)
## xfun 0.26 2021-09-14 [1] CRAN (R 4.1.0)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
## yardstick * 0.0.8 2021-03-28 [1] CRAN (R 4.1.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library