00 - Introduction

This weeks data dive focuses on expanding my understanding of generalized linear models (GLMs) – specifically with an emphasis on increasing my understanding of the diagnostic tools that are available. It will accomplish three objectives:

  1. Build a generalized linear model.

  2. Use tools outlined in prior weeks to diagnose the model and any issues that may be present.

  3. Interpret at least one of the model’s coefficients and how it relates to the business context of the dataset.

For the purposes of this data dive, I will focus on improving my prior logistic regression model on the variable outcome ~ whether a bank client subscribes to a term deposit by incorporating new variables and / or making adjustments to the model to use some of the comparison techniques available to us that were explained in the week #11 class notebook.

Once again, the dataset used for this data dive is the Bank Marketing dataset provided by UC Irvine Machine Learning Repository. We will begin by:

  1. Referencing relevant libraries, setting working director, and reading in the appropriate dataset.
  2. Pulling in all the code from the previous logistic regression model.
# Declare libraries
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(broom)
## Warning: package 'broom' was built under R version 4.5.3
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(GGally)
library(lindia)
## Warning: package 'lindia' was built under R version 4.5.3
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.3
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.5.3
library(CalibrationCurves)
## Warning: package 'CalibrationCurves' was built under R version 4.5.3
## Loading required package: rms
## Warning: package 'rms' was built under R version 4.5.3
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.5.3
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Registered S3 method overwritten by 'riskRegression':
##   method        from 
##   nobs.multinom broom
setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Previous Model

# Remove Unknown Categories
cleaned_data <- bank_marketing |>
  mutate(
    outcome = if_else(y == "yes", 1, 0) # Convert predictor variable to binary
  ) |>
  filter(
    job != "unknown",
    education != "unknown",
  )

prev_model <- glm(outcome ~ job + balance + education +  poutcome, cleaned_data, family = binomial(link = 'logit'))

tbl_regression(prev_model, exponentiate  = TRUE) |> # Applying tbl_regression function per Modern Statistics with R Textbook
   add_glance_table(
    include = AIC,
    label = list(AIC ~ "AIC")
  )
Characteristic OR 95% CI p-value
job


    admin.
    blue-collar 0.69 0.61, 0.78 <0.001
    entrepreneur 0.64 0.51, 0.79 <0.001
    housemaid 0.80 0.63, 1.00 0.056
    management 0.88 0.78, 1.00 0.056
    retired 2.01 1.74, 2.32 <0.001
    self-employed 0.85 0.70, 1.03 0.094
    services 0.78 0.67, 0.90 <0.001
    student 2.45 2.02, 2.96 <0.001
    technician 0.88 0.78, 0.99 0.037
    unemployed 1.29 1.07, 1.55 0.007
balance 1.00 1.00, 1.00 <0.001
education


    primary
    secondary 1.17 1.05, 1.31 0.004
    tertiary 1.63 1.44, 1.85 <0.001
poutcome


    failure
    other 1.36 1.17, 1.59 <0.001
    success 11.2 9.72, 12.9 <0.001
    unknown 0.72 0.65, 0.79 <0.001
AIC 28,018

Abbreviations: CI = Confidence Interval, OR = Odds Ratio

01 - The Improved Logistic Regression Model

The first improvement I am going to make in my new logistic regression model is adjusting the reference categories for my categorical variables.

Additionally, I noticed that in my initial logistic regression model – the predictions of subscription to a term deposit carried a high cost in my model. In the previous data dive I said that I was over-predicting term deposit subscription – but this is incorrect. Upon review I realize that I was under-predicting, which may in part be caused by the disproportion distribution of the data set towards bank clients who ultimately did not subscribe. For reference, see the 4,117 successful term deposit outcomes that were predicted as a failure by the previous model in the table below (where a prediction of term deposit subscription is determined based on a probability greater than or equal to 0.5).

Note: This is very similar to the example pointed out in Modern Statistics with R section 8.3.2 where they are trying to predict whether a wine is red and found that the model was almost always predicting white. They mentioned that this is likely because of the disproportionate skew in the data set towards white wine and that adding more explanatory variables may assist.

# Create Comparison Table on Prior Model vs Actual

# Get Fitted Values from the Model and Computing P-Hat
fitted_values <- fitted(prev_model)
predicted_class <- ifelse(fitted_values >= 0.5, 1, 0)

# Constructing Comparison table
comparison_table <- tibble(
  actual_value = cleaned_data$outcome,
  predicted_value = predicted_class
) |>
  count(actual_value, predicted_value) |>
  pivot_wider(
    names_from = predicted_value,
    values_from = n,
    values_fill = 0,
    names_prefix = "predicted_"
  )

comparison_table
## # A tibble: 2 × 3
##   actual_value predicted_0 predicted_1
##          <dbl>       <int>       <int>
## 1            0       37685         487
## 2            1        4117         904

So it seems like one solution is to increase the number of explanatory variables in our logistic regression model and see how that impacts our predictions, our Akaike Information Criterion, our coefficients, and in general some of the more advanced model diagnostics we are going to employ (such as VIF). So let’s go ahead and incorporate the following additional predictor variables into our model and see how that impacts our model’s results:

Continuous Data:

Categorical Data:

We will first need to make some alterations to how the data are cleaned – e.g., taking categorical variables and converting them into a single binary predictor variable for ease of understanding (and so we don’t have to deal with the reference category shenanigans) for the default, housing, loan, and even contact variables and then re-leveling the reference category for our categorical columns to improve interpretability.

# Improved Cleaning of Data before Input into Model
improved_cleaned_data <- bank_marketing |>
  mutate(
    outcome = if_else(y == "yes", 1, 0), # Convert predictor variable to binary
    default_bin = if_else(default == "yes", 1, 0),
    housing_bin = if_else(housing == "yes", 1, 0),
    loan_bin = if_else(loan == "yes", 1, 0),
    cellular_bin = if_else(contact == 'cellular', 1, 0),
    
    # Relevel appropriate columns for Logistic Regression call
    job = relevel(factor(job), ref = "blue-collar"),
    education = relevel(factor(education), ref = "primary"),
    marital = relevel(factor(marital), ref = "single"),
    poutcome = relevel(factor(poutcome), ref = "unknown")
  ) |>
  filter(
    job != "unknown",
    education != "unknown",
    marital != "unknown",
    contact != "unknown"
  )

# Call the Regression
improved_model <- glm(outcome ~ job + balance + education +  poutcome + default_bin + housing_bin + loan_bin + cellular_bin + age + campaign + marital,
                      improved_cleaned_data, 
                      family = binomial(link = 'logit'))

tbl_regression(improved_model, exponentiate = TRUE) |>
     add_glance_table(
    include = AIC,
    label = list(AIC ~ "AIC")
  )
Characteristic OR 95% CI p-value
job


    blue-collar
    admin. 1.31 1.14, 1.51 <0.001
    entrepreneur 0.92 0.72, 1.16 0.5
    housemaid 0.91 0.70, 1.16 0.4
    management 1.12 0.97, 1.29 0.13
    retired 2.05 1.72, 2.45 <0.001
    self-employed 1.03 0.83, 1.27 0.8
    services 1.12 0.95, 1.31 0.2
    student 2.19 1.77, 2.71 <0.001
    technician 1.07 0.94, 1.23 0.3
    unemployed 1.32 1.08, 1.62 0.006
balance 1.00 1.00, 1.00 <0.001
education


    primary
    secondary 1.11 0.99, 1.26 0.081
    tertiary 1.36 1.19, 1.57 <0.001
poutcome


    unknown
    failure 1.14 1.03, 1.26 0.010
    other 1.55 1.35, 1.78 <0.001
    success 10.5 9.30, 11.8 <0.001
default_bin 0.63 0.43, 0.88 0.011
housing_bin 0.53 0.49, 0.57 <0.001
loan_bin 0.60 0.53, 0.67 <0.001
cellular_bin 1.23 1.08, 1.40 0.002
age 1.00 1.00, 1.01 0.033
campaign 0.88 0.87, 0.90 <0.001
marital


    single
    divorced 0.81 0.72, 0.92 0.001
    married 0.71 0.65, 0.78 <0.001
AIC 22,561

Abbreviations: CI = Confidence Interval, OR = Odds Ratio

02 - Interpreting the Expanded Models Results

Within this ‘expanded’ model encompassing additional variables in the data set we observe:

To account for both of these errors, I will re-run the prior model by applying the same modifications.

# Previous Model Adjsuted to remove those with unknown contact type

# Remove Unknown Categories
cleaned_data <- bank_marketing |>
  mutate(
    outcome = if_else(y == "yes", 1, 0), # Convert predictor variable to binary
    
     # Relevel appropriate columns for Logistic Regression call
    job = relevel(factor(job), ref = "blue-collar"),
    education = relevel(factor(education), ref = "primary"),
    marital = relevel(factor(marital), ref = "single"),
    poutcome = relevel(factor(poutcome), ref = "unknown")
  ) |>
  filter(
    job != "unknown",
    education != "unknown",
    marital != "unknown", # technically 'unknown' ins't a category for the marital variable, but it doesn't hurt to play it safe
    contact != "unknown"
  )

# Generate previous model
prev_model <- glm(outcome ~ job + balance + education +  poutcome, cleaned_data, family = binomial(link = 'logit'))

# Print Regression Table
tbl_regression(prev_model, exponentiate = TRUE) |>
   add_glance_table(
    include = AIC,
    label = list(AIC ~ "AIC")
  )
Characteristic OR 95% CI p-value
job


    blue-collar
    admin. 1.49 1.30, 1.71 <0.001
    entrepreneur 0.91 0.72, 1.16 0.5
    housemaid 1.12 0.87, 1.41 0.4
    management 1.26 1.10, 1.45 0.001
    retired 2.98 2.56, 3.46 <0.001
    self-employed 1.18 0.96, 1.45 0.11
    services 1.19 1.02, 1.39 0.031
    student 3.64 2.97, 4.45 <0.001
    technician 1.23 1.08, 1.41 0.002
    unemployed 1.82 1.49, 2.21 <0.001
balance 1.00 1.00, 1.00 <0.001
education


    primary
    secondary 1.13 1.00, 1.27 0.049
    tertiary 1.49 1.30, 1.71 <0.001
poutcome


    unknown
    failure 1.08 0.98, 1.19 0.10
    other 1.46 1.28, 1.67 <0.001
    success 12.3 10.9, 13.8 <0.001
AIC 23,242

Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Now that the previous model is updated so that it is computing across the same observations, we can see:

Jitter Comparison

Another interesting way to view our models in comparison is to look at a jitter plot based on the predicted probability, relative to the actual values.

# Get Previous Models Residual Values
plot_previous_df <- augment(
  prev_model,
  type.predict = "response",
  type.residuals = "deviance"
) |>
  mutate(
    observed = model.response(model.frame(prev_model)),
    pred_prob = .fitted,
    response_residual = observed - pred_prob
  )

# Construct the Chart
plot_prev <- ggplot(plot_previous_df, aes(x = pred_prob, y = observed)) +
  geom_jitter(width = 0.01, height = 0.05, alpha = 0.4) +
  geom_vline(xintercept = 0.5, color = "red", linetype = "dotted") +
  scale_y_continuous(
    breaks = c(0, 1),
    limits = c(-0.1, 1.1)
  ) +
  scale_x_continuous(
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.25)
  ) +
  labs(
    x = "Predicted Probability",
    y = "Actual Observation",
    title = "Old Model"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )


# Get the Extended Models Residual Values
plot_extended_df <- augment(
  improved_model,
  type.predict = "response",
  type.residuals = "deviance"
) |>
  mutate(
    observed = model.response(model.frame(prev_model)),
    pred_prob = .fitted,
    response_residual = observed - pred_prob
  )

# Construct the Chart
plot_extended <- ggplot(plot_extended_df, aes(x = pred_prob, y = observed)) +
  geom_jitter(width = 0.01, height = 0.05, alpha = 0.4) +
  geom_vline(xintercept = 0.5, color = "red", linetype = "dotted") +
  scale_y_continuous(
    breaks = c(0, 1),
    limits = c(-0.1, 1.1)
  ) +
  scale_x_continuous(
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.25)
  ) +
  labs(
    x = "Predicted Probability",
    y = "Actual Observation",
    title = "New Model"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )


plot_extended | plot_prev
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).

Here we can see that with the new variables introduced, it appears that the spread on the predicted probabilities is larger – whereas in the previous model there appears to be a gap where the model isn’t predicting probabilities just shy of 0.5. It seems like in the newer model, the addition of other variables may have smoothed this gap by providing more parameters that could create these ‘in-between’ probabilities. Also – notice how the red line signifies the prediction threshold, where >= 0.5 indicates that we would have predicted the successful subscription of a term deposit.

  • In both cases, the dot’s appear to be ‘more dense’ visually above the 0.5 threshold where the actual observation resulted in a successful term deposit subscription.

Calibration Curves

Another method for evaluating model fit (and this comes in part from the OpenIntro Statistics book, pg. 377) is to look at a calibration curve. R has a library CalibrationCurves that can be used to construct these plots with additional details found here. Since our AIC is lower on our extended model and it doesn’t have discontinuity in it’s chart (if that matters), we’re going to go ahead and only graph the Calibration Curve for this model and then understand what it means.

# Plot Calibration Curve
cal_improved <- valProbggplot(
  p = predict(improved_model, type = "response"),
  y = model.response(model.frame(improved_model)),
  smooth = "rcs"
)

plot_improved <- cal_improved$ggPlot +
  labs(title = "Improved model Calibration Curve") +
  theme_classic(base_size = 16) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 18),
    axis.title = element_text(size = 16),
    axis.text = element_text(size = 13)
  )

plot_improved

The calibration curve is essentially looking at our predicted probabilities within groups and determining whether that predicted probability was actually observed in the underlying data. As Diez et al. puts it, “if we look at resumes that we modeled as having a 10% chance of getting a callback, do we find about 10% of them actually receive a callback?” (pg. 377). In this case, if our ‘group’ probability is 0.25 ~ do 25% of these individuals or 1 in 4 actually subscribe to a term deposit as we would expect? Thus, ideally the calibration line for the model would follow the ideal slope of \(y = x\), where each actual observed probability corresponded to the predicted probability.

In our case, we can see that the model appears reasonably well calibrated. Both the intercept and slope of our calibration lines are near the idea values of 0 and 1 (even with a 95% confidence interval), and the flexible calibration curve appears to track the line closely across most predicted probabilities, There is some evidence of over-prediction towards the end of the line, where the for observed probabilities of 65 percent and higher within the constructed groups, we are effectively predicting about 80% of these folks will subscribe to a term deposit.

I do want to note that reading Chapter 9 of OpenIntro Statistics really helped me get an understanding of the principles of logistic regression and how to interpret them – so that was a really worthwhile investment. Although there is much to be left desired to learn ‘under the hood’ so to speak.

Variance Inflation Factor

Finally, before I conclude that I am happy with this model ~ I am going to check the Variance Inflation Factor to identify any multicolinearity issues in the model that would violate the independence assumption among my predictor variables.

  • VIF works by performing an OLS regression on our variable \(X_i\) with all other explanatory variables as predictors.

  • Then we calculate the VIF factor with the formula \(\frac{1}{1 - R_i^2}\)

  • Where the higher the \(R^2\) within the model, the greater the VIF factor will be. If the \(R^2\) of a predictor variable regressed against the other variables was \(0.9\), then the VIF factor would be 10.

  • The rule of thumb on interpreting the VIF is to keep values that are below 10. Although a cutoff of 5 is also commonly used as noted per Wikipedia.

We can see below that none of our VIF values are anywhere remotely near 10 ~ or 5. There is also some additional features – e.g., if the variables have different degrees of freedom greater than 1, then generalized variance inflation factors are calculated as expressed under GVIFs below.

# Variance Inflation Factor
vif(improved_model)
##          jobadmin.    jobentrepreneur       jobhousemaid      jobmanagement 
##           1.853917           1.229995           1.175777           3.406901 
##         jobretired   jobself-employed        jobservices         jobstudent 
##           2.121337           1.357861           1.515107           1.422355 
##      jobtechnician      jobunemployed            balance educationsecondary 
##           2.201165           1.294766           1.031330           3.219842 
##  educationtertiary    poutcomefailure      poutcomeother    poutcomesuccess 
##           4.086999           1.085470           1.040962           1.041998 
##        default_bin        housing_bin           loan_bin       cellular_bin 
##           1.009676           1.145267           1.026390           1.063966 
##                age           campaign    maritaldivorced     maritalmarried 
##           2.158696           1.028248           1.420999           1.602795

Overall we can conclude:

  • The extended model has a lower cost (AIC) than the previous model for the same level of data with extended parameters.

  • The extended model has a smoother probability distribution on the predicted probabilities ~ e.g., it lacks the disjointedness or bimodality of the prior model. We may want to actually generate the cumulative distribution function of our residuals here – but this may not really matter and was more-so an interesting observation.

  • The extended model probabilities seem to generally align well with observed probabilities based on the calibration curve – although it starts to deviate and over-predict among some groups.

  • We do not observe evidence that multicolinearity among predictor variables is a problem in the model.

03 - Interpreting the Model’s Coefficients and Business Relevance

In prior versions when interpreting the model’s coefficients – I relied on expressing the relative odds as they increase on a percentage basis. Instead this time around I’m going to focus on interpreting three key coefficients within my extended (‘improved’) model.

These findings are interesting! I certainly did expect that subscription to a term deposit in a previous marketing campaign would result in a higher likelihood that the client would subscribe again and that those who had a housing loan may be less likely to put money away in a term deposit due to other obligations, but what I wasn’t expecting was to find out that students have the highest probability of all occupational groups (relative to blue-collar workers), to subscribe to a term deposit. That’s an interesting finding – and I wonder if there are some omitted variables that may influence this decision.

Note: When conversing with ChatGPT and providing feedback on this section it mentioned that the standard interpretation is to apply odds ratios ~ e.g., students have approximately a 2.18 times the odds of subscribing, which equates to about 118% higher odds than blue-collar clients, holding all else equal. It does say both are valid – and I prefer this way of expressing in terms of a baseline. I think it makes things more tangible.

# Model Results for Reference
tbl_regression(improved_model, intercept = TRUE) # Note: not calling exponetiate so I can call this manually
Characteristic log(OR) 95% CI p-value
(Intercept) -1.9 -2.1, -1.6 <0.001
job


    blue-collar
    admin. 0.27 0.13, 0.42 <0.001
    entrepreneur -0.09 -0.33, 0.15 0.5
    housemaid -0.10 -0.35, 0.15 0.4
    management 0.11 -0.03, 0.25 0.13
    retired 0.72 0.54, 0.90 <0.001
    self-employed 0.03 -0.18, 0.24 0.8
    services 0.11 -0.05, 0.27 0.2
    student 0.78 0.57, 1.0 <0.001
    technician 0.07 -0.06, 0.21 0.3
    unemployed 0.28 0.08, 0.48 0.006
balance 0.00 0.00, 0.00 <0.001
education


    primary
    secondary 0.11 -0.01, 0.23 0.081
    tertiary 0.31 0.17, 0.45 <0.001
poutcome


    unknown
    failure 0.13 0.03, 0.23 0.010
    other 0.44 0.30, 0.57 <0.001
    success 2.4 2.2, 2.5 <0.001
default_bin -0.46 -0.84, -0.12 0.011
housing_bin -0.64 -0.72, -0.57 <0.001
loan_bin -0.52 -0.63, -0.41 <0.001
cellular_bin 0.21 0.08, 0.34 0.002
age 0.00 0.00, 0.01 0.033
campaign -0.12 -0.14, -0.10 <0.001
marital


    single
    divorced -0.21 -0.33, -0.08 0.001
    married -0.34 -0.43, -0.25 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

04 - Ideas for Improvement and Questions