00 - Introduction

This weeks data dive focuses on expanding my understanding of generalized linear models (GLMs) – specifically with an emphasis on increasing my understanding of the diagnostic tools that are available. It will accomplish three objectives:

Build a generalized linear model.
Use tools outlined in prior weeks to diagnose the model and any issues that may be present.
Interpret at least one of the model’s coefficients and how it relates to the business context of the dataset.

For the purposes of this data dive, I will focus on improving my prior logistic regression model on the variable outcome ~ whether a bank client subscribes to a term deposit by incorporating new variables and / or making adjustments to the model to use some of the comparison techniques available to us that were explained in the week #11 class notebook.

Once again, the dataset used for this data dive is the Bank Marketing dataset provided by UC Irvine Machine Learning Repository. We will begin by:

Referencing relevant libraries, setting working director, and reading in the appropriate dataset.
Pulling in all the code from the previous logistic regression model.

# Declare libraries
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(broom)

## Warning: package 'broom' was built under R version 4.5.3

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(GGally)
library(lindia)

## Warning: package 'lindia' was built under R version 4.5.3

library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.3

library(patchwork)

## Warning: package 'patchwork' was built under R version 4.5.3

library(CalibrationCurves)

## Warning: package 'CalibrationCurves' was built under R version 4.5.3

## Loading required package: rms

## Warning: package 'rms' was built under R version 4.5.3

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 4.5.3

## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Registered S3 method overwritten by 'riskRegression':
##   method        from 
##   nobs.multinom broom

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")

## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Previous Model

# Remove Unknown Categories
cleaned_data <- bank_marketing |>
  mutate(
    outcome = if_else(y == "yes", 1, 0) # Convert predictor variable to binary
  ) |>
  filter(
    job != "unknown",
    education != "unknown",
  )

prev_model <- glm(outcome ~ job + balance + education +  poutcome, cleaned_data, family = binomial(link = 'logit'))

tbl_regression(prev_model, exponentiate  = TRUE) |> # Applying tbl_regression function per Modern Statistics with R Textbook
   add_glance_table(
    include = AIC,
    label = list(AIC ~ "AIC")
  )

Characteristic	OR	95% CI	p-value
job
admin.	—	—
blue-collar	0.69	0.61, 0.78	<0.001
entrepreneur	0.64	0.51, 0.79	<0.001
housemaid	0.80	0.63, 1.00	0.056
management	0.88	0.78, 1.00	0.056
retired	2.01	1.74, 2.32	<0.001
self-employed	0.85	0.70, 1.03	0.094
services	0.78	0.67, 0.90	<0.001
student	2.45	2.02, 2.96	<0.001
technician	0.88	0.78, 0.99	0.037
unemployed	1.29	1.07, 1.55	0.007
balance	1.00	1.00, 1.00	<0.001
education
primary	—	—
secondary	1.17	1.05, 1.31	0.004
tertiary	1.63	1.44, 1.85	<0.001
poutcome
failure	—	—
other	1.36	1.17, 1.59	<0.001
success	11.2	9.72, 12.9	<0.001
unknown	0.72	0.65, 0.79	<0.001
AIC	28,018
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

01 - The Improved Logistic Regression Model

The first improvement I am going to make in my new logistic regression model is adjusting the reference categories for my categorical variables.

Job - I am going to set blue collar workers as the reference category since that occupation is broad and tends to represent workers in low-paid labor position. It’s also a widely recognized category for business contexts.
Education - I will keep education at the primary level. This makes sense as it’s the ‘baseline’ in terms of the educational achievement that society generally expects people to achieve.
Poutcome - I am going to change the reference category to unknown – since this likely represents bank clients who were not previously reached out to as a part of a marketing campaign by the bank.

Additionally, I noticed that in my initial logistic regression model – the predictions of subscription to a term deposit carried a high cost in my model. In the previous data dive I said that I was over-predicting term deposit subscription – but this is incorrect. Upon review I realize that I was under-predicting, which may in part be caused by the disproportion distribution of the data set towards bank clients who ultimately did not subscribe. For reference, see the 4,117 successful term deposit outcomes that were predicted as a failure by the previous model in the table below (where a prediction of term deposit subscription is determined based on a probability greater than or equal to 0.5).

Note: This is very similar to the example pointed out in Modern Statistics with R section 8.3.2 where they are trying to predict whether a wine is red and found that the model was almost always predicting white. They mentioned that this is likely because of the disproportionate skew in the data set towards white wine and that adding more explanatory variables may assist.

# Create Comparison Table on Prior Model vs Actual

# Get Fitted Values from the Model and Computing P-Hat
fitted_values <- fitted(prev_model)
predicted_class <- ifelse(fitted_values >= 0.5, 1, 0)

# Constructing Comparison table
comparison_table <- tibble(
  actual_value = cleaned_data$outcome,
  predicted_value = predicted_class
) |>
  count(actual_value, predicted_value) |>
  pivot_wider(
    names_from = predicted_value,
    values_from = n,
    values_fill = 0,
    names_prefix = "predicted_"
  )

comparison_table

## # A tibble: 2 × 3
##   actual_value predicted_0 predicted_1
##          <dbl>       <int>       <int>
## 1            0       37685         487
## 2            1        4117         904

So it seems like one solution is to increase the number of explanatory variables in our logistic regression model and see how that impacts our predictions, our Akaike Information Criterion, our coefficients, and in general some of the more advanced model diagnostics we are going to employ (such as VIF). So let’s go ahead and incorporate the following additional predictor variables into our model and see how that impacts our model’s results:

Continuous Data:

Age - age of the bank client expressed in whole numbers.
- We saw previously that bank clients of job category ‘retired’ had a higher likelihood of subscribing to a term deposit. This variable may contain some of that missing information to more thoroughly understand these interactions.
Campaign - the number of contacts performed during this campaign and for this client (including the lat contact).
- Perhaps after more contacts during the campaign the bank client may be more likely to subscribe to a term deposit – so it’s more about breaking down their mental fortitude (joking) by persistence.

Categorical Data:

Marital - The marital status of the bank client (divorced, married, single, unknown, etc.). The reference category will be single.
Contact - The type of device the bank client used when contacted (telephone, cellular).
Default - Whether the bank client is in default or not, a binary variable.
Housing - Whether the bank client has a housing loan or not.
Loan - Whether the bank client has a personal loan or not.

We will first need to make some alterations to how the data are cleaned – e.g., taking categorical variables and converting them into a single binary predictor variable for ease of understanding (and so we don’t have to deal with the reference category shenanigans) for the default, housing, loan, and even contact variables and then re-leveling the reference category for our categorical columns to improve interpretability.

# Improved Cleaning of Data before Input into Model
improved_cleaned_data <- bank_marketing |>
  mutate(
    outcome = if_else(y == "yes", 1, 0), # Convert predictor variable to binary
    default_bin = if_else(default == "yes", 1, 0),
    housing_bin = if_else(housing == "yes", 1, 0),
    loan_bin = if_else(loan == "yes", 1, 0),
    cellular_bin = if_else(contact == 'cellular', 1, 0),
    
    # Relevel appropriate columns for Logistic Regression call
    job = relevel(factor(job), ref = "blue-collar"),
    education = relevel(factor(education), ref = "primary"),
    marital = relevel(factor(marital), ref = "single"),
    poutcome = relevel(factor(poutcome), ref = "unknown")
  ) |>
  filter(
    job != "unknown",
    education != "unknown",
    marital != "unknown",
    contact != "unknown"
  )

# Call the Regression
improved_model <- glm(outcome ~ job + balance + education +  poutcome + default_bin + housing_bin + loan_bin + cellular_bin + age + campaign + marital,
                      improved_cleaned_data, 
                      family = binomial(link = 'logit'))

tbl_regression(improved_model, exponentiate = TRUE) |>
     add_glance_table(
    include = AIC,
    label = list(AIC ~ "AIC")
  )

Characteristic	OR	95% CI	p-value
job
blue-collar	—	—
admin.	1.31	1.14, 1.51	<0.001
entrepreneur	0.92	0.72, 1.16	0.5
housemaid	0.91	0.70, 1.16	0.4
management	1.12	0.97, 1.29	0.13
retired	2.05	1.72, 2.45	<0.001
self-employed	1.03	0.83, 1.27	0.8
services	1.12	0.95, 1.31	0.2
student	2.19	1.77, 2.71	<0.001
technician	1.07	0.94, 1.23	0.3
unemployed	1.32	1.08, 1.62	0.006
balance	1.00	1.00, 1.00	<0.001
education
primary	—	—
secondary	1.11	0.99, 1.26	0.081
tertiary	1.36	1.19, 1.57	<0.001
poutcome
unknown	—	—
failure	1.14	1.03, 1.26	0.010
other	1.55	1.35, 1.78	<0.001
success	10.5	9.30, 11.8	<0.001
default_bin	0.63	0.43, 0.88	0.011
housing_bin	0.53	0.49, 0.57	<0.001
loan_bin	0.60	0.53, 0.67	<0.001
cellular_bin	1.23	1.08, 1.40	0.002
age	1.00	1.00, 1.01	0.033
campaign	0.88	0.87, 0.90	<0.001
marital
single	—	—
divorced	0.81	0.72, 0.92	0.001
married	0.71	0.65, 0.78	<0.001
AIC	22,561
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

02 - Interpreting the Expanded Models Results

Within this ‘expanded’ model encompassing additional variables in the data set we observe:

A big decrease in the AIC (Akaike Information Criterion) in the model from 28,018 to 22,561 in the newer model. For a logistic generalized linear model a lower AIC is preferred; however, in this case the AIC likely decreased because the number of observations included in the model fell from 43,192 down to 30,906 ~ about a drop of ~12,286 observations. This is caused by the adjustment on our categorical variable (converted to a dummy binary variable) where we exclude bank clients who have ‘unknown’ information on their contact type. Thus, relying on the change in AIC is not enough to tell us whether there are improvements in the model – and furthermore, given that the model is constructed with less observations, comparing the two models may not be wise given they are looking at different observations.
Additionally, you may notice I made an additional model that makes comparing these two models tricky. I didn’t re-level the reference categories of the previous model – so all the coefficients and significance level for the job categories are relative to the reference category ‘admin’ instead of ‘blue-collar’ and so on. This significantly impacts the direct comparison on the coefficients which is not ideal.

To account for both of these errors, I will re-run the prior model by applying the same modifications.

# Previous Model Adjsuted to remove those with unknown contact type

# Remove Unknown Categories
cleaned_data <- bank_marketing |>
  mutate(
    outcome = if_else(y == "yes", 1, 0), # Convert predictor variable to binary
    
     # Relevel appropriate columns for Logistic Regression call
    job = relevel(factor(job), ref = "blue-collar"),
    education = relevel(factor(education), ref = "primary"),
    marital = relevel(factor(marital), ref = "single"),
    poutcome = relevel(factor(poutcome), ref = "unknown")
  ) |>
  filter(
    job != "unknown",
    education != "unknown",
    marital != "unknown", # technically 'unknown' ins't a category for the marital variable, but it doesn't hurt to play it safe
    contact != "unknown"
  )

# Generate previous model
prev_model <- glm(outcome ~ job + balance + education +  poutcome, cleaned_data, family = binomial(link = 'logit'))

# Print Regression Table
tbl_regression(prev_model, exponentiate = TRUE) |>
   add_glance_table(
    include = AIC,
    label = list(AIC ~ "AIC")
  )

Characteristic	OR	95% CI	p-value
job
blue-collar	—	—
admin.	1.49	1.30, 1.71	<0.001
entrepreneur	0.91	0.72, 1.16	0.5
housemaid	1.12	0.87, 1.41	0.4
management	1.26	1.10, 1.45	0.001
retired	2.98	2.56, 3.46	<0.001
self-employed	1.18	0.96, 1.45	0.11
services	1.19	1.02, 1.39	0.031
student	3.64	2.97, 4.45	<0.001
technician	1.23	1.08, 1.41	0.002
unemployed	1.82	1.49, 2.21	<0.001
balance	1.00	1.00, 1.00	<0.001
education
primary	—	—
secondary	1.13	1.00, 1.27	0.049
tertiary	1.49	1.30, 1.71	<0.001
poutcome
unknown	—	—
failure	1.08	0.98, 1.19	0.10
other	1.46	1.28, 1.67	<0.001
success	12.3	10.9, 13.8	<0.001
AIC	23,242
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Now that the previous model is updated so that it is computing across the same observations, we can see:

The AIC of our extended model of 22,561 is lower than our previous, simpler model of 23,242. This implies that our model
We can also see some differences in the significance level on our coefficients and their value across the variables that are shared – although the general direction (positive or negative) appears to be the same. We should also note that we are looking at the coefficients exponentiated – e.g., after applying the transformation \(e^{B_1}\) to give us the relative odds associated with the presence of a given ‘unit’ – or if categorical, relative to the reference category.

Jitter Comparison

Another interesting way to view our models in comparison is to look at a jitter plot based on the predicted probability, relative to the actual values.

# Get Previous Models Residual Values
plot_previous_df <- augment(
  prev_model,
  type.predict = "response",
  type.residuals = "deviance"
) |>
  mutate(
    observed = model.response(model.frame(prev_model)),
    pred_prob = .fitted,
    response_residual = observed - pred_prob
  )

# Construct the Chart
plot_prev <- ggplot(plot_previous_df, aes(x = pred_prob, y = observed)) +
  geom_jitter(width = 0.01, height = 0.05, alpha = 0.4) +
  geom_vline(xintercept = 0.5, color = "red", linetype = "dotted") +
  scale_y_continuous(
    breaks = c(0, 1),
    limits = c(-0.1, 1.1)
  ) +
  scale_x_continuous(
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.25)
  ) +
  labs(
    x = "Predicted Probability",
    y = "Actual Observation",
    title = "Old Model"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )


# Get the Extended Models Residual Values
plot_extended_df <- augment(
  improved_model,
  type.predict = "response",
  type.residuals = "deviance"
) |>
  mutate(
    observed = model.response(model.frame(prev_model)),
    pred_prob = .fitted,
    response_residual = observed - pred_prob
  )

# Construct the Chart
plot_extended <- ggplot(plot_extended_df, aes(x = pred_prob, y = observed)) +
  geom_jitter(width = 0.01, height = 0.05, alpha = 0.4) +
  geom_vline(xintercept = 0.5, color = "red", linetype = "dotted") +
  scale_y_continuous(
    breaks = c(0, 1),
    limits = c(-0.1, 1.1)
  ) +
  scale_x_continuous(
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.25)
  ) +
  labs(
    x = "Predicted Probability",
    y = "Actual Observation",
    title = "New Model"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )


plot_extended | plot_prev

## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).

Here we can see that with the new variables introduced, it appears that the spread on the predicted probabilities is larger – whereas in the previous model there appears to be a gap where the model isn’t predicting probabilities just shy of 0.5. It seems like in the newer model, the addition of other variables may have smoothed this gap by providing more parameters that could create these ‘in-between’ probabilities. Also – notice how the red line signifies the prediction threshold, where >= 0.5 indicates that we would have predicted the successful subscription of a term deposit.

In both cases, the dot’s appear to be ‘more dense’ visually above the 0.5 threshold where the actual observation resulted in a successful term deposit subscription.

Calibration Curves

Another method for evaluating model fit (and this comes in part from the OpenIntro Statistics book, pg. 377) is to look at a calibration curve. R has a library CalibrationCurves that can be used to construct these plots with additional details found here. Since our AIC is lower on our extended model and it doesn’t have discontinuity in it’s chart (if that matters), we’re going to go ahead and only graph the Calibration Curve for this model and then understand what it means.

# Plot Calibration Curve
cal_improved <- valProbggplot(
  p = predict(improved_model, type = "response"),
  y = model.response(model.frame(improved_model)),
  smooth = "rcs"
)

plot_improved <- cal_improved$ggPlot +
  labs(title = "Improved model Calibration Curve") +
  theme_classic(base_size = 16) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 18),
    axis.title = element_text(size = 16),
    axis.text = element_text(size = 13)
  )

plot_improved

The calibration curve is essentially looking at our predicted probabilities within groups and determining whether that predicted probability was actually observed in the underlying data. As Diez et al. puts it, “if we look at resumes that we modeled as having a 10% chance of getting a callback, do we find about 10% of them actually receive a callback?” (pg. 377). In this case, if our ‘group’ probability is 0.25 ~ do 25% of these individuals or 1 in 4 actually subscribe to a term deposit as we would expect? Thus, ideally the calibration line for the model would follow the ideal slope of \(y = x\), where each actual observed probability corresponded to the predicted probability.

In our case, we can see that the model appears reasonably well calibrated. Both the intercept and slope of our calibration lines are near the idea values of 0 and 1 (even with a 95% confidence interval), and the flexible calibration curve appears to track the line closely across most predicted probabilities, There is some evidence of over-prediction towards the end of the line, where the for observed probabilities of 65 percent and higher within the constructed groups, we are effectively predicting about 80% of these folks will subscribe to a term deposit.

I do want to note that reading Chapter 9 of OpenIntro Statistics really helped me get an understanding of the principles of logistic regression and how to interpret them – so that was a really worthwhile investment. Although there is much to be left desired to learn ‘under the hood’ so to speak.

Variance Inflation Factor

Finally, before I conclude that I am happy with this model ~ I am going to check the Variance Inflation Factor to identify any multicolinearity issues in the model that would violate the independence assumption among my predictor variables.

VIF works by performing an OLS regression on our variable \(X_i\) with all other explanatory variables as predictors.
Then we calculate the VIF factor with the formula \(\frac{1}{1 - R_i^2}\)
Where the higher the \(R^2\) within the model, the greater the VIF factor will be. If the \(R^2\) of a predictor variable regressed against the other variables was \(0.9\), then the VIF factor would be 10.
The rule of thumb on interpreting the VIF is to keep values that are below 10. Although a cutoff of 5 is also commonly used as noted per Wikipedia.

We can see below that none of our VIF values are anywhere remotely near 10 ~ or 5. There is also some additional features – e.g., if the variables have different degrees of freedom greater than 1, then generalized variance inflation factors are calculated as expressed under GVIFs below.

# Variance Inflation Factor
vif(improved_model)

##          jobadmin.    jobentrepreneur       jobhousemaid      jobmanagement 
##           1.853917           1.229995           1.175777           3.406901 
##         jobretired   jobself-employed        jobservices         jobstudent 
##           2.121337           1.357861           1.515107           1.422355 
##      jobtechnician      jobunemployed            balance educationsecondary 
##           2.201165           1.294766           1.031330           3.219842 
##  educationtertiary    poutcomefailure      poutcomeother    poutcomesuccess 
##           4.086999           1.085470           1.040962           1.041998 
##        default_bin        housing_bin           loan_bin       cellular_bin 
##           1.009676           1.145267           1.026390           1.063966 
##                age           campaign    maritaldivorced     maritalmarried 
##           2.158696           1.028248           1.420999           1.602795

Overall we can conclude:

The extended model has a lower cost (AIC) than the previous model for the same level of data with extended parameters.
The extended model has a smoother probability distribution on the predicted probabilities ~ e.g., it lacks the disjointedness or bimodality of the prior model. We may want to actually generate the cumulative distribution function of our residuals here – but this may not really matter and was more-so an interesting observation.
The extended model probabilities seem to generally align well with observed probabilities based on the calibration curve – although it starts to deviate and over-predict among some groups.
We do not observe evidence that multicolinearity among predictor variables is a problem in the model.

03 - Interpreting the Model’s Coefficients and Business Relevance

In prior versions when interpreting the model’s coefficients – I relied on expressing the relative odds as they increase on a percentage basis. Instead this time around I’m going to focus on interpreting three key coefficients within my extended (‘improved’) model.

The coefficient for student bank clients is 0.78 and lies between 0.57 and 1 within a 95% confidence interval band at a significant p-value < 0.001. This means that holding all else equal the probability that a student would subscribe to a term deposit can be computed as \(\frac{e^{-1.9+0.78}}{1+e^{-1.9+0.78}}\) which results in a \(24.6\) percent chance, or roughly 1 in 4 students subscribing to a term deposit (with respect to all other reference categories held equal); in contrast with a blue collar working – holding all else equal – at \(13\) percent change of subscribing to a term deposit. This represents an 89% increase in the probability of term deposit subscription over the reference category.
The coefficient for success (e.g., bank clients who had previously said ‘yes’ to bank marketing) is 2.4 and lies between 2.2 and 2.5 within a 95% confidence interval band at a significant p-value < 0.001. This means holding all else equal the probability that a bank client would subscribe to a term deposit if they had done so successfully in the past in response to a marketing campaign is a \(62\) percent chance, or roughly 2 in 3 of these bank clients subscribing after being called; in contrast with those who were not contacted or had an ‘unknown’ outcome from previous marketing outreach at a \(13\) percent chance of subscribing to a term deposit. This represents a 470% increase in the probability of a term deposit subscription over the reference category.
The coefficient for housing loan is -0.64 and lies between -0.72 and -0.57 within a 95% confidence interval at a significant p-value < 0.001. All else held equal, if a bank client has a housing loan then they have a \(7.3\) percent change of subscribing to a term deposit after being marketing to in this campaign; in contrast to those who don’t who have a \(13\) percent chance. This represents a 44% decrease in the probability of a term deposit subscription if the bank client has a housing loan / mortgage.

These findings are interesting! I certainly did expect that subscription to a term deposit in a previous marketing campaign would result in a higher likelihood that the client would subscribe again and that those who had a housing loan may be less likely to put money away in a term deposit due to other obligations, but what I wasn’t expecting was to find out that students have the highest probability of all occupational groups (relative to blue-collar workers), to subscribe to a term deposit. That’s an interesting finding – and I wonder if there are some omitted variables that may influence this decision.

Note: When conversing with ChatGPT and providing feedback on this section it mentioned that the standard interpretation is to apply odds ratios ~ e.g., students have approximately a 2.18 times the odds of subscribing, which equates to about 118% higher odds than blue-collar clients, holding all else equal. It does say both are valid – and I prefer this way of expressing in terms of a baseline. I think it makes things more tangible.

# Model Results for Reference
tbl_regression(improved_model, intercept = TRUE) # Note: not calling exponetiate so I can call this manually

Characteristic	log(OR)	95% CI	p-value
(Intercept)	-1.9	-2.1, -1.6	<0.001
job
blue-collar	—	—
admin.	0.27	0.13, 0.42	<0.001
entrepreneur	-0.09	-0.33, 0.15	0.5
housemaid	-0.10	-0.35, 0.15	0.4
management	0.11	-0.03, 0.25	0.13
retired	0.72	0.54, 0.90	<0.001
self-employed	0.03	-0.18, 0.24	0.8
services	0.11	-0.05, 0.27	0.2
student	0.78	0.57, 1.0	<0.001
technician	0.07	-0.06, 0.21	0.3
unemployed	0.28	0.08, 0.48	0.006
balance	0.00	0.00, 0.00	<0.001
education
primary	—	—
secondary	0.11	-0.01, 0.23	0.081
tertiary	0.31	0.17, 0.45	<0.001
poutcome
unknown	—	—
failure	0.13	0.03, 0.23	0.010
other	0.44	0.30, 0.57	<0.001
success	2.4	2.2, 2.5	<0.001
default_bin	-0.46	-0.84, -0.12	0.011
housing_bin	-0.64	-0.72, -0.57	<0.001
loan_bin	-0.52	-0.63, -0.41	<0.001
cellular_bin	0.21	0.08, 0.34	0.002
age	0.00	0.00, 0.01	0.033
campaign	-0.12	-0.14, -0.10	<0.001
marital
single	—	—
divorced	-0.21	-0.33, -0.08	0.001
married	-0.34	-0.43, -0.25	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

04 - Ideas for Improvement and Questions

Since my final project will be a similar version to this – one quick note I wanted to make it that I can truncate some of the ‘extreme’ values on columns like ‘previous’ to potentially improve the model fit and lower the AIC by decreasing the leverage and influence that these data points may have on the model.
After learning about the concept of a full model and the approaches that maybe taken to achieve a simpler parsimonious model from the data, I think I want to try applying this approach in my actual project.
I noticed that there really wasn’t a lot of ‘meat’ on evaluating logistic regression models in OpenIntro Statistics and in Modern Statistics with R. In the OpenIntro book, they discussed the Calibration approach which I quite liked but they really only had one example of it and in Modern Statistics they displayed a binned residual plot for logistic regressions which I couldn’t quite wrap my head around. Are there other alternatives to evaluating ‘fit’ and diagnosing logistic regression plots – seems like both of these books don’t really offer super satisfying answers and or suggestions on how to improve logistic regression models beyond lower AIC and one type of fit plot.

Data Dive Week #11

2026-05-04