00 - Introduction

The purposes of this data dive is to focus on learning logistic regression based on an interesting binary column of data in our bank marketing dataset which is provided by the UC Irvine Machine Learning Repository. Documentation on this dataset is available here. We will have met the learning objectives when:

A logistic regression model is built using 1 - 4 explanatory variables.
The coefficients and model output are explained to the reader.
The standard error of at least one coefficient has a confidence interval built and it’s meaning translated for the reader.

Part of what makes me excited about this data dive is that the key objective in the UC Irvine Machine Learning Repository is to use the data to ‘predict’ the odds that a bank client will subscribe to a term deposit – a binary variable coded ‘1’ for subscription and ‘0’ for non-subscription based on a cellular marketing approach. So this fit’s perfectly with how the data are supposed to be used.

# Declare libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)

## Warning: package 'lindia' was built under R version 4.5.3

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")

## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

01 - Choosing Variables for the Model

To fit within the parameters of the assignment, we will keep the variable selection pretty simple and based around what we as layman’s may think of impacting campaign outreach success.

Job - Categorical variable indicating the type of industry the bank client works in.
Balance - Average annual balance of a clients account. It can be negative (somehow).
Education - Highest level of education that a bank client has received – either primary, tertiary, secondary, or unknown.
Poutcome - Outcome from the previous bank marketing campaign the bank client was exposed to – e.g., did they subscribe to a term deposit the last time they were solicited?

All of these variables express some business logic behind what factors may influence a bank client to subscribe to a term deposit.

A bank client who works in finance may understand the benefits it provides and be more susceptible to the campaign.
Bank clients who more money (denoted by a higher average annual balance) may want to park some of that money in a term deposit.
Bank clients who have obtained higher education levels may already have their money invested elsewhere and not see a need to tuck some away into a term deposit (opposite of prior perspective on balance kind of).
A bank client who had previously subscribed may have enjoyed the experience and may be willing to do it again.

02 - Reviewing Assumptions for Model

Since the outcome of logistic regression is binary, there are some differences in the assumptions made about the model.

Assumption #1: Variable x is linearly correlated with response y.
- Instead the predictor variables should be linearly related to the log-odds of the outcome.
Assumption #2: Errors have constant variance across all predictions.
- This assumption doesn’t necessarily hold since we are trying to determine a binary outcome – which means probabilities that are predicted near 0.5 will have more variance than probabilities predicted near 0 or 1.
Assumption #3: Observations are independent and uncorrelated
- This assumptions holds as well in logistic regression.
Assumption #4: Independent variables cannot be linearly correlated (multicolinearity).
- This assumption also holds as well for logistic regression.
Assumption #5: Errors are normally distributed over the prediction line.
- This assumption does not apply since the outcome is modeled as a binary outcome – where residuals cannot be normally distributed.

Additionally, if we wanted to test / determine whether some variables should be considered for a logit regression (which we won’t do here) we could apply a couple of different statistical techniques like t-testing, normal test of equal proportions, ANOVA, etc. Although final variable selection should be based on our understanding of business logic and the problem that we are attempting to model.

03 - Data Cleaning and Preparation

If we remember from the data dives performed earlier in the semester, we know that some of these categorical variables contain an ‘unknown’ option. Within our business context – it doesn’t make sense to act on behalf of ‘unknown’ information. After all, if we find that bank clients with an ‘unknown’ education are more likely to subscribe to a term deposit in our logit regression – that doesn’t help our business end users in any way. We also know there are some potential extreme values on some of our columns like previous which may be an error in the dataset. We’re not going to concern ourselves with cleaning these values since this variable is not applied in the regression model. We can see that we remove 2,018 out of roughly ~45,000 records where one of our categorical variables was unknown (roughly 5 percent of all records).

# Remove Unknown Categories
cleaned_data <- bank_marketing |>
  mutate(
    outcome = if_else(y == "yes", 1, 0) # Convert predictor variable to binary
  ) |>
  filter(
    job != "unknown",
    education != "unknown",
  )

# calculate rows removed after data are clenaed
rows_removed <- count(bank_marketing) - count(cleaned_data)
rows_removed

##      n
## 1 2018

04 - The Logit Regression Model

Now that data are cleaned – I will go ahead and construct the general linear logistic regression model by calling the glm function, the family as binomial with a logit link.

Note: For the categorical variables job, education, poutcome, these are converted into dummy binary variables except for one category – hence why we see two of the three possible education values below. This is done to avoid perfect multicolinearity between the predictor variables. This results in 16 total predictor variables instead of the 3-4 required by the lab, but I thought that this would help me understand how a categorical variable can turn into multiple variables while fitting the ‘idea’ of the data dive.

# Generate the logitic regression model
model <- glm(outcome ~ job + balance + education +  poutcome, cleaned_data, family = binomial(link = 'logit'))

summary(model)

## 
## Call:
## glm(formula = outcome ~ job + balance + education + poutcome, 
##     family = binomial(link = "logit"), data = cleaned_data)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.158e+00  8.100e-02 -26.636  < 2e-16 ***
## jobblue-collar     -3.717e-01  6.382e-02  -5.825 5.72e-09 ***
## jobentrepreneur    -4.440e-01  1.112e-01  -3.993 6.53e-05 ***
## jobhousemaid       -2.253e-01  1.180e-01  -1.910 0.056160 .  
## jobmanagement      -1.235e-01  6.470e-02  -1.909 0.056248 .  
## jobretired          6.969e-01  7.428e-02   9.382  < 2e-16 ***
## jobself-employed   -1.631e-01  9.732e-02  -1.676 0.093684 .  
## jobservices        -2.505e-01  7.361e-02  -3.403 0.000667 ***
## jobstudent          8.950e-01  9.759e-02   9.171  < 2e-16 ***
## jobtechnician      -1.260e-01  6.040e-02  -2.086 0.036959 *  
## jobunemployed       2.561e-01  9.484e-02   2.700 0.006941 ** 
## balance             2.687e-05  4.236e-06   6.343 2.26e-10 ***
## educationsecondary  1.580e-01  5.490e-02   2.878 0.004001 ** 
## educationtertiary   4.903e-01  6.379e-02   7.686 1.51e-14 ***
## poutcomeother       3.091e-01  7.863e-02   3.931 8.45e-05 ***
## poutcomesuccess     2.415e+00  7.176e-02  33.646  < 2e-16 ***
## poutcomeunknown    -3.335e-01  4.828e-02  -6.907 4.94e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31045  on 43192  degrees of freedom
## Residual deviance: 27984  on 43176  degrees of freedom
## AIC: 28018
## 
## Number of Fisher Scoring iterations: 5

Here we can see that our model has constructed multiple binary variables for almost all the categories (except for one) alongside our intercept and non-categorical variable balance. I will go ahead and construct a table below to explain the coefficient specific findings and how they are impacted with respect to the log-odds and the standard error specified. I start by employing a long-hand example and get progressively shorter as the variables go on.

Predictor Variable	Interpretation
intercept	The intercept estimate is -2.157, with a standard error of 0.081 and a z-value of -26.6, yielding a statistically significant p-value of p < 0.01. In a logistic regression model, the intercept represents the expected log-odds of subscribing to a term deposit for the reference group, when all numeric predictors are equal to zero and all categorical predictors are at their baseline/reference categories. In this case, the baseline log-odds of subscribing are -2.157, which corresponds to relatively low predicted odds of subscription for the reference client profile.
jobblue-collar	The coefficient for the blue-collar job category is -0.371, with a standard error of 0.063. This means that, holding all other variables constant, clients with blue-collar jobs have log-odds of subscribing to a term deposit that are 0.371 lower than clients in the reference job category. Exponentiating the coefficient gives an odds ratio of: $e^{-0.371} \approx 0.69$ which, when converted to a percentage $(0.69 - 1) \cdot 100$ gives us $-31$%. Meaning that blue collar bank clients have a 31% lower odds of subscribing to a term deposit compared to bank clients in the reference job category (the one not displayed as a variable ~ admin workers). If we apply a 95% confidence interval to the coefficient by applying the standard error multiplied by the z score ~ $-0.371 \pm 1.96(0.063)$ we get an interval where our coefficient value may fall between $[-0.494, -0.248]$, which when converted into odds-ratio terms approximates that bank clients working blue collar jobs have between a roughly 22% and 39% lower odds of subscribing to a term deposit than administrative workers, holding other variables constant.
jobentreprenuer	The coefficient for entrepreneur is -0.444, with a standard error of 0.111 and a statistically significant p value of p < 0.01. Holding other variables constant, entrepreneurs have a log-odds of subscribed that are 0.444 lower than clients in the reference category. Applying the same math as done above by applying the exponential, this equates to about a 36% lower odds of subscribing to a term deposit than the reference category all else equal. Within a 95% confidence interval, this affect is between 20% and 48%.
jobhousemaid	The coefficient for housemaid is -0.225, with a standard error of 0.118 and a p-value of 0.056. Holding all other variables constant, bank clients who are housemaids have a log-odds of subscribing that are 0.225 lower than clients in administrative work. This corresponds to a 20% lower odds of subscribing – and within a 95% confidence interval this may range from a 37% lower odds of subscribing to a 1% increased odds of subscribing relative to bank clients in the administrative field. However, not only is this p-value not statistically significant at the 0.01 or 0.05 levels, the coefficients standard error puts it into the range where the effect size could range from a negative effect to a positive effect – meaning we may be cautious about accepting whether housemaid clients differ meaningfully from the reference job category of administrative bank clients.
jobmanagement	The coefficient for management is -0.124, with a standard error of 0.065 and a p-value of 0.056. Holding other variables constant, management clients have a log-odds of subscribing that are 0.124 lower than clients in the administrative job category. This equates to about 12% lower odds of subscribing. However, again the p-value is not statistically significant and based on a 95% confidence level our effect size (coefficient) may actually turn positive or be effectively close to zero. Thus, we should be cautious about accepting whether management bank clients differ meaningfully in term deposit subscription outcomes.
jobretired	The coefficient for retired is 0.697, with a standard error of 0.074 and a statistically significant p-value of p < 0.01. Holding all other variables constant, retired clients have a log-odds of subscribing that are 0.697 higher than clients in the administrative category. This means that they have a roughly 2.01 times the odds of subscribing in comparison with bank clients in the administrative field ~ equating to a 101% higher odds of subscribing, holding all else constant. At a 95% confidence level for this coefficient, the interval on the percent increase in odds ranges from 74% to 132% – meaning that this is highly predictive / influential variable on determining subscription to a term deposit. Especially given that the other job categories are all lower in odds than the administrative job category.
jobself-employed	The coefficient for self-employed bank clients is -0.163, with a standard error of 0.097 and a p-value of 0.094. This means that self employed bank clients have a 15% lower odds of subscribing to a term deposit in contrast with administrative bank clients. Although, this coefficient is not statistically significant and based on the standard error within a 95% confidence interval ~ the effect may range from expressing lower odds, near-zero odds, or even slightly positive odds. This is not a uniquely compelling variable.
jobservices	The coefficient for service job bank clients is -0.25, with a standard error of 0.074 and a statistically significant p-value of p < 0.01. This means that bank clients who work in the services industry have a 22% lower odds of subscribing to a term deposit than administrative workers. Within a 95% confidence interval, this negative association of lower odds holds based on the specified standard error.
jobstudent	The coefficient for bank clients who are students is 0.895, with a standard error of 0.098, and a statistically significant p-value of p < 0.01. This means that students have a 145% higher odds of subscribing to a term deposit than bank clients of job category admin. This positive association holds within a 95% confidence interval based on the specified standard error.
jobtechnician	The coefficient for technician clients is -0.126, with a standard error of 0.06 and a p-value of 0.037 (significant where p < 0.05, but not at p < 0.01). This means that technicians have a 12% lower odds of subscribing to a term deposit than bank clients of job category admin. This negative association holds within a 95% confidence interval based on the specified standard error - but this coefficient is not significant at the 0.01 level, nor does this negative association strictly hold when the interval is expanded to 0.01.
jobunemployed	The coefficient for unemployed bank clients is 0.256, with a standard error of 0.095 and a statistically significant p-value of p < 0.01. This means that unemployed bank clients have a 29% higher odds of subscribing in contrast with admin bank clients. This positive association holds at the 95% confidence interval based on the specified standard error.
balance	The coefficient for balance is 0.0000269, with a standard error of 0.00000424 and a statistically significant p-value of p < 0.01. Holding all other variables constant, a one-unit increase in balance is associated with a pretty small increase in the log-odds of subscribing. One unit increase results in about a 0.0027% increase in the odds of subscribing. If we convert this to every $1,000 ~ this results in a 2.7% higher odds of subscribing for each additional $1,000 that the bank client holds on average in their bank account.
educationsecondary	The coefficient for secondary education is 0.158, with a standard error of 0.055 and a statistically significant p value of p < 0.01. This means bank clients who have obtained up to a secondary level of education have a 17% higher odds of subscribing than bank clients who have achieved up to a primary level of education, holding all other variables equal. Within a 95% confidence interval, this positive association holds based on the standard error specified.
educationtertiary	The coefficient for tertiary education is 0.490, with a standard error of 0.064 and a statistically significant p-value of p < 0.01. This means that bank clients who obtained up to a tertiary level of education have a 63% higher odds of subscribing to a term deposit than those who have only completed up to a primary level of education. Within a 95% confidence interval, this positive association holds.
poutcomeother	This variable likely indicates bank clients who had been previously advertised to and some other outcome aside from a success or a failure occurred – e.g., maybe they opened up a credit card or a debit card. The coefficient for poutcomeother is 0.309, with a standard error of 0.079 and a statistically significant p-value of p < 0.01. This means that those bank clients who had experienced a a prior marketing campaign that led to an ‘other’ result has a 36% higher odds of subscribing to a term deposit than those who failed to subscribe to a term deposit in a previous marketing campaign.
poutcomesuccess	The coefficient for poutcome success is 2.415, with a standard error of 0.072 and a statistically significant p-value of p < 0.01. This means that bank clients who had previously responded favorably to a marketing campaign have a 1,018 percent higher odds of subscribing to a term deposit in comparison to those who failed to subscribe to a term deposit in the previous marketing campaign.
poutcomeunknown	This variable likely indicates bank clients who had not previously been advertised to from the bank ~ e.g., their outcome is unknown or not relevant. The coefficient of -0.333, with a standard error of 0.048 and a statistically significant p-value < 0.01 means that these clients have a 28% lower odds of subscribing to a term deposit than those who failed to subscribe in the last marketing campaign outreach that was sent. This negative association remains when the coefficient is computed within a 95% confidence interval.

One quick reflection point on the binary variables – it appears that we establish a reference category (to avoid perfect multicolinearity), so from a business perspective, it may be best to specify what category should be the reference. In our business case, we may actually want to use the unknown category as a reference instead of admin workers for jobs or for education level. This could certainly use better definition because some of the value in understanding / explaining which bank clients are more likely to subscribe to a term deposit may be lost from which reference categorical value is used.

Some other interesting findings of this model are that students and unemployed bank clients both experience a higher odds of subscribing to a term deposit (relative to admin workers). This isn’t something I would expect since I would assume students and unemployed clients would have less funds to subscribe to a term deposit with. This runs contrary to what I would expect and may require some further investigation.

Although, the most significant variables influencing whether a client will subscribe to a term deposit are:

If the bank client is a student, unemployed, or retired ~ and then I’d probably also say if they are in admin as well since my intuition from seeing the relative coefficients makes me think that this likely increases as well (although we don’t have significant evidence for this without rearranging what is used as the baseline);
If the bank client on average has more funds in their account;
If the bank client has completed a secondary – and especially tertiary level of education (relative to the primary education baseline).
If the bank client had previously subscribed to a term deposit.

Thus, if I was advising a bank on which clients to prioritize operational outreach to for a more immediate and efficient return on increasing term deposits, I would recommend bank clients who fit within that category.

# Plotting Cost for the Model
y <- cleaned_data$outcome
p_hat <- model$fitted.values
cost <- -(y * log(p_hat) + (1 - y) * log(1 - p_hat))

ggplot() +
  geom_boxplot(mapping = aes(x = as_factor(cleaned_data$outcome), 
                             y = cost), 
               orientation = 'h') +
  labs(x = "Term Depsoit", y= 'Cost') +
  scale_y_log10()

I also did construct the cost graph borrowed from this weeks notebook above. Here we can see that my model is much better at predicting non-subscription rather than subscription to a term deposit – e.g., there is relatively low loss for bank clients who did not subscribe (as denoted by the mean) in contrast with a relatively high cost for clients who did subscribe. This means that it is ‘over-predicting’ term deposit subscription.

05 - Building a Confidence Interval

To reiterate how the confidence interval works on a coefficient within the model – let’s take a look at educationtertiary variable and it’s outputs from the logistic regression model. We can see that we have (1) a coefficient, (2) a standard error, and (3) a z-value, and (4) a p-value:

The coefficient specifies the effect size of the variable ~ e.g., does it increase or decrease our odds of the client subscribing to a term deposit;
The standard error represents the standard deviation of this coefficient and how it may vary if it were computed from a sampling distribution;
The z-value represents the test statistic that would be returned if the null hypothesis were accepted and the coefficient was 0;
The p-value translates that z-value into the probability that the test statistic z would occur or be as extreme under the paradigm that the null hypothesis is accepted.

Since the standard error represents the standard deviation that would be present in a sampling distribution of our coefficient – if it were ‘plucked’ numerous times randomly from the dataset, we can use this to build a confidence interval at a specified value since it follows the normal distribution to understand where our coefficent values may lie – and then also the relative strength of the coefficient. Afterall, if the coefficient effect size can range between a positive and negative number, that may be evidence that the coefficient doesn’t have super strong evidence on the direction that it’s affect should be – or that it may have a negligible effect.

For educationtertiary the coefficient value is 0.49 with a standard error of 0.063. If we apply the normal distribution in the following function: $0.49 \pm + 0.063z$ then we will get the hypothetical sampling distribution of our coefficient. We can then specify a confidence level as well such as $\alpha = 0.05$ with two tails as shown below. There we can see that 95% of the time our coefficient value will fall in the gray area between the two red lines for a coefficient value between ~0.35 and 0.65. Since none of the values are negative – and it would be very rare to get a negative value (e.g., the left tail), this indicates that we have evidence that an effect size exists and that it is positive.

# Coefficients
beta_hat <- 0.49
se <- 0.063
alpha <- 0.05

# Get the Z Critial Value for our specified alpha level
z_crit <- qnorm(1 - alpha / 2)

# Calculate the upper and lower bounds at 95% confidence level - two tailed
ci_lower <- beta_hat - z_crit * se
ci_upper <- beta_hat + z_crit * se

df <- data.frame(
  beta = seq(beta_hat - 4 * se, beta_hat + 4 * se, length.out = 1000)
)

df$density <- dnorm(df$beta, mean = beta_hat, sd = se)

# Plot the graph
ggplot(df, aes(x = beta, y = density)) +
  geom_line() +
  geom_area(
  data = subset(df, beta >= ci_lower & beta <= ci_upper),
  aes(x = beta, y = density),
  fill = "gray",
  alpha = 0.4
  ) +
  geom_vline(xintercept = beta_hat) +
  geom_vline(xintercept = c(ci_lower, ci_upper), linetype = "dashed", color = "red") +
  labs(
    title = "Sampling Distribution for educationtertiary",
    x = "Coefficient value",
    y = "Density"
  ) +
  theme_minimal()

06 - Other Questions / Observations

How does Simpson’s paradox affect general linear models that are constructed using transformations to the explanatory variables? On a dataset that heavily involves a lot of categorical columns, if one were to map out the explanatory variable against the response variable and identified a polynomial or logarithmic relationship – whose to say that this exists for all variations of that variable across categories? Couldn’t this negatively impact the model in some cases because you are assuming that the nonlinear relationship present holds.
Or what about the opposite on linear relationships and potentially hidden nonlinear relationships based on subcategories that may exist?
How would I really test my logit regression data for any transformations on the predictor variables – is that even possible (e.g., imagine log of balance better models the relationship between the log-odds of our outcome variable – is that even forseeable).

Data Dives Week #10

2026-05-02