00 - Introduction

The purposes of this data dive is to focus on learning logistic regression based on an interesting binary column of data in our bank marketing dataset which is provided by the UC Irvine Machine Learning Repository. Documentation on this dataset is available here. We will have met the learning objectives when:

Part of what makes me excited about this data dive is that the key objective in the UC Irvine Machine Learning Repository is to use the data to ‘predict’ the odds that a bank client will subscribe to a term deposit – a binary variable coded ‘1’ for subscription and ‘0’ for non-subscription based on a cellular marketing approach. So this fit’s perfectly with how the data are supposed to be used.

# Declare libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.5.3
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

01 - Choosing Variables for the Model

To fit within the parameters of the assignment, we will keep the variable selection pretty simple and based around what we as layman’s may think of impacting campaign outreach success.

All of these variables express some business logic behind what factors may influence a bank client to subscribe to a term deposit.

02 - Reviewing Assumptions for Model

Since the outcome of logistic regression is binary, there are some differences in the assumptions made about the model.

Additionally, if we wanted to test / determine whether some variables should be considered for a logit regression (which we won’t do here) we could apply a couple of different statistical techniques like t-testing, normal test of equal proportions, ANOVA, etc. Although final variable selection should be based on our understanding of business logic and the problem that we are attempting to model.

03 - Data Cleaning and Preparation

If we remember from the data dives performed earlier in the semester, we know that some of these categorical variables contain an ‘unknown’ option. Within our business context – it doesn’t make sense to act on behalf of ‘unknown’ information. After all, if we find that bank clients with an ‘unknown’ education are more likely to subscribe to a term deposit in our logit regression – that doesn’t help our business end users in any way. We also know there are some potential extreme values on some of our columns like previous which may be an error in the dataset. We’re not going to concern ourselves with cleaning these values since this variable is not applied in the regression model. We can see that we remove 2,018 out of roughly ~45,000 records where one of our categorical variables was unknown (roughly 5 percent of all records).

# Remove Unknown Categories
cleaned_data <- bank_marketing |>
  mutate(
    outcome = if_else(y == "yes", 1, 0) # Convert predictor variable to binary
  ) |>
  filter(
    job != "unknown",
    education != "unknown",
  )

# calculate rows removed after data are clenaed
rows_removed <- count(bank_marketing) - count(cleaned_data)
rows_removed
##      n
## 1 2018

04 - The Logit Regression Model

Now that data are cleaned – I will go ahead and construct the general linear logistic regression model by calling the glm function, the family as binomial with a logit link.

# Generate the logitic regression model
model <- glm(outcome ~ job + balance + education +  poutcome, cleaned_data, family = binomial(link = 'logit'))

summary(model)
## 
## Call:
## glm(formula = outcome ~ job + balance + education + poutcome, 
##     family = binomial(link = "logit"), data = cleaned_data)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.158e+00  8.100e-02 -26.636  < 2e-16 ***
## jobblue-collar     -3.717e-01  6.382e-02  -5.825 5.72e-09 ***
## jobentrepreneur    -4.440e-01  1.112e-01  -3.993 6.53e-05 ***
## jobhousemaid       -2.253e-01  1.180e-01  -1.910 0.056160 .  
## jobmanagement      -1.235e-01  6.470e-02  -1.909 0.056248 .  
## jobretired          6.969e-01  7.428e-02   9.382  < 2e-16 ***
## jobself-employed   -1.631e-01  9.732e-02  -1.676 0.093684 .  
## jobservices        -2.505e-01  7.361e-02  -3.403 0.000667 ***
## jobstudent          8.950e-01  9.759e-02   9.171  < 2e-16 ***
## jobtechnician      -1.260e-01  6.040e-02  -2.086 0.036959 *  
## jobunemployed       2.561e-01  9.484e-02   2.700 0.006941 ** 
## balance             2.687e-05  4.236e-06   6.343 2.26e-10 ***
## educationsecondary  1.580e-01  5.490e-02   2.878 0.004001 ** 
## educationtertiary   4.903e-01  6.379e-02   7.686 1.51e-14 ***
## poutcomeother       3.091e-01  7.863e-02   3.931 8.45e-05 ***
## poutcomesuccess     2.415e+00  7.176e-02  33.646  < 2e-16 ***
## poutcomeunknown    -3.335e-01  4.828e-02  -6.907 4.94e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31045  on 43192  degrees of freedom
## Residual deviance: 27984  on 43176  degrees of freedom
## AIC: 28018
## 
## Number of Fisher Scoring iterations: 5

Here we can see that our model has constructed multiple binary variables for almost all the categories (except for one) alongside our intercept and non-categorical variable balance. I will go ahead and construct a table below to explain the coefficient specific findings and how they are impacted with respect to the log-odds and the standard error specified. I start by employing a long-hand example and get progressively shorter as the variables go on.

Predictor Variable Interpretation
intercept

The intercept estimate is -2.157, with a standard error of 0.081 and a z-value of -26.6, yielding a statistically significant p-value of p < 0.01.

In a logistic regression model, the intercept represents the expected log-odds of subscribing to a term deposit for the reference group, when all numeric predictors are equal to zero and all categorical predictors are at their baseline/reference categories. In this case, the baseline log-odds of subscribing are -2.157, which corresponds to relatively low predicted odds of subscription for the reference client profile.

jobblue-collar

The coefficient for the blue-collar job category is -0.371, with a standard error of 0.063. This means that, holding all other variables constant, clients with blue-collar jobs have log-odds of subscribing to a term deposit that are 0.371 lower than clients in the reference job category.

Exponentiating the coefficient gives an odds ratio of: \(e^{-0.371} \approx 0.69\) which, when converted to a percentage \((0.69 - 1) \cdot 100\) gives us \(-31\)%. Meaning that blue collar bank clients have a 31% lower odds of subscribing to a term deposit compared to bank clients in the reference job category (the one not displayed as a variable ~ admin workers).

If we apply a 95% confidence interval to the coefficient by applying the standard error multiplied by the z score ~ \(-0.371 \pm 1.96(0.063)\) we get an interval where our coefficient value may fall between \([-0.494, -0.248]\), which when converted into odds-ratio terms approximates that bank clients working blue collar jobs have between a roughly 22% and 39% lower odds of subscribing to a term deposit than administrative workers, holding other variables constant.

jobentreprenuer The coefficient for entrepreneur is -0.444, with a standard error of 0.111 and a statistically significant p value of p < 0.01. Holding other variables constant, entrepreneurs have a log-odds of subscribed that are 0.444 lower than clients in the reference category. Applying the same math as done above by applying the exponential, this equates to about a 36% lower odds of subscribing to a term deposit than the reference category all else equal. Within a 95% confidence interval, this affect is between 20% and 48%.
jobhousemaid

The coefficient for housemaid is -0.225, with a standard error of 0.118 and a p-value of 0.056. Holding all other variables constant, bank clients who are housemaids have a log-odds of subscribing that are 0.225 lower than clients in administrative work. This corresponds to a 20% lower odds of subscribing – and within a 95% confidence interval this may range from a 37% lower odds of subscribing to a 1% increased odds of subscribing relative to bank clients in the administrative field.

However, not only is this p-value not statistically significant at the 0.01 or 0.05 levels, the coefficients standard error puts it into the range where the effect size could range from a negative effect to a positive effect – meaning we may be cautious about accepting whether housemaid clients differ meaningfully from the reference job category of administrative bank clients.

jobmanagement The coefficient for management is -0.124, with a standard error of 0.065 and a p-value of 0.056. Holding other variables constant, management clients have a log-odds of subscribing that are 0.124 lower than clients in the administrative job category. This equates to about 12% lower odds of subscribing. However, again the p-value is not statistically significant and based on a 95% confidence level our effect size (coefficient) may actually turn positive or be effectively close to zero. Thus, we should be cautious about accepting whether management bank clients differ meaningfully in term deposit subscription outcomes.
jobretired The coefficient for retired is 0.697, with a standard error of 0.074 and a statistically significant p-value of p < 0.01. Holding all other variables constant, retired clients have a log-odds of subscribing that are 0.697 higher than clients in the administrative category. This means that they have a roughly 2.01 times the odds of subscribing in comparison with bank clients in the administrative field ~ equating to a 101% higher odds of subscribing, holding all else constant. At a 95% confidence level for this coefficient, the interval on the percent increase in odds ranges from 74% to 132% – meaning that this is highly predictive / influential variable on determining subscription to a term deposit. Especially given that the other job categories are all lower in odds than the administrative job category.
jobself-employed The coefficient for self-employed bank clients is -0.163, with a standard error of 0.097 and a p-value of 0.094. This means that self employed bank clients have a 15% lower odds of subscribing to a term deposit in contrast with administrative bank clients. Although, this coefficient is not statistically significant and based on the standard error within a 95% confidence interval ~ the effect may range from expressing lower odds, near-zero odds, or even slightly positive odds. This is not a uniquely compelling variable.
jobservices The coefficient for service job bank clients is -0.25, with a standard error of 0.074 and a statistically significant p-value of p < 0.01. This means that bank clients who work in the services industry have a 22% lower odds of subscribing to a term deposit than administrative workers. Within a 95% confidence interval, this negative association of lower odds holds based on the specified standard error.
jobstudent The coefficient for bank clients who are students is 0.895, with a standard error of 0.098, and a statistically significant p-value of p < 0.01. This means that students have a 145% higher odds of subscribing to a term deposit than bank clients of job category admin. This positive association holds within a 95% confidence interval based on the specified standard error.
jobtechnician The coefficient for technician clients is -0.126, with a standard error of 0.06 and a p-value of 0.037 (significant where p < 0.05, but not at p < 0.01). This means that technicians have a 12% lower odds of subscribing to a term deposit than bank clients of job category admin. This negative association holds within a 95% confidence interval based on the specified standard error - but this coefficient is not significant at the 0.01 level, nor does this negative association strictly hold when the interval is expanded to 0.01.
jobunemployed The coefficient for unemployed bank clients is 0.256, with a standard error of 0.095 and a statistically significant p-value of p < 0.01. This means that unemployed bank clients have a 29% higher odds of subscribing in contrast with admin bank clients. This positive association holds at the 95% confidence interval based on the specified standard error.
balance The coefficient for balance is 0.0000269, with a standard error of 0.00000424 and a statistically significant p-value of p < 0.01. Holding all other variables constant, a one-unit increase in balance is associated with a pretty small increase in the log-odds of subscribing. One unit increase results in about a 0.0027% increase in the odds of subscribing. If we convert this to every $1,000 ~ this results in a 2.7% higher odds of subscribing for each additional $1,000 that the bank client holds on average in their bank account.
educationsecondary The coefficient for secondary education is 0.158, with a standard error of 0.055 and a statistically significant p value of p < 0.01. This means bank clients who have obtained up to a secondary level of education have a 17% higher odds of subscribing than bank clients who have achieved up to a primary level of education, holding all other variables equal. Within a 95% confidence interval, this positive association holds based on the standard error specified.
educationtertiary The coefficient for tertiary education is 0.490, with a standard error of 0.064 and a statistically significant p-value of p < 0.01. This means that bank clients who obtained up to a tertiary level of education have a 63% higher odds of subscribing to a term deposit than those who have only completed up to a primary level of education. Within a 95% confidence interval, this positive association holds.
poutcomeother This variable likely indicates bank clients who had been previously advertised to and some other outcome aside from a success or a failure occurred – e.g., maybe they opened up a credit card or a debit card. The coefficient for poutcomeother is 0.309, with a standard error of 0.079 and a statistically significant p-value of p < 0.01. This means that those bank clients who had experienced a a prior marketing campaign that led to an ‘other’ result has a 36% higher odds of subscribing to a term deposit than those who failed to subscribe to a term deposit in a previous marketing campaign.
poutcomesuccess The coefficient for poutcome success is 2.415, with a standard error of 0.072 and a statistically significant p-value of p < 0.01. This means that bank clients who had previously responded favorably to a marketing campaign have a 1,018 percent higher odds of subscribing to a term deposit in comparison to those who failed to subscribe to a term deposit in the previous marketing campaign.
poutcomeunknown This variable likely indicates bank clients who had not previously been advertised to from the bank ~ e.g., their outcome is unknown or not relevant. The coefficient of -0.333, with a standard error of 0.048 and a statistically significant p-value < 0.01 means that these clients have a 28% lower odds of subscribing to a term deposit than those who failed to subscribe in the last marketing campaign outreach that was sent. This negative association remains when the coefficient is computed within a 95% confidence interval.

One quick reflection point on the binary variables – it appears that we establish a reference category (to avoid perfect multicolinearity), so from a business perspective, it may be best to specify what category should be the reference. In our business case, we may actually want to use the unknown category as a reference instead of admin workers for jobs or for education level. This could certainly use better definition because some of the value in understanding / explaining which bank clients are more likely to subscribe to a term deposit may be lost from which reference categorical value is used.

Some other interesting findings of this model are that students and unemployed bank clients both experience a higher odds of subscribing to a term deposit (relative to admin workers). This isn’t something I would expect since I would assume students and unemployed clients would have less funds to subscribe to a term deposit with. This runs contrary to what I would expect and may require some further investigation.

Although, the most significant variables influencing whether a client will subscribe to a term deposit are:

Thus, if I was advising a bank on which clients to prioritize operational outreach to for a more immediate and efficient return on increasing term deposits, I would recommend bank clients who fit within that category.

# Plotting Cost for the Model
y <- cleaned_data$outcome
p_hat <- model$fitted.values
cost <- -(y * log(p_hat) + (1 - y) * log(1 - p_hat))

ggplot() +
  geom_boxplot(mapping = aes(x = as_factor(cleaned_data$outcome), 
                             y = cost), 
               orientation = 'h') +
  labs(x = "Term Depsoit", y= 'Cost') +
  scale_y_log10()

I also did construct the cost graph borrowed from this weeks notebook above. Here we can see that my model is much better at predicting non-subscription rather than subscription to a term deposit – e.g., there is relatively low loss for bank clients who did not subscribe (as denoted by the mean) in contrast with a relatively high cost for clients who did subscribe. This means that it is ‘over-predicting’ term deposit subscription.

05 - Building a Confidence Interval

To reiterate how the confidence interval works on a coefficient within the model – let’s take a look at educationtertiary variable and it’s outputs from the logistic regression model. We can see that we have (1) a coefficient, (2) a standard error, and (3) a z-value, and (4) a p-value:

  1. The coefficient specifies the effect size of the variable ~ e.g., does it increase or decrease our odds of the client subscribing to a term deposit;

  2. The standard error represents the standard deviation of this coefficient and how it may vary if it were computed from a sampling distribution;

  3. The z-value represents the test statistic that would be returned if the null hypothesis were accepted and the coefficient was 0;

  4. The p-value translates that z-value into the probability that the test statistic z would occur or be as extreme under the paradigm that the null hypothesis is accepted.

Since the standard error represents the standard deviation that would be present in a sampling distribution of our coefficient – if it were ‘plucked’ numerous times randomly from the dataset, we can use this to build a confidence interval at a specified value since it follows the normal distribution to understand where our coefficent values may lie – and then also the relative strength of the coefficient. Afterall, if the coefficient effect size can range between a positive and negative number, that may be evidence that the coefficient doesn’t have super strong evidence on the direction that it’s affect should be – or that it may have a negligible effect.

For educationtertiary the coefficient value is 0.49 with a standard error of 0.063. If we apply the normal distribution in the following function: \(0.49 \pm + 0.063z\) then we will get the hypothetical sampling distribution of our coefficient. We can then specify a confidence level as well such as \(\alpha = 0.05\) with two tails as shown below. There we can see that 95% of the time our coefficient value will fall in the gray area between the two red lines for a coefficient value between ~0.35 and 0.65. Since none of the values are negative – and it would be very rare to get a negative value (e.g., the left tail), this indicates that we have evidence that an effect size exists and that it is positive.

# Coefficients
beta_hat <- 0.49
se <- 0.063
alpha <- 0.05

# Get the Z Critial Value for our specified alpha level
z_crit <- qnorm(1 - alpha / 2)

# Calculate the upper and lower bounds at 95% confidence level - two tailed
ci_lower <- beta_hat - z_crit * se
ci_upper <- beta_hat + z_crit * se

df <- data.frame(
  beta = seq(beta_hat - 4 * se, beta_hat + 4 * se, length.out = 1000)
)

df$density <- dnorm(df$beta, mean = beta_hat, sd = se)

# Plot the graph
ggplot(df, aes(x = beta, y = density)) +
  geom_line() +
  geom_area(
  data = subset(df, beta >= ci_lower & beta <= ci_upper),
  aes(x = beta, y = density),
  fill = "gray",
  alpha = 0.4
  ) +
  geom_vline(xintercept = beta_hat) +
  geom_vline(xintercept = c(ci_lower, ci_upper), linetype = "dashed", color = "red") +
  labs(
    title = "Sampling Distribution for educationtertiary",
    x = "Coefficient value",
    y = "Density"
  ) +
  theme_minimal()

06 - Other Questions / Observations