00 - Introduction

For Week #9 data dive we will be focusing on regression diagnostics by enhancing the bivariate model submitted in week #8. Specifically, we will:

To revisit the business context – this work focuses on the the Bank Marketing dataset provided by UC Irvine Machine Learning Repository. While the main goal of this is to ‘predict’ whether a client subscribes to a term deposit based on a phone call outreach campaign – for the purposes of this linear regression lab, focus was not placed on using this binary outcome classifying variable.

AI Disclosure:

# Declare libraries
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.1
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(GGally)
library(lindia)
## Warning: package 'lindia' was built under R version 4.5.3
setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

01 - Evaluating The Previous Model

The prior regression model applied the following variables:

This was done in attempt to understand what factors may have influenced whether a bank client was previously solicited for term deposit or bank marketing – and those who have higher balances, one would suspect of being prime targets!

Of course, it’s entirely possible that the prior bank outreach isn’t targeted at all based on income (which perhaps explains why people with average balances below $1,000 were a part of this calling campaign) and thus this hypothesis of a linear relationship may not be true.

# Linear Regression Results
model <- lm(previous ~ balance, bank_marketing)
summary(model)
## 
## Call:
## lm(formula = previous ~ balance, data = bank_marketing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -1.851  -0.576  -0.566  -0.563 274.430 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.631e-01  1.187e-02  47.456  < 2e-16 ***
## balance     1.261e-05  3.558e-06   3.546 0.000392 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.303 on 45209 degrees of freedom
## Multiple R-squared:  0.000278,   Adjusted R-squared:  0.0002559 
## F-statistic: 12.57 on 1 and 45209 DF,  p-value: 0.000392

Based on this prior model, there are a couple of things that we know:

  1. The overall model is statistically significant at the 0.01 level (F = 12.57, p = 0.000392). This indicates that including balance as a predictor provides a significantly better fit than an intercept-only model, allowing us to reject the null hypothesis that the slope on balance is 0.

  2. The coefficient for balance is statistically significant at the 0.01 level (t = 3.546, p = 0.000392). However, the estimated effect size is extremely small: the coefficient (0.0000126) implies that an increase of approximently $80,000 in average balance is associated with only one additional prior contact. Thus, while statistically significant, the effect is practically negligible.

  3. The models explanatory power is extremely low, with an \(R^2\) of 0.000278. This means that the model explains only about 0.028% of the variance in the previous variable, indicating that balance – while significant, provides virtually no meaningful predictive value for the number of prior contacts.

  4. The large sample size (over 45,000 observations) increases the poewr of statistical tests, making it possible to detect even extremely small effects as statistically significant. Thus, the significance observed for balance is likely just a result of the influence of the sample size rather than a meaningful relationship.

*ChatGPT / AI was used to improve the language on these bullet points; although they were initially written manually.

# Plotting Residuals vs Fitted Values to test for Heteroskedasticity

# Get fitted values and residuals from the model
plot_data <- data.frame(
  fitted_vals = fitted(model),
  residuals = resid(model)
)

plot_data |>
  ggplot(mapping = aes(x = fitted_vals, y = residuals)) +
  geom_point() +
  labs(
    x = "Fitted Values",
    y = "Residuals",
    title = "Figure 1: Residuals vs Fitted Values"
  ) +
  geom_line(y = 0, color="red") +
  theme_minimal()

Based on our summary understanding of the model – and then in addition, the heteroskedasticity present in the Figure 1: Residuals vs Fitted Values – e.g., the visual relationship between the variance in our predictions from their fitted values appears to be non-random bunching around 0.5, or moving outward in a funnel motion from a fitted value of 0.8 – I can quite confidently say that this model is terrible.

More broadly, from a business context, the variable previous represents a count (i.e., the number of prior contacts). As such, it contains many zero values and is likely left-skewed. Because count data often violate the assumptions of linear regression—particularly normality and constant variance—a Poisson regression model may be more appropriate. Poisson regression is a type of generalized linear model (GLM) designed specifically for modeling count outcomes that is mentioned in a future week.

02 - Improving the Model

For now, let’s focus on meeting the requirements of the assignment by introducing some new variables to potentially explain prior contacts into our model based on business logic.

Let’s go ahead and check assumptions #1 (Variable \(x\) has a linear correlation with response variable \(y\) ) and #4 (Independent variables are not linearly correlated with one another) of the linear model we are about to create. We will do this by creating a correlation plot. We will need to:

  1. First we will need to convert the variable loan from a ‘yes’ - ‘no’ text field, to a binary value of 1 or 0.

  2. Then we will test whether a potential relationship exists between previous and education by testing with ANOVA – since education is a categorical column – it cannot be incorporated into the correlation matrix.

# Creating Correlation Matrix
bank_marketing |>
  mutate(loan = if_else(loan == "yes", 1, 0)) |> # converting to binary value for correlation
  select(previous, balance, age, loan) |>
  ggcorr(label = TRUE) +
  labs(title = "Figure 2: Correlation Heatmap of Independent Variables")

anova_model <- aov(previous ~ education, data = bank_marketing)
summary(anova_model)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## education       3    165   54.95   10.36 8.19e-07 ***
## Residuals   45207 239712    5.30                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results:

Regardless of these findings, let’s go ahead and power through, construct the model, and then engage in some evaluation with the plots in the next section. It’s important to note that since education is a categorical variable with four categories (unknown, primary, secondary, and tertiary), the linear model function (lm) in R will convert these categories into binary variables that will each be applied into the model.

# Creating the "Improved" Linear Model
improved_model = lm(previous ~ balance + age + loan + education, bank_marketing)
summary(improved_model)
## 
## Call:
## lm(formula = previous ~ balance + age + loan + education, data = bank_marketing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -1.736  -0.605  -0.563  -0.483 274.342 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.415e-01  5.560e-02   7.940 2.06e-15 ***
## balance             1.052e-05  3.601e-06   2.921  0.00350 ** 
## age                 9.514e-04  1.050e-03   0.906  0.36487    
## loanyes            -6.031e-02  2.971e-02  -2.030  0.04237 *  
## educationsecondary  8.728e-02  3.227e-02   2.705  0.00684 ** 
## educationtertiary   1.724e-01  3.495e-02   4.933 8.13e-07 ***
## educationunknown   -7.155e-03  6.031e-02  -0.119  0.90555    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.302 on 45204 degrees of freedom
## Multiple R-squared:  0.001023,   Adjusted R-squared:  0.0008905 
## F-statistic: 7.716 on 6 and 45204 DF,  p-value: 2.61e-08

03 - Diagnosing and Evaluating Improved Model

From the improved model, it does appear to be a slight improvement over the original in terms of explanatory power (\(R^2\)) – although it does also seem to fall into similar pitfalls.

  1. The overall model is significant with an f-statistic of 7.716

  2. Balance, loan indicator, and secondary / tertiary education variables are significant below a 0.05 value. Age and education unknown (binary variable indicating that the bank clients education is unknown) are insignificant.

  3. \(R^2\) is at 0.001 and adjusted \(R^2\) is 0.00089 – or a slight reduction based on the increased number of parameters. These values are about 4 - 5 x higher than in the initial model, suggesting an improvement in explanatory power. Although again, this is still incredibly low. This basically means that only 0.1 % of the variance in the previous variable are explained by these four factors.

  4. Again, based on the large sample size of around n = 45,000 the presence of this regression model may just be noise.

Let’s go ahead and test by creating some diagnostic plots:

Plotting Residuals vs Fitted Values

We’ll start by plotting the residuals vs fitted values to identify if ‘errors’ are pooling or if a pattern emerges. Clearly, it does appear that residuals are associated heavily between values of 0 - 1 and the model becomes more ‘accurate’ in predicting fitted values after 1. This tracks with our expectations of the poison distribution / count aspect where since we lean heavily towards 0 values, our prediction gears towards 0 values yielding a lot of fitted values that have a lot of error.

Another interesting observation from this plot is that are fitted values do not extend beyond around ~1.7. That means based on the underlying data, the regression is not computing any values greater than 2. Which we know is technically not accurate with the distribution of the previous variable and count (as displayed in the distribution of values by previous variable. Afterall there ~ 5,000 values that have a previous count over 2 and you can clearly see the anonymous count of around ~270 which had a predicted value of around 0.8.

Clearly the plot violates our assumption of randomly distributed residuals – although the spread of the errors do appear more random than in the prior model which were super concentrated between fitted values of 0.5 and 0.6 – whereas this runs from 0.4 to 0.8.

# Plotting Residuals vs Fitted Values to test for Heteroskedasticity

# Get fitted values and residuals from the model
plot_data <- data.frame(
  fitted_vals = fitted(improved_model),
  residuals = resid(improved_model)
)

plot_data |>
  ggplot(mapping = aes(x = fitted_vals, y = residuals)) +
  geom_point() +
  labs(
    x = "Fitted Values",
    y = "Residuals",
    title = "Figure 3: Residuals vs Fitted Values"
  ) +
  geom_line(y = 0, color="red") +
  theme_minimal()

# Plotting the Count of 
bank_marketing |>
  group_by(previous) |>
  count()
## # A tibble: 41 × 2
## # Groups:   previous [41]
##    previous     n
##       <dbl> <int>
##  1        0 36954
##  2        1  2772
##  3        2  2106
##  4        3  1142
##  5        4   714
##  6        5   459
##  7        6   277
##  8        7   205
##  9        8   129
## 10        9    92
## # ℹ 31 more rows

Applying Cook’s D

Looking at Figure 3: Residuals vs Fitted Values the large abnormality of the residual above 270, I am curious to see whether that point had a high level of influence on the model. To investigate this, I examined Cook’s distance using the same threshold applied by Professor Leon in the nootbok: three times the mean Cook’s D value – or the ‘matlab’ threshold specified in the gg_cooksd function I call. I also looked reviewed further some guidance on Cook’s distance from Statology and Penn State.

  • As shown below in Figure 4: Cook’s Distance for Improved Model, a few observations appear to have a relatively large influence on the model – particularly observations 29,183 and 38,327.

  • However, based on the common rule of thumb for interpreting cook’s distance, I identify 1,477 observations within the dataset of ~ 45,000 observations that are highly influential – again with numbers 29,183 and 38,327 yielding the largest influence. Observation #29,183 actually represents the highest count of 273.

  • When we plot out a histogram of our response variable previous as done in Figure 5: Distribution of Previous Response Variable Among Influential Points, we can see that our distribution is no longer zero-inflated and the mean of our highly influential values is 8. Which makes sense – the values that are highly influencing our regression model and associated coefficients are those which have high previous values relative to the distribution of the underlying data.

# Plotting Cook's Distance
gg_cooksd(improved_model, threshold = "matlab") +
  labs(
    x = "Observation Number",
    y = "Cook's Distance",
    title = "Figure 4: Cook's Distance for Improved Model"
  ) +
  theme_minimal()
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Compute Cook's D values
cooks_d <- cooks.distance(improved_model)

# Compute MATLAB-style threshold
matlab_threshold <- 3 * mean(cooks_d, na.rm = TRUE)

# Return influential observations with explanatory variables
influential_points <- bank_marketing |>
  mutate(
    observation = row_number(),
    cooksd = as.numeric(cooks_d)
  ) |>
  select(
    observation,
    cooksd,
    previous,
    balance,
    age,
    loan
  ) |>
  filter(cooksd > matlab_threshold) |>
  arrange(desc(cooksd))

influential_points
## # A tibble: 1,477 × 6
##    observation  cooksd previous balance   age loan 
##          <int>   <dbl>    <dbl>   <dbl> <dbl> <chr>
##  1       29183 0.167        275     543    40 no   
##  2       38327 0.0238        58    1085    46 yes  
##  3       30493 0.0110        27      68    31 no   
##  4       44439 0.0105        23   18931    53 no   
##  5       44823 0.00827       41     821    27 yes  
##  6       28887 0.00664       51     358    31 no   
##  7       44485 0.00553       35    6791    28 no   
##  8       35316 0.00497        9   34230    53 yes  
##  9       28592 0.00470       24    2291    32 yes  
## 10       28585 0.00465       29     702    31 yes  
## # ℹ 1,467 more rows
# Generate a Historgram of 'Previous' Column based on Influential points
ggplot(influential_points, aes(x = previous)) +
  geom_bar() +
  labs(
    title = "Figure 5: Distribution of Previous Response Variable Among Influential Points",
    x = "Previous",
    y = "Count"
  ) +
  geom_vline(xintercept = mean(influential_points$previous), color = "red") +
  annotate(
    "text",
    x = mean(influential_points$previous, na.rm = TRUE)+5,
    y = -5,
    label = round(mean(influential_points$previous, na.rm = TRUE), 0),
    color = "red"
  ) +
  theme_minimal()

In part, especially witnessing the 275 value in our previous column that these ‘hyper’ values for previous communications being sent to these bank clients may represent outliers, such as a bank client accidentally being included in a campaign twice, or potentially data issues and by cleaning up some of these extreme values we may end up improving the model. This would likely require further validation / a better conceptual understanding of the underlying data to truly make a good judgement call here.

Plotting Residuals vs X-Values

Now let’s plot the residuals vs their x-values for each of the predictor variables – and in the case of our categorical variables, I will apply a box and whisker plot to compare.

  • Balance: We can see that as balance increases, the value of our errors decreases. Errors are clustered around those who have a smaller balance.
  • Age: The distribution of the residuals across age actually seems pretty uniform. Errors seem to decrease as age increases – e.g., perhaps a slight linear trend.
  • Loan and Education: It’s really hard to discern any difference in the ‘no’ vs ‘yes’ categories here because of how the box and whisker plot is distributed, as well as for the education category.
# Plotting Residuals vs Predictor Values
gg_resX(improved_model, plot.all = FALSE)
## $balance

## 
## $age

## 
## $loan

## 
## $education

Overall, the residuals plotted against their x-values show a violation of the normality assumption in how errors are distributed for balance and age – but we cannot make that determination based on the categorical variables at this time.

  • The residuals for balance seem like they could be approximated by a \(\frac{1}{x}\) function. This makes me think that I may be able to transform the predictor variable with a \(\log{x}\) approach or some other functional approach with an offset to avoid an ‘undefined’ balance (e.g., log of 0).

  • The residuals for age seem almost multi-modal where they are relatively normal and then differ based on age range. There may be some statistical techniques to approach this – but I am not specifically aware of how. Age isn’t significant at my alpha level of p < 0.01 – so I am not going to worry about that.

Histogram of Residuals

Based on Figure 6: Histogram of Residuals we can see that our residuals are not normally distributed. In fact, the bulk of them are likely around 0 ~ which go figure makes sense because most of the values in the model are zero. This obviously violates the normality assumption for residuals and again emphasizes a flaw within our model discussed above.

# Create a histogram
res <- resid(improved_model)

ggplot(data.frame(res = res), aes(x = res)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Figure 6: Histogram of Residuals",
    x = "Residuals",
    y = "Count"
  ) +
  theme_minimal()

QQ Plot

From the QQ plot – as demonstrated as well in other conversations above, we can see we have some very strong deviation from normality. This is particularly demonstrated on the right tail, where there is a huge vertical spread on our residuals at high quantile – or in other words, extremely high positive residuals indicating that we are under predicting on some overvaluations in our model. This indicates an abnormal residual distribution and a central issue with our model – which again likely aligns with some assumptions / issues I’ve pointed out with regression on a count column that is heavily zero-inflated.

gg_qqplot(improved_model)

Conclusion of Analysis

We have a pretty terrible model that is:

  1. statistically significant:

  2. pragmatically ineffective based on severely low \(R^2\) and effect size (variable coefficients):

  3. violates the linearity, constant variance of errors, and normality of error distribution assumptions in linear regression.

It’s unusable – but there are some things that maybe could be done to improve in that will be discussed in the next section.

04 - Lessons / Reflections for the Next Week

From this analysis there are a couple of key takeaways that I can reflect on:

  1. There is limited business value in modeling the number of prior contacts (previous) a bank client may have received from the bank. I actually selected this variable because it was partially continuous and I had already used by other continuous variable balance in a prior analysis, so I wanted to try something fresh. However, in a practical setting, it’s hard to imagine a business use case here – especially since this variable is kinda ‘odd’ and is hard to think about how to understand what influences it.

  2. This exercise highlights the importance of engaging with exploratory data analysis before choosing a model. What I would like to do – and this kind of relates to concepts that are explored in the GLM week – would be to examine how predictor variables are distributed against the response variable to identify any potential nonlinear relationships that may make their way into how the residuals are distributed if not transformed.

  3. Given my understanding of some future course work – to fully construct a more robust model here – we may need to apply poisson regression and explore how to approach zero-inflated models. Reading a little into the zero inflation models – it seems kind of interesting like maybe you do something with a logit regression to predict whether the count will be zero or greater than zero, and if it is greater than zero you then apply a regression after the fact or something like a chained order of operations to predict the count?