00 - Introduction to Week #8 Data Dive

This weeks data dive will focus on gaining experience with running AVOA tests and building regression models. Specifically, it will meet the following requirements:

  1. Selection of a continuous (or ordered integer) column of data of most “value” within the context of the bank marketing dataset – e.g., response variable.
  2. Selection of a categorical variable (explanatory variable) that may influence the respponse variable above.
  3. Devised a null hypothesis for an ANOVA test given this situation. Test the hypothesis using ANOVA and summarize results with box plots. With clarity about how the R output relates to conclusions.
    • If more than 10 categories, consolidate before running tests using methods previously used.

    • Explain what this may mean for people interested in data.

  4. Find a single continuous (or ordered, non-binary) column of data that may influence the response variable. Relationship between this variable and response should be roughly linear.
  5. Build a linear regression model using just this column and evaluate it’s fit. Interpret coefficients of the model and explain how they relate to the context of the data.

Let’s go ahead and declare our libraries and load in the bank marketing dataset. Further documentation on this dataset can be found here.

# Declare libraries
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.1
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

01 & 02 - Choosing Response and Explanatory Variables

The main purpose of this dataset from the UC Irvine Machine Learning Repository is to understand the predictors that influence whether a client subscribes to a term deposit (e.g., setting aside a certain amount of money with the bank that they cannot deposit for a set period of time in exchange for a higher interest earnings return). This is indicated by a binary column y in the dataset where a 1 indicates subscription to a term deposit.

However, since the requirements above require a continuous or ordered integer column of data–instead we are going to look at a different variable. Eventually we will likely engage with a logarithmic regression for an inferential understanding of predictors for term deposits – but since this module focuses on linear regression, this just ain’t that day pal!

One variable that stands out is the previous variable. This indicates the number of contacts performed to the client before this current term deposit campaign. Now it seems likely that those who have a higher level of education may receive more communications from the bank in general. With a higher level of education, they are more likely to make more at their job and have a higher re-occurring direct deposit setup, a higher average balance, and even the potential for solicitation on student loans and other interesting financial instruments / arrangements. This will make for a good comparison with the ANOVA test.

Thus we have chosen accordingly:

03 - Testing with ANOVA

To begin, our null hypothesis outlining our expectation of explanatory variable education on the number of prior contacts previous, is as follows:

\[ H_0 : \text{average number of prior client contacts is equal across education levels} \]

We will analyze this by applying the Analysis of Variance (ANOVA) test, which allows us to understand how each of the categorical education values vary against one another, with respect to our response variable, previous. Before we officially test, we need to check our assumptions:

  1. Independence: Is every data point independent of each other?
    • Likely yes. Typically the education level of a client is not going to be dependent on the education level of another client. If clients are related, then that may be a different story but we shall assume independence for now.

    • It’s also possible that if two clients are related and share the same address, then they may receive less communications to avoid mailing to the same address, etc. but we will assume that most clients in the bank are not related.

  2. Normality: Is the distribution of values within each group normally distributed?
    • No the values do not fit within a standard “normal gaussian distribution.” Looking back the modules on distribution types they seem to fall more in line with a geometric or exponential distribution. Additionally, the probability mass function below demonstrates that each education level has roughly the same distribution of previous contacts from the bank – which makes my null hypothesis look pretty compelling at this point in time. At this point in time we’ll move forward with this variable since I chose it – but it’s not looking too hot out there chief.
  3. Homoscedansicity: Is the variance within groups consistent?
    • Kind of – while the standard deviation of the primary and secondary education levels are very similar – there appears to be a sizable right-tail for bank clients who have completed up to a tertiary level of education (college or higher) which results in a standard deviation almost double that of the other groups. We’ll have to see what ANOVA returns – it will be interesting.
# Checking the Normality assumption for ANOVA within groups with a Probability Mass Function
bank_marketing |>
  count(education, previous) |>
  group_by(education) |>
  mutate(prob = n / sum(n)) |>
  ggplot(aes(x = previous, y = prob)) +
  geom_line() +
  geom_area(alpha = 0.4)+
  coord_cartesian(xlim = c(0, 10)) + # Limiting result for viewability
  scale_x_continuous(breaks = 0:10) + # Limit x-axis to whole numbers
  theme_classic() +
  labs(
    title = "Probability Mass Function of Previous Contacts by Education Level",
    x = "# of Previous Contacts",
    y = "Probability"
  )+
  facet_wrap(~ education)

# Checking the standard Deviations within each Group for Homoscedansicity
bank_marketing |>
  group_by(education) |>
  summarize(
    count = n(),
    stdev = sd(previous),
    min_val = min(previous),
    avg_val = mean(previous),
    max_val = max(previous)
  )
## # A tibble: 4 × 6
##   education count stdev min_val avg_val max_val
##   <chr>     <int> <dbl>   <dbl>   <dbl>   <dbl>
## 1 primary    6851  1.90       0   0.489      58
## 2 secondary 23202  1.85       0   0.568      55
## 3 tertiary  13301  3.14       0   0.662     275
## 4 unknown    1857  1.57       0   0.488      27

While I don’t feel comfortable claiming that our assumptions have been fully satisfied–it should nevertheless still be interesting to soldier on. So let’s go ahead and calculate the test statistic F for ANOVA – and we’ll do it both manually and then validate by using the anova function within R as well. But first – a refresher on ANOVA. The Analysis of Variance (ANOVA) test is statistical test designed to test whether a dataset has statistically distinct variances within specific categorical subsets or groups of the data and a response variable. It’s akin to thinking through whether some specific groups of a given explanatory variable have different levels of variance between each other relative to an explanatory variable.

# Computing the F Test Statistic for ANOVA Manually
# Global Values
k = n_distinct(bank_marketing$education)
x_bar = mean(bank_marketing$previous)
n = nrow(bank_marketing)

# Mean Squared Group Variance -- e.g., Variance between each group, weighted by the size of the group relative to the "population" as a whole.
msg <- bank_marketing |>
  group_by(education) |>
  summarize(
    nj = n(),
    xbar_j = mean(previous)
  ) |>
  summarize(
    SSG = sum(nj * (xbar_j - x_bar)^2),
    MSG = SSG / (k - 1)
  )

# Mean Square Error -- weighted average standard deviation between groups.
mse <- bank_marketing |>
  group_by(education) |>
  summarize(
    nj = n(),
    xbar_j = mean(previous),
    s2 = sum((previous - xbar_j)^2) / (nj - 1)
  ) |>
  summarize(
    SSE = sum((nj - 1) * s2),
    MSE = SSE / (n - k)
  )

results <- c(
  SSG = msg$SSG,
  MSG = msg$MSG,
  SSE = mse$SSE,
  MSE = mse$MSE,
  F = msg$MSG / mse$MSE
)

results
##          SSG          MSG          SSE          MSE            F 
## 1.648371e+02 5.494569e+01 2.397122e+05 5.302546e+00 1.036213e+01
m <- aov(previous ~ education, data = bank_marketing)
summary(m)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## education       3    165   54.95   10.36 8.19e-07 ***
## Residuals   45207 239712    5.30                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The findings demonstrate that there is is a statistically significant difference between education categories and our response variable previous. Thus we can reject the null hypothesis that the average number of prior client contacts is equal across education levels. This is demonstrated by an F Statistic that of 10.36 (with a p value < 0.01) – which means that the variance between the groups is far larger than the variance within the groups – indicating that the groupings of education are introducing variance that is not present within data itself. Let’s go ahead and setup a box and whisker plot to validate these findings–and we will find that we really can’t tell anything about the difference in variance between groups across these plots because there are so many clients that have had 0 previous contacts!

# Create 4 different box and whisker plots of varying coordinate view
p <- bank_marketing |>
  ggplot(aes(education, previous)) +
  geom_boxplot() +
  theme_classic() +
  labs(
    x = 'Education',
    y = 'Previous',
    title = 'Quartile Distribution of Response Variable Previous by Education'
  )

p

p + coord_cartesian(ylim = c(0,5))

p + coord_cartesian(ylim = c(0,10))

p + coord_cartesian(ylim = c(0,20))

This produces a tricky challenge – but I did wonder if we could take or transform the response variable previous by either excluding the zero values or some other approach. ChatGPT recommends taking the log and adding +1 to each value and then creating a box and whisker plot to spot any differences in conjunction with a jitterplot. So let’s go ahead and do that. Here we can see that there does seem to be some difference between the groups with respect to non-zero values. The unknown group tends to have the least “density” of clients that have recieved a prior contact – and that makes sense! If the bank doesn’t know the level of education a person has received, this likely correlates with a general lack of knowledge about the client, so they would be less likely to be solicited by the bank in other product marketing campaigns!

# Log Transformation of Response Variable Previous
bank_marketing |>
  mutate(previous_log = log1p(previous)) |>
  ggplot(aes(x = education, y = previous_log)) +
  geom_boxplot() +
  geom_jitter(width = 0.15, height = 0, alpha = 0.4, size = 0.8) +
  theme_minimal()

04 - Finding a Continuous Explanatory Variable

Now I’m willing to bet that the prevalence of prior contacts probably relates in part to the current account balance that a client has. If a client has a higher account balance, then they may have been solicited more frequently since they have a greater potential to have funds to take away in a term deposit. They also just may receive more communications from the bank in general since they have a higher balance.

Thus we have chosen accordingly:

We do need to be careful here since we could potentially run into a reverse causality error or omitted variable bias issues:

The first step is plotting! Let’s plot this relationship and see if there could be a linear relationship among these two variables as hypothesized – and the evidence doesn’t appear to demonstrate any strong relationship here, visually. But i I do include the linear model line, we do appear to see a roughly positive linear relationship – so perhaps one may be present warranted we continue to further investigate.

# Plotting Number of Prior Contacts (Previous) to Average Annual Balance (Balance)
bank_marketing |>
  ggplot(mapping = aes(x = balance, y = previous)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  theme_classic() +
  labs(
    x = 'Average Annual Balance',
    y = '# of Prior Campaign Contacts',
    title = 'Correlation Chart of Balance on Previous Contacts'
  )
## `geom_smooth()` using formula = 'y ~ x'

05 - Building the Regression Model

Now let’s construct a simple bivariate regression model to understand the potential interactions and the strength of these interactions between these two variables. If we remember the principles of regression, it follows as: \[y = \beta_0 + \beta_1x + \varepsilon \]

In this case, \(\beta_0\) is our intercept or natural starting point, \(y\) is the variable previous, and our \(x\) variable will be balance multiplied by the appropriate scaling coefficient \(\beta_1\). The \(\varepsilon\) refers to the error between our prediction components: \(\beta + \beta_1x\) which can be colloquially referred to as \(\hat{y}\) and the actual observed value \(y\) in the dataset.

# Validate On Regression Model
model <- lm(previous ~ balance, bank_marketing)
print(model)
## 
## Call:
## lm(formula = previous ~ balance, data = bank_marketing)
## 
## Coefficients:
## (Intercept)      balance  
##   5.631e-01    1.261e-05
# Calculating Beta 1 Value
reg_b1 <- bank_marketing |>
  mutate(
    xy_num = (balance - mean(balance))*(previous - mean(previous)),
    x_denom = (balance - mean(balance))^2
  ) |>
  summarize(
    numerator = sum(xy_num),
    denominator = sum(x_denom)
  ) |>
  summarize(beta_1 = numerator / denominator)

# Calculating Beta 0 (Intercept Value)
reg_intcpt <- bank_marketing |>
  summarize(beta_0 = mean(previous) - reg_b1 * mean(balance))


outputs <- c(reg_intcpt, reg_b1)
print(outputs)
## $beta_0
##      beta_1
## 1 0.5631396
## 
## $beta_1
## [1] 1.261402e-05

Here based on this simple linear model we can see that there is a positive relationship between the average annual balance and the number of prior contacts – to the extent where every $1 in average annual balance increases the number of contacts they may receive from the bank by 0.00001 – which is an incredibly small number. To think more intuitively, let’s multiply by a factor of $10,0000 – this means that every $100,000 in annual average balance results in an extra contact from a previous campaign. This is an incredibly small affect size and to be honest, almost has no real meaning, especially since the number of prior contacts for an individual observation should always be a whole number.

Thinking ahead – it may be wise to “refactor” the client population into those who have had at least 1 contact – indicating that they have been with the bank at least some time – and see if that may impact the results, especially since the bulk of the bank clients have not had previous contact. Here we actually see the opposite of the model improving. In fact, the number of prior contacts becomes a function of balance and the intercept is much larger. Overall – the positive relationship is what we would expect, but the effect size is very small.

# Running Regression Model For Only Those Who 'HAD' a Prior Contact
filtered_reg <- bank_marketing |>
  filter(previous > 0)

filt_model <- lm(previous ~ balance, filtered_reg)
print(filt_model)
## 
## Call:
## lm(formula = previous ~ balance, data = filtered_reg)
## 
## Coefficients:
## (Intercept)      balance  
##   3.174e+00    2.339e-06