00 - Introduction to Week #6 Data Dive

This weeks data dive focuses on looking at the relationships between two variables (explanatory and response variables) and then ideally building a boostrapped confidence interval to further analyze the data. Specific requirements include:

Let’s go ahead and load in the bank-marketing dataset and read in the appropriate libraries. Proper documentation on the dataset can be found here.

# Declare libraries
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.1
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

01 - Age and Balance

To begin, let’s start with a business question:

This is a valid question since we would expect older individuals to have more time to build wealth – and thus, more likely to have a higher average balance – and if they have a higher average balance, they may more money to spare and thus a higher likelihood to subscribe to a term deposit. Hence, in this case our explanatory variable is age and our response variable is balance. To help answer this business question, we will first:

  1. Identify basic summary statistics on both variables jointly.
  2. Compute the covariance of age and balance and visualize it.
  3. Compute the correlation of age and balance and visualize it.
  4. Construct a bootstrapped confidence interval for our response variable to determine whether the correlation coefficient we computed fits.

Basic Summary Statistics

Let’s start with the basic interquartile ranges of age and balance:

# View Interquartile ranges of age
age_df <- bank_marketing |>
  select(age)

ggplot(age_df)+
  geom_boxplot(mapping = aes(x = age))+
  labs(
    title = "IQR of Age"
  )+
  theme_classic()

# View Interquartile ranges of average balance
balance_df <- bank_marketing |>
  select(balance)

ggplot(balance_df)+
  geom_boxplot(mapping = aes(x = balance))+
  labs(
    title = "IQR of Average Annual Balance"
  )+
  theme_classic()

Clearly, based on box plots above, we can see that there is significantly more variance on our average balance variable as opposed to our age variable. Let’s go ahead and transform the annual balance ranges into a categorical bin and understand the how

Covariance

Let’s now compute the covariance for age and average annual balance. Covariance defines the average variance in how much one variable varies from it’s mean in relationship to another variables variance from it’s mean. Mathematically, it can be defined as \[\frac{\sum (x_i-\bar x)(y_i-\bar y)}{n-1} \] where \(x_i - \bar x\) represents the variance of each observation of x from it’s mean multiplied by \(y_i - \bar y\), the variance of each observation of y from it’s mean – all divided by the number of samples minus 1 (for covariance of the sample instead of population).

# Compute the Covariance of Age and Balance
bank_updt <- bank_marketing |>
  mutate(
    age_var = age - mean(age),
    bal_var = balance - mean(balance),
    prod_age_bal = age_var * bal_var,
    sum_covar = sum(prod_age_bal),
    covariance = sum_covar / (nrow(bank_marketing) - 1)
  ) |>
  select(age, balance, age_var, bal_var, prod_age_bal, sum_covar, covariance)

print(bank_updt)
## # A tibble: 45,211 × 7
##      age balance age_var bal_var prod_age_bal  sum_covar covariance
##    <dbl>   <dbl>   <dbl>   <dbl>        <dbl>      <dbl>      <dbl>
##  1    58    2143   17.1     781.       13322. 142930362.      3161.
##  2    44      29    3.06  -1333.       -4085. 142930362.      3161.
##  3    33       2   -7.94  -1360.       10795. 142930362.      3161.
##  4    47    1506    6.06    144.         872. 142930362.      3161.
##  5    33       1   -7.94  -1361.       10803. 142930362.      3161.
##  6    35     231   -5.94  -1131.        6715. 142930362.      3161.
##  7    28     447  -12.9    -915.       11840. 142930362.      3161.
##  8    42       2    1.06  -1360.       -1447. 142930362.      3161.
##  9    58     121   17.1   -1241.      -21181. 142930362.      3161.
## 10    43     593    2.06   -769.       -1588. 142930362.      3161.
## # ℹ 45,201 more rows

Let’s now plot each observation in the dataset that makes up the covariance summation to get an understanding of how they vary together – but we’ll choose a random assortment of 30 points as to not choose from all 45,000

# Plot Numerator of Covariance, Point by Point
set.seed(175) # Setting seed per in-class guidelines

bank_updt |>
  slice_sample(n = 30) |>
  ggplot()+
  geom_point(mapping = aes(x = age_var, y = bal_var))+
  labs(
    x = "Age Variance",
    y = "Balance Variance",
    title = "Variance of Age and Balance from 'Population' Mean (n = 30)"
  )+
  theme_classic()

Above – we can see that since the relationship between the variances doesn’t appear squarish (e.g., equidistant in the scatterplot), it’s likely there will not be a high correlation when we perform the computation. So let’s go ahead and compute the correlation coefficient – and then visualize it against a scatterplot where both variables are plotted against each other.

Correlation

Correlation can be calculated by taking the covariance defined above and dividing it by the product of \(x\) and \(y\)’s standard deviations – to scale the value between -1 and 1. The formula for the standard deviation is as follows: \[\frac {\frac{\sum (x_i-\bar x)(y_i-\bar y)}{n-1}}{\sigma x \sigma y} \]

# Finishing up the Correlation Calculation and Visualizing It
bank_updt <- bank_updt |>
  mutate(
    stdev_age = sd(age),
    stdev_bal = sd(balance),
    age_bal_corr = covariance / (stdev_age * stdev_bal),
    act_age_bal_corr = cor(age, balance) # For validation
  )

bank_updt |>
  ggplot(aes(x = age, y = balance))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)+
  annotate("text",
           x = Inf, y = Inf,
           label = paste0("r = ", round(bank_updt$age_bal_corr, 3)),
           hjust = 1.1, vjust = 1.5)+
  labs(
    x = "Age",
    y = "Balance",
    title = "Age x Balance Correlation"
  )+
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

Here we can see that the correlation coefficient comes out to be a positive 0.098, which indicates a pretty weak relationship between age and average annual balance held by a bank client for the entire dataset. It’s possible that if we were to split up this dataset into multiple different components based on other factors like occupation or education, these relationships may be stronger or weaker accordingly. Let’s quickly split up the dataset by occupation and perform these computations within each group to see what effect this may have.

# Calculate Correlation within each Occupational Group
bank_marketing |>
  group_by(job) |>
  summarise(
    n = n(),
    age_bal_corr = cor(age, balance), .groups = "drop"
  )
## # A tibble: 12 × 3
##    job               n age_bal_corr
##    <chr>         <int>        <dbl>
##  1 admin.         5171      0.113  
##  2 blue-collar    9732      0.100  
##  3 entrepreneur   1487      0.118  
##  4 housemaid      1240      0.0354 
##  5 management     9458      0.0734 
##  6 retired        2264      0.138  
##  7 self-employed  1579      0.113  
##  8 services       4154      0.108  
##  9 student         938      0.0539 
## 10 technician     7597      0.0388 
## 11 unemployed     1303      0.0786 
## 12 unknown         288      0.00598

Interestingly enough – while there are differences between the occupational correlation coefficients for age and balance and the dataset as a whole, it is not as pronounced as I would have thought. They still mostly seem to hover within the range of 0.098. It will be interesting to see how these correlation coefficients within each occupation compare to the cofidence interval we will construct for the dataset correlation coefficient as a whole.

Constructing a Confidence Interval

Now we are going to bootstrap the original dataset by creating n samples of our dataset with replacement that are each of equal size to our dataset. Then we will go ahead and compute the correlation between age and balance for each sample and plot those on a histogram to understand the normal distribution of the age and balance correlation coefficient. From this, we will identify the 95% confidence interval (within 3 standard deviations of the mean) and determine whether our original correlation coefficient lies within it – which spoilers, it definitely should.

# Bootstrapping Bank Marketing Dataset
bootstrap <- function(data, func, n_iter = 10^4) {
  
  # number of rows
  n <- nrow(data)
  
  # vector to store bootstrap statistics
  func_values <- numeric(n_iter)
  
  for (i in 1:n_iter) {
    
    # resample rows with replacement
    sample_indices <- sample(1:n, size = n, replace = TRUE)
    x_sample <- data[sample_indices, ]
    
    # compute statistic
    func_values[i] <- func(x_sample)
  }
  
  return(func_values)
}

# Call Function to get Correlation
boot_age_bal_corr <- bootstrap(
  bank_marketing,
  func = function(d) cor(d$age, d$balance),
  n_iter = 10^4
)

# Construct Histrogram from Function Call
ggplot() +
  geom_histogram(mapping = aes(x = boot_age_bal_corr), color = "white")+
  labs(
    title = "10K Bootstrapped Age and Balance Correlations",
    x = "Boostrapped Sample Correlation",
    y = "Number of Samples"
  )+
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Well that took some sweet time! But you can clearly see that it looks like the mean of this bootstrapped population falls around 0.097 or 0.098 – which tells us exactly that we are well within the 95% confidence interval ~ or our initial age and balance correlation falls well within three standard deviations from our “simulated population.”

Mathematically, since this follows an normal distribution – we should be able to construct our confidence interval by taking / applying the standard error multiplied by 3 in both directions from the mean to get where 95% of the values should lie. Note – standard error also refers to standard deviation as well in these contexts.

# Calculating Standard Error / Standard Deviation and Mean
c(
  mean = mean(boot_age_bal_corr),
  se   = sd(boot_age_bal_corr)
)
##        mean          se 
## 0.097774763 0.005218586

This means that if construct our confidence interval we get that 95% of values lie approximately between ~0.082 and ~0.1127. Which confirms our suspicious that our sample correlation is well-suited and also emphasizes that the correlation coefficients observed within the various job categories may actually signify a greater difference than I initially anticipated. Especially those coefficients of 0.03 and 0.013 – which technically represent a 70% decrease from the original value or a 30% increase from it. Regardless, they are all still technically in the same classification of a weak positive relationship.