00 - Introduction to Week 3 Data Dive

The data dive for this week will focus on grouping rows of data from the Bank Marketing data set to understand probabilities and connections among variables.

To begin, let’s recall the objective of this bank marketing data:

Thus, for this data dive we will be focusing on analysis that get’s us closer to understanding that objective. More details on the specific dataset can be found here.

# Declare libraries
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.1
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Set global tibble printing for pivot tables
options(tibble.print_max = Inf)
options(tibble.print_min = Inf)

01 - Key Business Questions

The key business questions that apply to this week’s data dives will be:

02 - Exploring Conditional Probability of being Single and Age on Term Deposit Outcome

Conditional probability is defined as: \[P(A|B) = \frac{P(A \cap B)}{P(B)}\]In the case of our business question, we will let \(A\) reference a term deposit and \(B\) reference the martial status being single – if we extrapolate to incorporate age, we would get \[P(A|B,C) = \frac{P(A \cap B \cap C)}{P(B,C)}\] where \(C\), would represent our age status of over 30 years old (or under).

To reflect this within the data set, we will first identify the intersection of all these traits being true and then understand the probability that they are to either subscribe or not subscribe to a term deposit given we know this information about them.

We would ideally like to compare this to the population average absent these individuals to understand how they may have any affect. However – this is seperate and unrelated to understanding conditional probability.

bank_marketing |>
  
  # Data manipulation to get distribution
  mutate(
    marital_age_cat = case_when(
      age > 30 & marital == "single"  ~ "Single and Over 30",
      age <= 30 & marital == "single" ~ "Single and Under 30",
      TRUE                            ~ "Control"
    )
  ) |>
  count(marital_age_cat, y) |>                  
  group_by(marital_age_cat) |>                       
  mutate(pct = round(100 * n / sum(n), 1)) |> # Get percentage within each category (e.g., conditional probabiltiy)
  ungroup() |>
  select(marital_age_cat, y, pct) |> # Could also use select(-n), but appears less intuitive
  filter(y == 'yes') |>
  
  # Plot outcome
  ggplot(aes(x = marital_age_cat, y = pct)) +
  geom_col() +
  geom_text(
    aes(label = paste0(pct, "%")),
    vjust = -0.3
  ) +
  labs(title= "Percentage Subscribed to Term Deposit", x = "Category", y = "Perecent") +
  theme_classic()

The results indicate a clear difference for the campaign population that are both (1) single and (2) under 30. Assuming that we know a person is both single and under 30, they would have a conditional probability of 19.7% for subscribing to a term deposit , this is almost double our control population of 10.4%. This may likely be because people who are single (1) have less decision making barriers – they can act independently, and (2) they may have less financial experience and more likely to act on financial advice.

03 - Exploring Different Combinations of Education and Job

We would expect the bank clientele to likely be “wealthier” than your average population – especially since this is a clientele list that was marketed to directly. Further understanding the clientele base could help us figure out how the eventual outcomes or predictors of term deposit subscription may generalize – and whether they can be generalized – across a broader, non-current bank clientele population. This should essentially help us think about expansion and also understand our clientele better.

bank_marketing |>
  
  # Creating a pivot table of all education and job combinations
  count(job, education) |>
  mutate(pct = round(100 * n / sum(n), 1)) |>
  select(-n) |>
  
  # Creating a heatmap with ggplot
  ggplot(aes (x = education, y = job, fill = pct)) + 
  geom_tile(color = "white") +
  geom_text(aes(label = paste0(pct, "%")), size = 3) +
  scale_fill_gradient(low = "white", high = "firebrick") +
  labs(
    title = "Job vs Education Heatmap",
    x = "Education Level",
    y = "Job Category",
    fill = "Percent"
  ) +
  theme_classic()

The distribution of the clientele across all education levels clearly shows a concentration among individuals who have completed secondary (high-school level) education. However, the largest concentration occurred among all categories are those of job category “management” with up to and beyond the third level of education completed (college or above). This indicates that a specific subsection of the bank clientele when comparing across job category and education level tend to be well off.

There also are a sizeable portion of blue collar workers that are a part of the banks clientele that have completed between primary and up to secondary levels of education. Collectively, this group is larger than management.

Finally, we see that there are a relatively small subsection that fit into the unknown categories from both the job category and education level making up collectively less than 8% of the banks total clientele. This doesn’t present any specific worry, but may require us to filter out these clients to better understand campaign impact when engaging in future analysis.

04 - Basic Group by Dataframes of Categorical Columns

To better understand the basic structure absent the complexity present in the prior examples, we will create basic aggregation dataframes where we summarize the min, max, mean, median, standard deviation, variance, and basic count / probability.

To begin we will cut our bank marketing data to curate an age category column instead of an integer-based continuous variable.

# Create an age category column
bank_marketing_updt <- bank_marketing |>
  mutate(
    age_group = cut(
      age,
      breaks = c(0, 18, 25, 35, 45, 55, 65, Inf),
      right = FALSE,
      labels = c("0-17","18-24","25-34","35-44","45-54","55-64","65+")
    )
  )

04.1 - Age Group by Count Summary Statistics

# Create a dataframe that stores the Age Categorical variable
df_age <- bank_marketing_updt |>
  group_by(age_group) |>
  summarize(n = n(),
            min_age = min(age),
            med_age = median(age),
            mean_age = mean(age),
            max_age = max(age),
            var_age = var(age),
            sd_age = sd(age),
            .groups = "drop") |>
  mutate(prob = n / sum(n),
         flagged = prob < 0.02,
         flag_label = if_else(flagged, "Less than 2%", "")) 

# Display basic table
df_age 
## # A tibble: 6 × 11
##   age_group     n min_age med_age mean_age max_age var_age sd_age   prob flagged
##   <fct>     <int>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <lgl>  
## 1 18-24       809      18      23     22.6      24    2.37   1.54 0.0179 TRUE   
## 2 25-34     14204      25      31     30.6      34    6.62   2.57 0.314  FALSE  
## 3 35-44     14534      35      39     39.0      44    8.22   2.87 0.321  FALSE  
## 4 45-54      9958      45      49     49.2      54    8.29   2.88 0.220  FALSE  
## 5 55-64      4896      55      58     57.7      64    4.34   2.08 0.108  FALSE  
## 6 65+         810      65      72     73.1      95   36.2    6.02 0.0179 TRUE   
## # ℹ 1 more variable: flag_label <chr>

Age groups 18 - 24 and 65+ are the least likely age categories to occur within the bank clientele sample population. Collectively, the probability of a bank client being between 18 to 24 years old, or older than 65+ is less than 4%. The probability frequency plot is displayed below with these specific age categories highlighted.

# Histogram / Frequency Plot by Age Category
df_age |>
  ggplot() +
    geom_col(aes(x = age_group, y = prob, fill = flagged)) +
    labs(
      title = "Age Group Frequency Distribution",
      x = "Age Group",
      y = "Probability",
      fill = "Less than 2%"
    ) +
    theme_classic()

Since younger individuals have less work experience / history, they are also probably less likely to be in credit default, have a housing loan, or a personal loan. This may explain why such a small portion of the bank’s clientele are between 18 and 24 years of age. If we were to engage in hypothesis testing, we may do something similar to example one where we explored the differences between those single, whether they are younger or older than 30, and compare it to the boarder bank population. Specifically:

\(H_0\) = There is no difference between financial asset / instrument exposure (loans, defaults, etc.) for bank clients between the ages of 18 to 24 and those who are older.

\(Ha\) = There is a statistically significant difference between the financial asset / instrument exposure for bank clients between the ages of 18 and 24 and bank clients who are older. Those within the age ranges of 18 - 24 will have significantly less financial asset / instrument / liability exposure.

04.2 - Marital by Average Yearly Balance Summary Statistics

Let’s repeat / apply a similar analysis to the marital status categorical variable but instead of counting the population, let’s look at balance.

# Create and display marital status group by categorical dataframe

df_marital <- bank_marketing_updt |>
  group_by(marital) |>
  summarize(
    min_bal = min(balance),
    med_bal = median(balance),
    mean_bal = mean(balance),
    max_bal = max(balance),
    var_bal = var(balance),
    sd_bal = sd(balance)
  )

df_marital
## # A tibble: 3 × 7
##   marital  min_bal med_bal mean_bal max_bal  var_bal sd_bal
##   <chr>      <dbl>   <dbl>    <dbl>   <dbl>    <dbl>  <dbl>
## 1 divorced   -4057     348    1179.   66721 8564130.  2926.
## 2 married    -6847     477    1426.   98417 9865646.  3141.
## 3 single     -8019     437    1301.  102127 8267452.  2875.

Here we can see that married clientele who were a part of this marketing campaign tend to hold the highest average yearly balance both from the median balance held and the mean balance. We can also see that the significant skew between the mean and median balance indicates that our extreme balance holders are skewing the average – so if we were to plot a box and wisker plot for this data, we would see very long wiskers leading to the high balance amounts that skew the average, as shown below. In fact, the wiskers end up being quite

# Box and Wisker Plots for Average Yearly Balance by Marital Status
bank_marketing_updt |>
  ggplot() +
  geom_boxplot(mapping = aes(x = marital, y = balance)) + 
  labs(
    title = "Interquartile Ranges on Average Yearly Balance by Marital Status",
    x = "Marital Status",
    y = "Average Yearly Balance"
  ) +
  theme_classic()

04.3 - Job

To round the analysis out, let’s look at the job categorical variable and compute some basic summary statistics on the average yearly balance here. As we demonstrated in subsection 03 - Exploring Different Combinations of Education and Job, there should be quite a few clients with management positions that we would expect to have a higher average annual balance in terms of deposits – so let’s see if that is true.

# Job and Balance Basic Dataframe
df_job <- bank_marketing_updt |>
  group_by(job) |>
  summarize(
    min_bal = min(balance),
    med_bal = median(balance),
    mean_bal = mean(balance),
    max_bal = max(balance),
    var_bal = var(balance),
    sd_bal = sd(balance)
  )

df_job
## # A tibble: 12 × 7
##    job           min_bal med_bal mean_bal max_bal   var_bal sd_bal
##    <chr>           <dbl>   <dbl>    <dbl>   <dbl>     <dbl>  <dbl>
##  1 admin.          -1601    396     1136.   64343  6979967.  2642.
##  2 blue-collar     -8019    388     1079.   66653  5019944.  2241.
##  3 entrepreneur    -2082    352     1521.   59649 17251086.  4153.
##  4 housemaid       -1941    406     1392.   45141  8908387.  2985.
##  5 management      -6847    572     1764.  102127 14615066.  3823.
##  6 retired         -1598    787     1984.   81204 19333997.  4397.
##  7 self-employed   -3313    526     1648.   52587 13573769.  3684.
##  8 services        -2122    340.     997.   57435  4685032.  2164.
##  9 student          -679    502     1388.   24025  5961916.  2442.
## 10 technician      -2827    421     1253.   45248  6495077.  2549.
## 11 unemployed      -1270    529     1522.   44134  9888929.  3145.
## 12 unknown          -295    677     1772.   19706  8822614.  2970.

As we can can see, clients employed in the management profession hold among the highest mean average annual balances – but they are eclipsed by those who are retired. This is interesting because in the heatmap we constructed earlier, we saw that most retired clients have completed up to the secondary level of education whereas most managers have completed up to a tertiary level of education. Likely the story here is being driven by age – where those who are retired have had more time to build up their wealth or just have a preference to hold their wealth in their bank account.