00 - Introduction

This weeks data dive is all about investigating the bank_marketing dataset for any missing documentation or values to better understand the dataset wholistically. Specifically, we will:

For reference, the dataset is available here.

# Declare libraries
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.2.1
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

setwd("C:/Users/chris/OneDrive - Indiana University/Graduate School/MIS/INFO-H 510/Project Data")

# Read in dataframe
bank_marketing <- read_delim("bank-marketing.csv",delim=";")
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

01 - Clarity Through Documentation

Two columns that are unclear until you read documentation are:

  1. Contact – specifies whether the phone reached was a cellular phone, or a telephone and

  2. Housing – specifies whether the bank client has a housing loan.

If the categories in the contact column were not specified, then at a glance a researcher may imagine a plethora of different contact types for a calling campaign, whether it’s a home phone number, mobile, office, or potentially work-related. By specifying that the categorical types are – this helps a user at a glance get a more firm understanding of it’s scope. Although, documentation could be improved if they included the specifics on what each of these categories entail, how the data on classifying the contact type for these clients was gathered, and whether this data has been validated.

Housing is a little bit more problematic because if it was not specified, most users may infer that this indicates whether the bank client owns a home. And while this may be related to whether a bank client has a housing loan, it would not always fall hand in hand – especially if that client had paid off their mortgage.

In this case, I think that housing was an easier and simpler variable name than housing_loan–in fact, of the 17 variables included in the dataset, only 1 is a compound word / phrase separated by underscores: day_of_week.

Takeaway: It is incredibly important to be clear on the data that you are producing, thorough in defining that data – e.g., it’s source and collections, and then what different categories signify in the data. This is something that I have experienced at work as well – people tend to sleep on data documentation.

02 - Lack of Clarity Despite Documentation with Visualizations

One unclear variable is the balance variable, or average yearly balance. While this represents the average yearly balance that the client has with the bank – this doesn’t specific if it is aggregated across all customer accounts (e.g., a client with a checkings, savings, and then a CD), nor does it explain why so much of the client population carries a low average balance. If these are working professionals, we would expect their average balance to hopefully sit in a range of $500 - $1,000, but that does not appear to be the case as demonstrated via the balance range table below where 52% of all clients contacted have a balance below $500 or in the negative!

# Recode Education Table

bank_marketing <- bank_marketing |>
  mutate(
    education = recode(
      education,
      "primary" = "P",
      "secondary" = "S",
      "tertiary" = "T",
      "unknown" = "N"
    )
  )

# Balance Range Table

df_balrange <- bank_marketing |>
  mutate(balrange = case_when(
    balance < 0 ~ "1 - Negative",
    balance < 200 ~ "2 - Below $200",
    balance < 500 ~ "3 - Below $500",
    balance < 1000 ~ "4 - Below $1,000",
    balance < 2000 ~ "5 - Below $2,000",
    balance >= 2000 ~ "6 - Over $2,000",

  ))

df_balrange |>
  count(balrange) |>
  arrange(balrange) |>
  mutate(
    percent = round(n / sum(n) * 100, 2),
    running_percent = round(cumsum(percent),2)
  )
## # A tibble: 6 × 4
##   balrange             n percent running_percent
##   <chr>            <int>   <dbl>           <dbl>
## 1 1 - Negative      3766    8.33            8.33
## 2 2 - Below $200   12119   26.8            35.1 
## 3 3 - Below $500    7752   17.2            52.3 
## 4 4 - Below $1,000  6932   15.3            67.6 
## 5 5 - Below $2,000  6136   13.6            81.2 
## 6 6 - Over $2,000   8506   18.8           100

Let’s go ahead and focus on the population with a negative balance as well. How is this even possible – is it overdrafting? Why would we advertise a term deposit opportunity to clients who owe the bank money? And if we were to do a predictive model based on this data, it seems like these negative balance clients could negatively impact the validity of the model. Let’s explore and see if they are also connected to any missing values in the meantime – just to check for possible connection.

Let’s check on the job and education variables since we have an unknown category option on both of these variables which may signify incomplete profiles and then it also may reveal whether these negative balance holders are of a certain socioeconomic class. We’ll apply the rules of conditional probability to compare against the general population as well. From what we can observe – it’s clear that ‘lower-class’ professions are more likely to carry a negative balance. This makes sense intuitively and may suggest that nothing is intuitively wrong with negative balance holders–it may in fact just be those who have over drafted or express some other unknown financial predicament, which is again why I would like more information about how this column is derived.

# Education and Job Heatmaps for Negative Balance Holders

df_negbal <- bank_marketing |>
  filter(balance < 0)

df_posbal <- bank_marketing |>
  filter(balance >= 0)

df_all <- bank_marketing

# Combined tables for faceting
df_combined = bind_rows(
  negbal = df_negbal,
  posbal = df_posbal,
  all = df_all,
  .id = "table_name"
)

# Heatmap
df_combined |>
  count(job, education, table_name) |>
  group_by(table_name) |>                       # <--- group here
  mutate(pct = round(100 * n / sum(n), 1)) |>
  ungroup() |>
  ggplot(aes(x = education, y = job, fill = pct)) +
  facet_wrap(vars(table_name)) +
  geom_tile(color = "white") +
  geom_text(aes(label = paste0(pct, "%")), size = 3) +
  scale_fill_gradient(low = "white", high = "firebrick") +
  theme_classic() +
  labs(
    title = "Client Distribution by Balance Category, Job, and Education",
    x = "Education",
    y = "Job"
  )

Another tricky part about this variable is how clients who pull away from the bank may be included. For instance, if a client decided to withdrawal all $1000 of their funds prior to the campaign – that may be indicative of closing their account, but may reflect their average annual balance at around $750 if they pulled out that $1,000 three months ago and it is computed on a daily / monthly window within a year. This may signal to us as marketing practitioners that this client would be much less likely to subscribe to a term deposit than otherwise would be entailed in the dataset. Although, while such scenarios may be rare, I believe that this demonstrates some of the gaps in the documentation on this variable. Unfortunately, I do not believe there is any way to discern this behavior from our data to understand if it has any influence.

Instead, let’s go ahead and analyze how are various balance ranges – from negative to positive balance, relate to our marketing output variable to see if we can spot any trends. I’ve gone ahead and constructed a faceted jitterplot below. here we can see that the ‘yes’ or ‘no’ responses to a client term deposit seem to be ‘random’ from the sea set – but for those with negative balances, the higher the lower the balance (or more negative it is), the less subscriptions seem to be plucked in a pattern that doesn’t appear random. This likely tells us that there is some potential relationship between negative balances greater than $500 and impact on subscription to a term deposit – which honestly begs the question of the dataset, how can a bank let someone get a negative balance of $500 or greater in the first place? And then why would we even market to these folks at all if we know that they likely wouldn’t have the fund to even engage with a term deposit in the first place.

# Faceted Jitterplot of Balances by Range Type and Outcome Variable

df_balrange |>
  ggplot(aes(x = balance, y = y)) +
  facet_wrap(vars(balrange),  scales = "free") +
  geom_jitter() +
  theme_classic() +
  labs(
    title = "Term Deposit Subscription by Balance Range",
    y = "Subscribe to Term Deposit",
    x = "Balance"
  )

03 - Checking At Least Two Categorical Columns

We already know that we have null values in some of our columns – as denoted by the ‘unknown’ category in education and occupation. Let’s go ahead and check a different variable – the marital status column for (1) any explicit missing values, (2) any implicit missing values, or (3) any empty groups.

# Count of Clients by Marital Status

bank_marketing |>
  mutate(marital_is_na = is.na(marital)) |>
  ggplot() +
  geom_bar(mapping = aes(x = marital, fill = marital_is_na)) +
  labs(
    title = "Missing Marital Values by Marital Status",
    x = "Marital Status",
    y = "Number of Clients",
    fill = "Missing Value"
  ) +
  theme_classic()

Here we can see that there are three categories a client can belong to – single, married, or divorced. As relationship definitions broaden – we may be ‘missing’ some categories we otherwise would expect such as ‘partner’ for those who live together but are unmarried and ‘widowed’ for those whose spouses have passed away. These could potentially be considered implicit missing values, but that would be based on our preconceived notion of marital status that not everyone may agree to. The data documentation does explicitly state that divorced refers to both divorced or widowed – which obscures the data, but likely represents a rare occurrence in the roughly 44,000 bank clients in the dataset. Additionally, since each of the marital groups has data, we don’t have any empty groups – actually except for unknown. Within the data documentation under categories, the authors explicitly mention that ‘unknown’ could be a valid categorical variable entered into this field – which isn’t present in this current variable. Although generally – no, this variable does not have any explicit or implicit missing values, or any empty groups.

Let’s go ahead and take a look at the the education level variable education in the dataset to identify if there are any shenanigans going on as seen previously.

# Education Missing Values Chart
bank_marketing |>
  mutate(education_is_na = is.na(education)) |>
  ggplot(mapping = aes(x = education, fill = education_is_na)) +
  geom_bar() +
  labs(
    title = "Missing Education Values",
    x = "Number of Clients",
    y = "Education Level",
    fill = "Missing Values"
  ) +
  theme_classic()

In the case of education, while there are no null or missing categorical education values assigned to clients, ‘N’ stands for ‘unknown’ and almost certainly represents explicitly missing values that the data set author has tidied up. We can see here that those ‘missing’ the education values represent a small subsection of the overall data set. If we are going to perform any inferential analysis in the future or create any regression models, we may want to be cautious of these missing values and either remove them from the data set under the conditions that the education level would need to be specified as an accuracy condition or we may include it as it’s own ‘unknown’ category in the analysis.

I think that this may beg the question – why don’t we know their current education level? It could be that that older bank clients who set up their accounts prior to modern-day technology may not have had as many information tracked upon initially creating their accounts and haven’t updated their profile, etc. with this pertinent information. Is this a voluntarily collected field. Was it voluntary beforehand and then switched to being mandatory? These questions represent business logic that may suggest that ‘unknown’ education values may have some level of semantic meaning, and may be worth including in any modeling approaches.

Additionally, the data set doesn’t have great documentation on what these education levels mean. I suspect that primary represents those completing their degree up through Middle School, secondary through high school, and tertiary as college and beyond. This is certainly missing a lot of categories (e.g., missing groups) for bank clients who could fall below or outside of those educational ranges. Imagine a bank client who has a professional graduate degree like a medicine or law degree – that is beyond tertiary and may contain some helpful information about their propensity to subscribe to a term deposit. On the flip side, there are likely folks who didn’t receive any education at all – not even up to the primary level, although that would be rare.

Personally, I wouldn’t argue that this data set is implicitly missing this data or contain any empty groups since the education levels do arrange hierarchically without skipping any levels in an intentional way that reflects the vast majority of the human experience (local to the first world). If we only had clients that had completed up to a primary level of education or up to at tertiary level of education and completely skipped that secondary level, then I would be inclined to believe we have implicit missing values. So yes, we likely have missing groups based on the limited nature of how this variable is described, but we do not have implicit missing values.

Now, from what I’ve seen it appears that missing values are explicitly identified as ‘unknown’ across all columns. I’m not sure if that also tracks across continuous columns as well (e.g., a missing balance or prior contacts variable). We can see that this assumption is validated in the table below, where none of the columns possess a true ‘null’ or ‘missing value’.

# Computing Null Values Across Columns
bank_marketing |>
  summarise(across(
    everything(),
    list(na_count = ~ sum(is.na(.)))
  ))
## # A tibble: 1 × 17
##   age_na_count job_na_count marital_na_count education_na_count default_na_count
##          <int>        <int>            <int>              <int>            <int>
## 1            0            0                0                  0                0
## # ℹ 12 more variables: balance_na_count <int>, housing_na_count <int>,
## #   loan_na_count <int>, contact_na_count <int>, day_na_count <int>,
## #   month_na_count <int>, duration_na_count <int>, campaign_na_count <int>,
## #   pdays_na_count <int>, previous_na_count <int>, poutcome_na_count <int>,
## #   y_na_count <int>

04 - Defining an Outlier Within a Continuous Column

Let’s use the previous column to try and detect some outliers. In this context previous refers to the number of prior contacts from previous campaigns a bank client has been solicited from. I’ve gathered some quick summary statistics and we can see that:

This tells us that the distribution on the previous column is heavily skewed to the right and we may have some potential outliers or anomalies in the data.

# Computing Summary Statistics on Previous
bank_marketing |>
  summarise(
    prev_min = min(previous),
    prev_med = median(previous),
    prev_mean = mean(previous),
    prev_max = max(previous),
    prev_iqr = IQR(previous)
  )
## # A tibble: 1 × 5
##   prev_min prev_med prev_mean prev_max prev_iqr
##      <dbl>    <dbl>     <dbl>    <dbl>    <dbl>
## 1        0        0     0.580      275        0

Using Interquartile Range to Identify Outliers

Let’s first investigate by looking at detecting outliers using a quick and simple Interquartile Range approach – where any values not falling with 1.5x the interquartile range are considered outliers, where the interquartile range is considered as values lying between 25% - 75% of all data values.

# Constructing IQR plot for Previous

bank_marketing |>
  ggplot(mapping = aes(x = previous, y = "")) +
  geom_boxplot() +
  theme_classic() +
  labs(
    x = "Number of Prior Contacts (Previous)",
    y = "All Bank clients",
    title = "IQR Plot of Previous Field"
  )

In the plot above, we can see that there are several outliers that fall outside the IQR range. Although, there really does appear to be one extraordinary point where the number of prior contacts is 275. This may actually be a data anomaly or inaccuracy. We can imagine that perhaps the bank accidentally solicited a client multiple times on the same day in error or something along those lines. We can’t really know without knowing the underlying time distribution behind these values. Either way, let’s remove all values or at least attempt to – since the IQR function returns a ‘0’ since so many of the values concentrate around 0. So let’s go ahead and only keep values in the range of 0 - 4. When we do this, we can see that there still are significant outliers per the IQR. Even values of ‘1’ are outside of the IQR range. So all values greater ‘0’ would be classified as outliers – which is a bit ludicrous.

# Constructing Adjusted IQR Plot for Previous (For Previous values 0 - 4)

bank_marketing |>
  filter(previous < 4) |>
  ggplot(mapping = aes(x = previous, y = "")) +
  geom_boxplot() +
  theme_classic() +
  labs(
    x = "Number of Prior Contacts (Previous)",
    y = "Bank Clients with < 5 Previous Contacts",
    title = "Adjusted IQR Plot for Previous"
  )

Using High or Low Percentiles to Determine Outliers

Perhaps a better way to analyze outliers in this context would be to look at values within the one percentile of the data. Data points in this category are probably more likely to be errors such as the 275 identified above. So let’s go ahead and map all the values to identify those in (1) the 99 percentile and (2) 99.9 percentile.

In this context, since 0 is the lowest value in the dataset and it is heavily skewed to the left, we won’t look to the left or values within the 0.01 percentile, since none technically exist for this data field.

# Plotting Outliers Based on Percentile
percentiles_99 <- quantile(bank_marketing$previous, 0.99)
percentiles_999 <- quantile(bank_marketing$previous, 0.999)

bank_marketing |>
  ggplot() +
  geom_jitter(mapping = aes(x = previous, y = "")) +
  geom_vline(mapping = aes(xintercept = percentiles_99, color = "99% percentile")) +
  geom_vline(mapping = aes(xintercept = percentiles_999, color = "99.9% percentile")) +
  theme_classic() +
  labs(
    x = "Number of Prior Contacts (Previous)",
    y = "All Bank Clients",
    title = "Outliers in Previous Field"
  )

Here we can see that there are quite a few points past the 99% percentile, but very few past the 99.9% percentile. In this context, again the 275 number seems absolutely wild and I feel pretty confident that it may be an error.

Generally speaking, these all could be valid values that provide useful information about our dataset and these values arguably shouldn’t be removed unless there is a specific business reason or context. In this case, given how extreme some of the values are – especially those in the 99.9 percentile, I would feel comfortable removing them because they genuinely could be caused by an error in the banks CRM system or something of that sort. What I would need to do to be sure whether to make this determination is to investigate the underlying data rows and determine whether these are actually real previous contacts.