This assignment aims to perform exploratory data analysis of UCI bank marketing datasets and recommend suitable machine learning algorithms to predict whether a client will subscribe to a term deposit.

0. Data input and cleaning

I read the four datasets into dataframes and performed some initial checks on data types, duplicates, and missing data.

Dataset 1: bank-full

url <- 'https://raw.githubusercontent.com/alexandersimon1/Data622/refs/heads/main/Assignment1/bank-full.csv'
bank_full_df <- read_delim(url, delim = ';', show_col_types = FALSE)

This dataset has 45,211 rows (clients) and 17 columns (features), and the data appear to be tidy.

glimpse(bank_full_df)
## Rows: 45,211
## Columns: 17
## $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

I used the descriptions in the Variables Table on the source data website to rename the columns to make them more descriptive.

bank_full_df <- bank_full_df %>%
  select(age,
         job_type = job, 
         marital_status = marital, 
         highest_education = education,
         credit_is_defaulted = default,
         avg_yearly_balance = balance,
         has_housing_loan = housing,
         has_personal_loan = loan,
         communication_type = contact,
         last_contact_dow = day,
         last_contact_month = month,
         last_contact_duration_sec = duration,
         campaign_contacts = campaign,
         days_since_last_contact = pdays,
         previous_contacts = previous,
         previous_contact_outcome = poutcome,
         has_term_deposit = y)

To facilitate analysis, I also changed the categorical variables from character type to factors and yes/no variables to logical variables.

categorical_variables <- c('job_type', 'marital_status', 'highest_education',
                           'communication_type', 'previous_contact_outcome')

bank_full_df <- bank_full_df %>%
  mutate_at(categorical_variables, factor)
bank_full_df <- bank_full_df %>%
  mutate(
    credit_is_defaulted = ifelse(credit_is_defaulted == 'yes', TRUE, FALSE),
    has_housing_loan = ifelse(has_housing_loan == 'yes', TRUE, FALSE),
    has_personal_loan = ifelse(has_personal_loan == 'yes', TRUE, FALSE),
    has_term_deposit = ifelse(has_term_deposit == 'yes', TRUE, FALSE)
  )

Then I checked for duplicates and missing values (there were none).

sprintf("Duplicate rows in dataset: %d", count_duplicate_rows(bank_full_df))
## [1] "Duplicate rows in dataset: 0"
get_na_columns(bank_full_df)
## [1] "There are no columns with missing values"

Dataset 2: bank

url <- 'https://raw.githubusercontent.com/alexandersimon1/Data622/refs/heads/main/Assignment1/bank.csv'
bank_subset_df <- read_delim(url, delim = ';', show_col_types = FALSE)

This dataset has 4,521 rows (clients) and 17 columns (features), and the data appear to be tidy.

glimpse(bank_subset_df)
## Rows: 4,521
## Columns: 17
## $ age       <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job       <chr> "unemployed", "services", "management", "management", "blue-…
## $ marital   <chr> "married", "married", "single", "married", "married", "singl…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ balance   <dbl> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
## $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
## $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
## $ day       <dbl> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
## $ duration  <dbl> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign  <dbl> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays     <dbl> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous  <dbl> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

As with the bank_full dataset, I renamed some columns and changed some data types.

bank_subset_df <- bank_subset_df %>%
  select(age,
         job_type = job, 
         marital_status = marital, 
         highest_education = education,
         credit_is_defaulted = default,
         avg_yearly_balance = balance,
         has_housing_loan = housing,
         has_personal_loan = loan,
         communication_type = contact,
         last_contact_dow = day,
         last_contact_month = month,
         last_contact_duration_sec = duration,
         campaign_contacts = campaign,
         days_since_last_contact = pdays,
         previous_contacts = previous,
         previous_contact_outcome = poutcome,
         has_term_deposit = y)
bank_subset_df <- bank_subset_df %>%
  mutate_at(categorical_variables, factor)
bank_subset_df <- bank_subset_df %>%
  mutate(
    credit_is_defaulted = ifelse(credit_is_defaulted == 'yes', TRUE, FALSE),
    has_housing_loan = ifelse(has_housing_loan == 'yes', TRUE, FALSE),
    has_personal_loan = ifelse(has_personal_loan == 'yes', TRUE, FALSE),
    has_term_deposit = ifelse(has_term_deposit == 'yes', TRUE, FALSE)
  )

There were no duplicates or missing values in this dataset.

sprintf("Duplicate rows in dataset: %d", count_duplicate_rows(bank_subset_df))
## [1] "Duplicate rows in dataset: 0"
get_na_columns(bank_subset_df)
## [1] "There are no columns with missing values"

Dataset 3: bank-additional-full

url <- 'https://raw.githubusercontent.com/alexandersimon1/Data622/refs/heads/main/Assignment1/bank-additional-full.csv'
bank_additional_full_df <- read_delim(url, delim = ';', show_col_types = FALSE)

This dataset has 41,188 rows (clients) and 21 columns (features), and the data appear to be tidy.

glimpse(bank_additional_full_df)
## Rows: 41,188
## Columns: 21
## $ age            <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job            <chr> "housemaid", "services", "services", "admin.", "service…
## $ marital        <chr> "married", "married", "married", "married", "married", …
## $ education      <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
## $ default        <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
## $ housing        <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
## $ loan           <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
## $ contact        <chr> "telephone", "telephone", "telephone", "telephone", "te…
## $ month          <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
## $ day_of_week    <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
## $ duration       <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons.conf.idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr.employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…

As with the other datasets, I made the column names more descriptive.

bank_additional_full_df <- bank_additional_full_df %>%
  select(age,
         job_type = job, 
         marital_status = marital, 
         highest_education = education,
         credit_is_defaulted = default,
         has_housing_loan = housing,
         has_personal_loan = loan,
         communication_type = contact,
         last_contact_dow = day_of_week,
         last_contact_month = month,
         last_contact_duration_sec = duration,
         campaign_contacts = campaign,
         days_since_last_contact = pdays,
         previous_contacts = previous,
         previous_contact_outcome = poutcome,
         employee_variation_rate = emp.var.rate,
         consumer_price_index = cons.price.idx,
         consumer_confidence_index = cons.conf.idx,
         euribor_rate_3m = euribor3m,
         n_employees = nr.employed,
         has_term_deposit = y)

The bank_additional_full dataset has 5 new variables that are not found in the bank_full dataset. In addition, the bank_additional_full dataset does not have an avg_yearly_balance variable.

setdiff(colnames(bank_additional_full_df), colnames(bank_full_df))
## [1] "employee_variation_rate"   "consumer_price_index"     
## [3] "consumer_confidence_index" "euribor_rate_3m"          
## [5] "n_employees"
setdiff(colnames(bank_full_df), colnames(bank_additional_full_df))
## [1] "avg_yearly_balance"

I also coerced categorical variables to be factors and binary (yes/no) variables to be logical.

bank_additional_full_df <- bank_additional_full_df %>%
  mutate_at(categorical_variables, factor)
bank_additional_full_df <- bank_additional_full_df %>%
  mutate(
    credit_is_defaulted = ifelse(credit_is_defaulted == 'yes', TRUE, FALSE),
    has_housing_loan = ifelse(has_housing_loan == 'yes', TRUE, FALSE),
    has_personal_loan = ifelse(has_personal_loan == 'yes', TRUE, FALSE),
    has_term_deposit = ifelse(has_term_deposit == 'yes', TRUE, FALSE)
  )

This dataset had 14 duplicate rows and no missing values.

sprintf("Duplicate rows in dataset: %d", count_duplicate_rows(bank_additional_full_df))
## [1] "Duplicate rows in dataset: 14"
get_na_columns(bank_additional_full_df)
## [1] "There are no columns with missing values"

Dataset 4: bank-additional

url <- 'https://raw.githubusercontent.com/alexandersimon1/Data622/refs/heads/main/Assignment1/bank-additional.csv'
bank_additional_df <- read_delim(url, delim = ';', show_col_types = FALSE)

This dataset has 4,119 rows (clients) and 21 columns (features), and the data appear to be tidy.

glimpse(bank_additional_df)
## Rows: 4,119
## Columns: 21
## $ age            <dbl> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36, 36, 47,…
## $ job            <chr> "blue-collar", "services", "services", "services", "adm…
## $ marital        <chr> "married", "single", "married", "married", "married", "…
## $ education      <chr> "basic.9y", "high.school", "high.school", "basic.9y", "…
## $ default        <chr> "no", "no", "no", "no", "no", "no", "no", "unknown", "n…
## $ housing        <chr> "yes", "no", "yes", "unknown", "yes", "no", "yes", "yes…
## $ loan           <chr> "no", "no", "no", "unknown", "no", "no", "no", "no", "n…
## $ contact        <chr> "cellular", "telephone", "telephone", "telephone", "cel…
## $ month          <chr> "may", "may", "jun", "jun", "nov", "sep", "sep", "nov",…
## $ day_of_week    <chr> "fri", "fri", "wed", "fri", "mon", "thu", "mon", "mon",…
## $ duration       <dbl> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 301, 148,…
## $ campaign       <dbl> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2, 6, 4, 2…
## $ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate   <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0.1, 1.1,…
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199, 94.199,…
## $ cons.conf.idx  <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5, -42.0,…
## $ euribor3m      <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879, 4.191,…
## $ nr.employed    <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6, 4963.6,…
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…

As with the bank_additional_full dataset, I renamed some columns and changed some data types.

bank_additional_df <- bank_additional_df %>%
  select(age,
         job_type = job, 
         marital_status = marital, 
         highest_education = education,
         credit_is_defaulted = default,
         has_housing_loan = housing,
         has_personal_loan = loan,
         communication_type = contact,
         last_contact_dow = day_of_week,
         last_contact_month = month,
         last_contact_duration_sec = duration,
         campaign_contacts = campaign,
         days_since_last_contact = pdays,
         previous_contacts = previous,
         previous_contact_outcome = poutcome,
         employee_variation_rate = emp.var.rate,
         consumer_price_index = cons.price.idx,
         consumer_confidence_index = cons.conf.idx,
         euribor_rate_3m = euribor3m,
         n_employees = nr.employed,
         has_term_deposit = y)
bank_additional_df <- bank_additional_df %>%
  mutate_at(categorical_variables, factor)
bank_additional_df <- bank_additional_df %>%
  mutate(
    credit_is_defaulted = ifelse(credit_is_defaulted == 'yes', TRUE, FALSE),
    has_housing_loan = ifelse(has_housing_loan == 'yes', TRUE, FALSE),
    has_personal_loan = ifelse(has_personal_loan == 'yes', TRUE, FALSE),
    has_term_deposit = ifelse(has_term_deposit == 'yes', TRUE, FALSE)
  )

This dataset did not have any duplicate rows or missing values.

sprintf("Duplicate rows in dataset: %d", count_duplicate_rows(bank_additional_df))
## [1] "Duplicate rows in dataset: 0"
get_na_columns(bank_additional_df)
## [1] "There are no columns with missing values"

1. Exploratory data analysis

Because the bank and bank_additional datasets are subsets of the full datasets, I only performed EDA on the latter.

1.1. Logical variables

1.1.1. Credit in default

In the bank_full dataset, a little less than 2% of clients have credit in default.

fct_proportions(bank_full_df$credit_is_defaulted) %>%
  kbl() %>%
  kable_styling()
variable Freq
FALSE 98.2
TRUE 1.8

In the bank_additional_full dataset, the proportion is much smaller (0.01%), ie the clients have better credit than those in the other dataset.

fct_proportions(bank_additional_full_df$credit_is_defaulted) %>%
  kbl() %>%
  kable_styling()
variable Freq
FALSE 99.99
TRUE 0.01

1.1.2. Housing loan

In both datasets, approximately half of the clients have a housing loan.

fct_proportions(bank_full_df$has_housing_loan) %>%
  kbl() %>%
  kable_styling()
variable Freq
TRUE 55.58
FALSE 44.42
fct_proportions(bank_additional_full_df$has_housing_loan) %>%
  kbl() %>%
  kable_styling()
variable Freq
TRUE 52.38
FALSE 47.62

1.1.3. Personal loan

In both datasets, most (~85%) clients have a personal loan.

fct_proportions(bank_full_df$has_personal_loan) %>%
  kbl() %>%
  kable_styling()
variable Freq
FALSE 83.98
TRUE 16.02
fct_proportions(bank_additional_full_df$has_personal_loan) %>%
  kbl() %>%
  kable_styling()
variable Freq
FALSE 84.83
TRUE 15.17

1.1.4. Term deposit

In both datasets, most (~88%) clients have a term deposit.

fct_proportions(bank_full_df$has_term_deposit) %>%
  kbl() %>%
  kable_styling()
variable Freq
FALSE 88.3
TRUE 11.7
fct_proportions(bank_additional_full_df$has_term_deposit) %>%
  kbl() %>%
  kable_styling()
variable Freq
FALSE 88.73
TRUE 11.27

1.2. Categorical variables

1.2.1. Job type

In the bank_full dataset, the most common job types are blue collar (21.5%) and management (20.9%) and the least common is student (2.1%).

fct_proportions(bank_full_df$job_type) %>%
  kbl() %>%
  kable_styling()  
variable Freq
blue-collar 21.53
management 20.92
technician 16.80
admin. 11.44
services 9.19
retired 5.01
self-employed 3.49
entrepreneur 3.29
unemployed 2.88
housemaid 2.74
student 2.07
unknown 0.64
ggplot(bank_full_df, aes(x = fct_rev(fct_infreq(job_type)))) + 
  geom_bar(stat = 'count', fill = 'steelblue') +
  coord_flip() +
  labs(y = 'Count', x = 'Job Type') +
  ggtitle('Job types in bank_full dataset') +
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

The distribution of job types in the bank_additional_full dataset is similar. The most common job types are admin (25.3%) and blue collar (22.5%) and the least common is student (2.1%).

fct_proportions(bank_additional_full_df$job_type) %>%
  kbl() %>%
  kable_styling()
variable Freq
admin. 25.30
blue-collar 22.47
technician 16.37
services 9.64
management 7.10
retired 4.18
entrepreneur 3.54
self-employed 3.45
housemaid 2.57
unemployed 2.46
student 2.12
unknown 0.80
ggplot(bank_additional_full_df, aes(x = fct_rev(fct_infreq(job_type)))) + 
  geom_bar(stat = 'count', fill = '#E1BE6A') +
  coord_flip() +
  labs(y = 'Count', x = 'Job Type') +
  ggtitle('Job types in bank_additional_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

1.2.2. Marital status

In the bank_full dataset, most clients (60.2%) are married.

fct_proportions(bank_full_df$marital_status) %>%
  kbl() %>%
  kable_styling()
variable Freq
married 60.19
single 28.29
divorced 11.52
ggplot(bank_full_df, aes(x = fct_rev(fct_infreq(marital_status)))) + 
  geom_bar(stat = 'count', fill = 'steelblue') +
  ylim(0, 30000) +  
  coord_flip() +
  labs(y = 'Count', x = 'Marital Status') +
  ggtitle('Marital status in bank_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

The distribution of marital status in the bank_additional_full dataset is similar. Most clients (60.5%) are married.

fct_proportions(bank_additional_full_df$marital_status) %>%
  kbl() %>%
  kable_styling()
variable Freq
married 60.52
single 28.09
divorced 11.20
unknown 0.19
ggplot(bank_additional_full_df, aes(x = fct_rev(fct_infreq(marital_status)))) + 
  geom_bar(stat = 'count', fill = '#E1BE6A') +
  ylim(0, 30000) +  
  coord_flip() +
  labs(y = 'Count', x = 'Marital Status') +
  ggtitle('Marital status in bank_additional_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

1.2.3. Education level

In the bank_full dataset, most (51.3%) clients completed secondary education.

fct_proportions(bank_full_df$highest_education) %>%
  kbl() %>%
  kable_styling()
variable Freq
secondary 51.32
tertiary 29.42
primary 15.15
unknown 4.11
ggplot(bank_full_df, aes(x = fct_rev(fct_infreq(highest_education)))) + 
  geom_bar(stat = 'count', fill = 'steelblue') +
  ylim(0, 25000) +
  coord_flip() +
  labs(y = 'Count', x = 'Highest Education Level') +
  ggtitle('Education levels in bank_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

The bank_additional_full dataset provides more detailed information about clients’ education level. Specifically, primary is broken down into basic 9y, 6y, and 4y, and there is an additional category for professional coursework.

Unlike the bank_full dataset, the most common education level in the bank_additional_full dataset is university degree (29.5%), which is equivalent to tertiary education in the bank_full dataset.

fct_proportions(bank_additional_full_df$highest_education) %>%
  kbl() %>%
  kable_styling()
variable Freq
university.degree 29.54
high.school 23.10
basic.9y 14.68
professional.course 12.73
basic.4y 10.14
basic.6y 5.56
unknown 4.20
illiterate 0.04
ggplot(bank_additional_full_df, aes(x = fct_rev(fct_infreq(highest_education)))) + 
  geom_bar(stat = 'count', fill = '#E1BE6A') +
  coord_flip() +
  labs(y = 'Count', x = 'Highest Education Level') +
  ggtitle('Education levels in bank_additional_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

1.2.4. Communication type

In both the bank_full and bank_additional_full datasets, the majority of clients are contacted by cell phone.

fct_proportions(bank_full_df$communication_type) %>%
  kbl() %>%
  kable_styling()
variable Freq
cellular 64.77
unknown 28.80
telephone 6.43
ggplot(bank_full_df, aes(x = fct_rev(fct_infreq(communication_type)))) + 
  geom_bar(stat = 'count', fill = 'steelblue') +
  ylim(0, 30000) +
  coord_flip() +
  labs(y = 'Count', x = 'Communication Type') +
  ggtitle('Communication types in bank_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

fct_proportions(bank_additional_full_df$communication_type) %>%
  kbl() %>%
  kable_styling()
variable Freq
cellular 63.47
telephone 36.53
ggplot(bank_additional_full_df, aes(x = fct_rev(fct_infreq(communication_type)))) + 
  geom_bar(stat = 'count', fill = '#E1BE6A') +
  ylim(0, 30000) +
  coord_flip() +
  labs(y = 'Count', x = 'Communication Type') +
  ggtitle('Communication types in bank_additional_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

1.2.5. Outcome of previous contacts

In the bank_full and bank_additional_full datasets, the outcomes of previous contacts is mostly unknown or nonexistent (ie, no previous contact). Among known outcomes, ‘failure’ is more common than ‘success’ (~10% vs 3% of total clients).

fct_proportions(bank_full_df$previous_contact_outcome) %>%
  kbl() %>%
  kable_styling()
variable Freq
unknown 81.75
failure 10.84
other 4.07
success 3.34
ggplot(bank_full_df, aes(x = fct_rev(fct_infreq(previous_contact_outcome)))) + 
  geom_bar(stat = 'count', fill = 'steelblue') +
  ylim(0, 40000) +
  coord_flip() +
  labs(y = 'Count', x = 'Outcome') +
  ggtitle('Outcomes of previous contact in bank_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

fct_proportions(bank_additional_full_df$previous_contact_outcome) %>%
  kbl() %>%
  kable_styling()
variable Freq
nonexistent 86.34
failure 10.32
success 3.33
ggplot(bank_additional_full_df, aes(x = fct_rev(fct_infreq(previous_contact_outcome)))) + 
  geom_bar(stat = 'count', fill = '#E1BE6A') +
  ylim(0, 40000) +
  coord_flip() +
  labs(y = 'Count', x = 'Outcome') +
  ggtitle('Outcomes of previous contact in bank_additional_full dataset') +  
  theme_bw() +
  theme(
    axis.title = element_text(face = "bold")
  )

1.3. Numeric variables

1.3.1. Age

The age distribution is similar in both datasets. In the bank_full dataset, age ranges from 18 to 95 years and the mean (SD) age is 40.9 (10.6) years. In the bank_additional_full dataset, age ranges from 17 to 98 and the mean (SD) age is 40.0 (10.4) years.

Histograms and boxplots show that most clients are between 25 to 60 years old, and ~1% of clients are potential outliers.

Bank_full dataset details

bank_full_df %>%
  summarise(min = min(age), 
            max = max(age),
            mean = round(mean(age), 1),
            sd = round(sd(age), 1),
            median = median(age),
            IQR = IQR(age)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
18 95 40.9 10.6 39 15

The histogram below shows that most clients In the bank_full dataset are between 25 to 60 years old. A boxplot shows that 487 clients (~1% of the total dataset) older than 70.5 years are potential outliers.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(quantile(bank_full_df$age, 0.75) + 1.5 * IQR(bank_full_df$age))
n_outliers <- sum(bank_full_df$age > upper_fence)
sprintf('There are %d potential outliers >%.1f years old', n_outliers, upper_fence)
## [1] "There are 487 potential outliers >70.5 years old"
plot_boxhist(data = bank_full_df, var = bank_full_df$age, 
             xmin = 0, xmax = 100, ymax = 2250, 
             plot_title = 'Distribution of age in bank_full dataset (N=45,211)*\n',
             fill_color = 'steelblue',
             hist_binwidth = 1, upper_fence_xint = 70.5,
             arrow_coord_vec = c(71, 2250, 97, 2250), 
             label_coord_vec = c(84, 2100),
             xlab = 'Age (years)', 
             plot_caption = '',
             plot_ratio_vec = c(1,4))

Bank_additional_full dataset details

bank_additional_full_df %>%
  summarise(min = min(age), 
            max = max(age),
            mean = round(mean(age), 1),
            sd = round(sd(age), 1),
            median = median(age),
            IQR = IQR(age)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
17 98 40 10.4 38 15

The histogram below shows that most clients In the bank_additional_full dataset are between 25 to 60 years old. A boxplot shows that 469 clients (~1% of the total dataset) older than 69.5 years are potential outliers.

upper_fence <- as.numeric(quantile(bank_additional_full_df$age, 0.75) + 
                            1.5 * IQR(bank_additional_full_df$age))
n_outliers <- sum(bank_additional_full_df$age > upper_fence)
sprintf('There are %d potential outliers >%.1f years old', n_outliers, upper_fence)
## [1] "There are 469 potential outliers >69.5 years old"
plot_boxhist(data = bank_additional_full_df, var = bank_additional_full_df$age, 
             xmin = 0, xmax = 100, ymax = 2250, 
             plot_title = 'Distribution of age in bank_additional_full dataset (N=41,188)*\n',
             fill_color = '#E1BE6A',
             hist_binwidth = 1, upper_fence_xint = 69.5,
             arrow_coord_vec = c(71, 2250, 100, 2250), 
             label_coord_vec = c(86, 2100),
             xlab = 'Age (years)', 
             plot_caption = '',
             plot_ratio_vec = c(1,4))

1.3.2. Average yearly balance

In the bank_full dataset, the average yearly balance ranges from -8019 to 102127 (currency unknown). The mean (SD) balance is 1362.3 (3044.8), and the median (IQR) is 448 (1356). A boxplot shows that ~1% of clients have average yearly balances that are potential outliers.

This variable does not exist in the bank_additional_full dataset.

bank_full_df %>%
  summarise(min = min(avg_yearly_balance), 
            max = max(avg_yearly_balance),
            mean = round(mean(avg_yearly_balance), 1),
            sd = round(sd(avg_yearly_balance), 1),
            median = median(avg_yearly_balance),
            IQR = IQR(avg_yearly_balance)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
-8019 102127 1362.3 3044.8 448 1356

The histogram below shows that the distribution of balances In the bank_full dataset is strongly right skewed. A boxplot shows that 4729 clients (~10% of the total dataset) who have average yearly balances greater than 3462 or less than -1962 are potential outliers.

upper_fence <- as.numeric(quantile(bank_full_df$avg_yearly_balance, 0.75) + 
                            1.5 * IQR(bank_full_df$avg_yearly_balance))
n_outliers <- sum(bank_full_df$avg_yearly_balance > upper_fence)
sprintf('There are %d potential outliers with average yearly balance >%.0f', 
        n_outliers, upper_fence)
## [1] "There are 4712 potential outliers with average yearly balance >3462"
# Lower fence = Q1 - 1.5 * IQR
lower_fence <- as.numeric(quantile(bank_full_df$avg_yearly_balance, 0.25) - 
                            1.5 * IQR(bank_full_df$avg_yearly_balance))
n_outliers <- sum(bank_full_df$avg_yearly_balance < lower_fence)
sprintf('There are %d potential outliers with average yearly balance <%.0f', 
        n_outliers, lower_fence)
## [1] "There are 17 potential outliers with average yearly balance <-1962"
plot_boxhist(data = bank_full_df, var = bank_full_df$avg_yearly_balance,
             xmin = -10000, xmax = 120000, ymax = 25000,
             plot_title = paste0('Distribution of average yearly balance ',
                                 'in bank_full dataset (N=45,211)\n'),
             fill_color = 'steelblue',
             hist_binwidth = 1000, upper_fence_xint = 3462,
             arrow_coord_vec = c(3500, 20000, 100000, 20000),
             label_coord_vec = c(55000, 18000),
             xlab = 'Average yearly balance',
             plot_caption = '',
             plot_ratio_vec = c(1,4))

1.3.3. Last contact duration

The distribution of last contact duration is similar in both datasets and have the same range (0 to 4918 seconds). In the bank_full dataset, the mean (SD) duration was 258.2 (257.5) seconds and the median (IQR) was 180 (216) seconds. In the bank_additional_full dataset, the mean (SD) duration was 258.3 (259.3) seconds and the median (IQR) was 180 (217) seconds.

Boxplots show that the duration of last contact for ~7% of clients are potential outliers.

Bank_full dataset details

bank_full_df %>%
  summarise(min = min(last_contact_duration_sec), 
            max = max(last_contact_duration_sec),
            mean = round(mean(last_contact_duration_sec), 1),
            sd = round(sd(last_contact_duration_sec), 1),
            median = median(last_contact_duration_sec),
            IQR = IQR(last_contact_duration_sec)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
0 4918 258.2 257.5 180 216

The histogram below shows that the last contact duration In the bank_full dataset was right skewed. A boxplot shows that 3235 clients (~7% of the total dataset) with duration greater than 643 seconds are potential outliers.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(quantile(bank_full_df$last_contact_duration_sec, 0.75) + 
                            1.5 * IQR(bank_full_df$last_contact_duration_sec))
n_outliers <- sum(bank_full_df$last_contact_duration_sec > upper_fence)
sprintf('There are %d potential outliers with duration >%.0f seconds', 
        n_outliers, upper_fence)
## [1] "There are 3235 potential outliers with duration >643 seconds"
plot_boxhist(data = bank_full_df, var = bank_full_df$last_contact_duration_sec,
             xmin = 0, xmax = 5000, ymax = 10000,
             plot_title = paste0('Distribution of last contact duration ',
                                 'in bank_full dataset (N=45,211)\n'),
             fill_color = 'steelblue',
             hist_binwidth = 60, upper_fence_xint = 643,
             arrow_coord_vec = c(650, 7500, 5000, 7500),
             label_coord_vec = c(2900, 7000),
             xlab = 'Duration of last contact (seconds)',
             plot_caption = '',
             plot_ratio_vec = c(1,4))

Bank_additional_full dataset details

bank_additional_full_df %>%
  summarise(min = min(last_contact_duration_sec), 
            max = max(last_contact_duration_sec),
            mean = round(mean(last_contact_duration_sec), 1),
            sd = round(sd(last_contact_duration_sec), 1),
            median = median(last_contact_duration_sec),
            IQR = IQR(last_contact_duration_sec)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
0 4918 258.3 259.3 180 217

The histogram below shows that the last contact duration In the bank_additional_full dataset was right skewed. A boxplot shows that 2963 clients (~7% of the total dataset) with duration greater than 644 seconds are potential outliers.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(quantile(bank_additional_full_df$last_contact_duration_sec, 0.75) + 
                            1.5 * IQR(bank_additional_full_df$last_contact_duration_sec))
n_outliers <- sum(bank_additional_full_df$last_contact_duration_sec > upper_fence)
sprintf('There are %d potential outliers with duration >%.0f seconds', 
        n_outliers, upper_fence)
## [1] "There are 2963 potential outliers with duration >644 seconds"
plot_boxhist(data = bank_additional_full_df, 
             var = bank_additional_full_df$last_contact_duration_sec,
             xmin = 0, xmax = 5000, ymax = 10000,
             plot_title = paste0('Distribution of last contact duration ',
                                 'in bank_additional_full dataset (N=41,188)\n'),
             fill_color = '#E1BE6A',
             hist_binwidth = 60, upper_fence_xint = 644,
             arrow_coord_vec = c(650, 7500, 5000, 7500),
             label_coord_vec = c(2900, 7000),
             xlab = 'Duration of last contact (seconds)',
             plot_caption = '',
             plot_ratio_vec = c(1,4))

1.3.4. Number of contacts during this campaign

The distribution of the number of campaign contacts is similar in both datasets. In the bank_full dataset, the number of contacts ranged from 1 to 63, and the mean (SD) was 2.8 (3.1). In the bank_additional_full dataset, the number of contacts ranged from 1 to 56, and the mean (SD) was 2.6 (2.8).

Histograms and boxplots show that most clients have fewer than 6 contacts, and the number of contacts for ~7% of clients are potential outliers.

Bank_full dataset details

bank_full_df %>%
  summarise(min = min(campaign_contacts), 
            max = max(campaign_contacts),
            mean = round(mean(campaign_contacts), 1),
            sd = round(sd(campaign_contacts), 1),
            median = median(campaign_contacts),
            IQR = IQR(campaign_contacts)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
1 63 2.8 3.1 2 2

The histogram below shows that most clients In the bank_full dataset have fewer than 6 contacts. A boxplot shows that 3064 clients (~7% of the total dataset) with more than 6 contacts are potential outliers.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(quantile(bank_full_df$campaign_contacts, 0.75) + 
                            1.5 * IQR(bank_full_df$campaign_contacts))
n_outliers <- sum(bank_full_df$campaign_contacts > upper_fence)
sprintf('There are %d potential outliers with >%.0f contacts', n_outliers, upper_fence)
## [1] "There are 3064 potential outliers with >6 contacts"
plot_boxhist(data = bank_full_df, var = bank_full_df$campaign_contacts,
             xmin = 0, xmax = 70, ymax = 20000,
             plot_title = paste0('Distribution of number of campaign contacts ',
                                 'in bank_full dataset (N=45,211)\n'),
             fill_color = 'steelblue',
             hist_binwidth = 1, upper_fence_xint = 6,
             arrow_coord_vec = c(6, 15000, 65, 15000),
             label_coord_vec = c(37, 14000),
             xlab = 'Number of campaign contacts',
             plot_caption = '',
             plot_ratio_vec = c(1,4))

Bank_additional_full dataset details

bank_additional_full_df %>%
  summarise(min = min(campaign_contacts), 
            max = max(campaign_contacts),
            mean = round(mean(campaign_contacts), 1),
            sd = round(sd(campaign_contacts), 1),
            median = median(campaign_contacts),
            IQR = IQR(campaign_contacts)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
1 56 2.6 2.8 2 2

The histogram below shows that most clients In the bank_additional_full dataset have fewer than 6 contacts. A boxplot shows that 2406 clients (~7% of the total dataset) with more than 6 contacts are potential outliers.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(quantile(bank_additional_full_df$campaign_contacts, 0.75) + 
                            1.5 * IQR(bank_additional_full_df$campaign_contacts))
n_outliers <- sum(bank_additional_full_df$campaign_contacts > upper_fence)
sprintf('There are %d potential outliers with >%.0f contacts', n_outliers, upper_fence)
## [1] "There are 2406 potential outliers with >6 contacts"
plot_boxhist(data = bank_additional_full_df, 
             var = bank_additional_full_df$campaign_contacts,
             xmin = 0, xmax = 70, ymax = 20000,
             plot_title = paste0('Distribution of number of campaign contacts in\n',
                                 'bank_additional_full dataset (N=41,188)\n'),
             fill_color = '#E1BE6A',
             hist_binwidth = 1, upper_fence_xint = 6,
             arrow_coord_vec = c(6, 15000, 56, 15000),
             label_coord_vec = c(30, 14000),
             xlab = 'Number of campaign contacts',
             plot_caption = '',
             plot_ratio_vec = c(1,3))

1.3.5. Days since last contact

The majority of clients (>80%) in the two datasets have not been contacted previously. Among clients with previous contact, the distribution of the number of days since last contact differs between datasets.

In the bank_full dataset, among clients with previous contact, the number of days since last contact ranged from 0 to 871, and the mean (SD) number of days was 224.6 (115.3). Histogram and boxplot analysis showed that in this subset, most clients have been contacted during the past year.

In the bank_additional_full dataset, among clients with previous contact, the spread and variance in the number of days since last contact was less than that in the bank_full dataset (range: 0-27 days, mean [SD]: 6 [3.8]). Similarly, histogram and boxplot analysis showed that most clients were previously contacted within 1-2 weeks. This suggests that the bank_additional_full dataset includes clients who are more responsive to outreach.

Bank_full dataset details

bank_full_previous_contact_df <- bank_full_df %>%
  # -1 is used as a placeholder for clients who have no previous contact
  filter(days_since_last_contact >= 0)
  
no_previous_contact <- nrow(bank_full_df) - nrow(bank_full_previous_contact_df)
total_clients <- nrow(bank_full_df)
sprintf('%.0f (%.1f%%) clients do not have any previous contact', 
        no_previous_contact, (no_previous_contact / total_clients)*100)
## [1] "36954 (81.7%) clients do not have any previous contact"
bank_full_previous_contact_df %>%
  summarise(min = min(days_since_last_contact), 
            max = max(days_since_last_contact),
            mean = round(mean(days_since_last_contact), 1),
            sd = round(sd(days_since_last_contact), 1),
            median = median(days_since_last_contact),
            IQR = IQR(days_since_last_contact)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
1 871 224.6 115.3 194 194

A histogram of the subset of clients In the bank_full dataset who have been previously contacted shows that most were last contacted within the past year. A boxplot shows that 49 clients (~3% of this subset) who were last contacted >618 days ago are potential outliers.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(
  quantile(bank_full_previous_contact_df$days_since_last_contact, 0.75) + 
  1.5 * IQR(bank_full_previous_contact_df$days_since_last_contact))
n_outliers <- sum(bank_full_previous_contact_df$days_since_last_contact > upper_fence)
sprintf('There are %d potential outliers with >%.0f days since last contact', 
        n_outliers, upper_fence)
## [1] "There are 49 potential outliers with >618 days since last contact"
# Boxplot
p1 <- ggplot(bank_full_previous_contact_df, aes(x = days_since_last_contact)) +
  geom_boxplot(fill = 'steelblue') +
  xlim(0, 1000) +
  guides(x = guide_axis(minor.ticks = TRUE, cap = 'upper'),
         y = guide_axis(cap = 'upper')) +
  labs(title = paste0('Distribution of number of days since last contact ',
                      'in bank_full dataset (N=8257)*\n')) +
  theme_classic() +
theme(
  plot.title = element_text(size = 12, face = 'bold'),
  axis.title = element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank(),
  axis.line = element_blank()
)

# Histogram
p2 <- ggplot(bank_full_previous_contact_df, aes(x = days_since_last_contact)) +
  geom_histogram(binwidth = 10, fill = 'steelblue', color = 'gray30') +
  geom_vline(xintercept = 618,
             linetype = 'dashed', color = 'gray30') +
  # Label outliers
  annotate('segment', x = 620, y = 600, xend = 870, yend = 600,
           arrow = arrow(ends = 'both'), color = 'gray60') +
  annotate("text", x = 750, y = 550, label = 'Outliers', color = 'gray40') +
  # Label peaks
  annotate('segment', x = 35, y = 700, xend = 80, yend = 650,
           arrow = arrow(length = unit(2, 'mm'), ends = 'last'), color = 'steelblue') +
  annotate("text", x = 45, y = 725, label = '91 days', color = 'steelblue') +  
  annotate('segment', x = 200, y = 700, xend = 250, yend = 750,
           arrow = arrow(length = unit(2, 'mm'), ends = 'first'), color = 'steelblue') +
  annotate("text", x = 310, y = 760, label = '182 days', color = 'steelblue') +
  annotate('segment', x = 375, y = 425, xend = 425, yend = 475,
           arrow = arrow(length = unit(2, 'mm'), ends = 'first'), color = 'steelblue') +
  annotate("text", x = 450, y = 510, label = '370 days', color = 'steelblue') +  
  coord_cartesian(xlim = c(0, 1000), ylim = c(0, 800)) +
  guides(x = guide_axis(minor.ticks = TRUE, cap = 'upper'),
         y = guide_axis(minor.ticks = TRUE, cap = 'upper')) +
  labs(x = 'Number of days since last contact', y = 'Count',
       caption = '\n*Excluding 36,954 clients who have no previous contact.') +
  theme_classic() +
  theme(
    axis.title = element_text(face = "bold"),
    plot.caption = element_text(color = "gray30", hjust = 0)    
  )

# Combine plots
cowplot::plot_grid(p1, p2, ncol = 1, rel_heights = c(1, 4), align = 'v', axis = 'lr')

The highest peaks in the histogram above occur at approximately 91, 182, and 370 days since the last contact, which suggests that most client outreach is performed on a quarterly or semiannual basis.

as.data.frame(table(bank_full_previous_contact_df$days_since_last_contact)) %>%
  filter(Freq >= 90) %>%
  arrange(desc(Freq))
##   Var1 Freq
## 1  182  167
## 2   92  147
## 3   91  126
## 4  183  126
## 5  181  117
## 6  370   99

Bank_additional_full dataset details

bank_addl_full_previous_contact_df <- bank_additional_full_df %>%
  # 999 is used as a placeholder for clients who have no previous contact
  filter(days_since_last_contact != 999)
  
no_previous_contact <- nrow(bank_additional_full_df) - 
  nrow(bank_addl_full_previous_contact_df)
total_clients <- nrow(bank_additional_full_df)
sprintf('%.0f (%.1f%%) clients do not have any previous contact', 
        no_previous_contact, (no_previous_contact / total_clients)*100)
## [1] "39673 (96.3%) clients do not have any previous contact"
bank_addl_full_previous_contact_df %>%
  summarise(min = min(days_since_last_contact), 
            max = max(days_since_last_contact),
            mean = round(mean(days_since_last_contact), 1),
            sd = round(sd(days_since_last_contact), 1),
            median = median(days_since_last_contact),
            IQR = IQR(days_since_last_contact)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
0 27 6 3.8 6 4

A histogram of the subset of clients In the bank_additional_full dataset who have been previously contacted shows that most were last contacted within the past two weeks, with most outreach occurring on days 3 and 6. A boxplot shows that 82 clients (~5% of this subset) with >13 days since last contact are potential outliers.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(
  quantile(bank_addl_full_previous_contact_df$days_since_last_contact, 0.75) + 
  1.5 * IQR(bank_addl_full_previous_contact_df$days_since_last_contact))
n_outliers <- sum(bank_addl_full_previous_contact_df$days_since_last_contact > upper_fence)
sprintf('There are %d potential outliers with >%.0f days since last contact', 
        n_outliers, upper_fence)
## [1] "There are 82 potential outliers with >13 days since last contact"
plot_boxhist(data = bank_addl_full_previous_contact_df, 
             var = bank_addl_full_previous_contact_df$days_since_last_contact,
             xmin = 0, xmax = 30, ymax = 500,
             plot_title = paste0('Distribution of number of days since last contact in\n',
                                 'bank_additional_full dataset (N=1515)*\n'),
             fill_color = '#E1BE6A',
             hist_binwidth = 1, upper_fence_xint = 13,
             arrow_coord_vec = c(13, 400, 27, 400),
             label_coord_vec = c(20, 375),
             xlab = 'Number of days since last contact',
             plot_caption = '\n*Excluding 39,673 clients who have no previous contact.',
             plot_ratio_vec = c(1,3))

1.3.6. Number of contacts during previous campaigns

The majority of clients (>80%) in the two datasets did not have any contacts in previous campaigns. Among clients with contacts, the distribution of the number of contacts differs between datasets.

In the bank_full dataset, among clients with previous contact, the number of contacts during previous campaigns ranged from 1 to 275, and the mean (SD) number of contacts was 3.2 (4.6). Histogram and boxplot analysis showed that in this subset, most clients had less than 8 contacts in previous campaigns, and that the client with 275 contacts was an extreme outlier.

In the bank_additional_full dataset, among clients with contacts during previous campaigns, the spread and variance in the number of contacts was less than that in the bank_full dataset (range: 1-7 contacts, mean [SD]: 1.3 [0.6]). Similarly, histogram and boxplot analysis showed that most clients had 1 contact.

Bank_full dataset details

bank_full_previous_contact_df <- bank_full_df %>%
  filter(previous_contacts > 0)
  
no_previous_contact <- nrow(bank_full_df) - nrow(bank_full_previous_contact_df)
total_clients <- nrow(bank_full_df)
sprintf('%.0f (%.1f%%) clients do not have any contacts during previous campaigns', 
        no_previous_contact, (no_previous_contact / total_clients)*100)
## [1] "36954 (81.7%) clients do not have any contacts during previous campaigns"
bank_full_previous_contact_df %>%
  summarise(min = min(previous_contacts), 
            max = max(previous_contacts),
            mean = round(mean(previous_contacts), 1),
            sd = round(sd(previous_contacts), 1),
            median = median(previous_contacts),
            IQR = IQR(previous_contacts)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
1 275 3.2 4.6 2 3

Among clients In the bank_full dataset who had contacts during previous campaigns, the number of contacts was strongly right skewed. A boxplot showed that 453 clients (~5% of this subset) who had >8 contacts are potential outliers, and that the client with 275 contacts is an extreme outlier.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(quantile(bank_full_previous_contact_df$previous_contacts, 0.75) + 
                            1.5 * IQR(bank_full_previous_contact_df$previous_contacts))
n_outliers <- sum(bank_full_previous_contact_df$previous_contacts > upper_fence)
sprintf('There are %d potential outliers with >%.0f contacts', n_outliers, upper_fence)
## [1] "There are 453 potential outliers with >8 contacts"
plot_boxhist(data = bank_full_previous_contact_df, 
             var = bank_full_previous_contact_df$previous_contacts,
             xmin = 0, xmax = 300, ymax = 4000,
             plot_title = paste0('Distribution of number of contacts during ',
                                 'previous campaigns in\n',
                                 'bank_full dataset (N=8257)*\n'),
             fill_color = 'steelblue',
             hist_binwidth = 3, upper_fence_xint = 8,
             arrow_coord_vec = c(9, 3000, 300, 3000),
             label_coord_vec = c(150, 2800),
             xlab = 'Number of contacts during previous campaigns',
             plot_caption = paste0('\n*Excluding 36,954 clients who had no contacts ',
                                   'during previous campaigns.'),
             plot_ratio_vec = c(1,3))

Bank_additional_full dataset details

bank_addl_full_previous_contact_df <- bank_additional_full_df %>%
  filter(previous_contacts > 0)
  
no_previous_contact <- nrow(bank_additional_full_df) - 
  nrow(bank_addl_full_previous_contact_df)
total_clients <- nrow(bank_additional_full_df)
sprintf('%.0f (%.1f%%) clients do not have any contacts during previous campaigns', 
        no_previous_contact, (no_previous_contact / total_clients)*100)
## [1] "35563 (86.3%) clients do not have any contacts during previous campaigns"
bank_addl_full_previous_contact_df %>%
  summarise(min = min(previous_contacts), 
            max = max(previous_contacts),
            mean = round(mean(previous_contacts), 1),
            sd = round(sd(previous_contacts), 1),
            median = median(previous_contacts),
            IQR = IQR(previous_contacts)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
1 7 1.3 0.6 1 0

Among clients In the bank_additional_full dataset who had contacts during previous campaigns, most had 1 contact. A boxplot showed that 1064 clients (~19% of this subset) with >1 contact are potential outliers.

# Upper fence = Q3 + 1.5 * IQR
upper_fence <- as.numeric(
  quantile(bank_addl_full_previous_contact_df$previous_contacts, 0.75) + 
  1.5 * IQR(bank_addl_full_previous_contact_df$previous_contacts))
n_outliers <- sum(bank_addl_full_previous_contact_df$previous_contacts > upper_fence)
sprintf('There are %d potential outliers with >%.0f contacts', n_outliers, upper_fence)
## [1] "There are 1064 potential outliers with >1 contacts"
plot_boxhist(data = bank_addl_full_previous_contact_df, 
             var = bank_addl_full_previous_contact_df$previous_contacts,
             xmin = 0, xmax = 8, ymax = 5000,
             plot_title = paste0('Distribution of number of contacts during ',
                                 'previous campaigns in\n',
                                 'bank_additional_full dataset (N=5625)*\n'),
             fill_color = '#E1BE6A',
             hist_binwidth = 1, upper_fence_xint = 1,
             arrow_coord_vec = c(2, 4000, 7, 4000),
             label_coord_vec = c(4, 3800),
             xlab = 'Number of contacts during previous campaigns',
             plot_caption = paste0('\n*Excluding 35,563 clients who had no contacts ',
                                   'during previous campaigns.'),
             plot_ratio_vec = c(1,3))

1.3.7. Employee variation rate

This variable does not exist in the bank_full dataset.

In the bank_additional_full dataset, employee variation rate ranges from -3.4 to 1.4, and the median (IQR) is 1.1 (3.2). Although the variable is numeric, there are only 10 different values, the most frequent (39.4%) of which is 1.4 and the least frequent (0.02%) is -0.2.

A boxplot indicates that there are no potential outliers.

bank_additional_full_df %>%
  summarise(min = min(employee_variation_rate), 
            max = max(employee_variation_rate),
            mean = round(mean(employee_variation_rate), 1),
            sd = round(sd(employee_variation_rate), 1),
            median = median(employee_variation_rate),
            IQR = IQR(employee_variation_rate)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
-3.4 1.4 0.1 1.6 1.1 3.2
fct_proportions(bank_additional_full_df$employee_variation_rate) %>%
  kbl() %>%
  kable_styling()
variable Freq
1.4 39.41
-1.8 22.30
1.1 18.85
-0.1 8.94
-2.9 4.04
-3.4 2.60
-1.7 1.88
-1.1 1.54
-3 0.42
-0.2 0.02
upper_fence <- as.numeric(quantile(bank_additional_full_df$employee_variation_rate, 0.75) + 
                            1.5 * IQR(bank_additional_full_df$employee_variation_rate))
n_outliers <- sum(bank_additional_full_df$employee_variation_rate > upper_fence)
sprintf('There are %d potential outliers with rate >%.1f', 
        n_outliers, upper_fence)
## [1] "There are 0 potential outliers with rate >6.2"
# Boxplot
ggplot(bank_additional_full_df, aes(x = employee_variation_rate)) +
  geom_boxplot(fill = '#E1BE6A') +
  xlim(-5, 5) +
  guides(x = guide_axis(minor.ticks = TRUE, cap = 'upper'),
         y = guide_axis(cap = 'upper')) +
  labs(title = paste0('Distribution of employee variation rate in ',
                      'bank_additional_full dataset (N=41,188)\n'),
       x = 'Employee variation rate') +
  theme_classic() +
theme(
  plot.title = element_text(size = 12, face = 'bold'),
  axis.title = element_text(face = 'bold')
)

1.3.8. Consumer price index

This variable does not exist in the bank_full dataset.

In the bank_additional_full dataset, consumer price index ranges from 92.201 to 94.767, and the mean (SD) is 93.576 (0.579). A boxplot indicates that there are no potential outliers.

bank_additional_full_df %>%
  summarise(min = min(consumer_price_index), 
            max = max(consumer_price_index),
            mean = round(mean(consumer_price_index), 3),
            sd = round(sd(consumer_price_index), 3),
            median = median(consumer_price_index),
            IQR = IQR(consumer_price_index)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
92.201 94.767 93.576 0.579 93.749 0.919
upper_fence <- as.numeric(quantile(bank_additional_full_df$consumer_price_index, 0.75) + 
                            1.5 * IQR(bank_additional_full_df$consumer_price_index))
n_outliers <- sum(bank_additional_full_df$consumer_price_index > upper_fence)
sprintf('There are %d potential outliers with consumer price index >%.3f', 
        n_outliers, upper_fence)
## [1] "There are 0 potential outliers with consumer price index >95.373"
plot_boxhist(data = bank_additional_full_df, 
             var = bank_additional_full_df$consumer_price_index,
             xmin = 92, xmax = 95, ymax = 8000,
             plot_title = paste0('Distribution of consumer price index in\n',
                                 'bank_additional_full dataset (N=41,188)'),
             fill_color = '#E1BE6A',
             hist_binwidth = 0.1, upper_fence_xint = 95.373,
             arrow_coord_vec = c(-1, -1, -1, -1),  # no arrow for outliers
             label_coord_vec = c(-1, -1),  # no label for outliers
             xlab = 'Consumer price index',
             plot_caption = '',
             plot_ratio_vec = c(1,4))

1.3.9. Consumer confidence index

This variable does not exist in the bank_full dataset.

In the bank_additional_full dataset, consumer confidence index ranges from -50.8 to -26.9, and the mean (SD) is -40.5 (4.6). A boxplot indicates that 447 values (~1% of the dataset) equal to the maximum are potential outliers.

bank_additional_full_df %>%
  summarise(min = min(consumer_confidence_index), 
            max = max(consumer_confidence_index),
            mean = round(mean(consumer_confidence_index), 1),
            sd = round(sd(consumer_confidence_index), 1),
            median = median(consumer_confidence_index),
            IQR = IQR(consumer_confidence_index)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
-50.8 -26.9 -40.5 4.6 -41.8 6.3

The histogram below shows that the distribution of consumer confidence index In the bank_additional_full dataset is dominated by a few frequent values. A boxplot indicates that 447 values (~1% of the dataset) that are greater than -26.9 are potential outliers.

upper_fence <- as.numeric(quantile(bank_additional_full_df$consumer_confidence_index, 0.75) + 
                            1.5 * IQR(bank_additional_full_df$consumer_confidence_index))
n_outliers <- sum(bank_additional_full_df$consumer_confidence_index > upper_fence)
sprintf('There are %d potential outliers with consumer confidence index >%.1f', 
        n_outliers, upper_fence)
## [1] "There are 447 potential outliers with consumer confidence index >-26.9"
plot_boxhist(data = bank_additional_full_df, 
             var = bank_additional_full_df$consumer_confidence_index,
             xmin = -55, xmax = -20, ymax = 15000,
             plot_title = paste0('Distribution of consumer confidence index in\n',
                                 'bank_additional_full dataset (N=41,188)'),
             fill_color = '#E1BE6A',
             hist_binwidth = 1, upper_fence_xint = -26.9,
             arrow_coord_vec = c(-26.9, 10000, -26.9, 10000),
             label_coord_vec = c(-24, 10000),
             xlab = 'Consumer confidence index',
             plot_caption = '',
             plot_ratio_vec = c(1,3))

1.3.10. 3-month Euro interbank offered rate (eurobor_rate_3m)

This variable does not exist in the bank_full dataset.

In the bank_additional_full dataset, 3-month Euribor rates range from 0.634% to 5.045%. The mean (SD) is 3.6% (1.7%). There are no potential outliers.

bank_additional_full_df %>%
  summarise(min = min(euribor_rate_3m), 
            max = max(euribor_rate_3m),
            mean = round(mean(euribor_rate_3m), 1),
            sd = round(sd(euribor_rate_3m), 1),
            median = median(euribor_rate_3m),
            IQR = IQR(euribor_rate_3m)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
0.634 5.045 3.6 1.7 4.857 3.617

The histogram below shows that 3-month Euribor rates In the bank_additional_full dataset have three large clusters (~0.6% to 1.5%, 4.0% to 4.2%, and 4.9% to 5.0%). The predominance of higher rates (~5%) suggest that interest rates for personal loans and other credit products are also higher in this dataset.

A boxplot indicates that there are no potential outliers.

upper_fence <- as.numeric(quantile(bank_additional_full_df$euribor_rate_3m, 0.75) + 
                            1.5 * IQR(bank_additional_full_df$euribor_rate_3m))
n_outliers <- sum(bank_additional_full_df$euribor_rate_3m > upper_fence)
sprintf('There are %d potential outliers with values >%.1f%%', 
        n_outliers, upper_fence)
## [1] "There are 0 potential outliers with values >10.4%"
plot_boxhist(data = bank_additional_full_df, 
             var = bank_additional_full_df$euribor_rate_3m,
             xmin = 0, xmax = 6, ymax = 15000,
             plot_title = paste0('Distribution of eurobor3m rates in\n',
                                 'bank_additional_full dataset (N=41,188)'),
             fill_color = '#E1BE6A',
             hist_binwidth = 0.1, upper_fence_xint = 10.4,
             arrow_coord_vec = c(-1, -1, -1, -1),  # no arrow for outliers
             label_coord_vec = c(-1, -1),  # no label for outliers
             xlab = '3-month Eurobor rate (%)',
             plot_caption = '',
             plot_ratio_vec = c(1,3))

1.3.11. Number of employees

This variable does not exist in the bank_full dataset.

In the bank_additional_full dataset, the number of employees range from -4963.6 to 5228.1, and the median (IQR) is 5191 (129) employees. Although the variable is numeric, there are only 11 different values, the most frequent (39.4%) of which is 5228.1 and the least frequent (0.02%) is 5176.

It is not clear why some values for this variable. are not integers In practice, the source data for this variable should be verified and corrected as needed. In the absence of verification, values could be rounded to the nearest integer.

There are no potential outliers.

bank_additional_full_df %>%
  summarise(min = min(n_employees), 
            max = max(n_employees),
            mean = round(mean(n_employees), 1),
            sd = round(sd(n_employees), 1),
            median = median(n_employees),
            IQR = IQR(n_employees)) %>%
  kbl() %>%
  kable_styling()
min max mean sd median IQR
4963.6 5228.1 5167 72.3 5191 129
fct_proportions(bank_additional_full_df$n_employees) %>%
  kbl() %>%
  kable_styling()
variable Freq
5228.1 39.41
5099.1 20.72
5191 18.85
5195.8 8.94
5076.2 4.04
5017.5 2.60
4991.6 1.88
5008.7 1.58
4963.6 1.54
5023.5 0.42
5176.3 0.02
upper_fence <- as.numeric(quantile(bank_additional_full_df$n_employees, 0.75) + 
                            1.5 * IQR(bank_additional_full_df$n_employees))
n_outliers <- sum(bank_additional_full_df$n_employees > upper_fence)
sprintf('There are %d potential outliers with >%.1f employees', 
        n_outliers, upper_fence)
## [1] "There are 0 potential outliers with >5421.6 employees"
# Boxplot
ggplot(bank_additional_full_df, aes(x = n_employees)) +
  geom_boxplot(fill = '#E1BE6A') +
  xlim(4800, 5500) +
  guides(x = guide_axis(minor.ticks = TRUE, cap = 'upper'),
         y = guide_axis(cap = 'upper')) +
  labs(title = paste0('Distribution of number of employees in ',
               'bank_additional_full dataset (N=41,188)\n'),
       x = 'Number of employees') +
  theme_classic() +
theme(
  plot.title = element_text(size = 12, face = 'bold'),
  axis.title = element_text(face = 'bold')
)

1.4. Correlation analysis

1.4.1. Categorical variables

In both datasets, Chi-square tests revealed statistically significant associations among categorical variables. Of note, the target variable has_term_deposit is significantly associated with job_type, marital_status, highest_education, communication_type, and previous_contact_outcome (yellow row), which suggests that these features have predictive power for the target variable. However, these features are also significantly associated with each other (red P values), which may indicate redundant information.

Bank_full dataset

# Calculate Chi-square p-values for all pairwise tests of categorical variables
# Adapted from https://stackoverflow.com/questions/72968705/chi-squared-test-of-independence-across-all-variables
bank_catvar_df <- bank_full_df %>%
  select(credit_is_defaulted, has_housing_loan, has_personal_loan, has_term_deposit,
         job_type, marital_status, highest_education, communication_type,
         previous_contact_outcome)

pairwise_chisq <- formatC(outer(bank_catvar_df, bank_catvar_df, 
      Vectorize(\(x, y) chisq.test(table(x, y))$p.value)), format = 'e', digits = 1)
get_upper_triangular_df(pairwise_chisq) %>%
  gt(rownames_to_stub = TRUE) %>%
  # Title
  tab_header(
    title = md('Chi-square test p-values for all pairwise combinations of categorical variables')
  ) %>%
  opt_align_table_header(align = "left") %>%  
  # Highlight cells of interest
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_body(
      columns = c(job_type, marital_status, highest_education, communication_type,
                  previous_contact_outcome),
      rows = 4  # has_term_deposit
    )
  ) %>%
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_stub(rows = 4)
  ) %>%  
  tab_style(
    style = cell_text(color = 'darkred'),
    locations = cells_body(
      columns = c(marital_status, highest_education, communication_type,
                  previous_contact_outcome),
      rows = 5  # job_type
    )
  ) %>%
  tab_style(
    style = cell_text(color = 'darkred'),
    locations = cells_body(
      columns = c(highest_education, communication_type,
                  previous_contact_outcome),
      rows = 6  # marital_status
    )
  ) %>% 
  tab_style(
    style = cell_text(color = 'darkred'),
    locations = cells_body(
      columns = c(communication_type,
                  previous_contact_outcome),
      rows = 7  # highest_education
    )
  ) %>%
  tab_style(
    style = cell_text(color = 'darkred'),
    locations = cells_body(
      columns = c(previous_contact_outcome),
      rows = 8  # communication_type
    )
  ) %>%   
  # boldface column labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_column_labels()
  ) %>%
  # boldface row labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_stub(rows = everything())  
  )
Chi-square test p-values for all pairwise combinations of categorical variables
credit_is_defaulted has_housing_loan has_personal_loan has_term_deposit job_type marital_status highest_education communication_type previous_contact_outcome
credit_is_defaulted NA 2.1e-01 2.9e-60 2.5e-06 8.0e-09 2.3e-04 9.6e-03 1.4e-06 6.5e-16
has_housing_loan NA NA 1.7e-18 2.9e-192 0.0e+00 6.3e-05 3.1e-139 0.0e+00 1.6e-200
has_personal_loan NA NA NA 1.7e-47 5.9e-103 3.3e-27 7.3e-63 2.5e-03 1.0e-29
has_term_deposit NA NA NA NA 3.3e-172 2.1e-43 1.6e-51 1.3e-225 0.0e+00
job_type NA NA NA NA NA 0.0e+00 0.0e+00 0.0e+00 6.1e-97
marital_status NA NA NA NA NA NA 8.2e-286 1.1e-38 1.9e-14
highest_education NA NA NA NA NA NA NA 2.0e-291 2.0e-32
communication_type NA NA NA NA NA NA NA NA 0.0e+00
previous_contact_outcome NA NA NA NA NA NA NA NA NA


Bank_additional_full dataset

bank_catvar_df2 <- bank_additional_full_df %>%
  select(credit_is_defaulted, has_housing_loan, has_personal_loan, has_term_deposit,
         job_type, marital_status, highest_education, communication_type,
         previous_contact_outcome)

pairwise_chisq <- formatC(outer(bank_catvar_df2, bank_catvar_df2, 
      Vectorize(\(x, y) chisq.test(table(x, y))$p.value)), format = 'e', digits = 1)
get_upper_triangular_df(pairwise_chisq) %>%
  gt(rownames_to_stub = TRUE) %>%
  # Title
  tab_header(
    title = md('Chi-square test p-values for all pairwise combinations of categorical variables')
  ) %>%
  opt_align_table_header(align = "left") %>%
  # Highlight cells of interest
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_body(
      columns = c(job_type, marital_status, highest_education, communication_type,
                  previous_contact_outcome),
      rows = 4  # has_term_deposit
    )
  ) %>%
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_stub(rows = 4)
  ) %>%   
  tab_style(
    style = cell_text(color = 'darkred'),
    locations = cells_body(
      columns = c(marital_status, highest_education, communication_type,
                  previous_contact_outcome),
      rows = 5  # job_type
    )
  ) %>%
  tab_style(
    style = cell_text(color = 'darkred'),
    locations = cells_body(
      columns = c(highest_education, communication_type,
                  previous_contact_outcome),
      rows = 6  # marital_status
    )
  ) %>% 
  tab_style(
    style = cell_text(color = 'darkred'),
    locations = cells_body(
      columns = c(communication_type,
                  previous_contact_outcome),
      rows = 7  # highest_education
    )
  ) %>%
  tab_style(
    style = cell_text(color = 'darkred'),
    locations = cells_body(
      columns = c(previous_contact_outcome),
      rows = 8  # communication_type
    )
  ) %>%   
  # boldface column labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_column_labels()
  ) %>%
  # boldface row labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_stub(rows = everything())  
  )
Chi-square test p-values for all pairwise combinations of categorical variables
credit_is_defaulted has_housing_loan has_personal_loan has_term_deposit job_type marital_status highest_education communication_type previous_contact_outcome
credit_is_defaulted NA 9.3e-01 1.0e+00 1.0e+00 6.7e-02 5.8e-01 2.6e-01 4.7e-01 4.1e-01
has_housing_loan NA NA 1.6e-30 1.8e-02 1.0e-02 1.5e-02 7.6e-04 8.4e-66 3.9e-06
has_personal_loan NA NA NA 3.8e-01 1.8e-02 6.2e-01 2.4e-01 7.0e-03 1.0e+00
has_term_deposit NA NA NA NA 4.2e-199 2.1e-26 3.3e-38 1.5e-189 0.0e+00
job_type NA NA NA NA NA 0.0e+00 0.0e+00 1.3e-139 4.0e-163
marital_status NA NA NA NA NA NA 0.0e+00 1.2e-46 8.7e-23
highest_education NA NA NA NA NA NA NA 1.7e-132 4.0e-27
communication_type NA NA NA NA NA NA NA NA 0.0e+00
previous_contact_outcome NA NA NA NA NA NA NA NA NA


1.4.2. Numeric variables

Bank_full dataset

There is a moderate positive correlation (r=0.45) between the number of contacts during previous campaigns (previous_contacts) and the number of days since the last contact (days_since_last_contact).

No other numeric variables are strongly correlated.

bank_full_numeric_df <- bank_full_df %>%
  select(age, avg_yearly_balance, last_contact_duration_sec, campaign_contacts,
         days_since_last_contact, previous_contacts)

cor_mat <- cor(bank_full_numeric_df)
corrplot(cor_mat, type = 'upper', diag = FALSE, tl.col = 'black', tl.srt = 45)

get_upper_triangular_df(cor_mat) %>%
  gt(rownames_to_stub = TRUE) %>%
  # Highlight cell of interest
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_body(
      columns = previous_contacts,
      rows = previous_contacts > 0.4
    )
  ) %>%
  # boldface column labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_column_labels()
  ) %>%
  # boldface row labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_stub(rows = everything())  
  )
age avg_yearly_balance last_contact_duration_sec campaign_contacts days_since_last_contact previous_contacts
age NA 0.09778274 -0.004648428 0.004760312 -0.023758014 0.001288319
avg_yearly_balance NA NA 0.021560380 -0.014578279 0.003435322 0.016673637
last_contact_duration_sec NA NA NA -0.084569503 -0.001564770 0.001203057
campaign_contacts NA NA NA NA -0.088627668 -0.032855290
days_since_last_contact NA NA NA NA NA 0.454819635
previous_contacts NA NA NA NA NA NA


Bank_additional_full dataset

Several variables in this dataset have moderate to strong correlations.

The 3-month Euribor rate is nearly perfectly linearly correlated with employee variation rate (r=0.97) and number of employees (r=0.95). The employee variation rate is also strongly correlated with number of employees (r=0.91). In addition, the consumer price index has strong positive correlations with employee variation rate (r=0.78) and the 3-month Euribor rate (r=0.69).

Similar to the bank_full dataset, the number of previous contacts is moderately correlated with the number of days since last contact; however, the correlation is negative in the bank_additional_full dataset (r=-0.59).

The number of previous contacts is also moderately negatively correlated with the number of employees (r=-0.50), 3-month Euribor rate (r=-0.45), and employee variation rate (r=-0.42).

bank_additional_full_numeric_df <- bank_additional_full_df %>%
  select(age, last_contact_duration_sec, campaign_contacts, days_since_last_contact,
         previous_contacts, employee_variation_rate, consumer_price_index, consumer_confidence_index,
         euribor_rate_3m, n_employees)

cor_mat <- cor(bank_additional_full_numeric_df)
corrplot(cor_mat, type = 'upper', diag = FALSE, tl.col = 'black', tl.srt = 45)

get_upper_triangular_df(cor_mat) %>%
  gt(rownames_to_stub = TRUE) %>%
  # Highlight cells of interest
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_body(
      columns = euribor_rate_3m,
      rows = euribor_rate_3m > 0.9
    )
  ) %>%
  tab_style(
    style = cell_fill(color = 'wheat'),
    locations = cells_body(
      columns = n_employees,
      rows = n_employees > 0.9
    )
  ) %>%
  # Boldface column labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_column_labels()
  ) %>%
  # Boldface row labels
  tab_style(
    style = "font-weight: bold",
    locations = cells_stub(rows = everything())
  )
age last_contact_duration_sec campaign_contacts days_since_last_contact previous_contacts employee_variation_rate consumer_price_index consumer_confidence_index euribor_rate_3m n_employees
age NA -0.000865705 0.00459358 -0.03436895 0.02436474 -0.0003706855 0.000856715 0.129371614 0.01076743 -0.01772513
last_contact_duration_sec NA NA -0.07169923 -0.04757702 0.02064035 -0.0279678845 0.005312268 -0.008172873 -0.03289666 -0.04470322
campaign_contacts NA NA NA 0.05258357 -0.07914147 0.1507538056 0.127835912 -0.013733099 0.13513251 0.14409489
days_since_last_contact NA NA NA NA -0.58751386 0.2710041743 0.078889109 -0.091342354 0.29689911 0.37260474
previous_contacts NA NA NA NA NA -0.4204891094 -0.203129967 -0.050936351 -0.45449365 -0.50133293
employee_variation_rate NA NA NA NA NA NA 0.775334171 0.196041268 0.97224467 0.90697010
consumer_price_index NA NA NA NA NA NA NA 0.058986182 0.68823011 0.52203398
consumer_confidence_index NA NA NA NA NA NA NA NA 0.27768622 0.10051343
euribor_rate_3m NA NA NA NA NA NA NA NA NA 0.94515443
n_employees NA NA NA NA NA NA NA NA NA NA


2. Algorithm selection

Determination of whether a client will subscribe to a term deposit (labeled yes/no) is a supervised binary classification problem. Of the classification algorithms that have been presented so far, I propose using logistic regression and decision trees. These algorithms are simple, efficient, scalable, and explainable, which are important considerations for business applications. In contrast to logistic regression, decision trees are generally robust to outliers, which are present in most numeric variables in the bank marketing datasets. So, if only one algorithm could be chosen, I would select decision trees; however, it may be worth comparing the performance of both classifiers because logistic regression uses a linear decision boundary and decision trees use a nonlinear decision boundary (ie, one may be better than the other for the problem).

The main disadvantage of logistic regression and decision trees is that they may suffer from dimensionality problems in the full datasets with 40,000+ observations (clients) \(\times\) ~20 features. In this case, dimensionality reduction (eg, principal component analysis [PCA]) could be performed before classification. Alternatively, the smaller datasets that have 10% of the observations in the full datasets could be used.

3. Pre-processing

As shown in section 0. Data input and cleaning and also stated on the source data website, these datasets do not have missing data. However, the bank_additional_full dataset has some duplicate rows that need to be removed. In addition, one variable (n_employees) had non-integer values that do not make sense. These data should be verified and corrected as needed before model development. In the absence of verification, values could be rounded to the nearest integer.

Pre-processing that is needed include:

  • Regularization is needed to reduce overfitting with both logistic regression and decision trees.

  • For the full datasets, which have 40,000-45,000 observations (clients), both algorithms may also benefit from dimensionality reduction (eg, PCA).

  • For logistic regression, categorical variables would need to be encoded numerically (eg, by one-hot encoding). Feature scaling is not necessary, but it may improve performance.

  • The target feature (has_term_deposit) is imbalanced in both the bank_full and bank_additional_full datasets (approximately 88% of clients do not have a term deposit; section 1.1.4. Term deposit). Because of this, techniques such as undersampling the majority class would help reduce class imbalance.

  • Feature engineering may also be helpful; however, I don’t have experience in banking (or related fields), so I’m not sure how or what features could be combined to create new features.

4. Summary of findings and implications

Exploratory data analysis of the bank_full and bank_additional_full datasets showed that there are similarities and differences in the number, distribution, and type of variables as well as correlated variables. Of note:

  • The bank_additional_full dataset has 5 numeric variables that are not found in the bank_full dataset, including economic indicators, such as consumer price index and Eurobor interest rates. In contrast, only the bank_full dataset includes clients’ average yearly balance.

  • Compared with the bank_full dataset, the bank_additional_full dataset has a much smaller proportion of clients whose credit is in default, a higher proportion of clients who have a university degree, and much fewer days since last contact (ie, more responsive to outreach). These differences suggest that two different predictive models and/or marketing approaches may be needed.

  • In both datasets, the target variable has_term_deposit is significantly associated with 5 categorical variables (job_type, marital_status, highest_education, communication_type, and previous_contact_outcome), which suggests they may have predictive power. However, these variables are also significantly associated with each other, which suggests that they may be redundant. It is unclear which will have a greater effect on classification performance.

  • All numeric variables in the bank_full dataset have potential outliers. The corresponding variables in the bank_additional_full dataset also have potential outliers. Among the numeric variables that are unique to the latter, only consumer_confidence_index has outliers. Ideally, the classification algorithm will need to be robust to outliers.

  • The bank_additional_full dataset has more correlated numeric variables than the bank_full dataset, mainly among the economic indicator variables (which are not found in the bank_full dataset). Highly correlated variables, especially 3-month Euribor rate + employee variation rate or number of employees, which have Pearson r > 0.95, would be redundant in linear classification models and could be simplified.

Prediction of whether a client will subscribe to a term deposit (labeled yes/no) is a supervised binary classification problem. Of the classification algorithms that have been presented so far, I would choose decision trees, primarily because they are robust to outliers and result in an explainable model. Logistic regression may also be a possibility. Supporting techniques such as dimensionality reduction using PCA, undersampling the majority class (clients who do not have a term deposit), and boosting (for decision trees) may help improve performance.