622- Exploratory Data Analysis

Author

LFMG

Assignment 1

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit  The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. 

Libraries

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.1
Warning: package 'lubridate' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(tibble)
library(readr)
library(janitor)
Warning: package 'janitor' was built under R version 4.5.1

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(skimr)
Warning: package 'skimr' was built under R version 4.5.1
library(naniar)
Warning: package 'naniar' was built under R version 4.5.1

Attaching package: 'naniar'

The following object is masked from 'package:skimr':

    n_complete
library(DataExplorer)
Warning: package 'DataExplorer' was built under R version 4.5.1
library(GGally)
Warning: package 'GGally' was built under R version 4.5.1
library(corrplot)
Warning: package 'corrplot' was built under R version 4.5.1
corrplot 0.95 loaded
library(vip)
Warning: package 'vip' was built under R version 4.5.1

Attaching package: 'vip'

The following object is masked from 'package:utils':

    vi
library(lubridate)
library(knitr)
Warning: package 'knitr' was built under R version 4.5.1
library(kableExtra)
Warning: package 'kableExtra' was built under R version 4.5.1

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

Loading the Data

The dataset has been upload to GitHub for easy reproducibility.

url <- "https://raw.githubusercontent.com/Lfirenzeg/msds622/refs/heads/main/bank-additional-full.csv"

The file has semi colons instead of comas, so we also have to account for that. Also, when exploring the CSV some “unknown” values were found, so we’ll turn those into NAs.

uci <- read_delim( url, delim = ";", show_col_types = FALSE, na = c("", "unknown"))    |> janitor::clean_names()
# we can assign any entries with "unknown" as NA for easier processing
glimpse(uci)
Rows: 41,188
Columns: 21
$ age            <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
$ job            <chr> "housemaid", "services", "services", "admin.", "service…
$ marital        <chr> "married", "married", "married", "married", "married", …
$ education      <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
$ default        <chr> "no", NA, "no", "no", "no", NA, "no", NA, "no", "no", N…
$ housing        <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
$ loan           <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
$ contact        <chr> "telephone", "telephone", "telephone", "telephone", "te…
$ month          <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
$ day_of_week    <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
$ duration       <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
$ campaign       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
$ previous       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
$ emp_var_rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
$ cons_price_idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
$ cons_conf_idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
$ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
$ nr_employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
$ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…

The dataset has 41,188 rows and 21 columns combining several demographic variables. The target “y” is currently character and should be treated as a binary factor (no/yes).

Also, it seems some fields are skewed like pdays having may 999 meaning “no previous contact”, and duration is measured post-call, so it’s informative for EDA but can be leaky for modeling.

Exploratory Data Analysis

# For easier processing we'll make y a factor and split numeric vs categorical
uci <- uci|> dplyr::mutate(y = factor(y, levels = c("no","yes")))
num_vars <- uci |> dplyr::select(where(is.numeric)) |> names()
cat_vars <- setdiff(names(uci), num_vars)

Features correlation, graph and insights

# we'll start with computing the correlation matrix among numeric features
# "pairwise.complete.obs" helps with handling missing values
cor_mat <- uci |>
dplyr::select(dplyr::all_of(num_vars)) |>
stats::cor(use = "pairwise.complete.obs")

# for visualization we'll use  heatmap
corrplot::corrplot(cor_mat, method = "color", tl.cex = 0.7, number.cex = 0.6)

# we can clean up the matrix for a table showing highest correlations
cor_long <- cor_mat |>
as.data.frame() |>
rownames_to_column(var = "var1") |>
pivot_longer(-var1, names_to = "var2", values_to = "corr")

# Remove self-pairs and keep each pair once (the upper triangle only)
cor_long_clean <- cor_long |>
dplyr::filter(var1 != var2) |>
dplyr::filter(var1 < var2) |>
dplyr::mutate(abs_corr = abs(corr)) |>
dplyr::arrange(dplyr::desc(abs_corr))

# Show the top 15 absolute correlations
top15 <- cor_long_clean |> dplyr::slice_head(n = 15)
print(top15)
# A tibble: 15 × 4
   var1           var2           corr abs_corr
   <chr>          <chr>         <dbl>    <dbl>
 1 emp_var_rate   euribor3m     0.972    0.972
 2 euribor3m      nr_employed   0.945    0.945
 3 emp_var_rate   nr_employed   0.907    0.907
 4 cons_price_idx emp_var_rate  0.775    0.775
 5 cons_price_idx euribor3m     0.688    0.688
 6 pdays          previous     -0.588    0.588
 7 cons_price_idx nr_employed   0.522    0.522
 8 nr_employed    previous     -0.501    0.501
 9 euribor3m      previous     -0.454    0.454
10 emp_var_rate   previous     -0.420    0.420
11 nr_employed    pdays         0.373    0.373
12 euribor3m      pdays         0.297    0.297
13 cons_conf_idx  euribor3m     0.278    0.278
14 emp_var_rate   pdays         0.271    0.271
15 cons_price_idx previous     -0.203    0.203

From the heatmap and the table we can see that emp_var_rate, euribor3m, and nr_employed are very highly positively correlated. cons_price_idx also shows high positive correlation. We could say these variables co-move with the economic cycle.

On the other hand we have pdays vs previous being strongly negative (−0.59)

Everything else seems weakly related. Pairs others than the ones mentions have small correlation, so pairwise linear correlation is modest overall.

Relationship between variables

# We can use a copy of uci to come up with some relationships to explore.
# To choose them we can think of what would impact business
uci2 <- uci |>
mutate(
# Age buckets like marketing segments
age_bucket = cut(age, breaks = c(0,30,45,60,Inf),labels = c("<=30","31-45","46-60","60+"), right = TRUE),

# Recency flag, so any pdays 999 equals no prior contact
had_prior = if_else(pdays == 999, "no_prior","prior", missing = "no_prior"), recent_contact = if_else(pdays < 7, "recent","not_recent", missing = "not_recent"),

# Contact intensity band
intensity = case_when( campaign <= 2 ~ "low", campaign <= 5 ~ "medium", TRUE ~ "high"),

# Job × education interaction (socioeconomic proxy)
edu_job = interaction(education, job, drop = TRUE)
)
rel_age  <- uci2 |> group_by(age_bucket) |> summarise(p_yes = mean(y=="yes"), n=n(), .groups="drop")
rel_age
# A tibble: 4 × 3
  age_bucket  p_yes     n
  <fct>       <dbl> <int>
1 <=30       0.152   7383
2 31-45      0.0937 21974
3 46-60      0.0956 10921
4 60+        0.455    910

It seems seniors (those above 60) show the highest propensity but are a small group, while those under 30 group is also above average. Middle age bands are lowest.

rel_recent<- uci2 |> group_by(recent_contact, intensity) |> summarise(p_yes = mean(y=="yes"), n=n(), .groups="drop") |> arrange(desc(p_yes))
rel_recent
# A tibble: 6 × 4
  recent_contact intensity  p_yes     n
  <chr>          <chr>      <dbl> <int>
1 recent         low       0.672    912
2 recent         medium    0.627    177
3 recent         high      0.357     28
4 not_recent     low       0.106  27300
5 not_recent     medium    0.0884  9414
6 not_recent     high      0.0524  3357

This clearly show how recency dominates, meanwhile higher intensity seems to hurt in either case.

rel_edujob<- uci2 |> count(edu_job, name="n") |> arrange(desc(n)) |> slice_head(n=15) |>
  left_join(uci2 |> group_by(edu_job) |> summarise(p_yes = mean(y=="yes"), .groups="drop"), by="edu_job") |>
  arrange(desc(p_yes))
rel_edujob
# A tibble: 15 × 3
   edu_job                             n  p_yes
   <fct>                           <int>  <dbl>
 1 basic.4y.retired                  597 0.310 
 2 university.degree.admin.         5753 0.143 
 3 <NA>                             1930 0.140 
 4 university.degree.self-employed   765 0.125 
 5 university.degree.management     2063 0.125 
 6 university.degree.technician     1809 0.124 
 7 high.school.admin.               3329 0.115 
 8 university.degree.entrepreneur    610 0.108 
 9 high.school.blue-collar           878 0.107 
10 professional.course.technician   3320 0.103 
11 high.school.technician            873 0.0974
12 high.school.services             2682 0.0757
13 basic.6y.blue-collar             1426 0.0750
14 basic.9y.blue-collar             3623 0.0662
15 basic.4y.blue-collar             2318 0.0531

Some combos like basic.4y.retired are high (31%) but rare (n=597), and many professional/degree combos sit around 10–14%

Feature Distributions

#numeric distributions showing skew, and tails
uci |>
pivot_longer(where(is.numeric), names_to = "variable", values_to = "value") |>
ggplot(aes(value)) + geom_histogram(bins = 30) + geom_density() + facet_wrap(~variable, scales = "free") + labs(title = "Numeric variable distributions")

Duration, campaign, previous have long right tails. Many cases with 0–1 and a few very large values.

The spike in pdays is not real distance in days (no prior contact).

emp_var_rate, euribor3m, nr_employed, cons_price_idx, cons_conf_idx show clustered regimes instead of smooth bell curves.

Outliers

iqr_tbl <- map_dfr(names(uci |> select(where(is.numeric))), \(v){
  s <- uci[[v]] |> na.omit()
  if (length(s) < 3) return(tibble(variable=v, lower_fence=NA, upper_fence=NA, n_outliers=0, pct_outliers=0))
  q1 <- quantile(s, .25); q3 <- quantile(s, .75); iqr <- IQR(s)
  lo <- q1 - 1.5*iqr; hi <- q3 + 1.5*iqr
  tibble(variable=v,
         lower_fence=lo, upper_fence=hi,
         n_outliers=sum(s<lo | s>hi),
         pct_outliers=round(100*sum(s<lo | s>hi)/length(s), 2))
}) |>
  arrange(desc(pct_outliers))
iqr_tbl
# A tibble: 10 × 5
   variable       lower_fence upper_fence n_outliers pct_outliers
   <chr>                <dbl>       <dbl>      <int>        <dbl>
 1 previous              0            0         5625        13.7 
 2 duration           -224.         644.        2963         7.19
 3 campaign             -2            6         2406         5.84
 4 pdays               999          999         1515         3.68
 5 age                   9.5         69.5        469         1.14
 6 cons_conf_idx       -52.2        -26.9        447         1.09
 7 emp_var_rate         -6.6          6.2          0         0   
 8 cons_price_idx       91.7         95.4          0         0   
 9 euribor3m            -4.08        10.4          0         0   
10 nr_employed        4906.        5422.           0         0   

We see that previous (13.7%) and campaign (5.8%) have many outliers, something consistent with the right tails for repeat contacts.

And duration (7.2%) also flags, but it’s worth noting remember that this is post-call and should accounted for when modeling.

Categorical variables

cat_vars <- uci |> select(where(~is.character(.x) || is.factor(.x))) |> names()
cat_vars <- setdiff(cat_vars, "y")
for (v in cat_vars) {cat("\nRates by", v, "(top 20):\n")
lv <- uci |>
count(.data[[v]], name="n", drop=FALSE) |>
arrange(desc(n)) |> slice_head(n=20) |>
rename(level = !!sym(v)) |>
left_join(uci |> group_by(level = .data[[v]]) |> summarise(p_yes = mean(y=="yes"), .groups="drop"), by = "level"
) |>
arrange(desc(p_yes))
print(lv)
}

Rates by job (top 20):
# A tibble: 12 × 4
   level         drop      n  p_yes
   <chr>         <lgl> <int>  <dbl>
 1 student       FALSE   875 0.314 
 2 retired       FALSE  1720 0.252 
 3 unemployed    FALSE  1014 0.142 
 4 admin.        FALSE 10422 0.130 
 5 management    FALSE  2924 0.112 
 6 <NA>          FALSE   330 0.112 
 7 technician    FALSE  6743 0.108 
 8 self-employed FALSE  1421 0.105 
 9 housemaid     FALSE  1060 0.1   
10 entrepreneur  FALSE  1456 0.0852
11 services      FALSE  3969 0.0814
12 blue-collar   FALSE  9254 0.0689

Rates by marital (top 20):
# A tibble: 4 × 4
  level    drop      n p_yes
  <chr>    <lgl> <int> <dbl>
1 <NA>     FALSE    80 0.15 
2 single   FALSE 11568 0.140
3 divorced FALSE  4612 0.103
4 married  FALSE 24928 0.102

Rates by education (top 20):
# A tibble: 8 × 4
  level               drop      n  p_yes
  <chr>               <lgl> <int>  <dbl>
1 illiterate          FALSE    18 0.222 
2 <NA>                FALSE  1731 0.145 
3 university.degree   FALSE 12168 0.137 
4 professional.course FALSE  5243 0.113 
5 high.school         FALSE  9515 0.108 
6 basic.4y            FALSE  4176 0.102 
7 basic.6y            FALSE  2292 0.0820
8 basic.9y            FALSE  6045 0.0782

Rates by default (top 20):
# A tibble: 3 × 4
  level drop      n  p_yes
  <chr> <lgl> <int>  <dbl>
1 no    FALSE 32588 0.129 
2 <NA>  FALSE  8597 0.0515
3 yes   FALSE     3 0     

Rates by housing (top 20):
# A tibble: 3 × 4
  level drop      n p_yes
  <chr> <lgl> <int> <dbl>
1 yes   FALSE 21576 0.116
2 no    FALSE 18622 0.109
3 <NA>  FALSE   990 0.108

Rates by loan (top 20):
# A tibble: 3 × 4
  level drop      n p_yes
  <chr> <lgl> <int> <dbl>
1 no    FALSE 33950 0.113
2 yes   FALSE  6248 0.109
3 <NA>  FALSE   990 0.108

Rates by contact (top 20):
# A tibble: 2 × 4
  level     drop      n  p_yes
  <chr>     <lgl> <int>  <dbl>
1 cellular  FALSE 26144 0.147 
2 telephone FALSE 15044 0.0523

Rates by month (top 20):
# A tibble: 10 × 4
   level drop      n  p_yes
   <chr> <lgl> <int>  <dbl>
 1 mar   FALSE   546 0.505 
 2 dec   FALSE   182 0.489 
 3 sep   FALSE   570 0.449 
 4 oct   FALSE   718 0.439 
 5 apr   FALSE  2632 0.205 
 6 aug   FALSE  6178 0.106 
 7 jun   FALSE  5318 0.105 
 8 nov   FALSE  4101 0.101 
 9 jul   FALSE  7174 0.0905
10 may   FALSE 13769 0.0643

Rates by day_of_week (top 20):
# A tibble: 5 × 4
  level drop      n  p_yes
  <chr> <lgl> <int>  <dbl>
1 thu   FALSE  8623 0.121 
2 tue   FALSE  8090 0.118 
3 wed   FALSE  8134 0.117 
4 fri   FALSE  7827 0.108 
5 mon   FALSE  8514 0.0995

Rates by poutcome (top 20):
# A tibble: 3 × 4
  level       drop      n  p_yes
  <chr>       <lgl> <int>  <dbl>
1 success     FALSE  1373 0.651 
2 failure     FALSE  4252 0.142 
3 nonexistent FALSE 35563 0.0883

Conversion rates vary significantly by job (student/retired is higher but small), marital (single is slightly higher), education (a mild gradient with slightly higher ed), contact channel (cellular appears far higher than telephone), calendar effects (month/day patterns we’ll cover below), and prior outcome (poutcome=success is dominant).

Implications per variable:

  • Job: Target student/retired with tailored offers.

  • Marital: There’s a modest lift for singles, it could be kept as a simple factor.

  • Education: We should keep education, but perhaps with ordered encoding or buckets, like basic vs high-school vs higher ed.

  • pOutcome: This is one of the strongest predictors. So we would use it directly and perhaps also combine with recency (pdays) and intensity (campaign)

Central Tendency

num_summary <- uci |> summarise(across(where(is.numeric),
                    list(n = ~sum(!is.na(.x)),
                         mean = ~mean(.x, na.rm=TRUE),
                         median = ~median(.x, na.rm=TRUE),
                         sd = ~sd(.x, na.rm=TRUE),
                         iqr = ~IQR(.x, na.rm=TRUE),
                         min = ~min(.x, na.rm=TRUE),
                         max = ~max(.x, na.rm=TRUE)),
                    .names = "{.col}_{.fn}")) |>
  pivot_longer(everything(), names_to=c("variable",".value"), names_sep="_")
Warning: Expected 2 pieces. Additional pieces discarded in 28 rows [36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, ...].
num_summary |> print(n = 50)
# A tibble: 27 × 12
   variable     n    mean median      sd    iqr    min     max      var    price
   <chr>    <int>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl>    <dbl>
 1 age      41188  40.0    38     10.4    15    17       98    NA       NA      
 2 duration 41188 258.    180    259.    217     0     4918    NA       NA      
 3 campaign 41188   2.57    2      2.77    2     1       56    NA       NA      
 4 pdays    41188 962.    999    187.      0     0      999    NA       NA      
 5 previous 41188   0.173   0      0.495   0     0        7    NA       NA      
 6 emp         NA  NA      NA     NA      NA    NA       NA     4.12e+4 NA      
 7 emp         NA  NA      NA     NA      NA    NA       NA     8.19e-2 NA      
 8 emp         NA  NA      NA     NA      NA    NA       NA     1.1 e+0 NA      
 9 emp         NA  NA      NA     NA      NA    NA       NA     1.57e+0 NA      
10 emp         NA  NA      NA     NA      NA    NA       NA     3.2 e+0 NA      
11 emp         NA  NA      NA     NA      NA    NA       NA    -3.4 e+0 NA      
12 emp         NA  NA      NA     NA      NA    NA       NA     1.4 e+0 NA      
13 cons        NA  NA      NA     NA      NA    NA       NA    NA        4.12e+4
14 cons        NA  NA      NA     NA      NA    NA       NA    NA        9.36e+1
15 cons        NA  NA      NA     NA      NA    NA       NA    NA        9.37e+1
16 cons        NA  NA      NA     NA      NA    NA       NA    NA        5.79e-1
17 cons        NA  NA      NA     NA      NA    NA       NA    NA        9.19e-1
18 cons        NA  NA      NA     NA      NA    NA       NA    NA        9.22e+1
19 cons        NA  NA      NA     NA      NA    NA       NA    NA        9.48e+1
20 euribor… 41188   3.62    4.86   1.73    3.62  0.634    5.04 NA       NA      
21 nr          NA  NA      NA     NA      NA    NA       NA    NA       NA      
22 nr          NA  NA      NA     NA      NA    NA       NA    NA       NA      
23 nr          NA  NA      NA     NA      NA    NA       NA    NA       NA      
24 nr          NA  NA      NA     NA      NA    NA       NA    NA       NA      
25 nr          NA  NA      NA     NA      NA    NA       NA    NA       NA      
26 nr          NA  NA      NA     NA      NA    NA       NA    NA       NA      
27 nr          NA  NA      NA     NA      NA    NA       NA    NA       NA      
# ℹ 2 more variables: conf <dbl>, employed <dbl>

Age shows a higher concentration of working age population. Duration has a mean of 258s and the IQR of 217s, showing a high skewness to the right, meaning we should exclude from modeling. Campaign shows the long tail again.

Missing Values

miss_tbl <- uci |>
  summarise(across(everything(), ~sum(is.na(.x)))) |>
  pivot_longer(everything(), names_to="column", values_to="n_missing") |>
  mutate(pct_missing = round(100*n_missing/nrow(uci), 2)) |>
  arrange(desc(pct_missing))
miss_tbl |> print(n = 30)
# A tibble: 21 × 3
   column         n_missing pct_missing
   <chr>              <int>       <dbl>
 1 default             8597       20.9 
 2 education           1731        4.2 
 3 housing              990        2.4 
 4 loan                 990        2.4 
 5 job                  330        0.8 
 6 marital               80        0.19
 7 age                    0        0   
 8 contact                0        0   
 9 month                  0        0   
10 day_of_week            0        0   
11 duration               0        0   
12 campaign               0        0   
13 pdays                  0        0   
14 previous               0        0   
15 poutcome               0        0   
16 emp_var_rate           0        0   
17 cons_price_idx         0        0   
18 cons_conf_idx          0        0   
19 euribor3m              0        0   
20 nr_employed            0        0   
21 y                      0        0   

default has the most missing values (around 21%), then education (around 4%), housing/loan (around 2% each); everything else is less than ≤1% or no missing values.

When working on a model, for categoricals, we could either impute a separate “Unknown” level (since we should keep signal if that response is informative) or simple mode imputation.

For modeling fairness or stability, we’ll report that default is often unreported; keep “Unknown” as a level and monitor its importance.

Duplicates

dup_count <- sum(duplicated(uci)); dup_pct <- round(100*dup_count/nrow(uci), 2)
cat("Fully duplicated rows:", dup_count, "(", dup_pct, "%)\n")
Fully duplicated rows: 12 ( 0.03 %)

We find only 12 fully duplicated rows, representing 0.03%, so we can drop them.

Model Selection

Based on the analysis just completed, where we find a tabular dataset with mixed variables (categorical and numeric), skewed features, moderate class imbalance, some findings of collinearity (the macro variables), and a leakage feature “duration” that we’ll exclude for modeling, the strongest fits would be Regularized Logistic Regression and Decision Trees. That said, if we were to consider some alternatives, LDA would need to be done with care for assumptions, and Naïve Bayes. k-NN is the least attractive here due to scaling and distance-metric issues in mixed and tabular data.

Some Pros and Cons

  • Logistic Regression: It’s fast, stable, has well calibrated probabilities, and can be very interpretable, which is useful for explaining drivers (such as recency, or prior outcome). Some of the Cons, that it assumes linear log-odds, it also needs one hot encoding and scaling. Also it may miss interactions unless we add them explicitly and can be more sensitive to multicollinearity (although this can be mitigated by regularization).

  • Decision Trees: On the upside, it can easily handle nonlinearity and interactions. It also handles factor inputscleanly and its easy to explain with rules and feature splits. This model is also robust to monotone outliers. At the same time, single trees can have a high variance and be slightly less accurate than ensembles. This models also needs some cost complexity pruning and class weights to avoid overfitting.

  • LDA/QDA: These models can be simple, fast, and have easily interpretable linear quadratic boundaries. But one assumes normality and equal covariance (LDA) which our skewed, mixed data violate. Also QDA can overfit with many predictors and imbalanced classes. Both require numeric inputs (categoricals must be encoded) and are sensitive to collinearity.

Having said this, given the nature of the assignment I would suggest using 2 models: The Regularized Logistic Regression would work as the clear, business-friendly benchmark (easy to explain how recency, prior outcome, channel, etc. affect odds), and a pruned Decision Tree model tuned with class weights to capture the strong interaction patterns (like recent_contact vs intensity) we saw in the EDA. As a pair we would balance interpretability (since both models can be easily explained) with performance (the tree captures nonlinearity).

Any labels?

Yes, variable “y” (possible answers no, yes) makes this a supervised binary classification problem. That guided the choice of a model that rank leads (logistic, LDA/NB) and one that has rule-based classifiers that can explore actionable splits (trees). We’ll explicitly exclude duration to avoid leakage.

Model selection against dataset

Both logistic and trees support class weights. Trees easily capture interactions, and logistic can include a small set of interactions found in the EDA. Also, trees are robust, for logistic we can cap heavy tails (like those in campaign/previous) and treat pdays=999 with a flag and cleaned numeric. To handle multicollinearity, logistic uses L1/L2 to stabilize coefficients while trees is largely unaffected.

Smaller dataset?

With less than 1000 rows I’d prefer Logistic Regression (with stronger regularization) and possibly LDA over trees/k-NN/QDA, since they have lower variance, are more stable in small samples, and remain interpretable.

Pre-Processing

Data Cleaning

  • To handle leakage we’ll drop duration from any modeling pipeline (we keep it for EDA only).
  • Remove the 12 exact duplicates (0.03%).
  • Regarding sentinel values, I’d split pdays into no_prior = (pdays==999) and pdays_real = if_else(pdays==999, NA, pdays) so to never standardize 999 as if it were a real day count.
  • For missing values, we already converted “unknown” entries to NA. For categoricals we can keep an explicit Unknown level. For numerics we can simply use median imputation given the low rates of missingness.

Dimensionality Reduction

To address the collinearity in macro variables, since we saw in the heatmap a tight block for emp_var_rate, euribor3m, nr_employed, and also somewhat cons_price_idx:

  • For logistic, mitigate with L2/L1 regularization and optionally drop one of the near duplicates (maybe keep euribor3m and cons_conf_idx, but drop nr_employed, although this can be explored in different scenarios).
  • For trees, no reduction is strictly required (splits will choose), but we can still drop a redundant macro to simplify the tree.

I’d say no PCA is necessary since the feature count is modest.

Feature Engineering

To explore more, but some immediate ones we already explored:

  • Recency and intensity: recent_contact = When days_real is less than 7), and intensity bands from campaign, include their interaction (recent vs intensity).
  • Prior outcome: keep poutcome and consider had_prior = (poutcome != “nonexistent”).
  • Age segmentation: Commonly used brackets age_bucket (≤30, 31–45, 46–60, 60+).
  • Socio-economic proxy: simple education vs job interaction (or allow the tree to learn it).
  • Channel: contact retained (cellular vs telephone shows a strong effect).

Sampling Data

With around 41k rows, there should be no scaling issues. We can keep full data for training and use cross-validation.

If we really need faster iteration, we can use a stratified 20–30% subset for model prototyping and validate on the full set.

Data Transformation (per model)

  • Categoricals: One-hot encode for logistic (such as model.matrix). For trees: feed factors directly. Perhaps merge ultra-rare levels (like education=illiterate) into an “Other” bin.

  • Numeric scaling: Standardize numerics for logistic (mean-0, sd-1) after winsorizing tails; no scaling needed for trees.

  • Heavy tails or outliers: For logistic only, winsorize campaign and previous at, for example, 99th percentile; keep raw for trees.

  • pdays handling: Using no_prior flag and pdays_real. The point is to never let the 999s influence scaling.

Imbalanced Data

There is an imbalance of a class, since only 10–12% of the dataset represents “yes” in y.

For this I’d prefer class weights and threshold tuning on PR-AUC. If sampling is required, we can try stratified undersampling of the majority or SMOTE on the training folds only. But we should always keep a stratified untouched test set.