Load Data

# Based on information in the "bank-additional-names.txt" file
column_types <- list(age="n", job="f", marital="f", education="f", default="f", housing="f", loan="f", contact="f", month="f", day_of_week="f", duration="n", campaign="n", pdays="n", previous="n", poutcome="f", emp.var.rate="d", cons.price.idx="d", cons.conf.idx="d", euribor3m="d", nr.employed="d")

bank_add_full <- read_delim("bank-additional/bank-additional-full.csv", delim = ";", col_types = column_types)
bank_add <- read_delim("bank-additional/bank-additional.csv", delim = ";", col_types = column_types)

Peak at Table

head(bank_add_full)

## # A tibble: 6 × 21
##     age job    marital education default housing loan  contact month day_of_week
##   <dbl> <fct>  <fct>   <fct>     <fct>   <fct>   <fct> <fct>   <fct> <fct>      
## 1    56 house… married basic.4y  no      no      no    teleph… may   mon        
## 2    57 servi… married high.sch… unknown no      no    teleph… may   mon        
## 3    37 servi… married high.sch… no      yes     no    teleph… may   mon        
## 4    40 admin. married basic.6y  no      no      no    teleph… may   mon        
## 5    56 servi… married high.sch… no      no      yes   teleph… may   mon        
## 6    45 servi… married basic.9y  unknown no      no    teleph… may   mon        
## # ℹ 11 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## #   previous <dbl>, poutcome <fct>, emp.var.rate <dbl>, cons.price.idx <dbl>,
## #   cons.conf.idx <dbl>, euribor3m <dbl>, nr.employed <dbl>, y <chr>

Check Missing Values

All values are populated with an entry, meaning no data is missing from a technical perspective. We know from the information in the “bank-additional-names.txt” file that all the categorical columns have “unknown” values and the pdays column uses 999 for no previous contact which is analogous to a missing value.

bank_add_full |>
  summarize(across(everything(), ~ sum(is.na(.)))) |>
  kable() |>
  kable_styling(latex_options = "scale_down")

age	job	marital	education	default	housing	loan	contact	month	day_of_week	duration	campaign	pdays	previous	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Categorical Variable Distribution

The distributions of most of the categorical variable are largely unremarkable, with most having some degree of majority classes that are still well within reasonable bounds. The default variable is problematic as there are virtually no “yes” values, making the variable less useful. This is slightly offset by the fact that there are a reasonable number of “unknown” values which could be argued to be a proxy for “yes” but this cannot be proven and dilutes the usefulness of the variable overall. The poutcome variable has a similar issue with the majority class being “unknown”, however, the relative balance of the other two levels will likely still prove useful. The relative imbalance of the month levels is interesting, but unlikely to cause issues.

categorical <- bank_add_full |>
  select(where(is.factor))

cat_plots <- lapply(names(categorical), function(col) {
  ggplot(categorical, aes(.data[[col]], after_stat(count))) + 
    geom_bar() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
})

wrap_plots(cat_plots, ncol=2)

Numeric Variable Analysis

Summary statistics for the numeric variables provide some indication of problematic variables. The previous and pdays variables clearly have identical or near identical values that make up the majority of entries. The duration, campaign, and age variables also likely have some issues with outliers based on the difference between quartile values and maximum values.

numeric <- bank_add_full |>
  select(where(is.numeric))

numeric |>
  summary()

##       age           duration         campaign          pdays      
##  Min.   :17.00   Min.   :   0.0   Min.   : 1.000   Min.   :  0.0  
##  1st Qu.:32.00   1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0  
##  Median :38.00   Median : 180.0   Median : 2.000   Median :999.0  
##  Mean   :40.02   Mean   : 258.3   Mean   : 2.568   Mean   :962.5  
##  3rd Qu.:47.00   3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0  
##  Max.   :98.00   Max.   :4918.0   Max.   :56.000   Max.   :999.0  
##     previous      emp.var.rate      cons.price.idx  cons.conf.idx  
##  Min.   :0.000   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  1st Qu.:0.000   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  Median :0.000   Median : 1.10000   Median :93.75   Median :-41.8  
##  Mean   :0.173   Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##  3rd Qu.:0.000   3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##  Max.   :7.000   Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##    euribor3m      nr.employed  
##  Min.   :0.634   Min.   :4964  
##  1st Qu.:1.344   1st Qu.:5099  
##  Median :4.857   Median :5191  
##  Mean   :3.621   Mean   :5167  
##  3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :5.045   Max.   :5228

Looking at the density plots we can see almost none of the numeric variables have standard distributions, with age and duration being the exceptions with relatively normal distributions. The distribution issues can easily be addressed via pre-processing if using models sensitive to this aspect of the data. Both age and duration have long tails of outliers, which is not unexpected of either variable. The campaign variable has a similarly long tail although one that possesses an interestingly regular pattern. To address the issue of outliers for models sensitive to them, we can apply transformations such as the Box-Cox transformation or similar to normalize or simply bring in the outliers. The pdays and previous variables indicate particularly problematic distributions, with the vast majority of values falling in an extremely tight range. In the case of the pdays variable almost all entries are the 999 value indicating no prior contact, and the previous variable has a similar issue with almost all values being 0 indicating the same. As both largely track the same thing and appear to have very little variance it is likely one or both of these variables will not be useful for most models and may be worth dropping from the dataset.

numeric_density_plots <- lapply(names(numeric), function(col) {
  ggplot(numeric, aes(.data[[col]])) + 
    geom_density(fill="lightblue", alpha=0.8) +
    labs(y = "")
})

numeric_box_plots <- lapply(names(numeric), function(col) {
  ggplot(numeric, aes(.data[[col]])) + 
    geom_boxplot() +
    labs(y = "")
}) 
  
fig_list <- c(rbind(numeric_density_plots, numeric_box_plots))

wrap_plots(fig_list, ncol=2, axis_titles = 'collect_x')

Check for Near Zero Variance

Variables with zero or near zero variance can impact the performance of many model types. In this dataset the only variable exhibitng this behavior is pdays, providing further justification for the removal of this variable.

nearZeroVar(bank_add_full, saveMetrics = TRUE) |>
  kable() |>
  kable_styling() |>
  row_spec(13, background = "lightgrey")

	freqRatio	percentUnique	zeroVar	nzv
age	1.054713	0.1893755	FALSE	FALSE
job	1.126216	0.0291347	FALSE	FALSE
marital	2.154910	0.0097116	FALSE	FALSE
education	1.278823	0.0194231	FALSE	FALSE
default	3.790625	0.0072837	FALSE	FALSE
housing	1.158630	0.0072837	FALSE	FALSE
loan	5.433739	0.0072837	FALSE	FALSE
contact	1.737836	0.0048558	FALSE	FALSE
month	1.919292	0.0242789	FALSE	FALSE
day_of_week	1.012802	0.0121395	FALSE	FALSE
duration	1.000000	3.7486647	FALSE	FALSE
campaign	1.669063	0.1019714	FALSE	FALSE
pdays	90.371298	0.0655531	FALSE	TRUE
previous	7.797194	0.0194231	FALSE	FALSE
poutcome	8.363829	0.0072837	FALSE	FALSE
emp.var.rate	1.767639	0.0242789	FALSE	FALSE
cons.price.idx	1.161256	0.0631252	FALSE	FALSE
cons.conf.idx	1.161256	0.0631252	FALSE	FALSE
euribor3m	1.097589	0.7672138	FALSE	FALSE
nr.employed	1.902273	0.0267068	FALSE	FALSE
y	7.876724	0.0048558	FALSE	FALSE

Check correlation

Correlation Matrix of Numeric Variables

Among the strictly numeric variables we see a few issues with correlation. As expected the pdays and previous variables are highly negatively correlated. Given the other issues with the pdays variable discussed above, it is clear we should remove this variable and potentially keep the previous value depending on model selection. The “supplemental” socioeconomic variables in the extended dataset seem to exhibit a high degree of correlation, which makes sense for economic indicators. In particular the euribor3m, nr.employed, and emp.var.rate are highly correlated with enough other variables that they are likely worth excluding in models that suffer from correlated variables.

corrplot(cor(numeric), 
       method = "color", 
       type = "lower", 
       tl.col = "black", 
       tl.srt = 45,
       diag = FALSE,
       addCoef.col = "black",
       number.cex = 0.7
       )

Correlation Matrix of All Variables Using One Hot Encoding

If we one hot encode the categorical variables we can get an idea of how they may be correlated with the other variables. There are some expected correlations between particular values such as age and job.retired, age and marital.single, and various education and job levels but few that are problematic. Obviously, the categorical variables with binary classes or with multiple classes but a clear majority or minority class have high degrees of negative correlation. This means that for many of the categorical variables it is worth considering dropping the most highly correlated of the levels if one hot encoding. Of particular interest is the high correlation between housing.unknown and loan.unknown, however these unknown values are significant minority classes and so are also not of particular concern unless using a model requiring numeric variables, in which case one should be excluded. The contact.cellular level is fairly correlated with the economic predictors, and since contact only has two values it is likely worth removing for models that suffer from high dimensionality. The levels for the month variable are also fairly correlated with several of the economic indicators, so it may be worth exploring combining this column with the day_of_week column as both deal with time and the correlation is not high enough to justify the elimination of month entirely. Lastly, previous is highly correlated with the poutcome values and as it has exhibited other issues as discussed above it is likely worth excluding in most models. The correlations in the lower right are either largely the numeric correlations discussed above.

dummy_bank <- bank_add_full |>
  select(-y) |>
  dummyVars(formula = "~ .", fullRank = FALSE) |>
  predict(newdata = bank_add_full |> select(-y))

corrplot(cor(dummy_bank), 
         method = "color", 
         type = "lower", 
         tl.col = "black", 
         tl.srt = 45,
         diag = FALSE)

Algorithm Selection

Of the models covered in the course so far, the most suitable algorithms for this data would be Logistic Regression and kNN. As the problem, identifying whether the client will subscribe to a term deposit, is a classification problem with labeled data supervised learning methods are the clear answer. Both Logistic Regression and kNN are highly interpretable models which will improve our understanding of the problem space and provide actionable and understandable insight into the relationships between the predictors and the result. Penalization techniques such as Elastic Net can be used when conducting Logistic Regression, which could conduct automatic feature selection or penalization which would provide further insight into the importance of the variables. Both kNN and Logistic Regression are quick to train, although kNN may run into issues with execution speed as the dataset grows. This downside is more than made up for by the fact that kNN can continue to adapt in real time by simply adding new data to the training set and if issues with execution speed arise older data can be culled to keep the model up to date. It is worth exploring both methods and selecting the one that results in the best performance for the data, as both have the potential to perform well on the problem.

Pre-processing

Logistic Regression and kNN models are subject to many of the same issues when it comes to data quality. In both cases the data needs to be scaled, and both would likely benefit from centering as well although kNN is more resilient to this issue. Categorical variables will need to be one hot encoded as done in the correlation analysis. Highly correlated variables should be removed as both methods are impacted by these, with prime candidates being pdays, previous, housing.unknown or loan.unknown, contact.cellular, euribor3m, nr.employed, and emp.var.rate although it may be worth exploring Principal Component Analysis as well. Doing so may impact model interpretability and must be considered when analyzing any improvements in model accuracy that may occur, however, if some variables are found to have little predictive power kNN performance could be significantly improved.. It should also be noted that, according to the “bank-additional-names.txt” file included with the data, duration “should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model”. Thus it should be dropped for our purposes. It is also likely worth using some filter methods such as chi-square tests or examining the coefficient weights of Elastic Net regression to filter out predictors with low predictivie power, as contact, month, and day_of_week seem unlikely to influence whether an individual will subscribe.

Assignment 1 EDA

Matthew Tillmawitz

2025-02-28