# Based on information in the "bank-additional-names.txt" file
column_types <- list(age="n", job="f", marital="f", education="f", default="f", housing="f", loan="f", contact="f", month="f", day_of_week="f", duration="n", campaign="n", pdays="n", previous="n", poutcome="f", emp.var.rate="d", cons.price.idx="d", cons.conf.idx="d", euribor3m="d", nr.employed="d")
bank_add_full <- read_delim("bank-additional/bank-additional-full.csv", delim = ";", col_types = column_types)
bank_add <- read_delim("bank-additional/bank-additional.csv", delim = ";", col_types = column_types)
head(bank_add_full)
## # A tibble: 6 × 21
## age job marital education default housing loan contact month day_of_week
## <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 56 house… married basic.4y no no no teleph… may mon
## 2 57 servi… married high.sch… unknown no no teleph… may mon
## 3 37 servi… married high.sch… no yes no teleph… may mon
## 4 40 admin. married basic.6y no no no teleph… may mon
## 5 56 servi… married high.sch… no no yes teleph… may mon
## 6 45 servi… married basic.9y unknown no no teleph… may mon
## # ℹ 11 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## # previous <dbl>, poutcome <fct>, emp.var.rate <dbl>, cons.price.idx <dbl>,
## # cons.conf.idx <dbl>, euribor3m <dbl>, nr.employed <dbl>, y <chr>
All values are populated with an entry, meaning no data is missing from a technical perspective. We know from the information in the “bank-additional-names.txt” file that all the categorical columns have “unknown” values and the pdays column uses 999 for no previous contact which is analogous to a missing value.
bank_add_full |>
summarize(across(everything(), ~ sum(is.na(.)))) |>
kable() |>
kable_styling(latex_options = "scale_down")
| age | job | marital | education | default | housing | loan | contact | month | day_of_week | duration | campaign | pdays | previous | poutcome | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The distributions of most of the categorical variable are largely
unremarkable, with most having some degree of majority classes that are
still well within reasonable bounds. The default variable
is problematic as there are virtually no “yes” values, making the
variable less useful. This is slightly offset by the fact that there are
a reasonable number of “unknown” values which could be argued to be a
proxy for “yes” but this cannot be proven and dilutes the usefulness of
the variable overall. The poutcome variable has a similar
issue with the majority class being “unknown”, however, the relative
balance of the other two levels will likely still prove useful. The
relative imbalance of the month levels is interesting, but
unlikely to cause issues.
categorical <- bank_add_full |>
select(where(is.factor))
cat_plots <- lapply(names(categorical), function(col) {
ggplot(categorical, aes(.data[[col]], after_stat(count))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
})
wrap_plots(cat_plots, ncol=2)
Summary statistics for the numeric variables provide some indication
of problematic variables. The previous and
pdays variables clearly have identical or near identical
values that make up the majority of entries. The duration,
campaign, and age variables also likely have
some issues with outliers based on the difference between quartile
values and maximum values.
numeric <- bank_add_full |>
select(where(is.numeric))
numeric |>
summary()
## age duration campaign pdays
## Min. :17.00 Min. : 0.0 Min. : 1.000 Min. : 0.0
## 1st Qu.:32.00 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0
## Median :38.00 Median : 180.0 Median : 2.000 Median :999.0
## Mean :40.02 Mean : 258.3 Mean : 2.568 Mean :962.5
## 3rd Qu.:47.00 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0
## Max. :98.00 Max. :4918.0 Max. :56.000 Max. :999.0
## previous emp.var.rate cons.price.idx cons.conf.idx
## Min. :0.000 Min. :-3.40000 Min. :92.20 Min. :-50.8
## 1st Qu.:0.000 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## Median :0.000 Median : 1.10000 Median :93.75 Median :-41.8
## Mean :0.173 Mean : 0.08189 Mean :93.58 Mean :-40.5
## 3rd Qu.:0.000 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. :7.000 Max. : 1.40000 Max. :94.77 Max. :-26.9
## euribor3m nr.employed
## Min. :0.634 Min. :4964
## 1st Qu.:1.344 1st Qu.:5099
## Median :4.857 Median :5191
## Mean :3.621 Mean :5167
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
Looking at the density plots we can see almost none of the numeric
variables have standard distributions, with age and
duration being the exceptions with relatively normal
distributions. The distribution issues can easily be addressed via
pre-processing if using models sensitive to this aspect of the data.
Both age and duration have long tails of
outliers, which is not unexpected of either variable. The
campaign variable has a similarly long tail although one
that possesses an interestingly regular pattern. To address the issue of
outliers for models sensitive to them, we can apply transformations such
as the Box-Cox transformation or similar to normalize or simply bring in
the outliers. The pdays and previous variables
indicate particularly problematic distributions, with the vast majority
of values falling in an extremely tight range. In the case of the
pdays variable almost all entries are the 999 value
indicating no prior contact, and the previous variable has
a similar issue with almost all values being 0 indicating the same. As
both largely track the same thing and appear to have very little
variance it is likely one or both of these variables will not be useful
for most models and may be worth dropping from the dataset.
numeric_density_plots <- lapply(names(numeric), function(col) {
ggplot(numeric, aes(.data[[col]])) +
geom_density(fill="lightblue", alpha=0.8) +
labs(y = "")
})
numeric_box_plots <- lapply(names(numeric), function(col) {
ggplot(numeric, aes(.data[[col]])) +
geom_boxplot() +
labs(y = "")
})
fig_list <- c(rbind(numeric_density_plots, numeric_box_plots))
wrap_plots(fig_list, ncol=2, axis_titles = 'collect_x')
Variables with zero or near zero variance can impact the performance
of many model types. In this dataset the only variable exhibitng this
behavior is pdays, providing further justification for the
removal of this variable.
nearZeroVar(bank_add_full, saveMetrics = TRUE) |>
kable() |>
kable_styling() |>
row_spec(13, background = "lightgrey")
| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| age | 1.054713 | 0.1893755 | FALSE | FALSE |
| job | 1.126216 | 0.0291347 | FALSE | FALSE |
| marital | 2.154910 | 0.0097116 | FALSE | FALSE |
| education | 1.278823 | 0.0194231 | FALSE | FALSE |
| default | 3.790625 | 0.0072837 | FALSE | FALSE |
| housing | 1.158630 | 0.0072837 | FALSE | FALSE |
| loan | 5.433739 | 0.0072837 | FALSE | FALSE |
| contact | 1.737836 | 0.0048558 | FALSE | FALSE |
| month | 1.919292 | 0.0242789 | FALSE | FALSE |
| day_of_week | 1.012802 | 0.0121395 | FALSE | FALSE |
| duration | 1.000000 | 3.7486647 | FALSE | FALSE |
| campaign | 1.669063 | 0.1019714 | FALSE | FALSE |
| pdays | 90.371298 | 0.0655531 | FALSE | TRUE |
| previous | 7.797194 | 0.0194231 | FALSE | FALSE |
| poutcome | 8.363829 | 0.0072837 | FALSE | FALSE |
| emp.var.rate | 1.767639 | 0.0242789 | FALSE | FALSE |
| cons.price.idx | 1.161256 | 0.0631252 | FALSE | FALSE |
| cons.conf.idx | 1.161256 | 0.0631252 | FALSE | FALSE |
| euribor3m | 1.097589 | 0.7672138 | FALSE | FALSE |
| nr.employed | 1.902273 | 0.0267068 | FALSE | FALSE |
| y | 7.876724 | 0.0048558 | FALSE | FALSE |
Among the strictly numeric variables we see a few issues with
correlation. As expected the pdays and
previous variables are highly negatively correlated. Given
the other issues with the pdays variable discussed above,
it is clear we should remove this variable and potentially keep the
previous value depending on model selection. The
“supplemental” socioeconomic variables in the extended dataset seem to
exhibit a high degree of correlation, which makes sense for economic
indicators. In particular the euribor3m,
nr.employed, and emp.var.rate are highly
correlated with enough other variables that they are likely worth
excluding in models that suffer from correlated variables.
corrplot(cor(numeric),
method = "color",
type = "lower",
tl.col = "black",
tl.srt = 45,
diag = FALSE,
addCoef.col = "black",
number.cex = 0.7
)
If we one hot encode the categorical variables we can get an idea of
how they may be correlated with the other variables. There are some
expected correlations between particular values such as age
and job.retired, age and
marital.single, and various education and job levels but
few that are problematic. Obviously, the categorical variables with
binary classes or with multiple classes but a clear majority or minority
class have high degrees of negative correlation. This means that for
many of the categorical variables it is worth considering dropping the
most highly correlated of the levels if one hot encoding. Of particular
interest is the high correlation between housing.unknown
and loan.unknown, however these unknown values are
significant minority classes and so are also not of particular concern
unless using a model requiring numeric variables, in which case one
should be excluded. The contact.cellular level is fairly
correlated with the economic predictors, and since contact
only has two values it is likely worth removing for models that suffer
from high dimensionality. The levels for the month variable
are also fairly correlated with several of the economic indicators, so
it may be worth exploring combining this column with the
day_of_week column as both deal with time and the
correlation is not high enough to justify the elimination of
month entirely. Lastly, previous is highly
correlated with the poutcome values and as it has exhibited
other issues as discussed above it is likely worth excluding in most
models. The correlations in the lower right are either largely the
numeric correlations discussed above.
dummy_bank <- bank_add_full |>
select(-y) |>
dummyVars(formula = "~ .", fullRank = FALSE) |>
predict(newdata = bank_add_full |> select(-y))
corrplot(cor(dummy_bank),
method = "color",
type = "lower",
tl.col = "black",
tl.srt = 45,
diag = FALSE)
Of the models covered in the course so far, the most suitable algorithms for this data would be Logistic Regression and kNN. As the problem, identifying whether the client will subscribe to a term deposit, is a classification problem with labeled data supervised learning methods are the clear answer. Both Logistic Regression and kNN are highly interpretable models which will improve our understanding of the problem space and provide actionable and understandable insight into the relationships between the predictors and the result. Penalization techniques such as Elastic Net can be used when conducting Logistic Regression, which could conduct automatic feature selection or penalization which would provide further insight into the importance of the variables. Both kNN and Logistic Regression are quick to train, although kNN may run into issues with execution speed as the dataset grows. This downside is more than made up for by the fact that kNN can continue to adapt in real time by simply adding new data to the training set and if issues with execution speed arise older data can be culled to keep the model up to date. It is worth exploring both methods and selecting the one that results in the best performance for the data, as both have the potential to perform well on the problem.
Logistic Regression and kNN models are subject to many of the same
issues when it comes to data quality. In both cases the data needs to be
scaled, and both would likely benefit from centering as well although
kNN is more resilient to this issue. Categorical variables will need to
be one hot encoded as done in the correlation analysis. Highly
correlated variables should be removed as both methods are impacted by
these, with prime candidates being pdays,
previous, housing.unknown or
loan.unknown, contact.cellular,
euribor3m, nr.employed, and
emp.var.rate although it may be worth exploring Principal
Component Analysis as well. Doing so may impact model interpretability
and must be considered when analyzing any improvements in model accuracy
that may occur, however, if some variables are found to have little
predictive power kNN performance could be significantly improved.. It
should also be noted that, according to the “bank-additional-names.txt”
file included with the data, duration “should only be
included for benchmark purposes and should be discarded if the intention
is to have a realistic predictive model”. Thus it should be dropped for
our purposes. It is also likely worth using some filter methods such as
chi-square tests or examining the coefficient weights of Elastic Net
regression to filter out predictors with low predictivie power, as
contact, month, and day_of_week
seem unlikely to influence whether an individual will subscribe.