library(dplyr)
library(corrplot)
library(tidyverse)
After importing the data and converting character vectors to factors, here is a glimpse of the contents.
## Rows: 45,211
## Columns: 17
## $ age <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
Taking a field datatype approach, the numerical variables are reviewed here for
bank %>%
select_if(is.numeric) %>%
summary()
## age balance day duration
## Min. :18.00 Min. : -8019 Min. : 1.00 Min. : 0.0
## 1st Qu.:33.00 1st Qu.: 72 1st Qu.: 8.00 1st Qu.: 103.0
## Median :39.00 Median : 448 Median :16.00 Median : 180.0
## Mean :40.94 Mean : 1362 Mean :15.81 Mean : 258.2
## 3rd Qu.:48.00 3rd Qu.: 1428 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :95.00 Max. :102127 Max. :31.00 Max. :4918.0
## campaign pdays previous
## Min. : 1.000 Min. : -1.0 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000
## Median : 2.000 Median : -1.0 Median : 0.0000
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
A graphical look at the distribution of all the numeric variables:
df_long <- bank %>%
select_if(is.numeric) %>% # Select only numeric columns
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(df_long, aes(x = Value)) +
geom_histogram(bins = 10, fill = "steelblue", color = "black") +
facet_wrap(~ Variable, scales = "free") + # Facet by variable
theme_minimal() +
labs(title = "Histograms for Numeric Variables - Untreated")
Age has a strong central tendency between 30 and 50. day has nearly equal uniformity with a range of 1-31 - meaning that clients were contacted nearly throughout the month. The remaining numeric variables balance, duration and campaign have right skews. This is not surprising for balance as account balances have (typically) a natural floor of 0 euro and are nearly unlimited to the top end. What’s surprising is the low balances for the interquartile range for the clients: topping out at 1,428 at the 75th percentile. The right end of the distribution for these right skewed variables will be analyzed further in the outliers portion of the analysis.
As mentioned in my opening remarks, there is concern with the features related to the previous marketing campaign (pdays, previous), as they could exhibit sparsity as the response rate depends on the client’s inclusion on the previous campaign. The distribution indicates just that: pdays and previous are at least 75% N/A. These two fields are good candidates either for removal, merging or casting as categorical.
The interaction btwn it and the response variable
bank %>%
filter(pdays>0) %>%
ggplot(aes(x=y, y = pdays)) +
geom_violin()
bank %>%
filter(previous>0) %>%
ggplot(aes(x=y, y = previous)) +
geom_violin()
previous above looks like it exhibits the same information as
campaign whereby there is an inverse relationship between
success rate (y=‘yes’) and number contacts (ie campaign
variable). I will toss out this variable, but keep pdays
because it appears to exhibit greater density variances between the two
response classes. I will convert pdays to a categorical with 5
levels = “NONE”, 1-125, 126-250, 251-375, >375 because these are the
significant density boundaries I see on the violin plot.
Proportion of observations outside of 1.5 x of the IQR by variable.
prop_outside <-function(vec){
prop_under = sum(vec < (quantile(vec, .25) - 1.5*stats::IQR(vec)))/length(vec)
prop_ovr = sum(vec > (quantile(vec, .75) + 1.5*stats::IQR(vec)))/length(vec)
return(prop_under+prop_ovr)
}
#apply across all numeric variables and print the output
bank %>%
select_if(is.numeric) %>%
pivot_longer(cols=everything(), names_to='variables', values_to='value') %>%
group_by(variables) %>%
summarize(proportion_outliers = prop_outside(value))
## # A tibble: 7 × 2
## variables proportion_outliers
## <chr> <dbl>
## 1 age 0.0108
## 2 balance 0.105
## 3 campaign 0.0678
## 4 day 0
## 5 duration 0.0716
## 6 pdays 0.183
## 7 previous 0.183
Outliers are material for balance at 10%, reasonable for campaign and duration and substantial for pdays and previous.
The campaign seems to have targeted people in early to middle adulthood with seemingly low to mid range balances. Most clients were contacted 1-3 times. One thing to note: duration, day and month (a categorical variable) only record the last communication, which I could believe would be the most indicative and influential one - especially for successful purchases. I’d suspect that duration could correlate highly with successful purchases.
Collinearity is mostly a problem between pdays and previous, and to a lesser degree pdays and campaign. This further substantiates my decision to remove previous all together and bin pdays - these transformations will reduce multicollinearity.
bank %>%
keep(is.numeric) %>%
cor() %>%
corrplot(type = "upper")
Lastly, there are 9 categorical variables. Their range and spread are as follows:
bank %>%
select_if(is.factor) %>%
summary()
## job marital education default housing
## blue-collar:9732 married :27214 tertiary :13301 no :44396 yes:25130
## management :9458 single :12790 secondary:23202 yes: 815 no :20081
## technician :7597 divorced: 5207 unknown : 1857
## admin. :5171 primary : 6851
## services :4154
## retired :2264
## (Other) :6835
## loan contact month poutcome y
## no :37967 unknown :13020 may :13766 unknown:36959 no :39922
## yes: 7244 cellular :29285 jul : 6895 failure: 4901 yes: 5289
## telephone: 2906 aug : 6247 other : 1840
## jun : 5341 success: 1511
## nov : 3970
## apr : 2932
## (Other): 6060
Job shows some concentration among technicians, blue collar and management. There could be some interaction there with other categorical variables like education, housing and default.
Optional to place a violin plot or heatmap of some of these variables.
Detect some density around combinations of categorical variables: education, job, housing.
bank %>%
select(job, education) %>%
group_by(job,education) %>%
summarize(group_size = n()) %>%
ggplot(aes(x=job,y=education,fill = group_size)) +
geom_tile() +
scale_fill_gradient(low="white", high="blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `summarise()` has grouped output by 'job'. You can override using the `.groups`
## argument.
bank %>%
select(housing, job) %>%
group_by(housing,job) %>%
summarize(group_size = n()) %>%
ggplot(aes(x=housing,y=job,fill = group_size)) +
geom_tile() +
scale_fill_gradient(low="white", high="blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `summarise()` has grouped output by 'housing'. You can override using the
## `.groups` argument.
The previous campaign over-indexed for management and tertiary education level, secondary level across industries and slightly to homeowners, which makes sense. It appears that the previous campaign targeted a somewhat mature and established cohort
The unknown value for education and job is not concentrated in any intersections of ‘education’, ‘job’ or ‘housing’. It’s distributed well, hence it shouldn’t exert a lot of leverage on the coefficients. I’ll leave it as-it.
Given this dataset, kNN will require the following pre-processing: - normalization - impute missing data - convert categorical variables to dummy variables
I handle this in the most appropriate order.
Convert pdays to categorical
bank %>%
mutate(pdays_f = as.factor(case_when(pdays == -1 ~ 'new',
pdays >-1 & pdays <= 125 ~ 'min',
pdays >125 & pdays <= 250 ~ 'med1',
pdays > 250 & pdays <= 375 ~ 'med2',
pdays > 375 ~ 'max'))) -> bank2
The following fields have “unknown” values: job, education, contact, month. I would remove records with pervasive “unknown” values across categoricals, but there are none that even register “unknown” across even just two fields job and month, and so removing records is not a consideration.
#returns none
bank %>%
filter(job == "(Other)", month == "(Other)") %>%
nrow()
## [1] 0
For balance, I’ll impute the 95th percentile of balance for the intersection of education, job, housing.
custom_funct <- function(x) quantile(x, 0.8, na.rm = TRUE)
bank2 <- bank2 %>%
group_by(education, job, housing) %>%
mutate(balance2 = ifelse(balance > (quantile(balance, .75) + 1.5*stats::IQR(balance)), custom_funct(balance) , balance)) %>%
ungroup() # Ungroup after mutatio
#check that it worked
bank %>%
filter(balance > quantile(balance, .75) + 1.5*stats::IQR(balance)) %>%
nrow()
## [1] 4712
#now there are 1,426 outliers, from 4,712
bank2 %>%
filter(balance2 > quantile(balance2, .75) + 1.5*stats::IQR(balance2)) %>%
nrow()
## [1] 996
summary(bank2$balance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8019 72 448 1362 1428 102127
summary(bank2$balance2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8019.0 72.0 448.0 788.3 1345.0 8971.0
For duration, I’ll cap it at x2 the standard deviation. The assumption that values beyond this cap offer no more information.
custom_funct <- function(x) mean(x)+sd(x)*2
bank2 <- bank2 %>%
mutate(duration2 = ifelse(duration > (quantile(duration, .75) + 1.5*stats::IQR(duration)), custom_funct(duration) , duration)) %>%
ungroup() # Ungroup after mutatio
#check that it worked
bank2 %>%
filter(duration > quantile(duration, .75) + 1.5*stats::IQR(duration)) %>%
nrow()
## [1] 3235
bank2 %>%
filter(duration2 > quantile(duration2, .75) + 1.5*stats::IQR(duration2)) %>%
nrow()
## [1] 3235
summary(bank2$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 103.0 180.0 258.2 319.0 4918.0
summary(bank2$duration2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 103.0 180.0 244.3 319.0 773.2
The max value now is much more reasonable without changing the underlying structure of the vector much
Normalization for kNN
normalize <- function(x) {
min_x = min(x)
max_x = max(x)
return((x-min_x)/(max_x-min_x))
}
bank2 %>%
select_if(is.numeric) %>%
glimpse()
## Rows: 45,211
## Columns: 9
## $ age <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ balance <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ day <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ duration <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ balance2 <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ duration2 <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
bank2 %>%
mutate(across(c(age,balance2,duration2,day,campaign), ~normalize(.x), .names = "{.col}_norm")) -> bank2
bank2 %>%
select(ends_with('norm')) %>%
summary()
## age_norm balance2_norm duration2_norm day_norm
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1948 1st Qu.:0.4762 1st Qu.:0.1332 1st Qu.:0.2333
## Median :0.2727 Median :0.4984 Median :0.2328 Median :0.5000
## Mean :0.2979 Mean :0.5184 Mean :0.3159 Mean :0.4935
## 3rd Qu.:0.3896 3rd Qu.:0.5511 3rd Qu.:0.4126 3rd Qu.:0.6667
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## campaign_norm
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.01613
## Mean :0.02845
## 3rd Qu.:0.03226
## Max. :1.00000
Dataset thus far: age_norm, balance2_norm, duration2_norm, day_norm, campaign_norm, pdays_f, marital, education, job, default, poutcome, month, housing, y
Convert categorical to dummy variables