Assignment #1 DATA 622

library(dplyr)
library(corrplot)
library(tidyverse)

After importing the data and converting character vectors to factors, here is a glimpse of the contents.

## Rows: 45,211
## Columns: 17
## $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital   <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

Taking a field datatype approach, the numerical variables are reviewed here for

bank %>%
  select_if(is.numeric) %>%
  summary()

##       age           balance            day           duration     
##  Min.   :18.00   Min.   : -8019   Min.   : 1.00   Min.   :   0.0  
##  1st Qu.:33.00   1st Qu.:    72   1st Qu.: 8.00   1st Qu.: 103.0  
##  Median :39.00   Median :   448   Median :16.00   Median : 180.0  
##  Mean   :40.94   Mean   :  1362   Mean   :15.81   Mean   : 258.2  
##  3rd Qu.:48.00   3rd Qu.:  1428   3rd Qu.:21.00   3rd Qu.: 319.0  
##  Max.   :95.00   Max.   :102127   Max.   :31.00   Max.   :4918.0  
##     campaign          pdays          previous       
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000  
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000  
##  Median : 2.000   Median : -1.0   Median :  0.0000  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803  
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000  
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000

A graphical look at the distribution of all the numeric variables:

df_long <- bank %>%
  select_if(is.numeric) %>%  # Select only numeric columns
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")  

ggplot(df_long, aes(x = Value)) +
  geom_histogram(bins = 10, fill = "steelblue", color = "black") +
  facet_wrap(~ Variable, scales = "free") +  # Facet by variable
  theme_minimal() +
  labs(title = "Histograms for Numeric Variables - Untreated")

Age has a strong central tendency between 30 and 50. day has nearly equal uniformity with a range of 1-31 - meaning that clients were contacted nearly throughout the month. The remaining numeric variables balance, duration and campaign have right skews. This is not surprising for balance as account balances have (typically) a natural floor of 0 euro and are nearly unlimited to the top end. What’s surprising is the low balances for the interquartile range for the clients: topping out at 1,428 at the 75th percentile. The right end of the distribution for these right skewed variables will be analyzed further in the outliers portion of the analysis.

As mentioned in my opening remarks, there is concern with the features related to the previous marketing campaign (pdays, previous), as they could exhibit sparsity as the response rate depends on the client’s inclusion on the previous campaign. The distribution indicates just that: pdays and previous are at least 75% N/A. These two fields are good candidates either for removal, merging or casting as categorical.

The interaction btwn it and the response variable

bank %>%
  filter(pdays>0) %>%
  ggplot(aes(x=y, y = pdays)) +
  geom_violin()

bank %>%
  filter(previous>0) %>%
  ggplot(aes(x=y, y = previous)) +
  geom_violin()

previous above looks like it exhibits the same information as campaign whereby there is an inverse relationship between success rate (y=‘yes’) and number contacts (ie campaign variable). I will toss out this variable, but keep pdays because it appears to exhibit greater density variances between the two response classes. I will convert pdays to a categorical with 5 levels = “NONE”, 1-125, 126-250, 251-375, >375 because these are the significant density boundaries I see on the violin plot.

Inspection of central tendency and outliers

Proportion of observations outside of 1.5 x of the IQR by variable.

prop_outside <-function(vec){
prop_under = sum(vec < (quantile(vec, .25) - 1.5*stats::IQR(vec)))/length(vec)
prop_ovr = sum(vec > (quantile(vec, .75) + 1.5*stats::IQR(vec)))/length(vec)
return(prop_under+prop_ovr)
}

#apply across all numeric variables and print the output
bank %>%
  select_if(is.numeric) %>%
  pivot_longer(cols=everything(), names_to='variables', values_to='value')  %>%
  group_by(variables) %>%
  summarize(proportion_outliers = prop_outside(value))

## # A tibble: 7 × 2
##   variables proportion_outliers
##   <chr>                   <dbl>
## 1 age                    0.0108
## 2 balance                0.105 
## 3 campaign               0.0678
## 4 day                    0     
## 5 duration               0.0716
## 6 pdays                  0.183 
## 7 previous               0.183

Outliers are material for balance at 10%, reasonable for campaign and duration and substantial for pdays and previous.

The campaign seems to have targeted people in early to middle adulthood with seemingly low to mid range balances. Most clients were contacted 1-3 times. One thing to note: duration, day and month (a categorical variable) only record the last communication, which I could believe would be the most indicative and influential one - especially for successful purchases. I’d suspect that duration could correlate highly with successful purchases.

Collinearity is mostly a problem between pdays and previous, and to a lesser degree pdays and campaign. This further substantiates my decision to remove previous all together and bin pdays - these transformations will reduce multicollinearity.

bank %>%
  keep(is.numeric) %>%
  cor() %>%
  corrplot(type = "upper")

Lastly, there are 9 categorical variables. Their range and spread are as follows:

bank %>%
  select_if(is.factor) %>%
  summary()

##           job           marital          education     default     housing    
##  blue-collar:9732   married :27214   tertiary :13301   no :44396   yes:25130  
##  management :9458   single  :12790   secondary:23202   yes:  815   no :20081  
##  technician :7597   divorced: 5207   unknown  : 1857                          
##  admin.     :5171                    primary  : 6851                          
##  services   :4154                                                             
##  retired    :2264                                                             
##  (Other)    :6835                                                             
##   loan            contact          month          poutcome       y        
##  no :37967   unknown  :13020   may    :13766   unknown:36959   no :39922  
##  yes: 7244   cellular :29285   jul    : 6895   failure: 4901   yes: 5289  
##              telephone: 2906   aug    : 6247   other  : 1840              
##                                jun    : 5341   success: 1511              
##                                nov    : 3970                              
##                                apr    : 2932                              
##                                (Other): 6060

Job shows some concentration among technicians, blue collar and management. There could be some interaction there with other categorical variables like education, housing and default.

Optional to place a violin plot or heatmap of some of these variables.

Detect some density around combinations of categorical variables: education, job, housing.

bank %>%
  select(job, education) %>%
  group_by(job,education) %>%
  summarize(group_size = n()) %>%
  ggplot(aes(x=job,y=education,fill = group_size)) +
  geom_tile() +
  scale_fill_gradient(low="white", high="blue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `summarise()` has grouped output by 'job'. You can override using the `.groups`
## argument.

bank %>%
  select(housing, job) %>%
  group_by(housing,job) %>%
  summarize(group_size = n()) %>%
  ggplot(aes(x=housing,y=job,fill = group_size)) +
  geom_tile() +
  scale_fill_gradient(low="white", high="blue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `summarise()` has grouped output by 'housing'. You can override using the
## `.groups` argument.

The previous campaign over-indexed for management and tertiary education level, secondary level across industries and slightly to homeowners, which makes sense. It appears that the previous campaign targeted a somewhat mature and established cohort

The unknown value for education and job is not concentrated in any intersections of ‘education’, ‘job’ or ‘housing’. It’s distributed well, hence it shouldn’t exert a lot of leverage on the coefficients. I’ll leave it as-it.

PRE-PROCESSING for the two selected models kNN and logistic regression

Given this dataset, kNN will require the following pre-processing: - normalization - impute missing data - convert categorical variables to dummy variables

I handle this in the most appropriate order.

Convert pdays to categorical

bank %>%
  mutate(pdays_f = as.factor(case_when(pdays == -1 ~ 'new',
                             pdays >-1 & pdays <= 125 ~ 'min',
                             pdays >125 & pdays <= 250 ~ 'med1',
                             pdays > 250 & pdays <= 375 ~ 'med2',
                             pdays > 375 ~ 'max'))) -> bank2

Handling missing values for categorical variables

The following fields have “unknown” values: job, education, contact, month. I would remove records with pervasive “unknown” values across categoricals, but there are none that even register “unknown” across even just two fields job and month, and so removing records is not a consideration.

#returns none
bank %>%
  filter(job == "(Other)", month == "(Other)") %>%
  nrow()

## [1] 0

Handle Outliers

For balance, I’ll impute the 95th percentile of balance for the intersection of education, job, housing.

custom_funct <- function(x) quantile(x, 0.8, na.rm = TRUE)

bank2 <- bank2 %>%
  group_by(education, job, housing) %>%
  mutate(balance2 = ifelse(balance > (quantile(balance, .75) + 1.5*stats::IQR(balance)), custom_funct(balance) , balance)) %>%
  ungroup()  # Ungroup after mutatio           


#check that it worked
bank %>%
  filter(balance > quantile(balance, .75) + 1.5*stats::IQR(balance)) %>%
  nrow()

## [1] 4712

#now there are 1,426 outliers, from 4,712
bank2 %>%
  filter(balance2 > quantile(balance2, .75) + 1.5*stats::IQR(balance2)) %>%
  nrow()

## [1] 996

summary(bank2$balance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -8019      72     448    1362    1428  102127

summary(bank2$balance2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -8019.0    72.0   448.0   788.3  1345.0  8971.0

For duration, I’ll cap it at x2 the standard deviation. The assumption that values beyond this cap offer no more information.

custom_funct <- function(x) mean(x)+sd(x)*2

bank2 <- bank2 %>%
  mutate(duration2 = ifelse(duration > (quantile(duration, .75) + 1.5*stats::IQR(duration)), custom_funct(duration) , duration)) %>%
  ungroup()  # Ungroup after mutatio           


#check that it worked
bank2 %>%
  filter(duration > quantile(duration, .75) + 1.5*stats::IQR(duration)) %>%
  nrow()

## [1] 3235

bank2 %>%
  filter(duration2 > quantile(duration2, .75) + 1.5*stats::IQR(duration2)) %>%
  nrow()

## [1] 3235

summary(bank2$duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   103.0   180.0   258.2   319.0  4918.0

summary(bank2$duration2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   103.0   180.0   244.3   319.0   773.2

The max value now is much more reasonable without changing the underlying structure of the vector much

Normalization for kNN

normalize <- function(x) {
  min_x = min(x)
  max_x = max(x)
  
  return((x-min_x)/(max_x-min_x))
}


bank2 %>%
  select_if(is.numeric) %>%
  glimpse()

## Rows: 45,211
## Columns: 9
## $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ balance2  <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ duration2 <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…

bank2 %>%
  mutate(across(c(age,balance2,duration2,day,campaign), ~normalize(.x), .names = "{.col}_norm")) -> bank2

bank2 %>%
  select(ends_with('norm')) %>%
  summary()

##     age_norm      balance2_norm    duration2_norm      day_norm     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1948   1st Qu.:0.4762   1st Qu.:0.1332   1st Qu.:0.2333  
##  Median :0.2727   Median :0.4984   Median :0.2328   Median :0.5000  
##  Mean   :0.2979   Mean   :0.5184   Mean   :0.3159   Mean   :0.4935  
##  3rd Qu.:0.3896   3rd Qu.:0.5511   3rd Qu.:0.4126   3rd Qu.:0.6667  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  campaign_norm    
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.01613  
##  Mean   :0.02845  
##  3rd Qu.:0.03226  
##  Max.   :1.00000

Dataset thus far: age_norm, balance2_norm, duration2_norm, day_norm, campaign_norm, pdays_f, marital, education, job, default, poutcome, month, housing, y

Convert categorical to dummy variables

Assignment #1 DATA 622

Luis Calleja

2025-02-27

Inspection of central tendency and outliers

PRE-PROCESSING for the two selected models kNN and logistic regression

Handling missing values for categorical variables

Handle Outliers