library(tidyverse)
library(corrplot)
library(cowplot)
library(naniar)
library(skimr)

Pull in Dataset

bank <- read.csv("C:\\Users\\jashb\\OneDrive\\Documents\\Masters Data Science\\Spring 2025\\DATA 622\\Assignment 1\\DATA\\bank-additional\\bank-additional-full.csv", sep = ';')

EDA

Input variables:

Bank client data (From UCI):

Input variables:

1 - age (numeric)

2 - job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)

3 - marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)

4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)

5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)

6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)

7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”) # related with the last contact of the current campaign:

8 - contact: contact communication type (categorical: “cellular”,“telephone”)

9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”) # social and economic context attributes

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric) Output variable (desired target):

21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

STR and Skim to get an overall view of missingness, uniqueness, mean and standard deviation.

The independent variables within the dataset consist of integers and characters variables. The dependent variable is a simple “yes” or “no”

str(bank)
## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...
skim(bank)
Data summary
Name bank
Number of rows 41188
Number of columns 21
_______________________
Column type frequency:
character 11
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
job 0 1 6 13 0 12 0
marital 0 1 6 8 0 4 0
education 0 1 7 19 0 8 0
default 0 1 2 7 0 3 0
housing 0 1 2 7 0 3 0
loan 0 1 2 7 0 3 0
contact 0 1 8 9 0 2 0
month 0 1 3 3 0 10 0
day_of_week 0 1 3 3 0 5 0
poutcome 0 1 7 11 0 3 0
y 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 40.02 10.42 17.00 32.00 38.00 47.00 98.00 ▅▇▃▁▁
duration 0 1 258.29 259.28 0.00 102.00 180.00 319.00 4918.00 ▇▁▁▁▁
campaign 0 1 2.57 2.77 1.00 1.00 2.00 3.00 56.00 ▇▁▁▁▁
pdays 0 1 962.48 186.91 0.00 999.00 999.00 999.00 999.00 ▁▁▁▁▇
previous 0 1 0.17 0.49 0.00 0.00 0.00 0.00 7.00 ▇▁▁▁▁
emp.var.rate 0 1 0.08 1.57 -3.40 -1.80 1.10 1.40 1.40 ▁▃▁▁▇
cons.price.idx 0 1 93.58 0.58 92.20 93.08 93.75 93.99 94.77 ▁▆▃▇▂
cons.conf.idx 0 1 -40.50 4.63 -50.80 -42.70 -41.80 -36.40 -26.90 ▅▇▁▇▁
euribor3m 0 1 3.62 1.73 0.63 1.34 4.86 4.96 5.04 ▅▁▁▁▇
nr.employed 0 1 5167.04 72.25 4963.60 5099.10 5191.00 5228.10 5228.10 ▁▁▃▁▇

EDA - Question 1: Are the features (columns) of your data correlated?

The correlation analysis revealed that there are many more uncorrelated variables than correlated. However, a few highly correlated variables stood out. Erubibor3m and nr.employed have a 95% correlation, and nr.employed and emp.var.rate have a 91% correlation. Nr.employed and emp.var make sense because they are both measuring some form of employment. Euribor and nr.employed could be that rising employment reflect a growing economy, or vice versa.

cor_mat <- cor(bank[,unlist(lapply(bank,is.numeric))])
corrplot(cor_mat, type = 'upper',
         method = "color", 
         addCoef.col = "black", 
         number.cex = 0.8, 
         tl.cex = 0.8, 
         tl.col = "black", 
         tl.srt = 45, 
         col = colorRampPalette(c("blue", "white", "grey"))(100))

Are there any missing values and how significant are they?

I will be starting with missing data identification as it can help to understand potential problems to address down the road.

There is no missingness present within this dataset.

print(colSums(is.na(bank)))
##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

What is the overall distribution of each variable?

numeric_vars <- bank %>% select_if(is.numeric) %>% gather()

numeric_vars <- numeric_vars %>%
  group_by(key) %>% 
  mutate(ind = row_number()) %>% 
  ungroup()

for (var in unique(numeric_vars$key)) {
  p <- numeric_vars %>%
    filter(key == var) %>%
    ggplot(aes(value)) +
    geom_density() +
    labs(title = var) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  print(p)
}

Non Numeric Variable Distributions

How are categorical variables distributed?

non_numeric_vars <- bank %>% select_if(negate(is.numeric)) %>% gather()

for (var in unique(non_numeric_vars$key)) {
  p <- non_numeric_vars %>%
    filter(key == var) %>%
    ggplot(aes(x = value, fill = key)) +
    geom_bar() +
    labs(title = var) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  print(p)
}

Are there any outliers present?

I will be using scatter plot visualizations to see if numeric values have any important outliers. Distributions visualize below show that this data is not normally distributed. Age, campaign, duration and pdays stand out as having serious problems with outliers. If not using random forest, this will need to be addressed.

numeric_vars %>% 
  ggplot(aes(x = ind, y = value,))+
  geom_point(color = 'salmon')+
  facet_wrap(~key, scales = "free")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Numeric Variables Scatter Plots", x = "Row_Num", y = "Value")

What is the central tendency and spread of each variable?

To see central tendency and spread of the variables present in the dataset a table will be created that contains mean, median and mode (mode will be using a calculation I sourced online)

The dataset contains varying distributive properties and central tendencies. Pdays has a high concentration of a single value (999), according to the dataset notes “pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)”. This 999 placeholder could be a potential problem down the road and could be grounds for either getting rid of it or replacing 999 with NAs and let Random Forest sort it out. There are variables with high variability (e.g., duration and campaign), which might need special handling during modeling.

mode_calc <- function(v){
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v,uniqv)))]
}
central_tab <- bank %>% 
  select(where(is.numeric)) %>% 
  summarise_all(list(
    Mean = ~mean(.,na.rm = T),
    Median = ~median(.,na.rm = T),
    Mode = ~mode_calc(.),
    Std= ~sd(.,na.rm = T),
    IQR = ~IQR(.,na.rm = T)
  ))

#Converting from wide to long
central_tendency_long <- central_tab %>% 
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  separate(Variable, into = c("Variable", "Measure"), sep = "_") %>%
  pivot_wider(names_from = Measure, values_from = Value)

print(central_tendency_long)
## # A tibble: 10 × 6
##    Variable            Mean  Median    Mode     Std     IQR
##    <chr>              <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 age              40.0      38      31     10.4    15    
##  2 duration        258.      180      85    259.    217    
##  3 campaign          2.57      2       1      2.77    2    
##  4 pdays           962.      999     999    187.      0    
##  5 previous          0.173     0       0      0.495   0    
##  6 emp.var.rate      0.0819    1.1     1.4    1.57    3.2  
##  7 cons.price.idx   93.6      93.7    94.0    0.579   0.919
##  8 cons.conf.idx   -40.5     -41.8   -36.4    4.63    6.30 
##  9 euribor3m         3.62      4.86    4.86   1.73    3.62 
## 10 nr.employed    5167.     5191    5228.    72.3   129

Algorithm Selection:

For this classification problem two algorithms make sense based on the preliminary view of the data. First. A logistic regression. The primary benefit of logistic regressions is they are easy to deploy and are easy to interpret. Logistic regression can also handle larger data sets making them more computationally efficient than some of the more complex algorithms. However this algorithm is sensitive to outliers and multicollinearity, two things this data has (especially outliers). The best option for this dataset will likely be Random Forest. Despite the longer run times brought about by higher complexity, this algorithm is better equipped to handle imbalanced data, as well as a mix of categorical character features and numerical integer features. Additionally, if the unknowns in the data were to be converted back to NA, random forest is also able to handle missing data fairly well. As a result the Random Forest will be chosen.

Pre-processing:

As mentioned above, there actually is missing data in the dataset it is just coded as ‘unknown’. So first preprocessing step is deciding what to do about this missing data. One option is recoding the data as NA and then running it through a program like MICE to impute. Another option is letting the Random Forest do its own computation for imputing. Regardless, the NA need to be addressed. While random forest doesn’t necessarily need standardization, dropping highly correlated variables that are measuring the same thing could improve model performance and cut down on run time for the classification. For example nr.employed and emp.var.rate are highly correlated because they are both measuring a change over time in employment, for this, the features could either be combined or only keep one.