library(tidyverse)
library(corrplot)
library(cowplot)
library(naniar)
library(skimr)
bank <- read.csv("C:\\Users\\jashb\\OneDrive\\Documents\\Masters Data Science\\Spring 2025\\DATA 622\\Assignment 1\\DATA\\bank-additional\\bank-additional-full.csv", sep = ';')
Input variables:
1 - age (numeric)
2 - job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)
3 - marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)
4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)
5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)
6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”) # related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: “cellular”,“telephone”)
9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”) # social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric) Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
The independent variables within the dataset consist of integers and characters variables. The dependent variable is a simple “yes” or “no”
str(bank)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
skim(bank)
Name | bank |
Number of rows | 41188 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
character | 11 |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
job | 0 | 1 | 6 | 13 | 0 | 12 | 0 |
marital | 0 | 1 | 6 | 8 | 0 | 4 | 0 |
education | 0 | 1 | 7 | 19 | 0 | 8 | 0 |
default | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
housing | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
loan | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
contact | 0 | 1 | 8 | 9 | 0 | 2 | 0 |
month | 0 | 1 | 3 | 3 | 0 | 10 | 0 |
day_of_week | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
poutcome | 0 | 1 | 7 | 11 | 0 | 3 | 0 |
y | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 40.02 | 10.42 | 17.00 | 32.00 | 38.00 | 47.00 | 98.00 | ▅▇▃▁▁ |
duration | 0 | 1 | 258.29 | 259.28 | 0.00 | 102.00 | 180.00 | 319.00 | 4918.00 | ▇▁▁▁▁ |
campaign | 0 | 1 | 2.57 | 2.77 | 1.00 | 1.00 | 2.00 | 3.00 | 56.00 | ▇▁▁▁▁ |
pdays | 0 | 1 | 962.48 | 186.91 | 0.00 | 999.00 | 999.00 | 999.00 | 999.00 | ▁▁▁▁▇ |
previous | 0 | 1 | 0.17 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
emp.var.rate | 0 | 1 | 0.08 | 1.57 | -3.40 | -1.80 | 1.10 | 1.40 | 1.40 | ▁▃▁▁▇ |
cons.price.idx | 0 | 1 | 93.58 | 0.58 | 92.20 | 93.08 | 93.75 | 93.99 | 94.77 | ▁▆▃▇▂ |
cons.conf.idx | 0 | 1 | -40.50 | 4.63 | -50.80 | -42.70 | -41.80 | -36.40 | -26.90 | ▅▇▁▇▁ |
euribor3m | 0 | 1 | 3.62 | 1.73 | 0.63 | 1.34 | 4.86 | 4.96 | 5.04 | ▅▁▁▁▇ |
nr.employed | 0 | 1 | 5167.04 | 72.25 | 4963.60 | 5099.10 | 5191.00 | 5228.10 | 5228.10 | ▁▁▃▁▇ |
non_numeric_vars <- bank %>% select_if(negate(is.numeric)) %>% gather()
for (var in unique(non_numeric_vars$key)) {
p <- non_numeric_vars %>%
filter(key == var) %>%
ggplot(aes(x = value, fill = key)) +
geom_bar() +
labs(title = var) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
}
I will be using scatter plot visualizations to see if numeric values have any important outliers. Distributions visualize below show that this data is not normally distributed. Age, campaign, duration and pdays stand out as having serious problems with outliers. If not using random forest, this will need to be addressed.
numeric_vars %>%
ggplot(aes(x = ind, y = value,))+
geom_point(color = 'salmon')+
facet_wrap(~key, scales = "free")+
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Numeric Variables Scatter Plots", x = "Row_Num", y = "Value")
To see central tendency and spread of the variables present in the dataset a table will be created that contains mean, median and mode (mode will be using a calculation I sourced online)
The dataset contains varying distributive properties and central tendencies. Pdays has a high concentration of a single value (999), according to the dataset notes “pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)”. This 999 placeholder could be a potential problem down the road and could be grounds for either getting rid of it or replacing 999 with NAs and let Random Forest sort it out. There are variables with high variability (e.g., duration and campaign), which might need special handling during modeling.
mode_calc <- function(v){
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v,uniqv)))]
}
central_tab <- bank %>%
select(where(is.numeric)) %>%
summarise_all(list(
Mean = ~mean(.,na.rm = T),
Median = ~median(.,na.rm = T),
Mode = ~mode_calc(.),
Std= ~sd(.,na.rm = T),
IQR = ~IQR(.,na.rm = T)
))
#Converting from wide to long
central_tendency_long <- central_tab %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
separate(Variable, into = c("Variable", "Measure"), sep = "_") %>%
pivot_wider(names_from = Measure, values_from = Value)
print(central_tendency_long)
## # A tibble: 10 × 6
## Variable Mean Median Mode Std IQR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 40.0 38 31 10.4 15
## 2 duration 258. 180 85 259. 217
## 3 campaign 2.57 2 1 2.77 2
## 4 pdays 962. 999 999 187. 0
## 5 previous 0.173 0 0 0.495 0
## 6 emp.var.rate 0.0819 1.1 1.4 1.57 3.2
## 7 cons.price.idx 93.6 93.7 94.0 0.579 0.919
## 8 cons.conf.idx -40.5 -41.8 -36.4 4.63 6.30
## 9 euribor3m 3.62 4.86 4.86 1.73 3.62
## 10 nr.employed 5167. 5191 5228. 72.3 129
For this classification problem two algorithms make sense based on the preliminary view of the data. First. A logistic regression. The primary benefit of logistic regressions is they are easy to deploy and are easy to interpret. Logistic regression can also handle larger data sets making them more computationally efficient than some of the more complex algorithms. However this algorithm is sensitive to outliers and multicollinearity, two things this data has (especially outliers). The best option for this dataset will likely be Random Forest. Despite the longer run times brought about by higher complexity, this algorithm is better equipped to handle imbalanced data, as well as a mix of categorical character features and numerical integer features. Additionally, if the unknowns in the data were to be converted back to NA, random forest is also able to handle missing data fairly well. As a result the Random Forest will be chosen.
As mentioned above, there actually is missing data in the dataset it is just coded as ‘unknown’. So first preprocessing step is deciding what to do about this missing data. One option is recoding the data as NA and then running it through a program like MICE to impute. Another option is letting the Random Forest do its own computation for imputing. Regardless, the NA need to be addressed. While random forest doesn’t necessarily need standardization, dropping highly correlated variables that are measuring the same thing could improve model performance and cut down on run time for the classification. For example nr.employed and emp.var.rate are highly correlated because they are both measuring a change over time in employment, for this, the features could either be combined or only keep one.