A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.
Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
For this assignment, I chose the full dataset,
bank-additional-full.csv which has all observations and
features. I would prefer to start out with all the data, then narrow it
down after performing my EDA and algorithm selection. I wouldn’t want to
disregard some features or observations before looking at them and
knowing if they’re important or not.
Input variables: # bank client data: 1 - age (numeric) 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”) 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) 5 - default: has credit in default? (binary: “yes”,“no”) 6 - housing: has housing loan? (binary: “yes”,“no”) 7 - loan: has personal loan? (binary: “yes”,“no”) # related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) 9 - day: last contact day of the month (numeric) 10 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) 11 - duration: last contact duration, in seconds (numeric) # other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”) 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target): 21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
#Import Libraries
library(readr) # to uses read_csv function
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071) # For skewness function
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ROSE)
## Loaded ROSE 0.0-4
library(smotefamily)
library(dplyr)
library(caret)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(gmodels)
library(correlationfunnel)
## ══ Using correlationfunnel? ════════════════════════════════════════════════════
## You might also be interested in applied data science training for business.
## </> Learn more at - www.business-science.io </>
library(DataExplorer)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(readr)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.3.0 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ caret::lift() masks purrr::lift()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ yardstick::precision() masks caret::precision()
## ✖ yardstick::recall() masks caret::recall()
## ✖ car::recode() masks dplyr::recode()
## ✖ MASS::select() masks dplyr::select()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ car::some() masks purrr::some()
## ✖ yardstick::spec() masks readr::spec()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## ✖ tune::tune() masks parsnip::tune(), e1071::tune()
## • Use tidymodels_prefer() to resolve common conflicts.
library(themis)
# Full dataset
data_raw <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-608/refs/heads/main/bank-additional-full.csv", sep = ";")
General overview such as summary stats, missing values, and duplicates:
# Structure of the data
str(data_raw)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
# Glimpse of the data
head(data_raw)
## age job marital education default housing loan contact month
## 1 56 housemaid married basic.4y no no no telephone may
## 2 57 services married high.school unknown no no telephone may
## 3 37 services married high.school no yes no telephone may
## 4 40 admin. married basic.6y no no no telephone may
## 5 56 services married high.school no no yes telephone may
## 6 45 services married basic.9y unknown no no telephone may
## day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 mon 261 1 999 0 nonexistent 1.1
## 2 mon 149 1 999 0 nonexistent 1.1
## 3 mon 226 1 999 0 nonexistent 1.1
## 4 mon 151 1 999 0 nonexistent 1.1
## 5 mon 307 1 999 0 nonexistent 1.1
## 6 mon 198 1 999 0 nonexistent 1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 93.994 -36.4 4.857 5191 no
## 2 93.994 -36.4 4.857 5191 no
## 3 93.994 -36.4 4.857 5191 no
## 4 93.994 -36.4 4.857 5191 no
## 5 93.994 -36.4 4.857 5191 no
## 6 93.994 -36.4 4.857 5191 no
# Summary
summary(data_raw)
## age job marital education
## Min. :17.00 Length:41188 Length:41188 Length:41188
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :40.02
## 3rd Qu.:47.00
## Max. :98.00
## default housing loan contact
## Length:41188 Length:41188 Length:41188 Length:41188
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## month day_of_week duration campaign
## Length:41188 Length:41188 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 102.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.3 Mean : 2.568
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :56.000
## pdays previous poutcome emp.var.rate
## Min. : 0.0 Min. :0.000 Length:41188 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.000 Class :character 1st Qu.:-1.80000
## Median :999.0 Median :0.000 Mode :character Median : 1.10000
## Mean :962.5 Mean :0.173 Mean : 0.08189
## 3rd Qu.:999.0 3rd Qu.:0.000 3rd Qu.: 1.40000
## Max. :999.0 Max. :7.000 Max. : 1.40000
## cons.price.idx cons.conf.idx euribor3m nr.employed
## Min. :92.20 Min. :-50.8 Min. :0.634 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344 1st Qu.:5099
## Median :93.75 Median :-41.8 Median :4.857 Median :5191
## Mean :93.58 Mean :-40.5 Mean :3.621 Mean :5167
## 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
## y
## Length:41188
## Class :character
## Mode :character
##
##
##
plot_intro(data_raw)
# Check for duplicates
duplicates <- duplicated(data_raw)
# Print the duplicates
print(data_raw[duplicates, ])
## age job marital education default housing loan
## 1267 39 blue-collar married basic.6y no no no
## 12262 36 retired married unknown no no no
## 14235 27 technician single professional.course no no no
## 16957 47 technician divorced high.school no yes no
## 18466 32 technician single professional.course no yes no
## 20217 55 services married high.school unknown no no
## 20535 41 technician married professional.course no yes no
## 25218 39 admin. married university.degree no no no
## 28478 24 services single high.school no yes no
## 32517 35 admin. married university.degree no yes no
## 36952 45 admin. married university.degree no no no
## 38282 71 retired single university.degree no no no
## contact month day_of_week duration campaign pdays previous poutcome
## 1267 telephone may thu 124 1 999 0 nonexistent
## 12262 telephone jul thu 88 1 999 0 nonexistent
## 14235 cellular jul mon 331 2 999 0 nonexistent
## 16957 cellular jul thu 43 3 999 0 nonexistent
## 18466 cellular jul thu 128 1 999 0 nonexistent
## 20217 cellular aug mon 33 1 999 0 nonexistent
## 20535 cellular aug tue 127 1 999 0 nonexistent
## 25218 cellular nov tue 123 2 999 0 nonexistent
## 28478 cellular apr tue 114 1 999 0 nonexistent
## 32517 cellular may fri 348 4 999 0 nonexistent
## 36952 cellular jul thu 252 1 999 0 nonexistent
## 38282 telephone oct tue 120 1 999 0 nonexistent
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1267 1.1 93.994 -36.4 4.855 5191.0 no
## 12262 1.4 93.918 -42.7 4.966 5228.1 no
## 14235 1.4 93.918 -42.7 4.962 5228.1 no
## 16957 1.4 93.918 -42.7 4.962 5228.1 no
## 18466 1.4 93.918 -42.7 4.968 5228.1 no
## 20217 1.4 93.444 -36.1 4.965 5228.1 no
## 20535 1.4 93.444 -36.1 4.966 5228.1 no
## 25218 -0.1 93.200 -42.0 4.153 5195.8 no
## 28478 -1.8 93.075 -47.1 1.423 5099.1 no
## 32517 -1.8 92.893 -46.2 1.313 5099.1 no
## 36952 -2.9 92.469 -33.6 1.072 5076.2 yes
## 38282 -3.4 92.431 -26.9 0.742 5017.5 no
We can see here there are 41,188 observations with 20 features and
one target variable, y (subscribed or not subscribed to a
term deposit). The dataset contains discrete and continuous features and
no missing observations or columns. education and
month are ordinal features, while the remaining categorical
features are nominal.
There are no missing values, but there are duplicate rows in this dataset, so we will have to remove them for an accurate EDA.
Some interesting things to note right from the start are:
age average is 40 so most people being called are in
their 30s or 40scampaign (number of contacts performed during this
campaign and for this client) average is not many at ~2. Max value is
significantly larger.pdays mostly consists of a value of 999 which makes
sense given 999 means never contactedprevious mostly consists of zeroes, with a low max
valueemp.var.rate ranges from positive to negativeLet’s see how many observations are “missing” (and have a value of
999) for pdays:
# Check how many obs where `pdays` is not applicable (= 999)
perc_999 <- data_raw %>%
filter(pdays == 999) %>%
summarise(percentage = n() / nrow(data_raw) * 100) %>%
pull(percentage)
print(perc_999)
## [1] 96.32174
Over 96% of the observations, pdays is not even
applicable. This matches our summary stats. So we should probably remove
this variable when data cleaning.
Let’s do an initial clean up the data to make the EDA easier and valid:
# Rename columns
names(data_raw)[names(data_raw) == "emp.var.rate"] <- "emp_var_rate"
names(data_raw)[names(data_raw) == "cons.price.idx"] <- "cons_price_idx"
names(data_raw)[names(data_raw) == "cons.conf.idx"] <- "cons_conf_idx"
names(data_raw)[names(data_raw) == "nr.employed"] <- "nr_employed"
# Remove duplicates
data_raw <- unique(data_raw)
str(data_raw)
## 'data.frame': 41176 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp_var_rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons_price_idx: num 94 94 94 94 94 ...
## $ cons_conf_idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr_employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
After removing duplicates, we have 41,176 unique observations.
Let’s create new data frames for categorical and numerical features:
# Break up numerical and categorical variables
data_raw_numeric <- data_raw |>
dplyr::select(where(is.numeric))
numerical_predictors <- names(data_raw_numeric)
print(numerical_predictors)
## [1] "age" "duration" "campaign" "pdays"
## [5] "previous" "emp_var_rate" "cons_price_idx" "cons_conf_idx"
## [9] "euribor3m" "nr_employed"
data_raw_categorical <- data_raw |>
dplyr::select(where(is.factor) | where(is.character))
categorical_predictors <- names(data_raw_categorical)[names(data_raw_categorical) != "y"]
print(categorical_predictors)
## [1] "job" "marital" "education" "default" "housing"
## [6] "loan" "contact" "month" "day_of_week" "poutcome"
Correlation Between Predictors and Target Variable:
Let’s create a correlation funnel to visualize the most highly correlated predictors against the target variable.
# Correlation Funnel using plot_correlation_funnel
# This binarizes the features so it includes categorical features
data_raw_binarized <- data_raw %>%
binarize(n_bins = 4, thresh_infreq = 0.01)
data_raw_correlated_table <- data_raw_binarized %>%
correlate(target = y__yes)
data_raw_correlated_table %>%
plot_correlation_funnel(interactive = FALSE)
## Warning: ggrepel: 21 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We can see from the correlation funnel that the highest correlation features (based on absolute magnitude), are:
# Top 10 highly correlated features
data_raw_correlated_table %>%
filter(feature %in% c("duration", "pdays", "poutcome",
"nr_employed", "euribor3m", "emp_var_rate", "previous", "cons_price_idx", "contact")) %>%
plot_correlation_funnel(interactive = FALSE, limits = c(-0.4, 0.4))
We can see that the following scenarios are highly correlated to a “yes” for subscription:
So we now have an idea what features are important.
Now let’s look at a correlation matrix to see the correlation between each predictor variable.
Correlation Matrix for Numeric Variables:
# Correlation Matrix for Numeric Variables
data_raw_numeric <- data_raw |>
dplyr::select(where(is.numeric))
cor_matrix <- cor(data_raw_numeric, use = "complete.obs")
# Plot
corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.cex = 0.5,
tl.srt = 45) # Rotate text diagonally
par(mar = c(1, 1, 4, 1)) # top margin = 4 lines
# Convert to DF to view pairs
melt_cor <- melt(cor_matrix)
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
head(sorted_cor)
## Var1 Var2 value
## 86 emp_var_rate euribor3m 0.9722438
## 99 euribor3m nr_employed 0.9451459
## 96 emp_var_rate nr_employed 0.9069495
## 66 emp_var_rate cons_price_idx 0.7752934
## 87 cons_price_idx euribor3m 0.6881798
## 44 pdays previous -0.5875077
We can see from these two outputs that highly positive correlated predictors are:
emp_var_rate and euribor3meuribor3m and nr_employedemp_var_rate and nr_employedemp_var_rate and cons_price_idxcons_price_idx and euribor3mThese relationships make sense for the most part since these are all
economic features, and as employment increases, inflation usually also
increases. What is interesting is that emp_var_rate
increases as nr_employed increases. I would expect this to
have an inverse relationship meaning a higher employee count would lead
to a more stable job market, and therefore a lower employee variation
rate.
Top highly negative correlated predictors are:
pdays and previousprevious and nr_employedprevious and euribor3mprevious and emp_var_rateAs the amount of days go by without any contact increases, this must
mean the amount of previous calls decrease. Although this shouldn’t be
fully trusted since we will most likely remove pdays
anyway. It’s interesting as the number of contacts performed before this
campaign and for this client gets smaller, the number of people employed
increases. I would think this relationship would be positive meaning as
the number of previous calls increases, the number of employees
increases (providing the service to do the previous calls). I would
imagine that previous calls might be low to due to the fact there’s low
staff and bandwidth is low.
Categorical features:
Some of the categorical features are very detailed, making it hard to graph and visualize. For this graph, let’s combine some features to simplify things:
# Combine Variables (Feature Engineering)
data_raw_combined <- data_raw %>%
mutate(education_combined = case_when(
data_raw$education %in% c('basic.4y','basic.6y','basic.9y') ~ "basic" ,
data_raw$education == "high.school" ~ "high.school" ,
data_raw$education == "illiterate" ~ "illiterate" ,
data_raw$education == "professional.course" ~ "professional.course",
data_raw$education == "university.degree" ~ "university.degree",
data_raw$education == "unknown" ~ "unknown"
)
)
data_raw_combined <- data_raw_combined %>%
mutate(employed_status = case_when(
data_raw$job %in% c('admin.','blue-collar','entrepreneur', 'housemaid', 'management', 'services', 'technician', 'self-employed') ~ "employed",
data_raw$job == "student" ~ "student",
data_raw$job == "retired" ~ "retired",
data_raw$job == "unemployed" ~ "unemployed",
data_raw$job == "unknown" ~ "unknown"
)
)
data_raw_combined <- data_raw_combined %>%
mutate(season = case_when(
data_raw$month %in% c('sep','oct', 'nov') ~ "fall",
data_raw$month %in% c('dec','jan', 'feb') ~ "winter",
data_raw$month %in% c('mar','apr', 'may') ~ "spring",
data_raw$month %in% c('jun','jul', 'aug') ~ "summer"
)
)
data_raw_combined <- data_raw_combined %>%
dplyr::select(-c(education, job, month))
# Creates dummy variables for categorical features
plot_correlation(na.omit(data_raw_combined), maxcat = 10L)
Here we get some insights about the correlations between the features such as:
age and marital_married has a positive
relationship (as people get older, more likely to get married and not be
single)campaign and nr_employed has a positive
relationship (as more contacts are made, that usually means there are
employees present to make the contact)poutcome_success and poutcome_failure both
have a negative relationship with nr_employed which seems
strange but matches with our previous chart resultsNumerical feature relationships via scatterplot:
# Numerical relationships
ggpairs(data_raw_numeric, progress = FALSE)
This graph is a bit hard to read, but none of these relationships
show super strong linear relationships. We can visually see that
generally as employee variation rate increases, the number of employees
increases. We can also clearly see that as number of employees increase,
euribor3m also increases.
Two-way Cross-Tabulations - Categorical Variables:
# Comparing categorical features with the target variable
CrossTable(x = data_raw_combined$y, y = data_raw_combined$employed_status, chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41176
##
##
## | data_raw_combined$employed_status
## data_raw_combined$y | employed | retired | student | unemployed | unknown | Row Total |
## --------------------|------------|------------|------------|------------|------------|------------|
## no | 33490 | 1284 | 600 | 870 | 293 | 36537 |
## | 6.032 | 37.925 | 40.087 | 0.984 | 0.000 | |
## | 0.917 | 0.035 | 0.016 | 0.024 | 0.008 | 0.887 |
## | 0.899 | 0.747 | 0.686 | 0.858 | 0.888 | |
## | 0.813 | 0.031 | 0.015 | 0.021 | 0.007 | |
## --------------------|------------|------------|------------|------------|------------|------------|
## yes | 3749 | 434 | 275 | 144 | 37 | 4639 |
## | 47.507 | 298.696 | 315.724 | 7.753 | 0.001 | |
## | 0.808 | 0.094 | 0.059 | 0.031 | 0.008 | 0.113 |
## | 0.101 | 0.253 | 0.314 | 0.142 | 0.112 | |
## | 0.091 | 0.011 | 0.007 | 0.003 | 0.001 | |
## --------------------|------------|------------|------------|------------|------------|------------|
## Column Total | 37239 | 1718 | 875 | 1014 | 330 | 41176 |
## | 0.904 | 0.042 | 0.021 | 0.025 | 0.008 | |
## --------------------|------------|------------|------------|------------|------------|------------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 754.709 d.f. = 4 p = 4.953712e-162
##
##
##
CrossTable(x = data_raw_combined$y, y = data_raw_combined$marital, chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41176
##
##
## | data_raw_combined$marital
## data_raw_combined$y | divorced | married | single | unknown | Row Total |
## --------------------|-----------|-----------|-----------|-----------|-----------|
## no | 4135 | 22390 | 9944 | 68 | 36537 |
## | 0.462 | 3.461 | 9.804 | 0.126 | |
## | 0.113 | 0.613 | 0.272 | 0.002 | 0.887 |
## | 0.897 | 0.898 | 0.860 | 0.850 | |
## | 0.100 | 0.544 | 0.241 | 0.002 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|
## yes | 476 | 2531 | 1620 | 12 | 4639 |
## | 3.640 | 27.263 | 77.213 | 0.990 | |
## | 0.103 | 0.546 | 0.349 | 0.003 | 0.113 |
## | 0.103 | 0.102 | 0.140 | 0.150 | |
## | 0.012 | 0.061 | 0.039 | 0.000 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 4611 | 24921 | 11564 | 80 | 41176 |
## | 0.112 | 0.605 | 0.281 | 0.002 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 122.9593 d.f. = 3 p = 1.778423e-26
##
##
##
CrossTable(x = data_raw_combined$y, y = data_raw_combined$default, chisq = TRUE)
## Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation may
## be incorrect
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41176
##
##
## | data_raw_combined$default
## data_raw_combined$y | no | unknown | yes | Row Total |
## --------------------|-----------|-----------|-----------|-----------|
## no | 28381 | 8153 | 3 | 36537 |
## | 9.564 | 36.197 | 0.043 | |
## | 0.777 | 0.223 | 0.000 | 0.887 |
## | 0.871 | 0.948 | 1.000 | |
## | 0.689 | 0.198 | 0.000 | |
## --------------------|-----------|-----------|-----------|-----------|
## yes | 4196 | 443 | 0 | 4639 |
## | 75.323 | 285.091 | 0.338 | |
## | 0.905 | 0.095 | 0.000 | 0.113 |
## | 0.129 | 0.052 | 0.000 | |
## | 0.102 | 0.011 | 0.000 | |
## --------------------|-----------|-----------|-----------|-----------|
## Column Total | 32577 | 8596 | 3 | 41176 |
## | 0.791 | 0.209 | 0.000 | |
## --------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 406.5561 d.f. = 2 p = 5.217541e-89
##
##
##
CrossTable(x = data_raw_combined$y, y = data_raw_combined$housing, chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41176
##
##
## | data_raw_combined$housing
## data_raw_combined$y | no | unknown | yes | Row Total |
## --------------------|-----------|-----------|-----------|-----------|
## no | 16590 | 883 | 19064 | 36537 |
## | 0.316 | 0.023 | 0.308 | |
## | 0.454 | 0.024 | 0.522 | 0.887 |
## | 0.891 | 0.892 | 0.884 | |
## | 0.403 | 0.021 | 0.463 | |
## --------------------|-----------|-----------|-----------|-----------|
## yes | 2025 | 107 | 2507 | 4639 |
## | 2.487 | 0.184 | 2.424 | |
## | 0.437 | 0.023 | 0.540 | 0.113 |
## | 0.109 | 0.108 | 0.116 | |
## | 0.049 | 0.003 | 0.061 | |
## --------------------|-----------|-----------|-----------|-----------|
## Column Total | 18615 | 990 | 21571 | 41176 |
## | 0.452 | 0.024 | 0.524 | |
## --------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 5.742153 d.f. = 2 p = 0.05663793
##
##
##
CrossTable(x = data_raw_combined$y, y = data_raw_combined$loan, chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41176
##
##
## | data_raw_combined$loan
## data_raw_combined$y | no | unknown | yes | Row Total |
## --------------------|-----------|-----------|-----------|-----------|
## no | 30089 | 883 | 5565 | 36537 |
## | 0.022 | 0.023 | 0.079 | |
## | 0.824 | 0.024 | 0.152 | 0.887 |
## | 0.887 | 0.892 | 0.891 | |
## | 0.731 | 0.021 | 0.135 | |
## --------------------|-----------|-----------|-----------|-----------|
## yes | 3849 | 107 | 683 | 4639 |
## | 0.169 | 0.184 | 0.622 | |
## | 0.830 | 0.023 | 0.147 | 0.113 |
## | 0.113 | 0.108 | 0.109 | |
## | 0.093 | 0.003 | 0.017 | |
## --------------------|-----------|-----------|-----------|-----------|
## Column Total | 33938 | 990 | 6248 | 41176 |
## | 0.824 | 0.024 | 0.152 | |
## --------------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1.099295 d.f. = 2 p = 0.5771532
##
##
##
CrossTable(x = data_raw_combined$y, y = data_raw_combined$contact, chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41176
##
##
## | data_raw_combined$contact
## data_raw_combined$y | cellular | telephone | Row Total |
## --------------------|-----------|-----------|-----------|
## no | 22283 | 14254 | 36537 |
## | 35.517 | 61.714 | |
## | 0.610 | 0.390 | 0.887 |
## | 0.853 | 0.948 | |
## | 0.541 | 0.346 | |
## --------------------|-----------|-----------|-----------|
## yes | 3852 | 787 | 4639 |
## | 279.736 | 486.064 | |
## | 0.830 | 0.170 | 0.113 |
## | 0.147 | 0.052 | |
## | 0.094 | 0.019 | |
## --------------------|-----------|-----------|-----------|
## Column Total | 26135 | 15041 | 41176 |
## | 0.635 | 0.365 | |
## --------------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 863.0314 d.f. = 1 p = 1.067912e-189
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 862.0807 d.f. = 1 p = 1.718741e-189
##
##
CrossTable(x = data_raw_combined$y, y = data_raw_combined$season, chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41176
##
##
## | data_raw_combined$season
## data_raw_combined$y | fall | spring | summer | winter | Row Total |
## --------------------|-----------|-----------|-----------|-----------|-----------|
## no | 4400 | 15243 | 16801 | 93 | 36537 |
## | 30.222 | 2.876 | 3.496 | 29.051 | |
## | 0.120 | 0.417 | 0.460 | 0.003 | 0.887 |
## | 0.817 | 0.900 | 0.900 | 0.511 | |
## | 0.107 | 0.370 | 0.408 | 0.002 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|
## yes | 987 | 1701 | 1862 | 89 | 4639 |
## | 238.033 | 22.654 | 27.537 | 228.808 | |
## | 0.213 | 0.367 | 0.401 | 0.019 | 0.113 |
## | 0.183 | 0.100 | 0.100 | 0.489 | |
## | 0.024 | 0.041 | 0.045 | 0.002 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 5387 | 16944 | 18663 | 182 | 41176 |
## | 0.131 | 0.412 | 0.453 | 0.004 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 582.6779 d.f. = 3 p = 5.734418e-126
##
##
##
CrossTable(x = data_raw_combined$y, y = data_raw_combined$day_of_week, chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41176
##
##
## | data_raw_combined$day_of_week
## data_raw_combined$y | fri | mon | thu | tue | wed | Row Total |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## no | 6980 | 7665 | 7574 | 7133 | 7185 | 36537 |
## | 0.184 | 1.660 | 0.698 | 0.246 | 0.147 | |
## | 0.191 | 0.210 | 0.207 | 0.195 | 0.197 | 0.887 |
## | 0.892 | 0.900 | 0.879 | 0.882 | 0.883 | |
## | 0.170 | 0.186 | 0.184 | 0.173 | 0.174 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## yes | 846 | 847 | 1044 | 953 | 949 | 4639 |
## | 1.445 | 13.077 | 5.500 | 1.937 | 1.160 | |
## | 0.182 | 0.183 | 0.225 | 0.205 | 0.205 | 0.113 |
## | 0.108 | 0.100 | 0.121 | 0.118 | 0.117 | |
## | 0.021 | 0.021 | 0.025 | 0.023 | 0.023 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 7826 | 8512 | 8618 | 8086 | 8134 | 41176 |
## | 0.190 | 0.207 | 0.209 | 0.196 | 0.198 | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 26.05424 d.f. = 4 p = 3.085755e-05
##
##
##
Based on the p-value, employed_status,
marital, default, day_of_week,
contact and season are are statistically
significant, indicating these variables are associated with the target
variable. housing and loan are interestingly
not statistically significant.
# Distributions
# Numeric Variables
data_raw_numeric |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
ggtitle("Histograms of Numerical Features")
# Categorical Variables
data_raw_categorical |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black", stat="count") +
facet_wrap(~Feature, scales = "free") +
ggtitle("Count plot of Categorical Features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_histogram(bins = 30, fill = "skyblue", color = "black", :
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
# Distribution of target variable
data_raw |>
dplyr::select(y) |>
ggplot() +
aes(x = y) +
geom_histogram(bins= 40,fill = "blue", color = "black", stat="count") +
labs(title = "Distribution of y", y = "Count") +
theme_minimal()
## Warning in geom_histogram(bins = 40, fill = "blue", color = "black", stat =
## "count"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
# Count missing values per Variable
data_raw %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 0 × 2
## # ℹ 2 variables: variable <chr>, missing_count <int>
# Categorical Variables (Filtered for "Yes" Subscribed)
data_raw_categorical_yes <- subset(data_raw_categorical, y == "yes")
data_raw_categorical_yes |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black", stat="count") +
facet_wrap(~Feature, scales = "free") +
ggtitle("Count plot of Categorical Features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_histogram(bins = 30, fill = "skyblue", color = "black", :
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
For the numerical variables:
age is bimodal and right skewedcampaign is unimodal and right skewed (which makes
sense since the mean is 2.568)cons_conf_idx is multimodalcons_price_idx is multimodalduration is unimodal and right skewedemp_var_rate is multimodal and left skewedeuribor3m is multimodal and left skewednr_employed is multimodal and left skewedpdays is unimodal and left skewed: 999 means never
contacted which seems to happen most of the time. This would be a
variable to look into cleaning up or getting rid of.previous is unimodal and right skewed: With a mean of
0.173 this makes sense, most of the time this value is 0 with only a max
of 7Overall, most of these values are skewed and do not follow a normal
distribution. A lot of the social and economic features such as
euribor3m are multimodal and left skewed meaning these
values are usually higher, but depending on the year/state of the world,
these values can change.
For the categorical variables:
The target variable y is binary and categorical, this
confirms that we are dealing with a binary classification problem. There
are many more observations with an output of “no” compared to “yes”.
This could affect the algorithm we choose, so we should be mindful that
we get a fair mixture of both “yes” and “no” observations to build the
best model.
Focusing on just the “yes” observations did not give us much more
insight, except for Thursday is now the most frequent day, which might
be a good indication for when to contact. Additionally, for job, admin
is still the most frequent, but technician is above blue-collar. The
month data is a bit more even, but May is still the highest and December
is still the lowest. Lastly, we can see here how housing
and loan might have much of an impact.
I should also note, although a few of these predictors contain “unknown” values, am I treating this as an actual category and not as missing data. This seems intentional and should be kept in the data.
# Boxplots of all numeric predictors
ggplot(stack(data_raw_numeric), aes(x = ind, y = values)) +
geom_boxplot(color = 'skyblue', outlier.color = 'red') +
coord_cartesian(ylim = c(0, 700)) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
panel.background = element_rect(fill = 'grey96')) +
labs(title = "Boxplots of Predictor Variables", x="Predictors")
Many predictor variables have outliers shown by the red dots.
Variables such as age, duration,
campaign, pdays, previous and
cons_conf_idx have a very large spread (except for
cons_conf_idx) and contain values that are way above the
interquartile range. This indicates that some observations had higher
values in these categories. The other variables show a fairly small
spread. These boxplots match what we’ve already seen so far.
# Since we have a lot of variables, let's look at our combined dataset
for (col_name in names(data_raw_combined)) {
plot_boxplot(data_raw_combined, by = col_name, geom_boxplot_args = list("outlier.color" = "red"))
}
Some insights from these boxplots are:
campaign and durationdurationdurationAlthough there are outliers, I would keep these data points in the model for now to avoid removing any important observations. After creating the model, I would check for outliers, or bad leverage points, by calculating Cook’s distance and looking at the residuals. If any data points pop up there, I would then circle back to the dataset to view the observations and consider removing or imputing (find out more information before removing).
The two classification algorithms I would consider for this dataset would be Logistic Regression and Decision Tree. Some major characteristics about this dataset is that it has a binary output, if the client has subscribed to a term deposit, and that it has labelled data, so we would be using a supervised learning algorithm. Logistic Regression’s primary usage is for classification, particularly for binary classification since it estimates the probability that the target outcome occurs. It will also give the impact of each predictor, which is important when determining churn. Additionally, it is a supervised learning algorithm, which fits our labelled dataset. Logistic Regression can also handle categorical and numerical variables (need to create binary or dummy variables).
The pros for Logistic Regression are that it’s very interpretable so it can be easily explained to people who are non-technical. Additionally, L1 and L2 regularization can be applied to Logistic Regression to help with high dimensional datasets and prevent overfitting (although normalization or standardization is recommended here which can be a con). Logistic Regression is also fast to train which is a plus. There are also some cons to Logistic Regression. Logistic Regressions requires that independent variables are linearly related to the log odds (log(p/(1-p)) of the target variable. It creates linear boundaries which can be difficult to have for complex datasets (linear boundaries for data doesn’t happen very often in the real world). Also, training is fast, but prediction can be slow compared to a Decision Tree (still overall fast though). Additionally, there can’t be any multicollinearity between the predictors (that would make the model less accurate). Another assumption of Logistic Regression is the independence of observations. Lastly, the number of observations should be greater than the number of features for a Logistic Regression model, otherwise overfitting might occur.
Decision Tree would also be a good option for this dataset. Decision Tree is also a supervised learning algorithm and it’s best used when interpretability is most important. Having decision rules makes it very easily explainable, which can be useful from a business perspective. It can also handle numerical and categorical features, which this dataset has. Similar to Logistic Regression, it is also used a lot for customer churn prediction.
Besides it’s interpretability, the pros of Decision Trees are that it can handle missing values. Additionally, it does not require feature scaling or normalization. It can do well for many types of problems making it versatile. Decision Trees also know how to exclude unimportant features. There also a few split methods to choose from. Additionally, there are no assumptions about linear boundaries and the relationships between the independent and dependent variables. Although training the data can be slow for a Decision Tree, making predictions is usually faster compared to Logistic Regression. Also, Decision Trees are still very good with small datasets, whereas Logistic Regression might not be as good. Some cons are that in general, they are prone to overfitting. So if accuracy is more important, Logistic Regression might be the better option. Decision Trees can also be sensitive to outliers.
Overall while both algorithms would fit this dataset, I would recommend trying Logistic Regression first. Logistic Regression is specifically meant for a binary categorical target variable which is exactly our scenario. Additionally, Logistic Regression is fairly easy to interpret, which should be considered as part of the business goal. Also, the fact that Logistic Regression can indicate the impact of each predictor, and give good predictive accuracy is a huge plus from a business perspective. Being able to pinpoint the best marketing scenarios is crucial for figuring out where marketing is effective, and which areas could use more work.
Although many of the Logistic Regression assumption checks are made after the model is created with the training data, there are cleaning steps required to create appropriate training and test sets.
To use the Logistic Regression algorithm, let’s convert categorical
variables to factor data type which can be used by
glm:
# Convert categorical to factor since glm can use factors
data_raw_clean <- data_raw
data_raw_clean$job <- as.factor(data_raw_clean$job)
data_raw_clean$marital <- as.factor(data_raw_clean$marital)
data_raw_clean$education <- as.factor(data_raw_clean$education)
data_raw_clean$default <- as.factor(data_raw_clean$default)
data_raw_clean$housing <- as.factor(data_raw_clean$housing)
data_raw_clean$loan <- as.factor(data_raw_clean$loan)
data_raw_clean$contact <- as.factor(data_raw_clean$contact)
data_raw_clean$month <- as.factor(data_raw_clean$month)
data_raw_clean$day_of_week <- as.factor(data_raw_clean$day_of_week)
data_raw_clean$poutcome <- as.factor(data_raw_clean$poutcome)
data_raw_clean$y <- as.factor(data_raw_clean$y)
Since over 96% of pdays is inapplicable, let’s treat
this as NA data which should just be removed to simplify the model.
Having too much data can be harmful as it hurts efficiency.
Additionally, as the dataset suggests, duration should
be discarded if the intention is to have a realistic predictive model,
so let’s remove that as well.
# Remove variable pdays (96% inapplicable) and duration
data_raw_clean2 <- data_raw_clean |>
dplyr::select(-c(pdays, duration))
Address Multicollinearity:
Since there can not be any highly correlated variables for Logistic
Regression, let’s remove the ones we found in EDA which have
correlations > 0.4. Let’s remove euribor3m,
emp_var_rate, and nr_employed. This will also
help reduce the amount of data in the model.
# Remove `euribor3m`, `emp_var_rate`, and `nr_employed` (absolute value correlations > 0.4)
data_raw_clean3 <- data_raw_clean2 |>
dplyr::select(-c(euribor3m, emp_var_rate, nr_employed))
Although I combined variables in the dataset
data_raw_combined for EDA, I would probably start out my
model with just data_raw_clean3. After creating the model,
I would then test data_raw_combined (after applying the
same changes I made to data_raw_clean3 above) to see if it
would improve the model’s performance.
To ensure objective model evaluation and prevent overfitting, let’s split each dataset into training (80%) and testing (20%) sets.
# Split data into train and test
set.seed(123)
split <- initial_split(data_raw_clean3, prop = 0.8, strata = y)
train <- split |>
training()
test <- split |>
testing()
Standardization could be helpful in simple Logistic Regression, but it is not necessary. Since there is a difference in spreads among my predictors (as shown in the EDA), standardizing the data might be useful. This would be especially important if I ever decided to use regularization.
# Standardize the training and test sets (using training standardization)
set.seed(123)
preproc_params <- preProcess(train, method = c("center", "scale"))
# Apply standardization from the train set
train_standardized <- predict(preproc_params, train)
test_standardized <- predict(preproc_params, test)
As the EDA showed, this dataset is heavily imbalanced which can cause bias. To resolve this, let’s use SMOTE (Synthetic Minority Over-sampling Technique):
set.seed(123)
# Only apply SMOTE to training set
smoted_train_standardized <- smotenc(train_standardized, var = "y", over_ratio = 1)
# Distribution of target variable
smoted_train_standardized |>
dplyr::select(y) |>
ggplot() +
aes(x = y) +
geom_histogram(bins= 40,fill = "blue", color = "black", stat="count") +
labs(title = "Distribution of y", y = "Count") +
theme_minimal()
## Warning in geom_histogram(bins = 40, fill = "blue", color = "black", stat =
## "count"): Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
We can now see that the distribution is even for our target variable,
y.
The final training set to be used in a Logistic Regression model is
smoted_train_standardized and the test set is
test_standardized.
This assignment went through how I would approach EDA and Data Preparation to train a Logistic Regression model for the Bank Narketing dataset. The EDA gave a glimpse as to which variables are important. When training the model, there’s a chance I would need to make multiple adjustments to the data, depending on how the Logistic Regression model performed. Some assumption checks I would have to perform after model creation would be:
So for example, if VIF showed variables with a value of > 5, I would mostly likely need to remove them and train the model again. Similarly, if some variables did not meet the log-odds test, I would might have to perform a transformation on the variable.
Logistic Regression is a great, easy to understand predictive model that can also show important variables, which is extremely helpful for a marketing team.