library(readr)
library(stringr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(corrplot)
## corrplot 0.92 loaded
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.3.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.3.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggthemes)
library(purrr)
library(tidyr)
library(readr)
In the following code chuncks, I start by importing the bank marketing dataset and cleaning up its column names by removing unwanted characters and spaces. Then, I convert the appropriate columns to factors for categorical data and ensure that numeric columns are properly formatted, setting the stage for subsequent exploratory analysis.
I begin by importing the bank marketing dataset from “bank-additional-full.csv”, specifying the use of semicolons as delimiters and letting R automatically assign data types to each column. Next, I use glimpse to inspect the structure of the dataset—reviewing column names, data types, and a preview of the values—to ensure everything has been read in correctly.
bank_data <- read_delim("bank-additional-full.csv", delim = ";", col_types = cols())
glimpse(bank_data)
## Rows: 41,188
## Columns: 21
## $ age <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job <chr> "housemaid", "services", "services", "admin.", "service…
## $ marital <chr> "married", "married", "married", "married", "married", …
## $ education <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
## $ default <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
## $ housing <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
## $ loan <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
## $ contact <chr> "telephone", "telephone", "telephone", "telephone", "te…
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
## $ day_of_week <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
## $ duration <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons.conf.idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr.employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
Next, I created a custom function called
clean_column_names()
to sanitize the dataset’s column
names. - Within clean_column_names()
, I replaced periods
(.
) with underscores (_
) to standardize the
delimiters. - I also replaced hyphens (-
) with underscores
to further enhance consistency. - I removed spaces from the column names
to prevent issues during later data manipulation. - Finally, I applied
clean_column_names()
to the dataset, ensuring that all
column names are clean and uniform for further analysis.
clean_column_names <- function(dataset) {
colnames(dataset) <- str_replace_all(colnames(dataset), "\\.", "_") # Replace '.' with '_'
colnames(dataset) <- str_replace_all(colnames(dataset), "-", "_") # Replace '-' with '_'
colnames(dataset) <- str_replace_all(colnames(dataset), " ", "") # Remove spaces
return(dataset)
}
bank_data <- clean_column_names(bank_data)
categorical_cols
is created, listing the names
of columns that hold categorical data.mutate(across(...))
function from
dplyr is used to convert each of these columns to the
factor data type.categorical_cols <- c("job", "marital", "education", "default", "housing",
"loan", "contact", "month", "day_of_week", "poutcome", "y")
bank_data <- bank_data %>%
mutate(across(all_of(categorical_cols), as.factor))
numeric_cols
to identify
columns that should be numeric.mutate(across(..., as.numeric))
, ensuring proper arithmetic
operations and statistical analysis later.numeric_cols <- c("age", "duration", "campaign", "pdays", "previous", "emp_var_rate",
"cons_price_idx", "cons_conf_idx", "euribor3m", "nr_employed")
bank_data <- bank_data %>%
mutate(across(all_of(numeric_cols), as.numeric))
Now let us get a high-level view of our data using the glimpse() function. This step helps verify that the cleaning steps have been executed as intended.
We can see that our dataset contains 41,188 observations and 21
variables. It includes both categorical features (e.g., job, marital,
education, default, housing, loan, contact, month, day_of_week,
poutcome, and y) and numerical features (e.g., age, duration, campaign,
pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m,
and nr_employed). My next step is to use the summary()
and
table()
functions to get a quick statistical overview of
each variable. This helps identify potential data issues, outliers, and
the general distribution of the dataset.
glimpse(bank_data)
## Rows: 41,188
## Columns: 21
## $ age <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job <fct> housemaid, services, services, admin., services, servic…
## $ marital <fct> married, married, married, married, married, married, m…
## $ education <fct> basic.4y, high.school, high.school, basic.6y, high.scho…
## $ default <fct> no, unknown, no, no, no, unknown, no, unknown, no, no, …
## $ housing <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, no,…
## $ loan <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes, n…
## $ contact <fct> telephone, telephone, telephone, telephone, telephone, …
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, …
## $ day_of_week <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, …
## $ duration <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome <fct> nonexistent, nonexistent, nonexistent, nonexistent, non…
## $ emp_var_rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons_price_idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons_conf_idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr_employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
library(purrr)
bank_data %>%
summary()
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing loan
## university.degree :12168 no :32588 no :18622 no :33950
## high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576 yes : 6248
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## contact month day_of_week duration
## cellular :26144 may :13769 fri:7827 Min. : 0.0
## telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
## aug : 6178 thu:8623 Median : 180.0
## jun : 5318 tue:8090 Mean : 258.3
## nov : 4101 wed:8134 3rd Qu.: 319.0
## apr : 2632 Max. :4918.0
## (Other): 2016
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
## Median : 2.000 Median :999.0 Median :0.000 success : 1373
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
##
## emp_var_rate cons_price_idx cons_conf_idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08189 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr_employed y
## Min. :4964 no :36548
## 1st Qu.:5099 yes: 4640
## Median :5191
## Mean :5167
## 3rd Qu.:5228
## Max. :5228
##
# Load necessary libraries
library(ggplot2)
library(forcats)
# Function to create and display sorted bar plots for categorical features
plot_categorical <- function(data, var) {
data <- data %>%
mutate(!!sym(var) := fct_reorder(!!sym(var), !!sym(var), .fun = length, .desc = TRUE))
print(
ggplot(data, aes(x = !!sym(var), fill = !!sym(var))) +
geom_bar() +
theme_minimal() +
ggtitle(paste("Distribution of", var)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
}
# Create the proportion table
bank_data %>%
count(job, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 12 × 3
## job n proportion
## <fct> <int> <dbl>
## 1 admin. 10422 0.253
## 2 blue-collar 9254 0.225
## 3 technician 6743 0.164
## 4 services 3969 0.0964
## 5 management 2924 0.0710
## 6 retired 1720 0.0418
## 7 entrepreneur 1456 0.0354
## 8 self-employed 1421 0.0345
## 9 housemaid 1060 0.0257
## 10 unemployed 1014 0.0246
## 11 student 875 0.0212
## 12 unknown 330 0.00801
# Plot the distribution
plot_categorical(bank_data, "job")
How Job Distributed
Key Considerations
y
).The distribution of jobs may reflect real-world trends in the labor force, where administrative and blue-collar roles are more prevalent. Further analysis could explore whether job categories influence the target variable (y) or other key factors in the dataset.
# Create the proportion table
bank_data %>%
count(marital, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 4 × 3
## marital n proportion
## <fct> <int> <dbl>
## 1 married 24928 0.605
## 2 single 11568 0.281
## 3 divorced 4612 0.112
## 4 unknown 80 0.00194
# Plot the distribution
plot_categorical(bank_data, "marital")
How Marital Status is Distributed?
Key Considerations
Further analysis can explore whether marital status
significantly impacts the target variable (y
) in
predicting term deposit subscriptions.
# Create the proportion table
bank_data %>%
count(education, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 8 × 3
## education n proportion
## <fct> <int> <dbl>
## 1 university.degree 12168 0.295
## 2 high.school 9515 0.231
## 3 basic.9y 6045 0.147
## 4 professional.course 5243 0.127
## 5 basic.4y 4176 0.101
## 6 basic.6y 2292 0.0556
## 7 unknown 1731 0.0420
## 8 illiterate 18 0.000437
# Plot the distribution
plot_categorical(bank_data, "education")
How is the Education Variable Distributed?
Key Considerations:
Further analysis can explore whether education level
significantly impacts the likelihood of subscribing
(y
) . For example, whether highly educated
individuals are more likely to subscribe) could be an interesting area
for further analysis.
# Create the proportion table
bank_data %>%
count(default, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 3 × 3
## default n proportion
## <fct> <int> <dbl>
## 1 no 32588 0.791
## 2 unknown 8597 0.209
## 3 yes 3 0.0000728
# Plot the distribution
plot_categorical(bank_data, "default")
What is the Distribution of the Default Variable?
Implications for Modeling:
Further exploration could involve testing the impact of excluding this feature to determine if it influences model performance. Analysis could investigate whether individuals with “unknown” default status behave more like those with “no” defaults or those who have actually defaulted.
# Create the proportion table
bank_data %>%
count(housing, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 3 × 3
## housing n proportion
## <fct> <int> <dbl>
## 1 yes 21576 0.524
## 2 no 18622 0.452
## 3 unknown 990 0.0240
# Plot the distribution
plot_categorical(bank_data, "housing")
What is the Distribution of the Housing Variable?
Implications for Modeling:
Further exploration is needed to determine whether customers
with or without housing loans are more likely to subscribe
(y
).
# Create the proportion table
bank_data %>%
count(loan, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 3 × 3
## loan n proportion
## <fct> <int> <dbl>
## 1 no 33950 0.824
## 2 yes 6248 0.152
## 3 unknown 990 0.0240
# Plot the distribution
plot_categorical(bank_data, "loan")
How is the Loan Variable Distributed?
Implications for Predictive Modeling:
y
).Since a housing loan is more common than a personal loan, it could be insightful to examine how both loan types impact client financial behavior.
# Create the proportion table
bank_data %>%
count(contact, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 2 × 3
## contact n proportion
## <fct> <int> <dbl>
## 1 cellular 26144 0.635
## 2 telephone 15044 0.365
# Plot the distribution
plot_categorical(bank_data, "contact")
How is the Contact Variable Distributed?
Implications for Predictive Modeling:
y
) compared to
those reached via landlines.# Create the proportion table
bank_data %>%
count(month, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 10 × 3
## month n proportion
## <fct> <int> <dbl>
## 1 may 13769 0.334
## 2 jul 7174 0.174
## 3 aug 6178 0.150
## 4 jun 5318 0.129
## 5 nov 4101 0.0996
## 6 apr 2632 0.0639
## 7 oct 718 0.0174
## 8 sep 570 0.0138
## 9 mar 546 0.0133
## 10 dec 182 0.00442
# Plot the distribution
plot_categorical(bank_data, "month")
How is the Month Variable Distributed?
Implications for Predictive Modeling:
y
) vary significantly by
month, this feature could be a strong predictor. Further
analysis is needed to assess whether customer responsiveness
changes across seasons.# Create the proportion table
bank_data %>%
count(day_of_week, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 5 × 3
## day_of_week n proportion
## <fct> <int> <dbl>
## 1 thu 8623 0.209
## 2 mon 8514 0.207
## 3 wed 8134 0.197
## 4 tue 8090 0.196
## 5 fri 7827 0.190
# Plot the distribution
plot_categorical(bank_data, "day_of_week")
How is the Day of the Week Variable Distributed?
Implications for Predictive Modeling:
y
).bank_data <- bank_data %>%
mutate(previous = as.factor(previous))
# Create the proportion table
bank_data %>%
count(previous, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 8 × 3
## previous n proportion
## <fct> <int> <dbl>
## 1 0 35563 0.863
## 2 1 4561 0.111
## 3 2 754 0.0183
## 4 3 216 0.00524
## 5 4 70 0.00170
## 6 5 18 0.000437
## 7 6 5 0.000121
## 8 7 1 0.0000243
# Plot the distribution
plot_categorical(bank_data, "previous")
How is the Previous Campaign Outcome Distributed?
Implications for Predictive Modeling:
y
).This variable might also help assess the effectiveness of re-engagement efforts in the campaign.
# Create the proportion table
bank_data %>%
count(y, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 2 × 3
## y n proportion
## <fct> <int> <dbl>
## 1 no 36548 0.887
## 2 yes 4640 0.113
# Plot the distribution
plot_categorical(bank_data, "y")
How is the Target Variable (y
)
Distributed?
y
) represents
whether a customer subscribed to a term deposit.yes
), while
88.7% did not (no
).Implications for Predictive Modeling:
no
). Techniques like
SMOTE (Synthetic Minority Over-sampling Technique) or
class weighting should be considered to handle this
issue.yes
) are rare,
using metrics like precision, recall, and F1-score
instead of accuracy is crucial for evaluating model performance.Since only around 1 in 9 clients agreed to subscribe, this suggests that the campaign had limited effectiveness. Future analysis could explore: What factors distinguish clients who subscribed from those who didn’t. Whether certain demographics, timing, or contact methods had a higher success rate. Potential strategies to improve engagement and response rates.
# Load necessary libraries
library(ggplot2)
# Function to create and display histograms for numerical features
plot_numerical <- function(data, var, bins = 30) {
print(
ggplot(data, aes(x = !!sym(var))) +
geom_histogram(bins = bins, fill = "steelblue", color = "black", alpha = 0.7) +
theme_minimal() +
ggtitle(paste("Distribution of", var)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
}
# Load necessary library
library(ggplot2)
library(dplyr)
library(dplyr)
# Function to plot numerical variables (Histogram + Boxplot) with adjustable bins
plot_numeric_variable <- function(data, column_name, bins = 30) {
column_data <- data[[column_name]] # Extract column data
# Ensure the column is numeric
if (is.numeric(column_data)) {
# Remove NAs
clean_data <- data[complete.cases(column_data), ]
par(mfrow = c(1, 2)) # Set up 2 plots in a row
# Plot histogram with dynamic bins
hist(clean_data[[column_name]], main = paste("Histogram of", column_name),
xlab = column_name, breaks = bins, col = "steelblue", border = "black")
# Plot boxplot
boxplot(clean_data[[column_name]], main = paste("Boxplot of", column_name),
horizontal = TRUE, col = "orange")
par(mfrow = c(1, 1)) # Reset plot layout
} else {
message("Skipping non-numeric column: ", column_name)
}
}
# Function to convert summary statistics to a data frame
summary_to_dataframe <- function(data, column_name) {
# Get summary statistics
summary_values <- summary(data[[column_name]])
# Convert to data frame
summary_df <- data.frame(
Statistic = names(summary_values),
Value = as.numeric(summary_values)
)
return(summary_df)
}
# Example usage for "age"
age_summary_df <- summary_to_dataframe(bank_data, "age")
# Display the summary as a data frame
print(age_summary_df)
## Statistic Value
## 1 Min. 17.00000
## 2 1st Qu. 32.00000
## 3 Median 38.00000
## 4 Mean 40.02406
## 5 3rd Qu. 47.00000
## 6 Max. 98.00000
# Example usage with bins specified
plot_numeric_variable(bank_data, "age", bins = 50)
How is the Age Variable Distributed?
Interpretation of the Plots:
Implications for Predictive Modeling:
- Since most customers are in their 30s and 40s, age
may be a key factor in subscription likelihood.
- The presence of outliers (older customers) could
require winsorization or transformation to avoid
skewing the model.
- Further analysis should check whether older or younger
customers are more likely to subscribe (y
), as
this may guide marketing strategies.
campaign_summary_df <- summary_to_dataframe(bank_data, "campaign")
campaign_summary_df <- summary_to_dataframe(bank_data, "campaign")
# Display the summary as a data frame
print(campaign_summary_df)
## Statistic Value
## 1 Min. 1.000000
## 2 1st Qu. 1.000000
## 3 Median 2.000000
## 4 Mean 2.567593
## 5 3rd Qu. 3.000000
## 6 Max. 56.000000
# Example usage with bins specified
plot_numeric_variable(bank_data, "campaign", bins = 50)
How is the Campaign Variable Distributed?
Interpretation of the Plots:
Implications for Predictive Modeling:
y
) or
lead to customer fatigue.y
to determine if additional calls increase the
likelihood of a subscription.contact_summary_df <- summary_to_dataframe(bank_data, "duration")
contact_summary_df <- summary_to_dataframe(bank_data, "duration")
# Display the summary as a data frame
print(contact_summary_df)
## Statistic Value
## 1 Min. 0.000
## 2 1st Qu. 102.000
## 3 Median 180.000
## 4 Mean 258.285
## 5 3rd Qu. 319.000
## 6 Max. 4918.000
# Example usage with bins specified
plot_numeric_variable(bank_data, "duration", bins = 50)
How is the Duration Variable Distributed?
Interpretation of the Plots
Implications for Predictive Modeling
y
), as longer calls are
highly correlated with successful conversions.y
, confirming whether longer calls
lead to higher success rates.pdays_summary_df <- summary_to_dataframe(bank_data, "pdays")
pdays_summary_df <- summary_to_dataframe(bank_data, "pdays")
# Display the summary as a data frame
print(pdays_summary_df)
## Statistic Value
## 1 Min. 0.0000
## 2 1st Qu. 999.0000
## 3 Median 999.0000
## 4 Mean 962.4755
## 5 3rd Qu. 999.0000
## 6 Max. 999.0000
plot_numeric_variable(bank_data, "pdays", bins = 50)
How is the pdays
Variable
Distributed?
Interpretation of the Plots:
Implications for Predictive Modeling:
pdays_contacted = ifelse(pdays == 999, 0, 1)
,
indicating whether the client had prior contact.pdays
into bins
(e.g., recent contact vs. long ago).y
to
confirm whether prior contact history impacts subscription rates.# Convert cons_price_idx to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
mutate(emp_var_rate = as.numeric(as.character(emp_var_rate)))
emp.var.rate_summary_df <- summary_to_dataframe(bank_data, "emp_var_rate")
emp.var.rate_summary_df <- summary_to_dataframe(bank_data, "emp_var_rate")
# Display the summary as a data frame
print(emp.var.rate_summary_df)
## Statistic Value
## 1 Min. -3.4000000
## 2 1st Qu. -1.8000000
## 3 Median 1.1000000
## 4 Mean 0.0818855
## 5 3rd Qu. 1.4000000
## 6 Max. 1.4000000
plot_numeric_variable(bank_data, "emp_var_rate", bins = 50)
How is the emp_var_rate
Variable
Distributed?
Interpretation of the Plots
emp_var_rate
only takes specific
values rather than a continuous range.Implications for Predictive Modeling
emp_var_rate
is a macro-economic
indicator that may influence consumer financial
decisions.y
), as people save less during recessions.emp_var_rate
into categorical
bins (e.g., recession, stable, growth).y
to
determine its predictive power.cons_price_idx
.# Convert cons_price_idx to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
mutate(cons_price_idx = as.numeric(as.character(cons_price_idx)))
cons_price_idx_summary_df <- summary_to_dataframe(bank_data, "cons_price_idx")
cons_price_idx_summary_df <- summary_to_dataframe(bank_data, "cons_price_idx")
# Display the summary as a data frame
print(cons_price_idx_summary_df)
## Statistic Value
## 1 Min. 92.20100
## 2 1st Qu. 93.07500
## 3 Median 93.74900
## 4 Mean 93.57566
## 5 3rd Qu. 93.99400
## 6 Max. 94.76700
plot_numeric_variable(bank_data, "cons_price_idx", bins = 50)
How is the cons_price_idx
Variable
Distributed?
Interpretation of the Plots:
cons_price_idx
only takes on certain values, likely
due to it being recorded at fixed economic reporting periods.cons_price_idx
varies over time but
within a narrow range.Implications for Predictive Modeling:
Consumer price index reflects inflation and economic
stability, which can impact financial
decisions.
A higher price index might be associated with
reduced term deposit subscriptions (y
) if
inflation lowers consumer purchasing power.
Feature Engineering Recommendations:
y
to
determine its predictive power.emp_var_rate
may capture macroeconomic trends more effectively.# Convert euribor3m to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
mutate(euribor3m = as.numeric(as.character(euribor3m)))
euribor3m_summary_df <- summary_to_dataframe(bank_data, "euribor3m")
euribor3m_summary_df <- summary_to_dataframe(bank_data, "euribor3m")
# Display the summary as a data frame
print(euribor3m_summary_df)
## Statistic Value
## 1 Min. 0.634000
## 2 1st Qu. 1.344000
## 3 Median 4.857000
## 4 Mean 3.621291
## 5 3rd Qu. 4.961000
## 6 Max. 5.045000
plot_numeric_variable(bank_data, "euribor3m", bins = 50)
How is the euribor3m
Variable
Distributed?
The minimum is 0.63, and the maximum is 5.05. The median (4.86) and mean (3.62) suggest a skewed distribution toward lower values. The IQR (1.34 – 4.96) indicates most values fall within this range.
Interpretation of the Plots:
The histogram shows a multimodal distribution with peaks around 1 and 5, reflecting economic cycles. The boxplot confirms a wide spread without extreme outliers.
Implications for Predictive Modeling:
y
) due to increased borrowing costs.y
to determine predictive
strength.cons_price_idx
and
emp_var_rate
to capture macroeconomic
effects.# Convert nr_employed to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
mutate(nr_employed = as.numeric(as.character(nr_employed)))
nr_employed_summary_df <- summary_to_dataframe(bank_data, "nr_employed")
nr_employed_summary_df <- summary_to_dataframe(bank_data, "nr_employed")
# Display the summary as a data frame
print(nr_employed_summary_df)
## Statistic Value
## 1 Min. 4963.600
## 2 1st Qu. 5099.100
## 3 Median 5191.000
## 4 Mean 5167.036
## 5 3rd Qu. 5228.100
## 6 Max. 5228.100
plot_numeric_variable(bank_data, "nr_employed", bins = 50)
How is the nr_employed
Variable
Distributed?
The minimum number of employees recorded is 4963.6, while the maximum is 5228.1. The median (5191.0) and mean (5167.0) suggest a slightly right-skewed distribution. The IQR (5099.1 – 5228.1) indicates that most values are concentrated in this range.
Interpretation of the Plots:
The histogram reveals a discrete, multimodal distribution, reflecting distinct periods of employment levels. The boxplot suggests a steady increase in employment levels, with no extreme outliers.
Implications for Predictive Modeling:
- Employment levels influence economic stability,
potentially affecting term deposit subscriptions (y
).
- Higher employment may correlate with increased savings and
financial security, leading to higher term deposit
subscriptions.
- Feature Engineering Recommendations:
- Categorize employment levels into low, moderate, and high
employment periods.
- Assess interactions with emp_var_rate
and
euribor3m
to capture macroeconomic
trends.
Since nr_employed
reflects labor market conditions, it
is likely a strong predictor of financial behaviors and term
deposit subscriptions.
library(stats)
library(corrplot)
bank_data %>%
keep(is.numeric) %>% # Select only numeric variables for correlation analysis
cor() %>% # Compute correlation matrix
corrplot() # Plot correlation heatmap
The results in the correlation plot show that we have a few features that are highly correlated.
euribor3m
, nr_employed
, and
emp_var_rate
. This suggests that as the Euribor
3-month rate increases, so does the number of employed
individuals and the employment variation
rate.cons_price_idx
(consumer price index) and
euribor3m
, indicating that higher consumer prices are
associated with higher Euribor interest rates.emp_var_rate
and nr_employed
have a strong correlation, which makes sense as employment variation
often aligns with total employment figures.We also observe some negative correlations:
- cons_conf_idx
(consumer confidence index) is
negatively correlated with euribor3m
and
nr_employed
, meaning that when consumer confidence
is lower, interest rates and employment figures tend to be
higher.
- Similarly, emp_var_rate
is negatively
correlated with cons_conf_idx
, suggesting that
employment variation decreases when consumer confidence is lower.
Before deciding what to do with these correlated variables, we should conduct further analysis to determine whether any variables are redundant or if they provide unique information. If two variables are nearly identical in their impact, one could be removed or transformed to reduce multicollinearity in any future modeling.
# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
# Min-Max Normalization Function
min_max_normalize <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)))
}
# Select numeric variables and apply min-max scaling
numeric_data <- bank_data %>%
dplyr::select(where(is.numeric)) %>%
mutate(across(everything(), min_max_normalize))
# Boxplots for standardized numerical variables
numeric_data %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Variable, y = Value, fill = Variable)) +
geom_boxplot(alpha = 0.6) +
coord_flip() +
theme_minimal() +
theme(legend.position = "none") +
ggtitle("Outlier Detection Using Boxplots")
Analysis of the Min-Max Standardized Boxplots (Outlier Detection)
The boxplot displays the distribution of min-max scaled numerical variables, highlighting potential outliers. Here’s an analysis of key observations:
Variables with Many Outliers
campaign
and
duration
have a large number of outliers.
previous
and
pdays
(days since last contact) also show
several extreme values.
Highly Skewed Variables
euribor3m
(Euribor 3-month rate) and
nr_employed
(number of employees) have
long right tails, meaning they have some very high
values.emp_var_rate
(employment variation
rate) shows a wide range but is not heavily
skewed.Variables with a Compact Range -
age
and
cons_price_idx
(consumer price index)
appear relatively well-distributed. - cons_conf_idx
(consumer confidence index) is tightly packed but has some
outliers.
Interpretation of Outliers
campaign
,
previous
, pdays
) have extreme
values, possibly indicating repeated contacts to some clients
or first-time contacts for others.euribor3m
,
nr_employed
, emp_var_rate
) contain
extreme values, suggesting large fluctuations in the financial market
over time.duration
) outliers
indicate long conversations, which might be
linked to successful subscriptions.# Relationship between call duration and response (y)
ggplot(bank_data, aes(x = duration, fill = y)) +
geom_density(alpha = 0.5) +
theme_minimal() +
ggtitle("Call Duration vs. Response (y)")
This density plot visualizes the relationship between call duration and the response variable (y), which indicates whether a customer subscribed to a term deposit.
Key Observations:
y = yes
) tend to have longer
call durations compared to those who did not (y = no
).Implications for Predictive Modeling:
y
). Longer calls may indicate a higher chance
of convincing the customer.# Trend in campaign efforts over months
ggplot(bank_data, aes(x = month, fill = y)) +
geom_bar(position = "fill") +
theme_minimal() +
ggtitle("Subscription Rate by Month") +
ylab("Proportion of Subscriptions")
This stacked bar chart illustrates the proportion of term
deposit subscriptions (y
) by month.
Key Observations:
Implications for Predictive Modeling:
# Order days of the week
bank_data <- bank_data %>%
mutate(day_of_week = factor(day_of_week,
levels = c("mon", "tue", "wed", "thu", "fri")))
# Trend in campaign efforts over days of the week
ggplot(bank_data, aes(x = day_of_week, fill = y)) +
geom_bar(position = "fill") +
theme_minimal() +
ggtitle("Subscription Rate by Day of the Week") +
ylab("Proportion of Subscriptions") +
xlab("Day of the Week")
The subscription rate remains relatively consistent across
all weekdays, with no significant variation in
the proportion of successful (yes
) and unsuccessful
(no
) term deposit subscriptions. This suggests that the
day of the week does not play a major role in
influencing customer decisions regarding subscriptions.
We will assess Multiple Linear Regression (MLR), Logistic Regression (LR), k-Nearest Neighbors (kNN), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Naive Bayes (NB) based on scalability, interpretability, assumptions, and suitability for categorical data.
Pros
Cons
y
.Suitability: Not suitable (MLR is meant for regression, not classification).
Pros
y
).Cons
Suitability: *Best suited for this dataset** due to its efficiency and interpretability.
Pros
Cons
job
,
education
).Suitability: *Not practical** (computationally inefficient and requires extensive preprocessing).
Pros
Cons
Suitability: Potential alternative, but normality assumptions should be checked.
Pros
Cons
Suitability: Not ideal due to its complexity and data requirements.
Pros
job
, education
, marital
).Cons
Suitability: *Good alternative** if independence assumption does not significantly impact performance.
Logistic Regression (LR) is the best choice
because:
- It handles categorical and numerical variables well.
- It scales efficiently with large datasets (~41,188 records).
- It provides interpretable results and probability scores.
The dataset includes binary labels (y
),
making classification algorithms appropriate. LR is preferred over MLR
(for regression) and kNN (inefficient for large datasets).
Yes. Naive Bayes or kNN could be viable since they perform well on small datasets, while QDA risks overfitting.
Since the unknown values in housing and loan are small (~2.40%), imputing with the mode ensures minimal distortion to the distribution. However, for default, where ~20.87% of values are unknown, replacing them with the most frequent category may introduce bias. Instead, “unknown” is preserved as a separate category to allow the model to capture any predictive patterns associated with it. This approach maintains data integrity while minimizing unnecessary imputation bias.
The code below performs missing data handling for
the categorical variables default, housing, and loan in
bank_data
.
housing
and loan
while keeping
default
unchanged.housing
and
loan
using the mode (most frequent value in each
column).# Load necessary library
library(dplyr)
# Function to calculate missing percentages
calculate_missing_percentage <- function(data, cols) {
missing_summary <- data %>%
summarise(across(all_of(cols), ~ mean(. == "unknown") * 100)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Percentage")
return(missing_summary)
}
# Calculate missing percentages before transformation
missing_before <- calculate_missing_percentage(bank_data, c("default", "housing", "loan"))
# Display missing percentages
print(missing_before)
## # A tibble: 3 × 2
## Variable Missing_Percentage
## <chr> <dbl>
## 1 default 20.9
## 2 housing 2.40
## 3 loan 2.40
# Replace "unknown" with "missing" in categorical variables (only for housing and loan)
bank_data <- bank_data %>%
mutate(across(c(housing, loan), ~ ifelse(. == "unknown", "missing", .)))
# Calculate missing percentages after replacing "unknown" with "missing"
missing_after_replace <- calculate_missing_percentage(bank_data, c("housing", "loan"))
# Impute Missing Values for housing and loan with Mode (Most Frequent Value)
for (col in c("housing", "loan")) {
most_frequent <- names(which.max(table(bank_data[[col]])))
bank_data[[col]][bank_data[[col]] == "missing"] <- most_frequent
}
# Calculate missing percentages after imputation
missing_after_imputation <- calculate_missing_percentage(bank_data, c("housing", "loan"))
# Display missing percentages after imputation
print(missing_after_imputation)
## # A tibble: 2 × 2
## Variable Missing_Percentage
## <chr> <dbl>
## 1 housing 0
## 2 loan 0
# Load necessary libraries
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
# Compute correlation matrix for numeric features
cor_matrix <- cor(bank_data %>% select(where(is.numeric)), use = "pairwise.complete.obs")
# Convert correlation matrix to a data frame for readability
cor_df <- as.data.frame(as.table(cor_matrix))
# Filter to show only correlations >|0.8| (excluding self-correlations)
significant_correlations <- cor_df %>%
filter(Var1 != Var2, abs(Freq) > 0.8) %>%
arrange(desc(abs(Freq)))
# Print only significant correlations
print("Significant Correlations (>|0.8|):")
## [1] "Significant Correlations (>|0.8|):"
print(significant_correlations)
## Var1 Var2 Freq
## 1 euribor3m emp_var_rate 0.9722447
## 2 emp_var_rate euribor3m 0.9722447
## 3 nr_employed euribor3m 0.9451544
## 4 euribor3m nr_employed 0.9451544
## 5 nr_employed emp_var_rate 0.9069701
## 6 emp_var_rate nr_employed 0.9069701
# Find highly correlated variables (correlation > 0.85) for removal
highly_correlated_indices <- findCorrelation(cor_matrix, cutoff = 0.85)
# Get variable names of highly correlated features
highly_correlated_vars <- names(bank_data %>% select(where(is.numeric)))[highly_correlated_indices]
# **Explicitly add 'day_of_week' for removal due to lack of predictive power**
removal_candidates <- c(highly_correlated_vars, "day_of_week")
# Ensure only existing variables are removed
removed_vars <- intersect(names(bank_data), removal_candidates)
# Remove highly correlated and non-informative variables from dataset
bank_data <- bank_data %>% select(-all_of(removed_vars))
# Print the variables that were actually removed
print("Variables Removed from the Dataset:")
## [1] "Variables Removed from the Dataset:"
print(removed_vars)
## [1] "day_of_week" "emp_var_rate" "euribor3m"
The analysis reveals that several variables exhibit significant correlations (>|0.85|), indicating strong relationships that could introduce multicollinearity issues and impact model performance. Notably: - euribor3m and emp_var_rate have a very high correlation (0.972), suggesting a direct relationship between the Euro Interbank Offered Rate (Euribor) and employment variation rate. - nr_employed is strongly correlated with euribor3m (0.945), indicating that the number of employees is highly dependent on the Euribor rate. - nr_employed also has a strong correlation with emp_var_rate (0.907), reinforcing the relationship between employment levels and economic indicators.
Due to these high correlations, euribor3m and emp_var_rate were removed from the dataset to reduce redundancy and mitigate potential multicollinearity. This ensures that the predictive model remains interpretable and avoids issues related to highly correlated variables influencing model outcomes.
Additionally, day_of_week was removed because its distribution showed no meaningful variation in subscription rates across different days. Since the response variable (‘y’) remains nearly identical regardless of the day, this feature is unlikely to contribute to model performance and was removed to reduce dimensionality.
To enhance the dataset for predictive modeling while preserving the pattern within each predictor, the following transformations were carried out:
duration
and campaign
exhibit extreme
right-skewness. Applying a Box Cox transformation may help normalize
these distributions while maintaining meaningful differences.
# Load necessary libraries
library(dplyr)
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:PerformanceAnalytics':
##
## kurtosis, skewness
library(tidyr)
library(ggplot2)
# Function to calculate skewness
calculate_skewness <- function(data, cols) {
data %>%
summarise(across(all_of(cols), ~ skewness(.x, na.rm = TRUE))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness")
}
# Select duration variable
numeric_var_duration <- "duration"
# Calculate skewness before transformation
skewness_before_duration <- calculate_skewness(bank_data, numeric_var_duration)
# Apply Box-Cox Transformation for duration
x_adj_duration <- bank_data$duration + 1 # Adjust for zero values
lambda_duration <- BoxCox.lambda(x_adj_duration, method = "loglik")
bank_data$duration_boxcox <- BoxCox(x_adj_duration, lambda_duration)
# Compute Skewness After Transformation
skewness_after_duration <- calculate_skewness(bank_data, "duration_boxcox")
# Merge Skewness Before and After
skewness_df_duration <- left_join(skewness_before_duration, skewness_after_duration, by = "Variable", suffix = c("_Before", "_After"))
# Visualize Distribution Before and After Transformation
bank_data %>%
dplyr::select(all_of(c("duration", "duration_boxcox"))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 50, fill = "blue", alpha = 0.5) +
facet_wrap(~ Variable, scales = "free") +
theme_minimal() +
ggtitle("Distributions Before and After Box-Cox Transformation for Duration")
The Box-Cox transformation has significantly improved the distribution of the duration variable. Initially, the distribution was highly right-skewed, with a large concentration of values near zero and a long tail extending towards higher values.
After applying the Box-Cox transformation, the distribution appears much more normal, with a symmetrical bell shape centered around its mean. This transformation will likely enhance the performance of models that assume normality, such as LDA and QDA, and may improve the interpretability of logistic regression.
# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(e1071)
library(forecast)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Function to calculate skewness
calculate_skewness <- function(data, cols) {
data %>%
summarise(across(all_of(cols), ~ skewness(.x, na.rm = TRUE))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness")
}
# Select campaign variable
numeric_var_campaign <- "campaign"
# Calculate skewness before transformation
skewness_before_campaign <- calculate_skewness(bank_data, numeric_var_campaign)
# Apply Box-Cox Transformation for campaign
x_adj_campaign <- bank_data$campaign + 1 # Adjust for zero values
lambda_campaign <- BoxCox.lambda(x_adj_campaign, method = "loglik")
bank_data$campaign_boxcox <- BoxCox(x_adj_campaign, lambda_campaign)
# Apply Log1p Transformation
bank_data$campaign_log <- log1p(bank_data$campaign)
# Apply Square Root Transformation
bank_data$campaign_sqrt <- sqrt(bank_data$campaign)
# Apply Reciprocal Transformation (1/x) - Handle zeros separately
bank_data$campaign_reciprocal <- ifelse(bank_data$campaign == 0, 0, 1 / bank_data$campaign)
# Apply Binarization (High vs. Low Engagement)
bank_data$campaign_binary <- ifelse(bank_data$campaign > median(bank_data$campaign, na.rm = TRUE), "High", "Low")
# Compute Skewness After Each Transformation (except binary)
transformed_vars_campaign <- c("campaign_boxcox", "campaign_log", "campaign_sqrt", "campaign_reciprocal")
skewness_after_campaign <- calculate_skewness(bank_data, transformed_vars_campaign)
# Merge Skewness Before and After
skewness_df_campaign <- left_join(skewness_before_campaign, skewness_after_campaign, by = "Variable", suffix = c("_Before", "_After"))
# Visualize Distributions Before and After Transformations
bank_data %>%
dplyr::select(all_of(c("campaign", transformed_vars_campaign))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 50, fill = "blue", alpha = 0.5) +
facet_wrap(~ Variable, scales = "free", nrow = 2, ncol = 3) + # Facet with 2 rows and 3 columns
theme_minimal() +
ggtitle("Distributions Before and After Various Transformations for Campaign")
# Visualize Binarized Campaign Variable
ggplot(bank_data, aes(x = campaign_binary, fill = campaign_binary)) +
geom_bar(alpha = 0.5) +
theme_minimal() +
ggtitle("Binarized Campaign Variable (High vs. Low Engagement)")
# Load necessary libraries
library(dplyr)
library(ggplot2)
# Check for NA values before binning
bank_data <- bank_data %>% mutate(campaign = as.numeric(campaign))
# Define improved bins for 'campaign'
bank_data <- bank_data %>%
mutate(campaign_bin = case_when(
campaign == 1 ~ "1",
campaign == 2 ~ "2",
campaign == 3 ~ "3",
campaign == 4 ~ "4",
campaign == 5 ~ "5",
campaign == 6 ~ "6",
campaign >= 7 & campaign <= 10 ~ "7-10",
campaign > 10 ~ "11+",
TRUE ~ "Missing" # Handling NA values
))
# Convert to ordered factor
bank_data$campaign_bin <- factor(bank_data$campaign_bin,
levels = c("1", "2", "3", "4", "5", "6", "7-10", "11+", "Missing"))
# Plot histogram for binned 'campaign' variable
ggplot(bank_data, aes(x = campaign_bin, fill = campaign_bin)) +
geom_bar(alpha = 0.7) +
theme_minimal() +
ggtitle("Binned Campaign Variable with More Granular Bins") +
xlab("Campaign Bin") +
ylab("Count") +
theme(legend.position = "none")
Binning vs. Not Binning for Campaign vs. Binarization:
Binning was considered because alternative transformations (Box-Cox, log, square root, and reciprocal) did not sufficiently resolve the skewness in the campaign variable. Given the heavy concentration of low values and the sharp decline in higher values, binning provides a more structured representation, making the variable more suitable for models like Logistic Regression, LDA, QDA, kNN, and Naive Bayes, which may struggle with highly skewed continuous inputs.
The finer binning approach captures more variability compared to simple binarization. While binarization simplifies interpretation, it may discard important distinctions between different engagement levels. Granular binning retains more information about customer interaction patterns while still reducing skewness, making it potentially more useful for models like kNN, LDA, and QDA, which can leverage multiple discrete categories effectively. However, for Logistic Regression and Naive Bayes, a binary variable might still be preferable to avoid sparse category issues.
Best Choice?
poutcome_bin
: Converts
poutcome
into a binary variable (1 = success
,
0 = others
).bank_data <- bank_data %>%
mutate(poutcome_bin = ifelse(poutcome == "success", 1, 0))
campaign
campaign_binary
: Classifies clients
into high vs. low engagement based on the median number
of contacts.if (!"campaign_binary" %in% names(bank_data)) {
bank_data <- bank_data %>%
mutate(campaign_binary = ifelse(campaign > median(campaign, na.rm = TRUE), "High", "Low"))
}
month
as a Factor Instead of
Seasonal Binning
month
allows models to
capture variations in client behavior across different months.bank_data$month <- as.factor(bank_data$month)
job
bank_data$job <- fct_lump(bank_data$job, prop = 0.02)
pdays
Into a More Useful
Feature
contacted_before
(1
if previously contacted, 0
if never
contacted).pdays = 999
means
never contacted before, this transformation simplifies
model learning.bank_data <- bank_data %>%
mutate(contacted_before = ifelse(pdays == 999, 0, 1)) %>%
dplyr::select(-pdays) # Remove `pdays`
previous_contacts_ratio
: Measures
prior engagement by normalizing the number of previous contacts relative
to campaign attempts.loan_housing_combo
: Encodes the
combination of loan
and housing
into a single
categorical variable.bank_data <- bank_data %>%
mutate(
previous_contacts_ratio = as.numeric(as.character(previous)) / (as.numeric(as.character(campaign)) + 1),
loan_housing_combo = paste0(loan, "_", housing)
)
categorical_vars <- c("contact", "campaign_binary", "default", "education", "housing", "job",
"loan", "marital", "month", "loan_housing_combo", "poutcome")
bank_data <- bank_data %>%
mutate(across(all_of(categorical_vars), as.factor))
job_admin
, job_blue-collar
).if (!requireNamespace("fastDummies", quietly = TRUE)) {
install.packages("fastDummies")
}
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.3.3
bank_data <- dummy_cols(bank_data, select_columns = categorical_vars, remove_selected_columns = TRUE)
y
)
y
to a binary factor
(0 = no, 1 = yes) ensures compatibility with classification models.bank_data <- bank_data %>%
mutate(y = factor(ifelse(y == "yes", 1, 0)))
y
to the Last
Column
bank_data <- bank_data %>%
dplyr::select(-y, everything(), y)
-
and .
ensures compatibility with certain ML libraries that do not support
special characters.colnames(bank_data) <- gsub("-", "_", colnames(bank_data))
colnames(bank_data) <- gsub("\\.", "_", colnames(bank_data))
y
Remains a Factor
y
remains a factor
avoids issues during classification.bank_data <- bank_data %>%
mutate(y = factor(y, levels = c(0, 1)))
glimpse(bank_data)
## Rows: 41,188
## Columns: 69
## $ age <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, …
## $ duration <dbl> 261, 149, 226, 151, 307, 198, 139, 217, …
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cons_price_idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, …
## $ cons_conf_idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4…
## $ nr_employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191…
## $ duration_boxcox <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_boxcox <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, …
## $ campaign_log <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_sqrt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_reciprocal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_bin <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ previous_contacts_ratio <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1…
## $ default_unknown <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ default_yes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_6y <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_9y <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0…
## $ education_illiterate <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0…
## $ education_university_degree <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ housing_1 <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0…
## $ housing_3 <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ job_admin_ <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0…
## $ job_entrepreneur <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ job_management <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services <int> 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0…
## $ job_student <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ job_unemployed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1 <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ loan_3 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_divorced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ marital_married <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ marital_single <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0…
## $ marital_unknown <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1 <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ loan_housing_combo_1_3 <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ loan_housing_combo_3_1 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ loan_housing_combo_3_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ y <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
This script standardizes numeric variables using different scaling
techniques while ensuring that the target variable (y
)
remains unchanged. The transformations applied depend on the
characteristics of each variable:
Key Implementation Details: - The transformed
variables are given prefixes (z_
, minmax_
,
robust_
) to clearly differentiate them from their original
versions. - The original versions of transformed variables are removed
to avoid redundancy, except for y
, which remains unchanged.
- The order of columns is preserved to maintain consistency.
# Load necessary library
library(dplyr)
# Identify variables for different standardization techniques (excluding 'y')
z_score_vars <- c("age", "duration", "previous_contacts_ratio", "nr_employed")
min_max_vars <- c("campaign_boxcox", "campaign_sqrt", "cons_price_idx")
robust_vars <- c("cons_conf_idx")
# Standardization Functions
z_score_standardize <- function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
min_max_standardize <- function(x) (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
robust_standardize <- function(x) (x - median(x, na.rm = TRUE)) / IQR(x, na.rm = TRUE)
# Ensure 'y' remains unchanged
target_var <- "y"
# Apply Z-score Standardization
bank_data <- bank_data %>%
mutate(across(all_of(z_score_vars), z_score_standardize, .names = "z_{.col}"))
# Apply Min-Max Scaling
bank_data <- bank_data %>%
mutate(across(all_of(min_max_vars), min_max_standardize, .names = "minmax_{.col}"))
# Apply Robust Scaling
bank_data <- bank_data %>%
mutate(across(all_of(robust_vars), robust_standardize, .names = "robust_{.col}"))
# Drop original unscaled columns (excluding 'y')
bank_data <- bank_data %>%
dplyr::select(all_of(target_var), everything(), -all_of(c(z_score_vars, min_max_vars, robust_vars)))
# Verify the transformation
glimpse(bank_data)
## Rows: 41,188
## Columns: 69
## $ y <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ duration_boxcox <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_log <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_reciprocal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_bin <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1…
## $ default_unknown <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ default_yes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_6y <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_9y <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0…
## $ education_illiterate <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0…
## $ education_university_degree <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ housing_1 <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0…
## $ housing_3 <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ job_admin_ <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0…
## $ job_entrepreneur <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ job_management <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services <int> 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0…
## $ job_student <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ job_unemployed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1 <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ loan_3 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_divorced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ marital_married <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ marital_single <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0…
## $ marital_unknown <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1 <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ loan_housing_combo_1_3 <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ loan_housing_combo_3_1 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ loan_housing_combo_3_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ z_age <dbl> 1.533015677, 1.628973456, -0.290182119, …
## $ z_duration <dbl> 0.01047130, -0.42149540, -0.12451830, -0…
## $ z_previous_contacts_ratio <dbl> -0.3312077, -0.3312077, -0.3312077, -0.3…
## $ z_nr_employed <dbl> 0.3316759, 0.3316759, 0.3316759, 0.33167…
## $ minmax_campaign_boxcox <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ minmax_campaign_sqrt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ minmax_cons_price_idx <dbl> 0.6987529, 0.6987529, 0.6987529, 0.69875…
## $ robust_cons_conf_idx <dbl> 0.8571429, 0.8571429, 0.8571429, 0.85714…
For predictive modeling, we can use simple random sampling or stratified random sampling to create training and test datasets.
This method selects data randomly without replacement to create the training and test datasets, ensuring no duplicates.
# Set seed for reproducibility
set.seed(1234)
# Define training sample size (e.g., 75% of the data)
sample_size <- round(nrow(bank_data) * 0.75)
# Create sample set
sample_set <- sample(nrow(bank_data), sample_size, replace = FALSE)
# Split data into training and test sets
train_data <- bank_data[sample_set, ]
test_data <- bank_data[-sample_set, ]
# Verify class distribution remains consistent
print(round(prop.table(table(train_data$y)) * 100, 2))
##
## 0 1
## 88.67 11.33
print(round(prop.table(table(test_data$y)) * 100, 2))
##
## 0 1
## 88.92 11.08
Since y is a categorical variable, we should ensure that both training and test sets maintain the same proportion of classes.
# Load caret package
library(caret)
# Stratified sampling with 75% training data
set.seed(1234)
trainIndex <- createDataPartition(bank_data$y, p = 0.75, list = FALSE)
# Split data based on stratified sampling
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]
# Verify class distribution remains consistent
round(prop.table(table(train_data$y)) * 100, 2)
##
## 0 1
## 88.73 11.27
round(prop.table(table(test_data$y)) * 100, 2)
##
## 0 1
## 88.73 11.27
The class distribution in the training dataset closely mirrors that of the original dataset, with approximately 88.73% “no” responses and 11.27% “yes” responses in both cases. This indicates that the sampling process was performed correctly, preserving the proportion of classes in the response variable. Maintaining a similar distribution is crucial because it ensures that the model trained on the sample will generalize well to the full dataset, reducing bias and improving predictive performance.
# Load necessary libraries
library(themis)
## Warning: package 'themis' was built under R version 4.3.3
## Loading required package: recipes
## Warning: package 'recipes' was built under R version 4.3.3
##
## Attaching package: 'recipes'
## The following object is masked from 'package:stringr':
##
## fixed
## The following object is masked from 'package:stats':
##
## step
library(dplyr)
library(recipes)
# Convert `y` to a factor (required for SMOTE)
train_data <- train_data %>%
mutate(y = as.factor(y))
# Store `y` separately and remove it before conversion
y_train <- train_data$y
train_data <- train_data %>% dplyr::select(-y) # Remove `y` before conversion
# Convert other factor variables to numeric
train_data <- train_data %>%
mutate(across(where(is.factor), as.numeric))
# Add `y` back after conversion
train_data <- train_data %>%
mutate(y = y_train)
# Define the SMOTE recipe
set.seed(1234)
smote_recipe <- recipe(y ~ ., data = train_data) %>%
step_smote(y, over_ratio = 1) %>% # Balance classes
prep()
# Apply SMOTE transformation
train_data_smote <- juice(smote_recipe)
# Check the new class distribution after SMOTE
table(train_data_smote$y)
##
## 0 1
## 27411 27411
glimpse(train_data)
## Rows: 30,891
## Columns: 69
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ duration_boxcox <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_log <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_reciprocal <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, …
## $ campaign_bin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no <int> 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1…
## $ default_unknown <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0…
## $ default_yes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ education_basic_6y <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_9y <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0…
## $ education_illiterate <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
## $ education_university_degree <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0…
## $ housing_1 <int> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0…
## $ housing_3 <int> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1…
## $ job_admin_ <int> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1…
## $ job_entrepreneur <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ job_management <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services <int> 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0…
## $ job_student <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ job_unemployed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ loan_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ marital_divorced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_married <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1…
## $ marital_single <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0…
## $ marital_unknown <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1 <int> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…
## $ loan_housing_combo_1_3 <int> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1…
## $ loan_housing_combo_3_1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_housing_combo_3_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ z_age <dbl> 1.533015677, 1.628973456, -0.290182119, …
## $ z_duration <dbl> 0.01047130, -0.42149540, -0.12451830, -0…
## $ z_previous_contacts_ratio <dbl> -0.3312077, -0.3312077, -0.3312077, -0.3…
## $ z_nr_employed <dbl> 0.3316759, 0.3316759, 0.3316759, 0.33167…
## $ minmax_campaign_boxcox <dbl> 0.0000000, 0.0000000, 0.0000000, 0.00000…
## $ minmax_campaign_sqrt <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00…
## $ minmax_cons_price_idx <dbl> 0.6987529, 0.6987529, 0.6987529, 0.69875…
## $ robust_cons_conf_idx <dbl> 0.8571429, 0.8571429, 0.8571429, 0.85714…
## $ y <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
no
responses than yes
responses.