library(readr)
library(stringr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(corrplot)
## corrplot 0.92 loaded
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.3.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.3.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggthemes)
library(purrr)
library(tidyr)
library(readr)
In the following code chuncks, I start by importing the bank marketing dataset and cleaning up its column names by removing unwanted characters and spaces. Then, I convert the appropriate columns to factors for categorical data and ensure that numeric columns are properly formatted, setting the stage for subsequent exploratory analysis.
I begin by importing the bank marketing dataset from “bank-additional-full.csv”, specifying the use of semicolons as delimiters and letting R automatically assign data types to each column. Next, I use glimpse to inspect the structure of the dataset—reviewing column names, data types, and a preview of the values—to ensure everything has been read in correctly.
bank_data <- read_delim("bank-additional-full.csv", delim = ";", col_types = cols())
glimpse(bank_data)
## Rows: 41,188
## Columns: 21
## $ age <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job <chr> "housemaid", "services", "services", "admin.", "service…
## $ marital <chr> "married", "married", "married", "married", "married", …
## $ education <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
## $ default <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
## $ housing <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
## $ loan <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
## $ contact <chr> "telephone", "telephone", "telephone", "telephone", "te…
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
## $ day_of_week <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
## $ duration <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons.conf.idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr.employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
Next, I created a custom function called
clean_column_names()
to sanitize the dataset’s column
names. - Within clean_column_names()
, I replaced periods
(.
) with underscores (_
) to standardize the
delimiters. - I also replaced hyphens (-
) with underscores
to further enhance consistency. - I removed spaces from the column names
to prevent issues during later data manipulation. - Finally, I applied
clean_column_names()
to the dataset, ensuring that all
column names are clean and uniform for further analysis.
clean_column_names <- function(dataset) {
colnames(dataset) <- str_replace_all(colnames(dataset), "\\.", "_") # Replace '.' with '_'
colnames(dataset) <- str_replace_all(colnames(dataset), "-", "_") # Replace '-' with '_'
colnames(dataset) <- str_replace_all(colnames(dataset), " ", "") # Remove spaces
return(dataset)
}
bank_data <- clean_column_names(bank_data)
categorical_cols
is created, listing the names
of columns that hold categorical data.mutate(across(...))
function from
dplyr is used to convert each of these columns to the
factor data type.categorical_cols <- c("job", "marital", "education", "default", "housing",
"loan", "contact", "month", "day_of_week", "poutcome", "y")
bank_data <- bank_data %>%
mutate(across(all_of(categorical_cols), as.factor))
numeric_cols
to identify
columns that should be numeric.mutate(across(..., as.numeric))
, ensuring proper arithmetic
operations and statistical analysis later.numeric_cols <- c("age", "duration", "campaign", "pdays", "previous", "emp_var_rate",
"cons_price_idx", "cons_conf_idx", "euribor3m", "nr_employed")
bank_data <- bank_data %>%
mutate(across(all_of(numeric_cols), as.numeric))
Now let us get a high-level view of our data using the glimpse() function. This step helps verify that the cleaning steps have been executed as intended.
We can see that our dataset contains 41,188 observations and 21
variables. It includes both categorical features (e.g., job, marital,
education, default, housing, loan, contact, month, day_of_week,
poutcome, and y) and numerical features (e.g., age, duration, campaign,
pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m,
and nr_employed). My next step is to use the summary()
and
table()
functions to get a quick statistical overview of
each variable. This helps identify potential data issues, outliers, and
the general distribution of the dataset.
glimpse(bank_data)
## Rows: 41,188
## Columns: 21
## $ age <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job <fct> housemaid, services, services, admin., services, servic…
## $ marital <fct> married, married, married, married, married, married, m…
## $ education <fct> basic.4y, high.school, high.school, basic.6y, high.scho…
## $ default <fct> no, unknown, no, no, no, unknown, no, unknown, no, no, …
## $ housing <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, no,…
## $ loan <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes, n…
## $ contact <fct> telephone, telephone, telephone, telephone, telephone, …
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, …
## $ day_of_week <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, …
## $ duration <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome <fct> nonexistent, nonexistent, nonexistent, nonexistent, non…
## $ emp_var_rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons_price_idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons_conf_idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr_employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
library(purrr)
bank_data %>%
summary()
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing loan
## university.degree :12168 no :32588 no :18622 no :33950
## high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576 yes : 6248
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## contact month day_of_week duration
## cellular :26144 may :13769 fri:7827 Min. : 0.0
## telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
## aug : 6178 thu:8623 Median : 180.0
## jun : 5318 tue:8090 Mean : 258.3
## nov : 4101 wed:8134 3rd Qu.: 319.0
## apr : 2632 Max. :4918.0
## (Other): 2016
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
## Median : 2.000 Median :999.0 Median :0.000 success : 1373
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
##
## emp_var_rate cons_price_idx cons_conf_idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08189 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr_employed y
## Min. :4964 no :36548
## 1st Qu.:5099 yes: 4640
## Median :5191
## Mean :5167
## 3rd Qu.:5228
## Max. :5228
##
# Load necessary libraries
library(ggplot2)
library(forcats)
# Function to create and display sorted bar plots for categorical features
plot_categorical <- function(data, var) {
data <- data %>%
mutate(!!sym(var) := fct_reorder(!!sym(var), !!sym(var), .fun = length, .desc = TRUE))
print(
ggplot(data, aes(x = !!sym(var), fill = !!sym(var))) +
geom_bar() +
theme_minimal() +
ggtitle(paste("Distribution of", var)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
}
# Create the proportion table
bank_data %>%
count(job, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 12 × 3
## job n proportion
## <fct> <int> <dbl>
## 1 admin. 10422 0.253
## 2 blue-collar 9254 0.225
## 3 technician 6743 0.164
## 4 services 3969 0.0964
## 5 management 2924 0.0710
## 6 retired 1720 0.0418
## 7 entrepreneur 1456 0.0354
## 8 self-employed 1421 0.0345
## 9 housemaid 1060 0.0257
## 10 unemployed 1014 0.0246
## 11 student 875 0.0212
## 12 unknown 330 0.00801
# Plot the distribution
plot_categorical(bank_data, "job")
How Job Distributed
- The dataset includes 12 job categories plus an
“unknown” category.
- The most common occupations are “admin.”
(25.3%), “blue-collar” (22.5%), and
“technician” (16.3%).
- The least frequent categories are “student”
(2.1%) and “unknown” (0.8%).
- The proportion table and bar chart confirm that the
job variable is highly imbalanced, with a few
categories dominating the dataset.
Key Considerations
- Most customers work in administrative and blue-collar
jobs, which could impact the marketing strategy.
- The imbalance in job categories should be considered
when analyzing how occupation affects subscription rates.
- The “unknown” category is minimal but may require
handling in preprocessing (e.g., imputation or exclusion).
- Further analysis can explore whether job type influences the
likelihood of subscribing (y
).
The distribution of jobs may reflect real-world trends in the labor force, where administrative and blue-collar roles are more prevalent. Further analysis could explore whether job categories influence the target variable (y) or other key factors in the dataset.
# Create the proportion table
bank_data %>%
count(marital, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 4 × 3
## marital n proportion
## <fct> <int> <dbl>
## 1 married 24928 0.605
## 2 single 11568 0.281
## 3 divorced 4612 0.112
## 4 unknown 80 0.00194
# Plot the distribution
plot_categorical(bank_data, "marital")
How Marital Status is Distributed?
- The dataset contains four marital status categories:
Married, Single, Divorced, and Unknown.
- The majority of customers are married
(60.5%), followed by single (28.1%) and
divorced (11.2%).
- The “unknown” category is minimal
(0.2%) and may need special handling in
preprocessing.
Key Considerations
- Most bank customers are married, which could
influence marketing strategies targeting families or dual-income
households.
- The imbalance in marital status should be considered
when analyzing how relationship status affects subscription
likelihood.
- The “unknown” category is negligible but should be
addressed—either through imputation or exclusion.
Further analysis can explore whether marital status
significantly impacts the target variable (y
) in
predicting term deposit subscriptions.
# Create the proportion table
bank_data %>%
count(education, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 8 × 3
## education n proportion
## <fct> <int> <dbl>
## 1 university.degree 12168 0.295
## 2 high.school 9515 0.231
## 3 basic.9y 6045 0.147
## 4 professional.course 5243 0.127
## 5 basic.4y 4176 0.101
## 6 basic.6y 2292 0.0556
## 7 unknown 1731 0.0420
## 8 illiterate 18 0.000437
# Plot the distribution
plot_categorical(bank_data, "education")
How is the Education Variable Distributed?
- The dataset includes eight education levels, ranging
from illiterate to university
degree.
- The most common education levels are
university degree (29.5%), high school
(23.1%), and basic 9 years (14.7%).
- The least frequent categories are illiterate
(0.04%) and unknown (4.2%).
- The proportion table and bar chart confirm an
uneven distribution, with higher education levels more
represented.
Key Considerations:
- Higher education levels dominate the dataset, which
could influence marketing strategies tailored to professionals.
- The imbalance in education categories should be
considered when analyzing how education level affects subscription
rates.
- The “unknown” category (4.2%) is not negligible and
may require imputation or exclusion during
preprocessing.
Further analysis can explore whether education level
significantly impacts the likelihood of subscribing
(y
) . For example, whether highly educated
individuals are more likely to subscribe) could be an interesting area
for further analysis.
# Create the proportion table
bank_data %>%
count(default, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 3 × 3
## default n proportion
## <fct> <int> <dbl>
## 1 no 32588 0.791
## 2 unknown 8597 0.209
## 3 yes 3 0.0000728
# Plot the distribution
plot_categorical(bank_data, "default")
What is the Distribution of the Default
Variable?
- The dataset includes three categories for default
status: “no,” “yes,” and “unknown.”
- An overwhelming 79.1% of customers have no credit default
history, while only 0.007% have a known
default.
- A significant 20.9% of records fall under “unknown,”
making it unclear whether those customers have credit defaults.
- The distribution is highly imbalanced, with very few
actual defaults reported.
Implications for Modeling:
- Severe class imbalance: With only three cases
of “yes” defaults, this feature is unlikely to contribute
significantly to predictions unless handled properly.
- The large “unknown” category raises concerns—should
these be treated as missing values, or does their presence indicate a
meaningful pattern?
- Given the dominance of “no” defaults, this feature
might not be a strong predictor unless analyzed in
combination with other financial indicators.
Further exploration could involve testing the impact of excluding this feature to determine if it influences model performance. Analysis could investigate whether individuals with “unknown” default status behave more like those with “no” defaults or those who have actually defaulted.
# Create the proportion table
bank_data %>%
count(housing, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 3 × 3
## housing n proportion
## <fct> <int> <dbl>
## 1 yes 21576 0.524
## 2 no 18622 0.452
## 3 unknown 990 0.0240
# Plot the distribution
plot_categorical(bank_data, "housing")
What is the Distribution of the Housing
Variable?
- The dataset records whether a customer has a housing
loan, with three possible values: “yes,” “no,” and
“unknown.”
- Just over 52.4% of customers have a housing loan,
while 45.2% do not.
- The “unknown” category is relatively small (2.4%),
but its impact should be considered in preprocessing.
- The distribution is fairly balanced, though slightly
skewed toward customers with housing loans.
Implications for Modeling:
- Since the difference between “yes” and “no” is not
extreme, this variable might provide some predictive
value in determining subscription likelihood.
- The presence of “unknown” values may introduce
uncertainty—options include imputation, separate categorization,
or exclusion during preprocessing.
- Housing status might be correlated with other financial
variables, such as personal loan status, employment, or
income-related factors, making interaction analysis useful.
Further exploration is needed to determine whether customers
with or without housing loans are more likely to subscribe
(y
).
# Create the proportion table
bank_data %>%
count(loan, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 3 × 3
## loan n proportion
## <fct> <int> <dbl>
## 1 no 33950 0.824
## 2 yes 6248 0.152
## 3 unknown 990 0.0240
# Plot the distribution
plot_categorical(bank_data, "loan")
How is the Loan Variable Distributed?
- The dataset captures whether a customer has a personal
loan, categorized as “yes,” “no,” or
“unknown.”
- A significant 82.4% of customers do not have a personal
loan, while 15.2% do.
- The “unknown” category accounts for 2.4% of cases,
which may require handling in preprocessing.
- The distribution is highly imbalanced, with far more
customers not holding personal loans.
Implications for Predictive Modeling:
- Since most customers do not have a personal loan,
this variable might not be a strong predictor unless it
interacts with other financial features.
- The “unknown” category should be assessed—potential
approaches include treating it as missing data, encoding it as a
separate category, or imputing values based on similar
records.
- Customers with loans may behave differently regarding
term deposit subscriptions, warranting further analysis to check if this
feature correlates with the target variable (y
).
- Given the imbalance in the “yes” class, models may
need resampling techniques or weighting adjustments to
ensure the loan variable contributes meaningfully.
Since a housing loan is more common than a personal loan, it could be insightful to examine how both loan types impact client financial behavior.
# Create the proportion table
bank_data %>%
count(contact, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 2 × 3
## contact n proportion
## <fct> <int> <dbl>
## 1 cellular 26144 0.635
## 2 telephone 15044 0.365
# Plot the distribution
plot_categorical(bank_data, "contact")
How is the Contact Variable Distributed?
- The dataset records the communication method used in the marketing
campaign, categorized as “cellular” or
“telephone.”
- A majority (63.5%) of contacts were made via cellular
phones, while 36.5% were made via landline
telephones.
- The bar chart confirms that mobile phone outreach was nearly
twice as common as landline calls.
Implications for Predictive Modeling:
- The preference for cellular communication suggests
that mobile outreach may be more effective or that more
customers use cell phones.
- If subscription rates differ between the two groups,
contact type could be a strong predictor for campaign
success.
- Further analysis should explore whether customers reached via
cellular are more likely to subscribe (y
) compared to those
reached via landlines.
- This variable might also interact with time-based features
(month, day of the week), affecting conversion likelihood.
# Create the proportion table
bank_data %>%
count(month, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 10 × 3
## month n proportion
## <fct> <int> <dbl>
## 1 may 13769 0.334
## 2 jul 7174 0.174
## 3 aug 6178 0.150
## 4 jun 5318 0.129
## 5 nov 4101 0.0996
## 6 apr 2632 0.0639
## 7 oct 718 0.0174
## 8 sep 570 0.0138
## 9 mar 546 0.0133
## 10 dec 182 0.00442
# Plot the distribution
plot_categorical(bank_data, "month")
How is the Month Variable Distributed?
- The dataset records the month in which customers were contacted during
the marketing campaign.
- May accounts for the highest proportion (33.4%) of
contacts, followed by July (17.4%), August
(15.0%), and June (12.9%).
- The lowest contact volumes occur in December (0.4%), March
(1.3%), and September (1.4%), indicating limited marketing
efforts in those months.
- The bar chart confirms that most contacts happened between May
and August, with a sharp decline afterward.
Implications for Predictive Modeling:
- Seasonality may influence campaign effectiveness—a
high concentration of contacts in specific months could impact model
generalization.
- If subscription rates (y
) vary significantly by
month, this feature could be a strong predictor. Further
analysis is needed to assess whether customer responsiveness
changes across seasons.
- Why is May so dominant? Investigating external
factors (e.g., economic trends, promotions) during this period could
provide business insights.
- Given the uneven distribution, models may need
weighting adjustments or feature engineering (e.g.,
grouping months into “high,” “medium,” and “low” activity periods).
# Create the proportion table
bank_data %>%
count(day_of_week, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 5 × 3
## day_of_week n proportion
## <fct> <int> <dbl>
## 1 thu 8623 0.209
## 2 mon 8514 0.207
## 3 wed 8134 0.197
## 4 tue 8090 0.196
## 5 fri 7827 0.190
# Plot the distribution
plot_categorical(bank_data, "day_of_week")
How is the Day of the Week Variable
Distributed?
- The dataset captures the day of the week when
customers were contacted.
- The distribution is fairly even across weekdays, with
Thursday (20.9%) and Monday (20.6%) having slightly higher
contact volumes.
- Friday (19.0%) has the lowest proportion, but the
difference is minimal.
- The bar chart confirms that no single day dominates the
contact strategy, suggesting a balanced approach.
Implications for Predictive Modeling:
- Since contacts are evenly distributed across
weekdays, the day itself may not strongly impact model
predictions unless subscription rates vary by day.
- Further analysis should examine conversion rates per
day to determine if certain days yield higher success
rates (y
).
- If specific weekdays show better performance, future
marketing campaigns could prioritize those days.
- Potential feature engineering could involve grouping days into
“high” and “low” response periods if patterns emerge.
bank_data <- bank_data %>%
mutate(previous = as.factor(previous))
# Create the proportion table
bank_data %>%
count(previous, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 8 × 3
## previous n proportion
## <fct> <int> <dbl>
## 1 0 35563 0.863
## 2 1 4561 0.111
## 3 2 754 0.0183
## 4 3 216 0.00524
## 5 4 70 0.00170
## 6 5 18 0.000437
## 7 6 5 0.000121
## 8 7 1 0.0000243
# Plot the distribution
plot_categorical(bank_data, "previous")
How is the Previous Campaign Outcome
Distributed?
- The “previous” variable represents the number of
times a customer was contacted in past marketing campaigns.
- 86.3% of customers had no prior contact (0), meaning
they are being approached for the first time.
- Only 11.1% had exactly one previous contact, while
higher counts (2–7) occur with much lower frequency.
- The bar chart confirms that most customers were never
contacted before, with a sharp drop-off beyond one previous
contact.
Implications for Predictive Modeling:
- Since the majority of customers have never been contacted
before, this variable might be highly predictive of
subscription likelihood (y
).
- The small proportion of customers with multiple
contacts may indicate either persistent
follow-ups or targeting of specific customer
groups—further analysis is needed to see how their subscription
rates compare.
- If higher previous contact counts correlate with more
sign-ups, models might benefit from grouping this
feature into categories (e.g., “new contact”
vs. “previously engaged”).
- The presence of few customers with 5+ contacts
suggests diminishing returns on repeated outreach,
which could guide future marketing strategies.
This variable might also help assess the effectiveness of re-engagement efforts in the campaign.
# Create the proportion table
bank_data %>%
count(y, sort = TRUE) %>%
mutate(proportion = n / sum(n))
## # A tibble: 2 × 3
## y n proportion
## <fct> <int> <dbl>
## 1 no 36548 0.887
## 2 yes 4640 0.113
# Plot the distribution
plot_categorical(bank_data, "y")
How is the Target Variable (y
)
Distributed?
- The target variable (y
) represents
whether a customer subscribed to a term deposit.
- Only 11.3% of customers subscribed (yes
), while
88.7% did not (no
).
- The bar chart confirms a severe class imbalance, with
far more negative outcomes than positive ones.
Implications for Predictive Modeling:
- Severe class imbalance may lead to a biased model
favoring the majority class (no
). Techniques like
SMOTE (Synthetic Minority Over-sampling Technique) or
class weighting should be considered to handle this
issue.
- Since positive cases (yes
) are rare,
using metrics like precision, recall, and F1-score
instead of accuracy is crucial for evaluating model performance.
- Further feature analysis is needed to determine which factors
most influence subscription rates, allowing for targeted
feature engineering.
- Alternative modeling strategies, such as ensemble
methods, may help improve predictions in imbalanced data
settings.
Since only around 1 in 9 clients agreed to subscribe, this suggests that the campaign had limited effectiveness. Future analysis could explore: What factors distinguish clients who subscribed from those who didn’t. Whether certain demographics, timing, or contact methods had a higher success rate. Potential strategies to improve engagement and response rates.
# Load necessary libraries
library(ggplot2)
# Function to create and display histograms for numerical features
plot_numerical <- function(data, var, bins = 30) {
print(
ggplot(data, aes(x = !!sym(var))) +
geom_histogram(bins = bins, fill = "steelblue", color = "black", alpha = 0.7) +
theme_minimal() +
ggtitle(paste("Distribution of", var)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
}
# Load necessary library
library(ggplot2)
library(dplyr)
library(dplyr)
# Function to plot numerical variables (Histogram + Boxplot) with adjustable bins
plot_numeric_variable <- function(data, column_name, bins = 30) {
column_data <- data[[column_name]] # Extract column data
# Ensure the column is numeric
if (is.numeric(column_data)) {
# Remove NAs
clean_data <- data[complete.cases(column_data), ]
par(mfrow = c(1, 2)) # Set up 2 plots in a row
# Plot histogram with dynamic bins
hist(clean_data[[column_name]], main = paste("Histogram of", column_name),
xlab = column_name, breaks = bins, col = "steelblue", border = "black")
# Plot boxplot
boxplot(clean_data[[column_name]], main = paste("Boxplot of", column_name),
horizontal = TRUE, col = "orange")
par(mfrow = c(1, 1)) # Reset plot layout
} else {
message("Skipping non-numeric column: ", column_name)
}
}
# Function to convert summary statistics to a data frame
summary_to_dataframe <- function(data, column_name) {
# Get summary statistics
summary_values <- summary(data[[column_name]])
# Convert to data frame
summary_df <- data.frame(
Statistic = names(summary_values),
Value = as.numeric(summary_values)
)
return(summary_df)
}
# Example usage for "age"
age_summary_df <- summary_to_dataframe(bank_data, "age")
# Display the summary as a data frame
print(age_summary_df)
## Statistic Value
## 1 Min. 17.00000
## 2 1st Qu. 32.00000
## 3 Median 38.00000
## 4 Mean 40.02406
## 5 3rd Qu. 47.00000
## 6 Max. 98.00000
# Example usage with bins specified
plot_numeric_variable(bank_data, "age", bins = 50)
How is the Age Variable Distributed?
- The minimum age is 17, while the
maximum is 98.
- The median age is 38, with a mean of
40.02, indicating a slightly right-skewed distribution.
- The interquartile range (IQR) is from 32 to 47,
capturing the middle 50% of observations.
Interpretation of the Plots:
- The histogram shows a concentration of customers between 30
and 50 years old, with a sharp decline in older age
groups.
- The boxplot highlights outliers beyond 60 years old,
indicating a small subset of much older customers.
Implications for Predictive Modeling:
- Since most customers are in their 30s and 40s, age
may be a key factor in subscription likelihood.
- The presence of outliers (older customers) could
require winsorization or transformation to avoid
skewing the model.
- Further analysis should check whether older or younger
customers are more likely to subscribe (y
), as
this may guide marketing strategies.
campaign_summary_df <- summary_to_dataframe(bank_data, "campaign")
campaign_summary_df <- summary_to_dataframe(bank_data, "campaign")
# Display the summary as a data frame
print(campaign_summary_df)
## Statistic Value
## 1 Min. 1.000000
## 2 1st Qu. 1.000000
## 3 Median 2.000000
## 4 Mean 2.567593
## 5 3rd Qu. 3.000000
## 6 Max. 56.000000
# Example usage with bins specified
plot_numeric_variable(bank_data, "campaign", bins = 50)
How is the Campaign Variable Distributed?
- The minimum number of contacts per customer is
1, while the maximum is 56.
- The median number of contacts is 2, with an
average of 2.57, indicating most customers were
contacted a few times.
- The interquartile range (IQR) is between 1 and 3
contacts, meaning 50% of customers were contacted at
most three times.
Interpretation of the Plots:
- The histogram shows a strong right-skew, meaning a
majority of customers received very few calls, while a
small group received many follow-ups.
- The boxplot reveals a high number of extreme
outliers, where some customers received over 10-50
calls, which may indicate excessive follow-ups.
Implications for Predictive Modeling:
- Since most customers are contacted only once or
twice, marketing strategies should assess whether
multiple calls improve subscription rates (y
) or
lead to customer fatigue.
- The presence of outliers (customers receiving 10+
calls) suggests diminishing returns or possible
over-targeting, which could negatively impact campaign
effectiveness.
- Feature engineering options: - Binning
campaign values into categories (Low: 1-3, Medium: 4-6,
High: 7+) may help reduce noise in modeling. -
Investigate the relationship between campaign and
y
to determine if additional calls increase the
likelihood of a subscription.
contact_summary_df <- summary_to_dataframe(bank_data, "duration")
contact_summary_df <- summary_to_dataframe(bank_data, "duration")
# Display the summary as a data frame
print(contact_summary_df)
## Statistic Value
## 1 Min. 0.000
## 2 1st Qu. 102.000
## 3 Median 180.000
## 4 Mean 258.285
## 5 3rd Qu. 319.000
## 6 Max. 4918.000
# Example usage with bins specified
plot_numeric_variable(bank_data, "duration", bins = 50)
How is the Duration Variable Distributed?
- The minimum call duration is 0
seconds, while the maximum is 4918 seconds (~82
minutes).
- The median duration is 180 seconds (3 minutes), with
an average of 258.3 seconds (~4.3 minutes).
- The interquartile range (IQR) spans from 102 to 319
seconds, meaning 50% of calls last between 1.7 and 5.3
minutes.
Interpretation of the Plots
- The histogram shows a right-skewed distribution,
indicating that most calls are short, with a few lasting much
longer.
- The boxplot highlights many extreme outliers, meaning
a small number of calls last significantly longer than the
majority.
Implications for Predictive Modeling
- Duration is the most influential feature for predicting term
deposit subscription (y
), as longer calls are
highly correlated with successful conversions.
- The presence of extreme outliers suggests that
duration may need log transformation or
capping to prevent high leverage points from dominating
the model.
- Further analysis should assess the relationship between
duration and y
, confirming whether longer calls
lead to higher success rates.
pdays_summary_df <- summary_to_dataframe(bank_data, "pdays")
pdays_summary_df <- summary_to_dataframe(bank_data, "pdays")
# Display the summary as a data frame
print(pdays_summary_df)
## Statistic Value
## 1 Min. 0.0000
## 2 1st Qu. 999.0000
## 3 Median 999.0000
## 4 Mean 962.4755
## 5 3rd Qu. 999.0000
## 6 Max. 999.0000
plot_numeric_variable(bank_data, "pdays", bins = 50)
How is the pdays
Variable
Distributed?
- The minimum value is 0, while the maximum is
999.
- The median and upper quartiles are both 999, meaning
most values are concentrated at this level.
- The mean is 962.48, suggesting an overwhelming
skew towards 999.
Interpretation of the Plots:
- The histogram confirms that the vast majority of values are
999, while a small proportion falls closer to zero.
- The boxplot highlights that values near 999 dominate the
dataset, while a few lower values appear as outliers.
Implications for Predictive Modeling:
- In the dataset, 999 indicates that the customer was never
contacted previously.
- The variable is highly imbalanced, meaning it
should be treated as a categorical feature (e.g.,
“Never Contacted” vs. “Previously Contacted”).
- Feature Engineering Recommendations: - Create a
binary variable:
- pdays_contacted = ifelse(pdays == 999, 0, 1)
, indicating
whether the client had prior contact.
- Alternatively, group pdays
into bins
(e.g., recent contact vs. long ago).
- Check the correlation with y
to confirm
whether prior contact history impacts subscription rates.
# Convert cons_price_idx to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
mutate(emp_var_rate = as.numeric(as.character(emp_var_rate)))
emp.var.rate_summary_df <- summary_to_dataframe(bank_data, "emp_var_rate")
emp.var.rate_summary_df <- summary_to_dataframe(bank_data, "emp_var_rate")
# Display the summary as a data frame
print(emp.var.rate_summary_df)
## Statistic Value
## 1 Min. -3.4000000
## 2 1st Qu. -1.8000000
## 3 Median 1.1000000
## 4 Mean 0.0818855
## 5 3rd Qu. 1.4000000
## 6 Max. 1.4000000
plot_numeric_variable(bank_data, "emp_var_rate", bins = 50)
How is the emp_var_rate
Variable
Distributed?
- The minimum employment variation rate is
-3.4, while the maximum is 1.4.
- The median is 1.1, meaning that at least half
of the records correspond to positive employment trends.
- The mean is 0.08, indicating that employment
variation fluctuates around zero but is
slightly positive.
- The interquartile range (IQR) is between -1.8 and
1.4, covering most observations.
Interpretation of the Plots
- The histogram reveals a discrete distribution,
meaning that emp_var_rate
only takes specific
values rather than a continuous range.
- The boxplot shows that most values fall between -3.4 and 1.4,
with no extreme outliers.
- The dataset contains both negative and positive
values, which could reflect economic recessions and
recoveries over different time periods.
Implications for Predictive Modeling
- emp_var_rate
is a macro-economic
indicator that may influence consumer financial
decisions.
- Negative values (economic downturns) could correlate
with lower term deposit subscriptions (y
),
as people save less during recessions.
- Positive values (economic growth) might be linked to
higher subscriptions, as customers have more disposable
income.
- Feature Engineering Options - Convert
emp_var_rate
into categorical bins (e.g.,
recession, stable, growth).
- Check its correlation with y
to
determine its predictive power.
- Consider interaction effects with other economic
variables like cons_price_idx
.
# Convert cons_price_idx to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
mutate(cons_price_idx = as.numeric(as.character(cons_price_idx)))
cons_price_idx_summary_df <- summary_to_dataframe(bank_data, "cons_price_idx")
cons_price_idx_summary_df <- summary_to_dataframe(bank_data, "cons_price_idx")
# Display the summary as a data frame
print(cons_price_idx_summary_df)
## Statistic Value
## 1 Min. 92.20100
## 2 1st Qu. 93.07500
## 3 Median 93.74900
## 4 Mean 93.57566
## 5 3rd Qu. 93.99400
## 6 Max. 94.76700
plot_numeric_variable(bank_data, "cons_price_idx", bins = 50)
How is the cons_price_idx
Variable
Distributed?
- The minimum consumer price index is
92.20, while the maximum is
94.77.
- The median is 93.75, meaning half of the data points
fall below this value.
- The mean is 93.58, indicating the values are
evenly distributed around the center.
- The interquartile range (IQR) is between 93.08 and
93.99, suggesting most values are clustered in this
range.
Interpretation of the Plots:
- The histogram shows a discrete distribution, meaning
cons_price_idx
only takes on certain values, likely due to
it being recorded at fixed economic reporting periods.
- The boxplot confirms that the variable is tightly
distributed, with no extreme outliers.
- The presence of several peaks in the histogram
suggests that cons_price_idx
varies over time but
within a narrow range.
Implications for Predictive Modeling:
- Consumer price index reflects inflation and economic
stability, which can impact financial
decisions.
- A higher price index might be associated with
reduced term deposit subscriptions (y
) if
inflation lowers consumer purchasing power.
- Feature Engineering Recommendations: - Group
values into economic phases (e.g., low, moderate, high
inflation periods).
- Examine correlation with y
to determine
its predictive power.
- Interaction terms with emp_var_rate
may
capture macroeconomic trends more effectively.
# Convert euribor3m to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
mutate(euribor3m = as.numeric(as.character(euribor3m)))
euribor3m_summary_df <- summary_to_dataframe(bank_data, "euribor3m")
euribor3m_summary_df <- summary_to_dataframe(bank_data, "euribor3m")
# Display the summary as a data frame
print(euribor3m_summary_df)
## Statistic Value
## 1 Min. 0.634000
## 2 1st Qu. 1.344000
## 3 Median 4.857000
## 4 Mean 3.621291
## 5 3rd Qu. 4.961000
## 6 Max. 5.045000
plot_numeric_variable(bank_data, "euribor3m", bins = 50)
How is the euribor3m
Variable
Distributed?
The minimum is 0.63, and the
maximum is 5.05. The median
(4.86) and mean (3.62) suggest a
skewed distribution toward lower values. The
IQR (1.34 – 4.96) indicates most values fall within
this range.
Interpretation of the Plots:
The histogram shows a multimodal distribution with
peaks around 1 and 5, reflecting economic cycles. The
boxplot confirms a wide spread without extreme
outliers.
Implications for Predictive Modeling:
- Euribor3m is a key economic indicator influencing
financial behavior.
- Higher rates may reduce term deposit subscriptions
(y
) due to increased borrowing costs.
- Feature Engineering Recommendations:
- Segment values into interest rate periods (low,
moderate, high).
- Assess correlation with y
to determine predictive
strength.
- Use interaction terms with cons_price_idx
and
emp_var_rate
to capture macroeconomic effects.
# Convert nr_employed to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
mutate(nr_employed = as.numeric(as.character(nr_employed)))
nr_employed_summary_df <- summary_to_dataframe(bank_data, "nr_employed")
nr_employed_summary_df <- summary_to_dataframe(bank_data, "nr_employed")
# Display the summary as a data frame
print(nr_employed_summary_df)
## Statistic Value
## 1 Min. 4963.600
## 2 1st Qu. 5099.100
## 3 Median 5191.000
## 4 Mean 5167.036
## 5 3rd Qu. 5228.100
## 6 Max. 5228.100
plot_numeric_variable(bank_data, "nr_employed", bins = 50)
How is the nr_employed
Variable
Distributed?
The minimum number of employees recorded is
4963.6, while the maximum is
5228.1. The median (5191.0) and
mean (5167.0) suggest a slightly right-skewed
distribution. The IQR (5099.1 – 5228.1)
indicates that most values are concentrated in this range.
Interpretation of the Plots:
The histogram reveals a discrete, multimodal
distribution, reflecting distinct periods of employment levels.
The boxplot suggests a steady increase in employment
levels, with no extreme outliers.
Implications for Predictive Modeling:
- Employment levels influence economic stability,
potentially affecting term deposit subscriptions (y
).
- Higher employment may correlate with increased savings and
financial security, leading to higher term deposit
subscriptions.
- Feature Engineering Recommendations:
- Categorize employment levels into low, moderate, and high
employment periods.
- Assess interactions with emp_var_rate
and
euribor3m
to capture macroeconomic
trends.
Since nr_employed
reflects labor market conditions, it
is likely a strong predictor of financial behaviors and term
deposit subscriptions.
library(stats)
library(corrplot)
bank_data %>%
keep(is.numeric) %>% # Select only numeric variables for correlation analysis
cor() %>% # Compute correlation matrix
corrplot() # Plot correlation heatmap
The results in the correlation plot show that we have a few features that are highly correlated.
euribor3m
, nr_employed
, and
emp_var_rate
. This suggests that as the Euribor
3-month rate increases, so does the number of employed
individuals and the employment variation
rate.cons_price_idx
(consumer price index) and
euribor3m
, indicating that higher consumer prices are
associated with higher Euribor interest rates.emp_var_rate
and nr_employed
have a strong correlation, which makes sense as employment variation
often aligns with total employment figures.We also observe some negative correlations:
- cons_conf_idx
(consumer confidence index) is
negatively correlated with euribor3m
and
nr_employed
, meaning that when consumer confidence
is lower, interest rates and employment figures tend to be
higher.
- Similarly, emp_var_rate
is negatively
correlated with cons_conf_idx
, suggesting that
employment variation decreases when consumer confidence is lower.
Before deciding what to do with these correlated variables, we should conduct further analysis to determine whether any variables are redundant or if they provide unique information. If two variables are nearly identical in their impact, one could be removed or transformed to reduce multicollinearity in any future modeling.
# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
# Min-Max Normalization Function
min_max_normalize <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)))
}
# Select numeric variables and apply min-max scaling
numeric_data <- bank_data %>%
dplyr::select(where(is.numeric)) %>%
mutate(across(everything(), min_max_normalize))
# Boxplots for standardized numerical variables
numeric_data %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Variable, y = Value, fill = Variable)) +
geom_boxplot(alpha = 0.6) +
coord_flip() +
theme_minimal() +
theme(legend.position = "none") +
ggtitle("Outlier Detection Using Boxplots")
Analysis of the Min-Max Standardized Boxplots (Outlier Detection)
The boxplot displays the distribution of min-max scaled numerical variables, highlighting potential outliers. Here’s an analysis of key observations:
Variables with Many Outliers -
campaign
and
duration
have a large number of outliers.
- Campaign (number of contacts during the campaign)
shows many clients receiving very few calls, but some
were contacted multiple times, creating extreme values. -
Duration (call duration) is heavily
right-skewed, with many long calls being
potential outliers.
previous
and
pdays
(days since last contact) also show
several extreme values.
Highly Skewed Variables -
euribor3m
(Euribor 3-month rate) and
nr_employed
(number of employees) have
long right tails, meaning they have some very high
values. - emp_var_rate
(employment variation
rate) shows a wide range but is not heavily
skewed.
Variables with a Compact Range -
age
and
cons_price_idx
(consumer price index)
appear relatively well-distributed. - cons_conf_idx
(consumer confidence index) is tightly packed but has some
outliers.
Interpretation of Outliers -
Campaign-related variables (campaign
,
previous
, pdays
) have extreme
values, possibly indicating repeated contacts to some clients
or first-time contacts for others. - Economic indicators
(euribor3m
, nr_employed
,
emp_var_rate
) contain extreme values, suggesting
large fluctuations in the financial market over time. - Call
duration (duration
) outliers indicate long
conversations, which might be linked to successful
subscriptions.
# Relationship between call duration and response (y)
ggplot(bank_data, aes(x = duration, fill = y)) +
geom_density(alpha = 0.5) +
theme_minimal() +
ggtitle("Call Duration vs. Response (y)")
This density plot visualizes the relationship between call duration and the response variable (y), which indicates whether a customer subscribed to a term deposit.
Key Observations: - The “no” category
(red) dominates short call durations, showing a sharp peak at
low values. - The “yes” category (blue) has a wider
spread and extends into higher call durations. - Customers who
subscribed (y = yes
) tend to have longer call durations
compared to those who did not (y = no
).
Implications for Predictive Modeling: - Call
duration is a strong predictor for subscription
(y
). Longer calls may indicate a higher chance of
convincing the customer. - The overlap in distributions
suggests that some short-duration calls also result in subscriptions,
meaning duration alone is not a perfect predictor. - Potential
feature engineering: Consider binning duration into categories
or applying transformations (e.g., log transformation) to improve model
performance.
# Trend in campaign efforts over months
ggplot(bank_data, aes(x = month, fill = y)) +
geom_bar(position = "fill") +
theme_minimal() +
ggtitle("Subscription Rate by Month") +
ylab("Proportion of Subscriptions")
This stacked bar chart illustrates the proportion of term
deposit subscriptions (y
) by month.
Key Observations: - The majority of responses are “no” across all months, indicating a general difficulty in securing term deposit subscriptions. - December, March, and October show the highest proportions of “yes” responses, suggesting these months might be more favorable for conversions. - May, June, and July have the lowest subscription rates, potentially indicating seasonal challenges in customer acquisition.
Implications for Predictive Modeling: - Time-based features: The month of contact may be an important predictor of subscription likelihood. - Seasonality considerations: Further analysis can determine if external economic factors (e.g., holidays, interest rates) influence customer decisions. - Marketing strategy alignment: Campaign efforts might be more effective in months with historically higher subscription rates.
# Order days of the week
bank_data <- bank_data %>%
mutate(day_of_week = factor(day_of_week,
levels = c("mon", "tue", "wed", "thu", "fri")))
# Trend in campaign efforts over days of the week
ggplot(bank_data, aes(x = day_of_week, fill = y)) +
geom_bar(position = "fill") +
theme_minimal() +
ggtitle("Subscription Rate by Day of the Week") +
ylab("Proportion of Subscriptions") +
xlab("Day of the Week")
The subscription rate remains relatively consistent across
all weekdays, with no significant variation in
the proportion of successful (yes
) and unsuccessful
(no
) term deposit subscriptions. This suggests that the
day of the week does not play a major role in
influencing customer decisions regarding subscriptions.
We will assess Multiple Linear Regression (MLR), Logistic Regression (LR), k-Nearest Neighbors (kNN), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Naive Bayes (NB) based on scalability, interpretability, assumptions, and suitability for categorical data.
*Pros**
- Simple and computationally efficient.
- Easily interpretable coefficients.
- Works well with continuous outcomes.
*Cons
- Not suitable for classification problems like this
dataset, as it assumes a continuous response variable.
- Assumes linearity between independent and dependent
variables**, which doesn’t hold for a categorical target like
y
.
- Poor at handling class imbalances.
Suitability: Not suitable (MLR is meant for regression, not classification).
*Pros
- Specifically designed for binary classification
(ideal for predicting y
).
- Interpretable coefficients (e.g., odds of
subscription based on predictors).
- Scales well to large datasets like this one (41,188
records).
- Handles categorical variables** effectively.
*Cons
- Assumes a linear relationship between log-odds and
predictors, which may not always hold.
- Not effective for non-linearly separable** data.
Suitability: *Best suited for this dataset** due to its efficiency and interpretability.
*Pros
- Non-parametric (no strong assumptions about data
distribution).
- Can capture non-linear relationships**.
- Simple and easy to understand.
*Cons
- Computationally expensive for large datasets
(distance calculations for 41,188 records).
- Sensitive to irrelevant features and scaling
issues.
- Struggles with categorical data**, requiring extensive
preprocessing (e.g., encoding job
,
education
).
Suitability: *Not practical** (computationally inefficient and requires extensive preprocessing).
*Pros
- Works well for classification tasks with
normally distributed data.
- Handles multi-class problems** (though not needed
here).
- More robust than Logistic Regression in some cases.
*Cons
- Assumes Gaussian distribution** of features, which may not
hold in this dataset.
- Assumes equal covariance among classes, which is often
unrealistic.
Suitability: Potential alternative, but normality assumptions should be checked.
*Pros
- More flexible than LDA (allows different covariances for each
class).
- Can model non-linear** decision boundaries.
*Cons
- Data-intensive – requires more samples per class to
estimate covariance matrices.
- Prone to overfitting**, especially with small datasets or
imbalanced classes.
Suitability: Not ideal due to its complexity and data requirements.
*Pros
- Fast and scalable, even for large datasets.
- Handles categorical data efficiently (e.g.,
job
, education
, marital
).
- Works well in high-dimensional spaces**.
*Cons
- Assumes feature independence**, which is unrealistic for
real-world data.
- May underperform when predictors are highly correlated.
Suitability: *Good alternative** if independence assumption does not significantly impact performance.
Logistic Regression (LR) is the best choice
because:
- It handles categorical and numerical variables well.
- It scales efficiently with large datasets (~41,188 records).
- It provides interpretable results and probability scores.
The dataset includes binary labels (y
),
making classification algorithms appropriate. LR is preferred over MLR
(for regression) and kNN (inefficient for large datasets).
Yes. Naive Bayes or kNN could be viable since they perform well on small datasets, while QDA risks overfitting.
Since the unknown values in housing and loan are small (~2.40%), imputing with the mode ensures minimal distortion to the distribution. However, for default, where ~20.87% of values are unknown, replacing them with the most frequent category may introduce bias. Instead, “unknown” is preserved as a separate category to allow the model to capture any predictive patterns associated with it. This approach maintains data integrity while minimizing unnecessary imputation bias.
The code below performs missing data handling for
the categorical variables default, housing, and loan in
bank_data
.
housing
and loan
while keeping
default
unchanged.housing
and
loan
using the mode (most frequent value in each
column).# Load necessary library
library(dplyr)
# Function to calculate missing percentages
calculate_missing_percentage <- function(data, cols) {
missing_summary <- data %>%
summarise(across(all_of(cols), ~ mean(. == "unknown") * 100)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Percentage")
return(missing_summary)
}
# Calculate missing percentages before transformation
missing_before <- calculate_missing_percentage(bank_data, c("default", "housing", "loan"))
# Display missing percentages
print(missing_before)
## # A tibble: 3 × 2
## Variable Missing_Percentage
## <chr> <dbl>
## 1 default 20.9
## 2 housing 2.40
## 3 loan 2.40
# Replace "unknown" with "missing" in categorical variables (only for housing and loan)
bank_data <- bank_data %>%
mutate(across(c(housing, loan), ~ ifelse(. == "unknown", "missing", .)))
# Calculate missing percentages after replacing "unknown" with "missing"
missing_after_replace <- calculate_missing_percentage(bank_data, c("housing", "loan"))
# Impute Missing Values for housing and loan with Mode (Most Frequent Value)
for (col in c("housing", "loan")) {
most_frequent <- names(which.max(table(bank_data[[col]])))
bank_data[[col]][bank_data[[col]] == "missing"] <- most_frequent
}
# Calculate missing percentages after imputation
missing_after_imputation <- calculate_missing_percentage(bank_data, c("housing", "loan"))
# Display missing percentages after imputation
print(missing_after_imputation)
## # A tibble: 2 × 2
## Variable Missing_Percentage
## <chr> <dbl>
## 1 housing 0
## 2 loan 0
# Load necessary libraries
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
# Compute correlation matrix for numeric features
cor_matrix <- cor(bank_data %>% select(where(is.numeric)), use = "pairwise.complete.obs")
# Convert correlation matrix to a data frame for readability
cor_df <- as.data.frame(as.table(cor_matrix))
# Filter to show only correlations >|0.8| (excluding self-correlations)
significant_correlations <- cor_df %>%
filter(Var1 != Var2, abs(Freq) > 0.8) %>%
arrange(desc(abs(Freq)))
# Print only significant correlations
print("Significant Correlations (>|0.8|):")
## [1] "Significant Correlations (>|0.8|):"
print(significant_correlations)
## Var1 Var2 Freq
## 1 euribor3m emp_var_rate 0.9722447
## 2 emp_var_rate euribor3m 0.9722447
## 3 nr_employed euribor3m 0.9451544
## 4 euribor3m nr_employed 0.9451544
## 5 nr_employed emp_var_rate 0.9069701
## 6 emp_var_rate nr_employed 0.9069701
# Find highly correlated variables (correlation > 0.85) for removal
highly_correlated_indices <- findCorrelation(cor_matrix, cutoff = 0.85)
# Get variable names of highly correlated features
highly_correlated_vars <- names(bank_data %>% select(where(is.numeric)))[highly_correlated_indices]
# **Explicitly add 'day_of_week' for removal due to lack of predictive power**
removal_candidates <- c(highly_correlated_vars, "day_of_week")
# Ensure only existing variables are removed
removed_vars <- intersect(names(bank_data), removal_candidates)
# Remove highly correlated and non-informative variables from dataset
bank_data <- bank_data %>% select(-all_of(removed_vars))
# Print the variables that were actually removed
print("Variables Removed from the Dataset:")
## [1] "Variables Removed from the Dataset:"
print(removed_vars)
## [1] "day_of_week" "emp_var_rate" "euribor3m"
The analysis reveals that several variables exhibit significant correlations (>|0.85|), indicating strong relationships that could introduce multicollinearity issues and impact model performance. Notably: - euribor3m and emp_var_rate have a very high correlation (0.972), suggesting a direct relationship between the Euro Interbank Offered Rate (Euribor) and employment variation rate. - nr_employed is strongly correlated with euribor3m (0.945), indicating that the number of employees is highly dependent on the Euribor rate. - nr_employed also has a strong correlation with emp_var_rate (0.907), reinforcing the relationship between employment levels and economic indicators.
Due to these high correlations, euribor3m and emp_var_rate were removed from the dataset to reduce redundancy and mitigate potential multicollinearity. This ensures that the predictive model remains interpretable and avoids issues related to highly correlated variables influencing model outcomes.
Additionally, day_of_week was removed because its distribution showed no meaningful variation in subscription rates across different days. Since the response variable (‘y’) remains nearly identical regardless of the day, this feature is unlikely to contribute to model performance and was removed to reduce dimensionality.
To enhance the dataset for predictive modeling while preserving the pattern within each predictor, the following transformations were carried out:
duration
and campaign
exhibit extreme
right-skewness. Applying a Box Cox transformation may help normalize
these distributions while maintaining meaningful differences.
# Load necessary libraries
library(dplyr)
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:PerformanceAnalytics':
##
## kurtosis, skewness
library(tidyr)
library(ggplot2)
# Function to calculate skewness
calculate_skewness <- function(data, cols) {
data %>%
summarise(across(all_of(cols), ~ skewness(.x, na.rm = TRUE))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness")
}
# Select duration variable
numeric_var_duration <- "duration"
# Calculate skewness before transformation
skewness_before_duration <- calculate_skewness(bank_data, numeric_var_duration)
# Apply Box-Cox Transformation for duration
x_adj_duration <- bank_data$duration + 1 # Adjust for zero values
lambda_duration <- BoxCox.lambda(x_adj_duration, method = "loglik")
bank_data$duration_boxcox <- BoxCox(x_adj_duration, lambda_duration)
# Compute Skewness After Transformation
skewness_after_duration <- calculate_skewness(bank_data, "duration_boxcox")
# Merge Skewness Before and After
skewness_df_duration <- left_join(skewness_before_duration, skewness_after_duration, by = "Variable", suffix = c("_Before", "_After"))
# Visualize Distribution Before and After Transformation
bank_data %>%
dplyr::select(all_of(c("duration", "duration_boxcox"))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 50, fill = "blue", alpha = 0.5) +
facet_wrap(~ Variable, scales = "free") +
theme_minimal() +
ggtitle("Distributions Before and After Box-Cox Transformation for Duration")
The Box-Cox transformation has significantly improved the distribution of the duration variable. Initially, the distribution was highly right-skewed, with a large concentration of values near zero and a long tail extending towards higher values.
After applying the Box-Cox transformation, the distribution appears much more normal, with a symmetrical bell shape centered around its mean. This transformation will likely enhance the performance of models that assume normality, such as LDA and QDA, and may improve the interpretability of logistic regression.
# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(e1071)
library(forecast)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Function to calculate skewness
calculate_skewness <- function(data, cols) {
data %>%
summarise(across(all_of(cols), ~ skewness(.x, na.rm = TRUE))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness")
}
# Select campaign variable
numeric_var_campaign <- "campaign"
# Calculate skewness before transformation
skewness_before_campaign <- calculate_skewness(bank_data, numeric_var_campaign)
# Apply Box-Cox Transformation for campaign
x_adj_campaign <- bank_data$campaign + 1 # Adjust for zero values
lambda_campaign <- BoxCox.lambda(x_adj_campaign, method = "loglik")
bank_data$campaign_boxcox <- BoxCox(x_adj_campaign, lambda_campaign)
# Apply Log1p Transformation
bank_data$campaign_log <- log1p(bank_data$campaign)
# Apply Square Root Transformation
bank_data$campaign_sqrt <- sqrt(bank_data$campaign)
# Apply Reciprocal Transformation (1/x) - Handle zeros separately
bank_data$campaign_reciprocal <- ifelse(bank_data$campaign == 0, 0, 1 / bank_data$campaign)
# Apply Binarization (High vs. Low Engagement)
bank_data$campaign_binary <- ifelse(bank_data$campaign > median(bank_data$campaign, na.rm = TRUE), "High", "Low")
# Compute Skewness After Each Transformation (except binary)
transformed_vars_campaign <- c("campaign_boxcox", "campaign_log", "campaign_sqrt", "campaign_reciprocal")
skewness_after_campaign <- calculate_skewness(bank_data, transformed_vars_campaign)
# Merge Skewness Before and After
skewness_df_campaign <- left_join(skewness_before_campaign, skewness_after_campaign, by = "Variable", suffix = c("_Before", "_After"))
# Visualize Distributions Before and After Transformations
bank_data %>%
dplyr::select(all_of(c("campaign", transformed_vars_campaign))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 50, fill = "blue", alpha = 0.5) +
facet_wrap(~ Variable, scales = "free", nrow = 2, ncol = 3) + # Facet with 2 rows and 3 columns
theme_minimal() +
ggtitle("Distributions Before and After Various Transformations for Campaign")
# Visualize Binarized Campaign Variable
ggplot(bank_data, aes(x = campaign_binary, fill = campaign_binary)) +
geom_bar(alpha = 0.5) +
theme_minimal() +
ggtitle("Binarized Campaign Variable (High vs. Low Engagement)")
# Load necessary libraries
library(dplyr)
library(ggplot2)
# Check for NA values before binning
bank_data <- bank_data %>% mutate(campaign = as.numeric(campaign))
# Define improved bins for 'campaign'
bank_data <- bank_data %>%
mutate(campaign_bin = case_when(
campaign == 1 ~ "1",
campaign == 2 ~ "2",
campaign == 3 ~ "3",
campaign == 4 ~ "4",
campaign == 5 ~ "5",
campaign == 6 ~ "6",
campaign >= 7 & campaign <= 10 ~ "7-10",
campaign > 10 ~ "11+",
TRUE ~ "Missing" # Handling NA values
))
# Convert to ordered factor
bank_data$campaign_bin <- factor(bank_data$campaign_bin,
levels = c("1", "2", "3", "4", "5", "6", "7-10", "11+", "Missing"))
# Plot histogram for binned 'campaign' variable
ggplot(bank_data, aes(x = campaign_bin, fill = campaign_bin)) +
geom_bar(alpha = 0.7) +
theme_minimal() +
ggtitle("Binned Campaign Variable with More Granular Bins") +
xlab("Campaign Bin") +
ylab("Count") +
theme(legend.position = "none")
Binning vs. Not Binning for Campaign vs. Binarization:
Binning was considered because alternative transformations (Box-Cox, log, square root, and reciprocal) did not sufficiently resolve the skewness in the campaign variable. Given the heavy concentration of low values and the sharp decline in higher values, binning provides a more structured representation, making the variable more suitable for models like Logistic Regression, LDA, QDA, kNN, and Naive Bayes, which may struggle with highly skewed continuous inputs.
The finer binning approach captures more variability compared to simple binarization. While binarization simplifies interpretation, it may discard important distinctions between different engagement levels. Granular binning retains more information about customer interaction patterns while still reducing skewness, making it potentially more useful for models like kNN, LDA, and QDA, which can leverage multiple discrete categories effectively. However, for Logistic Regression and Naive Bayes, a binary variable might still be preferable to avoid sparse category issues.
Best Choice?
- Use binning for LR/NB
- Use transformed continuous for kNN/LDA/QDA
poutcome_bin
: Converts
poutcome
into a binary variable (1 = success
,
0 = others
).bank_data <- bank_data %>%
mutate(poutcome_bin = ifelse(poutcome == "success", 1, 0))
campaign
campaign_binary
: Classifies clients
into high vs. low engagement based on the median number
of contacts.if (!"campaign_binary" %in% names(bank_data)) {
bank_data <- bank_data %>%
mutate(campaign_binary = ifelse(campaign > median(campaign, na.rm = TRUE), "High", "Low"))
}
month
as a Factor Instead of
Seasonal Binning
month
allows models to
capture variations in client behavior across different months.bank_data$month <- as.factor(bank_data$month)
job
bank_data$job <- fct_lump(bank_data$job, prop = 0.02)
pdays
Into a More Useful
Feature
contacted_before
(1
if previously contacted, 0
if never
contacted).pdays = 999
means
never contacted before, this transformation simplifies
model learning.bank_data <- bank_data %>%
mutate(contacted_before = ifelse(pdays == 999, 0, 1)) %>%
dplyr::select(-pdays) # Remove `pdays`
previous_contacts_ratio
: Measures
prior engagement by normalizing the number of previous contacts relative
to campaign attempts.loan_housing_combo
: Encodes the
combination of loan
and housing
into a single
categorical variable.bank_data <- bank_data %>%
mutate(
previous_contacts_ratio = as.numeric(as.character(previous)) / (as.numeric(as.character(campaign)) + 1),
loan_housing_combo = paste0(loan, "_", housing)
)
categorical_vars <- c("contact", "campaign_binary", "default", "education", "housing", "job",
"loan", "marital", "month", "loan_housing_combo", "poutcome")
bank_data <- bank_data %>%
mutate(across(all_of(categorical_vars), as.factor))
job_admin
, job_blue-collar
).if (!requireNamespace("fastDummies", quietly = TRUE)) {
install.packages("fastDummies")
}
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.3.3
bank_data <- dummy_cols(bank_data, select_columns = categorical_vars, remove_selected_columns = TRUE)
y
)
y
to a binary factor
(0 = no, 1 = yes) ensures compatibility with classification models.bank_data <- bank_data %>%
mutate(y = factor(ifelse(y == "yes", 1, 0)))
y
to the Last
Column
bank_data <- bank_data %>%
dplyr::select(-y, everything(), y)
-
and .
ensures compatibility with certain ML libraries that do not support
special characters.colnames(bank_data) <- gsub("-", "_", colnames(bank_data))
colnames(bank_data) <- gsub("\\.", "_", colnames(bank_data))
y
Remains a Factor
y
remains a factor
avoids issues during classification.bank_data <- bank_data %>%
mutate(y = factor(y, levels = c(0, 1)))
glimpse(bank_data)
## Rows: 41,188
## Columns: 69
## $ age <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, …
## $ duration <dbl> 261, 149, 226, 151, 307, 198, 139, 217, …
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cons_price_idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, …
## $ cons_conf_idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4…
## $ nr_employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191…
## $ duration_boxcox <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_boxcox <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, …
## $ campaign_log <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_sqrt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_reciprocal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_bin <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ previous_contacts_ratio <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1…
## $ default_unknown <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ default_yes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_6y <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_9y <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0…
## $ education_illiterate <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0…
## $ education_university_degree <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ housing_1 <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0…
## $ housing_3 <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ job_admin_ <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0…
## $ job_entrepreneur <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ job_management <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services <int> 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0…
## $ job_student <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ job_unemployed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1 <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ loan_3 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_divorced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ marital_married <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ marital_single <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0…
## $ marital_unknown <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1 <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ loan_housing_combo_1_3 <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ loan_housing_combo_3_1 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ loan_housing_combo_3_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ y <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
This script standardizes numeric variables using different scaling
techniques while ensuring that the target variable (y
)
remains unchanged. The transformations applied depend on the
characteristics of each variable:
Key Implementation Details: - The transformed
variables are given prefixes (z_
, minmax_
,
robust_
) to clearly differentiate them from their original
versions. - The original versions of transformed variables are removed
to avoid redundancy, except for y
, which remains unchanged.
- The order of columns is preserved to maintain consistency.
# Load necessary library
library(dplyr)
# Identify variables for different standardization techniques (excluding 'y')
z_score_vars <- c("age", "duration", "previous_contacts_ratio", "nr_employed")
min_max_vars <- c("campaign_boxcox", "campaign_sqrt", "cons_price_idx")
robust_vars <- c("cons_conf_idx")
# Standardization Functions
z_score_standardize <- function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
min_max_standardize <- function(x) (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
robust_standardize <- function(x) (x - median(x, na.rm = TRUE)) / IQR(x, na.rm = TRUE)
# Ensure 'y' remains unchanged
target_var <- "y"
# Apply Z-score Standardization
bank_data <- bank_data %>%
mutate(across(all_of(z_score_vars), z_score_standardize, .names = "z_{.col}"))
# Apply Min-Max Scaling
bank_data <- bank_data %>%
mutate(across(all_of(min_max_vars), min_max_standardize, .names = "minmax_{.col}"))
# Apply Robust Scaling
bank_data <- bank_data %>%
mutate(across(all_of(robust_vars), robust_standardize, .names = "robust_{.col}"))
# Drop original unscaled columns (excluding 'y')
bank_data <- bank_data %>%
dplyr::select(all_of(target_var), everything(), -all_of(c(z_score_vars, min_max_vars, robust_vars)))
# Verify the transformation
glimpse(bank_data)
## Rows: 41,188
## Columns: 69
## $ y <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ duration_boxcox <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_log <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_reciprocal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_bin <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1…
## $ default_unknown <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ default_yes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_6y <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_9y <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0…
## $ education_illiterate <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0…
## $ education_university_degree <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ housing_1 <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0…
## $ housing_3 <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ job_admin_ <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0…
## $ job_entrepreneur <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ job_management <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services <int> 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0…
## $ job_student <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ job_unemployed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1 <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ loan_3 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_divorced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ marital_married <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ marital_single <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0…
## $ marital_unknown <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1 <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ loan_housing_combo_1_3 <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ loan_housing_combo_3_1 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ loan_housing_combo_3_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ z_age <dbl> 1.533015677, 1.628973456, -0.290182119, …
## $ z_duration <dbl> 0.01047130, -0.42149540, -0.12451830, -0…
## $ z_previous_contacts_ratio <dbl> -0.3312077, -0.3312077, -0.3312077, -0.3…
## $ z_nr_employed <dbl> 0.3316759, 0.3316759, 0.3316759, 0.33167…
## $ minmax_campaign_boxcox <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ minmax_campaign_sqrt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ minmax_cons_price_idx <dbl> 0.6987529, 0.6987529, 0.6987529, 0.69875…
## $ robust_cons_conf_idx <dbl> 0.8571429, 0.8571429, 0.8571429, 0.85714…
For predictive modeling, we can use simple random sampling or stratified random sampling to create training and test datasets.
This method selects data randomly without replacement to create the training and test datasets, ensuring no duplicates.
# Set seed for reproducibility
set.seed(1234)
# Define training sample size (e.g., 75% of the data)
sample_size <- round(nrow(bank_data) * 0.75)
# Create sample set
sample_set <- sample(nrow(bank_data), sample_size, replace = FALSE)
# Split data into training and test sets
train_data <- bank_data[sample_set, ]
test_data <- bank_data[-sample_set, ]
# Verify class distribution remains consistent
print(round(prop.table(table(train_data$y)) * 100, 2))
##
## 0 1
## 88.67 11.33
print(round(prop.table(table(test_data$y)) * 100, 2))
##
## 0 1
## 88.92 11.08
Since y is a categorical variable, we should ensure that both training and test sets maintain the same proportion of classes.
# Load caret package
library(caret)
# Stratified sampling with 75% training data
set.seed(1234)
trainIndex <- createDataPartition(bank_data$y, p = 0.75, list = FALSE)
# Split data based on stratified sampling
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]
# Verify class distribution remains consistent
round(prop.table(table(train_data$y)) * 100, 2)
##
## 0 1
## 88.73 11.27
round(prop.table(table(test_data$y)) * 100, 2)
##
## 0 1
## 88.73 11.27
The class distribution in the training dataset closely mirrors that of the original dataset, with approximately 88.73% “no” responses and 11.27% “yes” responses in both cases. This indicates that the sampling process was performed correctly, preserving the proportion of classes in the response variable. Maintaining a similar distribution is crucial because it ensures that the model trained on the sample will generalize well to the full dataset, reducing bias and improving predictive performance.
# Load necessary libraries
library(themis)
## Warning: package 'themis' was built under R version 4.3.3
## Loading required package: recipes
## Warning: package 'recipes' was built under R version 4.3.3
##
## Attaching package: 'recipes'
## The following object is masked from 'package:stringr':
##
## fixed
## The following object is masked from 'package:stats':
##
## step
library(dplyr)
library(recipes)
# Convert `y` to a factor (required for SMOTE)
train_data <- train_data %>%
mutate(y = as.factor(y))
# Store `y` separately and remove it before conversion
y_train <- train_data$y
train_data <- train_data %>% dplyr::select(-y) # Remove `y` before conversion
# Convert other factor variables to numeric
train_data <- train_data %>%
mutate(across(where(is.factor), as.numeric))
# Add `y` back after conversion
train_data <- train_data %>%
mutate(y = y_train)
# Define the SMOTE recipe
set.seed(1234)
smote_recipe <- recipe(y ~ ., data = train_data) %>%
step_smote(y, over_ratio = 1) %>% # Balance classes
prep()
# Apply SMOTE transformation
train_data_smote <- juice(smote_recipe)
# Check the new class distribution after SMOTE
table(train_data_smote$y)
##
## 0 1
## 27411 27411
glimpse(train_data)
## Rows: 30,891
## Columns: 69
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ duration_boxcox <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_log <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_reciprocal <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, …
## $ campaign_bin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no <int> 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1…
## $ default_unknown <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0…
## $ default_yes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ education_basic_6y <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_9y <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0…
## $ education_illiterate <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
## $ education_university_degree <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0…
## $ housing_1 <int> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0…
## $ housing_3 <int> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1…
## $ job_admin_ <int> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1…
## $ job_entrepreneur <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ job_management <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services <int> 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0…
## $ job_student <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ job_unemployed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ loan_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ marital_divorced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_married <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1…
## $ marital_single <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0…
## $ marital_unknown <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1 <int> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…
## $ loan_housing_combo_1_3 <int> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1…
## $ loan_housing_combo_3_1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_housing_combo_3_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ z_age <dbl> 1.533015677, 1.628973456, -0.290182119, …
## $ z_duration <dbl> 0.01047130, -0.42149540, -0.12451830, -0…
## $ z_previous_contacts_ratio <dbl> -0.3312077, -0.3312077, -0.3312077, -0.3…
## $ z_nr_employed <dbl> 0.3316759, 0.3316759, 0.3316759, 0.33167…
## $ minmax_campaign_boxcox <dbl> 0.0000000, 0.0000000, 0.0000000, 0.00000…
## $ minmax_campaign_sqrt <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00…
## $ minmax_cons_price_idx <dbl> 0.6987529, 0.6987529, 0.6987529, 0.69875…
## $ robust_cons_conf_idx <dbl> 0.8571429, 0.8571429, 0.8571429, 0.85714…
## $ y <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
no
responses than yes
responses.