A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.
df<- read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/refs/heads/main/Data_622/bank-full.csv", sep = ";", stringsAsFactors = T)
age: Age
job: type of job
marital: marital status
education: Education Level
default: has credit in default?
balance: average yearly balance
housing: has housing loan?
loan: has personal loan?
contact: contact communication type (categorical: ‘cellular’,‘telephone’)
day: last contact day of the week
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
duration: last contact duration, in seconds (numeric).
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’) y: has the client subscribed a term deposit?
The dataset consists of 45,211 instances and 17 features, with most variables being categorical. Among the 10 categorical variables, “contact”, “loan”, “default”, and “y” exhibit highly imbalanced distributions, with more than 80% of the instances belonging to a single category. In contrast, “education”, “housing”, “poutcome”, and “marital” contain three categories each and appear to be fairly balanced. The “job” and “month” variables have 11 and 12 categories, respectively, and show a relatively even distribution.
The dataset also includes seven numeric variables, namely “age”, “balance”, “campaign”, “day”, “duration”, “pdays”, and “previous”. The distributions of “age”, “balance”, “campaign”, and “duration” are right-skewed, indicating that most values are concentrated toward the lower end, with a few extreme values pulling the distribution to the right. The “day” variable exhibits a uniform distribution, while “pdays” and “previous” are extremely right-skewed, suggesting that most data points are clustered near the lower end, with a few outliers significantly increasing the range.
Outlier analysis reveals that “pdays” and “previous” contain the highest proportion of outliers (18%), followed by “balance” (10.5%), “duration” (7.2%), and “campaign” (6.8%). The “age” variable has only 1.1% outliers, making it relatively stable for analysis. Additionally, correlation analysis indicates that “pdays” and “previous” exhibit a weak positive correlation, while the remaining numeric variables show little to no correlation with each other.
The dataset also contains missing values, with “unknown” values treated as missing data. The most significant missing data is found in “poutcome”, where 81.7% of values are missing, followed by “contact” (28.8% missing values). The “education” and “job” variables contain a small number of missing values, which may have minimal impact on analysis. Importantly, there are no duplicated values in the dataset.
Additionally, a Near Zero Variance (NZV) check identified “default” and “pdays” as NZV variables, meaning that a single category dominates almost all rows. These variables provide little to no useful information for modeling and can negatively impact machine learning models by introducing multicollinearity and unstable coefficients.
Overall, the dataset presents several key characteristics, including imbalanced categorical distributions, right-skewed numeric variables, significant outliers, and missing values. These insights help guide further data preprocessing and model selection to ensure accurate and meaningful analysis.
glimpse(df)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
str(df)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
df.clean <- df %>%
mutate(across(where(is.character), ~ na_if(., "unknown"))) %>%
mutate(across(where(is.factor), ~ na_if(as.character(.), "unknown")))
df.clean <- df.clean %>%
mutate(across(c(job, marital, education, default, housing, loan, contact, month, poutcome, y), as.factor))
str(df.clean)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 11 levels "admin.","blue-collar",..: 5 10 3 2 NA 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 3 levels "primary","secondary",..: 3 2 2 NA NA 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": NA NA NA NA NA NA NA NA NA NA ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","other",..: NA NA NA NA NA NA NA NA NA NA ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
cor_matrix <- cor(df.clean %>% select_if(where(is.numeric)), use = "complete.obs")
cor_long <- melt(cor_matrix)
cor_long <- cor_long[as.numeric(cor_long$Var1) > as.numeric(cor_long$Var2), ]
ggplot(cor_long, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = ifelse(value != 0, round(value, 2), "")),
color = "black", size = 3, face="bold") +
scale_fill_gradient2(low = "coral", high = "lightblue", mid = "white",
midpoint = 0, limit = c(-1, 1),
space = "Lab", name = "Correlation") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 10),
axis.text.y = element_text(size = 10),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
) +
ggtitle("Correlation Matrix")
## Warning in geom_text(aes(label = ifelse(value != 0, round(value, 2), "")), :
## Ignoring unknown parameters: `face`
# Category variable distribution
category_percentages <- df.clean |>
summarise(across(where(is.factor), ~ list(prop.table(table(.)) * 100))) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category_Percentage") |>
drop_na() |>
unnest(cols = c(Category_Percentage)) |>
mutate(Category = names(Category_Percentage)) |>
arrange(Variable, desc(Category_Percentage)) |>
mutate(Variable = factor(Variable, levels = unique(Variable)))
# Define new variable groups manually
group_1 <- c("contact", "loan", "default", "y")
group_2 <- c("education", "housing", "poutcome", "marital")
group_3 <- c("job", "month")
# Filter data for each chart
category_percentages_1 <- category_percentages |> filter(Variable %in% group_1)
category_percentages_2 <- category_percentages |> filter(Variable %in% group_2)
category_percentages_3 <- category_percentages |> filter(Variable %in% group_3)
# First Plot (Contact, Loan, Default, y)
ggplot(category_percentages_1, aes(x = Category,
y = Category_Percentage, fill = Variable)) +
geom_bar(stat = "identity", color = "black") +
# Percentage on top of the bar
geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")),
vjust = 0.92, size = 3, color = "black", fontface = "bold") +
facet_wrap(~ Variable, scales = "free_x") +
theme_minimal() +
labs(title = "Category Distribution (Part 1: Contact, Loan, Default, y)", x = "Category", y = "Percentage") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.
# Second Plot (Education, Housing, Poutcome, marital)
ggplot(category_percentages_2, aes(x = Category,
y = Category_Percentage, fill = Variable)) +
geom_bar(stat = "identity", color = "black") +
# Percentage on top of the bar
geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")),
vjust = 1.5, size = 3, color = "black", fontface = "bold") +
facet_wrap(~ Variable, scales = "free_x") +
theme_minimal() +
labs(title = "Category Distribution (Part 2: Education, Housing, Poutcome)", x = "Category", y = "Percentage") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.
# Third Plot (Job & Month)
ggplot(category_percentages_3, aes(x = Category,
y = Category_Percentage, fill = Variable)) +
geom_bar(stat = "identity", color = "black") +
# Percentage on top of the bar
geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")),
vjust = -0.5, size = 2.5, color = "black", fontface = "bold") +
facet_wrap(~ Variable, scales = "free_x") +
theme_minimal() +
labs(title = "Category Distribution (Part 3: Job & Month)", x = "Category", y = "Percentage") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.
# Numeric Variable Distribution
df.clean |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "coral", color = "black") +
facet_wrap(~ Variable, scales = "free", ncol = 4) +
theme(
panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")
df.clean |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
ggplot(aes(x = Variable, y = Value)) + # Boxplots need categorical x-axis
geom_boxplot(fill = "lightblue", color = "black") +
facet_wrap(~ Variable, scales = "free", ncol = 3) + # Facet for multiple numeric variables
theme_minimal() +
labs(title = "Boxplots of Numeric Variables", x = "Variable", y = "Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Finding outliers
outliers <- df.clean |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
group_by(Variable) |>
mutate(
Q1 = quantile(Value, 0.25, na.rm = TRUE),
Q3 = quantile(Value, 0.75, na.rm = TRUE),
IQR = Q3 - Q1,
Lower_Bound = Q1 - 1.5 * IQR,
Upper_Bound = Q3 + 1.5 * IQR,
Outlier = Value < Lower_Bound | Value > Upper_Bound
) |>
filter(Outlier) |>
ungroup()
print(outliers)
## # A tibble: 28,029 × 8
## Variable Value Q1 Q3 IQR Lower_Bound Upper_Bound Outlier
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 balance 10635 72 1428 1356 -1962 3462 TRUE
## 2 duration 1666 103 319 216 -221 643 TRUE
## 3 duration 1492 103 319 216 -221 643 TRUE
## 4 duration 787 103 319 216 -221 643 TRUE
## 5 duration 1778 103 319 216 -221 643 TRUE
## 6 duration 812 103 319 216 -221 643 TRUE
## 7 balance 6530 72 1428 1356 -1962 3462 TRUE
## 8 balance 12223 72 1428 1356 -1962 3462 TRUE
## 9 balance 5935 72 1428 1356 -1962 3462 TRUE
## 10 duration 1042 103 319 216 -221 643 TRUE
## # ℹ 28,019 more rows
# Visualize outliers
outliers |>
group_by(Variable) |>
summarise(
Outlier_Count = n(),
Total_Count = nrow(df.clean),
Outlier_Percentage = (Outlier_Count / Total_Count) * 100
) |>
ungroup() |>
ggplot(aes(x = reorder(Variable, -Outlier_Count), y = Outlier_Count, fill = Outlier_Percentage)) +
geom_bar(stat = "identity", color = "black") +
# Display outlier count on top of the bar
geom_text(aes(label = Outlier_Count), vjust = -0.5, size = 3, fontface = "bold") +
# Display percentage inside the bar (centered)
geom_text(aes(label = paste0(round(Outlier_Percentage, 1), "%")),
vjust = 1.5, size = 3, color = "black", fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "coral", name = "Outlier %") +
theme_minimal() +
labs(
title = "Outlier Count and Percentage per Numeric Variable",
x = "Variable",
y = "Number of Outliers"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
df.clean |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.color = "coral", outlier.shape = 16) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Boxplots of Numeric Variables with Outliers", x = "Variable", y = "Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
missing_data <- df.clean |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") |>
mutate(Missing_Percentage = (Missing_Count / nrow(df.clean)) * 100) |>
arrange(desc(Missing_Count))
print(missing_data)
## # A tibble: 17 × 3
## Variable Missing_Count Missing_Percentage
## <chr> <int> <dbl>
## 1 poutcome 36959 81.7
## 2 contact 13020 28.8
## 3 education 1857 4.11
## 4 job 288 0.637
## 5 age 0 0
## 6 marital 0 0
## 7 default 0 0
## 8 balance 0 0
## 9 housing 0 0
## 10 loan 0 0
## 11 day 0 0
## 12 month 0 0
## 13 duration 0 0
## 14 campaign 0 0
## 15 pdays 0 0
## 16 previous 0 0
## 17 y 0 0
ggplot(missing_data, aes(x = reorder(Variable, -Missing_Count), y = Missing_Count, fill = Missing_Percentage)) +
geom_bar(stat = "identity", color = "black") +
geom_text(aes(label = Missing_Count), vjust = -0.5, size = 2.5, fontface = "bold") +
geom_text(aes(label = paste0(round(Missing_Percentage, 1), "%")),
vjust = 1.5, size = 2.3, color = "black", fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "coral", name = "Missing %") +
theme_minimal() +
labs(title = "Missing Values Count and Percentage per Variable",
x = "Variable", y = "Number of Missing Values") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
sum(duplicated(df.clean))
## [1] 0
nzv <- nearZeroVar(df.clean, saveMetrics = TRUE)
print(nzv)
## freqRatio percentUnique zeroVar nzv
## age 1.044589 0.170312535 FALSE FALSE
## job 1.028970 0.024330362 FALSE FALSE
## marital 2.127756 0.006635553 FALSE FALSE
## education 1.744380 0.006635553 FALSE FALSE
## default 54.473620 0.004423702 FALSE TRUE
## balance 18.020513 15.854548672 FALSE FALSE
## housing 1.251432 0.004423702 FALSE FALSE
## loan 5.241165 0.004423702 FALSE FALSE
## contact 10.077426 0.004423702 FALSE FALSE
## day 1.192374 0.068567384 FALSE FALSE
## month 1.996519 0.026542213 FALSE FALSE
## duration 1.021739 3.479241777 FALSE FALSE
## campaign 1.402959 0.106168853 FALSE FALSE
## pdays 221.281437 1.236424764 FALSE TRUE
## previous 13.331169 0.090685895 FALSE FALSE
## poutcome 2.663587 0.006635553 FALSE FALSE
## y 7.548119 0.004423702 FALSE FALSE
For this dataset, the goal is to apply machine learning techniques to predict whether a customer will subscribe to a bank’s term deposit, making it a binary classification problem. The dataset contains a labeled response variable (“y”), classifying it as a supervised learning problem. Classification algorithms such as Random Forest, Logistic Regression, and KNN are suitable models for this problem. Binary Logistic Regression is highly interpretable and computationally efficient, making it useful when insights into customer behavior are needed. However, Logistic Regression assumes a linear relationship between the independent variables and the log-odds of the target variable. Highly skewed numeric variables, such as ‘balance’, ‘campaign’, and ‘duration’, violate this assumption, potentially leading to biased coefficient estimates and affecting model performance. Additionally, imbalanced categorical variables such as ‘contact’, ‘loan’, and ‘default’ may further challenge the model by skewing predictions toward the majority class. To address these issues, transformations such as log scaling can help normalize right-skewed distributions, while class weighting can be applied to ensure that the minority class receives appropriate importance during model training.
On the other hand, K-Nearest Neighbors (KNN) is a non-parametric model that classifies a data point based on the majority vote of its nearest neighbors. It works well when decision boundaries are non-linear but can be computationally expensive on large datasets like this one. Additionally, KNN is sensitive to irrelevant features and outliers, requiring careful feature selection and preprocessing.
Among these models, Random Forest (RF) is the most suitable choice due to its ability to handle mixed data types, robustness against outliers, and strong predictive accuracy. Since this dataset contains both categorical and numeric variables, Random Forest can efficiently process them without requiring extensive preprocessing. Additionally, the dataset has some imbalanced features, which could affect simpler models like Logistic Regression. While Random Forest is less sensitive to class imbalance compared to Logistic Regression, it may still favor the majority class. Class weighting or resampling techniques can help mitigate this issue. Unlike models that assume a linear relationship, such as Logistic Regression, Linear Regression, or KNN, Random Forest is a tree-based model that does not rely on feature distributions being normal. As a result, it is not significantly affected by right-skewed numeric variables, making it well-suited for this dataset.
The choice of algorithm is also influenced by the size of the dataset, which contains 45,211 instances. Since Random Forest performs well on large datasets and can efficiently handle feature interactions, it is the most practical choice. If the dataset were smaller (fewer than 1,000 records), a simpler and faster model such as Logistic Regression or KNN would be preferred.
In conclusion, Random Forest is the recommended model due to its ability to capture complex relationships, handle imbalanced data, and work well with mixed data types. If model interpretability is a priority, Logistic Regression would be a good secondary option. However, if the dataset were significantly smaller, Logistic Regression or KNN would be more suitable due to their efficiency and simplicity. This selection ensures that the bank can effectively analyze customer behavior and optimize marketing strategies to increase subscription rates.
The correlation matrix from EDA showed that “pdays” and “previous” have a weak positive correlation. Plus, Near Zero Variance (NZV) variables such as “default” and “pdays” contribute very little information and should be removed to avoid model instability. poutcome consists of large amount of missing value and shall be removed as well.
df.clean.final <- df.clean |>
select(-pdays, -default, -poutcome)
According to the data describetion, when duration is equal to 0, response variable is always equal to no and should be discarded if the intention is to have a realistic predictive model. There are 3 rows of data consist of duration value equal to 0
df.clean.final <- df.clean.final |> filter(duration != 0)
Contact, education and job have missing value. Here, we use Random Forest Imputation to fill in these value.
library(missForest)
## Warning: package 'missForest' was built under R version 4.2.3
df.imputed <- missForest(df.clean.final)
df.clean.final <- df.imputed$ximp
missing_data_check <- df.clean.final |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") |>
mutate(Missing_Percentage = (Missing_Count / nrow(df.clean)) * 100) |>
arrange(desc(Missing_Count))
print(missing_data_check)
## # A tibble: 14 × 3
## Variable Missing_Count Missing_Percentage
## <chr> <int> <dbl>
## 1 age 0 0
## 2 job 0 0
## 3 marital 0 0
## 4 education 0 0
## 5 balance 0 0
## 6 housing 0 0
## 7 loan 0 0
## 8 contact 0 0
## 9 day 0 0
## 10 month 0 0
## 11 duration 0 0
## 12 campaign 0 0
## 13 previous 0 0
## 14 y 0 0
Age can be converted into categorical data to capture different behaviors.
# Create Age_Group based on defined age bins
df.clean.final <- df.clean.final |>
mutate(Age_Group = case_when(
age < 30 ~ "Young",
age >= 30 & age <= 50 ~ "Middle",
age > 50 ~ "Senior"
))
# Convert Age_Group into a factor for better modeling
df.clean.final$Age_Group <- factor(df.clean.final$Age_Group, levels = c("Young", "Middle", "Senior"))
# Check the distribution of Age_Group
table(df.clean.final$Age_Group)
##
## Young Middle Senior
## 5273 30681 9254
The df.clean.final data is ready for Random Forest model since Random Forest is less sensitive to class imbalance compared to Logistic Regression but may still favor the majority class.Also,Random Forest is not significantly affected by right-skewed numeric variables. Unlike models that assume a linear relationship (such as Logistic Regression, Linear Regression, or KNN), Random Forest is a tree-based model and does not rely on feature distributions being normal.
Since “contact”, “default”, “loan”, and “y” are imbalanced, this can affect the Logistic Regression model’s ability to predict the minority class accurately. The model assumes equal class distribution, so class imbalance skews the decision boundary. Don’t want to modify the dataset, you can adjust class weights when training Logistic Regression or Random Forest.
Since Balance, Campaign, Duration, and Previous are extremely right-skewed, their distributions have long tails with a concentration of values on the lower end. This can affect machine learning models that assume normality or symmetrical distributions (e.g., Logistic Regression). Log, square root or Box-Cox transformation can be performed
# Apply log transformation (add 1 to avoid log(0))
df.clean.final.logistic <- df.clean.final |>
mutate(
balance_log = log(ifelse(balance < 0, 0, balance) + 1),
campaign_log = log(campaign + 1),
duration_log = log(duration + 1),
previous_log = log(previous + 1)
)
# Numeric Variable Distribution
df.clean.final.logistic |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "coral", color = "black") +
facet_wrap(~ Variable, scales = "free", ncol = 4) +
theme(
panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")
# For Logistic Regression:
class_weights <- ifelse(df.clean.final.logistic$y == "yes", 5, 1)
# Train Logistic Regression with weights
log_model <- glm(y ~ . - balance - campaign - duration - previous,
data = df.clean.final.logistic,
family = binomial,
weights = class_weights)
summary(log_model)
##
## Call:
## glm(formula = y ~ . - balance - campaign - duration - previous,
## family = binomial, data = df.clean.final.logistic, weights = class_weights)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9250 -0.7979 -0.4189 -0.1846 8.3453
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.848021 0.145046 -74.790 < 2e-16 ***
## age 0.010224 0.002085 4.904 9.40e-07 ***
## jobblue-collar -0.404125 0.042469 -9.516 < 2e-16 ***
## jobentrepreneur -0.530998 0.072868 -7.287 3.17e-13 ***
## jobhousemaid -0.552928 0.078223 -7.069 1.57e-12 ***
## jobmanagement -0.300142 0.045233 -6.635 3.24e-11 ***
## jobretired 0.055782 0.060858 0.917 0.359357
## jobself-employed -0.498571 0.067690 -7.365 1.77e-13 ***
## jobservices -0.331398 0.048859 -6.783 1.18e-11 ***
## jobstudent 0.388591 0.073093 5.316 1.06e-07 ***
## jobtechnician -0.256175 0.041367 -6.193 5.91e-10 ***
## jobunemployed -0.247267 0.068188 -3.626 0.000288 ***
## maritalmarried -0.182502 0.035175 -5.188 2.12e-07 ***
## maritalsingle 0.072562 0.040736 1.781 0.074866 .
## educationsecondary 0.233082 0.036581 6.372 1.87e-10 ***
## educationtertiary 0.621443 0.044378 14.003 < 2e-16 ***
## housingyes -0.863579 0.025517 -33.843 < 2e-16 ***
## loanyes -0.502230 0.033746 -14.883 < 2e-16 ***
## contacttelephone 0.001994 0.046529 0.043 0.965816
## day 0.001285 0.001443 0.890 0.373440
## monthaug -0.637669 0.047686 -13.372 < 2e-16 ***
## monthdec 0.808027 0.128283 6.299 3.00e-10 ***
## monthfeb -0.084278 0.055040 -1.531 0.125719
## monthjan -1.238596 0.070025 -17.688 < 2e-16 ***
## monthjul -0.852180 0.047034 -18.119 < 2e-16 ***
## monthjun -0.589839 0.049366 -11.948 < 2e-16 ***
## monthmar 1.960958 0.091167 21.509 < 2e-16 ***
## monthmay -1.141059 0.042662 -26.746 < 2e-16 ***
## monthnov -0.823088 0.050937 -16.159 < 2e-16 ***
## monthoct 1.215984 0.076602 15.874 < 2e-16 ***
## monthsep 1.053303 0.085133 12.372 < 2e-16 ***
## Age_GroupMiddle -0.613060 0.040741 -15.048 < 2e-16 ***
## Age_GroupSenior -0.636444 0.070616 -9.013 < 2e-16 ***
## balance_log 0.069647 0.004245 16.407 < 2e-16 ***
## campaign_log -0.490387 0.026157 -18.748 < 2e-16 ***
## duration_log 2.033952 0.017727 114.738 < 2e-16 ***
## previous_log 0.836010 0.019019 43.956 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 89245 on 45207 degrees of freedom
## Residual deviance: 53345 on 45171 degrees of freedom
## AIC: 53419
##
## Number of Fisher Scoring iterations: 6
# Handling imbalanced data: Use Class Weights Instead of Balancing
# For randomForest Regression:
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_model <- randomForest(y ~ ., data = df.clean.final, classwt = c("no" = 1, "yes" = 5))
rf_model
##
## Call:
## randomForest(formula = y ~ ., data = df.clean.final, classwt = c(no = 1, yes = 5))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 9.97%
## Confusion matrix:
## no yes class.error
## no 38483 1436 0.03597285
## yes 3072 2217 0.58082813