Data_622_Assignment_1: Chun Shing Leung

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.

df<- read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/refs/heads/main/Data_622/bank-full.csv", sep = ";", stringsAsFactors = T)

Variable overview

age: Age

job: type of job

marital: marital status

education: Education Level

default: has credit in default?

balance: average yearly balance

housing: has housing loan?

loan: has personal loan?

contact: contact communication type (categorical: ‘cellular’,‘telephone’)

day: last contact day of the week

month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)

duration: last contact duration, in seconds (numeric).

campaign: number of contacts performed during this campaign and for this client

pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)

previous: number of contacts performed before this campaign and for this client

poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’) y: has the client subscribed a term deposit?

Exploratory Data Analysis

The dataset consists of 45,211 instances and 17 features, with most variables being categorical. Among the 10 categorical variables, “contact”, “loan”, “default”, and “y” exhibit highly imbalanced distributions, with more than 80% of the instances belonging to a single category. In contrast, “education”, “housing”, “poutcome”, and “marital” contain three categories each and appear to be fairly balanced. The “job” and “month” variables have 11 and 12 categories, respectively, and show a relatively even distribution.

The dataset also includes seven numeric variables, namely “age”, “balance”, “campaign”, “day”, “duration”, “pdays”, and “previous”. The distributions of “age”, “balance”, “campaign”, and “duration” are right-skewed, indicating that most values are concentrated toward the lower end, with a few extreme values pulling the distribution to the right. The “day” variable exhibits a uniform distribution, while “pdays” and “previous” are extremely right-skewed, suggesting that most data points are clustered near the lower end, with a few outliers significantly increasing the range.

Outlier analysis reveals that “pdays” and “previous” contain the highest proportion of outliers (18%), followed by “balance” (10.5%), “duration” (7.2%), and “campaign” (6.8%). The “age” variable has only 1.1% outliers, making it relatively stable for analysis. Additionally, correlation analysis indicates that “pdays” and “previous” exhibit a weak positive correlation, while the remaining numeric variables show little to no correlation with each other.

The dataset also contains missing values, with “unknown” values treated as missing data. The most significant missing data is found in “poutcome”, where 81.7% of values are missing, followed by “contact” (28.8% missing values). The “education” and “job” variables contain a small number of missing values, which may have minimal impact on analysis. Importantly, there are no duplicated values in the dataset.

Additionally, a Near Zero Variance (NZV) check identified “default” and “pdays” as NZV variables, meaning that a single category dominates almost all rows. These variables provide little to no useful information for modeling and can negatively impact machine learning models by introducing multicollinearity and unstable coefficients.

Overall, the dataset presents several key characteristics, including imbalanced categorical distributions, right-skewed numeric variables, significant outliers, and missing values. These insights help guide further data preprocessing and model selection to ensure accurate and meaningful analysis.

glimpse(df)

## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital   <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

str(df)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

df.clean <- df %>%
  mutate(across(where(is.character), ~ na_if(., "unknown"))) %>%
  mutate(across(where(is.factor), ~ na_if(as.character(.), "unknown"))) 
df.clean <- df.clean %>%
  mutate(across(c(job, marital, education, default, housing, loan, contact, month, poutcome, y), as.factor))

str(df.clean)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 11 levels "admin.","blue-collar",..: 5 10 3 2 NA 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 3 levels "primary","secondary",..: 3 2 2 NA NA 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 2 levels "cellular","telephone": NA NA NA NA NA NA NA NA NA NA ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 3 levels "failure","other",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Checking correlation using Correlation Matrix

cor_matrix <- cor(df.clean %>% select_if(where(is.numeric)), use = "complete.obs")

cor_long <- melt(cor_matrix)
cor_long <- cor_long[as.numeric(cor_long$Var1) > as.numeric(cor_long$Var2), ]


ggplot(cor_long, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = ifelse(value != 0, round(value, 2), "")), 
            color = "black", size = 3, face="bold") +  
  scale_fill_gradient2(low = "coral", high = "lightblue", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), 
                       space = "Lab", name = "Correlation") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 10),  
    axis.text.y = element_text(size = 10),                                   
    axis.title = element_blank(),                                            
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)         
  ) +
  ggtitle("Correlation Matrix")

## Warning in geom_text(aes(label = ifelse(value != 0, round(value, 2), "")), :
## Ignoring unknown parameters: `face`

Category and Numeric Variables distribution

# Category variable distribution

category_percentages <- df.clean |>
  summarise(across(where(is.factor), ~ list(prop.table(table(.)) * 100))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category_Percentage") |>
  drop_na() |>
  unnest(cols = c(Category_Percentage)) |>
  mutate(Category = names(Category_Percentage)) |>  
  arrange(Variable, desc(Category_Percentage)) |>  
  mutate(Variable = factor(Variable, levels = unique(Variable))) 


# Define new variable groups manually
group_1 <- c("contact", "loan", "default", "y")  
group_2 <- c("education", "housing", "poutcome", "marital")  
group_3 <- c("job", "month") 

# Filter data for each chart
category_percentages_1 <- category_percentages |> filter(Variable %in% group_1)
category_percentages_2 <- category_percentages |> filter(Variable %in% group_2)
category_percentages_3 <- category_percentages |> filter(Variable %in% group_3)

# First Plot (Contact, Loan, Default, y)
ggplot(category_percentages_1, aes(x = Category, 
                                   y = Category_Percentage, fill = Variable)) +
  geom_bar(stat = "identity", color = "black") +

  # Percentage on top of the bar
  geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")), 
            vjust = 0.92, size = 3, color = "black", fontface = "bold") +  
  
  facet_wrap(~ Variable, scales = "free_x") +  
  theme_minimal() +
  labs(title = "Category Distribution (Part 1: Contact, Loan, Default, y)", x = "Category", y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +  
  theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))

## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.

# Second Plot (Education, Housing, Poutcome, marital)
ggplot(category_percentages_2, aes(x = Category, 
                                   y = Category_Percentage, fill = Variable)) +
  geom_bar(stat = "identity", color = "black") +

  # Percentage on top of the bar
  geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")), 
            vjust = 1.5, size = 3, color = "black", fontface = "bold") +  

  facet_wrap(~ Variable, scales = "free_x") +  
  theme_minimal() +
  labs(title = "Category Distribution (Part 2: Education, Housing, Poutcome)", x = "Category", y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +  
  theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))

## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.

# Third Plot (Job & Month)
ggplot(category_percentages_3, aes(x = Category, 
                                   y = Category_Percentage, fill = Variable)) +
  geom_bar(stat = "identity", color = "black") +

  # Percentage on top of the bar
  geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")), 
            vjust = -0.5, size = 2.5, color = "black", fontface = "bold") +  

  facet_wrap(~ Variable, scales = "free_x") +  
  theme_minimal() +
  labs(title = "Category Distribution (Part 3: Job & Month)", x = "Category", y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +  
  theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))

## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.

# Numeric Variable Distribution

df.clean |>
  select(where(is.numeric)) |>
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "coral", color = "black") +
  facet_wrap(~ Variable, scales = "free", ncol = 4) +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")

Checking for Outliers

df.clean |>
  select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |>  
  ggplot(aes(x = Variable, y = Value)) +  # Boxplots need categorical x-axis
  geom_boxplot(fill = "lightblue", color = "black") + 
  facet_wrap(~ Variable, scales = "free", ncol = 3) +  # Facet for multiple numeric variables
  theme_minimal() +
  labs(title = "Boxplots of Numeric Variables", x = "Variable", y = "Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Finding outliers
outliers <- df.clean |>
  select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |> 
  group_by(Variable) |>
  mutate(
    Q1 = quantile(Value, 0.25, na.rm = TRUE),
    Q3 = quantile(Value, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    Lower_Bound = Q1 - 1.5 * IQR,
    Upper_Bound = Q3 + 1.5 * IQR,
    Outlier = Value < Lower_Bound | Value > Upper_Bound
  ) |>
  filter(Outlier) |>
  ungroup()


print(outliers)

## # A tibble: 28,029 × 8
##    Variable Value    Q1    Q3   IQR Lower_Bound Upper_Bound Outlier
##    <chr>    <int> <dbl> <dbl> <dbl>       <dbl>       <dbl> <lgl>  
##  1 balance  10635    72  1428  1356       -1962        3462 TRUE   
##  2 duration  1666   103   319   216        -221         643 TRUE   
##  3 duration  1492   103   319   216        -221         643 TRUE   
##  4 duration   787   103   319   216        -221         643 TRUE   
##  5 duration  1778   103   319   216        -221         643 TRUE   
##  6 duration   812   103   319   216        -221         643 TRUE   
##  7 balance   6530    72  1428  1356       -1962        3462 TRUE   
##  8 balance  12223    72  1428  1356       -1962        3462 TRUE   
##  9 balance   5935    72  1428  1356       -1962        3462 TRUE   
## 10 duration  1042   103   319   216        -221         643 TRUE   
## # ℹ 28,019 more rows

# Visualize outliers

outliers |>
  group_by(Variable) |>
  summarise(
    Outlier_Count = n(),
    Total_Count = nrow(df.clean),  
    Outlier_Percentage = (Outlier_Count / Total_Count) * 100  
  ) |>
  ungroup() |>
  ggplot(aes(x = reorder(Variable, -Outlier_Count), y = Outlier_Count, fill = Outlier_Percentage)) +
  geom_bar(stat = "identity", color = "black") +
  
  # Display outlier count on top of the bar
  geom_text(aes(label = Outlier_Count), vjust = -0.5, size = 3, fontface = "bold") +  
  
  # Display percentage inside the bar (centered)
  geom_text(aes(label = paste0(round(Outlier_Percentage, 1), "%")), 
            vjust = 1.5, size = 3, color = "black", fontface = "bold") +  

  scale_fill_gradient(low = "lightblue", high = "coral", name = "Outlier %") +  
  theme_minimal() +
  labs(
    title = "Outlier Count and Percentage per Numeric Variable",
    x = "Variable",
    y = "Number of Outliers"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

df.clean |>
  select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |>
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "lightblue", color = "black", outlier.color = "coral", outlier.shape = 16) + 
  facet_wrap(~ Variable, scales = "free", ncol = 3) +  
  theme_minimal() +
  labs(title = "Boxplots of Numeric Variables with Outliers", x = "Variable", y = "Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Missing Values

missing_data <- df.clean |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") |>
  mutate(Missing_Percentage = (Missing_Count / nrow(df.clean)) * 100) |> 
  arrange(desc(Missing_Count))

print(missing_data)

## # A tibble: 17 × 3
##    Variable  Missing_Count Missing_Percentage
##    <chr>             <int>              <dbl>
##  1 poutcome          36959             81.7  
##  2 contact           13020             28.8  
##  3 education          1857              4.11 
##  4 job                 288              0.637
##  5 age                   0              0    
##  6 marital               0              0    
##  7 default               0              0    
##  8 balance               0              0    
##  9 housing               0              0    
## 10 loan                  0              0    
## 11 day                   0              0    
## 12 month                 0              0    
## 13 duration              0              0    
## 14 campaign              0              0    
## 15 pdays                 0              0    
## 16 previous              0              0    
## 17 y                     0              0

ggplot(missing_data, aes(x = reorder(Variable, -Missing_Count), y = Missing_Count, fill = Missing_Percentage)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = Missing_Count), vjust = -0.5, size = 2.5, fontface = "bold") +  
  geom_text(aes(label = paste0(round(Missing_Percentage, 1), "%")), 
            vjust = 1.5, size = 2.3, color = "black", fontface = "bold") +  
  scale_fill_gradient(low = "lightblue", high = "coral", name = "Missing %") +  
  theme_minimal() +
  labs(title = "Missing Values Count and Percentage per Variable",
       x = "Variable", y = "Number of Missing Values") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Check for Duplicated Values

sum(duplicated(df.clean))

## [1] 0

Check for Near Zero Variance

nzv <- nearZeroVar(df.clean, saveMetrics = TRUE)
print(nzv)

##            freqRatio percentUnique zeroVar   nzv
## age         1.044589   0.170312535   FALSE FALSE
## job         1.028970   0.024330362   FALSE FALSE
## marital     2.127756   0.006635553   FALSE FALSE
## education   1.744380   0.006635553   FALSE FALSE
## default    54.473620   0.004423702   FALSE  TRUE
## balance    18.020513  15.854548672   FALSE FALSE
## housing     1.251432   0.004423702   FALSE FALSE
## loan        5.241165   0.004423702   FALSE FALSE
## contact    10.077426   0.004423702   FALSE FALSE
## day         1.192374   0.068567384   FALSE FALSE
## month       1.996519   0.026542213   FALSE FALSE
## duration    1.021739   3.479241777   FALSE FALSE
## campaign    1.402959   0.106168853   FALSE FALSE
## pdays     221.281437   1.236424764   FALSE  TRUE
## previous   13.331169   0.090685895   FALSE FALSE
## poutcome    2.663587   0.006635553   FALSE FALSE
## y           7.548119   0.004423702   FALSE FALSE

Algorithm Selection

For this dataset, the goal is to apply machine learning techniques to predict whether a customer will subscribe to a bank’s term deposit, making it a binary classification problem. The dataset contains a labeled response variable (“y”), classifying it as a supervised learning problem. Classification algorithms such as Random Forest, Logistic Regression, and KNN are suitable models for this problem. Binary Logistic Regression is highly interpretable and computationally efficient, making it useful when insights into customer behavior are needed. However, Logistic Regression assumes a linear relationship between the independent variables and the log-odds of the target variable. Highly skewed numeric variables, such as ‘balance’, ‘campaign’, and ‘duration’, violate this assumption, potentially leading to biased coefficient estimates and affecting model performance. Additionally, imbalanced categorical variables such as ‘contact’, ‘loan’, and ‘default’ may further challenge the model by skewing predictions toward the majority class. To address these issues, transformations such as log scaling can help normalize right-skewed distributions, while class weighting can be applied to ensure that the minority class receives appropriate importance during model training.

On the other hand, K-Nearest Neighbors (KNN) is a non-parametric model that classifies a data point based on the majority vote of its nearest neighbors. It works well when decision boundaries are non-linear but can be computationally expensive on large datasets like this one. Additionally, KNN is sensitive to irrelevant features and outliers, requiring careful feature selection and preprocessing.

Among these models, Random Forest (RF) is the most suitable choice due to its ability to handle mixed data types, robustness against outliers, and strong predictive accuracy. Since this dataset contains both categorical and numeric variables, Random Forest can efficiently process them without requiring extensive preprocessing. Additionally, the dataset has some imbalanced features, which could affect simpler models like Logistic Regression. While Random Forest is less sensitive to class imbalance compared to Logistic Regression, it may still favor the majority class. Class weighting or resampling techniques can help mitigate this issue. Unlike models that assume a linear relationship, such as Logistic Regression, Linear Regression, or KNN, Random Forest is a tree-based model that does not rely on feature distributions being normal. As a result, it is not significantly affected by right-skewed numeric variables, making it well-suited for this dataset.

The choice of algorithm is also influenced by the size of the dataset, which contains 45,211 instances. Since Random Forest performs well on large datasets and can efficiently handle feature interactions, it is the most practical choice. If the dataset were smaller (fewer than 1,000 records), a simpler and faster model such as Logistic Regression or KNN would be preferred.

In conclusion, Random Forest is the recommended model due to its ability to capture complex relationships, handle imbalanced data, and work well with mixed data types. If model interpretability is a priority, Logistic Regression would be a good secondary option. However, if the dataset were significantly smaller, Logistic Regression or KNN would be more suitable due to their efficiency and simplicity. This selection ensures that the bank can effectively analyze customer behavior and optimize marketing strategies to increase subscription rates.

Pre-processing

Removing Redundant Features

The correlation matrix from EDA showed that “pdays” and “previous” have a weak positive correlation. Plus, Near Zero Variance (NZV) variables such as “default” and “pdays” contribute very little information and should be removed to avoid model instability. poutcome consists of large amount of missing value and shall be removed as well.

df.clean.final <- df.clean |> 
                select(-pdays, -default, -poutcome)

Removing row

According to the data describetion, when duration is equal to 0, response variable is always equal to no and should be discarded if the intention is to have a realistic predictive model. There are 3 rows of data consist of duration value equal to 0

df.clean.final <- df.clean.final |> filter(duration != 0)

Imputation for missing Value

Contact, education and job have missing value. Here, we use Random Forest Imputation to fill in these value.

library(missForest)

## Warning: package 'missForest' was built under R version 4.2.3

df.imputed <- missForest(df.clean.final)
df.clean.final <- df.imputed$ximp

missing_data_check <- df.clean.final |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") |>
  mutate(Missing_Percentage = (Missing_Count / nrow(df.clean)) * 100) |> 
  arrange(desc(Missing_Count))

print(missing_data_check)

## # A tibble: 14 × 3
##    Variable  Missing_Count Missing_Percentage
##    <chr>             <int>              <dbl>
##  1 age                   0                  0
##  2 job                   0                  0
##  3 marital               0                  0
##  4 education             0                  0
##  5 balance               0                  0
##  6 housing               0                  0
##  7 loan                  0                  0
##  8 contact               0                  0
##  9 day                   0                  0
## 10 month                 0                  0
## 11 duration              0                  0
## 12 campaign              0                  0
## 13 previous              0                  0
## 14 y                     0                  0

Feature Engineering

Age can be converted into categorical data to capture different behaviors.

# Create Age_Group based on defined age bins
df.clean.final <- df.clean.final |> 
  mutate(Age_Group = case_when(
    age < 30 ~ "Young",
    age >= 30 & age <= 50 ~ "Middle",
    age > 50 ~ "Senior"
  ))

# Convert Age_Group into a factor for better modeling
df.clean.final$Age_Group <- factor(df.clean.final$Age_Group, levels = c("Young", "Middle", "Senior"))

# Check the distribution of Age_Group
table(df.clean.final$Age_Group)

## 
##  Young Middle Senior 
##   5273  30681   9254

Handling imbalanced and skewed data

The df.clean.final data is ready for Random Forest model since Random Forest is less sensitive to class imbalance compared to Logistic Regression but may still favor the majority class.Also,Random Forest is not significantly affected by right-skewed numeric variables. Unlike models that assume a linear relationship (such as Logistic Regression, Linear Regression, or KNN), Random Forest is a tree-based model and does not rely on feature distributions being normal.

Since “contact”, “default”, “loan”, and “y” are imbalanced, this can affect the Logistic Regression model’s ability to predict the minority class accurately. The model assumes equal class distribution, so class imbalance skews the decision boundary. Don’t want to modify the dataset, you can adjust class weights when training Logistic Regression or Random Forest.

Since Balance, Campaign, Duration, and Previous are extremely right-skewed, their distributions have long tails with a concentration of values on the lower end. This can affect machine learning models that assume normality or symmetrical distributions (e.g., Logistic Regression). Log, square root or Box-Cox transformation can be performed

Performing Log transformation on Numeric Variables

# Apply log transformation (add 1 to avoid log(0))
df.clean.final.logistic <- df.clean.final |> 
  mutate(
    balance_log = log(ifelse(balance < 0, 0, balance) + 1),
    campaign_log = log(campaign + 1),
    duration_log = log(duration + 1),
    previous_log = log(previous + 1)
  )

# Numeric Variable Distribution

df.clean.final.logistic |>
  select(where(is.numeric)) |>
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "coral", color = "black") +
  facet_wrap(~ Variable, scales = "free", ncol = 4) +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")

Handling imbalanced data: Use Class Weights Instead of Balancing

# For Logistic Regression:

class_weights <- ifelse(df.clean.final.logistic$y == "yes", 5, 1)

# Train Logistic Regression with weights
log_model <- glm(y ~ . - balance - campaign - duration - previous, 
                 data = df.clean.final.logistic, 
                 family = binomial, 
                 weights = class_weights)

summary(log_model)

## 
## Call:
## glm(formula = y ~ . - balance - campaign - duration - previous, 
##     family = binomial, data = df.clean.final.logistic, weights = class_weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9250  -0.7979  -0.4189  -0.1846   8.3453  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -10.848021   0.145046 -74.790  < 2e-16 ***
## age                  0.010224   0.002085   4.904 9.40e-07 ***
## jobblue-collar      -0.404125   0.042469  -9.516  < 2e-16 ***
## jobentrepreneur     -0.530998   0.072868  -7.287 3.17e-13 ***
## jobhousemaid        -0.552928   0.078223  -7.069 1.57e-12 ***
## jobmanagement       -0.300142   0.045233  -6.635 3.24e-11 ***
## jobretired           0.055782   0.060858   0.917 0.359357    
## jobself-employed    -0.498571   0.067690  -7.365 1.77e-13 ***
## jobservices         -0.331398   0.048859  -6.783 1.18e-11 ***
## jobstudent           0.388591   0.073093   5.316 1.06e-07 ***
## jobtechnician       -0.256175   0.041367  -6.193 5.91e-10 ***
## jobunemployed       -0.247267   0.068188  -3.626 0.000288 ***
## maritalmarried      -0.182502   0.035175  -5.188 2.12e-07 ***
## maritalsingle        0.072562   0.040736   1.781 0.074866 .  
## educationsecondary   0.233082   0.036581   6.372 1.87e-10 ***
## educationtertiary    0.621443   0.044378  14.003  < 2e-16 ***
## housingyes          -0.863579   0.025517 -33.843  < 2e-16 ***
## loanyes             -0.502230   0.033746 -14.883  < 2e-16 ***
## contacttelephone     0.001994   0.046529   0.043 0.965816    
## day                  0.001285   0.001443   0.890 0.373440    
## monthaug            -0.637669   0.047686 -13.372  < 2e-16 ***
## monthdec             0.808027   0.128283   6.299 3.00e-10 ***
## monthfeb            -0.084278   0.055040  -1.531 0.125719    
## monthjan            -1.238596   0.070025 -17.688  < 2e-16 ***
## monthjul            -0.852180   0.047034 -18.119  < 2e-16 ***
## monthjun            -0.589839   0.049366 -11.948  < 2e-16 ***
## monthmar             1.960958   0.091167  21.509  < 2e-16 ***
## monthmay            -1.141059   0.042662 -26.746  < 2e-16 ***
## monthnov            -0.823088   0.050937 -16.159  < 2e-16 ***
## monthoct             1.215984   0.076602  15.874  < 2e-16 ***
## monthsep             1.053303   0.085133  12.372  < 2e-16 ***
## Age_GroupMiddle     -0.613060   0.040741 -15.048  < 2e-16 ***
## Age_GroupSenior     -0.636444   0.070616  -9.013  < 2e-16 ***
## balance_log          0.069647   0.004245  16.407  < 2e-16 ***
## campaign_log        -0.490387   0.026157 -18.748  < 2e-16 ***
## duration_log         2.033952   0.017727 114.738  < 2e-16 ***
## previous_log         0.836010   0.019019  43.956  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 89245  on 45207  degrees of freedom
## Residual deviance: 53345  on 45171  degrees of freedom
## AIC: 53419
## 
## Number of Fisher Scoring iterations: 6

# Handling imbalanced data: Use Class Weights Instead of Balancing

# For randomForest Regression:
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model <- randomForest(y ~ ., data = df.clean.final, classwt = c("no" = 1, "yes" = 5))

rf_model

## 
## Call:
##  randomForest(formula = y ~ ., data = df.clean.final, classwt = c(no = 1,      yes = 5)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.97%
## Confusion matrix:
##        no  yes class.error
## no  38483 1436  0.03597285
## yes  3072 2217  0.58082813