DATA622 HW 1

Dataset: A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit. The records of their efforts are available in the form of a dataset.

Objective: The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. In the Bank Marketing dataset, there are 4 different datasets. I checked the two larger datasets among four and decided to choose bank-additional-full.csv instead of bank-full.csv. This is because the dataset includes five new additional macroeconomic factors such as: employment variation rate, consumer confidence index, and euribor 3-month rate. These indicators might serve as better predictors than just an individual’s balance that is present in back-full but not in bank-additional dataset. Let’s read the data.

bank_data <- read.csv("bank-additional-full.csv", sep = ";")

1. Exploratory Data Analysis (EDA)

Let’s check structure of the data.

# Check the first and last 6 rows
head(bank_data)

##   age       job marital   education default housing loan   contact month
## 1  56 housemaid married    basic.4y      no      no   no telephone   may
## 2  57  services married high.school unknown      no   no telephone   may
## 3  37  services married high.school      no     yes   no telephone   may
## 4  40    admin. married    basic.6y      no      no   no telephone   may
## 5  56  services married high.school      no      no  yes telephone   may
## 6  45  services married    basic.9y unknown      no   no telephone   may
##   day_of_week duration campaign pdays previous    poutcome emp.var.rate
## 1         mon      261        1   999        0 nonexistent          1.1
## 2         mon      149        1   999        0 nonexistent          1.1
## 3         mon      226        1   999        0 nonexistent          1.1
## 4         mon      151        1   999        0 nonexistent          1.1
## 5         mon      307        1   999        0 nonexistent          1.1
## 6         mon      198        1   999        0 nonexistent          1.1
##   cons.price.idx cons.conf.idx euribor3m nr.employed  y
## 1         93.994         -36.4     4.857        5191 no
## 2         93.994         -36.4     4.857        5191 no
## 3         93.994         -36.4     4.857        5191 no
## 4         93.994         -36.4     4.857        5191 no
## 5         93.994         -36.4     4.857        5191 no
## 6         93.994         -36.4     4.857        5191 no

tail(bank_data)

##       age         job marital           education default housing loan  contact
## 41183  29  unemployed  single            basic.4y      no     yes   no cellular
## 41184  73     retired married professional.course      no     yes   no cellular
## 41185  46 blue-collar married professional.course      no      no   no cellular
## 41186  56     retired married   university.degree      no     yes   no cellular
## 41187  44  technician married professional.course      no      no   no cellular
## 41188  74     retired married professional.course      no     yes   no cellular
##       month day_of_week duration campaign pdays previous    poutcome
## 41183   nov         fri      112        1     9        1     success
## 41184   nov         fri      334        1   999        0 nonexistent
## 41185   nov         fri      383        1   999        0 nonexistent
## 41186   nov         fri      189        2   999        0 nonexistent
## 41187   nov         fri      442        1   999        0 nonexistent
## 41188   nov         fri      239        3   999        1     failure
##       emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed   y
## 41183         -1.1         94.767         -50.8     1.028      4963.6  no
## 41184         -1.1         94.767         -50.8     1.028      4963.6 yes
## 41185         -1.1         94.767         -50.8     1.028      4963.6  no
## 41186         -1.1         94.767         -50.8     1.028      4963.6  no
## 41187         -1.1         94.767         -50.8     1.028      4963.6 yes
## 41188         -1.1         94.767         -50.8     1.028      4963.6  no

library(tidyverse)
glimpse(bank_data) # View dataset structure

## Rows: 41,188
## Columns: 21
## $ age            <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job            <chr> "housemaid", "services", "services", "admin.", "service…
## $ marital        <chr> "married", "married", "married", "married", "married", …
## $ education      <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
## $ default        <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
## $ housing        <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
## $ loan           <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
## $ contact        <chr> "telephone", "telephone", "telephone", "telephone", "te…
## $ month          <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
## $ day_of_week    <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
## $ duration       <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays          <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons.conf.idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr.employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…

The dataset contains 41,188 rows (instances) and 21 columns (features + target variable y).

sum(duplicated(bank_data)) # Count the number of duplicate rows

## [1] 12

Since 12 duplicates out of ~40,000 records is a very small percentage (~0.03%), removing them won’t significantly impact the dataset. Let’s remove them.

bank_data <- bank_data[!duplicated(bank_data), ]# Remove duplicate rows
sum(duplicated(bank_data))

## [1] 0

colSums(is.na(bank_data))

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

There are no missing values (NA).

table(bank_data$y) # Check the distribution of the target variable (y)

## 
##    no   yes 
## 36537  4639

The dataset is highly imbalanced, with 36,548 “no” responses (88.7%) and 4,640 “yes” responses (11.3%). Since the “yes” class is underrepresented, this could affect model performance, and we may need to apply techniques like class weighting or oversampling (SMOTE) in pre-processing to balance the data.

unknown_counts <- bank_data %>%
  summarise(across(where(is.character), ~ sum(. == "unknown"))) # Count "unknown" in each categorical column
print(unknown_counts)

##   job marital education default housing loan contact month day_of_week poutcome
## 1 330      80      1730    8596     990  990       0     0           0        0
##   y
## 1 0

From our analysis, we observe that some categorical variables contain “unknown” values. Our strategy for handling them is as follows:

Keep “unknown” as a category for: job, marital, and education (as they may hold predictive value).
Replace “unknown” with the most frequent value (mode) for: housing, loan, and possibly default (since it has a high number of unknowns).

These modifications will be addressed in the Pre-processing step.

Review the structure and content of the data and answer questions such as:

Are the features (columns) of your data correlated?

Check Correlation Between Features: To analyze how different numerical variables relate to each other, let’s create a correlation matrix.

library(ggcorrplot)

numeric_data <- bank_data %>% select_if(is.numeric)# Select only numeric columns

cor_matrix <- cor(numeric_data) # Compute correlation matrix

ggcorrplot(cor_matrix, method = "square", type = "lower", lab = TRUE) # Correlation heatmap

Corrplot Analysis: The correlation matrix reveals strong relationships between several numerical features. Notably, employment variation rate (emp.var.rate) and the number of employees (nr.employed) have an extremely high positive correlation of 0.97, suggesting redundancy. Similarly, euribor3m is highly correlated with both nr.employed (0.95) and emp.var.rate (0.91), indicating that these economic indicators move together and may not all be necessary for modeling. There is also a moderate negative correlation of -0.59 between pdays and previous, which might suggest an inverse relationship between the number of days since the last contact and the frequency of previous contacts. Since highly correlated variables can cause multicollinearity issues in modeling, we may consider removing or combining some of them during preprocessing.

What is the overall distribution of each variable?

Feature Distributions:

library(ggplot2)

# Reshape data for visualization
bank_long <- bank_data %>%
  pivot_longer(cols = where(is.numeric), names_to = "Feature", values_to = "Value")

# Histograms for numeric features
ggplot(bank_long, aes(x = Value)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  facet_wrap(~Feature, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric Features", x = "Value", y = "Count")

Histogram Analysis: Based on the histogram analysis, the numerical features show varying distributions. Age is right-skewed, with most clients between 25 and 60 years old. Duration and campaign are also highly skewed, with a large concentration of lower values and a few extreme cases. Pdays has a bimodal distribution, where most values are either very low or at 999, indicating a special category. Previous contacts are mostly zero, showing that many clients had no prior interactions. Economic indicators like employment variation rate (emp.var.rate) and euribor3m show multiple peaks, reflecting fluctuations in economic conditions. The distribution of consumer confidence index (cons.conf.idx) and consumer price index (cons.price.idx) appears more uniform. Overall, many variables are skewed, and some contain potential outliers that need further investigation.

Are there any outliers present?

Identify Outliers Using Boxplots:

ggplot(bank_long, aes(x = Value, y = Feature)) +  # Flip x and y
  geom_boxplot(fill = "lightblue", outlier.color = "red") +
  facet_wrap(~Feature, scales = "free", ncol = 2) +  
  theme_minimal() +
  labs(title = "Boxplots of Numeric Features", x = "Value", y = "Feature") +  # Adjust labels
  theme(axis.text.x = element_text(size = 10),  # Show x-axis labels
        axis.text.y = element_text(size = 10),
        strip.text = element_text(size = 12, face = "bold"))

Boxplot Analysis: The Duration, Campaign, and Pdays contain extreme outliers, with several observations far above the upper whiskers.The Previous and Emp.Var.Rate also show potential outliers but with fewer extreme points. Nr.Employed and Euribor3m appear to have fewer extreme values compared to other features. The presence of these outliers suggests that some clients have had very long call duration, many contacts in the campaign, or a long gap (pdays) since their last contact. The boxplot confirms that the age distribution is right-skewed, with a few elderly customers as outliers and these customers may still be valid, but we need to consider whether they could affect model performance later.

What are the relationships between different variables?

Relationships Between Different Variables: The correlation matrix helps reveal linear relationships between numeric variables. Based on our correlation heatmap, euribor3m, emp.var.rate, and nr.employed are highly correlated with each other, indicating a strong relationship between economic indicators. The previous campaign outcome (poutcome) and previous contacts (previous, pdays) show some relationship, suggesting that past interactions influence current marketing efforts. However, variables like age and campaign contacts do not show strong correlations with other numeric variables.

How are categorical variables distributed?

Analyzing Categorical Variable Distribution: Now that we’ve successfully identified numerical outliers, let’s analyze the distribution of categorical variables.

categorical_data <- bank_data %>%
  select(where(is.character)) %>%  # Select categorical variables
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") 
glimpse(categorical_data)  # Ensure Feature and Value exist

## Rows: 452,936
## Columns: 2
## $ Feature <chr> "job", "marital", "education", "default", "housing", "loan", "…
## $ Value   <chr> "housemaid", "married", "basic.4y", "no", "no", "no", "telepho…

I tried to plot all the categorical variables in one plot but very hard to read each categories. So, will filter and plot the categorical variables with fewer than 6 categories and variables with more than 5 categories separately.

small_categorical_data <- categorical_data %>%
  filter(Feature %in% c("marital", "default", "housing", "loan", "contact", "poutcome", "y"))

# Plot small categorical variables
ggplot(small_categorical_data, aes(y = reorder(Value, table(Value)[Value]), fill = Feature)) + 
  geom_bar() +
  facet_wrap(~Feature, scales = "free", ncol = 2) + 
  theme_minimal() +
  labs(title = "Distribution of Categorical Variables (≤5 Categories)", y = "Category", x = "Count") +
  theme(axis.text.y = element_text(size = 9),  
        axis.text.x = element_text(size = 10),
        strip.text = element_text(size = 12, face = "bold"),
        legend.position = "none",
         panel.spacing.x = unit(2, "lines"))

large_categorical_data <- categorical_data %>%
  filter(Feature %in% c("job", "education", "month", "day_of_week"))

# Plot large categorical variables with improved x-axis width
ggplot(large_categorical_data, aes(y = reorder(Value, table(Value)[Value]), fill = Feature)) + 
  geom_bar() +
  facet_wrap(~Feature, scales = "free", ncol = 2) +  
  theme_minimal() +
  labs(title = "Distribution of Categorical Variables (>5 Categories)", y = "Category", x = "Count") +
  theme(axis.text.y = element_text(size = 10), 
        axis.text.x = element_text(size = 10), 
        strip.text = element_text(size = 12, face = "bold"),  
        legend.position = "none",
        panel.spacing.x = unit(2, "lines"))  # Add space between facet columns

How are categorical variables distributed?

The categorical variables show clear distributions across different features. The majority of customers were contacted via cellular phones rather than landlines, and most had housing loans while fewer had personal loans. Marital status is predominantly “married,” and university degrees are the most common education level. The campaigns were largely conducted in May, and most previous campaign outcomes were either nonexistent or failures. The job distribution is diverse, with administrative, blue-collar, and technician roles being the most common. The target variable (y) is highly imbalanced, with most customers not subscribing to the term deposit

Do any patterns or trends emerge in the data?

Several patterns and trends emerge from the data. The peak contact period for marketing campaigns appears to be in May, which suggests a seasonality effect. Additionally, past campaign success seems to be a rare occurrence, indicating that customer retention may be a challenge. The presence of “unknown” values in multiple variables such as education, job, and loan status suggests incomplete information, which could influence predictive modeling.

What is the central tendency and spread of each variable? The central tendency and spread of variables vary across different attributes. Age shows a right-skewed distribution with most customers being middle-aged, while campaign-related variables such as the number of contacts tend to have a small range, with most people being contacted only a few times. The duration of the last contact exhibits a wide range with extreme values, indicating a need for careful handling.

Are there any missing values and how significant are they?

Missing values in the dataset are mostly labeled as “unknown” rather than actual NA values. These appear frequently in job, education, default, and loan-related fields. While their significance depends on the modeling approach, the presence of a substantial number of unknowns suggests that either data collection was incomplete or some customers did not disclose their information. These missing values should be examined further to determine if they need imputation, removal, or a separate category in modeling

2. Algorithm Selection

Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset. Answer questions such as:

Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).

Algorithm Selection for the Bank Marketing Dataset: Since our goal is to predict whether a customer will subscribe to a term deposit (y: yes/no), this is a binary classification problem. Based on the data characteristics observed in Exploratory Data Analysis (EDA) and machine learning algorithms presented so far inour class, the two suitable machine learning algorithms for this dataset are Elastic Net Logistic Regression (L1 + L2) and Naive Bayes. These models were chosen because they balance interpretability, efficiency, and handling of categorical data, making them ideal for understanding customer behavior and predicting term deposit subscriptions.

What are the pros and cons of each algorithm you selected?

Elastic Net Logistic Regression (L1 + L2)

Balances feature selection (Lasso) and stability (Ridge), preventing overfitting while keeping important predictors.
Helps with multicollinearity, making it suitable for datasets with correlated features.
More flexible than using Lasso or Ridge alone.
Adjustable for large datasets, making it computationally efficient.

Cons of Logistic Regression (Lasso or Ridge)

Requires hyperparameter tuning to balance feature selection and shrinkage.
Sensitive to outliers, requiring pre-processing techniques such as winsorization or transformation.
Assumes a linear relationship between predictors and the target (y), which may not fully capture non-linear patterns.

Pros of Naive Bayes:

Works well with categorical data and can directly handle variables such as job, education, and month without requiring extensive one-hot encoding.
Fast and Adjustable which is efficient for large datasets.
Less sensitive to class imbalance compared to Logistic Regression.

Cons of Naive Bays:

Assumes independence between features, which may not hold true for this dataset (e.g., job and education are related).
Generally less accurate than more complex models.
May struggle with outliers in numerical variables.

Which algorithm would you recommend, and why?

Given our dataset and business goal (helping the bank identify effective strategies for customer subscriptions), Elastic Net Logistic Regression (L1 + L2) is the best primary choice because it provides the best of both worlds—feature selection from Lasso and multicollinearity handling from Ridge. This ensures that we can identify the most important customer characteristics while maintaining model stability.

However, Elastic Net assumes a linear relationship between predictors and the target. Since real-world customer behaviors are often non-linear, we need a second model that can handle categorical data efficiently and offer an alternative classification method. Naive Bayes is the best secondary choice because it naturally handles categorical variables and is less sensitive to class imbalance, providing an efficient alternative approach to predicting customer subscriptions.

By combining Elastic Net Logistic Regression for interpretability and Naive Bayes for handling categorical data efficiently, we can achieve both actionable business insights and strong predictive power.

Are there labels in your data? Did that impact your choice of algorithm?

Yes, the dataset has a clearly labeled target variable (y), which is binary (yes/no). Since this is a supervised learning classification problem, we needed classification models instead of unsupervised learning approaches like clustering.

Additionally, the imbalance in y (~88.7% “no”, ~11.3% “yes”) impacted our choice. We avoided models like k-NN, which struggles with imbalanced datasets, and instead chose Naive Bayes, which is less sensitive to imbalance, and Elastic Net Logistic Regression, which allows for class weighting or resampling techniques to improve fairness in classification.

How does your choice of algorithm relates to the dataset?

Our dataset contains a mix of numerical and categorical variables, some of which are highly correlated and also imbalanced.

Elastic Net Logistic Regression is a good fit because it helps handle multicollinearity, selects important features, and allows for class weighting to address imbalance.
Naive Bayes works well with categorical variables, does not require complex pre-processing, and can still perform well despite class imbalance.

These two models complement each other—Elastic Net provides interpretability, while Naive Bayes handles categorical data efficiently. Together, they help us understand which factors influence customer decisions and improve future marketing strategies

Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

If the dataset had fewer than 1,000 records, I would likely replace Elastic Net Logistic Regression with standard Logistic Regression or k-NN, while keeping Naive Bayes because of its efficiency on small datasets.

3. Pre-processing

Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:

Data Cleaning - improve data quality, address missing data, etc.
Dimensionality Reduction - remove correlated/redundant data than will slow down training
Feature Engineering - use of business knowledge to create new features
Sampling Data - using sampling to resize datasets
Data Transformation - regularization, normalization, handling categorical variables
Imbalanced Data - reducing the imbalance between classes

Answer: Since we already identified key data issues in EDA, let’s now systematically clean and prepare the dataset for Elastic Net Logistic Regression and Naive Bayes.

Data Cleaning – Improve Data Quality & Address Missing Data:

What we need to do:

Replace “unknown” values in housing, loan, and default with the most frequent value (mode).
Keep “unknown” as a separate category for job, marital, and education.
Ensure all variables are correctly formatted.

Next Step: Apply these fixes in R

Replace “unknown” with Mode (Most Frequent Value): For housing, loan, and default, we will replace “unknown” with the most frequent category (mode) by identifying the most frequent value (mode) for each variable and Replace “unknown” values with the most common category.

most_common_housing <- names(sort(table(bank_data$housing), decreasing = TRUE))[1]
most_common_loan <- names(sort(table(bank_data$loan), decreasing = TRUE))[1]
most_common_default <- names(sort(table(bank_data$default), decreasing = TRUE))[1]

bank_data$housing[bank_data$housing == "unknown"] <- most_common_housing
bank_data$loan[bank_data$loan == "unknown"] <- most_common_loan
bank_data$default[bank_data$default == "unknown"] <- most_common_default

Keep “unknown” as a Category for job, marital, and education Since “unknown” in job, marital, and education may have meaning, we will keep it as a valid category by converting categorical variables to factors (to preserve “unknown” as a category).

bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$education <- factor(bank_data$education)

Ensure All Variables Are Correctly Formatted: We will verify that categorical variables are factors and numeric variables are properly formatted.

# Convert categorical variables to factors 
categorical_cols <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome", "y")
bank_data[categorical_cols] <- lapply(bank_data[categorical_cols], factor)

# Convert numeric variables to numeric 
numeric_cols <- c("age", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed")
bank_data[numeric_cols] <- lapply(bank_data[numeric_cols], as.numeric)

summary(bank_data)

##       age                 job            marital     
##  Min.   :17.00   admin.     :10419   divorced: 4611  
##  1st Qu.:32.00   blue-collar: 9253   married :24921  
##  Median :38.00   technician : 6739   single  :11564  
##  Mean   :40.02   services   : 3967   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1718                   
##                  (Other)    : 6156                   
##                education     default     housing      loan      
##  university.degree  :12164   no :41173   no :18615   no :34928  
##  high.school        : 9512   yes:    3   yes:22561   yes: 6248  
##  basic.9y           : 6045                                      
##  professional.course: 5240                                      
##  basic.4y           : 4176                                      
##  basic.6y           : 2291                                      
##  (Other)            : 1748                                      
##       contact          month       day_of_week    duration     
##  cellular :26135   may    :13767   fri:7826    Min.   :   0.0  
##  telephone:15041   jul    : 7169   mon:8512    1st Qu.: 102.0  
##                    aug    : 6176   thu:8618    Median : 180.0  
##                    jun    : 5318   tue:8086    Mean   : 258.3  
##                    nov    : 4100   wed:8134    3rd Qu.: 319.0  
##                    apr    : 2631               Max.   :4918.0  
##                    (Other): 2015                               
##     campaign          pdays          previous            poutcome    
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35551  
##  Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
##  Mean   : 2.568   Mean   :962.5   Mean   :0.173                      
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                      
##  Max.   :56.000   Max.   :999.0   Max.   :7.000                      
##                                                                      
##   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08192   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##                                                                    
##   nr.employed     y        
##  Min.   :4964   no :36537  
##  1st Qu.:5099   yes: 4639  
##  Median :5191              
##  Mean   :5167              
##  3rd Qu.:5228              
##  Max.   :5228              
##

The above summary shows that our data cleaning steps were applied correctly: “unknown” values were handled properly. Categorical variables are factors (job, marital, education, etc.). Numeric variables are correctly formatted (age, campaign, pdays, etc.).️ No missing values (NA).

Dimensionality Reduction - remove correlated/redundant data

What we need to do:

Drop duration (per dataset guidelines—it’s only useful for benchmarks).
Check multicollinearity using correlation analysis (already done in EDA).
If needed, remove highly correlated features (above 0.9 correlation).

Next Step: Decide if we should remove correlated features manually or let Elastic Net handle it.

Looking at the boxplot of numeric features, we can see several extreme outliers, especially in duration, campaign, pdays, and previous. We can let Elastic Net handle age, Emp. Var. Rate, Cons. Price Idx, Euribor3m, and Nr. Employed directly. this is because aside age, these are economic indicators, so they are not necessarily “outliers” but actual recorded values. No manual intervention needed.

library(dplyr)
bank_data <- bank_data %>%
  select(-duration)  # Remove duration since it's data leakage

library(corrr)  
numeric_data <- bank_data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_data)

# Find highly correlated features (above 0.9)
high_corr <- as.data.frame(as.table(cor_matrix)) %>%
  filter(Var1 != Var2, abs(Freq) > 0.9) %>%
  arrange(desc(abs(Freq)))

print(high_corr)

##           Var1         Var2      Freq
## 1    euribor3m emp.var.rate 0.9722438
## 2 emp.var.rate    euribor3m 0.9722438
## 3  nr.employed    euribor3m 0.9451459
## 4    euribor3m  nr.employed 0.9451459
## 5  nr.employed emp.var.rate 0.9069495
## 6 emp.var.rate  nr.employed 0.9069495

From above we cansee that three features are highly correlated, Elastic Net Handle can handle these but Naive Bayes does not automatically handle multicollinearity. Since it assumes independence between features, having highly correlated variables could negatively affect its performance. we can remove nr.employed since it’s closely tied to emp.var.rate.

bank_data_nb <- bank_data %>% select(-nr.employed)

This version of the dataset is only for Naïve Bayes.

Feature Engineering - use of business knowledge to create new features

Feature engineering involves creating new variables that could improve model performance.

What we need to do:

If we look at our boxplot for the numeric features, we can see campaign, pdays, and previous have severe outliers, causing high skewness. We will use Winsorization as it caps extreme values at the 99th percentile, reducing their impact without removing valuable data, which should prevent model instability while preserving most of the data distribution.

# Define Winsorization function
winsorize <- function(x, lower_quantile = 0.01, upper_quantile = 0.99) {
  lower_bound <- quantile(x, lower_quantile, na.rm = TRUE)
  upper_bound <- quantile(x, upper_quantile, na.rm = TRUE)
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

# Apply Winsorization 
bank_data <- bank_data %>%
  mutate(
    campaign = winsorize(campaign),
    pdays = winsorize(pdays),
    previous = winsorize(previous)
  )

# Check updated boxplots to confirm capping
library(ggplot2)
bank_long <- bank_data %>% pivot_longer(cols = where(is.numeric), names_to = "Feature", values_to = "Value")

ggplot(bank_long, aes(y = Value, x = Feature)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red") +
  facet_wrap(~Feature, scales = "free", ncol = 2) +  
  coord_flip() +
  theme_minimal() +
  labs(title = "Boxplots of Numeric Features (After Winsorization, Duration Removed)", x = "Feature", y = "Value") +
  theme(axis.text.x = element_blank(),  
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 10),
        strip.text = element_text(size = 12, face = "bold"))

The boxplots clearly show the effects of Winsorization and removing the duration feature. The extreme outliers have been reduced, making the data more suitable for modeling.

Check Skewness of Numeric Features

library(moments)  # skewness calculation
skewed_features <- bank_data %>%
  select(campaign, pdays, previous) # Select the numeric features 

skewness_values <- apply(skewed_features, 2, skewness) # Compute skewness for each feature
skewness_values

##  campaign     pdays  previous 
##  2.703432 -4.921200  2.725518

Campaign and Previous highly right-skewed where Pdays highly left-skewed. We will apply Log transformation. For right-skewed data (campaign & previous), we’ll apply log(x + 1) transformation to avoid log(0).

bank_data <- bank_data %>%
  mutate(
    campaign_log = log1p(campaign),   # log(x + 1) 
    previous_log = log1p(previous))

# Remove original columns 
bank_data <- bank_data %>%
  select(-campaign, -previous)

# Check skewness after transformation
transformed_skewness <- bank_data %>%
  select(campaign_log, previous_log) %>%
  summarise_all(skewness)

print(transformed_skewness)

##   campaign_log previous_log
## 1     1.162546     2.358403

let check pDays:

table(bank_data$pdays)

## 
##     3     4     5     6     7     8     9    10    11    12    13    14    15 
##   541   118    46   412    60    18    64    52    28    58    36    20    24 
##    16    17    18    19    20    21    22    25    26    27   999 
##    11     8     7     3     1     2     3     1     1     1 39661

So, 999 is not a regular numerical value. It is a special category indicating that the customer was never contacted before. Treating it as a number doesn’t make sense so convert it into a categorical variable instead.

library(dplyr)

# Convert pdays into a categorical feature
bank_data <- bank_data %>%
  mutate(
    pdays_cat = case_when(
      pdays == 999 ~ "Never Contacted",
      pdays <= 7   ~ "Contacted Recently (0-7 days)",
      pdays <= 30  ~ "Contacted Last Month (8-30 days)",
      TRUE         ~ "Contacted Earlier (30+ days)"
    )
  ) 

# Convert to factor for modeling
bank_data$pdays_cat <- as.factor(bank_data$pdays_cat)

# Drop original `pdays` column
bank_data <- bank_data %>%
  select(-pdays)

table(bank_data$pdays_cat)

## 
## Contacted Last Month (8-30 days)    Contacted Recently (0-7 days) 
##                              338                             1177 
##                  Never Contacted 
##                            39661

https://www.datacamp.com/tutorial/winsorized-mean

Sampling Data – Resizing the Dataset for Training

What we need to do:

Ensure stratified sampling when splitting into training/testing sets (so both classes are properly represented).
We do not need to downsample, since ~40K records is already manageable.

Next Step: We first create training and test sets and then apply encoding separately for Logistic Regression. Since our target variable y is imbalanced, we’ll use stratified sampling to ensure both classes are properly represented in the train/test sets.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(123)  # Ensure reproducibility

# Stratified sampling (80% train, 20% test)
trainIndex <- createDataPartition(bank_data$y, p = 0.8, list = FALSE)

# Split dataset into training and test sets
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]

# Check class distribution
table(train_data$y) / nrow(train_data)

## 
##        no       yes 
## 0.8873171 0.1126829

table(test_data$y) / nrow(test_data)

## 
##       no      yes 
## 0.887418 0.112582

The class distribution in the training and test sets is consistent with the original dataset (~88.7% “no”, ~11.3% “yes”).

Data Transformation – Encoding & Scaling Features

What we need to do:

For Logistic Regression (Elastic Net): Apply One-Hot Encoding (OHE) for categorical variables.
For Naive Bayes: Convert categorical variables to factors (no need for OHE).
Normalize numeric features (for Elastic Net) using standardization (mean = 0, SD = 1).

Next Step: Encoding Categorical Variables (For Logistic Regression): Since Logistic Regression requires numerical inputs, we need to one-hot encode categorical variables. However, we will keep categorical variables as factors for Naive Bayes.

library(caret)

# One-hot encode categorical variables
dummies <- dummyVars(~ ., data = train_data, fullRank = TRUE)  # Create dummy variables
train_data_encoded <- predict(dummies, newdata = train_data) %>% as.data.frame()
test_data_encoded <- predict(dummies, newdata = test_data) %>% as.data.frame()

# Check the structure of the encoded dataset
str(train_data_encoded)

## 'data.frame':    32942 obs. of  51 variables:
##  $ age                                    : num  56 57 37 40 56 45 59 41 24 25 ...
##  $ job.blue-collar                        : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ job.entrepreneur                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.housemaid                          : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ job.management                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.retired                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.self-employed                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.services                           : num  0 1 1 0 1 1 0 0 0 1 ...
##  $ job.student                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.technician                         : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ job.unemployed                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.unknown                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ marital.married                        : num  1 1 1 1 1 1 1 1 0 0 ...
##  $ marital.single                         : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ marital.unknown                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ education.basic.6y                     : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ education.basic.9y                     : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ education.high.school                  : num  0 1 1 0 1 0 0 0 0 1 ...
##  $ education.illiterate                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ education.professional.course          : num  0 0 0 0 0 0 1 0 1 0 ...
##  $ education.university.degree            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ education.unknown                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ default.yes                            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ housing.yes                            : num  0 0 1 0 0 0 0 0 1 1 ...
##  $ loan.yes                               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ contact.telephone                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ month.aug                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.dec                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.jul                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.jun                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.mar                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.may                              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ month.nov                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.oct                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.sep                              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ day_of_week.mon                        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day_of_week.thu                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ day_of_week.tue                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ day_of_week.wed                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome.nonexistent                   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ poutcome.success                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ emp.var.rate                           : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx                         : num  94 94 94 94 94 ...
##  $ cons.conf.idx                          : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m                              : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed                            : num  5191 5191 5191 5191 5191 ...
##  $ y.yes                                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ campaign_log                           : num  0.693 0.693 0.693 0.693 0.693 ...
##  $ previous_log                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pdays_cat.Contacted Recently (0-7 days): num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pdays_cat.Never Contacted              : num  1 1 1 1 1 1 1 1 1 1 ...

Before encoding, the categorical variables like job, education, month, day_of_week, etc., were stored as single columns with multiple categories. After encoding, each category within a categorical variable was converted into a separate binary column (0 or 1).

Imbalanced Data – Addressing Class Imbalance (~88.7% No, ~11.3% Yes)

What we need to do:

For Elastic Net Logistic Regression: Apply SMOTE oversampling because: We need to improve recall for “yes” (subscribed) customers. SMOTE creates synthetic cases, ensuring our model learns from enough positive examples. We have only ~11% “yes” cases—without SMOTE, Logistic Regression might struggle to identify patterns in the minority class.
For Naive Bayes: No action needed, since it handles class imbalance well.

Next Step: Apply SMOTE for Logistic Regression

library(themis)

## Loading required package: recipes

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stringr':
## 
##     fixed

## The following object is masked from 'package:stats':
## 
##     step

library(recipes)

set.seed(123) 

train_data_encoded$y.yes <- as.factor(train_data_encoded$y.yes)# Convert target variable (y.yes) to a factor
smote_recipe <- recipe(y.yes ~ ., data = train_data_encoded) %>% # Create a recipe for SMOTE
  step_smote(y.yes, over_ratio = 1)  # Creates a balanced dataset

smote_prep <- prep(smote_recipe, training = train_data_encoded) # Prep and apply SMOTE
train_data_smote <- bake(smote_prep, new_data = NULL)

# Check class distribution after SMOTE
table(train_data_smote$y.yes) / nrow(train_data_smote)

## 
##   0   1 
## 0.5 0.5

The dataset is now balanced (50% “no”, 50% “yes”) after applying SMOTE.

DATA622 HW 1

Farhana Akther

2025-03-01

1. Exploratory Data Analysis (EDA)

2. Algorithm Selection

3. Pre-processing