DATA 622 Assignment 1

Loading Libraries

library(readr)
library(stringr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(corrplot)

## corrplot 0.92 loaded

library(PerformanceAnalytics)

## Warning: package 'PerformanceAnalytics' was built under R version 4.3.3

## Loading required package: xts

## Warning: package 'xts' was built under R version 4.3.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggthemes)
library(purrr)
library(tidyr)
library(readr)

Importing the data

In the following code chuncks, I start by importing the bank marketing dataset and cleaning up its column names by removing unwanted characters and spaces. Then, I convert the appropriate columns to factors for categorical data and ensure that numeric columns are properly formatted, setting the stage for subsequent exploratory analysis.

Read the csv file

I begin by importing the bank marketing dataset from “bank-additional-full.csv”, specifying the use of semicolons as delimiters and letting R automatically assign data types to each column. Next, I use glimpse to inspect the structure of the dataset—reviewing column names, data types, and a preview of the values—to ensure everything has been read in correctly.

bank_data <- read_delim("bank-additional-full.csv", delim = ";", col_types = cols())
glimpse(bank_data)

## Rows: 41,188
## Columns: 21
## $ age            <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job            <chr> "housemaid", "services", "services", "admin.", "service…
## $ marital        <chr> "married", "married", "married", "married", "married", …
## $ education      <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
## $ default        <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
## $ housing        <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
## $ loan           <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
## $ contact        <chr> "telephone", "telephone", "telephone", "telephone", "te…
## $ month          <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
## $ day_of_week    <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
## $ duration       <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons.conf.idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr.employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…

Cleaning Column Names

Next, I created a custom function called clean_column_names() to sanitize the dataset’s column names. - Within clean_column_names(), I replaced periods (.) with underscores (_) to standardize the delimiters. - I also replaced hyphens (-) with underscores to further enhance consistency. - I removed spaces from the column names to prevent issues during later data manipulation. - Finally, I applied clean_column_names() to the dataset, ensuring that all column names are clean and uniform for further analysis.

clean_column_names <- function(dataset) {
colnames(dataset) <- str_replace_all(colnames(dataset), "\\.", "_")  # Replace '.' with '_'
colnames(dataset) <- str_replace_all(colnames(dataset), "-", "_")    # Replace '-' with '_'
colnames(dataset) <- str_replace_all(colnames(dataset), " ", "")     # Remove spaces
return(dataset)
}

bank_data <- clean_column_names(bank_data)

Convert Categorical Variables into Factors

A vector categorical_cols is created, listing the names of columns that hold categorical data.
The mutate(across(...)) function from dplyr is used to convert each of these columns to the factor data type.

categorical_cols <- c("job", "marital", "education", "default", "housing", 
                 "loan", "contact", "month", "day_of_week", "poutcome", "y")

bank_data <- bank_data %>%
mutate(across(all_of(categorical_cols), as.factor))

Convert Numerical Columns into Numeric Format

Similarly, I created a vector numeric_cols to identify columns that should be numeric.
These columns are converted to numeric type using mutate(across(..., as.numeric)), ensuring proper arithmetic operations and statistical analysis later.

numeric_cols <- c("age", "duration", "campaign", "pdays", "previous", "emp_var_rate", 
             "cons_price_idx", "cons_conf_idx", "euribor3m", "nr_employed")

bank_data <- bank_data %>%
mutate(across(all_of(numeric_cols), as.numeric))

Inspect the Cleaned Data

Now let us get a high-level view of our data using the glimpse() function. This step helps verify that the cleaning steps have been executed as intended.

We can see that our dataset contains 41,188 observations and 21 variables. It includes both categorical features (e.g., job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, and y) and numerical features (e.g., age, duration, campaign, pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, and nr_employed). My next step is to use the summary() and table() functions to get a quick statistical overview of each variable. This helps identify potential data issues, outliers, and the general distribution of the dataset.

glimpse(bank_data)

## Rows: 41,188
## Columns: 21
## $ age            <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job            <fct> housemaid, services, services, admin., services, servic…
## $ marital        <fct> married, married, married, married, married, married, m…
## $ education      <fct> basic.4y, high.school, high.school, basic.6y, high.scho…
## $ default        <fct> no, unknown, no, no, no, unknown, no, unknown, no, no, …
## $ housing        <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, no,…
## $ loan           <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes, n…
## $ contact        <fct> telephone, telephone, telephone, telephone, telephone, …
## $ month          <fct> may, may, may, may, may, may, may, may, may, may, may, …
## $ day_of_week    <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, …
## $ duration       <dbl> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <fct> nonexistent, nonexistent, nonexistent, nonexistent, non…
## $ emp_var_rate   <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons_price_idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons_conf_idx  <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m      <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr_employed    <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y              <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,…

Exploratory Data Analysis

Overall Distribution of each variable

library(purrr)
bank_data %>%
  summary()

##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        default         housing           loan      
##  university.degree  :12168   no     :32588   no     :18622   no     :33950  
##  high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
##  basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
##  professional.course: 5243                                                  
##  basic.4y           : 4176                                                  
##  basic.6y           : 2292                                                  
##  (Other)            : 1749                                                  
##       contact          month       day_of_week    duration     
##  cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
##  telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
##                    aug    : 6178   thu:8623    Median : 180.0  
##                    jun    : 5318   tue:8090    Mean   : 258.3  
##                    nov    : 4101   wed:8134    3rd Qu.: 319.0  
##                    apr    : 2632               Max.   :4918.0  
##                    (Other): 2016                               
##     campaign          pdays          previous            poutcome    
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35563  
##  Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
##  Mean   : 2.568   Mean   :962.5   Mean   :0.173                      
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                      
##  Max.   :56.000   Max.   :999.0   Max.   :7.000                      
##                                                                      
##   emp_var_rate      cons_price_idx  cons_conf_idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08189   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##                                                                    
##   nr_employed     y        
##  Min.   :4964   no :36548  
##  1st Qu.:5099   yes: 4640  
##  Median :5191              
##  Mean   :5167              
##  3rd Qu.:5228              
##  Max.   :5228              
##

Numerical Variables

Age: Ranges from 17 to 98, median at 38.
Duration: Highly skewed, min 0, max 4918, median 258.3.
Campaign Contacts: Min 1, max 56, median 2.
Pdays (Days since last campaign contact): Highly imbalanced, mostly 999 (never contacted before).
Previous Campaign Contacts: Mostly 0, max 7.
Employment Variation Rate: Between -3.4 to 1.4, median 1.1.
Consumer Price Index: Between 92.2 to 94.77, median 93.75.
Consumer Confidence Index: Between -50.8 to -26.9, median -41.8.
Euribor 3-month Rate: Between 0.63 to 5.04, median 4.86.
Number of Employees: Ranges from 5191 to 5228, median 5167.

Categorical Variables

Job: Most common jobs are admin, blue-collar, and technician.
Marital Status: Majority are married (24,928), followed by single (11,568).
Education: High school (9515) and university degree (12,168) are the most common.
Default: Mostly “no” (32,588), very few “yes” (3).
Housing Loan: Fairly split, 18,622 “no” vs. 21,576 “yes”.
Personal Loan: Mostly “no” (33,950), 6,248 “yes”.
Contact Method: Mostly cellular (26,144) over telephone (15,044).
Month: Most contacts occurred in May (13,679).
Day of Week: Friday (7,827) and Tuesday (8,090) had the highest contacts.
Previous Outcome: Majority had no prior contact (35,563 “nonexistent”), few “success” (1,373).
Subscription (Target Variable - y):
- No: 36,548 (88.8%)
- Yes: 4,640 (11.2%) (Highly imbalanced towards “no”)

Key Observations

The dataset is highly imbalanced, with only 11.2% of clients subscribing.
Pdays and previous contacts show that most clients were not contacted before.
May and cellular contact method were the most frequently used.
Job, education, and marital status have distinct distributions, potentially important for modeling.

Detailed Distribution of Categorical Variables

# Load necessary libraries
library(ggplot2)
library(forcats)

# Function to create and display sorted bar plots for categorical features
plot_categorical <- function(data, var) {
  data <- data %>%
    mutate(!!sym(var) := fct_reorder(!!sym(var), !!sym(var), .fun = length, .desc = TRUE))
  
  print(
    ggplot(data, aes(x = !!sym(var), fill = !!sym(var))) +
      geom_bar() +
      theme_minimal() +
      ggtitle(paste("Distribution of", var)) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
  )
}

Jobs

# Create the proportion table
bank_data %>%
  count(job, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 12 × 3
##    job               n proportion
##    <fct>         <int>      <dbl>
##  1 admin.        10422    0.253  
##  2 blue-collar    9254    0.225  
##  3 technician     6743    0.164  
##  4 services       3969    0.0964 
##  5 management     2924    0.0710 
##  6 retired        1720    0.0418 
##  7 entrepreneur   1456    0.0354 
##  8 self-employed  1421    0.0345 
##  9 housemaid      1060    0.0257 
## 10 unemployed     1014    0.0246 
## 11 student         875    0.0212 
## 12 unknown         330    0.00801

# Plot the distribution
plot_categorical(bank_data, "job")

How Job Distributed

The dataset includes 12 job categories plus an “unknown” category.
The most common occupations are “admin.” (25.3%), “blue-collar” (22.5%), and “technician” (16.3%).
The least frequent categories are “student” (2.1%) and “unknown” (0.8%).
The proportion table and bar chart confirm that the job variable is highly imbalanced, with a few categories dominating the dataset.

Key Considerations

Most customers work in administrative and blue-collar jobs, which could impact the marketing strategy.
The imbalance in job categories should be considered when analyzing how occupation affects subscription rates.
The “unknown” category is minimal but may require handling in preprocessing (e.g., imputation or exclusion).
Further analysis can explore whether job type influences the likelihood of subscribing (y).

The distribution of jobs may reflect real-world trends in the labor force, where administrative and blue-collar roles are more prevalent. Further analysis could explore whether job categories influence the target variable (y) or other key factors in the dataset.

Marital Status

# Create the proportion table
bank_data %>%
  count(marital, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 4 × 3
##   marital      n proportion
##   <fct>    <int>      <dbl>
## 1 married  24928    0.605  
## 2 single   11568    0.281  
## 3 divorced  4612    0.112  
## 4 unknown     80    0.00194

# Plot the distribution
plot_categorical(bank_data, "marital")

How Marital Status is Distributed?

The dataset contains four marital status categories: Married, Single, Divorced, and Unknown.
The majority of customers are married (60.5%), followed by single (28.1%) and divorced (11.2%).
The “unknown” category is minimal (0.2%) and may need special handling in preprocessing.

Key Considerations

Most bank customers are married, which could influence marketing strategies targeting families or dual-income households.
The imbalance in marital status should be considered when analyzing how relationship status affects subscription likelihood.
The “unknown” category is negligible but should be addressed—either through imputation or exclusion.

Further analysis can explore whether marital status significantly impacts the target variable (y) in predicting term deposit subscriptions.

Education Level

# Create the proportion table
bank_data %>%
  count(education, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 8 × 3
##   education               n proportion
##   <fct>               <int>      <dbl>
## 1 university.degree   12168   0.295   
## 2 high.school          9515   0.231   
## 3 basic.9y             6045   0.147   
## 4 professional.course  5243   0.127   
## 5 basic.4y             4176   0.101   
## 6 basic.6y             2292   0.0556  
## 7 unknown              1731   0.0420  
## 8 illiterate             18   0.000437

# Plot the distribution
plot_categorical(bank_data, "education")

How is the Education Variable Distributed?

The dataset includes eight education levels, ranging from illiterate to university degree.
The most common education levels are university degree (29.5%), high school (23.1%), and basic 9 years (14.7%).
The least frequent categories are illiterate (0.04%) and unknown (4.2%).
The proportion table and bar chart confirm an uneven distribution, with higher education levels more represented.

Key Considerations:

Higher education levels dominate the dataset, which could influence marketing strategies tailored to professionals.
The imbalance in education categories should be considered when analyzing how education level affects subscription rates.
The “unknown” category (4.2%) is not negligible and may require imputation or exclusion during preprocessing.

Further analysis can explore whether education level significantly impacts the likelihood of subscribing (y) . For example, whether highly educated individuals are more likely to subscribe) could be an interesting area for further analysis.

Credit Default

# Create the proportion table
bank_data %>%
  count(default, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 3 × 3
##   default     n proportion
##   <fct>   <int>      <dbl>
## 1 no      32588  0.791    
## 2 unknown  8597  0.209    
## 3 yes         3  0.0000728

# Plot the distribution
plot_categorical(bank_data, "default")

What is the Distribution of the Default Variable?

The dataset includes three categories for default status: “no,” “yes,” and “unknown.”
An overwhelming 79.1% of customers have no credit default history, while only 0.007% have a known default.
A significant 20.9% of records fall under “unknown,” making it unclear whether those customers have credit defaults.
The distribution is highly imbalanced, with very few actual defaults reported.

Implications for Modeling:

Severe class imbalance: With only three cases of “yes” defaults, this feature is unlikely to contribute significantly to predictions unless handled properly.
The large “unknown” category raises concerns—should these be treated as missing values, or does their presence indicate a meaningful pattern?
Given the dominance of “no” defaults, this feature might not be a strong predictor unless analyzed in combination with other financial indicators.

Further exploration could involve testing the impact of excluding this feature to determine if it influences model performance. Analysis could investigate whether individuals with “unknown” default status behave more like those with “no” defaults or those who have actually defaulted.

Housing Loan

# Create the proportion table
bank_data %>%
  count(housing, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 3 × 3
##   housing     n proportion
##   <fct>   <int>      <dbl>
## 1 yes     21576     0.524 
## 2 no      18622     0.452 
## 3 unknown   990     0.0240

# Plot the distribution
plot_categorical(bank_data, "housing")

What is the Distribution of the Housing Variable?

The dataset records whether a customer has a housing loan, with three possible values: “yes,” “no,” and “unknown.”
Just over 52.4% of customers have a housing loan, while 45.2% do not.
The “unknown” category is relatively small (2.4%), but its impact should be considered in preprocessing.
The distribution is fairly balanced, though slightly skewed toward customers with housing loans.

Implications for Modeling:

Since the difference between “yes” and “no” is not extreme, this variable might provide some predictive value in determining subscription likelihood.
The presence of “unknown” values may introduce uncertainty—options include imputation, separate categorization, or exclusion during preprocessing.
Housing status might be correlated with other financial variables, such as personal loan status, employment, or income-related factors, making interaction analysis useful.

Further exploration is needed to determine whether customers with or without housing loans are more likely to subscribe (y).

Personal Loan

# Create the proportion table
bank_data %>%
  count(loan, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 3 × 3
##   loan        n proportion
##   <fct>   <int>      <dbl>
## 1 no      33950     0.824 
## 2 yes      6248     0.152 
## 3 unknown   990     0.0240

# Plot the distribution
plot_categorical(bank_data, "loan")

How is the Loan Variable Distributed?

The dataset captures whether a customer has a personal loan, categorized as “yes,” “no,” or “unknown.”
A significant 82.4% of customers do not have a personal loan, while 15.2% do.
The “unknown” category accounts for 2.4% of cases, which may require handling in preprocessing.
The distribution is highly imbalanced, with far more customers not holding personal loans.

Implications for Predictive Modeling:

Since most customers do not have a personal loan, this variable might not be a strong predictor unless it interacts with other financial features.
The “unknown” category should be assessed—potential approaches include treating it as missing data, encoding it as a separate category, or imputing values based on similar records.
Customers with loans may behave differently regarding term deposit subscriptions, warranting further analysis to check if this feature correlates with the target variable (y).
Given the imbalance in the “yes” class, models may need resampling techniques or weighting adjustments to ensure the loan variable contributes meaningfully.

Since a housing loan is more common than a personal loan, it could be insightful to examine how both loan types impact client financial behavior.

Contact Method

# Create the proportion table
bank_data %>%
  count(contact, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 2 × 3
##   contact       n proportion
##   <fct>     <int>      <dbl>
## 1 cellular  26144      0.635
## 2 telephone 15044      0.365

# Plot the distribution
plot_categorical(bank_data, "contact")

How is the Contact Variable Distributed?

The dataset records the communication method used in the marketing campaign, categorized as “cellular” or “telephone.”
A majority (63.5%) of contacts were made via cellular phones, while 36.5% were made via landline telephones.
The bar chart confirms that mobile phone outreach was nearly twice as common as landline calls.

Implications for Predictive Modeling:

The preference for cellular communication suggests that mobile outreach may be more effective or that more customers use cell phones.
If subscription rates differ between the two groups, contact type could be a strong predictor for campaign success.
Further analysis should explore whether customers reached via cellular are more likely to subscribe (y) compared to those reached via landlines.
This variable might also interact with time-based features (month, day of the week), affecting conversion likelihood.

Month of Contact

# Create the proportion table
bank_data %>%
  count(month, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 10 × 3
##    month     n proportion
##    <fct> <int>      <dbl>
##  1 may   13769    0.334  
##  2 jul    7174    0.174  
##  3 aug    6178    0.150  
##  4 jun    5318    0.129  
##  5 nov    4101    0.0996 
##  6 apr    2632    0.0639 
##  7 oct     718    0.0174 
##  8 sep     570    0.0138 
##  9 mar     546    0.0133 
## 10 dec     182    0.00442

# Plot the distribution
plot_categorical(bank_data, "month")

How is the Month Variable Distributed?

The dataset records the month in which customers were contacted during the marketing campaign.
May accounts for the highest proportion (33.4%) of contacts, followed by July (17.4%), August (15.0%), and June (12.9%).
The lowest contact volumes occur in December (0.4%), March (1.3%), and September (1.4%), indicating limited marketing efforts in those months.
The bar chart confirms that most contacts happened between May and August, with a sharp decline afterward.

Implications for Predictive Modeling:

Seasonality may influence campaign effectiveness—a high concentration of contacts in specific months could impact model generalization.
If subscription rates (y) vary significantly by month, this feature could be a strong predictor. Further analysis is needed to assess whether customer responsiveness changes across seasons.
Why is May so dominant? Investigating external factors (e.g., economic trends, promotions) during this period could provide business insights.
Given the uneven distribution, models may need weighting adjustments or feature engineering (e.g., grouping months into “high,” “medium,” and “low” activity periods).

Day of the Week Contacted

# Create the proportion table
bank_data %>%
  count(day_of_week, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 5 × 3
##   day_of_week     n proportion
##   <fct>       <int>      <dbl>
## 1 thu          8623      0.209
## 2 mon          8514      0.207
## 3 wed          8134      0.197
## 4 tue          8090      0.196
## 5 fri          7827      0.190

# Plot the distribution
plot_categorical(bank_data, "day_of_week")

How is the Day of the Week Variable Distributed?

The dataset captures the day of the week when customers were contacted.
The distribution is fairly even across weekdays, with Thursday (20.9%) and Monday (20.6%) having slightly higher contact volumes.
Friday (19.0%) has the lowest proportion, but the difference is minimal.
The bar chart confirms that no single day dominates the contact strategy, suggesting a balanced approach.

Implications for Predictive Modeling:

Since contacts are evenly distributed across weekdays, the day itself may not strongly impact model predictions unless subscription rates vary by day.
Further analysis should examine conversion rates per day to determine if certain days yield higher success rates (y).
If specific weekdays show better performance, future marketing campaigns could prioritize those days.
Potential feature engineering could involve grouping days into “high” and “low” response periods if patterns emerge.

Previous Campaign Outcome

bank_data <- bank_data %>%
  mutate(previous = as.factor(previous))

# Create the proportion table
bank_data %>%
  count(previous, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 8 × 3
##   previous     n proportion
##   <fct>    <int>      <dbl>
## 1 0        35563  0.863    
## 2 1         4561  0.111    
## 3 2          754  0.0183   
## 4 3          216  0.00524  
## 5 4           70  0.00170  
## 6 5           18  0.000437 
## 7 6            5  0.000121 
## 8 7            1  0.0000243

# Plot the distribution
plot_categorical(bank_data, "previous")

How is the Previous Campaign Outcome Distributed?

The “previous” variable represents the number of times a customer was contacted in past marketing campaigns.
86.3% of customers had no prior contact (0), meaning they are being approached for the first time.
Only 11.1% had exactly one previous contact, while higher counts (2–7) occur with much lower frequency.
The bar chart confirms that most customers were never contacted before, with a sharp drop-off beyond one previous contact.

Implications for Predictive Modeling:

Since the majority of customers have never been contacted before, this variable might be highly predictive of subscription likelihood (y).
The small proportion of customers with multiple contacts may indicate either persistent follow-ups or targeting of specific customer groups—further analysis is needed to see how their subscription rates compare.
If higher previous contact counts correlate with more sign-ups, models might benefit from grouping this feature into categories (e.g., “new contact” vs. “previously engaged”).
The presence of few customers with 5+ contacts suggests diminishing returns on repeated outreach, which could guide future marketing strategies.

This variable might also help assess the effectiveness of re-engagement efforts in the campaign.

Subscription Rate (Target Variable)

# Create the proportion table
bank_data %>%
  count(y, sort = TRUE) %>%
  mutate(proportion = n / sum(n))

## # A tibble: 2 × 3
##   y         n proportion
##   <fct> <int>      <dbl>
## 1 no    36548      0.887
## 2 yes    4640      0.113

# Plot the distribution
plot_categorical(bank_data, "y")

How is the Target Variable (y) Distributed?

The target variable (y) represents whether a customer subscribed to a term deposit.
Only 11.3% of customers subscribed (yes), while 88.7% did not (no).
The bar chart confirms a severe class imbalance, with far more negative outcomes than positive ones.

Implications for Predictive Modeling:

Severe class imbalance may lead to a biased model favoring the majority class (no). Techniques like SMOTE (Synthetic Minority Over-sampling Technique) or class weighting should be considered to handle this issue.
Since positive cases (yes) are rare, using metrics like precision, recall, and F1-score instead of accuracy is crucial for evaluating model performance.
Further feature analysis is needed to determine which factors most influence subscription rates, allowing for targeted feature engineering.
Alternative modeling strategies, such as ensemble methods, may help improve predictions in imbalanced data settings.

Since only around 1 in 9 clients agreed to subscribe, this suggests that the campaign had limited effectiveness. Future analysis could explore: What factors distinguish clients who subscribed from those who didn’t. Whether certain demographics, timing, or contact methods had a higher success rate. Potential strategies to improve engagement and response rates.

Detail Distribution of Numerical Variables

# Load necessary libraries
library(ggplot2)

# Function to create and display histograms for numerical features
plot_numerical <- function(data, var, bins = 30) {
  print(
    ggplot(data, aes(x = !!sym(var))) +
      geom_histogram(bins = bins, fill = "steelblue", color = "black", alpha = 0.7) +
      theme_minimal() +
      ggtitle(paste("Distribution of", var)) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
  )
}

# Load necessary library
library(ggplot2)
library(dplyr)
library(dplyr)


# Function to plot numerical variables (Histogram + Boxplot) with adjustable bins
plot_numeric_variable <- function(data, column_name, bins = 30) {
  column_data <- data[[column_name]]  # Extract column data
  
  # Ensure the column is numeric
  if (is.numeric(column_data)) {
    
    # Remove NAs
    clean_data <- data[complete.cases(column_data), ]
    
    par(mfrow = c(1, 2))  # Set up 2 plots in a row
    
    # Plot histogram with dynamic bins
    hist(clean_data[[column_name]], main = paste("Histogram of", column_name),
         xlab = column_name, breaks = bins, col = "steelblue", border = "black")
    
    # Plot boxplot
    boxplot(clean_data[[column_name]], main = paste("Boxplot of", column_name),
            horizontal = TRUE, col = "orange")
    
    par(mfrow = c(1, 1))  # Reset plot layout
  } else {
    message("Skipping non-numeric column: ", column_name)
  }
}

# Function to convert summary statistics to a data frame
summary_to_dataframe <- function(data, column_name) {
  # Get summary statistics
  summary_values <- summary(data[[column_name]])
  
  # Convert to data frame
  summary_df <- data.frame(
    Statistic = names(summary_values),
    Value = as.numeric(summary_values)
  )
  
  return(summary_df)
}

Age

# Example usage for "age"
age_summary_df <- summary_to_dataframe(bank_data, "age")

# Display the summary as a data frame
print(age_summary_df)

##   Statistic    Value
## 1      Min. 17.00000
## 2   1st Qu. 32.00000
## 3    Median 38.00000
## 4      Mean 40.02406
## 5   3rd Qu. 47.00000
## 6      Max. 98.00000

# Example usage with bins specified
plot_numeric_variable(bank_data, "age", bins = 50)

How is the Age Variable Distributed?

The minimum age is 17, while the maximum is 98.
The median age is 38, with a mean of 40.02, indicating a slightly right-skewed distribution.
The interquartile range (IQR) is from 32 to 47, capturing the middle 50% of observations.

Interpretation of the Plots:

The histogram shows a concentration of customers between 30 and 50 years old, with a sharp decline in older age groups.
The boxplot highlights outliers beyond 60 years old, indicating a small subset of much older customers.

Implications for Predictive Modeling:
- Since most customers are in their 30s and 40s, age may be a key factor in subscription likelihood.
- The presence of outliers (older customers) could require winsorization or transformation to avoid skewing the model.
- Further analysis should check whether older or younger customers are more likely to subscribe (y), as this may guide marketing strategies.

Campaign (Number of Contacts During Campaign)

campaign_summary_df <- summary_to_dataframe(bank_data, "campaign")

campaign_summary_df <- summary_to_dataframe(bank_data, "campaign")

# Display the summary as a data frame
print(campaign_summary_df)

##   Statistic     Value
## 1      Min.  1.000000
## 2   1st Qu.  1.000000
## 3    Median  2.000000
## 4      Mean  2.567593
## 5   3rd Qu.  3.000000
## 6      Max. 56.000000

# Example usage with bins specified
plot_numeric_variable(bank_data, "campaign", bins = 50)

How is the Campaign Variable Distributed?

The minimum number of contacts per customer is 1, while the maximum is 56.
The median number of contacts is 2, with an average of 2.57, indicating most customers were contacted a few times.
The interquartile range (IQR) is between 1 and 3 contacts, meaning 50% of customers were contacted at most three times.

Interpretation of the Plots:

The histogram shows a strong right-skew, meaning a majority of customers received very few calls, while a small group received many follow-ups.
The boxplot reveals a high number of extreme outliers, where some customers received over 10-50 calls, which may indicate excessive follow-ups.

Implications for Predictive Modeling:

Since most customers are contacted only once or twice, marketing strategies should assess whether multiple calls improve subscription rates (y) or lead to customer fatigue.
The presence of outliers (customers receiving 10+ calls) suggests diminishing returns or possible over-targeting, which could negatively impact campaign effectiveness.
Feature engineering options:
- Binning campaign values into categories (Low: 1-3, Medium: 4-6, High: 7+) may help reduce noise in modeling.
- Investigate the relationship between campaign and y to determine if additional calls increase the likelihood of a subscription.

Duration (Call Duration in Seconds)

contact_summary_df <- summary_to_dataframe(bank_data, "duration")

contact_summary_df <- summary_to_dataframe(bank_data, "duration")

# Display the summary as a data frame
print(contact_summary_df)

##   Statistic    Value
## 1      Min.    0.000
## 2   1st Qu.  102.000
## 3    Median  180.000
## 4      Mean  258.285
## 5   3rd Qu.  319.000
## 6      Max. 4918.000

# Example usage with bins specified
plot_numeric_variable(bank_data, "duration", bins = 50)

How is the Duration Variable Distributed?

The minimum call duration is 0 seconds, while the maximum is 4918 seconds (~82 minutes).
The median duration is 180 seconds (3 minutes), with an average of 258.3 seconds (~4.3 minutes).
The interquartile range (IQR) spans from 102 to 319 seconds, meaning 50% of calls last between 1.7 and 5.3 minutes.

Interpretation of the Plots

The histogram shows a right-skewed distribution, indicating that most calls are short, with a few lasting much longer.
The boxplot highlights many extreme outliers, meaning a small number of calls last significantly longer than the majority.

Implications for Predictive Modeling

Duration is the most influential feature for predicting term deposit subscription (y), as longer calls are highly correlated with successful conversions.
The presence of extreme outliers suggests that duration may need log transformation or capping to prevent high leverage points from dominating the model.
Further analysis should assess the relationship between duration and y, confirming whether longer calls lead to higher success rates.

Pdays (Days Since Last Contact)

pdays_summary_df <- summary_to_dataframe(bank_data, "pdays")

pdays_summary_df <- summary_to_dataframe(bank_data, "pdays")

# Display the summary as a data frame
print(pdays_summary_df)

##   Statistic    Value
## 1      Min.   0.0000
## 2   1st Qu. 999.0000
## 3    Median 999.0000
## 4      Mean 962.4755
## 5   3rd Qu. 999.0000
## 6      Max. 999.0000

plot_numeric_variable(bank_data, "pdays", bins = 50)

How is the pdays Variable Distributed?

The minimum value is 0, while the maximum is 999.
The median and upper quartiles are both 999, meaning most values are concentrated at this level.
The mean is 962.48, suggesting an overwhelming skew towards 999.

Interpretation of the Plots:

The histogram confirms that the vast majority of values are 999, while a small proportion falls closer to zero.
The boxplot highlights that values near 999 dominate the dataset, while a few lower values appear as outliers.

Implications for Predictive Modeling:

In the dataset, 999 indicates that the customer was never contacted previously.
The variable is highly imbalanced, meaning it should be treated as a categorical feature (e.g., “Never Contacted” vs. “Previously Contacted”).
Feature Engineering Recommendations:
- Create a binary variable:
  - pdays_contacted = ifelse(pdays == 999, 0, 1), indicating whether the client had prior contact.
- Alternatively, group pdays into bins (e.g., recent contact vs. long ago).
Check the correlation with y to confirm whether prior contact history impacts subscription rates.

Employment Variation Rate (emp.var.rate, Economic Indicator)

# Convert cons_price_idx to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
  mutate(emp_var_rate = as.numeric(as.character(emp_var_rate)))

emp.var.rate_summary_df <- summary_to_dataframe(bank_data, "emp_var_rate")

emp.var.rate_summary_df <- summary_to_dataframe(bank_data, "emp_var_rate")

# Display the summary as a data frame
print(emp.var.rate_summary_df)

##   Statistic      Value
## 1      Min. -3.4000000
## 2   1st Qu. -1.8000000
## 3    Median  1.1000000
## 4      Mean  0.0818855
## 5   3rd Qu.  1.4000000
## 6      Max.  1.4000000

plot_numeric_variable(bank_data, "emp_var_rate", bins = 50)

How is the emp_var_rate Variable Distributed?

The minimum employment variation rate is -3.4, while the maximum is 1.4.
The median is 1.1, meaning that at least half of the records correspond to positive employment trends.
The mean is 0.08, indicating that employment variation fluctuates around zero but is slightly positive.
The interquartile range (IQR) is between -1.8 and 1.4, covering most observations.

Interpretation of the Plots

The histogram reveals a discrete distribution, meaning that emp_var_rate only takes specific values rather than a continuous range.
The boxplot shows that most values fall between -3.4 and 1.4, with no extreme outliers.
The dataset contains both negative and positive values, which could reflect economic recessions and recoveries over different time periods.

Implications for Predictive Modeling

emp_var_rate is a macro-economic indicator that may influence consumer financial decisions.
Negative values (economic downturns) could correlate with lower term deposit subscriptions (y), as people save less during recessions.
Positive values (economic growth) might be linked to higher subscriptions, as customers have more disposable income.
Feature Engineering Options
- Convert emp_var_rate into categorical bins (e.g., recession, stable, growth).
- Check its correlation with y to determine its predictive power.
- Consider interaction effects with other economic variables like cons_price_idx.

Consumer Price Index (cons.price.idx)

# Convert cons_price_idx to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
  mutate(cons_price_idx = as.numeric(as.character(cons_price_idx)))

cons_price_idx_summary_df <- summary_to_dataframe(bank_data, "cons_price_idx")

cons_price_idx_summary_df <- summary_to_dataframe(bank_data, "cons_price_idx")

# Display the summary as a data frame
print(cons_price_idx_summary_df)

##   Statistic    Value
## 1      Min. 92.20100
## 2   1st Qu. 93.07500
## 3    Median 93.74900
## 4      Mean 93.57566
## 5   3rd Qu. 93.99400
## 6      Max. 94.76700

plot_numeric_variable(bank_data, "cons_price_idx", bins = 50)

How is the cons_price_idx Variable Distributed?

The minimum consumer price index is 92.20, while the maximum is 94.77.
The median is 93.75, meaning half of the data points fall below this value.
The mean is 93.58, indicating the values are evenly distributed around the center.
The interquartile range (IQR) is between 93.08 and 93.99, suggesting most values are clustered in this range.

Interpretation of the Plots:

The histogram shows a discrete distribution, meaning cons_price_idx only takes on certain values, likely due to it being recorded at fixed economic reporting periods.
The boxplot confirms that the variable is tightly distributed, with no extreme outliers.
The presence of several peaks in the histogram suggests that cons_price_idx varies over time but within a narrow range.

Implications for Predictive Modeling:

Consumer price index reflects inflation and economic stability, which can impact financial decisions.
A higher price index might be associated with reduced term deposit subscriptions (y) if inflation lowers consumer purchasing power.
Feature Engineering Recommendations:
- Group values into economic phases (e.g., low, moderate, high inflation periods).
- Examine correlation with y to determine its predictive power.
- Interaction terms with emp_var_rate may capture macroeconomic trends more effectively.

Euribor 3-Month Rate (euribor3m, Euro Interbank Interest Rate)

# Convert euribor3m to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
  mutate(euribor3m = as.numeric(as.character(euribor3m)))

euribor3m_summary_df <- summary_to_dataframe(bank_data, "euribor3m")

euribor3m_summary_df <- summary_to_dataframe(bank_data, "euribor3m")

# Display the summary as a data frame
print(euribor3m_summary_df)

##   Statistic    Value
## 1      Min. 0.634000
## 2   1st Qu. 1.344000
## 3    Median 4.857000
## 4      Mean 3.621291
## 5   3rd Qu. 4.961000
## 6      Max. 5.045000

plot_numeric_variable(bank_data, "euribor3m", bins = 50)

How is the euribor3m Variable Distributed?

The minimum is 0.63, and the maximum is 5.05. The median (4.86) and mean (3.62) suggest a skewed distribution toward lower values. The IQR (1.34 – 4.96) indicates most values fall within this range.

Interpretation of the Plots:

The histogram shows a multimodal distribution with peaks around 1 and 5, reflecting economic cycles. The boxplot confirms a wide spread without extreme outliers.

Implications for Predictive Modeling:

Euribor3m is a key economic indicator influencing financial behavior.
Higher rates may reduce term deposit subscriptions (y) due to increased borrowing costs.
Feature Engineering Recommendations:
- Segment values into interest rate periods (low, moderate, high).
- Assess correlation with y to determine predictive strength.
- Use interaction terms with cons_price_idx and emp_var_rate to capture macroeconomic effects.

Number of Employees (nr.employed, Employment Level Indicator)

# Convert nr_employed to numeric, ensuring coercion warnings are handled
bank_data <- bank_data %>%
  mutate(nr_employed = as.numeric(as.character(nr_employed)))

nr_employed_summary_df <- summary_to_dataframe(bank_data, "nr_employed")

nr_employed_summary_df <- summary_to_dataframe(bank_data, "nr_employed")

# Display the summary as a data frame
print(nr_employed_summary_df)

##   Statistic    Value
## 1      Min. 4963.600
## 2   1st Qu. 5099.100
## 3    Median 5191.000
## 4      Mean 5167.036
## 5   3rd Qu. 5228.100
## 6      Max. 5228.100

plot_numeric_variable(bank_data, "nr_employed", bins = 50)

How is the nr_employed Variable Distributed?

The minimum number of employees recorded is 4963.6, while the maximum is 5228.1. The median (5191.0) and mean (5167.0) suggest a slightly right-skewed distribution. The IQR (5099.1 – 5228.1) indicates that most values are concentrated in this range.

Interpretation of the Plots:

The histogram reveals a discrete, multimodal distribution, reflecting distinct periods of employment levels. The boxplot suggests a steady increase in employment levels, with no extreme outliers.

Implications for Predictive Modeling:
- Employment levels influence economic stability, potentially affecting term deposit subscriptions (y).
- Higher employment may correlate with increased savings and financial security, leading to higher term deposit subscriptions.
- Feature Engineering Recommendations:
- Categorize employment levels into low, moderate, and high employment periods.
- Assess interactions with emp_var_rate and euribor3m to capture macroeconomic trends.

Since nr_employed reflects labor market conditions, it is likely a strong predictor of financial behaviors and term deposit subscriptions.

Check Feature Correlation (Relationships between different variables)

library(stats)
library(corrplot)

bank_data %>%
keep(is.numeric) %>% # Select only numeric variables for correlation analysis
cor() %>% # Compute correlation matrix
corrplot() # Plot correlation heatmap

The results in the correlation plot show that we have a few features that are highly correlated.

We see a high positive correlation between euribor3m, nr_employed, and emp_var_rate. This suggests that as the Euribor 3-month rate increases, so does the number of employed individuals and the employment variation rate.
There is also a strong positive correlation between cons_price_idx (consumer price index) and euribor3m, indicating that higher consumer prices are associated with higher Euribor interest rates.
Additionally, emp_var_rate and nr_employed have a strong correlation, which makes sense as employment variation often aligns with total employment figures.

We also observe some negative correlations:
- cons_conf_idx (consumer confidence index) is negatively correlated with euribor3m and nr_employed, meaning that when consumer confidence is lower, interest rates and employment figures tend to be higher.
- Similarly, emp_var_rate is negatively correlated with cons_conf_idx, suggesting that employment variation decreases when consumer confidence is lower.

Before deciding what to do with these correlated variables, we should conduct further analysis to determine whether any variables are redundant or if they provide unique information. If two variables are nearly identical in their impact, one could be removed or transformed to reduce multicollinearity in any future modeling.

Detect Outliers Using Boxplots

# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)

# Min-Max Normalization Function
min_max_normalize <- function(x) {
  return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)))
}

# Select numeric variables and apply min-max scaling
numeric_data <- bank_data %>% 
  dplyr::select(where(is.numeric)) %>%
  mutate(across(everything(), min_max_normalize))

# Boxplots for standardized numerical variables
numeric_data %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value, fill = Variable)) +
  geom_boxplot(alpha = 0.6) +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "none") +
  ggtitle("Outlier Detection Using Boxplots")

Analysis of the Min-Max Standardized Boxplots (Outlier Detection)

The boxplot displays the distribution of min-max scaled numerical variables, highlighting potential outliers. Here’s an analysis of key observations:

Variables with Many Outliers

campaign and duration have a large number of outliers.
- Campaign (number of contacts during the campaign) shows many clients receiving very few calls, but some were contacted multiple times, creating extreme values.
- Duration (call duration) is heavily right-skewed, with many long calls being potential outliers.
previous and pdays (days since last contact) also show several extreme values.
- pdays has a strong concentration of values at 999, which likely represents “never contacted before.”

Highly Skewed Variables

euribor3m (Euribor 3-month rate) and nr_employed (number of employees) have long right tails, meaning they have some very high values.
emp_var_rate (employment variation rate) shows a wide range but is not heavily skewed.

Variables with a Compact Range - age and cons_price_idx (consumer price index) appear relatively well-distributed. - cons_conf_idx (consumer confidence index) is tightly packed but has some outliers.

Interpretation of Outliers

Campaign-related variables (campaign, previous, pdays) have extreme values, possibly indicating repeated contacts to some clients or first-time contacts for others.
Economic indicators (euribor3m, nr_employed, emp_var_rate) contain extreme values, suggesting large fluctuations in the financial market over time.
Call duration (duration) outliers indicate long conversations, which might be linked to successful subscriptions.

Identify Patterns and Trends

Call Duration

# Relationship between call duration and response (y)
ggplot(bank_data, aes(x = duration, fill = y)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  ggtitle("Call Duration vs. Response (y)")

This density plot visualizes the relationship between call duration and the response variable (y), which indicates whether a customer subscribed to a term deposit.

Key Observations:

The “no” category (red) dominates short call durations, showing a sharp peak at low values.
The “yes” category (blue) has a wider spread and extends into higher call durations.
Customers who subscribed (y = yes) tend to have longer call durations compared to those who did not (y = no).

Implications for Predictive Modeling:

Call duration is a strong predictor for subscription (y). Longer calls may indicate a higher chance of convincing the customer.
The overlap in distributions suggests that some short-duration calls also result in subscriptions, meaning duration alone is not a perfect predictor.
Potential feature engineering: Consider binning duration into categories or applying transformations (e.g., log transformation) to improve model performance.

Month of the Year

# Trend in campaign efforts over months
ggplot(bank_data, aes(x = month, fill = y)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  ggtitle("Subscription Rate by Month") +
  ylab("Proportion of Subscriptions")

This stacked bar chart illustrates the proportion of term deposit subscriptions (y) by month.

Key Observations:

The majority of responses are “no” across all months, indicating a general difficulty in securing term deposit subscriptions.
December, March, and October show the highest proportions of “yes” responses, suggesting these months might be more favorable for conversions.
May, June, and July have the lowest subscription rates, potentially indicating seasonal challenges in customer acquisition.

Implications for Predictive Modeling:

Time-based features: The month of contact may be an important predictor of subscription likelihood.
Seasonality considerations: Further analysis can determine if external economic factors (e.g., holidays, interest rates) influence customer decisions.
Marketing strategy alignment: Campaign efforts might be more effective in months with historically higher subscription rates.

Day of Week

# Order days of the week
bank_data <- bank_data %>%
  mutate(day_of_week = factor(day_of_week, 
                              levels = c("mon", "tue", "wed", "thu", "fri")))

# Trend in campaign efforts over days of the week
ggplot(bank_data, aes(x = day_of_week, fill = y)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  ggtitle("Subscription Rate by Day of the Week") +
  ylab("Proportion of Subscriptions") +
  xlab("Day of the Week")

The subscription rate remains relatively consistent across all weekdays, with no significant variation in the proportion of successful (yes) and unsuccessful (no) term deposit subscriptions. This suggests that the day of the week does not play a major role in influencing customer decisions regarding subscriptions.

Algorithm Selection for the Bank Marketing Dataset

Evaluating Algorithm Suitability for the Bank Subscription Dataset

We will assess Multiple Linear Regression (MLR), Logistic Regression (LR), k-Nearest Neighbors (kNN), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Naive Bayes (NB) based on scalability, interpretability, assumptions, and suitability for categorical data.

Multiple Linear Regression (MLR)

Pros

Simple and computationally efficient.
Easily interpretable coefficients.
Works well with continuous outcomes.

Cons

Not suitable for classification problems like this dataset, as it assumes a continuous response variable.
Assumes linearity between independent and dependent variables, which doesn’t hold for a categorical target like y.
Poor at handling class imbalances.

Suitability: Not suitable (MLR is meant for regression, not classification).

Logistic Regression (LR)

Pros

Specifically designed for binary classification (ideal for predicting y).
Interpretable coefficients (e.g., odds of subscription based on predictors).
Scales well to large datasets like this one (41,188 records).
Handles categorical variables effectively.

Cons

Assumes a linear relationship between log-odds and predictors, which may not always hold.
Not effective for non-linearly separable data.

Suitability: *Best suited for this dataset** due to its efficiency and interpretability.

k-Nearest Neighbors (kNN)

Pros

Non-parametric (no strong assumptions about data distribution).
Can capture non-linear relationships.
Simple and easy to understand.

Cons

Computationally expensive for large datasets (distance calculations for 41,188 records).
Sensitive to irrelevant features and scaling issues.
Struggles with categorical data, requiring extensive preprocessing (e.g., encoding job, education).

Suitability: *Not practical** (computationally inefficient and requires extensive preprocessing).

Linear Discriminant Analysis (LDA)**

Pros

Works well for classification tasks with normally distributed data.
Handles multi-class problems (though not needed here).
More robust than Logistic Regression in some cases.

Cons

Assumes Gaussian distribution of features, which may not hold in this dataset.
Assumes equal covariance among classes, which is often unrealistic.

Suitability: Potential alternative, but normality assumptions should be checked.

Quadratic Discriminant Analysis (QDA)

Pros

More flexible than LDA (allows different covariances for each class).
Can model non-linear decision boundaries.

Cons

Data-intensive – requires more samples per class to estimate covariance matrices.
Prone to overfitting, especially with small datasets or imbalanced classes.

Suitability: Not ideal due to its complexity and data requirements.

Naive Bayes (NB)

Pros

Fast and scalable, even for large datasets.
Handles categorical data efficiently (e.g., job, education, marital).
Works well in high-dimensional spaces.

Cons

Assumes feature independence, which is unrealistic for real-world data.
May underperform when predictors are highly correlated.

Suitability: *Good alternative** if independence assumption does not significantly impact performance.

Recommended Algorithm

Logistic Regression (LR) is the best choice because:
- It handles categorical and numerical variables well.
- It scales efficiently with large datasets (~41,188 records).
- It provides interpretable results and probability scores.

Impact of Labels on Algorithm Choice

The dataset includes binary labels (y), making classification algorithms appropriate. LR is preferred over MLR (for regression) and kNN (inefficient for large datasets).

Dataset Influence on Choice

The mix of categorical and numerical features suits LR or Naive Bayes.
The large dataset size makes kNN impractical.

Would the Choice Change for <1,000 Records?

Yes. Naive Bayes or kNN could be viable since they perform well on small datasets, while QDA risks overfitting.

Pre-processing

Missing Value Handling

Since the unknown values in housing and loan are small (~2.40%), imputing with the mode ensures minimal distortion to the distribution. However, for default, where ~20.87% of values are unknown, replacing them with the most frequent category may introduce bias. Instead, “unknown” is preserved as a separate category to allow the model to capture any predictive patterns associated with it. This approach maintains data integrity while minimizing unnecessary imputation bias.

The code below performs missing data handling for the categorical variables default, housing, and loan in bank_data.

Calculates and prints the percentage of “unknown” values (treated as missing) before any transformation.
Replaces “unknown” with “missing” for housing and loan while keeping default unchanged.
Recalculates missing percentages after replacing “unknown” values.
Imputes missing values in housing and loan using the mode (most frequent value in each column).
Recalculates and prints missing percentages after imputation to confirm the changes.

# Load necessary library
library(dplyr)

# Function to calculate missing percentages
calculate_missing_percentage <- function(data, cols) {
  missing_summary <- data %>%
    summarise(across(all_of(cols), ~ mean(. == "unknown") * 100)) %>%
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Percentage")
  
  return(missing_summary)
}

# Calculate missing percentages before transformation
missing_before <- calculate_missing_percentage(bank_data, c("default", "housing", "loan"))

# Display missing percentages
print(missing_before)

## # A tibble: 3 × 2
##   Variable Missing_Percentage
##   <chr>                 <dbl>
## 1 default               20.9 
## 2 housing                2.40
## 3 loan                   2.40

# Replace "unknown" with "missing" in categorical variables (only for housing and loan)
bank_data <- bank_data %>%
  mutate(across(c(housing, loan), ~ ifelse(. == "unknown", "missing", .)))

# Calculate missing percentages after replacing "unknown" with "missing"
missing_after_replace <- calculate_missing_percentage(bank_data, c("housing", "loan"))



# Impute Missing Values for housing and loan with Mode (Most Frequent Value)
for (col in c("housing", "loan")) {
  most_frequent <- names(which.max(table(bank_data[[col]])))
  bank_data[[col]][bank_data[[col]] == "missing"] <- most_frequent
}

# Calculate missing percentages after imputation
missing_after_imputation <- calculate_missing_percentage(bank_data, c("housing", "loan"))

# Display missing percentages after imputation
print(missing_after_imputation)

## # A tibble: 2 × 2
##   Variable Missing_Percentage
##   <chr>                 <dbl>
## 1 housing                   0
## 2 loan                      0

Dimensionality Reduction

# Load necessary libraries
library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(dplyr)

# Compute correlation matrix for numeric features
cor_matrix <- cor(bank_data %>% select(where(is.numeric)), use = "pairwise.complete.obs")

# Convert correlation matrix to a data frame for readability
cor_df <- as.data.frame(as.table(cor_matrix))

# Filter to show only correlations >|0.8| (excluding self-correlations)
significant_correlations <- cor_df %>%
  filter(Var1 != Var2, abs(Freq) > 0.8) %>%
  arrange(desc(abs(Freq)))

# Print only significant correlations
print("Significant Correlations (>|0.8|):")

## [1] "Significant Correlations (>|0.8|):"

print(significant_correlations)

##           Var1         Var2      Freq
## 1    euribor3m emp_var_rate 0.9722447
## 2 emp_var_rate    euribor3m 0.9722447
## 3  nr_employed    euribor3m 0.9451544
## 4    euribor3m  nr_employed 0.9451544
## 5  nr_employed emp_var_rate 0.9069701
## 6 emp_var_rate  nr_employed 0.9069701

# Find highly correlated variables (correlation > 0.85) for removal
highly_correlated_indices <- findCorrelation(cor_matrix, cutoff = 0.85)

# Get variable names of highly correlated features
highly_correlated_vars <- names(bank_data %>% select(where(is.numeric)))[highly_correlated_indices]

# **Explicitly add 'day_of_week' for removal due to lack of predictive power**
removal_candidates <- c(highly_correlated_vars, "day_of_week")

# Ensure only existing variables are removed
removed_vars <- intersect(names(bank_data), removal_candidates)

# Remove highly correlated and non-informative variables from dataset
bank_data <- bank_data %>% select(-all_of(removed_vars))

# Print the variables that were actually removed
print("Variables Removed from the Dataset:")

## [1] "Variables Removed from the Dataset:"

print(removed_vars)

## [1] "day_of_week"  "emp_var_rate" "euribor3m"

Feature Removal Summary:

The analysis reveals that several variables exhibit significant correlations (>|0.85|), indicating strong relationships that could introduce multicollinearity issues and impact model performance. Notably: - euribor3m and emp_var_rate have a very high correlation (0.972), suggesting a direct relationship between the Euro Interbank Offered Rate (Euribor) and employment variation rate. - nr_employed is strongly correlated with euribor3m (0.945), indicating that the number of employees is highly dependent on the Euribor rate. - nr_employed also has a strong correlation with emp_var_rate (0.907), reinforcing the relationship between employment levels and economic indicators.

Due to these high correlations, euribor3m and emp_var_rate were removed from the dataset to reduce redundancy and mitigate potential multicollinearity. This ensures that the predictive model remains interpretable and avoids issues related to highly correlated variables influencing model outcomes.

Additionally, day_of_week was removed because its distribution showed no meaningful variation in subscription rates across different days. Since the response variable (‘y’) remains nearly identical regardless of the day, this feature is unlikely to contribute to model performance and was removed to reduce dimensionality.

Feature Engineering

To enhance the dataset for predictive modeling while preserving the pattern within each predictor, the following transformations were carried out:

Handling Skewed Numeric Variables

Box-Cox Transformation for Duration

duration and campaign exhibit extreme right-skewness. Applying a Box Cox transformation may help normalize these distributions while maintaining meaningful differences.

# Load necessary libraries
library(dplyr)
library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.3

## 
## Attaching package: 'e1071'

## The following objects are masked from 'package:PerformanceAnalytics':
## 
##     kurtosis, skewness

library(tidyr)
library(ggplot2)

# Function to calculate skewness
calculate_skewness <- function(data, cols) {
  data %>%
    summarise(across(all_of(cols), ~ skewness(.x, na.rm = TRUE))) %>%
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness")
}

# Select duration variable
numeric_var_duration <- "duration"

# Calculate skewness before transformation
skewness_before_duration <- calculate_skewness(bank_data, numeric_var_duration)

# Apply Box-Cox Transformation for duration
x_adj_duration <- bank_data$duration + 1  # Adjust for zero values
lambda_duration <- BoxCox.lambda(x_adj_duration, method = "loglik")
bank_data$duration_boxcox <- BoxCox(x_adj_duration, lambda_duration)

# Compute Skewness After Transformation
skewness_after_duration <- calculate_skewness(bank_data, "duration_boxcox")

# Merge Skewness Before and After
skewness_df_duration <- left_join(skewness_before_duration, skewness_after_duration, by = "Variable", suffix = c("_Before", "_After"))


# Visualize Distribution Before and After Transformation
bank_data %>%
  dplyr::select(all_of(c("duration", "duration_boxcox"))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 50, fill = "blue", alpha = 0.5) +
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  ggtitle("Distributions Before and After Box-Cox Transformation for Duration")

The Box-Cox transformation has significantly improved the distribution of the duration variable. Initially, the distribution was highly right-skewed, with a large concentration of values near zero and a long tail extending towards higher values.

After applying the Box-Cox transformation, the distribution appears much more normal, with a symmetrical bell shape centered around its mean. This transformation will likely enhance the performance of models that assume normality, such as LDA and QDA, and may improve the interpretability of logistic regression.

Multiple Transformations for Campaign

# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(e1071)
library(forecast)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Function to calculate skewness
calculate_skewness <- function(data, cols) {
  data %>%
    summarise(across(all_of(cols), ~ skewness(.x, na.rm = TRUE))) %>%
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness")
}

# Select campaign variable
numeric_var_campaign <- "campaign"

# Calculate skewness before transformation
skewness_before_campaign <- calculate_skewness(bank_data, numeric_var_campaign)

# Apply Box-Cox Transformation for campaign
x_adj_campaign <- bank_data$campaign + 1  # Adjust for zero values
lambda_campaign <- BoxCox.lambda(x_adj_campaign, method = "loglik")
bank_data$campaign_boxcox <- BoxCox(x_adj_campaign, lambda_campaign)

# Apply Log1p Transformation
bank_data$campaign_log <- log1p(bank_data$campaign)

# Apply Square Root Transformation
bank_data$campaign_sqrt <- sqrt(bank_data$campaign)

# Apply Reciprocal Transformation (1/x) - Handle zeros separately
bank_data$campaign_reciprocal <- ifelse(bank_data$campaign == 0, 0, 1 / bank_data$campaign)

# Apply Binarization (High vs. Low Engagement)
bank_data$campaign_binary <- ifelse(bank_data$campaign > median(bank_data$campaign, na.rm = TRUE), "High", "Low")

# Compute Skewness After Each Transformation (except binary)
transformed_vars_campaign <- c("campaign_boxcox", "campaign_log", "campaign_sqrt", "campaign_reciprocal")
skewness_after_campaign <- calculate_skewness(bank_data, transformed_vars_campaign)

# Merge Skewness Before and After
skewness_df_campaign <- left_join(skewness_before_campaign, skewness_after_campaign, by = "Variable", suffix = c("_Before", "_After"))

# Visualize Distributions Before and After Transformations
bank_data %>%
  dplyr::select(all_of(c("campaign", transformed_vars_campaign))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 50, fill = "blue", alpha = 0.5) +
  facet_wrap(~ Variable, scales = "free", nrow = 2, ncol = 3) +  # Facet with 2 rows and 3 columns
  theme_minimal() +
  ggtitle("Distributions Before and After Various Transformations for Campaign")

# Visualize Binarized Campaign Variable
ggplot(bank_data, aes(x = campaign_binary, fill = campaign_binary)) +
  geom_bar(alpha = 0.5) +
  theme_minimal() +
  ggtitle("Binarized Campaign Variable (High vs. Low Engagement)")

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Check for NA values before binning
bank_data <- bank_data %>% mutate(campaign = as.numeric(campaign))

# Define improved bins for 'campaign'
bank_data <- bank_data %>%
  mutate(campaign_bin = case_when(
    campaign == 1 ~ "1",
    campaign == 2 ~ "2",
    campaign == 3 ~ "3",
    campaign == 4 ~ "4",
    campaign == 5 ~ "5",
    campaign == 6 ~ "6",
    campaign >= 7 & campaign <= 10 ~ "7-10",
    campaign > 10 ~ "11+",
    TRUE ~ "Missing"  # Handling NA values
  ))

# Convert to ordered factor
bank_data$campaign_bin <- factor(bank_data$campaign_bin, 
                                 levels = c("1", "2", "3", "4", "5", "6", "7-10", "11+", "Missing"))

# Plot histogram for binned 'campaign' variable
ggplot(bank_data, aes(x = campaign_bin, fill = campaign_bin)) +
  geom_bar(alpha = 0.7) +
  theme_minimal() +
  ggtitle("Binned Campaign Variable with More Granular Bins") +
  xlab("Campaign Bin") +
  ylab("Count") +
  theme(legend.position = "none")

Binning vs. Not Binning for Campaign vs. Binarization:

Binning was considered because alternative transformations (Box-Cox, log, square root, and reciprocal) did not sufficiently resolve the skewness in the campaign variable. Given the heavy concentration of low values and the sharp decline in higher values, binning provides a more structured representation, making the variable more suitable for models like Logistic Regression, LDA, QDA, kNN, and Naive Bayes, which may struggle with highly skewed continuous inputs.

The finer binning approach captures more variability compared to simple binarization. While binarization simplifies interpretation, it may discard important distinctions between different engagement levels. Granular binning retains more information about customer interaction patterns while still reducing skewness, making it potentially more useful for models like kNN, LDA, and QDA, which can leverage multiple discrete categories effectively. However, for Logistic Regression and Naive Bayes, a binary variable might still be preferable to avoid sparse category issues.

Binning is better for Logistic Regression (LR) and Naive Bayes (NB) as it simplifies the variable and removes skewness.
Keeping Continuous (with transformation) works better for kNN, LDA, and QDA, which rely on numerical relationships.
LDA/QDA prefer normality, so Box-Cox or log transformation is ideal.
Binning loses detail but improves interpretability, while keeping it continuous preserves information for distance-based models.

Best Choice?

Use binning for LR/NB
Use transformed continuous for kNN/LDA/QDA

Data Preprocessing & Feature Engineering Breakdown

Encoding Binary Categorical Variables
- poutcome_bin: Converts poutcome into a binary variable (1 = success, 0 = others).
- Why? Machine learning models often perform better with numerical inputs instead of categorical strings.

bank_data <- bank_data %>%
 mutate(poutcome_bin = ifelse(poutcome == "success", 1, 0))

Binarizing campaign
- campaign_binary: Classifies clients into high vs. low engagement based on the median number of contacts.
- Why? Some models perform better with simplified categorical representations instead of skewed numerical data.

if (!"campaign_binary" %in% names(bank_data)) {
 bank_data <- bank_data %>%
   mutate(campaign_binary = ifelse(campaign > median(campaign, na.rm = TRUE), "High", "Low"))
}

Preserving month as a Factor Instead of Seasonal Binning
- Why? Retaining month allows models to capture variations in client behavior across different months.

bank_data$month <- as.factor(bank_data$month)

Handling Low-Frequency Categories in job
- Why? Reducing sparsity by grouping rare job categories (<2% frequency) into an “Other” category improves model stability.

bank_data$job <- fct_lump(bank_data$job, prop = 0.02)

Transforming pdays Into a More Useful Feature
- New feature: contacted_before (1 if previously contacted, 0 if never contacted).
- Why? Since pdays = 999 means never contacted before, this transformation simplifies model learning.

bank_data <- bank_data %>%
 mutate(contacted_before = ifelse(pdays == 999, 0, 1)) %>%
 dplyr::select(-pdays)  # Remove `pdays`

Adding Interaction Features to Capture Relationships
- previous_contacts_ratio: Measures prior engagement by normalizing the number of previous contacts relative to campaign attempts.
- loan_housing_combo: Encodes the combination of loan and housing into a single categorical variable.
- Why? Interaction terms capture dependencies between variables, helping models uncover hidden patterns.

bank_data <- bank_data %>%
 mutate(
   previous_contacts_ratio = as.numeric(as.character(previous)) / (as.numeric(as.character(campaign)) + 1),
   loan_housing_combo = paste0(loan, "_", housing)
 )

Converting Categorical Variables to Factors Before One-Hot Encoding
- Why? Ensuring categorical variables are factors prevents errors when encoding.

categorical_vars <- c("contact", "campaign_binary", "default", "education", "housing", "job", 
                     "loan", "marital", "month", "loan_housing_combo", "poutcome")

bank_data <- bank_data %>%
 mutate(across(all_of(categorical_vars), as.factor))

Applying One-Hot Encoding
- Converts categorical variables into multiple binary columns (e.g., job_admin, job_blue-collar).
- Why? Many ML models (e.g., logistic regression) require categorical variables to be represented numerically.

if (!requireNamespace("fastDummies", quietly = TRUE)) {
  install.packages("fastDummies")
}

library(fastDummies)

## Warning: package 'fastDummies' was built under R version 4.3.3

bank_data <- dummy_cols(bank_data, select_columns = categorical_vars, remove_selected_columns = TRUE)

Encoding the Target Variable (y)
- Why? Converting y to a binary factor (0 = no, 1 = yes) ensures compatibility with classification models.

bank_data <- bank_data %>%
 mutate(y = factor(ifelse(y == "yes", 1, 0)))

Rearranging Columns: Moving y to the Last Column
- Why? Placing the target variable last helps with readability and prevents unintended modifications.

bank_data <- bank_data %>%
  dplyr::select(-y, everything(), y)

Cleaning Column Names
- Why? Replacing - and . ensures compatibility with certain ML libraries that do not support special characters.

colnames(bank_data) <- gsub("-", "_", colnames(bank_data))
colnames(bank_data) <- gsub("\\.", "_", colnames(bank_data))

Ensuring y Remains a Factor
- Why? Ensuring y remains a factor avoids issues during classification.

bank_data <- bank_data %>%
  mutate(y = factor(y, levels = c(0, 1)))

glimpse(bank_data)

## Rows: 41,188
## Columns: 69
## $ age                           <dbl> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, …
## $ duration                      <dbl> 261, 149, 226, 151, 307, 198, 139, 217, …
## $ campaign                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous                      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ cons_price_idx                <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, …
## $ cons_conf_idx                 <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4…
## $ nr_employed                   <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191…
## $ duration_boxcox               <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_boxcox               <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, …
## $ campaign_log                  <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_sqrt                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_reciprocal           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_bin                  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ previous_contacts_ratio       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no                    <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1…
## $ default_unknown               <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ default_yes                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y            <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_6y            <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_9y            <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school         <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0…
## $ education_illiterate          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0…
## $ education_university_degree   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown             <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ housing_1                     <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0…
## $ housing_3                     <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ job_admin_                    <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar               <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0…
## $ job_entrepreneur              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid                 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ job_management                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services                  <int> 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0…
## $ job_student                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ job_unemployed                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1                        <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ loan_3                        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_divorced              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ marital_married               <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ marital_single                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0…
## $ marital_unknown               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1        <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ loan_housing_combo_1_3        <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ loan_housing_combo_3_1        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ loan_housing_combo_3_3        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ y                             <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Standardization

This script standardizes numeric variables using different scaling techniques while ensuring that the target variable (y) remains unchanged. The transformations applied depend on the characteristics of each variable:

Z-score Standardization is applied to variables that benefit from centering around a common scale, making them comparable.
Min-Max Scaling is used for variables that should remain within a fixed range (0 to 1), preserving relative differences.
Robust Scaling is applied to variables that contain outliers, reducing their impact.

Key Implementation Details: - The transformed variables are given prefixes (z_, minmax_, robust_) to clearly differentiate them from their original versions. - The original versions of transformed variables are removed to avoid redundancy, except for y, which remains unchanged. - The order of columns is preserved to maintain consistency.

# Load necessary library
library(dplyr)

# Identify variables for different standardization techniques (excluding 'y')
z_score_vars <- c("age", "duration", "previous_contacts_ratio", "nr_employed")
min_max_vars <- c("campaign_boxcox", "campaign_sqrt", "cons_price_idx")
robust_vars <- c("cons_conf_idx")

# Standardization Functions
z_score_standardize <- function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
min_max_standardize <- function(x) (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
robust_standardize <- function(x) (x - median(x, na.rm = TRUE)) / IQR(x, na.rm = TRUE)

# Ensure 'y' remains unchanged
target_var <- "y"

# Apply Z-score Standardization
bank_data <- bank_data %>%
  mutate(across(all_of(z_score_vars), z_score_standardize, .names = "z_{.col}"))

# Apply Min-Max Scaling
bank_data <- bank_data %>%
  mutate(across(all_of(min_max_vars), min_max_standardize, .names = "minmax_{.col}"))

# Apply Robust Scaling
bank_data <- bank_data %>%
  mutate(across(all_of(robust_vars), robust_standardize, .names = "robust_{.col}"))

# Drop original unscaled columns (excluding 'y')
bank_data <- bank_data %>%
  dplyr::select(all_of(target_var), everything(), -all_of(c(z_score_vars, min_max_vars, robust_vars)))

# Verify the transformation
glimpse(bank_data)

## Rows: 41,188
## Columns: 69
## $ y                             <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous                      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ duration_boxcox               <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_log                  <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_reciprocal           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_bin                  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no                    <int> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1…
## $ default_unknown               <int> 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ default_yes                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y            <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_6y            <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_9y            <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school         <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0…
## $ education_illiterate          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0…
## $ education_university_degree   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown             <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ housing_1                     <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0…
## $ housing_3                     <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ job_admin_                    <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar               <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0…
## $ job_entrepreneur              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid                 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ job_management                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services                  <int> 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0…
## $ job_student                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ job_unemployed                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1                        <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ loan_3                        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_divorced              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ marital_married               <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ marital_single                <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0…
## $ marital_unknown               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1        <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ loan_housing_combo_1_3        <int> 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1…
## $ loan_housing_combo_3_1        <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ loan_housing_combo_3_3        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ z_age                         <dbl> 1.533015677, 1.628973456, -0.290182119, …
## $ z_duration                    <dbl> 0.01047130, -0.42149540, -0.12451830, -0…
## $ z_previous_contacts_ratio     <dbl> -0.3312077, -0.3312077, -0.3312077, -0.3…
## $ z_nr_employed                 <dbl> 0.3316759, 0.3316759, 0.3316759, 0.33167…
## $ minmax_campaign_boxcox        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ minmax_campaign_sqrt          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ minmax_cons_price_idx         <dbl> 0.6987529, 0.6987529, 0.6987529, 0.69875…
## $ robust_cons_conf_idx          <dbl> 0.8571429, 0.8571429, 0.8571429, 0.85714…

Sampling Data

For predictive modeling, we can use simple random sampling or stratified random sampling to create training and test datasets.

Simple Random Sampling (Without Replacement)

This method selects data randomly without replacement to create the training and test datasets, ensuring no duplicates.

# Set seed for reproducibility
set.seed(1234)

# Define training sample size (e.g., 75% of the data)
sample_size <- round(nrow(bank_data) * 0.75)

# Create sample set
sample_set <- sample(nrow(bank_data), sample_size, replace = FALSE)

# Split data into training and test sets
train_data <- bank_data[sample_set, ]
test_data <- bank_data[-sample_set, ]

# Verify class distribution remains consistent
print(round(prop.table(table(train_data$y)) * 100, 2))

## 
##     0     1 
## 88.67 11.33

print(round(prop.table(table(test_data$y)) * 100, 2))

## 
##     0     1 
## 88.92 11.08

Stratified Random Sampling (Maintains Class Distribution)

Since y is a categorical variable, we should ensure that both training and test sets maintain the same proportion of classes.

# Load caret package
library(caret)

# Stratified sampling with 75% training data
set.seed(1234)
trainIndex <- createDataPartition(bank_data$y, p = 0.75, list = FALSE)

# Split data based on stratified sampling
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]

# Verify class distribution remains consistent
round(prop.table(table(train_data$y)) * 100, 2)

## 
##     0     1 
## 88.73 11.27

round(prop.table(table(test_data$y)) * 100, 2)

## 
##     0     1 
## 88.73 11.27

Why Use Stratified Sampling?

The dataset is imbalanced, simple random sampling may lead to unequal class distributions.
Stratified sampling ensures that the proportions of each class in the target variable remain consistent in both training and test sets.
This is critical for predictive modeling, as the model should be trained on data that accurately represents the real-world distribution.

The class distribution in the training dataset closely mirrors that of the original dataset, with approximately 88.73% “no” responses and 11.27% “yes” responses in both cases. This indicates that the sampling process was performed correctly, preserving the proportion of classes in the response variable. Maintaining a similar distribution is crucial because it ensures that the model trained on the sample will generalize well to the full dataset, reducing bias and improving predictive performance.

Handling Imbalanced Data (SMOTE)

# Load necessary libraries
library(themis)

## Warning: package 'themis' was built under R version 4.3.3

## Loading required package: recipes

## Warning: package 'recipes' was built under R version 4.3.3

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stringr':
## 
##     fixed

## The following object is masked from 'package:stats':
## 
##     step

library(dplyr)
library(recipes)

# Convert `y` to a factor (required for SMOTE)
train_data <- train_data %>%
  mutate(y = as.factor(y))

# Store `y` separately and remove it before conversion
y_train <- train_data$y
train_data <- train_data %>% dplyr::select(-y)  # Remove `y` before conversion

# Convert other factor variables to numeric
train_data <- train_data %>%
  mutate(across(where(is.factor), as.numeric))

# Add `y` back after conversion
train_data <- train_data %>%
  mutate(y = y_train)

# Define the SMOTE recipe
set.seed(1234)
smote_recipe <- recipe(y ~ ., data = train_data) %>%
  step_smote(y, over_ratio = 1) %>%  # Balance classes
  prep()

# Apply SMOTE transformation
train_data_smote <- juice(smote_recipe)

# Check the new class distribution after SMOTE
table(train_data_smote$y)

## 
##     0     1 
## 27411 27411

glimpse(train_data)

## Rows: 30,891
## Columns: 69
## $ campaign                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ previous                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ duration_boxcox               <dbl> 8.702628, 7.469200, 8.375577, 7.497312, …
## $ campaign_log                  <dbl> 0.6931472, 0.6931472, 0.6931472, 0.69314…
## $ campaign_reciprocal           <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, …
## $ campaign_bin                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_bin                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contacted_before              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_cellular              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ contact_telephone             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ campaign_binary_High          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ campaign_binary_Low           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ default_no                    <int> 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1…
## $ default_unknown               <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0…
## $ default_yes                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_basic_4y            <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ education_basic_6y            <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ education_basic_9y            <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_high_school         <int> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0…
## $ education_illiterate          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_professional_course <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
## $ education_university_degree   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ education_unknown             <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0…
## $ housing_1                     <int> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0…
## $ housing_3                     <int> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1…
## $ job_admin_                    <int> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_blue_collar               <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1…
## $ job_entrepreneur              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_housemaid                 <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ job_management                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_retired                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_self_employed             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_services                  <int> 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0…
## $ job_student                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_technician                <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ job_unemployed                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ job_Other                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_1                        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ loan_3                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ marital_divorced              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ marital_married               <int> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1…
## $ marital_single                <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0…
## $ marital_unknown               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_apr                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_aug                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_dec                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jul                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_jun                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_mar                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_may                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ month_nov                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_oct                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ month_sep                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ loan_housing_combo_1_1        <int> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0…
## $ loan_housing_combo_1_3        <int> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1…
## $ loan_housing_combo_3_1        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_housing_combo_3_3        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_failure              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome_nonexistent          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ poutcome_success              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ z_age                         <dbl> 1.533015677, 1.628973456, -0.290182119, …
## $ z_duration                    <dbl> 0.01047130, -0.42149540, -0.12451830, -0…
## $ z_previous_contacts_ratio     <dbl> -0.3312077, -0.3312077, -0.3312077, -0.3…
## $ z_nr_employed                 <dbl> 0.3316759, 0.3316759, 0.3316759, 0.33167…
## $ minmax_campaign_boxcox        <dbl> 0.0000000, 0.0000000, 0.0000000, 0.00000…
## $ minmax_campaign_sqrt          <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00…
## $ minmax_cons_price_idx         <dbl> 0.6987529, 0.6987529, 0.6987529, 0.69875…
## $ robust_cons_conf_idx          <dbl> 0.8571429, 0.8571429, 0.8571429, 0.85714…
## $ y                             <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Why it was done:

The original dataset was imbalanced, with significantly more no responses than yes responses.
Without balancing, the model would likely be biased toward predicting the majority class, leading to poor performance in identifying the minority class.
SMOTE was applied to generate synthetic examples for the minority class, ensuring both classes had equal representation.
A balanced dataset improves the model’s ability to generalize, making predictions more reliable for both classes.

DATA 622 Assignment 1

Fomba Kassoh

2025-02-22

Loading Libraries

Importing the data

Read the csv file

Cleaning Column Names

Convert Categorical Variables into Factors

Convert Numerical Columns into Numeric Format

Inspect the Cleaned Data

Exploratory Data Analysis

Overall Distribution of each variable

Numerical Variables

Categorical Variables

Key Observations

Detailed Distribution of Categorical Variables

Jobs

Marital Status

Education Level

Credit Default

Housing Loan

Personal Loan

Contact Method

Month of Contact

Day of the Week Contacted

Previous Campaign Outcome

Subscription Rate (Target Variable)

Detail Distribution of Numerical Variables

Age

Campaign (Number of Contacts During Campaign)

Duration (Call Duration in Seconds)

Pdays (Days Since Last Contact)

Employment Variation Rate (emp.var.rate, Economic Indicator)

Consumer Price Index (cons.price.idx)

Euribor 3-Month Rate (euribor3m, Euro Interbank Interest Rate)

Number of Employees (nr.employed, Employment Level Indicator)

Check Feature Correlation (Relationships between different variables)

Detect Outliers Using Boxplots

Identify Patterns and Trends

Call Duration

Month of the Year

Day of Week

Algorithm Selection for the Bank Marketing Dataset

Evaluating Algorithm Suitability for the Bank Subscription Dataset

Multiple Linear Regression (MLR)

Logistic Regression (LR)

k-Nearest Neighbors (kNN)

Linear Discriminant Analysis (LDA)**

Quadratic Discriminant Analysis (QDA)

Naive Bayes (NB)

Recommended Algorithm

Impact of Labels on Algorithm Choice

Dataset Influence on Choice

Would the Choice Change for <1,000 Records?

Pre-processing

Missing Value Handling

Dimensionality Reduction

Feature Removal Summary:

Feature Engineering

Handling Skewed Numeric Variables

Box-Cox Transformation for Duration

Multiple Transformations for Campaign

Data Preprocessing & Feature Engineering Breakdown

Standardization

Sampling Data

Simple Random Sampling (Without Replacement)

Stratified Random Sampling (Maintains Class Distribution)

Why Use Stratified Sampling?

Handling Imbalanced Data (SMOTE)

Why it was done: