DATA622- Assignment 1 - EDA: Portuguese Bank Marketing

Background

The assignment is to perform EDA on a dataset “Bank Marketing” UCI dataset , (detailed description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

There are two datasets: 1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).

  2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.

The classification goal is to predict if the client will subscribe a term deposit (variable y). We will use the bank-full.csv

Input predictors:

# bank client data: 1 - age (numeric)

2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)

3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)

4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)

5 - default: has credit in default? (binary: “yes”,“no”)

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: “yes”,“no”)

8 - loan: has personal loan? (binary: “yes”,“no”)

# related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

12 - duration: last contact duration, in seconds (numeric)

# other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

Missing values : none

Data Preparation

1. Importing Libraries

#Import needed libraries

library(ggplot2)
library(readr) # to uses read_csv function
library(dplyr) # to use Filter, mutate, arrange function etc
library(tidyr) # to use pivot_longer function

library(e1071)  # For skewness function
library(corrplot)

library(ROSE)
library(smotefamily)

library(caret)

2. Data Ingestion and inspection

The data analysis shows there are 45211 observations and 17 variables. I find some of variables not in the correct type. we need to convert it into correct data type.

bank_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA622/refs/heads/main/bank-full.csv")

head(bank_raw)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

dim(bank_raw)

## [1] 45211    17

str(bank_raw)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

3. Change the predictors to correct data type.

bank_transform <- bank_raw
bank_transform$job <- as.factor(bank_raw$job)
bank_transform$marital <- as.factor(bank_raw$marital)
bank_transform$education <- as.factor(bank_raw$education)
bank_transform$default <- as.factor(bank_raw$default)
bank_transform$balance <- as.integer(bank_raw$balance)
bank_transform$housing <- as.factor(bank_raw$housing)
bank_transform$loan <- as.factor(bank_raw$loan)
bank_transform$contact <- as.factor(bank_raw$contact)
bank_transform$month <- as.factor(bank_raw$month)
bank_transform$pdays <- as.integer(bank_raw$pdays)
bank_transform$poutcome <- as.factor(bank_raw$poutcome)
bank_transform$y <- as.factor(bank_raw$y)

str(bank_transform)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

4. Remove day variable

When checking the day predictor, it is not mapped to any year. hence we will not be able to infer in which year 2008 to 2010 it belongs.Hence its better to remove the day predictor.We can still keep month to see which month has the most subscription.

bank_transform <- subset(bank_transform, select = -c(day))

5. Are there any missing values and how significant are they?

There are no missing values. However Job,Education,Contact,poutcome has “unknown” values. It is difficult to impute the unknown values as it may affect the results of the outcome. Hence, I am exploring the data including the unknowns.

print("Missing values in the dataset")

## [1] "Missing values in the dataset"

colSums(is.na(bank_transform))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact     month  duration  campaign     pdays  previous  poutcome         y 
##         0         0         0         0         0         0         0         0

print("Unknown values in the dataset")

## [1] "Unknown values in the dataset"

bank_transform[bank_transform == "unknown"] <- NA
colSums(is.na(bank_transform))

##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact     month  duration  campaign     pdays  previous  poutcome         y 
##     13020         0         0         0         0         0     36959         0

6. Check for duplicates

No duplicates found.

# Check for duplicates
  duplicates <- sum(duplicated(bank_transform))
  cat(paste("\nNumber of duplicate rows:", duplicates, "\n"))

## 
## Number of duplicate rows: 0

7. Target predictor to a new numeric variable

To perform analysis, it is needed to have the target predictor in numeric format as well.

bank_transform$y_numeric <- ifelse(bank_transform$y == "yes", 1, 0)
bank_transform$y_numeric <- as.integer(bank_transform$y_numeric)

Exploratory Data Analysis (EDA)

1. Distribution analysis of the numeric variables.

Based on the box plot,i can see the spread of each variable. i can observe that there are many points outside the whiskers for each and every element, indicating many outliers present especially for previous, pdays and balance .
Based on the histogram plot, i can observe that all numerical predictors have substantial skewness, except for age which has moderate skewness.
The target variable distribution shows that the subscription rate is 11.7%

bank_transform |>
  select(where(is.numeric))|>
  gather()|>
  ggplot(aes(value)) + 
  geom_histogram(bins = 5) + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Histograms of Numerical Predictors")

bank_transform|>
  select(where(is.numeric))|>
  gather()|>
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

# Class distribution
  target_counts <- table(bank_transform$y)
  cat("\nTarget variable distribution:\n")

## 
## Target variable distribution:

  print(target_counts)

## 
##    no   yes 
## 39922  5289

  cat(paste("Subscription rate:", round(target_counts["yes"] / sum(target_counts) * 100, 2), "%"))

## Subscription rate: 11.7 %

 numeric_vars <- bank_transform %>% select_if(is.numeric)
for(col in names(numeric_vars)) {
  q1 <- quantile(numeric_vars[[col]], 0.25)
  q3 <- quantile(numeric_vars[[col]], 0.75)
  iqr <- q3 - q1
  lower <- q1 - 1.5 * iqr
  upper <- q3 + 1.5 * iqr
  outliers <- sum(numeric_vars[[col]] < lower | numeric_vars[[col]] > upper)
  if(outliers > 0) {
    cat("   -", col, ":", outliers, "potential outliers (", 
        round(outliers/nrow(bank_transform)*100, 2), "%)\n")
  }
}

##    - age : 487 potential outliers ( 1.08 %)
##    - balance : 4729 potential outliers ( 10.46 %)
##    - duration : 3235 potential outliers ( 7.16 %)
##    - campaign : 3064 potential outliers ( 6.78 %)
##    - pdays : 8257 potential outliers ( 18.26 %)
##    - previous : 8257 potential outliers ( 18.26 %)
##    - y_numeric : 5289 potential outliers ( 11.7 %)

2. Skewness of the numerical variables

Rule of thumb: 0.5 < |skewness| < 1 indicates moderate skewness, |skewness| > 1 indicates substantial skewness.

Based on the skewness values, all numerical predictors have substantial skewness, except for age which has moderate skewness.previous has the highest value of skewness, which is understanable as its the result of the previous campaign.

skew_values <- bank_transform |>
  select(where(is.numeric)) |>
  sapply(skewness, na.rm = TRUE)

# Print skewness values
print(skew_values)

##        age    balance   duration   campaign      pdays   previous  y_numeric 
##  0.6847725  8.3597536  3.1441095  4.8983251  2.6155419 41.8436778  2.3833223

3. Correlation analysis of the numeric predictors.

None of the numerical variables appear to be strongly correlated to each other. This suggests that individual numerical variables alone don’t strongly predict subscription behavior.
The variable pdays and previous are moderately correlated with a coefficient of 0.45.
Closest ones to consider is correlated to the target variable is the duration.
Based on the age distribution, clients who are in the age category 60+ and 20-30 are accepting the term deposit subscription.
Based on the box plot of Average Yearly Balance, i observe that the clients having a balance of below 12000 euros get term deposit subscription which * makes sense as the balance is low.
Based on the box plot of the contact duration ,i can observe that Clients who are interested in term deposit subscription tend to have longer contact duration around 400 seconds.This makes sense as longer conversation times may indicate more interested customers.
Based on the box plot of Past days contacted Distribution On Term Deposit Subscription, the client is contacted 1-100 times, with a upper whisker up to 250.
Based on the box plot of Previous contacts Distribution On Term Deposit Subscription, the client is previous contacted 1 time, with a upper whisker up to 2.

bank_transform|>
  select(is.numeric)|>
  cor()

## Warning: Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
##   # Was:
##   data %>% select(is.numeric)
## 
##   # Now:
##   data %>% select(where(is.numeric))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

##                    age      balance     duration     campaign        pdays
## age        1.000000000  0.097782739 -0.004648428  0.004760312 -0.023758014
## balance    0.097782739  1.000000000  0.021560380 -0.014578279  0.003435322
## duration  -0.004648428  0.021560380  1.000000000 -0.084569503 -0.001564770
## campaign   0.004760312 -0.014578279 -0.084569503  1.000000000 -0.088627668
## pdays     -0.023758014  0.003435322 -0.001564770 -0.088627668  1.000000000
## previous   0.001288319  0.016673637  0.001203057 -0.032855290  0.454819635
## y_numeric  0.025155017  0.052838410  0.394521016 -0.073172006  0.103621494
##               previous   y_numeric
## age        0.001288319  0.02515502
## balance    0.016673637  0.05283841
## duration   0.001203057  0.39452102
## campaign  -0.032855290 -0.07317201
## pdays      0.454819635  0.10362149
## previous   1.000000000  0.09323577
## y_numeric  0.093235773  1.00000000

#Age Distribution On Term Deposit Subscription
ggplot(bank_transform, aes(x = age, fill = y)) +
    geom_density(alpha = 0.5) +
    labs(title = "Age Distribution by Term Deposit Subscription") +
  scale_x_continuous(breaks = seq(0, 100, 5))

#Balance Distribution On Term Deposit Subscription
ggplot(bank_transform, aes(x = balance, fill = y)) +
    geom_density(alpha = 0.5) +
    labs(title = "Balance Distribution by Term Deposit Subscription") +
  scale_x_continuous(breaks = seq(min(bank_transform$balance), max(bank_transform$balance), by = 10000)) +
  scale_y_continuous(labels = scales::comma)

#Campaign Distribution On Term Deposit Subscription
ggplot(bank_transform, aes(x=campaign)) + 
  geom_histogram(bins=30, fill="purple", color="black", alpha=0.7) +
  labs(title="Distribution of Campaign Contacts", x="Number of Contacts", y="Count") +
  theme_minimal()

#Contact Distribution On Term Deposit Subscription
boxplot(duration ~ y,
        data = bank_transform,
        outline = F,
        main = "Contact Duration Distribution On Term Deposit Subscription",
        xlab = "Term Deposit Subscription",
        ylab = "Duration",
        col = c("indianred", "skyblue4"))

#Past days contacted  Distribution On Term Deposit Subscription
boxplot(pdays ~ y,
        data = bank_transform,
        outline = F,
        main = "Past days contacted  On Term Deposit Subscription",
        xlab = "Term Deposit Subscription",
        ylab = "Passed days since the client was last contacted",
        col = c("indianred", "skyblue4"))

boxplot(previous ~ y,
        data = bank_transform,
        outline = F,
        main = "Previous contacts Distribution On Term Deposit Subscription",
        xlab = "Term Deposit Subscription",
        ylab = "Previous contacts performed before this campaign",
        col = c("indianred", "skyblue4"))

#Previous contacts Distribution On Term Deposit Subscription
# outlier detection   showing a value of 275 
ggplot(bank_transform, aes(x = y, y = previous)) +
      geom_boxplot(aes(fill = y)) +
      labs(title = "Previous contacts Distribution by Term deposit Subscription")+
  scale_y_continuous(breaks = seq(min(bank_transform$pdays), max(bank_transform$pdays), by = 10))

# plotting by removing the outlier
bank_transform |>
  filter(previous < 200) |>
ggplot( aes(x = y, y = previous)) +
      geom_boxplot(aes(fill = y)) +
      labs(title = "Previous contacts Distribution by Term deposit Subscription")+
  scale_y_continuous(breaks = seq(min(bank_transform$pdays), max(bank_transform$pdays), by = 10))

4. Distribution analysis of Categorical predictors and its relationship with the target variable.

A moderate correlation with the target variable is observed in feature: default, Housing loan , loan ,contact, month and previous campaign outcome.
Customer Demographic Insights:
Job type impact: When plotting the job vs target variable by subscription rate i observe that student, retired, unemployed are shown to have the highest subscription rate.This might indicate these groups have more interest in passive income opportunities or are more available during contact hours.
Education impact: When plotting the education vs target variable by subscription rate i observe that tertiary education are shown to have the highest subscription rate.
Marital status influence: When plotting the marital status vs target variable by subscription rate i observe that clients who are single are shown to have the highest subscription rate.
Financial Status Patterns:
Clients who have no default are interested in in term deposit subscription. This greater interest indicates risk-averse financial behavior may correlate with term deposit interest
Clients who have no housing loan or no other loan are interested in in term deposit subscription than the clients who have housing loan.This aligns with financial capacity theory - they may have more disposable income to allocate.
Contact Strategy Insights:
Clients who have been contacted by Cellular are interested in in term deposit subscription than the clients who have been contacted by Telephone.This provides clear direction for contact strategy optimization
Subscription rate is highest in March followed by December.This seasonal pattern can inform capagin timing.
When plotting the previous campaign outcome(poutcome) vs target variable by subscription rate i observe that is the previous outcome was success, it has the highest subscription rate.Probably the bank has earned loyal customers.However it does not harm contacting the failed outcome clients as they also show a subscription rate of 12%.

# Job vs target
  job_subscription <- bank_transform %>%
    group_by(job) %>%
    summarize(subscription_rate = mean(y == "yes") * 100) %>%
    arrange(desc(subscription_rate))
ggplot(job_subscription, aes(x = reorder(job, subscription_rate), y = subscription_rate)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(title = "Subscription Rate by Job Type",
         x = "Job", y = "Subscription Rate (%)")

# marital status
mar_subscription <- bank_transform %>%
    group_by(marital) %>%
    summarize(subscription_rate = mean(y == "yes") * 100) %>%
    arrange(desc(subscription_rate))
ggplot(mar_subscription, aes(x = reorder(marital, subscription_rate), y = subscription_rate)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(title = "Subscription Rate by Marital Status",
         x = "Marital Status ", y = "Subscription Rate (%)")

# Education vs target
  edu_subscription <- bank_transform %>%
    group_by(education) %>%
    summarize(subscription_rate = mean(y == "yes") * 100) %>%
    arrange(desc(subscription_rate))
ggplot(edu_subscription, aes(x = reorder(education, subscription_rate), y = subscription_rate)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(title = "Subscription Rate by Job Type",
         x = "Education", y = "Subscription Rate (%)")

# Default
ggplot(bank_transform, aes(x = default, fill = y)) +
    geom_bar(position = "dodge") +
    labs(title = "Default by Term Deposit Subscription",
         x = "Default")

# Housing
ggplot(bank_transform, aes(x = housing, fill = y)) +
    geom_bar(position = "dodge") +
    labs(title = "Housing loan by Term Deposit Subscription",
         x = "Housing")

# Loan
ggplot(bank_transform, aes(x = loan, fill = y)) +
    geom_bar(position = "dodge") +
    labs(title = "Loan by Term Deposit Subscription",
         x = "Loan")

# Contact
ggplot(bank_transform, aes(x = contact, fill = y)) +
    geom_bar(position = "dodge") +
    labs(title = "Contact by Term Deposit Subscription",
         x = "Contact")

# Month Vs Target

month_order <- c("jan", "feb", "mar", "apr", "may", "jun", 
                   "jul", "aug", "sep", "oct", "nov", "dec")
  
  month_subscription <- bank_transform %>%
    group_by(month) %>%
    summarize(subscription_rate = mean(y == "yes") * 100)
  
  month_subscription$month <- factor(month_subscription$month, levels = month_order)
  month_subscription <- month_subscription[order(month_subscription$month), ]
  
  ggplot(month_subscription, aes(x = month, y = subscription_rate)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    labs(title = "Subscription Rate by Month",
         x = "Month", y = "Subscription Rate (%)")

  #Previous outcome vs target
  pout_subscription <- bank_transform %>%
    group_by(poutcome) %>%
    summarize(subscription_rate = mean(y == "yes") * 100) %>%
    arrange(desc(subscription_rate))
ggplot(pout_subscription, aes(x = reorder(poutcome, subscription_rate), y = subscription_rate)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(title = "Subscription Rate by Job Type",
         x = "Previous campaign outcome", y = "Subscription Rate (%)")

5.Correlation Analysis of categorical variables.

The Chi-square test for each categorical variable shows a p value = 0 . Despite low correlation coefficients, the statistical significance (p-value = 0) for categorical variables indicates that these relationships aren’t occurring by chance. The significant variables which can be considered are:default, housing loan, loan, month.

categorical_vars <- bank_transform %>% select_if(is.factor)
cat("\nRelationship between target variable (y) and categorical variables:\n")

## 
## Relationship between target variable (y) and categorical variables:

for(col in names(categorical_vars)) {
  if(col != "y") {
    cat("\nContingency table for y vs", col, ":\n")
    cont_table <- table(categorical_vars$y, categorical_vars[[col]])
    print(cont_table)
    cat("Chi-square test result:\n")
    print(chisq.test(cont_table))
  }
}

## 
## Contingency table for y vs job :
##      
##       admin. blue-collar entrepreneur housemaid management retired
##   no    4540        9024         1364      1131       8157    1748
##   yes    631         708          123       109       1301     516
##      
##       self-employed services student technician unemployed unknown
##   no           1392     3785     669       6757       1101       0
##   yes           187      369     269        840        202       0
## Chi-square test result:

## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = NaN, df = 11, p-value = NA
## 
## 
## Contingency table for y vs marital :
##      
##       divorced married single
##   no      4585   24459  10878
##   yes      622    2755   1912
## Chi-square test result:
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 196.5, df = 2, p-value < 2.2e-16
## 
## 
## Contingency table for y vs education :
##      
##       primary secondary tertiary unknown
##   no     6260     20752    11305       0
##   yes     591      2450     1996       0
## Chi-square test result:

## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = NaN, df = 3, p-value = NA
## 
## 
## Contingency table for y vs default :
##      
##          no   yes
##   no  39159   763
##   yes  5237    52
## Chi-square test result:
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 22.202, df = 1, p-value = 2.454e-06
## 
## 
## Contingency table for y vs housing :
##      
##          no   yes
##   no  16727 23195
##   yes  3354  1935
## Chi-square test result:
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 874.82, df = 1, p-value < 2.2e-16
## 
## 
## Contingency table for y vs loan :
##      
##          no   yes
##   no  33162  6760
##   yes  4805   484
## Chi-square test result:
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cont_table
## X-squared = 209.62, df = 1, p-value < 2.2e-16
## 
## 
## Contingency table for y vs contact :
##      
##       cellular telephone unknown
##   no     24916      2516       0
##   yes     4369       390       0
## Chi-square test result:

## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = NaN, df = 2, p-value = NA
## 
## 
## Contingency table for y vs month :
##      
##         apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep
##   no   2355  5559   114  2208  1261  6268  4795   229 12841  3567   415   310
##   yes   577   688   100   441   142   627   546   248   925   403   323   269
## Chi-square test result:
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 3061.8, df = 11, p-value < 2.2e-16
## 
## 
## Contingency table for y vs poutcome :
##      
##       failure other success unknown
##   no     4283  1533     533       0
##   yes     618   307     978       0
## Chi-square test result:

## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = NaN, df = 3, p-value = NA

p_values <- sapply(categorical_vars, function(x) chisq.test(table(x, bank_transform$y))$p.value)

## Warning in chisq.test(table(x, bank_transform$y)): Chi-squared approximation
## may be incorrect

## Warning in chisq.test(table(x, bank_transform$y)): Chi-squared approximation
## may be incorrect
## Warning in chisq.test(table(x, bank_transform$y)): Chi-squared approximation
## may be incorrect
## Warning in chisq.test(table(x, bank_transform$y)): Chi-squared approximation
## may be incorrect

significant_vars <- names(p_values[p_values < 0.05])  # Keep only significant ones


print(significant_vars)

##  [1] NA        "marital" NA        "default" "housing" "loan"    NA       
##  [8] "month"   NA        "y"

Machine Learning Algorithm Recommendations for Bank Term Deposit Subscription

Based on the EDA findings, as its a supervised learning usecasse with classification technique,here are my algorithm recommendations for predicting term deposit subscriptions:

Logistic Regression

Pros:

Provides probability scores and easily interpretable coefficients
Works well with the binary target variable (yes/no subscription)
Can handle both categorical and numerical features after proper encoding
Relatively fast to train with 45,211 records
Offers good explainability for business stakeholders
Can identify which variables most strongly influence subscription decisions

Cons:

May not capture complex non-linear relationships
Assumes independent features (though some correlation was observed)
Sensitive to outliers (which were observed in balance and duration)
May require careful feature scaling for optimal performance

Linear Discriminant Analysis (LDA)

Pros:

Works well with a categorical target and mixed feature types
Can handle multicollinearity better than logistic regression
Often performs well when classes are well-separated
Computationally efficient with large datasets
Provides probability estimates and interpretable results

Cons:

Assumes normally distributed features within each class
Assumes homoscedasticity (equal variance across classes)
Less flexible with highly non-linear decision boundaries
May not perform optimally with the observed categorical variables unless properly encoded

Quadratic Discriminant Analysis (QDA)

Pros:

Can model non-linear decision boundaries
More flexible than LDA for capturing complex relationships
Doesn’t assume equal covariance matrices across classes
May better capture the distinct characteristics of subscribers vs. non-subscribers

Cons:

Requires more parameters to estimate
May overfit with the number of features present
Assumes normal distribution of features within each class
Needs more data for reliable estimates

k-Nearest Neighbors (kNN)

Pros:

Non-parametric approach that doesn’t assume data distribution
Can capture complex local patterns in customer behavior
Simple to understand conceptually
Flexible to different types of decision boundaries

Cons:

Computationally intensive with 45,211 records
Sensitive to irrelevant features and the curse of dimensionality
Requires careful feature scaling
Performance highly dependent on choice of k and distance metric
Less interpretable than logistic regression or LDA

Naive Bayes

Pros:

Works well with categorical data (job, education, marital status)
Fast training and prediction times
Often performs surprisingly well despite simplistic assumptions
Robust to irrelevant features
Less prone to overfitting

Cons:

Assumes feature independence which is violated in this dataset (e.g., age and job may be related)
May not capture complex relationships between features
Probability estimates can be less calibrated
Performance can degrade with highly correlated features

Recommendation:

I would recommend starting with Logistic Regression for the following reasons:

Business Interpretability: Given the banking context, stakeholders likely need to understand why customers subscribe - logistic regression provides clear coefficients that explain the impact of each variable.
Dataset Characteristics: The dataset shows clear relationships between certain variables (duration, job type, education) and the target, which logistic regression can effectively capture.
Implementation Simplicity: It’s faster to implement, tune, and deploy, allowing quicker time-to-value for the business. Regulatory Considerations: Banking often requires transparent decision-making models that can be easily explained to regulators.

Additional Considerations:

Labels: Yes, the dataset has a clear binary label (“y” - whether the client subscribed to a term deposit). This makes this a supervised learning classification problem, which guided my algorithm choices.
Relationship to Dataset: Logistic regression aligns well with the observed patterns in the EDA, particularly the linear relationship between duration and subscription likelihood, and the categorical influence of job, education, and marital status.
Validation Strategy: Given the class imbalance observed (more “no” than “yes” responses), stratified cross-validation would be important regardless of algorithm choice.

As a secondary approach, I would suggest LDA to compare performance, especially if the assumptions of normally distributed features are reasonably met after transformation.

Pre- processing

1. Data Cleaning

There are no missing values. There are no duplicates as shown in above code.However there are unknown values which are converted to NA.

# Count missing values per column
colSums(is.na(bank_transform ))

##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact     month  duration  campaign     pdays  previous  poutcome         y 
##     13020         0         0         0         0         0     36959         0 
## y_numeric 
##         0

# Replace missing numerical values with median
bank_final <- bank_transform %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Replace missing categorical values with the most common value (mode)
bank_final <- bank_final %>%
  mutate(across(where(is.factor), ~ ifelse(is.na(.), as.character(names(sort(table(.), decreasing = TRUE)[1])), .)))

2. Dimensionality Reduction

i am removing the categorical variables which do show a correlation.

bank_final <- subset(bank_transform, select = -c(job, marital,education))

3. Feature Engineering

# Create Age Groups
bank_final <- bank_final %>%
  mutate(age_group = case_when(
    age <= 30 ~ "18-30",
    age > 30 & age <= 40 ~ "30-40",
    age > 40 & age <= 50 ~ "41-50",
    age > 50 & age <= 60 ~ "51-60",
    age > 60 ~ "60+"
  ))

# Categorize Balance Levels
bank_final$balance_group <- cut(bank_final$balance, 
                       breaks = quantile(bank_final$balance, probs = seq(0, 1, 0.2)),
                       labels = c("Very Low", "Low", "Medium", "High", "Very High"))

# Categorize Contact Duration
bank_final <- bank_final %>%
  mutate(duration_category = case_when(
    duration < 100 ~ "Short",
    duration >= 100 & duration <= 300 ~ "Medium",
    duration > 300 ~ "Long"
  ))

# Convert categorical variables to factors
bank_final <- bank_final %>% mutate(across(where(is.character), as.factor))

4. Imbalanced Data

# Check imbalance
table(bank_final$y)

## 
##    no   yes 
## 39922  5289

# Oversampling using ROSE -Random Over-Sampling Examples
bank_final_balanced <- ROSE(y ~ ., data = bank_final, seed = 123)$data
table(bank_final_balanced$y)

## 
##   no  yes 
## 4135 4047

# Alternative: SMOTE - Synthetic Minority Over-sampling Technique
# I tried SMOTE,but as it needs all values without any NAs, i was not able to use it.

5. Data Transformation

bank_final_balanced <- subset(bank_final_balanced , select = -c(age, duration, balance))

# Normalize numerical variables
num_vars <- bank_final_balanced %>% select(where(is.numeric)) %>% names()

bank_final_balanced [num_vars] <- scale(bank_final_balanced [num_vars])

# One-hot encoding categorical variables (for Logistic Regression)
bank_final_balanced  <- dummyVars(y ~ ., data = bank_final_balanced ) %>% predict(bank_final_balanced ) %>% as.data.frame()

# Convert target variable to factor for classification
bank_final_balanced $y <- as.factor(bank_final_balanced $y)

Conclusions:

The EDA on “Bank Marketing” dataset shows there are 45211 observations and 17 variables. The data cleanup activity included changing the variables to correct data type, removing day variable, observing the unknown values, checking for duplicates.

Numerical Variables:

Based on the box plot,I see the spread of each variable. I observe that there are many points outside the whiskers for each and every element, indicating many outliers present especially for previous, pdays and balance .

Based on the histogram plot, I observe that all numerical predictors have substantial skewness, except for age which has moderate skewness.

The target variable distribution shows that the subscription rate is 11.7%.

None of the numerical variables appear to be strongly correlated to each other. This suggests that individual numerical variables alone don’t strongly predict subscription behavior.

The variable pdays and previous are moderately correlated with a coefficient of 0.45.

Closest ones to consider is correlated to the target variable is the duration.

Based on the age distribution, clients who are in the age category 60+ and 20-30 are accepting the term deposit subscription.

Based on the box plot of Average Yearly Balance, I observe that the clients having a balance of below 12000 euros get term deposit subscription which makes sense as the balance is low.

Based on the box plot of the contact duration ,I observe that Clients who are interested in term deposit subscription tend to have longer contact duration around 400 seconds.This makes sense as longer conversation times may indicate more interested customers.

Based on the box plot of Past days contacted Distribution on Term Deposit Subscription, the client is contacted 1-100 times, with an upper whisker up to 250.

Based on the box plot of Previous contacts Distribution on Term Deposit Subscription, the client is previous contacted 1 time, with an upper whisker up to 2.

Categorical Variables:

A moderate correlation with the target variable is observed in feature: default, Housing loan , loan ,contact, month and previous campaign outcome.

Customer Demographic Insights:

Job type impact: When plotting the job vs target variable by subscription rate I observe that student, retired, unemployed are shown to have the highest subscription rate.This might indicate these groups have more interest in passive income opportunities or are more available during contact hours.

Education impact: When plotting the education vs target variable by subscription rate I observe that tertiary education is shown to have the highest subscription rate.

Marital status influence: When plotting the marital status vs target variable by subscription rate I observe that clients who are single are shown to have the highest subscription rate.

Financial Status Patterns:

Clients who have no default are interested in in term deposit subscription. This greater interest indicates risk-averse financial behavior may correlate with term deposit interest

Clients who have no housing loan or no other loan are interested in in term deposit subscription than the clients who have housing loan.This aligns with financial capacity theory - they may have more disposable income to allocate.

Contact Strategy Insights:

Clients who have been contacted by Cellular are interested in in term deposit subscription than the clients who have been contacted by Telephone.This provides clear direction for contact strategy optimization

Subscription rate is highest in March followed by December.This seasonal pattern can inform capagin timing.

When plotting the previous campaign outcome(poutcome) vs target variable by subscription rate i observe that is the previous outcome was success, it has the highest subscription rate.Probably the bank has earned loyal customers.However it does not harm contacting the failed outcome clients as they also show a subscription rate of 12%.

The Chi-square test for each categorical variable shows a p value = 0 . Despite low correlation coefficients, the statistical significance (p-value = 0) for categorical variables indicates that these relationships aren’t occurring by chance. The significant variables which can be considered are:default, housing loan, loan, month.

ML model:

Based on the EDA, I would recommend starting with Logistic Regression . I would suggest LDA to compare performance.

Pre Processing:

Based on above two models, i need to Replace missing numerical values with median, Replace missing categorical values with the most common value (mode). I need to remove the categorical variables job, marital,education which do show a correlation. i need to feature engineer to add Age Groups, Balance groups and duration category. finally perform One-hot encoding categorical variables (for Logistic Regression) and Convert target variable to factor for classification.

DATA622- Assignment 1 - EDA: Portuguese Bank Marketing

Dhanya Nair

2025-03-01

Background

Data Preparation

1. Importing Libraries

2. Data Ingestion and inspection

3. Change the predictors to correct data type.

4. Remove day variable

5. Are there any missing values and how significant are they?

6. Check for duplicates

7. Target predictor to a new numeric variable

Exploratory Data Analysis (EDA)

1. Distribution analysis of the numeric variables.

2. Skewness of the numerical variables

3. Correlation analysis of the numeric predictors.

4. Distribution analysis of Categorical predictors and its relationship with the target variable.

5.Correlation Analysis of categorical variables.

Machine Learning Algorithm Recommendations for Bank Term Deposit Subscription

Pre- processing

1. Data Cleaning

2. Dimensionality Reduction

3. Feature Engineering

4. Imbalanced Data

5. Data Transformation

Conclusions: