Assignment 1 data622

For my analysis I chose to use the bank-additional-full.csv because it contains all available examples (41,188 rows), has 20 input variables and is ordered by date, which might help in identifying time-based trends if needed. Using a reduced dataset (like bank-full.csv, which was my 2nd option ) may miss key patterns.

Loading the libraries

library(ggcorrplot)

## Warning: package 'ggcorrplot' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

library(tidyverse)  
library(DataExplorer)

## Warning: package 'DataExplorer' was built under R version 4.3.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

library(ROSE)

## Warning: package 'ROSE' was built under R version 4.3.3

library(ggplot2)
library(patchwork)

## Warning: package 'patchwork' was built under R version 4.3.3

library(summarytools)

## Warning: package 'summarytools' was built under R version 4.3.3

library(scales)

## Warning: package 'scales' was built under R version 4.3.3

library(moments)
library(dplyr)
library(MASS)  
library(car)

## Warning: package 'car' was built under R version 4.3.3

## Warning: package 'carData' was built under R version 4.3.3

library(readr)

Load the dataset

# I uploaded on github for reproducibility purposes 
url <- "https://raw.githubusercontent.com/NikoletaEm/datasps/refs/heads/main/bank-additional-full.csv"
bank <- read.csv(url, sep = ";")
head(bank)

##   age       job marital   education default housing loan   contact month
## 1  56 housemaid married    basic.4y      no      no   no telephone   may
## 2  57  services married high.school unknown      no   no telephone   may
## 3  37  services married high.school      no     yes   no telephone   may
## 4  40    admin. married    basic.6y      no      no   no telephone   may
## 5  56  services married high.school      no      no  yes telephone   may
## 6  45  services married    basic.9y unknown      no   no telephone   may
##   day_of_week duration campaign pdays previous    poutcome emp.var.rate
## 1         mon      261        1   999        0 nonexistent          1.1
## 2         mon      149        1   999        0 nonexistent          1.1
## 3         mon      226        1   999        0 nonexistent          1.1
## 4         mon      151        1   999        0 nonexistent          1.1
## 5         mon      307        1   999        0 nonexistent          1.1
## 6         mon      198        1   999        0 nonexistent          1.1
##   cons.price.idx cons.conf.idx euribor3m nr.employed  y
## 1         93.994         -36.4     4.857        5191 no
## 2         93.994         -36.4     4.857        5191 no
## 3         93.994         -36.4     4.857        5191 no
## 4         93.994         -36.4     4.857        5191 no
## 5         93.994         -36.4     4.857        5191 no
## 6         93.994         -36.4     4.857        5191 no

1.Exploratory Data Analysis

Initial Observations

str(bank)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

Age (numeric) – Age of the client.

Job (categorical) – Type of job (e.g., admin., blue-collar, management, student, etc.).

Marital (categorical) – Marital status (divorced, married, single, unknown).

Education (categorical) – Highest education level attained (basic.4y, high school, university degree, etc.).

Default (categorical) – Whether the client has credit in default (yes, no, unknown).

Housing (categorical) – Whether the client has a housing loan (yes, no, unknown).

Loan (categorical) – Whether the client has a personal loan (yes, no, unknown).

Contact (categorical) – Communication type (cellular or telephone).

Month (categorical) – Last contact month (e.g., Jan, Feb, Mar, …, Dec).

Day of Week (categorical) – Last contact day (Monday to Friday).

Duration (numeric) – Last contact duration in seconds (Important: This should be excluded for predictive modeling as it leaks information about the outcome).

Campaign (numeric) – Number of contacts performed during this campaign (including last contact).

Pdays (numeric) – Number of days since the client was last contacted (999 means never contacted before).

Previous (numeric) – Number of times the client was contacted before this campaign.

Poutcome (categorical) – Outcome of the previous marketing campaign (failure, nonexistent, success).

Employment Variation Rate (emp.var.rate) (numeric) – Quarterly employment variation rate.

Consumer Price Index (cons.price.idx) (numeric) – Monthly consumer price index.

Consumer Confidence Index (cons.conf.idx) (numeric) – Monthly consumer confidence index.

Euribor 3-month Rate (euribor3m) (numeric) – Daily Euribor 3-month interest rate.

Number of Employees (nr.employed) (numeric) – Quarterly number of employees.

y (binary) – Whether the client subscribed to a term deposit (yes/no).

The dataset consists of 41,188 observations and 21 variables, with a mix of numerical and categorical features. Our dataset contains both numerical(age, duration, campaign, pdays, previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed) and categorical variables (job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, yjob, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, y). Our target variable is y which indicates whether a client subscribed to the term deposit and is binary with two levels: “no” (0) and “yes” (1). It’s important to note that some categorical variables (e.g., default, housing, loan) have “unknown” as a category, which might represent missing values.

Checking for missing values

colSums(is.na(bank))

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

No missing values (NAs) were detected !

summary(bank)

##       age            job              marital           education        
##  Min.   :17.00   Length:41188       Length:41188       Length:41188      
##  1st Qu.:32.00   Class :character   Class :character   Class :character  
##  Median :38.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.02                                                           
##  3rd Qu.:47.00                                                           
##  Max.   :98.00                                                           
##    default            housing              loan             contact         
##  Length:41188       Length:41188       Length:41188       Length:41188      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     month           day_of_week           duration         campaign     
##  Length:41188       Length:41188       Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 102.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 180.0   Median : 2.000  
##                                        Mean   : 258.3   Mean   : 2.568  
##                                        3rd Qu.: 319.0   3rd Qu.: 3.000  
##                                        Max.   :4918.0   Max.   :56.000  
##      pdays          previous       poutcome          emp.var.rate     
##  Min.   :  0.0   Min.   :0.000   Length:41188       Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.000   Class :character   1st Qu.:-1.80000  
##  Median :999.0   Median :0.000   Mode  :character   Median : 1.10000  
##  Mean   :962.5   Mean   :0.173                      Mean   : 0.08189  
##  3rd Qu.:999.0   3rd Qu.:0.000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :7.000                      Max.   : 1.40000  
##  cons.price.idx  cons.conf.idx     euribor3m      nr.employed  
##  Min.   :92.20   Min.   :-50.8   Min.   :0.634   Min.   :4964  
##  1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344   1st Qu.:5099  
##  Median :93.75   Median :-41.8   Median :4.857   Median :5191  
##  Mean   :93.58   Mean   :-40.5   Mean   :3.621   Mean   :5167  
##  3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228  
##       y            
##  Length:41188      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Looking at the summary of our data we can notice that we have quite a few “unknown” values.More specifically there are: 8,597 in Default ~20% of data, 990 in Housing ~2.4% , 990 in Loan ~2.4% and 80 in Marital <1%. For those missing values there are 3 approaches we can follow: If “unknown” is meaningful (e.g., customers refusing to disclose info),we keep it as a category, if “unknown” is missing completely at random, imputation or removal might be better or for highly missing columns (like Default with 20%),we should consider dropping it if it’s not critical. pdays has a value used as a flag for never contacted which is 999 we will count how many are there.

sum(bank$pdays == 999)

## [1] 39673

sum(bank$pdays != 999)

## [1] 1515

This suggests that the vast majority of clients in our dataset had no prior contact.We need some investigation here

table_pdays_y <- table(bank$pdays == 999, bank$y)
chisq.test(table_pdays_y)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_pdays_y
## X-squared = 4341.7, df = 1, p-value < 2.2e-16

Since the p-value is far below 0.05, we reject the null hypothesis that pdays and the target variable (y) are independent. This means that pdays does influence the response variable so we can’t just drop it. A good idea would be to transform it into a binary variable (previously_contacted) where: Yes : if pdays ≠ 999 (client was contacted before) No : if pdays = 999 (first-time contact)

Central tendency(Mean, Median) and spread (Spread, IQR, and SD) of each variable

numeric_data <- bank %>% dplyr::select(where(is.numeric))

summary_stats <- data.frame(
  Mean = sapply(numeric_data, mean, na.rm = TRUE),
  Median = sapply(numeric_data, median, na.rm = TRUE),
  Range = sapply(numeric_data, function(x) diff(range(x, na.rm = TRUE))),
  IQR = sapply(numeric_data, function(x) IQR(x, na.rm = TRUE)),
  SD = sapply(numeric_data, sd, na.rm = TRUE)
)

print(summary_stats)

##                        Mean   Median    Range     IQR          SD
## age              40.0240604   38.000   81.000  15.000  10.4212500
## duration        258.2850102  180.000 4918.000 217.000 259.2792488
## campaign          2.5675925    2.000   55.000   2.000   2.7700135
## pdays           962.4754540  999.000  999.000   0.000 186.9109073
## previous          0.1729630    0.000    7.000   0.000   0.4949011
## emp.var.rate      0.0818855    1.100    4.800   3.200   1.5709597
## cons.price.idx   93.5756644   93.749    2.566   0.919   0.5788400
## cons.conf.idx   -40.5026003  -41.800   23.900   6.300   4.6281979
## euribor3m         3.6212908    4.857    4.411   3.617   1.7344474
## nr.employed    5167.0359109 5191.000  264.500 129.000  72.2515277

Correlation analysis

cor_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
ggcorrplot(cor_matrix, lab = TRUE, hc.order = TRUE, type = "lower")

We have some highly correlated features emp.var.rate and nr.employed have a very strong positive correlation (0.91), euribor3m is highly correlated with emp.var.rate (0.97) and nr.employed (0.95) and cons.price.idx has a strong correlation with emp.var.rate (0.78) and nr.employed (0.69).On the other hand we also have negatively correlated features. Previous has a strong negative correlation with euribor3m (-0.42) and emp.var.rate (-0.42) and also has a negative correlation with nr.employed (-0.5). The strong correlation between employment variables (emp.var.rate, nr.employed, euribor3m) suggests that they capture similar economic trends. The strong correlation between employment variables (emp.var.rate, nr.employed, euribor3m) suggests that they capture similar economic trends.Creating an interaction feature like age * campaign could reveal insights about whether younger or older clients are more likely to be influenced by repeated calls.

Feature Distribution

numeric_columns <- sapply(bank, is.numeric)


par(mfrow = c(2, 3))


for (col in names(bank)[numeric_columns]) {
  hist(bank[[col]], 
       main = paste("Histogram of", col), 
       xlab = col, 
       col = "maroon", 
       border = "black")
}

par(mfrow = c(1, 1))

The age variable follows a distribution reasonably close to normality, while duration and campaign are highly skewed, indicating that most calls were short and few customers were contacted multiple times. Pdays and previous show large spikes, suggesting many clients were either never contacted before or contacted after a long gap. Economic indicators like emp.var.rate, euribor3m, cons.price.idx, and nr.employed exhibit multimodal distributions, likely reflecting different economic conditions over time

To better understanding our data’s distribution I will compute the actual values for skewness and kurtosis.

# To remember
# Skewness close to 0 → Data is approximately symmetrical
# Positive skewness (> 0) -> Right-skewed 
# Negative skewness (< 0) -> Left-skewed 
# Kurtosis close to 3 -> Data follows a normal distribution
# Kurtosis > 3 ->  Heavy-tailed aka more outliers
# Kurtosis < 3 -> Light-tailed aka fewer extreme values
skewness_values <- apply(numeric_data, 2, skewness)
kurtosis_values <- apply(numeric_data, 2, kurtosis)

skew_kurt_summary <- data.frame(
  Skewness = skewness_values,
  Kurtosis = kurtosis_values
)
print(skew_kurt_summary)

##                  Skewness  Kurtosis
## age             0.7846682  3.791070
## duration        3.2630224 23.245334
## campaign        4.7623333 39.975160
## pdays          -4.9220107 25.226619
## previous        3.8319027 23.106230
## emp.var.rate   -0.7240692  1.937352
## cons.price.idx -0.2308792  2.170146
## cons.conf.idx   0.3031688  2.641340
## euribor3m      -0.7091621  1.593222
## nr.employed    -1.0442244  2.996094

Skewness results We can see that age (0.78) and cons.conf.idx (0.30) are both slightly skewed and cons.price.idx (-0.23) is almost normal distribution. The insight we can take from that is that we have slightly more younger customers than older ones. We have 3 strong right-skewed variables duration (3.26), campaign (4.76), previous (3.83) and 4 strong left-skewed variables pdays (-4.92), nr.employed (-1.04), emp.var.rate (-0.72), euribor3m (-0.71). Kurtosis results High kurtosis was observed in the following 4 variables duration (23.24), campaign (39.97), pdays (25.23), previous (23.11).That means they have extreme outliers, indicating rare but extreme cases like for example very long cals. Low kurtosis was observed in emp.var.rate (1.93), cons.price.idx (2.17), cons.conf.idx (2.64), euribor3m (1.59).These variables have fewer extreme values. Our next steps might include that highly skewed variables (duration, campaign, pdays, previous) might need log transformation.

Outliers We will view the boxplots to verify the results stated above

par(mfrow = c(2, 3))

for (col in names(bank)[numeric_columns]) {
  boxplot(bank[[col]], 
          main = paste("Boxplot of", col), 
          col = "purple", 
          border = "black", 
          horizontal = TRUE)
}

par(mfrow = c(1, 1))

The boxplots reveal the presence of outliers in multiple variables, especially in duration, campaign, pdays, and previous, which could potentially affect our models performance. To handle these outliers,I’d consider log transformation, to reduce skewness as mentioned before or removing extreme cases. The appropriate approach depends on whether these outliers carry meaningful information or result from data entry errors. Outliers in duration indicate lengthy call durations, possibly due to engaged customers, who are more likely to subscribe to a product or service, which is valuable information rather than noise. Also, outliers in campaign (number of contacts) could reflect persistent marketing efforts, while extreme values in pdays (days since last contact) might indicate a long gap between interactions.

Categorical Variables

# Identify categorical columns excluding numerical ones
categorical_columns <- names(bank)[!sapply(bank, is.numeric)]

par(mfrow = c(2, 2))  

# Loop through categorical variables and create bar plots
for (col in categorical_columns) {
  barplot(table(bank[[col]]), 
          main = paste("Distribution of", col), 
          col = "steelblue", 
          border = "black", 
          las = 2, 
          cex.names = 0.8) 
}

par(mfrow = c(1, 1))

Patterns/Trends

ggplot(bank, aes(x = y, y = duration)) + 
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Call Duration vs Subscription")

The insights that we get from the plot above reveal that clients who subscribed generally had longer call durations compared to those who did not. The median call duration for subscribers is noticeably higher, suggesting that longer conversations tend to increase the chances of subscription.

ggplot(bank, aes(x = y, y = campaign)) + 
  geom_boxplot(fill = "blue") +
  labs(title = "Number of Contacts vs Subscription")

Some of the insights drawn from the plot above are that a few individuals were contacted more than 40 times but still did not subscribe, indicating that excessive follow-ups may not be effective and most subscriptions happened within a low number of calls, suggesting that persistent calls beyond a certain threshold are not very useful.That tells us the bank should prioritize call quality over quantity, ensuring that calls are engaging and persuasive rather than simply frequent.

ggplot(bank, aes(x = campaign, y = duration, color = y)) +
  geom_point(alpha = 0.5) +
  labs(title = "Call Duration vs Number of Contacts by Subscription Outcome")

It is apparent by looking at the scatter plot above that most successful subscriptions (shown in lightblue) occur at relatively lower campaign contact numbers but longer call durations and large cluster of non-subscriptions (no in red) occurs at low call durations, even when multiple calls were made.In conclusion, the scatter plot suggests that long call duration is a stronger predictor of success than the number of contacts.

Seasonality

monthly_counts <- bank %>%
  count(month, y) %>%
  group_by(month) %>%
  mutate(percentage = n / sum(n) * 100)

# Plot
ggplot(monthly_counts, aes(x = month, y = percentage, group = y, color = y)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Subscription Rate by Month", x = "Month", y = "Percentage (%)") +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

By looking at the plot above it is very apparent that the likelihood of a client subscribing is highest in March, December and October. Whereas the probability of a non-subscription (‘no’) is highest in July, May, and August. Companies can leverage high-conversion months (March, December, October) by running intensive marketing campaigns, improve performance in low-conversion months (July, May, August) by adjusting messaging or customer engagement strategies and also investigate causes of seasonal variations by answering the following question: Are external factors (e.g., holidays, financial cycles) affecting client decisions?

create_plot <- function(data, x_var, title) {
  ggplot(data, aes_string(x = x_var, fill = "y")) +
    geom_bar(position = "fill", width = 0.7) +  
    geom_text(stat = "count", aes(label = scales::percent(..count.. / tapply(..count.., ..x.., sum)[..x..], accuracy = 0.1)), 
              position = position_fill(vjust = 0.5), size = 2.9, color = "black",fontface = "bold" ) +
    labs(title = title, x = x_var, y = "Proportion") +
    theme_minimal(base_size = 14) +  # Increase text size
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}

# Plot each variable individually
plot1 <- create_plot(bank, "job", "Subscription Rate by Job Type")

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

print(plot1)

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot2 <- create_plot(bank, "marital", "Subscription Rate by Marital Status")
print(plot2)

plot3 <- create_plot(bank, "education", "Subscription Rate by Education Level")
print(plot3)

plot4 <- create_plot(bank, "contact", "Subscription Rate by Contact Type")
print(plot4)

plot5 <- create_plot(bank, "day_of_week", "Subscription Rate by Day of the Week")
print(plot5)

Key insights: Retired individuals and students have the highest subscription rates (25.2% and 31.4% respectively), indicating they are more receptive to the offering. On the other hand blue-collar and entrepreneur job types show the lowest subscription rates (6.9% and 8.5% respectively), suggesting these groups may be less engaged or less financially inclined to subscribe. The highest subscription rates occur on Thursdays (12.1%) and Tuesdays (11.8%), while Mondays and Fridays which are the start and the end of the week see the lowest engagement.Marketing campaigns should be optimized for midweek outreach, as potential customers seem more receptive during these days. Cellular contacts have a higher conversion rate (14.7%) compared to telephone (5.2%), indicating that mobile communication is more effective. Traditional telephone calls may yield lower returns and could be de-prioritized in the marketing strategy and prioritize mobile-based outreach strategies such as SMS and email marketing. Single customers have the highest subscription rate (14%), followed by those in the “unknown” category (15%). Single customers are more open to subscribing, possibly because they have fewer financial commitments (like mortgages, dependents). Marketing strategies could target young professionals with customized financial products like savings plans or investment options. Higher education and professional training correlate with higher subscription rates, likely because these clients better understand financial products and also the high subscription rate among illiterate customers suggests they may be more responsive to direct marketing, such as in-person banking or phone calls, rather than online promotions.

Duplicate/Inconsistent Values If duplicates exist,we will remove them to avoid bias.

sum(duplicated(bank))

## [1] 12

We have 12 duplicates

# We will view them and then remove them in preprocessing
bank[duplicated(bank), ]

##       age         job  marital           education default housing loan
## 1267   39 blue-collar  married            basic.6y      no      no   no
## 12262  36     retired  married             unknown      no      no   no
## 14235  27  technician   single professional.course      no      no   no
## 16957  47  technician divorced         high.school      no     yes   no
## 18466  32  technician   single professional.course      no     yes   no
## 20217  55    services  married         high.school unknown      no   no
## 20535  41  technician  married professional.course      no     yes   no
## 25218  39      admin.  married   university.degree      no      no   no
## 28478  24    services   single         high.school      no     yes   no
## 32517  35      admin.  married   university.degree      no     yes   no
## 36952  45      admin.  married   university.degree      no      no   no
## 38282  71     retired   single   university.degree      no      no   no
##         contact month day_of_week duration campaign pdays previous    poutcome
## 1267  telephone   may         thu      124        1   999        0 nonexistent
## 12262 telephone   jul         thu       88        1   999        0 nonexistent
## 14235  cellular   jul         mon      331        2   999        0 nonexistent
## 16957  cellular   jul         thu       43        3   999        0 nonexistent
## 18466  cellular   jul         thu      128        1   999        0 nonexistent
## 20217  cellular   aug         mon       33        1   999        0 nonexistent
## 20535  cellular   aug         tue      127        1   999        0 nonexistent
## 25218  cellular   nov         tue      123        2   999        0 nonexistent
## 28478  cellular   apr         tue      114        1   999        0 nonexistent
## 32517  cellular   may         fri      348        4   999        0 nonexistent
## 36952  cellular   jul         thu      252        1   999        0 nonexistent
## 38282 telephone   oct         tue      120        1   999        0 nonexistent
##       emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed   y
## 1267           1.1         93.994         -36.4     4.855      5191.0  no
## 12262          1.4         93.918         -42.7     4.966      5228.1  no
## 14235          1.4         93.918         -42.7     4.962      5228.1  no
## 16957          1.4         93.918         -42.7     4.962      5228.1  no
## 18466          1.4         93.918         -42.7     4.968      5228.1  no
## 20217          1.4         93.444         -36.1     4.965      5228.1  no
## 20535          1.4         93.444         -36.1     4.966      5228.1  no
## 25218         -0.1         93.200         -42.0     4.153      5195.8  no
## 28478         -1.8         93.075         -47.1     1.423      5099.1  no
## 32517         -1.8         92.893         -46.2     1.313      5099.1  no
## 36952         -2.9         92.469         -33.6     1.072      5076.2 yes
## 38282         -3.4         92.431         -26.9     0.742      5017.5  no

Algorithm Selection

Since the goal is to predict whether a client will subscribe to a term deposit, which is a binary classification problem, the best models from the ones presented so far are: Logistic Regression, Linear Discriminant Analysis (LDA), k-Nearest Neighbors (KNN) (for smaller datasets!). Logistic Regression is valued for its simplicity, interpretability and computational efficiency, making it ideal for large datasets like ours. It provides probability estimates and handles both categorical and numerical data effectively. However, it assumes linear relationships between predictors and log-odds outcomes, struggles with complex non-linear patterns, and can be compromised by multicollinearity. Linear Discriminant Analysis (LDA) performs well with normally distributed data and offers dimensionality reduction that can improve performance. Like logistic regression, it’s computationally efficient. Its main limitations include assuming normality of predictor variables which may not always be true and reduced effectiveness when relationships between variables and targets are non-linear. K-Nearest Neighbors (KNN) stands out as a non-parametric approach that doesn’t assume any specific distribution, allowing it to capture complex patterns in data. It’s particularly effective with smaller datasets. However, KNN becomes computationally expensive with large datasets (like our 41K records), requires careful selection of the K parameter to avoid overfitting or underfitting, and tends to perform poorly with imbalanced data.

I recommend Logistic Regression for this customer subscription prediction task. It’s the best choice because it handles large datasets efficiently (important for our 41,188 records), provides clear interpretability so business stakeholders can understand which factors influence customer decisions and generates probability estimates that help prioritize potential subscribers. While Linear Discriminant Analysis could be a good alternative if our data is normally distributed and KNN should be avoided due to its high computational cost with large datasets, Logistic Regression offers the best balance of performance, speed and business value for our specific needs.

Looking at the dataset structure, there is a clear target variable y that has binary values “no” and “yes” indicating whether clients subscribed or not. This confirms that we have labeled data, which directly impacted the algorithm selection. Since this is a supervised learning classification problem, Logistic Regression and LDA were appropriate choices as they learn from these labeled examples to predict outcomes for new data. If the dataset lacked this y column with subscription outcomes, we would have been forced to use unsupervised methods like clustering algorithms instead, which would group customers based on similarities but wouldn’t directly predict subscription likelihood without additional interpretation. Looking at our EDA findings, our dataset contains both numerical variables (age, duration, economic indicators) and categorical variables (job, marital status, education), which both Logistic Regression and LDA can handle effectively. However, several variables like duration and campaign exhibit skewness in their distributions, which Logistic Regression can tolerate better than LDA, as LDA assumes normally distributed predictors. Our correlation analysis revealed highly correlated economic variables, particularly between emp.var.rate and euribor3m. This multicollinearity could potentially affect the stability and interpretability of Logistic Regression coefficients. However, we can mitigate this issue through careful feature selection or regularization techniques. Additionally, the class imbalance we observed in the target variable y (with far fewer “yes” responses (4639) than “no” (36537)) is better handled by Logistic Regression, which can be weighted to account for imbalanced classes. The flexibility of Logistic Regression to handle these dataset characteristics reinforces our algorithm selection, though we should remain mindful of its sensitivity to multicollinearity when interpreting results. The choice of algorithm would change if there were fewer than 1,000 records. With a smaller dataset, K-Nearest Neighbors (KNN) would become a viable option since its computational cost would be significantly lower. Linear Discriminant Analysis (LDA) would still be useful, but a small sample size could affect the stability of covariance estimates. Logistic Regression’s interpretability helps the bank understand key factors influencing subscriptions, allowing for better-targeted marketing strategies. Its probability estimates enable prioritization of high-potential leads, ensuring efficient use of resources. Since it scales well with large datasets, the bank can quickly update predictions and refine outreach efforts. This aligns with business goals by maximizing conversions while keeping costs low and focusing on the most promising customer segments.

Pre-Processing

Data Cleaning

We will first handle the “unknown” values mentioned/found in EDA. For the variable default with ~20% of it’s data being uknown I think the best approach would be to keep “unknown” as its own category to let the model decide if it’s meaningful.Since 20% of the data is missing, dropping these rows would result in significant data loss. For the variables housing and loan since only 2.4% of data is missing, dropping these rows is an option, but it’s better to impute missing values and since they are binary, I’ll fill missing values with the most common category (aka mode).

bank$housing[bank$housing == "unknown"] <- names(which.max(table(bank$housing)))
bank$loan[bank$loan == "unknown"] <- names(which.max(table(bank$loan)))

For the variable marital since missing values are less than 1%, they won’t significantly impact the model, so I’ll keep “unknown” as a separate category

For handling outliers/noise we will use winsorization. Winsorization. is a technique that replaces outliers with less extreme values to reduce their impact on data analysis. We used a stronger winsorizarion (capping between 1st & 99th) because of extreme outliers. For the variables nr.employed, emp.var.rate, and euribor3m I chose to just standarize them since the skewness is relatively mild (not extreme).Standardizing these variables I think would likely be adequate and should work well for our logistic regression model. For the variables campaign and duration I used a log transformation and they turned out well. For previous I had to use a couple log transformations, the skewness is still relatively high but it could be acceptable depending on our model’s performance.

winsorize <- function(x) {
  lower_bound <- quantile(x, 0.05, na.rm = TRUE)
  upper_bound <- quantile(x, 0.95, na.rm = TRUE)
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

winsorize_strong <- function(x) {
  lower_bound <- quantile(x, 0.01, na.rm = TRUE)
  upper_bound <- quantile(x, 0.99, na.rm = TRUE)
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

    # Apply Winsorization
duration_winsorized = winsorize(bank$duration)
campaign_winsorized = winsorize(bank$campaign)
previous_winsorized = winsorize_strong(bank$previous)
    
bank <- bank %>%
  mutate(
    # Apply transformations directly to the winsorized variables
    duration_transformed = log1p(duration_winsorized),
    campaign_transformed = log1p(campaign_winsorized),
    previous_transformed = log1p(log1p(log1p(log1p(log1p(previous_winsorized))))), 
    
    # Standardize other variables
    emp_var_rate_standardized = scale(emp.var.rate),
    euribor3m_standardized = scale(euribor3m),
    nr_employed_standardized = scale(nr.employed)
  )

# Check skewness after transformations
skewness_after <- data.frame(
  Variable = c("duration_transformed", "campaign_transformed", "previous_transformed", 
               "nr_employed_standardized", "emp_var_rate_standardized", "euribor3m_standardized"),
  Skewness = c(
    skewness(bank$duration_transformed, na.rm = TRUE),
    skewness(bank$campaign_transformed, na.rm = TRUE),
    skewness(bank$previous_transformed, na.rm = TRUE),
    skewness(bank$nr_employed_standardized, na.rm = TRUE),
    skewness(bank$emp_var_rate_standardized, na.rm = TRUE),
    skewness(bank$euribor3m_standardized, na.rm = TRUE)
  )
)

print(skewness_after)

##                    Variable    Skewness
## 1      duration_transformed -0.09216888
## 2      campaign_transformed  0.78735043
## 3      previous_transformed  2.15047130
## 4  nr_employed_standardized -1.04422438
## 5 emp_var_rate_standardized -0.72406918
## 6    euribor3m_standardized -0.70916213

# Duration Before & After
boxplot_duration_before <- ggplot(bank, aes(y = duration)) + 
  geom_boxplot(fill = "lightblue", outlier.color = "red") + 
  ggtitle("Duration (Before Winsorization)")

boxplot_duration_after <- ggplot(bank, aes(y = duration_winsorized)) + 
  geom_boxplot(fill = "lightgreen", outlier.color = "red") + 
  ggtitle("Duration (After Winsorization)")

# Campaign Before & After
boxplot_campaign_before <- ggplot(bank, aes(y = campaign)) + 
  geom_boxplot(fill = "lightblue", outlier.color = "red") + 
  ggtitle("Campaign (Before Winsorization)")

boxplot_campaign_after <- ggplot(bank, aes(y = campaign_winsorized)) + 
  geom_boxplot(fill = "lightgreen", outlier.color = "red") + 
  ggtitle("Campaign (After Winsorization)")

# Previous Before & After
boxplot_previous_before <- ggplot(bank, aes(y = previous)) + 
  geom_boxplot(fill = "lightblue", outlier.color = "red") + 
  ggtitle("Previous (Before Winsorization)")

boxplot_previous_after <- ggplot(bank, aes(y = previous_winsorized)) + 
  geom_boxplot(fill = "lightgreen", outlier.color = "red") + 
  ggtitle("Previous (After Winsorization)")

# Combine the boxplots side by side
(boxplot_duration_before | boxplot_duration_after) /
(boxplot_campaign_before | boxplot_campaign_after) /
(boxplot_previous_before | boxplot_previous_after)

We identified some duplicates in EDA that we will remove now

bank <- bank[!duplicated(bank), ] # remove all duplicate rows except the first occurrence

We will do a test to verify that they are removed!

sum(duplicated(bank))

## [1] 0

Dimensionality Reduction

Since we’re using logistic regression, it’s important to avoid multicollinearity, which can lead to unstable model results. The variables emp.var.rate, nr.employed, and euribor3m are highly correlated, so we can combine them into a single feature, economic_indicators, which represents their average. This helps preserve the information while reducing redundancy and multicollinearity. Another thing I considered is dropping the previous variable since it’s still really skewed and looks like it’s very similar to campaign variable.

bank <- bank %>%
  mutate(
    economic_indicators = (emp_var_rate_standardized + nr_employed_standardized + euribor3m_standardized) / 3
  ) %>%
  dplyr::select(-emp.var.rate, -nr.employed, -euribor3m, 
                -emp_var_rate_standardized, -nr_employed_standardized, -euribor3m_standardized)

head(bank)

##   age       job marital   education default housing loan   contact month
## 1  56 housemaid married    basic.4y      no      no   no telephone   may
## 2  57  services married high.school unknown      no   no telephone   may
## 3  37  services married high.school      no     yes   no telephone   may
## 4  40    admin. married    basic.6y      no      no   no telephone   may
## 5  56  services married high.school      no      no  yes telephone   may
## 6  45  services married    basic.9y unknown      no   no telephone   may
##   day_of_week duration campaign pdays previous    poutcome cons.price.idx
## 1         mon      261        1   999        0 nonexistent         93.994
## 2         mon      149        1   999        0 nonexistent         93.994
## 3         mon      226        1   999        0 nonexistent         93.994
## 4         mon      151        1   999        0 nonexistent         93.994
## 5         mon      307        1   999        0 nonexistent         93.994
## 6         mon      198        1   999        0 nonexistent         93.994
##   cons.conf.idx  y duration_transformed campaign_transformed
## 1         -36.4 no             5.568345            0.6931472
## 2         -36.4 no             5.010635            0.6931472
## 3         -36.4 no             5.424950            0.6931472
## 4         -36.4 no             5.023881            0.6931472
## 5         -36.4 no             5.730100            0.6931472
## 6         -36.4 no             5.293305            0.6931472
##   previous_transformed economic_indicators
## 1                    0           0.5640705
## 2                    0           0.5640705
## 3                    0           0.5640705
## 4                    0           0.5640705
## 5                    0           0.5640705
## 6                    0           0.5640705

Feature Engineering

I’ll transform pdays into a binary variable (previously_contacted)

bank$previously_contacted <- ifelse(bank$pdays == 999, 0, 1) # 999 means never contacted
bank <- bank %>%
  dplyr::select(-pdays)

table(bank$previously_contacted, bank$y)

##    
##        no   yes
##   0 35989  3672
##   1   548   967

prop.table(table(bank$previously_contacted, bank$y), margin = 1) * 100

##    
##            no       yes
##   0 90.741535  9.258465
##   1 36.171617 63.828383

Key Insights from the code above -> 36,000 clients (~91%) who were not contacted before did not subscribe. BUT among those previously contacted, 967 (~64%) subscribed, which is significantly higher than the 9% for first-time contacts.This suggests that relationship-building and follow-ups significantly improve conversion rates.

We can introduce a new variable, call_frequency, to track how often each customer has been contacted. The idea is that repeated calls may influence their likelihood of subscribing. Our findings show that while previous contact increases subscription chances, too many calls can have the opposite effect. Call_frequency can help find the right number of outreach attempts, the “golden ratio”, for an effective outreach.

bank <- bank %>%
  mutate(call_frequency = campaign / previous)

Another variable we could create instead of economic_indicators could be economic stability of each client which could involve using the Consumer Price Index (cons.price.idx), Employment Variability (emp.var.rate), and Euribor Rate (euribor3m) together, as they each capture different elements of economic stability. Consumer Price Index (CPI) captures inflation and cost of living., Employment Variation Rate (emp.var.rate) gives an idea of employment market stability and Euribor Rate (euribor3m) is the average interest rate at which European banks are prepared to lend, indicating overall economic stability. Why is it important? Because economic stability is a crucial aspect that influences the ability of customers to subscribe to products or services. High economic stability could lead to more people subscribing to products!

We have class imbalance that needs to be handled.I’ll use the ROSE package

table(bank$y)

## 
##    no   yes 
## 36537  4639

bank_yes <- bank[bank$y == "yes", ]
bank_no <- bank[bank$y == "no", ]

n_minority <- nrow(bank_yes)

bank_no_sampled <- bank_no[sample(nrow(bank_no), n_minority, replace = FALSE), ]


bank <- rbind(bank_yes, bank_no_sampled)


bank <- bank[sample(nrow(bank)), ]

table(bank$y)

## 
##   no  yes 
## 4639 4639

Data Transformation

Categorical Variables I’ll separate the categorical variables into two categories: nominal (no inherent order) and ordinal (ordered categories). For the nominal variables, which include job (12 levels, e.g., “admin.”, “blue-collar”), default (3 levels: “no”, “unknown”, “yes”), housing (3 levels: “no”, “unknown”, “yes”), loan (3 levels: “no”, “unknown”, “yes”), contact (2 levels: “cellular”, “telephone”), month (10 levels, e.g., “apr”, “aug”), day_of_week (5 levels, e.g., “fri”, “mon”), poutcome (3 levels: “failure”, “nonexistent”), and y (2 levels: “no”, “yes”, target variable), I’ll use One-Hot Encoding. For the ordinal variables, which are marital (4 levels: “divorced”, “married”, “single”, “widow”, with an order: married > single > divorced > widow) and education (8 levels, e.g., “basic.4y”, “basic.6y”, “high.school”, “university.degree”, with an order: basic < high school < university < etc.), I’ll use Label Encoding. For example, I’ll encode marital status as Married = 1, Single = 2, Divorced = 3, Widow = 4.

Standarization

Continuous variables such as age, duration(transformed), campaign(transformed), previous should be considered for standardization based on their scale and distribution. Age and duration should be standardized to ensure comparability.

Feature Selection

For feature selection I’d choose the Recursive Feature Elimination (RFE)technique that selects the most important features by systematically removing the least significant ones. It starts with all features, trains a model, ranks features by importance, then removes the weakest ones and retrains. This process repeats until finding the optimal feature set (testing subsets of 5, 10, and 15 features) through cross-validation. RFE helps build a simpler, more effective model by keeping only the truly important predictors of customer subscription behavior.

Assignment 1 data622

Nikoleta Emanouilidi

1.Exploratory Data Analysis

Algorithm Selection

Pre-Processing