DATA 622 Assignment 1

# Load necessary libraries
library(highcharter)
library(dplyr)
library(ggplot2)
library(corrplot)
library(caret)
library(tidyverse)
library(highcharter)
library(knitr)
library(kableExtra)

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit. The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

The first step is to load the dataset and review its structure to provide a summary of variable types and the first few records.

df <- read.csv("https://raw.githubusercontent.com/uplotnik/DATA-622/refs/heads/main/bank-full.csv",sep=";")

Exploratory Data Analysis

Quick look at the data

head(df,7)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
## 7  28   management  single  tertiary      no     447     yes  yes unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no
## 7   may      217        1    -1        0  unknown no

Inspect dataset structure

str(df)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

summary(df)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

dim(df)

## [1] 45211    17

desc_table <- data.frame(

  Var = c("age", "job", "marital", "education", "default", "balance", 

          "housing", "loan", "contact", "day", "month", "duration", 

          "campaign", "pdays", "previous", "poutcome", "y"),

  Desc = c("Age of the client", 

           "Occupation type", 

           "Marriage status", 

           "Highest education level of the client", 

           "Indicates if there is a credit default", 

           "Yearly average balance in euros", 

           "Possession of a housing loan", 

           "Possession of a personal loan", 

           "Type of communication contact", 

           "Day of the last contact", 

           "Month of the last contact", 

           "Duration of the last contact in seconds", 

           "Total number of contacts made during this campaign for the client", 

           "Days elapsed since the client was last contacted in a previous campaign (-1 means no previous contact)", 

           "Number of contacts before the current campaign for the client", 

           "Result of the previous marketing campaign", 

           "Indicates if the client has subscribed to a term deposit")

)


kable(desc_table, align = "ll", caption = "Description of Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
full_width = FALSE) %>%
column_spec(1, width = "2in") %>%
column_spec(2, width = "5in")

Description of Variables
Var	Desc
age	Age of the client
job	Occupation type
marital	Marriage status
education	Highest education level of the client
default	Indicates if there is a credit default
balance	Yearly average balance in euros
housing	Possession of a housing loan
loan	Possession of a personal loan
contact	Type of communication contact
day	Day of the last contact
month	Month of the last contact
duration	Duration of the last contact in seconds
campaign	Total number of contacts made during this campaign for the client
pdays	Days elapsed since the client was last contacted in a previous campaign (-1 means no previous contact)
previous	Number of contacts before the current campaign for the client
poutcome	Result of the previous marketing campaign
y	Indicates if the client has subscribed to a term deposit

The dataset comprises 45,211 instances with 17 features, including both numerical and categorical variables. Key variables include age, job type, marital status, education level, and previous campaign outcomes.

Features (Columns) Correlation, Graph & Insights Drawn

numeric_vars <- df %>% select_if(is.numeric)
corr_matrix <- cor(numeric_vars, use = "complete.obs")

hchart(corr_matrix, type = "heatmap", color = "RdYlGn") %>%
hc_title(text = "Correlation Matrix") %>%hc_xAxis(title = list(text = "Variables")) %>%hc_yAxis(title = list(text = "Variables"))

From the matrix, we can see a strong positive correlation betweenprevious and pdays. The correlation coefficient is approximately 0.45, indicating a moderate positive relationship.

The correlation between ‘age’ and ‘balance’ is 0.10 and correlation between campaignand day is 0.16. While this correlation is considered weak, it still suggests a potential relationship that could be explored further in a combined analysis. Also, even there is a weak correlation between age and balance, we can create a new variable age_balance in the data pre-processing step.

Other variables show weaker correlations and may not be suitable for combining without further investigation.

Explore relationship between selected variables

hc_scatter <- highchart() %>%
 hc_chart(type = "scatter") %>%
 hc_title(text = "Age vs Balance") %>%
hc_xAxis(title = list(text = "Age")) %>%
hc_yAxis(title = list(text = "Balance")) %>%
hc_add_series(data = df %>% select(age, balance) %>% 
mutate(age = as.numeric(age), campaign = as.numeric(balance)) %>% 
list_parse2(), name = "Age vs Balance")
hc_scatter

It looks like younger individuals are generally associated with lower bank balances, whereas older individuals tend to have higher balances. However, we notice substantial variability. A big portion of older individuals displays lower balances, and there are instances of younger individuals with higher balances. The wide distribution of the data points indicates a weak correlation between age and bank balance. Furthermore, the existence of outliers—data points that diverge markedly from the established trend—could be influenced by numerous factors that the chart does not account for.

hc_scatter <- highchart() %>%
hc_chart(type = "scatter") %>%
hc_title(text = "Day vs Campaign Contacts") %>%
hc_xAxis(title = list(text = "Day")) %>%
hc_yAxis(title = list(text = "Campaign Contacts")) %>%
hc_add_series(data = df %>% select(day, campaign) %>% 
mutate(age = as.numeric(day), campaign = as.numeric(campaign)) %>% 
list_parse2(), name = "Contacts")
hc_scatter

The chart shows that the number of contacts varies significantly from day to day. Some days have a high number of contacts, while others have a lower number. There is no clear trend or pattern in the data. The distribution of contacts appears somewhat random, with no obvious peaks or valleys.

df$pdays <- as.numeric(df$pdays)
df$previous <- as.numeric(df$previous)
df_grouped <- df %>%
group_by(pdays) %>%
summarise(mean_previous = mean(previous, na.rm = TRUE), .groups = 'drop')
hc_line <- highchart() %>%
hc_chart(type = "line") %>%
hc_title(text = "Average Previous vs. Pdays") %>%
hc_xAxis(title = list(text = "Pdays")) %>%
hc_yAxis(title = list(text = "Average Previous")) %>%
hc_add_series(data = list_parse2(df_grouped),
x = df_grouped$pdays,
y = df_grouped$mean_previous)
hc_line

The line chart provided illustrates the connection between the number of previous contacts a client has had and the number of days that have passed since their last contact in a prior campaign. The chart show us the generally inverse relationship; as the average number of previous contacts increases, the days since the last contact tend to decrease. This indicates that clients with a higher number of previous contacts are contacted more frequently.

highchart() %>%
hc_chart(type = "scatter") %>%
hc_title(text = "Scatter Plot: Age vs Duration") %>%
hc_xAxis(title = list(text = "Age")) %>%
hc_yAxis(title = list(text = "Duration")) %>%
hc_add_series(data = df %>% 
select(age, duration) %>% 
na.omit() %>% 
as.data.frame() %>% 
list_parse2(),
name = "Age-Duration")

The scatter plot illustrates the association between an individual’s age and the length of their most recent contact, quantified in seconds. The analysis indicates a negative correlation; as age rises, the length of the last contact generally diminishes. This implies that older individuals typically experience shorter durations of last contact in comparison to their younger counterparts. Overall, the plot illustrates a general trend of decreasing contact duration with increasing age.

Feature Distribution & Insights Drawn:

hc_hist <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Distribution of Age") %>%
hc_xAxis(title = list(text = "Age Groups")) %>%
hc_yAxis(title = list(text = "Frequency"))

age_breaks <- hist(df$age, plot = FALSE, breaks = 20)
hc_hist <- hc_hist %>% 
hc_add_series(name = "Age Distribution",
data = list_parse2(data.frame(x = age_breaks$mids, y = age_breaks$counts)))
hc_hist

It seems like significant portion of the population that is targeted for marketing campaign falls within 30-40 age group,. The frequency decreases as age increases, suggesting a right-skewed distribution. There are relatively few persons in the younger (20-30) and older (70+) age groups. The majority of the population is concentrated in the middle age range (30-60).

#Analyse Outliers
hc_box <- highchart() %>%
hc_chart(type = "boxplot") %>%
hc_title(text = "Boxplot for Age") %>%
hc_xAxis(categories = c("Age")) %>%
hc_add_series(data = list(unname(boxplot.stats(df$age)$stats)))

hc_box

The key statistics are:

Minimum: 18 Lower quartile: 33 Median: 39 Upper quartile: 48 Maximum: 70

Based on the boxplot, there are no outliers that significantly affect the mean. Robust statistics or data transformation would not be necessary in this case.

age_central_tendency <- df %>% 
summarise(mean_age = mean(age, na.rm = TRUE),
 median_age = median(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE))
print(age_central_tendency)

##   mean_age median_age   sd_age
## 1 40.93621         39 10.61876

The mean age is approximately 41 years, and the median age is 39 years. The standard deviation is 10.6 years. These values suggest a roughly normal distribution of age, with some skewness towards older individuals, given that the mean is slightly higher than the median. This age range is plausible for a customer base that a bank might target for term deposits, as it includes adults who are likely to have some financial stability.

hc_hist <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Distribution of Balance") %>%
hc_xAxis(title = list(text = "Balance")) %>%
hc_yAxis(title = list(text = "Frequency"))

balance_breaks <- hist(df$balance, plot = FALSE, breaks = 20)
hc_hist <- hc_hist %>% 
hc_add_series(name = "Balance Distribution",
data = list_parse2(data.frame(x = balance_breaks$mids, y = balance_breaks$counts)))
hc_hist

The histogram displays that the majority of balances fall within the 0-15k range. The distribution appears to be right-skewed, indicating a concentration of lower balances and a smaller number of significantly higher balances. The insight is that a large portion of the balances are relatively low, while a smaller number of accounts hold substantially larger balances. This suggests potential areas for further analysis, such as investigating the factors contributing to the higher balances or the implications of the skewed distribution for risk management or customer segmentation.

The histogram shows a right-skewed distribution of balances, with most balances concentrated in the lower ranges (0-20k) and a smaller number of significantly higher balances.

#Analyse Outliers
hc_box <- highchart() %>%
hc_chart(type = "boxplot") %>%
hc_title(text = "Boxplot for Balance") %>%
hc_xAxis(categories = c("Balance")) %>%
hc_add_series(data = list(unname(boxplot.stats(df$balance)$stats)))
hc_box

The boxplot displays the following key statistics for the “Balance” variable:

Maximum: 3462 Upper quartile: 1428 Median: 448 Lower quartile: 72 Minimum: -1944

The significant difference between the maximum value and the upper quartile, coupled with the minimum value being substantially lower than the lower quartile, indicates the presence of outliers. These outliers skew the mean, making it less representative of the central tendency of the data. Therefore, using robust measures of central tendency (like the median) or applying a transformation (like a logarithmic transformation) to reduce the influence of outliers is recommended for a more accurate analysis.

balance_central_tendency <- df %>% 

  summarise(mean_balance = mean(balance, na.rm = TRUE),

            median_balance = median(balance, na.rm = TRUE),

            sd_balance = sd(balance, na.rm = TRUE))

print(balance_central_tendency)

##   mean_balance median_balance sd_balance
## 1     1362.272            448   3044.766

The mean balance is 1362.272, while the median balance is 448. The standard deviation is 3044.766. The large difference between the mean and the median, coupled with a high standard deviation, indicates that the balance data is heavily skewed. This is common in financial datasets, where a small number of individuals hold a large proportion of the total wealth. A few customers with very high balances significantly raise the average, while most customers have balances closer to the median. This skewness is a typical characteristic of bank balance data.

Overall, the statistics provided appear reasonable and align with what might be expected in a bank marketing dataset. The age distribution seems centered around a plausible target demographic, and the balance data exhibits a common skewness observed in financial data.

Categorical Variables:

job_counts <- df %>% count(job) %>% arrange(desc(n))
hc_bar <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Distribution of Job Types") %>%
hc_xAxis(categories = job_counts$job, title = list(text = "Job Type")) %>%
hc_yAxis(title = list(text = "Count")) %>%
hc_add_series(name = "Count", data = job_counts$n)

hc_bar

The most prevalent job type is “blue-collar” (9732) followed by “management”(9458) and “technician” (7597). “Admin,” “retired,” and “services” represent a moderate number of persons. The remaining categories (“entrepreneur,” “housemaid,” “self-employed,” “student,” “unemployed,” and “unknown”) show significantly lower counts. The chart shows the distribution of job types, with blue-collar jobs being the most common, followed by management and technician roles. Other job types have significantly lower representation.

loan_counts <- df %>%
group_by(loan) %>%
summarise(count = n())

hc_loan <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Distribution of Personal Loan Status") %>%
hc_xAxis(title = list(text = "Personal Loan Status"),
categories = loan_counts$loan) %>%
hc_yAxis(title = list(text = "Number of Occurrences")) %>%
hc_add_series(data = loan_counts$count,
name = "Loan Status")
hc_loan

The majority of the 37,967 loans are in the “no” status, indicating a significantly larger number of loans that are not yet approved or completed compared to those in the “yes” status. The “yes” status represents a considerably smaller portion of the total number of loans. This suggests a potential imbalance or bottleneck in the loan approval process.

housing_counts <- df %>%
group_by(housing) %>%
summarise(count = n())
hc_housing <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Distribution of Housing Loan Status") %>%
hc_xAxis(title = list(text = "Housing Loan Status"),
categories = housing_counts$housing) %>%
hc_yAxis(title = list(text = "Number of Occurrences")) %>%
hc_add_series(data = housing_counts$count,
name = "Housing Status")
hc_housing

The chart shows that approximately 20K persons do not have a housing loan, while approximately 25K persons have a housing loan.

subscription_counts <- as.data.frame(table(df$y))
colnames(subscription_counts) <- c("Subscription", "Count")

subscription_counts <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Term Deposit Subscription Distribution") %>%
hc_xAxis(title = list(text = "Term Deposit Subscription Distribution"),
categories = subscription_counts$Subscription) %>%
hc_yAxis(title = list(text = "Number of Clients")) %>%
hc_add_series(data = subscription_counts$Count,
name = "Contact Status")
subscription_counts

Most clients have not subscribed to a term deposit, indicating an imbalanced dataset

# Count the occurrences of -1 in pdays

uncontacted_clients <- sum(df$pdays == -1)

# Check the total number of clients in the dataset

total_clients <- nrow(df)

percentage_uncontacted <- (uncontacted_clients / total_clients) * 100

# Prepare data for visualization

data_viz <- data.frame(
Status = c("Contacted", "Not Contacted"),
Count = c(total_clients - uncontacted_clients, uncontacted_clients)
)

data_viz <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Client Contact Status") %>%
hc_xAxis(title = list(text = "Client Contact Status"),
categories = data_viz$Status) %>%
hc_yAxis(title = list(text = "Number of Clients")) %>%
hc_add_series(data = data_viz$Count,
name = "Contact Status")
data_viz

A considerable portion of clients (almost 82%) shows a value of -1 for ‘pdays’, suggesting they were not reached out to prior to this campaign.

# Prepare the data
data_prepared <- df %>%
  select(duration, y) %>%
  group_by(y) %>%
  summarize(average_duration = mean(duration, na.rm = TRUE), count = n())

# Convert y to a factor for better visualization
data_prepared$y <- factor(data_prepared$y, levels = c("no", "yes"), labels = c("Not Subscribed", "Subscribed"))

# Create a bar plot to visualize the relationship
hchart(data_prepared, "column", hcaes(x = y, y = average_duration)) %>%
  hc_title(text = "Average Call Duration by Term Subscription Outcome") %>%
  hc_xAxis(title = list(text = "Subscription Outcome")) %>%
  hc_yAxis(title = list(text = "Average Call Duration (seconds)")) %>%
  hc_tooltip(pointFormat = "Average Duration: {point.y} seconds")

The chart suggests that calls resulting in longer conversations are more likely to lead to a positive outcome.

Patterns/Trends in Data

bank.loan <- df %>% filter(loan == "yes" | housing == "yes")
bank.no.loan <- df %>% filter(loan == "no" & housing == "no")

hist_loan <- hist(bank.loan$age, plot = FALSE)
hist_no_loan <- hist(bank.no.loan$age, plot = FALSE)

categories <- paste(hist_loan$breaks[-length(hist_loan$breaks)], 
                    "-", 
                    hist_loan$breaks[-1])
hc <- highchart() %>%
  hc_title(text = "Age vs Loan") %>%
  hc_xAxis(title = list(text = "Age"), categories = categories) %>%
  hc_yAxis(title = list(text = "Frequency")) %>%
  hc_add_series(name = "Loan", data = hist_loan$counts, type = "column") %>%
  hc_add_series(name = "No Loan", data = hist_no_loan$counts, type = "column") %>%
  hc_legend(enabled = TRUE)

hc

The chart reveals a trend where the number of people with loans is generally higher in the younger age groups (20-55). As age increases, the number of people with loans decreases. Conversely, the number of people without loans increases with age. This suggests a correlation between age and loan acquisition, with younger people being more likely to have loans than older people. The data suggests a possible relationship between age and financial needs or borrowing habits.

# Create subsets for clients with loans (either personal or housing loan)
bank.loan <- df %>% filter(loan == "yes" | housing == "yes")

# Create subsets for clients without loans (neither personal nor housing loan)
bank.no.loan <- df %>% filter(loan == "no" & housing == "no")

# Calculate job type counts and percentages for clients with loans
job_counts_loan <- bank.loan %>% 
  group_by(job) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Calculate job type counts and percentages for clients without loans
job_counts_no_loan <- bank.no.loan %>%
  group_by(job) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Merge the two dataframes for combined plotting
job_counts_loan$LoanStatus <- "With Loan"
job_counts_no_loan$LoanStatus <- "Without Loan"

job_counts_combined <- bind_rows(job_counts_loan, job_counts_no_loan)

# Create a highcharter column chart for combined data
highchart() %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Job Distribution by Loan Status") %>%
  hc_xAxis(categories = unique(job_counts_combined$job), title = list(text = "Job Type")) %>%
  hc_yAxis(title = list(text = "Percentage"), min = 0) %>%
  hc_plotOptions(column = list(grouping = FALSE, shadow = FALSE, borderWidth = 0.5, borderColor = 'grey')) %>%
  hc_tooltip(shared = FALSE, formatter = JS("function () { return '<b>' + this.point.category + '</b><br>' + this.series.name + ': ' + Highcharts.numberFormat(this.point.y, 2) + '%';}")) %>%
  hc_add_series(name = "With Loan", data = job_counts_combined$Percentage[job_counts_combined$LoanStatus == "With Loan"],
                dataLabels = list(enabled = FALSE),
                pointPlacement = -0.2) %>%
  hc_add_series(name = "Without Loan", data = job_counts_combined$Percentage[job_counts_combined$LoanStatus == "Without Loan"],
                dataLabels = list(enabled = FALSE),
                pointPlacement = 0.2)

The trend shows that a higher percentage of people in management, blue-collar, and admin. jobs have loans compared to other job types. Conversely, a larger percentage of people in jobs like housemaid, self-employed, and unemployed do not have loans. There’s a notable difference in loan distribution across various job categories, suggesting a correlation between job type and loan acquisition. The data suggests that those in higher-paying or more stable jobs are more likely to have loans.

# Calculate marital status counts and percentages for clients with loans
marital_counts_loan <- bank.loan %>%
  group_by(marital) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Calculate marital status counts and percentages for clients without loans
marital_counts_no_loan <- bank.no.loan %>%
  group_by(marital) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Merge the two dataframes for combined plotting
marital_counts_loan$LoanStatus <- "With Loan"
marital_counts_no_loan$LoanStatus <- "Without Loan"

marital_counts_combined <- bind_rows(marital_counts_loan, marital_counts_no_loan)

# Create a highcharter column chart for combined data
highchart() %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Marital Status by Loan Status Distribution") %>%
  hc_xAxis(categories = unique(marital_counts_combined$marital), title = list(text = "Marital Status")) %>%
  hc_yAxis(title = list(text = "Percentage"), min = 0) %>%
  hc_plotOptions(column = list(grouping = FALSE, shadow = FALSE, borderWidth = 0.5, borderColor = 'gray')) %>%
  hc_tooltip(shared = FALSE, formatter = JS("function () { return '<b>' + this.point.category + '</b><br>' + this.series.name + ': ' + Highcharts.numberFormat(this.point.y, 2) + '%';}")) %>%
  hc_add_series(name = "With Loan", data = marital_counts_combined$Percentage[marital_counts_combined$LoanStatus == "With Loan"],
                dataLabels = list(enabled = FALSE),
                pointPlacement = -0.2) %>%
  hc_add_series(name = "Without Loan", data = marital_counts_combined$Percentage[marital_counts_combined$LoanStatus == "Without Loan"],
                dataLabels = list(enabled = FALSE),
                pointPlacement = 0.2)

The data indicates that a greater proportion of married individuals hold loans in comparison to those who are single or divorced. A notable number of single individuals do not have any loans, and a significant percentage of divorced individuals also remain loan-free. The rate of married individuals with loans is markedly higher than that of other marital categories.

Let’s see if amount of balance matters on the loan borrow.

bank.loan <- df %>% filter(loan == "yes" | housing == "yes")

# Create subsets for clients without loans (neither personal nor housing loan)
bank.no.loan <- df %>% filter(loan == "no" & housing == "no")
density_loan <- density(bank.loan$balance)

# Calculate density for bank.no.loan$balance


density_no_loan <- density(bank.no.loan$balance)

# Create a highcharter line chart
highchart() %>%
  hc_title(text = "Distribution of Balance") %>%
  hc_xAxis(title = list(text = "Average Yearly Balance (Euros)")) %>%
  hc_yAxis(title = list(text = "Density")) %>%
  hc_add_series(data = data.frame(x = density_loan$x, y = density_loan$y),
                type = "line",
                name = "Loan",
                color = "lightgreen") %>%
  hc_add_series(data = data.frame(x = density_no_loan$x, y = density_no_loan$y),
                type = "line",
                name = "No Loan",
                color = "grey") %>%
  hc_legend(enabled = TRUE,
            align = "right",
            verticalAlign = "top",
            layout = "vertical")

The chart illustrates the distribution of average annual balances in Euros for two distinct categories: individuals with loans and those without.

It effectively highlights a disparity in the distribution of average yearly balances between loan holders and non-loan holders, with the loan holder group demonstrating higher average balances.

edu_y <- df %>% 
group_by(education, y) %>% 
summarise(count = n(), .groups = "drop")
hc_grouped <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Term Deposit Subscription by Education Level") %>%
hc_xAxis(categories = unique(edu_y$education), title = list(text = "Education"))

levels_y <- unique(edu_y$y)
for(lev in levels_y){
data_series <- edu_y %>% filter(y == lev) %>% arrange(education) %>% pull(count)
hc_grouped <- hc_grouped %>% hc_add_series(name = as.character(lev), data = data_series)

}
hc_grouped

The bar chart displays the number of term deposit subscriptions categorized by education level and whether the person has subscribed (yes/no).

The trend shows that the number of subscriptions is highest among those with secondary education, followed by tertiary education, then primary education, and lastly, those with unknown education levels. Within each education level, the number of “yes” responses (subscriptions) significantly outweighs the number of “no” responses.

In summary, the trend indicates a positive correlation between higher education levels and term deposit subscriptions. More people with secondary and tertiary education levels subscribed compared to those with primary education or unknown education levels.

married <- df %>%
select(marital, y) %>%                  
filter(!is.na(marital), !is.na(y))       
summary_data <- married %>%
group_by(marital, y) %>%
summarize(count = n()) %>%
ungroup()

## `summarise()` has grouped output by 'marital'. You can override using the
## `.groups` argument.

# Create a bar plot visualizing the subscription rates by marital status

highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Term Deposit Subscription by Marital Status") %>%
hc_yAxis(title = list(text = "Number of Subscriptions")) %>%
hc_add_series(name = "Subscribed", 
data = summary_data$count[summary_data$y == "yes"]) %>%
hc_add_series(name = "Not Subscribed", 
data = summary_data$count[summary_data$y == "no"]) %>%
hc_tooltip(table = TRUE)

The chart shows the distribution of term deposit subscriptions across different marital statuses, with married individuals showing the highest overall numbers, both subscribed and not subscribed.Single individuals show the lowest number of subscriptions.

job_y <- df %>% 
group_by(job, y) %>% 
summarise(count = n(), .groups = "drop")
hc_grouped <- highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Term Deposit Subscription by Job Title") %>%
hc_xAxis(categories = unique(job_y$job), title = list(text = "Term Deposit Subscription"))

levels_y <- unique(job_y$y)
for(lev in levels_y){
data_series <- job_y %>% filter(y == lev) %>% arrange(job) %>% pull(count)
hc_grouped <- hc_grouped %>% hc_add_series(name = as.character(lev), data = data_series)

}
hc_grouped

The majority of people in each job category do not have a term deposit subscription.There is no substantial evidence of relationships between job categories and subscription.

# Summarize the data
summary_data <- df %>%
  group_by(poutcome, y) %>% 
  summarise(count = n()) %>%
  ungroup() %>%
  mutate(
    subscription_status = ifelse(y == "yes", "Subscribed", "Not Subscribed")
  )

## `summarise()` has grouped output by 'poutcome'. You can override using the
## `.groups` argument.

# Create a highcharter plot

highcharter_plot <- highchart() %>%

  hc_title(text = "Relationship Between Previous Campaign Outcomes and Term Subscription") %>%

  hc_xAxis(categories = unique(summary_data$poutcome), title = list(text = "Previous Campaign Outcome")) %>%

  hc_yAxis(title = list(text = "Number of Clients")) %>%

  hc_add_series(name = "Subscribed", data = summary_data$count[summary_data$subscription_status == "Subscribed"], type = "column") %>%

  hc_add_series(name = "Not Subscribed", data = summary_data$count[summary_data$subscription_status == "Not Subscribed"], type = "column") %>%

  hc_plotOptions(column = list(dataLabels = list(enabled = TRUE))) %>%

  hc_tooltip(shared = TRUE, valueSuffix = " clients") %>%

  hc_legend(enabled = TRUE)

# Render the plot

highcharter_plot

The chart shows that a significant number of clients who had an unknown previous campaign outcome subscribed. A substantial number of clients with successful previous campaigns also subscribed. Conversely, a smaller number of clients with unsuccessful campaigns subscribed. The number of clients with “other” previous campaign outcomes who subscribed is also relatively small.

campaign_trend <- df%>%group_by(month) %>%

      summarise(contacts = n())

hchart(campaign_trend, "line", hcaes(x = month, y = contacts),name = "Contacts") %>%
hc_title(text = "Campaign Trend by Month") %>%
hc_xAxis(title = list(text = "Month")) %>%
hc_yAxis(title = list(text = "Number of Contacts"))

The graph shows fluctuations in the number of contacts throughout the year. There is a peak in the number of contacts around July, followed by a decline and then a rise again in May, reaching 13,766 per month. The lowest point seems to be in September or October.

Central Tendency and Spread

numeric_summary <- numeric_vars %>% summarise_all(list(mean = ~mean(., na.rm = TRUE),

                                                            median = ~median(., na.rm = TRUE),

                                                            sd = ~sd(., na.rm = TRUE),

                                                            IQR = ~IQR(., na.rm = TRUE)))

    print(numeric_summary)

##   age_mean balance_mean day_mean duration_mean campaign_mean pdays_mean
## 1 40.93621     1362.272 15.80642      258.1631      2.763841   40.19783
##   previous_mean age_median balance_median day_median duration_median
## 1     0.5803234         39            448         16             180
##   campaign_median pdays_median previous_median   age_sd balance_sd   day_sd
## 1               2           -1               0 10.61876   3044.766 8.322476
##   duration_sd campaign_sd pdays_sd previous_sd age_IQR balance_IQR day_IQR
## 1    257.5278    3.098021 100.1287    2.303441      15        1356      13
##   duration_IQR campaign_IQR pdays_IQR previous_IQR
## 1          216            2         0            0

Missing Values & Significance

The identification and significance of missing values is critical for assessing data quality. The is.na() function can be utilized to detect any missing values across variables, and the extent of missing data is quantified.

missing_values <- sapply(df, function(x) sum(is.na(x)))
print(missing_values)

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

duplicate_rows <- sum(duplicated(df))
print(paste("Number of duplicate rows:", duplicate_rows))

## [1] "Number of duplicate rows: 0"

There is no missing values in any column of the dataset. However, I will review and count “unknown” values.

df %>% 
  summarise_all(list(~sum(. == "unknown"))) %>% 
  gather(key = "variable", value = "nr_unknown") %>% 
  arrange(-nr_unknown)

##     variable nr_unknown
## 1   poutcome      36959
## 2    contact      13020
## 3  education       1857
## 4        job        288
## 5        age          0
## 6    marital          0
## 7    default          0
## 8    balance          0
## 9    housing          0
## 10      loan          0
## 11       day          0
## 12     month          0
## 13  duration          0
## 14  campaign          0
## 15     pdays          0
## 16  previous          0
## 17         y          0

There are 4 variables that have missing values in the dataset.

Upon reviewing the contact feature, I conclude that the “unknown” values are not particularly significant. The choice of communication channels for engaging individuals during a marketing campaign is less important than our ability to connect with and involve potential customers. Conversely, the poutcome feature carries considerable weight, as it reveals the effectiveness of prior marketing strategies. The high incidence of “unknown” values in this area is concerning.

Algorithm Selection

The dataset is designed to forecast whether a customer will opt for a term deposit, categorizing it as a supervised learning problem (classification) with binary outcomes. For selecting algorithms, I would evaluate Logistic Regression and Random Forest as potential options.

Pros and Cons of each algorithm:

Logistic Regression:

Pros:

Easy to Understand: The simple coefficients make it clear how different factors affect the chances of signing up.
Fast Training: It trains quickly, which is great for handling large datasets.
Good Starting Point: It’s often a solid first choice for binary classification problems.

Cons:

Limited to Linear Relationships: It struggles with data that has complex, non-linear patterns.
Sensitive to Outliers & Multicollinearity: It needs some preprocessing to keep performance steady.

Random Forest:

Pros:

Handles Non-linear Relationships: It captures complex interactions between features using multiple decision trees.
Strong Against Outliers: Generally, it’s robust to noise and outliers.
Feature Importance Insights: It helps identify which features are most significant, which can be useful for further analysis and marketing.

Cons:

Less Interpretable: It’s harder to interpret than logistic regression because of its ensemble nature.
More Resource Demanding: It needs more computational power, but that’s usually manageable with modern hardware, especially for medium to large datasets.
Risk of Overfitting: While it’s usually strong, it might need some tuning to avoid overfitting in certain cases.

Given that the dataset is of moderate to large size, computational inefficiency is less of a concern. Logistic regression can quickly establish a baseline model and shed light on feature impacts. Meanwhile, Random Forest’s ability to model non-linear relationships can uncover complex patterns that are valuable considering the variety of marketing features.

Should the dataset be much smaller, the recommendation could be altered. In cases where we have less data, Logistic Regression is generally preferred since it can reduced risk of overfitting. On the other hand, the complexity of Random Forest models can lead to overfitting when data is limited. Thus, a smaller dataset would be better suited for algorithms that utilize fewer parameters or strong regularization.

To summarize, I recommend Random Forest for its strength in capturing nonlinear interactions and complex relationships in the dataset. For datasets under 1,000 records, Logistic Regression is likely the more suitable option.

Data pre-processing

Data Cleaning

To elevate the quality of the dataset let’s convert “unknown” to NA where applicable, and remove duplicates if any.

clean_data<-df

# Check for missing values across the dataset

sapply(clean_data, function(x) sum(is.na(x)))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

# Convert "unknown" to NA where applicable

clean_data[clean_data == "unknown"] <- NA

# Remove duplicates if any

clean_data <- unique(clean_data)

Dimensionality Reduction

Check correlation matrix and look for high correlations if any.

# Select only numeric columns for correlation analysis

numeric_vars <- sapply(clean_data, is.numeric)

cor_matrix <- cor(clean_data[, numeric_vars], use="complete.obs")

cor_matrix

##                   age      balance          day     duration     campaign
## age       1.000000000  0.097782739 -0.009120046 -0.004648428  0.004760312
## balance   0.097782739  1.000000000  0.004502585  0.021560380 -0.014578279
## day      -0.009120046  0.004502585  1.000000000 -0.030206341  0.162490216
## duration -0.004648428  0.021560380 -0.030206341  1.000000000 -0.084569503
## campaign  0.004760312 -0.014578279  0.162490216 -0.084569503  1.000000000
## pdays    -0.023758014  0.003435322 -0.093044074 -0.001564770 -0.088627668
## previous  0.001288319  0.016673637 -0.051710497  0.001203057 -0.032855290
##                 pdays     previous
## age      -0.023758014  0.001288319
## balance   0.003435322  0.016673637
## day      -0.093044074 -0.051710497
## duration -0.001564770  0.001203057
## campaign -0.088627668 -0.032855290
## pdays     1.000000000  0.454819635
## previous  0.454819635  1.000000000

In our dataset I don’t see any highly correlated variables, so there is nothing to remove based on variability feature.

Feature Engineering

Bin the ‘age’ feature into categories like Young, middle-aged, and oldto capture non-linear patterns
Create interaction terms by multiplying age by balance
Create new feature long_call based on call duration
Corrected marital status inconsistency
Changed -1 to NA in pdays feature

# Create interaction term: age * balance

clean_data$age_balance <- clean_data$age * clean_data$balance

# Bin age into categories

clean_data$age_category <- cut(clean_data$age, breaks = c(0, 30, 50, 100), labels = c("Young", "Middle-Aged", "Old"))

#Create a new feature based on duration 
clean_data <- clean_data %>% mutate(long_call = if_else(duration > median(duration, na.rm = TRUE), "yes", "no"))

clean_data$long_call <- as.factor(clean_data$long_call)

# Correct marital status inconsistency

df$marital[df$marital == "divorced"] <- "divorced/widowed"

# Handle pdays (-1 means client was not previously contacted)

df$pdays[df$pdays == -1] <- NA



clean_data<-df
head(clean_data,10)

##    age          job          marital education default balance housing loan
## 1   58   management          married  tertiary      no    2143     yes   no
## 2   44   technician           single secondary      no      29     yes   no
## 3   33 entrepreneur          married secondary      no       2     yes  yes
## 4   47  blue-collar          married   unknown      no    1506     yes   no
## 5   33      unknown           single   unknown      no       1      no   no
## 6   35   management          married  tertiary      no     231     yes   no
## 7   28   management           single  tertiary      no     447     yes  yes
## 8   42 entrepreneur divorced/widowed  tertiary     yes       2     yes   no
## 9   58      retired          married   primary      no     121     yes   no
## 10  43   technician           single secondary      no     593     yes   no
##    contact day month duration campaign pdays previous poutcome  y
## 1  unknown   5   may      261        1    NA        0  unknown no
## 2  unknown   5   may      151        1    NA        0  unknown no
## 3  unknown   5   may       76        1    NA        0  unknown no
## 4  unknown   5   may       92        1    NA        0  unknown no
## 5  unknown   5   may      198        1    NA        0  unknown no
## 6  unknown   5   may      139        1    NA        0  unknown no
## 7  unknown   5   may      217        1    NA        0  unknown no
## 8  unknown   5   may      380        1    NA        0  unknown no
## 9  unknown   5   may       50        1    NA        0  unknown no
## 10 unknown   5   may       55        1    NA        0  unknown no

Sampling Data

I will resize the dataset using Simple Random Sampling method (80% of the data)

set.seed(123) 

sample_index <- sample(1:nrow(clean_data), 0.8 * nrow(clean_data))

train_data <- clean_data[sample_index, ]

test_data <- clean_data[-sample_index, ]

# Display the sampled data

dim(train_data)

## [1] 36168    17

dim(test_data)

## [1] 9043   17

Data Transformation

I will handle regularization, normalization, and encoding of categorical variables. For normalization, I am going to scale numeric columns and for categorical variables create dummy variables using caret’s dummyVars.

# Normalize numeric features (e.g., 'age', 'balance', 'duration', etc.)

num_vars <- names(clean_data)[sapply(clean_data, is.numeric)]

data_scaled <- clean_data

data_scaled[num_vars] <- scale(clean_data[num_vars])



dummies <- dummyVars(~ ., data = data_scaled, fullRank = TRUE)

data_transformed <- as.data.frame(predict(dummies, newdata = data_scaled))

table(clean_data$y)

## 
##    no   yes 
## 39922  5289

# Convert proper columns to factors

factor_columns <- c("job", "marital", "education", "default", "housing", 

                    "loan", "contact", "month", "poutcome", "y")

clean_data[factor_columns] <- lapply(clean_data[factor_columns], as.factor)

# Verify structure

str(clean_data)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced/widowed",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ previous : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Handle Imbalance Data

library(ROSE)

## Warning: package 'ROSE' was built under R version 4.3.3

## Loaded ROSE 0.0-4

set.seed(123)

bank_balanced <- ROSE(y ~ ., data = clean_data, seed = 123)$data

table(bank_balanced$y)

## 
##   no  yes 
## 4179 4078

Essay

1. Exploratory data analysis

The Bank Marketing dataset from the UCI Machine Learning Repository provides a rich source of information for analysis. The dataset comprises 45,211 instances with 17 features, including both numerical and categorical variables. Key variables include age, job type, marital status, education level, and previous campaign outcomes. A preliminary examination of the dataset reveals no missing values, which simplifies initial data handling. However, some categorical attributes have “unknown” labels, which could be considered as missing values in a broader sense.

The dataset’s general information indicates that the clients’ ages range from 19 to 87, with an average age of approximately 41 years. Significant portion of the population that is targeted for marketing campaign falls within the range of 30-40 age group. The dataset reveals that people with loans is generally higher in the younger age groups (20-55).Younger people being more likely to have loans than older people. The trend shows that a higher percentage of people in management, blue-collar, and administrative jobs have loans compared to other job types. Also, higher percentage of married persons have loans compared to single or divorced persons.

More people with secondary and tertiary education levels subscribed to term deposit compared to those with primary education or unknown education levels. However when we analyse job title and term deposit subscription there is no substantial evidence of relationships between job categories and subscription.

A considerable portion of clients shows a value of -1 for ‘pdays’, suggesting they were not reached out to prior to this campaign. Most clients have not subscribed to a term deposit, indicating an imbalanced dataset. Exploration reveals several critical issues and findings. The target variable ‘y’ is imbalanced, with a significantly higher number of clients not subscribing to a term deposit compared to those who do. This imbalance needs to be addressed during pre-processing to prevent biased model outcomes.

The distribution of contacts across months is uneven, with May having the highest number of last contacts. Also, the analysis shows that clients with a higher number of previous contacts are contacted more frequently.The older individuals typically experience shorter durations of last contact in comparison to their younger counterparts.

In conclusion, the EDA provides insights into potentially useful features, but also reveals that the dataset has several issues that need to be addressed during pre-processing, including class imbalance, uneven distribution of contacts across months, and potential outliers in numerical features.

2. Algorithm selection

Based on the attributes of the bank marketing dataset and the findings from the exploratory data analysis, a range of machine learning algorithms can be employed to accurately predict term deposit subscriptions. This study focuses on binary classification to determine client subscription decisions (‘yes’ or ‘no’), making Logistic Regression and Random Forest appropriate choices for this objective.

The chosen algorithms reflect sound reasoning based on the findings from the exploratory data analysis (EDA). The exploratory data analysis showcased how the duration of the last contact along with different demographic features strongly affect subscription decisions.

Logistic Regression operates as a linear model which calculates binary outcome probabilities while delivering clear results with efficient performance.

It generates coefficients that indicate the importance and direction of each feature’s influence on the result.

On the other hand, Random Forest is an ensemble learning technique that merges multiple decision trees to enhance prediction accuracy and minimize overfitting risks. This method is effective with both categorical and numerical data. Random Forest can manage imbalanced datasets well and less prone to overfitting. Based on the exploratory data analysis, Random Forests help identify feature importance, emphasizing the variables that most affect the likelihood of subscription. This insight allows marketing teams to concentrate on the key elements, improving their targeting strategies for campaigns.To summarize, I. would use Random Forest for its strength in capturing nonlinear interactions and complex relationships in the dataset.

3. Data Pre-processing

It is very important to address the concerns that we identified during the Exploratory Data Analysis (EDA) in order to enhance the model’s effectiveness.

The first step was to address the ‘unknown’ values issue and make sure there is no duplicates in the data. In order to enhance modeling process and interpretability, some variables inconsistency was corrected ( e.g. marital status, pdays) and new features has been created (e.g age_balance, long_call).

To get the data ready for modeling, numeric variables has been scaled and the categorical variables changed into a numerical format. Several columns turned into factors, which is super important for prepping the data for training the model. The other step in data preparation included data sampling, by applying Simple Random Sampling method.

On top of that, I tackled the class imbalance issue in the dataset. By using the ROSE package to balance the classes, the model can learn from both groups more effectively, which helps it make more accurate and reliable predictions. This step ensures that the model doesn’t lean too much towards the majority class, making it better at handling new data.

4. Business insight & recommendations

Customer segmentation in banking heavily relies on demographic data as its core component. Analysis of this customer dataset demonstrates that demographics including age level, educational attainment and marital status significantly impact customer behavior patterns toward term deposit subscriptions. According to the study results married clients show better participation in term deposits which could be attributed to their more secure financial situations or shared financial goals. Banks can leverage this development to create specialized marketing programs aimed at married couples which address their unique financial needs while highlighting joint investment opportunities.

Data analysis reveals that people in high-level management positions show a greater tendency to subscribe to term deposits compared to individuals in lower-paying positions. Banks have the potential to create specialized financial products and services for wealthy professionals by providing distinctive investment options and customized financial counsel that match their lifestyle ambitions.

Banks need to create marketing campaigns that reach people aged 30-40 through targeted strategies designed exclusively for this demographic. The method may require banks to use popular social media platforms as well as other favored channels among this age group while developing personalized messages that match their financial goals.

Those who complete secondary or tertiary educational programs display higher subscription rates. Marketing strategies should focus on promoting financial literacy as well as the benefits of investment and educational programs to boost term deposit subscriptions.

Data analysis demonstrates that the duration of the final interaction serves as an essential predictor of future subscription rates. Extended conversations tend to lead to better outcomes. Subscribed individuals experience calls that last about 537 seconds on average while non-subscribed individuals experience calls that last approximately 200 seconds. Therefore marketing strategies should focus on training call center staff in effective customer engagement methods so that they can maintain client interest during the entire discussion. Personalized conversations that address individual client needs achieve effective results.

The results from the prior marketing effort possess significant value. The likelihood of clients renewing their subscriptions increases if they have responded positively before which means marketing strategies should focus on proven successful campaigns to boost conversion rates. From the 33,573 clients who remained uncertain after earlier marketing efforts 3,386 customers eventually subscribed. Out of 1,511 clients who had experienced success in previous campaigns, 978 chose to subscribe.