Data 622 Assignment 1

Introduction

In this project, we explore the Bank Marketing dataset from the UCI Machine Learning Repository to analyze customer behavior and determine which factors influence the likelihood of subscribing to a term deposit. Through Exploratory Data Analysis (EDA), we examine key characteristics such as feature correlations, data distribution, missing values, and potential outliers. We then apply Machine Learning models, including Random Forest and Logistic Regression, to understand the most influential features driving customer subscription. Our findings aim to provide actionable insights to help financial institutions optimize their marketing strategies by targeting the right customers at the right time.

#Needed packages
library(reticulate)

## Warning: package 'reticulate' was built under R version 4.3.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.95 loaded

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

# Now install the Python package 'ucimlrepo' via reticulate:
py_install("ucimlrepo")

## Using virtual environment "C:/Users/PC/Documents/.virtualenvs/r-reticulate" ...

## + "C:/Users/PC/Documents/.virtualenvs/r-reticulate/Scripts/python.exe" -m pip install --upgrade --no-user ucimlrepo

This section installs the ucimlrepo package in Python using the reticulate package in R. The ucimlrepo package provides access to machine learning datasets from the UCI ML Repository. Since reticulate allows seamless integration between R and Python, this step ensures that we can fetch datasets directly from Python into R.

In the following section we run Python commands from within R using py_run_string(). The script first imports the function fetch_ucirepo, which retrieves datasets from the UCI Machine Learning Repository. The Bank Marketing dataset is fetched using its unique dataset ID (222). The dataset is split into X (features) and y (target variable), with additional metadata stored separately.

# Install Python package if not already installed
py_install("ucimlrepo")

## Using virtual environment "C:/Users/PC/Documents/.virtualenvs/r-reticulate" ...

## + "C:/Users/PC/Documents/.virtualenvs/r-reticulate/Scripts/python.exe" -m pip install --upgrade --no-user ucimlrepo

# Fetch the dataset from the UCI repository
py_run_string("
from ucimlrepo import fetch_ucirepo
bank_marketing = fetch_ucirepo(id=222)
features = bank_marketing.data.features
target = bank_marketing.data.targets
metadata = bank_marketing.metadata
variables = bank_marketing.variables
")

# Assign more meaningful names
bank_features <- py$features   # Independent variables
bank_target <- py$target       # Dependent variable (Target)
metadata <- py$metadata        # Metadata about the dataset
variables <- py$variables      # Variable descriptions

Data overview

# Combine features and target into a single dataset
bank_data <- dplyr::bind_cols(bank_features, target = bank_target)
bank_data$target <- as.factor(bank_data$y)

# Check structure and summary
str(bank_data)

## 'data.frame':    45211 obs. of  18 variables:
##  $ age        : num  58 44 33 47 33 35 28 42 58 43 ...
##  $ job        : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital    : chr  "married" "single" "married" "married" ...
##  $ education  : chr  "tertiary" "secondary" "secondary" NA ...
##  $ default    : chr  "no" "no" "no" "no" ...
##  $ balance    : num  2143 29 2 1506 1 ...
##  $ housing    : chr  "yes" "yes" "yes" "yes" ...
##  $ loan       : chr  "no" "no" "yes" "no" ...
##  $ contact    : chr  NA NA NA NA ...
##  $ day_of_week: num  5 5 5 5 5 5 5 5 5 5 ...
##  $ month      : chr  "may" "may" "may" "may" ...
##  $ duration   : num  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays      : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome   : chr  NA NA NA NA ...
##  $ y          : chr  "no" "no" "no" "no" ...
##  $ target     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "pandas.index")=RangeIndex(start=0, stop=45211, step=1)

The dataset consists of 45,211 observations and 16 predictor variables, along with a target variable y, which indicates whether a client subscribed to a term deposit ("yes" or "no"). The dataset captures various demographic, financial, and contact-related attributes of customers who were part of a bank marketing campaign. Below is a brief explanation of the key variables:

Demographic Information
- age: Numeric variable representing the client’s age.
- job: Categorical variable describing the client’s profession (e.g., “management,” “technician,” “blue-collar”).
- marital: Categorical variable indicating the client’s marital status ("married", "single", "divorced").
- education: Categorical variable capturing the education level ("tertiary", "secondary", "primary", or NA if missing).
Financial Information
- balance: Numeric variable showing the customer’s average yearly account balance (in euros).
- default: Categorical variable indicating whether the customer has credit in default ("yes" or "no").
- housing: Categorical variable indicating whether the client has a housing loan ("yes" or "no").
- loan: Categorical variable indicating whether the client has a personal loan ("yes" or "no").
Campaign-Related Information
- contact: Categorical variable specifying the communication type (e.g., "cellular", "telephone"), though it contains NA values, indicating missing or unrecorded contact types.
- day_of_week: Numeric variable representing the last contact day of the week.
- month: Categorical variable indicating the last contact month ("may", "jun", etc.).
- duration: Numeric variable recording the last contact duration (in seconds). Longer call durations typically indicate a higher likelihood of subscription.
Past Campaign Information
- campaign: Numeric variable showing the number of times a client was contacted during the current campaign.
- pdays: Numeric variable indicating the number of days since the client was last contacted (-1 means they were never contacted before).
- previous: Numeric variable representing the number of contacts made before this campaign.
- poutcome: Categorical variable capturing the outcome of the previous campaign ("success", "failure", NA if not contacted previously).
Target Variable
- y: Categorical variable ("yes" or "no") indicating whether the client subscribed to a term deposit.

The final goal is to predict whether a given client is likely to subscribe to a term deposit, allowing the bank to optimize future marketing strategies.

Missing Data

Fist we look at missing data and find that the missing data is in the categorical

# Calculate the number of missing values per column
missing_values <- colSums(is.na(bank_data))

# Print the result
print(missing_values)

##         age         job     marital   education     default     balance 
##           0         288           0        1857           0           0 
##     housing        loan     contact day_of_week       month    duration 
##           0           0       13020           0           0           0 
##    campaign       pdays    previous    poutcome           y      target 
##           0           0           0       36959           0           0

Summary of Missing Values

Column	Missing Values
job	288
education	1,857
contact	13,020
poutcome	36,959

most_common_job <- names(sort(table(bank_data$job), decreasing = TRUE))[1]
most_common_job

## [1] "blue-collar"

Most common ‘Job’ and ‘Education’ was “blue collar” with only 288 and 1,857 missing values out of 45,211 we can safely impute using the mode methode. as “Blue-collar” and “secondary” were the most common, we can impute these values in the main dataset with minimun bias and to preceive the value of the data. Another Option is to ignore the cases completley since they are a minor percentage. I chose to impute the most common job under the assumption that most common jobs are going to be “Blue-Collar”

In my opinion this is the best methode to handle these missing values as well as retaining all valuable customer information without dropping entire rows.

# Fill missing education values with the most common level
bank_data$job[is.na(bank_data$job)] <- most_common_job

most_common_education <- names(sort(table(bank_data$education), decreasing = TRUE))[1]
bank_data$education[is.na(bank_data$education)] <- most_common_education

For Contact method and “poutcome” which is how successful the last campaign was, we can safely keep these NA values and use Unknown. This is the best option as we are not sure whether there could be emailed or another point of contact that was not recorded correctly. We can also assume that the NA in poutcome did not have any previuse contact and can impute that to preserve business context meaning.

unique(bank_data$contact)

## [1] NA          "cellular"  "telephone"

unique(bank_data$poutcome)

## [1] NA        "failure" "other"   "success"

bank_data$contact[is.na(bank_data$contact)] <- "Unknown"
bank_data$poutcome[is.na(bank_data$poutcome)] <- "No Previous Contact"

colSums(is.na(bank_data))

##         age         job     marital   education     default     balance 
##           0           0           0           0           0           0 
##     housing        loan     contact day_of_week       month    duration 
##           0           0           0           0           0           0 
##    campaign       pdays    previous    poutcome           y      target 
##           0           0           0           0           0           0

Exploratory Data Analysis

Distribution of ‘age’

In the plot Below titled “Age Group Distribution” we can see that the largest number of clients contacted belong to the 25-35 (13,702 clients) and 35-45 (12,555 clients) age groups. This suggests that the bank’s marketing strategy focuses heavily on younger and mid-career individuals. Conversion Rates by Age Group

While most contacts did not subscribe (gray bars), some age groups showed higher engagement: 25-35: 1,869 subscribed (~13.6% conversion rate). 35-45: 1,301 subscribed (~10.4% conversion rate). 45-55: 893 subscribed (~10.3% conversion rate). 55-65: 586 subscribed (~14.1% conversion rate). 65+ groups (Seniors & Retirees) have very low subscription rates, with numbers dropping drastically.

bank_data_1 <- bank_data %>%
  mutate(age_group = cut(age, 
                         breaks = seq(15, 95, by = 10), 
                         labels = c("15-25", "25-35", "35-45", "45-55", 
                                    "55-65", "65-75", "75-85", "85-95"),
                         include.lowest = TRUE))

# Plot histogram with subscription status and age groups
ggplot(bank_data_1, aes(x = age_group, fill = target)) +
  geom_bar(position = "dodge", color = "black") +  # Side-by-side bars for 'yes' and 'no'
  geom_text(stat = "count", aes(label = ..count..), 
            vjust = -0.5, position = position_dodge(0.9), size = 3) +
  ggtitle("Age Group Distribution") +
  xlab("Age Group (Marketing Segments)") +
  ylab("Number of Clients") +
  scale_fill_manual(values = c("no" = "gray", "yes" = "blue"), 
                    name = "Subscription Status") +  # Custom colors
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Tilt labels for readability

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The analysis of age group distributions in the bank’s marketing campaign highlights critical insights for optimizing future outreach strategies. The data reveals that 25-35 year-olds were the most targeted group, with the highest number of contacts and conversions. However, despite being heavily marketed, 35-55 year-olds had lower conversion rates, suggesting a potential mismatch in messaging or engagement strategies for this demographic. Interestingly, the 55-65 age group demonstrated a relatively higher subscription rate, even though fewer clients were contacted, indicating an opportunity to expand outreach efforts to this segment.

Distribution of ‘job’

ggplot(bank_data, aes(x = job, fill = target)) +
  geom_bar(position = "fill") +
  ggtitle("Subscription Rate by Job Type") +
  xlab("Job Category") +
  ylab("Proportion Subscribed") +
  scale_fill_manual(values = c("no" = "gray", "yes" = "blue")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From this visualization, we can observe that blue-collar, management, and technician roles have the highest representation in the dataset, but the proportion of customers subscribing remains relatively low across all job categories. Students and retired individuals show a slightly higher proportion of subscribers compared to other job types, suggesting that these groups may be more receptive to banking offers. Meanwhile, blue-collar workers, services, and entrepreneurs appear to have lower subscription rates despite high representation in the dataset.

Relationships & Correlation

Correlation between Call Duration and Subscription

ggplot(bank_data, aes(x = target, y = duration, fill = target)) +
  geom_boxplot() +
  ggtitle("Call Duration vs. Subscription Status") +
  xlab("Subscription Status") +
  ylab("Call Duration (Seconds)") +
  scale_fill_manual(values = c("no" = "gray", "yes" = "blue")) +
  theme_minimal()

is evident that customers who subscribed (yes) tend to have significantly longer call durations compared to those who did not. The median call duration for non-subscribers is much lower, indicating that unsuccessful marketing calls tend to be shorter. In contrast, successful calls tend to be longer, with many exceeding several hundred seconds

# Select numeric columns
numeric_cols <- bank_data %>% select(where(is.numeric))

# Compute correlation matrix
cor_matrix <- cor(numeric_cols, use = "complete.obs")

# Visualize the correlation matrix
corrplot(cor_matrix, method = "circle", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45)

The correlation matrix visualized in the plot above displays the relationships between the numerical variables in the dataset. The size and color intensity of each circle indicate the strength and direction of the correlation. Darker blue circles represent strong positive correlations, while darker red circles represent strong negative correlations. Lighter colors indicate weaker relationships.

month_subscriptions <- bank_data %>%
  group_by(month, target) %>%
  summarise(count = n(), .groups = "drop") %>%
  tidyr::spread(target, count, fill = 0) %>%  # Convert to wide format
  mutate(subscription_rate = yes / (yes + no))  # Compute subscription rate

# Define month order for proper sorting
month_order <- c("jan", "feb", "mar", "apr", "may", "jun", 
                 "jul", "aug", "sep", "oct", "nov", "dec")

# Convert month to factor with specified order
month_subscriptions$month <- factor(month_subscriptions$month, levels = month_order)

# Plot subscription rate by month
ggplot(month_subscriptions, aes(x = month, y = subscription_rate, fill = subscription_rate)) +
  geom_bar(stat = "identity") +  # Color gradient for better visualization
  ggtitle("Subscription Rate by Month") +
  xlab("Month") +
  ylab("Subscription Rate") +
  theme_minimal()

From the plot, March (mar), September (sep), October (oct), and December (dec) show the highest subscription rates, with values exceeding 40-50%. This suggests that these months might be optimal for running marketing campaigns, possibly due to financial planning cycles, bonuses, or seasonal promotions that make customers more inclined to invest in term deposits.

ML models

Below we trained logistic regression and Random Forest Models to analyze customer subscription behavior, focusing on Logistic Regression and Random Forest to determine which factors influence whether a customer subscribes to a term deposit. To understand which features most influence customer subscription, we trained a Random Forest model. This model builds multiple decision trees and averages their predictions to provide an importance score for each feature. From the Random Forest feature importance plot, we observed that duration (call length) was the most critical predictor, followed by previous (number of times the client was contacted before), campaign (number of contacts in the current campaign), and month (which month the campaign was run). These insights suggest that longer calls, past successful campaigns, and timing of outreach significantly affect customer subscription rates.

# Convert categorical variables into factors and take of Y column 
bank_data <- bank_data %>%
  mutate(across(where(is.character), as.factor))
bank_data <- bank_data %>% select(-y)
# Split data into training (80%) and testing (20%)
set.seed(123)
trainIndex <- createDataPartition(bank_data$target, p = 0.8, list = FALSE)
train_data <- bank_data[trainIndex, ]
test_data  <- bank_data[-trainIndex, ]

# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(target ~ ., data = train_data, importance = TRUE, ntree = 100)

# Extract feature importance
feature_importance <- importance(rf_model)
feature_importance_df <- data.frame(Feature = rownames(feature_importance), Importance = feature_importance[, 1])

# Sort by importance
feature_importance_df <- feature_importance_df %>% arrange(desc(Importance))

# Plot feature importance
ggplot(feature_importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +  # Flip for better readability
  ggtitle("Feature Importance (Random Forest)") +
  xlab("Features") +
  ylab("Importance Score") +
  theme_minimal()

Next, we trained a Logistic Regression model, which helps interpret the direction and magnitude of feature effects on subscription. Unlike Random Forest, Logistic Regression provides coefficients for each categorical value separately rather than as a whole. This is why we saw separate importance values for different months (e.g., month_mar, month_may, etc.), showing that some months are more effective for marketing campaigns than others. We also found that clients with successful previous campaigns (poutcome_success) were far more likely to subscribe again, reinforcing the importance of follow-up efforts.

# Convert categorical variables into factors
bank_data <- bank_data %>%
  mutate(across(where(is.character), as.factor))

# Train Logistic Regression Model
log_model <- glm(target ~ ., data = bank_data, family = binomial)

# Extract feature importance (absolute coefficients)
log_importance <- abs(coef(log_model)[-1])  # Remove intercept
log_importance_df <- data.frame(Feature = names(log_importance), Importance = log_importance)

# Sort by importance
log_importance_df <- log_importance_df %>% arrange(desc(Importance))

# Remove the first two most important features
log_importance_filtered <- log_importance_df %>% slice(-c(1,2))
#the first two predectors were contact and poutcome, we exclude them for data integritey since there were too many missing values.

# Plot without top two features
ggplot(log_importance_filtered, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "red") +
  coord_flip() +
  ggtitle("Feature Importance from Logistic Regression") +
  xlab("Features") +
  ylab("Absolute Coefficient Value") +
  theme_minimal()

Conclusion

After analyzing the Bank Marketing dataset, I was able to pinpoint key factors that influence whether a customer subscribes to a term deposit. Through Exploratory Data Analysis (EDA), it became clear that call duration, past campaign success, and the timing of outreach (specific months) play a major role in subscription rates. On the other hand, demographic details like age and account balance had minimal impact, reinforcing that customer engagement and behavioral factors are more critical when designing marketing strategies.

To dive deeper, I used Machine Learning models to determine which factors drive customer decisions. Random Forest provided a global ranking of feature importance, highlighting which variables had the strongest predictive power. Meanwhile, Logistic Regression allowed for a more detailed breakdown of how specific categorical factors, such as job type and the month of outreach, influenced subscription likelihood. The key takeaway? Strategic timing, effective follow-ups, and past engagement history significantly increase the chances of conversion.

From a business perspective, these insights can be leveraged to refine marketing campaigns by targeting customers who have shown prior interest, focusing outreach during high-conversion months, and enhancing call strategies to improve engagement. Given that call duration is a strong predictor, banks could also benefit from training agents to maintain meaningful conversations that foster trust and commitment.

Ultimately, this data-driven approach allows banks to optimize marketing efforts, lower acquisition costs, and improve overall campaign efficiency, leading to higher customer engagement and increased financial performance.