library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.4.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
bank_full_url <- "https://raw.githubusercontent.com/zachrose97/Assignment1Data622/main/bank-full.csv"
bank_url <- "https://raw.githubusercontent.com/zachrose97/Assignment1Data622/main/bank.csv"
bank_additional_full_url <- "https://raw.githubusercontent.com/zachrose97/Assignment1Data622/main/bank-additional-full.csv"
bank_additional_url <- "https://raw.githubusercontent.com/zachrose97/Assignment1Data622/main/bank-additional.csv"
bank_full <- read.csv(bank_full_url, sep = ";")
bank <- read.csv(bank_url, sep = ";")
bank_additional_full <- read.csv(bank_additional_full_url, sep = ";")
bank_additional <- read.csv(bank_additional_url, sep = ";")
str(bank_full)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
summary(bank_full)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
colSums(is.na(bank_full))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
sum(duplicated(bank_full))
## [1] 0
numeric_bank <- bank_full %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.factor, as.integer)
cor_matrix <- cor(numeric_bank, use = "complete.obs")
corrplot(cor_matrix, method = "color", tl.cex = 0.7)
# Plot histograms
library(DataExplorer)
plot_histogram(bank_full)
# Bar plot
ggplot(bank_full, aes(x = job)) +
geom_bar(fill = "steelblue") +
theme_minimal() +
labs(title = "Distribution of Job Types", x = "Job Type", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate labels
bank_full <- bank_full %>%
mutate(balance = ifelse(balance > quantile(balance, 0.99), quantile(balance, 0.99), balance),
duration = ifelse(duration > quantile(duration, 0.99), quantile(duration, 0.99), duration),
previous = ifelse(previous > quantile(previous, 0.99), quantile(previous, 0.99), previous))
summary(bank_full)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. :-8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1274
## 3rd Qu.: 1428
## Max. :13165
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 254.3
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :1269.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. :0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.:0.0000 Class :character
## Median : 2.000 Median : -1.0 Median :0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean :0.5247
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.:0.0000
## Max. :63.000 Max. :871.0 Max. :8.9000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
bank_full <- bank_full %>%
mutate_if(is.character, as.factor)
bank_full <- bank_full %>%
mutate_if(is.numeric, scale)
table(bank_full$y)
##
## no yes
## 39922 5289
# Convert y to binary (1 = "yes", 0 = "no") for Logistic Regression
bank_full$y <- ifelse(bank_full$y == "yes", 1, 0)
bank_full$y <- as.factor(bank_full$y)
log_model <- glm(y ~ ., data = bank_full, family = binomial)
summary(log_model)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = bank_full)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.674387 0.143382 -11.678 < 2e-16 ***
## age -0.003742 0.023711 -0.158 0.874606
## jobblue-collar -0.327588 0.073518 -4.456 8.35e-06 ***
## jobentrepreneur -0.380416 0.127672 -2.980 0.002886 **
## jobhousemaid -0.529584 0.138523 -3.823 0.000132 ***
## jobmanagement -0.177888 0.074304 -2.394 0.016663 *
## jobretired 0.245041 0.098598 2.485 0.012946 *
## jobself-employed -0.317009 0.113295 -2.798 0.005141 **
## jobservices -0.228423 0.084996 -2.687 0.007200 **
## jobstudent 0.391102 0.110487 3.540 0.000400 ***
## jobtechnician -0.183596 0.069801 -2.630 0.008532 **
## jobunemployed -0.199633 0.112892 -1.768 0.077002 .
## jobunknown -0.341358 0.237005 -1.440 0.149783
## maritalmarried -0.188918 0.059629 -3.168 0.001534 **
## maritalsingle 0.085460 0.068065 1.256 0.209274
## educationsecondary 0.182252 0.065396 2.787 0.005322 **
## educationtertiary 0.383599 0.076173 5.036 4.76e-07 ***
## educationunknown 0.247570 0.105202 2.353 0.018608 *
## defaultyes -0.008104 0.165546 -0.049 0.960955
## balance 0.058937 0.017039 3.459 0.000542 ***
## housingyes -0.693472 0.044455 -15.600 < 2e-16 ***
## loanyes -0.425325 0.060625 -7.016 2.29e-12 ***
## contacttelephone -0.150437 0.076152 -1.975 0.048213 *
## contactunknown -1.625826 0.073322 -22.174 < 2e-16 ***
## day 0.085478 0.021027 4.065 4.80e-05 ***
## monthaug -0.684442 0.079740 -8.583 < 2e-16 ***
## monthdec 0.711000 0.178868 3.975 7.04e-05 ***
## monthfeb -0.121506 0.090721 -1.339 0.180461
## monthjan -1.264332 0.123008 -10.278 < 2e-16 ***
## monthjul -0.850344 0.078749 -10.798 < 2e-16 ***
## monthjun 0.461342 0.094776 4.868 1.13e-06 ***
## monthmar 1.645125 0.121295 13.563 < 2e-16 ***
## monthmay -0.401301 0.073491 -5.461 4.75e-08 ***
## monthnov -0.889958 0.085750 -10.379 < 2e-16 ***
## monthoct 0.911923 0.109487 8.329 < 2e-16 ***
## monthsep 0.895191 0.121228 7.384 1.53e-13 ***
## duration 1.093074 0.015682 69.704 < 2e-16 ***
## campaign -0.287709 0.031851 -9.033 < 2e-16 ***
## pdays -0.003027 0.031166 -0.097 0.922623
## previous 0.079854 0.022736 3.512 0.000444 ***
## poutcomeother 0.180027 0.091442 1.969 0.048980 *
## poutcomesuccess 2.318524 0.083467 27.778 < 2e-16 ***
## poutcomeunknown 0.034641 0.105144 0.329 0.741808
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32631 on 45210 degrees of freedom
## Residual deviance: 21020 on 45168 degrees of freedom
## AIC: 21106
##
## Number of Fisher Scoring iterations: 6
library(tidyverse)
library(recipes)
## Warning: package 'recipes' was built under R version 4.4.2
##
## Attaching package: 'recipes'
## The following object is masked from 'package:stringr':
##
## fixed
## The following object is masked from 'package:stats':
##
## step
library(themis)
## Warning: package 'themis' was built under R version 4.4.2
# Convert categorical variables to numeric encoding, but keep y as a factor
bank_full_encoded <- bank_full %>%
mutate_if(is.character, as.factor) %>% # Convert characters to factors
mutate_if(is.factor, as.integer) %>% # Convert all factors (except y) to integers
mutate(y = as.factor(bank_full$y)) # Ensure y remains a factor
# Create a recipe to apply SMOTE with a lower oversampling ratio
smote_recipe <- recipe(y ~ ., data = bank_full_encoded) %>%
step_smote(y, over_ratio = 0.5) # Prevent excessive oversampling
# apply SMOTE
bank_balanced <- prep(smote_recipe, training = bank_full_encoded) %>%
bake(new_data = NULL)
# Check class distribution after balancing
table(bank_balanced$y)
##
## 0 1
## 39922 19961
bank_balanced$y <- as.factor(bank_balanced$y)
rf_model_balanced <- randomForest(y ~ ., data = bank_balanced, ntree = 100)
print(rf_model_balanced)
##
## Call:
## randomForest(formula = y ~ ., data = bank_balanced, ntree = 100)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 6.87%
## Confusion matrix:
## 0 1 class.error
## 0 37811 2111 0.05287811
## 1 2003 17958 0.10034567
importance(rf_model_balanced)
## MeanDecreaseGini
## age 1615.2433
## job 790.7036
## marital 774.1067
## education 731.7040
## default 43.7810
## balance 1666.6408
## housing 2297.3724
## loan 556.0717
## contact 1094.8041
## day 1449.7013
## month 1950.0629
## duration 8298.5010
## campaign 2039.6012
## pdays 982.0890
## previous 593.6566
## poutcome 1162.2022
varImpPlot(rf_model_balanced)
log_preds <- predict(log_model, bank_full, type = "response")
log_preds_class <- ifelse(log_preds > 0.5, 1, 0)
rf_balanced_preds <- predict(rf_model_balanced, bank_balanced, type = "class")
log_conf_matrix <- confusionMatrix(as.factor(log_preds_class), as.factor(bank_full$y))
rf_balanced_conf_matrix <- confusionMatrix(as.factor(rf_balanced_preds), as.factor(bank_balanced$y))
print(log_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 38861 3325
## 1 1061 1964
##
## Accuracy : 0.903
## 95% CI : (0.9002, 0.9057)
## No Information Rate : 0.883
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4234
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9734
## Specificity : 0.3713
## Pos Pred Value : 0.9212
## Neg Pred Value : 0.6493
## Prevalence : 0.8830
## Detection Rate : 0.8595
## Detection Prevalence : 0.9331
## Balanced Accuracy : 0.6724
##
## 'Positive' Class : 0
##
print(rf_balanced_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39920 22
## 1 2 19939
##
## Accuracy : 0.9996
## 95% CI : (0.9994, 0.9997)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9991
##
## Mcnemar's Test P-Value : 0.0001052
##
## Sensitivity : 0.9999
## Specificity : 0.9989
## Pos Pred Value : 0.9994
## Neg Pred Value : 0.9999
## Prevalence : 0.6667
## Detection Rate : 0.6666
## Detection Prevalence : 0.6670
## Balanced Accuracy : 0.9994
##
## 'Positive' Class : 0
##
The Bank Marketing Dataset contains customer interaction data from a telemarketing campaign designed to persuade clients to subscribe to a term deposit. The dataset includes customer demographics, economic factors, and previous marketing interactions. The analysis above applies Exploratory Data Analysis and machine learning to identify key factors influencing a customer’s decision to subscribe. By understanding correlations, data distributions, and feature importance, we can help the bank improve its marketing strategies to increase term deposit subscription rates.
To begin, a correlation matrix was created to examine feature relationships. The analysis revealed that the strongest predictor of subscription was call duration, longer calls tended to increase the likelihood of a subscription. Another key predictor was previous successful marketing outreach, indicating that customers who had subscribed in past outreach campaigns were more likely to do so again. A histogram analysis showed that age is normally distributed, with most clients between 30 and 50 years old, whereas balance, duration, and pdays are highly right skewed, indicating potential outliers. The campaign variable revealed that most individuals were contacted only once or twice. Outliers were detected in balance and previous contacts, where some clients had extremely high savings or were contacted an unusually high number of times. Most clients belonged to the blue collar, management, or technician job categories, but students and unemployed individuals had higher conversion rates. Additionally, clients over 50 years old were more likely to subscribe.
Since this is a binary classification problem, we used Logistic Regression and Random Forest models. Logistic Regression is a simple and interpretable statistical model that estimates the probability of a binary outcome. It is advantageous due to its efficiency and explainability, however, it assumes a linear relationship between features and the target variable. The results from the logistic regression showed that call duration, past campaign success, balance, and the month of contact were significant predictors. Certain factors, such as having a housing loan and being contacted in May or November, were negatively associated with subscription likelihood. Random Forest, an ensemble method that constructs multiple decision trees, was chosen because of its ability to handle non linear relationships and complex feature interactions. Initially, the Random Forest model struggled with class imbalance, as only 11.7% of clients subscribed (y = 1), while 88.3% did not (y = 0). This caused the model to misclassify many actual subscribers (y = 1) as non-subscribers (y = 0), leading to a bias towards predicting ‘No.’. To address class imbalance, Synthetic Minority Oversampling (SMOTE) was applied. SMOTE was applied with an oversampling ratio of 0.5, rebalancing the dataset to 39,922 ‘No’ cases (y = 0) and 19,961 ‘Yes’ cases (y = 1). This helped the model learn more effectively from actual subscribers. After SMOTE balancing, the Random Forest model achieved an accuracy of 99.96%, with high sensitivity (99.86%) and specificity (99.88%), significantly improving its ability to classify both subscribers (y = 1) and non-subscribers (y = 0). This significantly reduced the misclassification rate for subscribers (y = 1), improving the model’s overall reliability. Feature importance analysis from Random Forest revealed that the most influential features for predicting a subscription were call duration, loan status, number of contacts, month of contact, and customer balance.
Both models had strengths and weaknesses. Logistic Regression provided clear interpretability, helping to understand how individual factors influenced term deposit subscriptions. Logistic Regression had a predictive accuracy of 90.3%, with a high sensitivity (97.34%) but a low specificity (37.13%). This means the model correctly identified most subscribers (y = 1), but frequently misclassified non-subscribers (y = 0) as subscribers. Random Forest performed better overall, especially after SMOTE balancing, but is more difficult to interpret than Logistic Regression. In conclusion, the most important factors for term deposit subscription were call duration, past campaign success, and balance. These insights suggest that longer and more engaged customer interactions could lead to higher conversion rates. To further optimize marketing efforts, my recommendation to the bank would be to focus on clients with previous successful engagements and implement call duration strategies. By incorporating these findings, the bank can enhance its telemarketing approach, improve customer engagement, and drive term deposit subscription rates.