Objective: To apply machine learning techniques to analyze the data set and figure out most effective tactics that will help the bank in the next campaign to persuade more customers to subscribe to the banks term deposit. There will be a focus on exploratory data to create a reliable training model, as “better data beats better algorithms”. EDA allows you to identify gaps & data imbalances, improve data quality, create better features and gain a deep understanding of the data before model training.
## Rows: 4,521
## Columns: 17
## $ age <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job <chr> "unemployed", "services", "management", "management", "blue-…
## $ marital <chr> "married", "married", "single", "married", "married", "singl…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
## $ loan <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
## $ contact <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
## $ day <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
## $ duration <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
#convert categorical variables to a factor
bank <- bank %>%
mutate(y = factor(y, levels = c("no", "yes")),
job = as.factor(job),
marital = as.factor(marital),
education = as.factor(education),
default = as.factor(default),
housing = as.factor(housing),
loan = as.factor(loan),
contact = as.factor(contact),
month = as.factor(month),
poutcome = as.factor(poutcome))
#Summary statistics
summary(bank)
## age job marital education default
## Min. :19.00 management :969 divorced: 528 primary : 678 no :4445
## 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306 yes: 76
## Median :39.00 technician :768 single :1196 tertiary :1350
## Mean :41.17 admin. :478 unknown : 187
## 3rd Qu.:49.00 services :417
## Max. :87.00 retired :230
## (Other) :713
## balance housing loan contact day
## Min. :-3313 no :1962 no :3830 cellular :2896 Min. : 1.00
## 1st Qu.: 69 yes:2559 yes: 691 telephone: 301 1st Qu.: 9.00
## Median : 444 unknown :1324 Median :16.00
## Mean : 1423 Mean :15.92
## 3rd Qu.: 1480 3rd Qu.:21.00
## Max. :71188 Max. :31.00
##
## month duration campaign pdays
## may :1398 Min. : 4 Min. : 1.000 Min. : -1.00
## jul : 706 1st Qu.: 104 1st Qu.: 1.000 1st Qu.: -1.00
## aug : 633 Median : 185 Median : 2.000 Median : -1.00
## jun : 531 Mean : 264 Mean : 2.794 Mean : 39.77
## nov : 389 3rd Qu.: 329 3rd Qu.: 3.000 3rd Qu.: -1.00
## apr : 293 Max. :3025 Max. :50.000 Max. :871.00
## (Other): 571
## previous poutcome y
## Min. : 0.0000 failure: 490 no :4000
## 1st Qu.: 0.0000 other : 197 yes: 521
## Median : 0.0000 success: 129
## Mean : 0.5426 unknown:3705
## 3rd Qu.: 0.0000
## Max. :25.0000
##
n_bank <- bank %>%
select(where(is.numeric))
summary(n_bank)
## age balance day duration
## Min. :19.00 Min. :-3313 Min. : 1.00 Min. : 4
## 1st Qu.:33.00 1st Qu.: 69 1st Qu.: 9.00 1st Qu.: 104
## Median :39.00 Median : 444 Median :16.00 Median : 185
## Mean :41.17 Mean : 1423 Mean :15.92 Mean : 264
## 3rd Qu.:49.00 3rd Qu.: 1480 3rd Qu.:21.00 3rd Qu.: 329
## Max. :87.00 Max. :71188 Max. :31.00 Max. :3025
## campaign pdays previous
## Min. : 1.000 Min. : -1.00 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000
## Median : 2.000 Median : -1.00 Median : 0.0000
## Mean : 2.794 Mean : 39.77 Mean : 0.5426
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :50.000 Max. :871.00 Max. :25.0000
The summary statistics show that the age of the clients range from 19-87 with a median of 39. Most individuals have low account balances with a median of $444. Contact duration varies widely, and the number of campaign contacts is generally low. Many individuals have had no prior contact before this campaign, with a significant portion of “pdays” values indicating no previous contact.
c_bank <- bank %>% select(-where(is.numeric))
summary(c_bank)
## job marital education default housing
## management :969 divorced: 528 primary : 678 no :4445 no :1962
## blue-collar:946 married :2797 secondary:2306 yes: 76 yes:2559
## technician :768 single :1196 tertiary :1350
## admin. :478 unknown : 187
## services :417
## retired :230
## (Other) :713
## loan contact month poutcome y
## no :3830 cellular :2896 may :1398 failure: 490 no :4000
## yes: 691 telephone: 301 jul : 706 other : 197 yes: 521
## unknown :1324 aug : 633 success: 129
## jun : 531 unknown:3705
## nov : 389
## apr : 293
## (Other): 571
#check for NA’s
colSums(is.na(bank))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
#numeric distributions
n_bank %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric Variables")
#categorical distributions
c_bank %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_bar(fill = "blue", alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Categorical Variables") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
n_bank <- n_bank %>%
mutate(y = bank$y)
n_bank <- n_bank %>%
mutate(y = as.factor(y))
n_bank_long <- n_bank %>%
pivot_longer(cols = -y, names_to = "variable", values_to = "value")
ggplot(n_bank_long, aes(x = y, y = value, fill = y)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
labs(title = "Distribution of Numeric Variables by Yes/No in Y",
x = "Y (Outcome)", y = "Value") +
theme_minimal()
Age and balance show no strong differentiation between subscribers and
non-subscribers of term deposits, however, high-balance outliers exist.
The number of campaign contacts is skewed, with non-subscribers
receiving more contacts, including higher outliers. Subscription
likelihood increases with longer call durations, emphasizing the
importance of engagement. Pdays is heavily skewed, with most clients not
being contacted recently, though lower values slightly favor
subscriptions. The ‘previous’ variable is generally higher for
subscribers, indicating that repeated engagement may improve success
rates.
c_bank <- c_bank %>%
mutate(y = bank$y) %>%
mutate(y = as.factor(y))
c_bank_long <- c_bank %>%
pivot_longer(cols = -y, names_to = "variable", values_to = "value") %>%
count(variable, value, y)
ggplot(c_bank_long, aes(x = value, y = n, fill = y)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
labs(title = "Relationship Between Categorical Variables and Y (Yes/No)",
x = "Category", y = "Count") +
scale_fill_manual(values = c("yes" = "blue", "no" = "red")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The variables ‘poutcome’ and ‘contact’ show a strong influence on subscription likelihood. Clients with a ‘success’ outcome in a previous campaign are more likely to subscribe, while those contacted via ‘cellular’ also show a higher tendency to say ‘yes.’ In contrast, housing and personal loans exhibit a more balanced distribution between ‘yes’ and ‘no,’ indicating a weaker impact on the subscription decision.
n_bank <- n_bank %>%
mutate(y = as.numeric(as.factor(y)) - 1)
cor_matrix <- cor(n_bank, use = "complete.obs")
ggcorrplot(cor_matrix,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("blue", "white", "red"),
title = "Correlation Heatmap",
ggtheme = theme_minimal())
The strongest correlation is between ‘pdays’ and ‘previous,’ indicating
that clients who had prior interactions were also contacted more
recently. ‘Duration’ has weak correlations with other variables,
suggesting that call length is relatively independent of prior contacts.
However, ‘duration’ shows the highest correlation with ‘y’, making it a
significant predictor in a machine learning model. Most other
correlations are low, indicating weak linear relationships among the
numerical features.
bank$y <- factor(bank$y, levels = c("no", "yes"))
log_model <- glm(y ~ age + job + marital + education + default + housing + loan + contact + month + duration + campaign + previous + poutcome,
data = bank,
family = binomial)
summary(log_model)
##
## Call:
## glm(formula = y ~ age + job + marital + education + default +
## housing + loan + contact + month + duration + campaign +
## previous + poutcome, family = binomial, data = bank)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.186721 0.523917 -4.174 3.00e-05 ***
## age -0.004354 0.007104 -0.613 0.539975
## jobblue-collar -0.404873 0.241543 -1.676 0.093700 .
## jobentrepreneur -0.277862 0.379569 -0.732 0.464140
## jobhousemaid -0.365479 0.417053 -0.876 0.380847
## jobmanagement -0.081810 0.240409 -0.340 0.733635
## jobretired 0.603314 0.310290 1.944 0.051852 .
## jobself-employed -0.179264 0.351770 -0.510 0.610327
## jobservices -0.159256 0.272822 -0.584 0.559397
## jobstudent 0.388569 0.372853 1.042 0.297342
## jobtechnician -0.198308 0.229813 -0.863 0.388185
## jobunemployed -0.664215 0.421633 -1.575 0.115178
## jobunknown 0.482495 0.584645 0.825 0.409214
## maritalmarried -0.470554 0.173912 -2.706 0.006816 **
## maritalsingle -0.302222 0.203251 -1.487 0.137030
## educationsecondary 0.074675 0.201876 0.370 0.711452
## educationtertiary 0.325503 0.232866 1.398 0.162169
## educationunknown -0.421668 0.357112 -1.181 0.237693
## defaultyes 0.550208 0.430839 1.277 0.201580
## housingyes -0.281237 0.136920 -2.054 0.039973 *
## loanyes -0.637709 0.199550 -3.196 0.001395 **
## contacttelephone -0.067554 0.233029 -0.290 0.771896
## contactunknown -1.364324 0.225196 -6.058 1.38e-09 ***
## monthaug -0.368791 0.246289 -1.497 0.134292
## monthdec 0.078006 0.653733 0.119 0.905019
## monthfeb 0.024297 0.277512 0.088 0.930232
## monthjan -0.973771 0.375581 -2.593 0.009523 **
## monthjul -0.750296 0.248382 -3.021 0.002522 **
## monthjun 0.398066 0.288706 1.379 0.167958
## monthmar 1.414981 0.386185 3.664 0.000248 ***
## monthmay -0.561520 0.229666 -2.445 0.014488 *
## monthnov -0.838923 0.271316 -3.092 0.001988 **
## monthoct 1.366844 0.328805 4.157 3.22e-05 ***
## monthsep 0.574524 0.407872 1.409 0.158957
## duration 0.004219 0.000202 20.886 < 2e-16 ***
## campaign -0.061363 0.027575 -2.225 0.026061 *
## previous -0.007144 0.037943 -0.188 0.850654
## poutcomeother 0.497650 0.267911 1.858 0.063237 .
## poutcomesuccess 2.427498 0.270028 8.990 < 2e-16 ***
## poutcomeunknown -0.093078 0.217016 -0.429 0.667998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3231.0 on 4520 degrees of freedom
## Residual deviance: 2177.8 on 4481 degrees of freedom
## AIC: 2257.8
##
## Number of Fisher Scoring iterations: 6
log_model_predictions <- predict(log_model, type = "response")
log_model_class <- ifelse(log_model_predictions > 0.5, "yes", "no")
confusionMatrix(as.factor(log_model_class), bank$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 3913 337
## yes 87 184
##
## Accuracy : 0.9062
## 95% CI : (0.8973, 0.9146)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 1.919e-06
##
## Kappa : 0.4188
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9782
## Specificity : 0.3532
## Pos Pred Value : 0.9207
## Neg Pred Value : 0.6790
## Prevalence : 0.8848
## Detection Rate : 0.8655
## Detection Prevalence : 0.9401
## Balanced Accuracy : 0.6657
##
## 'Positive' Class : no
##
set.seed(123)
X <- n_bank[, c("age", "balance", "duration", "campaign", "previous")]
y <- as.factor(n_bank$y) # Convert target variable to factor
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
knn_predictions <- knn(train = X_train, test = X_test, cl = y_train, k = 5)
confusionMatrix(knn_predictions, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1149 131
## 1 51 25
##
## Accuracy : 0.8658
## 95% CI : (0.8465, 0.8835)
## No Information Rate : 0.885
## P-Value [Acc > NIR] : 0.9866
##
## Kappa : 0.1516
##
## Mcnemar's Test P-Value : 4.745e-09
##
## Sensitivity : 0.9575
## Specificity : 0.1603
## Pos Pred Value : 0.8977
## Neg Pred Value : 0.3289
## Prevalence : 0.8850
## Detection Rate : 0.8473
## Detection Prevalence : 0.9440
## Balanced Accuracy : 0.5589
##
## 'Positive' Class : 0
##
Logistic regression and KNN are two acceptable machine learning algorithms for this dataset. Logistic regression provides easily intepertable anlaysis, along with efficiency, and the ability to provide probability scores, making it ideal for understanding the influence of features like call duration and previous campaign outcomes. However, it assumes a linear relationship between predictors and the log-odds of the outcome, which may not always hold. In contrast, KNN, a non-parametric method, can capture more complex relationships in the data. Its disadvantages include sensitivity to irrelevant features, higher computational costs with large datasets, and poor performance when data is imbalanced. The data set contains labeled data, therefore, supervised learning algorithms like logistic regression and KNN are appropriate. The dataset’s structure, including categorical and numerical features, aligns well with logistic regression, which performed better,when comparing the specificity. If the dataset had fewer than 1,000 records, KNN might be more viable, as its performance is less affected by small sample sizes, whereas logistic regression might suffer from insufficient data to estimate coefficients reliably. Based on current performance, I would recommend logistic regression due to its interpretability and higher specificity.
For this dataset, pre-processing is neded to improve model performance. Dimensionality reduction can remove redundant features, such as highly correlated variables like pdays and previous, to prevent multicollinearity. Feature engineering could involve creating new variables or transforming integers variables into categorical groups based on frequency. Data transformation is necessary to normalize numerical features like balance and duration,The dataset is imbalanced, with fewer “yes” responses, so oversampling or adjusting class weights in models will be important.
Considering the performance trade-offs, logistic regression is preferable due to its interpretability, scalability, and better handling of the dataset’s characteristics. Since labels exist in the data, supervised learning models were selected, and the choice aligns with the dataset’s goal of binary classification. If the dataset were smaller, KNN could become a more viable option due to its effectiveness in small sample sizes. To optimize model performance for the given dataset, several pre-processing steps are essential. Highly correlated variables should be removed to prevent redundancy and multicollinearity. In addition, feature engineering can enhance interpretability, such as transforming duration into categorical bins. Also, to address class imbalance, oversampling techniques like SMOTE or adjusting class weights in logistic regression can help improve predictive performance and fairness. The analysis highlights that duration, previous contacts, and specific months are strong indicators of customer response. While logistic regression performed best overall, improvements in handling class imbalance and feature engineering could enhance the model further. Pre-processing plays a crucial role in improving predictive accuracy, especially by addressing data imbalance and ensuring proper feature selection.