Objective: To apply machine learning techniques to analyze the data set and figure out most effective tactics that will help the bank in the next campaign to persuade more customers to subscribe to the banks term deposit. There will be a focus on exploratory data to create a reliable training model, as “better data beats better algorithms”. EDA allows you to identify gaps & data imbalances, improve data quality, create better features and gain a deep understanding of the data before model training.

1. Exploratory Data analysis

## Rows: 4,521
## Columns: 17
## $ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job       <chr> "unemployed", "services", "management", "management", "blue-…
## $ marital   <chr> "married", "married", "single", "married", "married", "singl…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
## $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
## $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
## $ day       <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
## $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign  <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

#convert categorical variables to a factor

bank <- bank %>%
  mutate(y = factor(y, levels = c("no", "yes")),  
         job = as.factor(job), 
         marital = as.factor(marital),
         education = as.factor(education),
         default = as.factor(default),
         housing = as.factor(housing),
         loan = as.factor(loan),
         contact = as.factor(contact),
         month = as.factor(month),
         poutcome = as.factor(poutcome))

#Summary statistics

summary(bank)
##       age                 job          marital         education    default   
##  Min.   :19.00   management :969   divorced: 528   primary  : 678   no :4445  
##  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306   yes:  76  
##  Median :39.00   technician :768   single  :1196   tertiary :1350             
##  Mean   :41.17   admin.     :478                   unknown  : 187             
##  3rd Qu.:49.00   services   :417                                              
##  Max.   :87.00   retired    :230                                              
##                  (Other)    :713                                              
##     balance      housing     loan           contact          day       
##  Min.   :-3313   no :1962   no :3830   cellular :2896   Min.   : 1.00  
##  1st Qu.:   69   yes:2559   yes: 691   telephone: 301   1st Qu.: 9.00  
##  Median :  444                         unknown  :1324   Median :16.00  
##  Mean   : 1423                                          Mean   :15.92  
##  3rd Qu.: 1480                                          3rd Qu.:21.00  
##  Max.   :71188                                          Max.   :31.00  
##                                                                        
##      month         duration       campaign          pdays       
##  may    :1398   Min.   :   4   Min.   : 1.000   Min.   : -1.00  
##  jul    : 706   1st Qu.: 104   1st Qu.: 1.000   1st Qu.: -1.00  
##  aug    : 633   Median : 185   Median : 2.000   Median : -1.00  
##  jun    : 531   Mean   : 264   Mean   : 2.794   Mean   : 39.77  
##  nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000   3rd Qu.: -1.00  
##  apr    : 293   Max.   :3025   Max.   :50.000   Max.   :871.00  
##  (Other): 571                                                   
##     previous          poutcome      y       
##  Min.   : 0.0000   failure: 490   no :4000  
##  1st Qu.: 0.0000   other  : 197   yes: 521  
##  Median : 0.0000   success: 129             
##  Mean   : 0.5426   unknown:3705             
##  3rd Qu.: 0.0000                            
##  Max.   :25.0000                            
## 
n_bank <- bank %>%
  select(where(is.numeric))
  
summary(n_bank)
##       age           balance           day           duration   
##  Min.   :19.00   Min.   :-3313   Min.   : 1.00   Min.   :   4  
##  1st Qu.:33.00   1st Qu.:   69   1st Qu.: 9.00   1st Qu.: 104  
##  Median :39.00   Median :  444   Median :16.00   Median : 185  
##  Mean   :41.17   Mean   : 1423   Mean   :15.92   Mean   : 264  
##  3rd Qu.:49.00   3rd Qu.: 1480   3rd Qu.:21.00   3rd Qu.: 329  
##  Max.   :87.00   Max.   :71188   Max.   :31.00   Max.   :3025  
##     campaign          pdays           previous      
##  Min.   : 1.000   Min.   : -1.00   Min.   : 0.0000  
##  1st Qu.: 1.000   1st Qu.: -1.00   1st Qu.: 0.0000  
##  Median : 2.000   Median : -1.00   Median : 0.0000  
##  Mean   : 2.794   Mean   : 39.77   Mean   : 0.5426  
##  3rd Qu.: 3.000   3rd Qu.: -1.00   3rd Qu.: 0.0000  
##  Max.   :50.000   Max.   :871.00   Max.   :25.0000

The summary statistics show that the age of the clients range from 19-87 with a median of 39. Most individuals have low account balances with a median of $444. Contact duration varies widely, and the number of campaign contacts is generally low. Many individuals have had no prior contact before this campaign, with a significant portion of “pdays” values indicating no previous contact.

c_bank <- bank %>% select(-where(is.numeric))
summary(c_bank)
##           job          marital         education    default    housing   
##  management :969   divorced: 528   primary  : 678   no :4445   no :1962  
##  blue-collar:946   married :2797   secondary:2306   yes:  76   yes:2559  
##  technician :768   single  :1196   tertiary :1350                        
##  admin.     :478                   unknown  : 187                        
##  services   :417                                                         
##  retired    :230                                                         
##  (Other)    :713                                                         
##   loan           contact         month         poutcome      y       
##  no :3830   cellular :2896   may    :1398   failure: 490   no :4000  
##  yes: 691   telephone: 301   jul    : 706   other  : 197   yes: 521  
##             unknown  :1324   aug    : 633   success: 129             
##                              jun    : 531   unknown:3705             
##                              nov    : 389                            
##                              apr    : 293                            
##                              (Other): 571

#check for NA’s

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

#numeric distributions

n_bank %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric Variables")

#categorical distributions

c_bank %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_bar(fill = "blue", alpha = 0.7) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Categorical Variables") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

n_bank <- n_bank %>%
  mutate(y = bank$y)  


n_bank <- n_bank %>%
  mutate(y = as.factor(y))


n_bank_long <- n_bank %>%
  pivot_longer(cols = -y, names_to = "variable", values_to = "value")


ggplot(n_bank_long, aes(x = y, y = value, fill = y)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Distribution of Numeric Variables by Yes/No in Y",
       x = "Y (Outcome)", y = "Value") +
  theme_minimal()

Age and balance show no strong differentiation between subscribers and non-subscribers of term deposits, however, high-balance outliers exist. The number of campaign contacts is skewed, with non-subscribers receiving more contacts, including higher outliers. Subscription likelihood increases with longer call durations, emphasizing the importance of engagement. Pdays is heavily skewed, with most clients not being contacted recently, though lower values slightly favor subscriptions. The ‘previous’ variable is generally higher for subscribers, indicating that repeated engagement may improve success rates.

c_bank <- c_bank %>%
  mutate(y = bank$y) %>%
  mutate(y = as.factor(y))  


c_bank_long <- c_bank %>%
  pivot_longer(cols = -y, names_to = "variable", values_to = "value") %>%
  count(variable, value, y)  


ggplot(c_bank_long, aes(x = value, y = n, fill = y)) +  
  geom_bar(stat = "identity", position = "dodge", alpha = 0.7) + 
  facet_wrap(~variable, scales = "free") +  
  labs(title = "Relationship Between Categorical Variables and Y (Yes/No)",
       x = "Category", y = "Count") +
  scale_fill_manual(values = c("yes" = "blue", "no" = "red")) + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The variables ‘poutcome’ and ‘contact’ show a strong influence on subscription likelihood. Clients with a ‘success’ outcome in a previous campaign are more likely to subscribe, while those contacted via ‘cellular’ also show a higher tendency to say ‘yes.’ In contrast, housing and personal loans exhibit a more balanced distribution between ‘yes’ and ‘no,’ indicating a weaker impact on the subscription decision. ​

n_bank <- n_bank %>%
  mutate(y = as.numeric(as.factor(y)) - 1) 


cor_matrix <- cor(n_bank, use = "complete.obs")


ggcorrplot(cor_matrix, 
           method = "circle",  
           type = "lower",      
           lab = TRUE,       
           lab_size = 3,        
           colors = c("blue", "white", "red"), 
           title = "Correlation Heatmap",
           ggtheme = theme_minimal())  

The strongest correlation is between ‘pdays’ and ‘previous,’ indicating that clients who had prior interactions were also contacted more recently. ‘Duration’ has weak correlations with other variables, suggesting that call length is relatively independent of prior contacts. However, ‘duration’ shows the highest correlation with ‘y’, making it a significant predictor in a machine learning model. Most other correlations are low, indicating weak linear relationships among the numerical features.

2 Algorithm Selection

bank$y <- factor(bank$y, levels = c("no", "yes"))
log_model <- glm(y ~ age + job + marital + education + default + housing + loan + contact +  month + duration + campaign +  previous + poutcome, 
                 data = bank, 
                 family = binomial)


summary(log_model)
## 
## Call:
## glm(formula = y ~ age + job + marital + education + default + 
##     housing + loan + contact + month + duration + campaign + 
##     previous + poutcome, family = binomial, data = bank)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.186721   0.523917  -4.174 3.00e-05 ***
## age                -0.004354   0.007104  -0.613 0.539975    
## jobblue-collar     -0.404873   0.241543  -1.676 0.093700 .  
## jobentrepreneur    -0.277862   0.379569  -0.732 0.464140    
## jobhousemaid       -0.365479   0.417053  -0.876 0.380847    
## jobmanagement      -0.081810   0.240409  -0.340 0.733635    
## jobretired          0.603314   0.310290   1.944 0.051852 .  
## jobself-employed   -0.179264   0.351770  -0.510 0.610327    
## jobservices        -0.159256   0.272822  -0.584 0.559397    
## jobstudent          0.388569   0.372853   1.042 0.297342    
## jobtechnician      -0.198308   0.229813  -0.863 0.388185    
## jobunemployed      -0.664215   0.421633  -1.575 0.115178    
## jobunknown          0.482495   0.584645   0.825 0.409214    
## maritalmarried     -0.470554   0.173912  -2.706 0.006816 ** 
## maritalsingle      -0.302222   0.203251  -1.487 0.137030    
## educationsecondary  0.074675   0.201876   0.370 0.711452    
## educationtertiary   0.325503   0.232866   1.398 0.162169    
## educationunknown   -0.421668   0.357112  -1.181 0.237693    
## defaultyes          0.550208   0.430839   1.277 0.201580    
## housingyes         -0.281237   0.136920  -2.054 0.039973 *  
## loanyes            -0.637709   0.199550  -3.196 0.001395 ** 
## contacttelephone   -0.067554   0.233029  -0.290 0.771896    
## contactunknown     -1.364324   0.225196  -6.058 1.38e-09 ***
## monthaug           -0.368791   0.246289  -1.497 0.134292    
## monthdec            0.078006   0.653733   0.119 0.905019    
## monthfeb            0.024297   0.277512   0.088 0.930232    
## monthjan           -0.973771   0.375581  -2.593 0.009523 ** 
## monthjul           -0.750296   0.248382  -3.021 0.002522 ** 
## monthjun            0.398066   0.288706   1.379 0.167958    
## monthmar            1.414981   0.386185   3.664 0.000248 ***
## monthmay           -0.561520   0.229666  -2.445 0.014488 *  
## monthnov           -0.838923   0.271316  -3.092 0.001988 ** 
## monthoct            1.366844   0.328805   4.157 3.22e-05 ***
## monthsep            0.574524   0.407872   1.409 0.158957    
## duration            0.004219   0.000202  20.886  < 2e-16 ***
## campaign           -0.061363   0.027575  -2.225 0.026061 *  
## previous           -0.007144   0.037943  -0.188 0.850654    
## poutcomeother       0.497650   0.267911   1.858 0.063237 .  
## poutcomesuccess     2.427498   0.270028   8.990  < 2e-16 ***
## poutcomeunknown    -0.093078   0.217016  -0.429 0.667998    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3231.0  on 4520  degrees of freedom
## Residual deviance: 2177.8  on 4481  degrees of freedom
## AIC: 2257.8
## 
## Number of Fisher Scoring iterations: 6
log_model_predictions <- predict(log_model, type = "response")

log_model_class <- ifelse(log_model_predictions > 0.5, "yes", "no")


confusionMatrix(as.factor(log_model_class), bank$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3913  337
##        yes   87  184
##                                           
##                Accuracy : 0.9062          
##                  95% CI : (0.8973, 0.9146)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 1.919e-06       
##                                           
##                   Kappa : 0.4188          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9782          
##             Specificity : 0.3532          
##          Pos Pred Value : 0.9207          
##          Neg Pred Value : 0.6790          
##              Prevalence : 0.8848          
##          Detection Rate : 0.8655          
##    Detection Prevalence : 0.9401          
##       Balanced Accuracy : 0.6657          
##                                           
##        'Positive' Class : no              
## 
set.seed(123)


X <- n_bank[, c("age", "balance", "duration", "campaign", "previous")]
y <- as.factor(n_bank$y)  # Convert target variable to factor


train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]


knn_predictions <- knn(train = X_train, test = X_test, cl = y_train, k = 5)


confusionMatrix(knn_predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1149  131
##          1   51   25
##                                           
##                Accuracy : 0.8658          
##                  95% CI : (0.8465, 0.8835)
##     No Information Rate : 0.885           
##     P-Value [Acc > NIR] : 0.9866          
##                                           
##                   Kappa : 0.1516          
##                                           
##  Mcnemar's Test P-Value : 4.745e-09       
##                                           
##             Sensitivity : 0.9575          
##             Specificity : 0.1603          
##          Pos Pred Value : 0.8977          
##          Neg Pred Value : 0.3289          
##              Prevalence : 0.8850          
##          Detection Rate : 0.8473          
##    Detection Prevalence : 0.9440          
##       Balanced Accuracy : 0.5589          
##                                           
##        'Positive' Class : 0               
## 

Logistic regression and KNN are two acceptable machine learning algorithms for this dataset. Logistic regression provides easily intepertable anlaysis, along with efficiency, and the ability to provide probability scores, making it ideal for understanding the influence of features like call duration and previous campaign outcomes. However, it assumes a linear relationship between predictors and the log-odds of the outcome, which may not always hold. In contrast, KNN, a non-parametric method, can capture more complex relationships in the data. Its disadvantages include sensitivity to irrelevant features, higher computational costs with large datasets, and poor performance when data is imbalanced. The data set contains labeled data, therefore, supervised learning algorithms like logistic regression and KNN are appropriate. The dataset’s structure, including categorical and numerical features, aligns well with logistic regression, which performed better,when comparing the specificity. If the dataset had fewer than 1,000 records, KNN might be more viable, as its performance is less affected by small sample sizes, whereas logistic regression might suffer from insufficient data to estimate coefficients reliably. Based on current performance, I would recommend logistic regression due to its interpretability and higher specificity.

3 Preprocessing

For this dataset, pre-processing is neded to improve model performance. Dimensionality reduction can remove redundant features, such as highly correlated variables like pdays and previous, to prevent multicollinearity. Feature engineering could involve creating new variables or transforming integers variables into categorical groups based on frequency. Data transformation is necessary to normalize numerical features like balance and duration,The dataset is imbalanced, with fewer “yes” responses, so oversampling or adjusting class weights in models will be important.

4 conclusion

Considering the performance trade-offs, logistic regression is preferable due to its interpretability, scalability, and better handling of the dataset’s characteristics. Since labels exist in the data, supervised learning models were selected, and the choice aligns with the dataset’s goal of binary classification. If the dataset were smaller, KNN could become a more viable option due to its effectiveness in small sample sizes. To optimize model performance for the given dataset, several pre-processing steps are essential. Highly correlated variables should be removed to prevent redundancy and multicollinearity. In addition, feature engineering can enhance interpretability, such as transforming duration into categorical bins. Also, to address class imbalance, oversampling techniques like SMOTE or adjusting class weights in logistic regression can help improve predictive performance and fairness. The analysis highlights that duration, previous contacts, and specific months are strong indicators of customer response. While logistic regression performed best overall, improvements in handling class imbalance and feature engineering could enhance the model further. Pre-processing plays a crucial role in improving predictive accuracy, especially by addressing data imbalance and ensuring proper feature selection.