Overview

This is a banking and marketing dataset that was created by Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL).It is a binary classification problem with the goal of predicting if the client will subscribe a bank term deposit (variable y).

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## As of tidytable v0.9.0 dotless versions of functions are exported.
## You can now use `arrange()`/`mutate()`/etc. directly.
## Warning: tidytable was loaded after dplyr.
## This can lead to most dplyr functions being overwritten by tidytable functions.
## 
## Attaching package: 'tidytable'
## The following objects are masked from 'package:dplyr':
## 
##     across, add_count, add_tally, anti_join, arrange, between,
##     bind_cols, bind_rows, c_across, case_when, coalesce, count,
##     cume_dist, cur_column, cur_data, cur_group_id, cur_group_rows,
##     dense_rank, desc, distinct, filter, first, full_join, group_by,
##     group_cols, group_split, group_vars, if_all, if_any, if_else,
##     inner_join, is_grouped_df, lag, last, lead, left_join, min_rank,
##     mutate, n, n_distinct, na_if, nest_by, nest_join, nth,
##     percent_rank, pull, recode, relocate, rename, rename_with,
##     right_join, row_number, rowwise, select, semi_join, slice,
##     slice_head, slice_max, slice_min, slice_sample, slice_tail,
##     summarise, summarize, tally, top_n, transmute, ungroup
## The following objects are masked from 'package:stats':
## 
##     dt, filter, lag
## The following object is masked from 'package:base':
## 
##     %in%
## corrplot 0.92 loaded
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Loading required package: lattice
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:tidytable':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice

Data Exploration

This dataset consist of 11 character variables and 10 numeric variables. There are no missing values.

skim_variable n_missing complete_rate numeric.mean numeric.sd
job 0 1 NA NA
marital 0 1 NA NA
education 0 1 NA NA
default 0 1 NA NA
housing 0 1 NA NA
loan 0 1 NA NA
contact 0 1 NA NA
month 0 1 NA NA
day_of_week 0 1 NA NA
poutcome 0 1 NA NA
y 0 1 NA NA
age 0 1 40.0274190 10.4320921
duration 0 1 259.0187757 260.5061542
campaign 0 1 2.5591273 2.7619762
pdays 0 1 962.1262180 187.7785545
previous 0 1 0.1741931 0.4988711
emp.var.rate 0 1 0.0836878 1.5702098
cons.price.idx 0 1 93.5779597 0.5793179
cons.conf.idx 0 1 -40.5246998 4.6162954
euribor3m 0 1 3.6237152 1.7332399
nr.employed 0 1 5167.1613577 72.1747491

Balance

The difference between those not enrolling into the services and those that due is heavily imbalanced. This will make predicting those that enroll more difficult.

Unique Values in Numeric Fields

From the unique values in the numeric fields we can see that several variables can be grouped. Such as cons.price.idx and cons.conf.idx. The previous variable looks like it should either be grouped as 0’s and 1’s signifying previous conversations or simply factored into groups of the number of conversations.

Correlations

From the correlation plot we can see that the emp.var.rate variable is highly correlated to three other variables, cons.price.idx, euribor3m and nr.employed. We also see that euribor3m and nr.employed are highly correlated with each other.

Due to this correlation both the emp.var.rate and euribor3m variables should be removed to limit the risk of multicollinearity.

Data Preparation

Remove correlated variables

og_training <- subset(og_training, select=-c(emp.var.rate, euribor3m))

og_testing <- subset(og_testing, select=-c(emp.var.rate, euribor3m))

Turn y from ‘no’/‘yes’ to 0,1

og_training$y <- ifelse(og_training$y == "no",0,1)

og_testing$y <- ifelse(og_testing$y == "no",0,1)

Grouping

The cons.price.idx variable has a range from 92.2 to 94.77. The histogram shows us that groups from 92 to 93, 93 to 94 and everything greater than 94, should be a relatively normal looking distribution of variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   92.20   93.08   93.80   93.58   93.99   94.77

changed_training$cons_price_group <- ifelse(as.numeric(changed_training$cons.price.idx) >= 94,2,ifelse(as.numeric(changed_training$cons.price.idx) <94 & as.numeric(changed_training$cons.price.idx) >=93,1,0))
changed_training$cons_price_group <- as.character(changed_training$cons_price_group)



changed_testing <- og_testing
changed_testing$cons_price_group <- ifelse(as.numeric(changed_testing$cons.price.idx) >= 94,2,ifelse(as.numeric(changed_testing$cons.price.idx) <94 & as.numeric(changed_testing$cons.price.idx) >=93,1,0))
changed_testing$cons_price_group <- as.character(changed_testing$cons_price_group)

The cons.conf.idx variable has a range between -50.8 to 26.9. It can be broken down into groups of five.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -50.80  -42.70  -41.80  -40.52  -36.40  -26.90
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

changed_training$cons_confs_group <- ifelse(as.numeric(changed_training$cons.conf.idx) <= (-45),3,ifelse(as.numeric(changed_training$cons.conf.idx) >= -45 & as.numeric(changed_training$cons.conf.idx)<=(-40),2, ifelse(as.numeric(changed_training$cons.conf.idx) >= -40 & as.numeric(changed_training$cons.conf.idx)<=(-35),1,0)))
changed_training$cons_confs_group <- as.character(changed_training$cons_confs_group)

changed_testing$cons_confs_group <- ifelse(as.numeric(changed_testing$cons.conf.idx) <= (-45),3,ifelse(as.numeric(changed_testing$cons.conf.idx) >= -45 & as.numeric(changed_testing$cons.conf.idx)<=(-40),2, ifelse(as.numeric(changed_testing$cons.conf.idx) >= -40 & as.numeric(changed_testing$cons.conf.idx)<=(-35),1,0)))
changed_testing$cons_confs_group <- as.character(changed_testing$cons_confs_group)

Factoring

All the character variables will be turned into dummy variables for each respective level in the group.

library(tidytable)
dummy <- get_dummies(changed_training)
dummy <- subset(dummy,select=-c(job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome,previous,cons.conf.idx,cons.price.idx,cons_confs_group,cons_price_group))
remove(changed_training)

dummy_testing <- get_dummies.(changed_testing)
dummy_testing <- subset(dummy_testing,select=-c(job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome,previous,cons.conf.idx,cons.price.idx,cons_confs_group,cons_price_group))
remove(changed_testing)

There have been problems with the variables not matching between the training and testing data. With identical I can test to make sure that all the variables are the same in both datasets.

identical(names(dummy),names(dummy_testing))
## [1] FALSE

Unbalance

ggplot(dummy_smote$data,aes(x=y))+
  geom_histogram(stat='count',fill='blue')

Training and Testing sets

set.seed(93384)
trainIndex <- createDataPartition(dummy_smote$data$y, p = .65,
                                  list = FALSE,
                                  times = 1,
                                  
                                  )

ftrain <-dummy_smote$data[trainIndex,]
ftrain <- subset(ftrain, select=-c(class))
ftest <- dummy_smote$data[-trainIndex,]
ftest <- subset(ftest, select=-c(class))


train <- xgb.DMatrix(data = data.matrix(ftrain[,-c(6)]), label = ftrain$y)

validate <- xgb.DMatrix(data = data.matrix(ftest[,-c(6)]), label = ftest$y)

t_label <- dummy_testing$y
true_test <- xgb.DMatrix(data = data.matrix(dummy_testing[,-c(6)]), label = t_label)
identical(names(ftrain),names(dummy_testing))
## [1] FALSE

Building Models

Model 1 GLM basic

glm_ini <- glm(y ~ ., ftrain,family='binomial')
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_ini)
## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = ftrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.3793  -0.1460   0.4776   3.0082  
## 
## Coefficients: (12 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    1.197e+11  1.542e+12   0.078 0.938139    
## age                           -1.551e-03  2.190e-03  -0.708 0.478918    
## duration                       6.959e-03  8.814e-05  78.950  < 2e-16 ***
## campaign                      -7.011e-02  1.111e-02  -6.313 2.73e-10 ***
## pdays                         -8.546e-04  2.305e-04  -3.707 0.000210 ***
## nr.employed                   -1.607e-02  5.020e-04 -32.017  < 2e-16 ***
## job_admin.                    -5.477e-01  2.333e-01  -2.348 0.018894 *  
## `job_blue-collar`             -8.227e-01  2.357e-01  -3.491 0.000482 ***
## job_entrepreneur              -7.674e-01  2.518e-01  -3.048 0.002307 ** 
## job_housemaid                 -4.787e-01  2.600e-01  -1.841 0.065619 .  
## job_management                -7.082e-01  2.405e-01  -2.945 0.003234 ** 
## job_retired                   -1.097e-01  2.441e-01  -0.449 0.653195    
## `job_self-employed`           -9.077e-01  2.529e-01  -3.590 0.000331 ***
## job_services                  -6.134e-01  2.393e-01  -2.563 0.010362 *  
## job_student                   -3.373e-01  2.530e-01  -1.333 0.182462    
## job_technician                -6.127e-01  2.362e-01  -2.595 0.009471 ** 
## job_unemployed                -3.042e-01  2.548e-01  -1.194 0.232463    
## job_unknown                           NA         NA      NA       NA    
## marital_divorced              -3.489e-01  3.947e-01  -0.884 0.376685    
## marital_married               -2.479e-01  3.910e-01  -0.634 0.526035    
## marital_single                -2.079e-01  3.917e-01  -0.531 0.595586    
## marital_unknown                       NA         NA      NA       NA    
## education_basic.4y            -3.092e-01  1.168e-01  -2.646 0.008139 ** 
## education_basic.6y            -8.987e-02  1.310e-01  -0.686 0.492742    
## education_basic.9y            -2.131e-01  1.104e-01  -1.931 0.053468 .  
## education_high.school         -1.775e-01  1.025e-01  -1.731 0.083481 .  
## education_illiterate           2.943e-01  8.039e-01   0.366 0.714247    
## education_professional.course  7.943e-02  1.111e-01   0.715 0.474556    
## education_university.degree    1.180e-01  1.018e-01   1.159 0.246486    
## education_unknown                     NA         NA      NA       NA    
## default_no                    -1.197e+11  1.542e+12  -0.078 0.938139    
## default_unknown               -1.197e+11  1.542e+12  -0.078 0.938139    
## default_yes                   -1.197e+11  1.542e+12  -0.078 0.938139    
## housing_no                    -9.701e-02  3.874e-02  -2.504 0.012275 *  
## housing_unknown                1.740e-01  1.349e-01   1.290 0.197095    
## housing_yes                           NA         NA      NA       NA    
## loan_no                        1.677e-01  5.362e-02   3.127 0.001764 ** 
## loan_unknown                          NA         NA      NA       NA    
## loan_yes                              NA         NA      NA       NA    
## contact_cellular              -1.739e-01  6.816e-02  -2.551 0.010750 *  
## contact_telephone                     NA         NA      NA       NA    
## month_apr                      2.166e+00  2.393e-01   9.051  < 2e-16 ***
## month_aug                      8.318e-01  1.295e-01   6.423 1.34e-10 ***
## month_dec                     -1.543e-01  2.365e-01  -0.653 0.514064    
## month_jul                      7.075e-02  1.414e-01   0.500 0.616903    
## month_jun                     -3.639e-01  1.585e-01  -2.295 0.021722 *  
## month_mar                      2.789e+00  1.932e-01  14.436  < 2e-16 ***
## month_may                     -8.181e-01  1.439e-01  -5.686 1.30e-08 ***
## month_nov                     -1.778e-02  1.407e-01  -0.126 0.899490    
## month_oct                      1.616e+00  1.635e-01   9.887  < 2e-16 ***
## month_sep                             NA         NA      NA       NA    
## day_of_week_fri               -1.168e-01  6.089e-02  -1.918 0.055161 .  
## day_of_week_mon               -2.461e-01  5.970e-02  -4.123 3.74e-05 ***
## day_of_week_thu               -2.416e-01  5.892e-02  -4.101 4.11e-05 ***
## day_of_week_tue               -8.942e-02  5.938e-02  -1.506 0.132124    
## day_of_week_wed                       NA         NA      NA       NA    
## poutcome_failure              -1.199e+00  2.394e-01  -5.008 5.50e-07 ***
## poutcome_nonexistent          -6.285e-01  2.461e-01  -2.554 0.010647 *  
## poutcome_success                      NA         NA      NA       NA    
## cons_price_group_0             1.706e+00  9.806e-02  17.399  < 2e-16 ***
## cons_price_group_1             7.952e-02  1.441e-01   0.552 0.581005    
## cons_price_group_2                    NA         NA      NA       NA    
## cons_confs_group_0             9.579e-02  1.067e-01   0.898 0.369105    
## cons_confs_group_1             1.020e+00  1.585e-01   6.435 1.23e-10 ***
## cons_confs_group_2             1.969e+00  1.642e-01  11.992  < 2e-16 ***
## cons_confs_group_3                    NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 46544  on 33651  degrees of freedom
## Residual deviance: 21797  on 33598  degrees of freedom
## AIC: 21905
## 
## Number of Fisher Scoring iterations: 25
pred_glm <- predict(glm_ini, ftest,type='response')

Ideal Binomial Threshold function

Since the binomial regression produces an estimation of probability, it is important to find the threshold that produces the best overall results.

There is much debate about which statistics from the confusion matrix is the best at determining model quality. In my opinion the best model would predict the ratio of true positives and ratio of true negatives equally well. As it limits the cost of being wrong.

The purpose of the ideal_binomial_threshold function is to find the threshold value that returns the smallest difference between the sensitivity and the specificity. This difference signifies the threshold value that balances the ratios of true positive predictions and the true negative predictions. Thus informing us of the threshold that determines the most balanced predictive power of the model.

ideal_binomial_threshold <- function(predictions,y){
  
  thresholds <- data.frame(matrix(ncol=8,nrow=501))
  colnames(thresholds) <- c('Threshold','Balanced_Accuracy','Sensitivity','Specificity',
                            'Sen_Spec_Diff')
  index=1
  
  for(i in seq(.25,.75,.001)){
      
      binom <- ifelse(predictions>i,1,0)
      cM <- confusionMatrix(as.factor(binom),as.factor(y))
      
      thresholds$Threshold[index] <- i
      thresholds$Balanced_Accuracy[index] <- as.numeric(cM$byClass['Balanced Accuracy'])
      thresholds$Sensitivity[index] <-as.numeric(cM$byClass['Sensitivity'])
      thresholds$Specificity[index] <-as.numeric(cM$byClass['Specificity'])
      sen<- as.numeric(cM$byClass['Sensitivity'])
      spec <- as.numeric(cM$byClass['Specificity'])
      sen_spec_diff <- ifelse(sen>= spec,(sen-spec), (spec-sen))
      thresholds$Sen_Spec_Diff[index] <- sen_spec_diff
      index <- index+1
  }

  min_thresh <- thresholds[which.min(thresholds$Sen_Spec_Diff),]
  return(min_thresh)
}

thresh <- ideal_binomial_threshold(pred_glm,ftest$y)
thresh
##     Threshold Balanced_Accuracy Sensitivity Specificity Sen_Spec_Diff NA NA NA
## 255     0.504         0.8733997   0.8732877   0.8735117  0.0002240581 NA NA NA
pred_glm_ideal <- ifelse(pred_glm >thresh$Threshold[1],1,0)
cM <- confusionMatrix(factor(pred_glm_ideal),factor(ftest$y))
cM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8415 1073
##          1 1221 7410
##                                           
##                Accuracy : 0.8734          
##                  95% CI : (0.8685, 0.8782)
##     No Information Rate : 0.5318          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.746           
##                                           
##  Mcnemar's Test P-Value : 0.002147        
##                                           
##             Sensitivity : 0.8733          
##             Specificity : 0.8735          
##          Pos Pred Value : 0.8869          
##          Neg Pred Value : 0.8585          
##              Prevalence : 0.5318          
##          Detection Rate : 0.4644          
##    Detection Prevalence : 0.5236          
##       Balanced Accuracy : 0.8734          
##                                           
##        'Positive' Class : 0               
## 

GLM Model 1 Analysis

With a threshold of 0.504 the model has a balanced accuracy score of 0.8733997. The AIC score is 2.1905262^{4}.

XGBoosting Models

Tuning the XGBoost Parameters

In order for the xgboost model to work at its maximum ability several parameters need to be selected. To select the ideal parameters caret has a package that tunes the parameters, which gives the best parameter set up.

Initial Parameters

To get a rough idea of the nrounds needed to be used the xgb.cv function will be used with a basic parameters set up.We will test a maximum of 1500 rounds.

params <- list(
  
  booster            = "gbtree",   
  
  eta                = 0.1,              

  max_depth          = 5,
  
  gamma              = 0,

  subsample          = 1,                 

  colsample_bytree   = 1,                

  colsample_bylevel  = 1,     
  
  min_child_weight   = 1,
  
  objective          ="binary:logistic",   #

  eval_metric        = "error"
  
)

xgbcv <- xgb.cv( params = params, data = train, nrounds = 1500, nfold=5, print_every_n = 100, early_stop_rounds = 1000,verbose=0)
min_error = min(xgbcv$evaluation_log[, test_error_mean])
min_error_index = which.min(xgbcv$evaluation_log[, test_error_mean])
print(min_error)
## [1] 0.05170574
print(min_error_index)
## [1] 137

The basic parameters shows us that the number of nrounds should be around 137

In depth Tuning

With the caret package you can tune all the parameters at the same time, but for each choice given, the time it takes to return a tuning is increased.

For this model I will have three choices for each parameter. Max depth will test depths of 4,7 and 10. Eta will test values of .05,.1 and .3. Colsample_bytree will test .5,.75 and .9. Subsample will test .5,.75 and .9. Gamma will test 0,1 and 5.

tune_grid <- expand.grid(
  nrounds = as.numeric(min_error_index),
  max_depth = c(4,7,10),
  eta = c(.05,.1,0.3),
  colsample_bytree = .75,
  subsample = .9,
  gamma = c(0,1,5),
  min_child_weight =1
)
xgb_tune <- train(
  x = ftrain[,-c(6)],
  y = factor(ftrain$y),
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
  verbose = 0
)
plot(xgb_tune)

Best Tune

The bestTune variable of the xgb_tune will give us the best tuned parameters.

xgb_tune$bestTune
##   nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 7     137        10 0.05     0             0.75                1       0.9

With the tuned parameters I will retest the number of rounds needed and retest.

tuned_params<- list(
  
  booster            = "gbtree",   
  
  eta                = xgb_tune$bestTune$eta,              

  max_depth          = xgb_tune$bestTune$max_depth,                

  gamma             = xgb_tune$bestTune$gamma,
  
  subsample          = xgb_tune$bestTune$subsample,                 

  colsample_bytree   = xgb_tune$bestTune$colsample_bytree,                

  
  min_child_weight   = xgb_tune$bestTune$min_child_weight,
  

  objective          ="binary:logistic",   #

  eval_metric        = "error"
  
)

xgbcv2 <- xgb.cv( params = params, data = train, nrounds = as.numeric(min_error_index), nfold=5, print_every_n = 100, early_stop_rounds = 1000)
## [1]  train-error:0.103798+0.001789   test-error:0.106650+0.002207 
## [101]    train-error:0.040495+0.000497   test-error:0.051914+0.001609 
## [137]    train-error:0.037130+0.000317   test-error:0.052300+0.001034
min_error2 = min(xgbcv2$evaluation_log[, test_error_mean])
min_error_index2 = which.min(xgbcv2$evaluation_log[, test_error_mean])
print(min_error)
## [1] 0.05170574
print(min_error_index)
## [1] 137
xgb_model <- xgb.train(tuned_params, train, nrounds = as.numeric(min_error_index))
xgb_pred <- predict(xgb_model, validate)

thresh2 <- ideal_binomial_threshold(xgb_pred,ftest$y)
xgb_pred_prob <- ifelse(xgb_pred >=thresh2$Threshold[1],1,0 )
cM2<-confusionMatrix(as.factor(xgb_pred_prob),as.factor(ftest$y))
cM2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9170  408
##          1  466 8075
##                                           
##                Accuracy : 0.9518          
##                  95% CI : (0.9485, 0.9548)
##     No Information Rate : 0.5318          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9032          
##                                           
##  Mcnemar's Test P-Value : 0.05385         
##                                           
##             Sensitivity : 0.9516          
##             Specificity : 0.9519          
##          Pos Pred Value : 0.9574          
##          Neg Pred Value : 0.9454          
##              Prevalence : 0.5318          
##          Detection Rate : 0.5061          
##    Detection Prevalence : 0.5286          
##       Balanced Accuracy : 0.9518          
##                                           
##        'Positive' Class : 0               
## 

XGB Model Analysis

The xgboost model performed ~8% better than the glm models with a ~95% prediction rate. The sensitivity and specificity scores are both ~95%. All these scores are .43 points better than the no information rate. Both positive and negative prediction rates are ~95%.

library(ggplot2)
library(scales)
og_training1=og_training
var_used<-og_training1[c("age","job","marital","education","housing")] #selecting the data for segmentation

customer1<-data.matrix(var_used[c("job","marital","education","housing")])

var_used<-data.frame(var_used,customer1) #Data Coversion
str(var_used)
## 'data.frame':    30891 obs. of  9 variables:
##  $ age        : int  57 37 40 45 41 24 25 41 25 29 ...
##  $ job        : chr  "services" "services" "admin." "services" ...
##  $ marital    : chr  "married" "married" "married" "married" ...
##  $ education  : chr  "high.school" "high.school" "basic.6y" "basic.9y" ...
##  $ housing    : chr  "no" "yes" "no" "no" ...
##  $ job.1      : int  8 8 1 8 2 10 8 2 8 2 ...
##  $ marital.1  : int  2 2 2 2 2 3 3 2 3 3 ...
##  $ education.1: int  4 4 2 3 8 6 4 8 4 4 ...
##  $ housing.1  : int  1 3 1 1 1 3 3 1 3 1 ...
job<-unique(var_used[c("job","job.1")])

marital<-unique(var_used[c("marital","marital.1")])

education<-unique(var_used[c("education","education.1")])

housing<-unique(var_used[c("housing","housing.1")])
x<-c("age","job.1","marital.1","education.1","housing.1")
x
## [1] "age"         "job.1"       "marital.1"   "education.1" "housing.1"
# Clustering Using K-Means

set.seed(100)

# K-meand Function to create 5 Cluster with 25 random scenario 
segmentation<-kmeans(x=var_used[x], center=8, nstart=25)

# Analyze Cluster Size Output
og_training$cluster<-segmentation$cluster
pp <- ggplot(data=og_training, aes(x=cluster,fill=y)) + 
      geom_bar(aes(y = (..count..)/sum(..count..)))
      
pp +  labs(x="Cluster", y = "Percent")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

var_used %>%
  as_tibble() %>%
  mutate(cluster = segmentation$cluster,
         state = row.names(USArrests)) %>%
  ggplot(aes(marital, education, color = factor(cluster), label = job)) +
  geom_text()

datacharacteristics <-og_training[which(og_training$cluster == 4),]
a <- ggplot(data=datacharacteristics, aes(x=job)) + 
      geom_bar(aes(y = (..count..)/sum(..count..)))
      
a +  labs(x="Job", y = "Percent") +theme(axis.text.x = element_text(angle=90,hjust = 0))

b <- ggplot(data=datacharacteristics, aes(x=education)) + 
      geom_bar(aes(y = (..count..)/sum(..count..)))

b +  labs(x="Education", y = "Percent") +theme(axis.text.x = element_text(angle=90,hjust = 0))

# Marital Status

c <- ggplot(data=datacharacteristics, aes(x=marital)) + 
      geom_bar(aes(y = (..count..)/sum(..count..)))
      
c +  labs(x="Marital", y = "Percent") +theme(axis.text.x = element_text(angle=90,hjust = 0))

# Age

d <- ggplot(data=datacharacteristics, aes(x=age)) + 
      geom_bar(aes(y = (..count..)/sum(..count..)))
      
d +  labs(x="Age", y = "Percent") +theme(axis.text.x = element_text(angle=90,hjust = 0))

The most profitable customer comes from Cluster Number 4, This cluster consists on profession as admin and blue-collar, university degree for last education, who are majority married, if the telemarketer would to focus on these attributes they would have a higher success rate in getting the “Yes” feedback

Citations

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.001

Appendix