This is a banking and marketing dataset that was created by Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL).It is a binary classification problem with the goal of predicting if the client will subscribe a bank term deposit (variable y).
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## As of tidytable v0.9.0 dotless versions of functions are exported.
## You can now use `arrange()`/`mutate()`/etc. directly.
## Warning: tidytable was loaded after dplyr.
## This can lead to most dplyr functions being overwritten by tidytable functions.
##
## Attaching package: 'tidytable'
## The following objects are masked from 'package:dplyr':
##
## across, add_count, add_tally, anti_join, arrange, between,
## bind_cols, bind_rows, c_across, case_when, coalesce, count,
## cume_dist, cur_column, cur_data, cur_group_id, cur_group_rows,
## dense_rank, desc, distinct, filter, first, full_join, group_by,
## group_cols, group_split, group_vars, if_all, if_any, if_else,
## inner_join, is_grouped_df, lag, last, lead, left_join, min_rank,
## mutate, n, n_distinct, na_if, nest_by, nest_join, nth,
## percent_rank, pull, recode, relocate, rename, rename_with,
## right_join, row_number, rowwise, select, semi_join, slice,
## slice_head, slice_max, slice_min, slice_sample, slice_tail,
## summarise, summarize, tally, top_n, transmute, ungroup
## The following objects are masked from 'package:stats':
##
## dt, filter, lag
## The following object is masked from 'package:base':
##
## %in%
## corrplot 0.92 loaded
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Loading required package: lattice
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:tidytable':
##
## slice
## The following object is masked from 'package:dplyr':
##
## slice
This dataset consist of 11 character variables and 10 numeric variables. There are no missing values.
| skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd |
|---|---|---|---|---|
| job | 0 | 1 | NA | NA |
| marital | 0 | 1 | NA | NA |
| education | 0 | 1 | NA | NA |
| default | 0 | 1 | NA | NA |
| housing | 0 | 1 | NA | NA |
| loan | 0 | 1 | NA | NA |
| contact | 0 | 1 | NA | NA |
| month | 0 | 1 | NA | NA |
| day_of_week | 0 | 1 | NA | NA |
| poutcome | 0 | 1 | NA | NA |
| y | 0 | 1 | NA | NA |
| age | 0 | 1 | 40.0274190 | 10.4320921 |
| duration | 0 | 1 | 259.0187757 | 260.5061542 |
| campaign | 0 | 1 | 2.5591273 | 2.7619762 |
| pdays | 0 | 1 | 962.1262180 | 187.7785545 |
| previous | 0 | 1 | 0.1741931 | 0.4988711 |
| emp.var.rate | 0 | 1 | 0.0836878 | 1.5702098 |
| cons.price.idx | 0 | 1 | 93.5779597 | 0.5793179 |
| cons.conf.idx | 0 | 1 | -40.5246998 | 4.6162954 |
| euribor3m | 0 | 1 | 3.6237152 | 1.7332399 |
| nr.employed | 0 | 1 | 5167.1613577 | 72.1747491 |
The difference between those not enrolling into the services and those that due is heavily imbalanced. This will make predicting those that enroll more difficult.
From the unique values in the numeric fields we can see that several variables can be grouped. Such as cons.price.idx and cons.conf.idx. The previous variable looks like it should either be grouped as 0’s and 1’s signifying previous conversations or simply factored into groups of the number of conversations.
From the correlation plot we can see that the emp.var.rate variable is highly correlated to three other variables, cons.price.idx, euribor3m and nr.employed. We also see that euribor3m and nr.employed are highly correlated with each other.
Due to this correlation both the emp.var.rate and euribor3m variables should be removed to limit the risk of multicollinearity.
og_training$y <- ifelse(og_training$y == "no",0,1)
og_testing$y <- ifelse(og_testing$y == "no",0,1)
The cons.price.idx variable has a range from 92.2 to 94.77. The histogram shows us that groups from 92 to 93, 93 to 94 and everything greater than 94, should be a relatively normal looking distribution of variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 92.20 93.08 93.80 93.58 93.99 94.77
changed_training$cons_price_group <- ifelse(as.numeric(changed_training$cons.price.idx) >= 94,2,ifelse(as.numeric(changed_training$cons.price.idx) <94 & as.numeric(changed_training$cons.price.idx) >=93,1,0))
changed_training$cons_price_group <- as.character(changed_training$cons_price_group)
changed_testing <- og_testing
changed_testing$cons_price_group <- ifelse(as.numeric(changed_testing$cons.price.idx) >= 94,2,ifelse(as.numeric(changed_testing$cons.price.idx) <94 & as.numeric(changed_testing$cons.price.idx) >=93,1,0))
changed_testing$cons_price_group <- as.character(changed_testing$cons_price_group)
The cons.conf.idx variable has a range between -50.8 to 26.9. It can be broken down into groups of five.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -50.80 -42.70 -41.80 -40.52 -36.40 -26.90
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
changed_training$cons_confs_group <- ifelse(as.numeric(changed_training$cons.conf.idx) <= (-45),3,ifelse(as.numeric(changed_training$cons.conf.idx) >= -45 & as.numeric(changed_training$cons.conf.idx)<=(-40),2, ifelse(as.numeric(changed_training$cons.conf.idx) >= -40 & as.numeric(changed_training$cons.conf.idx)<=(-35),1,0)))
changed_training$cons_confs_group <- as.character(changed_training$cons_confs_group)
changed_testing$cons_confs_group <- ifelse(as.numeric(changed_testing$cons.conf.idx) <= (-45),3,ifelse(as.numeric(changed_testing$cons.conf.idx) >= -45 & as.numeric(changed_testing$cons.conf.idx)<=(-40),2, ifelse(as.numeric(changed_testing$cons.conf.idx) >= -40 & as.numeric(changed_testing$cons.conf.idx)<=(-35),1,0)))
changed_testing$cons_confs_group <- as.character(changed_testing$cons_confs_group)
All the character variables will be turned into dummy variables for each respective level in the group.
library(tidytable)
dummy <- get_dummies(changed_training)
dummy <- subset(dummy,select=-c(job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome,previous,cons.conf.idx,cons.price.idx,cons_confs_group,cons_price_group))
remove(changed_training)
dummy_testing <- get_dummies.(changed_testing)
dummy_testing <- subset(dummy_testing,select=-c(job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome,previous,cons.conf.idx,cons.price.idx,cons_confs_group,cons_price_group))
remove(changed_testing)
There have been problems with the variables not matching between the training and testing data. With identical I can test to make sure that all the variables are the same in both datasets.
identical(names(dummy),names(dummy_testing))
## [1] FALSE
ggplot(dummy_smote$data,aes(x=y))+
geom_histogram(stat='count',fill='blue')
set.seed(93384)
trainIndex <- createDataPartition(dummy_smote$data$y, p = .65,
list = FALSE,
times = 1,
)
ftrain <-dummy_smote$data[trainIndex,]
ftrain <- subset(ftrain, select=-c(class))
ftest <- dummy_smote$data[-trainIndex,]
ftest <- subset(ftest, select=-c(class))
train <- xgb.DMatrix(data = data.matrix(ftrain[,-c(6)]), label = ftrain$y)
validate <- xgb.DMatrix(data = data.matrix(ftest[,-c(6)]), label = ftest$y)
t_label <- dummy_testing$y
true_test <- xgb.DMatrix(data = data.matrix(dummy_testing[,-c(6)]), label = t_label)
identical(names(ftrain),names(dummy_testing))
## [1] FALSE
glm_ini <- glm(y ~ ., ftrain,family='binomial')
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_ini)
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = ftrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.3793 -0.1460 0.4776 3.0082
##
## Coefficients: (12 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.197e+11 1.542e+12 0.078 0.938139
## age -1.551e-03 2.190e-03 -0.708 0.478918
## duration 6.959e-03 8.814e-05 78.950 < 2e-16 ***
## campaign -7.011e-02 1.111e-02 -6.313 2.73e-10 ***
## pdays -8.546e-04 2.305e-04 -3.707 0.000210 ***
## nr.employed -1.607e-02 5.020e-04 -32.017 < 2e-16 ***
## job_admin. -5.477e-01 2.333e-01 -2.348 0.018894 *
## `job_blue-collar` -8.227e-01 2.357e-01 -3.491 0.000482 ***
## job_entrepreneur -7.674e-01 2.518e-01 -3.048 0.002307 **
## job_housemaid -4.787e-01 2.600e-01 -1.841 0.065619 .
## job_management -7.082e-01 2.405e-01 -2.945 0.003234 **
## job_retired -1.097e-01 2.441e-01 -0.449 0.653195
## `job_self-employed` -9.077e-01 2.529e-01 -3.590 0.000331 ***
## job_services -6.134e-01 2.393e-01 -2.563 0.010362 *
## job_student -3.373e-01 2.530e-01 -1.333 0.182462
## job_technician -6.127e-01 2.362e-01 -2.595 0.009471 **
## job_unemployed -3.042e-01 2.548e-01 -1.194 0.232463
## job_unknown NA NA NA NA
## marital_divorced -3.489e-01 3.947e-01 -0.884 0.376685
## marital_married -2.479e-01 3.910e-01 -0.634 0.526035
## marital_single -2.079e-01 3.917e-01 -0.531 0.595586
## marital_unknown NA NA NA NA
## education_basic.4y -3.092e-01 1.168e-01 -2.646 0.008139 **
## education_basic.6y -8.987e-02 1.310e-01 -0.686 0.492742
## education_basic.9y -2.131e-01 1.104e-01 -1.931 0.053468 .
## education_high.school -1.775e-01 1.025e-01 -1.731 0.083481 .
## education_illiterate 2.943e-01 8.039e-01 0.366 0.714247
## education_professional.course 7.943e-02 1.111e-01 0.715 0.474556
## education_university.degree 1.180e-01 1.018e-01 1.159 0.246486
## education_unknown NA NA NA NA
## default_no -1.197e+11 1.542e+12 -0.078 0.938139
## default_unknown -1.197e+11 1.542e+12 -0.078 0.938139
## default_yes -1.197e+11 1.542e+12 -0.078 0.938139
## housing_no -9.701e-02 3.874e-02 -2.504 0.012275 *
## housing_unknown 1.740e-01 1.349e-01 1.290 0.197095
## housing_yes NA NA NA NA
## loan_no 1.677e-01 5.362e-02 3.127 0.001764 **
## loan_unknown NA NA NA NA
## loan_yes NA NA NA NA
## contact_cellular -1.739e-01 6.816e-02 -2.551 0.010750 *
## contact_telephone NA NA NA NA
## month_apr 2.166e+00 2.393e-01 9.051 < 2e-16 ***
## month_aug 8.318e-01 1.295e-01 6.423 1.34e-10 ***
## month_dec -1.543e-01 2.365e-01 -0.653 0.514064
## month_jul 7.075e-02 1.414e-01 0.500 0.616903
## month_jun -3.639e-01 1.585e-01 -2.295 0.021722 *
## month_mar 2.789e+00 1.932e-01 14.436 < 2e-16 ***
## month_may -8.181e-01 1.439e-01 -5.686 1.30e-08 ***
## month_nov -1.778e-02 1.407e-01 -0.126 0.899490
## month_oct 1.616e+00 1.635e-01 9.887 < 2e-16 ***
## month_sep NA NA NA NA
## day_of_week_fri -1.168e-01 6.089e-02 -1.918 0.055161 .
## day_of_week_mon -2.461e-01 5.970e-02 -4.123 3.74e-05 ***
## day_of_week_thu -2.416e-01 5.892e-02 -4.101 4.11e-05 ***
## day_of_week_tue -8.942e-02 5.938e-02 -1.506 0.132124
## day_of_week_wed NA NA NA NA
## poutcome_failure -1.199e+00 2.394e-01 -5.008 5.50e-07 ***
## poutcome_nonexistent -6.285e-01 2.461e-01 -2.554 0.010647 *
## poutcome_success NA NA NA NA
## cons_price_group_0 1.706e+00 9.806e-02 17.399 < 2e-16 ***
## cons_price_group_1 7.952e-02 1.441e-01 0.552 0.581005
## cons_price_group_2 NA NA NA NA
## cons_confs_group_0 9.579e-02 1.067e-01 0.898 0.369105
## cons_confs_group_1 1.020e+00 1.585e-01 6.435 1.23e-10 ***
## cons_confs_group_2 1.969e+00 1.642e-01 11.992 < 2e-16 ***
## cons_confs_group_3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 46544 on 33651 degrees of freedom
## Residual deviance: 21797 on 33598 degrees of freedom
## AIC: 21905
##
## Number of Fisher Scoring iterations: 25
pred_glm <- predict(glm_ini, ftest,type='response')
Since the binomial regression produces an estimation of probability, it is important to find the threshold that produces the best overall results.
There is much debate about which statistics from the confusion matrix is the best at determining model quality. In my opinion the best model would predict the ratio of true positives and ratio of true negatives equally well. As it limits the cost of being wrong.
The purpose of the ideal_binomial_threshold function is to find the threshold value that returns the smallest difference between the sensitivity and the specificity. This difference signifies the threshold value that balances the ratios of true positive predictions and the true negative predictions. Thus informing us of the threshold that determines the most balanced predictive power of the model.
ideal_binomial_threshold <- function(predictions,y){
thresholds <- data.frame(matrix(ncol=8,nrow=501))
colnames(thresholds) <- c('Threshold','Balanced_Accuracy','Sensitivity','Specificity',
'Sen_Spec_Diff')
index=1
for(i in seq(.25,.75,.001)){
binom <- ifelse(predictions>i,1,0)
cM <- confusionMatrix(as.factor(binom),as.factor(y))
thresholds$Threshold[index] <- i
thresholds$Balanced_Accuracy[index] <- as.numeric(cM$byClass['Balanced Accuracy'])
thresholds$Sensitivity[index] <-as.numeric(cM$byClass['Sensitivity'])
thresholds$Specificity[index] <-as.numeric(cM$byClass['Specificity'])
sen<- as.numeric(cM$byClass['Sensitivity'])
spec <- as.numeric(cM$byClass['Specificity'])
sen_spec_diff <- ifelse(sen>= spec,(sen-spec), (spec-sen))
thresholds$Sen_Spec_Diff[index] <- sen_spec_diff
index <- index+1
}
min_thresh <- thresholds[which.min(thresholds$Sen_Spec_Diff),]
return(min_thresh)
}
thresh <- ideal_binomial_threshold(pred_glm,ftest$y)
thresh
## Threshold Balanced_Accuracy Sensitivity Specificity Sen_Spec_Diff NA NA NA
## 255 0.504 0.8733997 0.8732877 0.8735117 0.0002240581 NA NA NA
pred_glm_ideal <- ifelse(pred_glm >thresh$Threshold[1],1,0)
cM <- confusionMatrix(factor(pred_glm_ideal),factor(ftest$y))
cM
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8415 1073
## 1 1221 7410
##
## Accuracy : 0.8734
## 95% CI : (0.8685, 0.8782)
## No Information Rate : 0.5318
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.746
##
## Mcnemar's Test P-Value : 0.002147
##
## Sensitivity : 0.8733
## Specificity : 0.8735
## Pos Pred Value : 0.8869
## Neg Pred Value : 0.8585
## Prevalence : 0.5318
## Detection Rate : 0.4644
## Detection Prevalence : 0.5236
## Balanced Accuracy : 0.8734
##
## 'Positive' Class : 0
##
With a threshold of 0.504 the model has a balanced accuracy score of 0.8733997. The AIC score is 2.1905262^{4}.
In order for the xgboost model to work at its maximum ability several parameters need to be selected. To select the ideal parameters caret has a package that tunes the parameters, which gives the best parameter set up.
To get a rough idea of the nrounds needed to be used the xgb.cv function will be used with a basic parameters set up.We will test a maximum of 1500 rounds.
params <- list(
booster = "gbtree",
eta = 0.1,
max_depth = 5,
gamma = 0,
subsample = 1,
colsample_bytree = 1,
colsample_bylevel = 1,
min_child_weight = 1,
objective ="binary:logistic", #
eval_metric = "error"
)
xgbcv <- xgb.cv( params = params, data = train, nrounds = 1500, nfold=5, print_every_n = 100, early_stop_rounds = 1000,verbose=0)
min_error = min(xgbcv$evaluation_log[, test_error_mean])
min_error_index = which.min(xgbcv$evaluation_log[, test_error_mean])
print(min_error)
## [1] 0.05170574
print(min_error_index)
## [1] 137
The basic parameters shows us that the number of nrounds should be around 137
With the caret package you can tune all the parameters at the same time, but for each choice given, the time it takes to return a tuning is increased.
For this model I will have three choices for each parameter. Max depth will test depths of 4,7 and 10. Eta will test values of .05,.1 and .3. Colsample_bytree will test .5,.75 and .9. Subsample will test .5,.75 and .9. Gamma will test 0,1 and 5.
tune_grid <- expand.grid(
nrounds = as.numeric(min_error_index),
max_depth = c(4,7,10),
eta = c(.05,.1,0.3),
colsample_bytree = .75,
subsample = .9,
gamma = c(0,1,5),
min_child_weight =1
)
xgb_tune <- train(
x = ftrain[,-c(6)],
y = factor(ftrain$y),
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = 0
)
plot(xgb_tune)
The bestTune variable of the xgb_tune will give us the best tuned parameters.
xgb_tune$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 7 137 10 0.05 0 0.75 1 0.9
With the tuned parameters I will retest the number of rounds needed and retest.
tuned_params<- list(
booster = "gbtree",
eta = xgb_tune$bestTune$eta,
max_depth = xgb_tune$bestTune$max_depth,
gamma = xgb_tune$bestTune$gamma,
subsample = xgb_tune$bestTune$subsample,
colsample_bytree = xgb_tune$bestTune$colsample_bytree,
min_child_weight = xgb_tune$bestTune$min_child_weight,
objective ="binary:logistic", #
eval_metric = "error"
)
xgbcv2 <- xgb.cv( params = params, data = train, nrounds = as.numeric(min_error_index), nfold=5, print_every_n = 100, early_stop_rounds = 1000)
## [1] train-error:0.103798+0.001789 test-error:0.106650+0.002207
## [101] train-error:0.040495+0.000497 test-error:0.051914+0.001609
## [137] train-error:0.037130+0.000317 test-error:0.052300+0.001034
min_error2 = min(xgbcv2$evaluation_log[, test_error_mean])
min_error_index2 = which.min(xgbcv2$evaluation_log[, test_error_mean])
print(min_error)
## [1] 0.05170574
print(min_error_index)
## [1] 137
xgb_model <- xgb.train(tuned_params, train, nrounds = as.numeric(min_error_index))
xgb_pred <- predict(xgb_model, validate)
thresh2 <- ideal_binomial_threshold(xgb_pred,ftest$y)
xgb_pred_prob <- ifelse(xgb_pred >=thresh2$Threshold[1],1,0 )
cM2<-confusionMatrix(as.factor(xgb_pred_prob),as.factor(ftest$y))
cM2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9170 408
## 1 466 8075
##
## Accuracy : 0.9518
## 95% CI : (0.9485, 0.9548)
## No Information Rate : 0.5318
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9032
##
## Mcnemar's Test P-Value : 0.05385
##
## Sensitivity : 0.9516
## Specificity : 0.9519
## Pos Pred Value : 0.9574
## Neg Pred Value : 0.9454
## Prevalence : 0.5318
## Detection Rate : 0.5061
## Detection Prevalence : 0.5286
## Balanced Accuracy : 0.9518
##
## 'Positive' Class : 0
##
The xgboost model performed ~8% better than the glm models with a ~95% prediction rate. The sensitivity and specificity scores are both ~95%. All these scores are .43 points better than the no information rate. Both positive and negative prediction rates are ~95%.
library(ggplot2)
library(scales)
og_training1=og_training
var_used<-og_training1[c("age","job","marital","education","housing")] #selecting the data for segmentation
customer1<-data.matrix(var_used[c("job","marital","education","housing")])
var_used<-data.frame(var_used,customer1) #Data Coversion
str(var_used)
## 'data.frame': 30891 obs. of 9 variables:
## $ age : int 57 37 40 45 41 24 25 41 25 29 ...
## $ job : chr "services" "services" "admin." "services" ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "high.school" "high.school" "basic.6y" "basic.9y" ...
## $ housing : chr "no" "yes" "no" "no" ...
## $ job.1 : int 8 8 1 8 2 10 8 2 8 2 ...
## $ marital.1 : int 2 2 2 2 2 3 3 2 3 3 ...
## $ education.1: int 4 4 2 3 8 6 4 8 4 4 ...
## $ housing.1 : int 1 3 1 1 1 3 3 1 3 1 ...
job<-unique(var_used[c("job","job.1")])
marital<-unique(var_used[c("marital","marital.1")])
education<-unique(var_used[c("education","education.1")])
housing<-unique(var_used[c("housing","housing.1")])
x<-c("age","job.1","marital.1","education.1","housing.1")
x
## [1] "age" "job.1" "marital.1" "education.1" "housing.1"
# Clustering Using K-Means
set.seed(100)
# K-meand Function to create 5 Cluster with 25 random scenario
segmentation<-kmeans(x=var_used[x], center=8, nstart=25)
# Analyze Cluster Size Output
og_training$cluster<-segmentation$cluster
pp <- ggplot(data=og_training, aes(x=cluster,fill=y)) +
geom_bar(aes(y = (..count..)/sum(..count..)))
pp + labs(x="Cluster", y = "Percent")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
var_used %>%
as_tibble() %>%
mutate(cluster = segmentation$cluster,
state = row.names(USArrests)) %>%
ggplot(aes(marital, education, color = factor(cluster), label = job)) +
geom_text()
datacharacteristics <-og_training[which(og_training$cluster == 4),]
a <- ggplot(data=datacharacteristics, aes(x=job)) +
geom_bar(aes(y = (..count..)/sum(..count..)))
a + labs(x="Job", y = "Percent") +theme(axis.text.x = element_text(angle=90,hjust = 0))
b <- ggplot(data=datacharacteristics, aes(x=education)) +
geom_bar(aes(y = (..count..)/sum(..count..)))
b + labs(x="Education", y = "Percent") +theme(axis.text.x = element_text(angle=90,hjust = 0))
# Marital Status
c <- ggplot(data=datacharacteristics, aes(x=marital)) +
geom_bar(aes(y = (..count..)/sum(..count..)))
c + labs(x="Marital", y = "Percent") +theme(axis.text.x = element_text(angle=90,hjust = 0))
# Age
d <- ggplot(data=datacharacteristics, aes(x=age)) +
geom_bar(aes(y = (..count..)/sum(..count..)))
d + labs(x="Age", y = "Percent") +theme(axis.text.x = element_text(angle=90,hjust = 0))
The most profitable customer comes from Cluster Number 4, This cluster
consists on profession as admin and blue-collar, university degree for
last education, who are majority married, if the telemarketer would to
focus on these attributes they would have a higher success rate in
getting the “Yes” feedback
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, In press, http://dx.doi.org/10.1016/j.dss.2014.03.001