Will the client subscribe the bank term deposit sponsored by Portuguese banking institution?
The data is related with direct marketing campaigns on bank term deposit of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,in order to access if the product (i.e, bank term deposit) would be subscribed/not.
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 261 | 1 | -1 | 0 | unknown | no |
| 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 151 | 1 | -1 | 0 | unknown | no |
| 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 76 | 1 | -1 | 0 | unknown | no |
| 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no |
| 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 198 | 1 | -1 | 0 | unknown | no |
| 35 | management | married | tertiary | no | 231 | yes | no | unknown | 5 | may | 139 | 1 | -1 | 0 | unknown | no |
| 28 | management | single | tertiary | no | 447 | yes | yes | unknown | 5 | may | 217 | 1 | -1 | 0 | unknown | no |
| 42 | entrepreneur | divorced | tertiary | yes | 2 | yes | no | unknown | 5 | may | 380 | 1 | -1 | 0 | unknown | no |
| 58 | retired | married | primary | no | 121 | yes | no | unknown | 5 | may | 50 | 1 | -1 | 0 | unknown | no |
| 43 | technician | single | secondary | no | 593 | yes | no | unknown | 5 | may | 55 | 1 | -1 | 0 | unknown | no |
| 41 | admin. | divorced | secondary | no | 270 | yes | no | unknown | 5 | may | 222 | 1 | -1 | 0 | unknown | no |
| 29 | admin. | single | secondary | no | 390 | yes | no | unknown | 5 | may | 137 | 1 | -1 | 0 | unknown | no |
# bank client data:
age : age of the client (numeric)
job : type of a job (categorical)(“admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”,“blue-collar”,“self-employed”,“retired”,“technician”,“services”)
marital : marital status (categorical)(“married”,“divorced”,“single”)
education : (categorical)(“unknown”,“secondary”,“primary”,“tertiary”)
default: has credit in default? (binary: “yes”,“no”)
balance: average yearly balance in euros (numeric)
housing: has housing loan? (binary: “yes”,“no”)
loan: has personal loan? (binary: “yes”,“no”)
# related with the last contact of the current campaign:
contact: contact communication type (categorical)(“unknown”,“telephone”,“cellular”)
day: last contact day of the month (numeric)
month: last contact month of year (categorical)(“jan”, “feb”, “mar”, …, “nov”, “dec”)
duration: last contact duration, in seconds (numeric)
# other attributes:
campaign: number of contacts performed during this campaign and for this client (numeric),it includes last contact
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric)(-1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical)(“unknown”,“other”,“failure”,“success”)
y : has the client subscribed a term deposit? (binary: “yes”,“no”)
Here we have 45211 observations with 17 variables out of which the y variable is an outcome variable with binary classes and remaining 16 variables are the predictors.
summary(bmarketing)
age job marital education
Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
1st Qu.:33.00 management :9458 married :27214 secondary:23202
Median :39.00 technician :7597 single :12790 tertiary :13301
Mean :40.94 admin. :5171 unknown : 1857
3rd Qu.:48.00 services :4154
Max. :95.00 retired :2264
(Other) :6835
default balance housing loan contact
no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
Median : 448 unknown :13020
Mean : 1362
3rd Qu.: 1428
Max. :102127
day month duration campaign
Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
(Other): 6060
pdays previous poutcome y
Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
Median : -1.0 Median : 0.0000 success: 1511
Mean : 40.2 Mean : 0.5803 unknown:36959
3rd Qu.: -1.0 3rd Qu.: 0.0000
Max. :871.0 Max. :275.0000
bmarketing$age<-as.numeric(bmarketing$age)
bmarketing$balance<-as.numeric(bmarketing$balance)
bmarketing$day<-as.numeric(bmarketing$day)
bmarketing$duration<-as.numeric(bmarketing$duration)
bmarketing$campaign<-as.numeric(bmarketing$campaign)
bmarketing$pdays<-as.numeric(bmarketing$pdays)
bmarketing$previous<-as.numeric(bmarketing$previous)
str(bmarketing)
'data.frame': 45211 obs. of 17 variables:
$ age : num 58 44 33 47 33 35 28 42 58 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
$ balance : num 2143 29 2 1506 1 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
$ day : num 5 5 5 5 5 5 5 5 5 5 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
$ duration : num 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
sum(is.na(bmarketing))
[1] 0
Here we found that no missing values are available in whole dataset.
boxplot(bmarketing)
We can see from the above boxplot that there are outliers available in certain variables whereas in variables 'balance','duration' and 'previous' we have some extreme outliers which are to be removed as these effect our model performance.
dim(bmarketing)
[1] 45211 17
bmarketing<-bmarketing[!bmarketing$balance>5e+04,]
bmarketing<-bmarketing[!bmarketing$balance==-8019,]
bmarketing<-bmarketing[!bmarketing$balance==-6847,]
bmarketing<-bmarketing[!bmarketing$duration>3500,]
bmarketing<-bmarketing[!bmarketing$previous>50,]
boxplot(bmarketing)
dim(bmarketing)
[1] 45184 17
Now after omitting some extreme outliers the dimension of the data is reduced to 45184 observations, this 45184 data is splitted further into train data,test1 data and test2 data.
library(caret)
Loading required package: lattice
Loading required package: ggplot2
split<-createDataPartition(bmarketing$y,p=0.6,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)
[1] 27112 17
dim(test1)
[1] 9036 17
dim(test2)
[1] 9036 17
Here the 45184 observations are splitted to 27112 observations as train data,9036 observations as test1 and test2 data each.
Now lets perform the model research on these data where model is built using train data and best model is choosen out of it and further tested on test1 and test2 data.
library(caret)
kfoldrepeated<-trainControl(method = "repeatedcv",number = 5,repeats = 3)
model_bin<-train(y ~ .,data=train,method="glm",trControl=kfoldrepeated)
model_bin
Generalized Linear Model
27112 samples
16 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times)
Summary of sample sizes: 21689, 21690, 21689, 21690, 21690, 21689, ...
Resampling results:
Accuracy Kappa
0.8991959 0.3816199
Here the model is built using k fold repeated cross validation multiple times(usually 3 times)
pred_train<-predict(model_bin,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 23351 591
yes 2134 1036
Accuracy : 0.8995
95% CI : (0.8959, 0.903)
No Information Rate : 0.94
P-Value [Acc > NIR] : 1
Kappa : 0.383
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9163
Specificity : 0.6368
Pos Pred Value : 0.9753
Neg Pred Value : 0.3268
Prevalence : 0.9400
Detection Rate : 0.8613
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7765
'Positive' Class : no
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 27112
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 23351 | 591 | 23942 |
| 0.861 | 0.022 | |
---------------------------------|-----------|-----------|-----------|
yes | 2134 | 1036 | 3170 |
| 0.079 | 0.038 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 25485 | 1627 | 27112 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
Loading required package: gplots
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6510646
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Here the model performed well with 90% accuracy (predicting 86% of not accepting term deposit and 4% accepting term deposit correctly),but inter rater reliability being low and specificity upto the mark and area under curve of 66%.
library(naivebayes)
model_nb<-naive_bayes(y ~ ., data=train)
model_nb$prior
no yes
0.8830776 0.1169224
The model is built by calculating the prior probabilities with 88% of not accepting subscription and 11% accepting subscription with this it calculates posterior probabilities as it is a probabilistic model.
library(caret)
pred_train<-predict(model_nb,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 22007 1935
yes 1525 1645
Accuracy : 0.8724
95% CI : (0.8684, 0.8763)
No Information Rate : 0.868
P-Value [Acc > NIR] : 0.01571
Kappa : 0.4148
Mcnemar's Test P-Value : 3.571e-12
Sensitivity : 0.9352
Specificity : 0.4595
Pos Pred Value : 0.9192
Neg Pred Value : 0.5189
Prevalence : 0.8680
Detection Rate : 0.8117
Detection Prevalence : 0.8831
Balanced Accuracy : 0.6973
'Positive' Class : no
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 27112
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 22007 | 1935 | 23942 |
| 0.812 | 0.071 | |
---------------------------------|-----------|-----------|-----------|
yes | 1525 | 1645 | 3170 |
| 0.056 | 0.061 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 23532 | 3580 | 27112 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7190536
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
The model has 87% accuracy but inter rater reliability and specificity is very low.
library(rpart)
model_cart<-rpart(y ~ ., data=train,control = rpart.control(minbucket = 10))
model_cart$cptable
CP nsplit rel error xerror xstd
1 0.03417455 0 1.0000000 1.0000000 0.01669052
2 0.02839117 3 0.8974763 0.9176656 0.01607557
3 0.02050473 4 0.8690852 0.8772871 0.01575943
4 0.01000000 5 0.8485804 0.8577287 0.01560261
library(rpart.plot)
rpart.plot(model_cart)
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 27112
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 22007 | 1935 | 23942 |
| 0.812 | 0.071 | |
---------------------------------|-----------|-----------|-----------|
yes | 1525 | 1645 | 3170 |
| 0.056 | 0.061 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 23532 | 3580 | 27112 |
---------------------------------|-----------|-----------|-----------|
Here the model is built with different complexity parameter and calculating its respective errors and selecting the cp with low error (cp=0.01 is selected)
From cross table we can read that 80% of not accepting subscription and 6% of accepting subscription is predicted.
From the tree plotted the variable duration is selected as node with low GINI index.
library(caret)
pred_train<-predict(model_cart,train,type="class")
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 23327 615
yes 2075 1095
Accuracy : 0.9008
95% CI : (0.8972, 0.9043)
No Information Rate : 0.9369
P-Value [Acc > NIR] : 1
Kappa : 0.3996
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9183
Specificity : 0.6404
Pos Pred Value : 0.9743
Neg Pred Value : 0.3454
Prevalence : 0.9369
Detection Rate : 0.8604
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7793
'Positive' Class : no
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 27112
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 23327 | 615 | 23942 |
| 0.860 | 0.023 | |
---------------------------------|-----------|-----------|-----------|
yes | 2075 | 1095 | 3170 |
| 0.077 | 0.040 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 25402 | 1710 | 27112 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6598694
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
The performance of CART decision tree model is good with 90% accurate,but kappa is low and specificity being bit low,the area under curve is 68%.
library(C50)
model_c50<-C5.0(y ~ ., data=train)
library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 23405 537
yes 1311 1859
Accuracy : 0.9318
95% CI : (0.9288, 0.9348)
No Information Rate : 0.9116
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6308
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9470
Specificity : 0.7759
Pos Pred Value : 0.9776
Neg Pred Value : 0.5864
Prevalence : 0.9116
Detection Rate : 0.8633
Detection Prevalence : 0.8831
Balanced Accuracy : 0.8614
'Positive' Class : no
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 27112
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 23405 | 537 | 23942 |
| 0.863 | 0.020 | |
---------------------------------|-----------|-----------|-----------|
yes | 1311 | 1859 | 3170 |
| 0.048 | 0.069 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 24716 | 2396 | 27112 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7820031
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
The model is accurate of 93%,kappa is also good and other parameters being performed well , it is a good model.
library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':
margin
model_rf<-randomForest(y ~ ., data=train,mtry=3)
model_rf
Call:
randomForest(formula = y ~ ., data = train, mtry = 3)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 9.56%
Confusion matrix:
no yes class.error
no 23186 756 0.03157631
yes 1835 1335 0.57886435
plot(model_rf)
varImpPlot(model_rf)
The random forest model has 9% error on test data(68% of train data) which is the out of bag error.
From the model plot we can see that the error rate was decreasing as no of trees was increasing and from the variable importance plot we can infer that the variable with low GINI index has more importance.
pred_train<-predict(model_rf,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 23942 0
yes 128 3042
Accuracy : 0.9953
95% CI : (0.9944, 0.9961)
No Information Rate : 0.8878
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9767
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9947
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9596
Prevalence : 0.8878
Detection Rate : 0.8831
Detection Prevalence : 0.8831
Balanced Accuracy : 0.9973
'Positive' Class : no
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 27112
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 23942 | 0 | 23942 |
| 0.883 | 0.000 | |
---------------------------------|-----------|-----------|-----------|
yes | 128 | 3042 | 3170 |
| 0.005 | 0.112 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 24070 | 3042 | 27112 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.9798107
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
The model performed excellent by predicting on train data with 99% accuracy and other parameters were totally satisfactory but we need to check whether these parameters are consistently performing well on different combination of bmarketing dataset.
library(ada)
model_ada<-ada(y ~ ., data=train,loss='exponential',type='discrete',iter=100)
model_ada
Call:
ada(y ~ ., data = train, loss = "exponential", type = "discrete",
iter = 100)
Loss: exponential Method: discrete Iteration: 100
Final Confusion Matrix for Data:
Final Prediction
True value no yes
no 23281 661
yes 1792 1378
Train Error: 0.09
Out-Of-Bag Error: 0.091 iteration= 88
Additional Estimates of number of iterations:
train.err1 train.kap1
95 95
plot(model_ada)
Here the exponential method is used to build the model with 100 iterations where test data error being only 9% and we can see that error rate was decreasing with increase in no of iterations from the plot.
pred_train<-predict(model_ada,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 23281 661
yes 1792 1378
Accuracy : 0.9095
95% CI : (0.906, 0.9129)
No Information Rate : 0.9248
P-Value [Acc > NIR] : 1
Kappa : 0.4816
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9285
Specificity : 0.6758
Pos Pred Value : 0.9724
Neg Pred Value : 0.4347
Prevalence : 0.9248
Detection Rate : 0.8587
Detection Prevalence : 0.8831
Balanced Accuracy : 0.8022
'Positive' Class : no
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 27112
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 23281 | 661 | 23942 |
| 0.859 | 0.024 | |
---------------------------------|-----------|-----------|-----------|
yes | 1792 | 1378 | 3170 |
| 0.066 | 0.051 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 25073 | 2039 | 27112 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.703546
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
The model performed well with accuracy of 91% but kappa was low as 49% and remaining parameters were also satisfactory.
Now lets check the summary of performance of all the above models built and tested on train data.
| MODEL | ACCURACY | KAPPA | SENSITIVITY | SPECIFICITY | BALANCED_ACCURACY | AREA_UNDER_CURVE | NO_OF_VALUES_PREDECTED_CORRECTLY |
|---|---|---|---|---|---|---|---|
| . | . | . | . | . | . | . | (out_of_27112-values) |
| BINARY_LOGISTIC_REGRESSION | 90 | 40.5 | 91.8 | 65 | 78 | 66 | 24452 |
| NAIVE_BAYES | 87 | 41.8 | 93.7 | 45.5 | 69.6 | 72.5 | 23607 |
| CART_DECISION_TREE | 90 | 43.4 | 92.4 | 62.4 | 77.4 | 68.3 | 24444 |
| C50_DECISION_TREE | 93 | 62.6 | 94.6 | 76.9 | 85.7 | 78 | 25235 |
| RANDOM_FOREST | 99 | 97 | 99 | 100 | 99 | 97.7 | 26970 |
| ADAPTIVE_BOOSTING | 91 | 49 | 93 | 67 | 80 | 71 | 24690 |
From the summary we can infer that the random forest algorithm performed excellent above all the models and next the c50 decision tree performed well and other models performed satisfactory as the kappa being bit low. So lets shortlist the random forest model and c50 decision tree model and check whether their performance are consistent by performing on tets1 and test2 data so as to minimize bias-variance trade off.
pred_test1<-predict(model_rf,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7746 234
yes 617 439
Accuracy : 0.9058
95% CI : (0.8996, 0.9118)
No Information Rate : 0.9255
P-Value [Acc > NIR] : 1
Kappa : 0.4585
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9262
Specificity : 0.6523
Pos Pred Value : 0.9707
Neg Pred Value : 0.4157
Prevalence : 0.9255
Detection Rate : 0.8572
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7893
'Positive' Class : no
library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 9036
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 7746 | 234 | 7980 |
| 0.857 | 0.026 | |
---------------------------------|-----------|-----------|-----------|
yes | 617 | 439 | 1056 |
| 0.068 | 0.049 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 8363 | 673 | 9036 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.6931982
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred_test2<-predict(model_rf,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7774 206
yes 552 504
Accuracy : 0.9161
95% CI : (0.9102, 0.9217)
No Information Rate : 0.9214
P-Value [Acc > NIR] : 0.9701
Kappa : 0.5263
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9337
Specificity : 0.7099
Pos Pred Value : 0.9742
Neg Pred Value : 0.4773
Prevalence : 0.9214
Detection Rate : 0.8603
Detection Prevalence : 0.8831
Balanced Accuracy : 0.8218
'Positive' Class : no
library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 9036
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 7774 | 206 | 7980 |
| 0.860 | 0.023 | |
---------------------------------|-----------|-----------|-----------|
yes | 552 | 504 | 1056 |
| 0.061 | 0.056 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 8326 | 710 | 9036 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7257291
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7677 303
yes 576 480
Accuracy : 0.9027
95% CI : (0.8964, 0.9088)
No Information Rate : 0.9133
P-Value [Acc > NIR] : 0.9998
Kappa : 0.4692
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9302
Specificity : 0.6130
Pos Pred Value : 0.9620
Neg Pred Value : 0.4545
Prevalence : 0.9133
Detection Rate : 0.8496
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7716
'Positive' Class : no
library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 9036
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 7677 | 303 | 7980 |
| 0.850 | 0.034 | |
---------------------------------|-----------|-----------|-----------|
yes | 576 | 480 | 1056 |
| 0.064 | 0.053 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 8253 | 783 | 9036 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7082878
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 7685 295
yes 553 503
Accuracy : 0.9062
95% CI : (0.9, 0.9121)
No Information Rate : 0.9117
P-Value [Acc > NIR] : 0.9686
Kappa : 0.4914
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9329
Specificity : 0.6303
Pos Pred Value : 0.9630
Neg Pred Value : 0.4763
Prevalence : 0.9117
Detection Rate : 0.8505
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7816
'Positive' Class : no
library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 9036
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 7685 | 295 | 7980 |
| 0.850 | 0.033 | |
---------------------------------|-----------|-----------|-----------|
yes | 553 | 503 | 1056 |
| 0.061 | 0.056 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 8238 | 798 | 9036 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7196792
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
| MODEL | ACCURACY | KAPPA | SENSITIVITY | SPECIFICITY | BALANCED_ACCURACY | AREA_UNDER_CURVE | NO_OF_VALUES_PREDECTED_CORRECTLY |
|---|---|---|---|---|---|---|---|
| RANDOM_FOREST_TRAIN | 99.0 | 97.0 | 99.0 | 100.0 | 99.0 | 97.7 | 26970 |
| TEST1 | 91.0 | 48.0 | 93.0 | 67.0 | 80.0 | 70.7 | 8219 |
| TEST2 | 91.0 | 47.0 | 92.7 | 66.9 | 79.8 | 69.7 | 8205 |
| C50_TRAIN | 93.0 | 62.6 | 94.6 | 76.9 | 85.7 | 78.0 | 25235 |
| TEST1 | 90.5 | 48.0 | 93.0 | 63.0 | 78.0 | 71.0 | 8179 |
| TEST2 | 90.0 | 45.0 | 92.8 | 60.0 | 76.4 | 70.0 | 8134 |
NOTE:The no of values predicted correctly are out of 27112 in train data and 9036 in test1 and test2 data each
From above summary,we can infer that the random forest model has more bias-variance tradeof when compared with c50 decision tree model, except sensitivity reamaining parameters are not consistent for random forest,whereas c50 decision tree shows consistent performance for all the data,so lets use c50 decision tree model and check the performance for some shuffled data.
set.seed(123)
library(caret)
split<-createDataPartition(bmarketing$y,p=0.5,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)
[1] 22592 17
dim(test1)
[1] 11297 17
dim(test2)
[1] 11295 17
library(C50)
model_c50<-C5.0(y ~ ., data=train)
library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 19496 455
yes 1042 1599
Accuracy : 0.9337
95% CI : (0.9304, 0.9369)
No Information Rate : 0.9091
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6448
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9493
Specificity : 0.7785
Pos Pred Value : 0.9772
Neg Pred Value : 0.6055
Prevalence : 0.9091
Detection Rate : 0.8630
Detection Prevalence : 0.8831
Balanced Accuracy : 0.8639
'Positive' Class : no
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 22592
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 19496 | 455 | 19951 |
| 0.863 | 0.020 | |
---------------------------------|-----------|-----------|-----------|
yes | 1042 | 1599 | 2641 |
| 0.046 | 0.071 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 20538 | 2054 | 22592 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7913233
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 9575 401
yes 722 599
Accuracy : 0.9006
95% CI : (0.8949, 0.9061)
No Information Rate : 0.9115
P-Value [Acc > NIR] : 1
Kappa : 0.4619
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9299
Specificity : 0.5990
Pos Pred Value : 0.9598
Neg Pred Value : 0.4534
Prevalence : 0.9115
Detection Rate : 0.8476
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7644
'Positive' Class : no
library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 11297
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 9575 | 401 | 9976 |
| 0.848 | 0.035 | |
---------------------------------|-----------|-----------|-----------|
yes | 722 | 599 | 1321 |
| 0.064 | 0.053 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 10297 | 1000 | 11297 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7066239
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 9593 382
yes 711 609
Accuracy : 0.9032
95% CI : (0.8976, 0.9086)
No Information Rate : 0.9123
P-Value [Acc > NIR] : 0.9996
Kappa : 0.4744
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9310
Specificity : 0.6145
Pos Pred Value : 0.9617
Neg Pred Value : 0.4614
Prevalence : 0.9123
Detection Rate : 0.8493
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7728
'Positive' Class : no
library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 11295
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 9593 | 382 | 9975 |
| 0.849 | 0.034 | |
---------------------------------|-----------|-----------|-----------|
yes | 711 | 609 | 1320 |
| 0.063 | 0.054 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 10304 | 991 | 11295 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7115339
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
set.seed(1234)
library(caret)
split<-createDataPartition(bmarketing$y,p=0.7,list = FALSE)
train<-bmarketing[split,]
test<-bmarketing[-split,]
split<-createDataPartition(test$y,p=0.5,list=FALSE)
test1<-test[split,]
test2<-test[-split,]
dim(train)
[1] 31630 17
dim(test1)
[1] 6777 17
dim(test2)
[1] 6777 17
library(C50)
model_c50<-C5.0(y ~ ., data=train)
library(caret)
pred_train<-predict(model_c50,train)
confusionMatrix(as.factor(train$y),as.factor(pred_train))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 27208 724
yes 1450 2248
Accuracy : 0.9313
95% CI : (0.9284, 0.934)
No Information Rate : 0.906
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6362
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9494
Specificity : 0.7564
Pos Pred Value : 0.9741
Neg Pred Value : 0.6079
Prevalence : 0.9060
Detection Rate : 0.8602
Detection Prevalence : 0.8831
Balanced Accuracy : 0.8529
'Positive' Class : no
library(gmodels)
CrossTable(train$y, pred_train,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 31630
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 27208 | 724 | 27932 |
| 0.860 | 0.023 | |
---------------------------------|-----------|-----------|-----------|
yes | 1450 | 2248 | 3698 |
| 0.046 | 0.071 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 28658 | 2972 | 31630 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_train),as.numeric(train$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.790988
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred_test1<-predict(model_c50,test1)
confusionMatrix(as.factor(test1$y),as.factor(pred_test1))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 5729 256
yes 400 392
Accuracy : 0.9032
95% CI : (0.8959, 0.9101)
No Information Rate : 0.9044
P-Value [Acc > NIR] : 0.6391
Kappa : 0.4909
Mcnemar's Test P-Value : 2.361e-08
Sensitivity : 0.9347
Specificity : 0.6049
Pos Pred Value : 0.9572
Neg Pred Value : 0.4949
Prevalence : 0.9044
Detection Rate : 0.8454
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7698
'Positive' Class : no
library(gmodels)
CrossTable(test1$y, pred_test1,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 6777
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 5729 | 256 | 5985 |
| 0.845 | 0.038 | |
---------------------------------|-----------|-----------|-----------|
yes | 400 | 392 | 792 |
| 0.059 | 0.058 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 6129 | 648 | 6777 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_test1),as.numeric(test1$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7260879
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
pred_test2<-predict(model_c50,test2)
confusionMatrix(as.factor(test2$y),as.factor(pred_test2))
Confusion Matrix and Statistics
Reference
Prediction no yes
no 5732 253
yes 399 393
Accuracy : 0.9038
95% CI : (0.8965, 0.9107)
No Information Rate : 0.9047
P-Value [Acc > NIR] : 0.608
Kappa : 0.4934
Mcnemar's Test P-Value : 1.358e-08
Sensitivity : 0.9349
Specificity : 0.6084
Pos Pred Value : 0.9577
Neg Pred Value : 0.4962
Prevalence : 0.9047
Detection Rate : 0.8458
Detection Prevalence : 0.8831
Balanced Accuracy : 0.7716
'Positive' Class : no
library(gmodels)
CrossTable(test2$y, pred_test2,prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,dnn = c('actual term deposit subscription', 'predicted term deposit subscription'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 6777
| predicted term deposit subscription
actual term deposit subscription | no | yes | Row Total |
---------------------------------|-----------|-----------|-----------|
no | 5732 | 253 | 5985 |
| 0.846 | 0.037 | |
---------------------------------|-----------|-----------|-----------|
yes | 399 | 393 | 792 |
| 0.059 | 0.058 | |
---------------------------------|-----------|-----------|-----------|
Column Total | 6131 | 646 | 6777 |
---------------------------------|-----------|-----------|-----------|
library(ROCR)
pr<-prediction(as.numeric(pred_test2),as.numeric(test2$y))
auc<-performance(pr,measure = "auc")
auc<-auc@y.values
auc
[[1]]
[1] 0.7269699
prf<-performance(pr,measure = "tpr",x.measure = "fpr")
plot(prf)
Lets summarize the performance of c50 decision tree model on different combination of shuffled data.
| PARAMETERS | TRAIN | TEST1 | TEST2 | TRAIN.1 | TEST1.1 | TEST2.1 | TRAIN.2 | TEST1.2 | TEST2.2 |
|---|---|---|---|---|---|---|---|---|---|
| ACCURACY | 93.0 | 90.5 | 90.0 | 93.3 | 90.0 | 90 | 93.0 | 90.0 | 90.4 |
| KAPPA | 62.6 | 48.0 | 45.0 | 64.5 | 46.0 | 47 | 63.6 | 49.0 | 49.0 |
| SENSITIVITY | 94.6 | 93.0 | 92.8 | 95.0 | 93.0 | 93 | 95.0 | 93.5 | 93.5 |
| SPECIFICITY | 76.9 | 63.0 | 60.0 | 73.8 | 60.0 | 61 | 75.6 | 60.5 | 60.8 |
| BALANCED_ACCURACY | 85.7 | 78.0 | 76.4 | 86.0 | 76.0 | 77 | 85.0 | 77.0 | 77.0 |
| AREA_UNDER_CURVE | 78.0 | 71.0 | 70.0 | 79.0 | 70.7 | 71 | 79.0 | 72.6 | 72.7 |
From the above summary we can infer that the model performed well on different combination of marketing dataset consistently with less chances of overfitting(bias) and error rate(variance) and with an average accuracy of 90% and allowing only 10% error,it will efficiently predict the subscription of bank term deposit of a customer and we can conclude that the c50 decision tree model has a good potential to predict well on the future data too and therefore it is the best model and is ready to be deployed.
Attaching package: 'dplyr'
The following object is masked from 'package:randomForest':
combine
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | observed_y | expected_y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 261 | 1 | -1 | 0 | unknown | no | no |
| 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 151 | 1 | -1 | 0 | unknown | no | no |
| 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 76 | 1 | -1 | 0 | unknown | no | no |
| 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no | no |
| 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 198 | 1 | -1 | 0 | unknown | no | no |
| 35 | management | married | tertiary | no | 231 | yes | no | unknown | 5 | may | 139 | 1 | -1 | 0 | unknown | no | no |
| 28 | management | single | tertiary | no | 447 | yes | yes | unknown | 5 | may | 217 | 1 | -1 | 0 | unknown | no | no |
| 42 | entrepreneur | divorced | tertiary | yes | 2 | yes | no | unknown | 5 | may | 380 | 1 | -1 | 0 | unknown | no | no |
| 58 | retired | married | primary | no | 121 | yes | no | unknown | 5 | may | 50 | 1 | -1 | 0 | unknown | no | no |
| 43 | technician | single | secondary | no | 593 | yes | no | unknown | 5 | may | 55 | 1 | -1 | 0 | unknown | no | no |
| 41 | admin. | divorced | secondary | no | 270 | yes | no | unknown | 5 | may | 222 | 1 | -1 | 0 | unknown | no | no |
| 29 | admin. | single | secondary | no | 390 | yes | no | unknown | 5 | may | 137 | 1 | -1 | 0 | unknown | no | no |
| 53 | technician | married | secondary | no | 6 | yes | no | unknown | 5 | may | 517 | 1 | -1 | 0 | unknown | no | no |
| 58 | technician | married | unknown | no | 71 | yes | no | unknown | 5 | may | 71 | 1 | -1 | 0 | unknown | no | no |
| 57 | services | married | secondary | no | 162 | yes | no | unknown | 5 | may | 174 | 1 | -1 | 0 | unknown | no | no |