The data is about direct marketing campaigns (phone calls) of a Portuguese banking institution from May 2008 to November 2010. The clients were contacted more than once in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
## [1] TRUE
## Age JobType Marital Education CreditDefault AvgYearlyBalance
## 1 58 management married tertiary no 2143
## 2 44 technician single secondary no 29
## 3 33 entrepreneur married secondary no 2
## 5 33 unknown single unknown no 1
## 7 28 management single tertiary no 447
## 8 42 entrepreneur divorced tertiary yes 2
## HousingLoan PersonalLoan ContactType Day Month Duration NumOfContacts
## 1 yes no unknown 5 may 261 1
## 2 yes no unknown 5 may 151 1
## 3 yes yes unknown 5 may 76 1
## 5 no no unknown 5 may 198 1
## 7 yes yes unknown 5 may 217 1
## 8 yes no unknown 5 may 380 1
## PassedDays Previous PreviousOutcome Subscription
## 1 -1 0 unknown no
## 2 -1 0 unknown no
## 3 -1 0 unknown no
## 5 -1 0 unknown no
## 7 -1 0 unknown no
## 8 -1 0 unknown no
## Age JobType Marital Education CreditDefault AvgYearlyBalance
## 45203 34 admin. single secondary no 557
## 45204 23 student single tertiary no 113
## 45207 51 technician married tertiary no 825
## 45208 71 retired divorced primary no 1729
## 45209 72 retired married secondary no 5715
## 45210 57 blue-collar married secondary no 668
## HousingLoan PersonalLoan ContactType Day Month Duration
## 45203 no no cellular 17 nov 224
## 45204 no no cellular 17 nov 266
## 45207 no no cellular 17 nov 977
## 45208 no no cellular 17 nov 456
## 45209 no no cellular 17 nov 1127
## 45210 no no telephone 17 nov 508
## NumOfContacts PassedDays Previous PreviousOutcome Subscription
## 45203 1 -1 0 unknown yes
## 45204 1 -1 0 unknown yes
## 45207 3 -1 0 unknown yes
## 45208 2 -1 0 unknown yes
## 45209 5 184 3 success yes
## 45210 4 -1 0 unknown no
## 'data.frame': 36170 obs. of 17 variables:
## $ Age : int 58 44 33 33 28 42 58 41 29 53 ...
## $ JobType : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 12 5 3 6 1 1 10 ...
## $ Marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 3 3 1 2 1 3 2 ...
## $ Education : Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 3 3 1 2 2 2 ...
## $ CreditDefault : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ AvgYearlyBalance: int 2143 29 2 1 447 2 121 270 390 6 ...
## $ HousingLoan : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
## $ PersonalLoan : Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
## $ ContactType : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ Duration : int 261 151 76 198 217 380 50 222 137 517 ...
## $ NumOfContacts : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PassedDays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ Previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PreviousOutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Subscription : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## Age JobType Marital Education
## Min. :18.00 blue-collar:7766 divorced: 4215 primary : 5454
## 1st Qu.:33.00 management :7616 married :21726 secondary:18541
## Median :39.00 technician :6068 single :10229 tertiary :10672
## Mean :40.91 admin. :4167 unknown : 1503
## 3rd Qu.:48.00 services :3297
## Max. :95.00 retired :1786
## (Other) :5470
## CreditDefault AvgYearlyBalance HousingLoan PersonalLoan ContactType
## no :35513 Min. : -8019 no :16071 no :30320 cellular :23365
## yes: 657 1st Qu.: 72 yes:20099 yes: 5850 telephone: 2319
## Median : 450 unknown :10486
## Mean : 1355
## 3rd Qu.: 1423
## Max. :102127
##
## Day Month Duration NumOfContacts
## Min. : 1.0 may :10969 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.0 jul : 5535 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.0 aug : 4999 Median : 180.0 Median : 2.000
## Mean :15.8 jun : 4317 Mean : 258.6 Mean : 2.767
## 3rd Qu.:21.0 nov : 3157 3rd Qu.: 318.0 3rd Qu.: 3.000
## Max. :31.0 apr : 2334 Max. :4918.0 Max. :63.000
## (Other): 4859
## PassedDays Previous PreviousOutcome Subscription
## Min. : -1.00 Min. : 0.0000 failure: 3940 no :31938
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 1457 yes: 4232
## Median : -1.00 Median : 0.0000 success: 1194
## Mean : 40.15 Mean : 0.5783 unknown:29579
## 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :871.00 Max. :275.0000
##
What does “unknown” mean in the column of PreviousOutcome?
## [1] 29574
## [1] 29579
Three distinct periods addressing the number of contacts of the current market campaign:
A new categorical variable will be created to address these three periods:
## [1] TRUE
## Raw_Training_Season
## high low medium
## 25820 2713 7637
Two distinct age groups addressing the number of contacts of the current market campaign:
## [1] 935
## no yes
## 535 400
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 3.171 4.000 275.000
## [1] 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 3.13 4.00 58.00
##
## no yes
## 31938 4232
Our visualized ad-hoc A/B test shows that job type is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, job types of management, retired, umemployed and student have higher percentage on subscribing term deposit and greater than our reference line. They may contribute as an important factor in our classification model in the later session. Although clients in blue-collar have the highest number in count, the subcription percentage among this category is the lowest. This may provide insight of how to target our potential client according to job types.
##
## Pearson's Chi-squared test
##
## data: TableForTest
## X-squared = 667, df = 11, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that marital status seems to be an important variable in deciding the subscription preference. with a chi-squared test run below, its statistical significant result confirms marital status is an important varaible. In the plot, single clients have the highest percentage in subscription , which may contribute as an important factor in our classification model in the later session.
##
## Pearson's Chi-squared test
##
## data: TableForTest
## X-squared = 158.29, df = 2, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that education status seems to be an important variable in deciding the subscription preference. with a chi-squared test run below, its statistical significant result confirms marital status is an important varaible. In the plot, clients with tertiary or unknown education level have higher percentage in subscription , which may contribute as an important factor in our classification model in the later session.
##
## Pearson's Chi-squared test
##
## data: TableForTest
## X-squared = 210.87, df = 3, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that credit default status seems to be an important variable in deciding the subscription preference. with a chi-squared test run below, its statistical significant result confirms marital status is an important varaible. Most of the clients has no credit default and its subscription percentage is close to the reference level.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TableForTest
## X-squared = 17.727, df = 1, p-value = 2.55e-05
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that house loan status is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. Also, Clients have no housing loan tends to have a higher subcription percentage which is above the reference level.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TableForTest
## X-squared = 692.19, df = 1, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that personal loan status is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. Also, Clients have no personal loan tends to have a higher subcription percentage which is at the reference level. With the plots of housing loan and personal loan, clients have stable financial status tend to be an important factor to contribute to our classification model in our later session.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TableForTest
## X-squared = 161.41, df = 1, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that contact type of communication is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, using telephone and cellular have higher percentage on subscribing term deposit and greater than our reference line. They may contribute as an important factor in our classification model in the later session.
##
## Pearson's Chi-squared test
##
## data: TableForTest
## X-squared = 844.67, df = 2, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that outcome of the previous marketing campaign is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, clients subscribed previously have the highest percentage on subscribing term deposit again and far greater than our reference line. It may contribute as an important factor in our classification model in the later session. Also, such kind of clients indicates that there are some reasons for them to subscribe again. Targetting such clients may be beneficial for the next marketing campaign.
##
## Pearson's Chi-squared test
##
## data: TableForTest
## X-squared = 3629.7, df = 3, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that clients have been contacted in the previous campaign is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. clients have been contacted previously have higher percentage on subscribing term deposit and greater than our reference line. They may contribute as an important factor in our classification model in the later session.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TableForTest
## X-squared = 1030.5, df = 1, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
These are the follow-up plots of the 1.7 Basic Exploratory Data Analysis session. Our visualized ad-hoc A/B test shows that clients having age greater than 60 years old is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, such kind of clients have higher percentage on subscribing term deposit and is greater than our reference line. They may contribute as an important factor in our classification model in the later session.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TableForTest
## X-squared = 894.35, df = 1, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Our visualized ad-hoc A/B test shows that season of contact is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, low season have the highest percentage on subscribing term deposit and greater than our reference line. They may contribute as an important factor in our classification model in the later session. Although most of the contacts is in high season, its subscription rate is the lowest.
##
## Pearson's Chi-squared test
##
## data: TableForTest
## X-squared = 1417.8, df = 2, p-value < 2.2e-16
##
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
The histogram of the distribution of age and its corresponding ad-hoc A/B test echoes to the results of the previous plots of age group.
The distribution of average yearly balance is greatly skewed to the right. Most clients have the average yearly balance between -2500 and 2500 Euros. Our ad-hoc A/B test shows that this variable may play an important role in deciding the clients to subcribe or not.
The distribution of number of contacts is highly skewed to the right. While the number of contacts increases, the subscription percentage decreases. By zooming in the range from 1 to 10 times of the number of contact, such result are more obvious. Our ad-hoc A/B test shows that clients being contacted once have a higher percentage in subscription.
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.11495 0.98535 -2.14640 0.03184
## Age 0.00353 0.00216 1.63582 0.10188
## AvgYearlyBalance 0.00002 0.00000 4.04693 0.00005
## NumOfContacts -0.10902 0.00953 -11.44489 0.00000
## Previous 0.02550 0.00970 2.63038 0.00853
## JobType.admin. 0.09304 0.22537 0.41282 0.67974
## `JobType.blue-collar` -0.09265 0.22418 -0.41327 0.67941
## JobType.entrepreneur -0.24304 0.24538 -0.99045 0.32195
## JobType.housemaid -0.18998 0.24732 -0.76812 0.44241
## JobType.management -0.06958 0.22357 -0.31124 0.75562
## JobType.retired 0.54637 0.22912 2.38471 0.01709
## `JobType.self-employed` -0.09679 0.23831 -0.40614 0.68464
## JobType.services -0.14359 0.22950 -0.62567 0.53153
## JobType.student 0.53497 0.23770 2.25062 0.02441
## JobType.technician -0.08673 0.22355 -0.38797 0.69804
## JobType.unemployed 0.18879 0.23666 0.79773 0.42503
## Marital.divorced -0.09292 0.06431 -1.44485 0.14850
## Marital.married -0.34583 0.04493 -7.69730 0.00000
## Education.primary -0.26642 0.10083 -2.64226 0.00824
## Education.secondary -0.11397 0.08885 -1.28268 0.19961
## Education.tertiary 0.11091 0.09342 1.18720 0.23515
## CreditDefault.no 0.23357 0.16401 1.42414 0.15441
## HousingLoan.no 0.56162 0.03881 14.47057 0.00000
## PersonalLoan.no 0.46298 0.05749 8.05351 0.00000
## ContactType.cellular 1.00382 0.05622 17.85604 0.00000
## ContactType.telephone 0.80552 0.08610 9.35614 0.00000
## PreviousOutcome.failure -1.41149 0.94067 -1.50052 0.13348
## PreviousOutcome.other -1.12527 0.94220 -1.19430 0.23236
## PreviousOutcome.success 0.87942 0.94148 0.93409 0.35026
## HasPreviousContact.no -1.47237 0.94023 -1.56596 0.11736
## [1] "JobType.management"
## [1] 0.7556205
## [1] "`JobType.self-employed`"
## [1] 0.7688976
## [1] "JobType.technician"
## [1] 0.7992887
## [1] "`JobType.blue-collar`"
## [1] 0.8300147
## [1] "JobType.services"
## [1] 0.4165803
## [1] "JobType.housemaid"
## [1] 0.4024666
## [1] "PreviousOutcome.success"
## [1] 0.354393
## [1] "Education.tertiary"
## [1] 0.1815268
## [1] "JobType.entrepreneur"
## [1] 0.1765045
## [1] "CreditDefault.no"
## [1] 0.1461899
## [1] "Marital.divorced"
## [1] 0.1379186
## [1] "Age"
## [1] 0.2966937
## [1] "Previous"
## [1] 0.007773817
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.90937 0.11097 -8.19467 0.00000
## AvgYearlyBalance 0.00002 0.00000 4.31809 0.00002
## NumOfContacts -0.10887 0.00951 -11.44333 0.00000
## Previous 0.02577 0.00968 2.66174 0.00777
## JobType.admin. 0.19142 0.05663 3.38038 0.00072
## JobType.retired 0.69992 0.06887 10.16262 0.00000
## JobType.student 0.59598 0.09464 6.29742 0.00000
## JobType.unemployed 0.28906 0.09326 3.09940 0.00194
## Marital.married -0.30238 0.03670 -8.24021 0.00000
## Education.primary -0.37632 0.06083 -6.18607 0.00000
## Education.secondary -0.22922 0.03918 -5.85043 0.00000
## HousingLoan.no 0.57048 0.03804 14.99711 0.00000
## PersonalLoan.no 0.47664 0.05722 8.32998 0.00000
## ContactType.cellular 1.01013 0.05596 18.05172 0.00000
## ContactType.telephone 0.82504 0.08549 9.65024 0.00000
## PreviousOutcome.failure -2.29244 0.07990 -28.69294 0.00000
## PreviousOutcome.other -2.00556 0.09677 -20.72563 0.00000
## HasPreviousContact.no -2.35688 0.07319 -32.20307 0.00000
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.8913 0.2260 0.8847 0.8976 0.8831
## AccuracyPValue McnemarPValue
## 0.0076 0.0000
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.8912 0.2275 0.8846 0.8975 0.8831
## AccuracyPValue McnemarPValue
## 0.0084 0.0000
Our model achieves 89.12% accuracy of predicting if the clients will subcribe the term deposit or not. Such accuracy rate is impressive. In order to consolidate such result, a cumulative accuracy profile(CAP) curve will be constructed to illustrate it.
Although the accuracy drops from 89.13% to 89.12%, the decrease is very slightly. At the expense of that, the readability of our model increases. Since all the variables in our final model have p-value less than 5%, meaning statistically significant, their positive or negative signs provide evidence to interpret the model.
The following variables detract our model:
## [1] "(Intercept)" "NumOfContacts"
## [3] "Marital.married" "Education.primary"
## [5] "Education.secondary" "PreviousOutcome.failure"
## [7] "PreviousOutcome.other" "HasPreviousContact.no"
The above variables matches to the results in the previous session regarding the ad-hoc A/B tests, meaning that clients having these characteristics tend NOT to subscribe a bank’s term deposit. Among these variables, no previous contacts, negative previous outcome and other previous outcome have the largest negative coefficients, which are -2.36, -2.39, -2.01 respectively. They have a higher level of per-unit association of the subsciption preference. Others’ coefficients are between 0 and -1. They are number of contacts, married marital, primary education and secondary education.
The following variables contribute to our model:
## [1] "AvgYearlyBalance" "Previous"
## [3] "JobType.admin." "JobType.retired"
## [5] "JobType.student" "JobType.unemployed"
## [7] "HousingLoan.no" "PersonalLoan.no"
## [9] "ContactType.cellular" "ContactType.telephone"
The above variables matches to the results in the previous session regarding the ad-hoc A/B tests, meaning that clients having these characteristics tend to subscribe a bank’s term deposit. Ranging their coefficients from largest(1.01) to lowest(0.00002), they are contact with cellular, contact with telephone, the retired, students, no housing loan, no personal loan, the unemmployed, job in administration, having previous contact, and average yearly balance. Thus, the level of per-unit association of the subscription preference decreases with such arranged coefficients.
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
The X in the upper plot and lower plot is 78% and 79% respectively, indicating our model is good. The vertical line cuts the 50% on the x-axis. Its intercept with the curve means that when contacting half of the total number of clients, what is the percentage of the clients will subscribe the term deposit. In the above case, our model figures out in half of the total number of clients, 78% of them will subscibe the product.
## Estimate Std..Error z.value Pr...z.. OddRatio
## (Intercept) -0.90937 0.11097 -8.19467 0.00000 0.403
## AvgYearlyBalance 0.00002 0.00000 4.31809 0.00002 1.000
## NumOfContacts -0.10887 0.00951 -11.44333 0.00000 0.897
## Previous 0.02577 0.00968 2.66174 0.00777 1.026
## JobType.admin. 0.19142 0.05663 3.38038 0.00072 1.211
## JobType.retired 0.69992 0.06887 10.16262 0.00000 2.014
## JobType.student 0.59598 0.09464 6.29742 0.00000 1.815
## JobType.unemployed 0.28906 0.09326 3.09940 0.00194 1.335
## Marital.married -0.30238 0.03670 -8.24021 0.00000 0.739
## Education.primary -0.37632 0.06083 -6.18607 0.00000 0.686
## Education.secondary -0.22922 0.03918 -5.85043 0.00000 0.795
## HousingLoan.no 0.57048 0.03804 14.99711 0.00000 1.769
## PersonalLoan.no 0.47664 0.05722 8.32998 0.00000 1.611
## ContactType.cellular 1.01013 0.05596 18.05172 0.00000 2.746
## ContactType.telephone 0.82504 0.08549 9.65024 0.00000 2.282
## PreviousOutcome.failure -2.29244 0.07990 -28.69294 0.00000 0.101
## PreviousOutcome.other -2.00556 0.09677 -20.72563 0.00000 0.135
## HasPreviousContact.no -2.35688 0.07319 -32.20307 0.00000 0.095