Content

1.Background & Data Preparation
1.1 Background of This Project
1.2 Data Source
1.3 Load the Dataset
1.4 Load Required Packages
1.5 Functions Created for this Project
1.6 Split the Data into Training Dataset and Testing Dataset
1.7 Basic Exploratory Data Analysis
1.8 Data Preparation
1.9 Basic Data Transformation
2.Exploratory Data Analysis
2.1 Bar Charts of Categorical Variables with Ad-hoc A/B Test and Chi-Squared Test
2.2 Histograms of Continuous Variables with Ad-hoc A/B Test
3.Modelling
3.1 Finalize the Dataset for Training and Testing
3.2 Use Logistic Regression to Fit the Model - Initial Model
3.3 Variables Selection Using Backward Elimination - Improved Model
3.4 Assess and Interpret the Model
3.5 Assess the Model with Cumulative Accuracy Profile(CAP) Curve
4.Insights & Conclusion
4.1 Insights from the Logistic Regression Model
4.2 Insights from the CAP Curve
5.Source of Reference

1.Background & Data Preparation

1.1 Background of this project

The data is about direct marketing campaigns (phone calls) of a Portuguese banking institution from May 2008 to November 2010. The clients were contacted more than once in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

The goal of this project:
  • to classify and predict whether the client will subscribe (yes/no) a term deposit
  • to improve the stragegy for the next market campaign
  • to draw business insights from the data
Variable description:
Bank client data:
  • Age (numeric)
  • JobType : type of job
  • Marital : marital status
  • Education
  • CreditDefault: has credit in default? (“yes”,“no”)
  • AvgYearlyBbalance: average yearly balance, in euros
  • HousingLoan: has housing loan? (“yes”,“no”)
  • PersonalLoan: has personal loan? “yes”,“no”)
Other variables:
  • NumOfContacts: number of contacts performed during this campaign and for this client
  • PassedDays: number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted)
  • Previous: number of contacts performed before this campaign and for this client
  • PreviousOutcome: outcome of the previous marketing campaign
Output variable (desired target):
  • Subscription - has the client subscribed a term deposit? (“yes”,“no”)

1.2 Data Source:

  1. Link: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
  2. From the ‘Data Folder’, download the ‘bank.zip’
  3. Unzip ‘bank.zip’
  4. Get the ‘bank-full.csv’ for this project

1.3 Load the Dataset

1.4 Load Required Packages

1.5 Split the Data into Training Dataset and Testing Dataset

  • The original dataset will be splited.
  • 80% of it will be the training set.
  • 20% of it will be the testing set for assessing the model performance.
  • Throughout this project, the training set will be used for exploratory data analysis, model buidling and fitting, and drawing insights.
## [1] TRUE

1.6 Functions Created for this Project

1.7 Basic Exploratory Data Analysis

##   Age      JobType  Marital Education CreditDefault AvgYearlyBalance
## 1  58   management  married  tertiary            no             2143
## 2  44   technician   single secondary            no               29
## 3  33 entrepreneur  married secondary            no                2
## 5  33      unknown   single   unknown            no                1
## 7  28   management   single  tertiary            no              447
## 8  42 entrepreneur divorced  tertiary           yes                2
##   HousingLoan PersonalLoan ContactType Day Month Duration NumOfContacts
## 1         yes           no     unknown   5   may      261             1
## 2         yes           no     unknown   5   may      151             1
## 3         yes          yes     unknown   5   may       76             1
## 5          no           no     unknown   5   may      198             1
## 7         yes          yes     unknown   5   may      217             1
## 8         yes           no     unknown   5   may      380             1
##   PassedDays Previous PreviousOutcome Subscription
## 1         -1        0         unknown           no
## 2         -1        0         unknown           no
## 3         -1        0         unknown           no
## 5         -1        0         unknown           no
## 7         -1        0         unknown           no
## 8         -1        0         unknown           no
##       Age     JobType  Marital Education CreditDefault AvgYearlyBalance
## 45203  34      admin.   single secondary            no              557
## 45204  23     student   single  tertiary            no              113
## 45207  51  technician  married  tertiary            no              825
## 45208  71     retired divorced   primary            no             1729
## 45209  72     retired  married secondary            no             5715
## 45210  57 blue-collar  married secondary            no              668
##       HousingLoan PersonalLoan ContactType Day Month Duration
## 45203          no           no    cellular  17   nov      224
## 45204          no           no    cellular  17   nov      266
## 45207          no           no    cellular  17   nov      977
## 45208          no           no    cellular  17   nov      456
## 45209          no           no    cellular  17   nov     1127
## 45210          no           no   telephone  17   nov      508
##       NumOfContacts PassedDays Previous PreviousOutcome Subscription
## 45203             1         -1        0         unknown          yes
## 45204             1         -1        0         unknown          yes
## 45207             3         -1        0         unknown          yes
## 45208             2         -1        0         unknown          yes
## 45209             5        184        3         success          yes
## 45210             4         -1        0         unknown           no
## 'data.frame':    36170 obs. of  17 variables:
##  $ Age             : int  58 44 33 33 28 42 58 41 29 53 ...
##  $ JobType         : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 12 5 3 6 1 1 10 ...
##  $ Marital         : Factor w/ 3 levels "divorced","married",..: 2 3 2 3 3 1 2 1 3 2 ...
##  $ Education       : Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 3 3 1 2 2 2 ...
##  $ CreditDefault   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ AvgYearlyBalance: int  2143 29 2 1 447 2 121 270 390 6 ...
##  $ HousingLoan     : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ PersonalLoan    : Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
##  $ ContactType     : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Day             : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Month           : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ Duration        : int  261 151 76 198 217 380 50 222 137 517 ...
##  $ NumOfContacts   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PassedDays      : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ Previous        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PreviousOutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Subscription    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##       Age               JobType         Marital          Education    
##  Min.   :18.00   blue-collar:7766   divorced: 4215   primary  : 5454  
##  1st Qu.:33.00   management :7616   married :21726   secondary:18541  
##  Median :39.00   technician :6068   single  :10229   tertiary :10672  
##  Mean   :40.91   admin.     :4167                    unknown  : 1503  
##  3rd Qu.:48.00   services   :3297                                     
##  Max.   :95.00   retired    :1786                                     
##                  (Other)    :5470                                     
##  CreditDefault AvgYearlyBalance HousingLoan PersonalLoan    ContactType   
##  no :35513     Min.   : -8019   no :16071   no :30320    cellular :23365  
##  yes:  657     1st Qu.:    72   yes:20099   yes: 5850    telephone: 2319  
##                Median :   450                            unknown  :10486  
##                Mean   :  1355                                             
##                3rd Qu.:  1423                                             
##                Max.   :102127                                             
##                                                                           
##       Day           Month          Duration      NumOfContacts   
##  Min.   : 1.0   may    :10969   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.0   jul    : 5535   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.0   aug    : 4999   Median : 180.0   Median : 2.000  
##  Mean   :15.8   jun    : 4317   Mean   : 258.6   Mean   : 2.767  
##  3rd Qu.:21.0   nov    : 3157   3rd Qu.: 318.0   3rd Qu.: 3.000  
##  Max.   :31.0   apr    : 2334   Max.   :4918.0   Max.   :63.000  
##                 (Other): 4859                                    
##    PassedDays        Previous        PreviousOutcome Subscription
##  Min.   : -1.00   Min.   :  0.0000   failure: 3940   no :31938   
##  1st Qu.: -1.00   1st Qu.:  0.0000   other  : 1457   yes: 4232   
##  Median : -1.00   Median :  0.0000   success: 1194               
##  Mean   : 40.15   Mean   :  0.5783   unknown:29579               
##  3rd Qu.: -1.00   3rd Qu.:  0.0000                               
##  Max.   :871.00   Max.   :275.0000                               
## 

What does “unknown” mean in the column of PreviousOutcome?

## [1] 29574
## [1] 29579

Three distinct periods addressing the number of contacts of the current market campaign:

  • High Season - Period with high contact number: May to Aug
  • Medium Season - Period with medium contact number: Feb, Apr, Nov
  • Low Season - Period with low contact number: Jan, Mar, Sept, Oct, Dec

A new categorical variable will be created to address these three periods:

## [1] TRUE
## Raw_Training_Season
##   high    low medium 
##  25820   2713   7637

Two distinct age groups addressing the number of contacts of the current market campaign:

  • The number of contacts drops dramatically at around 61 years old
  • A new categorical variable will be created to address these two groups

## [1] 935
##  no yes 
## 535 400

1.8 Data Preparation

Manually corrected data points
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   3.171   4.000 275.000
## [1] 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    3.13    4.00   58.00
Construct new variables in training dataset
Construct new variables in testing dataset

1.9 Basic Data Transformation

## 
##    no   yes 
## 31938  4232
  • This reference line will be added into the plots.

2.Exploratory Data Analysis

2.1 Bar Charts of Categorical Variables with Ad-hoc A/B Test and Chi-Squared Test

Job Types

Our visualized ad-hoc A/B test shows that job type is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, job types of management, retired, umemployed and student have higher percentage on subscribing term deposit and greater than our reference line. They may contribute as an important factor in our classification model in the later session. Although clients in blue-collar have the highest number in count, the subcription percentage among this category is the lowest. This may provide insight of how to target our potential client according to job types.

## 
##  Pearson's Chi-squared test
## 
## data:  TableForTest
## X-squared = 667, df = 11, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Marital Status

Our visualized ad-hoc A/B test shows that marital status seems to be an important variable in deciding the subscription preference. with a chi-squared test run below, its statistical significant result confirms marital status is an important varaible. In the plot, single clients have the highest percentage in subscription , which may contribute as an important factor in our classification model in the later session.

## 
##  Pearson's Chi-squared test
## 
## data:  TableForTest
## X-squared = 158.29, df = 2, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Education Level

Our visualized ad-hoc A/B test shows that education status seems to be an important variable in deciding the subscription preference. with a chi-squared test run below, its statistical significant result confirms marital status is an important varaible. In the plot, clients with tertiary or unknown education level have higher percentage in subscription , which may contribute as an important factor in our classification model in the later session.

## 
##  Pearson's Chi-squared test
## 
## data:  TableForTest
## X-squared = 210.87, df = 3, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Credit Default Status

Our visualized ad-hoc A/B test shows that credit default status seems to be an important variable in deciding the subscription preference. with a chi-squared test run below, its statistical significant result confirms marital status is an important varaible. Most of the clients has no credit default and its subscription percentage is close to the reference level.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TableForTest
## X-squared = 17.727, df = 1, p-value = 2.55e-05
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Housing Loan Status

Our visualized ad-hoc A/B test shows that house loan status is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. Also, Clients have no housing loan tends to have a higher subcription percentage which is above the reference level.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TableForTest
## X-squared = 692.19, df = 1, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Personal Loan Status

Our visualized ad-hoc A/B test shows that personal loan status is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. Also, Clients have no personal loan tends to have a higher subcription percentage which is at the reference level. With the plots of housing loan and personal loan, clients have stable financial status tend to be an important factor to contribute to our classification model in our later session.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TableForTest
## X-squared = 161.41, df = 1, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Contact Type of Communication

Our visualized ad-hoc A/B test shows that contact type of communication is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, using telephone and cellular have higher percentage on subscribing term deposit and greater than our reference line. They may contribute as an important factor in our classification model in the later session.

## 
##  Pearson's Chi-squared test
## 
## data:  TableForTest
## X-squared = 844.67, df = 2, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Outcomes of the Previous Campaign

Our visualized ad-hoc A/B test shows that outcome of the previous marketing campaign is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, clients subscribed previously have the highest percentage on subscribing term deposit again and far greater than our reference line. It may contribute as an important factor in our classification model in the later session. Also, such kind of clients indicates that there are some reasons for them to subscribe again. Targetting such clients may be beneficial for the next marketing campaign.

## 
##  Pearson's Chi-squared test
## 
## data:  TableForTest
## X-squared = 3629.7, df = 3, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Previous Contact with the Clients

Our visualized ad-hoc A/B test shows that clients have been contacted in the previous campaign is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. clients have been contacted previously have higher percentage on subscribing term deposit and greater than our reference line. They may contribute as an important factor in our classification model in the later session.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TableForTest
## X-squared = 1030.5, df = 1, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Age Groups

These are the follow-up plots of the 1.7 Basic Exploratory Data Analysis session. Our visualized ad-hoc A/B test shows that clients having age greater than 60 years old is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, such kind of clients have higher percentage on subscribing term deposit and is greater than our reference line. They may contribute as an important factor in our classification model in the later session.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TableForTest
## X-squared = 894.35, df = 1, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"
Seasonal Spread

Our visualized ad-hoc A/B test shows that season of contact is an important variable in deciding the subscription preference. Such result will be confirmed with a chi-squared test run below. In the plot, low season have the highest percentage on subscribing term deposit and greater than our reference line. They may contribute as an important factor in our classification model in the later session. Although most of the contacts is in high season, its subscription rate is the lowest.

## 
##  Pearson's Chi-squared test
## 
## data:  TableForTest
## X-squared = 1417.8, df = 2, p-value < 2.2e-16
## 
## [1] "This is a statistical significant result. This variable plays an important role in the decision of clients to subscribe or not to subscribe the bank's term deposit"

2.2 Histograms of Continuous Variables with Ad-hoc A/B Test

Age

The histogram of the distribution of age and its corresponding ad-hoc A/B test echoes to the results of the previous plots of age group.

Average Yearly Balance

The distribution of average yearly balance is greatly skewed to the right. Most clients have the average yearly balance between -2500 and 2500 Euros. Our ad-hoc A/B test shows that this variable may play an important role in deciding the clients to subcribe or not.

Number of Contacts of each Client

The distribution of number of contacts is highly skewed to the right. While the number of contacts increases, the subscription percentage decreases. By zooming in the range from 1 to 10 times of the number of contact, such result are more obvious. Our ad-hoc A/B test shows that clients being contacted once have a higher percentage in subscription.

3.Modelling

3.1 Finalize the Dataset for Training and Testing

Add Dummy Variables to Replace Factor Variables in Training Dataset

Assumptions:

  • The variable, Duration, highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input would be discarded in order to have a realistic predictive model.
  • Date will be excluded in the model. Generally speacking, clients decide to subscribe or not to subscribe a bank’s term deposit are not reference to date. Thus, the Day, Month, Season variables in the dataset will be excluded.
  • Variable, AgeGreaterThan60, will be excluded in modelling in order to avoid collinearity with another variable, Age, because the former one is generated based on the latter one. *Variable, PassedDays, will be excluded since we have created another variable, HasPreviousContact, to capture the key information and replace it.

Add Dummy Variables to Replace Factor Variables in Testing Dataset

3.2 Use Logistic Regression to Fit the Model - Initial Model

##                         Estimate Std. Error   z value Pr(>|z|)
## (Intercept)             -2.11495    0.98535  -2.14640  0.03184
## Age                      0.00353    0.00216   1.63582  0.10188
## AvgYearlyBalance         0.00002    0.00000   4.04693  0.00005
## NumOfContacts           -0.10902    0.00953 -11.44489  0.00000
## Previous                 0.02550    0.00970   2.63038  0.00853
## JobType.admin.           0.09304    0.22537   0.41282  0.67974
## `JobType.blue-collar`   -0.09265    0.22418  -0.41327  0.67941
## JobType.entrepreneur    -0.24304    0.24538  -0.99045  0.32195
## JobType.housemaid       -0.18998    0.24732  -0.76812  0.44241
## JobType.management      -0.06958    0.22357  -0.31124  0.75562
## JobType.retired          0.54637    0.22912   2.38471  0.01709
## `JobType.self-employed` -0.09679    0.23831  -0.40614  0.68464
## JobType.services        -0.14359    0.22950  -0.62567  0.53153
## JobType.student          0.53497    0.23770   2.25062  0.02441
## JobType.technician      -0.08673    0.22355  -0.38797  0.69804
## JobType.unemployed       0.18879    0.23666   0.79773  0.42503
## Marital.divorced        -0.09292    0.06431  -1.44485  0.14850
## Marital.married         -0.34583    0.04493  -7.69730  0.00000
## Education.primary       -0.26642    0.10083  -2.64226  0.00824
## Education.secondary     -0.11397    0.08885  -1.28268  0.19961
## Education.tertiary       0.11091    0.09342   1.18720  0.23515
## CreditDefault.no         0.23357    0.16401   1.42414  0.15441
## HousingLoan.no           0.56162    0.03881  14.47057  0.00000
## PersonalLoan.no          0.46298    0.05749   8.05351  0.00000
## ContactType.cellular     1.00382    0.05622  17.85604  0.00000
## ContactType.telephone    0.80552    0.08610   9.35614  0.00000
## PreviousOutcome.failure -1.41149    0.94067  -1.50052  0.13348
## PreviousOutcome.other   -1.12527    0.94220  -1.19430  0.23236
## PreviousOutcome.success  0.87942    0.94148   0.93409  0.35026
## HasPreviousContact.no   -1.47237    0.94023  -1.56596  0.11736

3.3 Eliminate Variables & Re-Fit the Model - Improved Model

## [1] "JobType.management"
## [1] 0.7556205
## [1] "`JobType.self-employed`"
## [1] 0.7688976
## [1] "JobType.technician"
## [1] 0.7992887
## [1] "`JobType.blue-collar`"
## [1] 0.8300147
## [1] "JobType.services"
## [1] 0.4165803
## [1] "JobType.housemaid"
## [1] 0.4024666
## [1] "PreviousOutcome.success"
## [1] 0.354393
## [1] "Education.tertiary"
## [1] 0.1815268
## [1] "JobType.entrepreneur"
## [1] 0.1765045
## [1] "CreditDefault.no"
## [1] 0.1461899
## [1] "Marital.divorced"
## [1] 0.1379186
## [1] "Age"
## [1] 0.2966937
## [1] "Previous"
## [1] 0.007773817
##                         Estimate Std. Error   z value Pr(>|z|)
## (Intercept)             -0.90937    0.11097  -8.19467  0.00000
## AvgYearlyBalance         0.00002    0.00000   4.31809  0.00002
## NumOfContacts           -0.10887    0.00951 -11.44333  0.00000
## Previous                 0.02577    0.00968   2.66174  0.00777
## JobType.admin.           0.19142    0.05663   3.38038  0.00072
## JobType.retired          0.69992    0.06887  10.16262  0.00000
## JobType.student          0.59598    0.09464   6.29742  0.00000
## JobType.unemployed       0.28906    0.09326   3.09940  0.00194
## Marital.married         -0.30238    0.03670  -8.24021  0.00000
## Education.primary       -0.37632    0.06083  -6.18607  0.00000
## Education.secondary     -0.22922    0.03918  -5.85043  0.00000
## HousingLoan.no           0.57048    0.03804  14.99711  0.00000
## PersonalLoan.no          0.47664    0.05722   8.32998  0.00000
## ContactType.cellular     1.01013    0.05596  18.05172  0.00000
## ContactType.telephone    0.82504    0.08549   9.65024  0.00000
## PreviousOutcome.failure -2.29244    0.07990 -28.69294  0.00000
## PreviousOutcome.other   -2.00556    0.09677 -20.72563  0.00000
## HasPreviousContact.no   -2.35688    0.07319 -32.20307  0.00000

3.4 Assess and Interpret the Model

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##         0.8913         0.2260         0.8847         0.8976         0.8831 
## AccuracyPValue  McnemarPValue 
##         0.0076         0.0000
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##         0.8912         0.2275         0.8846         0.8975         0.8831 
## AccuracyPValue  McnemarPValue 
##         0.0084         0.0000

Our model achieves 89.12% accuracy of predicting if the clients will subcribe the term deposit or not. Such accuracy rate is impressive. In order to consolidate such result, a cumulative accuracy profile(CAP) curve will be constructed to illustrate it.

Although the accuracy drops from 89.13% to 89.12%, the decrease is very slightly. At the expense of that, the readability of our model increases. Since all the variables in our final model have p-value less than 5%, meaning statistically significant, their positive or negative signs provide evidence to interpret the model.

The following variables detract our model:

## [1] "(Intercept)"             "NumOfContacts"          
## [3] "Marital.married"         "Education.primary"      
## [5] "Education.secondary"     "PreviousOutcome.failure"
## [7] "PreviousOutcome.other"   "HasPreviousContact.no"

The above variables matches to the results in the previous session regarding the ad-hoc A/B tests, meaning that clients having these characteristics tend NOT to subscribe a bank’s term deposit. Among these variables, no previous contacts, negative previous outcome and other previous outcome have the largest negative coefficients, which are -2.36, -2.39, -2.01 respectively. They have a higher level of per-unit association of the subsciption preference. Others’ coefficients are between 0 and -1. They are number of contacts, married marital, primary education and secondary education.

The following variables contribute to our model:

##  [1] "AvgYearlyBalance"      "Previous"             
##  [3] "JobType.admin."        "JobType.retired"      
##  [5] "JobType.student"       "JobType.unemployed"   
##  [7] "HousingLoan.no"        "PersonalLoan.no"      
##  [9] "ContactType.cellular"  "ContactType.telephone"

The above variables matches to the results in the previous session regarding the ad-hoc A/B tests, meaning that clients having these characteristics tend to subscribe a bank’s term deposit. Ranging their coefficients from largest(1.01) to lowest(0.00002), they are contact with cellular, contact with telephone, the retired, students, no housing loan, no personal loan, the unemmployed, job in administration, having previous contact, and average yearly balance. Thus, the level of per-unit association of the subscription preference decreases with such arranged coefficients.

3.5 Assess the Model with Cumulative Accuracy Profile(CAP) Curve

## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Interpretation of CAP Curve:

  • The blue curve is our model’s performance
  • The green line is a random model’s performance
  • Let X be the horizontal line in the graphs
  • X < 60% : The model is rubbish
  • 60% < X < 70% : The model is poor
  • 70% < X < 80% : The model is good
  • 80% < X < 90% : The model is very good
  • 90% < X < 100% : The model is too good (!!!Overfitting)

The X in the upper plot and lower plot is 78% and 79% respectively, indicating our model is good. The vertical line cuts the 50% on the x-axis. Its intercept with the curve means that when contacting half of the total number of clients, what is the percentage of the clients will subscribe the term deposit. In the above case, our model figures out in half of the total number of clients, 78% of them will subscibe the product.

4.Insights & Conclusion

4.1 Insights from the Logistic Regression Model

4.1.1 Change the Coefficients in our Model into Odds to Facilitate Interpretation

##                         Estimate Std..Error   z.value Pr...z.. OddRatio
## (Intercept)             -0.90937    0.11097  -8.19467  0.00000    0.403
## AvgYearlyBalance         0.00002    0.00000   4.31809  0.00002    1.000
## NumOfContacts           -0.10887    0.00951 -11.44333  0.00000    0.897
## Previous                 0.02577    0.00968   2.66174  0.00777    1.026
## JobType.admin.           0.19142    0.05663   3.38038  0.00072    1.211
## JobType.retired          0.69992    0.06887  10.16262  0.00000    2.014
## JobType.student          0.59598    0.09464   6.29742  0.00000    1.815
## JobType.unemployed       0.28906    0.09326   3.09940  0.00194    1.335
## Marital.married         -0.30238    0.03670  -8.24021  0.00000    0.739
## Education.primary       -0.37632    0.06083  -6.18607  0.00000    0.686
## Education.secondary     -0.22922    0.03918  -5.85043  0.00000    0.795
## HousingLoan.no           0.57048    0.03804  14.99711  0.00000    1.769
## PersonalLoan.no          0.47664    0.05722   8.32998  0.00000    1.611
## ContactType.cellular     1.01013    0.05596  18.05172  0.00000    2.746
## ContactType.telephone    0.82504    0.08549   9.65024  0.00000    2.282
## PreviousOutcome.failure -2.29244    0.07990 -28.69294  0.00000    0.101
## PreviousOutcome.other   -2.00556    0.09677 -20.72563  0.00000    0.135
## HasPreviousContact.no   -2.35688    0.07319 -32.20307  0.00000    0.095

4.1.2 Interpretate Odd Ratios

  • By holding everything else constant, increasing an independent variable by 1 unit will increase the odds by a multiplicative factor of its corresponding odd ratio.
  • For instance, increasing 1 more time in the number of contact will decrease the odd by 0.897. Thus, the probability of subcription will decrease consequently.
  • For instance, increasing 1 more unit in student will increase the odd by 1.815, and thus increase the probability of subscription.
  • The bank should be aware of those detracting variables when planning marketing strategies for different clients.
  • Variables with odd ratio greater than 1 bring positive result on subscripton. They should be considered wisely in the marketing campaign.

4.2 Insights from the CAP Curve

4.2.1 Segment Clients by Likelihood Score

  • The subscription probability of each client has been organized in decending order.
  • The bank is able to segment the clients into different percentil.
  • The bank is able to plan different marketing strategies to address different segments.

4.2.2 Budget Constraints

  • The CAP curve of our model is better than that of the random model.
  • The bank is able to decide which clients to target at when there is budget limits, i.e. as the case in our project, when the budget can only support calling 50% of the clients, our model can reach 78% among those who will subscirbe a term deposit.

4.2.3 Efficiency

  • The bank can save cost from reaching the targeted clients who are more likely to subscribe a term deposit by approaching a certain percent of the total clients.
  • In our case of this project, when reaching 63% of the total clients, the bank has already approached 88% of those who are more likely to subscribe.

5.Source of Reference

  1. UC Irvine Machine Learning Repository
  2. [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014