Exam 2: PREDICTING BANK TELEMARKETING SUCCESS

The success of marketing campaigns can be highly specific to the product, the target audience, and the campaign methods. In this problem, we examine data from direct marketing campaigns of a Portuguese banking institution between May 2008 and November 2010. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be or not subscribed.

In this analysis, the goal would be predicting the dependent variable y, which takes value 1 if the the client subscribed to a term deposit, and 0 otherwise. The data we will be using bank.csv is a subset of the original data, containing 5000 examples and 20 input variables. The variable information is as follows:

age
job - type of job
marital - marital status
education - Shows the level of education of each customer
default - Whether a customer has credit in default
housing - Does the customer have a housing loan?
loan - Does the customer have a personal loan?
contact - The contact communication type
month - Last contact month of year
day_of_week - Last contact day of Week
duration - Last contact duration in seconds (Note: this variable is not known before making the call)
campaign - Number of contact performed for the client during the campaign
pdays - number of days that passed by after the client was last contacted from a previous campaign (value of 999 means the client was not previously contacted)
previous - number of contacts performed before this campaign and for this client
poutcome - outcome of the previous marketing campaign
emp.var.rate - employment variation rate - quarterly indicator
cons.price.idx - consumer price index - monthly indicator
cons.conf.idx - consumer confidence index - monthly indicator
euribor3m - euribor 3 month rate - daily indicator
nr.employed - number of employees - quarterly indicator

Loading the data

Use the read.csv function to load the contents of bank.csv into a data frame called bank. What is the average age in the data set?

bank <- read.csv("bank.csv")
mean(bank$age)
## [1] 39.5814
summary(bank)
##       age            job              marital           education        
##  Min.   :17.00   Length:5000        Length:5000        Length:5000       
##  1st Qu.:32.00   Class :character   Class :character   Class :character  
##  Median :38.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :39.58                                                           
##  3rd Qu.:46.00                                                           
##  Max.   :92.00                                                           
##    default            housing              loan             contact         
##  Length:5000        Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     month           day_of_week           duration         campaign     
##  Length:5000        Length:5000        Min.   :   1.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 105.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 181.0   Median : 2.000  
##                                        Mean   : 265.2   Mean   : 2.538  
##                                        3rd Qu.: 329.0   3rd Qu.: 3.000  
##                                        Max.   :3284.0   Max.   :56.000  
##      pdays          previous        poutcome          emp.var.rate     
##  Min.   :  0.0   Min.   :0.0000   Length:5000        Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.0000   Class :character   1st Qu.:-1.80000  
##  Median :999.0   Median :0.0000   Mode  :character   Median : 1.10000  
##  Mean   :958.7   Mean   :0.1778                      Mean   : 0.05834  
##  3rd Qu.:999.0   3rd Qu.:0.0000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :6.0000                      Max.   : 1.40000  
##  cons.price.idx  cons.conf.idx      euribor3m      nr.employed  
##  Min.   :92.20   Min.   :-50.80   Min.   :0.634   Min.   :4964  
##  1st Qu.:93.08   1st Qu.:-42.70   1st Qu.:1.334   1st Qu.:5099  
##  Median :93.44   Median :-41.80   Median :4.857   Median :5191  
##  Mean   :93.57   Mean   :-40.54   Mean   :3.597   Mean   :5166  
##  3rd Qu.:93.99   3rd Qu.:-36.40   3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :94.77   Max.   :-26.90   Max.   :5.045   Max.   :5228  
##        y         
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1182  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

Call dureations by job

Build a boxplot that shows the call duration distributions over different jobs. Which three jobs have the longest average call durations? (if it’s hard to see from the boxplot, use tapply function.)

str(bank)
## 'data.frame':    5000 obs. of  21 variables:
##  $ age           : int  52 49 25 27 44 31 51 41 59 34 ...
##  $ job           : chr  "admin." "blue-collar" "blue-collar" "admin." ...
##  $ marital       : chr  "single" "divorced" "single" "single" ...
##  $ education     : chr  "university.degree" "high.school" "basic.9y" "university.degree" ...
##  $ default       : chr  "unknown" "no" "no" "no" ...
##  $ housing       : chr  "unknown" "no" "yes" "no" ...
##  $ loan          : chr  "unknown" "yes" "yes" "no" ...
##  $ contact       : chr  "cellular" "telephone" "cellular" "telephone" ...
##  $ month         : chr  "aug" "may" "jul" "oct" ...
##  $ day_of_week   : chr  "wed" "mon" "wed" "tue" ...
##  $ duration      : int  138 742 322 540 113 317 61 592 160 110 ...
##  $ campaign      : int  3 2 2 1 1 2 1 1 3 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.4 1.1 1.4 -0.1 -2.9 1.4 1.4 1.1 1.4 -0.1 ...
##  $ cons.price.idx: num  93.4 94 93.9 93.8 92.2 ...
##  $ cons.conf.idx : num  -36.1 -36.4 -42.7 -40.4 -31.4 -42.7 -42.7 -36.4 -36.1 -42 ...
##  $ euribor3m     : num  4.964 4.857 4.963 4.86 0.879 ...
##  $ nr.employed   : num  5228 5191 5228 5196 5076 ...
##  $ y             : int  0 0 0 1 0 0 0 0 0 0 ...
bank$job <- as.factor(bank$job)

boxplot(duration ~ job, data = bank)

tapply(bank$duration, bank$job, FUN = mean, na.rm = TRUE)
##        admin.   blue-collar  entrepreneur     housemaid    management 
##      256.5754      284.4000      279.7680      303.2542      261.0000 
##       retired self-employed      services       student    technician 
##      285.8770      287.3608      269.6022      255.7519      239.9052 
##    unemployed       unknown 
##      232.3053      248.7222

Problem of Multicolinearity

As good practice, it is always helpful to first check for multicolinearity before running models, especially since this dataset contains macroeconomic indicators. Examine the correlation between the following variables: emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, and nr.employed. Which of the following statements are correct (limited to just these selected variables)?

# select variables 
vars <- c("emp.var.rate", "cons.price.idx", 
            "cons.conf.idx", "euribor3m", "nr.employed")
bank_vars <- bank[vars]

cor(bank_vars)
##                emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## emp.var.rate      1.0000000      0.7808786    0.20449051 0.9732989  0.91412988
## cons.price.idx    0.7808786      1.0000000    0.07921100 0.7000698  0.54522242
## cons.conf.idx     0.2044905      0.0792110    1.00000000 0.2822054  0.09970857
## euribor3m         0.9732989      0.7000698    0.28220537 1.0000000  0.94798135
## nr.employed       0.9141299      0.5452224    0.09970857 0.9479813  1.00000000

Splitting into a training and testing set

Splitting into a Training and Testing Set 0.0/5.0 points (graded)

Obtain a random training/testing set split with:

set.seed(201)

library(caTools)

spl = sample.split(bank$y, 0.7)

Split months into a training data frame called “training” using the observations for which spl is TRUE and a testing data frame called “testing” using the observations for which spl is FALSE.

Why do we use the sample.split() function to split into a training and testing set?

set.seed(201)
library(caTools)

spl <-  sample.split(bank$y, 0.7)

bank_training <- subset(bank, spl == TRUE)
bank_testing <- subset(bank, spl == FALSE)

Training a logistic Regression Model

Train a logistic regression model using independent variables age, job, marital, education, default, housing, loan, contact, month, day_of_week, campaign, pdays, previous, poutcome, emp.var.rate, cons.price.idx, and cons.conf.idx, using the training set to obtain the model. Notice that we have removed duration (since it’s not available before the call, so shouldn’t be used in a strictly predictive model), euribor3m and nr.employed (due to multicolinearity issue).

Which of the following characteristics are statistically significantly POSITIVELY (at 0.05 level) associated with an increased chance of subscribing to the product?

logY <- glm(y ~ . -duration -euribor3m -nr.employed, data = bank_training, family = binomial)
summary(logY)
## 
## Call:
## glm(formula = y ~ . - duration - euribor3m - nr.employed, family = binomial, 
##     data = bank_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0070  -0.4412  -0.3237  -0.2409   2.7433  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.142e+02  1.847e+01  -6.183 6.28e-10 ***
## age                           1.567e-02  7.227e-03   2.169 0.030118 *  
## jobblue-collar                1.579e-01  2.193e-01   0.720 0.471487    
## jobentrepreneur              -5.453e-01  4.039e-01  -1.350 0.176968    
## jobhousemaid                 -1.103e-01  4.798e-01  -0.230 0.818152    
## jobmanagement                -2.475e-02  2.469e-01  -0.100 0.920141    
## jobretired                   -3.763e-02  3.272e-01  -0.115 0.908424    
## jobself-employed              3.930e-01  3.177e-01   1.237 0.216089    
## jobservices                   2.249e-03  2.553e-01   0.009 0.992971    
## jobstudent                    4.922e-01  3.297e-01   1.493 0.135466    
## jobtechnician                -7.562e-02  2.188e-01  -0.346 0.729563    
## jobunemployed                -6.709e-01  4.378e-01  -1.533 0.125371    
## jobunknown                   -1.076e-01  7.446e-01  -0.145 0.885084    
## maritalmarried                3.128e-01  2.224e-01   1.406 0.159601    
## maritalsingle                 3.625e-01  2.457e-01   1.475 0.140111    
## maritalunknown                5.848e-01  1.205e+00   0.485 0.627556    
## educationbasic.6y             7.062e-03  3.202e-01   0.022 0.982405    
## educationbasic.9y             9.914e-03  2.530e-01   0.039 0.968744    
## educationhigh.school         -5.576e-02  2.588e-01  -0.215 0.829383    
## educationilliterate           1.458e+01  3.247e+02   0.045 0.964179    
## educationprofessional.course  5.903e-02  2.879e-01   0.205 0.837523    
## educationuniversity.degree    1.464e-01  2.589e-01   0.566 0.571647    
## educationunknown             -6.803e-01  3.950e-01  -1.722 0.085020 .  
## defaultunknown               -4.383e-01  1.973e-01  -2.221 0.026339 *  
## housingunknown                1.361e-01  3.763e-01   0.362 0.717616    
## housingyes                    4.583e-02  1.206e-01   0.380 0.703955    
## loanunknown                          NA         NA      NA       NA    
## loanyes                      -2.102e-01  1.740e-01  -1.208 0.226947    
## contacttelephone             -5.525e-01  1.928e-01  -2.866 0.004163 ** 
## monthaug                      8.172e-01  3.156e-01   2.590 0.009602 ** 
## monthdec                      7.238e-01  6.343e-01   1.141 0.253835    
## monthjul                      2.274e-01  2.784e-01   0.817 0.413883    
## monthjun                     -4.160e-01  2.733e-01  -1.522 0.128027    
## monthmar                      1.286e+00  3.396e-01   3.786 0.000153 ***
## monthmay                     -2.559e-01  2.138e-01  -1.197 0.231302    
## monthnov                     -4.158e-03  2.838e-01  -0.015 0.988310    
## monthoct                      3.807e-01  4.121e-01   0.924 0.355577    
## monthsep                      1.329e-01  4.356e-01   0.305 0.760293    
## day_of_weekmon               -4.157e-01  1.986e-01  -2.093 0.036335 *  
## day_of_weekthu               -1.294e-02  1.884e-01  -0.069 0.945251    
## day_of_weektue                1.172e-01  1.905e-01   0.615 0.538331    
## day_of_weekwed                1.773e-01  1.863e-01   0.952 0.341066    
## campaign                     -5.324e-02  3.203e-02  -1.662 0.096460 .  
## pdays                        -1.469e-03  8.815e-04  -1.667 0.095581 .  
## previous                     -6.776e-02  2.017e-01  -0.336 0.736952    
## poutcomenonexistent           8.804e-01  2.999e-01   2.936 0.003325 ** 
## poutcomesuccess               4.368e-01  8.555e-01   0.511 0.609598    
## emp.var.rate                 -8.078e-01  7.625e-02 -10.594  < 2e-16 ***
## cons.price.idx                1.196e+00  1.985e-01   6.024 1.70e-09 ***
## cons.conf.idx                -5.433e-04  1.549e-02  -0.035 0.972028    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2544.5  on 3499  degrees of freedom
## Residual deviance: 2031.3  on 3451  degrees of freedom
## AIC: 2129.3
## 
## Number of Fisher Scoring iterations: 11
# Interpreting monthmar
exp(1.286) - 1
## [1] 2.618284

Obtaining test set predictions

Using your logistic regression model, obtain predictions on the test set. Then, using a probability threshold of 0.5, create a confusion matrix for the test set.

We would like to compare the predictions obtained by the logistic regression model and those obtained by a naive baseline model. Remember that the naive baseline model we use in this class always predicts the most frequent outcome in the training set for all observations in the test set.

What is the number of test set observations where the prediction from the logistic regression model is different than the prediction from the baseline model?

pred_Y <- predict(logY, newdata = bank_testing, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Confusion matrix
table(bank_testing$y, pred_Y > 0.5)
##    
##     FALSE TRUE
##   0  1273   50
##   1   133   44
table(pred_Y<0.5)
## 
## FALSE  TRUE 
##    94  1406
table(bank_testing$y)
## 
##    0    1 
## 1323  177

Computing test-set AUC

What is the test-set AUC of the logistic regression model?

library(ROCR)

ROCRPred <- prediction(pred_Y, bank_testing$y)
as.numeric(performance(ROCRPred, "auc")@y.values)
## [1] 0.7507334
# plotting purpose
perf1 <- performance(ROCRPred, "tpr", "fpr")
plot(perf1, colorize = TRUE)

Cross-validation to select parameters

Which of the following best describes how 10-fold cross-validation works when selecting between 4 different parameter values?

Ans: 40 models are trained on subsets of the training set and evaluated on a portion of the training set correct

Cross-Validation for a CART Model

Set the random seed to 201 (even though you have already done so earlier in the problem). Then use the caret package and the train function to perform 10-fold cross validation with the training data set to select the best cp value for a CART model that predicts the dependent variable y using the same set of independent variables as in the logistic regression (Problem 5). Select the cp value from a grid consisting of the 50 values 0.001, 0.002, …, 0.05.

What cp value maximizes the cross-validation accuracy?

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(201)

numFolds <- trainControl(method = "cv", number = 10)
cartGrid <- expand.grid( .cp = seq(0.001,0.05,0.001)) 

bank_training$y <- as.factor(bank_training$y)
str(bank_training)
## 'data.frame':    3500 obs. of  21 variables:
##  $ age           : int  52 49 25 27 44 31 51 41 59 34 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 1 2 2 1 1 2 2 2 6 10 ...
##  $ marital       : chr  "single" "divorced" "single" "single" ...
##  $ education     : chr  "university.degree" "high.school" "basic.9y" "university.degree" ...
##  $ default       : chr  "unknown" "no" "no" "no" ...
##  $ housing       : chr  "unknown" "no" "yes" "no" ...
##  $ loan          : chr  "unknown" "yes" "yes" "no" ...
##  $ contact       : chr  "cellular" "telephone" "cellular" "telephone" ...
##  $ month         : chr  "aug" "may" "jul" "oct" ...
##  $ day_of_week   : chr  "wed" "mon" "wed" "tue" ...
##  $ duration      : int  138 742 322 540 113 317 61 592 160 110 ...
##  $ campaign      : int  3 2 2 1 1 2 1 1 3 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.4 1.1 1.4 -0.1 -2.9 1.4 1.4 1.1 1.4 -0.1 ...
##  $ cons.price.idx: num  93.4 94 93.9 93.8 92.2 ...
##  $ cons.conf.idx : num  -36.1 -36.4 -42.7 -40.4 -31.4 -42.7 -42.7 -36.4 -36.1 -42 ...
##  $ euribor3m     : num  4.964 4.857 4.963 4.86 0.879 ...
##  $ nr.employed   : num  5228 5191 5228 5196 5076 ...
##  $ y             : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
train(y ~ . -duration -euribor3m -nr.employed, data = bank_training, method = "rpart", trControl = numFolds, tuneGrid = cartGrid)
## CART 
## 
## 3500 samples
##   20 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3151, 3150, 3149, 3151, 3150, 3151, ... 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa     
##   0.001  0.8762802  0.24740131
##   0.002  0.8791349  0.24593557
##   0.003  0.8808533  0.24421052
##   0.004  0.8799953  0.22828555
##   0.005  0.8794223  0.20012039
##   0.006  0.8814256  0.16375868
##   0.007  0.8834223  0.17061874
##   0.008  0.8857097  0.16884302
##   0.009  0.8845643  0.17107583
##   0.010  0.8851358  0.17388815
##   0.011  0.8857096  0.18898483
##   0.012  0.8825757  0.14872787
##   0.013  0.8822908  0.15070155
##   0.014  0.8822908  0.15070155
##   0.015  0.8802851  0.12170442
##   0.016  0.8802851  0.12170442
##   0.017  0.8791423  0.09996644
##   0.018  0.8791423  0.09996644
##   0.019  0.8791423  0.09996644
##   0.020  0.8791423  0.09996644
##   0.021  0.8791423  0.09996644
##   0.022  0.8791423  0.09996644
##   0.023  0.8791423  0.09996644
##   0.024  0.8791423  0.09996644
##   0.025  0.8791423  0.09996644
##   0.026  0.8791423  0.09996644
##   0.027  0.8791423  0.09996644
##   0.028  0.8791423  0.09996644
##   0.029  0.8785708  0.07595429
##   0.030  0.8785708  0.07595429
##   0.031  0.8785708  0.06329876
##   0.032  0.8785708  0.06329876
##   0.033  0.8788565  0.04332053
##   0.034  0.8788565  0.04332053
##   0.035  0.8788565  0.04332053
##   0.036  0.8788565  0.04332053
##   0.037  0.8788565  0.04332053
##   0.038  0.8788565  0.04332053
##   0.039  0.8788565  0.04332053
##   0.040  0.8788565  0.04332053
##   0.041  0.8788565  0.04332053
##   0.042  0.8788565  0.04332053
##   0.043  0.8788565  0.04332053
##   0.044  0.8788565  0.04332053
##   0.045  0.8788565  0.04332053
##   0.046  0.8788565  0.04332053
##   0.047  0.8788565  0.04332053
##   0.048  0.8788565  0.04332053
##   0.049  0.8788565  0.04332053
##   0.050  0.8788565  0.04332053
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.008.

Train CART Model

Build and plot the CART model trained with the parameter identified in Problem 13, again predicting the dependent variable using the same set of independent variables. What variable is used as the first (upper-most) split in the tree?

library(rpart)
library(rpart.plot)

bankTree <- rpart(y ~ . -duration -euribor3m -nr.employed, data = bank_training, method = "class", cp = 0.014)
prp(bankTree)

Test-set Accuracy for CART Model

Using the CART model you created in Problem 14, obtain predictions on the test set (using the parameter type=“class” with the predict function). Then, create a confusion matrix for the test set.

What is the accuracy of your CART model?

pred_bankTree <- predict(bankTree, newdata = bank_testing, type = "class")

# conf matrix
table(bank_testing$y, pred_bankTree)
##    pred_bankTree
##        0    1
##   0 1303   20
##   1  149   28
# accuracy
(1303+28)/nrow(bank_testing)
## [1] 0.8873333
## DONE PART 2 =)