The success of marketing campaigns can be highly specific to the product, the target audience, and the campaign methods. In this problem, we examine data from direct marketing campaigns of a Portuguese banking institution between May 2008 and November 2010. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be or not subscribed.
In this analysis, the goal would be predicting the dependent variable y, which takes value 1 if the the client subscribed to a term deposit, and 0 otherwise. The data we will be using bank.csv is a subset of the original data, containing 5000 examples and 20 input variables. The variable information is as follows:
age
job - type of job
marital - marital status
education - Shows the level of education of each customer
default - Whether a customer has credit in default
housing - Does the customer have a housing loan?
loan - Does the customer have a personal loan?
contact - The contact communication type
month - Last contact month of year
day_of_week - Last contact day of Week
duration - Last contact duration in seconds (Note: this variable is not known before making the call)
campaign - Number of contact performed for the client during the campaign
pdays - number of days that passed by after the client was last contacted from a previous campaign (value of 999 means the client was not previously contacted)
previous - number of contacts performed before this campaign and for this client
poutcome - outcome of the previous marketing campaign
emp.var.rate - employment variation rate - quarterly indicator
cons.price.idx - consumer price index - monthly indicator
cons.conf.idx - consumer confidence index - monthly indicator
euribor3m - euribor 3 month rate - daily indicator
nr.employed - number of employees - quarterly indicator
Use the read.csv function to load the contents of bank.csv into a data frame called bank. What is the average age in the data set?
bank <- read.csv("bank.csv")
mean(bank$age)
## [1] 39.5814
summary(bank)
## age job marital education
## Min. :17.00 Length:5000 Length:5000 Length:5000
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :39.58
## 3rd Qu.:46.00
## Max. :92.00
## default housing loan contact
## Length:5000 Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## month day_of_week duration campaign
## Length:5000 Length:5000 Min. : 1.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 105.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 181.0 Median : 2.000
## Mean : 265.2 Mean : 2.538
## 3rd Qu.: 329.0 3rd Qu.: 3.000
## Max. :3284.0 Max. :56.000
## pdays previous poutcome emp.var.rate
## Min. : 0.0 Min. :0.0000 Length:5000 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.0000 Class :character 1st Qu.:-1.80000
## Median :999.0 Median :0.0000 Mode :character Median : 1.10000
## Mean :958.7 Mean :0.1778 Mean : 0.05834
## 3rd Qu.:999.0 3rd Qu.:0.0000 3rd Qu.: 1.40000
## Max. :999.0 Max. :6.0000 Max. : 1.40000
## cons.price.idx cons.conf.idx euribor3m nr.employed
## Min. :92.20 Min. :-50.80 Min. :0.634 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.70 1st Qu.:1.334 1st Qu.:5099
## Median :93.44 Median :-41.80 Median :4.857 Median :5191
## Mean :93.57 Mean :-40.54 Mean :3.597 Mean :5166
## 3rd Qu.:93.99 3rd Qu.:-36.40 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.90 Max. :5.045 Max. :5228
## y
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1182
## 3rd Qu.:0.0000
## Max. :1.0000
Build a boxplot that shows the call duration distributions over different jobs. Which three jobs have the longest average call durations? (if it’s hard to see from the boxplot, use tapply function.)
str(bank)
## 'data.frame': 5000 obs. of 21 variables:
## $ age : int 52 49 25 27 44 31 51 41 59 34 ...
## $ job : chr "admin." "blue-collar" "blue-collar" "admin." ...
## $ marital : chr "single" "divorced" "single" "single" ...
## $ education : chr "university.degree" "high.school" "basic.9y" "university.degree" ...
## $ default : chr "unknown" "no" "no" "no" ...
## $ housing : chr "unknown" "no" "yes" "no" ...
## $ loan : chr "unknown" "yes" "yes" "no" ...
## $ contact : chr "cellular" "telephone" "cellular" "telephone" ...
## $ month : chr "aug" "may" "jul" "oct" ...
## $ day_of_week : chr "wed" "mon" "wed" "tue" ...
## $ duration : int 138 742 322 540 113 317 61 592 160 110 ...
## $ campaign : int 3 2 2 1 1 2 1 1 3 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.4 1.1 1.4 -0.1 -2.9 1.4 1.4 1.1 1.4 -0.1 ...
## $ cons.price.idx: num 93.4 94 93.9 93.8 92.2 ...
## $ cons.conf.idx : num -36.1 -36.4 -42.7 -40.4 -31.4 -42.7 -42.7 -36.4 -36.1 -42 ...
## $ euribor3m : num 4.964 4.857 4.963 4.86 0.879 ...
## $ nr.employed : num 5228 5191 5228 5196 5076 ...
## $ y : int 0 0 0 1 0 0 0 0 0 0 ...
bank$job <- as.factor(bank$job)
boxplot(duration ~ job, data = bank)
tapply(bank$duration, bank$job, FUN = mean, na.rm = TRUE)
## admin. blue-collar entrepreneur housemaid management
## 256.5754 284.4000 279.7680 303.2542 261.0000
## retired self-employed services student technician
## 285.8770 287.3608 269.6022 255.7519 239.9052
## unemployed unknown
## 232.3053 248.7222
As good practice, it is always helpful to first check for multicolinearity before running models, especially since this dataset contains macroeconomic indicators. Examine the correlation between the following variables: emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, and nr.employed. Which of the following statements are correct (limited to just these selected variables)?
# select variables
vars <- c("emp.var.rate", "cons.price.idx",
"cons.conf.idx", "euribor3m", "nr.employed")
bank_vars <- bank[vars]
cor(bank_vars)
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## emp.var.rate 1.0000000 0.7808786 0.20449051 0.9732989 0.91412988
## cons.price.idx 0.7808786 1.0000000 0.07921100 0.7000698 0.54522242
## cons.conf.idx 0.2044905 0.0792110 1.00000000 0.2822054 0.09970857
## euribor3m 0.9732989 0.7000698 0.28220537 1.0000000 0.94798135
## nr.employed 0.9141299 0.5452224 0.09970857 0.9479813 1.00000000
Splitting into a Training and Testing Set 0.0/5.0 points (graded)
Obtain a random training/testing set split with:
set.seed(201)
library(caTools)
spl = sample.split(bank$y, 0.7)
Split months into a training data frame called “training” using the observations for which spl is TRUE and a testing data frame called “testing” using the observations for which spl is FALSE.
Why do we use the sample.split() function to split into a training and testing set?
set.seed(201)
library(caTools)
spl <- sample.split(bank$y, 0.7)
bank_training <- subset(bank, spl == TRUE)
bank_testing <- subset(bank, spl == FALSE)
Train a logistic regression model using independent variables age, job, marital, education, default, housing, loan, contact, month, day_of_week, campaign, pdays, previous, poutcome, emp.var.rate, cons.price.idx, and cons.conf.idx, using the training set to obtain the model. Notice that we have removed duration (since it’s not available before the call, so shouldn’t be used in a strictly predictive model), euribor3m and nr.employed (due to multicolinearity issue).
Which of the following characteristics are statistically significantly POSITIVELY (at 0.05 level) associated with an increased chance of subscribing to the product?
logY <- glm(y ~ . -duration -euribor3m -nr.employed, data = bank_training, family = binomial)
summary(logY)
##
## Call:
## glm(formula = y ~ . - duration - euribor3m - nr.employed, family = binomial,
## data = bank_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0070 -0.4412 -0.3237 -0.2409 2.7433
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.142e+02 1.847e+01 -6.183 6.28e-10 ***
## age 1.567e-02 7.227e-03 2.169 0.030118 *
## jobblue-collar 1.579e-01 2.193e-01 0.720 0.471487
## jobentrepreneur -5.453e-01 4.039e-01 -1.350 0.176968
## jobhousemaid -1.103e-01 4.798e-01 -0.230 0.818152
## jobmanagement -2.475e-02 2.469e-01 -0.100 0.920141
## jobretired -3.763e-02 3.272e-01 -0.115 0.908424
## jobself-employed 3.930e-01 3.177e-01 1.237 0.216089
## jobservices 2.249e-03 2.553e-01 0.009 0.992971
## jobstudent 4.922e-01 3.297e-01 1.493 0.135466
## jobtechnician -7.562e-02 2.188e-01 -0.346 0.729563
## jobunemployed -6.709e-01 4.378e-01 -1.533 0.125371
## jobunknown -1.076e-01 7.446e-01 -0.145 0.885084
## maritalmarried 3.128e-01 2.224e-01 1.406 0.159601
## maritalsingle 3.625e-01 2.457e-01 1.475 0.140111
## maritalunknown 5.848e-01 1.205e+00 0.485 0.627556
## educationbasic.6y 7.062e-03 3.202e-01 0.022 0.982405
## educationbasic.9y 9.914e-03 2.530e-01 0.039 0.968744
## educationhigh.school -5.576e-02 2.588e-01 -0.215 0.829383
## educationilliterate 1.458e+01 3.247e+02 0.045 0.964179
## educationprofessional.course 5.903e-02 2.879e-01 0.205 0.837523
## educationuniversity.degree 1.464e-01 2.589e-01 0.566 0.571647
## educationunknown -6.803e-01 3.950e-01 -1.722 0.085020 .
## defaultunknown -4.383e-01 1.973e-01 -2.221 0.026339 *
## housingunknown 1.361e-01 3.763e-01 0.362 0.717616
## housingyes 4.583e-02 1.206e-01 0.380 0.703955
## loanunknown NA NA NA NA
## loanyes -2.102e-01 1.740e-01 -1.208 0.226947
## contacttelephone -5.525e-01 1.928e-01 -2.866 0.004163 **
## monthaug 8.172e-01 3.156e-01 2.590 0.009602 **
## monthdec 7.238e-01 6.343e-01 1.141 0.253835
## monthjul 2.274e-01 2.784e-01 0.817 0.413883
## monthjun -4.160e-01 2.733e-01 -1.522 0.128027
## monthmar 1.286e+00 3.396e-01 3.786 0.000153 ***
## monthmay -2.559e-01 2.138e-01 -1.197 0.231302
## monthnov -4.158e-03 2.838e-01 -0.015 0.988310
## monthoct 3.807e-01 4.121e-01 0.924 0.355577
## monthsep 1.329e-01 4.356e-01 0.305 0.760293
## day_of_weekmon -4.157e-01 1.986e-01 -2.093 0.036335 *
## day_of_weekthu -1.294e-02 1.884e-01 -0.069 0.945251
## day_of_weektue 1.172e-01 1.905e-01 0.615 0.538331
## day_of_weekwed 1.773e-01 1.863e-01 0.952 0.341066
## campaign -5.324e-02 3.203e-02 -1.662 0.096460 .
## pdays -1.469e-03 8.815e-04 -1.667 0.095581 .
## previous -6.776e-02 2.017e-01 -0.336 0.736952
## poutcomenonexistent 8.804e-01 2.999e-01 2.936 0.003325 **
## poutcomesuccess 4.368e-01 8.555e-01 0.511 0.609598
## emp.var.rate -8.078e-01 7.625e-02 -10.594 < 2e-16 ***
## cons.price.idx 1.196e+00 1.985e-01 6.024 1.70e-09 ***
## cons.conf.idx -5.433e-04 1.549e-02 -0.035 0.972028
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2544.5 on 3499 degrees of freedom
## Residual deviance: 2031.3 on 3451 degrees of freedom
## AIC: 2129.3
##
## Number of Fisher Scoring iterations: 11
# Interpreting monthmar
exp(1.286) - 1
## [1] 2.618284
Using your logistic regression model, obtain predictions on the test set. Then, using a probability threshold of 0.5, create a confusion matrix for the test set.
We would like to compare the predictions obtained by the logistic regression model and those obtained by a naive baseline model. Remember that the naive baseline model we use in this class always predicts the most frequent outcome in the training set for all observations in the test set.
What is the number of test set observations where the prediction from the logistic regression model is different than the prediction from the baseline model?
pred_Y <- predict(logY, newdata = bank_testing, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Confusion matrix
table(bank_testing$y, pred_Y > 0.5)
##
## FALSE TRUE
## 0 1273 50
## 1 133 44
table(pred_Y<0.5)
##
## FALSE TRUE
## 94 1406
table(bank_testing$y)
##
## 0 1
## 1323 177
What is the test-set AUC of the logistic regression model?
library(ROCR)
ROCRPred <- prediction(pred_Y, bank_testing$y)
as.numeric(performance(ROCRPred, "auc")@y.values)
## [1] 0.7507334
# plotting purpose
perf1 <- performance(ROCRPred, "tpr", "fpr")
plot(perf1, colorize = TRUE)
Which of the following best describes how 10-fold cross-validation works when selecting between 4 different parameter values?
Ans: 40 models are trained on subsets of the training set and evaluated on a portion of the training set correct
Set the random seed to 201 (even though you have already done so earlier in the problem). Then use the caret package and the train function to perform 10-fold cross validation with the training data set to select the best cp value for a CART model that predicts the dependent variable y using the same set of independent variables as in the logistic regression (Problem 5). Select the cp value from a grid consisting of the 50 values 0.001, 0.002, …, 0.05.
What cp value maximizes the cross-validation accuracy?
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(201)
numFolds <- trainControl(method = "cv", number = 10)
cartGrid <- expand.grid( .cp = seq(0.001,0.05,0.001))
bank_training$y <- as.factor(bank_training$y)
str(bank_training)
## 'data.frame': 3500 obs. of 21 variables:
## $ age : int 52 49 25 27 44 31 51 41 59 34 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 1 2 2 1 1 2 2 2 6 10 ...
## $ marital : chr "single" "divorced" "single" "single" ...
## $ education : chr "university.degree" "high.school" "basic.9y" "university.degree" ...
## $ default : chr "unknown" "no" "no" "no" ...
## $ housing : chr "unknown" "no" "yes" "no" ...
## $ loan : chr "unknown" "yes" "yes" "no" ...
## $ contact : chr "cellular" "telephone" "cellular" "telephone" ...
## $ month : chr "aug" "may" "jul" "oct" ...
## $ day_of_week : chr "wed" "mon" "wed" "tue" ...
## $ duration : int 138 742 322 540 113 317 61 592 160 110 ...
## $ campaign : int 3 2 2 1 1 2 1 1 3 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.4 1.1 1.4 -0.1 -2.9 1.4 1.4 1.1 1.4 -0.1 ...
## $ cons.price.idx: num 93.4 94 93.9 93.8 92.2 ...
## $ cons.conf.idx : num -36.1 -36.4 -42.7 -40.4 -31.4 -42.7 -42.7 -36.4 -36.1 -42 ...
## $ euribor3m : num 4.964 4.857 4.963 4.86 0.879 ...
## $ nr.employed : num 5228 5191 5228 5196 5076 ...
## $ y : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
train(y ~ . -duration -euribor3m -nr.employed, data = bank_training, method = "rpart", trControl = numFolds, tuneGrid = cartGrid)
## CART
##
## 3500 samples
## 20 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3151, 3150, 3149, 3151, 3150, 3151, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.8762802 0.24740131
## 0.002 0.8791349 0.24593557
## 0.003 0.8808533 0.24421052
## 0.004 0.8799953 0.22828555
## 0.005 0.8794223 0.20012039
## 0.006 0.8814256 0.16375868
## 0.007 0.8834223 0.17061874
## 0.008 0.8857097 0.16884302
## 0.009 0.8845643 0.17107583
## 0.010 0.8851358 0.17388815
## 0.011 0.8857096 0.18898483
## 0.012 0.8825757 0.14872787
## 0.013 0.8822908 0.15070155
## 0.014 0.8822908 0.15070155
## 0.015 0.8802851 0.12170442
## 0.016 0.8802851 0.12170442
## 0.017 0.8791423 0.09996644
## 0.018 0.8791423 0.09996644
## 0.019 0.8791423 0.09996644
## 0.020 0.8791423 0.09996644
## 0.021 0.8791423 0.09996644
## 0.022 0.8791423 0.09996644
## 0.023 0.8791423 0.09996644
## 0.024 0.8791423 0.09996644
## 0.025 0.8791423 0.09996644
## 0.026 0.8791423 0.09996644
## 0.027 0.8791423 0.09996644
## 0.028 0.8791423 0.09996644
## 0.029 0.8785708 0.07595429
## 0.030 0.8785708 0.07595429
## 0.031 0.8785708 0.06329876
## 0.032 0.8785708 0.06329876
## 0.033 0.8788565 0.04332053
## 0.034 0.8788565 0.04332053
## 0.035 0.8788565 0.04332053
## 0.036 0.8788565 0.04332053
## 0.037 0.8788565 0.04332053
## 0.038 0.8788565 0.04332053
## 0.039 0.8788565 0.04332053
## 0.040 0.8788565 0.04332053
## 0.041 0.8788565 0.04332053
## 0.042 0.8788565 0.04332053
## 0.043 0.8788565 0.04332053
## 0.044 0.8788565 0.04332053
## 0.045 0.8788565 0.04332053
## 0.046 0.8788565 0.04332053
## 0.047 0.8788565 0.04332053
## 0.048 0.8788565 0.04332053
## 0.049 0.8788565 0.04332053
## 0.050 0.8788565 0.04332053
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.008.
Build and plot the CART model trained with the parameter identified in Problem 13, again predicting the dependent variable using the same set of independent variables. What variable is used as the first (upper-most) split in the tree?
library(rpart)
library(rpart.plot)
bankTree <- rpart(y ~ . -duration -euribor3m -nr.employed, data = bank_training, method = "class", cp = 0.014)
prp(bankTree)
Using the CART model you created in Problem 14, obtain predictions on the test set (using the parameter type=“class” with the predict function). Then, create a confusion matrix for the test set.
What is the accuracy of your CART model?
pred_bankTree <- predict(bankTree, newdata = bank_testing, type = "class")
# conf matrix
table(bank_testing$y, pred_bankTree)
## pred_bankTree
## 0 1
## 0 1303 20
## 1 149 28
# accuracy
(1303+28)/nrow(bank_testing)
## [1] 0.8873333
## DONE PART 2 =)