Group Members : Utsav Raj, Rohit Menon, Satarupa Saha, Mounish Sunkara
Description:
Customer conversion in a direct marketing campaign is one of the most important metrics of success when evaluating the campaign itself. In a world of limited resources, it’s often difficult to make the best use of a marketer’s time, we can improve on this by predicting whether a prospective customer will respond to a marketing campaign. In order to do this, we will:
• Understand which of the observed variables are most associated with the chance of subscribing to a term deposit.
• How the important variables relate to the predicted probability that a client will subscribe to a term deposit
• Build a model that could be useful in determining which future clients are likely to respond a term deposit, if contacted.
We have 9 duplicate records. Deleted all the duplicates.
Plotting the Graphs
Code
par(mfrow =c(3,2))hist(bank_trn_data_fnl$age, main ="Distribution of Age", xlab ="Age")hist(bank_trn_data_fnl$duration, main ="Distribution of Duration", xlab ="Duration")barplot(table(bank_trn_data_fnl$default), main ="Frequency of Default Status")barplot(table(bank_trn_data_fnl$marital), main ="Marital Status")barplot(table(bank_trn_data_fnl$loan), main ="Frequency by Loan Status")barplot(table(bank_trn_data_fnl$contact), main ="Distribution by Contact Status")
Call duration variable seems to be right skewed which suggests most of the calls where less than 10 secs.
Around 79% people have not defaulted on loan payment.
50% of the people are married and don’t have any loan.
Code
par(mfrow =c(2,2))barplot(table(bank_trn_data_fnl$pdays), main ="Distribution by PDays")barplot(table(bank_trn_data_fnl$previous), main ="Distribution by Previous")#bank_trn_data_fnl %>% group_by(pdays) %>% count()hist(bank_trn_data_fnl$nr_employed, main ="Distribution of Number Employed", xlab ="Number of Employees")barplot(table(bank_trn_data_fnl$subscribed), main ="Frequency by Subscribed Status")
Majority of the records have PDays as 999 which means they were contacted a long time back.
Pdays and previous are correlated.
No of employees does not have a clear distribution
88% of the data set have subscribed
Checking correlation on all numeric variables
Code
# Look at correlations between numeric featuresnum <-sapply(bank_trn_data_fnl, FUN = is.numeric) corx <-cor(bank_trn_data_fnl[, num], use="pairwise", method="spearman") # Visualize correlationscorrplot(corx, method ="color", order ="FPC", type ="lower",addCoef.col ="black",number.cex =0.65, diag =TRUE)
High Correlation between emp_var_rate with nr_emolpyed and euribor3m and b/w euribor3m and nr_emolpyed.
Backward model is same as the model we got with both ways process and it differs from our initial model.
2 Way Interaction
As discussed we are not going ahead with the 2 way interaction.
Model Accuracy
Code
bank_copy<- bank_trn_data_fnlbank_copy$pred_null<-as.factor(ifelse(predict(model_null, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))bank_copy$pred_init<-as.factor(ifelse(predict(model_var, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))bank_copy$pred_step_AIC_both <-as.factor(ifelse(predict(step, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))bank_copy$pred_step_AIC_fwd <-as.factor(ifelse(predict(step_for, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))bank_copy$pred_step_AIC_bck <-as.factor(ifelse(predict(step_back, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))df <-data.frame(matrix(ncol =4, nrow =1))colnames(df) <-c('pred_init', 'pred_step_AIC_both', 'pred_step_AIC_fwd','pred_step_AIC_bck')#Accuracy of initial model df$pred_init<-sum(bank_copy$subscribed==bank_copy$pred_init)/nrow(bank_copy)#Accuracy of AIC both way model df$pred_step_AIC_both<-sum(bank_copy$subscribed==bank_copy$pred_step_AIC_both)/nrow(bank_copy)#Accuracy of AIC forward model df$pred_step_AIC_fwd<-sum(bank_copy$subscribed==bank_copy$pred_step_AIC_fwd)/nrow(bank_copy)#Accuracy of AIC Backward model df$pred_step_AIC_bck<-sum(bank_copy$subscribed==bank_copy$pred_step_AIC_bck)/nrow(bank_copy)knitr::kable(df, align ="lccrr", caption ="Accuracy of all the models")
Accuracy of all the models
pred_init
pred_step_AIC_both
pred_step_AIC_fwd
pred_step_AIC_bck
0.9042929
0.9112544
0.9114973
0.9112544
Based on accuracy percentage all AIC models are performing better than the initial model.
4.Model Selction
Based on the accuracy and AIC value the backward and both ways model are best but we need to do some more analysis to lessen the number of predictors. We will use LOCO method in the next step.
5.Model Predictor Selection
LOCO of Both and backward direction model
Code
xnames <-c("duration", "month","poutcome", "emp_var_rate", "cons_price_idx", "contact", "euribor3m", "job", "default", "pdays","day_of_week","campaign", "cons_conf_idx") # predictors onlyvi.scores <-numeric(length(xnames))names(vi.scores) <- xnames(baseline <-deviance(step)) # smaller is better; could also use AIC, Brier score, etc.
[1] 15400.73
Code
for (xname in xnames) {# col_list <- c(attr(step$terms, "term.labels"), "sudata.copy <- bank_trn_data_fnl %>% dplyr::select(c(attr(step$terms, "term.labels"), "subscribed"))data.copy[[xname]] <-NULLfit.new <-glm(subscribed ~ ., data = data.copy, family =binomial(link ="logit"))vi.scores[xname] <-deviance(fit.new) - baseline # measure drop in performance}sort(vi.scores, decreasing =TRUE)
Based on the LOCO test we can take below 3 predictors.
Duration
Month
Cons_price_idx
6.Leakage.
Duration may be leakage variable, since if we know the duration of the last call, it is highly likely that we know the outcome whether the customer subscribed or not.
7.Deployment
Taking the output of both direction step AIC but removing duration (As it is a potential leakage variable) from predictor as the final model.
bank_new_copy<-bank_new_dataprob <-predict(final_model,type ="response", newdata = bank_new_data)bank_new_copy$prob<- proby <-na.omit(bank_new_data)$subscribed # Function to compute lift and cumulative gain chartslift <-function(prob, y, pos.class =NULL, cumulative =TRUE) {if (!all(sort(unique(y)) ==c(0, 1))) {if (is.null(pos.class)) {stop("A value for `pos.class` is required whenever `y` is not a 0/1 ","outcome.", call. =FALSE) } y <-ifelse(y == pos.class, 1, 0) } ord <-order(prob, decreasing =TRUE) prob <- prob[ord] y <- y[ord] prop <-seq_along(y) /length(y) lift <-if (isTRUE(cumulative)) {cumsum(y) } else { (cumsum(y) /seq_along(y)) /mean(y) }structure(list("lift"= lift, "prop"= prop, "cumulative"= cumulative,"y"= y), class ="lift")}# Cumulative gain chart; what does this plot tell us?l <-lift(prob, y = y, pos.class ="yes")plot(l[["prop"]], l[["lift"]], type ="l", xlab ="Proportion of sample", ylab ="Cumulative lift", las =1, lwd =2, col =2)abline(0, sum(y =="yes"), lty =2)
There are 464 clients who subscribed in the bank_new data. The calibration chart looks well calibrated, as it shows at first top 20% of the sample, its almost captured the subscribed clients.
Out of 500 targeted 233 people actually subscribed which is around 47% of target audience.
Data Dictionry
Client attributes
• age (numeric) • job : type of job • marital : marital status • education • default: has credit in default? • loan: has personal loan?
Attributes related with the last contact of the current campaign
• contact: contact communication type • month: last contact month of year • day_of_week: last contact day of the week • duration: last contact duration, in seconds (numeric).
• campaign: number of contacts performed during this campaign and for this client • pdays: number of days that passed by after the client was last contacted from a previous campaign • previous: number of contacts performed before this campaign and for this client (numeric) • poutcome: outcome of the previous marketing campaign
Social and economic context attributes
• emp.var.rate: employment variation rate - quarterly indicator (numeric) • cons.price.idx: consumer price index - monthly indicator (numeric) • cons.conf.idx: consumer confidence index - monthly indicator (numeric) • euribor3m: euribor 3 month rate - daily indicator (numeric) • nr.employed: number of employees - quarterly indicator (numeric)