A Portuguese Banking Institution has provided data related to direct marketing campaigns. The campaigns were based on phone calls to their customers in order to offer term deposit subscriptions. In this project, we will build a model which predicts the clients who are more likely to subscribe to a term deposit as well as identify the main factors that affect the clients’ decision.
Build a classification model to predict whether the client will subscribe to a term deposit
Identify the main factors that affect the customers’ decision
The data is available on UCI Machine Learning repository (https://archive.ics.uci.edu/). The data set contains 4119 clients and 20 attributes. The outcome is whether the client subscribed to a term deposit. 89% of the clients did not subscribe to a term deposit in the data set.
Due to the data being collected through phone call interviews, many clients refused to provide some information. This missing data is not at random so it should be modeled. The missing values are marked as “unknown” in the data set. From Figure 1 we can see that 75% of the clients provided a full response to the questions. There are 19.5% missing values in default; 4% missing values in education, and so on. We can also look at the histogram which clearly depicts the influence of missing values in the attributes.
There are a number of different methods to handle missing data. Since the attributes with missing values are categorical with more than two factors, we will use Multivariate Imputation by Chained Equations (MICE) with polytomous regression. In Figure 2 we compare the distribution of the original data and the imputed outcome. We can see that the bar plots are quite similar which means that the MICE imputation is accurate.
The figure below displays histograms and bar plots of numeric attributes. The side-by-side boxplots display the attributes by the outcome (Yes/No subscription to a term deposit).
Summary key points
In order to understand which attributes are related to the outcome, we conducted chi-square tests of independence for the categorical attributes and Mann-Whitney test for the numeric attributes. The results are shown in the following table.
Based on the exploratory data analysis and the inferential analysis, we identified the attributes more likely to affect the client’s subscription. These attributes will be called “predictors” and they are: job, marital education, contact, month, campaign, pdays, previous, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed, duration_min.
Our problem involves predicting whether the client will subscribe to a term deposit or not. This is known as a classification problem with binary response (Yes/No).
We built six different algorithms to predict the term deposit subscription:
Based on the graphical results (Figure 3), we know that some of the attributes have a skewed distribution, others an exponential one, and some of them multimodal distributions. So we executed a power transform to correct each attribute.
We also used resampling methods to split the data set into training and testing sets. To assess the prediction capability of each algorithm on unseen data, Test Accuracy and Area Under Receiver Operating Characteristic (ROC) curve were completed.
Table 2 below lists the performance measurements for the six algorithms developed.
Random forest ranks first in the Area Under Curve (AUC) and has a similar test accuracy result to GLM, so it is the best model. Figure 4 shows the ROC curve and the trade-off between specificity and sensitivity for each algorithm. We can see that random forest holds the higher area.
In Figure 5 the green curve is the error rate for “yes” subscriptions, the red curve is the error rate for “no” subscriptions, while the black curve is the Out-of-Bag error rate (OOB error). The OOB error falls below 9%.
The model predicts better “no” subscriptions than “yes” subscriptions. It occurs due to the fact that there were more clients who didn’t subscribe to a term deposit in the data set. The model is highly sensitive (96.59%) which means a highly true positive rate.
Now we have built the model with the most powerful prediction ability, we will identify which attributes affect decision of the clients the most, in other words, the success of the marketing campaign.
Figure 6 below shows the relative variable importance in the term deposit subscription by plotting the mean decrease in Gini Index, calculated across all trees.
We can see that the duration of the last contact has the highest relative importance, followed by Euribor 3 month rate, job and number of employees. To analyse the effect of these predictors on the outcome, we will find out their coefficient using the logistic regression algorithm that we performed earlier.
Table 3 was created according to the signs of predictor coefficients in logistic regression.
Duration is statistically significant, suggesting a strong association of the last contact duration and the chance of subscribing to a term deposit.
The positive coefficient of Euribor3m suggests that, all other variables being equal, the Euribor 3 month rate has positive effect on clients’ subscription.
The negative coefficient of job such as blue-collar, entrepreneur, housemaid, and so on suggest that clients in these jobs are less likely to subscribe to a term deposit.
The number of employees has a positive effect on clients’ subscription.
In this project we aimed to build a model to predict whether the client will subscribe to a term deposit and identify the main attributes that affect the subscription.
In order to identify the predictors, we conducted several hypothesis testing and we ended up with 15 predictors.
We developed a bunch of linear, nonlinear, bagging and boosting algorithms. Then, we assessed them using test accuracy measurement and AUC. Random forest was the most powerful classifier with 91.58% accuracy and 93.29% AUC.
We ranked the relative variable importance according to the mean decrease of Gini Index. This rank detected duration of the last contact as the most important factor, Euribor 3 month rate as second; job as the third; and number of employees as the fourth.
In summary, firstly, it is more likely that clients would subscribe to a term deposit if the conversation on the phone is long (at least 7.63 minutes).
Secondly, the higher the Euribor 3 month rate, the higher the chance of subscription. In general, when the Euribor interest rates rise or fall there is a high likelihood that the interest rates on banking products such as mortgages, savings accounts and loans will also be adjusted.
Thirdly, technician, retired and clients working in services are more likely to subscribe to a term deposit.
Finally, the number of employees in the bank positively affects the subscription. Usually, the size of the bank is associated with financial strengths of the institution.