Julian Hatwell
Aug 2016
This short case study using data from a Caravan Owners Insurance Policy CRM database will demonstrate how data science can be used to add value and increase ROI for the Insurance Sales business.
As a proof of concept, we'll go for a quick win.
Conversion rate is considered low in the caravan insurance policy sales dept. We spend a long time chasing leads that don't go anywhere. How can we improve our conversion?
We can use a model to predict the probability that each lead will convert, based on all the data we keep. We then prioritise the more probable leads. With such a large proportion of poor leads, the risk of rejecting good leads is low.
For this PoC, a relatively simple technique, logistic regression, is used to predict the probability of a purchase.
Once the data is prepared, the model described takes seconds to produce but is it any good at predicting?
To ascertain this, it must be tested on new data. For this simulation, a set of the CRM records have been kept aside so the model can be run on previously unseen data.
Confusion Matrix of Predicted vs Actual Purchase on new data
purchased
prediction No Yes
No 3756 235
Yes 6 3
On the surface the previous results were terrible. However, the low conversion rate we started with is causing a very well understood issue called the “class imbalance problem.” With such a high proportion of actual No purchases, our model has a strong tendency to predict a No for nearly every case.
This is easily addressed by reducing the decision threshold. This means we give the model an instruction to predict Yes at a much lower probability.
What should this threshold be? Again we can use data science techniques to compute the best threshold directly from our data.
We assess the model at thousands of possible values of the threshold and then plot the true predictions against the errors.
We can also apply a cost function to the type of error because we know a false negative (a lost sale) is vastly more costly that a false positive (chasing a low quality lead).
The best cost-based balance of true positive rate and false positive rate can be decided from this ROC curve.
Updated Confusion Matrix of Predicted vs Actual Purchase on new data
purchased
prediction No Yes
No 3004 112
Yes 758 126
Following the predictions of this new model, we would immediately move 3004 + 112 = 3116 to the lowest priority. We would be losing 112 potential sales but our attention would be on only 758 + 126 = 884 leads, giving a conversion rate of 14.25% for the 126 actual sales. That's much more time to nurture and convert further incoming leads.
Of course, we can't be satisfied with these results. So, what next?
The previous slides are the results of just a few short hours work, using the crudest methods for a quick win.
The demo gave us a minimum expectation of the benefits we could achieve.
There are far more sensitive prediction methods we could try. There are also methods for determining what are the strongest influencing factors in a customer decision to buy. The latter is what we'll look at next.
The CRM data used in the first demo has 85 fields for each customer. It's very rich information but somewhat overwhelming to analyse all these variables one by one.
Feature selection can automate and compress the search for the most influential variables.
For simplicity we'll stick with the logistic regression method and add to it another method to discover the most important features that drive the predictions.
The Lasso is a method that adds a penalty for each variable included in our logistic regression, favouring a simpler model.
This chart shows the effect of increasing the penalty, Lambda, on the regression model. The bigger Lambda gets, the more the variables are “squeezed” by the penalty and the smaller ones are excluded altogether.
The exact level of penalty which gives the best predictions is determined by another technique called cross-validation. In simple terms, we run the routine multiple times over subsets of the data and compare all the results for a better outcome.
The previous chart indicates that a pair of useful values are returned: The penalty (Lambda) that achieved the best results and a larger value which gave results within a standard error of the best. Statistically, it might perform just as well while giving a clearer answer to our business problem because it further reduces the number of variables in the model.
In this case we keep the lower penalty that yielded the most accurate predictions.
Using the Lasso and techniques from the first part of the demo, we've also recalculated the threshold for a Yes prediction. This is used to create a new confusion matrix.
Updated Confusion Matrix of Predicted vs Actual Purchase on new data
purchased
predictions No Yes
No 2305 78
Yes 1457 160
Clearly the predictions are less accurate. False Positive Rate is up, but this is OK. While we're working through twice as many leads, we're also finding finding a lot more sales.
The variables used in the model are
(Intercept) V47 V82
The real benefit of the Lasso is that instead of using all 85 fields from the data, the model requires only 2 (not including the Intercept). It still gives useful predictions but it's also much easier to interpret and explain.
This gives the marketing team a distinct focus for their next campaign, and the product development team may have some very clear ideas about creating a new strategy.
This demonstration has barely begun to scratch the surface of what is possible. Nevertheless, with just a few hours work we're able to produce new actionable insights from the available data:
A simple but operational prediction model which gives reasonable results
A clear indication of two key drivers of customer purchasing decisions
This was achieved completely through Machine Learning without any time consuming manual labour (goodbye, spreadsheets!). It's repeatable at the click of a mouse and we can easily build on what we've discovered.