A Short Demo of Predictive Analytics

Julian Hatwell
Aug 2016

Introduction

This short case study using data from a Caravan Owners Insurance Policy CRM database will demonstrate how data science can be used to add value and increase ROI for the Insurance Sales business.

Challenge 1: Proof of Concept

As a proof of concept, we'll go for a quick win.

Conversion rate is considered low in the caravan insurance policy sales dept. We spend a long time chasing leads that don't go anywhere. How can we improve our conversion?

plot of chunk conversion_rate_fig

Solution: Predictive Analytics

We can use a model to predict the probability that each lead will convert, based on all the data we keep. We then prioritise the more probable leads. With such a large proportion of poor leads, the risk of rejecting good leads is low.

plot of chunk demo_model_fig

For this PoC, a relatively simple technique, logistic regression, is used to predict the probability of a purchase.

Benchmarking

Once the data is prepared, the model described takes seconds to produce but is it any good at predicting?

To ascertain this, it must be tested on new data. For this simulation, a set of the CRM records have been kept aside so the model can be run on previously unseen data.

Confusion Matrix of Predicted vs Actual Purchase on new data

          purchased
prediction   No  Yes
       No  3756  235
       Yes    6    3

Don't be put off

On the surface the previous results were terrible. However, the low conversion rate we started with is causing a very well understood issue called the “class imbalance problem.” With such a high proportion of actual No purchases, our model has a strong tendency to predict a No for nearly every case.

This is easily addressed by reducing the decision threshold. This means we give the model an instruction to predict Yes at a much lower probability.

What should this threshold be? Again we can use data science techniques to compute the best threshold directly from our data.

ROC Curve

plot of chunk roc_curve_fig

We assess the model at thousands of possible values of the threshold and then plot the true predictions against the errors.

We can also apply a cost function to the type of error because we know a false negative (a lost sale) is vastly more costly that a false positive (chasing a low quality lead).

The best cost-based balance of true positive rate and false positive rate can be decided from this ROC curve.

Updating our results

Updated Confusion Matrix of Predicted vs Actual Purchase on new data

          purchased
prediction   No  Yes
       No  3004  112
       Yes  758  126

Following the predictions of this new model, we would immediately move 3004 + 112 = 3116 to the lowest priority. We would be losing 112 potential sales but our attention would be on only 758 + 126 = 884 leads, giving a conversion rate of 14.25% for the 126 actual sales. That's much more time to nurture and convert further incoming leads.

Of course, we can't be satisfied with these results. So, what next?

Challenge 2: Boosting Sales

The previous slides are the results of just a few short hours work, using the crudest methods for a quick win.

The demo gave us a minimum expectation of the benefits we could achieve.

There are far more sensitive prediction methods we could try. There are also methods for determining what are the strongest influencing factors in a customer decision to buy. The latter is what we'll look at next.

Solution: Feature Selection

The CRM data used in the first demo has 85 fields for each customer. It's very rich information but somewhat overwhelming to analyse all these variables one by one.

Feature selection can automate and compress the search for the most influential variables.

For simplicity we'll stick with the logistic regression method and add to it another method to discover the most important features that drive the predictions.

The Lasso is a method that adds a penalty for each variable included in our logistic regression, favouring a simpler model.

Lasso Regression

plot of chunk lasso_model_fig

This chart shows the effect of increasing the penalty, Lambda, on the regression model. The bigger Lambda gets, the more the variables are “squeezed” by the penalty and the smaller ones are excluded altogether.

Benchmarking

The exact level of penalty which gives the best predictions is determined by another technique called cross-validation. In simple terms, we run the routine multiple times over subsets of the data and compare all the results for a better outcome.

plot of chunk lasso_cv_fig

Don't be confused

The previous chart indicates that a pair of useful values are returned: The penalty (Lambda) that achieved the best results and a larger value which gave results within a standard error of the best. Statistically, it might perform just as well while giving a clearer answer to our business problem because it further reduces the number of variables in the model.

In this case we keep the lower penalty that yielded the most accurate predictions.

Updating our results

Using the Lasso and techniques from the first part of the demo, we've also recalculated the threshold for a Yes prediction. This is used to create a new confusion matrix.

Updated Confusion Matrix of Predicted vs Actual Purchase on new data

           purchased
predictions   No  Yes
        No  2305   78
        Yes 1457  160

Clearly the predictions are less accurate. False Positive Rate is up, but this is OK. While we're working through twice as many leads, we're also finding finding a lot more sales.

Factors Driving Sales

The variables used in the model are
 (Intercept) V47 V82

The real benefit of the Lasso is that instead of using all 85 fields from the data, the model requires only 2 (not including the Intercept). It still gives useful predictions but it's also much easier to interpret and explain.

This gives the marketing team a distinct focus for their next campaign, and the product development team may have some very clear ideas about creating a new strategy.

Conclusion

This demonstration has barely begun to scratch the surface of what is possible. Nevertheless, with just a few hours work we're able to produce new actionable insights from the available data:

A simple but operational prediction model which gives reasonable results
A clear indication of two key drivers of customer purchasing decisions

This was achieved completely through Machine Learning without any time consuming manual labour (goodbye, spreadsheets!). It's repeatable at the click of a mouse and we can easily build on what we've discovered.