class: center, middle, inverse, title-slide # Ensuring a Safer Future ## Exploratory analysis and customer segmentation ### Peer Christensen ### 2019-09-06 --- # Tasks 1. Analyse & summarise the main characteristics of the travel insurance data 2. Identify and value the most common ‘clusters’ of customers purchasing the insurance product 3. identify the ‘riskiest’ group of customers. What are the characteristics of this group? 4. Explain the business impact of the analysis --- # The Data Set Variables: 11 - Agency - Agency Type - Distribution Channel - Product Name - Claim - Duration - Destination - Net Sales - Commision (in value) - Gender - Age Rows: 63326 From a business perspective, we are particularly interested in the variables Claim and Net Sales. In the following slides, we will explore the relationships between these target variables and other variables. --- class: inverse, middle, center # The Claim variable --- # Claim ~ numeric variables <!-- --> --- # Claim ~ categorical variables <!-- --> --- # Product Name <!-- --> --- # Countries..or rather continents! There are 149 different destinations (countries) in the data and some are very infrequent. To get a better overview of destinations, we can create a new variable that groups destinations by their continents. <!-- --> --- # Key findings - Looking at the previous plots, we see right away that the Claim variable is highly imbalanced with only 927 claims made. That's 1.5% - This tends to make it difficult to visually assess its relationships with other variables. Further exploration of the underlying numbers and proportions did not reveal any clear relationships. - We observe that the majority of values in the Gender variable are missing. - Outliers and improbable values are also present, but these are more clearly exposed when we explore the Net Sales variable. --- class: inverse, middle, center # The Net Sales variable --- # Net Sales ~ numeric variables <!-- --> --- # Net Sales ~ categorical variables I <!-- --> --- # Net Sales ~ categorical variables II <!-- --> --- # Key findings - It is clear from the previous slide that clusters of unlikely values are present in the Age and Duration variables. - We see now that some Net Sales values are actually negative. How can this be? - Different trends emerge when Commissions and Net Sales are plotted together. - When we colour the data points according to the Claim variable, it looks like Claim values might be predicted by high commissions relative to Net Sales. - There could be a relationship between the Net Sales and Claim variables such that claims are more likely to be made for higher Net Sales values. --- class: inverse, middle, center # Data cleaning and feature engineering --- Data cleaning - As we've seen, the Age and Duration variables contain outliers. These will be removed. - The Duration and Net Sales variables contain negative values. This seems very odd, and we'll remove them in order to create meaningful new features. - The Gender variable contains lots of missing values. Removing these is not a viable option. Instead we will drop this variable altogether. (fitting linear models to the data suggests that gender is a weak predictor of both claims and net sales) - Some continuous variables have skewed distributions. Log transformations will be applied. - Rows with NAs will be dropped. Feature engineering - We've already created a variable called Continent. In addition, we'll also create one for Gross Sales, which we assume to be the actual price that customers pay. --- # So which variables help identify "risky" customers? We use stepwise regression to get a quick overview of statistical relationships in the final model of claims. Note that some categorical variables are excluded due to many and infrequent values. Looking at the estimated coefficients, we find that commissions and sales are the most important variables and that riskier customers spend more on their insurance plans. <!-- --> --- class: inverse, middle, center # Predictive Modelling --- # Predictive Modelling We now have a good idea about the traits that are related to net sales and claims. For insurance companies, being able to predict claims, or "risky" customers, is incredibly useful in eliminating some of the risk by adjusting the price points. Training classification models using different algorithms and correcting for class imbalance, we can then select the best model and visualise its performance in predicting claims. --- # Performance - AUC After training and tuning several models, the best model is a good-old GLM with an AUC of 0.78, but there's really little difference between the best ones. Below we see ROC curves for the five best models. <img src="roc_glm_claim.png" width="500px" /> --- # Performance - Lift chart To get a better understanding of what this means for decision-makers, we can ask: *If we apply the model and select n % of observations, how much better is our model in predicting claims?* We can illustrate this with a lift chart of the best model <img src="lift.png" width="400px" /> --- # Performance - Gains chart We may also ask: *If we apply the model and select n % of observations, what % of claim observations can we expect to hit?* We can illustrate this with a gains chart of the best model <img src="gains.png" width="400px" /> --- # Variable importance Lastly, inspecting the variables with the highest importance, we see that product names, sales variables and destinations dominate variables in the top 25. <img src="var_imp_glm_claim.png" width="550px" /> --- class: inverse, middle, center # Customer segments --- # Customer segments - A popular method for clustering customers is according to their lifetime value (CLV) in which clusters are created from customers' purchase histories (i.e. recency, frequency and monetary value, RFM) and determined using k-means clustering. - In our case, we have a mix of numeric and categorical variables, which is better handled by the PAM algorithm (partitioning around medoids). - Applying cluster selection methods, the optimal number of clusteres was found to be two. --- # PAM clusters We can then reduce the dimensions with t-SNE and plot the clusters in two dimensions <img src="pam_cluster.png" width="500px" /> --- # What characterises the clusters? Below we see the medoids for the clusters reflecting the most common values for each cluster. Note that the monetary value of cluster two members is higher than for cluster one members. ``` ## X1 X2 ## Agency Type Airlines Travel Agency ## Distribution Channel Online Online ## Product Name Basic Plan Rental Vehicle Excess Insurance ## Claim No No ## Duration 2.385227 2.419607 ## Net Sales 2.448822 2.543048 ## Commision (in value) 2.165733 2.427437 ## Age 40 40 ## Continent Asia Asia ``` A more detailed summary with distributions can be obtained from the PAM results --- # The monetary value of cluster members - We know how the clusters differ - As a final step, we compare the monetary value of customers grouped by cluster <img src="clusters_value_combined2.png" width="750px" /> --- class: inverse, middle, center # Business impact --- # Business impact In this analysis, we sought to understand insurance plan buyers better, group them by their value and predict whether they make insurance claims or not. Insurance companies make a profit by determining the average cost of insuring travelers and charging slightly more. Risk assessment is therefore hugely important and understanding why claims are made and predicting who will make them will then allow insurance companies to make better price estimates and increase profits. The usefulness of our predictive model was assessed using lift and gains charts. However, we need more information to estimate its potential financial impact. --- # Final remarks Insurance companies may often have a good idea about the average cost of claims and the average value of plan buyers. With more information about e.g. the cost of claims, we would be able to estimate the expected gain when applying the predictive model and adjusting the classification threshold for claims to maximize revenue. Feature engineering can often make the difference between a good model and a great model. With the Destination variable having so many and infrequent levels, it was difficult to assess its explanatory power. Alternatively, we could add new variables for each country with relevant information. For instance, I imagine that we could join our data with information about destination safety and standard of living from indices like country GDP, the Global Peace Index and the Multidimensional Poverty Index.