Introduction

In this project I will analyze the data of Sbermarket. Based on the data about customers and their behaviour, I will try to segment the customers, calculate and predict the level of customer churn and create a program to improve these indicators.

Data

The data contains 5326 observations (customers) and 18 variables. The variables are the following:

Variable Description
user_id Unique identifier for each customer
gender Gender - male or female
city Destination
age_group Categorical variable of ages
mean_rate How a customer assesses the service on average
satisfaction Categorical variable related to customer’s fulfilment
timediff_order How many days passed after the last interaction with a customer
dw_kind Type of prefered delivery
platform Which type of technology is more often used
os The most frequently used operating system by a custome
savings How much money a customer saved in total
spendings How much money a customer spent in total
num_orders How many orders were made by a customer
churn Churn rate

Exploratory analysis

In this graph we can see the distribution of some of our variables. Based on them, we can conclude which groups are most involved in using our service, and which we can try to attract additionally.

It follows from this graph that the majority of users are women. Also, users between the ages of 25 and 34 are the most frequent customers of our service. The most preferred platform for using the service is an application, the most used operating system is android.

Next, I will analyze the current situation of customer churn.

As we can see, the number of lost customers is relatively small, but it is still important for us what factors influence their churn and we will try to explore them.

  • If we take into account the distribution of our data, which we did a little earlier, then it will be possible to conclude that such factor as gender does not matter for customer churn.

  • Age. Although the 75+ age group is the smallest in terms of numbers, it is most prone to churn. Also, the smallest churn is observed in the 55-64 group, which, despite its small number among our clients, is a good indicator.

  • Satisfaction. This relationship is quite expected: passives and detracters churn more often that promoters who are totally satisfied with a service.

Hypothesis

Based on the graphs presented and the conclusion from the work of Chen, K., Hu, YH. & Hsieh, YC. “Predicting customer churn from valuable B2B customers in the logistics industry: a case study” I hypothesize that the churn rate is most influenced by financial indicators (for example, money spent), as well as the number of customer orders.

This graph somewhat shows it:

Customer Segmentation

Summary: We devided our customers into three clusters based on their characteristics and identified their typical purchasing behaviour, it can be helpful to target them differently.

Data preparation

For the future analysis we need to remove variables with the demographic data, because most of them are categorical and are not applicable for the algorithms we are going to use. Also all variables with id will be removed, too. Then data will be normalized to make it range from 0 to 1.

Analysis

To conduct customer segmentation we need to use k-means method. Firstly, we should identify the most reasonable number of clusters and for that we will use the elbow graph.

I’ve decided to divide our customers into 3 clusters, further let’s look at them closer.

Clusters

Now let us create those clusters and calculate the number of customers in each of them. Clusters differ in the number of customers with cluster N2 being the biggest.

Next, we visualize the clusters that we received.

It is difficult to draw concrete conclusions on visualization. So let’s see mean values for each variable for each cluster.

  1. These customers give a low rating (about 2). The last order was made relative recently. On average, they have a small number of orders and spending.

  2. Give a high rating (about 8). They have the lowest time since last interaction and relatively small number of orders and spending.

  3. They give an average rating (about 6-7). The average time since the last interaction with the service is the biggest, but they also have the biggest number of orders and expenses.

Let’s look at the churn distribution in each of the clusters.

In the first cluster there is a small percentage of churn (9.7%). It can be assumed that customers do not like some details in the service (they give low ratings), but they continue to use it because they do not find alternatives for buying certain goods.

In the third cluster there is 59.6% of churn, which is more than half. We can conclude that buyers made a lot of orders in one period of time and spent much. But they found something that did not like them - perhaps the delivery or the quality of the goods. The more orders there are, the more likely it is that the buyer will stumble upon some drawback, which will leave him with a bad impression, interrupting all the good.

There is no churn in cluster 2. Based on the analysis above, these are moderately satisfied customers who do not make orders too often, spending small amounts. Perhaps these are new customers who have just started using it and are still satisfied.

Churn prediction

Summary: The model of churn prediction identified that the most important variable associated with churn is the number of purchases. Another model suggested that average check and amount of spendings are also important in churning.

Now let us build the predictive model of customer churn. Further we will review 3 different predictive models, evaluate them and decide on which one is the best suited.

First let’s glimpse at the correlation between the variables that we are going to use.

Time since the last order strongly correlates with the variable of churn, because the churn was derived from it, so we won’t use it in the analysis.

Further we will train our models on training data and evaluate them with testing data.

Logistic regression

The first model we’ll use is logistic regression for binary classification.

Predictors

These are the variables that our model considered significant in churn prediction.

As we can see, number of order and average check are positively correlated with churn, which is quite surprising. The total amount of spendings, however, is negatively correlated with churn.

Model evaluation

The accuracy of logistic regression is 81%, which is better that random guessing. However, because of the imbalance in the categories of our predicted variable model recall is way better than model precision. It can be obviously seen on the plotted table - as the model is good at predicting the customers that stayed, but not as good at predicting churners who actually churned.

Lasso

The next model we’ll use is called lasso or elastic net classification.

First we will tune the model to select the best parameters for it.

Model tuning

Next we’ll identify which features the model selected as significant:

Predictors

Interestingly enough, the lasso concluded that only one of our predictors is significant and it is the number of orders. Let’s see how the efficiency of the model has changed.

Model evaluation

The model accuracy equals 81% and it is exactly the same as logistic regression. We still face the same issue with unpredictable churners.

Random forest

And the last model we’ll cover is random forest. Again, let’s start the tuning process to select the best parameters, and then the model will automatically apply those parameters to achieve the best accuracy.

Model tuning

Model evaluation

With the random forest model our accuracy is worse than before as it is 80%.

Model comparison

Now we will compare three models in terms of their accuracy, precision and recall.

As we can see from the graph, lasso and logistic regression have slightly better accuracy than random forest. However, random forest beats the other two models in terms of precision, but, again, fails at recall. On the basis of this we believe that lasso is the best model as it maximizes both accuracy and recall, and in terms of precision the difference from the other modes is not that big.

Policies

We see that those whom we could call “too good to be true” customers are leaving us. They buy a lot, often buy, for large sums, put good marks on the application. However, due to the fact that the service is not unique, in order to hook such buyers who choose from a variety of services, we lose them. What needs to be done?

  1. Creating a loyalty card. We give a discount to those who bought for a certain amount, so that it would be most profitable for them to stay with us. The discount will increase with the levels, which will encourage the best buyers to stay with us.

  2. We are introducing express delivery, in which new and unique products can be stored. This will help us to offer our users a unique offer, when an available service can bring not only food in an hour, but also, for example, a new phone. It will also encourage people to make emotional purchases, since a purchase with fast delivery does not need to be thought about as much as with delivery in a few days.