Project 2: Analysis of the Sbermarket

library(readr)
library(dplyr)
library(ggplot2)
library(rstatix)
library(plotly)
options(scipen = 99999)
scale_01 <- function(x){(x-min(x))/(max(x)-min(x))}

orders <- read_csv("OrdersFinance.csv")
customers <- read_csv("CustomerInfo.csv")
experience <- read_csv("CustomerExperience.csv")
delivery <- read_csv("CustomerDelivery.csv")

cust <- subset(customers,!duplicated(customers$user_id))

sber <- cust  %>% inner_join(experience, by="user_id") %>% inner_join(delivery, by="user_id") %>% inner_join(orders, by="user_id") %>% select(-X1.x, -X1.y, -X1.x.x, -X1.y.y)

#factor variables
sber$gender <- as.factor(sber$gender)
levels(sber$gender) <- c("Female", "Male")
sber$age_group <- as.factor(sber$age_group)
sber$city <- as.factor(sber$city)
sber$satisfaction <- as.factor(sber$satisfaction)
sber$dw_kind <- as.factor(sber$dw_kind)
sber$platform <- as.factor(sber$platform)
sber$os <- as.factor(sber$os)

Summary

In this project we did exploratory analysis demonstrating important business metrics and identifying troubles of the service. We build logistic regression to predict churn and two models for customer satisfaction prediction. We identified customer clusters and analyzed them, as well. Lastly, we came up with certain churn prediction strategies and recommendations.

The most important notices:

Churned customers are identified by bigger number of orders, application as a platform used, low satisfaction and lower spendings;
Customer’s dissatisfaction is characterized by web platform used, longer time passed since last interaction and male gender;
Two problematic clusters were identified, but significant differences among their characteristics were found only in terms of age: 4th is characterized by prevailing number of young people (<45).

Important Metrics

Customer satisfaction (measured with NPS - Net Promoter Score);
Churned customers (customers that stayed silent for more than 5 months);
Socio-demographic characteristics of customers (gender, age groups);
Consumption patterns (os, platform);
Number of orders may be considered as frequency of orders.

For better understanding of who are promoters, passives and detractors look at the pictures [1].

NPS _________________________________________________________________________________________________

Overview of the current situation

Customer satisfaction

Firstly, we would like to draw your attention to customer satisfaction. The metric that has four categories: Detractors, NotDefined, Passives and Promoters.

There is a prevailing share of detractors - 2448 customers (39%)

The graph signals that the problem is serious. The detractors greatly outweigh others.

pcts  <- sber %>% group_by(satisfaction) %>% 
 summarize(count = n()) %>% 
 mutate(pct = count/sum(count)) %>% na.omit()

g <- ggplot(pcts, aes(x = satisfaction, y = pct, fill = satisfaction, text = count)) + 
         geom_bar(stat='identity', show.legend = F) + 
         scale_fill_brewer(palette = "Purples") +
  geom_bar(data = dplyr::filter(pcts, satisfaction == 'Detractors'), 
           aes(x = satisfaction,  y = pct), stat = 'identity',fill = "#d64747") +
  geom_text(aes(label=scales::percent(pct)), position = position_stack(vjust = .5)) +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Customer's fulfilment") +
  xlab("Satisfaction categories") +
  ylab("Percent of customers") +
  theme_minimal()

hide_legend(ggplotly(g, tooltip= c("x", "count")))

Click on the graph to make it interactive

Patterns of detractors and promoters

Next we will show certain patterns of Detractors and Promoters that could be traced in the dataset.

More male detractors and female promoters

The difference is noticeable and statistically significant.

sber %>% 
  filter(!is.na(satisfaction)) %>% 
  ggplot() +
  geom_bar(aes(x = gender, fill = satisfaction), position = "fill") +
  scale_fill_brewer(name= "Satisfaction", palette = "Purples") +
  ggtitle("Satisfaction by gender") +
  xlab("Gender") +
  ylab("Share of customers") +
  theme_minimal()

More detractors - 15-24 y.o., more promoters - 35-54 y.o.

The difference is noticeable and statistically significant.

sber %>% 
  filter(!is.na(satisfaction)) %>% 
  filter(!is.na(age_group)) %>% 
  ggplot() +
  geom_bar(aes(x = age_group, fill = satisfaction), position = "fill") +
  scale_fill_brewer(name="Satisfaction", palette = "Purples") +
  ggtitle("Satisfaction by age groups") +
  xlab("Age group") +
  ylab("Share of customers") +
  theme_minimal()

More detractors - web platform, more promoters - the app

The difference is noticeable and statistically significant.

sber %>%
  filter(!is.na(satisfaction)) %>% 
  filter(!is.na(platform)) %>% 
  ggplot() +
  geom_bar(aes(x = platform, fill = satisfaction), position = "fill") +
  scale_fill_brewer(name= "Satisfaction", palette = "Purples") +
  ggtitle("Satisfaction by platform") +
  xlab("Platform type") +
  ylab("Share of customers") +
  theme_minimal()

Promoters rate high, dectractors - low, but not defined have many zeros

So those who satisfied more, they rate better. This seems reasonable.

sber %>%
  filter(!is.na(satisfaction)) %>% 
  ggplot() +
  geom_boxplot(aes(x = satisfaction, y = mean_rate, fill = satisfaction), show.legend = F) +
  scale_fill_brewer(palette = "Purples") +
  ggtitle("Mean ratings in satisfaction categories") +
  xlab("Satisfaction") +
  ylab("Mean rate") +
  theme_minimal()

NotDefined category gave only zero ratings (550), so mean rating variable becomes skewed and not usable in further analysis. Perhaps, some problems arose during data collection or storage which resulted in many zeros in one category.

Promoters have wider range of number of orders, detractors have smaller number of orders.

The difference is noticeable and statistically significant.

sber %>%
  filter(!is.na(satisfaction)) %>% 
  ggplot() +
  geom_boxplot(aes(x = satisfaction, y = num_orders, fill = satisfaction), show.legend = F) +
  scale_fill_brewer(palette = "Purples") +
  ggtitle("Number of orders in satisfaction categories") +
  xlab("Satisfaction") +
  ylab("Number of orders") +
  scale_y_log10() +
  theme_minimal()

Promoters with less inactivity (in days), detractors - with more

The difference is noticeable and statistically significant.

sber %>%
  filter(!is.na(satisfaction)) %>% 
  ggplot() +
  geom_boxplot(aes(x = satisfaction, y = timediff_order, fill = satisfaction),  show.legend = F) +
  scale_fill_brewer(palette = "Purples") +
  ggtitle("Time since last order by satisfaction categories") +
  xlab("Satisfaction") +
  ylab("Days since last order") +
  theme_minimal()

Promoters with higher savings, detractors - with lower

The difference is noticeable and statistically significant.

sber %>%
  filter(!is.na(satisfaction)) %>% 
  ggplot() +
  geom_boxplot(aes(x = satisfaction, y = savings, fill = satisfaction), show.legend = F) +
  scale_fill_brewer(palette = "Purples") +
  ggtitle("Savings by satisfaction categories") +
  xlab("Satisfaction") +
  ylab("Savings amount") +
  scale_y_log10() +
  theme_minimal()

Churn

Secondly, we would like to draw your attention to churn. The metric was created, assuming that customers who were inactive for more than 150 (~ 5 months) are churners.

sber <- sber %>% mutate(churned = case_when(timediff_order > 150 ~ "yes", TRUE ~ "no"))
sber$churned = as.factor(sber$churned)

The churn rate is 20%

churn  <- sber %>% 
  group_by(churned) %>% 
 summarize(count = n()) %>% 
 mutate(pct = count/sum(count))  

c <- ggplot(churn, aes(churned, pct, fill = churned, text = count)) + 
  scale_fill_manual(values = c("#83B44B", "#d64747")) +
  geom_bar(stat='identity', show.legend = FALSE) + 
  geom_text(aes(label=scales::percent(pct)), position = position_stack(vjust = .5))+
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Number of customers who churned/stayed in %") +
  xlab("Churned") +
  ylab("Percent of customers") + 
  theme_minimal()

hide_legend(ggplotly(c, tooltip= c("x", "count")))

Click on the graph to make it interactive

Patterns of churners

Next we will show certain patterns of churners that could be traced in the dataset.

More churners among 45-54 and 55-64 age groups

The difference is noticeable and statistically significant.

sber %>%
  filter(!is.na(age_group)) %>% 
  ggplot() +
  geom_bar(aes(x = age_group, fill = churned), position = "fill") +
  scale_fill_manual(name="Churned", values = c("#83B44B", "#d64747")) +
  ggtitle("Churn by age groups") +
  xlab("Age group") +
  ylab("Share of customers") +
  theme_minimal()

More churners among app users

The difference is noticeable and statistically significant.

sber %>%
  filter(!is.na(platform)) %>% 
  ggplot() +
  geom_bar(aes(x = platform, fill = churned), position = "fill") +
  scale_fill_manual(name="Churned", values = c("#83B44B", "#d64747")) +
  ggtitle("Churn by platform") +
  xlab("Platform type") +
  ylab("Share of customers") +
  theme_minimal()

Passives and detractors churned the most, promoters - less likely to churn

The difference is noticeable and statistically significant.

sber %>%
  filter(!is.na(satisfaction)) %>% 
  ggplot() +
  geom_bar(aes(x = satisfaction, fill = churned), position = "fill") +
  scale_fill_manual(name="Churned", values = c("#83B44B", "#d64747")) +
  ggtitle("Churn by satisfaction") +
  xlab("Satisfaction") +
  ylab("Share of customers") +
  theme_minimal()

Churners have a bit larger spendings

The difference is noticeable and statistically significant.

sber %>%
  ggplot() +
  geom_boxplot(aes(x = churned, y = spendings, fill = churned), show.legend = F) +
   scale_fill_manual(values = c("#83B44B", "#d64747")) +
  ggtitle("Spendings amount among churners/stayers") +
  xlab("Churned") +
  ylab("Spendings") +
  scale_y_log10() +
  theme_minimal()

Churners made more orders

The difference is noticeable and statistically significant.

sber %>%
  ggplot() +
  geom_boxplot(aes(x = churned, y = num_orders, fill = churned), show.legend = F) +
   scale_fill_manual(values = c("#83B44B", "#d64747")) +
  ggtitle("Number of orders among churners/stayers") +
  xlab("Churned") +
  ylab("Number of orders") +
  scale_y_log10() +
  theme_minimal()

Predictive model of churn

In this section we would make a model for predicting customer churn.

To predict churn we used number of orders made, spendings, platform used and customer’s belonging to a certain category of satisfaction variable.

library(tidymodels)

set.seed(555)
split = initial_split(sber_churn, prop = 0.8, strata = churned)
train = training(split)
test = testing(split)

set.seed(555)
model1  <- logistic_reg() %>%
  fit(churned~., data = train)

test_pred_class = predict(model1, test)

#summary(model1$fit)

test %>%
  cbind(test_pred_class) %>%
  conf_mat(churned, .pred_class) %>% summary()

#caret::confusionMatrix(test_pred_class$.pred_class, test$churned)

Logistic regression model makes prediction with 0.8 accuracy. The most important variables that help to detect churners are number of orders, spendings, satisfaction and platform used.

Customers may have made a considerable number of purchases and then stayed silent for more than 5 months. This implies that company is losing valuable customers who ordered a lot, but for some reason stopped.
App-users, Detractors and Passives are more likely to be churners, as well.

Model works better in predicting non-churners due to inadequate and inconsistent information about silent customers. However, it stil shows meaningful results

Predictive model of customer satisfaction

It was decided to use only 2 categories in satisfaction prediction: Detractors (as the most troublesome) and Promoters (their opposite).

Several different models were build to predict customer’s satisfaction and the best one was chosen.

Model 1. Logistic regression

The model is performed by logistic regression (as we did with churn).

To predict satisfaction we used number of days passed since last interaction, savings, platform used and gender.

polar = sber %>% filter(satisfaction=="Promoters" | satisfaction =="Detractors")
polar$satisfaction <- ifelse(polar$satisfaction == "Promoters", 0, 1) #Detractors = 1 
polar$satisfaction = as.factor(polar$satisfaction)

polar=polar%>%
  droplevels(polar$satisfaction) %>% na.omit(polar$satisfaction) %>% 
  select(-churned, -user_id, -CE_id, -phone_id, -dw_id, -city, -dw_kind, -os, -mean_rate, -spendings, -avg_check, -age_group)

#dummy variables
polar = dummy_cols(polar,
select_columns = c("gender", "platform")) 

polar <- polar %>% select(-platform, -gender)

polar <- polar %>% mutate_if(sapply(polar, is.numeric), scale_01)

set.seed(1234)
split = initial_split(polar, prop = 0.8, strata = satisfaction)
train1 = training(split)
test1 = testing(split)

set.seed(1234)
model2  <- logistic_reg() %>%
  fit(satisfaction~., data = train1)

test_pred_class1 = predict(model2, test1)

#summary(model2$fit)

test1 %>%
  cbind(test_pred_class1) %>%
  conf_mat(satisfaction, .pred_class) %>% summary()

#caret::confusionMatrix(test_pred_class1$.pred_class, test1$satisfaction)

The model makes prediction with 0.64 accuracy. Specificity (True Negative Rate) equals to 0.75;
The model shows that if customers uses web version, they are more likely to be unsatisfied;
The more days passed from the last interaction with the customer, the more likely their dissatisfaction;
A dissatisfied customer is more likely to be male;
One unit change in number of orders decreases number of dissatisfied customers.

Model 2. Random Forest

Second, a more complex model was build in order to compare it with the previous one, and try to improve performance. Variables used were the same as in the previous model.

set.seed(1234)
rf_mod = rand_forest(mode = "classification", trees = 50, mtry = 4) %>% 
  set_engine('randomForest', importance = TRUE)

wf_rf_mod = workflow() %>% 
  add_model(rf_mod) %>% 
  add_formula(satisfaction~.) %>% 
  fit(train1)

predtest.rf = predict(wf_rf_mod, test1)

# тестовая
accuracyTest.rf_mod = accuracy_vec(test1$satisfaction, predtest.rf$.pred_class)

test1 %>% 
    cbind(predtest.rf) %>% 
    conf_mat(truth = satisfaction, estimate = .pred_class) %>% summary()

#caret::confusionMatrix(predtest.rf$.pred_class, test1$satisfaction)

In terms of accuracy, random forest is a bit better than logistic regression (0.64 vs 0.66), but sensitivity and specificity are more balanced.

However, both satisfaction models agree that dissatisfied customers are more likely:

to use web version;
to be male;
to be churners (time passed since last interaction with them).

Model Inconsistencies:

Churn model showed us that app-users are more likely to stay silent for more than 5 months, but satisfaction model says that customers dislike web version.

Segments

In this section we tried to analyze the churners and customer satisfaction using clusterization.

sber_clean = na.omit(sber)
# variables management
sber_clusters = sber_clean %>%
# deleting id-s and useless variables
select(-user_id, -CE_id, -phone_id, -dw_id, -dw_kind, -timediff_order, -city) %>%
# making gender and churned numerical
mutate(gender = ifelse(gender == "Male", 1, 0)) %>% mutate(churned = ifelse(churned == "yes", 1, 0))

# making factor and character variables numeric
sber_clusters = dummy_cols(sber_clusters,
select_columns = c(
"age_group", "satisfaction", "platform", "os"))

sber_clusters = sber_clusters %>%
# deleting all factor and character variables
select(-satisfaction, -platform, -os, -age_group)

sber_clusters[1:26] = lapply(sber_clusters[1:26], scale_01)
#sber_clusters = sber_clusters %>%
# NaN values
#select(-city_Нальчик)

Now we can think about some variables, that can be used in cluster analysis due to their ability to somehow explain churn/customer satisfaction and can have different values among clusters. After consideration we can define these variables:

gender

Gender often determines a person’s circle of needs, values. Accordingly, the lack of representation of products that are mainly interested in a particular gender, the difference in the range of interests and standards of living (which products to choose, which services to use) can affect customer expectations towards Sbermarket & satisfaction with the service.

age_group

Significant gap in knowledge and familiarity of the Internet and online services, different interests and therefore can create different satisfaction level among different age groups.

Using different os implies that Sbermarket app/website are a little bit different for Android/Mac user. Each app can have its unique features(Apple Pay for Apple users), comfortability of usage and other technical things that can affect customer satisfaction based on different operational systems.

platform

Using mobile app or website can also lead to different satisfaction level. Ability to use Apple/Google Pay in Sbermarket App, a lot more features in mobile app rather than in the official Sbermarket website(section «Favourites», etc.) and other moments, that are more connected with the organisation and technical support of the app/website can lead to different levels of customer satisfaction.

num_orders

Represent how many orders were made by a customer and can be connected with shopping frequency of customers. This can lead to different levels of satisfaction and trust to the service.

Six clusters are optimal

tot_withinss = map_dbl(1:10, function(k){
set.seed(1000)
km = kmeans(x = sber_clusters, centers = k)
km$tot.withinss
})

elbow_bank = data.frame(k = 1:10, tot_withinss = tot_withinss)

ggplot(elbow_bank) +
geom_line(aes(x = k, y = tot_withinss)) +
geom_vline(xintercept = 6, color = "red", alpha = 0.2) +
scale_x_continuous(breaks = 1:10) +
ylab("Total within-cluster sum of squares") +
xlab("K") +
ggtitle("Total within-cluster sum of squares for different k") +
theme_minimal()

This graphs showed that the optimal number of clusters is six.

Next we perform clustering and see what conclusions could be made.

k = 6
set.seed(801)
cluster_set = kmeans(sber_clusters, centers = k)
clusters = cluster_set$cluster
sber_clean = mutate(sber_clean, cluster = clusters)

Detection of problematic clusters

Take a look at what clusters have problems with satisfaction.

ggplot(sber_clean) +
geom_bar(aes(fill = satisfaction, x = as.factor(cluster)), position = "fill") +
scale_fill_brewer(name="Satisfaction", palette = "Purples") +
ylab("Customers share") +
xlab("Cluster") +
ggtitle("Share of satisfaction categories") +
theme_minimal()

According to the graph, cluster 2 and cluster 4 are the most unsatisfied: they has the biggest share of Detractors.

Tests showed that observed differences are statistically significant in relation to each category of interest.

The second and fourth clusters have more observed values for Detractors than expected.

Clusters with churn problems

ggplot(sber_clean) +
geom_bar(aes(fill = churned, x = as.factor(cluster)), position = "fill") +
scale_fill_manual(name = "Churned", values = c("#83B44B", "#d64747")) +
ylab("Customers share") +
xlab("Cluster") +
ggtitle("Share of churners/non-churners") +
theme_minimal()

Graphs show that there is no cluster which churn situation is expressively critical.

Description of problematic clusters

In the previous part we identified 2 problematic clusters. But the only thing we know about it is that it is problematic. Let’s describe by features we selected at the beginning of this part.

After all preparation, we saw that gender, platform, os, savings, spendings variables did not give us any valuable results: our problematic clusters 2 and 4 were not show prevalence of people based on these variables. But one variable seems crusial at this point.

Age Group

ggplot(sber_clean) +
geom_bar(aes(fill = age_group, x = as.factor(cluster)), position = "fill") +
ylab("Customers share") +
xlab("Cluster") +
ggtitle("Share of age groups") +
theme_minimal()

There is only one point that can be observed: cluster 4 has the second lowest share of people who is 45+.Or it can be said in reversed: cluster 4 has the biggest share of people who is younger than 45.

Checking if these difference are statistically significant:

sber_test = sber_clean %>%
mutate(
age = case_when(
age_group == "45-54" ~ ">44",
age_group == "55-64" ~ ">44",
age_group == "65-74" ~ ">44",
age_group == "75 и старше" ~ ">44", TRUE ~ "<45"))

chisq.test(sber_test$age, sber_test$cluster)

## 
##  Pearson's Chi-squared test
## 
## data:  sber_test$age and sber_test$cluster
## X-squared = 131.79, df = 5, p-value < 0.00000000000000022

chisq.test(sber_test$age, sber_test$cluster)$res

##              sber_test$cluster
## sber_test$age          1          2          3          4          5          6
##           <45  2.8681997 -1.3158969 -0.9771127  1.4619151 -1.4362731 -1.2422866
##           >44 -7.5407958  3.4596300  2.5689312 -3.8435271  3.7761117  3.2661009

Everything is fine: test showed that differences in age groups shares are statistically significant.

So, our main finding here is that more unsatisfied clusters are ones with the lowest share of customers 45+. Based on this information we created recommendations for improving customer satisfaction of younger segments that you already saw in the beginning of the report.

Recommendations

As a result of the analysis, we came up with several recommendations that might improve the service and customer satisfaction.

Churn prediction and satisfaction improvement policies:

To reduce the number of dissatisfied customers, the company should improve the web platform, make it more user-friendly. The problems with the web site might be received by directly asking the users (questionnaire). Then improvements in the design should be checked by A/B testing, to find out whether customers like them or not.
However, more churners were identified for the app users. So the customer evaluation of the app also might be needed for making changes.
As churners make many orders, company should offer more loyalty programs for its frequent buyers (bonuses, special offers). For example, this might be increasing of discounts or gifts after the customer makes certain amount of orders. Such as, for making 100 orders the customer will get 20% discount or some product for free. This will stimulate customers use the service, and not to churn.
To decrease the number of young churners, we can provide them with special offers. It might be discounts or gifts from other services that Sbermarket can collaborate with. For example, young people use streaming platforms a lot (Kinopoisk, Okko, etc.), collaborating with them would be efficient. By offering these we can make young people to stay with us and not to move to another competitor service.

References:

Qualtrics - https://www.qualtrics.com/uk/experience-management/customer/net-promoter-score/?rid=ip&prevsite=en&newsite=uk&geo=RU&geomatch=uk
Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic thinking. " O’Reilly Media, Inc." (19-42)
Link to the Notion page where we organasied our team work: https://www.notion.so/katestudy/Final-project-BA-a2281ff0df59476aad163f5ccb4783de

Project 2: Analysis of the Sbermarket

The Three: Maria Starodubtseva, Ekaterina Teterina, Xenia Gorochnaya

Summary

Important Metrics

Overview of the current situation

Customer satisfaction

There is a prevailing share of detractors - 2448 customers (39%)

Patterns of detractors and promoters

More male detractors and female promoters

More detractors - 15-24 y.o., more promoters - 35-54 y.o.

More detractors - web platform, more promoters - the app

Promoters rate high, dectractors - low, but not defined have many zeros

Promoters have wider range of number of orders, detractors have smaller number of orders.

Promoters with less inactivity (in days), detractors - with more

Promoters with higher savings, detractors - with lower

Churn

The churn rate is 20%

Patterns of churners

More churners among 45-54 and 55-64 age groups

More churners among app users

Passives and detractors churned the most, promoters - less likely to churn

Churners have a bit larger spendings

Churners made more orders

Predictive model of churn

Predictive model of customer satisfaction

Model 1. Logistic regression

Model 2. Random Forest

Segments

Six clusters are optimal

Detection of problematic clusters

Clusters with churn problems

Description of problematic clusters

Age Group

Recommendations

References: