library(readr)
library(dplyr)
library(ggplot2)
library(rstatix)
library(plotly)
options(scipen = 99999)
scale_01 <- function(x){(x-min(x))/(max(x)-min(x))}
orders <- read_csv("OrdersFinance.csv")
customers <- read_csv("CustomerInfo.csv")
experience <- read_csv("CustomerExperience.csv")
delivery <- read_csv("CustomerDelivery.csv")
cust <- subset(customers,!duplicated(customers$user_id))
sber <- cust %>% inner_join(experience, by="user_id") %>% inner_join(delivery, by="user_id") %>% inner_join(orders, by="user_id") %>% select(-X1.x, -X1.y, -X1.x.x, -X1.y.y)
#factor variables
sber$gender <- as.factor(sber$gender)
levels(sber$gender) <- c("Female", "Male")
sber$age_group <- as.factor(sber$age_group)
sber$city <- as.factor(sber$city)
sber$satisfaction <- as.factor(sber$satisfaction)
sber$dw_kind <- as.factor(sber$dw_kind)
sber$platform <- as.factor(sber$platform)
sber$os <- as.factor(sber$os)
In this project we did exploratory analysis demonstrating important business metrics and identifying troubles of the service. We build logistic regression to predict churn and two models for customer satisfaction prediction. We identified customer clusters and analyzed them, as well. Lastly, we came up with certain churn prediction strategies and recommendations.
The most important notices:
Churned customers are identified by bigger number of orders, application as a platform used, low satisfaction and lower spendings;
Customer’s dissatisfaction is characterized by web platform used, longer time passed since last interaction and male gender;
Two problematic clusters were identified, but significant differences among their characteristics were found only in terms of age: 4th is characterized by prevailing number of young people (<45).
Customer satisfaction (measured with NPS - Net Promoter Score);
Churned customers (customers that stayed silent for more than 5 months);
Socio-demographic characteristics of customers (gender, age groups);
Consumption patterns (os, platform);
Number of orders may be considered as frequency of orders.
For better understanding of who are promoters, passives and detractors look at the pictures [1].
_________________________________________________________________________________________________
Firstly, we would like to draw your attention to customer satisfaction. The metric that has four categories: Detractors, NotDefined, Passives and Promoters.
Next we will show certain patterns of Detractors and Promoters that could be traced in the dataset.
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(satisfaction)) %>%
ggplot() +
geom_bar(aes(x = gender, fill = satisfaction), position = "fill") +
scale_fill_brewer(name= "Satisfaction", palette = "Purples") +
ggtitle("Satisfaction by gender") +
xlab("Gender") +
ylab("Share of customers") +
theme_minimal()
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(satisfaction)) %>%
filter(!is.na(age_group)) %>%
ggplot() +
geom_bar(aes(x = age_group, fill = satisfaction), position = "fill") +
scale_fill_brewer(name="Satisfaction", palette = "Purples") +
ggtitle("Satisfaction by age groups") +
xlab("Age group") +
ylab("Share of customers") +
theme_minimal()
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(satisfaction)) %>%
filter(!is.na(platform)) %>%
ggplot() +
geom_bar(aes(x = platform, fill = satisfaction), position = "fill") +
scale_fill_brewer(name= "Satisfaction", palette = "Purples") +
ggtitle("Satisfaction by platform") +
xlab("Platform type") +
ylab("Share of customers") +
theme_minimal()
So those who satisfied more, they rate better. This seems reasonable.
sber %>%
filter(!is.na(satisfaction)) %>%
ggplot() +
geom_boxplot(aes(x = satisfaction, y = mean_rate, fill = satisfaction), show.legend = F) +
scale_fill_brewer(palette = "Purples") +
ggtitle("Mean ratings in satisfaction categories") +
xlab("Satisfaction") +
ylab("Mean rate") +
theme_minimal()
NotDefined category gave only zero ratings (550), so mean rating variable becomes skewed and not usable in further analysis. Perhaps, some problems arose during data collection or storage which resulted in many zeros in one category.
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(satisfaction)) %>%
ggplot() +
geom_boxplot(aes(x = satisfaction, y = num_orders, fill = satisfaction), show.legend = F) +
scale_fill_brewer(palette = "Purples") +
ggtitle("Number of orders in satisfaction categories") +
xlab("Satisfaction") +
ylab("Number of orders") +
scale_y_log10() +
theme_minimal()
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(satisfaction)) %>%
ggplot() +
geom_boxplot(aes(x = satisfaction, y = timediff_order, fill = satisfaction), show.legend = F) +
scale_fill_brewer(palette = "Purples") +
ggtitle("Time since last order by satisfaction categories") +
xlab("Satisfaction") +
ylab("Days since last order") +
theme_minimal()
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(satisfaction)) %>%
ggplot() +
geom_boxplot(aes(x = satisfaction, y = savings, fill = satisfaction), show.legend = F) +
scale_fill_brewer(palette = "Purples") +
ggtitle("Savings by satisfaction categories") +
xlab("Satisfaction") +
ylab("Savings amount") +
scale_y_log10() +
theme_minimal()
Secondly, we would like to draw your attention to churn. The metric was created, assuming that customers who were inactive for more than 150 (~ 5 months) are churners.
sber <- sber %>% mutate(churned = case_when(timediff_order > 150 ~ "yes", TRUE ~ "no"))
sber$churned = as.factor(sber$churned)
churn <- sber %>%
group_by(churned) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count))
c <- ggplot(churn, aes(churned, pct, fill = churned, text = count)) +
scale_fill_manual(values = c("#83B44B", "#d64747")) +
geom_bar(stat='identity', show.legend = FALSE) +
geom_text(aes(label=scales::percent(pct)), position = position_stack(vjust = .5))+
scale_y_continuous(labels = scales::percent) +
ggtitle("Number of customers who churned/stayed in %") +
xlab("Churned") +
ylab("Percent of customers") +
theme_minimal()
hide_legend(ggplotly(c, tooltip= c("x", "count")))
Click on the graph to make it interactive
Next we will show certain patterns of churners that could be traced in the dataset.
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(age_group)) %>%
ggplot() +
geom_bar(aes(x = age_group, fill = churned), position = "fill") +
scale_fill_manual(name="Churned", values = c("#83B44B", "#d64747")) +
ggtitle("Churn by age groups") +
xlab("Age group") +
ylab("Share of customers") +
theme_minimal()
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(platform)) %>%
ggplot() +
geom_bar(aes(x = platform, fill = churned), position = "fill") +
scale_fill_manual(name="Churned", values = c("#83B44B", "#d64747")) +
ggtitle("Churn by platform") +
xlab("Platform type") +
ylab("Share of customers") +
theme_minimal()
The difference is noticeable and statistically significant.
sber %>%
filter(!is.na(satisfaction)) %>%
ggplot() +
geom_bar(aes(x = satisfaction, fill = churned), position = "fill") +
scale_fill_manual(name="Churned", values = c("#83B44B", "#d64747")) +
ggtitle("Churn by satisfaction") +
xlab("Satisfaction") +
ylab("Share of customers") +
theme_minimal()
The difference is noticeable and statistically significant.
sber %>%
ggplot() +
geom_boxplot(aes(x = churned, y = spendings, fill = churned), show.legend = F) +
scale_fill_manual(values = c("#83B44B", "#d64747")) +
ggtitle("Spendings amount among churners/stayers") +
xlab("Churned") +
ylab("Spendings") +
scale_y_log10() +
theme_minimal()
The difference is noticeable and statistically significant.
sber %>%
ggplot() +
geom_boxplot(aes(x = churned, y = num_orders, fill = churned), show.legend = F) +
scale_fill_manual(values = c("#83B44B", "#d64747")) +
ggtitle("Number of orders among churners/stayers") +
xlab("Churned") +
ylab("Number of orders") +
scale_y_log10() +
theme_minimal()
In this section we would make a model for predicting customer churn.
To predict churn we used number of orders made, spendings, platform used and customer’s belonging to a certain category of satisfaction variable.
library(tidymodels)
set.seed(555)
split = initial_split(sber_churn, prop = 0.8, strata = churned)
train = training(split)
test = testing(split)
set.seed(555)
model1 <- logistic_reg() %>%
fit(churned~., data = train)
test_pred_class = predict(model1, test)
#summary(model1$fit)
test %>%
cbind(test_pred_class) %>%
conf_mat(churned, .pred_class) %>% summary()
#caret::confusionMatrix(test_pred_class$.pred_class, test$churned)
Logistic regression model makes prediction with 0.8 accuracy. The most important variables that help to detect churners are number of orders, spendings, satisfaction and platform used.
Customers may have made a considerable number of purchases and then stayed silent for more than 5 months. This implies that company is losing valuable customers who ordered a lot, but for some reason stopped.
App-users, Detractors and Passives are more likely to be churners, as well.
Model works better in predicting non-churners due to inadequate and inconsistent information about silent customers. However, it stil shows meaningful results
It was decided to use only 2 categories in satisfaction prediction: Detractors (as the most troublesome) and Promoters (their opposite).
Several different models were build to predict customer’s satisfaction and the best one was chosen.
The model is performed by logistic regression (as we did with churn).
To predict satisfaction we used number of days passed since last interaction, savings, platform used and gender.
polar = sber %>% filter(satisfaction=="Promoters" | satisfaction =="Detractors")
polar$satisfaction <- ifelse(polar$satisfaction == "Promoters", 0, 1) #Detractors = 1
polar$satisfaction = as.factor(polar$satisfaction)
polar=polar%>%
droplevels(polar$satisfaction) %>% na.omit(polar$satisfaction) %>%
select(-churned, -user_id, -CE_id, -phone_id, -dw_id, -city, -dw_kind, -os, -mean_rate, -spendings, -avg_check, -age_group)
#dummy variables
polar = dummy_cols(polar,
select_columns = c("gender", "platform"))
polar <- polar %>% select(-platform, -gender)
polar <- polar %>% mutate_if(sapply(polar, is.numeric), scale_01)
set.seed(1234)
split = initial_split(polar, prop = 0.8, strata = satisfaction)
train1 = training(split)
test1 = testing(split)
set.seed(1234)
model2 <- logistic_reg() %>%
fit(satisfaction~., data = train1)
test_pred_class1 = predict(model2, test1)
#summary(model2$fit)
test1 %>%
cbind(test_pred_class1) %>%
conf_mat(satisfaction, .pred_class) %>% summary()
#caret::confusionMatrix(test_pred_class1$.pred_class, test1$satisfaction)
The model makes prediction with 0.64 accuracy. Specificity (True Negative Rate) equals to 0.75;
The model shows that if customers uses web version, they are more likely to be unsatisfied;
The more days passed from the last interaction with the customer, the more likely their dissatisfaction;
A dissatisfied customer is more likely to be male;
One unit change in number of orders decreases number of dissatisfied customers.
Second, a more complex model was build in order to compare it with the previous one, and try to improve performance. Variables used were the same as in the previous model.
set.seed(1234)
rf_mod = rand_forest(mode = "classification", trees = 50, mtry = 4) %>%
set_engine('randomForest', importance = TRUE)
wf_rf_mod = workflow() %>%
add_model(rf_mod) %>%
add_formula(satisfaction~.) %>%
fit(train1)
predtest.rf = predict(wf_rf_mod, test1)
# тестовая
accuracyTest.rf_mod = accuracy_vec(test1$satisfaction, predtest.rf$.pred_class)
test1 %>%
cbind(predtest.rf) %>%
conf_mat(truth = satisfaction, estimate = .pred_class) %>% summary()
#caret::confusionMatrix(predtest.rf$.pred_class, test1$satisfaction)
In terms of accuracy, random forest is a bit better than logistic regression (0.64 vs 0.66), but sensitivity and specificity are more balanced.
However, both satisfaction models agree that dissatisfied customers are more likely:
to use web version;
to be male;
to be churners (time passed since last interaction with them).
Model Inconsistencies:
Churn model showed us that app-users are more likely to stay silent for more than 5 months, but satisfaction model says that customers dislike web version.
In this section we tried to analyze the churners and customer satisfaction using clusterization.
sber_clean = na.omit(sber)
# variables management
sber_clusters = sber_clean %>%
# deleting id-s and useless variables
select(-user_id, -CE_id, -phone_id, -dw_id, -dw_kind, -timediff_order, -city) %>%
# making gender and churned numerical
mutate(gender = ifelse(gender == "Male", 1, 0)) %>% mutate(churned = ifelse(churned == "yes", 1, 0))
# making factor and character variables numeric
sber_clusters = dummy_cols(sber_clusters,
select_columns = c(
"age_group", "satisfaction", "platform", "os"))
sber_clusters = sber_clusters %>%
# deleting all factor and character variables
select(-satisfaction, -platform, -os, -age_group)
sber_clusters[1:26] = lapply(sber_clusters[1:26], scale_01)
#sber_clusters = sber_clusters %>%
# NaN values
#select(-city_Нальчик)
Now we can think about some variables, that can be used in cluster analysis due to their ability to somehow explain churn/customer satisfaction and can have different values among clusters. After consideration we can define these variables:
gender
Gender often determines a person’s circle of needs, values. Accordingly, the lack of representation of products that are mainly interested in a particular gender, the difference in the range of interests and standards of living (which products to choose, which services to use) can affect customer expectations towards Sbermarket & satisfaction with the service.
age_group
Significant gap in knowledge and familiarity of the Internet and online services, different interests and therefore can create different satisfaction level among different age groups.
os
Using different os implies that Sbermarket app/website are a little bit different for Android/Mac user. Each app can have its unique features(Apple Pay for Apple users), comfortability of usage and other technical things that can affect customer satisfaction based on different operational systems.
platform
Using mobile app or website can also lead to different satisfaction level. Ability to use Apple/Google Pay in Sbermarket App, a lot more features in mobile app rather than in the official Sbermarket website(section «Favourites», etc.) and other moments, that are more connected with the organisation and technical support of the app/website can lead to different levels of customer satisfaction.
num_orders
Represent how many orders were made by a customer and can be connected with shopping frequency of customers. This can lead to different levels of satisfaction and trust to the service.
tot_withinss = map_dbl(1:10, function(k){
set.seed(1000)
km = kmeans(x = sber_clusters, centers = k)
km$tot.withinss
})
elbow_bank = data.frame(k = 1:10, tot_withinss = tot_withinss)
ggplot(elbow_bank) +
geom_line(aes(x = k, y = tot_withinss)) +
geom_vline(xintercept = 6, color = "red", alpha = 0.2) +
scale_x_continuous(breaks = 1:10) +
ylab("Total within-cluster sum of squares") +
xlab("K") +
ggtitle("Total within-cluster sum of squares for different k") +
theme_minimal()
This graphs showed that the optimal number of clusters is six.
Next we perform clustering and see what conclusions could be made.
k = 6
set.seed(801)
cluster_set = kmeans(sber_clusters, centers = k)
clusters = cluster_set$cluster
sber_clean = mutate(sber_clean, cluster = clusters)
Take a look at what clusters have problems with satisfaction.
ggplot(sber_clean) +
geom_bar(aes(fill = satisfaction, x = as.factor(cluster)), position = "fill") +
scale_fill_brewer(name="Satisfaction", palette = "Purples") +
ylab("Customers share") +
xlab("Cluster") +
ggtitle("Share of satisfaction categories") +
theme_minimal()
According to the graph, cluster 2 and cluster 4 are the most unsatisfied: they has the biggest share of Detractors.
Tests showed that observed differences are statistically significant in relation to each category of interest.
The second and fourth clusters have more observed values for Detractors than expected.
ggplot(sber_clean) +
geom_bar(aes(fill = churned, x = as.factor(cluster)), position = "fill") +
scale_fill_manual(name = "Churned", values = c("#83B44B", "#d64747")) +
ylab("Customers share") +
xlab("Cluster") +
ggtitle("Share of churners/non-churners") +
theme_minimal()
Graphs show that there is no cluster which churn situation is expressively critical.
In the previous part we identified 2 problematic clusters. But the only thing we know about it is that it is problematic. Let’s describe by features we selected at the beginning of this part.
After all preparation, we saw that gender, platform, os, savings, spendings variables did not give us any valuable results: our problematic clusters 2 and 4 were not show prevalence of people based on these variables. But one variable seems crusial at this point.
ggplot(sber_clean) +
geom_bar(aes(fill = age_group, x = as.factor(cluster)), position = "fill") +
ylab("Customers share") +
xlab("Cluster") +
ggtitle("Share of age groups") +
theme_minimal()
There is only one point that can be observed: cluster 4 has the second lowest share of people who is 45+.Or it can be said in reversed: cluster 4 has the biggest share of people who is younger than 45.
Checking if these difference are statistically significant:
sber_test = sber_clean %>%
mutate(
age = case_when(
age_group == "45-54" ~ ">44",
age_group == "55-64" ~ ">44",
age_group == "65-74" ~ ">44",
age_group == "75 и старше" ~ ">44", TRUE ~ "<45"))
chisq.test(sber_test$age, sber_test$cluster)
##
## Pearson's Chi-squared test
##
## data: sber_test$age and sber_test$cluster
## X-squared = 131.79, df = 5, p-value < 0.00000000000000022
chisq.test(sber_test$age, sber_test$cluster)$res
## sber_test$cluster
## sber_test$age 1 2 3 4 5 6
## <45 2.8681997 -1.3158969 -0.9771127 1.4619151 -1.4362731 -1.2422866
## >44 -7.5407958 3.4596300 2.5689312 -3.8435271 3.7761117 3.2661009
Everything is fine: test showed that differences in age groups shares are statistically significant.
So, our main finding here is that more unsatisfied clusters are ones with the lowest share of customers 45+. Based on this information we created recommendations for improving customer satisfaction of younger segments that you already saw in the beginning of the report.
As a result of the analysis, we came up with several recommendations that might improve the service and customer satisfaction.
Churn prediction and satisfaction improvement policies:
To reduce the number of dissatisfied customers, the company should improve the web platform, make it more user-friendly. The problems with the web site might be received by directly asking the users (questionnaire). Then improvements in the design should be checked by A/B testing, to find out whether customers like them or not.
However, more churners were identified for the app users. So the customer evaluation of the app also might be needed for making changes.
As churners make many orders, company should offer more loyalty programs for its frequent buyers (bonuses, special offers). For example, this might be increasing of discounts or gifts after the customer makes certain amount of orders. Such as, for making 100 orders the customer will get 20% discount or some product for free. This will stimulate customers use the service, and not to churn.
To decrease the number of young churners, we can provide them with special offers. It might be discounts or gifts from other services that Sbermarket can collaborate with. For example, young people use streaming platforms a lot (Kinopoisk, Okko, etc.), collaborating with them would be efficient. By offering these we can make young people to stay with us and not to move to another competitor service.
Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic thinking. " O’Reilly Media, Inc." (19-42)
Link to the Notion page where we organasied our team work: https://www.notion.so/katestudy/Final-project-BA-a2281ff0df59476aad163f5ccb4783de