Welcome! This report was created by several analysts - Yana Dolgolenok, Denis Kapitonov, Ekaterina Tulubenskaya, and Arthur Pecherskikh. We are 3rd year students of the bachelor program on “Sociology & Social Informatics”. For the last 4 months we attended the “Business Analytics” course and our final task is here.
We were asked to work on the data from the unknown Russian airline and to come up with some ideas of improving the loyalty metric. As there are 3 distinctive parts of the project (+report styling), the specification on the authors are presented in the table below:
(сделаем тут потом табличку!)
library(readr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(wesanderson)
library(cowplot)
library(forcats)
airdata <- read.csv("airdata2.csv")
summary(airdata)
## Loyalty_card Gender Age Home_city
## Length:62072 Length:62072 Min. : 7.00 Length:62072
## Class :character Class :character 1st Qu.:23.00 Class :character
## Mode :character Mode :character Median :33.00 Mode :character
## Mean :35.45
## 3rd Qu.:48.00
## Max. :85.00
## days_customer LTV Class_modal Satisfaction_bin
## Min. : 1 Min. : 34336 Length:62072 Length:62072
## 1st Qu.: 412 1st Qu.: 89928 Class :character Class :character
## Median : 918 Median : 171989 Mode :character Mode :character
## Mean :1134 Mean : 290344
## 3rd Qu.:1772 3rd Qu.: 267434
## Max. :3431 Max. :1439592
## sat_Seat_comfort sat_Time_convenient sat_Food sat_Gate_location
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.793 Mean :3.019 Mean :2.732 Mean :2.967
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## sat_Wifi sat_Entertainment sat_online_support
## Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :4.000
## Mean :3.151 Mean :3.019 Mean :3.287
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
## sat_ease_of_online_booking sat_onboard_service sat_legroom
## Min. :1.000 Min. :1.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :4.000 Median :4.000
## Mean :3.291 Mean :3.419 Mean :3.406
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## sat_baggage_handling sat_checkin sat_clean_salon sat_boarding
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :3.000 Median :4.000 Median :3.000
## Mean :3.743 Mean :3.362 Mean :3.763 Mean :3.236
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
With no exceptions, EDA is the basic part of any analysis that gives one the general impression about the data and allows to identify the variables relevant for the further work - it is exactly what I tried to do in a relation to the clients’ loyalty. To start with, let’s look at the current situation with the loyalty:
airdata %>% count(Loyalty_card) %>% ggplot(aes(Loyalty_card, n)) + geom_bar(stat = "identity", fill = c("#E6A0C4", "#C6CDF7")) + theme_classic() + labs(title = "The loyality distribution among the clients", x = "", y = "", subtitle = "") + geom_text(aes(Loyalty_card, n, label = n, vjust = -0.5))
The reality is cruel: approximately 0.64 of the clients are currently disloyal. Running a bit further than necessary, I want to extend the understanding of the company’s loses by looking at the clients’ lifetime values (LTV):
ggplot(airdata, aes(Loyalty_card, LTV)) + geom_boxplot(fill = c("#E6A0C4", "#C6CDF7")) + stat_summary(fun = mean, geom = "point", shape = 4, size = 4) + theme_classic() + labs(title = "Loyalty and LTV: how much do we lose?", subtitle = "", x = "")
disloyal_ltv <- airdata %>% filter(Loyalty_card == "Disloyal")
The boxplot presents the distributions inside loyalty categories: the disloyal one has many outliers (meaning it has large dispersion), while the loyal one is more homogeneous. Despite the median values are close to each other, there is still a difference between them: 0.46 of the total LTV is brought by disloyal clients.
An important variable for our considerations is Satisfaction - it is potentially a strong predictor of disloyalty as it captures directly what clients think of the company’s services.
airdata %>% count(Loyalty_card, Satisfaction_bin) %>% ggplot(aes(Loyalty_card, fill = Satisfaction_bin, n)) + geom_bar(stat = "identity", position = "stack") + theme_classic() + labs(title = "Loyalty & Satisfaction", subtitle = "", x = "", y = "", fill = "Clients' state") + scale_fill_manual(values = wes_palette("GrandBudapest1"))
The result is interesting: the loyalty is definitely more associated with satisfaction than dissatisfaction but it is not right for the disloyalty where the shares of satisfaction are approximately of the same size. To tell the truth, there were some debates among my teammates about what of these variables is more applicable for our purpose. So, during the discussions, I explored satisfaction separately as well - and I want to share one of the distinctive parameter with you:
p1 <- airdata %>% count(Gender, Loyalty_card) %>% ggplot(aes(Gender, fill = Loyalty_card, n)) + geom_bar(stat = "identity", position = "stack") + theme_classic() + labs(title = "Gender & loyalty", subtitle = "", x = "", y = "", fill = "Loyalty") + scale_fill_manual(values = wes_palette("GrandBudapest2"))
p2 <- airdata %>% count(Satisfaction_bin, Gender) %>% ggplot(aes(Gender, fill = Satisfaction_bin, n)) + geom_bar(stat = "identity", position = "stack") + theme_classic() + labs(title = "Gender & Satisfaction", subtitle = "", x = "", y = "", fill = "Clients' state") + scale_fill_manual(values = wes_palette("GrandBudapest1"))
plot_grid(p1, p2) + theme_classic()
It is hard to formulate the distinction extensively, so I just summarize the graphs: there is no significant gender effect on loyalty, while the dramatic difference appears in the satisfaction - males are more comfortable with the services (0.85 of males = 25696 clients are satisfied!). And as gender has these effect on satisfaction, we may refer to the last as to the (partial) mediator.
Another variable that should be considered is the class of each flight.
ggplot(airdata, aes(Class_modal, fill = Loyalty_card)) + geom_bar() + theme_classic() + labs(title = "The loyalty categories & flight class types", subtitle = "", x = "", y = "", fill = "Loyalty") + scale_fill_manual(values = wes_palette("GrandBudapest2"))
loyal_class <- airdata %>% filter(Class_modal == "Business") %>% count(Loyalty_card)
disloyal_class <- airdata %>% filter(Class_modal == "Eco Plus") %>% count(Loyalty_card)
To note, the graph below is not only about the classes - it shows their shares in services as well. The “economic” class is the largest one and the “economic plus” class is the smallest. Returning to the loyalty, “business” class shows the highest loyalty (0.76 of these clients are loyal). Different narrative happens with “Eco plus” class - only 0.17 clients are loyal here.
Next, I aggregated the observations by city:
airdata %>% count(Home_city, Loyalty_card) %>% mutate(Home_city = fct_reorder(Home_city, n)) %>% ggplot(aes(Home_city, n, fill = Loyalty_card)) + geom_bar(stat = "identity") + coord_flip() + theme_classic() + labs(title = "Home cities of the clients & loyalty,", subtitle = "15 cities in total
", x = "", y = "", fill = "Loyalty") + scale_fill_manual(values = wes_palette("GrandBudapest2")) + geom_text(aes(label = n), check_overlap = TRUE, position = position_stack(0.5))
Obviously, the bigger the city is the more clients come from there. On the other hand, there are “missings” like Ufa - its population is over the million, though it is not covered by the company. But it is not our topic. The deal is that the proportions of loyal clients are in a range between 0.35 and 0.39 (0.37 in fact, as the value 0.39 came from Khabarovsk is an outlier) in all cities:
split <- airdata %>% count(Home_city, Loyalty_card) %>% arrange(Loyalty_card)
loyalcity <- split[16:30,]
disloyalcity <- split[1:15,]
disloyalcity$n_dis <- disloyalcity$n
disloyalcity <- disloyalcity %>% select(Home_city, n_dis)
loyalcity <- loyalcity %>% select(Home_city, n)
split <- loyalcity %>% full_join(disloyalcity)
split$ratio <- round(split$n / (split$n+split$n_dis),2)
split <- arrange(split, ratio)
ggplot(split, aes(ratio)) + geom_boxplot( fill = "pink") + theme_classic() + labs(title = "The homogeneity of loyality among the cities,", subtitle = "15 cities in total
", x = "The share of loyal clients")
So, city is a poor metric for the further work. Sadly, the company performs equally everywhere, despite the variability of the cities (in terms of region, population, and so on.
The last demographic variable we have is Age. I decided to observe the distribution as it is without the modification to the categorical variable:
ggplot(airdata, aes(Age, fill = Loyalty_card)) +
geom_histogram(binwidth = 3, col = "black") + theme_classic() + labs(title = "Clients' age distibution & loyalty", subtitle = "", x = "", y = "", fill = "Loyalty", linetype = "Central tendency measures*", caption = "*with the mean age = 35.45 and the median age = 33") + scale_fill_manual(values = wes_palette("GrandBudapest2")) + geom_vline(aes(xintercept = mean(Age), linetype = "mean"), color = c("black")) + geom_vline(aes(xintercept = median(Age), linetype = "median"), color = c("brown")) + theme(plot.caption = element_text(hjust = 2.6))
It is seen from the graph that the middle-aged clients are more likely to be loyal than the representatives of the other ages. As one of the earlier graphs had shown, the loyalty among the clients who flew in “business” class is relatively high - so, the riddle behind high loyalty among middle-aged people is solved. The next chart provides a visual proof:
ggplot(airdata, aes(Age, fill = Class_modal)) +
geom_histogram(binwidth = 3, col = "black") + theme_classic() + labs(title = "Clients' age distibution & loyalty", subtitle = "", x = "", y = "", fill = "Loyalty", linetype = "Central tendency measures*", caption = "*with the mean age = 35.45 and the median age = 33") + scale_fill_manual(values = wes_palette("Royal2")) + geom_vline(aes(xintercept = mean(Age), linetype = "mean"), color = c("black")) + geom_vline(aes(xintercept = median(Age), linetype = "median"), color = c("brown")) + theme(plot.caption = element_text(hjust = 2.6))
Finally, we need to explore a duration of the relationships between the company and the customers. The days_customer variable was modified (here it shows the year: the “cruel” division into 365 days produces a little bias but it is unnoticeable) for this plot to make the y-scale easier to perceive:
ggplot(airdata, aes(Loyalty_card, days_customer/365)) + geom_boxplot(fill = c("#E6A0C4", "#C6CDF7")) + stat_summary(fun = mean, geom = "point", shape = 4, size = 4) + theme_classic() + labs(title = "Loyalty & time clients stay with us,", subtitle = "in years
", x = "", y = "time")
distime <- airdata %>% filter(Loyalty_card == "Disloyal")
loyaltime <- airdata %>% filter(Loyalty_card == "Loyal")
mean1 <- mean(distime$days_customer / 365)
mean2 <- mean(loyaltime$days_customer / 365)
This result is thought-provoking: the more clients use company’s services, the more they are likely to be disloyal. The mean difference (1.26 year) is about a year and 3 months. That is a pity!
According to the presented analysis, there are several metrics that are important for the loyalty monitoring:
As for the hypotheses, I would briefly formulate them as well:
That is the end of the EDA part! Thanks!