BA final project

Intro: who we are

Welcome! This report was created by several analysts - Yana Dolgolenok, Denis Kapitonov, Ekaterina Tulubenskaya, and Arthur Pecherskikh. We are 3rd year students of the bachelor program on “Sociology & Social Informatics”. For the last 4 months we attended the “Business Analytics” course and our final task is here.

We were asked to work on the data from the unknown Russian airline and to come up with some ideas of improving the loyalty metric. As there are 3 distinctive parts of the project (+report styling), the specification on the authors are presented in the table below:

(сделаем тут потом табличку!)

Part 0 - libraries & data

library(readr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(wesanderson)
library(cowplot)
library(forcats)

airdata <- read.csv("airdata2.csv")
summary(airdata)
##  Loyalty_card          Gender               Age         Home_city        
##  Length:62072       Length:62072       Min.   : 7.00   Length:62072      
##  Class :character   Class :character   1st Qu.:23.00   Class :character  
##  Mode  :character   Mode  :character   Median :33.00   Mode  :character  
##                                        Mean   :35.45                     
##                                        3rd Qu.:48.00                     
##                                        Max.   :85.00                     
##  days_customer       LTV          Class_modal        Satisfaction_bin  
##  Min.   :   1   Min.   :  34336   Length:62072       Length:62072      
##  1st Qu.: 412   1st Qu.:  89928   Class :character   Class :character  
##  Median : 918   Median : 171989   Mode  :character   Mode  :character  
##  Mean   :1134   Mean   : 290344                                        
##  3rd Qu.:1772   3rd Qu.: 267434                                        
##  Max.   :3431   Max.   :1439592                                        
##  sat_Seat_comfort sat_Time_convenient    sat_Food     sat_Gate_location
##  Min.   :0.000    Min.   :0.000       Min.   :0.000   Min.   :0.000    
##  1st Qu.:2.000    1st Qu.:2.000       1st Qu.:2.000   1st Qu.:2.000    
##  Median :3.000    Median :3.000       Median :3.000   Median :3.000    
##  Mean   :2.793    Mean   :3.019       Mean   :2.732   Mean   :2.967    
##  3rd Qu.:4.000    3rd Qu.:4.000       3rd Qu.:4.000   3rd Qu.:4.000    
##  Max.   :5.000    Max.   :5.000       Max.   :5.000   Max.   :5.000    
##     sat_Wifi     sat_Entertainment sat_online_support
##  Min.   :0.000   Min.   :0.000     Min.   :1.000     
##  1st Qu.:2.000   1st Qu.:2.000     1st Qu.:2.000     
##  Median :3.000   Median :3.000     Median :4.000     
##  Mean   :3.151   Mean   :3.019     Mean   :3.287     
##  3rd Qu.:4.000   3rd Qu.:4.000     3rd Qu.:4.000     
##  Max.   :5.000   Max.   :5.000     Max.   :5.000     
##  sat_ease_of_online_booking sat_onboard_service  sat_legroom   
##  Min.   :1.000              Min.   :1.000       Min.   :0.000  
##  1st Qu.:2.000              1st Qu.:3.000       1st Qu.:2.000  
##  Median :4.000              Median :4.000       Median :4.000  
##  Mean   :3.291              Mean   :3.419       Mean   :3.406  
##  3rd Qu.:4.000              3rd Qu.:4.000       3rd Qu.:5.000  
##  Max.   :5.000              Max.   :5.000       Max.   :5.000  
##  sat_baggage_handling  sat_checkin    sat_clean_salon  sat_boarding  
##  Min.   :1.000        Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000        1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :4.000        Median :3.000   Median :4.000   Median :3.000  
##  Mean   :3.743        Mean   :3.362   Mean   :3.763   Mean   :3.236  
##  3rd Qu.:5.000        3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000  
##  Max.   :5.000        Max.   :5.000   Max.   :5.000   Max.   :5.000

Part 1 - exploratory data analysis (EDA)

With no exceptions, EDA is the basic part of any analysis that gives one the general impression about the data and allows to identify the variables relevant for the further work - it is exactly what I tried to do in a relation to the clients’ loyalty. To start with, let’s look at the current situation with the loyalty:

airdata %>% count(Loyalty_card) %>% ggplot(aes(Loyalty_card, n)) + geom_bar(stat = "identity", fill = c("#E6A0C4", "#C6CDF7")) + theme_classic() + labs(title = "The loyality distribution among the clients", x = "", y = "", subtitle = "") + geom_text(aes(Loyalty_card, n, label = n, vjust = -0.5))

The reality is cruel: approximately 0.64 of the clients are currently disloyal. Running a bit further than necessary, I want to extend the understanding of the company’s loses by looking at the clients’ lifetime values (LTV):

ggplot(airdata, aes(Loyalty_card, LTV)) + geom_boxplot(fill = c("#E6A0C4", "#C6CDF7")) + stat_summary(fun = mean, geom = "point", shape = 4, size = 4) + theme_classic() + labs(title = "Loyalty and LTV: how much do we lose?", subtitle = "", x = "")

disloyal_ltv <- airdata %>% filter(Loyalty_card == "Disloyal")

The boxplot presents the distributions inside loyalty categories: the disloyal one has many outliers (meaning it has large dispersion), while the loyal one is more homogeneous. Despite the median values are close to each other, there is still a difference between them: 0.46 of the total LTV is brought by disloyal clients.

An important variable for our considerations is Satisfaction - it is potentially a strong predictor of disloyalty as it captures directly what clients think of the company’s services.

airdata %>% count(Loyalty_card, Satisfaction_bin) %>% ggplot(aes(Loyalty_card, fill = Satisfaction_bin, n)) + geom_bar(stat = "identity", position = "stack") + theme_classic() + labs(title = "Loyalty & Satisfaction", subtitle = "", x = "", y = "", fill = "Clients' state") + scale_fill_manual(values = wes_palette("GrandBudapest1"))

The result is interesting: the loyalty is definitely more associated with satisfaction than dissatisfaction but it is not right for the disloyalty where the shares of satisfaction are approximately of the same size. To tell the truth, there were some debates among my teammates about what of these variables is more applicable for our purpose. So, during the discussions, I explored satisfaction separately as well - and I want to share one of the distinctive parameter with you:

p1 <- airdata %>% count(Gender, Loyalty_card) %>% ggplot(aes(Gender, fill = Loyalty_card, n)) + geom_bar(stat = "identity", position = "stack") + theme_classic() + labs(title = "Gender & loyalty", subtitle = "", x = "", y = "", fill = "Loyalty") + scale_fill_manual(values = wes_palette("GrandBudapest2"))

p2 <- airdata %>% count(Satisfaction_bin, Gender) %>% ggplot(aes(Gender, fill = Satisfaction_bin, n)) + geom_bar(stat = "identity", position = "stack") + theme_classic() + labs(title = "Gender & Satisfaction", subtitle = "", x = "", y = "", fill = "Clients' state") + scale_fill_manual(values = wes_palette("GrandBudapest1"))

plot_grid(p1, p2) + theme_classic()

It is hard to formulate the distinction extensively, so I just summarize the graphs: there is no significant gender effect on loyalty, while the dramatic difference appears in the satisfaction - males are more comfortable with the services (0.85 of males = 25696 clients are satisfied!). And as gender has these effect on satisfaction, we may refer to the last as to the (partial) mediator.

Another variable that should be considered is the class of each flight.

ggplot(airdata, aes(Class_modal, fill = Loyalty_card)) + geom_bar() + theme_classic() + labs(title = "The loyalty categories & flight class types", subtitle = "", x = "", y = "", fill = "Loyalty") + scale_fill_manual(values = wes_palette("GrandBudapest2"))

loyal_class <- airdata %>% filter(Class_modal == "Business") %>% count(Loyalty_card)
disloyal_class <- airdata %>% filter(Class_modal == "Eco Plus") %>% count(Loyalty_card)

To note, the graph below is not only about the classes - it shows their shares in services as well. The “economic” class is the largest one and the “economic plus” class is the smallest. Returning to the loyalty, “business” class shows the highest loyalty (0.76 of these clients are loyal). Different narrative happens with “Eco plus” class - only 0.17 clients are loyal here.

Next, I aggregated the observations by city:

airdata %>% count(Home_city, Loyalty_card) %>%  mutate(Home_city = fct_reorder(Home_city, n)) %>% ggplot(aes(Home_city, n, fill = Loyalty_card)) + geom_bar(stat = "identity") + coord_flip() + theme_classic() + labs(title = "Home cities of the clients & loyalty,", subtitle = "15 cities in total
                                                                                                                                                                                   ", x = "", y = "", fill = "Loyalty") + scale_fill_manual(values = wes_palette("GrandBudapest2")) + geom_text(aes(label = n), check_overlap = TRUE, position = position_stack(0.5))

Obviously, the bigger the city is the more clients come from there. On the other hand, there are “missings” like Ufa - its population is over the million, though it is not covered by the company. But it is not our topic. The deal is that the proportions of loyal clients are in a range between 0.35 and 0.39 (0.37 in fact, as the value 0.39 came from Khabarovsk is an outlier) in all cities:

split <- airdata %>% count(Home_city, Loyalty_card) %>% arrange(Loyalty_card)
loyalcity <- split[16:30,]
disloyalcity <- split[1:15,]
disloyalcity$n_dis <- disloyalcity$n
disloyalcity <- disloyalcity %>% select(Home_city, n_dis)
loyalcity <- loyalcity %>% select(Home_city, n)
split <- loyalcity %>% full_join(disloyalcity)
split$ratio <- round(split$n / (split$n+split$n_dis),2)
split <- arrange(split, ratio)

ggplot(split, aes(ratio)) + geom_boxplot( fill = "pink") + theme_classic() + labs(title = "The homogeneity of loyality among the cities,", subtitle = "15 cities in total
                                                                                  ", x = "The share of loyal clients")

So, city is a poor metric for the further work. Sadly, the company performs equally everywhere, despite the variability of the cities (in terms of region, population, and so on.

The last demographic variable we have is Age. I decided to observe the distribution as it is without the modification to the categorical variable:

ggplot(airdata, aes(Age, fill = Loyalty_card)) +
  geom_histogram(binwidth = 3, col = "black") + theme_classic() + labs(title = "Clients' age distibution & loyalty", subtitle = "", x = "", y = "", fill = "Loyalty", linetype = "Central tendency measures*", caption = "*with the mean age = 35.45 and the median age = 33") + scale_fill_manual(values = wes_palette("GrandBudapest2")) + geom_vline(aes(xintercept = mean(Age), linetype = "mean"), color = c("black")) + geom_vline(aes(xintercept = median(Age), linetype = "median"), color = c("brown")) + theme(plot.caption = element_text(hjust = 2.6))

It is seen from the graph that the middle-aged clients are more likely to be loyal than the representatives of the other ages. As one of the earlier graphs had shown, the loyalty among the clients who flew in “business” class is relatively high - so, the riddle behind high loyalty among middle-aged people is solved. The next chart provides a visual proof:

ggplot(airdata, aes(Age, fill = Class_modal)) +
  geom_histogram(binwidth = 3, col = "black") + theme_classic() + labs(title = "Clients' age distibution & loyalty", subtitle = "", x = "", y = "", fill = "Loyalty", linetype = "Central tendency measures*", caption = "*with the mean age = 35.45 and the median age = 33") + scale_fill_manual(values = wes_palette("Royal2")) + geom_vline(aes(xintercept = mean(Age), linetype = "mean"), color = c("black")) + geom_vline(aes(xintercept = median(Age), linetype = "median"), color = c("brown")) + theme(plot.caption = element_text(hjust = 2.6))

Finally, we need to explore a duration of the relationships between the company and the customers. The days_customer variable was modified (here it shows the year: the “cruel” division into 365 days produces a little bias but it is unnoticeable) for this plot to make the y-scale easier to perceive:

ggplot(airdata, aes(Loyalty_card, days_customer/365)) + geom_boxplot(fill = c("#E6A0C4", "#C6CDF7")) + stat_summary(fun = mean, geom = "point", shape = 4, size = 4) + theme_classic() + labs(title = "Loyalty & time clients stay with us,", subtitle = "in years
                                                                                                                                                                                          ", x = "", y = "time") 

distime <- airdata %>% filter(Loyalty_card == "Disloyal") 
loyaltime <- airdata %>% filter(Loyalty_card == "Loyal")
mean1 <- mean(distime$days_customer / 365)
mean2 <- mean(loyaltime$days_customer / 365)

This result is thought-provoking: the more clients use company’s services, the more they are likely to be disloyal. The mean difference (1.26 year) is about a year and 3 months. That is a pity!

Summary & metrics

According to the presented analysis, there are several metrics that are important for the loyalty monitoring:

  1. Satisfaction, as this binary variable matches the loyalty pretty well. In addition, gender affects satisfaction thus affecting the loyalty but it is not necessary to track the gender proportions separately. On the other hand, to suggest some special offers basing on high disloyalty among women might be a possible solution.
  2. The share of disloyal & loyal clients in the total LTV. In fact, the company gets profit from the disloyal clients because they are not disloyal from the beginning. So, the metric purpose is to show how “healthy” our business is.
  3. The shares of loyalty among the class groups - at least, these values should not become smaller. For example, if the proportion of loyal people who used “business” class goes down, the company fails in its “safe segment”. Note: it seems that Age input might be largely influenced by class, so it should not be used.
  4. Finally, the time gap between disloyal & loyal categories is needed to be tracked: it shows the longitudinal side of the company-client relation. The larger it is, the worse: a company cannot be developed if it relies on the “fresh” audience all the time.

As for the hypotheses, I would briefly formulate them as well:

  1. LTV- and time- related variables are the most significant predictors of loyalty.
  2. The variables of Age, gender, and Home_city are not significant for the analysis.

That is the end of the EDA part! Thanks!