Customer Satisfaction Analysis

Introduction

This data is contains information about airlines customer satisfaction from Kaggle. We are going to analyze the data to build model and predict customer satisfaction using Logistic Regression and K-Nearest Neighbor.

First import the library that we needed.

library(dplyr)
library(tidyverse)
library(MLmetrics)
library(lmtest)
library(rsample)
library(class)
library(caret)
library(car)

Data Preparation

Input Data

Input our data and put it into airline object. We use parameter stringsAsFactors = TRUE so that all character columns will automatically stored as factors.

airline <- read.csv("Invistico_Airline.csv", stringsAsFactors = TRUE)

Overview our data:

head(airline)

Data Structure

Check the number of columns and rows.

dim(airline)
## [1] 129880     23

Data contains 129.880 rows and 23 columns.

View all columns and the data types.

glimpse(airline)
## Rows: 129,880
## Columns: 23
## $ satisfaction                      <fct> satisfied, satisfied, satisfied, sat~
## $ Gender                            <fct> Female, Male, Female, Female, Female~
## $ Customer.Type                     <fct> Loyal Customer, Loyal Customer, Loya~
## $ Age                               <int> 65, 47, 15, 60, 70, 30, 66, 10, 56, ~
## $ Type.of.Travel                    <fct> Personal Travel, Personal Travel, Pe~
## $ Class                             <fct> Eco, Business, Eco, Eco, Eco, Eco, E~
## $ Flight.Distance                   <int> 265, 2464, 2138, 623, 354, 1894, 227~
## $ Seat.comfort                      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Departure.Arrival.time.convenient <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Food.and.drink                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Gate.location                     <int> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, ~
## $ Inflight.wifi.service             <int> 2, 0, 2, 3, 4, 2, 2, 2, 5, 2, 3, 2, ~
## $ Inflight.entertainment            <int> 4, 2, 0, 4, 3, 0, 5, 0, 3, 0, 3, 0, ~
## $ Online.support                    <int> 2, 2, 2, 3, 4, 2, 5, 2, 5, 2, 3, 2, ~
## $ Ease.of.Online.booking            <int> 3, 3, 2, 1, 2, 2, 5, 2, 4, 2, 3, 2, ~
## $ On.board.service                  <int> 3, 4, 3, 1, 2, 5, 5, 3, 4, 2, 3, 3, ~
## $ Leg.room.service                  <int> 0, 4, 3, 0, 0, 4, 0, 3, 0, 4, 0, 2, ~
## $ Baggage.handling                  <int> 3, 4, 4, 1, 2, 5, 5, 4, 1, 5, 1, 5, ~
## $ Checkin.service                   <int> 5, 2, 4, 4, 4, 5, 5, 5, 5, 3, 2, 2, ~
## $ Cleanliness                       <int> 3, 3, 4, 1, 2, 4, 5, 4, 4, 4, 3, 5, ~
## $ Online.boarding                   <int> 2, 2, 2, 3, 5, 2, 3, 2, 4, 2, 5, 2, ~
## $ Departure.Delay.in.Minutes        <int> 0, 310, 0, 0, 0, 0, 17, 0, 0, 30, 47~
## $ Arrival.Delay.in.Minutes          <int> 0, 305, 0, 0, 0, 0, 15, 0, 0, 26, 48~

Data type of all columns are correct.

Pre-processing Data

Checking the missing value.

colSums(is.na(airline))
##                      satisfaction                            Gender 
##                                 0                                 0 
##                     Customer.Type                               Age 
##                                 0                                 0 
##                    Type.of.Travel                             Class 
##                                 0                                 0 
##                   Flight.Distance                      Seat.comfort 
##                                 0                                 0 
## Departure.Arrival.time.convenient                    Food.and.drink 
##                                 0                                 0 
##                     Gate.location             Inflight.wifi.service 
##                                 0                                 0 
##            Inflight.entertainment                    Online.support 
##                                 0                                 0 
##            Ease.of.Online.booking                  On.board.service 
##                                 0                                 0 
##                  Leg.room.service                  Baggage.handling 
##                                 0                                 0 
##                   Checkin.service                       Cleanliness 
##                                 0                                 0 
##                   Online.boarding        Departure.Delay.in.Minutes 
##                                 0                                 0 
##          Arrival.Delay.in.Minutes 
##                               393

We can see that Arrival.Delay.in.Minutes has 393 missing values. We can handle the missing value by adjust it to zero (no delay).

airline <- airline %>% 
  mutate(Arrival.Delay.in.Minutes = replace(Arrival.Delay.in.Minutes, is.na(Arrival.Delay.in.Minutes), 0),
         Arrival.Delay.in.Minutes = as.numeric(Arrival.Delay.in.Minutes)) 

colSums(is.na(airline))
##                      satisfaction                            Gender 
##                                 0                                 0 
##                     Customer.Type                               Age 
##                                 0                                 0 
##                    Type.of.Travel                             Class 
##                                 0                                 0 
##                   Flight.Distance                      Seat.comfort 
##                                 0                                 0 
## Departure.Arrival.time.convenient                    Food.and.drink 
##                                 0                                 0 
##                     Gate.location             Inflight.wifi.service 
##                                 0                                 0 
##            Inflight.entertainment                    Online.support 
##                                 0                                 0 
##            Ease.of.Online.booking                  On.board.service 
##                                 0                                 0 
##                  Leg.room.service                  Baggage.handling 
##                                 0                                 0 
##                   Checkin.service                       Cleanliness 
##                                 0                                 0 
##                   Online.boarding        Departure.Delay.in.Minutes 
##                                 0                                 0 
##          Arrival.Delay.in.Minutes 
##                                 0

No missing value found. Now the data is ready to explore.

Exploratory Data Analysis

Let’s see the summary of all columns.

summary(airline)
##        satisfaction      Gender                Customer.Type         Age       
##  dissatisfied:58793   Female:65899   disloyal Customer: 23780   Min.   : 7.00  
##  satisfied   :71087   Male  :63981   Loyal Customer   :106100   1st Qu.:27.00  
##                                                                 Median :40.00  
##                                                                 Mean   :39.43  
##                                                                 3rd Qu.:51.00  
##                                                                 Max.   :85.00  
##          Type.of.Travel       Class       Flight.Distance  Seat.comfort  
##  Business travel:89693   Business:62160   Min.   :  50    Min.   :0.000  
##  Personal Travel:40187   Eco     :58309   1st Qu.:1359    1st Qu.:2.000  
##                          Eco Plus: 9411   Median :1925    Median :3.000  
##                                           Mean   :1981    Mean   :2.839  
##                                           3rd Qu.:2544    3rd Qu.:4.000  
##                                           Max.   :6951    Max.   :5.000  
##  Departure.Arrival.time.convenient Food.and.drink  Gate.location 
##  Min.   :0.000                     Min.   :0.000   Min.   :0.00  
##  1st Qu.:2.000                     1st Qu.:2.000   1st Qu.:2.00  
##  Median :3.000                     Median :3.000   Median :3.00  
##  Mean   :2.991                     Mean   :2.852   Mean   :2.99  
##  3rd Qu.:4.000                     3rd Qu.:4.000   3rd Qu.:4.00  
##  Max.   :5.000                     Max.   :5.000   Max.   :5.00  
##  Inflight.wifi.service Inflight.entertainment Online.support
##  Min.   :0.000         Min.   :0.000          Min.   :0.00  
##  1st Qu.:2.000         1st Qu.:2.000          1st Qu.:3.00  
##  Median :3.000         Median :4.000          Median :4.00  
##  Mean   :3.249         Mean   :3.383          Mean   :3.52  
##  3rd Qu.:4.000         3rd Qu.:4.000          3rd Qu.:5.00  
##  Max.   :5.000         Max.   :5.000          Max.   :5.00  
##  Ease.of.Online.booking On.board.service Leg.room.service Baggage.handling
##  Min.   :0.000          Min.   :0.000    Min.   :0.000    Min.   :1.000   
##  1st Qu.:2.000          1st Qu.:3.000    1st Qu.:2.000    1st Qu.:3.000   
##  Median :4.000          Median :4.000    Median :4.000    Median :4.000   
##  Mean   :3.472          Mean   :3.465    Mean   :3.486    Mean   :3.696   
##  3rd Qu.:5.000          3rd Qu.:4.000    3rd Qu.:5.000    3rd Qu.:5.000   
##  Max.   :5.000          Max.   :5.000    Max.   :5.000    Max.   :5.000   
##  Checkin.service  Cleanliness    Online.boarding Departure.Delay.in.Minutes
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :   0.00           
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:   0.00           
##  Median :3.000   Median :4.000   Median :4.000   Median :   0.00           
##  Mean   :3.341   Mean   :3.706   Mean   :3.353   Mean   :  14.71           
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:  12.00           
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :1592.00           
##  Arrival.Delay.in.Minutes
##  Min.   :   0.00         
##  1st Qu.:   0.00         
##  Median :   0.00         
##  Mean   :  15.05         
##  3rd Qu.:  13.00         
##  Max.   :1584.00

Before doing the analysis, we have to inspect the distribution of all variables. Categorical variables:

ggplot(gather(airline %>% select_if(is.factor)), aes(value)) + 
  geom_bar(bins = 10, fill = "maroon") + 
  facet_wrap(~key, scales = 'free_x') +
  theme_minimal()

Numerical variables:

ggplot(gather(airline %>% select_if(is.numeric)), aes(value)) + 
  geom_histogram(bins = 10, fill = "maroon") + 
  facet_wrap(~key, scales = 'free_x') +
  theme_minimal()

Logistic Regression

Logistic regression is a statistical analysis method to predict a categorical outcome. When the target variabel has 2 values such as “yes” and “no”, we build binomial logistic regression model, and if the target variable has more than 2 values, the model is multinomial logistic regression. For this project, we are going to build binomial logistic regression model because the target variable is Satisfaction: “satisfied” and “dissatisfied”.

Cross Validation

Checking the proportion of our target variable.

prop.table(table(airline$satisfaction))
## 
## dissatisfied    satisfied 
##    0.4526717    0.5473283

The proportion of target variable is balance. Next, split the data into data train and test.

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(airline), nrow(airline)*0.8)
airline_train <- airline[index,]
airline_test <- airline[-index,]

After split the data, re-check the class imbalance of data train.

prop.table(table(airline_train$satisfaction))
## 
## dissatisfied    satisfied 
##     0.453303     0.546697

It is still balanced. Next, we can build the model.

Build Model

model_logistic <- glm(satisfaction~., data = airline, family = "binomial")
summary(model_logistic)
## 
## Call:
## glm(formula = satisfaction ~ ., family = "binomial", data = airline)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9805  -0.5776   0.1930   0.5192   3.6393  
## 
## Coefficients:
##                                       Estimate   Std. Error  z value
## (Intercept)                       -6.798554108  0.065301226 -104.111
## GenderMale                        -0.965355063  0.016444726  -58.703
## Customer.TypeLoyal Customer        1.980323291  0.024991761   79.239
## Age                               -0.007754971  0.000571093  -13.579
## Type.of.TravelPersonal Travel     -0.780419597  0.023400688  -33.350
## ClassEco                          -0.729101632  0.021200175  -34.391
## ClassEco Plus                     -0.806549152  0.032585445  -24.752
## Flight.Distance                   -0.000110760  0.000008596  -12.884
## Seat.comfort                       0.289617833  0.009196862   31.491
## Departure.Arrival.time.convenient -0.196737890  0.006770793  -29.057
## Food.and.drink                    -0.219463677  0.009344082  -23.487
## Gate.location                      0.114537166  0.007638146   14.995
## Inflight.wifi.service             -0.071701102  0.008874769   -8.079
## Inflight.entertainment             0.686584065  0.008294491   82.776
## Online.support                     0.091800763  0.009015167   10.183
## Ease.of.Online.booking             0.221709025  0.011614610   19.089
## On.board.service                   0.308841027  0.008242354   37.470
## Leg.room.service                   0.223597185  0.007009861   31.898
## Baggage.handling                   0.107907249  0.009282240   11.625
## Checkin.service                    0.296327433  0.006931315   42.752
## Cleanliness                        0.079310433  0.009654015    8.215
## Online.boarding                    0.171242789  0.009945911   17.217
## Departure.Delay.in.Minutes         0.002551590  0.000741308    3.442
## Arrival.Delay.in.Minutes          -0.007843130  0.000734262  -10.682
##                                               Pr(>|z|)    
## (Intercept)                       < 0.0000000000000002 ***
## GenderMale                        < 0.0000000000000002 ***
## Customer.TypeLoyal Customer       < 0.0000000000000002 ***
## Age                               < 0.0000000000000002 ***
## Type.of.TravelPersonal Travel     < 0.0000000000000002 ***
## ClassEco                          < 0.0000000000000002 ***
## ClassEco Plus                     < 0.0000000000000002 ***
## Flight.Distance                   < 0.0000000000000002 ***
## Seat.comfort                      < 0.0000000000000002 ***
## Departure.Arrival.time.convenient < 0.0000000000000002 ***
## Food.and.drink                    < 0.0000000000000002 ***
## Gate.location                     < 0.0000000000000002 ***
## Inflight.wifi.service             0.000000000000000652 ***
## Inflight.entertainment            < 0.0000000000000002 ***
## Online.support                    < 0.0000000000000002 ***
## Ease.of.Online.booking            < 0.0000000000000002 ***
## On.board.service                  < 0.0000000000000002 ***
## Leg.room.service                  < 0.0000000000000002 ***
## Baggage.handling                  < 0.0000000000000002 ***
## Checkin.service                   < 0.0000000000000002 ***
## Cleanliness                       < 0.0000000000000002 ***
## Online.boarding                   < 0.0000000000000002 ***
## Departure.Delay.in.Minutes                    0.000577 ***
## Arrival.Delay.in.Minutes          < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178886  on 129879  degrees of freedom
## Residual deviance:  99982  on 129856  degrees of freedom
## AIC: 100030
## 
## Number of Fisher Scoring iterations: 5

From the logistic regression model above, we can see that all predictor variables significant to target variable (satisfaction), but to get the interpretation of each variable, we have to calculate exponential of each coefficient.

exp(model_logistic$coefficients)
##                       (Intercept)                        GenderMale 
##                       0.001115387                       0.380847951 
##       Customer.TypeLoyal Customer                               Age 
##                       7.245084880                       0.992275021 
##     Type.of.TravelPersonal Travel                          ClassEco 
##                       0.458213706                       0.482342116 
##                     ClassEco Plus                   Flight.Distance 
##                       0.446395856                       0.999889246 
##                      Seat.comfort Departure.Arrival.time.convenient 
##                       1.335916847                       0.821405903 
##                    Food.and.drink                     Gate.location 
##                       0.802949323                       1.121354316 
##             Inflight.wifi.service            Inflight.entertainment 
##                       0.930809071                       1.986916749 
##                    Online.support            Ease.of.Online.booking 
##                       1.096146407                       1.248208128 
##                  On.board.service                  Leg.room.service 
##                       1.361845857                       1.250567171 
##                  Baggage.handling                   Checkin.service 
##                       1.113944421                       1.344910453 
##                       Cleanliness                   Online.boarding 
##                       1.082540326                       1.186778851 
##        Departure.Delay.in.Minutes          Arrival.Delay.in.Minutes 
##                       1.002554848                       0.992187547

Interpretation of model (for example):

  1. Gender: The probability of a male customer getting satisfied is 0.38 times more likely than a female customer.

  2. Customer Type: The probability of a loyal customer getting satisfied is 7.25 times more likely than a disloyal customer.

  3. Age: The increasing age by 1 unit multiplies the odds of having the outcome statisfied by 0.99.

  4. Class: The probability of a Eco Class getting satisfied is 0.48 times more likely than business class; The probability of a Eco Plus Class getting satisfied is 0.45 times more likely than business class.

  5. Flight distance: The increasing flight distance by 1 unit multiplies the odds of having the outcome statisfied by 0.99.

  6. Seat comfort: The increasing seat comfort by 1 unit multiplies the odds of having the outcome statisfied by 1.34.

For each interpretation above, note that the value of the other coefficients is constant.

Prediction

Show the prediction of logistic regression model using data test and compare the prediction result with the existing data.

airline_test$pred_satisfaction <- predict(object = model_logistic, newdata = airline_test, type="response")

airline_test$pred_satisfaction_label <- ifelse(airline_test$pred_satisfaction>0.5, "satisfied", "dissatisfied") %>% as.factor()

airline_test %>% select(pred_satisfaction, pred_satisfaction_label, satisfaction)

Model Evaluation

Using confusionMatrix() to evaluate our model. Confusion matrix is a table that shows:

  • TP (True Positive) = When we predict positive class, and it’s true.
  • TN (True Negative) = When we predict negative class, and it’s true.
  • FP (False Positive) = When we predict positive class, and it’s not true.
  • FN (False Negative) = When we predict negative class, and it’s not true.

We will get information about:

  • Accuracy: How accurately our model predicts the target class.
  • Sensitivity/ Recall: The measure of the goodness of the model to the positive class.
  • Specificity: The measure of the goodness of the model to the negative class.
  • Pos Pred Value/Precision: How precise the model predicts positive class.
confusionMatrix(data=airline_test$pred_satisfaction_label, reference=airline_test$satisfaction, positive="satisfied")
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     dissatisfied satisfied
##   dissatisfied         9564      2103
##   satisfied            2129     12180
##                                              
##                Accuracy : 0.8371             
##                  95% CI : (0.8325, 0.8416)   
##     No Information Rate : 0.5499             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6708             
##                                              
##  Mcnemar's Test P-Value : 0.7008             
##                                              
##             Sensitivity : 0.8528             
##             Specificity : 0.8179             
##          Pos Pred Value : 0.8512             
##          Neg Pred Value : 0.8197             
##              Prevalence : 0.5499             
##          Detection Rate : 0.4689             
##    Detection Prevalence : 0.5509             
##       Balanced Accuracy : 0.8353             
##                                              
##        'Positive' Class : satisfied          
## 

The evaluation result of logistic regression model:

  • Accuracy = 83.71%
  • Sensitivity/ Recall = 85.28%
  • Specificity = 81.79%
  • Pos Pred Value/Precision = 85.12%

K-Nearest Neighbor

K-Nearest Neighbor (K-NN) will classify the new data by comparing the characteristics of the new data (test data) with existing data (train data). The closeness of these characteristics is measured by Euclidean Distance and the class will be determined using majority voting. This method is good for numerical predictor (because it classifies by distance), but not good for categorical predictor.

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)

index2 <- sample(nrow(airline), nrow(airline)*0.8)
airline_train_knn <- airline[index2,]
airline_test_knn <- airline[-index2,]
# prediktor data train
train_x <- airline_train_knn %>% select_if(is.numeric) # dipilih semua kolom yang numerik karena akan discaling

# target data train
train_y <- airline_train_knn %>% select(satisfaction) # dipisahkan khusus untuk kelas target

# prediktor data test
test_x <- airline_test_knn %>% select_if(is.numeric)

# target data test
test_y <-  airline_test_knn %>% select(satisfaction)

Scaling

Based on summary before, the range of each variable is different, so it is necessary to do feature rescaling using z-score standardization or scale .

train_x <- scale(train_x)
test_x <- scale(test_x, 
                center = attr(train_x, "scaled:center"), #nilai rata-rata train
                scale = attr(train_x, "scaled:scale")) # nilai sd train

Prediction

Selecting the value of k by square root of the observation.

sqrt(nrow(train_x))
## [1] 322.3414

We got k = 322 then predict using K-NN method.

airline_pred_knn <- knn(train = train_x,
    test = test_x,
    cl = train_y$satisfaction,
    k=322)

Evaluation Model

confusionMatrix(airline_pred_knn, reference = test_y$satisfaction, positive="satisfied")
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     dissatisfied satisfied
##   dissatisfied        10527      1989
##   satisfied            1166     12294
##                                                
##                Accuracy : 0.8785               
##                  95% CI : (0.8745, 0.8825)     
##     No Information Rate : 0.5499               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.7562               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.8607               
##             Specificity : 0.9003               
##          Pos Pred Value : 0.9134               
##          Neg Pred Value : 0.8411               
##              Prevalence : 0.5499               
##          Detection Rate : 0.4733               
##    Detection Prevalence : 0.5182               
##       Balanced Accuracy : 0.8805               
##                                                
##        'Positive' Class : satisfied            
## 

The evaluation of K-NN:

  • Accuracy = 87.85%
  • Sensitivity/ Recall = 86.07%
  • Specificity = 90.03%
  • Pos Pred Value/Precision = 91.34%

Conclusion

  1. The customer that is more likely to get satisfied is Loyal Customer and flight in Business Class.
  2. Not only that condition, but the increase rating of Seat comfort, Gate location, Inflight entertainment, Online support, Ease of Online booking, On board service, Leg room service, Baggage handling, Checkin service, and Cleanliness Online boarding, the more likely customer getting satisfied.
  3. Based on model evaluation, both model have good result. Compared by Accuracy, Sensitifity, Specificity, and Pos Pred Value, model KNN has higher values than logistic regression model.