1 Introduction

Hello Readers! For this project, we are using a dataset from Kaggle on Airlines Customer Satisfaction using various factors. We are going to implement the Logistic Regression and K-Nearest Neighbors model for this report to predict Airlines Passengers Satisfaction.

dataset source : https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction

2 Data Preparation

2.1 Library and Setup

Before we do analysis, we need to load the required library packages.

library(tidyverse)
library(ggplot2)
library(class)
library(caret) 
library(ggmosaic)
library(kableExtra)
library(lmtest)
library(car)

2.2 Import Data

We need the data to do the analysis. Then, we have to load the dataset

airlines <- read.csv("train.csv")

head(airlines)

2.3 Data Description

To get to know more of our dataset, here is the thorough explanations about each variables:

Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)

3 Exploratory Data Analysis

3.1 Check Data Type

Let us check each column’s data type.

glimpse(airlines)

#> Rows: 103,904
#> Columns: 25
#> $ X                                 <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11~
#> $ id                                <int> 70172, 5047, 110028, 24026, 119299, ~
#> $ Gender                            <chr> "Male", "Male", "Female", "Female", ~
#> $ Customer.Type                     <chr> "Loyal Customer", "disloyal Customer~
#> $ Age                               <int> 13, 25, 26, 25, 61, 26, 47, 52, 41, ~
#> $ Type.of.Travel                    <chr> "Personal Travel", "Business travel"~
#> $ Class                             <chr> "Eco Plus", "Business", "Business", ~
#> $ Flight.Distance                   <int> 460, 235, 1142, 562, 214, 1180, 1276~
#> $ Inflight.wifi.service             <int> 3, 3, 2, 2, 3, 3, 2, 4, 1, 3, 4, 2, ~
#> $ Departure.Arrival.time.convenient <int> 4, 2, 2, 5, 3, 4, 4, 3, 2, 3, 5, 4, ~
#> $ Ease.of.Online.booking            <int> 3, 3, 2, 5, 3, 2, 2, 4, 2, 3, 5, 2, ~
#> $ Gate.location                     <int> 1, 3, 2, 5, 3, 1, 3, 4, 2, 4, 4, 2, ~
#> $ Food.and.drink                    <int> 5, 1, 5, 2, 4, 1, 2, 5, 4, 2, 2, 1, ~
#> $ Online.boarding                   <int> 3, 3, 5, 2, 5, 2, 2, 5, 3, 3, 5, 2, ~
#> $ Seat.comfort                      <int> 5, 1, 5, 2, 5, 1, 2, 5, 3, 3, 2, 1, ~
#> $ Inflight.entertainment            <int> 5, 1, 5, 2, 3, 1, 2, 5, 1, 2, 2, 1, ~
#> $ On.board.service                  <int> 4, 1, 4, 2, 3, 3, 3, 5, 1, 2, 3, 1, ~
#> $ Leg.room.service                  <int> 3, 5, 3, 5, 4, 4, 3, 5, 2, 3, 3, 2, ~
#> $ Baggage.handling                  <int> 4, 3, 4, 3, 4, 4, 4, 5, 1, 4, 5, 5, ~
#> $ Checkin.service                   <int> 4, 1, 4, 1, 3, 4, 3, 4, 4, 4, 3, 5, ~
#> $ Inflight.service                  <int> 5, 4, 4, 4, 3, 4, 5, 5, 1, 3, 5, 5, ~
#> $ Cleanliness                       <int> 5, 1, 5, 2, 3, 1, 2, 4, 2, 2, 2, 1, ~
#> $ Departure.Delay.in.Minutes        <int> 25, 1, 0, 11, 0, 0, 9, 4, 0, 0, 0, 0~
#> $ Arrival.Delay.in.Minutes          <dbl> 18, 6, 0, 9, 0, 0, 23, 0, 0, 0, 0, 0~
#> $ satisfaction                      <chr> "neutral or dissatisfied", "neutral ~

After we check the data type of each columns, we found that some of the columns don’t have the required data type. We need to change these columns’ data type for us to ease the analysis process. We also found that some columns needs to be dropped since these columns have no valuable informations for the analysis.To simplify our modeling, we are going to change the dissatified category into 0 and satisfied category into 1.

df_airlines <- airlines %>% 
  select(-X,-id) %>% 
  mutate_if(is.character,as.factor) %>% 
  mutate(satisfaction = 
           factor(satisfaction, 
                 levels = c("neutral or dissatisfied","satisfied"), 
                 labels = c(0, 1)),
         Customer.Type =
           factor(Customer.Type, 
                 levels = c("disloyal Customer","Loyal Customer"), 
                 labels = c(0, 1)))

glimpse(df_airlines)

#> Rows: 103,904
#> Columns: 23
#> $ Gender                            <fct> Male, Male, Female, Female, Male, Fe~
#> $ Customer.Type                     <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, ~
#> $ Age                               <int> 13, 25, 26, 25, 61, 26, 47, 52, 41, ~
#> $ Type.of.Travel                    <fct> Personal Travel, Business travel, Bu~
#> $ Class                             <fct> Eco Plus, Business, Business, Busine~
#> $ Flight.Distance                   <int> 460, 235, 1142, 562, 214, 1180, 1276~
#> $ Inflight.wifi.service             <int> 3, 3, 2, 2, 3, 3, 2, 4, 1, 3, 4, 2, ~
#> $ Departure.Arrival.time.convenient <int> 4, 2, 2, 5, 3, 4, 4, 3, 2, 3, 5, 4, ~
#> $ Ease.of.Online.booking            <int> 3, 3, 2, 5, 3, 2, 2, 4, 2, 3, 5, 2, ~
#> $ Gate.location                     <int> 1, 3, 2, 5, 3, 1, 3, 4, 2, 4, 4, 2, ~
#> $ Food.and.drink                    <int> 5, 1, 5, 2, 4, 1, 2, 5, 4, 2, 2, 1, ~
#> $ Online.boarding                   <int> 3, 3, 5, 2, 5, 2, 2, 5, 3, 3, 5, 2, ~
#> $ Seat.comfort                      <int> 5, 1, 5, 2, 5, 1, 2, 5, 3, 3, 2, 1, ~
#> $ Inflight.entertainment            <int> 5, 1, 5, 2, 3, 1, 2, 5, 1, 2, 2, 1, ~
#> $ On.board.service                  <int> 4, 1, 4, 2, 3, 3, 3, 5, 1, 2, 3, 1, ~
#> $ Leg.room.service                  <int> 3, 5, 3, 5, 4, 4, 3, 5, 2, 3, 3, 2, ~
#> $ Baggage.handling                  <int> 4, 3, 4, 3, 4, 4, 4, 5, 1, 4, 5, 5, ~
#> $ Checkin.service                   <int> 4, 1, 4, 1, 3, 4, 3, 4, 4, 4, 3, 5, ~
#> $ Inflight.service                  <int> 5, 4, 4, 4, 3, 4, 5, 5, 1, 3, 5, 5, ~
#> $ Cleanliness                       <int> 5, 1, 5, 2, 3, 1, 2, 4, 2, 2, 2, 1, ~
#> $ Departure.Delay.in.Minutes        <int> 25, 1, 0, 11, 0, 0, 9, 4, 0, 0, 0, 0~
#> $ Arrival.Delay.in.Minutes          <dbl> 18, 6, 0, 9, 0, 0, 23, 0, 0, 0, 0, 0~
#> $ satisfaction                      <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, ~

All the data types are correct, we are ready to go for the next step

3.2 Check Missing Value

We have to check if there is any missing values in our data set

colSums(is.na(df_airlines))

#>                            Gender                     Customer.Type 
#>                                 0                                 0 
#>                               Age                    Type.of.Travel 
#>                                 0                                 0 
#>                             Class                   Flight.Distance 
#>                                 0                                 0 
#>             Inflight.wifi.service Departure.Arrival.time.convenient 
#>                                 0                                 0 
#>            Ease.of.Online.booking                     Gate.location 
#>                                 0                                 0 
#>                    Food.and.drink                   Online.boarding 
#>                                 0                                 0 
#>                      Seat.comfort            Inflight.entertainment 
#>                                 0                                 0 
#>                  On.board.service                  Leg.room.service 
#>                                 0                                 0 
#>                  Baggage.handling                   Checkin.service 
#>                                 0                                 0 
#>                  Inflight.service                       Cleanliness 
#>                                 0                                 0 
#>        Departure.Delay.in.Minutes          Arrival.Delay.in.Minutes 
#>                                 0                               310 
#>                      satisfaction 
#>                                 0

After we checked if there is any NA values, we found out that 310 observations are NA on the arrival_delay_in_minutes column. Here, we are going to assume that the NA value on are 0.

df_airlines$Arrival.Delay.in.Minutes <- 
  ifelse(is.na(df_airlines$Arrival.Delay.in.Minutes)
         , '0', df_airlines$Arrival.Delay.in.Minutes)
df_airlines$Arrival.Delay.in.Minutes <- 
  as.numeric(df_airlines$Arrival.Delay.in.Minutes)

colSums(is.na(df_airlines))

#>                            Gender                     Customer.Type 
#>                                 0                                 0 
#>                               Age                    Type.of.Travel 
#>                                 0                                 0 
#>                             Class                   Flight.Distance 
#>                                 0                                 0 
#>             Inflight.wifi.service Departure.Arrival.time.convenient 
#>                                 0                                 0 
#>            Ease.of.Online.booking                     Gate.location 
#>                                 0                                 0 
#>                    Food.and.drink                   Online.boarding 
#>                                 0                                 0 
#>                      Seat.comfort            Inflight.entertainment 
#>                                 0                                 0 
#>                  On.board.service                  Leg.room.service 
#>                                 0                                 0 
#>                  Baggage.handling                   Checkin.service 
#>                                 0                                 0 
#>                  Inflight.service                       Cleanliness 
#>                                 0                                 0 
#>        Departure.Delay.in.Minutes          Arrival.Delay.in.Minutes 
#>                                 0                                 0 
#>                      satisfaction 
#>                                 0

Now there are no NA in the data. Now let us go to the next step.

3.3 Analysis

To get to know more about our data, let us check the summary.

summary(df_airlines)

#>     Gender      Customer.Type      Age                Type.of.Travel 
#>  Female:52727   0:18981       Min.   : 7.00   Business travel:71655  
#>  Male  :51177   1:84923       1st Qu.:27.00   Personal Travel:32249  
#>                               Median :40.00                          
#>                               Mean   :39.38                          
#>                               3rd Qu.:51.00                          
#>                               Max.   :85.00                          
#>       Class       Flight.Distance Inflight.wifi.service
#>  Business:49665   Min.   :  31    Min.   :0.00         
#>  Eco     :46745   1st Qu.: 414    1st Qu.:2.00         
#>  Eco Plus: 7494   Median : 843    Median :3.00         
#>                   Mean   :1189    Mean   :2.73         
#>                   3rd Qu.:1743    3rd Qu.:4.00         
#>                   Max.   :4983    Max.   :5.00         
#>  Departure.Arrival.time.convenient Ease.of.Online.booking Gate.location  
#>  Min.   :0.00                      Min.   :0.000          Min.   :0.000  
#>  1st Qu.:2.00                      1st Qu.:2.000          1st Qu.:2.000  
#>  Median :3.00                      Median :3.000          Median :3.000  
#>  Mean   :3.06                      Mean   :2.757          Mean   :2.977  
#>  3rd Qu.:4.00                      3rd Qu.:4.000          3rd Qu.:4.000  
#>  Max.   :5.00                      Max.   :5.000          Max.   :5.000  
#>  Food.and.drink  Online.boarding  Seat.comfort   Inflight.entertainment
#>  Min.   :0.000   Min.   :0.00    Min.   :0.000   Min.   :0.000         
#>  1st Qu.:2.000   1st Qu.:2.00    1st Qu.:2.000   1st Qu.:2.000         
#>  Median :3.000   Median :3.00    Median :4.000   Median :4.000         
#>  Mean   :3.202   Mean   :3.25    Mean   :3.439   Mean   :3.358         
#>  3rd Qu.:4.000   3rd Qu.:4.00    3rd Qu.:5.000   3rd Qu.:4.000         
#>  Max.   :5.000   Max.   :5.00    Max.   :5.000   Max.   :5.000         
#>  On.board.service Leg.room.service Baggage.handling Checkin.service
#>  Min.   :0.000    Min.   :0.000    Min.   :1.000    Min.   :0.000  
#>  1st Qu.:2.000    1st Qu.:2.000    1st Qu.:3.000    1st Qu.:3.000  
#>  Median :4.000    Median :4.000    Median :4.000    Median :3.000  
#>  Mean   :3.382    Mean   :3.351    Mean   :3.632    Mean   :3.304  
#>  3rd Qu.:4.000    3rd Qu.:4.000    3rd Qu.:5.000    3rd Qu.:4.000  
#>  Max.   :5.000    Max.   :5.000    Max.   :5.000    Max.   :5.000  
#>  Inflight.service  Cleanliness    Departure.Delay.in.Minutes
#>  Min.   :0.00     Min.   :0.000   Min.   :   0.00           
#>  1st Qu.:3.00     1st Qu.:2.000   1st Qu.:   0.00           
#>  Median :4.00     Median :3.000   Median :   0.00           
#>  Mean   :3.64     Mean   :3.286   Mean   :  14.82           
#>  3rd Qu.:5.00     3rd Qu.:4.000   3rd Qu.:  12.00           
#>  Max.   :5.00     Max.   :5.000   Max.   :1592.00           
#>  Arrival.Delay.in.Minutes satisfaction
#>  Min.   :   0.00          0:58879     
#>  1st Qu.:   0.00          1:45025     
#>  Median :   0.00                      
#>  Mean   :  15.13                      
#>  3rd Qu.:  13.00                      
#>  Max.   :1584.00

Below frequency data visualization for each numerical variables

ggplot(gather(df_airlines %>% select_if(is.numeric)), aes(value)) + 
    geom_histogram(bins = 10, fill="blue") + 
    facet_wrap(~key, scales = 'free_x')

Below frequency data visualization for each categorical variables

ggplot(gather(df_airlines %>% select_if(is.factor)), aes(value)) + 
    geom_bar(bins = 10,fill="firebrick") + 
    facet_wrap(~key, scales = 'free_x') + labs(x="Categorical",
                                               y="Value")

Some insights from the summary data :

We can see in the visualization that Arrival.Delay.In.Minutes, Departure.Delay.In.Minuts, and Flight Distance variable curve is not in normality skewness. We can assume at this point that there are probably outliers present. We will see later whether we need to drop these outliers or probably these variables.
In terms of Gender, it is balance between frequency of male passenger and female passenger
In terms of Customer type, almost 85% are loyal customer(1).
In terms of type of travel, only 30% is personal travel. Majority of passengers are business traveler
In terms of class, it is balance between Business class and Economy class. Only small amount of passengers using Economy plus class

4 Model Fitting

4.1 Logistic Regression

Logistic regression is a classification algorithm used to fit a regression curve, y = f (x), where y is a categorical variable. We also call the model binomial logistic regression where in cases of the Dependent Variable are more than 2 values the model are referred to as a class of multinomial logistic regression.

For our dataset, it is considered as Binomial because the target variable is 1 and 0, in which 1 is satisfied and 0 is dissatisfied.

4.1.1 Cross Validation

Now let us check the proportion of our target variable

prop.table(table(df_airlines$satisfaction))

#> 
#>         0         1 
#> 0.5666673 0.4333327

We can say that our target variable is balance

Now we split data for train and test with proportion 0.8

RNGkind(sample.kind = "Rounding") 
set.seed(901)

index <- sample(nrow(df_airlines), 
                nrow(df_airlines) *0.8) 

airlines_train <- df_airlines[index, ] 
airlines_test <- df_airlines[-index, ]

now let us recheck the class imbalance between data train and test

prop.table(table(airlines_train$satisfaction))

#> 
#>        0        1 
#> 0.565752 0.434248

prop.table(table(airlines_test$satisfaction))

#> 
#>         0         1 
#> 0.5703287 0.4296713

We can say that our target variable in both our data train and test are balance

4.1.2 Model

Now let us make the logistic regression model

model_reg1 <- glm(satisfaction ~ ., data = airlines_train, family = "binomial")
summary(model_reg1)

#> 
#> Call:
#> glm(formula = satisfaction ~ ., family = "binomial", data = airlines_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8491  -0.4920  -0.1761   0.3876   4.0093  
#> 
#> Coefficients:
#>                                      Estimate  Std. Error z value
#> (Intercept)                       -7.83464868  0.08774368 -89.290
#> GenderMale                         0.04581272  0.02174589   2.107
#> Customer.Type1                     2.01293177  0.03336690  60.327
#> Age                               -0.00778677  0.00079326  -9.816
#> Type.of.TravelPersonal Travel     -2.71347682  0.03513671 -77.226
#> ClassEco                          -0.74002829  0.02867838 -25.804
#> ClassEco Plus                     -0.85652829  0.04636535 -18.473
#> Flight.Distance                   -0.00001895  0.00001262  -1.502
#> Inflight.wifi.service              0.38996438  0.01279222  30.484
#> Departure.Arrival.time.convenient -0.12084981  0.00917830 -13.167
#> Ease.of.Online.booking            -0.14870927  0.01269358 -11.715
#> Gate.location                      0.03313043  0.01024700   3.233
#> Food.and.drink                    -0.03359285  0.01193826  -2.814
#> Online.boarding                    0.60609741  0.01142894  53.032
#> Seat.comfort                       0.06019415  0.01250315   4.814
#> Inflight.entertainment             0.07843071  0.01593490   4.922
#> On.board.service                   0.29769215  0.01137631  26.168
#> Leg.room.service                   0.25346328  0.00951696  26.633
#> Baggage.handling                   0.13561299  0.01273426  10.649
#> Checkin.service                    0.32532543  0.00955669  34.042
#> Inflight.service                   0.11912393  0.01341064   8.883
#> Cleanliness                        0.22274202  0.01353198  16.460
#> Departure.Delay.in.Minutes         0.00398071  0.00101737   3.913
#> Arrival.Delay.in.Minutes          -0.00864276  0.00100776  -8.576
#>                                               Pr(>|z|)    
#> (Intercept)                       < 0.0000000000000002 ***
#> GenderMale                                     0.03514 *  
#> Customer.Type1                    < 0.0000000000000002 ***
#> Age                               < 0.0000000000000002 ***
#> Type.of.TravelPersonal Travel     < 0.0000000000000002 ***
#> ClassEco                          < 0.0000000000000002 ***
#> ClassEco Plus                     < 0.0000000000000002 ***
#> Flight.Distance                                0.13320    
#> Inflight.wifi.service             < 0.0000000000000002 ***
#> Departure.Arrival.time.convenient < 0.0000000000000002 ***
#> Ease.of.Online.booking            < 0.0000000000000002 ***
#> Gate.location                                  0.00122 ** 
#> Food.and.drink                                 0.00489 ** 
#> Online.boarding                   < 0.0000000000000002 ***
#> Seat.comfort                               0.000001477 ***
#> Inflight.entertainment                     0.000000857 ***
#> On.board.service                  < 0.0000000000000002 ***
#> Leg.room.service                  < 0.0000000000000002 ***
#> Baggage.handling                  < 0.0000000000000002 ***
#> Checkin.service                   < 0.0000000000000002 ***
#> Inflight.service                  < 0.0000000000000002 ***
#> Cleanliness                       < 0.0000000000000002 ***
#> Departure.Delay.in.Minutes                 0.000091248 ***
#> Arrival.Delay.in.Minutes          < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 113791  on 83122  degrees of freedom
#> Residual deviance:  55582  on 83099  degrees of freedom
#> AIC: 55630
#> 
#> Number of Fisher Scoring iterations: 6

We found that there are 1 variable that are not signicant (flight distance). Let us try deselect this variable and check the AIC

However let us check the multicolinearity of this model. If there is multicoleniarity in some variable, we can deselect also these variables

vif(model_reg1)

#>                                        GVIF Df GVIF^(1/(2*Df))
#> Gender                             1.007015  1        1.003502
#> Customer.Type                      1.606263  1        1.267384
#> Age                                1.179469  1        1.086034
#> Type.of.Travel                     1.868529  1        1.366941
#> Class                              1.677552  2        1.138070
#> Flight.Distance                    1.358093  1        1.165372
#> Inflight.wifi.service              2.218141  1        1.489343
#> Departure.Arrival.time.convenient  1.725875  1        1.313726
#> Ease.of.Online.booking             2.600178  1        1.612507
#> Gate.location                      1.522327  1        1.233826
#> Food.and.drink                     2.015123  1        1.419550
#> Online.boarding                    1.486299  1        1.219138
#> Seat.comfort                       2.040074  1        1.428312
#> Inflight.entertainment             3.251724  1        1.803254
#> On.board.service                   1.638945  1        1.280213
#> Leg.room.service                   1.213795  1        1.101723
#> Baggage.handling                   1.821675  1        1.349694
#> Checkin.service                    1.206278  1        1.098307
#> Inflight.service                   2.003023  1        1.415282
#> Cleanliness                        2.466384  1        1.570473
#> Departure.Delay.in.Minutes        12.312295  1        3.508888
#> Arrival.Delay.in.Minutes          12.342944  1        3.513253

we found out that there are 2 variables Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes are colinear (VIF > 10)

Now let us check the linearity assumption from model_reg1

#linearity check

data.frame(prediction=model_reg1$fitted.values,
     error=model_reg1$residuals) %>% 
  ggplot(aes(prediction,error)) +
  geom_hline(yintercept=0) +
  geom_point() +
  geom_smooth() +
  theme_bw()

From plot above, There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.

so let us deselect flight distance, Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes variables since model_reg1 doesn’t pass the multicolinearity assumption.

df_airlines2 = airlines_train %>% 
  select(-Departure.Delay.in.Minutes,-Arrival.Delay.in.Minutes)

model_reg2 <- glm(satisfaction ~ ., data = df_airlines2, family = "binomial")

summary(model_reg2)

#> 
#> Call:
#> glm(formula = satisfaction ~ ., family = "binomial", data = df_airlines2)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8289  -0.4936  -0.1780   0.3933   4.0203  
#> 
#> Coefficients:
#>                                      Estimate  Std. Error z value
#> (Intercept)                       -7.88356309  0.08744342 -90.156
#> GenderMale                         0.04561210  0.02167814   2.104
#> Customer.Type1                     1.99598203  0.03324612  60.037
#> Age                               -0.00750104  0.00079017  -9.493
#> Type.of.TravelPersonal Travel     -2.69215847  0.03502039 -76.874
#> ClassEco                          -0.74621577  0.02859797 -26.093
#> ClassEco Plus                     -0.86572867  0.04624346 -18.721
#> Flight.Distance                   -0.00001902  0.00001257  -1.514
#> Inflight.wifi.service              0.39152928  0.01274685  30.716
#> Departure.Arrival.time.convenient -0.12068981  0.00916195 -13.173
#> Ease.of.Online.booking            -0.14382242  0.01264437 -11.374
#> Gate.location                      0.02984078  0.01022146   2.919
#> Food.and.drink                    -0.02484845  0.01181987  -2.102
#> Online.boarding                    0.60205710  0.01135460  53.023
#> Seat.comfort                       0.05835508  0.01246703   4.681
#> Inflight.entertainment             0.08305960  0.01584349   5.243
#> On.board.service                   0.29867436  0.01134616  26.324
#> Leg.room.service                   0.24545388  0.00949448  25.852
#> Baggage.handling                   0.12608496  0.01270955   9.920
#> Checkin.service                    0.32031468  0.00952425  33.631
#> Inflight.service                   0.13363797  0.01334041  10.018
#> Cleanliness                        0.21200683  0.01343995  15.774
#>                                               Pr(>|z|)    
#> (Intercept)                       < 0.0000000000000002 ***
#> GenderMale                                     0.03537 *  
#> Customer.Type1                    < 0.0000000000000002 ***
#> Age                               < 0.0000000000000002 ***
#> Type.of.TravelPersonal Travel     < 0.0000000000000002 ***
#> ClassEco                          < 0.0000000000000002 ***
#> ClassEco Plus                     < 0.0000000000000002 ***
#> Flight.Distance                                0.13015    
#> Inflight.wifi.service             < 0.0000000000000002 ***
#> Departure.Arrival.time.convenient < 0.0000000000000002 ***
#> Ease.of.Online.booking            < 0.0000000000000002 ***
#> Gate.location                                  0.00351 ** 
#> Food.and.drink                                 0.03553 *  
#> Online.boarding                   < 0.0000000000000002 ***
#> Seat.comfort                               0.000002858 ***
#> Inflight.entertainment                     0.000000158 ***
#> On.board.service                  < 0.0000000000000002 ***
#> Leg.room.service                  < 0.0000000000000002 ***
#> Baggage.handling                  < 0.0000000000000002 ***
#> Checkin.service                   < 0.0000000000000002 ***
#> Inflight.service                  < 0.0000000000000002 ***
#> Cleanliness                       < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 113791  on 83122  degrees of freedom
#> Residual deviance:  55899  on 83101  degrees of freedom
#> AIC: 55943
#> 
#> Number of Fisher Scoring iterations: 5

Now let us check the multicolnearity with this model

vif(model_reg2)

#>                                       GVIF Df GVIF^(1/(2*Df))
#> Gender                            1.006973  1        1.003480
#> Customer.Type                     1.600231  1        1.265003
#> Age                               1.177436  1        1.085097
#> Type.of.Travel                    1.860462  1        1.363987
#> Class                             1.677943  2        1.138136
#> Flight.Distance                   1.358874  1        1.165708
#> Inflight.wifi.service             2.216832  1        1.488903
#> Departure.Arrival.time.convenient 1.727893  1        1.314493
#> Ease.of.Online.booking            2.597454  1        1.611662
#> Gate.location                     1.524117  1        1.234551
#> Food.and.drink                    1.986777  1        1.409531
#> Online.boarding                   1.474444  1        1.214267
#> Seat.comfort                      2.039703  1        1.428182
#> Inflight.entertainment            3.222060  1        1.795010
#> On.board.service                  1.638434  1        1.280013
#> Leg.room.service                  1.211370  1        1.100623
#> Baggage.handling                  1.818625  1        1.348564
#> Checkin.service                   1.205342  1        1.097881
#> Inflight.service                  1.991087  1        1.411059
#> Cleanliness                       2.445290  1        1.563742

Now let us check the linearity assumption from model_reg1

#linearity check

data.frame(prediction=model_reg2$fitted.values,
     error=model_reg2$residuals) %>% 
  ggplot(aes(prediction,error)) +
  geom_hline(yintercept=0) +
  geom_point() +
  geom_smooth() +
  theme_bw()

From plot above, we can conclude that our model is linear.

summary between models:

Model <- c("Model_Reg1", "Model_Reg2")
AIC <- c(55630,55943)
MultiColinearity <- c("yes","no" )
Linearity <-c("yes","yes" )


df <- data.frame(Model, AIC, MultiColinearity,Linearity )

print (df)

#>        Model   AIC MultiColinearity Linearity
#> 1 Model_Reg1 55630              yes       yes
#> 2 Model_Reg2 55943               no       yes

Now we are safe to pick the model_reg2 for further analysis. Although the AIC is slightly below, this model pass the assumption of the multicolinearity which is not present.

Let us interpetrate one of coefficient and estimate so we can easily read above summary model_reg2.

customer_type1 <- 2.01662055
exp(customer_type1)

#> [1] 7.512893

Interpretation :Customer type 1 or Loyal Customer have a probability 7.5 times more to be satisfied rather than dissatisfied.

4.1.3 Prediction

Now let us predict the probability of satisfaction using our test data with our model_reg2 and saved in new column named pred_result

airlines_test$pred_result <- predict(object = model_reg2, 
        newdata = airlines_test, 
        type = "response")

Now classify the data in airlines_test based on pred_result and saved in new column namen pred_label

airlines_test$pred_label <- ifelse(airlines_test$pred_result < 0.5 ,0, 1)
airlines_test$pred_label <- as.factor(airlines_test$pred_label)
head(airlines_test)

str(airlines_test)

#> 'data.frame':    20781 obs. of  25 variables:
#>  $ Gender                           : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 1 1 1 2 1 ...
#>  $ Customer.Type                    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Age                              : int  13 61 26 47 12 17 43 36 22 35 ...
#>  $ Type.of.Travel                   : Factor w/ 2 levels "Business travel",..: 2 1 2 2 2 2 2 1 2 1 ...
#>  $ Class                            : Factor w/ 3 levels "Business","Eco",..: 3 1 2 2 3 2 2 1 2 1 ...
#>  $ Flight.Distance                  : int  460 214 1180 1276 308 208 752 3347 2342 2611 ...
#>  $ Inflight.wifi.service            : int  3 3 3 2 2 3 3 3 3 4 ...
#>  $ Departure.Arrival.time.convenient: int  4 3 4 4 4 1 5 1 2 5 ...
#>  $ Ease.of.Online.booking           : int  3 3 2 2 2 3 3 1 3 4 ...
#>  $ Gate.location                    : int  1 3 1 3 2 3 5 1 3 4 ...
#>  $ Food.and.drink                   : int  5 4 1 2 1 5 5 1 3 4 ...
#>  $ Online.boarding                  : int  3 5 2 2 2 3 4 2 3 4 ...
#>  $ Seat.comfort                     : int  5 5 1 2 1 5 5 1 1 4 ...
#>  $ Inflight.entertainment           : int  5 3 1 2 1 5 3 3 3 3 ...
#>  $ On.board.service                 : int  4 3 3 3 1 2 3 3 2 3 ...
#>  $ Leg.room.service                 : int  3 4 4 3 2 5 3 3 4 4 ...
#>  $ Baggage.handling                 : int  4 4 4 4 5 3 5 3 3 5 ...
#>  $ Checkin.service                  : int  4 3 4 3 5 3 3 2 4 4 ...
#>  $ Inflight.service                 : int  5 3 4 5 5 4 3 3 2 3 ...
#>  $ Cleanliness                      : int  5 3 1 2 1 5 4 2 3 4 ...
#>  $ Departure.Delay.in.Minutes       : int  25 0 0 9 0 0 52 18 19 109 ...
#>  $ Arrival.Delay.in.Minutes         : num  18 0 0 23 0 0 29 12 0 120 ...
#>  $ satisfaction                     : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 2 ...
#>  $ pred_result                      : num  0.2007 0.8807 0.033 0.0193 0.0141 ...
#>  $ pred_label                       : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 2 ...

See below prediction result in graph

  ggplot(data=airlines_test,mapping=aes(x=pred_result)) +
  geom_density(fill="green",col="black") +
  geom_vline(xintercept = 0.5 , linetype="dashed") +
  theme_bw()

If we look at above plot, the result shows higher density <0.5, means higher customers number that aren’t satisfied with the airlines.

4.2 K-Nearest Neighbor

K-NN has its own characteristics and one of them was it is better for predictors that are numeric, therefore, in our pre-processing step here we are going to divide the categorical variables in order to make the train and test dataset.

# predictor
airlines_train_x <-
  airlines_train %>% select(
    -c(
      satisfaction,
      Gender,
      Customer.Type,
      Type.of.Travel,
      Class,
      Arrival.Delay.in.Minutes ,
      Departure.Delay.in.Minutes,
      Flight.Distance
    )
  )
airlines_test_x <-
  airlines_test %>% select(
    -c(
      satisfaction,
      Gender,
      Customer.Type,
      Type.of.Travel,
      Class,
      Arrival.Delay.in.Minutes ,
      Departure.Delay.in.Minutes,
      Flight.Distance,pred_label,
      pred_result
    )
  )

# target
airlines_train_y <- airlines_train %>% select(satisfaction)
airlines_test_y <- airlines_test %>% select(satisfaction)

4.2.1 Scaling

airlines_train_xs <- scale(airlines_train_x)

airlines_test_xs <- scale(
airlines_test_x,
center = attr(airlines_train_xs, "scaled:center"),
scale = attr(airlines_train_xs, "scaled:scale")
)

head(airlines_test_xs)

#>           Age Inflight.wifi.service Departure.Arrival.time.convenient
#> 1  -1.7438403             0.2037476                        0.61566676
#> 5   1.4300214             0.2037476                       -0.04034416
#> 6  -0.8842528             0.2037476                        0.61566676
#> 7   0.5043117            -0.5488682                        0.61566676
#> 12 -1.8099624            -0.5488682                        0.61566676
#> 22 -1.4793518             0.2037476                       -1.35236600
#>    Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
#> 1               0.1755811   -1.54772909      1.3537888      -0.1854321
#> 5               0.1755811    0.01852225      0.6010933       1.2961529
#> 6              -0.5399920   -1.54772909     -1.6569932      -0.9262246
#> 7              -0.5399920    0.01852225     -0.9042977      -0.9262246
#> 12             -0.5399920   -0.76460342     -1.6569932      -0.9262246
#> 22              0.1755811    0.01852225      1.3537888      -0.1854321
#>    Seat.comfort Inflight.entertainment On.board.service Leg.room.service
#> 1      1.185323              1.2327161        0.4785345       -0.2652092
#> 5      1.185323             -0.2691719       -0.2985313        0.4951201
#> 6     -1.849844             -1.7710600       -0.2985313        0.4951201
#> 7     -1.091052             -1.0201160       -0.2985313       -0.2652092
#> 12    -1.849844             -1.7710600       -1.8526629       -1.0255385
#> 22     1.185323              1.2327161       -1.0755971        1.2554494
#>    Baggage.handling Checkin.service Inflight.service Cleanliness
#> 1         0.3149261       0.5497154        1.1555694   1.3073159
#> 5         0.3149261      -0.2407287       -0.5453061  -0.2173036
#> 6         0.3149261       0.5497154        0.3051316  -1.7419232
#> 7         0.3149261      -0.2407287        1.1555694  -0.9796134
#> 12        1.1610845       1.3401596        1.1555694  -1.7419232
#> 22       -0.5312323      -0.2407287        0.3051316   1.3073159

4.2.2 Prediction

Now let’s find the optimum K for further analysis

round(sqrt(nrow(airlines_train_xs)))

#> [1] 288

airlines_knn <- knn(train=airlines_train_xs,test=airlines_test_xs, cl= airlines_train_y$satisfaction, k=288)

head(airlines_knn)

#> [1] 0 1 0 0 0 0
#> Levels: 0 1

5 Model Evaluation

To evaluate our model, we may use confusionMatrix() function from the library caret. Confusion matrix is a table that shows four different category: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). After that we may use 4 metrics to evaluate the model, those are Accuracy, Sensitivity, Specificity, and Precision.

5.1 Logistic Regression

conf_log <- confusionMatrix(data = airlines_test$pred_label,reference = airlines_test$satisfaction, positive = "1")
conf_log

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 10698  1478
#>          1  1154  7451
#>                                                
#>                Accuracy : 0.8733               
#>                  95% CI : (0.8687, 0.8778)     
#>     No Information Rate : 0.5703               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7404               
#>                                                
#>  Mcnemar's Test P-Value : 0.0000000003056      
#>                                                
#>             Sensitivity : 0.8345               
#>             Specificity : 0.9026               
#>          Pos Pred Value : 0.8659               
#>          Neg Pred Value : 0.8786               
#>              Prevalence : 0.4297               
#>          Detection Rate : 0.3585               
#>    Detection Prevalence : 0.4141               
#>       Balanced Accuracy : 0.8686               
#>                                                
#>        'Positive' Class : 1                    
#>

5.2 K-Nearest Neighbor

conf_knn <- confusionMatrix(data=airlines_knn,reference = as.factor(airlines_test_y$satisfaction), positive="1")
conf_knn

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 10973  1560
#>          1   879  7369
#>                                                
#>                Accuracy : 0.8826               
#>                  95% CI : (0.8782, 0.887)      
#>     No Information Rate : 0.5703               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7583               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8253               
#>             Specificity : 0.9258               
#>          Pos Pred Value : 0.8934               
#>          Neg Pred Value : 0.8755               
#>              Prevalence : 0.4297               
#>          Detection Rate : 0.3546               
#>    Detection Prevalence : 0.3969               
#>       Balanced Accuracy : 0.8756               
#>                                                
#>        'Positive' Class : 1                    
#>

Summary Model Evaluation:

Model <- c("Logistic Regression", "K-Nearest Neighbor")
Accuracy <- c(0.8733,0.8826)
Recall <- c(0.8345,0.8253 )
specificity <-c(0.9026,0.9258 )
Precision <- c(0.8659,0.8934)

df <- data.frame(Model, Accuracy,Recall,specificity,Precision )

print (df)

#>                 Model Accuracy Recall specificity Precision
#> 1 Logistic Regression   0.8733 0.8345      0.9026    0.8659
#> 2  K-Nearest Neighbor   0.8826 0.8253      0.9258    0.8934

6 Conclusion

Here are some conclusions we can get from both models:

If we look at the summary model evaluation, we can see that both of our model perform very well in all of the metrics from Accuracy, Recall, Specificity and Precision. However, if we want a more precised model, our K-Nearest Neighbor model performs better in all metrics.
Therefore, depending on what we want to achieve, for example if we only focuses on the positive classification or “satisfied” class, we may prioritize the model with higher precision value. It means that we are focusing on our satisfied customers to retain their satisfaction and make them more loyal. We can provide them better loyalty program, services, promotions, vouchers, etc
But on the contrary, if we would like to pay attention more to both the number of correct positive and negative outcome, we might prioritize the model with high accuracy. If we choose high accuracy, we can focus not only our True Positive “Satisfied customers”, but also, our “False Positive” and “False Negative” customers.We should focus on these customers to regain their satisfaction by giving them discounts or vouchers or promotions or better services.
But in this case, as previously mentioned we should refer to the K-NN model as it is better in all metrics.

Airlines Passengers Satisfaction Prediction using Logistic Regression and K-Nearest Neighbor

By : Syabaruddin Malik