Predicting Airlines Passengers Satisfaction using Logistic Regression and K-Nearest Neighbor

Gissella Nadya

1 Business Case

Welcome back again, readers! It’s yours truly, Gissella Nadya, and this is my Classification 1 Learning By Building Project for Algoritma Academy. For this project, we are using a dataset from Kaggle (here or here on Airlines Customer Satisfaction using various factors. We are going to implement the Logistic Regression and K-Nearest Neighbors model for this report. Let’s go!

2 Getting Started

2.1 Importing Libraries

library(tidyverse)
library(ggplot2)
library(class) # knn()
library(caret) # cm
library(ggmosaic)
library(kableExtra)

2.2 Reading the Dataset

airline <- read_csv("airline.csv") %>% janitor::clean_names()
head(airline)

3 Data Wrangling

glimpse(airline)
#> Rows: 129,880
#> Columns: 23
#> $ satisfaction                      <chr> "satisfied", "satisfied", "satisfie…
#> $ gender                            <chr> "Female", "Male", "Female", "Female…
#> $ customer_type                     <chr> "Loyal Customer", "Loyal Customer",…
#> $ age                               <dbl> 65, 47, 15, 60, 70, 30, 66, 10, 56,…
#> $ type_of_travel                    <chr> "Personal Travel", "Personal Travel…
#> $ class                             <chr> "Eco", "Business", "Eco", "Eco", "E…
#> $ flight_distance                   <dbl> 265, 2464, 2138, 623, 354, 1894, 22…
#> $ seat_comfort                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ departure_arrival_time_convenient <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ food_and_drink                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ gate_location                     <dbl> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4,…
#> $ inflight_wifi_service             <dbl> 2, 0, 2, 3, 4, 2, 2, 2, 5, 2, 3, 2,…
#> $ inflight_entertainment            <dbl> 4, 2, 0, 4, 3, 0, 5, 0, 3, 0, 3, 0,…
#> $ online_support                    <dbl> 2, 2, 2, 3, 4, 2, 5, 2, 5, 2, 3, 2,…
#> $ ease_of_online_booking            <dbl> 3, 3, 2, 1, 2, 2, 5, 2, 4, 2, 3, 2,…
#> $ on_board_service                  <dbl> 3, 4, 3, 1, 2, 5, 5, 3, 4, 2, 3, 3,…
#> $ leg_room_service                  <dbl> 0, 4, 3, 0, 0, 4, 0, 3, 0, 4, 0, 2,…
#> $ baggage_handling                  <dbl> 3, 4, 4, 1, 2, 5, 5, 4, 1, 5, 1, 5,…
#> $ checkin_service                   <dbl> 5, 2, 4, 4, 4, 5, 5, 5, 5, 3, 2, 2,…
#> $ cleanliness                       <dbl> 3, 3, 4, 1, 2, 4, 5, 4, 4, 4, 3, 5,…
#> $ online_boarding                   <dbl> 2, 2, 2, 3, 5, 2, 3, 2, 4, 2, 5, 2,…
#> $ departure_delay_in_minutes        <dbl> 0, 310, 0, 0, 0, 0, 17, 0, 0, 30, 4…
#> $ arrival_delay_in_minutes          <dbl> 0, 305, 0, 0, 0, 0, 15, 0, 0, 26, 4…

To get to know more of our dataset, here is the thorough explanations about each variables:

  • Satisfaction: Airline satisfaction level(satisfied, or dissatisfied)
  • Gender: Gender of the passengers (Female, Male)
  • Customer Type: The customer type (Loyal customer, disloyal customer)
  • Age: The actual age of the passengers
  • Type of Travel: Purpose of the flight of the passengers (Personal Travel, or Business Travel)
  • Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
  • Flight distance: The flight distance of this journey
  • Seat comfort: Satisfaction level of Seat comfort (0:Not Applicable; 1-5 )
  • Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
  • Food and drink: Satisfaction level of Food and drink
  • Gate location: Satisfaction level of Gate location
  • Inflight wifi service: Satisfaction level of the inflight wifi service
  • Inflight entertainment: Satisfaction level of inflight entertainment
  • Online support: Satisfaction level of Online Support
  • Ease of Online booking: Satisfaction level of online booking
  • On-board service: Satisfaction level of On-board service
  • Leg room service: Satisfaction level of Leg room service
  • Baggage handling: Satisfaction level of baggage handling
  • Check-in service: Satisfaction level of Check-in service
  • Cleanliness: Satisfaction level of Cleanliness
  • Online boarding: Satisfaction level of online boarding
  • Departure Delay in Minutes: Minutes delayed when departure
  • Arrival Delay in Minutes: Minutes delayed when Arrival
clean_airline <- airline %>% 
  mutate_if(~is.character(.), ~as.factor(.)) %>% 
  mutate(age = as.numeric(age), 
         flight_distance = as.numeric(flight_distance),
         departure_delay_in_minutes = as.numeric(departure_delay_in_minutes),
         arrival_delay_in_minutes = as.numeric(arrival_delay_in_minutes))

Checking NA Values

colSums(is.na(clean_airline))
#>                      satisfaction                            gender 
#>                                 0                                 0 
#>                     customer_type                               age 
#>                                 0                                 0 
#>                    type_of_travel                             class 
#>                                 0                                 0 
#>                   flight_distance                      seat_comfort 
#>                                 0                                 0 
#> departure_arrival_time_convenient                    food_and_drink 
#>                                 0                                 0 
#>                     gate_location             inflight_wifi_service 
#>                                 0                                 0 
#>            inflight_entertainment                    online_support 
#>                                 0                                 0 
#>            ease_of_online_booking                  on_board_service 
#>                                 0                                 0 
#>                  leg_room_service                  baggage_handling 
#>                                 0                                 0 
#>                   checkin_service                       cleanliness 
#>                                 0                                 0 
#>                   online_boarding        departure_delay_in_minutes 
#>                                 0                                 0 
#>          arrival_delay_in_minutes 
#>                               393

After we checked if there is any NA values, we found out that 393 observations are NA on the arrival_delay_in_minutes column. Here, we are going to assume that the NA value on are 0.

clean_airline$arrival_delay_in_minutes <- ifelse(is.na(clean_airline$arrival_delay_in_minutes), '0', clean_airline$arrival_delay_in_minutes)
clean_airline$arrival_delay_in_minutes <- as.numeric(clean_airline$arrival_delay_in_minutes)

To simplify our modeling, we are going to change the dissatified category into 0 and satisfied category into 1.

clean_airline <- clean_airline %>% 
  mutate(satisfaction = as.factor(case_when(satisfaction == "dissatisfied" ~ "0", TRUE ~ "1")),
         customer_type = as.factor(case_when(customer_type == "disloyal Customer" ~ "0",
                                             TRUE ~ "1")))

4 EDA

To get to know more about our data, we are going to do an Exploratory Data Analysis. We are not going to be thorough in this part as our focus is on the Log Reg and K-NN models.

summary(clean_airline)
#>  satisfaction    gender      customer_type      age       
#>  0:58793      Female:65899   0: 23780      Min.   : 7.00  
#>  1:71087      Male  :63981   1:106100      1st Qu.:27.00  
#>                                            Median :40.00  
#>                                            Mean   :39.43  
#>                                            3rd Qu.:51.00  
#>                                            Max.   :85.00  
#>          type_of_travel       class       flight_distance  seat_comfort  
#>  Business travel:89693   Business:62160   Min.   :  50    Min.   :0.000  
#>  Personal Travel:40187   Eco     :58309   1st Qu.:1359    1st Qu.:2.000  
#>                          Eco Plus: 9411   Median :1925    Median :3.000  
#>                                           Mean   :1981    Mean   :2.839  
#>                                           3rd Qu.:2544    3rd Qu.:4.000  
#>                                           Max.   :6951    Max.   :5.000  
#>  departure_arrival_time_convenient food_and_drink  gate_location 
#>  Min.   :0.000                     Min.   :0.000   Min.   :0.00  
#>  1st Qu.:2.000                     1st Qu.:2.000   1st Qu.:2.00  
#>  Median :3.000                     Median :3.000   Median :3.00  
#>  Mean   :2.991                     Mean   :2.852   Mean   :2.99  
#>  3rd Qu.:4.000                     3rd Qu.:4.000   3rd Qu.:4.00  
#>  Max.   :5.000                     Max.   :5.000   Max.   :5.00  
#>  inflight_wifi_service inflight_entertainment online_support
#>  Min.   :0.000         Min.   :0.000          Min.   :0.00  
#>  1st Qu.:2.000         1st Qu.:2.000          1st Qu.:3.00  
#>  Median :3.000         Median :4.000          Median :4.00  
#>  Mean   :3.249         Mean   :3.383          Mean   :3.52  
#>  3rd Qu.:4.000         3rd Qu.:4.000          3rd Qu.:5.00  
#>  Max.   :5.000         Max.   :5.000          Max.   :5.00  
#>  ease_of_online_booking on_board_service leg_room_service baggage_handling
#>  Min.   :0.000          Min.   :0.000    Min.   :0.000    Min.   :1.000   
#>  1st Qu.:2.000          1st Qu.:3.000    1st Qu.:2.000    1st Qu.:3.000   
#>  Median :4.000          Median :4.000    Median :4.000    Median :4.000   
#>  Mean   :3.472          Mean   :3.465    Mean   :3.486    Mean   :3.696   
#>  3rd Qu.:5.000          3rd Qu.:4.000    3rd Qu.:5.000    3rd Qu.:5.000   
#>  Max.   :5.000          Max.   :5.000    Max.   :5.000    Max.   :5.000   
#>  checkin_service  cleanliness    online_boarding departure_delay_in_minutes
#>  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :   0.00           
#>  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:   0.00           
#>  Median :3.000   Median :4.000   Median :4.000   Median :   0.00           
#>  Mean   :3.341   Mean   :3.706   Mean   :3.353   Mean   :  14.71           
#>  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:  12.00           
#>  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :1592.00           
#>  arrival_delay_in_minutes
#>  Min.   :   0.00         
#>  1st Qu.:   0.00         
#>  Median :   0.00         
#>  Mean   :  15.05         
#>  3rd Qu.:  13.00         
#>  Max.   :1584.00
ggplot(gather(clean_airline %>% select_if(is.numeric)), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

A fun insight indeed, here we can see that compared to female, there are more male passengers that are dissatisfied with the airlines service.

5 Building the Models

5.1 Logistic Regression

Logistic regression is a classification algorithm used to fit a regression curve, y = f (x), where y is a categorical variable. When the Dependent Variable is binary (1 for spam, 0 for not-spam) we also call the model binomial logistic regression where in cases of the Dependent Variable are more than 2 values the model are referred to as a class of multinomial logistic regression.

For our dataset, it is considered as Binomial because the target variable is 1 and 0, in which 1 is satisfied and 0 is dissatisfied.

5.1.1 Cross Validation

prop.table(table(clean_airline$satisfaction))
#> 
#>         0         1 
#> 0.4526717 0.5473283

45:54 is a good number for our target variables.

RNGkind(sample.kind = "Rounding") 
set.seed(598)

index <- sample(nrow(clean_airline), 
                nrow(clean_airline) *0.8) 

airline_train <- clean_airline[index, ] 
airline_test <- clean_airline[-index, ] 

5.1.2 Model

model_logreg <-  glm(formula = satisfaction ~ ., 
                     data = airline_train,
                     family = binomial("logit"))
summary(model_logreg)
#> 
#> Call:
#> glm(formula = satisfaction ~ ., family = binomial("logit"), data = airline_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.9873  -0.5741   0.1912   0.5171   3.5917  
#> 
#> Coefficients:
#>                                       Estimate   Std. Error z value
#> (Intercept)                       -6.822591352  0.073216843 -93.183
#> genderMale                        -0.956407411  0.018407073 -51.959
#> customer_type1                     1.988820649  0.028014059  70.994
#> age                               -0.007668534  0.000641260 -11.959
#> type_of_travelPersonal Travel     -0.780669721  0.026200679 -29.796
#> classEco                          -0.748964211  0.023703356 -31.597
#> classEco Plus                     -0.783218049  0.036438417 -21.494
#> flight_distance                   -0.000114023  0.000009641 -11.827
#> seat_comfort                       0.289408040  0.010300741  28.096
#> departure_arrival_time_convenient -0.201601106  0.007594677 -26.545
#> food_and_drink                    -0.215013110  0.010467912 -20.540
#> gate_location                      0.115611700  0.008578729  13.477
#> inflight_wifi_service             -0.078420308  0.009959073  -7.874
#> inflight_entertainment             0.689561404  0.009298993  74.154
#> online_support                     0.094209823  0.010102754   9.325
#> ease_of_online_booking             0.223764332  0.013017619  17.189
#> on_board_service                   0.307225318  0.009235947  33.264
#> leg_room_service                   0.218333969  0.007850168  27.813
#> baggage_handling                   0.110937290  0.010390525  10.677
#> checkin_service                    0.298166736  0.007761246  38.417
#> cleanliness                        0.084748768  0.010815271   7.836
#> online_boarding                    0.172143142  0.011150624  15.438
#> departure_delay_in_minutes         0.002727487  0.000825623   3.304
#> arrival_delay_in_minutes          -0.007837885  0.000817985  -9.582
#>                                               Pr(>|z|)    
#> (Intercept)                       < 0.0000000000000002 ***
#> genderMale                        < 0.0000000000000002 ***
#> customer_type1                    < 0.0000000000000002 ***
#> age                               < 0.0000000000000002 ***
#> type_of_travelPersonal Travel     < 0.0000000000000002 ***
#> classEco                          < 0.0000000000000002 ***
#> classEco Plus                     < 0.0000000000000002 ***
#> flight_distance                   < 0.0000000000000002 ***
#> seat_comfort                      < 0.0000000000000002 ***
#> departure_arrival_time_convenient < 0.0000000000000002 ***
#> food_and_drink                    < 0.0000000000000002 ***
#> gate_location                     < 0.0000000000000002 ***
#> inflight_wifi_service              0.00000000000000343 ***
#> inflight_entertainment            < 0.0000000000000002 ***
#> online_support                    < 0.0000000000000002 ***
#> ease_of_online_booking            < 0.0000000000000002 ***
#> on_board_service                  < 0.0000000000000002 ***
#> leg_room_service                  < 0.0000000000000002 ***
#> baggage_handling                  < 0.0000000000000002 ***
#> checkin_service                   < 0.0000000000000002 ***
#> cleanliness                        0.00000000000000465 ***
#> online_boarding                   < 0.0000000000000002 ***
#> departure_delay_in_minutes                    0.000955 ***
#> arrival_delay_in_minutes          < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 143114  on 103903  degrees of freedom
#> Residual deviance:  79746  on 103880  degrees of freedom
#> AIC: 79794
#> 
#> Number of Fisher Scoring iterations: 5
customer_type1 <- 1.988820649
exp(customer_type1)
#> [1] 7.306911

Example interpretation: - Customer type 1 or Loyal Customer have a probability 7.30 times more to be satisfied rather than dissatisfied.

5.1.3 Prediction

log_prob <-  predict(model_logreg,
                     newdata = airline_test,
                     type = "response")

log_label <-  as.factor(ifelse(log_prob > 0.5,
                               yes = "1",
                               no = "0"))

head(log_label)
#> 1 2 3 4 5 6 
#> 0 0 0 1 1 0 
#> Levels: 0 1
class(log_label)
#> [1] "factor"

5.2 K-Nearest Neighbor

K-NN has its own characteristics and one of them was it is better for predictors that are numeric, therefore, in our pre-processing step here we are going to divide the categorical variables in order to make the train and test dataset.

# predictor
airline_train_x <- airline_train %>% select(-c(satisfaction, gender, customer_type, type_of_travel, class))
airline_test_x <- airline_test %>% select(-c(satisfaction, gender, customer_type, type_of_travel, class))

# target
airline_train_y <- airline_train %>% select(satisfaction)
airline_test_y <- airline_test %>% select(satisfaction)

5.2.1 Scaling

airline_train_xs <- scale(x = airline_train_x)
airline_test_xs <- scale(x = airline_test_x, 
                         center = attr(airline_train_xs, "scaled:center"), 
                         scale = attr(airline_train_xs, "scaled:scale"))
head(airline_test_xs)
#>             age flight_distance seat_comfort departure_arrival_time_convenient
#> [1,]  0.5001273       0.4693603    -2.037774                         -1.958847
#> [2,]  2.0248780      -1.5856046    -2.037774                         -1.958847
#> [3,] -0.6268623      -0.0857724    -2.037774                         -1.958847
#> [4,]  1.7597040      -1.7092920    -2.037774                         -1.958847
#> [5,]  1.0967689      -1.8592752    -2.037774                         -1.958847
#> [6,]  1.2293559      -1.8290838    -2.037774                         -1.958847
#>      food_and_drink gate_location inflight_wifi_service inflight_entertainment
#> [1,]      -1.977142   0.007046417            -2.4659491             -1.0284106
#> [2,]      -1.977142   0.007046417             0.5675381             -0.2859376
#> [3,]      -1.977142   0.007046417            -0.9492055             -2.5133567
#> [4,]      -1.977142   0.007046417            -0.9492055              1.1990085
#> [5,]      -1.977142   0.007046417             1.3259099             -0.2859376
#> [6,]      -1.977142   0.007046417            -0.1908337             -0.2859376
#>      online_support ease_of_online_booking on_board_service leg_room_service
#> [1,]     -1.1638199             -0.3594178        0.4231493         0.399446
#> [2,]      0.3670479             -1.1250754       -1.1504694        -2.696657
#> [3,]     -1.1638199             -1.1250754        1.2099587         0.399446
#> [4,]      1.1324818              1.1718973        1.2099587        -2.696657
#> [5,]      1.1324818              0.4062398        0.4231493        -2.696657
#> [6,]     -0.3983860             -0.3594178       -0.3636601        -2.696657
#>      baggage_handling checkin_service cleanliness online_boarding
#> [1,]        0.2654027      -1.0643002  -0.6111931      -1.0418249
#> [2,]       -1.4625503       0.5217398  -1.4794651       1.2679964
#> [3,]        1.1293792       1.3147598   0.2570788      -1.0418249
#> [4,]        1.1293792       1.3147598   1.1253508      -0.2718845
#> [5,]       -2.3265268       1.3147598   0.2570788       0.4980560
#> [6,]       -2.3265268      -1.0643002  -0.6111931       1.2679964
#>      departure_delay_in_minutes arrival_delay_in_minutes
#> [1,]                 7.73134704             7.5236118680
#> [2,]                -0.38470109            -0.3898527594
#> [3,]                -0.38470109            -0.3898527594
#> [4,]                 0.06037252            -0.0006659745
#> [5,]                -0.38470109            -0.3898527594
#> [6,]                 0.84579653             0.8555449524

5.2.2 Prediction

round(sqrt(nrow(airline_train)))
#> [1] 322

k = 322

knn_label <- knn(train = airline_train_xs,
                 test = airline_test_xs,
                 cl = airline_train_y$satisfaction, 
                 k = 322)
head(knn_label)
#> [1] 0 1 0 1 1 1
#> Levels: 0 1
class(knn_label)
#> [1] "factor"

6 Model Evaluation

To evaluate our model, we may use confusionMatrix() function from the library caret. Confusion matrix is a table that shows four different category: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). After that we may use 4 metrics to evaluate the model, those are Accuracy, Sensitivity, Specificity, and Precision.

Confusion Matrix Categories
Actual No Actual Yes
Predicted No True Negative False Negative
Predicted Yes False Positive True Positive

\[ Accuracy = \frac{TP + TN} {TP + TN + FP + FN } \]

\[ Sensitivity = \frac{TP} {TP + FN}\]

\[ Specificity = \frac{TN}{TN+FP}\]

\[ Precision = \frac{TP} {TP + FP}\]

6.1 Log Reg

cm_log <- confusionMatrix(data = log_label,
                           reference = airline_test$satisfaction,
                           positive = "1")
cm_log
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0  9587  2177
#>          1  2158 12054
#>                                              
#>                Accuracy : 0.8331             
#>                  95% CI : (0.8285, 0.8376)   
#>     No Information Rate : 0.5479             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.6632             
#>                                              
#>  Mcnemar's Test P-Value : 0.7846             
#>                                              
#>             Sensitivity : 0.8470             
#>             Specificity : 0.8163             
#>          Pos Pred Value : 0.8482             
#>          Neg Pred Value : 0.8149             
#>              Prevalence : 0.5479             
#>          Detection Rate : 0.4640             
#>    Detection Prevalence : 0.5471             
#>       Balanced Accuracy : 0.8316             
#>                                              
#>        'Positive' Class : 1                  
#> 

6.2 K-NN

cm_knn <- confusionMatrix(data = knn_label,
                          reference = airline_test$satisfaction,
                          positive = "1")
cm_knn
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 10595  1980
#>          1  1150 12251
#>                                                
#>                Accuracy : 0.8795               
#>                  95% CI : (0.8755, 0.8834)     
#>     No Information Rate : 0.5479               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7583               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8609               
#>             Specificity : 0.9021               
#>          Pos Pred Value : 0.9142               
#>          Neg Pred Value : 0.8425               
#>              Prevalence : 0.5479               
#>          Detection Rate : 0.4716               
#>    Detection Prevalence : 0.5159               
#>       Balanced Accuracy : 0.8815               
#>                                                
#>        'Positive' Class : 1                    
#> 

7 Conclusion

As we can see on the Model Evaluation, both of our model perform very well in all of the metrics from Accuracy, Recall, Specificity and Precision. But, to be more precise our K-Nearest Neighbor model performs better in all aspects. Therefore, depending on what we want to achieve, for example if we only focuses on the positive classification or “satisfied” class, we may prioritize the model with higher precision value. But on the contrary, if we would like to pay attention more to both the number of correct positive and negative outcome, we might prioritize the model with high accuracy. But in this case, as previously mentioned we should refer to the K-NN model as it is better in all aspects.

Thank you for taking the time to read my report, feel free to comment or give me feedbacks. Get in touch with me here!