Hello Readers! For this project, we are using a dataset from Kaggle on Airlines Customer Satisfaction using various factors. We are going to implement the Logistic Regression and K-Nearest Neighbors model for this report to predict Airlines Passengers Satisfaction.
dataset source : https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction
Before we do analysis, we need to load the required library packages.
library(tidyverse)
library(ggplot2)
library(class)
library(caret)
library(ggmosaic)
library(kableExtra)
library(lmtest)
library(car)We need the data to do the analysis. Then, we have to load the dataset
airlines <- read.csv("train.csv")
head(airlines)To get to know more of our dataset, here is the thorough explanations about each variables:
Gender: Gender of the passengers (Female, Male)
Customer Type: The customer type (Loyal customer, disloyal customer)
Age: The actual age of the passengers
Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)
Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)
Flight distance: The flight distance of this journey
Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)
Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient
Ease of Online booking: Satisfaction level of online booking
Gate location: Satisfaction level of Gate location
Food and drink: Satisfaction level of Food and drink
Online boarding: Satisfaction level of online boarding
Seat comfort: Satisfaction level of Seat comfort
Inflight entertainment: Satisfaction level of inflight entertainment
On-board service: Satisfaction level of On-board service
Leg room service: Satisfaction level of Leg room service
Baggage handling: Satisfaction level of baggage handling
Check-in service: Satisfaction level of Check-in service
Inflight service: Satisfaction level of inflight service
Cleanliness: Satisfaction level of Cleanliness
Departure Delay in Minutes: Minutes delayed when departure
Arrival Delay in Minutes: Minutes delayed when Arrival
Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)
Let us check each column’s data type.
glimpse(airlines)#> Rows: 103,904
#> Columns: 25
#> $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11~
#> $ id <int> 70172, 5047, 110028, 24026, 119299, ~
#> $ Gender <chr> "Male", "Male", "Female", "Female", ~
#> $ Customer.Type <chr> "Loyal Customer", "disloyal Customer~
#> $ Age <int> 13, 25, 26, 25, 61, 26, 47, 52, 41, ~
#> $ Type.of.Travel <chr> "Personal Travel", "Business travel"~
#> $ Class <chr> "Eco Plus", "Business", "Business", ~
#> $ Flight.Distance <int> 460, 235, 1142, 562, 214, 1180, 1276~
#> $ Inflight.wifi.service <int> 3, 3, 2, 2, 3, 3, 2, 4, 1, 3, 4, 2, ~
#> $ Departure.Arrival.time.convenient <int> 4, 2, 2, 5, 3, 4, 4, 3, 2, 3, 5, 4, ~
#> $ Ease.of.Online.booking <int> 3, 3, 2, 5, 3, 2, 2, 4, 2, 3, 5, 2, ~
#> $ Gate.location <int> 1, 3, 2, 5, 3, 1, 3, 4, 2, 4, 4, 2, ~
#> $ Food.and.drink <int> 5, 1, 5, 2, 4, 1, 2, 5, 4, 2, 2, 1, ~
#> $ Online.boarding <int> 3, 3, 5, 2, 5, 2, 2, 5, 3, 3, 5, 2, ~
#> $ Seat.comfort <int> 5, 1, 5, 2, 5, 1, 2, 5, 3, 3, 2, 1, ~
#> $ Inflight.entertainment <int> 5, 1, 5, 2, 3, 1, 2, 5, 1, 2, 2, 1, ~
#> $ On.board.service <int> 4, 1, 4, 2, 3, 3, 3, 5, 1, 2, 3, 1, ~
#> $ Leg.room.service <int> 3, 5, 3, 5, 4, 4, 3, 5, 2, 3, 3, 2, ~
#> $ Baggage.handling <int> 4, 3, 4, 3, 4, 4, 4, 5, 1, 4, 5, 5, ~
#> $ Checkin.service <int> 4, 1, 4, 1, 3, 4, 3, 4, 4, 4, 3, 5, ~
#> $ Inflight.service <int> 5, 4, 4, 4, 3, 4, 5, 5, 1, 3, 5, 5, ~
#> $ Cleanliness <int> 5, 1, 5, 2, 3, 1, 2, 4, 2, 2, 2, 1, ~
#> $ Departure.Delay.in.Minutes <int> 25, 1, 0, 11, 0, 0, 9, 4, 0, 0, 0, 0~
#> $ Arrival.Delay.in.Minutes <dbl> 18, 6, 0, 9, 0, 0, 23, 0, 0, 0, 0, 0~
#> $ satisfaction <chr> "neutral or dissatisfied", "neutral ~
After we check the data type of each columns, we found that some of the columns don’t have the required data type. We need to change these columns’ data type for us to ease the analysis process. We also found that some columns needs to be dropped since these columns have no valuable informations for the analysis.To simplify our modeling, we are going to change the dissatified category into 0 and satisfied category into 1.
df_airlines <- airlines %>%
select(-X,-id) %>%
mutate_if(is.character,as.factor) %>%
mutate(satisfaction =
factor(satisfaction,
levels = c("neutral or dissatisfied","satisfied"),
labels = c(0, 1)),
Customer.Type =
factor(Customer.Type,
levels = c("disloyal Customer","Loyal Customer"),
labels = c(0, 1)))
glimpse(df_airlines)#> Rows: 103,904
#> Columns: 23
#> $ Gender <fct> Male, Male, Female, Female, Male, Fe~
#> $ Customer.Type <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, ~
#> $ Age <int> 13, 25, 26, 25, 61, 26, 47, 52, 41, ~
#> $ Type.of.Travel <fct> Personal Travel, Business travel, Bu~
#> $ Class <fct> Eco Plus, Business, Business, Busine~
#> $ Flight.Distance <int> 460, 235, 1142, 562, 214, 1180, 1276~
#> $ Inflight.wifi.service <int> 3, 3, 2, 2, 3, 3, 2, 4, 1, 3, 4, 2, ~
#> $ Departure.Arrival.time.convenient <int> 4, 2, 2, 5, 3, 4, 4, 3, 2, 3, 5, 4, ~
#> $ Ease.of.Online.booking <int> 3, 3, 2, 5, 3, 2, 2, 4, 2, 3, 5, 2, ~
#> $ Gate.location <int> 1, 3, 2, 5, 3, 1, 3, 4, 2, 4, 4, 2, ~
#> $ Food.and.drink <int> 5, 1, 5, 2, 4, 1, 2, 5, 4, 2, 2, 1, ~
#> $ Online.boarding <int> 3, 3, 5, 2, 5, 2, 2, 5, 3, 3, 5, 2, ~
#> $ Seat.comfort <int> 5, 1, 5, 2, 5, 1, 2, 5, 3, 3, 2, 1, ~
#> $ Inflight.entertainment <int> 5, 1, 5, 2, 3, 1, 2, 5, 1, 2, 2, 1, ~
#> $ On.board.service <int> 4, 1, 4, 2, 3, 3, 3, 5, 1, 2, 3, 1, ~
#> $ Leg.room.service <int> 3, 5, 3, 5, 4, 4, 3, 5, 2, 3, 3, 2, ~
#> $ Baggage.handling <int> 4, 3, 4, 3, 4, 4, 4, 5, 1, 4, 5, 5, ~
#> $ Checkin.service <int> 4, 1, 4, 1, 3, 4, 3, 4, 4, 4, 3, 5, ~
#> $ Inflight.service <int> 5, 4, 4, 4, 3, 4, 5, 5, 1, 3, 5, 5, ~
#> $ Cleanliness <int> 5, 1, 5, 2, 3, 1, 2, 4, 2, 2, 2, 1, ~
#> $ Departure.Delay.in.Minutes <int> 25, 1, 0, 11, 0, 0, 9, 4, 0, 0, 0, 0~
#> $ Arrival.Delay.in.Minutes <dbl> 18, 6, 0, 9, 0, 0, 23, 0, 0, 0, 0, 0~
#> $ satisfaction <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, ~
All the data types are correct, we are ready to go for the next step
We have to check if there is any missing values in our data set
colSums(is.na(df_airlines))#> Gender Customer.Type
#> 0 0
#> Age Type.of.Travel
#> 0 0
#> Class Flight.Distance
#> 0 0
#> Inflight.wifi.service Departure.Arrival.time.convenient
#> 0 0
#> Ease.of.Online.booking Gate.location
#> 0 0
#> Food.and.drink Online.boarding
#> 0 0
#> Seat.comfort Inflight.entertainment
#> 0 0
#> On.board.service Leg.room.service
#> 0 0
#> Baggage.handling Checkin.service
#> 0 0
#> Inflight.service Cleanliness
#> 0 0
#> Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
#> 0 310
#> satisfaction
#> 0
After we checked if there is any NA values, we found out that 310 observations are NA on the arrival_delay_in_minutes column. Here, we are going to assume that the NA value on are 0.
df_airlines$Arrival.Delay.in.Minutes <-
ifelse(is.na(df_airlines$Arrival.Delay.in.Minutes)
, '0', df_airlines$Arrival.Delay.in.Minutes)
df_airlines$Arrival.Delay.in.Minutes <-
as.numeric(df_airlines$Arrival.Delay.in.Minutes)
colSums(is.na(df_airlines))#> Gender Customer.Type
#> 0 0
#> Age Type.of.Travel
#> 0 0
#> Class Flight.Distance
#> 0 0
#> Inflight.wifi.service Departure.Arrival.time.convenient
#> 0 0
#> Ease.of.Online.booking Gate.location
#> 0 0
#> Food.and.drink Online.boarding
#> 0 0
#> Seat.comfort Inflight.entertainment
#> 0 0
#> On.board.service Leg.room.service
#> 0 0
#> Baggage.handling Checkin.service
#> 0 0
#> Inflight.service Cleanliness
#> 0 0
#> Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
#> 0 0
#> satisfaction
#> 0
Now there are no NA in the data. Now let us go to the next step.
To get to know more about our data, let us check the summary.
summary(df_airlines)#> Gender Customer.Type Age Type.of.Travel
#> Female:52727 0:18981 Min. : 7.00 Business travel:71655
#> Male :51177 1:84923 1st Qu.:27.00 Personal Travel:32249
#> Median :40.00
#> Mean :39.38
#> 3rd Qu.:51.00
#> Max. :85.00
#> Class Flight.Distance Inflight.wifi.service
#> Business:49665 Min. : 31 Min. :0.00
#> Eco :46745 1st Qu.: 414 1st Qu.:2.00
#> Eco Plus: 7494 Median : 843 Median :3.00
#> Mean :1189 Mean :2.73
#> 3rd Qu.:1743 3rd Qu.:4.00
#> Max. :4983 Max. :5.00
#> Departure.Arrival.time.convenient Ease.of.Online.booking Gate.location
#> Min. :0.00 Min. :0.000 Min. :0.000
#> 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000
#> Median :3.00 Median :3.000 Median :3.000
#> Mean :3.06 Mean :2.757 Mean :2.977
#> 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :5.00 Max. :5.000 Max. :5.000
#> Food.and.drink Online.boarding Seat.comfort Inflight.entertainment
#> Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000
#> 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000
#> Median :3.000 Median :3.00 Median :4.000 Median :4.000
#> Mean :3.202 Mean :3.25 Mean :3.439 Mean :3.358
#> 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:5.000 3rd Qu.:4.000
#> Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000
#> On.board.service Leg.room.service Baggage.handling Checkin.service
#> Min. :0.000 Min. :0.000 Min. :1.000 Min. :0.000
#> 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000
#> Median :4.000 Median :4.000 Median :4.000 Median :3.000
#> Mean :3.382 Mean :3.351 Mean :3.632 Mean :3.304
#> 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
#> Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
#> Inflight.service Cleanliness Departure.Delay.in.Minutes
#> Min. :0.00 Min. :0.000 Min. : 0.00
#> 1st Qu.:3.00 1st Qu.:2.000 1st Qu.: 0.00
#> Median :4.00 Median :3.000 Median : 0.00
#> Mean :3.64 Mean :3.286 Mean : 14.82
#> 3rd Qu.:5.00 3rd Qu.:4.000 3rd Qu.: 12.00
#> Max. :5.00 Max. :5.000 Max. :1592.00
#> Arrival.Delay.in.Minutes satisfaction
#> Min. : 0.00 0:58879
#> 1st Qu.: 0.00 1:45025
#> Median : 0.00
#> Mean : 15.13
#> 3rd Qu.: 13.00
#> Max. :1584.00
Below frequency data visualization for each numerical variables
ggplot(gather(df_airlines %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10, fill="blue") +
facet_wrap(~key, scales = 'free_x') Below frequency data visualization for each categorical variables
ggplot(gather(df_airlines %>% select_if(is.factor)), aes(value)) +
geom_bar(bins = 10,fill="firebrick") +
facet_wrap(~key, scales = 'free_x') + labs(x="Categorical",
y="Value")Some insights from the summary data :
Logistic regression is a classification algorithm used to fit a regression curve, y = f (x), where y is a categorical variable. We also call the model binomial logistic regression where in cases of the Dependent Variable are more than 2 values the model are referred to as a class of multinomial logistic regression.
For our dataset, it is considered as Binomial because the target variable is 1 and 0, in which 1 is satisfied and 0 is dissatisfied.
Now let us check the proportion of our target variable
prop.table(table(df_airlines$satisfaction))#>
#> 0 1
#> 0.5666673 0.4333327
We can say that our target variable is balance
Now we split data for train and test with proportion 0.8
RNGkind(sample.kind = "Rounding")
set.seed(901)
index <- sample(nrow(df_airlines),
nrow(df_airlines) *0.8)
airlines_train <- df_airlines[index, ]
airlines_test <- df_airlines[-index, ] now let us recheck the class imbalance between data train and test
prop.table(table(airlines_train$satisfaction))#>
#> 0 1
#> 0.565752 0.434248
prop.table(table(airlines_test$satisfaction))#>
#> 0 1
#> 0.5703287 0.4296713
We can say that our target variable in both our data train and test are balance
Now let us make the logistic regression model
model_reg1 <- glm(satisfaction ~ ., data = airlines_train, family = "binomial")
summary(model_reg1)#>
#> Call:
#> glm(formula = satisfaction ~ ., family = "binomial", data = airlines_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8491 -0.4920 -0.1761 0.3876 4.0093
#>
#> Coefficients:
#> Estimate Std. Error z value
#> (Intercept) -7.83464868 0.08774368 -89.290
#> GenderMale 0.04581272 0.02174589 2.107
#> Customer.Type1 2.01293177 0.03336690 60.327
#> Age -0.00778677 0.00079326 -9.816
#> Type.of.TravelPersonal Travel -2.71347682 0.03513671 -77.226
#> ClassEco -0.74002829 0.02867838 -25.804
#> ClassEco Plus -0.85652829 0.04636535 -18.473
#> Flight.Distance -0.00001895 0.00001262 -1.502
#> Inflight.wifi.service 0.38996438 0.01279222 30.484
#> Departure.Arrival.time.convenient -0.12084981 0.00917830 -13.167
#> Ease.of.Online.booking -0.14870927 0.01269358 -11.715
#> Gate.location 0.03313043 0.01024700 3.233
#> Food.and.drink -0.03359285 0.01193826 -2.814
#> Online.boarding 0.60609741 0.01142894 53.032
#> Seat.comfort 0.06019415 0.01250315 4.814
#> Inflight.entertainment 0.07843071 0.01593490 4.922
#> On.board.service 0.29769215 0.01137631 26.168
#> Leg.room.service 0.25346328 0.00951696 26.633
#> Baggage.handling 0.13561299 0.01273426 10.649
#> Checkin.service 0.32532543 0.00955669 34.042
#> Inflight.service 0.11912393 0.01341064 8.883
#> Cleanliness 0.22274202 0.01353198 16.460
#> Departure.Delay.in.Minutes 0.00398071 0.00101737 3.913
#> Arrival.Delay.in.Minutes -0.00864276 0.00100776 -8.576
#> Pr(>|z|)
#> (Intercept) < 0.0000000000000002 ***
#> GenderMale 0.03514 *
#> Customer.Type1 < 0.0000000000000002 ***
#> Age < 0.0000000000000002 ***
#> Type.of.TravelPersonal Travel < 0.0000000000000002 ***
#> ClassEco < 0.0000000000000002 ***
#> ClassEco Plus < 0.0000000000000002 ***
#> Flight.Distance 0.13320
#> Inflight.wifi.service < 0.0000000000000002 ***
#> Departure.Arrival.time.convenient < 0.0000000000000002 ***
#> Ease.of.Online.booking < 0.0000000000000002 ***
#> Gate.location 0.00122 **
#> Food.and.drink 0.00489 **
#> Online.boarding < 0.0000000000000002 ***
#> Seat.comfort 0.000001477 ***
#> Inflight.entertainment 0.000000857 ***
#> On.board.service < 0.0000000000000002 ***
#> Leg.room.service < 0.0000000000000002 ***
#> Baggage.handling < 0.0000000000000002 ***
#> Checkin.service < 0.0000000000000002 ***
#> Inflight.service < 0.0000000000000002 ***
#> Cleanliness < 0.0000000000000002 ***
#> Departure.Delay.in.Minutes 0.000091248 ***
#> Arrival.Delay.in.Minutes < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 113791 on 83122 degrees of freedom
#> Residual deviance: 55582 on 83099 degrees of freedom
#> AIC: 55630
#>
#> Number of Fisher Scoring iterations: 6
We found that there are 1 variable that are not signicant (flight distance). Let us try deselect this variable and check the AIC
However let us check the multicolinearity of this model. If there is multicoleniarity in some variable, we can deselect also these variables
vif(model_reg1)#> GVIF Df GVIF^(1/(2*Df))
#> Gender 1.007015 1 1.003502
#> Customer.Type 1.606263 1 1.267384
#> Age 1.179469 1 1.086034
#> Type.of.Travel 1.868529 1 1.366941
#> Class 1.677552 2 1.138070
#> Flight.Distance 1.358093 1 1.165372
#> Inflight.wifi.service 2.218141 1 1.489343
#> Departure.Arrival.time.convenient 1.725875 1 1.313726
#> Ease.of.Online.booking 2.600178 1 1.612507
#> Gate.location 1.522327 1 1.233826
#> Food.and.drink 2.015123 1 1.419550
#> Online.boarding 1.486299 1 1.219138
#> Seat.comfort 2.040074 1 1.428312
#> Inflight.entertainment 3.251724 1 1.803254
#> On.board.service 1.638945 1 1.280213
#> Leg.room.service 1.213795 1 1.101723
#> Baggage.handling 1.821675 1 1.349694
#> Checkin.service 1.206278 1 1.098307
#> Inflight.service 2.003023 1 1.415282
#> Cleanliness 2.466384 1 1.570473
#> Departure.Delay.in.Minutes 12.312295 1 3.508888
#> Arrival.Delay.in.Minutes 12.342944 1 3.513253
we found out that there are 2 variables Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes are colinear (VIF > 10)
Now let us check the linearity assumption from model_reg1
#linearity check
data.frame(prediction=model_reg1$fitted.values,
error=model_reg1$residuals) %>%
ggplot(aes(prediction,error)) +
geom_hline(yintercept=0) +
geom_point() +
geom_smooth() +
theme_bw() From plot above, There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.
so let us deselect flight distance, Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes variables since model_reg1 doesn’t pass the multicolinearity assumption.
df_airlines2 = airlines_train %>%
select(-Departure.Delay.in.Minutes,-Arrival.Delay.in.Minutes)
model_reg2 <- glm(satisfaction ~ ., data = df_airlines2, family = "binomial")
summary(model_reg2)#>
#> Call:
#> glm(formula = satisfaction ~ ., family = "binomial", data = df_airlines2)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.8289 -0.4936 -0.1780 0.3933 4.0203
#>
#> Coefficients:
#> Estimate Std. Error z value
#> (Intercept) -7.88356309 0.08744342 -90.156
#> GenderMale 0.04561210 0.02167814 2.104
#> Customer.Type1 1.99598203 0.03324612 60.037
#> Age -0.00750104 0.00079017 -9.493
#> Type.of.TravelPersonal Travel -2.69215847 0.03502039 -76.874
#> ClassEco -0.74621577 0.02859797 -26.093
#> ClassEco Plus -0.86572867 0.04624346 -18.721
#> Flight.Distance -0.00001902 0.00001257 -1.514
#> Inflight.wifi.service 0.39152928 0.01274685 30.716
#> Departure.Arrival.time.convenient -0.12068981 0.00916195 -13.173
#> Ease.of.Online.booking -0.14382242 0.01264437 -11.374
#> Gate.location 0.02984078 0.01022146 2.919
#> Food.and.drink -0.02484845 0.01181987 -2.102
#> Online.boarding 0.60205710 0.01135460 53.023
#> Seat.comfort 0.05835508 0.01246703 4.681
#> Inflight.entertainment 0.08305960 0.01584349 5.243
#> On.board.service 0.29867436 0.01134616 26.324
#> Leg.room.service 0.24545388 0.00949448 25.852
#> Baggage.handling 0.12608496 0.01270955 9.920
#> Checkin.service 0.32031468 0.00952425 33.631
#> Inflight.service 0.13363797 0.01334041 10.018
#> Cleanliness 0.21200683 0.01343995 15.774
#> Pr(>|z|)
#> (Intercept) < 0.0000000000000002 ***
#> GenderMale 0.03537 *
#> Customer.Type1 < 0.0000000000000002 ***
#> Age < 0.0000000000000002 ***
#> Type.of.TravelPersonal Travel < 0.0000000000000002 ***
#> ClassEco < 0.0000000000000002 ***
#> ClassEco Plus < 0.0000000000000002 ***
#> Flight.Distance 0.13015
#> Inflight.wifi.service < 0.0000000000000002 ***
#> Departure.Arrival.time.convenient < 0.0000000000000002 ***
#> Ease.of.Online.booking < 0.0000000000000002 ***
#> Gate.location 0.00351 **
#> Food.and.drink 0.03553 *
#> Online.boarding < 0.0000000000000002 ***
#> Seat.comfort 0.000002858 ***
#> Inflight.entertainment 0.000000158 ***
#> On.board.service < 0.0000000000000002 ***
#> Leg.room.service < 0.0000000000000002 ***
#> Baggage.handling < 0.0000000000000002 ***
#> Checkin.service < 0.0000000000000002 ***
#> Inflight.service < 0.0000000000000002 ***
#> Cleanliness < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 113791 on 83122 degrees of freedom
#> Residual deviance: 55899 on 83101 degrees of freedom
#> AIC: 55943
#>
#> Number of Fisher Scoring iterations: 5
Now let us check the multicolnearity with this model
vif(model_reg2)#> GVIF Df GVIF^(1/(2*Df))
#> Gender 1.006973 1 1.003480
#> Customer.Type 1.600231 1 1.265003
#> Age 1.177436 1 1.085097
#> Type.of.Travel 1.860462 1 1.363987
#> Class 1.677943 2 1.138136
#> Flight.Distance 1.358874 1 1.165708
#> Inflight.wifi.service 2.216832 1 1.488903
#> Departure.Arrival.time.convenient 1.727893 1 1.314493
#> Ease.of.Online.booking 2.597454 1 1.611662
#> Gate.location 1.524117 1 1.234551
#> Food.and.drink 1.986777 1 1.409531
#> Online.boarding 1.474444 1 1.214267
#> Seat.comfort 2.039703 1 1.428182
#> Inflight.entertainment 3.222060 1 1.795010
#> On.board.service 1.638434 1 1.280013
#> Leg.room.service 1.211370 1 1.100623
#> Baggage.handling 1.818625 1 1.348564
#> Checkin.service 1.205342 1 1.097881
#> Inflight.service 1.991087 1 1.411059
#> Cleanliness 2.445290 1 1.563742
Now let us check the linearity assumption from model_reg1
#linearity check
data.frame(prediction=model_reg2$fitted.values,
error=model_reg2$residuals) %>%
ggplot(aes(prediction,error)) +
geom_hline(yintercept=0) +
geom_point() +
geom_smooth() +
theme_bw() From plot above, we can conclude that our model is linear.
summary between models:
Model <- c("Model_Reg1", "Model_Reg2")
AIC <- c(55630,55943)
MultiColinearity <- c("yes","no" )
Linearity <-c("yes","yes" )
df <- data.frame(Model, AIC, MultiColinearity,Linearity )
print (df)#> Model AIC MultiColinearity Linearity
#> 1 Model_Reg1 55630 yes yes
#> 2 Model_Reg2 55943 no yes
Now we are safe to pick the model_reg2 for further analysis. Although the AIC is slightly below, this model pass the assumption of the multicolinearity which is not present.
Let us interpetrate one of coefficient and estimate so we can easily read above summary model_reg2.
customer_type1 <- 2.01662055
exp(customer_type1)#> [1] 7.512893
Interpretation :Customer type 1 or Loyal Customer have a probability 7.5 times more to be satisfied rather than dissatisfied.
Now let us predict the probability of satisfaction using our test data with our model_reg2 and saved in new column named pred_result
airlines_test$pred_result <- predict(object = model_reg2,
newdata = airlines_test,
type = "response")Now classify the data in airlines_test based on pred_result and saved in new column namen pred_label
airlines_test$pred_label <- ifelse(airlines_test$pred_result < 0.5 ,0, 1)
airlines_test$pred_label <- as.factor(airlines_test$pred_label)
head(airlines_test)str(airlines_test)#> 'data.frame': 20781 obs. of 25 variables:
#> $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 1 1 1 2 1 ...
#> $ Customer.Type : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#> $ Age : int 13 61 26 47 12 17 43 36 22 35 ...
#> $ Type.of.Travel : Factor w/ 2 levels "Business travel",..: 2 1 2 2 2 2 2 1 2 1 ...
#> $ Class : Factor w/ 3 levels "Business","Eco",..: 3 1 2 2 3 2 2 1 2 1 ...
#> $ Flight.Distance : int 460 214 1180 1276 308 208 752 3347 2342 2611 ...
#> $ Inflight.wifi.service : int 3 3 3 2 2 3 3 3 3 4 ...
#> $ Departure.Arrival.time.convenient: int 4 3 4 4 4 1 5 1 2 5 ...
#> $ Ease.of.Online.booking : int 3 3 2 2 2 3 3 1 3 4 ...
#> $ Gate.location : int 1 3 1 3 2 3 5 1 3 4 ...
#> $ Food.and.drink : int 5 4 1 2 1 5 5 1 3 4 ...
#> $ Online.boarding : int 3 5 2 2 2 3 4 2 3 4 ...
#> $ Seat.comfort : int 5 5 1 2 1 5 5 1 1 4 ...
#> $ Inflight.entertainment : int 5 3 1 2 1 5 3 3 3 3 ...
#> $ On.board.service : int 4 3 3 3 1 2 3 3 2 3 ...
#> $ Leg.room.service : int 3 4 4 3 2 5 3 3 4 4 ...
#> $ Baggage.handling : int 4 4 4 4 5 3 5 3 3 5 ...
#> $ Checkin.service : int 4 3 4 3 5 3 3 2 4 4 ...
#> $ Inflight.service : int 5 3 4 5 5 4 3 3 2 3 ...
#> $ Cleanliness : int 5 3 1 2 1 5 4 2 3 4 ...
#> $ Departure.Delay.in.Minutes : int 25 0 0 9 0 0 52 18 19 109 ...
#> $ Arrival.Delay.in.Minutes : num 18 0 0 23 0 0 29 12 0 120 ...
#> $ satisfaction : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 2 ...
#> $ pred_result : num 0.2007 0.8807 0.033 0.0193 0.0141 ...
#> $ pred_label : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 2 ...
See below prediction result in graph
ggplot(data=airlines_test,mapping=aes(x=pred_result)) +
geom_density(fill="green",col="black") +
geom_vline(xintercept = 0.5 , linetype="dashed") +
theme_bw() If we look at above plot, the result shows higher density <0.5, means higher customers number that aren’t satisfied with the airlines.
K-NN has its own characteristics and one of them was it is better for predictors that are numeric, therefore, in our pre-processing step here we are going to divide the categorical variables in order to make the train and test dataset.
# predictor
airlines_train_x <-
airlines_train %>% select(
-c(
satisfaction,
Gender,
Customer.Type,
Type.of.Travel,
Class,
Arrival.Delay.in.Minutes ,
Departure.Delay.in.Minutes,
Flight.Distance
)
)
airlines_test_x <-
airlines_test %>% select(
-c(
satisfaction,
Gender,
Customer.Type,
Type.of.Travel,
Class,
Arrival.Delay.in.Minutes ,
Departure.Delay.in.Minutes,
Flight.Distance,pred_label,
pred_result
)
)
# target
airlines_train_y <- airlines_train %>% select(satisfaction)
airlines_test_y <- airlines_test %>% select(satisfaction)airlines_train_xs <- scale(airlines_train_x)
airlines_test_xs <- scale(
airlines_test_x,
center = attr(airlines_train_xs, "scaled:center"),
scale = attr(airlines_train_xs, "scaled:scale")
)
head(airlines_test_xs)#> Age Inflight.wifi.service Departure.Arrival.time.convenient
#> 1 -1.7438403 0.2037476 0.61566676
#> 5 1.4300214 0.2037476 -0.04034416
#> 6 -0.8842528 0.2037476 0.61566676
#> 7 0.5043117 -0.5488682 0.61566676
#> 12 -1.8099624 -0.5488682 0.61566676
#> 22 -1.4793518 0.2037476 -1.35236600
#> Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
#> 1 0.1755811 -1.54772909 1.3537888 -0.1854321
#> 5 0.1755811 0.01852225 0.6010933 1.2961529
#> 6 -0.5399920 -1.54772909 -1.6569932 -0.9262246
#> 7 -0.5399920 0.01852225 -0.9042977 -0.9262246
#> 12 -0.5399920 -0.76460342 -1.6569932 -0.9262246
#> 22 0.1755811 0.01852225 1.3537888 -0.1854321
#> Seat.comfort Inflight.entertainment On.board.service Leg.room.service
#> 1 1.185323 1.2327161 0.4785345 -0.2652092
#> 5 1.185323 -0.2691719 -0.2985313 0.4951201
#> 6 -1.849844 -1.7710600 -0.2985313 0.4951201
#> 7 -1.091052 -1.0201160 -0.2985313 -0.2652092
#> 12 -1.849844 -1.7710600 -1.8526629 -1.0255385
#> 22 1.185323 1.2327161 -1.0755971 1.2554494
#> Baggage.handling Checkin.service Inflight.service Cleanliness
#> 1 0.3149261 0.5497154 1.1555694 1.3073159
#> 5 0.3149261 -0.2407287 -0.5453061 -0.2173036
#> 6 0.3149261 0.5497154 0.3051316 -1.7419232
#> 7 0.3149261 -0.2407287 1.1555694 -0.9796134
#> 12 1.1610845 1.3401596 1.1555694 -1.7419232
#> 22 -0.5312323 -0.2407287 0.3051316 1.3073159
Now let’s find the optimum K for further analysis
round(sqrt(nrow(airlines_train_xs)))#> [1] 288
airlines_knn <- knn(train=airlines_train_xs,test=airlines_test_xs, cl= airlines_train_y$satisfaction, k=288)
head(airlines_knn)#> [1] 0 1 0 0 0 0
#> Levels: 0 1
To evaluate our model, we may use confusionMatrix() function from the library caret. Confusion matrix is a table that shows four different category: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). After that we may use 4 metrics to evaluate the model, those are Accuracy, Sensitivity, Specificity, and Precision.
conf_log <- confusionMatrix(data = airlines_test$pred_label,reference = airlines_test$satisfaction, positive = "1")
conf_log#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 10698 1478
#> 1 1154 7451
#>
#> Accuracy : 0.8733
#> 95% CI : (0.8687, 0.8778)
#> No Information Rate : 0.5703
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.7404
#>
#> Mcnemar's Test P-Value : 0.0000000003056
#>
#> Sensitivity : 0.8345
#> Specificity : 0.9026
#> Pos Pred Value : 0.8659
#> Neg Pred Value : 0.8786
#> Prevalence : 0.4297
#> Detection Rate : 0.3585
#> Detection Prevalence : 0.4141
#> Balanced Accuracy : 0.8686
#>
#> 'Positive' Class : 1
#>
conf_knn <- confusionMatrix(data=airlines_knn,reference = as.factor(airlines_test_y$satisfaction), positive="1")
conf_knn#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 10973 1560
#> 1 879 7369
#>
#> Accuracy : 0.8826
#> 95% CI : (0.8782, 0.887)
#> No Information Rate : 0.5703
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.7583
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8253
#> Specificity : 0.9258
#> Pos Pred Value : 0.8934
#> Neg Pred Value : 0.8755
#> Prevalence : 0.4297
#> Detection Rate : 0.3546
#> Detection Prevalence : 0.3969
#> Balanced Accuracy : 0.8756
#>
#> 'Positive' Class : 1
#>
Summary Model Evaluation:
Model <- c("Logistic Regression", "K-Nearest Neighbor")
Accuracy <- c(0.8733,0.8826)
Recall <- c(0.8345,0.8253 )
specificity <-c(0.9026,0.9258 )
Precision <- c(0.8659,0.8934)
df <- data.frame(Model, Accuracy,Recall,specificity,Precision )
print (df)#> Model Accuracy Recall specificity Precision
#> 1 Logistic Regression 0.8733 0.8345 0.9026 0.8659
#> 2 K-Nearest Neighbor 0.8826 0.8253 0.9258 0.8934
Here are some conclusions we can get from both models:
If we look at the summary model evaluation, we can see that both of our model perform very well in all of the metrics from Accuracy, Recall, Specificity and Precision. However, if we want a more precised model, our K-Nearest Neighbor model performs better in all metrics.
Therefore, depending on what we want to achieve, for example if we only focuses on the positive classification or “satisfied” class, we may prioritize the model with higher precision value. It means that we are focusing on our satisfied customers to retain their satisfaction and make them more loyal. We can provide them better loyalty program, services, promotions, vouchers, etc
But on the contrary, if we would like to pay attention more to both the number of correct positive and negative outcome, we might prioritize the model with high accuracy. If we choose high accuracy, we can focus not only our True Positive “Satisfied customers”, but also, our “False Positive” and “False Negative” customers.We should focus on these customers to regain their satisfaction by giving them discounts or vouchers or promotions or better services.
But in this case, as previously mentioned we should refer to the K-NN model as it is better in all metrics.