Content
Introduction
image source: https://images.pexels.com/photos/1309644/pexels-photo-1309644.jpeg?cs=srgb&dl=pexels-sourav-mishra-1309644.jpg&fm=jpg
Welcome Aboard
According to Bureau of Transportation Statistics, the risk
brought out by COVID-19 barely could stop people flying high in the
United States, more than 640 million scheduled passengers were serviced
by in the U.S. This number illustrates a moderate slope of 27% when
comparing to year 2019. While over 50 domestic airline companies can be
selected by the customers, considerations of price and comfort might
appear in their minds at the first second. However, are they satisfied
with their trip whether the ticker cost $50 or $500?
Study Goals:
- Does basic information, like gender or flight class has significant influence on the overall rating? What male passengers and female passenger value most? What about business class passengers and economy class passengers?
- What are key factors that contribute to “satisfied” feedback? Are there specific relationships among all variables?
- What advice can be provided to airline companies to improve their service?
Data Source:https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction?select=data_dictionary.csv
Data Prepartion
Variables Summary
| Variable Name | Type | Description |
|---|---|---|
| ID | int | Unique passenger identifier |
| Gender | chr | Gender of the passenger (Female/Male) |
| Age | int | Age of the passenger |
| Customer_Type | chr | Type of airline customer (First-time/Returning) |
| Travel_Type | chr | Purpose of the flight (Business/Personal) |
| Class | chr | Travel class in the airplane for the passenger seat |
| Distance | chr | Flight distance in miles |
| Departure_Delay | int | Flight departure delay in minutes |
| Arrival_Delay | int | Flight arrival delay in minutes |
| D_A_Time_Convenience | int | Satisfaction level with the convenience of the flight departure and arrival times from 1 (lowest) to 5 (highest) |
| Online_Book_Ease | int | Satisfaction level with the online booking experience from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Check_in_Service | int | Satisfaction level with the check-in service from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Online_Boarding | int | Satisfaction level with the online boarding experience from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Gate_Location | int | Satisfaction level with the gate location in the airport from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| On_board_Service | int | Satisfaction level with the on-boarding experience from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Seat_Comfort | int | Satisfaction level with the comfort of the airplane seat from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Leg_Room | int | Satisfaction level with the leg room of the airplane seat from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Cleanliness | int | Satisfaction level with the cleanliness of the airplane from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Food_Drink | int | Satisfaction level with the food and drinks on the airplane from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| In_flight_Service | int | Satisfaction level with the in-flight service from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| In_flight_Wifi | int | Satisfaction level with the in-flight WiFi from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| In_flight_Entertainment | int | Satisfaction level with the in-flight entertainment from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Baggage_Handling | int | Satisfaction level with the baggage handling from the airline from 1 (lowest) to 5 (highest) - 0 means “not applicable” |
| Satisfaction | chr | Overall satisfaction level with the airline (Satisfied/Neutral or unsatisfied) |
Required Packages
library(ggplot2)
library(dplyr)
library(kableExtra)
library(corrplot)
library(fastDummies)
library(gridExtra)
library(ggcorrplot)
library(olsrr)
library(MASS)
library(glmnet)
library(plotmo)
library(ROCR)
library(rpart)
library(rpart.plot)
library(olsrr)
Data Clean
In this section, I checked whether missing values or duplication exist in the data, to make further work more convinent, I shortend some column names and finally present a summary table. There are 22 variables except ID and result of whether satisfied in this database, including 129880 observations with no missing values or duplication rows in the original table. Through observing the summary table, no outliers are detected, max = 5 and min = 0 or min = 1 for all rating metrics. No negative or unusual values for “Departure Delay”, “Arrival Delay”, or “Distance”.
# import data and take a quick look at the basic information
airline <- read.csv("airline.csv")
str(airline)
## 'data.frame': 129880 obs. of 24 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Female" "Male" "Male" ...
## $ Age : int 48 35 41 50 49 43 43 60 50 38 ...
## $ Customer_Type : chr "First-time" "Returning" "Returning" "Returning" ...
## $ Travel_Type : chr "Business" "Business" "Business" "Business" ...
## $ Class : chr "Business" "Business" "Business" "Business" ...
## $ Distance : int 821 821 853 1905 3470 3788 1963 853 2607 2822 ...
## $ Departure_Delay : int 2 26 0 0 0 0 0 0 0 13 ...
## $ Arrival_Delay : int 5 39 0 0 1 0 0 3 0 0 ...
## $ D_A_Time_Convenience : int 3 2 4 2 3 4 3 3 1 2 ...
## $ Online_Book_Ease : int 3 2 4 2 3 4 3 4 1 5 ...
## $ Check_in_Service : int 4 3 4 3 3 3 4 3 3 3 ...
## $ Online_Boarding : int 3 5 5 4 5 5 4 4 2 5 ...
## $ Gate_Location : int 3 2 4 2 3 4 3 4 1 2 ...
## $ On_board_Service : int 3 5 3 5 3 4 5 3 4 5 ...
## $ Seat_Comfort : int 5 4 5 5 4 4 5 4 3 4 ...
## $ Leg_Room : int 2 5 3 5 4 4 5 4 4 5 ...
## $ Cleanliness : int 5 5 5 4 5 3 4 4 3 4 ...
## $ Food_Drink : int 5 3 5 4 4 3 5 4 3 2 ...
## $ In_flight_Service : int 5 5 3 5 3 4 5 3 4 5 ...
## $ In_flight_Wifi : int 3 2 4 2 3 4 3 4 4 2 ...
## $ In_flight_Entertainment: int 5 5 3 5 3 4 5 3 4 5 ...
## $ Baggage_Handling : int 5 5 3 5 3 4 5 3 4 5 ...
## $ Satisfaction : chr "Neutral or Dissatisfied" "Satisfied" "Satisfied" "Satisfied" ...
dim(airline)
## [1] 129880 24
head(airline, 10)
## ID Gender Age Customer_Type Travel_Type Class Distance Departure_Delay
## 1 1 Male 48 First-time Business Business 821 2
## 2 2 Female 35 Returning Business Business 821 26
## 3 3 Male 41 Returning Business Business 853 0
## 4 4 Male 50 Returning Business Business 1905 0
## 5 5 Female 49 Returning Business Business 3470 0
## 6 6 Male 43 Returning Business Business 3788 0
## 7 7 Male 43 Returning Business Business 1963 0
## 8 8 Female 60 Returning Business Business 853 0
## 9 9 Male 50 Returning Business Business 2607 0
## 10 10 Female 38 Returning Business Business 2822 13
## Arrival_Delay D_A_Time_Convenience Online_Book_Ease Check_in_Service
## 1 5 3 3 4
## 2 39 2 2 3
## 3 0 4 4 4
## 4 0 2 2 3
## 5 1 3 3 3
## 6 0 4 4 3
## 7 0 3 3 4
## 8 3 3 4 3
## 9 0 1 1 3
## 10 0 2 5 3
## Online_Boarding Gate_Location On_board_Service Seat_Comfort Leg_Room
## 1 3 3 3 5 2
## 2 5 2 5 4 5
## 3 5 4 3 5 3
## 4 4 2 5 5 5
## 5 5 3 3 4 4
## 6 5 4 4 4 4
## 7 4 3 5 5 5
## 8 4 4 3 4 4
## 9 2 1 4 3 4
## 10 5 2 5 4 5
## Cleanliness Food_Drink In_flight_Service In_flight_Wifi
## 1 5 5 5 3
## 2 5 3 5 2
## 3 5 5 3 4
## 4 4 4 5 2
## 5 5 4 3 3
## 6 3 3 4 4
## 7 4 5 5 3
## 8 4 4 3 4
## 9 3 3 4 4
## 10 4 2 5 2
## In_flight_Entertainment Baggage_Handling Satisfaction
## 1 5 5 Neutral or Dissatisfied
## 2 5 5 Satisfied
## 3 3 3 Satisfied
## 4 5 5 Satisfied
## 5 3 3 Satisfied
## 6 4 4 Satisfied
## 7 5 5 Satisfied
## 8 3 3 Satisfied
## 9 4 4 Neutral or Dissatisfied
## 10 5 5 Satisfied
# check missing values and duplication
sum(is.na(airline))
## [1] 0
colSums(is.na(airline))
## ID Gender Age
## 0 0 0
## Customer_Type Travel_Type Class
## 0 0 0
## Distance Departure_Delay Arrival_Delay
## 0 0 0
## D_A_Time_Convenience Online_Book_Ease Check_in_Service
## 0 0 0
## Online_Boarding Gate_Location On_board_Service
## 0 0 0
## Seat_Comfort Leg_Room Cleanliness
## 0 0 0
## Food_Drink In_flight_Service In_flight_Wifi
## 0 0 0
## In_flight_Entertainment Baggage_Handling Satisfaction
## 0 0 0
n_occur <- data.frame(table(airline$ID))
count(n_occur[n_occur$Freq > 1, ])
## n
## 1 0
# change variable names
airline <- rename(airline, dep_delay=Departure_Delay)
airline <- rename(airline, arr_delay=Arrival_Delay)
airline <- rename(airline, time_cov=D_A_Time_Convenience)
airline <- rename(airline, online_book=Online_Book_Ease)
airline <- rename(airline, check_in=Check_in_Service)
airline <- rename(airline, online_bo=Online_Boarding)
airline <- rename(airline, gate=Gate_Location)
airline <- rename(airline, on_board=On_board_Service)
airline <- rename(airline, seat=Seat_Comfort)
airline <- rename(airline, legroom=Leg_Room)
airline <- rename(airline, clean=Cleanliness)
airline <- rename(airline, fnd=Food_Drink)
airline <- rename(airline, in_service=In_flight_Service)
airline <- rename(airline, in_wifi=In_flight_Wifi)
airline <- rename(airline, in_ent=In_flight_Entertainment)
airline <- rename(airline, baggage=Baggage_Handling)
airline <- rename(airline, overall=Satisfaction)
airlinesum <- summary(airline %>% dplyr::select(c("Distance", "dep_delay", "arr_delay", "time_cov", "online_book", "check_in", "online_bo", "gate", "on_board", "seat", "legroom", "clean", "fnd", "in_service", "in_wifi", "in_ent", "baggage")), digits = 2) %>%
knitr::kable(caption = "Summary statistics: Fly Distance and Rating Metrics")
airlinesum
| Distance | dep_delay | arr_delay | time_cov | online_book | check_in | online_bo | gate | on_board | seat | legroom | clean | fnd | in_service | in_wifi | in_ent | baggage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 31 | Min. : 0 | Min. : 0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. :0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. :0.0 | Min. :1.0 | |
| 1st Qu.: 414 | 1st Qu.: 0 | 1st Qu.: 0 | 1st Qu.:2.0 | 1st Qu.:2.0 | 1st Qu.:3.0 | 1st Qu.:2.0 | 1st Qu.:2 | 1st Qu.:2.0 | 1st Qu.:2.0 | 1st Qu.:2.0 | 1st Qu.:2.0 | 1st Qu.:2.0 | 1st Qu.:3.0 | 1st Qu.:2.0 | 1st Qu.:2.0 | 1st Qu.:3.0 | |
| Median : 844 | Median : 0 | Median : 0 | Median :3.0 | Median :3.0 | Median :3.0 | Median :3.0 | Median :3 | Median :4.0 | Median :4.0 | Median :4.0 | Median :3.0 | Median :3.0 | Median :4.0 | Median :3.0 | Median :4.0 | Median :4.0 | |
| Mean :1190 | Mean : 15 | Mean : 15 | Mean :3.1 | Mean :2.8 | Mean :3.3 | Mean :3.3 | Mean :3 | Mean :3.4 | Mean :3.4 | Mean :3.4 | Mean :3.3 | Mean :3.2 | Mean :3.6 | Mean :2.7 | Mean :3.4 | Mean :3.6 | |
| 3rd Qu.:1744 | 3rd Qu.: 12 | 3rd Qu.: 13 | 3rd Qu.:4.0 | 3rd Qu.:4.0 | 3rd Qu.:4.0 | 3rd Qu.:4.0 | 3rd Qu.:4 | 3rd Qu.:4.0 | 3rd Qu.:5.0 | 3rd Qu.:4.0 | 3rd Qu.:4.0 | 3rd Qu.:4.0 | 3rd Qu.:5.0 | 3rd Qu.:4.0 | 3rd Qu.:4.0 | 3rd Qu.:5.0 | |
| Max. :4983 | Max. :1592 | Max. :1584 | Max. :5.0 | Max. :5.0 | Max. :5.0 | Max. :5.0 | Max. :5 | Max. :5.0 | Max. :5.0 | Max. :5.0 | Max. :5.0 | Max. :5.0 | Max. :5.0 | Max. :5.0 | Max. :5.0 | Max. :5.0 |
Exploratory Data Analysis
Distributions of Grading Metrics and Satisfaction
To better create models and process analysis, I converted columns of change data type of “Gender”, “Customer_Type”, “Travel_Type”, “Class”, and ““Satisfaction” from character to dummy variables. I also removed first column “ID” to prevent this column influences the following variable selection part. Then draw correlation plot based on the significance level among variables.
# delete 1st Column ID
airline1 <- airline %>%
dplyr::select(-ID)
# create dummy dependent variable
airline.new <- dummy_cols(airline1, select_columns = "overall",
remove_first_dummy = TRUE,
remove_selected_columns = TRUE)
airline.new$Gender <- as.factor(airline.new$Gender)
airline.new$Customer_Type <- as.factor(airline$Customer_Type)
airline.new$Travel_Type <- as.factor(airline.new$Travel_Type)
airline.new$Class <- as.factor(airline.new$Class)
airline.new <- rename(airline.new, overall=overall_Satisfied)
Correlation Matrix and Multicollinearity
According to the corrlation plot, I found out the value between dep_delay and arr_delay is too high (0.96), this might indicate multicollinearity and requires more investigation. Before using Tolearnce and Variance Inflation Factors (VIF) to double-test my assumption, I realized arrival delay has a significant postivie relationship with depart delay from the following grpah. To solve the multicollinearity problem but invovle delay time as a crucial metric, I deleted arr_delay column and kept dep_delay.
# draw corrplot
corplot <- model.matrix(~0+., data=airline.new) %>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag = F, type = "lower", lab=TRUE, lab_size=1, insig="blank", ggtheme = theme_minimal(), hc.order = TRUE, outline.color = "white")
lm1 <- lm(overall ~ ., family = binomial, data = airline.new)
ols_vif_tol(lm1)
## Variables Tolerance VIF
## 1 GenderMale 0.98894305 1.011181
## 2 Age 0.86095629 1.161499
## 3 Customer_TypeReturning 0.63402096 1.577235
## 4 Travel_TypePersonal 0.47985750 2.083952
## 5 ClassEconomy 0.46731187 2.139899
## 6 ClassEconomy Plus 0.76644309 1.304728
## 7 Distance 0.73206224 1.366004
## 8 dep_delay 0.07953135 12.573658
## 9 arr_delay 0.07946686 12.583862
## 10 time_cov 0.59919050 1.668918
## 11 online_book 0.37055209 2.698676
## 12 check_in 0.81505179 1.226916
## 13 online_bo 0.49916804 2.003333
## 14 gate 0.66317468 1.507898
## 15 on_board 0.56362772 1.774221
## 16 seat 0.41708446 2.397596
## 17 legroom 0.75884974 1.317784
## 18 clean 0.34845226 2.869834
## 19 fnd 0.45969255 2.175367
## 20 in_service 0.48031704 2.081958
## 21 in_wifi 0.40773487 2.452574
## 22 in_ent 0.25653253 3.898141
## 23 baggage 0.52318769 1.911360
# arr_delay and depart_delay
plot(airline.new$arr_delay, airline.new$dep_delay, col="#9BA3EB")
# delete one column
a1 <- airline.new %>%
dplyr::select(-arr_delay)
Logistic Regression
Train and Test Data
I splited data into 80% train set and 20% test set, applying Step AIC(forward & backward), Step BIC, and LASSO respectively to select the best variable combination as reported AIC score, BIC score, and the most important in-sample mean resiudal deviance.
# split into train and test data
set.seed(1000)
index <- sample(nrow(a1), nrow(a1)*0.8)
train = a1[index,]
test = a1[-index,]
Variable Selection
# full data GLM
airline.glm <- glm(overall ~ ., family = binomial, data = a1)
summary(airline.glm)
##
## Call:
## glm(formula = overall ~ ., family = binomial, data = a1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8807 -0.4915 -0.1758 0.3891 4.0206
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.831e+00 7.016e-02 -111.604 < 2e-16 ***
## GenderMale 6.302e-02 1.737e-02 3.628 0.000286 ***
## Age -8.373e-03 6.346e-04 -13.193 < 2e-16 ***
## Customer_TypeReturning 2.032e+00 2.660e-02 76.378 < 2e-16 ***
## Travel_TypePersonal -2.729e+00 2.807e-02 -97.219 < 2e-16 ***
## ClassEconomy -7.164e-01 2.288e-02 -31.314 < 2e-16 ***
## ClassEconomy Plus -8.161e-01 3.699e-02 -22.063 < 2e-16 ***
## Distance -1.620e-05 1.008e-05 -1.607 0.108118
## dep_delay -4.633e-03 2.365e-04 -19.593 < 2e-16 ***
## time_cov -1.327e-01 7.303e-03 -18.175 < 2e-16 ***
## online_book -1.498e-01 1.008e-02 -14.856 < 2e-16 ***
## check_in 3.268e-01 7.633e-03 42.809 < 2e-16 ***
## online_bo 6.076e-01 9.128e-03 66.566 < 2e-16 ***
## gate 2.475e-02 8.169e-03 3.029 0.002451 **
## on_board 3.005e-01 9.095e-03 33.039 < 2e-16 ***
## seat 6.498e-02 9.991e-03 6.504 7.82e-11 ***
## legroom 2.495e-01 7.607e-03 32.800 < 2e-16 ***
## clean 2.264e-01 1.080e-02 20.964 < 2e-16 ***
## fnd -2.634e-02 9.550e-03 -2.758 0.005813 **
## in_service 1.254e-01 1.073e-02 11.687 < 2e-16 ***
## in_wifi 4.021e-01 1.022e-02 39.331 < 2e-16 ***
## in_ent 5.826e-02 1.273e-02 4.577 4.71e-06 ***
## baggage 1.335e-01 1.020e-02 13.090 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177814 on 129879 degrees of freedom
## Residual deviance: 87038 on 129857 degrees of freedom
## AIC: 87084
##
## Number of Fisher Scoring iterations: 5
airline.glm$dev/airline.glm$df.residual # = 0.6703
## [1] 0.6702621
# null model
airline.null <- glm(overall ~ 1, family = binomial, data = a1)
airline.null$dev/airline.null$df.residual # = 1.37
## [1] 1.369075
# step AIC
aic.back <- step(airline.glm, direction = "backward")
## Start: AIC=87084.22
## overall ~ Gender + Age + Customer_Type + Travel_Type + Class +
## Distance + dep_delay + time_cov + online_book + check_in +
## online_bo + gate + on_board + seat + legroom + clean + fnd +
## in_service + in_wifi + in_ent + baggage
##
## Df Deviance AIC
## <none> 87038 87084
## - Distance 1 87041 87085
## - fnd 1 87046 87090
## - gate 1 87047 87091
## - Gender 1 87051 87095
## - in_ent 1 87059 87103
## - seat 1 87080 87124
## - in_service 1 87176 87220
## - baggage 1 87210 87254
## - Age 1 87213 87257
## - online_book 1 87260 87304
## - time_cov 1 87367 87411
## - dep_delay 1 87446 87490
## - clean 1 87480 87524
## - legroom 1 88121 88165
## - Class 2 88145 88187
## - on_board 1 88150 88194
## - in_wifi 1 88639 88683
## - check_in 1 88933 88977
## - online_bo 1 91744 91788
## - Customer_Type 1 93446 93490
## - Travel_Type 1 98160 98204
aic.for <- step(airline.null, direction = "forward",
scope = list(lower = airline.null, upper = airline.glm))
## Start: AIC=177816.1
## overall ~ 1
##
## Df Deviance AIC
## + online_bo 1 140237 140241
## + Class 2 143336 143342
## + Travel_Type 1 148176 148180
## + in_ent 1 155663 155667
## + seat 1 160973 160977
## + on_board 1 163597 163601
## + legroom 1 164546 164550
## + clean 1 165039 165043
## + Distance 1 166034 166038
## + in_wifi 1 167068 167072
## + baggage 1 169418 169422
## + in_service 1 169674 169678
## + check_in 1 170284 170288
## + fnd 1 171911 171915
## + Customer_Type 1 173075 173079
## + online_book 1 174071 174075
## + Age 1 175462 175466
## + time_cov 1 177432 177436
## + dep_delay 1 177458 177462
## + Gender 1 177798 177802
## <none> 177814 177816
## + gate 1 177813 177817
##
## Step: AIC=140240.7
## overall ~ online_bo
##
## Df Deviance AIC
## + Travel_Type 1 117574 117580
## + Class 2 119878 119886
## + in_ent 1 128193 128199
## + legroom 1 129411 129417
## + on_board 1 130021 130027
## + in_service 1 133183 133189
## + baggage 1 133232 133238
## + Distance 1 133952 133958
## + seat 1 136708 136714
## + clean 1 136783 136789
## + check_in 1 137384 137390
## + Customer_Type 1 138700 138706
## + fnd 1 138801 138807
## + time_cov 1 138833 138839
## + in_wifi 1 139374 139380
## + dep_delay 1 139950 139956
## + Gender 1 140036 140042
## + Age 1 140077 140083
## + online_book 1 140178 140184
## <none> 140237 140241
## + gate 1 140235 140241
##
## Step: AIC=117580.2
## overall ~ online_bo + Travel_Type
##
## Df Deviance AIC
## + on_board 1 107136 107144
## + in_ent 1 107946 107954
## + legroom 1 109347 109355
## + in_service 1 109905 109913
## + baggage 1 110193 110201
## + Customer_Type 1 110244 110252
## + Class 2 111700 111710
## + check_in 1 113138 113146
## + clean 1 114315 114323
## + seat 1 114767 114775
## + Distance 1 115505 115513
## + fnd 1 116518 116526
## + in_wifi 1 116613 116621
## + dep_delay 1 117196 117204
## + Age 1 117298 117306
## + Gender 1 117484 117492
## + online_book 1 117496 117504
## + gate 1 117542 117550
## <none> 117574 117580
## + time_cov 1 117573 117581
##
## Step: AIC=107143.6
## overall ~ online_bo + Travel_Type + on_board
##
## Df Deviance AIC
## + Customer_Type 1 99348 99358
## + in_ent 1 103485 103495
## + Class 2 103680 103692
## + legroom 1 103845 103855
## + clean 1 104549 104559
## + check_in 1 104896 104906
## + seat 1 104905 104915
## + Distance 1 105588 105598
## + baggage 1 105694 105704
## + in_service 1 105773 105783
## + fnd 1 106055 106065
## + in_wifi 1 106604 106614
## + dep_delay 1 106847 106857
## + Age 1 106970 106980
## + Gender 1 107035 107045
## + online_book 1 107056 107066
## + gate 1 107122 107132
## + time_cov 1 107123 107133
## <none> 107136 107144
##
## Step: AIC=99358.1
## overall ~ online_bo + Travel_Type + on_board + Customer_Type
##
## Df Deviance AIC
## + check_in 1 96241 96253
## + legroom 1 96390 96402
## + in_ent 1 96693 96705
## + clean 1 96861 96873
## + baggage 1 97012 97024
## + in_service 1 97102 97114
## + Class 2 97924 97938
## + seat 1 98001 98013
## + in_wifi 1 98182 98194
## + fnd 1 98388 98400
## + Age 1 99002 99014
## + dep_delay 1 99015 99027
## + time_cov 1 99194 99206
## + Distance 1 99216 99228
## + Gender 1 99284 99296
## + gate 1 99325 99337
## + online_book 1 99337 99349
## <none> 99348 99358
##
## Step: AIC=96252.99
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in
##
## Df Deviance AIC
## + in_ent 1 93431 93445
## + legroom 1 93622 93636
## + clean 1 94259 94273
## + baggage 1 94435 94449
## + in_service 1 94502 94516
## + in_wifi 1 94757 94771
## + seat 1 95275 95289
## + Class 2 95324 95340
## + fnd 1 95377 95391
## + dep_delay 1 95856 95870
## + Age 1 95887 95901
## + time_cov 1 96051 96065
## + Distance 1 96171 96185
## + Gender 1 96189 96203
## + gate 1 96228 96242
## <none> 96241 96253
## + online_book 1 96241 96255
##
## Step: AIC=93444.52
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent
##
## Df Deviance AIC
## + legroom 1 91582 91598
## + Class 2 92299 92317
## + in_wifi 1 92436 92452
## + baggage 1 92495 92511
## + in_service 1 92641 92657
## + Age 1 93063 93079
## + dep_delay 1 93084 93100
## + clean 1 93268 93284
## + time_cov 1 93273 93289
## + Distance 1 93339 93355
## + Gender 1 93370 93386
## + gate 1 93414 93430
## + fnd 1 93418 93434
## + online_book 1 93419 93435
## + seat 1 93426 93442
## <none> 93431 93445
##
## Step: AIC=91597.58
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom
##
## Df Deviance AIC
## + Class 2 90568 90588
## + in_wifi 1 90922 90940
## + baggage 1 91096 91114
## + dep_delay 1 91171 91189
## + in_service 1 91178 91196
## + Age 1 91276 91294
## + clean 1 91287 91305
## + time_cov 1 91392 91410
## + Distance 1 91514 91532
## + seat 1 91541 91559
## + Gender 1 91541 91559
## + gate 1 91566 91584
## + online_book 1 91577 91595
## + fnd 1 91578 91596
## <none> 91582 91598
##
## Step: AIC=90587.55
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class
##
## Df Deviance AIC
## + in_wifi 1 89558 89580
## + dep_delay 1 90175 90197
## + baggage 1 90191 90213
## + Age 1 90243 90265
## + in_service 1 90270 90292
## + clean 1 90296 90318
## + time_cov 1 90383 90405
## + Gender 1 90533 90555
## + gate 1 90553 90575
## + seat 1 90554 90576
## + Distance 1 90556 90578
## <none> 90568 90588
## + fnd 1 90566 90588
## + online_book 1 90567 90589
##
## Step: AIC=89580.35
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi
##
## Df Deviance AIC
## + time_cov 1 88869 88893
## + online_book 1 88892 88916
## + clean 1 89144 89168
## + dep_delay 1 89197 89221
## + baggage 1 89296 89320
## + Age 1 89314 89338
## + in_service 1 89343 89367
## + gate 1 89360 89384
## + seat 1 89481 89505
## + Gender 1 89536 89560
## + fnd 1 89554 89578
## + Distance 1 89555 89579
## <none> 89558 89580
##
## Step: AIC=88892.63
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov
##
## Df Deviance AIC
## + clean 1 88431 88457
## + dep_delay 1 88513 88539
## + baggage 1 88598 88624
## + online_book 1 88600 88626
## + in_service 1 88633 88659
## + Age 1 88650 88676
## + seat 1 88780 88806
## + Gender 1 88848 88874
## + gate 1 88862 88888
## + fnd 1 88864 88890
## <none> 88869 88893
## + Distance 1 88867 88893
##
## Step: AIC=88456.67
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean
##
## Df Deviance AIC
## + dep_delay 1 88038 88066
## + in_service 1 88040 88068
## + baggage 1 88045 88073
## + online_book 1 88200 88228
## + Age 1 88239 88267
## + fnd 1 88413 88441
## + Gender 1 88417 88445
## + gate 1 88423 88451
## + seat 1 88424 88452
## + Distance 1 88428 88456
## <none> 88431 88457
##
## Step: AIC=88065.73
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay
##
## Df Deviance AIC
## + baggage 1 87628 87658
## + in_service 1 87677 87707
## + online_book 1 87800 87830
## + Age 1 87839 87869
## + fnd 1 88016 88046
## + Gender 1 88023 88053
## + seat 1 88031 88061
## + gate 1 88031 88061
## + Distance 1 88035 88065
## <none> 88038 88066
##
## Step: AIC=87658.27
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage
##
## Df Deviance AIC
## + online_book 1 87417 87449
## + Age 1 87470 87502
## + in_service 1 87474 87506
## + seat 1 87606 87638
## + Gender 1 87617 87649
## + fnd 1 87622 87654
## + gate 1 87625 87657
## <none> 87628 87658
## + Distance 1 87627 87659
##
## Step: AIC=87448.64
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage + online_book
##
## Df Deviance AIC
## + Age 1 87240 87274
## + in_service 1 87266 87300
## + seat 1 87401 87435
## + Gender 1 87404 87438
## + gate 1 87409 87443
## + fnd 1 87411 87445
## <none> 87417 87449
## + Distance 1 87416 87450
##
## Step: AIC=87274.24
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage + online_book + Age
##
## Df Deviance AIC
## + in_service 1 87109 87145
## + seat 1 87214 87250
## + Gender 1 87228 87264
## + gate 1 87231 87267
## + fnd 1 87232 87268
## + Distance 1 87237 87273
## <none> 87240 87274
##
## Step: AIC=87145.4
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage + online_book + Age + in_service
##
## Df Deviance AIC
## + seat 1 87070 87108
## + Gender 1 87098 87136
## + gate 1 87098 87136
## + fnd 1 87106 87144
## + Distance 1 87107 87145
## <none> 87109 87145
##
## Step: AIC=87108.34
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage + online_book + Age + in_service +
## seat
##
## Df Deviance AIC
## + Gender 1 87058 87098
## + gate 1 87061 87101
## + fnd 1 87063 87103
## + Distance 1 87068 87108
## <none> 87070 87108
##
## Step: AIC=87097.96
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage + online_book + Age + in_service +
## seat + Gender
##
## Df Deviance AIC
## + gate 1 87048 87090
## + fnd 1 87050 87092
## + Distance 1 87055 87097
## <none> 87058 87098
##
## Step: AIC=87090.27
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage + online_book + Age + in_service +
## seat + Gender + gate
##
## Df Deviance AIC
## + fnd 1 87041 87085
## + Distance 1 87046 87090
## <none> 87048 87090
##
## Step: AIC=87084.8
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage + online_book + Age + in_service +
## seat + Gender + gate + fnd
##
## Df Deviance AIC
## + Distance 1 87038 87084
## <none> 87041 87085
##
## Step: AIC=87084.22
## overall ~ online_bo + Travel_Type + on_board + Customer_Type +
## check_in + in_ent + legroom + Class + in_wifi + time_cov +
## clean + dep_delay + baggage + online_book + Age + in_service +
## seat + Gender + gate + fnd + Distance
# step BIC
airline.bic <- step(airline.glm, k = log(129880))
## Start: AIC=87309.03
## overall ~ Gender + Age + Customer_Type + Travel_Type + Class +
## Distance + dep_delay + time_cov + online_book + check_in +
## online_bo + gate + on_board + seat + legroom + clean + fnd +
## in_service + in_wifi + in_ent + baggage
##
## Df Deviance AIC
## - Distance 1 87041 87300
## - fnd 1 87046 87305
## - gate 1 87047 87306
## <none> 87038 87309
## - Gender 1 87051 87310
## - in_ent 1 87059 87318
## - seat 1 87080 87340
## - in_service 1 87176 87435
## - baggage 1 87210 87469
## - Age 1 87213 87472
## - online_book 1 87260 87519
## - time_cov 1 87367 87626
## - dep_delay 1 87446 87705
## - clean 1 87480 87739
## - legroom 1 88121 88380
## - Class 2 88145 88392
## - on_board 1 88150 88409
## - in_wifi 1 88639 88898
## - check_in 1 88933 89192
## - online_bo 1 91744 92003
## - Customer_Type 1 93446 93705
## - Travel_Type 1 98160 98419
##
## Step: AIC=87299.84
## overall ~ Gender + Age + Customer_Type + Travel_Type + Class +
## dep_delay + time_cov + online_book + check_in + online_bo +
## gate + on_board + seat + legroom + clean + fnd + in_service +
## in_wifi + in_ent + baggage
##
## Df Deviance AIC
## - fnd 1 87048 87296
## - gate 1 87050 87297
## <none> 87041 87300
## - Gender 1 87054 87301
## - in_ent 1 87062 87309
## - seat 1 87083 87330
## - in_service 1 87179 87426
## - baggage 1 87214 87461
## - Age 1 87214 87461
## - online_book 1 87264 87511
## - time_cov 1 87371 87618
## - dep_delay 1 87448 87696
## - clean 1 87482 87730
## - legroom 1 88121 88368
## - on_board 1 88152 88399
## - Class 2 88248 88484
## - in_wifi 1 88652 88900
## - check_in 1 88935 89182
## - online_bo 1 91744 91992
## - Customer_Type 1 93851 94098
## - Travel_Type 1 98305 98552
##
## Step: AIC=87295.53
## overall ~ Gender + Age + Customer_Type + Travel_Type + Class +
## dep_delay + time_cov + online_book + check_in + online_bo +
## gate + on_board + seat + legroom + clean + in_service + in_wifi +
## in_ent + baggage
##
## Df Deviance AIC
## - gate 1 87058 87293
## <none> 87048 87296
## - Gender 1 87061 87296
## - in_ent 1 87063 87298
## - seat 1 87086 87322
## - in_service 1 87193 87428
## - Age 1 87218 87454
## - baggage 1 87226 87462
## - online_book 1 87272 87507
## - time_cov 1 87379 87615
## - dep_delay 1 87453 87688
## - clean 1 87486 87721
## - legroom 1 88143 88379
## - on_board 1 88187 88423
## - Class 2 88252 88476
## - in_wifi 1 88657 88892
## - check_in 1 88937 89172
## - online_bo 1 91760 91995
## - Customer_Type 1 93892 94128
## - Travel_Type 1 98334 98570
##
## Step: AIC=87293.44
## overall ~ Gender + Age + Customer_Type + Travel_Type + Class +
## dep_delay + time_cov + online_book + check_in + online_bo +
## on_board + seat + legroom + clean + in_service + in_wifi +
## in_ent + baggage
##
## Df Deviance AIC
## <none> 87058 87293
## - Gender 1 87070 87294
## - in_ent 1 87073 87297
## - seat 1 87098 87322
## - in_service 1 87201 87425
## - Age 1 87226 87450
## - baggage 1 87235 87459
## - online_book 1 87274 87498
## - time_cov 1 87387 87611
## - dep_delay 1 87461 87684
## - clean 1 87497 87721
## - legroom 1 88145 88369
## - on_board 1 88192 88416
## - Class 2 88261 88473
## - in_wifi 1 88660 88884
## - check_in 1 88942 89166
## - online_bo 1 91876 92100
## - Customer_Type 1 93893 94117
## - Travel_Type 1 98453 98677
summary(aic.back)
##
## Call:
## glm(formula = overall ~ Gender + Age + Customer_Type + Travel_Type +
## Class + Distance + dep_delay + time_cov + online_book + check_in +
## online_bo + gate + on_board + seat + legroom + clean + fnd +
## in_service + in_wifi + in_ent + baggage, family = binomial,
## data = a1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8807 -0.4915 -0.1758 0.3891 4.0206
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.831e+00 7.016e-02 -111.604 < 2e-16 ***
## GenderMale 6.302e-02 1.737e-02 3.628 0.000286 ***
## Age -8.373e-03 6.346e-04 -13.193 < 2e-16 ***
## Customer_TypeReturning 2.032e+00 2.660e-02 76.378 < 2e-16 ***
## Travel_TypePersonal -2.729e+00 2.807e-02 -97.219 < 2e-16 ***
## ClassEconomy -7.164e-01 2.288e-02 -31.314 < 2e-16 ***
## ClassEconomy Plus -8.161e-01 3.699e-02 -22.063 < 2e-16 ***
## Distance -1.620e-05 1.008e-05 -1.607 0.108118
## dep_delay -4.633e-03 2.365e-04 -19.593 < 2e-16 ***
## time_cov -1.327e-01 7.303e-03 -18.175 < 2e-16 ***
## online_book -1.498e-01 1.008e-02 -14.856 < 2e-16 ***
## check_in 3.268e-01 7.633e-03 42.809 < 2e-16 ***
## online_bo 6.076e-01 9.128e-03 66.566 < 2e-16 ***
## gate 2.475e-02 8.169e-03 3.029 0.002451 **
## on_board 3.005e-01 9.095e-03 33.039 < 2e-16 ***
## seat 6.498e-02 9.991e-03 6.504 7.82e-11 ***
## legroom 2.495e-01 7.607e-03 32.800 < 2e-16 ***
## clean 2.264e-01 1.080e-02 20.964 < 2e-16 ***
## fnd -2.634e-02 9.550e-03 -2.758 0.005813 **
## in_service 1.254e-01 1.073e-02 11.687 < 2e-16 ***
## in_wifi 4.021e-01 1.022e-02 39.331 < 2e-16 ***
## in_ent 5.826e-02 1.273e-02 4.577 4.71e-06 ***
## baggage 1.335e-01 1.020e-02 13.090 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177814 on 129879 degrees of freedom
## Residual deviance: 87038 on 129857 degrees of freedom
## AIC: 87084
##
## Number of Fisher Scoring iterations: 5
aic.back$deviance/aic.back$df.residual # 0.6703
## [1] 0.6702621
AIC(aic.back) # 87084.22
## [1] 87084.22
BIC(aic.back) # 87309.03
## [1] 87309.03
AIC(aic.for) # 87084.22
## [1] 87084.22
BIC(aic.for) # 87309.03
## [1] 87309.03
summary(airline.bic) # 87097.96
##
## Call:
## glm(formula = overall ~ Gender + Age + Customer_Type + Travel_Type +
## Class + dep_delay + time_cov + online_book + check_in + online_bo +
## on_board + seat + legroom + clean + in_service + in_wifi +
## in_ent + baggage, family = binomial, data = a1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9068 -0.4916 -0.1751 0.3902 4.0254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.8414635 0.0657092 -119.336 < 2e-16 ***
## GenderMale 0.0610962 0.0173610 3.519 0.000433 ***
## Age -0.0081828 0.0006324 -12.940 < 2e-16 ***
## Customer_TypeReturning 2.0235239 0.0256339 78.939 < 2e-16 ***
## Travel_TypePersonal -2.7333811 0.0277943 -98.343 < 2e-16 ***
## ClassEconomy -0.7031667 0.0217357 -32.351 < 2e-16 ***
## ClassEconomy Plus -0.8012297 0.0359673 -22.277 < 2e-16 ***
## dep_delay -0.0046142 0.0002368 -19.483 < 2e-16 ***
## time_cov -0.1260518 0.0069388 -18.166 < 2e-16 ***
## online_book -0.1409153 0.0096224 -14.644 < 2e-16 ***
## check_in 0.3256407 0.0076267 42.697 < 2e-16 ***
## online_bo 0.6020942 0.0089530 67.251 < 2e-16 ***
## on_board 0.3022302 0.0090505 33.394 < 2e-16 ***
## seat 0.0624964 0.0098802 6.325 2.53e-10 ***
## legroom 0.2492942 0.0075815 32.882 < 2e-16 ***
## clean 0.2201422 0.0105258 20.915 < 2e-16 ***
## in_service 0.1275715 0.0106880 11.936 < 2e-16 ***
## in_wifi 0.4018987 0.0102135 39.350 < 2e-16 ***
## in_ent 0.0461207 0.0118545 3.891 0.000100 ***
## baggage 0.1350570 0.0101748 13.274 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177814 on 129879 degrees of freedom
## Residual deviance: 87058 on 129860 degrees of freedom
## AIC: 87098
##
## Number of Fisher Scoring iterations: 5
airline.bic$deviance/airline.bic$df.residual # 0.6704
## [1] 0.6703986
AIC(airline.bic) # 87097.96
## [1] 87097.96
BIC(airline.bic) # 87293.44
## [1] 87293.44
# LASSO
new <- model.matrix(~ ., data = a1)
new1 <- data.frame(new[, -1])
train1 <- as.matrix(
dplyr::select(new1, -overall)[index,])
test1 <- as.matrix(
dplyr::select(new1, -overall)[-index,])
train2 <- new1[index, "overall"]
test2 <- new1[-index, "overall"]
a.lasso <- glmnet(x=train1, y=train2, family = "binomial")
a.lascv <- cv.glmnet(x=train1, y=train2, family = "binomial", typle.meausre = "class",
alpha=1, nlambda = 100)
plot(a.lascv)
fit.1se = glmnet(x=train1, y=train2, family = "binomial", alpha=1, lambda = a.lascv$lambda.1se)
fit.1se$beta[,1]
## GenderMale Age Customer_TypeReturning
## 0.006432198 -0.003600187 1.768684640
## Travel_TypePersonal ClassEconomy ClassEconomy.Plus
## -2.559936475 -0.656739515 -0.612520755
## Distance dep_delay time_cov
## 0.000000000 -0.003420403 -0.109353006
## online_book check_in online_bo
## -0.057722975 0.297412727 0.584131850
## gate on_board seat
## 0.000000000 0.268614117 0.042731361
## legroom clean fnd
## 0.226913567 0.178765876 0.000000000
## in_service in_wifi in_ent
## 0.114069504 0.301480385 0.092786433
## baggage
## 0.118632117
fit.min = glmnet(x=train1, y=train2, family = "binomial", alpha=1, lambda = a.lascv$lambda.min)
fit.min$beta[,1]
## GenderMale Age Customer_TypeReturning
## 6.711187e-02 -7.663790e-03 1.977415e+00
## Travel_TypePersonal ClassEconomy ClassEconomy.Plus
## -2.713571e+00 -7.038392e-01 -7.617881e-01
## Distance dep_delay time_cov
## -2.330595e-06 -4.549073e-03 -1.274325e-01
## online_book check_in online_bo
## -1.289160e-01 3.282890e-01 6.112944e-01
## gate on_board seat
## 1.279580e-02 2.898013e-01 5.960706e-02
## legroom clean fnd
## 2.448732e-01 2.105089e-01 -2.488579e-02
## in_service in_wifi in_ent
## 1.290537e-01 3.836345e-01 7.507715e-02
## baggage
## 1.304634e-01
a1se <- glm(overall ~ Gender + Age + Customer_Type + Travel_Type + Class + dep_delay +
time_cov + online_book + check_in + online_bo + on_board + seat + legroom +
clean + in_service + in_wifi + in_ent + baggage, family = "binomial", data = train)
a1se$deviance/a1se$df.residual # 0.6679
## [1] 0.6678561
AIC(a1se) # 69419.57
## [1] 69419.57
BIC(a1se) # 69610.59
## [1] 69610.59
coef(a1se)
## (Intercept) GenderMale Age
## -7.878834520 0.073965155 -0.008142632
## Customer_TypeReturning Travel_TypePersonal ClassEconomy
## 2.011285062 -2.744288718 -0.707093669
## ClassEconomy Plus dep_delay time_cov
## -0.779985593 -0.004701768 -0.126086981
## online_book check_in online_bo
## -0.134060138 0.331906974 0.612377346
## on_board seat legroom
## 0.295195445 0.058675938 0.248253465
## clean in_service in_wifi
## 0.208558406 0.133887850 0.395153389
## in_ent baggage
## 0.059036062 0.134109699
# in lambda.min, all variables are selected but it has the lowest mean residual deviance
amin <- glm(overall ~ ., family = "binomial", data = train)
amin$deviance/amin$df.residual # 0.6677
## [1] 0.6676886
AIC(amin) # 69406.16
## [1] 69406.16
BIC(amin) #69625.84
## [1] 69625.84
summary(a1se)
##
## Call:
## glm(formula = overall ~ Gender + Age + Customer_Type + Travel_Type +
## Class + dep_delay + time_cov + online_book + check_in + online_bo +
## on_board + seat + legroom + clean + in_service + in_wifi +
## in_ent + baggage, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9029 -0.4891 -0.1729 0.3878 4.0333
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.8788345 0.0737610 -106.816 < 2e-16 ***
## GenderMale 0.0739652 0.0194494 3.803 0.000143 ***
## Age -0.0081426 0.0007100 -11.469 < 2e-16 ***
## Customer_TypeReturning 2.0112851 0.0286999 70.080 < 2e-16 ***
## Travel_TypePersonal -2.7442887 0.0311756 -88.027 < 2e-16 ***
## ClassEconomy -0.7070937 0.0243270 -29.066 < 2e-16 ***
## ClassEconomy Plus -0.7799856 0.0402883 -19.360 < 2e-16 ***
## dep_delay -0.0047018 0.0002649 -17.749 < 2e-16 ***
## time_cov -0.1260870 0.0077958 -16.174 < 2e-16 ***
## online_book -0.1340601 0.0107895 -12.425 < 2e-16 ***
## check_in 0.3319070 0.0085460 38.838 < 2e-16 ***
## online_bo 0.6123773 0.0100583 60.883 < 2e-16 ***
## on_board 0.2951954 0.0101327 29.133 < 2e-16 ***
## seat 0.0586759 0.0110685 5.301 1.15e-07 ***
## legroom 0.2482535 0.0084920 29.234 < 2e-16 ***
## clean 0.2085584 0.0117740 17.713 < 2e-16 ***
## in_service 0.1338879 0.0119754 11.180 < 2e-16 ***
## in_wifi 0.3951534 0.0114354 34.555 < 2e-16 ***
## in_ent 0.0590361 0.0132361 4.460 8.19e-06 ***
## baggage 0.1341097 0.0114171 11.746 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 142238 on 103903 degrees of freedom
## Residual deviance: 69380 on 103884 degrees of freedom
## AIC: 69420
##
## Number of Fisher Scoring iterations: 6
ROC, AUC and Asymmetric Cost
pred_reg <- predict(a1se, type = 'response')
hist(pred_reg)
# confusion matrix with different cut-off probability
table(train$overall, (pred_reg>0.5)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 53140 5647
## 1 7407 37710
table(train$overall, (pred_reg>1/6)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 38708 20079
## 1 3188 41929
table(train$overall, (pred_reg>0.0001)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 3 58784
## 1 0 45117
# full-data classification
pre_full <- predict(airline.glm, data = a1, method = "class")
table(a1$overall, (pre_full>1/6)*1, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 67441 6011
## 1 10141 46287
#in-sample ROC
pred <- prediction(pred_reg, train$overall)
perf <- performance(pred, "tpr", "fpr")
in.plot <- plot(perf, colorize = TRUE)
title(main = "In-Sample ROC")
unlist(slot(performance(pred, "auc"), "y.values")) # 0.9271
## [1] 0.9270789
# asy cost
pcut1 <- 1/2
cost1 <- function(r, pi){
mean(((r==0)&(pi>pcut1)) | ((r==1)&(pi<pcut1)))
}
pcut2 <- 1/(5+1)
cost2 <- function(r, pi){
weight1 <- 5
weight0 <- 1
c1 <- (r==1)&(pi<pcut2)
c0 <- (r==0)&(pi>pcut2)
return(mean(weight1*c1+weight0*c0))
}
cost1(r = train$overall, pi = pred_reg) # 0.1256
## [1] 0.1256352
cost2(r = train$overall, pi = pred_reg) # 0.3467
## [1] 0.3466565
cost2(r = train$overall, pi = pre_full ) # full data set asy.cost, 1.5194
## [1] 1.519356
Out-of-Sample Test
glm.predict <- predict(a1se, test, type = "response")
out.pred <- prediction(glm.predict, test$overall)
out.perf <- performance(out.pred, "tpr", "fpr")
plot(out.perf, colorize = TRUE)
unlist(slot(performance(out.pred, "auc"), "y.values")) # 0.9243
## [1] 0.9242501
Classification Tree
In-Sample Training Data
a.rpart <- rpart(formula = overall ~ ., data = train, method = "class")
a.rpart1 <- rpart(formula = overall ~., data = train, method = "class",
parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
rpart.pred <- predict(a.rpart, type = "class")
table(train$overall, rpart.pred, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 50965 7822
## 1 4194 40923
# plot tree - nbr and percentage of obs
a.rpart1
## n= 103904
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 103904 58787 1 (0.565781876 0.434218124)
## 2) online_bo< 3.5 52287 39155 0 (0.850230459 0.149769541)
## 4) in_wifi< 3.5 47524 23840 0 (0.899671745 0.100328255)
## 8) in_wifi>=0.5 45732 14925 0 (0.934728418 0.065271582)
## 16) Class=Economy,Economy Plus 32896 2775 0 (0.983128648 0.016871352) *
## 17) Class=Business 12836 10406 1 (0.810688688 0.189311312)
## 34) in_ent< 3.5 8961 3275 0 (0.926905479 0.073094521) *
## 35) in_ent>=3.5 3875 2100 1 (0.541935484 0.458064516)
## 70) Customer_Type=First-time 1582 295 0 (0.962705436 0.037294564) *
## 71) Customer_Type=Returning 2293 577 1 (0.251635412 0.748364588) *
## 9) in_wifi< 0.5 1792 9 1 (0.005022321 0.994977679) *
## 5) in_wifi>=3.5 4763 1700 1 (0.356917909 0.643082091) *
## 3) online_bo>=3.5 51617 14331 1 (0.277641087 0.722358913)
## 6) Travel_Type=Personal 10538 8218 1 (0.779844373 0.220155627)
## 12) in_wifi< 3.5 5194 605 0 (0.976703889 0.023296111)
## 24) in_wifi>=0.5 5073 0 0 (1.000000000 0.000000000) *
## 25) in_wifi< 0.5 121 0 1 (0.000000000 1.000000000) *
## 13) in_wifi>=3.5 5344 3145 1 (0.588510479 0.411489521) *
## 7) Travel_Type=Business 41079 6113 1 (0.148810828 0.851189172) *
prp(a.rpart1, extra = 101, split.col = "#9BA3EB", fallen.leaves = TRUE, box.col ="#9BA3EB")
# in-sample pred
train.pred.tree <- predict(a.rpart1, train, type = "class")
table(train$overall, train.pred.tree, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 47243 11544
## 1 1269 43848
# in-sample asy. cost
cost <- function(r, phat){
weight1 <- 5
weight0 <- 1
pcut <- weight0/(weight1+weight0)
c1 <- (r==1)&(phat<pcut) #logical vector - true if actual 1 but predict 0
c0 <-(r==0)&(phat>pcut) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
cost(train$overall, predict(a.rpart1, train, type="prob")) # = 1.2266
## [1] 1.226632
# cp value and xerror
printcp(a.rpart1)
##
## Classification tree:
## rpart(formula = overall ~ ., data = train, method = "class",
## parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)))
##
## Variables actually used in tree construction:
## [1] Class Customer_Type in_ent in_wifi online_bo
## [6] Travel_Type
##
## Root node error: 58787/103904 = 0.56578
##
## n= 103904
##
## CP nsplit rel error xerror xstd
## 1 0.160886 0 1.00000 5.0000 0.0135889
## 2 0.151496 2 0.67823 1.4446 0.0098622
## 3 0.057623 3 0.52673 1.4150 0.0098806
## 4 0.038002 5 0.41149 1.5634 0.0103947
## 5 0.020889 7 0.33548 1.1340 0.0090849
## 6 0.010291 8 0.31459 1.0055 0.0086128
## 7 0.010000 9 0.30430 1.0045 0.0086130
a.rpart2 <- rpart(formula = overall ~., data = train, method = "class", cp = 0.001,
parms = list(loss=matrix(c(0,5,1,0), nrow = 2))) # new cp = 0.001
printcp(a.rpart2)
##
## Classification tree:
## rpart(formula = overall ~ ., data = train, method = "class",
## parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)), cp = 0.001)
##
## Variables actually used in tree construction:
## [1] Age baggage check_in Class clean
## [6] Customer_Type dep_delay gate in_ent in_service
## [11] in_wifi on_board online_bo seat Travel_Type
##
## Root node error: 58787/103904 = 0.56578
##
## n= 103904
##
## CP nsplit rel error xerror xstd
## 1 0.1608859 0 1.00000 5.00000 0.0135889
## 2 0.1514961 2 0.67823 1.44459 0.0098622
## 3 0.0576233 3 0.52673 1.41502 0.0098806
## 4 0.0380016 5 0.41149 1.56344 0.0103947
## 5 0.0208890 7 0.33548 1.13403 0.0090849
## 6 0.0102914 8 0.31459 1.00549 0.0086128
## 7 0.0063790 9 0.30430 1.00344 0.0086131
## 8 0.0043377 10 0.29792 0.97154 0.0084900
## 9 0.0039124 14 0.27681 0.97632 0.0085259
## 10 0.0034021 15 0.27290 0.97954 0.0085414
## 11 0.0032547 16 0.26950 0.97874 0.0085456
## 12 0.0029598 21 0.25101 0.97690 0.0085468
## 13 0.0023304 25 0.23917 0.90467 0.0082571
## 14 0.0022454 26 0.23684 0.90199 0.0082484
## 15 0.0017351 27 0.23459 0.87856 0.0081508
## 16 0.0016500 28 0.23286 0.87101 0.0081216
## 17 0.0014204 29 0.23121 0.87273 0.0081301
## 18 0.0013302 32 0.22653 0.84958 0.0080321
## 19 0.0012418 37 0.21988 0.82699 0.0079334
## 20 0.0012163 38 0.21864 0.80389 0.0078308
## 21 0.0010000 40 0.21620 0.78329 0.0077367
Out-of-Sample Testing Data
# actual cost
cost(test$overall, predict(a.rpart1, test, type="prob")) # = 1.2275
## [1] 1.227537
cost(a1$overall, predict(a.rpart1, a1, type="prob")) # full data = 1.2268
## [1] 1.226813
# compare with logistic regression
test.pred.glm <- predict(amin, test, type="response")
cost(test$overall, test.pred.glm) # = 0.3531
## [1] 0.3530567
table(test$overall, as.numeric(test.pred.glm>1/6), dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 9679 4986
## 1 837 10474
# ROC for tree
test_rpart <- predict(a.rpart1, test, type="prob")
pred2 = prediction(test_rpart[,2], test$overall)
perf2 = performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)
title(main = "ROC curve of predicted probability of classification tree")
slot(performance(pred2, "auc"), "y.values")[[1]]
## [1] 0.9263265