This report shows flight satisfaction analysis using several classification algorithms. The dataset used in this report for modeling is flight data from a company’s flight data. The dataset can be viewed and retrieved from the kaggle.com site, and here is the link here
The structure of the report is like 1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Import required libraries
rm(list = ls())
library(ggplot2)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(sandwich)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
library(e1071)
library(party)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
Read the flight data set and see how it’s structured
# Read Data
data = read.csv("train.csv")
test = read.csv("test.csv")
# Struture Data
str(data)
## 'data.frame': 103904 obs. of 25 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 70172 5047 110028 24026 119299 111157 82113 96462 79485 65725 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Customer.Type : chr "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 13 25 26 25 61 26 47 52 41 20 ...
## $ Type.of.Travel : chr "Personal Travel" "Business travel" "Business travel" "Business travel" ...
## $ Class : chr "Eco Plus" "Business" "Business" "Business" ...
## $ Flight.Distance : int 460 235 1142 562 214 1180 1276 2035 853 1061 ...
## $ Inflight.wifi.service : int 3 3 2 2 3 3 2 4 1 3 ...
## $ Departure.Arrival.time.convenient: int 4 2 2 5 3 4 4 3 2 3 ...
## $ Ease.of.Online.booking : int 3 3 2 5 3 2 2 4 2 3 ...
## $ Gate.location : int 1 3 2 5 3 1 3 4 2 4 ...
## $ Food.and.drink : int 5 1 5 2 4 1 2 5 4 2 ...
## $ Online.boarding : int 3 3 5 2 5 2 2 5 3 3 ...
## $ Seat.comfort : int 5 1 5 2 5 1 2 5 3 3 ...
## $ Inflight.entertainment : int 5 1 5 2 3 1 2 5 1 2 ...
## $ On.board.service : int 4 1 4 2 3 3 3 5 1 2 ...
## $ Leg.room.service : int 3 5 3 5 4 4 3 5 2 3 ...
## $ Baggage.handling : int 4 3 4 3 4 4 4 5 1 4 ...
## $ Checkin.service : int 4 1 4 1 3 4 3 4 4 4 ...
## $ Inflight.service : int 5 4 4 4 3 4 5 5 1 3 ...
## $ Cleanliness : int 5 1 5 2 3 1 2 4 2 2 ...
## $ Departure.Delay.in.Minutes : int 25 1 0 11 0 0 9 4 0 0 ...
## $ Arrival.Delay.in.Minutes : num 18 6 0 9 0 0 23 0 0 0 ...
## $ satisfaction : chr "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...
The dataset contains 103904 observations and 25 variables. The target variable is satisfaction and the rest are just features.
Extract summarized statistics from each variable
# Data Dimension
d= dim(data)
m= d[1]
n= d[2]
# Satistical Summary
summary(data)
## X id Gender Customer.Type
## Min. : 0 Min. : 1 Length:103904 Length:103904
## 1st Qu.: 25976 1st Qu.: 32534 Class :character Class :character
## Median : 51952 Median : 64857 Mode :character Mode :character
## Mean : 51952 Mean : 64924
## 3rd Qu.: 77927 3rd Qu.: 97368
## Max. :103903 Max. :129880
##
## Age Type.of.Travel Class Flight.Distance
## Min. : 7.00 Length:103904 Length:103904 Min. : 31
## 1st Qu.:27.00 Class :character Class :character 1st Qu.: 414
## Median :40.00 Mode :character Mode :character Median : 843
## Mean :39.38 Mean :1189
## 3rd Qu.:51.00 3rd Qu.:1743
## Max. :85.00 Max. :4983
##
## Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking
## Min. :0.00 Min. :0.00 Min. :0.000
## 1st Qu.:2.00 1st Qu.:2.00 1st Qu.:2.000
## Median :3.00 Median :3.00 Median :3.000
## Mean :2.73 Mean :3.06 Mean :2.757
## 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.00 Max. :5.00 Max. :5.000
##
## Gate.location Food.and.drink Online.boarding Seat.comfort
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.00 Median :4.000
## Mean :2.977 Mean :3.202 Mean :3.25 Mean :3.439
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
##
## Inflight.entertainment On.board.service Leg.room.service Baggage.handling
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000 Median :4.000
## Mean :3.358 Mean :3.382 Mean :3.351 Mean :3.632
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## Checkin.service Inflight.service Cleanliness Departure.Delay.in.Minutes
## Min. :0.000 Min. :0.00 Min. :0.000 Min. : 0.00
## 1st Qu.:3.000 1st Qu.:3.00 1st Qu.:2.000 1st Qu.: 0.00
## Median :3.000 Median :4.00 Median :3.000 Median : 0.00
## Mean :3.304 Mean :3.64 Mean :3.286 Mean : 14.82
## 3rd Qu.:4.000 3rd Qu.:5.00 3rd Qu.:4.000 3rd Qu.: 12.00
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :1592.00
##
## Arrival.Delay.in.Minutes satisfaction
## Min. : 0.00 Length:103904
## 1st Qu.: 0.00 Class :character
## Median : 0.00 Mode :character
## Mean : 15.18
## 3rd Qu.: 13.00
## Max. :1584.00
## NA's :310
We can see the minimum, median, mean, and maximum values of each numeric variable. It turns out that there are more neutral and dissatisfied than satisfied
ggplot(data = data, aes(x= satisfaction))+
geom_bar()+
theme_bw()+
labs(y= "Passanger Count",
title= "Passenger Satisfaction Rates")
Based on the bar above, it turns out that the neutral and dissatisfied almost reached 60,000 compared to satisfaction in satisfaction
## Casting char from numeric to factor
data$Customer.Type=factor(data$Customer.Type)
data$Gender=factor(data$Gender)
data$Type.of.Travel=factor(data$Type.of.Travel)
data$Class=factor(data$Class)
data$satisfaction=factor(data$satisfaction)
test$Customer.Type=factor(test$Customer.Type)
test$Gender=factor(test$Gender)
test$Type.of.Travel=factor(test$Type.of.Travel)
test$Class=factor(test$Class)
test$satisfaction=factor(test$satisfaction)
## Passanger Satisfaction from age distribution
ggplot(data = data, aes(x= Age, fill= satisfaction))+
theme_bw()+
geom_histogram(color= "purple", bins = 10)+
labs(y= "Passenger Count",
title= "Passanger Satisfaction by age Distribution")
Based on the histogram above, passenger satisfaction is more between the ages of 40-49 followed by the age range of 50 which shows passenger satisfaction.
ggplot(data = data, aes(x= Age, fill= satisfaction))+
theme_bw()+
facet_wrap(Gender~Class)+
geom_histogram(bins = 15)+
labs(x= "AGE",
y= "Passenger Count",
title = "Passenger Satisfaction from Age, Gender & Class")
Judging from the histogram above, we know that customer satisfaction in business class is more in men than women. In the eco class there are more who are neutral or dissatisfied while in the ecoplus class between men and women the number between satisfied and neutral or dissatisfied does not reach 1000 people.
data_preprocessing <- function(file){
file= file[3:25]
file$Gender= factor(file$Gender)
file$satisfaction= factor(file$satisfaction)
file$Customer.Type= factor(file$Customer.Type)
file$Type.of.Travel= factor(file$Type.of.Travel)
file$Class= factor(file$Class)
file= na.omit(file)
return(file)
}
ggplot(data=data, aes(x=Customer.Type, y=Flight.Distance, color=Gender)) +
geom_boxplot()
ggplot(data=data, aes(x=satisfaction)) + geom_bar()
ggplot(data=data, aes(fill=Class, x=satisfaction)) + geom_bar()
# Create Regression Model.
logit <- glm(formula= satisfaction~.,
data=data,
family = binomial)
summary(logit)
##
## Call:
## glm(formula = satisfaction ~ ., family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9156 -0.4877 -0.1716 0.3880 3.9929
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.629e+00 8.106e-02 -94.111 < 2e-16 ***
## X -7.532e-07 3.247e-07 -2.319 0.02037 *
## id -4.564e-06 2.694e-07 -16.940 < 2e-16 ***
## GenderMale 4.256e-02 1.954e-02 2.178 0.02942 *
## Customer.TypeLoyal Customer 2.012e+00 2.999e-02 67.084 < 2e-16 ***
## Age -8.146e-03 7.138e-04 -11.412 < 2e-16 ***
## Type.of.TravelPersonal Travel -2.708e+00 3.165e-02 -85.568 < 2e-16 ***
## ClassEco -7.644e-01 2.581e-02 -29.621 < 2e-16 ***
## ClassEco Plus -8.861e-01 4.182e-02 -21.187 < 2e-16 ***
## Flight.Distance -7.498e-06 1.138e-05 -0.659 0.50983
## Inflight.wifi.service 3.910e-01 1.151e-02 33.960 < 2e-16 ***
## Departure.Arrival.time.convenient -1.275e-01 8.233e-03 -15.485 < 2e-16 ***
## Ease.of.Online.booking -1.413e-01 1.138e-02 -12.411 < 2e-16 ***
## Gate.location 2.938e-02 9.185e-03 3.199 0.00138 **
## Food.and.drink -2.566e-02 1.072e-02 -2.394 0.01665 *
## Online.boarding 6.191e-01 1.030e-02 60.110 < 2e-16 ***
## Seat.comfort 7.508e-02 1.125e-02 6.672 2.53e-11 ***
## Inflight.entertainment 4.252e-02 1.440e-02 2.952 0.00316 **
## On.board.service 3.041e-01 1.024e-02 29.706 < 2e-16 ***
## Leg.room.service 2.553e-01 8.572e-03 29.785 < 2e-16 ***
## Baggage.handling 1.409e-01 1.151e-02 12.248 < 2e-16 ***
## Checkin.service 3.290e-01 8.615e-03 38.187 < 2e-16 ***
## Inflight.service 1.332e-01 1.214e-02 10.974 < 2e-16 ***
## Cleanliness 2.307e-01 1.217e-02 18.957 < 2e-16 ***
## Departure.Delay.in.Minutes 5.837e-03 9.950e-04 5.866 4.46e-09 ***
## Arrival.Delay.in.Minutes -1.060e-02 9.820e-04 -10.790 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 141768 on 103593 degrees of freedom
## Residual deviance: 68874 on 103568 degrees of freedom
## (310 observations deleted due to missingness)
## AIC: 68926
##
## Number of Fisher Scoring iterations: 6
actual <- test$satisfaction
pred.prob <- predict(logit, test, type="response")
pred.logit <- factor(pred.prob > .5,
levels = c(TRUE, FALSE),
labels = c("satisfied", "neutral or dissatisfied"))
cm.logit <- table(actual, pred.logit,
dnn = c("Actual","Predicted"))
cm.logit
## Predicted
## Actual satisfied neutral or dissatisfied
## neutral or dissatisfied 1478 13050
## satisfied 9518 1847
TP <- cm.logit[1, 2]
TN <- cm.logit[2, 1]
FP <- cm.logit[2, 2]
FN <- cm.logit[1, 1]
accuracy <- (TP+TN) / (TP+TN+FP+FN)
precision <- TP / (TP+FP)
recall <- TP / (TP+FN)
f1_score <- 2*precision*recall/(precision+recall)
accuracy
## [1] 0.8715869
recall
## [1] 0.8982654
precision
## [1] 0.8760153
By using logistic regression modeling we get 87.1 in accuracy, 87.3 in precision and 90.1 in recall
# Create Decision Tree
dt <- ctree(formula= satisfaction~.,
data = data)
pred.dt <- predict(dt, test)
actual <- test$satisfaction
cm.dt <- table(actual, pred.dt,
dnn = c("Actual","Predicted"))
cm.dt
## Predicted
## Actual neutral or dissatisfied satisfied
## neutral or dissatisfied 14136 437
## satisfied 733 10670
TP <- cm.dt[1, 1]
TN <- cm.dt[2, 2]
FP <- cm.dt[2, 1]
FN <- cm.dt[1, 2]
accuracy <- (TP+TN) / (TP+TN+FP+FN)
precision <- TP / (TP+FP)
recall <- TP / (TP+FN)
f1_score <- 2*precision*recall/(precision+recall)
accuracy
## [1] 0.9549584
recall
## [1] 0.970013
precision
## [1] 0.9507028
# Create Random Forest
rf <- randomForest(formula=satisfaction~.,
data = na.omit(data))
pred.rf <- predict(rf, test)
actual <- test$satisfaction
cm.rf <- table(actual, pred.rf,
dnn = c("Actual","Predicted"))
cm.rf
## Predicted
## Actual neutral or dissatisfied satisfied
## neutral or dissatisfied 14239 289
## satisfied 628 10737
TP <- cm.rf[1, 1]
TN <- cm.rf[2, 2]
FP <- cm.rf[2, 1]
FN <- cm.rf[1, 2]
accuracy <- (TP+TN) / (TP+TN+FP+FN)
precision <- TP / (TP+FP)
recall <- TP / (TP+FN)
f1_score <- 2*precision*recall/(precision+recall)
accuracy
## [1] 0.964585
recall
## [1] 0.9801074
precision
## [1] 0.9577588