This report shows flight satisfaction analysis using several classification algorithms. The dataset used in this report for modeling is flight data from a company’s flight data. The dataset can be viewed and retrieved from the kaggle.com site, and here is the link here

The structure of the report is like 1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

Data Extraction

Import required libraries

rm(list = ls())
library(ggplot2)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(sandwich)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
library(e1071)
library(party)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin

Read the flight data set and see how it’s structured

# Read Data
data = read.csv("train.csv")
test = read.csv("test.csv")
# Struture Data
str(data)
## 'data.frame':    103904 obs. of  25 variables:
##  $ X                                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id                               : int  70172 5047 110028 24026 119299 111157 82113 96462 79485 65725 ...
##  $ Gender                           : chr  "Male" "Male" "Female" "Female" ...
##  $ Customer.Type                    : chr  "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
##  $ Age                              : int  13 25 26 25 61 26 47 52 41 20 ...
##  $ Type.of.Travel                   : chr  "Personal Travel" "Business travel" "Business travel" "Business travel" ...
##  $ Class                            : chr  "Eco Plus" "Business" "Business" "Business" ...
##  $ Flight.Distance                  : int  460 235 1142 562 214 1180 1276 2035 853 1061 ...
##  $ Inflight.wifi.service            : int  3 3 2 2 3 3 2 4 1 3 ...
##  $ Departure.Arrival.time.convenient: int  4 2 2 5 3 4 4 3 2 3 ...
##  $ Ease.of.Online.booking           : int  3 3 2 5 3 2 2 4 2 3 ...
##  $ Gate.location                    : int  1 3 2 5 3 1 3 4 2 4 ...
##  $ Food.and.drink                   : int  5 1 5 2 4 1 2 5 4 2 ...
##  $ Online.boarding                  : int  3 3 5 2 5 2 2 5 3 3 ...
##  $ Seat.comfort                     : int  5 1 5 2 5 1 2 5 3 3 ...
##  $ Inflight.entertainment           : int  5 1 5 2 3 1 2 5 1 2 ...
##  $ On.board.service                 : int  4 1 4 2 3 3 3 5 1 2 ...
##  $ Leg.room.service                 : int  3 5 3 5 4 4 3 5 2 3 ...
##  $ Baggage.handling                 : int  4 3 4 3 4 4 4 5 1 4 ...
##  $ Checkin.service                  : int  4 1 4 1 3 4 3 4 4 4 ...
##  $ Inflight.service                 : int  5 4 4 4 3 4 5 5 1 3 ...
##  $ Cleanliness                      : int  5 1 5 2 3 1 2 4 2 2 ...
##  $ Departure.Delay.in.Minutes       : int  25 1 0 11 0 0 9 4 0 0 ...
##  $ Arrival.Delay.in.Minutes         : num  18 6 0 9 0 0 23 0 0 0 ...
##  $ satisfaction                     : chr  "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...

The dataset contains 103904 observations and 25 variables. The target variable is satisfaction and the rest are just features.

Extract summarized statistics from each variable

# Data Dimension
d= dim(data)
m= d[1] 
n= d[2]

# Satistical Summary
summary(data)
##        X                id            Gender          Customer.Type     
##  Min.   :     0   Min.   :     1   Length:103904      Length:103904     
##  1st Qu.: 25976   1st Qu.: 32534   Class :character   Class :character  
##  Median : 51952   Median : 64857   Mode  :character   Mode  :character  
##  Mean   : 51952   Mean   : 64924                                        
##  3rd Qu.: 77927   3rd Qu.: 97368                                        
##  Max.   :103903   Max.   :129880                                        
##                                                                         
##       Age        Type.of.Travel        Class           Flight.Distance
##  Min.   : 7.00   Length:103904      Length:103904      Min.   :  31   
##  1st Qu.:27.00   Class :character   Class :character   1st Qu.: 414   
##  Median :40.00   Mode  :character   Mode  :character   Median : 843   
##  Mean   :39.38                                         Mean   :1189   
##  3rd Qu.:51.00                                         3rd Qu.:1743   
##  Max.   :85.00                                         Max.   :4983   
##                                                                       
##  Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking
##  Min.   :0.00          Min.   :0.00                      Min.   :0.000         
##  1st Qu.:2.00          1st Qu.:2.00                      1st Qu.:2.000         
##  Median :3.00          Median :3.00                      Median :3.000         
##  Mean   :2.73          Mean   :3.06                      Mean   :2.757         
##  3rd Qu.:4.00          3rd Qu.:4.00                      3rd Qu.:4.000         
##  Max.   :5.00          Max.   :5.00                      Max.   :5.000         
##                                                                                
##  Gate.location   Food.and.drink  Online.boarding  Seat.comfort  
##  Min.   :0.000   Min.   :0.000   Min.   :0.00    Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00    1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.00    Median :4.000  
##  Mean   :2.977   Mean   :3.202   Mean   :3.25    Mean   :3.439  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00    3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00    Max.   :5.000  
##                                                                 
##  Inflight.entertainment On.board.service Leg.room.service Baggage.handling
##  Min.   :0.000          Min.   :0.000    Min.   :0.000    Min.   :1.000   
##  1st Qu.:2.000          1st Qu.:2.000    1st Qu.:2.000    1st Qu.:3.000   
##  Median :4.000          Median :4.000    Median :4.000    Median :4.000   
##  Mean   :3.358          Mean   :3.382    Mean   :3.351    Mean   :3.632   
##  3rd Qu.:4.000          3rd Qu.:4.000    3rd Qu.:4.000    3rd Qu.:5.000   
##  Max.   :5.000          Max.   :5.000    Max.   :5.000    Max.   :5.000   
##                                                                           
##  Checkin.service Inflight.service  Cleanliness    Departure.Delay.in.Minutes
##  Min.   :0.000   Min.   :0.00     Min.   :0.000   Min.   :   0.00           
##  1st Qu.:3.000   1st Qu.:3.00     1st Qu.:2.000   1st Qu.:   0.00           
##  Median :3.000   Median :4.00     Median :3.000   Median :   0.00           
##  Mean   :3.304   Mean   :3.64     Mean   :3.286   Mean   :  14.82           
##  3rd Qu.:4.000   3rd Qu.:5.00     3rd Qu.:4.000   3rd Qu.:  12.00           
##  Max.   :5.000   Max.   :5.00     Max.   :5.000   Max.   :1592.00           
##                                                                             
##  Arrival.Delay.in.Minutes satisfaction      
##  Min.   :   0.00          Length:103904     
##  1st Qu.:   0.00          Class :character  
##  Median :   0.00          Mode  :character  
##  Mean   :  15.18                            
##  3rd Qu.:  13.00                            
##  Max.   :1584.00                            
##  NA's   :310

We can see the minimum, median, mean, and maximum values of each numeric variable. It turns out that there are more neutral and dissatisfied than satisfied

2. Exploratory Data Analysis

2.1 Univariate Analysis

ggplot(data = data, aes(x= satisfaction))+
  geom_bar()+
  theme_bw()+
  labs(y= "Passanger Count",
       title= "Passenger Satisfaction Rates")

Based on the bar above, it turns out that the neutral and dissatisfied almost reached 60,000 compared to satisfaction in satisfaction

2.2 Bivariate Analysis

## Casting char from numeric to factor
data$Customer.Type=factor(data$Customer.Type)
data$Gender=factor(data$Gender)
data$Type.of.Travel=factor(data$Type.of.Travel)
data$Class=factor(data$Class)
data$satisfaction=factor(data$satisfaction)

test$Customer.Type=factor(test$Customer.Type)
test$Gender=factor(test$Gender)
test$Type.of.Travel=factor(test$Type.of.Travel)
test$Class=factor(test$Class)
test$satisfaction=factor(test$satisfaction)

## Passanger Satisfaction from age distribution
ggplot(data = data, aes(x= Age, fill= satisfaction))+
  theme_bw()+
  geom_histogram(color= "purple", bins = 10)+
  labs(y= "Passenger Count",
       title= "Passanger Satisfaction by age Distribution")

Based on the histogram above, passenger satisfaction is more between the ages of 40-49 followed by the age range of 50 which shows passenger satisfaction.

2.3 Multivariate Analysis

ggplot(data = data, aes(x= Age, fill= satisfaction))+
  theme_bw()+
  facet_wrap(Gender~Class)+
  geom_histogram(bins = 15)+
  labs(x= "AGE",
       y= "Passenger Count",
       title = "Passenger Satisfaction from Age, Gender & Class")

Judging from the histogram above, we know that customer satisfaction in business class is more in men than women. In the eco class there are more who are neutral or dissatisfied while in the ecoplus class between men and women the number between satisfied and neutral or dissatisfied does not reach 1000 people.

3. Data Preparation

data_preprocessing <- function(file){
  file=  file[3:25]  
  file$Gender= factor(file$Gender)  
  file$satisfaction= factor(file$satisfaction)  
  file$Customer.Type= factor(file$Customer.Type)  
  file$Type.of.Travel= factor(file$Type.of.Travel)  
  file$Class= factor(file$Class)  
  file= na.omit(file)  
  
  return(file)  
}
ggplot(data=data, aes(x=Customer.Type, y=Flight.Distance, color=Gender)) + 
  geom_boxplot() 

ggplot(data=data, aes(x=satisfaction)) + geom_bar()

ggplot(data=data, aes(fill=Class, x=satisfaction)) + geom_bar()

4. Modelling & Evaluation

Using Logistic Regression

# Create Regression Model.
logit <- glm(formula= satisfaction~.,
             data=data,
             family = binomial)

summary(logit)
## 
## Call:
## glm(formula = satisfaction ~ ., family = binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9156  -0.4877  -0.1716   0.3880   3.9929  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -7.629e+00  8.106e-02 -94.111  < 2e-16 ***
## X                                 -7.532e-07  3.247e-07  -2.319  0.02037 *  
## id                                -4.564e-06  2.694e-07 -16.940  < 2e-16 ***
## GenderMale                         4.256e-02  1.954e-02   2.178  0.02942 *  
## Customer.TypeLoyal Customer        2.012e+00  2.999e-02  67.084  < 2e-16 ***
## Age                               -8.146e-03  7.138e-04 -11.412  < 2e-16 ***
## Type.of.TravelPersonal Travel     -2.708e+00  3.165e-02 -85.568  < 2e-16 ***
## ClassEco                          -7.644e-01  2.581e-02 -29.621  < 2e-16 ***
## ClassEco Plus                     -8.861e-01  4.182e-02 -21.187  < 2e-16 ***
## Flight.Distance                   -7.498e-06  1.138e-05  -0.659  0.50983    
## Inflight.wifi.service              3.910e-01  1.151e-02  33.960  < 2e-16 ***
## Departure.Arrival.time.convenient -1.275e-01  8.233e-03 -15.485  < 2e-16 ***
## Ease.of.Online.booking            -1.413e-01  1.138e-02 -12.411  < 2e-16 ***
## Gate.location                      2.938e-02  9.185e-03   3.199  0.00138 ** 
## Food.and.drink                    -2.566e-02  1.072e-02  -2.394  0.01665 *  
## Online.boarding                    6.191e-01  1.030e-02  60.110  < 2e-16 ***
## Seat.comfort                       7.508e-02  1.125e-02   6.672 2.53e-11 ***
## Inflight.entertainment             4.252e-02  1.440e-02   2.952  0.00316 ** 
## On.board.service                   3.041e-01  1.024e-02  29.706  < 2e-16 ***
## Leg.room.service                   2.553e-01  8.572e-03  29.785  < 2e-16 ***
## Baggage.handling                   1.409e-01  1.151e-02  12.248  < 2e-16 ***
## Checkin.service                    3.290e-01  8.615e-03  38.187  < 2e-16 ***
## Inflight.service                   1.332e-01  1.214e-02  10.974  < 2e-16 ***
## Cleanliness                        2.307e-01  1.217e-02  18.957  < 2e-16 ***
## Departure.Delay.in.Minutes         5.837e-03  9.950e-04   5.866 4.46e-09 ***
## Arrival.Delay.in.Minutes          -1.060e-02  9.820e-04 -10.790  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 141768  on 103593  degrees of freedom
## Residual deviance:  68874  on 103568  degrees of freedom
##   (310 observations deleted due to missingness)
## AIC: 68926
## 
## Number of Fisher Scoring iterations: 6
actual <- test$satisfaction

pred.prob <- predict(logit, test, type="response")

pred.logit <- factor(pred.prob > .5,
                     levels = c(TRUE, FALSE),
                     labels = c("satisfied", "neutral or dissatisfied"))

cm.logit <- table(actual, pred.logit,
                  dnn = c("Actual","Predicted"))

cm.logit
##                          Predicted
## Actual                    satisfied neutral or dissatisfied
##   neutral or dissatisfied      1478                   13050
##   satisfied                    9518                    1847
TP <- cm.logit[1, 2]
TN <- cm.logit[2, 1]
FP <- cm.logit[2, 2]
FN <- cm.logit[1, 1]

accuracy <- (TP+TN) / (TP+TN+FP+FN)
precision <- TP / (TP+FP)
recall <- TP / (TP+FN)
f1_score <- 2*precision*recall/(precision+recall)

accuracy
## [1] 0.8715869
recall
## [1] 0.8982654
precision
## [1] 0.8760153

By using logistic regression modeling we get 87.1 in accuracy, 87.3 in precision and 90.1 in recall

Using Decision Tree

# Create Decision Tree
dt <- ctree(formula= satisfaction~.,
            data = data)

pred.dt <- predict(dt, test)

actual <- test$satisfaction
cm.dt <- table(actual, pred.dt,
                  dnn = c("Actual","Predicted"))

cm.dt
##                          Predicted
## Actual                    neutral or dissatisfied satisfied
##   neutral or dissatisfied                   14136       437
##   satisfied                                   733     10670
TP <- cm.dt[1, 1]
TN <- cm.dt[2, 2]
FP <- cm.dt[2, 1]
FN <- cm.dt[1, 2]

accuracy <- (TP+TN) / (TP+TN+FP+FN)
precision <- TP / (TP+FP)
recall <- TP / (TP+FN)
f1_score <- 2*precision*recall/(precision+recall)

accuracy
## [1] 0.9549584
recall
## [1] 0.970013
precision
## [1] 0.9507028

Using Random Forest

# Create Random Forest
rf <- randomForest(formula=satisfaction~.,
            data = na.omit(data))


pred.rf <- predict(rf, test)

actual <- test$satisfaction
cm.rf <- table(actual, pred.rf,
               dnn = c("Actual","Predicted"))

cm.rf
##                          Predicted
## Actual                    neutral or dissatisfied satisfied
##   neutral or dissatisfied                   14239       289
##   satisfied                                   628     10737
TP <- cm.rf[1, 1]
TN <- cm.rf[2, 2]
FP <- cm.rf[2, 1]
FN <- cm.rf[1, 2]

accuracy <- (TP+TN) / (TP+TN+FP+FN)
precision <- TP / (TP+FP)
recall <- TP / (TP+FN)
f1_score <- 2*precision*recall/(precision+recall)

accuracy
## [1] 0.964585
recall
## [1] 0.9801074
precision
## [1] 0.9577588