Predicting Airline Passenger Satisfaction

Team members:

  1. Ilham Pambudi (25062528)
  2. Eau Pei Ling (24202744)
  3. Alya Nadrah binti Abdul Rahman (25096247)
  4. Liew Jia Yi (25075973)
  5. Sakinah Al Izzah Binti Mohd Asri (25096564)

Introduction

The airline industry operates in a highly competitive environment where passenger satisfaction plays a critical role in customer retention, brand loyalty, and long-term business performance. Understanding what drives a passenger to leave a flight feeling satisfied or dissatisfied is therefore of significant strategic value to any airline operator.

Project Objectives:

  1. To identify the most significant features that impact passenger satisfaction
  2. To evaluate machine learning models that can accurately predict passenger satisfaction based on selected features.
  3. To compare multiple machine learning models and select the best performing model for predicting passenger satisfaction.
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(corrplot)
library(stringr)

Data Understanding

# Load dataset
df<-read_csv("Airplane Dataset.csv")
## New names:
## Rows: 25976 Columns: 25
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): Gender, Customer Type, Type of Travel, Class, satisfaction dbl (20): ...1,
## id, Age, Flight Distance, Inflight wifi service, Departure/A...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Check on data types
glimpse(df)
## Rows: 25,976
## Columns: 25
## $ ...1                                <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
## $ id                                  <dbl> 19556, 90035, 12360, 77959, 36875,…
## $ Gender                              <chr> "Female", "Female", "Male", "Male"…
## $ `Customer Type`                     <chr> "Loyal Customer", "Loyal Customer"…
## $ Age                                 <dbl> 52, 36, 20, 44, 49, 16, 77, 43, 47…
## $ `Type of Travel`                    <chr> "Business travel", "Business trave…
## $ Class                               <chr> "Eco", "Business", "Eco", "Busines…
## $ `Flight Distance`                   <dbl> 160, 2863, 192, 3377, 1182, 311, 3…
## $ `Inflight wifi service`             <dbl> 5, 1, 2, 0, 2, 3, 5, 2, 5, 2, 4, 2…
## $ `Departure/Arrival time convenient` <dbl> 4, 1, 0, 0, 3, 3, 5, 2, 2, 2, 1, 5…
## $ `Ease of Online booking`            <dbl> 3, 3, 2, 0, 4, 3, 5, 2, 2, 2, 1, 5…
## $ `Gate location`                     <dbl> 4, 1, 4, 2, 3, 3, 5, 2, 2, 2, 1, 5…
## $ `Food and drink`                    <dbl> 3, 5, 2, 3, 4, 5, 3, 4, 5, 3, 5, 1…
## $ `Online boarding`                   <dbl> 4, 4, 2, 4, 1, 5, 5, 4, 5, 4, 1, 3…
## $ `Seat comfort`                      <dbl> 3, 5, 2, 4, 2, 3, 5, 5, 5, 4, 5, 4…
## $ `Inflight entertainment`            <dbl> 5, 4, 2, 1, 2, 5, 5, 4, 5, 4, 3, 2…
## $ `On-board service`                  <dbl> 5, 4, 4, 1, 2, 4, 5, 4, 2, 4, 3, 2…
## $ `Leg room service`                  <dbl> 5, 4, 1, 1, 2, 3, 5, 4, 2, 4, 4, 2…
## $ `Baggage handling`                  <dbl> 5, 4, 3, 1, 2, 1, 5, 4, 5, 4, 3, 2…
## $ `Checkin service`                   <dbl> 2, 3, 2, 3, 4, 1, 4, 5, 3, 5, 1, 3…
## $ `Inflight service`                  <dbl> 5, 4, 2, 1, 2, 2, 5, 4, 3, 4, 3, 2…
## $ Cleanliness                         <dbl> 5, 5, 2, 4, 4, 5, 3, 3, 5, 4, 4, 4…
## $ `Departure Delay in Minutes`        <dbl> 50, 0, 0, 0, 0, 0, 0, 77, 1, 28, 2…
## $ `Arrival Delay in Minutes`          <dbl> 44, 0, 0, 6, 20, 0, 0, 65, 0, 14, …
## $ satisfaction                        <chr> "satisfied", "satisfied", "neutral…
# Summary statistics
summary(df)
##       ...1             id            Gender          Customer Type     
##  Min.   :    0   Min.   :    17   Length:25976       Length:25976      
##  1st Qu.: 6494   1st Qu.: 32170   Class :character   Class :character  
##  Median :12988   Median : 65320   Mode  :character   Mode  :character  
##  Mean   :12988   Mean   : 65006                                        
##  3rd Qu.:19481   3rd Qu.: 97584                                        
##  Max.   :25975   Max.   :129877                                        
##                                                                        
##       Age        Type of Travel        Class           Flight Distance
##  Min.   : 7.00   Length:25976       Length:25976       Min.   :  31   
##  1st Qu.:27.00   Class :character   Class :character   1st Qu.: 414   
##  Median :40.00   Mode  :character   Mode  :character   Median : 849   
##  Mean   :39.62                                         Mean   :1194   
##  3rd Qu.:51.00                                         3rd Qu.:1744   
##  Max.   :85.00                                         Max.   :4983   
##                                                                       
##  Inflight wifi service Departure/Arrival time convenient Ease of Online booking
##  Min.   :0.000         Min.   :0.000                     Min.   :0.000         
##  1st Qu.:2.000         1st Qu.:2.000                     1st Qu.:2.000         
##  Median :3.000         Median :3.000                     Median :3.000         
##  Mean   :2.725         Mean   :3.047                     Mean   :2.757         
##  3rd Qu.:4.000         3rd Qu.:4.000                     3rd Qu.:4.000         
##  Max.   :5.000         Max.   :5.000                     Max.   :5.000         
##                                                                                
##  Gate location   Food and drink  Online boarding  Seat comfort  
##  Min.   :1.000   Min.   :0.000   Min.   :0.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :4.000   Median :4.000  
##  Mean   :2.977   Mean   :3.215   Mean   :3.262   Mean   :3.449  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##  Inflight entertainment On-board service Leg room service Baggage handling
##  Min.   :0.000          Min.   :0.000    Min.   :0.00     Min.   :1.000   
##  1st Qu.:2.000          1st Qu.:2.000    1st Qu.:2.00     1st Qu.:3.000   
##  Median :4.000          Median :4.000    Median :4.00     Median :4.000   
##  Mean   :3.358          Mean   :3.386    Mean   :3.35     Mean   :3.633   
##  3rd Qu.:4.000          3rd Qu.:4.000    3rd Qu.:4.00     3rd Qu.:5.000   
##  Max.   :5.000          Max.   :5.000    Max.   :5.00     Max.   :5.000   
##                                                                           
##  Checkin service Inflight service  Cleanliness    Departure Delay in Minutes
##  Min.   :1.000   Min.   :0.000    Min.   :0.000   Min.   :   0.00           
##  1st Qu.:3.000   1st Qu.:3.000    1st Qu.:2.000   1st Qu.:   0.00           
##  Median :3.000   Median :4.000    Median :3.000   Median :   0.00           
##  Mean   :3.314   Mean   :3.649    Mean   :3.286   Mean   :  14.31           
##  3rd Qu.:4.000   3rd Qu.:5.000    3rd Qu.:4.000   3rd Qu.:  12.00           
##  Max.   :5.000   Max.   :5.000    Max.   :5.000   Max.   :1128.00           
##                                                                             
##  Arrival Delay in Minutes satisfaction      
##  Min.   :   0.00          Length:25976      
##  1st Qu.:   0.00          Class :character  
##  Median :   0.00          Mode  :character  
##  Mean   :  14.74                            
##  3rd Qu.:  13.00                            
##  Max.   :1115.00                            
##  NA's   :83
# Create dataframe of column types
data.frame(column = names(df), dtype = sapply(df, class) )
##                                                              column     dtype
## ...1                                                           ...1   numeric
## id                                                               id   numeric
## Gender                                                       Gender character
## Customer Type                                         Customer Type character
## Age                                                             Age   numeric
## Type of Travel                                       Type of Travel character
## Class                                                         Class character
## Flight Distance                                     Flight Distance   numeric
## Inflight wifi service                         Inflight wifi service   numeric
## Departure/Arrival time convenient Departure/Arrival time convenient   numeric
## Ease of Online booking                       Ease of Online booking   numeric
## Gate location                                         Gate location   numeric
## Food and drink                                       Food and drink   numeric
## Online boarding                                     Online boarding   numeric
## Seat comfort                                           Seat comfort   numeric
## Inflight entertainment                       Inflight entertainment   numeric
## On-board service                                   On-board service   numeric
## Leg room service                                   Leg room service   numeric
## Baggage handling                                   Baggage handling   numeric
## Checkin service                                     Checkin service   numeric
## Inflight service                                   Inflight service   numeric
## Cleanliness                                             Cleanliness   numeric
## Departure Delay in Minutes               Departure Delay in Minutes   numeric
## Arrival Delay in Minutes                   Arrival Delay in Minutes   numeric
## satisfaction                                           satisfaction character
# Total memory
print(object.size(df), units = "MB")
## 5 Mb
# Count numeric columns
sum(sapply(df, is.numeric))
## [1] 20
# Count categorical columns
sum(sapply(df, is.character))
## [1] 5

Objective: * Understand dataset structure * Identify variable types * Detect potential data quality issues

Key Observations: * The dataset contains 25,976 rows and 25 columns, occupies 5 MB of memory. * 20 numeric columns and 5 character (categorical) columns. * Character type columns: Gender, Customer Type, Type of Travel, Class, and satisfaction. * Numeric type: Age, Flight Distance, and various service ratings. * An unnamed index column, to be removed in data cleaning process.

Data Cleaning

# Remove reduncant column
df <- df%>%
  select(-id)

# Check any missing value
colSums(is.na(df))
##                              ...1                            Gender 
##                                 0                                 0 
##                     Customer Type                               Age 
##                                 0                                 0 
##                    Type of Travel                             Class 
##                                 0                                 0 
##                   Flight Distance             Inflight wifi service 
##                                 0                                 0 
## Departure/Arrival time convenient            Ease of Online booking 
##                                 0                                 0 
##                     Gate location                    Food and drink 
##                                 0                                 0 
##                   Online boarding                      Seat comfort 
##                                 0                                 0 
##            Inflight entertainment                  On-board service 
##                                 0                                 0 
##                  Leg room service                  Baggage handling 
##                                 0                                 0 
##                   Checkin service                  Inflight service 
##                                 0                                 0 
##                       Cleanliness        Departure Delay in Minutes 
##                                 0                                 0 
##          Arrival Delay in Minutes                      satisfaction 
##                                83                                 0
# Check missing data 
colSums(df == 0, na.rm = TRUE)
##                              ...1                            Gender 
##                                 1                                 0 
##                     Customer Type                               Age 
##                                 0                                 0 
##                    Type of Travel                             Class 
##                                 0                                 0 
##                   Flight Distance             Inflight wifi service 
##                                 0                               813 
## Departure/Arrival time convenient            Ease of Online booking 
##                              1381                              1195 
##                     Gate location                    Food and drink 
##                                 0                                25 
##                   Online boarding                      Seat comfort 
##                               652                                 0 
##            Inflight entertainment                  On-board service 
##                                 4                                 2 
##                  Leg room service                  Baggage handling 
##                               126                                 0 
##                   Checkin service                  Inflight service 
##                                 0                                 2 
##                       Cleanliness        Departure Delay in Minutes 
##                                 2                             14688 
##          Arrival Delay in Minutes                      satisfaction 
##                             14594                                 0
# For rating columns, replace 0 with NA
df <- df %>% 
  mutate(across(
    c(`Inflight wifi service`, `Food and drink`, `Seat comfort`, `Inflight entertainment`, `On-board service`), 
    ~na_if(., 0)
  ))

# Capitalization fix 
df <- df %>% 
  mutate(`Customer Type` = recode(`Customer Type`, "disloyal Customer" = "Disloyal Customer"))

# Convert categorical variables 
df <- df %>% 
  mutate(
    Gender = factor(Gender), 
    `Customer Type` = factor(`Customer Type`), 
    `Type of Travel` = factor(`Type of Travel`), 
    Class = factor(Class), 
    satisfaction = factor(satisfaction)
  )

# Impute row with missing value (use median)
df_clean <- df %>% 
  mutate(across(where(is.numeric), ~ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Check for duplicated data
sum(duplicated(df_clean))
## [1] 0
# Check outliers for Arrival Delay
summary(df_clean$`Arrival Delay in Minutes`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   14.69   13.00 1115.00
# Check outliers with box plot
boxplot(df_clean$`Arrival Delay in Minutes`, main = "Arrival Delay Outliers")

boxplot(df_clean$Age, main = "Age Distribution")

# Check consistency 
unique(df_clean$`Customer Type`)
## [1] Loyal Customer    Disloyal Customer
## Levels: Disloyal Customer Loyal Customer
unique(df_clean$Class) 
## [1] Eco      Business Eco Plus
## Levels: Business Eco Eco Plus
unique(df_clean$satisfaction)
## [1] satisfied               neutral or dissatisfied
## Levels: neutral or dissatisfied satisfied
# Check value ranges
summary(df_clean$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   27.00   40.00   39.62   51.00   85.00
summary(df_clean$`Flight Distance`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      31     414     849    1194    1744    4983

Exploratory Data Analysis (EDA)

# Average age
mean(df_clean$Age)
## [1] 39.62096
# Check Satisfaction distribution
table(df_clean$satisfaction)
## 
## neutral or dissatisfied               satisfied 
##                   14573                   11403
prop.table(table(df_clean$satisfaction))
## 
## neutral or dissatisfied               satisfied 
##               0.5610179               0.4389821
ggplot(df_clean, aes(x = satisfaction, fill = satisfaction)) + 
  geom_bar() +
  theme_minimal()

# Check gender distribution
table(df_clean$Gender)
## 
## Female   Male 
##  13172  12804
prop.table(table(df_clean$Gender)) 
## 
##    Female      Male 
## 0.5070835 0.4929165
table(df_clean$Gender, df_clean$satisfaction) 
##         
##          neutral or dissatisfied satisfied
##   Female                    7437      5735
##   Male                      7136      5668
prop.table(table(df_clean$Gender, df_clean$satisfaction), 1)
##         
##          neutral or dissatisfied satisfied
##   Female               0.5646067 0.4353933
##   Male                 0.5573258 0.4426742
# Plot histogram by Age
ggplot(df_clean, aes(x = Age)) +
  geom_histogram(
    bins = 30,
    fill = "steelblue",
    color = "white"
  ) +
  labs(
    title = "Distribution of Passenger Age",
    x = "Age",
    y = "Frequency"
  )

# Plot histogram by Flight Distance
ggplot(df_clean, aes(x = `Flight Distance`)) +
  geom_histogram(
    bins = 30,
    fill = "steelblue",
    color = "white"
  ) +
  labs(
    title = "Distribution of Flight Distance",
    x = "Flight Distance",
    y = "Frequency"
  )

# Plot histogram by Departure Delay
ggplot(df_clean, aes(x = `Departure Delay in Minutes`)) +
  geom_histogram(
    bins = 30,
    fill = "steelblue",
    color = "white"
  ) +
  labs(
    title = "Distribution of Departure Delay",
    x = "Departure Delay (Minutes)",
    y = "Frequency"
  ) 

# Plot satisfaction by Gender
ggplot(df_clean, aes(x = Gender, fill = satisfaction)) + 
  geom_bar() + 
  labs(title = "Satisfaction by Gender") +
  theme_minimal()

# Plot satisfaction by Age
ggplot(df_clean, aes(x = satisfaction, y = Age)) +
  geom_boxplot() + 
  labs(title = "Satisfaction by Age")

# Plot satisfaction by Class
ggplot(df_clean, aes(x = Class, fill = satisfaction)) +
  geom_bar() + 
  labs(title = "Satisfaction by Class")

# Plot satisfaction by Class, split side-by-side by Gender
ggplot(df_clean, aes(x = Class, fill = satisfaction)) + 
  geom_bar(position = "fill") + 
  facet_wrap(~Gender)

# Plot satisfaction by Travel Type
ggplot(df_clean, aes(x = `Type of Travel`, fill = satisfaction)) +
  geom_bar(position = "fill")  + 
  labs(title = "Satisfaction by Travel Type")

# Check for correlations 
cor_matrix <- cor(df_clean %>% select(where(is.numeric))) 
corrplot(cor_matrix)

# Check on p-value on the gender (p > 0.05, no strong evidence of bias)
chisq.test(table(df_clean$Gender, df_clean$satisfaction))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(df_clean$Gender, df_clean$satisfaction)
## X-squared = 1.3682, df = 1, p-value = 0.2421

Feature Engineering

# Create total delay column 
df_clean <- df_clean %>% 
  mutate(total_delay = `Departure Delay in Minutes` + `Arrival Delay in Minutes`)

# Create Delay Category
df_clean <- df_clean %>% 
  mutate(delay_category = case_when(
    total_delay == 0 ~ "No Delay",
    total_delay <= 30 ~ "Minor Delay",
    total_delay <= 120 ~ "Moderate Delay",
    TRUE ~ "Severe Delay"
  ))

# Create Age Group
df_clean <- df_clean %>%
  mutate(
    age_group = case_when(
      Age < 25 ~ "Young",
      Age < 45 ~ "Adult",
      Age < 60 ~ "Middle Age",
      TRUE ~ "Senior"
    )
  )

# Create Overall Service Score
df_clean <- df_clean %>%
  mutate(
    service_score =
      (`Inflight wifi service` +
       `Food and drink` +
       `Seat comfort` +
       `Inflight entertainment` +
       `On-board service` +
        `Leg room service` +
         `Baggage handling` +
         `Checkin service` +
         `Inflight service` +
         `Cleanliness`) / 11)

# Create Flight distance category
df_clean <- df_clean %>%
  mutate(
    flight_type = case_when(
      `Flight Distance` < 1000 ~ "Short",
      `Flight Distance` < 3000 ~ "Medium",
      TRUE ~ "Long"
    )
  )

# Create service delay ratio 
df_clean <- df_clean %>%
 mutate(service_delay_ratio = service_score/(total_delay+1))

# Check new columns
df_clean %>% 
  select(total_delay, delay_category, age_group, service_score, flight_type, service_delay_ratio) %>% 
  head(n = 10)
## # A tibble: 10 × 6
##    total_delay delay_category age_group  service_score flight_type
##          <dbl> <chr>          <chr>              <dbl> <chr>      
##  1          94 Moderate Delay Middle Age          3.91 Short      
##  2           0 No Delay       Adult               3.55 Medium     
##  3           0 No Delay       Young               2    Short      
##  4           6 Minor Delay    Adult               2    Long       
##  5          20 Minor Delay    Middle Age          2.36 Medium     
##  6           0 No Delay       Young               2.91 Short      
##  7           0 No Delay       Senior              4.09 Long       
##  8         142 Severe Delay   Adult               3.55 Medium     
##  9           1 Minor Delay    Middle Age          3.64 Short      
## 10          42 Moderate Delay Middle Age          3.45 Medium     
## # ℹ 1 more variable: service_delay_ratio <dbl>
# Convert categorical variables 
df_clean <- df_clean %>%
  mutate(
    delay_category = factor(delay_category),
    age_group = factor(age_group),
    flight_type = factor(flight_type)
  )
# Save the final dataset to a new CSV file 
write_csv(df_clean, "airplane_clean.csv")

Feature Engineering Summary

EDA on Engineered Features

# Check relationships
df_clean %>%
  group_by(satisfaction) %>%
  summarise(
    avg_service = mean(service_score),
    avg_delay = mean(total_delay)
  )
## # A tibble: 2 × 3
##   satisfaction            avg_service avg_delay
##   <fct>                         <dbl>     <dbl>
## 1 neutral or dissatisfied        2.73      32.7
## 2 satisfied                      3.44      24.2
# Plot distribution by total_delay
ggplot(df_clean, aes(x = total_delay)) +
  geom_histogram(bins = 30) + 
  labs(title = "Total Delay Distribution")

# Plot satisfaction by total_delay
ggplot(df_clean, aes(x = satisfaction, y = total_delay)) +
  geom_boxplot() + 
  labs(title = "Satisfaction by Total Delay")

# Plot Overall Service Score vs. Flight Type vs. satisfaction
ggplot(df_clean, aes(x = flight_type, y = service_score, fill = satisfaction)) +
  geom_boxplot() +
  labs(
    title = "Service Scores by Flight Distance Category",
    x = "Flight Distance Type",
    y = "Average Service Score (1-5)"
  )

# Plot Delay Categories vs. Age Groups
ggplot(df_clean, aes(x = age_group, fill = delay_category)) +
  geom_bar(position = "fill") +
  labs(
    title = "Proportion of Delay Categories Faced Across Age Groups",
    x = "Age Group",
    y = "Proportion of Flights"
  ) 

# Plot satisfaction by Delay Categories
ggplot(df_clean, aes(x = delay_category, fill = satisfaction)) +
  geom_bar(position = "dodge") + 
  labs(
    title = "Passenger Satisfaction Levels by Delay Severity",
    x = "Delay Category",
    y = "Number of Passengers"
  )

# Plot satisfaction by flight_type
ggplot(df_clean, aes(x = flight_type, fill = satisfaction)) +
  geom_bar(position = "dodge") + 
  labs(
    title = "Passenger Satisfaction Levels by flight_type",
    x = "flight_type",
    y = "Number of Passengers"
  )

EDA Observations

Airline Passenger Satisfaction observation by:

Summary

Based on the EDA and Feature Engineering findings,

the strongest candidate predictors are: - service_score, Class, Type of Travel, total_delay, delay_category

Variables may provide additional predictive values: - flight_type, age_group, ervice_delay_ratio

Feature Selection

# Correlation analysis
cor_matrix <- cor(
  df_clean %>% select(where(is.numeric))
)

# Chi-square tests
chisq.test(table(df_clean$Class,
                 df_clean$satisfaction))
## 
##  Pearson's Chi-squared test
## 
## data:  table(df_clean$Class, df_clean$satisfaction)
## X-squared = 6435, df = 2, p-value < 2.2e-16
chisq.test(table(df_clean$`Type of Travel`,
                 df_clean$satisfaction))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(df_clean$`Type of Travel`, df_clean$satisfaction)
## X-squared = 5334.8, df = 1, p-value < 2.2e-16
chisq.test(table(df_clean$delay_category,
                 df_clean$satisfaction))
## 
##  Pearson's Chi-squared test
## 
## data:  table(df_clean$delay_category, df_clean$satisfaction)
## X-squared = 208.15, df = 3, p-value < 2.2e-16

#Modelling: Classification & Regression #Classification Modelling

# Load modelling packages
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
## ✔ broom        1.0.13     ✔ rsample      1.3.2 
## ✔ dials        1.4.3      ✔ tailor       0.1.0 
## ✔ infer        1.1.0      ✔ tune         2.1.0 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.6.0      ✔ workflowsets 1.1.1 
## ✔ purrr        1.2.2      ✔ yardstick    1.4.0 
## ✔ recipes      1.3.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()  masks scales::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
library(ranger)
library(xgboost)
library(vip)
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
# Set seed so that train-test split and model results are reproducible
set.seed(7004)

# Prepare modelling dataset
model_df <- df_clean %>%
  select(-any_of(c("X", "...1", "id"))) %>%
  mutate(
    satisfaction = factor(
      satisfaction,
      levels = c("satisfied", "neutral or dissatisfied")
    )
  )

# Confirm no missing values remain after cleaning and feature engineering
sum(is.na(model_df))
## [1] 0
# Check final target distribution
table(model_df$satisfaction)
## 
##               satisfied neutral or dissatisfied 
##                   11403                   14573
prop.table(table(model_df$satisfaction))
## 
##               satisfied neutral or dissatisfied 
##               0.4389821               0.5610179
# Create stratified train-test split
data_split <- initial_split(model_df, prop = 0.80, strata = satisfaction)
train_data <- training(data_split)
test_data <- testing(data_split)

# Confirm class balance is similar in train and test sets
prop.table(table(train_data$satisfaction))
## 
##               satisfied neutral or dissatisfied 
##               0.4389798               0.5610202
prop.table(table(test_data$satisfaction))
## 
##               satisfied neutral or dissatisfied 
##               0.4389915               0.5610085
# Recipe for logistic regression: dummy encode categorical variables and normalize numeric variables
log_recipe <- recipe(satisfaction ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Recipe for tree-based models: dummy encode categorical variables, no normalization needed
tree_recipe <- recipe(satisfaction ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

# Model 1: Logistic Regression
log_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

log_workflow <- workflow() %>%
  add_recipe(log_recipe) %>%
  add_model(log_model)

log_fit <- fit(log_workflow, data = train_data)

# Model 2: Random Forest
rf_model <- rand_forest(
  trees = 500,
  mtry = 6,
  min_n = 5
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

rf_workflow <- workflow() %>%
  add_recipe(tree_recipe) %>%
  add_model(rf_model)

rf_fit <- fit(rf_workflow, data = train_data)

# Model 3: XGBoost
xgb_model <- boost_tree(
  trees = 300,
  tree_depth = 4,
  learn_rate = 0.05,
  loss_reduction = 0,
  min_n = 10
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_workflow <- workflow() %>%
  add_recipe(tree_recipe) %>%
  add_model(xgb_model)

xgb_fit <- fit(xgb_workflow, data = train_data)

# Function to create class predictions, probabilities, and metrics
classification_metrics <- metric_set(
  accuracy,
  precision,
  recall,
  f_meas,
  roc_auc
)

evaluate_model <- function(fitted_model, model_name) {
  predictions <- predict(fitted_model, test_data, type = "class") %>%
    bind_cols(predict(fitted_model, test_data, type = "prob")) %>%
    bind_cols(test_data %>% select(satisfaction))

  metrics <- predictions %>%
    classification_metrics(
      truth = satisfaction,
      estimate = .pred_class,
      .pred_satisfied,
      event_level = "first"
    ) %>%
    mutate(model = model_name)

  list(predictions = predictions, metrics = metrics)
}

log_results <- evaluate_model(log_fit, "Logistic Regression")
rf_results <- evaluate_model(rf_fit, "Random Forest")
xgb_results <- evaluate_model(xgb_fit, "XGBoost")

# Compare model performance
model_comparison <- bind_rows(
  log_results$metrics,
  rf_results$metrics,
  xgb_results$metrics
) %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  arrange(desc(roc_auc))

model_comparison
## # A tibble: 3 × 6
##   model               accuracy precision recall f_meas roc_auc
##   <chr>                  <dbl>     <dbl>  <dbl>  <dbl>   <dbl>
## 1 XGBoost                0.952     0.959  0.930  0.944   0.991
## 2 Random Forest          0.952     0.960  0.929  0.944   0.990
## 3 Logistic Regression    0.888     0.880  0.862  0.871   0.949
# Confusion matrices
conf_mat(log_results$predictions, truth = satisfaction, estimate = .pred_class)
##                          Truth
## Prediction                satisfied neutral or dissatisfied
##   satisfied                    1966                     267
##   neutral or dissatisfied       315                    2648
conf_mat(rf_results$predictions, truth = satisfaction, estimate = .pred_class)
##                          Truth
## Prediction                satisfied neutral or dissatisfied
##   satisfied                    2119                      89
##   neutral or dissatisfied       162                    2826
conf_mat(xgb_results$predictions, truth = satisfaction, estimate = .pred_class)
##                          Truth
## Prediction                satisfied neutral or dissatisfied
##   satisfied                    2121                      90
##   neutral or dissatisfied       160                    2825
# ROC curves for visual comparison
bind_rows(
  log_results$predictions %>% mutate(model = "Logistic Regression"),
  rf_results$predictions %>% mutate(model = "Random Forest"),
  xgb_results$predictions %>% mutate(model = "XGBoost")
) %>%
  group_by(model) %>%
  roc_curve(truth = satisfaction, .pred_satisfied, event_level = "first") %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) +
  geom_path(linewidth = 1) +
  geom_abline(linetype = "dashed", color = "gray50") +
  labs(
    title = "ROC Curve Comparison",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = "Model"
  ) +
  theme_minimal()

write_csv(model_comparison, "model_comparison_results.csv")

# Variable importance for Random Forest
rf_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 15) +
  labs(title = "Top 15 Important Predictors - Random Forest")

# Variable importance for XGBoost
xgb_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 15) +
  labs(title = "Top 15 Important Predictors - XGBoost")

Regression Modelling

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:rsample':
## 
##     calibration
## The following object is masked from 'package:purrr':
## 
##     lift
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ scales::col_factor() masks readr::col_factor()
## ✖ purrr::discard()     masks scales::discard()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ recipes::fixed()     masks stringr::fixed()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ caret::lift()        masks purrr::lift()
## ✖ yardstick::spec()    masks readr::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 5.0
library(rpart)
## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune
library(ggplot2)

df <- read.csv("airplane_clean.csv")
names(df) <- make.names(names(df), unique = TRUE)

summary(df$Arrival.Delay.in.Minutes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   14.69   13.00 1115.00
hist(df$Arrival.Delay.in.Minutes,
     main = "Distribution of Arrival Delay",
     xlab = "Arrival Delay in Minutes")

drop_cols <- c(
  "X...1",
  "Arrival.Delay.in.Minutes",
  "total_delay",
  "delay_category",
  "service_delay_ratio",
  "satisfaction"
)

X <- df %>%
  select(-any_of(drop_cols))

y <- df$Arrival.Delay.in.Minutes

model_data <- X %>%
  mutate(Arrival.Delay.in.Minutes = y)

model_data <- model_data %>%
  mutate_if(is.character, as.factor)

#split data into training and testing
set.seed(42)
train_index <- createDataPartition(
  model_data$Arrival.Delay.in.Minutes,
  p = 0.8,
  list = FALSE
)

train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]

#set cross validation
control <- trainControl(
  method = "cv",
  number = 3
)

#train linreg
set.seed(42)
model_lm <- train(
  Arrival.Delay.in.Minutes ~ .,
  data = train_data,
  method = "lm",
  trControl = control
)

#predict
pred_lm <- predict(model_lm, newdata = test_data)

#evaluate
eval_lm <- postResample(
  pred = pred_lm,
  obs = test_data$Arrival.Delay.in.Minutes
)
eval_lm
##       RMSE   Rsquared        MAE 
## 11.4924016  0.9014878  5.4419250
#train ridgereg
set.seed(42)
model_ridge <- train(
  Arrival.Delay.in.Minutes ~ .,
  data = train_data,
  method = "glmnet",
  trControl = control,
  tuneGrid = expand.grid(
    alpha = 0,
    lambda = seq(0.001, 1, length = 20)
  )
)

#predict
pred_ridge <- predict(model_ridge, newdata = test_data)

#evaluate
eval_ridge <- postResample(
  pred = pred_ridge,
  obs = test_data$Arrival.Delay.in.Minutes
)
eval_ridge
##      RMSE  Rsquared       MAE 
## 11.704873  0.901501  6.082836
#train decision tree
set.seed(42)
model_tree <- train(
  Arrival.Delay.in.Minutes ~ .,
  data = train_data,
  method = "rpart",
  trControl = control
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
#predict
pred_tree <- predict(model_tree, newdata = test_data)

#evaluate
eval_tree <- postResample(
  pred = pred_tree,
  obs = test_data$Arrival.Delay.in.Minutes
)
eval_tree
##       RMSE   Rsquared        MAE 
## 21.8542926  0.6427662 13.1868888
#compare all results
results <- data.frame(
  Model = c(
    "Linear Regression",
    "Ridge Regression",
    "Decision Tree"
  ),
  
  RMSE = c(
    RMSE(pred_lm, test_data$Arrival.Delay.in.Minutes),
    RMSE(pred_ridge, test_data$Arrival.Delay.in.Minutes),
    RMSE(pred_tree, test_data$Arrival.Delay.in.Minutes)
  ),
  
  MAE = c(
    MAE(pred_lm, test_data$Arrival.Delay.in.Minutes),
    MAE(pred_ridge, test_data$Arrival.Delay.in.Minutes),
    MAE(pred_tree, test_data$Arrival.Delay.in.Minutes)
  ),
  
  R2 = c(
    R2(pred_lm, test_data$Arrival.Delay.in.Minutes),
    R2(pred_ridge, test_data$Arrival.Delay.in.Minutes),
    R2(pred_tree, test_data$Arrival.Delay.in.Minutes)
  )
)

results <- results %>%
  arrange(RMSE)
results
##               Model     RMSE       MAE        R2
## 1 Linear Regression 11.49240  5.441925 0.9014878
## 2  Ridge Regression 11.70487  6.082836 0.9015010
## 3     Decision Tree 21.85429 13.186889 0.6427662
#bestmodel
best_model <- model_lm
best_pred <- pred_lm

#actual vs predicted
comparison <- data.frame(
  Actual = test_data$Arrival.Delay.in.Minutes,
  Predicted = best_pred
)

ggplot(comparison, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.4) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Actual vs Predicted Arrival Delay",
    x = "Actual Arrival Delay in Minutes",
    y = "Predicted Arrival Delay in Minutes"
  )

#residual plot
comparison <- comparison %>%
  mutate(Residual = Actual - Predicted)

ggplot(comparison, aes(x = Predicted, y = Residual)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residual Plot",
    x = "Predicted Arrival Delay",
    y = "Residual"
  )

# Discussion

Classification: Predicting Passenger Satisfaction

Three classification models were trained and evaluated to predict airline passenger satisfaction: Logistic Regression, Random Forest, and XGBoost. The dataset contained a mild class imbalance, with approximately 56.1% of passengers classified as “neutral or dissatisfied” and 43.9% as “satisfied”, which was preserved consistently across the training and test splits through stratified sampling.

The model performance results are summarised in the table below:

Model Accuracy Precision Recall F1-Score ROC-AUC
XGBoost 0.9519 0.9593 0.9299 0.9443 0.9911
Random Forest 0.9517 0.9597 0.9290 0.9441 0.9902
Logistic Regression 0.8880 0.8804 0.8619 0.8711 0.9491

Both XGBoost and Random Forest achieved near-identical and excellent results, each exceeding 95% accuracy and 0.99 ROC-AUC. In contrast, Logistic Regression performed noticeably lower at 88.8% accuracy and 0.949 ROC-AUC. This performance gap highlights that passenger satisfaction is driven by non-linear relationships among features that a simple linear decision boundary cannot adequately capture.

Confusion Matrix Analysis

The confusion matrices revealed distinct error patterns across the three models:

  • Logistic Regression produced the highest number of misclassifications, with 315 false negatives (satisfied passengers incorrectly predicted as dissatisfied) and 267 false positives (dissatisfied passengers predicted as satisfied), reflecting its limited capacity to model complex feature interactions.

  • Random Forest significantly reduced errors, with only 162 false negatives and 89 false positives. The low false positive rate (89) is particularly valuable in an airline context, as it means fewer genuinely dissatisfied passengers would be missed.

  • XGBoost performed marginally better than Random Forest, recording 160 false negatives and 90 false positives — the best overall error balance. The near-equivalence of these two ensemble models suggests the dataset’s predictive signal is largely captured by tree-based methods, and further gains would require additional data or feature engineering.

ROC Curve Comparison

The ROC curve comparison visually confirms the ranking of models. XGBoost and Random Forest curves nearly overlap and rise steeply toward the top-left corner, indicating excellent discrimination between satisfied and dissatisfied passengers across all classification thresholds. The Logistic Regression curve, while well above the diagonal baseline, curves more gradually — particularly in the low false-positive-rate range — indicating reduced sensitivity when strict precision is required.

The XGBoost ROC-AUC of 0.9911 means that in approximately 99.1% of randomly chosen satisfied-vs-dissatisfied passenger pairs, the model correctly ranks the satisfied passenger higher in predicted probability. This represents a strong and practically deployable classifier.

Feature Importance

The variable importance plots from both Random Forest and XGBoost consistently identified the same top predictors, lending strong confidence to these findings:

  1. Online Boarding emerged as the single most important predictor in both models. This highlights that the digital pre-flight experience — particularly the ease and quality of online check-in and boarding — is the most decisive factor in shaping overall passenger satisfaction. Airlines should treat their digital boarding experience as a primary service priority.

  2. Inflight WiFi Service and Type of Travel (Personal Travel) ranked second and third respectively. High-speed, reliable in-flight connectivity is no longer a luxury but a core satisfaction driver, especially for business travellers. The prominence of travel type confirms that business and personal travellers have fundamentally different expectations and satisfaction thresholds.

  3. Service Score — the engineered composite feature aggregating 10 individual service ratings — ranked fourth, validating the feature engineering approach taken. Its high importance confirms that overall service quality holistically influences satisfaction, beyond any single service dimension.

  4. Class (Economy) was the fifth most important feature, consistent with the EDA finding that Economy class passengers report markedly lower satisfaction rates. The negative direction of this effect (Economy class → lower satisfaction) suggests that while class upgrades are not feasible for all passengers, Economy-specific service improvements could yield meaningful satisfaction gains.

  5. Mid-tier predictors including Ease of Online Booking, Inflight Entertainment, Seat Comfort, and Flight Distance confirm that both physical comfort and digital touchpoints contribute meaningfully to satisfaction beyond the top predictors.

Research Question: What factors lead to customer satisfaction and prediction of air passenger satisfaction?

Answer:

Based on the feature importance results from both ensemble models, the most significant predictors of passenger satisfaction are Online Boarding experience, Inflight WiFi quality, Type of Travel, overall service score, and cabin class. Digital service touchpoints (online boarding and WiFi) outweigh physical comfort factors, suggesting that modern airline passengers place increasing value on seamless technology integration throughout their journey. Loyal customers and business travellers are more likely to report satisfaction, while personal travellers in Economy class represent the highest-risk dissatisfaction group.

Regression: Predicting Arrival Delay

Three regression models were trained to predict arrival delay in minutes: Linear Regression, Ridge Regression, and Decision Tree Regression. Arrival delay was selected as the regression target as it represents a measurable operational outcome with direct impact on passenger experience. The arrival delay distribution (mean = 14.69 minutes, median = 0, max = 1,115 minutes) is heavily right-skewed, with the vast majority of flights experiencing zero or minimal delay, and a small proportion experiencing extreme delays exceeding 200 minutes.

Model RMSE MAE
Linear Regression 11.4924 5.4419 0.9015
Ridge Regression 11.7049 6.0828 0.9015
Decision Tree 21.8543 13.1869 0.6428

Model Comparison

Linear Regression achieved the best overall performance with RMSE of 11.49 minutes, MAE of 5.44 minutes, and R² of 0.9015. The high R² of approximately 0.90 is a notably strong result, indicating that 90% of the variance in arrival delay can be explained by the passenger and flight features in this dataset. This is a surprisingly strong result, likely attributable to the strong linear relationship between departure delay (a feature in the dataset) and arrival delay — delays that occur at departure tend to propagate directly through to arrival.

Ridge Regression produced near-identical results (RMSE = 11.70, R² = 0.9015), confirming that multicollinearity among predictors is not a meaningful concern in this dataset. The L2 regularisation penalty provided no practical benefit over standard linear regression, as the features are not excessively correlated with one another to a degree that would destabilise the coefficient estimates.

Decision Tree Regression performed substantially worse, with RMSE of 21.85 minutes, MAE of 13.19 minutes, and R² of 0.643. Despite its ability to model non-linear relationships, the decision tree overfitted during training and failed to generalise as well to the test set. The much larger RMSE indicates it was particularly poor at handling the extreme delay outliers present in the dataset — a common weakness of shallow decision trees on skewed targets. A warning about missing values in resampled performance during cross-validation further suggests instability in this model during training.

Residual Analysis

The actual vs. predicted plot for the best model (Linear Regression) shows strong alignment with the diagonal line across the full range of delay values, from 0 to over 800 minutes. This confirms that the model generalises well across both low-delay and high-delay cases. However, a dense cluster of points at the lower-left corner (0–50 minutes) reflects the heavily right-skewed target distribution — the model sees far more zero-delay examples during training than extreme delay cases.

The residual plot reveals a fan-shaped pattern: residuals are tightly concentrated near zero for low predicted delays but spread outward as predicted delay increases. This heteroscedasticity is expected given the skewed delay distribution, where extreme delays are rare and driven by exceptional operational events. Several isolated large negative residuals (as low as −330 minutes) represent cases where the model over-predicted delays that did not materialise — likely flights where scheduled recovery actions reduced delays more than expected. These outlier residuals do not undermine the overall model, but they highlight that extreme delay scenarios remain inherently difficult to predict precisely.

Research Question:Can passenger and flight attributes predict the arrival delay experienced?

Answer:

Yes, with strong predictive accuracy. Linear Regression achieved an R² of 0.90, meaning the available features explain approximately 90% of the variance in arrival delay. The primary driver of this strong performance is the inclusion of Departure Delay in Minutes as a predictor, which shares a direct causal relationship with arrival delay. The model’s MAE of 5.44 minutes means predictions are typically within about 5 minutes of the actual arrival delay, which is operationally useful for passenger communication and crew scheduling. Linear Regression is recommended as the production model for this task, given its superior accuracy, simpler interpretability, and near-identical performance to the more complex Ridge Regression.


Recommendations

Based on the findings from both the classification and regression analyses, the following recommendations are proposed for airline stakeholders:

  1. Prioritise digital service improvements: Online boarding and inflight WiFi are the strongest predictors of satisfaction. Investment in seamless mobile boarding technology and reliable high-speed in-flight connectivity offers the highest return in satisfaction improvement.

  2. Differentiate service strategy by travel type: Personal travellers are significantly less satisfied than business travellers. Tailored service offerings — such as family-friendly amenities or leisure-focused entertainment — could address this gap.

  3. Focus Economy class service quality: Economy class passengers report the lowest satisfaction levels. Targeted improvements in seat comfort, cleanliness, and on-board service within Economy could meaningfully shift satisfaction rates without requiring full product overhauls.

  4. Deploy the classifier for proactive service recovery: The XGBoost model, with 95.2% accuracy and 0.991 ROC-AUC, is sufficiently accurate to be deployed at check-in or boarding to flag passengers predicted to be at risk of dissatisfaction. Front-line staff can then prioritise proactive service recovery for these passengers.

  5. Deploy the Linear Regression delay model for operational use: With an R² of 0.90 and MAE of approximately 5 minutes, the Linear Regression model is sufficiently accurate for real-time arrival delay estimation. It can be integrated into passenger notification systems to provide early, data-driven delay alerts. Future work may further improve accuracy by incorporating operational variables such as weather conditions, aircraft type, and historical route-level delay patterns.

Conclusion

This project successfully analysed airline passenger satisfaction using a complete data science workflow, including data preparation, feature engineering, exploratory analysis, and machine learning modelling. All project objectives were achieved: identifying the most important satisfaction factors, developing effective classification and regression models, and comparing model performance to select the best approach. The findings of this project carry practical implications for airline service management. Airlines seeking to improve passenger satisfaction should prioritise investment in digital service infrastructure, particularly seamless online boarding systems and reliable in-flight WiFi as these features exert the greatest influence on how passengers rate their overall experience. Economy class passengers and personal travellers represent the highest-risk dissatisfaction segments, and targeted service improvements for these groups could yield meaningful gains in overall satisfaction rates. The XGBoost classifier, with near 95% accuracy, is sufficiently robust to be considered for real-world deployment in proactive passenger satisfaction management systems. Future work could extend this analysis by incorporating larger and more diverse airline datasets to improve generalisability, applying hyperparameter tuning to further optimise model performance, and supplementing the regression model with operational variables such as weather conditions, airport congestion, and aircraft type to build a more complete arrival delay prediction system.