Airline Passenger Satisfaction Data Set

Data Source: Kaggle.com (2025)

Link: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data

1) Raw Data Importation

#train.csv will be used as it is 80% of the full dataset

raw_cloudrate <- read.table("./train.csv", header = TRUE, sep = ",", dec = ".")
head(raw_cloudrate)
##   X     id Gender     Customer.Type Age  Type.of.Travel    Class
## 1 0  70172   Male    Loyal Customer  13 Personal Travel Eco Plus
## 2 1   5047   Male disloyal Customer  25 Business travel Business
## 3 2 110028 Female    Loyal Customer  26 Business travel Business
## 4 3  24026 Female    Loyal Customer  25 Business travel Business
## 5 4 119299   Male    Loyal Customer  61 Business travel Business
## 6 5 111157 Female    Loyal Customer  26 Personal Travel      Eco
##   Flight.Distance Inflight.wifi.service Departure.Arrival.time.convenient
## 1             460                     3                                 4
## 2             235                     3                                 2
## 3            1142                     2                                 2
## 4             562                     2                                 5
## 5             214                     3                                 3
## 6            1180                     3                                 4
##   Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1                      3             1              5               3
## 2                      3             3              1               3
## 3                      2             2              5               5
## 4                      5             5              2               2
## 5                      3             3              4               5
## 6                      2             1              1               2
##   Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1            5                      5                4                3
## 2            1                      1                1                5
## 3            5                      5                4                3
## 4            2                      2                2                5
## 5            5                      3                3                4
## 6            1                      1                3                4
##   Baggage.handling Checkin.service Inflight.service Cleanliness
## 1                4               4                5           5
## 2                3               1                4           1
## 3                4               4                4           5
## 4                3               1                4           2
## 5                4               3                3           3
## 6                4               4                4           1
##   Departure.Delay.in.Minutes Arrival.Delay.in.Minutes            satisfaction
## 1                         25                       18 neutral or dissatisfied
## 2                          1                        6 neutral or dissatisfied
## 3                          0                        0               satisfied
## 4                         11                        9 neutral or dissatisfied
## 5                          0                        0               satisfied
## 6                          0                        0 neutral or dissatisfied

Explanation of Raw Data

Target population: All airline passengers

Sample size: Passengers who took part in the airline passengers satisfaction survey (Total = 103,904 observations)

Unit of observation: 1 of the 103904 passengers who participated in the survey

Number of variables: 25 (after cleaning = 23 variables)

Definition & unit of measurement of all variables

  • id (categorical - nominal): Customer unique identification number

  • Gender (categorical - nominal): Gender of the passengers - “Female” or “Male”

  • Customer.Type (categorical - nominal): Customer type - “Loyal” or “disloyal” customer

  • Age (numerical - ratio): The actual age of the passengers

  • Type.of.Travel (categorical - nominal): Purpose of the flight of the passengers - “Personal Travel” or “Business Travel”

  • Class (categorical - ordinal): Travel class in the plane of the passengers - “Business”, “Eco”, “Eco Plus”

  • Flight.distance (numerical - ratio): The flight distance of this journey

  • Inflight.wifi.service (numerical - interval): Satisfaction level of the inflight wifi service (0:Not Applicable; 1-5: Satisfaction level)

  • Departure/Arrival time convenient (numerical - interval): Satisfaction level of Departure/Arrival time convenient

  • Ease of Online booking (numerical - interval): Satisfaction level of online booking

  • Gate location (numerical - interval): Satisfaction level of Gate location

  • Food and drink (numerical - interval): Satisfaction level of Food and drink

  • Online boarding (numerical - interval):* Satisfaction level of online boarding

  • Seat comfort (numerical - interval): Satisfaction level of Seat comfort

  • Inflight entertainment (numerical - interval): Satisfaction level of inflight entertainment

  • On-board service (numerical - interval): Satisfaction level of On-board service

  • Leg room service (numerical - interval): Satisfaction level of Leg room service

  • Baggage handling (numerical - interval): Satisfaction level of baggage handling

  • Check-in service (numerical - interval): Satisfaction level of Check-in service

  • Inflight service (numerical - interval): Satisfaction level of inflight service

  • Cleanliness (numerical - interval): Satisfaction level of Cleanliness

  • Departure Delay in Minutes (numerical - ratio): Minutes delayed when departure

  • Arrival Delay in Minutes (numerical - ratio): Minutes delayed when Arrival

  • Satisfaction (categorical - ordinal): Airline satisfaction level - “Satisfaction”, or “Neutral or dissatisfaction”)

2) Data Manipulation / Cleaning

#install.packages("dplyr")
#install.packages("tidyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
  • Initial cleaning of raw data
raw_cloudrate <- raw_cloudrate %>% 
                 select(-1, -Customer.Type) #Remove the first column (serial number) and Customer.Type column
  • Cleaning observations
raw_cloudrate <- raw_cloudrate %>% drop_na() #Drop all observations with NA values in their records
raw_cloudrate$id <- sprintf("%06d", raw_cloudrate$id) #Make ID values to be consistent at 6 digits
  • Rename variables
raw_cloudrate <- raw_cloudrate %>%
                 rename (cust_id = id, 
                         gender = Gender,
                         age = Age,
                         travel_type = Type.of.Travel,
                         class = Class,
                         flight_dist = Flight.Distance,
                         plane_wifi = Inflight.wifi.service,
                         dep_arr_conv = Departure.Arrival.time.convenient,
                         online_book = Ease.of.Online.booking,
                         gate_loc = Gate.location,
                         food_drink = Food.and.drink,
                         online_board = Online.boarding,
                         seat_comf = Seat.comfort,
                         plane_ent = Inflight.entertainment,
                         onboard_srv = On.board.service,
                         legroom = Leg.room.service,
                         baggage = Baggage.handling,
                         checkin_srv = Checkin.service,
                         plane_srv = Inflight.service,
                         clean = Cleanliness,
                         dep_delay = Departure.Delay.in.Minutes,
                         arr_delay = Arrival.Delay.in.Minutes,
                         overall_sat = satisfaction)
  • Converting categorical variables into factor variables
raw_cloudrate <- raw_cloudrate %>%
  mutate(
    gender = factor(gender),
    travel_type = factor(travel_type),
    class = factor(class),
    overall_sat = factor(overall_sat, levels = c("neutral or dissatisfied", "satisfied"))
  )
  • Create new data.frame based on conditions
#By travel class
business <- raw_cloudrate %>% filter(class == "Business")
economy_combined <- raw_cloudrate %>% filter(class %in% c("Eco", "Eco Plus"))

#By delays
delayed_flights <- raw_cloudrate %>% filter(dep_delay > 0 | arr_delay > 0)
ontime_flights <- raw_cloudrate %>% filter(dep_delay == 0 & arr_delay == 0)

#By long or short-haul flights
long_flights <- raw_cloudrate %>% filter(flight_dist > 1500)
short_flights <- raw_cloudrate %>% filter(flight_dist <= 1500)

#By age group
above_50 <- raw_cloudrate %>% filter(age >= 50)
below_30 <- raw_cloudrate %>% filter(age <= 30)

#By inflight services
inflight_srv_ratings <- raw_cloudrate %>% select(cust_id, starts_with("plane"), onboard_srv, seat_comf, legroom)
  • Create new variables/columns
# Mean rating for inflight services
inflight_srv_ratings <- inflight_srv_ratings %>%
  mutate(avg_score = round(rowMeans(select(., -cust_id), na.rm = TRUE), 2))

3) Descriptive Statistics

#install.packages("psych")
library(psych)

#Overall summary of the raw database (after data manipulation)
summary(raw_cloudrate)
##    cust_id             gender           age                 travel_type   
##  Length:103594      Female:52576   Min.   : 7.00   Business travel:71465  
##  Class :character   Male  :51018   1st Qu.:27.00   Personal Travel:32129  
##  Mode  :character                  Median :40.00                          
##                                    Mean   :39.38                          
##                                    3rd Qu.:51.00                          
##                                    Max.   :85.00                          
##       class        flight_dist     plane_wifi    dep_arr_conv   online_book   
##  Business:49533   Min.   :  31   Min.   :0.00   Min.   :0.00   Min.   :0.000  
##  Eco     :46593   1st Qu.: 414   1st Qu.:2.00   1st Qu.:2.00   1st Qu.:2.000  
##  Eco Plus: 7468   Median : 842   Median :3.00   Median :3.00   Median :3.000  
##                   Mean   :1189   Mean   :2.73   Mean   :3.06   Mean   :2.757  
##                   3rd Qu.:1743   3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:4.000  
##                   Max.   :4983   Max.   :5.00   Max.   :5.00   Max.   :5.000  
##     gate_loc       food_drink     online_board    seat_comf      plane_ent    
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :0.00   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.00   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.00   Median :4.00   Median :4.000  
##  Mean   :2.977   Mean   :3.202   Mean   :3.25   Mean   :3.44   Mean   :3.358  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.00   3rd Qu.:5.00   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.00   Max.   :5.000  
##   onboard_srv       legroom         baggage       checkin_srv   
##  Min.   :0.000   Min.   :0.000   Min.   :1.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :4.000   Median :4.000   Median :4.000   Median :3.000  
##  Mean   :3.383   Mean   :3.351   Mean   :3.632   Mean   :3.304  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##    plane_srv         clean         dep_delay         arr_delay      
##  Min.   :0.000   Min.   :0.000   Min.   :   0.00   Min.   :   0.00  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:   0.00   1st Qu.:   0.00  
##  Median :4.000   Median :3.000   Median :   0.00   Median :   0.00  
##  Mean   :3.641   Mean   :3.286   Mean   :  14.75   Mean   :  15.18  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:  12.00   3rd Qu.:  13.00  
##  Max.   :5.000   Max.   :5.000   Max.   :1592.00   Max.   :1584.00  
##                   overall_sat   
##  neutral or dissatisfied:58697  
##  satisfied              :44897  
##                                 
##                                 
##                                 
## 
#Selected numerical variables
raw_cloudrate %>%
  select(age, flight_dist, dep_delay, arr_delay, plane_wifi, food_drink, seat_comf, checkin_srv) %>%
  describeBy()
## Warning in describeBy(.): no grouping variable requested
##             vars      n    mean     sd median trimmed    mad min  max range
## age            1 103594   39.38  15.11     40   39.40  17.79   7   85    78
## flight_dist    2 103594 1189.33 997.30    842 1042.61 766.50  31 4983  4952
## dep_delay      3 103594   14.75  38.12      0    5.79   0.00   0 1592  1592
## arr_delay      4 103594   15.18  38.70      0    6.09   0.00   0 1584  1584
## plane_wifi     5 103594    2.73   1.33      3    2.70   1.48   0    5     5
## food_drink     6 103594    3.20   1.33      3    3.25   1.48   0    5     5
## seat_comf      7 103594    3.44   1.32      4    3.55   1.48   0    5     5
## checkin_srv    8 103594    3.30   1.27      3    3.38   1.48   0    5     5
##              skew kurtosis   se
## age          0.00    -0.72 0.05
## flight_dist  1.11     0.27 3.10
## dep_delay    6.77   101.46 0.12
## arr_delay    6.60    94.53 0.12
## plane_wifi   0.04    -0.85 0.00
## food_drink  -0.15    -1.15 0.00
## seat_comf   -0.48    -0.92 0.00
## checkin_srv -0.37    -0.83 0.00

Explanation of Some Numerical Variables:

  • age (median): 50% of the sample size are aged 40 and below, while the remaining 50% are aged above 40.
  • dep_delay (mean): Passengers experienced an average 14.75 minutes delay in their departure flights.
  • arr_delay (mean): Passengers experienced an average 15.18 minutes delay upon their estimated arrival time.
  • flight_dist (sd): There is significant variability among the flight distances, with a standard deviation of 997.30km. With its median lower than the mean, implies that there is a right-skewed distribution. This suggests that while many flights have shorter distances, a few long-distance flights are pulling the mean higher.
# Categorical data
prop.table(table(raw_cloudrate$gender, raw_cloudrate$overall_sat), margin = 1)
##         
##          neutral or dissatisfied satisfied
##   Female               0.5726377 0.4273623
##   Male                 0.5603905 0.4396095

Explanation of Cross-Tablulation Table

This table shows the distribution of passenger satisfaction across genders. While it can be seen that the satisfaction levels are relatively similar across both genders, it is an indicator that gender is not a major differentiator in passenger satisfaction.

4) Distribution Graphs

# install.packages("ggplot2")
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

Histogram

#For age
ggplot(raw_cloudrate, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "orange", color = "black") +
  labs(title = "Age Distribution", x = "Age", y = "Frequency")

#For flight distance
ggplot(raw_cloudrate, aes(x = flight_dist)) +
  geom_histogram(binwidth = 100, fill = "lightgreen", color = "black") +
  labs(title = "Flight Distance Distribution", x = "Flight Distance", y = "Frequency")

Explanation of Histogram

  • Age Distribution: The age distribution of passengers is approximately assumed to have a normal distribution but is noted to be slightly right-skewed. The majority of the passengers consist of younger to middle-aged adults, ranging from 20s to 50s. The distribution aligns with common airline passenger demographics, where working-age individuals and young adults travel more frequently than young teenagers and elderly.

  • Flight Distance Distribution: It can be seen that passengers frequently opt to travel short-haul flights as it is dominated with a right-skewed distribution. While most flights are short, the range extensively extends largely due to several long-haul flights that airlines offers.

Boxplot

ggplot(raw_cloudrate, aes(x = overall_sat, y = age, fill = overall_sat)) +
  geom_boxplot() +
  labs(title = "Distribution of Overall Satisfaction by Age", x = "Overall Satisfaction", y = "Age") +
  theme_minimal() +
  scale_fill_manual(values = c("pink", "lightblue"))

Explanation of Boxplot:

This boxplot shows the distribution between satisfied and neutral/dissatisfied passengers based on age. The median age of satisfied passengers is higher than that of neutral or dissatisfied passengers, suggesting that older individuals tend to report higher satisfaction. The interquartile range (IQR) for neutral/dissatisfied passengers can be seen to be slightly larger, suggesting that their ages are more spread out compared to satisfied passengers, whose ages are more concentrated.