Data Source: Kaggle.com (2025)
Link: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data
#train.csv will be used as it is 80% of the full dataset
raw_cloudrate <- read.table("./train.csv", header = TRUE, sep = ",", dec = ".")
head(raw_cloudrate)
## X id Gender Customer.Type Age Type.of.Travel Class
## 1 0 70172 Male Loyal Customer 13 Personal Travel Eco Plus
## 2 1 5047 Male disloyal Customer 25 Business travel Business
## 3 2 110028 Female Loyal Customer 26 Business travel Business
## 4 3 24026 Female Loyal Customer 25 Business travel Business
## 5 4 119299 Male Loyal Customer 61 Business travel Business
## 6 5 111157 Female Loyal Customer 26 Personal Travel Eco
## Flight.Distance Inflight.wifi.service Departure.Arrival.time.convenient
## 1 460 3 4
## 2 235 3 2
## 3 1142 2 2
## 4 562 2 5
## 5 214 3 3
## 6 1180 3 4
## Ease.of.Online.booking Gate.location Food.and.drink Online.boarding
## 1 3 1 5 3
## 2 3 3 1 3
## 3 2 2 5 5
## 4 5 5 2 2
## 5 3 3 4 5
## 6 2 1 1 2
## Seat.comfort Inflight.entertainment On.board.service Leg.room.service
## 1 5 5 4 3
## 2 1 1 1 5
## 3 5 5 4 3
## 4 2 2 2 5
## 5 5 3 3 4
## 6 1 1 3 4
## Baggage.handling Checkin.service Inflight.service Cleanliness
## 1 4 4 5 5
## 2 3 1 4 1
## 3 4 4 4 5
## 4 3 1 4 2
## 5 4 3 3 3
## 6 4 4 4 1
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes satisfaction
## 1 25 18 neutral or dissatisfied
## 2 1 6 neutral or dissatisfied
## 3 0 0 satisfied
## 4 11 9 neutral or dissatisfied
## 5 0 0 satisfied
## 6 0 0 neutral or dissatisfied
Explanation of Raw Data
Target population: All airline passengers
Sample size: Passengers who took part in the airline passengers satisfaction survey (Total = 103,904 observations)
Unit of observation: 1 of the 103904 passengers who participated in the survey
Number of variables: 25 (after cleaning = 23 variables)
Definition & unit of measurement of all variables
id (categorical - nominal): Customer unique identification number
Gender (categorical - nominal): Gender of the passengers - “Female” or “Male”
Customer.Type (categorical - nominal): Customer type - “Loyal” or “disloyal” customer
Age (numerical - ratio): The actual age of the passengers
Type.of.Travel (categorical - nominal): Purpose of the flight of the passengers - “Personal Travel” or “Business Travel”
Class (categorical - ordinal): Travel class in the plane of the passengers - “Business”, “Eco”, “Eco Plus”
Flight.distance (numerical - ratio): The flight distance of this journey
Inflight.wifi.service (numerical - interval): Satisfaction level of the inflight wifi service (0:Not Applicable; 1-5: Satisfaction level)
Departure/Arrival time convenient (numerical - interval): Satisfaction level of Departure/Arrival time convenient
Ease of Online booking (numerical - interval): Satisfaction level of online booking
Gate location (numerical - interval): Satisfaction level of Gate location
Food and drink (numerical - interval): Satisfaction level of Food and drink
Online boarding (numerical - interval):* Satisfaction level of online boarding
Seat comfort (numerical - interval): Satisfaction level of Seat comfort
Inflight entertainment (numerical - interval): Satisfaction level of inflight entertainment
On-board service (numerical - interval): Satisfaction level of On-board service
Leg room service (numerical - interval): Satisfaction level of Leg room service
Baggage handling (numerical - interval): Satisfaction level of baggage handling
Check-in service (numerical - interval): Satisfaction level of Check-in service
Inflight service (numerical - interval): Satisfaction level of inflight service
Cleanliness (numerical - interval): Satisfaction level of Cleanliness
Departure Delay in Minutes (numerical - ratio): Minutes delayed when departure
Arrival Delay in Minutes (numerical - ratio): Minutes delayed when Arrival
Satisfaction (categorical - ordinal): Airline satisfaction level - “Satisfaction”, or “Neutral or dissatisfaction”)
#install.packages("dplyr")
#install.packages("tidyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
raw_cloudrate <- raw_cloudrate %>%
select(-1, -Customer.Type) #Remove the first column (serial number) and Customer.Type column
raw_cloudrate <- raw_cloudrate %>% drop_na() #Drop all observations with NA values in their records
raw_cloudrate$id <- sprintf("%06d", raw_cloudrate$id) #Make ID values to be consistent at 6 digits
raw_cloudrate <- raw_cloudrate %>%
rename (cust_id = id,
gender = Gender,
age = Age,
travel_type = Type.of.Travel,
class = Class,
flight_dist = Flight.Distance,
plane_wifi = Inflight.wifi.service,
dep_arr_conv = Departure.Arrival.time.convenient,
online_book = Ease.of.Online.booking,
gate_loc = Gate.location,
food_drink = Food.and.drink,
online_board = Online.boarding,
seat_comf = Seat.comfort,
plane_ent = Inflight.entertainment,
onboard_srv = On.board.service,
legroom = Leg.room.service,
baggage = Baggage.handling,
checkin_srv = Checkin.service,
plane_srv = Inflight.service,
clean = Cleanliness,
dep_delay = Departure.Delay.in.Minutes,
arr_delay = Arrival.Delay.in.Minutes,
overall_sat = satisfaction)
raw_cloudrate <- raw_cloudrate %>%
mutate(
gender = factor(gender),
travel_type = factor(travel_type),
class = factor(class),
overall_sat = factor(overall_sat, levels = c("neutral or dissatisfied", "satisfied"))
)
#By travel class
business <- raw_cloudrate %>% filter(class == "Business")
economy_combined <- raw_cloudrate %>% filter(class %in% c("Eco", "Eco Plus"))
#By delays
delayed_flights <- raw_cloudrate %>% filter(dep_delay > 0 | arr_delay > 0)
ontime_flights <- raw_cloudrate %>% filter(dep_delay == 0 & arr_delay == 0)
#By long or short-haul flights
long_flights <- raw_cloudrate %>% filter(flight_dist > 1500)
short_flights <- raw_cloudrate %>% filter(flight_dist <= 1500)
#By age group
above_50 <- raw_cloudrate %>% filter(age >= 50)
below_30 <- raw_cloudrate %>% filter(age <= 30)
#By inflight services
inflight_srv_ratings <- raw_cloudrate %>% select(cust_id, starts_with("plane"), onboard_srv, seat_comf, legroom)
# Mean rating for inflight services
inflight_srv_ratings <- inflight_srv_ratings %>%
mutate(avg_score = round(rowMeans(select(., -cust_id), na.rm = TRUE), 2))
#install.packages("psych")
library(psych)
#Overall summary of the raw database (after data manipulation)
summary(raw_cloudrate)
## cust_id gender age travel_type
## Length:103594 Female:52576 Min. : 7.00 Business travel:71465
## Class :character Male :51018 1st Qu.:27.00 Personal Travel:32129
## Mode :character Median :40.00
## Mean :39.38
## 3rd Qu.:51.00
## Max. :85.00
## class flight_dist plane_wifi dep_arr_conv online_book
## Business:49533 Min. : 31 Min. :0.00 Min. :0.00 Min. :0.000
## Eco :46593 1st Qu.: 414 1st Qu.:2.00 1st Qu.:2.00 1st Qu.:2.000
## Eco Plus: 7468 Median : 842 Median :3.00 Median :3.00 Median :3.000
## Mean :1189 Mean :2.73 Mean :3.06 Mean :2.757
## 3rd Qu.:1743 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :4983 Max. :5.00 Max. :5.00 Max. :5.000
## gate_loc food_drink online_board seat_comf plane_ent
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :0.00 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.00 Median :4.00 Median :4.000
## Mean :2.977 Mean :3.202 Mean :3.25 Mean :3.44 Mean :3.358
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:5.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.00 Max. :5.000
## onboard_srv legroom baggage checkin_srv
## Min. :0.000 Min. :0.000 Min. :1.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000 Median :3.000
## Mean :3.383 Mean :3.351 Mean :3.632 Mean :3.304
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## plane_srv clean dep_delay arr_delay
## Min. :0.000 Min. :0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 0.00 1st Qu.: 0.00
## Median :4.000 Median :3.000 Median : 0.00 Median : 0.00
## Mean :3.641 Mean :3.286 Mean : 14.75 Mean : 15.18
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.: 12.00 3rd Qu.: 13.00
## Max. :5.000 Max. :5.000 Max. :1592.00 Max. :1584.00
## overall_sat
## neutral or dissatisfied:58697
## satisfied :44897
##
##
##
##
#Selected numerical variables
raw_cloudrate %>%
select(age, flight_dist, dep_delay, arr_delay, plane_wifi, food_drink, seat_comf, checkin_srv) %>%
describeBy()
## Warning in describeBy(.): no grouping variable requested
## vars n mean sd median trimmed mad min max range
## age 1 103594 39.38 15.11 40 39.40 17.79 7 85 78
## flight_dist 2 103594 1189.33 997.30 842 1042.61 766.50 31 4983 4952
## dep_delay 3 103594 14.75 38.12 0 5.79 0.00 0 1592 1592
## arr_delay 4 103594 15.18 38.70 0 6.09 0.00 0 1584 1584
## plane_wifi 5 103594 2.73 1.33 3 2.70 1.48 0 5 5
## food_drink 6 103594 3.20 1.33 3 3.25 1.48 0 5 5
## seat_comf 7 103594 3.44 1.32 4 3.55 1.48 0 5 5
## checkin_srv 8 103594 3.30 1.27 3 3.38 1.48 0 5 5
## skew kurtosis se
## age 0.00 -0.72 0.05
## flight_dist 1.11 0.27 3.10
## dep_delay 6.77 101.46 0.12
## arr_delay 6.60 94.53 0.12
## plane_wifi 0.04 -0.85 0.00
## food_drink -0.15 -1.15 0.00
## seat_comf -0.48 -0.92 0.00
## checkin_srv -0.37 -0.83 0.00
Explanation of Some Numerical Variables:
# Categorical data
prop.table(table(raw_cloudrate$gender, raw_cloudrate$overall_sat), margin = 1)
##
## neutral or dissatisfied satisfied
## Female 0.5726377 0.4273623
## Male 0.5603905 0.4396095
Explanation of Cross-Tablulation Table
This table shows the distribution of passenger satisfaction across genders. While it can be seen that the satisfaction levels are relatively similar across both genders, it is an indicator that gender is not a major differentiator in passenger satisfaction.
# install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
Histogram
#For age
ggplot(raw_cloudrate, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "orange", color = "black") +
labs(title = "Age Distribution", x = "Age", y = "Frequency")
#For flight distance
ggplot(raw_cloudrate, aes(x = flight_dist)) +
geom_histogram(binwidth = 100, fill = "lightgreen", color = "black") +
labs(title = "Flight Distance Distribution", x = "Flight Distance", y = "Frequency")
Explanation of Histogram
Age Distribution: The age distribution of passengers is approximately assumed to have a normal distribution but is noted to be slightly right-skewed. The majority of the passengers consist of younger to middle-aged adults, ranging from 20s to 50s. The distribution aligns with common airline passenger demographics, where working-age individuals and young adults travel more frequently than young teenagers and elderly.
Flight Distance Distribution: It can be seen that passengers frequently opt to travel short-haul flights as it is dominated with a right-skewed distribution. While most flights are short, the range extensively extends largely due to several long-haul flights that airlines offers.
Boxplot
ggplot(raw_cloudrate, aes(x = overall_sat, y = age, fill = overall_sat)) +
geom_boxplot() +
labs(title = "Distribution of Overall Satisfaction by Age", x = "Overall Satisfaction", y = "Age") +
theme_minimal() +
scale_fill_manual(values = c("pink", "lightblue"))
Explanation of Boxplot:
This boxplot shows the distribution between satisfied and neutral/dissatisfied passengers based on age. The median age of satisfied passengers is higher than that of neutral or dissatisfied passengers, suggesting that older individuals tend to report higher satisfaction. The interquartile range (IQR) for neutral/dissatisfied passengers can be seen to be slightly larger, suggesting that their ages are more spread out compared to satisfied passengers, whose ages are more concentrated.