Abstract

This data given by an airline organization. The actual name of the company is not given due to various purposes.The dataset consists of the details of customers who have already flown with them. The feedback of the customers on various context and their flight data has been consolidated.

The main purpose of this dataset is to predict whether a future customer would be satisfied with their service given the details of the other parameters values. Also the airlines need to know on which aspect of the services offered by them have to be emphasized more to generate more satisfied customers.

Variable Breakdown

satisfaction: (Categorical) Indicates whether the customer was satisfied or dissatisfied with their flight experience.
Gender: (Categorical) Gender of the customer (Male/Female).
Customer.Type: (Categorical) Indicates whether the customer is a Loyal Customer or a Disloyal Customer.
Age: (Numeric) Age of the customer in years.
Type.of.Travel: (Categorical) Purpose of the travel (Business travel/Personal travel).
Class: (Categorical) Travel class (Business/Eco/Eco Plus).
Flight.Distance: (Numeric) Distance of the flight in miles.
Seat.comfort: (Numeric) Rating of seat comfort on a scale (1 to 5).
Departure.Arrival.time.convenient: (Numeric) Rating of convenience of departure/arrival time (1 to 5).
Food.and.drink: (Numeric) Rating of food and drink quality on the flight (1 to 5).
Gate.location: (Numeric) Rating of the gate location’s convenience (1 to 5).
Inflight.wifi.service: (Numeric) Rating of the inflight Wi-Fi service quality (1 to 5).
Inflight.entertainment: (Numeric) Rating of inflight entertainment options (1 to 5).
Online.support: (Numeric) Rating of the airline’s online support (1 to 5).
Ease.of.Online.booking: (Numeric) Rating of the ease of online booking process (1 to 5).
On.board.service: (Numeric) Rating of onboard service quality (1 to 5).
Leg.room.service: (Numeric) Rating of leg room space (1 to 5).
Baggage.handling: (Numeric) Rating of baggage handling efficiency (1 to 5).
Checkin.service: (Numeric) Rating of the check-in service (1 to 5).
Cleanliness: (Numeric) Rating of the cleanliness of the airplane (1 to 5).
Online.boarding: (Numeric) Rating of the online boarding process (1 to 5).
Departure.Delay.in.Minutes: (Numeric) Duration of departure delay in minutes.
Arrival.Delay.in.Minutes: (Numeric) Duration of arrival delay in minutes.

Data Cleaning and Feature Engineering

# Load necessary libraries
library(tidyverse)
library(FactoMineR)
library(factoextra)
library(cluster)
library(caret)
library(readr)
library(ggplot2)
library(ggpubr)
library(corrplot)
library(psych)
library(dplyr)
library(gridExtra)
library(tidyr)
library(reshape2)  

# dataset
airline_data = read.csv("airline_customer_satisfaction (1).csv")
head(airline_data)

##   satisfaction Gender  Customer.Type Age  Type.of.Travel    Class
## 1 dissatisfied Female Loyal Customer  41 Business travel Business
## 2    satisfied   Male Loyal Customer  60 Business travel      Eco
## 3    satisfied Female Loyal Customer  33 Business travel Business
## 4    satisfied   Male Loyal Customer  38 Business travel Eco Plus
## 5    satisfied Female Loyal Customer  47 Business travel Business
## 6 dissatisfied Female Loyal Customer  46 Personal Travel      Eco
##   Flight.Distance Seat.comfort Departure.Arrival.time.convenient Food.and.drink
## 1            1909            3                                 5              5
## 2            2398            2                                 5              5
## 3             631            3                                 3              3
## 4            2540            4                                 3              3
## 5            1149            4                                 4              4
## 6             974            4                                 4              4
##   Gate.location Inflight.wifi.service Inflight.entertainment Online.support
## 1             5                     2                      4              3
## 2             5                     2                      2              2
## 3             3                     5                      5              5
## 4             3                     4                      4              4
## 5             4                     5                      4              4
## 6             2                     4                      5              5
##   Ease.of.Online.booking On.board.service Leg.room.service Baggage.handling
## 1                      3                3                3                3
## 2                      2                1                5                1
## 3                      3                3                3                3
## 4                      4                4                1                5
## 5                      4                4                5                4
## 6                      3                3                4                3
##   Checkin.service Cleanliness Online.boarding Departure.Delay.in.Minutes
## 1               4           3               3                          0
## 2               5           3               2                         10
## 3               4           3               4                          0
## 4               1           1               4                          9
## 5               3           4               3                          8
## 6               4           3               4                          0
##   Arrival.Delay.in.Minutes
## 1                        0
## 2                       10
## 3                        0
## 4                       20
## 5                        1
## 6                        0

# Count the number of NA values in each column
na_counts <- colSums(is.na(airline_data))

# Create a data frame for visualization
na_data <- data.frame(
  Column = names(na_counts),
  NA_Count = as.numeric(na_counts)
)

na_data

##                               Column NA_Count
## 1                       satisfaction        0
## 2                             Gender        0
## 3                      Customer.Type        0
## 4                                Age        0
## 5                     Type.of.Travel        0
## 6                              Class        0
## 7                    Flight.Distance        0
## 8                       Seat.comfort        0
## 9  Departure.Arrival.time.convenient        0
## 10                    Food.and.drink        0
## 11                     Gate.location        0
## 12             Inflight.wifi.service        0
## 13            Inflight.entertainment        0
## 14                    Online.support        0
## 15            Ease.of.Online.booking        0
## 16                  On.board.service        0
## 17                  Leg.room.service        0
## 18                  Baggage.handling        0
## 19                   Checkin.service        0
## 20                       Cleanliness        0
## 21                   Online.boarding        0
## 22        Departure.Delay.in.Minutes        0
## 23          Arrival.Delay.in.Minutes        0

airline_data$satisfaction <- factor(
  airline_data$satisfaction,
  levels = c("satisfied", "dissatisfied") # Specify the desired order
)

There are no missing values in our data-set, so there is no need to alter any specific entries of data. We have also taken the liberty of factorizing our satisfaction column so that there is a numerical component based with this variable for further comparison.

With our data properly loaded in and ready to go, we can start our analysis with some descriptive Analytics.

1) Descriptive & Comparitive Analytics

Descriptive analytics aims to summarize and understand the dataset by identifying patterns, trends, and key insights into customer feedback and flight data. We will continue forward with answering important questions that can help give us some much needed insight into the overall spread and distribution of our data set with respect to satisfaction levels

1.1) What percentage of customers are classified as satisfied or dissatisfied?

satisfaction_counts <- airline_data %>%
  group_by(satisfaction) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)



# Plot with percentage labels
ggplot(satisfaction_counts, aes(x = factor(satisfaction), y = Count, fill = factor(satisfaction))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) + # Add percentage labels
  labs(
    x = "Satisfaction",
    y = "Count",
    fill = "Satisfaction",
    title = "Customer Satisfaction Distribution"
  ) +
  scale_x_discrete(labels = c("Satisfied", "Not Satisfied")) +
  scale_fill_manual(
    values = c("satisfied" = "steelblue", "dissatisfied" = "maroon") # Set custom colors
  ) +
  theme_minimal()

The bar chart shows the distribution of customer satisfaction, with 54.6% of customers being satisfied and 45.4% dissatisfied. While the majority of customers are satisfied, the high proportion of dissatisfied customers (nearly half) is significant and indicates room for improvement. This suggests that the airline’s services may not consistently meet customer expectations. Focusing on the factors contributing to dissatisfaction—such as delays, comfort, or service quality—can help the airline reduce dissatisfaction rates and enhance overall customer satisfaction. Addressing these gaps is crucial for improving customer retention and loyalty.

With our descriptive analytics we were able to conduct a surface level comparison of our satisfaction levels by different metrics with the help of visualization tools to observe the spread of our data. In this section, we will dive deeper into exploring the intricate differences between satisfaction levels with more precise measures

1.2) Distribution of Satisfaction By age and Flight Distance?

c1 = ggplot(airline_data, aes(Age, fill = satisfaction)) +
  geom_density(alpha = 0.5) +
  labs(title = "Satisfaction Frequency by Age",
       x = "Age")+
  scale_fill_manual(
    values = c("satisfied" = "steelblue", "dissatisfied" = "maroon") # Custom colors
  ) +
  theme_minimal()

c2 = ggplot(airline_data, aes(Flight.Distance, fill = satisfaction)) +
  geom_density(alpha = 0.5) +
  labs(title = "Satisfaction Frequency by Flight Distance",
       x = "Flight Distance")+
  scale_fill_manual(
    values = c("satisfied" = "steelblue", "dissatisfied" = "maroon") # Custom colors
  ) +
  theme_minimal()

grid.arrange(c1, c2, nrow = 2)

The density plot shows that dissatisfied customers are more evenly distributed across ages, with peaks among younger and middle-aged groups (20–40 years). Satisfied customers are concentrated in older age groups (40–60 years), suggesting older customers may have higher satisfaction levels, possibly due to differing expectations or experiences. This insight highlights the need to focus on younger customers to improve satisfaction.

The density plot shows that dissatisfied customers are more concentrated in mid-range flight distances (1,500–3,000 miles), while satisfied customers are slightly more prevalent in shorter flights (500–1,500 miles).

There is significant overlap between the two groups, suggesting that flight distance alone does not fully determine satisfaction. Shorter flights may have higher dissatisfaction due to perceived lack of value, rushed services, or frequent delays. The airline could prioritize improving service quality and addressing pain points for short-haul flights while maintaining comfort and reliability for longer flights to sustain satisfaction.

1.3) How does satisfaction vary by gender, travel class, or customer type?

p1 = ggplot(airline_data, aes(x = satisfaction, fill = satisfaction)) +
  geom_bar() + 
  labs( x= "", y = "Count") + 
  scale_x_discrete(labels = c("Satisfied", "Not Satisfied")) + 
  scale_fill_manual(
    values = c("satisfied" = "steelblue", "dissatisfied" = "maroon") # Set custom colors
  ) +
  ggtitle("Customer Satisfaction By Gender") + 
  geom_text(stat = "count", aes(label = ..count..), vjust = -.5) +
  theme_minimal() +
  theme(legend.position = "none")+
  facet_wrap(~ Gender)+
  ylim(0,1250)

p2 = ggplot(airline_data, aes(x = factor(satisfaction), fill = factor(satisfaction))) +
  geom_bar() + 
  labs(x= "", y = "Count", fill = "Satisfaction") + 
  scale_x_discrete(labels = c("Satisfied", "Not Satisfied")) + 
  scale_fill_manual(
    values = c("satisfied" = "steelblue", "dissatisfied" = "maroon") # Set custom colors
  ) +
  ggtitle("Customer Satisfaction By Travel Type") + 
  geom_text(stat = "count", aes(label = ..count..), vjust = -.5) +
  theme_minimal() +
  theme(legend.position = "none")+
  facet_wrap(~ Type.of.Travel)+
  ylim(0,1500)

p3 = ggplot(airline_data, aes(x = factor(satisfaction), fill = factor(satisfaction))) +
  geom_bar() + 
  labs(x = "Satisfaction", y = "Count", fill = "Satisfaction") + 
  scale_x_discrete(labels = c("Satisfied", "Not Satisfied")) + 
  scale_fill_manual(
    values = c("satisfied" = "steelblue", "dissatisfied" = "maroon") # Set custom colors
  ) +
  ggtitle("Customer Satisfaction By Customer Type") + 
  geom_text(stat = "count", aes(label = ..count..), vjust = -.5) +
  theme_minimal() +
  theme(legend.position = "none")+
  facet_wrap(~ Customer.Type)+
  ylim(0,1750)

grid.arrange(p1, p2, p3,  nrow = 3)

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The bar charts reveal key trends in customer satisfaction segmented by gender, travel type, and customer type. Among genders, females have a higher proportion of satisfaction compared to males, indicating possible differences in service expectations or experiences between genders.

For travel type, business travelers show significantly higher satisfaction compared to personal travelers, suggesting that services tailored for business needs (e.g., convenience, efficiency) are meeting expectations, while personal travelers may find the experience less satisfactory, possibly due to different priorities like affordability or leisure comfort.

In terms of customer type, loyal customers are overwhelmingly more satisfied compared to disloyal customers. This highlights the value of cultivating loyalty through targeted programs and enhanced experiences. The stark dissatisfaction among disloyal customers signals potential gaps in meeting their expectations or retaining their business. These insights suggest the airline should focus efforts on improving satisfaction for male, personal, and disloyal customers, as these groups represent the largest areas for improvement to boost overall satisfaction rates.

1.4) Are there significant differences in satisfaction between men and women?

airline_data$satisfaction_binary = ifelse(airline_data$satisfaction == "satisfied", 1, 0)

male_satisfaction = airline_data %>%
  dplyr::select(Gender, satisfaction_binary)%>%
  dplyr::filter(Gender == 'Male')

female_satisfaction = airline_data %>%
  dplyr::select(Gender, satisfaction_binary)%>%
  dplyr::filter(Gender == 'Female')

t.test(male_satisfaction$satisfaction_binary, female_satisfaction$satisfaction_binary, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  male_satisfaction$satisfaction_binary and female_satisfaction$satisfaction_binary
## t = -13.717, df = 2998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2766024 -0.2074174
## sample estimates:
## mean of x mean of y 
## 0.4237743 0.6657842

The two-sample t-test compares the mean satisfaction scores (binary: 1 = satisfied, 0 = dissatisfied) between male and female customers. The results show:

t-value (-13.717): Indicates a significant difference between the two groups.
p-value (< 2.2e-16): The p-value is extremely small, much lower than the typical threshold of 0.05, meaning the difference in satisfaction between males and females is statistically significant.

Insight: Female customers are significantly more satisfied with their flight experience compared to male customers. This indicates potential differences in service expectations or experiences based on gender. To improve overall satisfaction, the airline should investigate specific factors contributing to lower satisfaction among male customer.

This statistical test supports our earlier assumption of a genuine difference in satisfaction ratings between men and women in our descriptive analytics.

1.5) Insight into metrics divided by Cabin Class

One of the most distinct differences between customer overall flying experience is seating class, allow us to dive into these different metrics and measures.

d1 = ggplot(airline_data, aes(x = Class, y = Flight.Distance))+
  geom_boxplot(aes( fill = Class))+
  labs(title = "Flight distance distribtution by Class",
       x = "Cabin Class",
       y = "Flight Distance")+
  theme_minimal()

d2 = ggplot(airline_data, aes(x = Class, fill = satisfaction)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("steelblue", "maroon")) +
  labs(x = "Class", y = "Proportion of Satisfaction", fill = "Satisfaction") +
  ggtitle("Satisfaction by Travel Class") +
  theme_minimal()

grid.arrange(d1, d2, nrow = 2)

Although we noted earlier that there is no notable difference in satisfaction levels from flight distances, we can not a distinct difference in the spread of flight distances for each cabin class. With that said, the unique cabin classes do have their differences in the porportion of satisfaction with their flights.

1.6) Insight on Delays

# Calculate the average flight delay for departure
average_departure_delay <- mean(airline_data$Departure.Delay.in.Minutes, na.rm = TRUE)
cat("Average Departure Delay:", average_departure_delay, "minutes\n")

## Average Departure Delay: 15.31033 minutes

# Calculate the average flight delay for arrival
average_arrival_delay <- mean(airline_data$Arrival.Delay.in.Minutes, na.rm = TRUE)
cat("Average Arrival Delay:", average_arrival_delay, "minutes\n")

## Average Arrival Delay: 15.60567 minutes

# Create a data frame for plotting
delays <- data.frame(
  Type = c("Departure", "Arrival"),
  Average_Delay = c(average_departure_delay, average_arrival_delay)
)


# Plot the graph
ggplot(delays, aes(x = Type, y = Average_Delay, fill = Type)) +
  geom_bar(stat = "identity", width = 0.5) +
  scale_fill_manual(values = c("darkgreen", "pink")) +
  labs(
    title = "Average Departure and Arrival Delays",
    x = "Delay Type",
    y = "Average Delay (Minutes)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Within our dataset we can see there are no notable differences in average departure and and arrival delay time, but we will look into how the spread of these times affect satisfaction.

Moving forward with delays, we will filter our dataset to only include delays less than 200 minutes to allow for feasible visualization, but will continue to use these outliers in future analysis

delay_data <- airline_data%>%
  filter(Departure.Delay.in.Minutes < 200, Arrival.Delay.in.Minutes <200)

g1 = ggplot(delay_data, aes(x = satisfaction, y = Departure.Delay.in.Minutes))+
  geom_boxplot(aes( fill = satisfaction))+
  scale_fill_manual(values = c("steelblue", "maroon")) +
  labs(title = "Departure Delay distribtution by Satisfaction Rating",
       x = "Cabin Class",
       y = "Departure Delay")+
  theme_minimal()+
  theme(legend.position = "none")

g2 = ggplot(delay_data, aes(x = satisfaction, y = Arrival.Delay.in.Minutes))+
  geom_boxplot(aes( fill = satisfaction))+
  scale_fill_manual(values = c("steelblue", "maroon")) +
  labs(title = "Arrival Delay distribtution by Satisfaction Rating",
       x = "Satisfaction Category",
       y = "Arrival Delay")+
  theme_minimal()+
  theme(legend.position = "none")

grid.arrange(g1,g2, ncol = 2)

Delays, both departure and arrival, seem to have a noticeable impact on customer satisfaction, as dissatisfied customers consistently experience longer and more variable delays. However, there is some overlap between satisfied and dissatisfied customers, suggesting that delays are not the sole determinant of satisfaction.

2) Service Specific Analysis

So far we have done a holistic overview of the data set with respect to satisfaction levels. At this moment we will take a step away from this approach and look at the airline services and analyze these metrics as they play a large role in determining a customers expereince with their trip.

2.1) Which services have the highest and lowest ratings?

service_ratings = colMeans(airline_data[, c("Inflight.wifi.service", "Inflight.entertainment", "Online.support", "On.board.service", "Leg.room.service", "Checkin.service", "Baggage.handling")], na.rm = TRUE)

avg_data <- data.frame(
  Column = names(service_ratings),
  Average = as.numeric(service_ratings)
)

ggplot(avg_data, aes(x = reorder(Column, -Average), y = Average, fill = Column)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Average Values of Service Columns",
    x = "Service",
    y = "Average"
  ) +
  geom_text(aes(label = round(Average, 2)), vjust = -0.5, size = 4) +
  ylim(0, 4) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Highest Service Rating for the Airline is Baggage Handling with a score of 3.7 and the lowest is Inflight Wifi with 3.2. With only a .5 spread of average ratings amongst all service columns we can see the airline is fairly consistent in their reviews.

2.2 Which service has the most significant impact on overall satisfaction?

# Correlation matrix
service = airline_data%>%
  dplyr::select(Inflight.wifi.service, Inflight.entertainment, Online.support, On.board.service, Leg.room.service, Checkin.service, Baggage.handling, satisfaction_binary)


cor_matrix = cor(service)

# Correlation visualization
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = .4)

As seen, the factor with the highest level of positive association with the association of potivie influence by our correlation heat map is Inflight Entertainment. This makes initial sense as this is the service that the customer will have to most interaction with, thus understandably holding the most weight as well.

3) Exploratory and Predictive Analysis with PCA

Exploratory analysis with PCA will aim to uncover underlying trends and groupings among features such as flight delays, customer satisfaction, engagement levels, and other performance metrics. For example, PCA can help identify how factors like flight distance, delays, and customer demographics collectively influence satisfaction levels.

For predictive analysis, PCA will be used to reduce the dimensionality of the dataset and address multicollinearity issues, creating a set of principal components that can serve as inputs for predictive models. These models will aim to predict key outcomes, such as whether a customer will be satisfied or dissatisfied based on features like flight delays, class of travel, and engagement survey scores. By leveraging PCA, the analysis will ensure the predictive models are computationally efficient and robust, focusing on the most informative features.

Initializing PCA

numeric_cols = airline_data[, sapply(airline_data, is.numeric)]
numeric_cols = numeric_cols[, colSums(is.na(numeric_cols)) == 0]


# Correlation matrix
cor_matrix = cor(numeric_cols)

# Correlation visualization
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = .4)

This correlation matrix heatmap visualizes the relationships between variables in the dataset, with the color gradient and numerical values indicating the strength and direction of correlations. Dark blue represents strong positive correlations, white indicates weak or no correlation, and dark red shows strong negative correlations. Notable positive correlations include Seat Comfort and Food and Drink (0.73) and Inflight Wi-Fi Service and Inflight Entertainment (0.60), suggesting that these factors are closely related in influencing customer perceptions. Moderate positive correlations, such as Ease of Online Booking (0.45) and Inflight Entertainment (0.50) with Satisfaction Binary, highlight their contribution to customer satisfaction. Negative correlations, such as Arrival Delay in Minutes and Satisfaction Binary (-0.10), indicate that delays slightly reduce satisfaction. Variables like Age show near-zero correlations with service metrics, suggesting minimal influence. Overall, this heatmap helps identify significant relationships, multicollinearity, and features relevant to customer satisfaction.

# Perform PCA
numeric_cols = numeric_cols%>%
  dplyr::select(-satisfaction_binary)

standardized_data = scale(numeric_cols)


pca = PCA(standardized_data, graph = FALSE)

fviz_eig(pca, addlabels = TRUE)

The first few components account for the majority of the variance, with the first dimension alone explaining 21.7% of the total variance, followed by the second and third dimensions contributing 14.3% and 11.9%, respectively. This suggests that the initial components capture the most significant variation in the data, making them key contributors for interpreting patterns and trends. As the dimensions increase, the variance explained diminishes, indicating that later components capture more subtle or specific variations. Using the first few components, we can analyze the dominant factors influencing the dataset efficiently.

# Top contributors to PC1 and PC2
m1 = fviz_contrib(pca, choice = "var", axes = 1, top = 10) + ggtitle("Top Contributors to PC1")
m2 = fviz_contrib(pca, choice = "var", axes = 2, top = 10) + ggtitle("Top Contributors to PC2")

grid.arrange(m1, m2)

PC1: Online and Customer Experience Dimension * The primary contributors to PC1 are Ease of Online Booking, Online Support, Online Boarding, and Satisfaction Binary, followed by factors like Inflight Entertainment and On-Board Service. These variables suggest that PC1 likely represents a dimension focused on the overall customer experience with digital and service-related aspects. A higher value in PC1 may indicate a customer who values seamless online processes, in-flight amenities, and general satisfaction.

PC2: Flight Operations and Service Convenience Dimension PC2 is dominated by Food and Drink, Gate Location, Seat Comfort, and Departure/Arrival Time Convenience, with smaller contributions from cleanliness and ease of online booking. This suggests that PC2 likely reflects a dimension related to operational and physical service factors that influence customer satisfaction. A higher PC2 score may indicate customers prioritizing physical comfort, punctuality, and accessibility during their flight.

# PCA variable contributions
var = get_pca_var(pca)

# Round values to 2 decimal places
rounded_coord <- round(var$coord, 2)
rounded_contrib <- round(var$contrib, 2)
rounded_cos2 <- round(var$cos2, 2)

# Correlation Plot of cos2 values
corrplot(var$cos2, is.corr = FALSE)

With this heatmap we can alos dive into the meaning behind the other dimensions/principal components

Dimension 3: Heavily influenced by Departure/Arrival Time Convenience, implying this dimension captures the time management and punctuality aspects of customer experience.

Dimension 4 : * Strongly tied to Age, which suggests this dimension may reflect demographic differences influencing satisfaction or preferences.

Dimension 5 (Dim. 5): * Influenced by Flight Distance, implying this dimension relates to the length and type of flight (short-haul vs. long-haul).

Moving Forward, we will be using our First 2 Principal Components to draw insight for our Analysis

3.1) Which variables are most correlated with overall flight experience?

fviz_pca_var(pca, col.var = "contrib", gradient.cols = c("blue", "green", "red"), repel = T)

This PCA biplot highlights the contributions of variables to the first two dimensions, which explain 21.7% and 14.3% of the variance, respectively. Dim1 is driven by factors like Ease of Online Booking, Inflight Entertainment, and Online Support, reflecting the digital and service experience, while Dim2 focuses on physical comfort and operational factors like Food and Drink, Seat Comfort, and Time Convenience. Variables like Delays and Flight Distance align, suggesting a relationship between longer flights and delays.

Prioritizing our 3 high-contributing variables: online interactions, inflight services, and seat comfort can significantly improve customer satisfaction, all highlighted in red.

# Biplot of individuals and variables
fviz_pca_biplot(pca, geom = "point", repel = TRUE, col.var = "black", col.ind = "lightblue")

Above we have each individual within our dataset of 300 subjects plotted within our principal component chart. We can work with filtering the sectioning of these individuals according to other metrics available to us in our dataset to recieve some insight.

# Scatter Plot of PCA Results with Satisfaction

pca_data = as.data.frame(pca$ind$coord)

pca_data$satisfaction = airline_data$satisfaction

ggplot(pca_data, aes(x = Dim.1, y = Dim.2, color = factor(satisfaction))) +
  geom_point() +
  scale_color_manual(values = c("green", "red"))+
  ggtitle("PCA Results by Satisfaction Level") +
  labs(x = "Principal Component 1", y = "Principal Component 2", color = "Satisfaction") +
  theme_minimal()

We can see how the plot accurately seperates our groups of satisfied and dissatisfied customers pretty accurately, moving forward we will see if the same can be done for other metrics.

3.2 How does type of travel influence flight perception?

pca_data$Type.of.Travel = airline_data$Type.of.Travel
ggplot(pca_data, aes(x = Dim.1, y = Dim.2, color = Type.of.Travel)) + geom_point() + ggtitle("PCA Clusters by Travel Type")

This scatter plot visualizes PCA clusters based on travel type (business vs. personal) along the first two principal components (Dim1 and Dim2). The data points show significant overlap between business travel (red) and personal travel (blue), indicating that the first two dimensions do not distinctly separate these two travel types. This suggests that while business and personal travelers may share similar characteristics, additional dimensions or variables might be needed to better differentiate these groups.

3.3) How does travel class infulence perceptions flights?

pca_data$Class = airline_data$Class
ggplot(pca_data, aes(x = Dim.1, y = Dim.2, color = Class)) + geom_point() + ggtitle("PCA Clusters by Travel Class") + labs(x = "Principal Component 1", y = "Principal Component 2") + theme_minimal()

This scatter plot visualizes PCA clusters based on travel class (Business, Eco, and Eco Plus) along the first two principal components. The points show a high degree of overlap across the three classes, indicating that the first two principal components do not strongly differentiate between travel classes. However, Business class (red) appears slightly more concentrated toward the center, while Eco (green) is more dispersed, suggesting greater variability in customer experiences or behaviors within this class. Eco Plus (blue) shows minimal representation and overlaps closely with Eco, indicating similar patterns in this PCA space. Further exploration using additional components or variables could help better distinguish between these travel classes.

4) Cluster Analysis

4.1) What groups of customers with similar characteristics can be identified?

This question can best be answered by conducting a cluster analysis. Given that we’ve done the preparation work and creation of our clusters earlier, we can reuse what we’ve already done and analyze once again.

# Recreating and visualizing our clusters

kmeans_result = kmeans(pca_data[, c("Dim.1", "Dim.2")], centers = 3, nstart = 25)
pca_data$Cluster = as.factor(kmeans_result$cluster)
ggplot(pca_data, aes(x = Dim.1, y = Dim.2, color = Cluster)) + geom_point() + ggtitle("Customer Clusters")

# Add Cluster information to the original dataset
airline_data$Cluster <- pca_data$Cluster

# Summarizing the makeup of clusters
cluster_summary <- airline_data %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `Cluster = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

print(cluster_summary)

## # A tibble: 3 × 20
##   Cluster   Age Flight.Distance Seat.comfort Departure.Arrival.time.convenient
##   <fct>   <dbl>           <dbl>        <dbl>                             <dbl>
## 1 1        40.8           1916.         3.99                              3.90
## 2 2        37.9           2017.         2.43                              2.81
## 3 3        40.2           2095.         1.79                              1.82
## # ℹ 15 more variables: Food.and.drink <dbl>, Gate.location <dbl>,
## #   Inflight.wifi.service <dbl>, Inflight.entertainment <dbl>,
## #   Online.support <dbl>, Ease.of.Online.booking <dbl>, On.board.service <dbl>,
## #   Leg.room.service <dbl>, Baggage.handling <dbl>, Checkin.service <dbl>,
## #   Cleanliness <dbl>, Online.boarding <dbl>, Departure.Delay.in.Minutes <dbl>,
## #   Arrival.Delay.in.Minutes <dbl>, satisfaction_binary <dbl>

Here we can see the three clusters that our data is split into, as well as the average demographic information of these customer clusters to better understand the group characteristics.

Our first cluster contains slightly longer distance flight customers with low seat comfort, flight departure/arrival time convenience, and food ratings. The second cluster is marked by low in-flight entertainment, online support, inflight wifi service, online booking ease, onboard service and legroom service ratings, among others. Most notably, this group has a very low satisfaction rating on average, which aligns with the other low ratings across the board. The thrid cluster is more similar to the first, except with higher inflight entertainment, seat comfort, food and drink, and departure/arrival time convenience ratings.

4.2) What are the main differences between satisfied and dissatisfied customer groups?

Upon looking at our cluster summary, it was clear that the second cluster was far less satisfied with their airline experience than the other two clusters. Let’s look at some of the key variables that underlie this satisfaction difference.

# Selecting the relevant features from the dataset (determin)
features_for_comparison <- airline_data %>%
  select(Inflight.wifi.service, Inflight.entertainment, Online.support,
         Ease.of.Online.booking, On.board.service, Food.and.drink, Gate.location, Leg.room.service, Online.support,
         Baggage.handling, Checkin.service, Cleanliness, Online.boarding,
         Departure.Delay.in.Minutes, Arrival.Delay.in.Minutes, Cluster)

# Reshaping the data into a long format
features_long <- features_for_comparison %>%
  pivot_longer(cols = -Cluster, names_to = "Feature", values_to = "Rating")

# Computing the average rating for each feature and cluster once again
avg_features_by_cluster <- features_long %>%
  group_by(Cluster, Feature) %>%
  summarise(Avg_Rating = mean(Rating), .groups = "drop")

# Creating our parallel coordinate plot
ggplot(avg_features_by_cluster, aes(x = Feature, y = Avg_Rating, group = Cluster, color = factor(Cluster))) +
  geom_line(alpha = 0.7, size = 1) +
  labs(
    title = "Average Feature Ratings Across Clusters",
    x = "Features",
    y = "Average Rating",
    color = "Cluster"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

As we can now see visually, the second cluster consistently has much lower scores in almost all of the major in-flight accommodation ratings such as leg room and entertainment, as well as the online experience. It is also worth noting that these customers experienced delayed departures and arrivals more often than our satisifed customer clusters.

4.3) How does customer loyalty vary across clusters?

I’m curious to see the customer loyalty makeup of our clusters. I imagine the the second, mostly dissatisfied cluster will have a lower proportion of loyal customers, but I’m curious to see if our first and third clusters have any significant difference and what this may say about the groups in our data.

# Stacked bar chart of Customer Loyalty by Cluster
ggplot(airline_data, aes(x = factor(Cluster), fill = Customer.Type)) +
  geom_bar(position = "fill") +  # "fill" normalizes to proportions
  labs(
    title = "Proportion of Loyal and Disloyal Customers by Cluster",
    x = "Cluster",
    y = "Proportion",
    fill = "Customer Type"
  ) +
  scale_fill_manual(values = c("Loyal Customer" = "blue", "disloyal Customer" = "red")) + 
  theme_classic()

As we can observe here, our second cluster of mostly dissatisfied individuals is made up of a smaller proportion of loyal customers than our other two clusters. This was expected and aligns with our earlier findings. Our first and third clusters essentially have the same proportion of loyal and disloyal customers, meaning that there are no groups that have high satisfaction with low customer loyalty, proving the importance of this variable.

5) Recommendation Analysis

5.1) Which customer groups should be the primary focus?

As we’ve seen already, loyal customers are far more satisfied than non-loyal customers on average, so we will need to convert these non-loyal customers by creating a better experience for them. We also know from our previous analysis that male, personal travelers tend to be highly dissatisfied with their travel experience. But would it be more profitable for the company to cater to the needs of these demographics, or should the company continue to cater more to business travelers?

# Calculating the count of travelers and the count of loyal customers
combined_data <- airline_data %>%
  group_by(Type.of.Travel, Customer.Type) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  mutate(Proportion = ifelse(Customer.Type == "Loyal Customer", Count / sum(Count[Type.of.Travel == Type.of.Travel]), 0))

# Plotting total counts for business/personal travelers with proportion of loyal customers
ggplot(combined_data, aes(x = Type.of.Travel, y = Count, fill = Customer.Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Travelers by Type of Travel with Loyal Customers Proportion",
       x = "Type of Travel", y = "Count") +
  scale_fill_manual(values = c("Loyal Customer" = "skyblue", "disloyal Customer" = "lightgrey")) +
  theme_minimal() +
  theme(legend.title = element_blank())  # Remove legend title for better readability

From this graph, we can see that a staggering number of personal travelers already happen to be loyal customers, which means we can count on them returning to the airline despite the low satisfaction rating. Instead, it would be more beneficial to use resources to convert business travelers into loyal customers by aiming to improve the satisfaction rating. Business travelers also make up a much higher count of the total customer base, which is another reason to prioritize them.

5.2) How should the airline improve satisfaction ratings for business travelers?

Now that we have our target market identified, let’s look at which variables correlate the strongest with satisfaction specifically for business travelers.

# Filtering for business travelers
business_travelers <- airline_data %>%
  filter(Type.of.Travel == "Business travel")

# Select numeric columns and Satisfaction Rating
numeric_data <- business_travelers %>%
  select_if(is.numeric) %>%
  select(-satisfaction_binary)

# Calculating the correlation matrix with Satisfaction Rating
correlation_matrix <- cor(cbind(numeric_data, Satisfaction = business_travelers$satisfaction_binary))

# Reshape the correlation matrix to plot
correlation_data <- melt(correlation_matrix)

# Heatmap of correlations between numeric variables for business travelers
ggplot(correlation_data, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1)) +
  theme_minimal() +
  labs(title = "Heatmap of Numeric Variables for Business Travelers",
       x = "Variables", y = "Variables") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As can be observed from this heatmap, the variables that have the highest correlation with satisfaction for business travelers are online boarding, cleanliness, check-in service, baggage handling, legroom, onboard service, ease of online booking, online support, inflight entertainment, inflight wifi service, and seat comfort.

Specifically, ease of online booking and inflight entertainment appear to be the most important for this type of traveler. I suggest that the airline increases the number of movies/shows available to watch on board, and simplifies the website to make it easier to book flights. Online support is also quite high in importance, so improving the overall online experience will be very beneficial for the company.

5.3) How should the airline seek to improve the experience for younger people?

As we saw earlier, younger people are overall less satisfied with their inflight experience than older people are. It is crucial to target a younger audience, given that they are likely not yet a loyal customers of any airline, and securing them would mean greater future business for the airline. I’m going to create another correlation matrix for individuals 30 and under to see which elements are most important to them.

# Filtering for travelers aged 30 and under
young_travelers <- airline_data %>%
  filter(Age <= 30)  # Assuming 'Age' is the variable for the traveler's age

# Select numeric columns and satisfaction
numeric_data1 <- young_travelers %>%
  select_if(is.numeric) %>%
  select(-satisfaction_binary)  

# Correlation matrix
correlation_matrix <- cor(cbind(numeric_data1, Satisfaction = young_travelers$satisfaction_binary))

# Reshaping correlation matrix for plotting
correlation_data <- melt(correlation_matrix)

# Heatmap of correlations for travelers 30 and under
ggplot(correlation_data, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1)) +
  theme_minimal() +
  labs(title = "Heatmap for Ages 30 and Under",
       x = "Variables", y = "Variables") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Once again, inflight entertainment is the most important to this demographic meaning that the company should immediately seek to improve this aspect of the in-flight experience. The online experience is also once again very important, so this should be refined as soon as possible as well.

Compared to our previous correlation matrix, travelers 30 and under care a bit more about inflight wifi, food and drink, and seat comfort so in order to specifically target this audience, these should all be looked at and sdjusted as well.

Airline Satisfaction Report

Jeffrey Fernandez

2024-12-15