In this section, we analyze the relationships between various service dimensions and overall customer satisfaction. The goal is to identify which service factors have the strongest influence on overall ratings, enabling targeted service improvements and resource allocation.
# Load the cleaned dataset
ryanair_clean <- read.csv("ryanair_reviews_cleaned.csv")
# Select numeric rating variables for correlation analysis
rating_vars <- c("Overall.Rating", "Seat.Comfort", "Cabin.Staff.Service",
"Food...Beverages", "Ground.Service", "Wifi...Connectivity",
"Inflight.Entertainment", "Value.For.Money")
# Create correlation dataset
corr_data <- ryanair_clean[rating_vars]
# Check data structure
cat("### Dataset for Correlation Analysis\n")
cat("Variables included:", paste(rating_vars, collapse = ", "), "\n")
Variables included: Overall.Rating, Seat.Comfort, Cabin.Staff.Service, Food…Beverages, Ground.Service, Wifi…Connectivity, Inflight.Entertainment, Value.For.Money
cat("Total observations:", nrow(corr_data), "\n\n")
Total observations: 2119
# Ryanair colors
ryanair_blue <- "#073590"
ryanair_yellow <- "#FFD200"
# Calculate count of ratings and average rating by year
rating_trends <- ryanair_clean %>%
mutate(Review_Date = as.Date(Date.Published),
Year = floor_date(Review_Date, "year")) %>%
group_by(Year) %>%
summarize(
Rating_Count = n(),
Avg_Rating = mean(Overall.Rating, na.rm = TRUE)
)
# Create the combined plot
ggplot(rating_trends, aes(x = Year)) +
# Bar plot for count of ratings
geom_bar(aes(y = Rating_Count), stat = "identity",
fill = ryanair_blue, alpha = 0.7, width = 300) +
# Line plot for average rating (on secondary axis) - CORRECTED for 1-10 scale
geom_line(aes(y = Avg_Rating * (max(Rating_Count) / 10)),
color = ryanair_yellow, size = 1.5, group = 1) +
geom_point(aes(y = Avg_Rating * (max(Rating_Count) / 10)),
color = ryanair_yellow, size = 3) +
# Dual y-axes - CORRECTED for 1-10 scale
scale_y_continuous(
name = "Number of Reviews",
sec.axis = sec_axis(~ . / (max(rating_trends$Rating_Count) / 10),
name = "Average Overall Rating",
breaks = 1:10) # Changed to 1:10
) +
# Yearly date axis
scale_x_date(
date_breaks = "1 year",
date_labels = "%Y",
expand = expansion(mult = c(0.05, 0.05))
) +
labs(
title = "Review Volume and Average Rating Over Time",
subtitle = "Bars show number of reviews per year, line shows average rating (1-10 scale)", # Updated subtitle
x = "Year"
) +
theme_minimal() +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
plot.subtitle = element_text(color = "darkgray", size = 12),
axis.title.y.left = element_text(color = ryanair_blue),
axis.title.y.right = element_text(color = ryanair_yellow),
axis.text.y.left = element_text(color = ryanair_blue),
axis.text.y.right = element_text(color = ryanair_yellow),
axis.text.x = element_text(size = 10),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white", color = "lightgray", linewidth = 0.5)
)
# Calculate average overall rating by year - table only
yearly_ratings <- ryanair_clean %>%
mutate(Review_Date = as.Date(Date.Published),
Year = year(Review_Date)) %>%
group_by(Year) %>%
summarize(
Avg_Overall_Rating = round(mean(Overall.Rating, na.rm = TRUE), 2),
Review_Count = n(),
.groups = 'drop'
)
cat("### Average Overall Rating by Year (1-10 Scale)\n")
yearly_ratings %>% knitr::kable()
| Year | Avg_Overall_Rating | Review_Count |
|---|---|---|
| 2012 | 4.15 | 20 |
| 2013 | 6.56 | 107 |
| 2014 | 7.01 | 261 |
| 2015 | 5.39 | 311 |
| 2016 | 6.13 | 174 |
| 2017 | 4.03 | 173 |
| 2018 | 3.05 | 286 |
| 2019 | 2.80 | 350 |
| 2020 | 3.11 | 110 |
| 2021 | 2.62 | 64 |
| 2022 | 2.88 | 126 |
| 2023 | 3.50 | 129 |
| 2024 | 6.62 | 8 |
We calculate the correlation matrix where we exclude non-usage cases (ratings = 0) only for WiFi and Inflight Entertainment, while keeping all data for other services. This approach provides the most accurate picture of true service quality relationships.
# Create a copy of the correlation data
corr_data_selective <- ryanair_clean[rating_vars]
# Replace zeros with NA only for WiFi and Inflight Entertainment
wifi_entertainment_services <- c("Wifi...Connectivity", "Inflight.Entertainment")
for(service in wifi_entertainment_services) {
corr_data_selective[[service]][corr_data_selective[[service]] == 0] <- NA
}
# Calculate correlation matrix using pairwise complete observations
correlation_matrix_selective <- cor(corr_data_selective, use = "pairwise.complete.obs")
# Display correlation matrix
cat("### Correlation Matrix: Service Ratings vs Overall Rating (Selective Non-Usage Exclusion)\n")
cat("**Note:** Excludes zero ratings only for WiFi and Inflight Entertainment. All other services use complete data.\n\n")
Note: Excludes zero ratings only for WiFi and Inflight Entertainment. All other services use complete data.
kable(round(correlation_matrix_selective, 3), caption = "Correlation Between Service Dimensions and Overall Rating (Selective Non-Usage Exclusion)")
| Overall.Rating | Seat.Comfort | Cabin.Staff.Service | Food…Beverages | Ground.Service | Wifi…Connectivity | Inflight.Entertainment | Value.For.Money | |
|---|---|---|---|---|---|---|---|---|
| Overall.Rating | 1.000 | 0.734 | 0.757 | 0.413 | 0.539 | 0.339 | 0.429 | 0.880 |
| Seat.Comfort | 0.734 | 1.000 | 0.713 | 0.351 | 0.407 | 0.404 | 0.444 | 0.699 |
| Cabin.Staff.Service | 0.757 | 0.713 | 1.000 | 0.454 | 0.487 | 0.292 | 0.366 | 0.706 |
| Food…Beverages | 0.413 | 0.351 | 0.454 | 1.000 | 0.307 | 0.420 | 0.530 | 0.361 |
| Ground.Service | 0.539 | 0.407 | 0.487 | 0.307 | 1.000 | 0.368 | 0.307 | 0.535 |
| Wifi…Connectivity | 0.339 | 0.404 | 0.292 | 0.420 | 0.368 | 1.000 | 0.794 | 0.333 |
| Inflight.Entertainment | 0.429 | 0.444 | 0.366 | 0.530 | 0.307 | 0.794 | 1.000 | 0.365 |
| Value.For.Money | 0.880 | 0.699 | 0.706 | 0.361 | 0.535 | 0.333 | 0.365 | 1.000 |
# Create visualization
corrplot(correlation_matrix_selective, method = "color", type = "upper",
order = "hclust", tl.cex = 0.8, tl.col = "black",
title = "Correlation Matrix: Service Ratings vs Overall Satisfaction\n(Excluding Non-Usage for WiFi/Entertainment Only)",
mar = c(0,0,2,0))
# Extract correlations with Overall.Rating
overall_correlations_selective <- correlation_matrix_selective["Overall.Rating", ]
overall_correlations_selective <- overall_correlations_selective[names(overall_correlations_selective) != "Overall.Rating"]
# Create ranked correlation table
correlation_rank_selective <- data.frame(
Service = names(overall_correlations_selective),
Correlation = round(overall_correlations_selective, 3),
Absolute_Correlation = round(abs(overall_correlations_selective), 3),
Rank = rank(-abs(overall_correlations_selective))
) %>%
arrange(desc(Absolute_Correlation))
cat("### Service Factors Ranked by Correlation with Overall Rating (Selective Non-Usage Exclusion)\n")
kable(correlation_rank_selective, caption = "Top Drivers of Overall Customer Satisfaction (Selective Non-Usage Exclusion)")
| Service | Correlation | Absolute_Correlation | Rank | |
|---|---|---|---|---|
| Value.For.Money | Value.For.Money | 0.880 | 0.880 | 1 |
| Cabin.Staff.Service | Cabin.Staff.Service | 0.757 | 0.757 | 2 |
| Seat.Comfort | Seat.Comfort | 0.734 | 0.734 | 3 |
| Ground.Service | Ground.Service | 0.539 | 0.539 | 4 |
| Inflight.Entertainment | Inflight.Entertainment | 0.429 | 0.429 | 5 |
| Food…Beverages | Food…Beverages | 0.413 | 0.413 | 6 |
| Wifi…Connectivity | Wifi…Connectivity | 0.339 | 0.339 | 7 |
# Ryanair colors
ryanair_blue <- "#073590"
ryanair_yellow <- "#FFD200"
# Prepare data for visualization
cor_viz <- correlation_rank_selective %>%
mutate(
Service = factor(Service, levels = Service[order(Absolute_Correlation)]),
Direction = ifelse(Correlation > 0, "Positive", "Negative")
)
ggplot(cor_viz, aes(x = Service, y = Absolute_Correlation, fill = Direction)) +
geom_bar(stat = "identity", alpha = 0.9) +
scale_fill_manual(values = c("Negative" = ryanair_yellow, "Positive" = ryanair_blue)) +
geom_text(aes(label = paste0(round(Correlation, 3))),
hjust = -0.2, size = 4, color = ryanair_blue) +
coord_flip() +
labs(title = "Key Drivers of RyanAir Customer Satisfaction",
subtitle = "Service factors ranked by correlation strength with Overall Rating (Selective Non-Usage Exclusion)",
x = "Service Dimension",
y = "Absolute Correlation Coefficient",
fill = "Correlation Direction") +
theme_minimal() +
scale_y_continuous(limits = c(0, 1), expand = expansion(mult = c(0, 0.1))) +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
plot.subtitle = element_text(color = "darkgray", size = 12),
legend.position = "bottom"
)
While correlation identifies relationships, linear regression provides statistical significance testing and quantifies the exact impact of each service on overall satisfaction. This allows us to answer critical business questions about the magnitude and reliability of each service’s influence.
cat("### Linear Regression Analysis: Quantifying Service Impact\n\n")
# Individual regression models for each service
regression_results <- data.frame()
for(service in correlation_rank_selective$Service) {
# Create formula for individual regression
formula <- as.formula(paste("Overall.Rating ~", service))
# Fit linear model
model <- lm(formula, data = corr_data_selective)
# Extract model summary
model_summary <- summary(model)
coefficients <- tidy(model)
# Extract service coefficient (excluding intercept)
service_coef <- coefficients[coefficients$term == service, ]
regression_results <- rbind(regression_results, data.frame(
Service = service,
Coefficient = round(service_coef$estimate, 3),
Std_Error = round(service_coef$std.error, 3),
T_Value = round(service_coef$statistic, 3),
P_Value = format.pval(service_coef$p.value, digits = 3),
R_Squared = round(model_summary$r.squared, 3),
Adj_R_Squared = round(model_summary$adj.r.squared, 3),
Significant = ifelse(service_coef$p.value < 0.001, "***",
ifelse(service_coef$p.value < 0.01, "**",
ifelse(service_coef$p.value < 0.05, "*", "No")))
))
}
cat("**Individual Regression Models: Impact of Each Service on Overall Rating**\n")
Individual Regression Models: Impact of Each Service on Overall Rating
kable(regression_results, caption = "Linear Regression Results for Each Service Dimension")
| Service | Coefficient | Std_Error | T_Value | P_Value | R_Squared | Adj_R_Squared | Significant |
|---|---|---|---|---|---|---|---|
| Value.For.Money | 1.854 | 0.022 | 85.371 | <2e-16 | 0.775 | 0.775 | *** |
| Cabin.Staff.Service | 1.812 | 0.034 | 53.336 | <2e-16 | 0.573 | 0.573 | *** |
| Seat.Comfort | 2.069 | 0.042 | 49.721 | <2e-16 | 0.539 | 0.538 | *** |
| Ground.Service | 1.386 | 0.047 | 29.432 | <2e-16 | 0.290 | 0.290 | *** |
| Inflight.Entertainment | 1.696 | 0.197 | 8.606 | 3.16e-16 | 0.184 | 0.181 | *** |
| Food…Beverages | 1.321 | 0.063 | 20.850 | <2e-16 | 0.170 | 0.170 | *** |
| Wifi…Connectivity | 1.275 | 0.217 | 5.878 | 1.24e-08 | 0.115 | 0.112 | *** |
A multiple regression model allows us to understand the unique contribution of each service while controlling for all other services. This identifies which services have independent predictive power beyond their correlations with other services.
cat("### Multiple Regression Analysis: Independent Contributions\n\n")
# Prepare data for multiple regression (remove rows with any NA)
multiple_reg_data <- corr_data_selective[complete.cases(corr_data_selective), ]
# Fit multiple regression model
multiple_formula <- as.formula(paste("Overall.Rating ~",
paste(correlation_rank_selective$Service, collapse = " + ")))
multiple_model <- lm(multiple_formula, data = multiple_reg_data)
# Display multiple regression results
multiple_summary <- summary(multiple_model)
multiple_coef <- tidy(multiple_model)
cat("**Multiple Regression Model Summary**\n")
Multiple Regression Model Summary
cat("- R-squared:", round(multiple_summary$r.squared, 3), "\n")
cat("- Adjusted R-squared:", round(multiple_summary$adj.r.squared, 3), "\n")
cat("- F-statistic:", round(multiple_summary$fstatistic[1], 1), "\n")
cat("- P-value:", format.pval(pf(multiple_summary$fstatistic[1],
multiple_summary$fstatistic[2],
multiple_summary$fstatistic[3],
lower.tail = FALSE), digits = 3), "\n\n")
# Display coefficients (excluding intercept)
multiple_coef_display <- multiple_coef[-1, ] %>% # Remove intercept
mutate(
term = gsub("`", "", term),
estimate = round(estimate, 3),
std.error = round(std.error, 3),
statistic = round(statistic, 3),
p.value = format.pval(p.value, digits = 3),
Significance = ifelse(p.value < 0.001, "***",
ifelse(p.value < 0.01, "**",
ifelse(p.value < 0.05, "*", "No")))
) %>%
rename(Service = term, Coefficient = estimate, Std_Error = std.error,
T_Value = statistic, P_Value = p.value)
cat("**Multiple Regression Coefficients**\n")
Multiple Regression Coefficients
kable(multiple_coef_display, caption = "Independent Contributions of Each Service (Controlling for All Others)")
| Service | Coefficient | Std_Error | T_Value | P_Value | Significance |
|---|---|---|---|---|---|
| Value.For.Money | 0.673 | 0.093 | 7.219 | 6.60e-12 | No |
| Cabin.Staff.Service | 0.065 | 0.093 | 0.700 | 0.484 | No |
| Seat.Comfort | 0.115 | 0.115 | 0.999 | 0.319 | No |
| Ground.Service | 0.900 | 0.121 | 7.443 | 1.68e-12 | No |
| Inflight.Entertainment | -0.139 | 0.275 | -0.507 | 0.613 | No |
| Food…Beverages | 0.070 | 0.131 | 0.535 | 0.593 | No |
| Wifi…Connectivity | 0.060 | 0.273 | 0.220 | 0.826 | No |
The discrepancy between individual and multiple regression results requires investigation. We check Variance Inflation Factor (VIF) to determine if multicollinearity is affecting our model.
# Check Variance Inflation Factor (VIF) to confirm multicollinearity
vif_values <- vif(multiple_model)
cat("\n**Variance Inflation Factor (VIF) for Multicollinearity Check**\n")
Variance Inflation Factor (VIF) for Multicollinearity Check
vif_df <- data.frame(Service = names(vif_values), VIF = round(vif_values, 2))
kable(vif_df, caption = "VIF Values for Each Service Dimension")
| Service | VIF | |
|---|---|---|
| Value.For.Money | Value.For.Money | 2.25 |
| Cabin.Staff.Service | Cabin.Staff.Service | 2.03 |
| Seat.Comfort | Seat.Comfort | 1.98 |
| Ground.Service | Ground.Service | 2.50 |
| Inflight.Entertainment | Inflight.Entertainment | 2.94 |
| Food…Beverages | Food…Beverages | 1.87 |
| Wifi…Connectivity | Wifi…Connectivity | 3.03 |
# Interpret VIF results
cat("\n**VIF Interpretation:**\n")
VIF Interpretation:
cat("- VIF < 5: Moderate multicollinearity\n")
cat("- VIF > 5-10: High multicollinearity\n")
cat("- VIF > 10: Severe multicollinearity (model coefficients unreliable)\n")
# Check which services have problematic VIF
high_vif <- vif_df[vif_df$VIF > 5, ]
if(nrow(high_vif) > 0) {
cat("\n**Services with High Multicollinearity (VIF > 5):**\n")
for(i in 1:nrow(high_vif)) {
cat("- ", high_vif$Service[i], ": VIF = ", high_vif$VIF[i], "\n", sep = "")
}
cat("\n**Conclusion:** The high VIF values confirm severe multicollinearity, explaining why all coefficients became non-significant in the multiple regression.\n")
} else {
cat("\n**No severe multicollinearity detected (all VIF < 5)**\n")
cat("- This means the multiple regression results are statistically reliable\n")
cat("- The non-significance in multiple regression indicates that when controlling for all services, no single service has independent predictive power\n")
cat("- This suggests customers evaluate services holistically rather than independently\n")
}
No severe multicollinearity detected (all VIF < 5) - This means the multiple regression results are statistically reliable - The non-significance in multiple regression indicates that when controlling for all services, no single service has independent predictive power - This suggests customers evaluate services holistically rather than independently
# Get top 3 services
top_3_services <- head(correlation_rank_selective$Service, 3)
# Prepare data for combined plot
combined_data_wide <- corr_data_selective %>%
select(Overall.Rating, all_of(top_3_services))
# Create combined scatter plot with all three regression lines
ggplot(combined_data_wide, aes(x = Overall.Rating, y = Overall.Rating)) +
# Value.For.Money points and line
geom_point(aes(y = Value.For.Money), alpha = 0.3, color = ryanair_blue, size = 1) +
geom_smooth(aes(y = Value.For.Money), method = "lm", se = FALSE,
color = ryanair_blue, size = 1.2, linetype = "solid") +
# Cabin.Staff.Service points and line
geom_point(aes(y = Cabin.Staff.Service), alpha = 0.3, color = "#FF6B35", size = 1) +
geom_smooth(aes(y = Cabin.Staff.Service), method = "lm", se = FALSE,
color = "#FF6B35", size = 1.2, linetype = "solid") +
# Seat.Comfort points and line
geom_point(aes(y = Seat.Comfort), alpha = 0.3, color = "#4ECDC4", size = 1) +
geom_smooth(aes(y = Seat.Comfort), method = "lm", se = FALSE,
color = "#4ECDC4", size = 1.2, linetype = "solid") +
labs(title = "Top 3 Service Factors vs Overall Satisfaction",
subtitle = "All three key service dimensions plotted against overall customer ratings",
x = "Overall Rating (1-10 scale)",
y = "Service Rating (1-5 scale)") +
theme_minimal() +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
plot.subtitle = element_text(color = "darkgray", size = 12),
legend.position = "none"
) +
# Add legend manually
annotate("text", x = Inf, y = Inf,
label = paste("Value.For.Money: R_Squared =",
round(regression_results$R_Squared[regression_results$Service == "Value.For.Money"], 3)),
hjust = 1.1, vjust = 2, color = ryanair_blue, size = 4) +
annotate("text", x = Inf, y = Inf,
label = paste("Cabin.Staff.Service: R_Squared =",
round(regression_results$R_Squared[regression_results$Service == "Cabin.Staff.Service"], 3)),
hjust = 1.1, vjust = 3.5, color = "#FF6B35", size = 4) +
annotate("text", x = Inf, y = Inf,
label = paste("Seat.Comfort: R_Squared =",
round(regression_results$R_Squared[regression_results$Service == "Seat.Comfort"], 3)),
hjust = 1.1, vjust = 5, color = "#4ECDC4", size = 4)
The initial bad review analysis included all ratings, but this can be misleading for services with high non-usage rates. To understand true service failures, we exclude non-usage cases by only considering ratings greater than 0. We define a poor rating as 2 or below on the 5-point scale, representing clear service dissatisfaction when the service was actually used.
# Define rating thresholds for analysis
POOR_RATING_THRESHOLD <- 2 # 2/5 or below indicates service failure
BAD_REVIEW_THRESHOLD <- 5 # 5/10 or below indicates bad overall experience
# Identify bad reviews
bad_reviews <- corr_data_selective[corr_data_selective$Overall.Rating <= BAD_REVIEW_THRESHOLD, ]
corrected_bad_analysis <- data.frame(
Service = rating_vars[rating_vars != "Overall.Rating"],
Poor_Rating_Count = sapply(rating_vars[rating_vars != "Overall.Rating"],
function(x) sum(corr_data_selective[[x]] <= POOR_RATING_THRESHOLD &
corr_data_selective$Overall.Rating <= BAD_REVIEW_THRESHOLD &
corr_data_selective[[x]] > 0, na.rm = TRUE)),
Actually_Used_Count = sapply(rating_vars[rating_vars != "Overall.Rating"],
function(x) sum(corr_data_selective[[x]] > 0 &
corr_data_selective$Overall.Rating <= BAD_REVIEW_THRESHOLD, na.rm = TRUE)),
Total_Bad_Reviews = nrow(bad_reviews)
) %>%
mutate(
Actual_Failure_Rate = round(Poor_Rating_Count / Actually_Used_Count * 100, 1),
Service_Usage_Rate = round(Actually_Used_Count / Total_Bad_Reviews * 100, 1),
Severity_Rank = rank(-Actual_Failure_Rate)
) %>%
arrange(desc(Actual_Failure_Rate))
cat("### Corrected Analysis: Actual Service Failure Rates (Excluding Non-Usage)\n\n")
cat("**Service Failure Rates When Services Are Actually Used**\n")
Service Failure Rates When Services Are Actually Used
kable(corrected_bad_analysis, caption = "True Service Failure Rates Excluding Non-Usage Cases")
| Service | Poor_Rating_Count | Actually_Used_Count | Total_Bad_Reviews | Actual_Failure_Rate | Service_Usage_Rate | Severity_Rank | |
|---|---|---|---|---|---|---|---|
| Inflight.Entertainment | Inflight.Entertainment | 288 | 294 | 1284 | 98.0 | 22.9 | 1 |
| Wifi…Connectivity | Wifi…Connectivity | 238 | 245 | 1284 | 97.1 | 19.1 | 2 |
| Food…Beverages | Food…Beverages | 1149 | 1222 | 1284 | 94.0 | 95.2 | 3 |
| Ground.Service | Ground.Service | 1189 | 1284 | 1284 | 92.6 | 100.0 | 4 |
| Seat.Comfort | Seat.Comfort | 1042 | 1281 | 1284 | 81.3 | 99.8 | 5 |
| Value.For.Money | Value.For.Money | 1019 | 1281 | 1284 | 79.5 | 99.8 | 6 |
| Cabin.Staff.Service | Cabin.Staff.Service | 832 | 1278 | 1284 | 65.1 | 99.5 | 7 |
We now perform regression analysis specifically on bad reviews to understand which service failures have the strongest impact on driving overall dissatisfaction.
cat("### Linear Regression Analysis: Drivers of Bad Reviews\n\n")
# Individual regression models for each service on bad reviews only
bad_review_regression_results <- data.frame()
for(service in correlation_rank_selective$Service) {
# Create formula for individual regression on bad reviews
formula <- as.formula(paste("Overall.Rating ~", service))
# Fit linear model only on bad reviews
model <- lm(formula, data = bad_reviews)
# Extract model summary
model_summary <- summary(model)
coefficients <- tidy(model)
# Extract service coefficient (excluding intercept)
service_coef <- coefficients[coefficients$term == service, ]
bad_review_regression_results <- rbind(bad_review_regression_results, data.frame(
Service = service,
Coefficient = round(service_coef$estimate, 3),
Std_Error = round(service_coef$std.error, 3),
T_Value = round(service_coef$statistic, 3),
P_Value = format.pval(service_coef$p.value, digits = 3),
R_Squared = round(model_summary$r.squared, 3),
Adj_R_Squared = round(model_summary$adj.r.squared, 3),
Significant = ifelse(service_coef$p.value < 0.001, "***",
ifelse(service_coef$p.value < 0.01, "**",
ifelse(service_coef$p.value < 0.05, "*", "No")))
))
}
cat("**Individual Regression Models: Impact of Service Failures on Bad Reviews**\n")
Individual Regression Models: Impact of Service Failures on Bad Reviews
kable(bad_review_regression_results, caption = "Linear Regression Results for Bad Reviews Only")
| Service | Coefficient | Std_Error | T_Value | P_Value | R_Squared | Adj_R_Squared | Significant |
|---|---|---|---|---|---|---|---|
| Value.For.Money | 0.687 | 0.025 | 27.720 | <2e-16 | 0.375 | 0.374 | *** |
| Cabin.Staff.Service | 0.343 | 0.029 | 12.038 | <2e-16 | 0.102 | 0.101 | *** |
| Seat.Comfort | 0.478 | 0.035 | 13.642 | <2e-16 | 0.127 | 0.126 | *** |
| Ground.Service | 0.510 | 0.045 | 11.443 | <2e-16 | 0.093 | 0.092 | *** |
| Inflight.Entertainment | 0.343 | 0.155 | 2.217 | 0.0274 | 0.017 | 0.013 | * |
| Food…Beverages | 0.100 | 0.048 | 2.082 | 0.0376 | 0.003 | 0.003 | * |
| Wifi…Connectivity | 0.386 | 0.141 | 2.729 | 0.00682 | 0.030 | 0.026 | ** |
Key Insights from Bad Review Regression:
In bad reviews: 37.5% variance explained (vs 77.5% overall)
Coefficient: 0.687 (vs 1.854 overall)
Interpretation: Once customers are already dissatisfied, value perception becomes less dominant
Seat.Comfort: Strong impact (0.478 coefficient, 12.7% variance)
Ground.Service: Surprisingly strong (0.510 coefficient, 9.3% variance)
Interpretation: For dissatisfied customers, physical discomfort and ground experience become major pain points
WiFi: 0.386 coefficient, statistically significant (**)
Entertainment: 0.343 coefficient, statistically significant (*)
Interpretation: These services actively contribute to pushing customers from neutral to negative
Business Insight:
The “Last Straw” Effect: While WiFi and Entertainment don’t drive overall satisfaction broadly, they serve as critical failure points that push already-unhappy customers over the edge. Customers experiencing core service failures (value, staff, comfort) then encounter broken paid services, creating a compounding dissatisfaction effect.
Strategic Implication: WiFi and Entertainment aren’t just “nice-to-haves” - they’re critical for service recovery. When core services fail, functioning WiFi/Entertainment could prevent the situation from escalating to a bad review. Their current catastrophic failure rates mean they’re actively making bad situations worse.
Updated Strategy:
Prevention: Fix core services (Value, Staff, Seats)
Containment: Ensure WiFi/Entertainment work reliably as backup options
Recovery: Use functioning digital services to mitigate other failures
Regression reveals the causal drivers while failure rates show symptom frequency
# Create comparative visualization: Failure Rates vs Regression Impact
comparison_data <- data.frame(
Service = corrected_bad_analysis$Service,
Failure_Rate = corrected_bad_analysis$Actual_Failure_Rate,
Service_Usage_Rate = corrected_bad_analysis$Service_Usage_Rate,
Business_Impact = bad_review_regression_results$Coefficient[match(corrected_bad_analysis$Service, bad_review_regression_results$Service)],
Variance_Explained = bad_review_regression_results$R_Squared[match(corrected_bad_analysis$Service, bad_review_regression_results$Service)] * 100
)
# Create left plot with usage information (wider)
p1 <- ggplot(comparison_data, aes(x = reorder(Service, Failure_Rate), y = Failure_Rate,
fill = ifelse(Service_Usage_Rate > 50, "High Usage (>50%)", "Low Usage (<=50%)"))) +
geom_bar(stat = "identity", alpha = 0.9) +
scale_fill_manual(values = c("High Usage (>50%)" = ryanair_blue, "Low Usage (<=50%)" = ryanair_yellow),
name = "Service Usage Level") +
geom_text(aes(label = paste0(Failure_Rate, "%")), hjust = -0.2, size = 3.5, color = ryanair_blue) +
geom_text(aes(label = paste0("(Used by ", Service_Usage_Rate, ")")),
hjust = -0.2, size = 3, color = "darkgray", vjust = 2) +
coord_flip() +
labs(title = "Service Failure Rates in Bad Reviews",
subtitle = "Percentage of poor ratings (<=2/5) when service is actually used",
x = "Service Dimension",
y = "Failure Rate (%)") +
theme_minimal() +
scale_y_continuous(limits = c(0, 105), expand = expansion(mult = c(0, 0.1))) +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold", size = 14),
plot.subtitle = element_text(color = "darkgray", size = 11),
legend.position = "bottom",
legend.box = "horizontal",
legend.text = element_text(size = 10)
)
# Create right plot with business impact and legend
p2 <- ggplot(comparison_data, aes(x = reorder(Service, Business_Impact), y = Business_Impact)) +
geom_bar(stat = "identity", fill = ryanair_blue, alpha = 0.8) +
geom_text(aes(label = round(Business_Impact, 2)), hjust = -0.2, size = 3.5, color = ryanair_blue) +
geom_text(aes(label = paste0("(", round(Variance_Explained, 1), "%)")),
hjust = -0.2, size = 3, color = "darkgray", vjust = 2) +
coord_flip() +
labs(title = "Business Impact on Bad Reviews",
subtitle = "Regression analysis: Coefficient (points) and R_Squared (%) shown",
x = "Service Dimension",
y = "Impact Coefficient",
caption = "Numbers represent: Coefficient, (R_Squared %)\n- Coefficient: Points increase in Overall Rating per 1-point service improvement\n- R_Squared: Percentage of bad review variance explained") +
theme_minimal() +
scale_y_continuous(limits = c(0, 0.8), expand = expansion(mult = c(0, 0.1))) +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold", size = 14),
plot.subtitle = element_text(color = "darkgray", size = 11),
plot.caption = element_text(size = 9, color = "darkgray", hjust = 0)
)
# Arrange plots with different widths (left wider, right narrower)
gridExtra::grid.arrange(p1, p2, ncol = 2, widths = c(0.55, 0.45),
top = "Critical Insight: Services with Highest Failure Rates vs Most Impactful on Satisfaction")
Key Comparative Insight
The side-by-side analysis reveals a crucial strategic distinction between service failure frequency and business impact. While Inflight Entertainment (98% failure rate) and WiFi Connectivity (97% failure rate) demonstrate near-universal service breakdown, their impact on driving overall dissatisfaction is relatively modest (3.0% and 1.7% variance explained respectively).
Conversely, Value.For.Money, despite a lower failure rate (79.5%), exerts the strongest influence on bad reviews, explaining 37.5% of dissatisfaction variance. Similarly, Seat Comfort and Cabin Staff Service show moderate failure rates (81.3% and 65.1%) but substantial business impact (12.7% and 10.2% variance explained).
This divergence highlights a critical data mining insight: Ryanair should prioritize resources based on business impact rather than failure frequency. The catastrophic WiFi and Entertainment failures, while alarming, represent containment issues rather than primary drivers of customer dissatisfaction. The strategic focus should remain on Value Perception, Seat Comfort, and Cabin Staff improvements, which actually determine whether core service experiences escalate into negative reviews.
# Create a combined ranking table
cat("**Strategic Priority Ranking: Failure Rates vs Business Impact**\n")
Strategic Priority Ranking: Failure Rates vs Business Impact
priority_ranking <- comparison_data %>%
mutate(
Failure_Rank = rank(-Failure_Rate),
Impact_Rank = rank(-Business_Impact),
Strategic_Priority = ifelse(Impact_Rank <= 3, "HIGH",
ifelse(Failure_Rank <= 3, "MEDIUM", "LOW"))
) %>%
select(Service, Failure_Rate, Business_Impact, Failure_Rank, Impact_Rank, Strategic_Priority) %>%
arrange(Impact_Rank)
kable(priority_ranking, caption = "Service Prioritization Based on Combined Failure and Impact Analysis")
| Service | Failure_Rate | Business_Impact | Failure_Rank | Impact_Rank | Strategic_Priority |
|---|---|---|---|---|---|
| Value.For.Money | 79.5 | 0.687 | 6 | 1.0 | HIGH |
| Ground.Service | 92.6 | 0.510 | 4 | 2.0 | HIGH |
| Seat.Comfort | 81.3 | 0.478 | 5 | 3.0 | HIGH |
| Wifi…Connectivity | 97.1 | 0.386 | 2 | 4.0 | MEDIUM |
| Inflight.Entertainment | 98.0 | 0.343 | 1 | 5.5 | MEDIUM |
| Cabin.Staff.Service | 65.1 | 0.343 | 7 | 5.5 | LOW |
| Food…Beverages | 94.0 | 0.100 | 3 | 7.0 | MEDIUM |
Ranking Logic:
HIGH Priority: Services ranked in top 3 by Business Impact (Impact_Rank ≤ 3)
These services actually drive bad reviews the most
Fixing them will have the biggest reduction in dissatisfaction
MEDIUM Priority: Services ranked in top 3 by Failure Rate but NOT top 3 by Impact (Failure_Rank ≤ 3 AND Impact_Rank > 3)
These fail frequently but don’t drive bad reviews as much
Still important to fix, but less urgent than HIGH priority
LOW Priority: Everything else
Neither high failure rates nor high business impact
# Analysis of Extra Services Extreme Dissatisfaction (Rating = 1)
extra_services <- c("Wifi...Connectivity", "Inflight.Entertainment", "Food...Beverages")
# Calculate metrics for extra services in bad reviews
extra_services_analysis <- data.frame(
Service = extra_services,
Extreme_Dissat_Count = sapply(extra_services,
function(x) sum(bad_reviews[[x]] == 1, na.rm = TRUE)),
Moderate_Dissat_Count = sapply(extra_services,
function(x) sum(bad_reviews[[x]] == 2, na.rm = TRUE)),
Neutral_Positive_Count = sapply(extra_services,
function(x) sum(bad_reviews[[x]] >= 3, na.rm = TRUE)),
Actually_Used_Count = sapply(extra_services,
function(x) sum(bad_reviews[[x]] > 0, na.rm = TRUE)),
Total_Bad_Reviews = nrow(bad_reviews)
) %>%
mutate(
Extreme_Dissat_Rate = round(Extreme_Dissat_Count / Actually_Used_Count * 100, 1),
Moderate_Dissat_Rate = round(Moderate_Dissat_Count / Actually_Used_Count * 100, 1),
Service_Usage_Rate = round(Actually_Used_Count / Total_Bad_Reviews * 100, 1),
Service_Label = case_when(
Service == "Wifi...Connectivity" ~ "WiFi & Connectivity",
Service == "Inflight.Entertainment" ~ "Inflight Entertainment",
Service == "Food...Beverages" ~ "Food & Beverages"
)
)
# Prepare data for stacked bar plot
stacked_data <- extra_services_analysis %>%
select(Service_Label, Extreme_Dissat_Rate, Moderate_Dissat_Rate, Service_Usage_Rate) %>%
pivot_longer(cols = c(Extreme_Dissat_Rate, Moderate_Dissat_Rate),
names_to = "Dissatisfaction_Level",
values_to = "Percentage") %>%
mutate(
Dissatisfaction_Level = factor(Dissatisfaction_Level,
levels = c("Moderate_Dissat_Rate", "Extreme_Dissat_Rate"),
labels = c("Moderate Dissatisfaction (Rating = 2)",
"Extreme Dissatisfaction (Rating = 1)")),
Service_Label = factor(Service_Label,
levels = c("WiFi & Connectivity", "Inflight Entertainment", "Food & Beverages"))
)
# Create stacked bar plot with Ryanair colors
ggplot(stacked_data, aes(x = Service_Label, y = Percentage, fill = Dissatisfaction_Level)) +
geom_bar(stat = "identity", alpha = 0.9) +
scale_fill_manual(values = c("Extreme Dissatisfaction (Rating = 1)" = ryanair_yellow,
"Moderate Dissatisfaction (Rating = 2)" = ryanair_blue)) +
# Add service usage rate annotations
geom_text(data = extra_services_analysis,
aes(x = Service_Label, y = 105,
label = paste0("Used by ", Service_Usage_Rate, "% of bad reviews")),
color = "darkgray", size = 3.5, fontface = "italic", inherit.aes = FALSE) +
# Add percentage labels on bars - using the main data that has Dissatisfaction_Level
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5),
color = "white", size = 4, fontface = "bold") +
labs(title = "Extra Services: Catastrophic Failure Analysis in Bad Reviews",
subtitle = "Distribution of dissatisfaction levels when services are actually used by dissatisfied customers",
x = "Extra Service Category",
y = "Percentage of Service Users (%)",
fill = "Dissatisfaction Level",
caption = "Analysis limited to bad reviews (Overall Rating <= 5/10)\nBars show percentage of users who rated each service 1 or 2 (out of 5)") +
theme_minimal() +
scale_y_continuous(limits = c(0, 115), expand = expansion(mult = c(0, 0.05))) +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
plot.subtitle = element_text(color = "darkgray", size = 12),
plot.caption = element_text(size = 10, color = "darkgray", hjust = 0),
axis.text.x = element_text(size = 11, color = "black"),
axis.text.y = element_text(size = 10, color = "black"),
legend.position = "bottom",
legend.title = element_text(face = "bold", size = 10),
legend.text = element_text(size = 9),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white", color = "lightgray", linewidth = 0.5)
) +
guides(fill = guide_legend(reverse = TRUE))
# Analysis of Extra Services Catastrophic Failure (Rating = 1 only)
extra_services <- c("Wifi...Connectivity", "Inflight.Entertainment", "Food...Beverages")
# Calculate metrics for catastrophic failures only (rating = 1)
catastrophic_analysis <- data.frame(
Service = extra_services,
Catastrophic_Failure_Count = sapply(extra_services,
function(x) sum(bad_reviews[[x]] == 1, na.rm = TRUE)),
Actually_Used_Count = sapply(extra_services,
function(x) sum(bad_reviews[[x]] > 0, na.rm = TRUE)),
Total_Bad_Reviews = nrow(bad_reviews)
) %>%
mutate(
Catastrophic_Failure_Rate = round(Catastrophic_Failure_Count / Actually_Used_Count * 100, 1),
Service_Usage_Rate = round(Actually_Used_Count / Total_Bad_Reviews * 100, 1),
Service_Label = case_when(
Service == "Wifi...Connectivity" ~ "WiFi & Connectivity",
Service == "Inflight.Entertainment" ~ "Inflight Entertainment",
Service == "Food...Beverages" ~ "Food & Beverages"
),
Service_Label = factor(Service_Label,
levels = c("WiFi & Connectivity", "Inflight Entertainment", "Food & Beverages"))
) %>%
arrange(desc(Catastrophic_Failure_Rate))
# Create HORIZONTAL bar plot showing only catastrophic failures (rating = 1)
ggplot(catastrophic_analysis, aes(x = Catastrophic_Failure_Rate, y = reorder(Service_Label, Catastrophic_Failure_Rate))) +
geom_bar(stat = "identity", fill = ryanair_yellow, alpha = 0.9, width = 0.7) +
# Add catastrophic failure rate labels
geom_text(aes(label = paste0(Catastrophic_Failure_Rate, "%")),
hjust = -0.2, size = 5, fontface = "bold", color = ryanair_blue) +
# Add absolute count annotations showing rating=1 vs total users
geom_text(aes(x = Catastrophic_Failure_Rate / 2,
label = paste0(Catastrophic_Failure_Count, " / ", Actually_Used_Count, " users")),
color = "white", size = 3.5, fontface = "bold") +
labs(title = "Extra Services: Catastrophic Failure Analysis (Rating = 1 Only)",
subtitle = "Percentage of service users who gave worst possible rating (1/5) in bad reviews",
x = "Catastrophic Failure Rate (%)",
y = "Extra Service Category",
caption = "Analysis limited to bad reviews (Overall Rating <= 5/10)\nNumbers show: Rating=1 count / Total users of service") +
theme_minimal() +
scale_x_continuous(limits = c(0, max(catastrophic_analysis$Catastrophic_Failure_Rate) * 1.15),
expand = expansion(mult = c(0, 0.05))) +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
plot.subtitle = element_text(color = "darkgray", size = 12),
plot.caption = element_text(size = 10, color = "darkgray", hjust = 0),
axis.text.y = element_text(size = 11, color = "black"),
axis.text.x = element_text(size = 10, color = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white", color = "lightgray", linewidth = 0.5)
)
# Recreate necessary objects for this section
top_driver <- head(regression_results[order(-regression_results$R_Squared), ], 1)
secondary_drivers <- regression_results %>%
arrange(desc(R_Squared)) %>%
slice(2:3)
top_bad_review_issues <- head(corrected_bad_analysis, 3)
cat("### Statistical Confidence in Service Drivers\n")
While correlation identifies relationships, linear regression provides statistical significance testing and quantifies the exact impact of each service on overall satisfaction. This allows us to answer critical business questions about the magnitude and reliability of each service’s influence.
Primary Driver of Customer Satisfaction
Value.For.Money is the primary driver of customer satisfaction, explaining 77.5% of overall satisfaction variance. Each 1-point improvement in Value.For.Money increases Overall Rating by 1.854 points, with extremely high statistical significance (p < 0.001).
Secondary Critical Drivers
Two additional services show strong predictive power for overall satisfaction: - Cabin.Staff.Service: Explains 57.3% variance, +1.812 points per 1-point improvement - Seat.Comfort: Explains 53.8% variance, +2.069 points per 1-point improvement
cat("### Insights from Bad Review Analysis\n")
cat("**Executive Summary:**\n")
Executive Summary:
cat("While WiFi and Entertainment show near-universal failure (98%), they have limited impact on driving bad reviews (3% variance). The real drivers are Value Perception (38%), Seat Comfort (13%), and Cabin Staff (10%).\n\n")
While WiFi and Entertainment show near-universal failure (98%), they have limited impact on driving bad reviews (3% variance). The real drivers are Value Perception (38%), Seat Comfort (13%), and Cabin Staff (10%).
top_bad_review_issues <- head(corrected_bad_analysis, 3)
cat("**Services with Highest Actual Failure Rates in Bad Reviews:**\n")
Services with Highest Actual Failure Rates in Bad Reviews:
for(i in 1:nrow(top_bad_review_issues)) {
service <- top_bad_review_issues$Service[i]
failure_rate <- top_bad_review_issues$Actual_Failure_Rate[i]
usage_rate <- top_bad_review_issues$Service_Usage_Rate[i]
cat(i, ". **", service, "**: ", failure_rate, "% failure rate (used by ", usage_rate, "% of dissatisfied customers)\n", sep = "")
}
Note: This is to understand extreme failure patterns observed in Ryanair’s paid services. As some addtional insight into the nature of these failures. Not related to overall satisfaction drivers, but could be useful for service strategy.
# Analyze the catastrophic failure patterns for WiFi and Entertainment
cat("### WiFi and Entertainment Service Failure Analysis\n\n")
# Calculate failure statistics
wifi_failure_rate <- round(sum(ryanair_clean$Wifi...Connectivity == 1, na.rm = TRUE) /
sum(ryanair_clean$Wifi...Connectivity > 0, na.rm = TRUE) * 100, 1)
entertainment_failure_rate <- round(sum(ryanair_clean$Inflight.Entertainment == 1, na.rm = TRUE) /
sum(ryanair_clean$Inflight.Entertainment > 0, na.rm = TRUE) * 100, 1)
# Calculate usage patterns in bad reviews vs all reviews
bad_reviews <- ryanair_clean[ryanair_clean$Overall.Rating <= 5, ]
wifi_usage_bad <- round(sum(bad_reviews$Wifi...Connectivity > 0, na.rm = TRUE) / nrow(bad_reviews) * 100, 1)
wifi_usage_all <- round(sum(ryanair_clean$Wifi...Connectivity > 0, na.rm = TRUE) / nrow(ryanair_clean) * 100, 1)
entertainment_usage_bad <- round(sum(bad_reviews$Inflight.Entertainment > 0, na.rm = TRUE) / nrow(bad_reviews) * 100, 1)
entertainment_usage_all <- round(sum(ryanair_clean$Inflight.Entertainment > 0, na.rm = TRUE) / nrow(ryanair_clean) * 100, 1)
wifi_ratio <- round(wifi_usage_bad / wifi_usage_all, 1)
entertainment_ratio <- round(entertainment_usage_bad / entertainment_usage_all, 1)
# Display the usage pattern findings
cat("**Usage Pattern Analysis: Bad Reviews vs All Reviews**\n")
Usage Pattern Analysis: Bad Reviews vs All Reviews
cat("WiFi Connectivity:\n")
WiFi Connectivity:
cat("- Usage in bad reviews: ", wifi_usage_bad, "%\n", sep = "")
cat("- Usage in all reviews: ", wifi_usage_all, "%\n", sep = "")
cat("- Ratio: ", wifi_ratio, "x more likely in bad reviews\n\n", sep = "")
cat("Inflight Entertainment:\n")
Inflight Entertainment:
cat("- Usage in bad reviews: ", entertainment_usage_bad, "%\n", sep = "")
cat("- Usage in all reviews: ", entertainment_usage_all, "%\n", sep = "")
cat("- Ratio: ", entertainment_ratio, "x more likely in bad reviews\n\n", sep = "")
# Create visualization of rating distributions
wifi_entertainment_data <- data.frame(
Service = rep(c("WiFi Connectivity", "Inflight Entertainment"), each = 5),
Rating = rep(1:5, 2),
Count = c(251, 7, 6, 2, 2, 299, 20, 7, 2, 3)
)
ggplot(wifi_entertainment_data, aes(x = factor(Rating), y = Count, fill = Service)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("WiFi Connectivity" = ryanair_blue, "Inflight Entertainment" = ryanair_yellow)) +
labs(title = "Catastrophic Failure Patterns: WiFi and Entertainment Services",
subtitle = "Near-universal lowest ratings indicate fundamental service breakdown",
x = "Rating (1-5 scale)",
y = "Number of Ratings",
fill = "Service") +
theme_minimal() +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
plot.subtitle = element_text(color = "darkgray", size = 12),
legend.position = "bottom"
) +
geom_text(aes(label = Count), position = position_dodge(width = 0.9),
vjust = -0.5, size = 3.5)
Insights: WiFi and Entertainment Service Failure
The quantitative analysis reveals a service quality crisis for Ryanair’s paid offerings. WiFi Connectivity shows a catastrophic failure pattern with 94% of users (251 out of 268) giving the lowest possible rating of 1 out of 5. Similarly, Inflight Entertainment demonstrates near-identical failure rates with 90% of users (299 out of 331) assigning the minimum rating. This extreme rating concentration—where virtually all users agree on failure—signals fundamental service breakdown rather than random quality variation.
The usage patterns reveal an alarming trend: dissatisfied customers are 1.5 times more likely to use WiFi services (19.1% usage in bad reviews vs 12.6% overall) and 1.5 times more likely to use entertainment services (22.9% usage in bad reviews vs 15.6% overall). This suggests that customers experiencing poor core services use paid offerings, and encounter further disappointment.
The strategic implication is clear: these paid services are not merely underperforming but are actively damaging customer satisfaction. With dissatisfied customers being more likely to use these services and then experiencing near-universal disappointment, Ryanair faces a critical decision point. Either substantial investment is required to completely overhaul these services to reliable standards, or they should think up other ways of offerings to prevent further reputation damage from services that consistently fail to meet basic customer expectations.
# Additional statistical summary
cat("**Statistical Summary of Service Failures**\n")
Statistical Summary of Service Failures
failure_summary <- data.frame(
Service = c("WiFi Connectivity", "Inflight Entertainment"),
Total_Users = c(268, 331),
Lowest_Rating_Count = c(251, 299),
Failure_Rate = c(paste0(wifi_failure_rate, "%"), paste0(entertainment_failure_rate, "%")),
Average_Rating = c(
round(mean(ryanair_clean$Wifi...Connectivity[ryanair_clean$Wifi...Connectivity > 0], na.rm = TRUE), 2),
round(mean(ryanair_clean$Inflight.Entertainment[ryanair_clean$Inflight.Entertainment > 0], na.rm = TRUE), 2)
)
)
kable(failure_summary, caption = "Quantitative Service Failure Analysis")
| Service | Total_Users | Lowest_Rating_Count | Failure_Rate | Average_Rating |
|---|---|---|---|---|
| WiFi Connectivity | 268 | 251 | 93.7% | 1.12 |
| Inflight Entertainment | 331 | 299 | 90.3% | 1.16 |
library(tidyverse)
library(cluster)
library(factoextra)
library(gridExtra)
library(fmsb)
library(reshape2)
# Load the cleaned dataset
ryanair_clean <- read.csv("ryanair_reviews_cleaned.csv")
# Display basic dataset info
cat("## Dataset Overview\n")
cat("Total reviews:", nrow(ryanair_clean), "\n")
Total reviews: 2119
# Check for data issues in categorical variables
cat("### Data Quality Check\n")
cat("Type.Of.Traveller unique values:\n")
Type.Of.Traveller unique values:
print(unique(ryanair_clean$Type.Of.Traveller))
[1] “Family Leisure” “Couple Leisure” “Solo Leisure” “Business”
[5] “Unknown”
# Create cleaned traveler type variable with "Unknown" for missing values
ryanair_clean <- ryanair_clean %>%
mutate(
cleaned_traveller_type = case_when(
grepl("business|Business", Type.Of.Traveller) ~ "Business",
grepl("family|Family", Type.Of.Traveller) ~ "Family",
grepl("solo|Solo", Type.Of.Traveller) ~ "Solo Leisure",
grepl("couple|Couple", Type.Of.Traveller) ~ "Couples Leisure",
is.na(Type.Of.Traveller) | Type.Of.Traveller == "" ~ "Unknown",
TRUE ~ "Unknown"
)
)
# Select service dimension variables (excluding Overall.Rating to avoid dominance)
service_vars <- c("Seat.Comfort", "Cabin.Staff.Service", "Food...Beverages",
"Ground.Service", "Value.For.Money")
# Prepare data for clustering - SCALE the data for distance-based algorithms
clustering_data <- ryanair_clean %>%
select(all_of(service_vars)) %>%
scale() %>% # Standardize variables
as.data.frame()
cat("### Clustering Dataset Summary\n")
cat("Service Variables:", paste(service_vars, collapse = ", "), "\n")
Service Variables: Seat.Comfort, Cabin.Staff.Service, Food…Beverages, Ground.Service, Value.For.Money
cat("Observations:", nrow(clustering_data), "\n")
Observations: 2119
# Calculate WSS first for annotation
wss <- map_dbl(1:10, function(k) {
kmeans(clustering_data, centers = k, nstart = 25)$tot.withinss
})
# Elbow method with clear indication for k=3
p1 <- fviz_nbclust(clustering_data, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = "dashed", color = "red") +
annotate("text", x = 3.2, y = max(wss)*0.9, label = "Elbow at k=3",
color = "red", hjust = 0) +
labs(title = "K-means: Optimal Number of Clusters (Elbow Method)") +
theme_minimal()
# Silhouette method
p2 <- fviz_nbclust(clustering_data, kmeans, method = "silhouette") +
labs(title = "K-means: Optimal Number of Clusters (Silhouette Method)") +
theme_minimal()
# Display both plots
grid.arrange(p1, p2, ncol = 2)
set.seed(123)
kmeans_result <- kmeans(clustering_data, centers = 3, nstart = 25)
# Add cluster assignments
ryanair_segmented <- ryanair_clean %>%
mutate(kmeans_cluster = as.factor(kmeans_result$cluster))
cat("### K-means Clustering Results\n")
cat("Cluster sizes:\n")
Cluster sizes:
table(kmeans_result$cluster) %>% knitr::kable()
| Var1 | Freq |
|---|---|
| 1 | 487 |
| 2 | 1155 |
| 3 | 477 |
# Calculate distance matrix and perform clustering
dist_matrix <- dist(clustering_data, method = "euclidean")
hclust_result <- hclust(dist_matrix, method = "ward.D2")
# Dendrogram with reduced label overlap
fviz_dend(hclust_result, k = 3,
k_colors = c("#073590", "#FFD200", "#2E8B57"),
color_labels_by_k = TRUE,
rect = TRUE,
show_labels = FALSE, # Remove overlapping bottom labels
main = "Hierarchical Clustering Dendrogram",
xlab = "Observations", ylab = "Height")
# Cut tree and add assignments
hierarchical_clusters <- cutree(hclust_result, k = 3)
ryanair_segmented <- ryanair_segmented %>%
mutate(hierarchical_cluster = as.factor(hierarchical_clusters))
cluster_profiles <- ryanair_segmented %>%
group_by(kmeans_cluster) %>%
summarize(
n_customers = n(),
percent_total = round(n() / nrow(ryanair_segmented) * 100, 1),
# Service ratings (what we clustered on)
avg_seat_comfort = round(mean(Seat.Comfort, na.rm = TRUE), 2),
avg_cabin_staff = round(mean(Cabin.Staff.Service, na.rm = TRUE), 2),
avg_food_beverages = round(mean(Food...Beverages, na.rm = TRUE), 2),
avg_ground_service = round(mean(Ground.Service, na.rm = TRUE), 2),
avg_value_money = round(mean(Value.For.Money, na.rm = TRUE), 2),
# Overall rating (outcome for validation)
avg_overall_rating = round(mean(Overall.Rating, na.rm = TRUE), 2),
# Traveler types (using cleaned data with "Unknown")
business_pct = round(sum(cleaned_traveller_type == "Business") / n() * 100, 1),
family_pct = round(sum(cleaned_traveller_type == "Family") / n() * 100, 1),
solo_pct = round(sum(cleaned_traveller_type == "Solo Leisure") / n() * 100, 1),
couples_pct = round(sum(cleaned_traveller_type == "Couples Leisure") / n() * 100, 1),
unknown_pct = round(sum(cleaned_traveller_type == "Unknown") / n() * 100, 1),
# CORRECTED Seat type distribution - all 4 categories
first_class_pct = round(sum(grepl("First Class", Seat.Type)) / n() * 100, 1),
business_class_pct = round(sum(grepl("Business Class", Seat.Type)) / n() * 100, 1),
premium_economy_pct = round(sum(grepl("Premium Economy", Seat.Type)) / n() * 100, 1),
economy_class_pct = round(sum(grepl("Economy Class", Seat.Type)) / n() * 100, 1),
total_premium_pct = round(sum(grepl("First Class|Business Class|Premium Economy", Seat.Type)) / n() * 100, 1),
# Service usage patterns
wifi_usage_pct = round(mean(1 - Wifi_Connectivity_NotRated) * 100, 1),
entertainment_usage_pct = round(mean(1 - Inflight_Entertainment_NotRated) * 100, 1)
) %>%
mutate(
# Name segments based on actual patterns
segment_name = case_when(
kmeans_cluster == 1 ~ "Satisfied Service-Experienced",
kmeans_cluster == 2 ~ "Highly Dissatisfied Customers",
kmeans_cluster == 3 ~ "Value-Satisfied but Ground Service Critics"
)
)
cat("### Customer Segment Profiles\n")
cluster_profiles %>% knitr::kable()
| kmeans_cluster | n_customers | percent_total | avg_seat_comfort | avg_cabin_staff | avg_food_beverages | avg_ground_service | avg_value_money | avg_overall_rating | business_pct | family_pct | solo_pct | couples_pct | unknown_pct | first_class_pct | business_class_pct | premium_economy_pct | economy_class_pct | total_premium_pct | wifi_usage_pct | entertainment_usage_pct | segment_name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 487 | 23.0 | 3.39 | 4.20 | 2.67 | 4.17 | 4.57 | 8.06 | 8.6 | 21.8 | 39.4 | 30.2 | 0.0 | 0.4 | 0.2 | 0.4 | 99.0 | 1.0 | 4.5 | 6.8 | Satisfied Service-Experienced |
| 2 | 1155 | 54.5 | 1.53 | 1.77 | 1.60 | 1.19 | 1.51 | 1.63 | 6.9 | 20.9 | 26.5 | 34.6 | 11.1 | 0.0 | 0.1 | 1.1 | 98.8 | 1.2 | 20.6 | 24.4 | Highly Dissatisfied Customers |
| 3 | 477 | 22.5 | 3.45 | 4.02 | 2.39 | 1.12 | 4.23 | 7.29 | 0.8 | 5.7 | 10.5 | 8.4 | 74.6 | 0.0 | 0.2 | 0.0 | 99.8 | 0.2 | 1.7 | 3.4 | Value-Satisfied but Ground Service Critics |
This table is used for the plots and interpretation.
We’ve identified three distinct customer segments based on their service experience patterns. Here’s what each segment tells us:
🔴 Cluster 2: Highly Dissatisfied Customers (54.5% - MAJOR CONCERN)
Largest segment representing over half of our customer base
Extremely poor experiences across all services (ratings ~1.5/5)
Critical dissatisfaction with overall experience (1.63/10)
Heavy users of premium services - highest WiFi (20.6%) and entertainment (24.4%) usage
Most concerning insight: Customers paying for premium services are our most dissatisfied segment
🟢 Cluster 1: Satisfied Service-Experienced (23%)
High satisfaction across all service dimensions (ratings 3.4-4.6/5)
Excellent overall experience (8.06/10)
Balanced traveler mix across business, solo, couples, and family types
Moderate premium service usage - represents our core satisfied customer base
🟡 Cluster 3: Value-Satisfied but Ground Service Critics (22.5%)
Good value perception and comfort ratings, but disastrous ground service (1.12/5)
Solid overall satisfaction (7.29/10) despite ground service issues
Minimal premium service usage - true budget-conscious flyers
Critical opportunity: Fixing ground service could significantly boost this segment’s satisfaction
Priority #1: Address the root causes of Cluster 2’s widespread dissatisfaction, especially given their high premium service usage
Quick Win: Improve ground service operations to convert Cluster 3 into fully satisfied customers
Retention Focus: Maintain the excellent service experience for Cluster 1 while exploring upsell opportunities
# 1. Cluster Centroids Profiling Plot
centers <- as.data.frame(kmeans_result$centers)
centers$cluster <- 1:3
centers.melt <- melt(centers, id.vars = "cluster")
profile_plot <- ggplot(centers.melt, aes(x = variable, y = value, group = cluster, color = as.factor(cluster))) +
geom_line(size = 1.2) +
geom_point(size = 3) +
scale_color_manual(values = c("#073590", "#FFD200", "#2E8B57"),
name = "Cluster",
labels = cluster_profiles$segment_name) +
labs(title = "Cluster Service Profiles: What Defines Each Segment",
subtitle = "Standardized centroid values for each service dimension",
x = "Service Dimension", y = "Standardized Centroid Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# 2. Overall Rating Distribution (Validation)
overall_plot <- ggplot(ryanair_segmented, aes(x = Overall.Rating, fill = kmeans_cluster)) +
geom_density(alpha = 0.6) +
scale_fill_manual(values = c("#073590", "#FFD200", "#2E8B57"),
name = "Segment",
labels = cluster_profiles$segment_name) +
labs(title = "Overall Rating Distribution by Segment",
subtitle = "Validates that service patterns correlate with overall satisfaction",
x = "Overall Rating", y = "Density") +
theme_minimal()
# 3. Traveler Type Composition
traveller_plot <- ryanair_segmented %>%
count(kmeans_cluster, cleaned_traveller_type) %>%
group_by(kmeans_cluster) %>%
mutate(percent = n / sum(n) * 100) %>%
ggplot(aes(x = kmeans_cluster, y = percent, fill = cleaned_traveller_type)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_manual(values = c("#073590", "#1E90FF", "#FFD200", "#32CD32", "#808080"),
name = "Traveler Type") +
labs(title = "Traveler Type Composition by Cluster",
x = "Cluster", y = "Percentage") +
theme_minimal() +
theme(plot.margin = margin(1, 1, 1, 1, "cm"))
# 4. Seat Type Distribution - ADD width only here
seat_plot <- ryanair_segmented %>%
mutate(
seat_category = case_when(
grepl("First Class", Seat.Type) ~ "First Class",
grepl("Business Class", Seat.Type) ~ "Business Class",
grepl("Premium Economy", Seat.Type) ~ "Premium Economy",
grepl("Economy Class", Seat.Type) ~ "Economy Class",
TRUE ~ "Unknown"
)
) %>%
count(kmeans_cluster, seat_category) %>%
group_by(kmeans_cluster) %>%
mutate(percent = n / sum(n) * 100) %>%
ggplot(aes(x = kmeans_cluster, y = percent, fill = seat_category)) +
geom_bar(stat = "identity", position = "stack", width = 0.8) + # ADD width HERE only
scale_fill_manual(values = c("#073590", "#1E90FF", "#FFD200", "#32CD32", "#808080"),
name = "Seat Type") +
labs(title = "Seat Type Distribution by Cluster - All 4 Categories",
x = "Cluster", y = "Percentage") +
theme_minimal() +
theme(plot.margin = margin(1, 1, 1, 1, "cm"),
plot.title = element_text(margin = margin(b = 10)))
# 5. Service Usage Patterns with Corrected Insight
usage_plot <- ryanair_segmented %>%
group_by(kmeans_cluster) %>%
summarize(
wifi_usage = mean(1 - Wifi_Connectivity_NotRated) * 100,
entertainment_usage = mean(1 - Inflight_Entertainment_NotRated) * 100
) %>%
pivot_longer(cols = c(wifi_usage, entertainment_usage),
names_to = "service", values_to = "usage_pct") %>%
ggplot(aes(x = kmeans_cluster, y = usage_pct, fill = service)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#073590", "#FFD200"),
name = "Service",
labels = c("WiFi", "Entertainment")) +
labs(title = "Premium Service Usage by Cluster - Key Insight",
subtitle = "Dissatisfied customers (Cluster 2) use WiFi/entertainment most\npotential disappointment with paid services",
x = "Cluster", y = "Usage Percentage") +
theme_minimal() +
theme(plot.margin = margin(1, 1, 1, 1, "cm"),
plot.title = element_text(margin = margin(b = 10)),
plot.subtitle = element_text(margin = margin(b = 10),
size = 10)) # Smaller font for subtitle
# Or use a shorter subtitle:
# subtitle = "Cluster 2 (dissatisfied) uses premium services most"
# Display all plots with adjusted layout
profile_plot
overall_plot
grid.arrange(traveller_plot, seat_plot, ncol = 2,
widths = c(1.2, 1),
top = "Demographic Analysis: Traveler Types and Seat Preferences") # Add overall title
usage_plot
Most balanced distribution across all traveler types
Highest business traveler percentage (8.6%) - corporate travelers appreciate the service quality
Strong family representation (21.8%) - families are having good experiences
Well-distributed leisure travel - appeals to all customer types
Highest couples leisure percentage (34.6%) - romantic trips are being ruined
Strong family presence (20.9%) - family vacations are disappointing
Moderate business travelers (6.9%) - even corporate travelers are unhappy
This is a universal problem affecting all traveler types
Massive data gap (74.6% Unknown) - we don’t know who these customers are
Very low business travelers (0.8%) - almost no corporate customers
Low family representation (5.7%) - families are avoiding this experience
Primarily leisure travelers who tolerate poor ground service for value
Critical Business Implications:
Couples & Families Are Most Vulnerable - romantic trips and family vacations are being severely impacted
Business Travelers Prefer Cluster 1 Experience - corporate customers gravitate toward good service
Cluster 3 Has Identity Crisis - we’re failing to capture who these budget customers are
No Traveler Type is Immune - dissatisfaction affects all customer segments
Actionable Recommendations:
Protect couples/family market - these are high-value leisure segments
Understand Cluster 3 demographics - fix data collection for budget segment
Leverage business traveler satisfaction - use as reference for service standards
Universal service overhaul needed - no traveler type is being adequately served
Bottom Line: Service quality issues are affecting all traveler types, but couples and families are disproportionately represented in the dissatisfied segment, threatening Ryanair’s leisure travel market position.
Note: The seat type is not insightful since around 99% are economy across all clusters.
cluster_comparison <- table(Kmeans = ryanair_segmented$kmeans_cluster,
Hierarchical = ryanair_segmented$hierarchical_cluster)
cat("### Cluster Assignment Comparison: K-means vs Hierarchical\n")
cluster_comparison %>% knitr::kable()
| 1 | 2 | 3 |
|---|---|---|
| 468 | 19 | 0 |
| 5 | 1129 | 21 |
| 44 | 53 | 380 |
agreement_rate <- sum(diag(cluster_comparison)) / sum(cluster_comparison)
cat("\n**Agreement rate between methods:**", round(agreement_rate * 100, 1), "%\n")
Agreement rate between methods: 93.3 %
final_recommendations <- cluster_profiles %>%
select(segment_name, n_customers, percent_total,
avg_overall_rating, avg_value_money, business_pct,
wifi_usage_pct, total_premium_pct, unknown_pct) %>% # CHANGED: premium_pct → total_premium_pct
mutate(
priority_level = case_when(
segment_name == "Highly Dissatisfied Customers" ~ "CRITICAL",
segment_name == "Value-Satisfied but Ground Service Critics" ~ "HIGH",
segment_name == "Satisfied Service-Experienced" ~ "MEDIUM"
),
key_actions = case_when(
segment_name == "Highly Dissatisfied Customers" ~ "Overhaul core service delivery, fix value perception, staff training",
segment_name == "Value-Satisfied but Ground Service Critics" ~ "Improve ground operations, maintain value proposition",
segment_name == "Satisfied Service-Experienced" ~ "Retention focus, loyalty program enhancement"
),
segment_characteristics = case_when(
segment_name == "Satisfied Service-Experienced" ~ "High ratings across services, balanced traveler types, moderate WiFi usage",
segment_name == "Highly Dissatisfied Customers" ~ "Extremely low ratings on all core services, highest WiFi usage",
segment_name == "Value-Satisfied but Ground Service Critics" ~ "Good value/comfort but terrible ground service, lowest WiFi usage"
)
) %>%
select(segment_name, n_customers, percent_total, priority_level, key_actions, segment_characteristics, everything())
cat("### Final Segment Definitions and Business Recommendations\n")
final_recommendations %>% knitr::kable()
| segment_name | n_customers | percent_total | priority_level | key_actions | segment_characteristics | avg_overall_rating | avg_value_money | business_pct | wifi_usage_pct | total_premium_pct | unknown_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Satisfied Service-Experienced | 487 | 23.0 | MEDIUM | Retention focus, loyalty program enhancement | High ratings across services, balanced traveler types, moderate WiFi usage | 8.06 | 4.57 | 8.6 | 4.5 | 1.0 | 0.0 |
| Highly Dissatisfied Customers | 1155 | 54.5 | CRITICAL | Overhaul core service delivery, fix value perception, staff training | Extremely low ratings on all core services, highest WiFi usage | 1.63 | 1.51 | 6.9 | 20.6 | 1.2 | 11.1 |
| Value-Satisfied but Ground Service Critics | 477 | 22.5 | HIGH | Improve ground operations, maintain value proposition | Good value/comfort but terrible ground service, lowest WiFi usage | 7.29 | 4.23 | 0.8 | 1.7 | 0.2 | 74.6 |
Key Insight:
The analysis reveals three distinct customer segments with clear service experience patterns, validated by demographic and behavioral characteristics. The near-universal economy seat distribution confirms Ryanair’s position as a budget airline across all customer types.
The Real Problem Isn’t What You’d Expect We discovered a critical paradox: Our most dissatisfied customers (54.5% of base) are NOT primarily upset about WiFi/entertainment quality issues - they’re experiencing fundamental failures across ALL core services.
The Core Issue:
-Cluster 2 customers hate EVERYTHING - seat comfort (1.53/5), cabin staff (1.77/5), food (1.60/5), ground service (1.19/5), and value (1.51/5)
-This is a systemic service breakdown, not isolated premium service problems
-WiFi/entertainment usage is a symptom, not the cause - these customers try more services but find everything disappointing
Strategic Implications:
-Stop focusing on premium services as the main problem - they’re only 3% of the issue
-Fix the core service engine first: Value perception, seat comfort, and cabin staff service drive 61% of dissatisfaction
-Cluster 2 represents a broken customer experience - we’re fundamentally failing over half our customer base
Action Priority:
-CRITICAL: Overhaul core service delivery for Cluster 2 (54.5% of customers)
-HIGH: Fix ground service for Cluster 3 (quick satisfaction win)
-MEDIUM: Maintain excellence for Cluster 1 (our reference customers)
Bottom Line: We have a systemic service quality crisis affecting the majority of customers. Premium service fixes won’t solve this - we need to rebuild the core Ryanair experience from the ground up.