RyanAir Reviews: Correlation and Regression Analysis

1. Correlation Analysis: Identifying Key Drivers of Customer Satisfaction

1.1 Data Preparation for Correlation Analysis

In this section, we analyze the relationships between various service dimensions and overall customer satisfaction. The goal is to identify which service factors have the strongest influence on overall ratings, enabling targeted service improvements and resource allocation.

# Load the cleaned dataset
ryanair_clean <- read.csv("ryanair_reviews_cleaned.csv")

# Select numeric rating variables for correlation analysis
rating_vars <- c("Overall.Rating", "Seat.Comfort", "Cabin.Staff.Service", 
                "Food...Beverages", "Ground.Service", "Wifi...Connectivity", 
                "Inflight.Entertainment", "Value.For.Money")

# Create correlation dataset
corr_data <- ryanair_clean[rating_vars]

# Check data structure
cat("### Dataset for Correlation Analysis\n")

Dataset for Correlation Analysis

cat("Variables included:", paste(rating_vars, collapse = ", "), "\n")

Variables included: Overall.Rating, Seat.Comfort, Cabin.Staff.Service, Food…Beverages, Ground.Service, Wifi…Connectivity, Inflight.Entertainment, Value.For.Money

cat("Total observations:", nrow(corr_data), "\n\n")

Total observations: 2119

EDA: Overview of overall ratings over time

# Ryanair colors
ryanair_blue <- "#073590"
ryanair_yellow <- "#FFD200"

# Calculate count of ratings and average rating by year
rating_trends <- ryanair_clean %>%
  mutate(Review_Date = as.Date(Date.Published),
         Year = floor_date(Review_Date, "year")) %>%
  group_by(Year) %>%
  summarize(
    Rating_Count = n(),
    Avg_Rating = mean(Overall.Rating, na.rm = TRUE)
  )

# Create the combined plot
ggplot(rating_trends, aes(x = Year)) +
  # Bar plot for count of ratings
  geom_bar(aes(y = Rating_Count), stat = "identity", 
           fill = ryanair_blue, alpha = 0.7, width = 300) +
  # Line plot for average rating (on secondary axis) - CORRECTED for 1-10 scale
  geom_line(aes(y = Avg_Rating * (max(Rating_Count) / 10)), 
            color = ryanair_yellow, size = 1.5, group = 1) +
  geom_point(aes(y = Avg_Rating * (max(Rating_Count) / 10)), 
             color = ryanair_yellow, size = 3) +
  # Dual y-axes - CORRECTED for 1-10 scale
  scale_y_continuous(
    name = "Number of Reviews",
    sec.axis = sec_axis(~ . / (max(rating_trends$Rating_Count) / 10), 
                       name = "Average Overall Rating",
                       breaks = 1:10)  # Changed to 1:10
  ) +
  # Yearly date axis
  scale_x_date(
    date_breaks = "1 year",
    date_labels = "%Y",
    expand = expansion(mult = c(0.05, 0.05))
  ) +
  labs(
    title = "Review Volume and Average Rating Over Time",
    subtitle = "Bars show number of reviews per year, line shows average rating (1-10 scale)",  # Updated subtitle
    x = "Year"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
    plot.subtitle = element_text(color = "darkgray", size = 12),
    axis.title.y.left = element_text(color = ryanair_blue),
    axis.title.y.right = element_text(color = ryanair_yellow),
    axis.text.y.left = element_text(color = ryanair_blue),
    axis.text.y.right = element_text(color = ryanair_yellow),
    axis.text.x = element_text(size = 10),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = "lightgray", linewidth = 0.5)
  )

# Calculate average overall rating by year - table only
yearly_ratings <- ryanair_clean %>%
  mutate(Review_Date = as.Date(Date.Published),
         Year = year(Review_Date)) %>%
  group_by(Year) %>%
  summarize(
    Avg_Overall_Rating = round(mean(Overall.Rating, na.rm = TRUE), 2),
    Review_Count = n(),
    .groups = 'drop'
  )

cat("### Average Overall Rating by Year (1-10 Scale)\n")

Average Overall Rating by Year (1-10 Scale)

yearly_ratings %>% knitr::kable()
Year Avg_Overall_Rating Review_Count
2012 4.15 20
2013 6.56 107
2014 7.01 261
2015 5.39 311
2016 6.13 174
2017 4.03 173
2018 3.05 286
2019 2.80 350
2020 3.11 110
2021 2.62 64
2022 2.88 126
2023 3.50 129
2024 6.62 8

1.2 Correlation Matrix (Selective Non-Usage Exclusion)

We calculate the correlation matrix where we exclude non-usage cases (ratings = 0) only for WiFi and Inflight Entertainment, while keeping all data for other services. This approach provides the most accurate picture of true service quality relationships.

# Create a copy of the correlation data
corr_data_selective <- ryanair_clean[rating_vars]

# Replace zeros with NA only for WiFi and Inflight Entertainment
wifi_entertainment_services <- c("Wifi...Connectivity", "Inflight.Entertainment")
for(service in wifi_entertainment_services) {
  corr_data_selective[[service]][corr_data_selective[[service]] == 0] <- NA
}

# Calculate correlation matrix using pairwise complete observations
correlation_matrix_selective <- cor(corr_data_selective, use = "pairwise.complete.obs")

# Display correlation matrix
cat("### Correlation Matrix: Service Ratings vs Overall Rating (Selective Non-Usage Exclusion)\n")

Correlation Matrix: Service Ratings vs Overall Rating (Selective Non-Usage Exclusion)

cat("**Note:** Excludes zero ratings only for WiFi and Inflight Entertainment. All other services use complete data.\n\n")

Note: Excludes zero ratings only for WiFi and Inflight Entertainment. All other services use complete data.

kable(round(correlation_matrix_selective, 3), caption = "Correlation Between Service Dimensions and Overall Rating (Selective Non-Usage Exclusion)")
Correlation Between Service Dimensions and Overall Rating (Selective Non-Usage Exclusion)
Overall.Rating Seat.Comfort Cabin.Staff.Service Food…Beverages Ground.Service Wifi…Connectivity Inflight.Entertainment Value.For.Money
Overall.Rating 1.000 0.734 0.757 0.413 0.539 0.339 0.429 0.880
Seat.Comfort 0.734 1.000 0.713 0.351 0.407 0.404 0.444 0.699
Cabin.Staff.Service 0.757 0.713 1.000 0.454 0.487 0.292 0.366 0.706
Food…Beverages 0.413 0.351 0.454 1.000 0.307 0.420 0.530 0.361
Ground.Service 0.539 0.407 0.487 0.307 1.000 0.368 0.307 0.535
Wifi…Connectivity 0.339 0.404 0.292 0.420 0.368 1.000 0.794 0.333
Inflight.Entertainment 0.429 0.444 0.366 0.530 0.307 0.794 1.000 0.365
Value.For.Money 0.880 0.699 0.706 0.361 0.535 0.333 0.365 1.000
# Create visualization
corrplot(correlation_matrix_selective, method = "color", type = "upper", 
         order = "hclust", tl.cex = 0.8, tl.col = "black",
         title = "Correlation Matrix: Service Ratings vs Overall Satisfaction\n(Excluding Non-Usage for WiFi/Entertainment Only)",
         mar = c(0,0,2,0))

1.3 Top Factors Correlating with Overall Rating (Selective Non-Usage Exclusion)

# Extract correlations with Overall.Rating
overall_correlations_selective <- correlation_matrix_selective["Overall.Rating", ]
overall_correlations_selective <- overall_correlations_selective[names(overall_correlations_selective) != "Overall.Rating"]

# Create ranked correlation table
correlation_rank_selective <- data.frame(
  Service = names(overall_correlations_selective),
  Correlation = round(overall_correlations_selective, 3),
  Absolute_Correlation = round(abs(overall_correlations_selective), 3),
  Rank = rank(-abs(overall_correlations_selective))
) %>%
  arrange(desc(Absolute_Correlation))

cat("### Service Factors Ranked by Correlation with Overall Rating (Selective Non-Usage Exclusion)\n")

Service Factors Ranked by Correlation with Overall Rating (Selective Non-Usage Exclusion)

kable(correlation_rank_selective, caption = "Top Drivers of Overall Customer Satisfaction (Selective Non-Usage Exclusion)")
Top Drivers of Overall Customer Satisfaction (Selective Non-Usage Exclusion)
Service Correlation Absolute_Correlation Rank
Value.For.Money Value.For.Money 0.880 0.880 1
Cabin.Staff.Service Cabin.Staff.Service 0.757 0.757 2
Seat.Comfort Seat.Comfort 0.734 0.734 3
Ground.Service Ground.Service 0.539 0.539 4
Inflight.Entertainment Inflight.Entertainment 0.429 0.429 5
Food…Beverages Food…Beverages 0.413 0.413 6
Wifi…Connectivity Wifi…Connectivity 0.339 0.339 7

1.4 Visualization: Correlation Strength

# Ryanair colors
ryanair_blue <- "#073590"
ryanair_yellow <- "#FFD200"

# Prepare data for visualization
cor_viz <- correlation_rank_selective %>%
  mutate(
    Service = factor(Service, levels = Service[order(Absolute_Correlation)]),
    Direction = ifelse(Correlation > 0, "Positive", "Negative")
  )

ggplot(cor_viz, aes(x = Service, y = Absolute_Correlation, fill = Direction)) +
  geom_bar(stat = "identity", alpha = 0.9) +
  scale_fill_manual(values = c("Negative" = ryanair_yellow, "Positive" = ryanair_blue)) +
  geom_text(aes(label = paste0(round(Correlation, 3))), 
            hjust = -0.2, size = 4, color = ryanair_blue) +
  coord_flip() +
  labs(title = "Key Drivers of RyanAir Customer Satisfaction",
       subtitle = "Service factors ranked by correlation strength with Overall Rating (Selective Non-Usage Exclusion)",
       x = "Service Dimension",
       y = "Absolute Correlation Coefficient",
       fill = "Correlation Direction") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 1), expand = expansion(mult = c(0, 0.1))) +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
    plot.subtitle = element_text(color = "darkgray", size = 12),
    legend.position = "bottom"
  )

1.5 Linear Regression Analysis: Statistical Significance and Effect Sizes

While correlation identifies relationships, linear regression provides statistical significance testing and quantifies the exact impact of each service on overall satisfaction. This allows us to answer critical business questions about the magnitude and reliability of each service’s influence.

cat("### Linear Regression Analysis: Quantifying Service Impact\n\n")

Linear Regression Analysis: Quantifying Service Impact

# Individual regression models for each service
regression_results <- data.frame()

for(service in correlation_rank_selective$Service) {
  # Create formula for individual regression
  formula <- as.formula(paste("Overall.Rating ~", service))
  
  # Fit linear model
  model <- lm(formula, data = corr_data_selective)
  
  # Extract model summary
  model_summary <- summary(model)
  coefficients <- tidy(model)
  
  # Extract service coefficient (excluding intercept)
  service_coef <- coefficients[coefficients$term == service, ]
  
  regression_results <- rbind(regression_results, data.frame(
    Service = service,
    Coefficient = round(service_coef$estimate, 3),
    Std_Error = round(service_coef$std.error, 3),
    T_Value = round(service_coef$statistic, 3),
    P_Value = format.pval(service_coef$p.value, digits = 3),
    R_Squared = round(model_summary$r.squared, 3),
    Adj_R_Squared = round(model_summary$adj.r.squared, 3),
    Significant = ifelse(service_coef$p.value < 0.001, "***",
                        ifelse(service_coef$p.value < 0.01, "**",
                              ifelse(service_coef$p.value < 0.05, "*", "No")))
  ))
}

cat("**Individual Regression Models: Impact of Each Service on Overall Rating**\n")

Individual Regression Models: Impact of Each Service on Overall Rating

kable(regression_results, caption = "Linear Regression Results for Each Service Dimension")
Linear Regression Results for Each Service Dimension
Service Coefficient Std_Error T_Value P_Value R_Squared Adj_R_Squared Significant
Value.For.Money 1.854 0.022 85.371 <2e-16 0.775 0.775 ***
Cabin.Staff.Service 1.812 0.034 53.336 <2e-16 0.573 0.573 ***
Seat.Comfort 2.069 0.042 49.721 <2e-16 0.539 0.538 ***
Ground.Service 1.386 0.047 29.432 <2e-16 0.290 0.290 ***
Inflight.Entertainment 1.696 0.197 8.606 3.16e-16 0.184 0.181 ***
Food…Beverages 1.321 0.063 20.850 <2e-16 0.170 0.170 ***
Wifi…Connectivity 1.275 0.217 5.878 1.24e-08 0.115 0.112 ***

1.6 Multiple Regression: Controlling for All Services

A multiple regression model allows us to understand the unique contribution of each service while controlling for all other services. This identifies which services have independent predictive power beyond their correlations with other services.

cat("### Multiple Regression Analysis: Independent Contributions\n\n")

Multiple Regression Analysis: Independent Contributions

# Prepare data for multiple regression (remove rows with any NA)
multiple_reg_data <- corr_data_selective[complete.cases(corr_data_selective), ]

# Fit multiple regression model
multiple_formula <- as.formula(paste("Overall.Rating ~", 
                                    paste(correlation_rank_selective$Service, collapse = " + ")))
multiple_model <- lm(multiple_formula, data = multiple_reg_data)

# Display multiple regression results
multiple_summary <- summary(multiple_model)
multiple_coef <- tidy(multiple_model)

cat("**Multiple Regression Model Summary**\n")

Multiple Regression Model Summary

cat("- R-squared:", round(multiple_summary$r.squared, 3), "\n")
  • R-squared: 0.677
cat("- Adjusted R-squared:", round(multiple_summary$adj.r.squared, 3), "\n")
  • Adjusted R-squared: 0.668
cat("- F-statistic:", round(multiple_summary$fstatistic[1], 1), "\n")
  • F-statistic: 73.2
cat("- P-value:", format.pval(pf(multiple_summary$fstatistic[1], 
                                multiple_summary$fstatistic[2], 
                                multiple_summary$fstatistic[3], 
                                lower.tail = FALSE), digits = 3), "\n\n")
  • P-value: <2e-16
# Display coefficients (excluding intercept)
multiple_coef_display <- multiple_coef[-1, ] %>%  # Remove intercept
  mutate(
    term = gsub("`", "", term),
    estimate = round(estimate, 3),
    std.error = round(std.error, 3),
    statistic = round(statistic, 3),
    p.value = format.pval(p.value, digits = 3),
    Significance = ifelse(p.value < 0.001, "***",
                         ifelse(p.value < 0.01, "**",
                               ifelse(p.value < 0.05, "*", "No")))
  ) %>%
  rename(Service = term, Coefficient = estimate, Std_Error = std.error, 
         T_Value = statistic, P_Value = p.value)

cat("**Multiple Regression Coefficients**\n")

Multiple Regression Coefficients

kable(multiple_coef_display, caption = "Independent Contributions of Each Service (Controlling for All Others)")
Independent Contributions of Each Service (Controlling for All Others)
Service Coefficient Std_Error T_Value P_Value Significance
Value.For.Money 0.673 0.093 7.219 6.60e-12 No
Cabin.Staff.Service 0.065 0.093 0.700 0.484 No
Seat.Comfort 0.115 0.115 0.999 0.319 No
Ground.Service 0.900 0.121 7.443 1.68e-12 No
Inflight.Entertainment -0.139 0.275 -0.507 0.613 No
Food…Beverages 0.070 0.131 0.535 0.593 No
Wifi…Connectivity 0.060 0.273 0.220 0.826 No

1.7 Multicollinearity Analysis

The discrepancy between individual and multiple regression results requires investigation. We check Variance Inflation Factor (VIF) to determine if multicollinearity is affecting our model.

# Check Variance Inflation Factor (VIF) to confirm multicollinearity
vif_values <- vif(multiple_model)
cat("\n**Variance Inflation Factor (VIF) for Multicollinearity Check**\n")

Variance Inflation Factor (VIF) for Multicollinearity Check

vif_df <- data.frame(Service = names(vif_values), VIF = round(vif_values, 2))
kable(vif_df, caption = "VIF Values for Each Service Dimension")
VIF Values for Each Service Dimension
Service VIF
Value.For.Money Value.For.Money 2.25
Cabin.Staff.Service Cabin.Staff.Service 2.03
Seat.Comfort Seat.Comfort 1.98
Ground.Service Ground.Service 2.50
Inflight.Entertainment Inflight.Entertainment 2.94
Food…Beverages Food…Beverages 1.87
Wifi…Connectivity Wifi…Connectivity 3.03
# Interpret VIF results
cat("\n**VIF Interpretation:**\n")

VIF Interpretation:

cat("- VIF < 5: Moderate multicollinearity\n")
  • VIF < 5: Moderate multicollinearity
cat("- VIF > 5-10: High multicollinearity\n") 
  • VIF > 5-10: High multicollinearity
cat("- VIF > 10: Severe multicollinearity (model coefficients unreliable)\n")
  • VIF > 10: Severe multicollinearity (model coefficients unreliable)
# Check which services have problematic VIF
high_vif <- vif_df[vif_df$VIF > 5, ]
if(nrow(high_vif) > 0) {
  cat("\n**Services with High Multicollinearity (VIF > 5):**\n")
  for(i in 1:nrow(high_vif)) {
    cat("- ", high_vif$Service[i], ": VIF = ", high_vif$VIF[i], "\n", sep = "")
  }
  cat("\n**Conclusion:** The high VIF values confirm severe multicollinearity, explaining why all coefficients became non-significant in the multiple regression.\n")
} else {
  cat("\n**No severe multicollinearity detected (all VIF < 5)**\n")
  cat("- This means the multiple regression results are statistically reliable\n")
  cat("- The non-significance in multiple regression indicates that when controlling for all services, no single service has independent predictive power\n")
  cat("- This suggests customers evaluate services holistically rather than independently\n")
}

No severe multicollinearity detected (all VIF < 5) - This means the multiple regression results are statistically reliable - The non-significance in multiple regression indicates that when controlling for all services, no single service has independent predictive power - This suggests customers evaluate services holistically rather than independently

1.8 Scatter Plots with Combined Regression Lines

# Get top 3 services
top_3_services <- head(correlation_rank_selective$Service, 3)

# Prepare data for combined plot
combined_data_wide <- corr_data_selective %>%
  select(Overall.Rating, all_of(top_3_services))

# Create combined scatter plot with all three regression lines
ggplot(combined_data_wide, aes(x = Overall.Rating, y = Overall.Rating)) +
  # Value.For.Money points and line
  geom_point(aes(y = Value.For.Money), alpha = 0.3, color = ryanair_blue, size = 1) +
  geom_smooth(aes(y = Value.For.Money), method = "lm", se = FALSE, 
              color = ryanair_blue, size = 1.2, linetype = "solid") +
  
  # Cabin.Staff.Service points and line  
  geom_point(aes(y = Cabin.Staff.Service), alpha = 0.3, color = "#FF6B35", size = 1) +
  geom_smooth(aes(y = Cabin.Staff.Service), method = "lm", se = FALSE,
              color = "#FF6B35", size = 1.2, linetype = "solid") +
  
  # Seat.Comfort points and line
  geom_point(aes(y = Seat.Comfort), alpha = 0.3, color = "#4ECDC4", size = 1) +
  geom_smooth(aes(y = Seat.Comfort), method = "lm", se = FALSE,
              color = "#4ECDC4", size = 1.2, linetype = "solid") +
  
  labs(title = "Top 3 Service Factors vs Overall Satisfaction",
       subtitle = "All three key service dimensions plotted against overall customer ratings",
       x = "Overall Rating (1-10 scale)",
       y = "Service Rating (1-5 scale)") +
  theme_minimal() +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
    plot.subtitle = element_text(color = "darkgray", size = 12),
    legend.position = "none"
  ) +
  # Add legend manually
  annotate("text", x = Inf, y = Inf, 
           label = paste("Value.For.Money: R_Squared =", 
                        round(regression_results$R_Squared[regression_results$Service == "Value.For.Money"], 3)),
           hjust = 1.1, vjust = 2, color = ryanair_blue, size = 4) +
  annotate("text", x = Inf, y = Inf, 
           label = paste("Cabin.Staff.Service: R_Squared =", 
                        round(regression_results$R_Squared[regression_results$Service == "Cabin.Staff.Service"], 3)),
           hjust = 1.1, vjust = 3.5, color = "#FF6B35", size = 4) +
  annotate("text", x = Inf, y = Inf, 
           label = paste("Seat.Comfort: R_Squared =", 
                        round(regression_results$R_Squared[regression_results$Service == "Seat.Comfort"], 3)),
           hjust = 1.1, vjust = 5, color = "#4ECDC4", size = 4)

2. Bad Review Analysis: Understanding Service Failures

2.1 Corrected Bad Review Analysis Accounting for Service Usage

The initial bad review analysis included all ratings, but this can be misleading for services with high non-usage rates. To understand true service failures, we exclude non-usage cases by only considering ratings greater than 0. We define a poor rating as 2 or below on the 5-point scale, representing clear service dissatisfaction when the service was actually used.

# Define rating thresholds for analysis
POOR_RATING_THRESHOLD <- 2  # 2/5 or below indicates service failure
BAD_REVIEW_THRESHOLD <- 5   # 5/10 or below indicates bad overall experience

# Identify bad reviews
bad_reviews <- corr_data_selective[corr_data_selective$Overall.Rating <= BAD_REVIEW_THRESHOLD, ]

corrected_bad_analysis <- data.frame(
  Service = rating_vars[rating_vars != "Overall.Rating"],
  Poor_Rating_Count = sapply(rating_vars[rating_vars != "Overall.Rating"], 
                            function(x) sum(corr_data_selective[[x]] <= POOR_RATING_THRESHOLD & 
                                           corr_data_selective$Overall.Rating <= BAD_REVIEW_THRESHOLD & 
                                           corr_data_selective[[x]] > 0, na.rm = TRUE)),
  Actually_Used_Count = sapply(rating_vars[rating_vars != "Overall.Rating"],
                              function(x) sum(corr_data_selective[[x]] > 0 & 
                                             corr_data_selective$Overall.Rating <= BAD_REVIEW_THRESHOLD, na.rm = TRUE)),
  Total_Bad_Reviews = nrow(bad_reviews)
) %>%
  mutate(
    Actual_Failure_Rate = round(Poor_Rating_Count / Actually_Used_Count * 100, 1),
    Service_Usage_Rate = round(Actually_Used_Count / Total_Bad_Reviews * 100, 1),
    Severity_Rank = rank(-Actual_Failure_Rate)
  ) %>%
  arrange(desc(Actual_Failure_Rate))

cat("### Corrected Analysis: Actual Service Failure Rates (Excluding Non-Usage)\n\n")

Corrected Analysis: Actual Service Failure Rates (Excluding Non-Usage)

cat("**Service Failure Rates When Services Are Actually Used**\n")

Service Failure Rates When Services Are Actually Used

kable(corrected_bad_analysis, caption = "True Service Failure Rates Excluding Non-Usage Cases")
True Service Failure Rates Excluding Non-Usage Cases
Service Poor_Rating_Count Actually_Used_Count Total_Bad_Reviews Actual_Failure_Rate Service_Usage_Rate Severity_Rank
Inflight.Entertainment Inflight.Entertainment 288 294 1284 98.0 22.9 1
Wifi…Connectivity Wifi…Connectivity 238 245 1284 97.1 19.1 2
Food…Beverages Food…Beverages 1149 1222 1284 94.0 95.2 3
Ground.Service Ground.Service 1189 1284 1284 92.6 100.0 4
Seat.Comfort Seat.Comfort 1042 1281 1284 81.3 99.8 5
Value.For.Money Value.For.Money 1019 1281 1284 79.5 99.8 6
Cabin.Staff.Service Cabin.Staff.Service 832 1278 1284 65.1 99.5 7

2.2 Linear Regression for Bad Reviews

We now perform regression analysis specifically on bad reviews to understand which service failures have the strongest impact on driving overall dissatisfaction.

cat("### Linear Regression Analysis: Drivers of Bad Reviews\n\n")

Linear Regression Analysis: Drivers of Bad Reviews

# Individual regression models for each service on bad reviews only
bad_review_regression_results <- data.frame()

for(service in correlation_rank_selective$Service) {
  # Create formula for individual regression on bad reviews
  formula <- as.formula(paste("Overall.Rating ~", service))
  
  # Fit linear model only on bad reviews
  model <- lm(formula, data = bad_reviews)
  
  # Extract model summary
  model_summary <- summary(model)
  coefficients <- tidy(model)
  
  # Extract service coefficient (excluding intercept)
  service_coef <- coefficients[coefficients$term == service, ]
  
  bad_review_regression_results <- rbind(bad_review_regression_results, data.frame(
    Service = service,
    Coefficient = round(service_coef$estimate, 3),
    Std_Error = round(service_coef$std.error, 3),
    T_Value = round(service_coef$statistic, 3),
    P_Value = format.pval(service_coef$p.value, digits = 3),
    R_Squared = round(model_summary$r.squared, 3),
    Adj_R_Squared = round(model_summary$adj.r.squared, 3),
    Significant = ifelse(service_coef$p.value < 0.001, "***",
                        ifelse(service_coef$p.value < 0.01, "**",
                              ifelse(service_coef$p.value < 0.05, "*", "No")))
  ))
}

cat("**Individual Regression Models: Impact of Service Failures on Bad Reviews**\n")

Individual Regression Models: Impact of Service Failures on Bad Reviews

kable(bad_review_regression_results, caption = "Linear Regression Results for Bad Reviews Only")
Linear Regression Results for Bad Reviews Only
Service Coefficient Std_Error T_Value P_Value R_Squared Adj_R_Squared Significant
Value.For.Money 0.687 0.025 27.720 <2e-16 0.375 0.374 ***
Cabin.Staff.Service 0.343 0.029 12.038 <2e-16 0.102 0.101 ***
Seat.Comfort 0.478 0.035 13.642 <2e-16 0.127 0.126 ***
Ground.Service 0.510 0.045 11.443 <2e-16 0.093 0.092 ***
Inflight.Entertainment 0.343 0.155 2.217 0.0274 0.017 0.013 *
Food…Beverages 0.100 0.048 2.082 0.0376 0.003 0.003 *
Wifi…Connectivity 0.386 0.141 2.729 0.00682 0.030 0.026 **

Key Insights from Bad Review Regression:

  1. Value.For.Money Still Dominates but Less So

In bad reviews: 37.5% variance explained (vs 77.5% overall)

Coefficient: 0.687 (vs 1.854 overall)

Interpretation: Once customers are already dissatisfied, value perception becomes less dominant

  1. Physical Comfort Emerges as Critical

Seat.Comfort: Strong impact (0.478 coefficient, 12.7% variance)

Ground.Service: Surprisingly strong (0.510 coefficient, 9.3% variance)

Interpretation: For dissatisfied customers, physical discomfort and ground experience become major pain points

  1. WiFi and Entertainment Have REAL Impact in Bad Reviews

WiFi: 0.386 coefficient, statistically significant (**)

Entertainment: 0.343 coefficient, statistically significant (*)

Interpretation: These services actively contribute to pushing customers from neutral to negative

Business Insight:

The “Last Straw” Effect: While WiFi and Entertainment don’t drive overall satisfaction broadly, they serve as critical failure points that push already-unhappy customers over the edge. Customers experiencing core service failures (value, staff, comfort) then encounter broken paid services, creating a compounding dissatisfaction effect.

Strategic Implication: WiFi and Entertainment aren’t just “nice-to-haves” - they’re critical for service recovery. When core services fail, functioning WiFi/Entertainment could prevent the situation from escalating to a bad review. Their current catastrophic failure rates mean they’re actively making bad situations worse.

Updated Strategy:

Prevention: Fix core services (Value, Staff, Seats)

Containment: Ensure WiFi/Entertainment work reliably as backup options

Recovery: Use functioning digital services to mitigate other failures

Comparative Analysis: Failure Rates vs Business Impact

Regression reveals the causal drivers while failure rates show symptom frequency

# Create comparative visualization: Failure Rates vs Regression Impact
comparison_data <- data.frame(
  Service = corrected_bad_analysis$Service,
  Failure_Rate = corrected_bad_analysis$Actual_Failure_Rate,
  Service_Usage_Rate = corrected_bad_analysis$Service_Usage_Rate,
  Business_Impact = bad_review_regression_results$Coefficient[match(corrected_bad_analysis$Service, bad_review_regression_results$Service)],
  Variance_Explained = bad_review_regression_results$R_Squared[match(corrected_bad_analysis$Service, bad_review_regression_results$Service)] * 100
)

# Create left plot with usage information (wider)
p1 <- ggplot(comparison_data, aes(x = reorder(Service, Failure_Rate), y = Failure_Rate, 
                                 fill = ifelse(Service_Usage_Rate > 50, "High Usage (>50%)", "Low Usage (<=50%)"))) +
  geom_bar(stat = "identity", alpha = 0.9) +
  scale_fill_manual(values = c("High Usage (>50%)" = ryanair_blue, "Low Usage (<=50%)" = ryanair_yellow),
                    name = "Service Usage Level") +
  geom_text(aes(label = paste0(Failure_Rate, "%")), hjust = -0.2, size = 3.5, color = ryanair_blue) +
  geom_text(aes(label = paste0("(Used by ", Service_Usage_Rate, ")")), 
            hjust = -0.2, size = 3, color = "darkgray", vjust = 2) +
  coord_flip() +
  labs(title = "Service Failure Rates in Bad Reviews",
       subtitle = "Percentage of poor ratings (<=2/5) when service is actually used",
       x = "Service Dimension",
       y = "Failure Rate (%)") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 105), expand = expansion(mult = c(0, 0.1))) +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold", size = 14),
    plot.subtitle = element_text(color = "darkgray", size = 11),
    legend.position = "bottom",
    legend.box = "horizontal",
    legend.text = element_text(size = 10)
  )

# Create right plot with business impact and legend
p2 <- ggplot(comparison_data, aes(x = reorder(Service, Business_Impact), y = Business_Impact)) +
  geom_bar(stat = "identity", fill = ryanair_blue, alpha = 0.8) +
  geom_text(aes(label = round(Business_Impact, 2)), hjust = -0.2, size = 3.5, color = ryanair_blue) +
  geom_text(aes(label = paste0("(", round(Variance_Explained, 1), "%)")), 
            hjust = -0.2, size = 3, color = "darkgray", vjust = 2) +
  coord_flip() +
  labs(title = "Business Impact on Bad Reviews",
       subtitle = "Regression analysis: Coefficient (points) and R_Squared (%) shown",
       x = "Service Dimension",
       y = "Impact Coefficient",
       caption = "Numbers represent: Coefficient, (R_Squared %)\n- Coefficient: Points increase in Overall Rating per 1-point service improvement\n- R_Squared: Percentage of bad review variance explained") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 0.8), expand = expansion(mult = c(0, 0.1))) +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold", size = 14),
    plot.subtitle = element_text(color = "darkgray", size = 11),
    plot.caption = element_text(size = 9, color = "darkgray", hjust = 0)
  )

# Arrange plots with different widths (left wider, right narrower)
gridExtra::grid.arrange(p1, p2, ncol = 2, widths = c(0.55, 0.45),
                       top = "Critical Insight: Services with Highest Failure Rates vs Most Impactful on Satisfaction")

Key Comparative Insight

The side-by-side analysis reveals a crucial strategic distinction between service failure frequency and business impact. While Inflight Entertainment (98% failure rate) and WiFi Connectivity (97% failure rate) demonstrate near-universal service breakdown, their impact on driving overall dissatisfaction is relatively modest (3.0% and 1.7% variance explained respectively).

Conversely, Value.For.Money, despite a lower failure rate (79.5%), exerts the strongest influence on bad reviews, explaining 37.5% of dissatisfaction variance. Similarly, Seat Comfort and Cabin Staff Service show moderate failure rates (81.3% and 65.1%) but substantial business impact (12.7% and 10.2% variance explained).

This divergence highlights a critical data mining insight: Ryanair should prioritize resources based on business impact rather than failure frequency. The catastrophic WiFi and Entertainment failures, while alarming, represent containment issues rather than primary drivers of customer dissatisfaction. The strategic focus should remain on Value Perception, Seat Comfort, and Cabin Staff improvements, which actually determine whether core service experiences escalate into negative reviews.

# Create a combined ranking table
cat("**Strategic Priority Ranking: Failure Rates vs Business Impact**\n")

Strategic Priority Ranking: Failure Rates vs Business Impact

priority_ranking <- comparison_data %>%
  mutate(
    Failure_Rank = rank(-Failure_Rate),
    Impact_Rank = rank(-Business_Impact),
    Strategic_Priority = ifelse(Impact_Rank <= 3, "HIGH", 
                               ifelse(Failure_Rank <= 3, "MEDIUM", "LOW"))
  ) %>%
  select(Service, Failure_Rate, Business_Impact, Failure_Rank, Impact_Rank, Strategic_Priority) %>%
  arrange(Impact_Rank)

kable(priority_ranking, caption = "Service Prioritization Based on Combined Failure and Impact Analysis")
Service Prioritization Based on Combined Failure and Impact Analysis
Service Failure_Rate Business_Impact Failure_Rank Impact_Rank Strategic_Priority
Value.For.Money 79.5 0.687 6 1.0 HIGH
Ground.Service 92.6 0.510 4 2.0 HIGH
Seat.Comfort 81.3 0.478 5 3.0 HIGH
Wifi…Connectivity 97.1 0.386 2 4.0 MEDIUM
Inflight.Entertainment 98.0 0.343 1 5.5 MEDIUM
Cabin.Staff.Service 65.1 0.343 7 5.5 LOW
Food…Beverages 94.0 0.100 3 7.0 MEDIUM

Ranking Logic:

HIGH Priority: Services ranked in top 3 by Business Impact (Impact_Rank ≤ 3)

These services actually drive bad reviews the most

Fixing them will have the biggest reduction in dissatisfaction

MEDIUM Priority: Services ranked in top 3 by Failure Rate but NOT top 3 by Impact (Failure_Rank ≤ 3 AND Impact_Rank > 3)

These fail frequently but don’t drive bad reviews as much

Still important to fix, but less urgent than HIGH priority

LOW Priority: Everything else

Neither high failure rates nor high business impact

2.3 Visualization: Extra Services Catastrophic Failure Analysis

# Analysis of Extra Services Extreme Dissatisfaction (Rating = 1)
extra_services <- c("Wifi...Connectivity", "Inflight.Entertainment", "Food...Beverages")

# Calculate metrics for extra services in bad reviews
extra_services_analysis <- data.frame(
  Service = extra_services,
  Extreme_Dissat_Count = sapply(extra_services, 
                               function(x) sum(bad_reviews[[x]] == 1, na.rm = TRUE)),
  Moderate_Dissat_Count = sapply(extra_services,
                                function(x) sum(bad_reviews[[x]] == 2, na.rm = TRUE)),
  Neutral_Positive_Count = sapply(extra_services,
                                 function(x) sum(bad_reviews[[x]] >= 3, na.rm = TRUE)),
  Actually_Used_Count = sapply(extra_services,
                              function(x) sum(bad_reviews[[x]] > 0, na.rm = TRUE)),
  Total_Bad_Reviews = nrow(bad_reviews)
) %>%
  mutate(
    Extreme_Dissat_Rate = round(Extreme_Dissat_Count / Actually_Used_Count * 100, 1),
    Moderate_Dissat_Rate = round(Moderate_Dissat_Count / Actually_Used_Count * 100, 1),
    Service_Usage_Rate = round(Actually_Used_Count / Total_Bad_Reviews * 100, 1),
    Service_Label = case_when(
      Service == "Wifi...Connectivity" ~ "WiFi & Connectivity",
      Service == "Inflight.Entertainment" ~ "Inflight Entertainment", 
      Service == "Food...Beverages" ~ "Food & Beverages"
    )
  )

# Prepare data for stacked bar plot
stacked_data <- extra_services_analysis %>%
  select(Service_Label, Extreme_Dissat_Rate, Moderate_Dissat_Rate, Service_Usage_Rate) %>%
  pivot_longer(cols = c(Extreme_Dissat_Rate, Moderate_Dissat_Rate),
               names_to = "Dissatisfaction_Level",
               values_to = "Percentage") %>%
  mutate(
    Dissatisfaction_Level = factor(Dissatisfaction_Level,
                                  levels = c("Moderate_Dissat_Rate", "Extreme_Dissat_Rate"),
                                  labels = c("Moderate Dissatisfaction (Rating = 2)", 
                                           "Extreme Dissatisfaction (Rating = 1)")),
    Service_Label = factor(Service_Label, 
                          levels = c("WiFi & Connectivity", "Inflight Entertainment", "Food & Beverages"))
  )

# Create stacked bar plot with Ryanair colors
ggplot(stacked_data, aes(x = Service_Label, y = Percentage, fill = Dissatisfaction_Level)) +
  geom_bar(stat = "identity", alpha = 0.9) +
  scale_fill_manual(values = c("Extreme Dissatisfaction (Rating = 1)" = ryanair_yellow,
                              "Moderate Dissatisfaction (Rating = 2)" = ryanair_blue)) +
  # Add service usage rate annotations
  geom_text(data = extra_services_analysis,
            aes(x = Service_Label, y = 105, 
                label = paste0("Used by ", Service_Usage_Rate, "% of bad reviews")),
            color = "darkgray", size = 3.5, fontface = "italic", inherit.aes = FALSE) +
  # Add percentage labels on bars - using the main data that has Dissatisfaction_Level
  geom_text(aes(label = paste0(Percentage, "%")),
            position = position_stack(vjust = 0.5),
            color = "white", size = 4, fontface = "bold") +
  labs(title = "Extra Services: Catastrophic Failure Analysis in Bad Reviews",
       subtitle = "Distribution of dissatisfaction levels when services are actually used by dissatisfied customers",
       x = "Extra Service Category",
       y = "Percentage of Service Users (%)",
       fill = "Dissatisfaction Level",
       caption = "Analysis limited to bad reviews (Overall Rating <= 5/10)\nBars show percentage of users who rated each service 1 or 2 (out of 5)") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 115), expand = expansion(mult = c(0, 0.05))) +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
    plot.subtitle = element_text(color = "darkgray", size = 12),
    plot.caption = element_text(size = 10, color = "darkgray", hjust = 0),
    axis.text.x = element_text(size = 11, color = "black"),
    axis.text.y = element_text(size = 10, color = "black"),
    legend.position = "bottom",
    legend.title = element_text(face = "bold", size = 10),
    legend.text = element_text(size = 9),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = "lightgray", linewidth = 0.5)
  ) +
  guides(fill = guide_legend(reverse = TRUE))

# Analysis of Extra Services Catastrophic Failure (Rating = 1 only)
extra_services <- c("Wifi...Connectivity", "Inflight.Entertainment", "Food...Beverages")

# Calculate metrics for catastrophic failures only (rating = 1)
catastrophic_analysis <- data.frame(
  Service = extra_services,
  Catastrophic_Failure_Count = sapply(extra_services, 
                                     function(x) sum(bad_reviews[[x]] == 1, na.rm = TRUE)),
  Actually_Used_Count = sapply(extra_services,
                              function(x) sum(bad_reviews[[x]] > 0, na.rm = TRUE)),
  Total_Bad_Reviews = nrow(bad_reviews)
) %>%
  mutate(
    Catastrophic_Failure_Rate = round(Catastrophic_Failure_Count / Actually_Used_Count * 100, 1),
    Service_Usage_Rate = round(Actually_Used_Count / Total_Bad_Reviews * 100, 1),
    Service_Label = case_when(
      Service == "Wifi...Connectivity" ~ "WiFi & Connectivity",
      Service == "Inflight.Entertainment" ~ "Inflight Entertainment", 
      Service == "Food...Beverages" ~ "Food & Beverages"
    ),
    Service_Label = factor(Service_Label, 
                          levels = c("WiFi & Connectivity", "Inflight Entertainment", "Food & Beverages"))
  ) %>%
  arrange(desc(Catastrophic_Failure_Rate))
# Create HORIZONTAL bar plot showing only catastrophic failures (rating = 1)
ggplot(catastrophic_analysis, aes(x = Catastrophic_Failure_Rate, y = reorder(Service_Label, Catastrophic_Failure_Rate))) +
  geom_bar(stat = "identity", fill = ryanair_yellow, alpha = 0.9, width = 0.7) +
  # Add catastrophic failure rate labels
  geom_text(aes(label = paste0(Catastrophic_Failure_Rate, "%")), 
            hjust = -0.2, size = 5, fontface = "bold", color = ryanair_blue) +
  # Add absolute count annotations showing rating=1 vs total users
  geom_text(aes(x = Catastrophic_Failure_Rate / 2, 
                label = paste0(Catastrophic_Failure_Count, " / ", Actually_Used_Count, " users")),
            color = "white", size = 3.5, fontface = "bold") +
  labs(title = "Extra Services: Catastrophic Failure Analysis (Rating = 1 Only)",
       subtitle = "Percentage of service users who gave worst possible rating (1/5) in bad reviews",
       x = "Catastrophic Failure Rate (%)",
       y = "Extra Service Category",
       caption = "Analysis limited to bad reviews (Overall Rating <= 5/10)\nNumbers show: Rating=1 count / Total users of service") +
  theme_minimal() +
  scale_x_continuous(limits = c(0, max(catastrophic_analysis$Catastrophic_Failure_Rate) * 1.15), 
                     expand = expansion(mult = c(0, 0.05))) +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
    plot.subtitle = element_text(color = "darkgray", size = 12),
    plot.caption = element_text(size = 10, color = "darkgray", hjust = 0),
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10, color = "black"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = "lightgray", linewidth = 0.5)
  )

3. Summary: Business Insights and Strategic Recommendations

# Recreate necessary objects for this section
top_driver <- head(regression_results[order(-regression_results$R_Squared), ], 1)
secondary_drivers <- regression_results %>%
  arrange(desc(R_Squared)) %>%
  slice(2:3)
top_bad_review_issues <- head(corrected_bad_analysis, 3)

cat("### Statistical Confidence in Service Drivers\n")

Statistical Confidence in Service Drivers

While correlation identifies relationships, linear regression provides statistical significance testing and quantifies the exact impact of each service on overall satisfaction. This allows us to answer critical business questions about the magnitude and reliability of each service’s influence.

Primary Driver of Customer Satisfaction

Value.For.Money is the primary driver of customer satisfaction, explaining 77.5% of overall satisfaction variance. Each 1-point improvement in Value.For.Money increases Overall Rating by 1.854 points, with extremely high statistical significance (p < 0.001).

Secondary Critical Drivers

Two additional services show strong predictive power for overall satisfaction: - Cabin.Staff.Service: Explains 57.3% variance, +1.812 points per 1-point improvement - Seat.Comfort: Explains 53.8% variance, +2.069 points per 1-point improvement

cat("### Insights from Bad Review Analysis\n")

Insights from Bad Review Analysis

cat("**Executive Summary:**\n")

Executive Summary:

cat("While WiFi and Entertainment show near-universal failure (98%), they have limited impact on driving bad reviews (3% variance). The real drivers are Value Perception (38%), Seat Comfort (13%), and Cabin Staff (10%).\n\n")

While WiFi and Entertainment show near-universal failure (98%), they have limited impact on driving bad reviews (3% variance). The real drivers are Value Perception (38%), Seat Comfort (13%), and Cabin Staff (10%).

top_bad_review_issues <- head(corrected_bad_analysis, 3)
cat("**Services with Highest Actual Failure Rates in Bad Reviews:**\n")

Services with Highest Actual Failure Rates in Bad Reviews:

for(i in 1:nrow(top_bad_review_issues)) {
  service <- top_bad_review_issues$Service[i]
  failure_rate <- top_bad_review_issues$Actual_Failure_Rate[i]
  usage_rate <- top_bad_review_issues$Service_Usage_Rate[i]
  cat(i, ". **", service, "**: ", failure_rate, "% failure rate (used by ", usage_rate, "% of dissatisfied customers)\n", sep = "")
}
  1. Inflight.Entertainment: 98% failure rate (used by 22.9% of dissatisfied customers)
  2. Wifi…Connectivity: 97.1% failure rate (used by 19.1% of dissatisfied customers)
  3. Food…Beverages: 94% failure rate (used by 95.2% of dissatisfied customers)

4. Deep Dive: Catastrophic Failure Patterns in WiFi and Entertainment Services

Note: This is to understand extreme failure patterns observed in Ryanair’s paid services. As some addtional insight into the nature of these failures. Not related to overall satisfaction drivers, but could be useful for service strategy.

# Analyze the catastrophic failure patterns for WiFi and Entertainment
cat("### WiFi and Entertainment Service Failure Analysis\n\n")

WiFi and Entertainment Service Failure Analysis

# Calculate failure statistics
wifi_failure_rate <- round(sum(ryanair_clean$Wifi...Connectivity == 1, na.rm = TRUE) / 
                           sum(ryanair_clean$Wifi...Connectivity > 0, na.rm = TRUE) * 100, 1)

entertainment_failure_rate <- round(sum(ryanair_clean$Inflight.Entertainment == 1, na.rm = TRUE) / 
                                   sum(ryanair_clean$Inflight.Entertainment > 0, na.rm = TRUE) * 100, 1)

# Calculate usage patterns in bad reviews vs all reviews
bad_reviews <- ryanair_clean[ryanair_clean$Overall.Rating <= 5, ]
wifi_usage_bad <- round(sum(bad_reviews$Wifi...Connectivity > 0, na.rm = TRUE) / nrow(bad_reviews) * 100, 1)
wifi_usage_all <- round(sum(ryanair_clean$Wifi...Connectivity > 0, na.rm = TRUE) / nrow(ryanair_clean) * 100, 1)
entertainment_usage_bad <- round(sum(bad_reviews$Inflight.Entertainment > 0, na.rm = TRUE) / nrow(bad_reviews) * 100, 1)
entertainment_usage_all <- round(sum(ryanair_clean$Inflight.Entertainment > 0, na.rm = TRUE) / nrow(ryanair_clean) * 100, 1)

wifi_ratio <- round(wifi_usage_bad / wifi_usage_all, 1)
entertainment_ratio <- round(entertainment_usage_bad / entertainment_usage_all, 1)

# Display the usage pattern findings
cat("**Usage Pattern Analysis: Bad Reviews vs All Reviews**\n")

Usage Pattern Analysis: Bad Reviews vs All Reviews

cat("WiFi Connectivity:\n")

WiFi Connectivity:

cat("- Usage in bad reviews: ", wifi_usage_bad, "%\n", sep = "")
  • Usage in bad reviews: 19.1%
cat("- Usage in all reviews: ", wifi_usage_all, "%\n", sep = "")
  • Usage in all reviews: 12.6%
cat("- Ratio: ", wifi_ratio, "x more likely in bad reviews\n\n", sep = "")
  • Ratio: 1.5x more likely in bad reviews
cat("Inflight Entertainment:\n")

Inflight Entertainment:

cat("- Usage in bad reviews: ", entertainment_usage_bad, "%\n", sep = "")
  • Usage in bad reviews: 22.9%
cat("- Usage in all reviews: ", entertainment_usage_all, "%\n", sep = "")
  • Usage in all reviews: 15.6%
cat("- Ratio: ", entertainment_ratio, "x more likely in bad reviews\n\n", sep = "")
  • Ratio: 1.5x more likely in bad reviews
# Create visualization of rating distributions
wifi_entertainment_data <- data.frame(
  Service = rep(c("WiFi Connectivity", "Inflight Entertainment"), each = 5),
  Rating = rep(1:5, 2),
  Count = c(251, 7, 6, 2, 2, 299, 20, 7, 2, 3)
)

ggplot(wifi_entertainment_data, aes(x = factor(Rating), y = Count, fill = Service)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  scale_fill_manual(values = c("WiFi Connectivity" = ryanair_blue, "Inflight Entertainment" = ryanair_yellow)) +
  labs(title = "Catastrophic Failure Patterns: WiFi and Entertainment Services",
       subtitle = "Near-universal lowest ratings indicate fundamental service breakdown",
       x = "Rating (1-5 scale)",
       y = "Number of Ratings",
       fill = "Service") +
  theme_minimal() +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold", size = 16),
    plot.subtitle = element_text(color = "darkgray", size = 12),
    legend.position = "bottom"
  ) +
  geom_text(aes(label = Count), position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 3.5)

Insights: WiFi and Entertainment Service Failure

The quantitative analysis reveals a service quality crisis for Ryanair’s paid offerings. WiFi Connectivity shows a catastrophic failure pattern with 94% of users (251 out of 268) giving the lowest possible rating of 1 out of 5. Similarly, Inflight Entertainment demonstrates near-identical failure rates with 90% of users (299 out of 331) assigning the minimum rating. This extreme rating concentration—where virtually all users agree on failure—signals fundamental service breakdown rather than random quality variation.

The usage patterns reveal an alarming trend: dissatisfied customers are 1.5 times more likely to use WiFi services (19.1% usage in bad reviews vs 12.6% overall) and 1.5 times more likely to use entertainment services (22.9% usage in bad reviews vs 15.6% overall). This suggests that customers experiencing poor core services use paid offerings, and encounter further disappointment.

The strategic implication is clear: these paid services are not merely underperforming but are actively damaging customer satisfaction. With dissatisfied customers being more likely to use these services and then experiencing near-universal disappointment, Ryanair faces a critical decision point. Either substantial investment is required to completely overhaul these services to reliable standards, or they should think up other ways of offerings to prevent further reputation damage from services that consistently fail to meet basic customer expectations.

# Additional statistical summary
cat("**Statistical Summary of Service Failures**\n")

Statistical Summary of Service Failures

failure_summary <- data.frame(
  Service = c("WiFi Connectivity", "Inflight Entertainment"),
  Total_Users = c(268, 331),
  Lowest_Rating_Count = c(251, 299),
  Failure_Rate = c(paste0(wifi_failure_rate, "%"), paste0(entertainment_failure_rate, "%")),
  Average_Rating = c(
    round(mean(ryanair_clean$Wifi...Connectivity[ryanair_clean$Wifi...Connectivity > 0], na.rm = TRUE), 2),
    round(mean(ryanair_clean$Inflight.Entertainment[ryanair_clean$Inflight.Entertainment > 0], na.rm = TRUE), 2)
  )
)

kable(failure_summary, caption = "Quantitative Service Failure Analysis")
Quantitative Service Failure Analysis
Service Total_Users Lowest_Rating_Count Failure_Rate Average_Rating
WiFi Connectivity 268 251 93.7% 1.12
Inflight Entertainment 331 299 90.3% 1.16

Customer Segmentation Analysis

Setup and Data Loading

library(tidyverse)
library(cluster)
library(factoextra)
library(gridExtra)
library(fmsb)
library(reshape2)
# Load the cleaned dataset
ryanair_clean <- read.csv("ryanair_reviews_cleaned.csv")

# Display basic dataset info
cat("## Dataset Overview\n")

Dataset Overview

cat("Total reviews:", nrow(ryanair_clean), "\n")

Total reviews: 2119

Data Validation and Preparation

# Check for data issues in categorical variables
cat("### Data Quality Check\n")

Data Quality Check

cat("Type.Of.Traveller unique values:\n")

Type.Of.Traveller unique values:

print(unique(ryanair_clean$Type.Of.Traveller))

[1] “Family Leisure” “Couple Leisure” “Solo Leisure” “Business”
[5] “Unknown”

# Create cleaned traveler type variable with "Unknown" for missing values
ryanair_clean <- ryanair_clean %>%
  mutate(
    cleaned_traveller_type = case_when(
      grepl("business|Business", Type.Of.Traveller) ~ "Business",
      grepl("family|Family", Type.Of.Traveller) ~ "Family",
      grepl("solo|Solo", Type.Of.Traveller) ~ "Solo Leisure",
      grepl("couple|Couple", Type.Of.Traveller) ~ "Couples Leisure",
      is.na(Type.Of.Traveller) | Type.Of.Traveller == "" ~ "Unknown",
      TRUE ~ "Unknown"
    )
  )
# Select service dimension variables (excluding Overall.Rating to avoid dominance)
service_vars <- c("Seat.Comfort", "Cabin.Staff.Service", "Food...Beverages", 
                  "Ground.Service", "Value.For.Money")

# Prepare data for clustering - SCALE the data for distance-based algorithms
clustering_data <- ryanair_clean %>%
  select(all_of(service_vars)) %>%
  scale() %>%  # Standardize variables
  as.data.frame()

cat("### Clustering Dataset Summary\n")

Clustering Dataset Summary

cat("Service Variables:", paste(service_vars, collapse = ", "), "\n")

Service Variables: Seat.Comfort, Cabin.Staff.Service, Food…Beverages, Ground.Service, Value.For.Money

cat("Observations:", nrow(clustering_data), "\n")

Observations: 2119

Determine Optimal Number of Clusters

# Calculate WSS first for annotation
wss <- map_dbl(1:10, function(k) {
  kmeans(clustering_data, centers = k, nstart = 25)$tot.withinss
})

# Elbow method with clear indication for k=3
p1 <- fviz_nbclust(clustering_data, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = "dashed", color = "red") +
  annotate("text", x = 3.2, y = max(wss)*0.9, label = "Elbow at k=3", 
           color = "red", hjust = 0) +
  labs(title = "K-means: Optimal Number of Clusters (Elbow Method)") +
  theme_minimal()

# Silhouette method  
p2 <- fviz_nbclust(clustering_data, kmeans, method = "silhouette") +
  labs(title = "K-means: Optimal Number of Clusters (Silhouette Method)") +
  theme_minimal()

# Display both plots
grid.arrange(p1, p2, ncol = 2)

K-means Clustering

set.seed(123)
kmeans_result <- kmeans(clustering_data, centers = 3, nstart = 25)

# Add cluster assignments
ryanair_segmented <- ryanair_clean %>%
  mutate(kmeans_cluster = as.factor(kmeans_result$cluster))

cat("### K-means Clustering Results\n")

K-means Clustering Results

cat("Cluster sizes:\n")

Cluster sizes:

table(kmeans_result$cluster) %>% knitr::kable()
Var1 Freq
1 487
2 1155
3 477

Hierarchical Clustering

# Calculate distance matrix and perform clustering
dist_matrix <- dist(clustering_data, method = "euclidean")
hclust_result <- hclust(dist_matrix, method = "ward.D2")

# Dendrogram with reduced label overlap
fviz_dend(hclust_result, k = 3,
          k_colors = c("#073590", "#FFD200", "#2E8B57"),
          color_labels_by_k = TRUE,
          rect = TRUE,
          show_labels = FALSE,  # Remove overlapping bottom labels
          main = "Hierarchical Clustering Dendrogram",
          xlab = "Observations", ylab = "Height")

# Cut tree and add assignments
hierarchical_clusters <- cutree(hclust_result, k = 3)
ryanair_segmented <- ryanair_segmented %>%
  mutate(hierarchical_cluster = as.factor(hierarchical_clusters))

Cluster Profiling

cluster_profiles <- ryanair_segmented %>%
  group_by(kmeans_cluster) %>%
  summarize(
    n_customers = n(),
    percent_total = round(n() / nrow(ryanair_segmented) * 100, 1),
    
    # Service ratings (what we clustered on)
    avg_seat_comfort = round(mean(Seat.Comfort, na.rm = TRUE), 2),
    avg_cabin_staff = round(mean(Cabin.Staff.Service, na.rm = TRUE), 2),
    avg_food_beverages = round(mean(Food...Beverages, na.rm = TRUE), 2),
    avg_ground_service = round(mean(Ground.Service, na.rm = TRUE), 2),
    avg_value_money = round(mean(Value.For.Money, na.rm = TRUE), 2),
    
    # Overall rating (outcome for validation)
    avg_overall_rating = round(mean(Overall.Rating, na.rm = TRUE), 2),
    
    # Traveler types (using cleaned data with "Unknown")
    business_pct = round(sum(cleaned_traveller_type == "Business") / n() * 100, 1),
    family_pct = round(sum(cleaned_traveller_type == "Family") / n() * 100, 1),
    solo_pct = round(sum(cleaned_traveller_type == "Solo Leisure") / n() * 100, 1),
    couples_pct = round(sum(cleaned_traveller_type == "Couples Leisure") / n() * 100, 1),
    unknown_pct = round(sum(cleaned_traveller_type == "Unknown") / n() * 100, 1),
    
 # CORRECTED Seat type distribution - all 4 categories
    first_class_pct = round(sum(grepl("First Class", Seat.Type)) / n() * 100, 1),
    business_class_pct = round(sum(grepl("Business Class", Seat.Type)) / n() * 100, 1),
    premium_economy_pct = round(sum(grepl("Premium Economy", Seat.Type)) / n() * 100, 1),
    economy_class_pct = round(sum(grepl("Economy Class", Seat.Type)) / n() * 100, 1),
    total_premium_pct = round(sum(grepl("First Class|Business Class|Premium Economy", Seat.Type)) / n() * 100, 1),
    
    # Service usage patterns
    wifi_usage_pct = round(mean(1 - Wifi_Connectivity_NotRated) * 100, 1),
    entertainment_usage_pct = round(mean(1 - Inflight_Entertainment_NotRated) * 100, 1)
  ) %>%
  mutate(
    # Name segments based on actual patterns
    segment_name = case_when(
      kmeans_cluster == 1 ~ "Satisfied Service-Experienced",
      kmeans_cluster == 2 ~ "Highly Dissatisfied Customers", 
      kmeans_cluster == 3 ~ "Value-Satisfied but Ground Service Critics"
    )
  )

cat("### Customer Segment Profiles\n")

Customer Segment Profiles

cluster_profiles %>% knitr::kable()
kmeans_cluster n_customers percent_total avg_seat_comfort avg_cabin_staff avg_food_beverages avg_ground_service avg_value_money avg_overall_rating business_pct family_pct solo_pct couples_pct unknown_pct first_class_pct business_class_pct premium_economy_pct economy_class_pct total_premium_pct wifi_usage_pct entertainment_usage_pct segment_name
1 487 23.0 3.39 4.20 2.67 4.17 4.57 8.06 8.6 21.8 39.4 30.2 0.0 0.4 0.2 0.4 99.0 1.0 4.5 6.8 Satisfied Service-Experienced
2 1155 54.5 1.53 1.77 1.60 1.19 1.51 1.63 6.9 20.9 26.5 34.6 11.1 0.0 0.1 1.1 98.8 1.2 20.6 24.4 Highly Dissatisfied Customers
3 477 22.5 3.45 4.02 2.39 1.12 4.23 7.29 0.8 5.7 10.5 8.4 74.6 0.0 0.2 0.0 99.8 0.2 1.7 3.4 Value-Satisfied but Ground Service Critics

This table is used for the plots and interpretation.

Key Insights for the Team

Overview:

We’ve identified three distinct customer segments based on their service experience patterns. Here’s what each segment tells us:

🔴 Cluster 2: Highly Dissatisfied Customers (54.5% - MAJOR CONCERN)

Largest segment representing over half of our customer base

Extremely poor experiences across all services (ratings ~1.5/5)

Critical dissatisfaction with overall experience (1.63/10)

Heavy users of premium services - highest WiFi (20.6%) and entertainment (24.4%) usage

Most concerning insight: Customers paying for premium services are our most dissatisfied segment

🟢 Cluster 1: Satisfied Service-Experienced (23%)

High satisfaction across all service dimensions (ratings 3.4-4.6/5)

Excellent overall experience (8.06/10)

Balanced traveler mix across business, solo, couples, and family types

Moderate premium service usage - represents our core satisfied customer base

🟡 Cluster 3: Value-Satisfied but Ground Service Critics (22.5%)

Good value perception and comfort ratings, but disastrous ground service (1.12/5)

Solid overall satisfaction (7.29/10) despite ground service issues

Minimal premium service usage - true budget-conscious flyers

Critical opportunity: Fixing ground service could significantly boost this segment’s satisfaction

Key Business Implications:

Priority #1: Address the root causes of Cluster 2’s widespread dissatisfaction, especially given their high premium service usage

Quick Win: Improve ground service operations to convert Cluster 3 into fully satisfied customers

Retention Focus: Maintain the excellent service experience for Cluster 1 while exploring upsell opportunities

Enhanced Visualization of Results

# 1. Cluster Centroids Profiling Plot
centers <- as.data.frame(kmeans_result$centers)
centers$cluster <- 1:3
centers.melt <- melt(centers, id.vars = "cluster")

profile_plot <- ggplot(centers.melt, aes(x = variable, y = value, group = cluster, color = as.factor(cluster))) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(values = c("#073590", "#FFD200", "#2E8B57"), 
                     name = "Cluster",
                     labels = cluster_profiles$segment_name) +
  labs(title = "Cluster Service Profiles: What Defines Each Segment",
       subtitle = "Standardized centroid values for each service dimension",
       x = "Service Dimension", y = "Standardized Centroid Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# 2. Overall Rating Distribution (Validation)
overall_plot <- ggplot(ryanair_segmented, aes(x = Overall.Rating, fill = kmeans_cluster)) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = c("#073590", "#FFD200", "#2E8B57"),
                    name = "Segment",
                    labels = cluster_profiles$segment_name) +
  labs(title = "Overall Rating Distribution by Segment",
       subtitle = "Validates that service patterns correlate with overall satisfaction",
       x = "Overall Rating", y = "Density") +
  theme_minimal()

# 3. Traveler Type Composition
traveller_plot <- ryanair_segmented %>%
  count(kmeans_cluster, cleaned_traveller_type) %>%
  group_by(kmeans_cluster) %>%
  mutate(percent = n / sum(n) * 100) %>%
  ggplot(aes(x = kmeans_cluster, y = percent, fill = cleaned_traveller_type)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values = c("#073590", "#1E90FF", "#FFD200", "#32CD32", "#808080"),
                    name = "Traveler Type") +
  labs(title = "Traveler Type Composition by Cluster",
       x = "Cluster", y = "Percentage") +
  theme_minimal() +
  theme(plot.margin = margin(1, 1, 1, 1, "cm"))

# 4. Seat Type Distribution - ADD width only here
seat_plot <- ryanair_segmented %>%
  mutate(
    seat_category = case_when(
      grepl("First Class", Seat.Type) ~ "First Class",
      grepl("Business Class", Seat.Type) ~ "Business Class", 
      grepl("Premium Economy", Seat.Type) ~ "Premium Economy",
      grepl("Economy Class", Seat.Type) ~ "Economy Class",
      TRUE ~ "Unknown"
    )
  ) %>%
  count(kmeans_cluster, seat_category) %>%
  group_by(kmeans_cluster) %>%
  mutate(percent = n / sum(n) * 100) %>%
  ggplot(aes(x = kmeans_cluster, y = percent, fill = seat_category)) +
  geom_bar(stat = "identity", position = "stack", width = 0.8) +  # ADD width HERE only
  scale_fill_manual(values = c("#073590", "#1E90FF", "#FFD200", "#32CD32", "#808080"),
                    name = "Seat Type") +
  labs(title = "Seat Type Distribution by Cluster - All 4 Categories",
       x = "Cluster", y = "Percentage") +
  theme_minimal() +
  theme(plot.margin = margin(1, 1, 1, 1, "cm"),
        plot.title = element_text(margin = margin(b = 10)))

# 5. Service Usage Patterns with Corrected Insight
usage_plot <- ryanair_segmented %>%
  group_by(kmeans_cluster) %>%
  summarize(
    wifi_usage = mean(1 - Wifi_Connectivity_NotRated) * 100,
    entertainment_usage = mean(1 - Inflight_Entertainment_NotRated) * 100
  ) %>%
  pivot_longer(cols = c(wifi_usage, entertainment_usage), 
               names_to = "service", values_to = "usage_pct") %>%
  ggplot(aes(x = kmeans_cluster, y = usage_pct, fill = service)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#073590", "#FFD200"), 
                    name = "Service",
                    labels = c("WiFi", "Entertainment")) +
  labs(title = "Premium Service Usage by Cluster - Key Insight",
       subtitle = "Dissatisfied customers (Cluster 2) use WiFi/entertainment most\npotential disappointment with paid services",
       x = "Cluster", y = "Usage Percentage") +
  theme_minimal() +
  theme(plot.margin = margin(1, 1, 1, 1, "cm"),
        plot.title = element_text(margin = margin(b = 10)),
        plot.subtitle = element_text(margin = margin(b = 10),
                                    size = 10))  # Smaller font for subtitle

# Or use a shorter subtitle:
# subtitle = "Cluster 2 (dissatisfied) uses premium services most"

# Display all plots with adjusted layout
profile_plot

overall_plot

grid.arrange(traveller_plot, seat_plot, ncol = 2, 
             widths = c(1.2, 1),
             top = "Demographic Analysis: Traveler Types and Seat Preferences")  # Add overall title

usage_plot

Traveler Type vs Clusters Insights:

  1. Cluster 1 - Satisfied Service-Experienced:

Most balanced distribution across all traveler types

Highest business traveler percentage (8.6%) - corporate travelers appreciate the service quality

Strong family representation (21.8%) - families are having good experiences

Well-distributed leisure travel - appeals to all customer types

  1. Cluster 2 - Highly Dissatisfied Customers:

Highest couples leisure percentage (34.6%) - romantic trips are being ruined

Strong family presence (20.9%) - family vacations are disappointing

Moderate business travelers (6.9%) - even corporate travelers are unhappy

This is a universal problem affecting all traveler types

  1. Cluster 3 - Value-Satisfied but Ground Service Critics:

Massive data gap (74.6% Unknown) - we don’t know who these customers are

Very low business travelers (0.8%) - almost no corporate customers

Low family representation (5.7%) - families are avoiding this experience

Primarily leisure travelers who tolerate poor ground service for value

Critical Business Implications:

Couples & Families Are Most Vulnerable - romantic trips and family vacations are being severely impacted

Business Travelers Prefer Cluster 1 Experience - corporate customers gravitate toward good service

Cluster 3 Has Identity Crisis - we’re failing to capture who these budget customers are

No Traveler Type is Immune - dissatisfaction affects all customer segments

Actionable Recommendations:

Protect couples/family market - these are high-value leisure segments

Understand Cluster 3 demographics - fix data collection for budget segment

Leverage business traveler satisfaction - use as reference for service standards

Universal service overhaul needed - no traveler type is being adequately served

Bottom Line: Service quality issues are affecting all traveler types, but couples and families are disproportionately represented in the dissatisfied segment, threatening Ryanair’s leisure travel market position.

Note: The seat type is not insightful since around 99% are economy across all clusters.

Method Comparison

cluster_comparison <- table(Kmeans = ryanair_segmented$kmeans_cluster,
                           Hierarchical = ryanair_segmented$hierarchical_cluster)

cat("### Cluster Assignment Comparison: K-means vs Hierarchical\n")

Cluster Assignment Comparison: K-means vs Hierarchical

cluster_comparison %>% knitr::kable()
1 2 3
468 19 0
5 1129 21
44 53 380
agreement_rate <- sum(diag(cluster_comparison)) / sum(cluster_comparison)
cat("\n**Agreement rate between methods:**", round(agreement_rate * 100, 1), "%\n")

Agreement rate between methods: 93.3 %

Final Business Recommendations

final_recommendations <- cluster_profiles %>%
  select(segment_name, n_customers, percent_total,
         avg_overall_rating, avg_value_money, business_pct,
         wifi_usage_pct, total_premium_pct, unknown_pct) %>%  # CHANGED: premium_pct → total_premium_pct
  mutate(
    priority_level = case_when(
      segment_name == "Highly Dissatisfied Customers" ~ "CRITICAL",
      segment_name == "Value-Satisfied but Ground Service Critics" ~ "HIGH", 
      segment_name == "Satisfied Service-Experienced" ~ "MEDIUM"
    ),
    key_actions = case_when(
      segment_name == "Highly Dissatisfied Customers" ~ "Overhaul core service delivery, fix value perception, staff training",
      segment_name == "Value-Satisfied but Ground Service Critics" ~ "Improve ground operations, maintain value proposition", 
      segment_name == "Satisfied Service-Experienced" ~ "Retention focus, loyalty program enhancement"
    ),
    segment_characteristics = case_when(
      segment_name == "Satisfied Service-Experienced" ~ "High ratings across services, balanced traveler types, moderate WiFi usage",
      segment_name == "Highly Dissatisfied Customers" ~ "Extremely low ratings on all core services, highest WiFi usage", 
      segment_name == "Value-Satisfied but Ground Service Critics" ~ "Good value/comfort but terrible ground service, lowest WiFi usage"
    )
  ) %>%
  select(segment_name, n_customers, percent_total, priority_level, key_actions, segment_characteristics, everything())

cat("### Final Segment Definitions and Business Recommendations\n")

Final Segment Definitions and Business Recommendations

final_recommendations %>% knitr::kable()
segment_name n_customers percent_total priority_level key_actions segment_characteristics avg_overall_rating avg_value_money business_pct wifi_usage_pct total_premium_pct unknown_pct
Satisfied Service-Experienced 487 23.0 MEDIUM Retention focus, loyalty program enhancement High ratings across services, balanced traveler types, moderate WiFi usage 8.06 4.57 8.6 4.5 1.0 0.0
Highly Dissatisfied Customers 1155 54.5 CRITICAL Overhaul core service delivery, fix value perception, staff training Extremely low ratings on all core services, highest WiFi usage 1.63 1.51 6.9 20.6 1.2 11.1
Value-Satisfied but Ground Service Critics 477 22.5 HIGH Improve ground operations, maintain value proposition Good value/comfort but terrible ground service, lowest WiFi usage 7.29 4.23 0.8 1.7 0.2 74.6

Summary

Key Insight:

The analysis reveals three distinct customer segments with clear service experience patterns, validated by demographic and behavioral characteristics. The near-universal economy seat distribution confirms Ryanair’s position as a budget airline across all customer types.

Key Takeaway from This Customer Segmentation Analysis:

The Real Problem Isn’t What You’d Expect We discovered a critical paradox: Our most dissatisfied customers (54.5% of base) are NOT primarily upset about WiFi/entertainment quality issues - they’re experiencing fundamental failures across ALL core services.

The Core Issue:

-Cluster 2 customers hate EVERYTHING - seat comfort (1.53/5), cabin staff (1.77/5), food (1.60/5), ground service (1.19/5), and value (1.51/5)

-This is a systemic service breakdown, not isolated premium service problems

-WiFi/entertainment usage is a symptom, not the cause - these customers try more services but find everything disappointing

Strategic Implications:

-Stop focusing on premium services as the main problem - they’re only 3% of the issue

-Fix the core service engine first: Value perception, seat comfort, and cabin staff service drive 61% of dissatisfaction

-Cluster 2 represents a broken customer experience - we’re fundamentally failing over half our customer base

Action Priority:

-CRITICAL: Overhaul core service delivery for Cluster 2 (54.5% of customers)

-HIGH: Fix ground service for Cluster 3 (quick satisfaction win)

-MEDIUM: Maintain excellence for Cluster 1 (our reference customers)

Bottom Line: We have a systemic service quality crisis affecting the majority of customers. Premium service fixes won’t solve this - we need to rebuild the core Ryanair experience from the ground up.