RyanAir Reviews: Data Cleaning and Quality Analysis

1. Data Preprocessing and Initial Exploration

1.1 Data Loading and Structure Analysis

In this section, we perform essential data preprocessing steps to ensure the dataset is ready for analysis. This includes examining the data structure, handling missing values, and preparing the data for statistical analysis and visualization.

# Load libraries
library(tidyverse)
library(ggplot2)
library(knitr)

# Load data
ryanair <- read.csv("ryanair_reviews.csv", stringsAsFactors = FALSE)

# Display dataset dimensions and basic information
dataset_info <- data.frame(
  Metric = c("Total Rows", "Total Columns", "Date Range", "Rating Scale"),
  Value = c(
    nrow(ryanair),
    ncol(ryanair),
    paste(range(ryanair$Date.Published, na.rm = TRUE), collapse = " to "),
    "1-10 (Overall), 1-5 (Service Ratings)"
  )
)

kable(dataset_info, caption = "Basic Dataset Information")
Basic Dataset Information
Metric Value
Total Rows 2249
Total Columns 21
Date Range 2012-08-23 to 2024-02-03
Rating Scale 1-10 (Overall), 1-5 (Service Ratings)
# Check for duplicate records
duplicate_count <- sum(duplicated(ryanair))
cat("### Duplicate Records Check\n")

Duplicate Records Check

cat("Duplicate records found:", duplicate_count, "\n\n")

Duplicate records found: 0

if(duplicate_count == 0) {
  cat("No duplicates found - proceeding with analysis\n\n")
}

No duplicates found - proceeding with analysis

Code Explanation: We load the necessary libraries and dataset, then display basic information about the dataset structure including dimensions and date range.

1.2 Missing Values Analysis

Now we’ll address missing values in the dataset. Handling missing values is essential to ensure data integrity and prevent errors during analysis. First, let’s identify which columns have missing values and understand the patterns of missingness.

# Calculate missing values statistics
missing_summary <- data.frame(
  Column = names(ryanair),
  Missing_Count = colSums(is.na(ryanair)),
  Missing_Percentage = round(colSums(is.na(ryanair)) / nrow(ryanair) * 100, 1)
) %>%
  filter(Missing_Count > 0) %>%
  arrange(desc(Missing_Percentage))

kable(missing_summary, caption = "Columns with Missing Values")
Columns with Missing Values
Column Missing_Count Missing_Percentage
Wifi…Connectivity Wifi…Connectivity 1981 88.1
Inflight.Entertainment Inflight.Entertainment 1918 85.3
Food…Beverages Food…Beverages 937 41.7
Ground.Service Ground.Service 671 29.8
Overall.Rating Overall.Rating 130 5.8
Cabin.Staff.Service Cabin.Staff.Service 121 5.4
Seat.Comfort Seat.Comfort 112 5.0
Value.For.Money Value.For.Money 1 0.0

Interpretation of Missingness Patterns:

  • Extreme Missingness (85-88%): Inflight Entertainment and WiFi Connectivity - suggests most passengers didn’t use these paid services
  • High Missingness (30-42%): Food & Beverages and Ground Service - mixed reasons (forgot to rate vs didn’t use service)
  • Moderate Missingness (5-6%): Seat Comfort, Cabin Staff Service - likely random missingness
  • Critical Missingness (6%): Overall Rating - our target variable requires special attention

1.3 Categorical Variables Analysis

Why We Analyze Categorical Variables: Understanding the distribution and missingness in categorical variables helps us identify data quality issues and potential biases in our dataset.

# Output categorical variables
cat_vars <- sapply(ryanair, is.character)
cat_vars_names <- names(ryanair)[cat_vars]
cat_vars_names

[1] “Date.Published” “Passenger.Country” “Trip_verified”
[4] “Comment.title” “Comment” “Aircraft”
[7] “Type.Of.Traveller” “Seat.Type” “Origin”
[10] “Destination” “Date.Flown” “Recommended”

# Convert empty strings or "null" to NA
ryanair[cat_vars_names] <- lapply(ryanair[cat_vars_names], function(x) {
  x[x == "" | x == " " | x == "null"] <- NA
  return(x)
})

# Check missing values after conversion
cat("Missing values in categorical variables after conversion:\n")

Missing values in categorical variables after conversion:

cat_missing_after <- sapply(ryanair[cat_vars_names], function(x) sum(is.na(x)))
cat_missing_after_df <- data.frame(
  Variable = names(cat_missing_after),
  Missing_Count = cat_missing_after
)
kable(cat_missing_after_df, caption = "Missing Values in Categorical Variables After Conversion")
Missing Values in Categorical Variables After Conversion
Variable Missing_Count
Date.Published Date.Published 0
Passenger.Country Passenger.Country 0
Trip_verified Trip_verified 944
Comment.title Comment.title 0
Comment Comment 0
Aircraft Aircraft 1697
Type.Of.Traveller Type.Of.Traveller 614
Seat.Type Seat.Type 0
Origin Origin 615
Destination Destination 615
Date.Flown Date.Flown 618
Recommended Recommended 0
# Handle missing categorical fields
for (var in cat_vars_names) {
  ryanair[[var]][is.na(ryanair[[var]])] <- "Unknown"
}

# Verify no missing categorical values
cat("\nFinal verification - missing values after handling:\n")

Final verification - missing values after handling:

final_missing_cat <- sapply(ryanair[cat_vars_names], function(x) sum(is.na(x)))
final_missing_df <- data.frame(
  Variable = names(final_missing_cat),
  Missing_Count = final_missing_cat
)
kable(final_missing_df, caption = "Final Missing Values in Categorical Variables")
Final Missing Values in Categorical Variables
Variable Missing_Count
Date.Published Date.Published 0
Passenger.Country Passenger.Country 0
Trip_verified Trip_verified 0
Comment.title Comment.title 0
Comment Comment 0
Aircraft Aircraft 0
Type.Of.Traveller Type.Of.Traveller 0
Seat.Type Seat.Type 0
Origin Origin 0
Destination Destination 0
Date.Flown Date.Flown 0
Recommended Recommended 0
cat("\n✓ All categorical variables now have 0 missing values\n")

<U+2713> All categorical variables now have 0 missing values

Rationale for Categorical Variables Analysis:

  • Data Quality Check: Ensure categorical variables don’t have missing values that could affect analysis
  • Feature Engineering: Identify categorical variables that can be used for segmentation and modeling
  • Business Understanding: Understand the types of categorical data available for customer profiling

2. Discovery in exploration: The “Silent Complainer” Phenomenon

2.1 Systematic Analysis of Missing Target Variable

Background and Methodology: Before proceeding with data cleaning, we conduct a systematic analysis to understand if missing values occur randomly or follow specific patterns. This investigation is crucial because non-random missingness can indicate underlying business insights and influence our data cleaning strategy.

We focus particularly on the Overall.Rating column (our target variable) to determine if passengers who skip providing an overall rating differ systematically from those who complete it.

# Ryanair brand colors
ryanair_blue <- "#073590"
ryanair_yellow <- "#FFD200" 

# Define the groups for comparison
missing_target_rows <- ryanair[is.na(ryanair$Overall.Rating), ]
complete_rows <- ryanair[!is.na(ryanair$Overall.Rating), ]

# Analyze ALL service ratings for comprehensive comparison
all_service_ratings <- c("Seat.Comfort", "Cabin.Staff.Service", "Food...Beverages", 
                         "Ground.Service", "Value.For.Money", "Inflight.Entertainment", 
                         "Wifi...Connectivity")

# Calculate means for both groups
missing_services <- colMeans(missing_target_rows[all_service_ratings], na.rm = TRUE)
complete_services <- colMeans(complete_rows[all_service_ratings], na.rm = TRUE)

# Create comprehensive comparison
comprehensive_comparison <- data.frame(
  Service = c("Seat Comfort", "Cabin Staff", "Food & Beverages", "Ground Service",
              "Value for Money", "Inflight Entertainment", "WiFi Connectivity"),
  Missing_Target = round(missing_services, 2),
  Complete_Target = round(complete_services, 2),
  Difference = round(complete_services - missing_services, 2)
) %>%
  mutate(
    Percent_Difference = round((Difference / Complete_Target) * 100, 1)
  )

kable(comprehensive_comparison, caption = "Comprehensive Service Rating Comparison: Missing vs Complete Overall Ratings")
Comprehensive Service Rating Comparison: Missing vs Complete Overall Ratings
Service Missing_Target Complete_Target Difference Percent_Difference
Seat.Comfort Seat Comfort 1.75 2.41 0.66 27.4
Cabin.Staff.Service Cabin Staff 1.56 2.82 1.26 44.7
Food…Beverages Food & Beverages 0.78 2.05 1.26 61.5
Ground.Service Ground Service NaN 2.16 NaN NaN
Value.For.Money Value for Money 1.23 2.82 1.59 56.4
Inflight.Entertainment Inflight Entertainment NaN 1.16 NaN NaN
Wifi…Connectivity WiFi Connectivity NaN 1.12 NaN NaN

2.2 Visualization: Systematic Rating Differences

# Prepare data for comprehensive visualization
comprehensive_viz <- comprehensive_comparison %>%
  pivot_longer(cols = c(Missing_Target, Complete_Target), 
               names_to = "Group", 
               values_to = "Average_Rating") %>%
  mutate(
    Group = factor(Group, levels = c("Missing_Target", "Complete_Target"),
                   labels = c("Silent Complainers\n(Missing Overall Rating)", 
                              "Typical Customers\n(Complete Data)")),
    Service = factor(Service, levels = comprehensive_comparison$Service[order(comprehensive_comparison$Difference)])
  )

ggplot(comprehensive_viz, aes(x = Service, y = Average_Rating, fill = Group)) +
  geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
  geom_text(aes(label = Average_Rating), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  scale_fill_manual(values = c(ryanair_yellow, ryanair_blue)) +
  labs(title = "Comprehensive Service Rating Analysis: Silent Complainers vs Typical Customers",
       subtitle = "Passengers with missing Overall Ratings show consistent dissatisfaction across major service types",
       y = "Average Rating (1-5 scale)",
       x = "Service Dimension") +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(color = ryanair_blue, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1))

2.3 Statistical Significance Testing

Why We Use Statistical Testing: To determine if the observed differences are statistically significant (not due to random chance), we conduct t-tests. This helps validate whether the “silent complainer” pattern is a real phenomenon worthy of business attention.

# Perform statistical tests for services showing differences
t_test_seat <- t.test(complete_rows$Seat.Comfort, missing_target_rows$Seat.Comfort)
t_test_staff <- t.test(complete_rows$Cabin.Staff.Service, missing_target_rows$Cabin.Staff.Service)
t_test_value <- t.test(complete_rows$Value.For.Money, missing_target_rows$Value.For.Money)
t_test_food <- t.test(complete_rows$Food...Beverages, missing_target_rows$Food...Beverages, na.rm = TRUE)

significance_results <- data.frame(
  Service_Dimension = c("Seat Comfort", "Cabin Staff Service", "Value for Money", "Food & Beverages"),
  P_Value = c(t_test_seat$p.value, t_test_staff$p.value, t_test_value$p.value, t_test_food$p.value),
  Mean_Difference = c(
    mean(complete_rows$Seat.Comfort, na.rm = TRUE) - mean(missing_target_rows$Seat.Comfort, na.rm = TRUE),
    mean(complete_rows$Cabin.Staff.Service, na.rm = TRUE) - mean(missing_target_rows$Cabin.Staff.Service, na.rm = TRUE),
    mean(complete_rows$Value.For.Money, na.rm = TRUE) - mean(missing_target_rows$Value.For.Money, na.rm = TRUE),
    mean(complete_rows$Food...Beverages, na.rm = TRUE) - mean(missing_target_rows$Food...Beverages, na.rm = TRUE)
  )
) %>%
  mutate(
    Significance = ifelse(P_Value < 0.001, "***", 
                         ifelse(P_Value < 0.01, "**",
                               ifelse(P_Value < 0.05, "*", "ns")))
  )

kable(significance_results, caption = "T-Test Results for Service Rating Differences")
T-Test Results for Service Rating Differences
Service_Dimension P_Value Mean_Difference Significance
Seat Comfort 0 0.6597064 ***
Cabin Staff Service 0 1.2627859 ***
Value for Money 0 1.5931212 ***
Food & Beverages 0 1.2619159 ***

2.4 Visualization: Statistical Significance

Why We Use -log₁₀(p-value): We transform p-values using -log₁₀ for visualization because: - Raw p-values like 6.68e-10 are extremely small and hard to interpret visually - -log₁₀ transformation creates a linear scale where higher bars indicate greater significance - This allows clear visualization of extremely significant results that would otherwise be microscopic on a raw p-value scale

# Prepare data for visualization
viz_data <- significance_results %>%
  mutate(
    Neg_Log_Pvalue = -log10(P_Value),
    Service_Dimension = factor(Service_Dimension, 
                              levels = c("Value for Money", "Cabin Staff Service", "Seat Comfort", "Food & Beverages"))
  )

ggplot(viz_data, aes(x = Service_Dimension, y = Neg_Log_Pvalue, fill = Mean_Difference)) +
  geom_bar(stat = "identity", alpha = 0.9) +
  scale_fill_gradient2(low = ryanair_yellow, high = ryanair_blue, mid = "white",
                       midpoint = 1, name = "Rating Difference") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "red", alpha = 0.7) +
  geom_text(aes(label = paste0("p = ", format.pval(P_Value, digits = 3))), 
            hjust = -0.1, size = 3.5, color = ryanair_blue) +
  geom_text(aes(label = paste0(round(Mean_Difference, 2), " points")), 
            hjust = 1.2, size = 4, fontface = "bold", color = "white") +
  labs(title = "Statistical Significance of 'Silent Complainer' Differences",
       subtitle = "Passengers with missing Overall Ratings are significantly more dissatisfied\n(-log10 transformation makes extreme significance visually clear)",
       y = "-log10(p-value)\nHigher = More Statistically Significant",
       x = "Service Dimension") +
  theme_minimal() +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold"),
    plot.subtitle = element_text(color = "darkgray")
  )

Key Business Insight: The ‘Silent Complainer’ Phenomenon

Discovery: Our analysis reveals that missing data in the Overall.Rating column is not random. Passengers who skip providing an overall rating show systematic patterns of extreme dissatisfaction across all service dimensions.

Statistical Evidence: - 4 service dimensions show statistically significant differences (p < 0.0001) - Value for Money shows the largest difference: 1.5931212 points lower ratings from silent complainers - Food & Beverages and Cabin Staff Service: 1.2627859 points lower ratings - Seat Comfort: 0.6597064 points lower ratings

Strategic Implications: 1. Hidden Churn Risk: These 130 customers represent a segment at high risk of churn 2. Lost Feedback: Traditional analysis would miss this extreme dissatisfaction 3. Proactive Engagement Needed: Ryanair should implement targeted recovery strategies 4. Data Quality as Insight: Missing data patterns reveal hidden customer sentiment

3. Data Cleaning Strategy with Detailed Rationale

3.1 Step 1: Remove Missing Target Variable Rows

Why We Do This Step: The Overall.Rating is our primary variable for analysis. Most machine learning algorithms cannot handle missing target values, and our analysis revealed these missing cases represent a systematic “silent complainer” segment.

ryanair_clean <- ryanair[!is.na(ryanair$Overall.Rating), ]

cleaning_step1 <- data.frame(
  Action = "Remove rows with missing Overall.Rating",
  Rationale = "Target variable cannot be missing for supervised learning; removed segment shows systematic dissatisfaction bias",
  Original_Rows = nrow(ryanair),
  Final_Rows = nrow(ryanair_clean),
  Rows_Removed = nrow(ryanair) - nrow(ryanair_clean),
  Percentage_Removed = round((nrow(ryanair) - nrow(ryanair_clean)) / nrow(ryanair) * 100, 1),
  Impact = "Enables all supervised learning; prevents systematic bias"
)

kable(cleaning_step1, caption = "Target Variable Cleaning Strategy")
Target Variable Cleaning Strategy
Action Rationale Original_Rows Final_Rows Rows_Removed Percentage_Removed Impact
Remove rows with missing Overall.Rating Target variable cannot be missing for supervised learning; removed segment shows systematic dissatisfaction bias 2249 2119 130 5.8 Enables all supervised learning; prevents systematic bias

3.2 Step 2: Handle High Missingness Service Columns

Why We Impute with 0 and Create Flags: For services with extreme missingness (85-88%), we use 0 imputation with flags because: - Such high missingness indicates most passengers didn’t use these paid services - Imputing with median/mean would falsely suggest service usage - 0 clearly indicates “service not used/not rated” - Flags preserve the crucial distinction between “didn’t use” vs “used but rated poorly” - Creates analytical capability to distinguish between service non-usage and low ratings

### 3.2 Step 2: Handle High Missingness Service Columns

# Create NotRated flags for high missingness services before imputation
ryanair_clean <- ryanair_clean %>%
  mutate(
    Inflight_Entertainment_NotRated = as.integer(is.na(Inflight.Entertainment)),
    Wifi_Connectivity_NotRated = as.integer(is.na(Wifi...Connectivity))
  )

# Impute high missingness services with 0
ryanair_clean$Inflight.Entertainment[is.na(ryanair_clean$Inflight.Entertainment)] <- 0
ryanair_clean$Wifi...Connectivity[is.na(ryanair_clean$Wifi...Connectivity)] <- 0

service_cleaning <- data.frame(
  Service = c("Inflight Entertainment", "WiFi Connectivity"),
  Missingness_Before = c(85.3, 88.1),
  Imputation_Strategy = "Impute with 0 + Create NotRated flags",
  Rationale = "Extreme missingness indicates service non-usage; 0 preserves business meaning, flags track data quality",
  Impact = "Creates zero-inflated distributions with quality tracking; enables service adoption and satisfaction analysis"
)

kable(service_cleaning, caption = "Strategy for Services with Extreme Missingness")
Strategy for Services with Extreme Missingness
Service Missingness_Before Imputation_Strategy Rationale Impact
Inflight Entertainment 85.3 Impute with 0 + Create NotRated flags Extreme missingness indicates service non-usage; 0 preserves business meaning, flags track data quality Creates zero-inflated distributions with quality tracking; enables service adoption and satisfaction analysis
WiFi Connectivity 88.1 Impute with 0 + Create NotRated flags Extreme missingness indicates service non-usage; 0 preserves business meaning, flags track data quality Creates zero-inflated distributions with quality tracking; enables service adoption and satisfaction analysis

3.3 Step 3: Median Imputation with Flags for Moderate Missingness

Why We Use Median with Flags: For moderate missingness (30-42%), we use median imputation with flags because: - Median is robust to outliers in rating data (1-5 scale) - Flags preserve information about original data quality - Allows controlling for imputation effects in statistical models - Better than mean for potentially skewed rating distributions

# Create flags before imputation
ryanair_clean <- ryanair_clean %>%
  mutate(
    Food_Beverages_NotRated = as.integer(is.na(Food...Beverages)),
    Ground_Service_NotRated = as.integer(is.na(Ground.Service))
  )

# Perform median imputation
food_median <- median(ryanair_clean$Food...Beverages, na.rm = TRUE)
ground_median <- median(ryanair_clean$Ground.Service, na.rm = TRUE)

ryanair_clean$Food...Beverages[is.na(ryanair_clean$Food...Beverages)] <- food_median
ryanair_clean$Ground.Service[is.na(ryanair_clean$Ground.Service)] <- ground_median

median_cleaning <- data.frame(
  Service = c("Food & Beverages", "Ground Service"),
  Missingness_Percentage = c(41.7, 29.8),
  Imputation_Value = c(food_median, ground_median),
  Strategy = "Median imputation with flags",
  Rationale = "Moderate missingness with mixed causes; flags preserve data quality information",
  Impact = "Complete dataset with quality controls; enables sensitivity analysis"
)

kable(median_cleaning, caption = "Median Imputation Strategy for Moderate Missingness")
Median Imputation Strategy for Moderate Missingness
Service Missingness_Percentage Imputation_Value Strategy Rationale Impact
Food & Beverages 41.7 2 Median imputation with flags Moderate missingness with mixed causes; flags preserve data quality information Complete dataset with quality controls; enables sensitivity analysis
Ground Service 29.8 1 Median imputation with flags Moderate missingness with mixed causes; flags preserve data quality information Complete dataset with quality controls; enables sensitivity analysis

3.4 Step 4: Simple Median Imputation for Low Missingness

Why Simple Median Suffices: For low missingness (<6%), simple median imputation is appropriate because: - Minimal impact on variable distributions - Preserves overall data patterns and relationships - Standard practice for low-level missing data - Simple and computationally efficient

# Simple median imputation for low missingness
simple_impute <- function(x) {
  x[is.na(x)] <- median(x, na.rm = TRUE)
  return(x)
}

ryanair_clean$Seat.Comfort <- simple_impute(ryanair_clean$Seat.Comfort)
ryanair_clean$Cabin.Staff.Service <- simple_impute(ryanair_clean$Cabin.Staff.Service)
ryanair_clean$Value.For.Money <- simple_impute(ryanair_clean$Value.For.Money)

final_missing_check <- data.frame(
  Column = names(ryanair_clean),
  Missing_After_Cleaning = colSums(is.na(ryanair_clean))
) %>%
  filter(Missing_After_Cleaning > 0)

if(nrow(final_missing_check) == 0) {
  cat("SUCCESS: All missing values have been handled. Dataset is now complete for analysis.\n\n")
} else {
  kable(final_missing_check, caption = "Remaining Missing Values After Cleaning")
}

SUCCESS: All missing values have been handled. Dataset is now complete for analysis.

Cleaning Completion Summary:

  • Target variable: 100% complete
  • Service ratings: All missing values handled with appropriate strategies
  • New analytical flags: Created for data quality tracking
  • Dataset ready for: Machine learning, statistical analysis, business intelligence

4. Analysis: Service Rating Patterns

4.1 Service Rating Completion Analysis

Why We Analyze Service Rating Patterns: Understanding which services passengers consistently rate provides insights into customer engagement patterns and service usage. This helps identify which services are core to the passenger experience versus optional/add-on services.

# Calculate service rating completion rates using ORIGINAL data (before imputation)
service_ratings <- data.frame(
  Service = c("Seat Comfort", "Cabin Staff", "Food & Beverages", "Ground Service", 
              "Inflight Entertainment", "WiFi Connectivity"),
  Rated_Count = c(
    sum(!is.na(ryanair$Seat.Comfort[!is.na(ryanair$Overall.Rating)])),
    sum(!is.na(ryanair$Cabin.Staff.Service[!is.na(ryanair$Overall.Rating)])),
    sum(!is.na(ryanair$Food...Beverages[!is.na(ryanair$Overall.Rating)])),
    sum(!is.na(ryanair$Ground.Service[!is.na(ryanair$Overall.Rating)])),
    sum(!is.na(ryanair$Inflight.Entertainment[!is.na(ryanair$Overall.Rating)])),
    sum(!is.na(ryanair$Wifi...Connectivity[!is.na(ryanair$Overall.Rating)]))
  ),
  Total_Passengers = nrow(ryanair_clean)
) %>%
  mutate(
    Rating_Rate = round(Rated_Count / Total_Passengers * 100, 1),
    Service_Type = ifelse(Service %in% c("Seat Comfort", "Cabin Staff", "Ground Service"), 
                         "Core Experience", "Additional Services")
  )

# Visualization with Ryanair colors
ggplot(service_ratings, aes(x = reorder(Service, Rating_Rate), 
                             y = Rating_Rate, 
                             fill = Service_Type)) +
  geom_bar(stat = "identity", alpha = 0.9) +
  scale_fill_manual(values = c("Core Experience" = ryanair_blue, "Additional Services" = ryanair_yellow)) +
  geom_text(aes(label = paste0(Rating_Rate, "%")), hjust = -0.2, size = 3.5, color = ryanair_blue) +
  labs(title = "Ryanair Service Rating Patterns",
       subtitle = "Core experience services show high rating completion, while additional services have lower engagement",
       x = "Service Type",
       y = "Percentage of Passengers Who Rated (%)",
       fill = "Service Category") +
  theme_minimal() +
  coord_flip() +
  scale_y_continuous(limits = c(0, 100), expand = expansion(mult = c(0, 0.1))) +
  theme(
    plot.title = element_text(color = ryanair_blue, face = "bold"),
    plot.subtitle = element_text(color = "darkgray"),
    legend.position = "bottom"
  )

Interpretation of Service Rating Patterns:

Core Experience Services (High Rating Rates): - Seat Comfort (95%) and Cabin Staff (94%): Nearly all passengers rate these core services, indicating they are fundamental to the flight experience - Ground Service (75%): Also a core service with substantial engagement Additional Services (Variable Rating/Usage): - Food & Beverages (56%): Moderate rating completion suggests these are used by a subset of passengers - Inflight Entertainment (16%) and WiFi Connectivity (13%): Very low usage rates indicate these are niche services with limited adoption

Business Implications: - Focus on maintaining excellence in core experience services - Opportunity to improve promotion and adoption of additional services - Consider bundling or pricing strategies for underutilized services

5. Final Dataset Quality Assessment

final_metrics <- data.frame(
  Metric = c("Original Dataset Size", "Final Dataset Size", "Data Retention Rate", 
             "Target Variable Completeness", "Service Variables Completeness",
             "New Analytical Flags Created", "Ready for Machine Learning"),
  Value = c(
    paste(nrow(ryanair), "rows"),
    paste(nrow(ryanair_clean), "rows"),
    paste(round(nrow(ryanair_clean)/nrow(ryanair)*100, 1), "%"),
    "100%",
    "100%",
    "4 flags (NotRated indicators)",
    "Yes"
  )
)

kable(final_metrics, caption = "Final Dataset Quality Metrics")
Final Dataset Quality Metrics
Metric Value
Original Dataset Size 2249 rows
Final Dataset Size 2119 rows
Data Retention Rate 94.2 %
Target Variable Completeness 100%
Service Variables Completeness 100%
New Analytical Flags Created 4 flags (NotRated indicators)
Ready for Machine Learning Yes

Key Business Insights Extracted from Data Quality Analysis

  1. ‘Silent Complainer’ Segment Identified: 130 extremely dissatisfied customers who skip overall ratings
  2. Service Adoption Patterns: Clear distinction between core experience services and additional services
  3. Revenue Opportunity: Significant potential in promoting underutilized services like WiFi and Inflight Entertainment
  4. Customer Experience Focus: Cabin Staff and Seat Comfort are critical to passenger satisfaction
  5. Data Quality Transformation: Successfully converted data quality issues into actionable business intelligence
# Save final dataset
write.csv(ryanair_clean, "ryanair_reviews_cleaned.csv", row.names = FALSE)
cat("Final cleaned dataset saved as: 'ryanair_reviews_cleaned.csv'\n")

Final cleaned dataset saved as: ‘ryanair_reviews_cleaned.csv’

6. Conclusion: From Data Quality Challenges to Strategic Assets

This comprehensive analysis demonstrates how methodological rigor in data cleaning can transform data quality issues into valuable business intelligence. Rather than treating missing data as a problem to be eliminated, we:

  • Uncovered hidden customer segments through systematic missingness analysis
  • Preserved business meaning through strategic imputation methods
  • Created analytical infrastructure with quality control flags
  • Extracted actionable insights about service adoption and customer satisfaction

The cleaned dataset is now optimized for advanced analytics including predictive modeling, customer segmentation, and service quality optimization, providing Ryanair with data-driven insights for strategic decision-making.