In this section, we perform essential data preprocessing steps to ensure the dataset is ready for analysis. This includes examining the data structure, handling missing values, and preparing the data for statistical analysis and visualization.
# Load libraries
library(tidyverse)
library(ggplot2)
library(knitr)
# Load data
ryanair <- read.csv("ryanair_reviews.csv", stringsAsFactors = FALSE)
# Display dataset dimensions and basic information
dataset_info <- data.frame(
Metric = c("Total Rows", "Total Columns", "Date Range", "Rating Scale"),
Value = c(
nrow(ryanair),
ncol(ryanair),
paste(range(ryanair$Date.Published, na.rm = TRUE), collapse = " to "),
"1-10 (Overall), 1-5 (Service Ratings)"
)
)
kable(dataset_info, caption = "Basic Dataset Information")
| Metric | Value |
|---|---|
| Total Rows | 2249 |
| Total Columns | 21 |
| Date Range | 2012-08-23 to 2024-02-03 |
| Rating Scale | 1-10 (Overall), 1-5 (Service Ratings) |
# Check for duplicate records
duplicate_count <- sum(duplicated(ryanair))
cat("### Duplicate Records Check\n")
cat("Duplicate records found:", duplicate_count, "\n\n")
Duplicate records found: 0
if(duplicate_count == 0) {
cat("No duplicates found - proceeding with analysis\n\n")
}
No duplicates found - proceeding with analysis
Code Explanation: We load the necessary libraries and dataset, then display basic information about the dataset structure including dimensions and date range.
Now we’ll address missing values in the dataset. Handling missing values is essential to ensure data integrity and prevent errors during analysis. First, let’s identify which columns have missing values and understand the patterns of missingness.
# Calculate missing values statistics
missing_summary <- data.frame(
Column = names(ryanair),
Missing_Count = colSums(is.na(ryanair)),
Missing_Percentage = round(colSums(is.na(ryanair)) / nrow(ryanair) * 100, 1)
) %>%
filter(Missing_Count > 0) %>%
arrange(desc(Missing_Percentage))
kable(missing_summary, caption = "Columns with Missing Values")
| Column | Missing_Count | Missing_Percentage | |
|---|---|---|---|
| Wifi…Connectivity | Wifi…Connectivity | 1981 | 88.1 |
| Inflight.Entertainment | Inflight.Entertainment | 1918 | 85.3 |
| Food…Beverages | Food…Beverages | 937 | 41.7 |
| Ground.Service | Ground.Service | 671 | 29.8 |
| Overall.Rating | Overall.Rating | 130 | 5.8 |
| Cabin.Staff.Service | Cabin.Staff.Service | 121 | 5.4 |
| Seat.Comfort | Seat.Comfort | 112 | 5.0 |
| Value.For.Money | Value.For.Money | 1 | 0.0 |
Why We Analyze Categorical Variables: Understanding the distribution and missingness in categorical variables helps us identify data quality issues and potential biases in our dataset.
# Output categorical variables
cat_vars <- sapply(ryanair, is.character)
cat_vars_names <- names(ryanair)[cat_vars]
cat_vars_names
[1] “Date.Published” “Passenger.Country” “Trip_verified”
[4] “Comment.title” “Comment” “Aircraft”
[7] “Type.Of.Traveller” “Seat.Type” “Origin”
[10] “Destination” “Date.Flown” “Recommended”
# Convert empty strings or "null" to NA
ryanair[cat_vars_names] <- lapply(ryanair[cat_vars_names], function(x) {
x[x == "" | x == " " | x == "null"] <- NA
return(x)
})
# Check missing values after conversion
cat("Missing values in categorical variables after conversion:\n")
Missing values in categorical variables after conversion:
cat_missing_after <- sapply(ryanair[cat_vars_names], function(x) sum(is.na(x)))
cat_missing_after_df <- data.frame(
Variable = names(cat_missing_after),
Missing_Count = cat_missing_after
)
kable(cat_missing_after_df, caption = "Missing Values in Categorical Variables After Conversion")
| Variable | Missing_Count | |
|---|---|---|
| Date.Published | Date.Published | 0 |
| Passenger.Country | Passenger.Country | 0 |
| Trip_verified | Trip_verified | 944 |
| Comment.title | Comment.title | 0 |
| Comment | Comment | 0 |
| Aircraft | Aircraft | 1697 |
| Type.Of.Traveller | Type.Of.Traveller | 614 |
| Seat.Type | Seat.Type | 0 |
| Origin | Origin | 615 |
| Destination | Destination | 615 |
| Date.Flown | Date.Flown | 618 |
| Recommended | Recommended | 0 |
# Handle missing categorical fields
for (var in cat_vars_names) {
ryanair[[var]][is.na(ryanair[[var]])] <- "Unknown"
}
# Verify no missing categorical values
cat("\nFinal verification - missing values after handling:\n")
Final verification - missing values after handling:
final_missing_cat <- sapply(ryanair[cat_vars_names], function(x) sum(is.na(x)))
final_missing_df <- data.frame(
Variable = names(final_missing_cat),
Missing_Count = final_missing_cat
)
kable(final_missing_df, caption = "Final Missing Values in Categorical Variables")
| Variable | Missing_Count | |
|---|---|---|
| Date.Published | Date.Published | 0 |
| Passenger.Country | Passenger.Country | 0 |
| Trip_verified | Trip_verified | 0 |
| Comment.title | Comment.title | 0 |
| Comment | Comment | 0 |
| Aircraft | Aircraft | 0 |
| Type.Of.Traveller | Type.Of.Traveller | 0 |
| Seat.Type | Seat.Type | 0 |
| Origin | Origin | 0 |
| Destination | Destination | 0 |
| Date.Flown | Date.Flown | 0 |
| Recommended | Recommended | 0 |
cat("\n✓ All categorical variables now have 0 missing values\n")
<U+2713> All categorical variables now have 0 missing values
Background and Methodology: Before proceeding with data cleaning, we conduct a systematic analysis to understand if missing values occur randomly or follow specific patterns. This investigation is crucial because non-random missingness can indicate underlying business insights and influence our data cleaning strategy.
We focus particularly on the Overall.Rating column (our target variable) to determine if passengers who skip providing an overall rating differ systematically from those who complete it.
# Ryanair brand colors
ryanair_blue <- "#073590"
ryanair_yellow <- "#FFD200"
# Define the groups for comparison
missing_target_rows <- ryanair[is.na(ryanair$Overall.Rating), ]
complete_rows <- ryanair[!is.na(ryanair$Overall.Rating), ]
# Analyze ALL service ratings for comprehensive comparison
all_service_ratings <- c("Seat.Comfort", "Cabin.Staff.Service", "Food...Beverages",
"Ground.Service", "Value.For.Money", "Inflight.Entertainment",
"Wifi...Connectivity")
# Calculate means for both groups
missing_services <- colMeans(missing_target_rows[all_service_ratings], na.rm = TRUE)
complete_services <- colMeans(complete_rows[all_service_ratings], na.rm = TRUE)
# Create comprehensive comparison
comprehensive_comparison <- data.frame(
Service = c("Seat Comfort", "Cabin Staff", "Food & Beverages", "Ground Service",
"Value for Money", "Inflight Entertainment", "WiFi Connectivity"),
Missing_Target = round(missing_services, 2),
Complete_Target = round(complete_services, 2),
Difference = round(complete_services - missing_services, 2)
) %>%
mutate(
Percent_Difference = round((Difference / Complete_Target) * 100, 1)
)
kable(comprehensive_comparison, caption = "Comprehensive Service Rating Comparison: Missing vs Complete Overall Ratings")
| Service | Missing_Target | Complete_Target | Difference | Percent_Difference | |
|---|---|---|---|---|---|
| Seat.Comfort | Seat Comfort | 1.75 | 2.41 | 0.66 | 27.4 |
| Cabin.Staff.Service | Cabin Staff | 1.56 | 2.82 | 1.26 | 44.7 |
| Food…Beverages | Food & Beverages | 0.78 | 2.05 | 1.26 | 61.5 |
| Ground.Service | Ground Service | NaN | 2.16 | NaN | NaN |
| Value.For.Money | Value for Money | 1.23 | 2.82 | 1.59 | 56.4 |
| Inflight.Entertainment | Inflight Entertainment | NaN | 1.16 | NaN | NaN |
| Wifi…Connectivity | WiFi Connectivity | NaN | 1.12 | NaN | NaN |
# Prepare data for comprehensive visualization
comprehensive_viz <- comprehensive_comparison %>%
pivot_longer(cols = c(Missing_Target, Complete_Target),
names_to = "Group",
values_to = "Average_Rating") %>%
mutate(
Group = factor(Group, levels = c("Missing_Target", "Complete_Target"),
labels = c("Silent Complainers\n(Missing Overall Rating)",
"Typical Customers\n(Complete Data)")),
Service = factor(Service, levels = comprehensive_comparison$Service[order(comprehensive_comparison$Difference)])
)
ggplot(comprehensive_viz, aes(x = Service, y = Average_Rating, fill = Group)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
geom_text(aes(label = Average_Rating),
position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
scale_fill_manual(values = c(ryanair_yellow, ryanair_blue)) +
labs(title = "Comprehensive Service Rating Analysis: Silent Complainers vs Typical Customers",
subtitle = "Passengers with missing Overall Ratings show consistent dissatisfaction across major service types",
y = "Average Rating (1-5 scale)",
x = "Service Dimension") +
theme_minimal() +
theme(legend.position = "bottom",
plot.title = element_text(color = ryanair_blue, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))
Why We Use Statistical Testing: To determine if the observed differences are statistically significant (not due to random chance), we conduct t-tests. This helps validate whether the “silent complainer” pattern is a real phenomenon worthy of business attention.
# Perform statistical tests for services showing differences
t_test_seat <- t.test(complete_rows$Seat.Comfort, missing_target_rows$Seat.Comfort)
t_test_staff <- t.test(complete_rows$Cabin.Staff.Service, missing_target_rows$Cabin.Staff.Service)
t_test_value <- t.test(complete_rows$Value.For.Money, missing_target_rows$Value.For.Money)
t_test_food <- t.test(complete_rows$Food...Beverages, missing_target_rows$Food...Beverages, na.rm = TRUE)
significance_results <- data.frame(
Service_Dimension = c("Seat Comfort", "Cabin Staff Service", "Value for Money", "Food & Beverages"),
P_Value = c(t_test_seat$p.value, t_test_staff$p.value, t_test_value$p.value, t_test_food$p.value),
Mean_Difference = c(
mean(complete_rows$Seat.Comfort, na.rm = TRUE) - mean(missing_target_rows$Seat.Comfort, na.rm = TRUE),
mean(complete_rows$Cabin.Staff.Service, na.rm = TRUE) - mean(missing_target_rows$Cabin.Staff.Service, na.rm = TRUE),
mean(complete_rows$Value.For.Money, na.rm = TRUE) - mean(missing_target_rows$Value.For.Money, na.rm = TRUE),
mean(complete_rows$Food...Beverages, na.rm = TRUE) - mean(missing_target_rows$Food...Beverages, na.rm = TRUE)
)
) %>%
mutate(
Significance = ifelse(P_Value < 0.001, "***",
ifelse(P_Value < 0.01, "**",
ifelse(P_Value < 0.05, "*", "ns")))
)
kable(significance_results, caption = "T-Test Results for Service Rating Differences")
| Service_Dimension | P_Value | Mean_Difference | Significance |
|---|---|---|---|
| Seat Comfort | 0 | 0.6597064 | *** |
| Cabin Staff Service | 0 | 1.2627859 | *** |
| Value for Money | 0 | 1.5931212 | *** |
| Food & Beverages | 0 | 1.2619159 | *** |
Why We Use -log₁₀(p-value): We transform p-values using -log₁₀ for visualization because: - Raw p-values like 6.68e-10 are extremely small and hard to interpret visually - -log₁₀ transformation creates a linear scale where higher bars indicate greater significance - This allows clear visualization of extremely significant results that would otherwise be microscopic on a raw p-value scale
# Prepare data for visualization
viz_data <- significance_results %>%
mutate(
Neg_Log_Pvalue = -log10(P_Value),
Service_Dimension = factor(Service_Dimension,
levels = c("Value for Money", "Cabin Staff Service", "Seat Comfort", "Food & Beverages"))
)
ggplot(viz_data, aes(x = Service_Dimension, y = Neg_Log_Pvalue, fill = Mean_Difference)) +
geom_bar(stat = "identity", alpha = 0.9) +
scale_fill_gradient2(low = ryanair_yellow, high = ryanair_blue, mid = "white",
midpoint = 1, name = "Rating Difference") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "red", alpha = 0.7) +
geom_text(aes(label = paste0("p = ", format.pval(P_Value, digits = 3))),
hjust = -0.1, size = 3.5, color = ryanair_blue) +
geom_text(aes(label = paste0(round(Mean_Difference, 2), " points")),
hjust = 1.2, size = 4, fontface = "bold", color = "white") +
labs(title = "Statistical Significance of 'Silent Complainer' Differences",
subtitle = "Passengers with missing Overall Ratings are significantly more dissatisfied\n(-log10 transformation makes extreme significance visually clear)",
y = "-log10(p-value)\nHigher = More Statistically Significant",
x = "Service Dimension") +
theme_minimal() +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold"),
plot.subtitle = element_text(color = "darkgray")
)
Discovery: Our analysis reveals that missing data in the Overall.Rating column is not random. Passengers who skip providing an overall rating show systematic patterns of extreme dissatisfaction across all service dimensions.
Statistical Evidence: - 4 service dimensions show statistically significant differences (p < 0.0001) - Value for Money shows the largest difference: 1.5931212 points lower ratings from silent complainers - Food & Beverages and Cabin Staff Service: 1.2627859 points lower ratings - Seat Comfort: 0.6597064 points lower ratings
Strategic Implications: 1. Hidden Churn Risk: These 130 customers represent a segment at high risk of churn 2. Lost Feedback: Traditional analysis would miss this extreme dissatisfaction 3. Proactive Engagement Needed: Ryanair should implement targeted recovery strategies 4. Data Quality as Insight: Missing data patterns reveal hidden customer sentiment
Why We Do This Step: The Overall.Rating is our primary variable for analysis. Most machine learning algorithms cannot handle missing target values, and our analysis revealed these missing cases represent a systematic “silent complainer” segment.
ryanair_clean <- ryanair[!is.na(ryanair$Overall.Rating), ]
cleaning_step1 <- data.frame(
Action = "Remove rows with missing Overall.Rating",
Rationale = "Target variable cannot be missing for supervised learning; removed segment shows systematic dissatisfaction bias",
Original_Rows = nrow(ryanair),
Final_Rows = nrow(ryanair_clean),
Rows_Removed = nrow(ryanair) - nrow(ryanair_clean),
Percentage_Removed = round((nrow(ryanair) - nrow(ryanair_clean)) / nrow(ryanair) * 100, 1),
Impact = "Enables all supervised learning; prevents systematic bias"
)
kable(cleaning_step1, caption = "Target Variable Cleaning Strategy")
| Action | Rationale | Original_Rows | Final_Rows | Rows_Removed | Percentage_Removed | Impact |
|---|---|---|---|---|---|---|
| Remove rows with missing Overall.Rating | Target variable cannot be missing for supervised learning; removed segment shows systematic dissatisfaction bias | 2249 | 2119 | 130 | 5.8 | Enables all supervised learning; prevents systematic bias |
Why We Impute with 0 and Create Flags: For services with extreme missingness (85-88%), we use 0 imputation with flags because: - Such high missingness indicates most passengers didn’t use these paid services - Imputing with median/mean would falsely suggest service usage - 0 clearly indicates “service not used/not rated” - Flags preserve the crucial distinction between “didn’t use” vs “used but rated poorly” - Creates analytical capability to distinguish between service non-usage and low ratings
### 3.2 Step 2: Handle High Missingness Service Columns
# Create NotRated flags for high missingness services before imputation
ryanair_clean <- ryanair_clean %>%
mutate(
Inflight_Entertainment_NotRated = as.integer(is.na(Inflight.Entertainment)),
Wifi_Connectivity_NotRated = as.integer(is.na(Wifi...Connectivity))
)
# Impute high missingness services with 0
ryanair_clean$Inflight.Entertainment[is.na(ryanair_clean$Inflight.Entertainment)] <- 0
ryanair_clean$Wifi...Connectivity[is.na(ryanair_clean$Wifi...Connectivity)] <- 0
service_cleaning <- data.frame(
Service = c("Inflight Entertainment", "WiFi Connectivity"),
Missingness_Before = c(85.3, 88.1),
Imputation_Strategy = "Impute with 0 + Create NotRated flags",
Rationale = "Extreme missingness indicates service non-usage; 0 preserves business meaning, flags track data quality",
Impact = "Creates zero-inflated distributions with quality tracking; enables service adoption and satisfaction analysis"
)
kable(service_cleaning, caption = "Strategy for Services with Extreme Missingness")
| Service | Missingness_Before | Imputation_Strategy | Rationale | Impact |
|---|---|---|---|---|
| Inflight Entertainment | 85.3 | Impute with 0 + Create NotRated flags | Extreme missingness indicates service non-usage; 0 preserves business meaning, flags track data quality | Creates zero-inflated distributions with quality tracking; enables service adoption and satisfaction analysis |
| WiFi Connectivity | 88.1 | Impute with 0 + Create NotRated flags | Extreme missingness indicates service non-usage; 0 preserves business meaning, flags track data quality | Creates zero-inflated distributions with quality tracking; enables service adoption and satisfaction analysis |
Why We Use Median with Flags: For moderate missingness (30-42%), we use median imputation with flags because: - Median is robust to outliers in rating data (1-5 scale) - Flags preserve information about original data quality - Allows controlling for imputation effects in statistical models - Better than mean for potentially skewed rating distributions
# Create flags before imputation
ryanair_clean <- ryanair_clean %>%
mutate(
Food_Beverages_NotRated = as.integer(is.na(Food...Beverages)),
Ground_Service_NotRated = as.integer(is.na(Ground.Service))
)
# Perform median imputation
food_median <- median(ryanair_clean$Food...Beverages, na.rm = TRUE)
ground_median <- median(ryanair_clean$Ground.Service, na.rm = TRUE)
ryanair_clean$Food...Beverages[is.na(ryanair_clean$Food...Beverages)] <- food_median
ryanair_clean$Ground.Service[is.na(ryanair_clean$Ground.Service)] <- ground_median
median_cleaning <- data.frame(
Service = c("Food & Beverages", "Ground Service"),
Missingness_Percentage = c(41.7, 29.8),
Imputation_Value = c(food_median, ground_median),
Strategy = "Median imputation with flags",
Rationale = "Moderate missingness with mixed causes; flags preserve data quality information",
Impact = "Complete dataset with quality controls; enables sensitivity analysis"
)
kable(median_cleaning, caption = "Median Imputation Strategy for Moderate Missingness")
| Service | Missingness_Percentage | Imputation_Value | Strategy | Rationale | Impact |
|---|---|---|---|---|---|
| Food & Beverages | 41.7 | 2 | Median imputation with flags | Moderate missingness with mixed causes; flags preserve data quality information | Complete dataset with quality controls; enables sensitivity analysis |
| Ground Service | 29.8 | 1 | Median imputation with flags | Moderate missingness with mixed causes; flags preserve data quality information | Complete dataset with quality controls; enables sensitivity analysis |
Why Simple Median Suffices: For low missingness (<6%), simple median imputation is appropriate because: - Minimal impact on variable distributions - Preserves overall data patterns and relationships - Standard practice for low-level missing data - Simple and computationally efficient
# Simple median imputation for low missingness
simple_impute <- function(x) {
x[is.na(x)] <- median(x, na.rm = TRUE)
return(x)
}
ryanair_clean$Seat.Comfort <- simple_impute(ryanair_clean$Seat.Comfort)
ryanair_clean$Cabin.Staff.Service <- simple_impute(ryanair_clean$Cabin.Staff.Service)
ryanair_clean$Value.For.Money <- simple_impute(ryanair_clean$Value.For.Money)
final_missing_check <- data.frame(
Column = names(ryanair_clean),
Missing_After_Cleaning = colSums(is.na(ryanair_clean))
) %>%
filter(Missing_After_Cleaning > 0)
if(nrow(final_missing_check) == 0) {
cat("SUCCESS: All missing values have been handled. Dataset is now complete for analysis.\n\n")
} else {
kable(final_missing_check, caption = "Remaining Missing Values After Cleaning")
}
SUCCESS: All missing values have been handled. Dataset is now complete for analysis.
Why We Analyze Service Rating Patterns: Understanding which services passengers consistently rate provides insights into customer engagement patterns and service usage. This helps identify which services are core to the passenger experience versus optional/add-on services.
# Calculate service rating completion rates using ORIGINAL data (before imputation)
service_ratings <- data.frame(
Service = c("Seat Comfort", "Cabin Staff", "Food & Beverages", "Ground Service",
"Inflight Entertainment", "WiFi Connectivity"),
Rated_Count = c(
sum(!is.na(ryanair$Seat.Comfort[!is.na(ryanair$Overall.Rating)])),
sum(!is.na(ryanair$Cabin.Staff.Service[!is.na(ryanair$Overall.Rating)])),
sum(!is.na(ryanair$Food...Beverages[!is.na(ryanair$Overall.Rating)])),
sum(!is.na(ryanair$Ground.Service[!is.na(ryanair$Overall.Rating)])),
sum(!is.na(ryanair$Inflight.Entertainment[!is.na(ryanair$Overall.Rating)])),
sum(!is.na(ryanair$Wifi...Connectivity[!is.na(ryanair$Overall.Rating)]))
),
Total_Passengers = nrow(ryanair_clean)
) %>%
mutate(
Rating_Rate = round(Rated_Count / Total_Passengers * 100, 1),
Service_Type = ifelse(Service %in% c("Seat Comfort", "Cabin Staff", "Ground Service"),
"Core Experience", "Additional Services")
)
# Visualization with Ryanair colors
ggplot(service_ratings, aes(x = reorder(Service, Rating_Rate),
y = Rating_Rate,
fill = Service_Type)) +
geom_bar(stat = "identity", alpha = 0.9) +
scale_fill_manual(values = c("Core Experience" = ryanair_blue, "Additional Services" = ryanair_yellow)) +
geom_text(aes(label = paste0(Rating_Rate, "%")), hjust = -0.2, size = 3.5, color = ryanair_blue) +
labs(title = "Ryanair Service Rating Patterns",
subtitle = "Core experience services show high rating completion, while additional services have lower engagement",
x = "Service Type",
y = "Percentage of Passengers Who Rated (%)",
fill = "Service Category") +
theme_minimal() +
coord_flip() +
scale_y_continuous(limits = c(0, 100), expand = expansion(mult = c(0, 0.1))) +
theme(
plot.title = element_text(color = ryanair_blue, face = "bold"),
plot.subtitle = element_text(color = "darkgray"),
legend.position = "bottom"
)
Core Experience Services (High Rating Rates): - Seat Comfort (95%) and Cabin Staff (94%): Nearly all passengers rate these core services, indicating they are fundamental to the flight experience - Ground Service (75%): Also a core service with substantial engagement Additional Services (Variable Rating/Usage): - Food & Beverages (56%): Moderate rating completion suggests these are used by a subset of passengers - Inflight Entertainment (16%) and WiFi Connectivity (13%): Very low usage rates indicate these are niche services with limited adoption
Business Implications: - Focus on maintaining excellence in core experience services - Opportunity to improve promotion and adoption of additional services - Consider bundling or pricing strategies for underutilized services
final_metrics <- data.frame(
Metric = c("Original Dataset Size", "Final Dataset Size", "Data Retention Rate",
"Target Variable Completeness", "Service Variables Completeness",
"New Analytical Flags Created", "Ready for Machine Learning"),
Value = c(
paste(nrow(ryanair), "rows"),
paste(nrow(ryanair_clean), "rows"),
paste(round(nrow(ryanair_clean)/nrow(ryanair)*100, 1), "%"),
"100%",
"100%",
"4 flags (NotRated indicators)",
"Yes"
)
)
kable(final_metrics, caption = "Final Dataset Quality Metrics")
| Metric | Value |
|---|---|
| Original Dataset Size | 2249 rows |
| Final Dataset Size | 2119 rows |
| Data Retention Rate | 94.2 % |
| Target Variable Completeness | 100% |
| Service Variables Completeness | 100% |
| New Analytical Flags Created | 4 flags (NotRated indicators) |
| Ready for Machine Learning | Yes |
# Save final dataset
write.csv(ryanair_clean, "ryanair_reviews_cleaned.csv", row.names = FALSE)
cat("Final cleaned dataset saved as: 'ryanair_reviews_cleaned.csv'\n")
Final cleaned dataset saved as: ‘ryanair_reviews_cleaned.csv’
This comprehensive analysis demonstrates how methodological rigor in data cleaning can transform data quality issues into valuable business intelligence. Rather than treating missing data as a problem to be eliminated, we:
The cleaned dataset is now optimized for advanced analytics including predictive modeling, customer segmentation, and service quality optimization, providing Ryanair with data-driven insights for strategic decision-making.