Introduction

This analysis uses a richly expanded dataset of arrival delays, cancellations, and diversions for four major airlines across five West Coast cities. We will walk step-by-step through advanced data tidying, handling missing values, and providing meaningful visual and tabular comparisons both overall and by city. Please note that additinal features were created in the csv to further showcase the process of analysis.


Data Source

The dataset below was created for this project and contains intentionally missing values and a richer set of airlines and statuses.
Key features:
- Four airlines: ALASKA, AM WEST, UNITED, DELTA
- Four statuses: on time, delayed, canceled, diverted
- Five cities
- Some values are missing (blank cells)

Airline,Status,Los Angeles,Phoenix,San Diego,San Francisco,Seattle
ALASKA,on time,497,221,212,503,1841
ALASKA,delayed,62,12,20,102,305
ALASKA,canceled,2,,1,,3
ALASKA,diverted,0,0,,1,
AM WEST,on time,694,4840,383,320,201
AM WEST,delayed,117,415,65,129,61
AM WEST,canceled,3,12,0,1,
AM WEST,diverted,1,,2,0,0
UNITED,on time,320,355,190,400,1800
UNITED,delayed,45,33,25,80,260
UNITED,canceled,2,1,2,1,0
UNITED,diverted,0,0,0,2,
DELTA,on time,610,210,200,355,1655
DELTA,delayed,55,18,19,70,198
DELTA,canceled,1,,3,1,
DELTA,diverted,0,1,0,0,0

Step 1: Reading the Data

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)

# This chunk loads the required libraries for data manipulation and visualization.
# - readr: for reading in CSV files
# - dplyr: for data wrangling and transformation
# - tidyr: for data tidying (wide to long format, handling missing values)
# - ggplot2: for creating visualizations
# - knitr: for rendering tables in the RMarkdown output

# Read the expanded data
delays_wide <- read_csv("arrival_delays.csv")

# Display the original data as a table to confirm its structure and content.
kable(delays_wide, caption = "Raw Expanded Data (Wide Format)")
Raw Expanded Data (Wide Format)
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
ALASKA canceled 2 NA 1 NA 3
ALASKA diverted 0 0 NA 1 NA
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61
AM WEST canceled 3 12 0 1 NA
AM WEST diverted 1 NA 2 0 0
UNITED on time 320 355 190 400 1800
UNITED delayed 45 33 25 80 260
UNITED canceled 2 1 2 1 0
UNITED diverted 0 0 0 2 NA
DELTA on time 610 210 200 355 1655
DELTA delayed 55 18 19 70 198
DELTA canceled 1 NA 3 1 NA
DELTA diverted 0 1 0 0 0

Description:
This code loads all necessary libraries and reads the expanded CSV file into R. We also preview the data to confirm its structure and presence of missing values.


Step 2: Handling and Filling Missing Data

Identifying Missing Data

# This code chunk calculates the number of missing (NA) values in each column of our data.
# This helps us understand where data might be missing and guides how we should handle it.

missing_summary <- sapply(delays_wide, function(x) sum(is.na(x)))
kable(as.data.frame(t(missing_summary)), caption = "Number of Missing Values per Column")
Number of Missing Values per Column
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
0 0 0 3 1 1 4

Description:
We use sapply to count missing values in every column. This diagnostic step is essential for documenting data quality and planning cleaning strategies.


Filling Missing Data

# This code chunk fills all missing numeric values with 0.
# The assumption is that a missing cell represents 0 flights of that status for that airline/city.
# We use dplyr's mutate and across to apply replace_na across all numeric columns.

delays_filled <- delays_wide %>%
  mutate(across(where(is.numeric), ~replace_na(., 0)))
kable(delays_filled, caption = "Missing Data Filled with 0")
Missing Data Filled with 0
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
ALASKA canceled 2 0 1 0 3
ALASKA diverted 0 0 0 1 0
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61
AM WEST canceled 3 12 0 1 0
AM WEST diverted 1 0 2 0 0
UNITED on time 320 355 190 400 1800
UNITED delayed 45 33 25 80 260
UNITED canceled 2 1 2 1 0
UNITED diverted 0 0 0 2 0
DELTA on time 610 210 200 355 1655
DELTA delayed 55 18 19 70 198
DELTA canceled 1 0 3 1 0
DELTA diverted 0 1 0 0 0

Description:
Here, missing numeric values (NA) are replaced with zeros using mutate(across(...)). This is appropriate for summary tables where a blank cell implies “zero flights” for that combination.


Step 3: Tidying Data (Wide to Long Format)

# This chunk converts the data from wide format (one column per city) to long (one row per observation).
# We use tidyr's pivot_longer to gather all city columns into two columns: City and Count.
# This tidy format is essential for flexible analysis and visualization.

delays_long <- delays_filled %>%
  pivot_longer(
    cols = c(`Los Angeles`, Phoenix, `San Diego`, `San Francisco`, Seattle),
    names_to = "City",
    values_to = "Count"
  )
kable(head(delays_long, 12), caption = "First 12 Rows of Tidy Data")
First 12 Rows of Tidy Data
Airline Status City Count
ALASKA on time Los Angeles 497
ALASKA on time Phoenix 221
ALASKA on time San Diego 212
ALASKA on time San Francisco 503
ALASKA on time Seattle 1841
ALASKA delayed Los Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San Diego 20
ALASKA delayed San Francisco 102
ALASKA delayed Seattle 305
ALASKA canceled Los Angeles 2
ALASKA canceled Phoenix 0

Description:
pivot_longer() is used to create a tidy dataset, where each row represents a single airline/status/city/count observation. This structure is ideal for grouped analysis and plotting.


Step 4: Overall Analysis (Counts and Percentages)

Overall Totals and Percentages

# This chunk computes the total and percentage for each Airline/Status combination, across all cities.
# The pipeline groups by airline and status, then computes totals.
# Percentages are calculated within each airline, so each airline's statuses sum to 100%.

overall_summary <- delays_long %>%
  group_by(Airline, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  group_by(Airline) %>%
  mutate(Percent = round(100 * Total / sum(Total), 1))
kable(overall_summary, caption = "Overall Counts and Percentages by Airline and Status")
Overall Counts and Percentages by Airline and Status
Airline Status Total Percent
ALASKA canceled 6 0.2
ALASKA delayed 501 13.2
ALASKA diverted 1 0.0
ALASKA on time 3274 86.6
AM WEST canceled 16 0.2
AM WEST delayed 787 10.9
AM WEST diverted 3 0.0
AM WEST on time 6438 88.9
DELTA canceled 5 0.1
DELTA delayed 360 10.6
DELTA diverted 1 0.0
DELTA on time 3030 89.2
UNITED canceled 6 0.2
UNITED delayed 443 12.6
UNITED diverted 2 0.1
UNITED on time 3065 87.2

Description:
This code produces both total counts and percentages of each status per airline, offering a quick way to compare overall performance between airlines.


Visualization: Overall Percentages

# This chunk creates a bar plot of the percentage of each status (on time, delayed, canceled, diverted) by airline.
# We use ggplot2 for a clear, color-coded presentation. Each airline's statuses are side-by-side for easy comparison.

ggplot(overall_summary, aes(x = Airline, y = Percent, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  labs(title = "Overall Percentages by Airline and Status",
       y = "Percentage (%)", x = "Airline") +
  scale_fill_brewer(palette = "Set1")

Description:
This visualization provides an at-a-glance comparison of how each airline performs overall in terms of on-time, delayed, canceled, and diverted flights, highlighting both strengths and weaknesses.


Step 5: City-by-City Analysis (Counts and Percentages)

# This chunk calculates totals and percentages for each Airline/City/Status combination.
# By grouping by airline and city, we can spot city-specific issues or patterns.

city_summary <- delays_long %>%
  group_by(Airline, City, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  group_by(Airline, City) %>%
  mutate(Percent = round(100 * Total / sum(Total), 1))
kable(head(city_summary, 15), caption = "Sample City-by-City Counts and Percentages")
Sample City-by-City Counts and Percentages
Airline City Status Total Percent
ALASKA Los Angeles canceled 2 0.4
ALASKA Los Angeles delayed 62 11.1
ALASKA Los Angeles diverted 0 0.0
ALASKA Los Angeles on time 497 88.6
ALASKA Phoenix canceled 0 0.0
ALASKA Phoenix delayed 12 5.2
ALASKA Phoenix diverted 0 0.0
ALASKA Phoenix on time 221 94.8
ALASKA San Diego canceled 1 0.4
ALASKA San Diego delayed 20 8.6
ALASKA San Diego diverted 0 0.0
ALASKA San Diego on time 212 91.0
ALASKA San Francisco canceled 0 0.0
ALASKA San Francisco delayed 102 16.8
ALASKA San Francisco diverted 1 0.2

Description:
This table breaks down the data to the city level, allowing detailed performance analysis for every airline in every city.


Visualization: City-by-City Percentages

# This chunk creates a faceted bar plot for status percentages in each city, separated by airline.
# Each subplot represents an airline, with bars for each status in each city.

ggplot(city_summary, aes(x = City, y = Percent, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~Airline) +
  labs(title = "Arrival Status Percentages by Airline and City",
       y = "Percentage (%)", x = "City") +
  scale_fill_brewer(palette = "Set1") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Description:
Faceted bar plots provide a powerful way to spot city-specific strengths and weaknesses for each airline, making disparities more visible than in overall summaries.


Step 6: Analysis & Interpretation

Narrative

  • Overall:
    Each airline’s on-time, delayed, canceled, and diverted rates are now visible in both count and percent, helping us see which airline is most reliable overall.
  • By City:
    City-level analysis can reveal, for example, that an airline may be excellent overall but have issues in a particular city.
  • Missing Data:
    By filling missing values with zero, we ensure our analysis is comprehensive and not biased by blank cells. This is a standard approach for summary tables.
  • Discrepancy Between Overall and City-by-City:
    Sometimes, an airline with strong performance in one high-volume city can appear better overall, masking weaknesses elsewhere. City-by-city analysis reveals these local effects.
  • Explanation:
    This is related to Simpson’s Paradox; aggregated data can mask or reverse trends seen in subgroups. Thus, both levels of analysis are needed for a fair comparison.

Step 7: Conclusion

Key points: - The dataset was expanded for realism, including multiple airlines, more statuses, and missing data. - Missing data was detected, filled, and justified. - Data was tidied from wide to long for analysis flexibility. - Both overall and city-by-city analyses were performed, with visual and tabular summaries. - Discrepancy between overall and detailed analysis was discussed and explained.


Submission Details