Assignment 5A – Airline Delays: Approach
Introduction
This assignment examines airline on-time performance data for two airlines (Alaska and Amwest) across five cities. The source data is provided in a wide, non-tidy format that separates counts of “on time” and “delayed” flights by airline and destination.
The primary objective is to recreate the dataset, transform it into tidy format, and compare airline performance both overall and city-by-city using percentages rather than raw counts. A key goal is to analyze and explain any discrepancy between overall performance comparisons and city-level comparisons.
Planned Approach
1. Recreate the Dataset in Wide Format
I will first recreate the dataset programmatically in R in wide format to match the structure of the original file, including any blank header cells that appear in the source. After constructing the dataframe, I will export it to a CSV file using write.csv() and upload it to a publicly accessible GitHub repository. The analysis will then read the dataset from the GitHub raw link to ensure full reproducibility.
The dataset will include:
- Airline (Alaska, Amwest)
- Flight Status (On Time, Delayed)
- Five destination cities:
- Los Angeles
- Phoenix
- San Diego
- San Francisco
- Seattle
2. Load Data and Address Missing Structure
After loading the CSV into R, I will:
- Inspect column names and structure
- Programmatically handle any blank or missing values
- Ensure counts are numeric
- Verify totals per airline and per city
If necessary, I will write code to populate or correct missing structural values to ensure the dataset is complete and consistent before transformation.
3. Transform Data from Wide to Long Format
Using tidyr and dplyr, I will convert the dataset into tidy (long) format so that:
- Each row represents one airline city status combination
- Variables include:
- airline
- city
- status (on_time or delayed)
- count
This transformation will make percentage calculations and comparisons straightforward and reproducible.
4. Compute Overall Airline Performance (Percent-Based)
To compare airlines overall, I will:
- Calculate total flights per airline
- Compute percentage delayed (and/or percentage on time)
- Present results using a summary table and/or visualization
The focus will be on percentages rather than counts, since airlines may operate different numbers of flights.
5. Compare Airline Performance by City
Next, I will use percentage delayed as the primary performance metric for both overall and city-level comparisons.
This comparison will be presented using:
- Grouped bar charts to visualize city-level percentage differences
- Narrative interpretation summarizing patterns
This step allows for a more granular evaluation of airline performance.
6. Investigate and Explain the Discrepancy
If the overall comparison produces a different conclusion than the city-by-city comparison, I will explain the discrepancy.
This phenomenon may arise due to differences in flight volume distribution across cities. Since overall percentages are weighted averages, differences in how many flights each airline operates in each city can change the aggregated result.
This analysis will highlight how aggregation can sometimes produce misleading conclusions when underlying group distributions differ.
Anticipated Data Challenges
- The original dataset is not tidy and requires restructuring.
- Blank header cells may require cleaning before transformation.
- Percent calculations must use correct denominators.
- Care must be taken to avoid comparing raw counts instead of rates.
- Interpretation must clearly distinguish between aggregated and stratified comparisons.
Reproducibility Plan
- The CSV file will be stored in a public GitHub repository.
- All transformations will be fully reproducible using
tidyversefunctions. - No local file paths will be used.
- All code will run in a clean R environment.
This structured approach ensures that the dataset is recreated faithfully, transformed properly, analyzed rigorously, and interpreted clearly.