Assignment 5A – Airline Delays: Approach

Author

Muhammad Suffyan Khan

Published

February 26, 2026

Introduction

This assignment examines airline on-time performance data for two airlines (Alaska and Amwest) across five cities. The source data is provided in a wide, non-tidy format that separates counts of “on time” and “delayed” flights by airline and destination.

The primary objective is to recreate the dataset, transform it into tidy format, and compare airline performance both overall and city-by-city using percentages rather than raw counts. A key goal is to analyze and explain any discrepancy between overall performance comparisons and city-level comparisons.


Planned Approach

1. Recreate the Dataset in Wide Format

I will first recreate the dataset programmatically in R in wide format to match the structure of the original file, including any blank header cells that appear in the source. After constructing the dataframe, I will export it to a CSV file using write.csv() and upload it to a publicly accessible GitHub repository. The analysis will then read the dataset from the GitHub raw link to ensure full reproducibility.

The dataset will include:

  • Airline (Alaska, Amwest)
  • Flight Status (On Time, Delayed)
  • Five destination cities:
    • Los Angeles
    • Phoenix
    • San Diego
    • San Francisco
    • Seattle

2. Load Data and Address Missing Structure

After loading the CSV into R, I will:

  • Inspect column names and structure
  • Programmatically handle any blank or missing values
  • Ensure counts are numeric
  • Verify totals per airline and per city

If necessary, I will write code to populate or correct missing structural values to ensure the dataset is complete and consistent before transformation.


3. Transform Data from Wide to Long Format

Using tidyr and dplyr, I will convert the dataset into tidy (long) format so that:

  • Each row represents one airline city status combination
  • Variables include:
    • airline
    • city
    • status (on_time or delayed)
    • count

This transformation will make percentage calculations and comparisons straightforward and reproducible.


4. Compute Overall Airline Performance (Percent-Based)

To compare airlines overall, I will:

  • Calculate total flights per airline
  • Compute percentage delayed (and/or percentage on time)
  • Present results using a summary table and/or visualization

The focus will be on percentages rather than counts, since airlines may operate different numbers of flights.


5. Compare Airline Performance by City

Next, I will use percentage delayed as the primary performance metric for both overall and city-level comparisons.

This comparison will be presented using:

  • Grouped bar charts to visualize city-level percentage differences
  • Narrative interpretation summarizing patterns

This step allows for a more granular evaluation of airline performance.


6. Investigate and Explain the Discrepancy

If the overall comparison produces a different conclusion than the city-by-city comparison, I will explain the discrepancy.

This phenomenon may arise due to differences in flight volume distribution across cities. Since overall percentages are weighted averages, differences in how many flights each airline operates in each city can change the aggregated result.

This analysis will highlight how aggregation can sometimes produce misleading conclusions when underlying group distributions differ.


Anticipated Data Challenges

  • The original dataset is not tidy and requires restructuring.
  • Blank header cells may require cleaning before transformation.
  • Percent calculations must use correct denominators.
  • Care must be taken to avoid comparing raw counts instead of rates.
  • Interpretation must clearly distinguish between aggregated and stratified comparisons.

Reproducibility Plan

  • The CSV file will be stored in a public GitHub repository.
  • All transformations will be fully reproducible using tidyverse functions.
  • No local file paths will be used.
  • All code will run in a clean R environment.

This structured approach ensures that the dataset is recreated faithfully, transformed properly, analyzed rigorously, and interpreted clearly.