This assignment examines airline on-time performance data for two airlines (Alaska and Amwest) across five cities. The source data is provided in a wide, non-tidy format that separates counts of “on time” and “delayed” flights by airline and destination.
The primary objective is to recreate the dataset, transform it into tidy format, and compare airline performance both overall and city-by-city using percentages rather than raw counts. A key goal is to analyze and explain any discrepancy between overall performance comparisons and city-level comparisons.
Planned Approach
1. Recreate the Dataset in Wide Format
I will first recreate the dataset programmatically in R in wide format to match the structure of the original file, including any blank header cells that appear in the source. After constructing the dataframe, I will export it to a CSV file using write.csv() and upload it to a publicly accessible GitHub repository. The analysis will then read the dataset from the GitHub raw link to ensure full reproducibility.
The dataset will include:
Airline (Alaska, Amwest)
Flight Status (On Time, Delayed)
Five destination cities:
Los Angeles
Phoenix
San Diego
San Francisco
Seattle
2. Load Data and Address Missing Structure
After loading the CSV into R, I will:
Inspect column names and structure
Programmatically handle any blank or missing values
Ensure counts are numeric
Verify totals per airline and per city
If necessary, I will write code to populate or correct missing structural values to ensure the dataset is complete and consistent before transformation.
3. Transform Data from Wide to Long Format
Using tidyr and dplyr, I will convert the dataset into tidy (long) format so that:
Each row represents one airline city status combination
Variables include:
airline
city
status (on_time or delayed)
count
This transformation will make percentage calculations and comparisons straightforward and reproducible.
Compute percentage delayed (and/or percentage on time)
Present results using a summary table and/or visualization
The focus will be on percentages rather than counts, since airlines may operate different numbers of flights.
5. Compare Airline Performance by City
Next, I will use percentage delayed as the primary performance metric for both overall and city-level comparisons.
This comparison will be presented using:
Grouped bar charts to visualize city-level percentage differences
Narrative interpretation summarizing patterns
This step allows for a more granular evaluation of airline performance.
6. Investigate and Explain the Discrepancy
If the overall comparison produces a different conclusion than the city-by-city comparison, I will explain the discrepancy.
This phenomenon may arise due to differences in flight volume distribution across cities. Since overall percentages are weighted averages, differences in how many flights each airline operates in each city can change the aggregated result.
This analysis will highlight how aggregation can sometimes produce misleading conclusions when underlying group distributions differ.
Anticipated Data Challenges
The original dataset is not tidy and requires restructuring.
Blank header cells may require cleaning before transformation.
Percent calculations must use correct denominators.
Care must be taken to avoid comparing raw counts instead of rates.
Interpretation must clearly distinguish between aggregated and stratified comparisons.
Reproducibility Plan
The CSV file will be stored in a public GitHub repository.
All transformations will be fully reproducible using tidyverse functions.
No local file paths will be used.
All code will run in a clean R environment.
This structured approach ensures that the dataset is recreated faithfully, transformed properly, analyzed rigorously, and interpreted clearly.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
1) Recreate the dataset in wide format (and intentionally include missing structural values)
Below I create the dataset in wide format to match the original structure (cities as columns). To satisfy the “populate missing data” requirement, I intentionally leave the airline name as NA on the second row of each airline block and later fill it in programmatically.
# A tibble: 20 × 4
Airline Status City Count
<fct> <fct> <fct> <int>
1 ALASKA delayed Los Angeles 62
2 ALASKA on_time Los Angeles 497
3 ALASKA delayed Phoenix 12
4 ALASKA on_time Phoenix 221
5 ALASKA delayed San Diego 20
6 ALASKA on_time San Diego 212
7 ALASKA delayed San Francisco 102
8 ALASKA on_time San Francisco 503
9 ALASKA delayed Seattle 305
10 ALASKA on_time Seattle 1841
11 AMWEST delayed Los Angeles 117
12 AMWEST on_time Los Angeles 694
13 AMWEST delayed Phoenix 415
14 AMWEST on_time Phoenix 4840
15 AMWEST delayed San Diego 65
16 AMWEST on_time San Diego 383
17 AMWEST delayed San Francisco 129
18 AMWEST on_time San Francisco 320
19 AMWEST delayed Seattle 61
20 AMWEST on_time Seattle 201
5) Count analysis (basic checks)
# Total flights by airlineairline_long %>%group_by(Airline) %>%summarise(total_flights =sum(Count), .groups ="drop")
# Total flights by airline and cityairline_long %>%group_by(Airline, City) %>%summarise(total_flights =sum(Count), .groups ="drop") %>%arrange(Airline, desc(total_flights))
# A tibble: 10 × 3
Airline City total_flights
<fct> <fct> <int>
1 ALASKA Seattle 2146
2 ALASKA San Francisco 605
3 ALASKA Los Angeles 559
4 ALASKA Phoenix 233
5 ALASKA San Diego 232
6 AMWEST Phoenix 5255
7 AMWEST Los Angeles 811
8 AMWEST San Francisco 449
9 AMWEST San Diego 448
10 AMWEST Seattle 262
Percentage comparisons
6) Overall comparison (percent delayed by airline)
overall_perf <- airline_long %>%group_by(Airline, Status) %>%summarise(n =sum(Count), .groups ="drop") %>%group_by(Airline) %>%mutate(total =sum(n),pct = n / total ) %>%ungroup()overall_perf
# A tibble: 4 × 5
Airline Status n total pct
<fct> <fct> <int> <int> <dbl>
1 ALASKA delayed 501 3775 0.133
2 ALASKA on_time 3274 3775 0.867
3 AMWEST delayed 787 7225 0.109
4 AMWEST on_time 6438 7225 0.891
# A tibble: 5 × 3
City Airline pct_delayed
<fct> <fct> <dbl>
1 Los Angeles ALASKA 0.111
2 Phoenix ALASKA 0.0515
3 San Diego ALASKA 0.0862
4 San Francisco ALASKA 0.169
5 Seattle ALASKA 0.142
From the overall comparison, AMWEST appears to perform better overall, with a lower delayed rate (~10.9%) than ALASKA (~13.3%). However, the city-by-city results show the opposite pattern: ALASKA has a lower delayed percentage in every one of the five cities. This difference between the aggregated (overall) result and the stratified (city-level) result is the key discrepancy explored next.
Explaining the discrepancy
9) Why overall and city-by-city comparisons can disagree
Overall rates are weighted by how many flights each airline has in each city. To show this, I compute each airline’s flight volume by city (weights).
# A tibble: 10 × 5
Airline City city_total airline_total weight
<fct> <fct> <int> <int> <dbl>
1 ALASKA Seattle 2146 3775 0.568
2 ALASKA San Francisco 605 3775 0.160
3 ALASKA Los Angeles 559 3775 0.148
4 ALASKA Phoenix 233 3775 0.0617
5 ALASKA San Diego 232 3775 0.0615
6 AMWEST Phoenix 5255 7225 0.727
7 AMWEST Los Angeles 811 7225 0.112
8 AMWEST San Francisco 449 7225 0.0621
9 AMWEST San Diego 448 7225 0.0620
10 AMWEST Seattle 262 7225 0.0363
city_delayed %>%left_join(weights, by =c("Airline", "City")) %>%select(Airline, City, pct_delayed, city_total, weight) %>%arrange(City, Airline)
# A tibble: 10 × 5
Airline City pct_delayed city_total weight
<fct> <fct> <dbl> <int> <dbl>
1 ALASKA Los Angeles 0.111 559 0.148
2 AMWEST Los Angeles 0.144 811 0.112
3 ALASKA Phoenix 0.0515 233 0.0617
4 AMWEST Phoenix 0.0790 5255 0.727
5 ALASKA San Diego 0.0862 232 0.0615
6 AMWEST San Diego 0.145 448 0.0620
7 ALASKA San Francisco 0.169 605 0.160
8 AMWEST San Francisco 0.287 449 0.0621
9 ALASKA Seattle 0.142 2146 0.568
10 AMWEST Seattle 0.233 262 0.0363
Conclusions
The overall comparison shows that AMWEST appears to perform better, with a lower overall percentage of delayed flights (approximately 10.9%) compared to ALASKA (approximately 13.3%). If we only considered the aggregated totals, we would conclude that AMWEST is the more reliable airline.
However, when we examine the results city-by-city, a different pattern emerges. Within each of the five cities (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle), ALASKA has a lower percentage of delayed flights than AMWEST. This means that ALASKA performs better in every individual city, even though it appears worse overall.
This discrepancy occurs because the overall delay rate is a weighted average across cities. AMWEST operates a much larger share of its flights in Phoenix, where both airlines have relatively lower delay rates. In contrast, ALASKA has more flights in cities with higher delay rates, such as Seattle and San Francisco. As a result, the aggregated comparison is heavily influenced by the distribution of flights across cities.
This example demonstrates how aggregation can produce misleading conclusions when underlying group distributions differ. To fairly compare airline performance, it is essential to examine stratified (city-level) results in addition to overall totals.