For this assignment, I am working with airline delay data.The table shows two airlines (Alaska and Am West) and their flight results across five cities. For each city, it shows how many flights were on time and how many were delayed.
The goal of this assignment is to: • recreate this table as a CSV file, • read it into R, • clean and tidy the data, • compare the delay percentages, • and explain why overall results might look different from city by city results.
2. Creating the Data File
First, I will recreate the table as it in the image. I create a CSV file in wide format, meaning: • One column for Airline • One column for Status (on time or delayed) • One column for each city
I also keep the blank airline cells like the original image. For example, the second row under Alaska does not repeat the airline name. I keep that blank because the assignment requires recreating the file in the same format. After creating the CSV file, I upload it to my public GitHub repository so it can be accessed.
3. Reading and Cleaning the Data
After reading the CSV file into R, I notice that some airline cells are empty.Those empty cells actually belong to the airline written above them. So I will fill in the missing airline names to make the dataset complete. This step makes the data easier to work with and prevents problems during analysis.
4. Transforming the Data (Wide to Long)
The original data is in wide format because each city is a separate column. For analysis, I will convert the dataset to long format. In the long format: • Each row represents one airline, one city, one status, and one count.
5. Analysis Plan
After cleaning and transforming the data, I will: 1. Compare overall delay percentages between the two airlines. 2. Compare delay percentages across the five cities. 3. Summarize what I observe in both comparisons. 4. Explain why the overall comparison might tell a different story than the city by city comparison.
The main focus is on percentages, not counts, because percentages give a more fair comparison.
6. Expected Approach
It is possible that one airline looks better overall, but not better in every city. This can happen because the airlines may operate different numbers of flights in each city. When totals are combined, the results can look different than when each city is examined separately.
Code Base
1. Load Libraries
I used tidyverse because it includes dplyr and tidyr, which are helpful for cleaning and transforming data.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
2. Read CSV from GitHub
I read the dataset directly from my public GitHub repository using the raw file link. : https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/Week5/Week5A/airline_delays_wide.csv
Rows: 4 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Airline, Status
dbl (5): Los Angeles, Phoenix, San Diego, San Francisco, Seattle
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airline_data
# A tibble: 4 × 7
Airline Status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALASKA on time 497 221 212 503 1841
2 <NA> delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 <NA> delayed 117 415 65 129 61
33. Fill Missing Airline Names
The blank airline cells belong to the airline above them. I used fill() to copy the airline name downward.
# A tibble: 20 × 4
Airline Status City Count
<chr> <chr> <chr> <dbl>
1 ALASKA on time Los Angeles 497
2 ALASKA on time Phoenix 221
3 ALASKA on time San Diego 212
4 ALASKA on time San Francisco 503
5 ALASKA on time Seattle 1841
6 ALASKA delayed Los Angeles 62
7 ALASKA delayed Phoenix 12
8 ALASKA delayed San Diego 20
9 ALASKA delayed San Francisco 102
10 ALASKA delayed Seattle 305
11 AM WEST on time Los Angeles 694
12 AM WEST on time Phoenix 4840
13 AM WEST on time San Diego 383
14 AM WEST on time San Francisco 320
15 AM WEST on time Seattle 201
16 AM WEST delayed Los Angeles 117
17 AM WEST delayed Phoenix 415
18 AM WEST delayed San Diego 65
19 AM WEST delayed San Francisco 129
20 AM WEST delayed Seattle 61
# A tibble: 10 × 6
Airline City delayed `on time` Total_Flights Delay_Percentage
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ALASKA Los Angeles 62 497 559 11.1
2 ALASKA Phoenix 12 221 233 5.15
3 ALASKA San Diego 20 212 232 8.62
4 ALASKA San Francisco 102 503 605 16.9
5 ALASKA Seattle 305 1841 2146 14.2
6 AM WEST Los Angeles 117 694 811 14.4
7 AM WEST Phoenix 415 4840 5255 7.90
8 AM WEST San Diego 65 383 448 14.5
9 AM WEST San Francisco 129 320 449 28.7
10 AM WEST Seattle 61 201 262 23.3
7. Analysis
Overall, AM WEST shows a lower overall delay percentage compared to ALASKA. This means that when combining all flights across all cities, that airline appears to perform better in terms of on time arrivals.
However, when examining the delay percentages city by city, the results are different. In some cities, the airline that performed worse overall actually has a lower delay percentage.
The difference between overall results and city by city results happens because the airlines operate different numbers of flights in each city. If one airline flies more often to cities with naturally higher delay counts, it can increase its overall delay percentage. This shows how combining data can sometimes give a different conclusion than looking at each group separately.