Week5A

Author

Sinem K Moschos

Week 5A Approach

1. Introduction

For this assignment, I am working with airline delay data.The table shows two airlines (Alaska and Am West) and their flight results across five cities. For each city, it shows how many flights were on time and how many were delayed.

The goal of this assignment is to: • recreate this table as a CSV file, • read it into R, • clean and tidy the data, • compare the delay percentages, • and explain why overall results might look different from city by city results.

2. Creating the Data File

First, I will recreate the table as it in the image. I create a CSV file in wide format, meaning: • One column for Airline • One column for Status (on time or delayed) • One column for each city

I also keep the blank airline cells like the original image. For example, the second row under Alaska does not repeat the airline name. I keep that blank because the assignment requires recreating the file in the same format. After creating the CSV file, I upload it to my public GitHub repository so it can be accessed.

3. Reading and Cleaning the Data

After reading the CSV file into R, I notice that some airline cells are empty.Those empty cells actually belong to the airline written above them. So I will fill in the missing airline names to make the dataset complete. This step makes the data easier to work with and prevents problems during analysis.

4. Transforming the Data (Wide to Long)

The original data is in wide format because each city is a separate column. For analysis, I will convert the dataset to long format. In the long format: • Each row represents one airline, one city, one status, and one count.

5. Analysis Plan

After cleaning and transforming the data, I will: 1. Compare overall delay percentages between the two airlines. 2. Compare delay percentages across the five cities. 3. Summarize what I observe in both comparisons. 4. Explain why the overall comparison might tell a different story than the city by city comparison.

The main focus is on percentages, not counts, because percentages give a more fair comparison.

6. Expected Approach

It is possible that one airline looks better overall, but not better in every city. This can happen because the airlines may operate different numbers of flights in each city. When totals are combined, the results can look different than when each city is examined separately.

Code Base

1. Load Libraries

I used tidyverse because it includes dplyr and tidyr, which are helpful for cleaning and transforming data.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2. Read CSV from GitHub

I read the dataset directly from my public GitHub repository using the raw file link. : https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/Week5/Week5A/airline_delays_wide.csv

file_url <- "https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/Week5/Week5A/airline_delays_wide.csv"
airline_data <- read_csv(file_url)

Rows: 4 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Airline, Status
dbl (5): Los Angeles, Phoenix, San Diego, San Francisco, Seattle

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

airline_data

# A tibble: 4 × 7
  Airline Status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
  <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
1 ALASKA  on time           497     221         212             503    1841
2 <NA>    delayed            62      12          20             102     305
3 AM WEST on time           694    4840         383             320     201
4 <NA>    delayed           117     415          65             129      61

33. Fill Missing Airline Names

The blank airline cells belong to the airline above them. I used fill() to copy the airline name downward.

airline_data <- airline_data %>%
  fill(Airline)
airline_data

# A tibble: 4 × 7
  Airline Status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
  <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
1 ALASKA  on time           497     221         212             503    1841
2 ALASKA  delayed            62      12          20             102     305
3 AM WEST on time           694    4840         383             320     201
4 AM WEST delayed           117     415          65             129      61

4. Transform Wide -> Long Format

Right now, each city is a column. I converted it to long format so each row represents: Airline + Status + City + Count

airline_long <- airline_data %>%
  pivot_longer(
    cols = c(`Los Angeles`, Phoenix, `San Diego`, `San Francisco`, Seattle),
    names_to = "City",
    values_to = "Count"
  )
airline_long

# A tibble: 20 × 4
   Airline Status  City          Count
   <chr>   <chr>   <chr>         <dbl>
 1 ALASKA  on time Los Angeles     497
 2 ALASKA  on time Phoenix         221
 3 ALASKA  on time San Diego       212
 4 ALASKA  on time San Francisco   503
 5 ALASKA  on time Seattle        1841
 6 ALASKA  delayed Los Angeles      62
 7 ALASKA  delayed Phoenix          12
 8 ALASKA  delayed San Diego        20
 9 ALASKA  delayed San Francisco   102
10 ALASKA  delayed Seattle         305
11 AM WEST on time Los Angeles     694
12 AM WEST on time Phoenix        4840
13 AM WEST on time San Diego       383
14 AM WEST on time San Francisco   320
15 AM WEST on time Seattle         201
16 AM WEST delayed Los Angeles     117
17 AM WEST delayed Phoenix         415
18 AM WEST delayed San Diego        65
19 AM WEST delayed San Francisco   129
20 AM WEST delayed Seattle          61

5. Overall Delay Percentage Comparison

First, calculate totals per airline.

overall_summary <- airline_long %>%
  group_by(Airline, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  pivot_wider(names_from = Status, values_from = Total) %>%
  mutate(
    Total_Flights = `on time` + delayed,
    Delay_Percentage = delayed / Total_Flights * 100
  )
overall_summary

# A tibble: 2 × 5
  Airline delayed `on time` Total_Flights Delay_Percentage
  <chr>     <dbl>     <dbl>         <dbl>            <dbl>
1 ALASKA      501      3274          3775             13.3
2 AM WEST     787      6438          7225             10.9

6. City-by-City Delay Percentage Comparison

Now calculate delay percentage per airline per city.

city_summary <- airline_long %>%
  group_by(Airline, City, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  pivot_wider(names_from = Status, values_from = Total) %>%
  mutate(
    Total_Flights = `on time` + delayed,
    Delay_Percentage = delayed / Total_Flights * 100
  )
city_summary

# A tibble: 10 × 6
   Airline City          delayed `on time` Total_Flights Delay_Percentage
   <chr>   <chr>           <dbl>     <dbl>         <dbl>            <dbl>
 1 ALASKA  Los Angeles        62       497           559            11.1 
 2 ALASKA  Phoenix            12       221           233             5.15
 3 ALASKA  San Diego          20       212           232             8.62
 4 ALASKA  San Francisco     102       503           605            16.9 
 5 ALASKA  Seattle           305      1841          2146            14.2 
 6 AM WEST Los Angeles       117       694           811            14.4 
 7 AM WEST Phoenix           415      4840          5255             7.90
 8 AM WEST San Diego          65       383           448            14.5 
 9 AM WEST San Francisco     129       320           449            28.7 
10 AM WEST Seattle            61       201           262            23.3

7. Analysis

Overall, AM WEST shows a lower overall delay percentage compared to ALASKA. This means that when combining all flights across all cities, that airline appears to perform better in terms of on time arrivals.

However, when examining the delay percentages city by city, the results are different. In some cities, the airline that performed worse overall actually has a lower delay percentage.

The difference between overall results and city by city results happens because the airlines operate different numbers of flights in each city. If one airline flies more often to cities with naturally higher delay counts, it can increase its overall delay percentage. This shows how combining data can sometimes give a different conclusion than looking at each group separately.