arr_delay <- read_csv("Tidying and Transforming - Arrival Delays.csv", show_col_types = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
print(arr_delay)
## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  On Time           497     221         212             503    1841
## 2 <NA>    Delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST On Time           694    4840         383             320     201
## 5 <NA>    Delayed           117     415          65             129      61

Our dataset contains some inconsistencies, including missing values and blank cells used for spacing within the sheet. To begin cleaning, we’ll rename the first two variables, which represent the airline and flight status. Next, we’ll reorganize the remaining variables and group them by city to ensure the values accurately reflect the observations.

tidy_arr_delay <- arr_delay %>% 
  fill(`...1`, `...2`) %>% 
  filter(!is.na(`...1`) | !is.na(`...2`)) %>%
  pivot_longer(
    cols = `Los Angeles`:`Seattle`,
    names_to = "City",
    values_to = "Num_Flights"
  ) %>%
  rename(Airline = ...1, Status = ...2) %>% 
  drop_na(Num_Flights, Airline)
print(tidy_arr_delay)
## # A tibble: 20 × 4
##    Airline Status  City          Num_Flights
##    <chr>   <chr>   <chr>               <dbl>
##  1 ALASKA  On Time Los Angeles           497
##  2 ALASKA  On Time Phoenix               221
##  3 ALASKA  On Time San Diego             212
##  4 ALASKA  On Time San Francisco         503
##  5 ALASKA  On Time Seattle              1841
##  6 ALASKA  Delayed Los Angeles            62
##  7 ALASKA  Delayed Phoenix                12
##  8 ALASKA  Delayed San Diego              20
##  9 ALASKA  Delayed San Francisco         102
## 10 ALASKA  Delayed Seattle               305
## 11 AM WEST On Time Los Angeles           694
## 12 AM WEST On Time Phoenix              4840
## 13 AM WEST On Time San Diego             383
## 14 AM WEST On Time San Francisco         320
## 15 AM WEST On Time Seattle               201
## 16 AM WEST Delayed Los Angeles           117
## 17 AM WEST Delayed Phoenix               415
## 18 AM WEST Delayed San Diego              65
## 19 AM WEST Delayed San Francisco         129
## 20 AM WEST Delayed Seattle                61

Below, we aggregate the data. Initially, I intended to do this but noticed that the numbers weren’t summing correctly. With the following code, the flight totals are accurately calculated.

flights_sum <- tidy_arr_delay %>%
  filter(Airline %in% c("ALASKA", "AM WEST")) %>%
  group_by(Airline, Status) %>%
  summarise(Total_Flights = sum(Num_Flights)) %>%
  ungroup()
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
view(flights_sum)

Now, we’re calculating the delay percentage for each airline to better compare their performance. Alaska airlines’ delay percentage comes out to 13%, while AM West’s delay percentage is about 11%.

flight_percent <- tidy_arr_delay %>%
  mutate(
    Total_Flights = sum(Num_Flights),
    Delayed_Flights = sum(Num_Flights[Status == "Delayed"])
  )
print(flight_percent)
## # A tibble: 20 × 6
##    Airline Status  City          Num_Flights Total_Flights Delayed_Flights
##    <chr>   <chr>   <chr>               <dbl>         <dbl>           <dbl>
##  1 ALASKA  On Time Los Angeles           497         11000            1288
##  2 ALASKA  On Time Phoenix               221         11000            1288
##  3 ALASKA  On Time San Diego             212         11000            1288
##  4 ALASKA  On Time San Francisco         503         11000            1288
##  5 ALASKA  On Time Seattle              1841         11000            1288
##  6 ALASKA  Delayed Los Angeles            62         11000            1288
##  7 ALASKA  Delayed Phoenix                12         11000            1288
##  8 ALASKA  Delayed San Diego              20         11000            1288
##  9 ALASKA  Delayed San Francisco         102         11000            1288
## 10 ALASKA  Delayed Seattle               305         11000            1288
## 11 AM WEST On Time Los Angeles           694         11000            1288
## 12 AM WEST On Time Phoenix              4840         11000            1288
## 13 AM WEST On Time San Diego             383         11000            1288
## 14 AM WEST On Time San Francisco         320         11000            1288
## 15 AM WEST On Time Seattle               201         11000            1288
## 16 AM WEST Delayed Los Angeles           117         11000            1288
## 17 AM WEST Delayed Phoenix               415         11000            1288
## 18 AM WEST Delayed San Diego              65         11000            1288
## 19 AM WEST Delayed San Francisco         129         11000            1288
## 20 AM WEST Delayed Seattle                61         11000            1288
flight_delayed_percentage <- flight_percent %>%
  mutate(Delay_Percentage = (Delayed_Flights / Total_Flights) * 100)
print(flight_delayed_percentage)
## # A tibble: 20 × 7
##    Airline Status  City          Num_Flights Total_Flights Delayed_Flights
##    <chr>   <chr>   <chr>               <dbl>         <dbl>           <dbl>
##  1 ALASKA  On Time Los Angeles           497         11000            1288
##  2 ALASKA  On Time Phoenix               221         11000            1288
##  3 ALASKA  On Time San Diego             212         11000            1288
##  4 ALASKA  On Time San Francisco         503         11000            1288
##  5 ALASKA  On Time Seattle              1841         11000            1288
##  6 ALASKA  Delayed Los Angeles            62         11000            1288
##  7 ALASKA  Delayed Phoenix                12         11000            1288
##  8 ALASKA  Delayed San Diego              20         11000            1288
##  9 ALASKA  Delayed San Francisco         102         11000            1288
## 10 ALASKA  Delayed Seattle               305         11000            1288
## 11 AM WEST On Time Los Angeles           694         11000            1288
## 12 AM WEST On Time Phoenix              4840         11000            1288
## 13 AM WEST On Time San Diego             383         11000            1288
## 14 AM WEST On Time San Francisco         320         11000            1288
## 15 AM WEST On Time Seattle               201         11000            1288
## 16 AM WEST Delayed Los Angeles           117         11000            1288
## 17 AM WEST Delayed Phoenix               415         11000            1288
## 18 AM WEST Delayed San Diego              65         11000            1288
## 19 AM WEST Delayed San Francisco         129         11000            1288
## 20 AM WEST Delayed Seattle                61         11000            1288
## # ℹ 1 more variable: Delay_Percentage <dbl>
ggplot(flight_delayed_percentage, aes(x = Airline, y = Delay_Percentage)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ City) +
  labs(title = "Flight Status by Percentage for Each Airline",
       x = "Airline", 
       y = "Percentage (%)") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) + 
  theme_minimal()

The bar graph depicts that AM West has a lot more flights compared to Alaska Airlines. Most of AM West flights run on time. As for delayed flights, Alaska Airlines and AM West have nearly the same amount of delayed flights, which is significant because AM West has more than 3,000 flights on Alaska Airlines. It’s important to note that Alaska Airlines primarily services those who live in Alaska, compared to a more popular airline such as AM West.

ggplot(flights_sum, aes(x = Airline, y = Total_Flights, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "On Time vs Delayed Flights for Alaska and AM West", 
       x = "Airline", 
       y = "Total Flights") +
  theme_minimal()

I also wanted to compare flight status by city. For AM West, the majority of on-time flights were to Phoenix, which also had the highest number of delays. Alaska Airlines saw most of its on-time flights landing in Seattle, which, like Phoenix, had the greatest number of delays. This is likely because these cities handle a higher volume of flights overall.

ggplot(tidy_arr_delay, aes(x = Airline, y = Num_Flights, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ City)