arr_delay <- read_csv("Tidying and Transforming - Arrival Delays.csv", show_col_types = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
print(arr_delay)
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA On Time 497 221 212 503 1841
## 2 <NA> Delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST On Time 694 4840 383 320 201
## 5 <NA> Delayed 117 415 65 129 61
Our dataset contains some inconsistencies, including missing values and blank cells used for spacing within the sheet. To begin cleaning, we’ll rename the first two variables, which represent the airline and flight status. Next, we’ll reorganize the remaining variables and group them by city to ensure the values accurately reflect the observations.
tidy_arr_delay <- arr_delay %>%
fill(`...1`, `...2`) %>%
filter(!is.na(`...1`) | !is.na(`...2`)) %>%
pivot_longer(
cols = `Los Angeles`:`Seattle`,
names_to = "City",
values_to = "Num_Flights"
) %>%
rename(Airline = ...1, Status = ...2) %>%
drop_na(Num_Flights, Airline)
print(tidy_arr_delay)
## # A tibble: 20 × 4
## Airline Status City Num_Flights
## <chr> <chr> <chr> <dbl>
## 1 ALASKA On Time Los Angeles 497
## 2 ALASKA On Time Phoenix 221
## 3 ALASKA On Time San Diego 212
## 4 ALASKA On Time San Francisco 503
## 5 ALASKA On Time Seattle 1841
## 6 ALASKA Delayed Los Angeles 62
## 7 ALASKA Delayed Phoenix 12
## 8 ALASKA Delayed San Diego 20
## 9 ALASKA Delayed San Francisco 102
## 10 ALASKA Delayed Seattle 305
## 11 AM WEST On Time Los Angeles 694
## 12 AM WEST On Time Phoenix 4840
## 13 AM WEST On Time San Diego 383
## 14 AM WEST On Time San Francisco 320
## 15 AM WEST On Time Seattle 201
## 16 AM WEST Delayed Los Angeles 117
## 17 AM WEST Delayed Phoenix 415
## 18 AM WEST Delayed San Diego 65
## 19 AM WEST Delayed San Francisco 129
## 20 AM WEST Delayed Seattle 61
Below, we aggregate the data. Initially, I intended to do this but noticed that the numbers weren’t summing correctly. With the following code, the flight totals are accurately calculated.
flights_sum <- tidy_arr_delay %>%
filter(Airline %in% c("ALASKA", "AM WEST")) %>%
group_by(Airline, Status) %>%
summarise(Total_Flights = sum(Num_Flights)) %>%
ungroup()
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
view(flights_sum)
Now, we’re calculating the delay percentage for each airline to better compare their performance. Alaska airlines’ delay percentage comes out to 13%, while AM West’s delay percentage is about 11%.
flight_percent <- tidy_arr_delay %>%
mutate(
Total_Flights = sum(Num_Flights),
Delayed_Flights = sum(Num_Flights[Status == "Delayed"])
)
print(flight_percent)
## # A tibble: 20 × 6
## Airline Status City Num_Flights Total_Flights Delayed_Flights
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA On Time Los Angeles 497 11000 1288
## 2 ALASKA On Time Phoenix 221 11000 1288
## 3 ALASKA On Time San Diego 212 11000 1288
## 4 ALASKA On Time San Francisco 503 11000 1288
## 5 ALASKA On Time Seattle 1841 11000 1288
## 6 ALASKA Delayed Los Angeles 62 11000 1288
## 7 ALASKA Delayed Phoenix 12 11000 1288
## 8 ALASKA Delayed San Diego 20 11000 1288
## 9 ALASKA Delayed San Francisco 102 11000 1288
## 10 ALASKA Delayed Seattle 305 11000 1288
## 11 AM WEST On Time Los Angeles 694 11000 1288
## 12 AM WEST On Time Phoenix 4840 11000 1288
## 13 AM WEST On Time San Diego 383 11000 1288
## 14 AM WEST On Time San Francisco 320 11000 1288
## 15 AM WEST On Time Seattle 201 11000 1288
## 16 AM WEST Delayed Los Angeles 117 11000 1288
## 17 AM WEST Delayed Phoenix 415 11000 1288
## 18 AM WEST Delayed San Diego 65 11000 1288
## 19 AM WEST Delayed San Francisco 129 11000 1288
## 20 AM WEST Delayed Seattle 61 11000 1288
flight_delayed_percentage <- flight_percent %>%
mutate(Delay_Percentage = (Delayed_Flights / Total_Flights) * 100)
print(flight_delayed_percentage)
## # A tibble: 20 × 7
## Airline Status City Num_Flights Total_Flights Delayed_Flights
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA On Time Los Angeles 497 11000 1288
## 2 ALASKA On Time Phoenix 221 11000 1288
## 3 ALASKA On Time San Diego 212 11000 1288
## 4 ALASKA On Time San Francisco 503 11000 1288
## 5 ALASKA On Time Seattle 1841 11000 1288
## 6 ALASKA Delayed Los Angeles 62 11000 1288
## 7 ALASKA Delayed Phoenix 12 11000 1288
## 8 ALASKA Delayed San Diego 20 11000 1288
## 9 ALASKA Delayed San Francisco 102 11000 1288
## 10 ALASKA Delayed Seattle 305 11000 1288
## 11 AM WEST On Time Los Angeles 694 11000 1288
## 12 AM WEST On Time Phoenix 4840 11000 1288
## 13 AM WEST On Time San Diego 383 11000 1288
## 14 AM WEST On Time San Francisco 320 11000 1288
## 15 AM WEST On Time Seattle 201 11000 1288
## 16 AM WEST Delayed Los Angeles 117 11000 1288
## 17 AM WEST Delayed Phoenix 415 11000 1288
## 18 AM WEST Delayed San Diego 65 11000 1288
## 19 AM WEST Delayed San Francisco 129 11000 1288
## 20 AM WEST Delayed Seattle 61 11000 1288
## # ℹ 1 more variable: Delay_Percentage <dbl>
ggplot(flight_delayed_percentage, aes(x = Airline, y = Delay_Percentage)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ City) +
labs(title = "Flight Status by Percentage for Each Airline",
x = "Airline",
y = "Percentage (%)") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
theme_minimal()
The bar graph depicts that AM West has a lot more flights compared to Alaska Airlines. Most of AM West flights run on time. As for delayed flights, Alaska Airlines and AM West have nearly the same amount of delayed flights, which is significant because AM West has more than 3,000 flights on Alaska Airlines. It’s important to note that Alaska Airlines primarily services those who live in Alaska, compared to a more popular airline such as AM West.
ggplot(flights_sum, aes(x = Airline, y = Total_Flights, fill = Status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "On Time vs Delayed Flights for Alaska and AM West",
x = "Airline",
y = "Total Flights") +
theme_minimal()
I also wanted to compare flight status by city. For AM West, the majority of on-time flights were to Phoenix, which also had the highest number of delays. Alaska Airlines saw most of its on-time flights landing in Seattle, which, like Phoenix, had the greatest number of delays. This is likely because these cities handle a higher volume of flights overall.
ggplot(tidy_arr_delay, aes(x = Airline, y = Num_Flights, fill = Status)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ City)