In this assignment, I create a CSV file of airline flights, cities, and delays, and tidied and transformed the wide data.
Part 1: Read data
I created the .CSV file that includes all of the flight information as specified in the assignment instructions and uploaded it to my github repo for easy access.
Below, I read the information from my .CSV file into R, and use tidyr and dplyr to tidy and transform my data.
New names:
Rows: 5 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): ...1, ...2 dbl (3): Los Angeles, San Diego, San Francisco num (2):
Phoenix, Seattle
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
• `` -> `...2`
See my tidied, long data set below:
head(flights)
# A tibble: 6 × 4
airline arrival_delay city count
<fct> <fct> <fct> <dbl>
1 ALASKA on time Los Angeles 497
2 ALASKA on time Phoenix 221
3 ALASKA on time San Diego 212
4 ALASKA on time San Francisco 503
5 ALASKA on time Seattle 1841
6 ALASKA delayed Los Angeles 62
Part 2: Analyze
I perform analysis below to compare the arrival delays for the two airlines.
Alaska is delayed 13% of the time, while America West Airlines is only delayed 11% of the time. Based on this, I’d rather fly America West, regardless of the city I’m traveling to.
# A tibble: 4 × 4
# Groups: airline [2]
airline arrival_delay sum perc
<fct> <fct> <dbl> <dbl>
1 ALASKA delayed 501 0.133
2 ALASKA on time 3274 0.867
3 AM WEST delayed 787 0.109
4 AM WEST on time 6438 0.891
I illustrate the slight difference in airline delay rate by comparing the shaded portions of the bars below:
Next, I compare these same rates by city. Below, we can compare the proportion of delays by airline and by city with these side-by-side bar plots. For any given city, Alaska Airlines flights are less likely to have an arrival delay there than America West Airlines. In this case, if I already have a city in mind, I’d rather fly Alaskan, as my chances of delay would be lower.
# Calculate percentages of delayed vs. on time flighty by airline and city.flights_by_airline_city <- flights |>group_by(airline, city) |>mutate(perc_airline_city = count /sum(count) *100)# Bar plot comparing delay rate of airlines by cityflights_by_airline_city |>ggplot(aes(x = city,y = count,fill = arrival_delay,label =paste0(round(perc_airline_city), "%") )) +geom_bar(stat ="identity", position ="stack") +geom_text(position =position_stack(vjust = .5), size =2.4) +labs(title ="Airline Performance",x ="City",y ="Count",fill ="Status") +theme(axis.text.x =element_text(angle =45, hjust =1)) +facet_wrap(flights$airline)
Conclusion
In conclusion, American West performs better in overall percentage of on-time flights, but Alaskan Airlines performs better in each city-specific percentage of on-time flights–the choice of airline depends on whether you prioritize overall delay rates or delay rates for a specific city. Since people typically pick their destination first and airline second, I think Alaskan Airlines is the better choice. For further research, I’d be curious to have data on more cities to check the rate of delays for further analysis. I’d also be curious about the duration of these delays—not just count. For example, Alaska Airlines may generally have much longer delays than American West, but we wouldn’t know this from the current data.