I’ll start by creating the table from the assignment as a dataframe.
airline <- c('ALASKA', 'ALASKA', 'AM WEST', 'AM WEST')
arrival_status <- c('on time', 'delayed', 'on time', 'delayed')
los_angeles <- c(497, 62, 694, 117)
phoenix <- c(221, 12, 4840, 415)
san_diego <- c(212, 20, 383, 65)
san_fran <- c(503, 102, 320, 129)
seattle <- c(1841, 305, 201, 61)
arrival_delays_df <- data.frame(airline, arrival_status, los_angeles, phoenix, san_diego, san_fran, seattle)
head(arrival_delays_df)
## airline arrival_status los_angeles phoenix san_diego san_fran seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
The following line of code converts to .csv.
write.csv(arrival_delays_df, file = "C:/Users/Thomas/Documents/GitHub/DATA607_HW4/DATA607_HW4.csv", row.names = FALSE)
read.csv("C:/Users/Thomas/Documents/GitHub/DATA607_HW4/DATA607_HW4.csv")
## airline arrival_status los_angeles phoenix san_diego san_fran seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
As mentioned in the instructions, the data is in ‘wide’ format currently. I’ll use tidyr to make this table in a ‘long’ format.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
arrival_delays_long <- arrival_delays_df %>%
gather(airport, n, los_angeles:seattle) %>%
arrange(airline)
arrival_delays_long
## airline arrival_status airport n
## 1 ALASKA on time los_angeles 497
## 2 ALASKA delayed los_angeles 62
## 3 ALASKA on time phoenix 221
## 4 ALASKA delayed phoenix 12
## 5 ALASKA on time san_diego 212
## 6 ALASKA delayed san_diego 20
## 7 ALASKA on time san_fran 503
## 8 ALASKA delayed san_fran 102
## 9 ALASKA on time seattle 1841
## 10 ALASKA delayed seattle 305
## 11 AM WEST on time los_angeles 694
## 12 AM WEST delayed los_angeles 117
## 13 AM WEST on time phoenix 4840
## 14 AM WEST delayed phoenix 415
## 15 AM WEST on time san_diego 383
## 16 AM WEST delayed san_diego 65
## 17 AM WEST on time san_fran 320
## 18 AM WEST delayed san_fran 129
## 19 AM WEST on time seattle 201
## 20 AM WEST delayed seattle 61
Now that all of the information is in a long format, it will be easier to use summarize and mutate to find the percent of flights that were delayed for each variable. I’ll start by finding the total flights at each airport and then calculating the percent that were delayed. Finally, I’ll arrange the newly calculated information by airport and airline to facilitate comparison between their percent delays.
flights_pct_delay <- arrival_delays_long %>%
group_by(airport, airline) %>%
mutate(total_flights = sum(n)) %>%
ungroup() %>%
filter(arrival_status == "delayed") %>%
mutate(pct_airline_delay = n / total_flights) %>%
arrange(airport, airline)
flights_pct_delay
## # A tibble: 10 x 6
## airline arrival_status airport n total_flights pct_airline_delay
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA delayed los_angeles 62 559 0.111
## 2 AM WEST delayed los_angeles 117 811 0.144
## 3 ALASKA delayed phoenix 12 233 0.0515
## 4 AM WEST delayed phoenix 415 5255 0.0790
## 5 ALASKA delayed san_diego 20 232 0.0862
## 6 AM WEST delayed san_diego 65 448 0.145
## 7 ALASKA delayed san_fran 102 605 0.169
## 8 AM WEST delayed san_fran 129 449 0.287
## 9 ALASKA delayed seattle 305 2146 0.142
## 10 AM WEST delayed seattle 61 262 0.233
From the new table, a few observations jump out. The total flights for AM WEST are highest in Phoenix, while ALASKA likely has Seattle as a hub. Comparing each airline, delay times vary between 5 and 30%. Overall, it also appears that ALASKA has lower rates of delay at each airport. I can now use the spread function to show the airline as elements in a wide dataframe.
flights_pct_delay_wide <- flights_pct_delay %>%
select(airline, airport, pct_airline_delay) %>%
spread(airport, pct_airline_delay) %>%
group_by(airline)
flights_pct_delay_wide
## # A tibble: 2 x 6
## # Groups: airline [2]
## airline los_angeles phoenix san_diego san_fran seattle
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 0.111 0.0515 0.0862 0.169 0.142
## 2 AM WEST 0.144 0.0790 0.145 0.287 0.233