The data file is hosted in github.
Airline
url <- getURL("https://raw.githubusercontent.com/binishkurian/DATA-607/master/week-04/flights.csv")
df <- read.csv(text = url, header = TRUE, check.names = FALSE)
flight_schedule <- as.tibble(df)
flight_schedule
## # A tibble: 4 x 7
## carrier arrival `Los Angeles` Phoenix `San Diego` `San Francisco` Seatt…
## <fctr> <fctr> <int> <int> <int> <int> <int>
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
flight_schedule_tidy <-flight_schedule %>%
gather('Los Angeles':Seattle, key = "destination", value = "flight_count") %>%
spread(key = arrival, value = flight_count) %>%
rename(on_time = `on time`)
flight_schedule_tidy
## # A tibble: 10 x 4
## carrier destination delayed on_time
## <fctr> <chr> <int> <int>
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
flight_schedule_tidy %>%
mutate(
perc_delayed = delayed / (delayed + on_time),
delay_score = min_rank(perc_delayed)
) %>%
arrange(delay_score)
## # A tibble: 10 x 6
## carrier destination delayed on_time perc_delayed delay_score
## <fctr> <chr> <int> <int> <dbl> <int>
## 1 ALASKA Phoenix 12 221 0.0515 1
## 2 AM WEST Phoenix 415 4840 0.0790 2
## 3 ALASKA San Diego 20 212 0.0862 3
## 4 ALASKA Los Angeles 62 497 0.111 4
## 5 ALASKA Seattle 305 1841 0.142 5
## 6 AM WEST Los Angeles 117 694 0.144 6
## 7 AM WEST San Diego 65 383 0.145 7
## 8 ALASKA San Francisco 102 503 0.169 8
## 9 AM WEST Seattle 61 201 0.233 9
## 10 AM WEST San Francisco 129 320 0.287 10
AM WEST's service to San Francisco has most delayed flights.
boxplot(flight_schedule_tidy[flight_schedule_tidy$carrier == "ALASKA", ]$delayed)
boxplot(flight_schedule_tidy[flight_schedule_tidy$carrier == "AM WEST", ]$delayed)
flight_schedule_tidy %>%
group_by(carrier) %>%
summarise(
count = n(),
mean = mean(delayed, na.rm = TRUE),
sd = sd(delayed, na.rm = TRUE),
median = median(delayed, na.rm = TRUE),
iqr = IQR(delayed, na.rm = TRUE),
abs_dev = mad(delayed, na.rm = TRUE),
min = min(delayed, na.rm = TRUE),
max = max(delayed, na.rm = TRUE)
)
## # A tibble: 2 x 9
## carrier count mean sd median iqr abs_dev min max
## <fctr> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 5 100 120 62 82.0 62.3 12.0 305
## 2 AM WEST 5 157 147 117 64.0 77.1 61.0 415
Since both airline data have a outlier, median and IQR are more robust than
mean and sd. Following the table above, ALASKA has median of 62 flights and
IQR of 82 and AM WEST has median of 117 flights and IQR of 64. This makes me to
conclude that ALASKA has better control on their flight delays.