library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
untidy <- read.csv(url("https://raw.githubusercontent.com/agCS/DATA607/master/flights%20wk5.csv"), stringsAsFactors = FALSE, header = TRUE)
# remove blank row
untidy <- untidy[-3,]
untidy
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 4 AM West on time 694 4840 383 320 201
## 5 AM West delayed 117 415 65 129 61
The city represents destination variable’s observations so I need to move it to rows by using gather(). ‘delayed’ and ‘on time’ are variables as well that need to be spread() into columns.
tidy1 <- untidy %>%
gather(Destination, Count, 3:7)
tidy2 <- tidy1 %>%
spread('X.1',Count)
# give column name to carrier
tidyFinal <- rename(tidy2, Carrier=X)
# display tidy data
tidyFinal
## Carrier Destination delayed on time
## 1 Alaska Los.Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San.Diego 20 212
## 4 Alaska San.Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM West Los.Angeles 117 694
## 7 AM West Phoenix 415 4840
## 8 AM West San.Diego 65 383
## 9 AM West San.Francisco 129 320
## 10 AM West Seattle 61 201
I’d like to quickly see how total delays compare by two carriers:
ggplot(data = tidyFinal) +
geom_col(mapping = aes(x = Carrier, y = delayed))
I see AM West has higher number of delays overall. Looking closer at averages, AM West has on average 57 more delayed flights.
(avg <- tidyFinal %>%
group_by(Carrier) %>%
summarize(avg = mean(delayed)))
## # A tibble: 2 x 2
## Carrier avg
## <chr> <dbl>
## 1 Alaska 100.2
## 2 AM West 157.4
When comparing delays per destination, however, we see that although AM West has higher total delays, Alaska airline has much higher number of arrival delays in Seattle. Does this make sense? Perhaps, this could be explained by availability of routes by each carrier, for example, it’s possible that AM West flies to Seattle more frequently than AM West. In my case, however, i suspect the graphs are misleading as the numbers are not comparable in proportion to all flights by the carrier.
ggplot(data = tidyFinal) +
geom_col(
mapping = aes(x = Destination, y = delayed, fill = Carrier),
position = "dodge")
So, let’s instead calculate proportions of delayed flights for each destination by carrier and average for delay in proportion to all flights. I will use mutate() function from dplyr to calculate and store the value in a new column.
tidyFinal <- mutate(tidyFinal,
delayProp = tidyFinal$delayed / (tidyFinal$delayed + tidyFinal$'on time'))
ggplot(data = tidyFinal) +
geom_col(
mapping = aes(x = Destination, y = delayProp, fill = Carrier),
position = "dodge")
(avg_prop <- tidyFinal %>%
group_by(Carrier) %>%
summarize(avg_prop = mean(delayProp)))
## # A tibble: 2 x 2
## Carrier avg_prop
## <chr> <dbl>
## 1 Alaska 0.1118683
## 2 AM West 0.1776915
Plotting proportions allows us to see that AM West has higher arrival delays at each destniations. Overall, AM West on average has ~7% more delays.