suppressPackageStartupMessages(library(nycflights13))
package 㤼㸱nycflights13㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library(tidyverse))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3

1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

  • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
  • A flight is always 10 minutes late.
  • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
  • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
  • Which is more important: arrival delay or departure delay?

What this question gets at is a fundamental question of data analysis: the cost function. As analysts, the reason we are interested in flight delay because it is costly to passengers. But it is worth thinking carefully about how it is costly and use that information in ranking and measuring these scenarios.

In many scenarios, arrival delay is more important. In most cases, arriving late is more costly to the passenger since it could disrupt the next stages of their travel, such as connecting flights or scheduled meetings.

If a departure is delayed without affecting the arrival time, this delay will not have those affects plans nor does it affect the total time spent traveling. This delay could be beneficial, if less time is spent in the cramped confines of the airplane itself, or a negative, if that delayed time is still spent in the cramped confines of the airplane on the runway.

Variation in arrival time is worse than consistency. If a flight is always 30 minutes late and that delay is known, then it is as if the arrival time is that delayed time. The traveler could easily plan for this. But higher variation in flight times makes it harder to plan.

2. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))

The first expression is the following.

not_cancelled %>%
  count(dest)

The count() function counts the number of instances within each group of variables. Instead of using the count() function,we can combine the group_by() and summarise() verbs.

not_cancelled %>%
  group_by(dest) %>%
  summarise(n = length(dest))

An alternative method for getting the number of observations in a data frame is the function n().

not_cancelled %>%
  group_by(dest) %>%
  summarise(n = n())

Another alternative to count() is to use the combination of the group_by() and tally() verbs. In fact, count() is effectively a short-cut for group_by() followed by tally().

not_cancelled %>%
  group_by(tailnum) %>%
  tally()

The second expression also uses the count() function, but adds a wt argument.

not_cancelled %>%
  count(tailnum, wt = distance)

As before, we can replicate count() by combining the group_by() and summarise() verbs. But this time instead of using length(), we will use sum() with the weighting variable.

not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))

Like the previous example, we can also use the combination group_by() and tally(). Any arguments to tally() are summed.

not_cancelled %>%
  group_by(tailnum) %>%
  tally(distance)

Exercise 3. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?

If a flight never departs, then it won’t arrive. A flight could also depart and not arrive if it crashes, or if it is redirected and lands in an airport other than its intended destination. So the most important column is arr_delay, which indicates the amount of delay in arrival.

filter(flights, !is.na(dep_delay), is.na(arr_delay)) %>%
  select(dep_time, arr_time, sched_arr_time, dep_delay, arr_delay)

Okay, I’m not sure what’s going on in this data. dep_time can be non-missing and arr_delay missing but arr_time not missing. They may be combining different flights?

5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

flights %>%
  group_by(carrier) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(arr_delay))

What airline corresponds to the “F9” carrier code?

filter(airlines, carrier == "F9")

You can get part of the way to disentangling the effects of airports versus bad carriers by comparing the average delay of each carrier to the average delay of flights within a route (flights from the same origin to the same destination). Comparing delays between carriers and within each route disentangles the effect of carriers and airports. A better analysis would compare the average delay of a carrier’s flights to the average delay of all other carrier’s flights within a route.

flights %>%
  filter(!is.na(arr_delay)) %>%
  # Total delay by carrier within each origin, dest
  group_by(origin, dest, carrier) %>%
  summarise(
    arr_delay = sum(arr_delay),
    flights = n()
  ) %>%
  # Total delay within each origin dest
  group_by(origin, dest) %>%
  mutate(
    arr_delay_total = sum(arr_delay),
    flights_total = sum(flights)
  ) %>%
  # average delay of each carrier - average delay of other carriers
  ungroup() %>%
  mutate(
    arr_delay_others = (arr_delay_total - arr_delay) /
      (flights_total - flights),
    arr_delay_mean = arr_delay / flights,
    arr_delay_diff = arr_delay_mean - arr_delay_others
  ) %>%
  # remove NaN values (when there is only one carrier)
  filter(is.finite(arr_delay_diff)) %>%
  # average over all airports it flies to
  group_by(carrier) %>%
  summarise(arr_delay_diff = mean(arr_delay_diff)) %>%
  arrange(desc(arr_delay_diff))

There are more sophisticated ways to do this analysis, however comparing the delay of flights within each route goes a long ways toward disentangling airport and carrier effects. To see a more complete example of this analysis, see this FiveThirtyEight piece.

6. What does the sort argument to count() do? When might you use it?

The sort argument to count() sorts the results in order of n. You could use this anytime you would run count() followed by arrange().

---
title: "Grouped summaries with summarise()"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

```{r loadlibrary}
suppressPackageStartupMessages(library(nycflights13))
suppressPackageStartupMessages(library(tidyverse))
```

### 1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

 - A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
 - A flight is always 10 minutes late.
 - A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
 - 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
 - Which is more important: arrival delay or departure delay?

What this question gets at is a fundamental question of data analysis: the cost function. As analysts, the reason we are interested in flight delay because it is costly to passengers. But it is worth thinking carefully about how it is costly and use that information in ranking and measuring these scenarios.

In many scenarios, arrival delay is more important. In most cases, arriving late is more costly to the passenger since it could disrupt the next stages of their travel, such as connecting flights or scheduled meetings.

If a departure is delayed without affecting the arrival time, this delay will not have those affects plans nor does it affect the total time spent traveling. This delay could be beneficial, if less time is spent in the cramped confines of the airplane itself, or a negative, if that delayed time is still spent in the cramped confines of the airplane on the runway.

Variation in arrival time is worse than consistency. If a flight is always 30 minutes late and that delay is known, then it is as if the arrival time is that delayed time. The traveler could easily plan for this. But higher variation in flight times makes it harder to plan.

### 2. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).

```{r}
not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))
```

The first expression is the following.

```{r}
not_cancelled %>%
  count(dest)
```

The `count()` function counts the number of instances within each group of variables. Instead of using the `count()` function,we can combine the `group_by()` and `summarise()` verbs.

```{r}
not_cancelled %>%
  group_by(dest) %>%
  summarise(n = length(dest))
```

An alternative method for getting the number of observations in a data frame is the function `n()`.

```{r}
not_cancelled %>%
  group_by(dest) %>%
  summarise(n = n())
```

Another alternative to `count()` is to use the combination of the `group_by()` and `tally()` verbs. In fact, `count()` is effectively a short-cut for `group_by()` followed by `tally()`.

```{r}
not_cancelled %>%
  group_by(tailnum) %>%
  tally()
```

The second expression also uses the `count()` function, but adds a `wt` argument.

```{r}
not_cancelled %>%
  count(tailnum, wt = distance)
```

As before, we can replicate `count()` by combining the `group_by()` and `summarise()` verbs. But this time instead of using `length()`, we will use `sum()` with the weighting variable.

```{r}
not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))
```

Like the previous example, we can also use the combination `group_by()` and `tally()`. Any arguments to `tally()` are summed.

```{r}
not_cancelled %>%
  group_by(tailnum) %>%
  tally(distance)
```

### Exercise 3. Our definition of cancelled flights `(is.na(dep_delay) | is.na(arr_delay))` is slightly suboptimal. Why? Which is the most important column?

If a flight never departs, then it won’t arrive. A flight could also depart and not arrive if it crashes, or if it is redirected and lands in an airport other than its intended destination. So the most important column is `arr_delay`, which indicates the amount of delay in arrival.

```{r}
filter(flights, !is.na(dep_delay), is.na(arr_delay)) %>%
  select(dep_time, arr_time, sched_arr_time, dep_delay, arr_delay)
```

Okay, I’m not sure what’s going on in this data. `dep_time` can be non-missing and `arr_delay` missing but `arr_time` not missing. They may be combining different flights?

### 4. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

One pattern in cancelled flights per day is that the number of cancelled flights increases with the total number of flights per day. The proportion of cancelled flights increases with the average delay of flights.

To answer these questions, I'll use `!(is.na(arr_delay) & is.na(dep_delay))` as the definition of cancelled.

The first part of the question asks for any pattern in the number of cancelled flights per day. I’ll look at the relationship between the number of cancelled flights per day and the total number of flights in a day. There should be an increasing relationship for two reasons. First, if all flights are equally likely to be cancelled, then days with more flights should have a higher number of cancellations. Second, it is likely that days with more flights would have a higher probability of cancellations because congestion itself can cause delays and any delay would affect more flights, and large delays can lead to cancellations.

```{r}
cancelled_per_day <-
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(
    cancelled_num = sum(cancelled),
    flights_num = n(),
  )
```

Plotting `flights_num` against `cancelled_num` shows that the number of flights cancelled increases with the total number of flights.

```{r}
ggplot(cancelled_per_day) +
  geom_point(aes(x = flights_num, y = cancelled_num))
```

The second part of the question asks whether there is a relationship between the proportion of flights cancelled and the average departure delay. I implied this in my answer to the first part of the question, when I noted that increasing delays could result in increased cancellations. The question does not specify which delay, so I will show the relationship for both.

```{r}
cancelled_and_delays <-
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(
    cancelled_prop = mean(cancelled),
    avg_dep_delay = mean(dep_delay, na.rm = TRUE),
    avg_arr_delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  ungroup()
```
There is a strong increasing relationship between both average departure delay and and average arrival delay and the proportion of cancelled flights.

```{r}
ggplot(cancelled_and_delays) +
  geom_point(aes(x = avg_dep_delay, y = cancelled_prop))

ggplot(cancelled_and_delays) +
  geom_point(aes(x = avg_arr_delay, y = cancelled_prop))
```

### 5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)

```{r}
flights %>%
  group_by(carrier) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(arr_delay))
```

What airline corresponds to the "F9" carrier code?

```{r}
filter(airlines, carrier == "F9")
```

You can get part of the way to disentangling the effects of airports versus bad carriers by comparing the average delay of each carrier to the average delay of flights within a route (flights from the same origin to the same destination). Comparing delays between carriers and within each route disentangles the effect of carriers and airports. A better analysis would compare the average delay of a carrier’s flights to the average delay of all other carrier’s flights within a route.

```{r}
flights %>%
  filter(!is.na(arr_delay)) %>%
  # Total delay by carrier within each origin, dest
  group_by(origin, dest, carrier) %>%
  summarise(
    arr_delay = sum(arr_delay),
    flights = n()
  ) %>%
  # Total delay within each origin dest
  group_by(origin, dest) %>%
  mutate(
    arr_delay_total = sum(arr_delay),
    flights_total = sum(flights)
  ) %>%
  # average delay of each carrier - average delay of other carriers
  ungroup() %>%
  mutate(
    arr_delay_others = (arr_delay_total - arr_delay) /
      (flights_total - flights),
    arr_delay_mean = arr_delay / flights,
    arr_delay_diff = arr_delay_mean - arr_delay_others
  ) %>%
  # remove NaN values (when there is only one carrier)
  filter(is.finite(arr_delay_diff)) %>%
  # average over all airports it flies to
  group_by(carrier) %>%
  summarise(arr_delay_diff = mean(arr_delay_diff)) %>%
  arrange(desc(arr_delay_diff))
```

There are more sophisticated ways to do this analysis, however comparing the delay of flights within each route goes a long ways toward disentangling airport and carrier effects. To see a more complete example of this analysis, see [this FiveThirtyEight piece](https://fivethirtyeight.com/features/the-best-and-worst-airlines-airports-and-flights-summer-2015-update/).

### 6. What does the sort argument to `count()` do? When might you use it?

The sort argument to `count()` sorts the results in order of n. You could use this anytime you would run `count()` followed by `arrange()`.