1. Find all flights that
(a) Had an arrival delay of two or more hours
Since the arr_delay
variable is measured in minutes, find flights with an arrival delay of 120 or more minutes.
filter(flights, arr_delay >= 120)
(b) Flew to Houston (IAH or HOU)
The flights that flew to Houston are those flights where the destination (dest
) is either “IAH” or “HOU”.
filter(flights, dest == "IAH" | dest == "HOU")
However, using %in%
is more compact and would scale to cases where there were more than two airports we were interested in.
filter(flights, dest %in% c("IAH", "HOU"))
(c) Were operated by United, American, or Delta
In the flights
dataset, the column carrier
indicates the airline, but it uses two-character carrier codes. We can find the carrier codes for the airlines in the airlines dataset. Since the carrier code dataset only has 16 rows, and the names of the airlines in that dataset are not exactly “United”, “American”, or “Delta”, it is easiest to manually look up their carrier codes in that data.
airlines
The carrier code for Delta is "DL"
, for American is "AA"
, and for United is "UA"
. Using these carriers codes, we check whether carrier
is one of those.
filter(flights, carrier %in% c("AA", "DL", "UA"))
(d) Departed in summer (July, August, and September)
The variable month
has the month, and it is numeric. So, the summer flights are those that departed in months 7 (July), 8 (August), and 9 (September).
filter(flights, month >= 7, month <= 9)
The %in%
operator is an alternative. If the :
operator is used to specify the integer range, the expression is readable and compact.
filter(flights, month %in% 7:9)
We could also use the |
operator. However, the |
does not scale to many choices. Even with only three choices, it is quite verbose.
filter(flights, month == 7 | month == 8 | month == 9)
(e) Arrived more than two hours late, but didn’t leave late
Flights that arrived more than two hours late, but didn’t leave late will have an arrival delay of more than 120 minutes (arr_delay > 120
) and a non-positive departure delay (dep_delay <= 0
).
filter(flights, arr_delay > 120, dep_delay <= 0)
(f) Were delayed by at least an hour, but made up over 30 minutes in flight
Were delayed by at least an hour, but made up over 30 minutes in flight. If a flight was delayed by at least an hour, then dep_delay >= 60
. If the flight didn’t make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning dep_delay == arr_delay
, or alternatively, dep_delay - arr_delay == 0
. If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as dep_delay - arr_delay > 30
.
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
(g) Departed between midnight and 6 am (inclusive)
Finding flights that departed between midnight and 6 a.m. is complicated by the way in which times are represented in the data. In dep_time
, midnight is represented by 2400
, not 0
. You can verify this by checking the minimum and maximum of dep_time.
summary(flights$dep_time)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 907 1401 1349 1744 2400 8255
This is an example of why it is always good to check the summary statistics of your data. Unfortunately, this means we cannot simply check that dep_time < 600
, because we also have to consider the special case of midnight.
filter(flights, dep_time <= 600 | dep_time == 2400)
Alternatively, we could use the modulo operator, %%
. The modulo operator returns the remainder of division. Let’s see how how this affects our times.
c(600, 1200, 2400) %% 2400
[1] 600 1200 0
Since 2400 %% 2400 == 0
and all other times are left unchanged, we can compare the result of the modulo operation to 600
,
filter(flights, dep_time %% 2400 <= 600)
This filter expression is more compact, but its readability will depends on the familiarity of the reader with modular arithmetic.
2. Another useful dplyr filtering helper is between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
The expression between(x, left, right)
is equivalent to x >= left & x <= right
.
Of the answers in the previous question, we could simplify the statement of departed in summer (month >= 7 & month <= 9
) using the between()
function.
filter(flights, between(month, 7, 9))
---
title: "Filter rows with filter()"
output: 
  html_notebook:
    toc: true
    toc_float: true
    toc_depth: 4
---


```{r loadlibrary}
suppressPackageStartupMessages(library(nycflights13))
suppressPackageStartupMessages(library(tidyverse))
```

### 1. Find all flights that

#### (a) Had an arrival delay of two or more hours

Since the `arr_delay` variable is measured in minutes, find flights with an arrival delay of 120 or more minutes.

```{r delay}
filter(flights, arr_delay >= 120)
```

#### (b) Flew to Houston (IAH or HOU)

The flights that flew to Houston are those flights where the destination (`dest`) is either “IAH” or “HOU”.

```{r destHOU}
filter(flights, dest == "IAH" | dest == "HOU")
```

However, using `%in%` is more compact and would scale to cases where there were more than two airports we were interested in.

```{r in}
filter(flights, dest %in% c("IAH", "HOU"))
```

#### (c) Were operated by United, American, or Delta

In the `flights` dataset, the column `carrier` indicates the airline, but it uses two-character carrier codes. We can find the carrier codes for the airlines in the airlines dataset. Since the carrier code dataset only has 16 rows, and the names of the airlines in that dataset are not exactly “United”, “American”, or “Delta”, it is easiest to manually look up their carrier codes in that data.

```{r airlines}
airlines
```

The carrier code for Delta is `"DL"`, for American is `"AA"`, and for United is `"UA"`. Using these carriers codes, we check whether `carrier` is one of those.

```{r}
filter(flights, carrier %in% c("AA", "DL", "UA"))
```

#### (d) Departed in summer (July, August, and September)

The variable `month` has the month, and it is numeric. So, the summer flights are those that departed in months 7 (July), 8 (August), and 9 (September).

```{r month}
filter(flights, month >= 7, month <= 9)
```

The `%in%` operator is an alternative. If the `:` operator is used to specify the integer range, the expression is readable and compact.

```{r incolon}
filter(flights, month %in% 7:9)
```

We could also use the `|` operator. However, the `|` does not scale to many choices. Even with only three choices, it is quite verbose.

```{r verbose}
filter(flights, month == 7 | month == 8 | month == 9)
```


#### (e) Arrived more than two hours late, but didn’t leave late

Flights that arrived more than two hours late, but didn’t leave late will have an arrival delay of more than 120 minutes (`arr_delay > 120`) and a non-positive departure delay (`dep_delay <= 0`).

```{r noleave}
filter(flights, arr_delay > 120, dep_delay <= 0)
```

#### (f) Were delayed by at least an hour, but made up over 30 minutes in flight

Were delayed by at least an hour, but made up over 30 minutes in flight. If a flight was delayed by at least an hour, then `dep_delay >= 60`. If the flight didn’t make up any time in the air, then its arrival would be delayed by the same amount as its departure, meaning `dep_delay == arr_delay`, or alternatively, `dep_delay - arr_delay == 0`. If it makes up over 30 minutes in the air, then the arrival delay must be at least 30 minutes less than the departure delay, which is stated as `dep_delay - arr_delay > 30`.

```{r makeup30}
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
```

#### (g) Departed between midnight and 6 am (inclusive)

Finding flights that departed between midnight and 6 a.m. is complicated by the way in which times are represented in the data.
In `dep_time`, midnight is represented by `2400`, not `0`. You can verify this by checking the minimum and maximum of dep_time.

```{r checkmidnight}
summary(flights$dep_time)
```

This is an example of why it is always good to check the summary statistics of your data. Unfortunately, this means we cannot simply check that `dep_time < 600`, because we also have to consider the special case of midnight.

```{r getmidnight}
filter(flights, dep_time <= 600 | dep_time == 2400)
```


Alternatively, we could use the [modulo operator](https://en.wikipedia.org/wiki/Modulo_operation), `%%`. The modulo operator returns the remainder of division. Let’s see how how this affects our times.

```{r modulo}
c(600, 1200, 2400) %% 2400
```

Since `2400 %% 2400 == 0` and all other times are left unchanged, we can compare the result of the modulo operation to `600`,

```{r mod2400}
filter(flights, dep_time %% 2400 <= 600)
```

This filter expression is more compact, but its readability will depends on the familiarity of the reader with modular arithmetic.

### 2. Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?

The expression `between(x, left, right)` is equivalent to `x >= left & x <= right`.

Of the answers in the previous question, we could simplify the statement of *departed in summer* (`month >= 7 & month <= 9`) using the `between()` function.

```{r summer}
filter(flights, between(month, 7, 9))
```

### 3. How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?

Find the rows of flights with a missing departure time (`dep_time`) using the `is.na()` function.

```{r missing_deptime}
filter(flights, is.na(dep_time))
```

Notably, the arrival time (`arr_time`) is also missing for these rows. These seem to be cancelled flights.

### 4. Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)

`NA ^ 0 == 1` since for all numeric values $x^0=1$.

```{r na0}
NA^0
```

`NA | TRUE` is `TRUE` because the value of the missing `TRUE` or `FALSE`,  
$x$ or `TRUE` is `TRUE` for all values of $x$.

```{r natrue}
NA | TRUE
```


Likewise, anything and `FALSE` is always `FALSE`.

```{r nafalse}
NA & FALSE
```

Because the value of the missing element matters in `NA | FALSE` and `NA & TRUE`, these are missing:

```{r naor}
NA | FALSE
NA & TRUE
```

Since $x \times0=0$ for all finite, numeric $x$, we might expect `NA * 0 == 0`, but that’s not the case.

```{r natimes0}
NA * 0
```

The reason that `NA * 0` is not equal to `0` is that $x \times \infty$ and $x \times -\infty$ is undefined. R represents undefined results as `NaN`, which is an abbreviation of [not a number](https://en.wikipedia.org/wiki/NaN).

```{r inf0}
Inf * 0
-Inf * 0
```

