suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("nycflights13"))
package 㤼㸱nycflights13㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("viridis"))
package 㤼㸱viridis㤼㸲 was built under R version 3.6.2
1. What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)
Flights that have a missing tailnum all have missing values of arr_time
, meaning that the flight was canceled.
flights %>%
filter(is.na(tailnum), !is.na(arr_time)) %>%
nrow()
[1] 0
Many of the tail numbers that don’t have a matching value in planes are registered to American Airlines (AA) or Envoy Airlines (MQ). The documentation for planes
states
American Airways (AA) and Envoy Air (MQ) report fleet numbers rather than tail numbers so can’t be matched.
flights %>%
anti_join(planes, by = "tailnum") %>%
count(carrier, sort = TRUE) %>%
mutate(p = n / sum(n))
However, not all tail numbers appearing in flights
from these carriers are missing from the planes
table. I don’t know how to reconcile this discrepancy.
flights %>%
distinct(carrier, tailnum) %>%
left_join(planes, by = "tailnum") %>%
group_by(carrier) %>%
summarise(
total_planes = n(),
not_in_planes = sum(is.na(model))
) %>%
mutate(missing_pct = not_in_planes / total_planes) %>%
arrange(desc(missing_pct))
2. Filter flights to only show flights with planes that have flown at least 100 flights.
First, I find all planes that have flown at least 100 flights. I need to filter flights that are missing a tail number otherwise all flights missing a tail number will be treated as a single plane.
planes_gte100 <- flights %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n >= 100)
Now, I will semi join the data frame of planes that have flown at least 100 flights to the data frame of flights to select the flights by those planes.
flights %>%
semi_join(planes_gte100, by = "tailnum")
This can also be answered with a grouped mutate.
flights %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
mutate(n = n()) %>%
filter(n >= 100)
3.Combine fueleconomy::vehicles
and fueleconomy::common
to find only the records for the most common models.
fueleconomy::vehicles %>%
semi_join(fueleconomy::common, by = c("make", "model"))
Detecting old grouped_df format, replacing `vars` attribute by `groups`
Why does the above code join on make
and model
and not just model
? It is possible for two car brands (make) to produce a car with the same name (model). In both the vehicles and common data we can find some examples. For example, “Truck 4WD”
is produced by many different brands.
fueleconomy::vehicles %>%
distinct(model, make) %>%
group_by(model) %>%
filter(n() > 1) %>%
arrange(model)
fueleconomy::common %>%
distinct(model, make) %>%
group_by(model) %>%
filter(n() > 1) %>%
arrange(model)
If we were to merge these data on the model
column alone, there would be incorrect matches.
4. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?
I will start by clarifying how I will be measuring the concepts in the question. There are three concepts that need to be defined more precisely.
What is meant by “delay”? I will use departure delay. Since the weather data only contains data for the New York City airports, and departure delays will be more sensitive to New York City weather conditions than arrival delays.
What is meant by “worst”? I define worst delay as the average departure delay per flight for flights scheduled to depart in that hour. For hour, I will use the scheduled departure time rather than the actual departure time. If planes are delayed due to weather conditions, the weather conditions during the scheduled time are more important than the actual departure time, at which point, the weather could have improved.
What is meant by “48 hours over the course of the year”? This could mean two days, a span of 48 contiguous hours, or 48 hours that are not necessarily contiguous hours. I will find 48 not-necessarily contiguous hours. That definition makes better use of the methods introduced in this section and chapter.
What is the unit of analysis? Although the question mentions only hours, I will use airport hours. The weather dataset has an observation for each airport for each hour. Since all the departure airports are in the vicinity of New York City, their weather should be similar, it will not be the same.
First, I need to find the 48 hours with the worst delays. I group flights by hour of scheduled departure time and calculate the average delay. Then I select the 48 observations (hours) with the highest average delay.
worst_hours <- flights %>%
mutate(hour = sched_dep_time %/% 100) %>%
group_by(origin, year, month, day, hour) %>%
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>%
arrange(desc(dep_delay)) %>%
slice(1:48)
Then I can use semi_join()
to get the weather for these hours.
weather_most_delayed <- semi_join(weather, worst_hours,
by = c(
"origin", "year",
"month", "day", "hour"
)
)
For weather, I’ll focus on precipitation, wind speed, and temperature. I will display these in both a table and a plot.
Many of these observations have a higher than average wind speed (10 mph) or some precipitation. However, I would have expected the weather for the hours with the worst delays to be much worse.
select(weather_most_delayed, temp, wind_speed, precip) %>%
print(n = 48)
ggplot(weather_most_delayed, aes(x = precip, y = wind_speed, color = temp)) +
geom_point()

It’s hard to say much more than that without using the tools from Exploratory Data Analysis section to look for covariation between weather and flight delays using all flights. Implicitly in my informal analysis of trends in weather using only the 48 hours with the worst delays, I was comparing the weather in these hours to some belief I had about what constitutes “normal” or “good” weather. It would be better to actually use data to make that comparison.
5. What does anti_join(flights, airports, by = c("dest" = "faa"))
tell you? What does anti_join(airports, flights, by = c("faa" = "dest"))
tell you?
The expression anti_join(flights, airports, by = c("dest" = "faa"))
returns the flights that went to an airport that is not in the FAA list of destinations. Since the FAA list only contains domestic airports, these are likely foreign flights.
The expression anti_join(airports, flights, by = c("faa" = "dest"))
returns the US airports that were not the destination of any flight in the data. Since the data contains all flights from New York City airports, this is also the list of US airports that did not have a nonstop flight from New York City in 2013.
---
title: "Filtering joins"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

```{r}
suppressPackageStartupMessages(library("tidyverse"))
suppressPackageStartupMessages(library("nycflights13"))
suppressPackageStartupMessages(library("viridis"))
```

### 1. What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

Flights that have a missing tailnum all have missing values of `arr_time`, meaning that the flight was canceled.

```{r}
flights %>%
  filter(is.na(tailnum), !is.na(arr_time)) %>%
  nrow()
```

Many of the tail numbers that don’t have a matching value in planes are registered to American Airlines (AA) or Envoy Airlines (MQ). The documentation for `planes` states

> American Airways (AA) and Envoy Air (MQ) report fleet numbers rather than tail numbers so can’t be matched.

```{r}
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(carrier, sort = TRUE) %>%
  mutate(p = n / sum(n))
```

However, not all tail numbers appearing in `flights` from these carriers are missing from the `planes` table. I don’t know how to reconcile this discrepancy.

```{r}
flights %>%
  distinct(carrier, tailnum) %>%
  left_join(planes, by = "tailnum") %>%
  group_by(carrier) %>%
  summarise(
    total_planes = n(),
    not_in_planes = sum(is.na(model))
  ) %>%
  mutate(missing_pct = not_in_planes / total_planes) %>%
  arrange(desc(missing_pct))
```


### 2. Filter flights to only show flights with planes that have flown at least 100 flights.

First, I find all planes that have flown at least 100 flights. I need to filter flights that are missing a tail number otherwise all flights missing a tail number will be treated as a single plane.

```{r}
planes_gte100 <- flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  count() %>%
  filter(n >= 100)
```

Now, I will semi join the data frame of planes that have flown at least 100 flights to the data frame of flights to select the flights by those planes.

```{r}
flights %>%
  semi_join(planes_gte100, by = "tailnum")
```

This can also be answered with a grouped mutate.

```{r}
flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  mutate(n = n()) %>%
  filter(n >= 100)
```

### 3.Combine `fueleconomy::vehicles` and `fueleconomy::common` to find only the records for the most common models.

```{r}
fueleconomy::vehicles %>%
  semi_join(fueleconomy::common, by = c("make", "model"))
```

Why does the above code join on `make` and `model` and not just `model`? It is possible for two car brands (make) to produce a car with the same name (model). In both the vehicles and common data we can find some examples. For example, `“Truck 4WD”` is produced by many different brands.

```{r}
fueleconomy::vehicles %>%
  distinct(model, make) %>%
  group_by(model) %>%
  filter(n() > 1) %>%
  arrange(model)
fueleconomy::common %>%
  distinct(model, make) %>%
  group_by(model) %>%
  filter(n() > 1) %>%
  arrange(model)
```

If we were to merge these data on the `model` column alone, there would be incorrect matches.

### 4. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

I will start by clarifying how I will be measuring the concepts in the question. There are three concepts that need to be defined more precisely.

1. What is meant by “delay”? I will use departure delay. Since the weather data only contains data for the New York City airports, and departure delays will be more sensitive to New York City weather conditions than arrival delays.

2. What is meant by “worst”? I define worst delay as the average departure delay per flight for flights scheduled to depart in that hour. For hour, I will use the scheduled departure time rather than the actual departure time. If planes are delayed due to weather conditions, the weather conditions during the scheduled time are more important than the actual departure time, at which point, the weather could have improved.

3. What is meant by “48 hours over the course of the year”? This could mean two days, a span of 48 contiguous hours, or 48 hours that are not necessarily contiguous hours. I will find 48 not-necessarily contiguous hours. That definition makes better use of the methods introduced in this section and chapter.

4. What is the unit of analysis? Although the question mentions only hours, I will use airport hours. The weather dataset has an observation for each airport for each hour. Since all the departure airports are in the vicinity of New York City, their weather should be similar, it will not be the same.

First, I need to find the 48 hours with the worst delays. I group flights by hour of scheduled departure time and calculate the average delay. Then I select the 48 observations (hours) with the highest average delay.

```{r}
worst_hours <- flights %>%
  mutate(hour = sched_dep_time %/% 100) %>%
  group_by(origin, year, month, day, hour) %>%
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ungroup() %>%
  arrange(desc(dep_delay)) %>%
  slice(1:48)
```

Then I can use `semi_join()` to get the weather for these hours.

```{r}
weather_most_delayed <- semi_join(weather, worst_hours,
  by = c(
    "origin", "year",
    "month", "day", "hour"
  )
)
```

For weather, I’ll focus on precipitation, wind speed, and temperature. I will display these in both a table and a plot.

Many of these observations have a higher than average wind speed (10 mph) or some precipitation. However, I would have expected the weather for the hours with the worst delays to be much worse.

```{r}
select(weather_most_delayed, temp, wind_speed, precip) %>%
  print(n = 48)
ggplot(weather_most_delayed, aes(x = precip, y = wind_speed, color = temp)) +
  geom_point()
```

It’s hard to say much more than that without using the tools from Exploratory Data Analysis section to look for covariation between weather and flight delays using all flights. Implicitly in my informal analysis of trends in weather using only the 48 hours with the worst delays, I was comparing the weather in these hours to some belief I had about what constitutes “normal” or “good” weather. It would be better to actually use data to make that comparison.

### 5. What does `anti_join(flights, airports, by = c("dest" = "faa"))` tell you? What does `anti_join(airports, flights, by = c("faa" = "dest"))` tell you?

The expression `anti_join(flights, airports, by = c("dest" = "faa"))` returns the flights that went to an airport that is not in the FAA list of destinations. Since the FAA list only contains domestic airports, these are likely foreign flights.

The expression `anti_join(airports, flights, by = c("faa" = "dest"))` returns the US airports that were not the destination of any flight in the data. Since the data contains all flights from New York City airports, this is also the list of US airports that did not have a nonstop flight from New York City in 2013.

### 6. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.

At each point in time, each plane is flown by a single airline. However, a plane can be sold and fly for multiple airlines. Logically, it is possible that a plane can fly for multiple airlines over the course of its lifetime. But, it is not necessarily the case that a plane will fly for more than one airline in this data, especially since it comprises only a year of data. So let’s check to see if there are any planes in the data flew for multiple airlines.

First, find all distinct airline, plane combinations.

```{r}
planes_carriers <-
  flights %>%
  filter(!is.na(tailnum)) %>%
  distinct(tailnum, carrier)
```

The number of planes that have flown for more than one airline are those tailnum that appear more than once in the planes_carriers data.

```{r}
planes_carriers %>%
  count(tailnum) %>%
  filter(n > 1) %>%
  nrow()
```

The names of airlines are easier to understand than the two-letter carrier codes. The airlines data frame contains the names of the airlines.

```{r}
carrier_transfer_tbl <- planes_carriers %>%
  # keep only planes which have flown for more than one airline
  group_by(tailnum) %>%
  filter(n() > 1) %>%
  # join with airlines to get airline names
  left_join(airlines, by = "carrier") %>%
  arrange(tailnum, carrier)

carrier_transfer_tbl
```