## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Today will complete our work on relational data. `
Filternig joins match observations in the same way as mutating joins, but affect the observations, not the variables.
`semi_join(x,y) keeps all observations in x that have a match in y.
`anti_join(x,y) drops all observations in x that have a match in y.
Today will complete our work on relational data.
The above picture is a semi_join. As you can see only the existence of a match is important. Thus filtering joins can never duplicate rows like mutating joins do. see below.
The opposite of a semi-join is an anti-join which only keeps the rows that do not have a match. See below. Thus filtering joins really deserve the name “filtering” you can accomplish the same thing with filter in tidyverse, but in some situations that might prove to be more problematic.
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
## # A tibble: 722 × 2
## tailnum n
## <chr> <int>
## 1 <NA> 2512
## 2 N725MQ 575
## 3 N722MQ 513
## 4 N723MQ 507
## 5 N713MQ 483
## 6 N735MQ 396
## 7 N0EGMQ 371
## 8 N534MQ 364
## 9 N542MQ 363
## 10 N531MQ 349
## # … with 712 more rows
Exercises 13.4.6
Exercises 13.5.1
This work is based on “R for Data Science” by Hadley and Grolemund. Tt is licensed under
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/us/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.
airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
#Exercise13.5.1
13.5.1 Exercises 1.What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)
Flights that a missing tailnum will not have arrival time recorded becouse of identification.
2.Filter flights to only show flights with planes that have flown at least 100 flights.
planes_gte100 <- flights %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
count() %>%
filter(n>= 100)
3.Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.
4.Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?
5.What does anti_join(flights, airports, by = c(“dest” = “faa”)) tell you? What does anti_join(airports, flights, by = c(“faa” = “dest”)) tell you?
6.You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.