Joining one table to another probably appeals to your intuition as an application when you’re joining between different levels of data:
But it’s actually a way more general approach! You’ve probably seen
me gratuitously left_join
ing in other syntax I’ve written
for this lab. It comes up when you want to do something like
full_join()
to make sure we have rows in the data
frame for every possible combination of values, even if they don’t show
up in the actual data.2*_join
, specificallyThe four major join types:
x
, and
the right circle referred to as y
.
What does each join achieve?
An inner_join()
keeps only those rows from
both x and y.
A left_join()
keeps all the rows from x,
with matching rows from y.
A full_join()
keeps all rows from both x and
y, whether they match or not.
A right_join()
keeps all the rows from y,
with matching rows from x.
Let’s do some examples.
inner_join()
library(tidyverse)
library(magrittr)
x <- tribble(
~id, ~y1,
"A", .5,
"B", 1,
"C", .5,
"D", 1.5
)
y <- tribble(
~id, ~y2,
"B", -1,
"C", -.5,
"E", -1.5
)
x %>%
inner_join(y)
## # A tibble: 2 × 3
## id y1 y2
## <chr> <dbl> <dbl>
## 1 B 1 -1
## 2 C 0.5 -0.5
As we can see, we merged on id
(when we omitted the
argument to by
inside inner_join()
, dplyr
insightful chose to join on all the variables which were common to both
data frames) and kept only the rows which were in both x and y,
left_join()
x %>%
left_join(y)
## # A tibble: 4 × 3
## id y1 y2
## <chr> <dbl> <dbl>
## 1 A 0.5 NA
## 2 B 1 -1
## 3 C 0.5 -0.5
## 4 D 1.5 NA
We merged on id
and kept all the rows from
x
, and the matching rows from y
.
full_join()
x %>%
full_join(y)
## # A tibble: 5 × 3
## id y1 y2
## <chr> <dbl> <dbl>
## 1 A 0.5 NA
## 2 B 1 -1
## 3 C 0.5 -0.5
## 4 D 1.5 NA
## 5 E NA -1.5
We merged on id
and kept all the rows from either
x
or y
.
right_join()
x %>%
right_join(y)
## # A tibble: 3 × 3
## id y1 y2
## <chr> <dbl> <dbl>
## 1 B 1 -1
## 2 C 0.5 -0.5
## 3 E NA -1.5
We merged on id
and kept all the rows from
y
, and the matching rows from x
.
Let’s use joins to compare polls to vote
d_polls <- "https://github.com/thomasjwood/ps7160/raw/master/polls_20.rds" %>%
url %>%
readRDS %>%
as_tibble %>%
mutate(
end_date = end_date %>%
mdy %>%
as.Date,
state = state %>%
is.na %>%
ifelse(
"United States",
state
)
)
d_votes <- "https://github.com/thomasjwood/ps7160/raw/master/votes_pres_senate_76-20.rds" %>%
url %>%
readRDS %>%
filter(year == 2020)
Now we’ll do some simple computation
t_polls <- d_polls %>%
filter(
race %>%
is_in(
c("presidential election",
"senate election"
)
) &
end_date %>%
is_weakly_greater_than(
"2020-09-04" %>%
as.Date
) &
candidate_party %>%
is_in(
c("DEM", "REP")
)
) %>%
mutate(
race = race %>%
str_to_lower %>%
str_extract(
"president|senate"
)
) %>%
group_by(
race, state, candidate_party
) %>%
summarize(
mu = pct %>% mean
) %>%
pivot_wider(
names_from = candidate_party,
values_from = mu
) %>%
mutate(
poll_lead = REP - DEM
)
t_votes <- d_votes %>%
rename(
race = office
) %>%
mutate(
race = race %>%
str_to_lower %>%
str_extract(
"president|senate"
),
party_simplified = party_simplified %>%
str_sub(, 3),
state = state %>%
str_replace_all(
state.name %>%
c("District of Columbia") %>%
set_attr(
"names",
state.name %>%
str_to_upper %>%
c("DISTRICT OF COLUMBIA")
)
)
) %>%
filter(
party_simplified %>%
is_in(
c("DEM", "REP")
)
) %>%
group_by(
race, state
) %>%
mutate(
perc = candidatevotes %>%
divide_by(
candidatevotes %>%
sum
) %>%
multiply_by(100)
) %>%
select(
race, state, party_simplified, perc
) %>%
# removing repeated single party candidates
group_by(race, state, party_simplified) %>%
slice(1) %>%
group_by(race, state) %>%
pivot_wider(
names_from = party_simplified,
values_from = perc
) %>%
mutate(
vote_lead = REP - DEM
)
And we left_join()
to compare both quantities
t3 <- t_polls %>%
left_join(
t_votes %>%
select(
race, state, vote_lead
)
) %>%
mutate(
miss = poll_lead - vote_lead
) %>%
arrange(miss)
Or we can plot
t3 %>%
filter(
state %>%
equals("District of Columbia") %>%
not
) %>%
ggplot() +
geom_text(
aes(
poll_lead,
vote_lead,
label = state %>%
plyr::mapvalues(
state.name,
state.abb
)
)
) +
geom_hline(
yintercept = 0,
linetype = "dashed"
) +
geom_vline(
xintercept = 0,
linetype = "dashed"
) +
geom_abline(
yintercept = 0,
slope = 1,
linetype = "dashed"
) +
facet_wrap(~race, nrow = 1) +
labs(
x = "GOP polling lead - Dem polling Lead",
y = "GOP vote share - Dem vote share"
)
I want to dedicate more attention to these more creative uses of
*_join
to solve a broader array of problems. Hopefully the
above is enough to provide some of the building blocks. But to motivate
you:
diamonds %>%
group_by(cut) %>%
nest %>%
left_join(
expand_grid(
cut = diamonds$cut %>% unique,
frm = str_c(
"price ~ ",
c("color", "clarity", "carat")
)
)
) %>%
mutate(
mods = frm %>%
map2(
data,
\(i, j)
lm(as.formula(i), data = j)
)
)
## # A tibble: 15 × 4
## # Groups: cut [5]
## cut data frm mods
## <ord> <list> <chr> <list>
## 1 Ideal <tibble [21,551 × 9]> price ~ color <lm>
## 2 Ideal <tibble [21,551 × 9]> price ~ clarity <lm>
## 3 Ideal <tibble [21,551 × 9]> price ~ carat <lm>
## 4 Premium <tibble [13,791 × 9]> price ~ color <lm>
## 5 Premium <tibble [13,791 × 9]> price ~ clarity <lm>
## 6 Premium <tibble [13,791 × 9]> price ~ carat <lm>
## 7 Good <tibble [4,906 × 9]> price ~ color <lm>
## 8 Good <tibble [4,906 × 9]> price ~ clarity <lm>
## 9 Good <tibble [4,906 × 9]> price ~ carat <lm>
## 10 Very Good <tibble [12,082 × 9]> price ~ color <lm>
## 11 Very Good <tibble [12,082 × 9]> price ~ clarity <lm>
## 12 Very Good <tibble [12,082 × 9]> price ~ carat <lm>
## 13 Fair <tibble [1,610 × 9]> price ~ color <lm>
## 14 Fair <tibble [1,610 × 9]> price ~ clarity <lm>
## 15 Fair <tibble [1,610 × 9]> price ~ carat <lm>
Now we’re off to the races! List columns, everything!