Two table verbs–way more useful than you think.

Joining one table to another probably appeals to your intuition as an application when you’re joining between different levels of data:

But it’s actually a way more general approach! You’ve probably seen me gratuitously left_joining in other syntax I’ve written for this lab. It comes up when you want to do something like

Let’s do a nice graphical summary of two table verbs–*_join, specifically

The four major join types:



Four kinds of two table joins. In each example, the left circle is referred to by x , and the right circle referred to as y .
Four kinds of two table joins. In each example, the left circle is referred to by x , and the right circle referred to as y .



What does each join achieve?

  1. An inner_join() keeps only those rows from both x and y.

  2. A left_join() keeps all the rows from x, with matching rows from y.

  3. A full_join() keeps all rows from both x and y, whether they match or not.

  4. A right_join() keeps all the rows from y, with matching rows from x.

Let’s do some examples.

An inner_join()

library(tidyverse) 
library(magrittr) 

x <- tribble(
  ~id, ~y1,
  "A",  .5,
  "B",  1,
  "C",  .5,
  "D",  1.5
)

y <- tribble(
  ~id, ~y2,
  "B",  -1,
  "C",  -.5,
  "E",  -1.5
)

x %>%
  inner_join(y) 
## # A tibble: 2 × 3
##   id       y1    y2
##   <chr> <dbl> <dbl>
## 1 B       1    -1  
## 2 C       0.5  -0.5

As we can see, we merged on id (when we omitted the argument to by inside inner_join(), dplyr insightful chose to join on all the variables which were common to both data frames) and kept only the rows which were in both x and y,

A left_join()

x %>%
  left_join(y) 
## # A tibble: 4 × 3
##   id       y1    y2
##   <chr> <dbl> <dbl>
## 1 A       0.5  NA  
## 2 B       1    -1  
## 3 C       0.5  -0.5
## 4 D       1.5  NA

We merged on id and kept all the rows from x, and the matching rows from y.

A full_join()

x %>%
  full_join(y) 
## # A tibble: 5 × 3
##   id       y1    y2
##   <chr> <dbl> <dbl>
## 1 A       0.5  NA  
## 2 B       1    -1  
## 3 C       0.5  -0.5
## 4 D       1.5  NA  
## 5 E      NA    -1.5

We merged on id and kept all the rows from either x or y.

A right_join()

x %>%
  right_join(y) 
## # A tibble: 3 × 3
##   id       y1    y2
##   <chr> <dbl> <dbl>
## 1 B       1    -1  
## 2 C       0.5  -0.5
## 3 E      NA    -1.5

We merged on id and kept all the rows from y, and the matching rows from x.

An example of joining

Let’s use joins to compare polls to vote

d_polls <- "https://github.com/thomasjwood/ps7160/raw/master/polls_20.rds" %>%
  url %>%
  readRDS %>%
  as_tibble %>%
  mutate(
    end_date = end_date %>%
      mdy %>%
      as.Date,
    state = state %>%
      is.na %>%
      ifelse(
        "United States",
        state
      )
  )

d_votes <- "https://github.com/thomasjwood/ps7160/raw/master/votes_pres_senate_76-20.rds" %>%
  url %>%
  readRDS %>% 
  filter(year == 2020)

Now we’ll do some simple computation

t_polls <- d_polls %>%
  filter(
    race %>%
      is_in(
        c("presidential election",
          "senate election"
        )
      ) &
      end_date %>%
      is_weakly_greater_than(
        "2020-09-04" %>%
          as.Date
      ) &
      candidate_party %>%
      is_in(
        c("DEM", "REP")
      )
  ) %>%
  mutate(
    race = race %>%
      str_to_lower %>%
      str_extract(
        "president|senate"
      )
  ) %>%
  group_by(
    race, state, candidate_party
  ) %>%
  summarize(
    mu = pct %>% mean
  ) %>%
  pivot_wider(
    names_from = candidate_party,
    values_from = mu
  ) %>%
  mutate(
    poll_lead = REP - DEM
  )

t_votes <- d_votes %>%
  rename(
    race = office
  ) %>%
  mutate(
    race = race %>%
      str_to_lower %>%
      str_extract(
        "president|senate"
      ),
    party_simplified = party_simplified %>%
      str_sub(, 3),
    state = state %>%
      str_replace_all(
        state.name %>%
          c("District of Columbia") %>%
          set_attr(
            "names",
            state.name %>%
              str_to_upper %>%
              c("DISTRICT OF COLUMBIA")
          )
      )
  ) %>%
  filter(
    party_simplified %>%
      is_in(
        c("DEM", "REP")
      )
  ) %>%
  group_by(
    race, state
  ) %>%
  mutate(
    perc = candidatevotes %>%
      divide_by(
        candidatevotes %>%
          sum
      ) %>%
      multiply_by(100)
  ) %>%
  select(
    race, state, party_simplified, perc
  ) %>%
  # removing repeated single party candidates
  group_by(race, state, party_simplified) %>%
  slice(1) %>%
  group_by(race, state) %>%
  pivot_wider(
    names_from = party_simplified,
    values_from = perc
  ) %>%
  mutate(
    vote_lead = REP - DEM
  )

And we left_join() to compare both quantities

t3 <- t_polls %>%
  left_join(
    t_votes %>%
      select(
        race, state, vote_lead
      )
  ) %>%
  mutate(
    miss = poll_lead - vote_lead
  ) %>%
  arrange(miss)

Or we can plot

t3 %>%
  filter(
    state %>%
      equals("District of Columbia") %>%
      not
  ) %>%
  ggplot() +
  geom_text(
    aes(
      poll_lead,
      vote_lead,
      label = state %>%
        plyr::mapvalues(
          state.name,
          state.abb
        )
    )
  ) +
  geom_hline(
    yintercept = 0,
    linetype = "dashed"
  ) +
  geom_vline(
    xintercept = 0,
    linetype = "dashed"
  ) +
  geom_abline(
    yintercept = 0,
    slope = 1,
    linetype = "dashed"
  ) +
  facet_wrap(~race, nrow = 1) +
  labs(
    x = "GOP polling lead  - Dem polling Lead",
    y = "GOP vote share - Dem vote share"
  )

More creative use of joins

I want to dedicate more attention to these more creative uses of *_join to solve a broader array of problems. Hopefully the above is enough to provide some of the building blocks. But to motivate you:

diamonds %>% 
  group_by(cut) %>% 
  nest %>% 
  left_join(
    expand_grid(
      cut = diamonds$cut %>% unique, 
      frm = str_c(
        "price ~ ",
        c("color", "clarity", "carat")
        )
      )
    ) %>% 
  mutate(
    mods = frm %>% 
      map2(
        data,
          \(i, j)
          lm(as.formula(i), data = j)
          )
    )
## # A tibble: 15 × 4
## # Groups:   cut [5]
##    cut       data                  frm             mods  
##    <ord>     <list>                <chr>           <list>
##  1 Ideal     <tibble [21,551 × 9]> price ~ color   <lm>  
##  2 Ideal     <tibble [21,551 × 9]> price ~ clarity <lm>  
##  3 Ideal     <tibble [21,551 × 9]> price ~ carat   <lm>  
##  4 Premium   <tibble [13,791 × 9]> price ~ color   <lm>  
##  5 Premium   <tibble [13,791 × 9]> price ~ clarity <lm>  
##  6 Premium   <tibble [13,791 × 9]> price ~ carat   <lm>  
##  7 Good      <tibble [4,906 × 9]>  price ~ color   <lm>  
##  8 Good      <tibble [4,906 × 9]>  price ~ clarity <lm>  
##  9 Good      <tibble [4,906 × 9]>  price ~ carat   <lm>  
## 10 Very Good <tibble [12,082 × 9]> price ~ color   <lm>  
## 11 Very Good <tibble [12,082 × 9]> price ~ clarity <lm>  
## 12 Very Good <tibble [12,082 × 9]> price ~ carat   <lm>  
## 13 Fair      <tibble [1,610 × 9]>  price ~ color   <lm>  
## 14 Fair      <tibble [1,610 × 9]>  price ~ clarity <lm>  
## 15 Fair      <tibble [1,610 × 9]>  price ~ carat   <lm>

Now we’re off to the races! List columns, everything!


  1. Now, this is also possibly given a group_by() %>% mutate() chain, but the left_join() approach might be more pedagogically apparent.↩︎

  2. This also could be done with a specialist function–specifically, tidyr::complete()↩︎