Two table verbs–way more useful than you think.

Joining one table to another probably appeals to your intuition as an application when you’re joining between different levels of data:

You have survey subjects from separate countries, and you want to regress attitudes on a state property (like GDP/capita or the Polity score.)
You have county attributes*year observations, and you need to join state attributes.

But it’s actually a way more general approach! You’ve probably seen me gratuitously left_joining in other syntax I’ve written for this lab. It comes up when you want to do something like

Joining observations to group means, so they can be differenced from some meaningful baseline¹.
Using full_join() to make sure we have rows in the data frame for every possible combination of values, even if they don’t show up in the actual data.²

Let’s do a nice graphical summary of two table verbs–`*_join`, specifically

The four major join types:

Four kinds of two table joins. In each example, the left circle is referred to by x , and the right circle referred to as y .

What does each join achieve?

An inner_join() keeps only those rows from both x and y.
A left_join() keeps all the rows from x, with matching rows from y.
A full_join() keeps all rows from both x and y, whether they match or not.
A right_join() keeps all the rows from y, with matching rows from x.

Let’s do some examples.

An `inner_join()`

library(tidyverse) 
library(magrittr) 

x <- tribble(
  ~id, ~y1,
  "A",  .5,
  "B",  1,
  "C",  .5,
  "D",  1.5
)

y <- tribble(
  ~id, ~y2,
  "B",  -1,
  "C",  -.5,
  "E",  -1.5
)

x %>%
  inner_join(y)

## # A tibble: 2 × 3
##   id       y1    y2
##   <chr> <dbl> <dbl>
## 1 B       1    -1  
## 2 C       0.5  -0.5

As we can see, we merged on id (when we omitted the argument to by inside inner_join(), dplyr insightful chose to join on all the variables which were common to both data frames) and kept only the rows which were in both x and y,

A `left_join()`

x %>%
  left_join(y)

## # A tibble: 4 × 3
##   id       y1    y2
##   <chr> <dbl> <dbl>
## 1 A       0.5  NA  
## 2 B       1    -1  
## 3 C       0.5  -0.5
## 4 D       1.5  NA

We merged on id and kept all the rows from x, and the matching rows from y.

A `full_join()`

x %>%
  full_join(y)

## # A tibble: 5 × 3
##   id       y1    y2
##   <chr> <dbl> <dbl>
## 1 A       0.5  NA  
## 2 B       1    -1  
## 3 C       0.5  -0.5
## 4 D       1.5  NA  
## 5 E      NA    -1.5

We merged on id and kept all the rows from either x or y.

A `right_join()`

x %>%
  right_join(y)

## # A tibble: 3 × 3
##   id       y1    y2
##   <chr> <dbl> <dbl>
## 1 B       1    -1  
## 2 C       0.5  -0.5
## 3 E      NA    -1.5

We merged on id and kept all the rows from y, and the matching rows from x.

An example of joining

Let’s use joins to compare polls to vote

d_polls <- "https://github.com/thomasjwood/ps7160/raw/master/polls_20.rds" %>%
  url %>%
  readRDS %>%
  as_tibble %>%
  mutate(
    end_date = end_date %>%
      mdy %>%
      as.Date,
    state = state %>%
      is.na %>%
      ifelse(
        "United States",
        state
      )
  )

d_votes <- "https://github.com/thomasjwood/ps7160/raw/master/votes_pres_senate_76-20.rds" %>%
  url %>%
  readRDS %>% 
  filter(year == 2020)

Now we’ll do some simple computation

t_polls <- d_polls %>%
  filter(
    race %>%
      is_in(
        c("presidential election",
          "senate election"
        )
      ) &
      end_date %>%
      is_weakly_greater_than(
        "2020-09-04" %>%
          as.Date
      ) &
      candidate_party %>%
      is_in(
        c("DEM", "REP")
      )
  ) %>%
  mutate(
    race = race %>%
      str_to_lower %>%
      str_extract(
        "president|senate"
      )
  ) %>%
  group_by(
    race, state, candidate_party
  ) %>%
  summarize(
    mu = pct %>% mean
  ) %>%
  pivot_wider(
    names_from = candidate_party,
    values_from = mu
  ) %>%
  mutate(
    poll_lead = REP - DEM
  )

t_votes <- d_votes %>%
  rename(
    race = office
  ) %>%
  mutate(
    race = race %>%
      str_to_lower %>%
      str_extract(
        "president|senate"
      ),
    party_simplified = party_simplified %>%
      str_sub(, 3),
    state = state %>%
      str_replace_all(
        state.name %>%
          c("District of Columbia") %>%
          set_attr(
            "names",
            state.name %>%
              str_to_upper %>%
              c("DISTRICT OF COLUMBIA")
          )
      )
  ) %>%
  filter(
    party_simplified %>%
      is_in(
        c("DEM", "REP")
      )
  ) %>%
  group_by(
    race, state
  ) %>%
  mutate(
    perc = candidatevotes %>%
      divide_by(
        candidatevotes %>%
          sum
      ) %>%
      multiply_by(100)
  ) %>%
  select(
    race, state, party_simplified, perc
  ) %>%
  # removing repeated single party candidates
  group_by(race, state, party_simplified) %>%
  slice(1) %>%
  group_by(race, state) %>%
  pivot_wider(
    names_from = party_simplified,
    values_from = perc
  ) %>%
  mutate(
    vote_lead = REP - DEM
  )

And we left_join() to compare both quantities

t3 <- t_polls %>%
  left_join(
    t_votes %>%
      select(
        race, state, vote_lead
      )
  ) %>%
  mutate(
    miss = poll_lead - vote_lead
  ) %>%
  arrange(miss)

Or we can plot

t3 %>%
  filter(
    state %>%
      equals("District of Columbia") %>%
      not
  ) %>%
  ggplot() +
  geom_text(
    aes(
      poll_lead,
      vote_lead,
      label = state %>%
        plyr::mapvalues(
          state.name,
          state.abb
        )
    )
  ) +
  geom_hline(
    yintercept = 0,
    linetype = "dashed"
  ) +
  geom_vline(
    xintercept = 0,
    linetype = "dashed"
  ) +
  geom_abline(
    yintercept = 0,
    slope = 1,
    linetype = "dashed"
  ) +
  facet_wrap(~race, nrow = 1) +
  labs(
    x = "GOP polling lead  - Dem polling Lead",
    y = "GOP vote share - Dem vote share"
  )

More creative use of joins

I want to dedicate more attention to these more creative uses of *_join to solve a broader array of problems. Hopefully the above is enough to provide some of the building blocks. But to motivate you:

diamonds %>% 
  group_by(cut) %>% 
  nest %>% 
  left_join(
    expand_grid(
      cut = diamonds$cut %>% unique, 
      frm = str_c(
        "price ~ ",
        c("color", "clarity", "carat")
        )
      )
    ) %>% 
  mutate(
    mods = frm %>% 
      map2(
        data,
          \(i, j)
          lm(as.formula(i), data = j)
          )
    )

## # A tibble: 15 × 4
## # Groups:   cut [5]
##    cut       data                  frm             mods  
##    <ord>     <list>                <chr>           <list>
##  1 Ideal     <tibble [21,551 × 9]> price ~ color   <lm>  
##  2 Ideal     <tibble [21,551 × 9]> price ~ clarity <lm>  
##  3 Ideal     <tibble [21,551 × 9]> price ~ carat   <lm>  
##  4 Premium   <tibble [13,791 × 9]> price ~ color   <lm>  
##  5 Premium   <tibble [13,791 × 9]> price ~ clarity <lm>  
##  6 Premium   <tibble [13,791 × 9]> price ~ carat   <lm>  
##  7 Good      <tibble [4,906 × 9]>  price ~ color   <lm>  
##  8 Good      <tibble [4,906 × 9]>  price ~ clarity <lm>  
##  9 Good      <tibble [4,906 × 9]>  price ~ carat   <lm>  
## 10 Very Good <tibble [12,082 × 9]> price ~ color   <lm>  
## 11 Very Good <tibble [12,082 × 9]> price ~ clarity <lm>  
## 12 Very Good <tibble [12,082 × 9]> price ~ carat   <lm>  
## 13 Fair      <tibble [1,610 × 9]>  price ~ color   <lm>  
## 14 Fair      <tibble [1,610 × 9]>  price ~ clarity <lm>  
## 15 Fair      <tibble [1,610 × 9]>  price ~ carat   <lm>

Now we’re off to the races! List columns, everything!

Now, this is also possibly given a group_by() %>% mutate() chain, but the left_join() approach might be more pedagogically apparent.↩︎
This also could be done with a specialist function–specifically, tidyr::complete()↩︎

Code Lab-Lab 6 (Two table verbs)

Thomas Wood (wood.1080@osu.edu)

January 19, 2023

Two table verbs–way more useful than you think.

Let’s do a nice graphical summary of two table verbs–`*_join`, specifically

An `inner_join()`

A `left_join()`

A `full_join()`

A `right_join()`

An example of joining

More creative use of joins

Code Lab-Lab 6 (Two table verbs)

Thomas Wood (wood.1080@osu.edu)

January 19, 2023

Two table verbs–way more useful than you think.

Let’s do a nice graphical summary of two table verbs–*_join, specifically

An inner_join()

A left_join()

A full_join()

A right_join()

An example of joining

More creative use of joins

Let’s do a nice graphical summary of two table verbs–`*_join`, specifically

An `inner_join()`

A `left_join()`

A `full_join()`

A `right_join()`