Reshaping tabular data

I can very clearly remember reading Hadley Wickham’s 2007 JSS paper (written while still a grad student!) on reshaping data, and it was a real struggle to understand why we’d want to invest so much time in his approach for reshaping. I initially thought reshaping was a toolkit designed to address an anomaly, where we’re dealing with a special class of misshapen data.

Looking back, it’s apparent why I labored under this misapprehension–this is how Saint Hadley introduced this approach

	subject	time	age	weight	height
1	John Smith	1	33	90	2
2	Mary Smith	1			2

and then

	subject	time	variable	value
1	John Smith	1	age	33
2	John Smith	1	weight	90
3	John Smith	1	height	2
4	Mary Smith	1	height	2

So, I could grasp that these where two separate ways to represent the same data, but why would we care about a general approach for moving data between these formats?

It turns out that reshaping data is a vastly powerful approach, useful in solving a huge array of problems. Not quite split-apply-combine, but gosh, it’s close! (Also far less well understood among social scientists, so it’s a real point of distinction for those who master it.)

How can we understand the stakes?

Long data are for computation
Wide data are for comparisons

Becoming adept at reshaping will lead you to use these tools constantly.

A note about software and terminology, where you can be lead astray when looking for help online. The three packages which Hadley and peer developed for reshaping are:

package	introduced	function: wide -> long	function: long -> wide
reshape	2006	`reshape`	`deshape`
reshape2	2011	`melt`	`cast`
tidyr 0.2	2014	`gather`	`spread`
tidyr 1.0	2019	`pivot_longer`	`pivot_wider`

The older packages are still around, but are no longer under active development.

Let’s do a couple of toy examples before we get to a typical application used in opinion research.

Pivoting longer (converting variables to cases)

So maybe we can further embed this with a graphical aid?

A diagrammatic summary of pivoting

A nice toy example

library(tidyverse)
library(magrittr)

t0 <- tribble(
  ~name, ~class, ~exam, ~paper, ~attendance,
  "kelly", "psych_100", 85, 90, 71,
  "kelly", "psych_220", 75, 68, 84,
  "mike",  "psych_100", 83, 87, 81,
  "cindy", "psych_100", 91, 65, 73,
  "james",  "psych_300", 69, 69, 65
  )

Compare the before..

t0

## # A tibble: 5 × 5
##   name  class      exam paper attendance
##   <chr> <chr>     <dbl> <dbl>      <dbl>
## 1 kelly psych_100    85    90         71
## 2 kelly psych_220    75    68         84
## 3 mike  psych_100    83    87         81
## 4 cindy psych_100    91    65         73
## 5 james psych_300    69    69         65

…and the after(longer)

t1 <- t0 %>% 
  pivot_longer(
    cols = exam:attendance, 
    names_to = "item",
    values_to = "score"
  )

t1

## # A tibble: 15 × 4
##    name  class     item       score
##    <chr> <chr>     <chr>      <dbl>
##  1 kelly psych_100 exam          85
##  2 kelly psych_100 paper         90
##  3 kelly psych_100 attendance    71
##  4 kelly psych_220 exam          75
##  5 kelly psych_220 paper         68
##  6 kelly psych_220 attendance    84
##  7 mike  psych_100 exam          83
##  8 mike  psych_100 paper         87
##  9 mike  psych_100 attendance    81
## 10 cindy psych_100 exam          91
## 11 cindy psych_100 paper         65
## 12 cindy psych_100 attendance    73
## 13 james psych_300 exam          69
## 14 james psych_300 paper         69
## 15 james psych_300 attendance    65

At this stage, we should note that

There was no change made to the underlying data. Nothing was created, removed, or summarized.
Instead, these are just two ways to represent the same underlying data.

A brief interlude

With the “long” version of these data (t1), please:

Make a table which reports a student/class average, assuming an equal weight for all assessment components.
Make a sub table which reports the average for exams and papers, sorted by the average and component.

But most importantly–

Do it first when these data are long (ie as t1)
Then do it when the data are wide (ie as t0)

We can also previse the pivot_wider() function, to compare students’ peformance in psych_100:

t2 <- t1 %>% 
  filter(
    class %>% 
      equals("psych_100")
  ) %>% 
  pivot_wider(
    names_from = name, 
    values_from = score
    )

We’ll get to these arguments in a moment, but just quickly use a wide table to figure out:

On which class components did Kelly beat Cindy, and on which components did Cindy beat Kelly?

And back to the action..

We can borrow some data from tidyr to demonstrate the core functionality of pivot_longer().

data("relig_income")
head(relig_income)

## # A tibble: 6 × 11
##   religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
## 1 Agnostic       27        34        60        81        76       137        122
## 2 Atheist        12        27        37        52        35        70         73
## 3 Buddhist       27        21        30        34        33        58         62
## 4 Catholic      418       617       732       670       638      1116        949
## 5 Don’t kn…      15        14        15        11        10        35         21
## 6 Evangeli…     575       869      1064       982       881      1486        949
## # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
## #   `Don't know/refused` <dbl>

First things first–whoa, what a mess. All those $s actually in the variable names. Anyway, that’s life in the big city.

The survey:

… was conducted in 2007 by Pew–“The Religious Landscape Study”
… comprised 35,000 phone calls, modeled to be both nationally representative and to represent the demographics of all 50 states (on race, income, and educational attainment).

But this table relig_incomeis just a contingency weighted count on household income and religious affiliation.

Imagine we want to know:

“is there an association between atheist identification and income? Specifically, is the share of atheists with incomes less than or equal to $50,000/annum the same as the share of those with higher incomes?“

The weirdness of the underlying table make this a non-obvious challenge.

Do we compute the atheist share separately for each income group, and then compute an overall weighted mean?
Do we compute a mean income for each religious affiliation?

No, we do not do these silly things. We will pivot_longer().

Let’s do this step by step:

relig_income %>% 
  pivot_longer(
    cols = `<$10k`:`>150k`, 
    names_to = "income", 
    values_to = "inc_n"
    )

## # A tibble: 162 × 4
##    religion `Don't know/refused` income    inc_n
##    <chr>                   <dbl> <chr>     <dbl>
##  1 Agnostic                   96 <$10k        27
##  2 Agnostic                   96 $10-20k      34
##  3 Agnostic                   96 $20-30k      60
##  4 Agnostic                   96 $30-40k      81
##  5 Agnostic                   96 $40-50k      76
##  6 Agnostic                   96 $50-75k     137
##  7 Agnostic                   96 $75-100k    122
##  8 Agnostic                   96 $100-150k   109
##  9 Agnostic                   96 >150k        84
## 10 Atheist                    76 <$10k        12
## # ℹ 152 more rows

Whoa! I didn’t mean to leave Don't know/refused in there, but maybe it will be pedagogically instructive?

What did we do?

We explained which variables would be pivoted with the first argument–the first argument to cols
The other stuff was just variable naming! We could have said whatever!

Maybe the most important thing–why is Don't know/refused repeated? It’s because pivoting doesn’t create or destroy data. It’s just indexing the one value that used to be in the Agnostic row

But already, in its long form, this is way more tractable, for next steps. Let’s fixed our DK fuckup and proceed.

relig_income %>% 
  pivot_longer(
    cols = `<$10k`:`Don't know/refused`, 
    names_to = "income", 
    values_to = "inc_n"
    ) %>% 
  mutate(
    inc2 = income %>% 
      case_match(
        c("<$10k",
          "$10-20k",
          "$20-30k",
          "$30-40k",
          "$40-50k") ~ "<$50k",
        "Don't know/refused" ~ "DK/Ref", 
        .default = ">$50k" 
        ) %>% 
      fct_inorder
    ) %>% 
  group_by(
    religion, inc2, 
  ) %>% 
  tally(wt = inc_n)

## # A tibble: 54 × 3
## # Groups:   religion [18]
##    religion inc2       n
##    <chr>    <fct>  <dbl>
##  1 Agnostic <$50k    278
##  2 Agnostic >$50k    452
##  3 Agnostic DK/Ref    96
##  4 Atheist  <$50k    163
##  5 Atheist  >$50k    276
##  6 Atheist  DK/Ref    76
##  7 Buddhist <$50k    145
##  8 Buddhist >$50k    212
##  9 Buddhist DK/Ref    54
## 10 Catholic <$50k   3075
## # ℹ 44 more rows

Whoa our old friends from dplyr! Excuse me for the gratuitous use of case_match, but I needed a nice indicator to simplify the income groups.

rel2 <- relig_income %>% 
  pivot_longer(
    cols = `<$10k`:`Don't know/refused`, 
    names_to = "income", 
    values_to = "inc_n"
    ) %>% 
  mutate(
    inc2 = income %>% 
      case_match(
        c("<$10k",
          "$10-20k",
          "$20-30k",
          "$30-40k",
          "$40-50k") ~ "<$50k",
        "Don't know/refused" ~ "DK/Ref", 
        .default = ">$50k" 
        ) %>% 
      fct_inorder
    ) %>% 
  group_by(
    religion, inc2, 
  ) %>% 
  tally(wt = inc_n) %>% 
  mutate(
    perc = n %>% 
      divide_by(n %>% sum) %>% 
      multiply_by(100)
  ) %>% 
  filter(
    inc2 == "<$50k"
  ) %>% 
  arrange(desc(perc))

Pivoting wider (converting cases to variables)

Pivoting helps us make comparisons:

us_rent_income

## # A tibble: 104 × 5
##    GEOID NAME       variable estimate   moe
##    <chr> <chr>      <chr>       <dbl> <dbl>
##  1 01    Alabama    income      24476   136
##  2 01    Alabama    rent          747     3
##  3 02    Alaska     income      32940   508
##  4 02    Alaska     rent         1200    13
##  5 04    Arizona    income      27517   148
##  6 04    Arizona    rent          972     4
##  7 05    Arkansas   income      23789   165
##  8 05    Arkansas   rent          709     5
##  9 06    California income      29454   109
## 10 06    California rent         1358     3
## # ℹ 94 more rows

To pivot wider, we only need to tell pivot_wider() which variable defines the new variable names, and which will define the new variables values.

us_rent_wide <- us_rent_income %>% 
  select(-moe) %>% 
  pivot_wider(
    names_from = variable, 
    values_from = estimate
    )

us_rent_wide

## # A tibble: 52 × 4
##    GEOID NAME                 income  rent
##    <chr> <chr>                 <dbl> <dbl>
##  1 01    Alabama               24476   747
##  2 02    Alaska                32940  1200
##  3 04    Arizona               27517   972
##  4 05    Arkansas              23789   709
##  5 06    California            29454  1358
##  6 08    Colorado              32401  1125
##  7 09    Connecticut           35326  1123
##  8 10    Delaware              31560  1076
##  9 11    District of Columbia  43198  1424
## 10 12    Florida               25952  1077
## # ℹ 42 more rows

And we can use this table to figure out which states have the biggest and smallest surplus income estimates:

us_rent_wide %>% 
  na.omit %>% 
  mutate(
    surp_inc = income %>% 
      subtract(
        rent
      )
  ) %>% 
  arrange(surp_inc) %>% 
  slice(
    c(1:5, 47:51)
  )

## # A tibble: 10 × 5
##    GEOID NAME                 income  rent surp_inc
##    <chr> <chr>                 <dbl> <dbl>    <dbl>
##  1 28    Mississippi           22766   740    22026
##  2 54    West Virginia         23707   681    23026
##  3 05    Arkansas              23789   709    23080
##  4 35    New Mexico            24457   809    23648
##  5 01    Alabama               24476   747    23729
##  6 25    Massachusetts         34498  1173    33325
##  7 34    New Jersey            35075  1249    33826
##  8 09    Connecticut           35326  1123    34203
##  9 24    Maryland              37147  1311    35836
## 10 11    District of Columbia  43198  1424    41774

An exercise

We’re going to practice reshaping some ANES data–a subset of the 2020 ANES

There are three question types which underpin these 9 variables:

libcon: We hear a lot of talk these days about liberals and conservatives. Here is a seven-point scale on which the political views that people might hold are arranged from extremely liberal to extremely conservative. Where would you place yourself on this scale, or haven't you thought much about this?

govspend: Some people think the government should provide fewer services even in areas such as health and education in order to reduce spending. Suppose these people are at one end of a scale, at point 7. Other people feel it is important for the government to provide many more services even if it means an increase in spending. Suppose these people are at the other end, at point 1. And, of course, some other people have opinions somewhere in between, at points 2, 3, 4, 5 or 6.

defspend: Some people believe that we should spend much less money for defense. Suppose these people are at one end of a scale, at point 1. Others feel that defense spending should be greatly increased. Suppose these people are at the other end, at point 7. And, of course, some other people have opinions somewhere in between, at points 2, 3, 4, 5 or 6. Where would you place yourself on this scale, or haven’t you thought much about this?

Respondents located themselves, and the Democratic and Republican presidential candidates on each of the three scales above.

We can load these data here

a1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/anes20_scales_wide.rds" %>% 
  url %>% 
  readRDS

Where do Democratic and Republican respondents locate both candidates on each scale?
What share of respondents perceive the Democratic candidate as more liberal than the Republican, by scale and level of educational attainment?
Do partisans perceive their parties’ nominees as more conservative or less conservative than themselves, by scale and educational attainment?

Code Lab-Lab 2 (reshaping data)

Thomas Wood (wood.1080@osu.edu)

May 24, 2023