I can very clearly remember reading Hadley Wickham’s 2007 JSS paper (written while still a grad student!) on reshaping data, and it was a real struggle to understand why we’d want to invest so much time in his approach for reshaping. I initially thought reshaping was a toolkit designed to address an anomaly, where we’re dealing with a special class of misshapen data.
Looking back, it’s apparent why I labored under this misapprehension–this is how Saint Hadley introduced this approach
subject | time | age | weight | height | |
---|---|---|---|---|---|
1 | John Smith | 1 | 33 | 90 | 2 |
2 | Mary Smith | 1 | 2 |
and then
subject | time | variable | value | |
---|---|---|---|---|
1 | John Smith | 1 | age | 33 |
2 | John Smith | 1 | weight | 90 |
3 | John Smith | 1 | height | 2 |
4 | Mary Smith | 1 | height | 2 |
So, I could grasp that these where two separate ways to represent the same data, but why would we care about a general approach for moving data between these formats?
It turns out that reshaping data is a vastly powerful approach, useful in solving a huge array of problems. Not quite split-apply-combine, but gosh, it’s close! (Also far less well understood among social scientists, so it’s a real point of distinction for those who master it.)
How can we understand the stakes?
Becoming adept at reshaping will lead you to use these tools constantly.
A note about software and terminology, where you can be lead astray when looking for help online. The three packages which Hadley and peer developed for reshaping are:
package | introduced | function: wide -> long | function: long -> wide |
---|---|---|---|
reshape | 2006 | reshape |
deshape |
reshape2 | 2011 | melt |
cast |
tidyr 0.2 | 2014 | gather |
spread |
tidyr 1.0 | 2019 | pivot_longer |
pivot_wider |
The older packages are still around, but are no longer under active development.
Let’s do a couple of toy examples before we get to a typical application used in opinion research.
So maybe we can further embed this with a graphical aid?
A nice toy example
library(tidyverse)
library(magrittr)
t0 <- tribble(
~name, ~class, ~exam, ~paper, ~attendance,
"kelly", "psych_100", 85, 90, 71,
"kelly", "psych_220", 75, 68, 84,
"mike", "psych_100", 83, 87, 81,
"cindy", "psych_100", 91, 65, 73,
"james", "psych_300", 69, 69, 65
)
Compare the before..
t0
## # A tibble: 5 × 5
## name class exam paper attendance
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 kelly psych_100 85 90 71
## 2 kelly psych_220 75 68 84
## 3 mike psych_100 83 87 81
## 4 cindy psych_100 91 65 73
## 5 james psych_300 69 69 65
…and the after(longer)
t1 <- t0 %>%
pivot_longer(
cols = exam:attendance,
names_to = "item",
values_to = "score"
)
t1
## # A tibble: 15 × 4
## name class item score
## <chr> <chr> <chr> <dbl>
## 1 kelly psych_100 exam 85
## 2 kelly psych_100 paper 90
## 3 kelly psych_100 attendance 71
## 4 kelly psych_220 exam 75
## 5 kelly psych_220 paper 68
## 6 kelly psych_220 attendance 84
## 7 mike psych_100 exam 83
## 8 mike psych_100 paper 87
## 9 mike psych_100 attendance 81
## 10 cindy psych_100 exam 91
## 11 cindy psych_100 paper 65
## 12 cindy psych_100 attendance 73
## 13 james psych_300 exam 69
## 14 james psych_300 paper 69
## 15 james psych_300 attendance 65
At this stage, we should note that
With the “long” version of these data (t1
), please:
But most importantly–
t1
)t0
)We can also previse the pivot_wider()
function, to
compare students’ peformance in psych_100
:
t2 <- t1 %>%
filter(
class %>%
equals("psych_100")
) %>%
pivot_wider(
names_from = name,
values_from = score
)
We’ll get to these arguments in a moment, but just quickly use a wide table to figure out:
We can borrow some data from tidyr
to demonstrate the
core functionality of pivot_longer()
.
data("relig_income")
head(relig_income)
## # A tibble: 6 × 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t kn… 15 14 15 11 10 35 21
## 6 Evangeli… 575 869 1064 982 881 1486 949
## # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
## # `Don't know/refused` <dbl>
First things first–whoa, what a mess. All those $
s
actually in the variable names. Anyway, that’s life in the big
city.
The survey:
But this table relig_income
is just a
contingency weighted count on household income and religious
affiliation.
Imagine we want to know:
“is there an association between atheist identification and income? Specifically, is the share of atheists with incomes less than or equal to $50,000/annum the same as the share of those with higher incomes?“
The weirdness of the underlying table make this a non-obvious challenge.
Do we compute the atheist share separately for each income group, and then compute an overall weighted mean?
Do we compute a mean income for each religious affiliation?
No, we do not do these silly things. We will pivot_longer().
Let’s do this step by step:
relig_income %>%
pivot_longer(
cols = `<$10k`:`>150k`,
names_to = "income",
values_to = "inc_n"
)
## # A tibble: 162 × 4
## religion `Don't know/refused` income inc_n
## <chr> <dbl> <chr> <dbl>
## 1 Agnostic 96 <$10k 27
## 2 Agnostic 96 $10-20k 34
## 3 Agnostic 96 $20-30k 60
## 4 Agnostic 96 $30-40k 81
## 5 Agnostic 96 $40-50k 76
## 6 Agnostic 96 $50-75k 137
## 7 Agnostic 96 $75-100k 122
## 8 Agnostic 96 $100-150k 109
## 9 Agnostic 96 >150k 84
## 10 Atheist 76 <$10k 12
## # ℹ 152 more rows
Whoa! I didn’t mean to leave Don't know/refused
in
there, but maybe it will be pedagogically instructive?
What did we do?
cols
Maybe the most important thing–why is Don't know/refused
repeated? It’s because pivoting doesn’t create or destroy
data. It’s just indexing the one value that used to be in the
Agnostic
row
But already, in its long form, this is way more tractable, for next steps. Let’s fixed our DK fuckup and proceed.
relig_income %>%
pivot_longer(
cols = `<$10k`:`Don't know/refused`,
names_to = "income",
values_to = "inc_n"
) %>%
mutate(
inc2 = income %>%
case_match(
c("<$10k",
"$10-20k",
"$20-30k",
"$30-40k",
"$40-50k") ~ "<$50k",
"Don't know/refused" ~ "DK/Ref",
.default = ">$50k"
) %>%
fct_inorder
) %>%
group_by(
religion, inc2,
) %>%
tally(wt = inc_n)
## # A tibble: 54 × 3
## # Groups: religion [18]
## religion inc2 n
## <chr> <fct> <dbl>
## 1 Agnostic <$50k 278
## 2 Agnostic >$50k 452
## 3 Agnostic DK/Ref 96
## 4 Atheist <$50k 163
## 5 Atheist >$50k 276
## 6 Atheist DK/Ref 76
## 7 Buddhist <$50k 145
## 8 Buddhist >$50k 212
## 9 Buddhist DK/Ref 54
## 10 Catholic <$50k 3075
## # ℹ 44 more rows
Whoa our old friends from dplyr
! Excuse me for the
gratuitous use of case_match
, but I needed a nice indicator
to simplify the income groups.
rel2 <- relig_income %>%
pivot_longer(
cols = `<$10k`:`Don't know/refused`,
names_to = "income",
values_to = "inc_n"
) %>%
mutate(
inc2 = income %>%
case_match(
c("<$10k",
"$10-20k",
"$20-30k",
"$30-40k",
"$40-50k") ~ "<$50k",
"Don't know/refused" ~ "DK/Ref",
.default = ">$50k"
) %>%
fct_inorder
) %>%
group_by(
religion, inc2,
) %>%
tally(wt = inc_n) %>%
mutate(
perc = n %>%
divide_by(n %>% sum) %>%
multiply_by(100)
) %>%
filter(
inc2 == "<$50k"
) %>%
arrange(desc(perc))
Pivoting helps us make comparisons:
us_rent_income
## # A tibble: 104 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama income 24476 136
## 2 01 Alabama rent 747 3
## 3 02 Alaska income 32940 508
## 4 02 Alaska rent 1200 13
## 5 04 Arizona income 27517 148
## 6 04 Arizona rent 972 4
## 7 05 Arkansas income 23789 165
## 8 05 Arkansas rent 709 5
## 9 06 California income 29454 109
## 10 06 California rent 1358 3
## # ℹ 94 more rows
To pivot wider, we only need to tell pivot_wider()
which
variable defines the new variable names, and which will define
the new variables values.
us_rent_wide <- us_rent_income %>%
select(-moe) %>%
pivot_wider(
names_from = variable,
values_from = estimate
)
us_rent_wide
## # A tibble: 52 × 4
## GEOID NAME income rent
## <chr> <chr> <dbl> <dbl>
## 1 01 Alabama 24476 747
## 2 02 Alaska 32940 1200
## 3 04 Arizona 27517 972
## 4 05 Arkansas 23789 709
## 5 06 California 29454 1358
## 6 08 Colorado 32401 1125
## 7 09 Connecticut 35326 1123
## 8 10 Delaware 31560 1076
## 9 11 District of Columbia 43198 1424
## 10 12 Florida 25952 1077
## # ℹ 42 more rows
And we can use this table to figure out which states have the biggest and smallest surplus income estimates:
us_rent_wide %>%
na.omit %>%
mutate(
surp_inc = income %>%
subtract(
rent
)
) %>%
arrange(surp_inc) %>%
slice(
c(1:5, 47:51)
)
## # A tibble: 10 × 5
## GEOID NAME income rent surp_inc
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 28 Mississippi 22766 740 22026
## 2 54 West Virginia 23707 681 23026
## 3 05 Arkansas 23789 709 23080
## 4 35 New Mexico 24457 809 23648
## 5 01 Alabama 24476 747 23729
## 6 25 Massachusetts 34498 1173 33325
## 7 34 New Jersey 35075 1249 33826
## 8 09 Connecticut 35326 1123 34203
## 9 24 Maryland 37147 1311 35836
## 10 11 District of Columbia 43198 1424 41774
We’re going to practice reshaping some ANES data–a subset of the 2020 ANES
There are three question types which underpin these 9 variables:
libcon: We hear a lot of talk these days about liberals and conservatives. Here is a seven-point scale on which the political views that people might hold are arranged from extremely liberal to extremely conservative. Where would you place yourself on this scale, or haven't you thought much about this?
govspend: Some people think the government should provide fewer services even in areas such as health and education in order to reduce spending. Suppose these people are at one end of a scale, at point 7. Other people feel it is important for the government to provide many more services even if it means an increase in spending. Suppose these people are at the other end, at point 1. And, of course, some other people have opinions somewhere in between, at points 2, 3, 4, 5 or 6.
defspend: Some people believe that we should spend much less money for defense. Suppose these people are at one end of a scale, at point 1. Others feel that defense spending should be greatly increased. Suppose these people are at the other end, at point 7. And, of course, some other people have opinions somewhere in between, at points 2, 3, 4, 5 or 6. Where would you place yourself on this scale, or haven’t you thought much about this?
Respondents located themselves, and the Democratic and Republican presidential candidates on each of the three scales above.
We can load these data here
a1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/anes20_scales_wide.rds" %>%
url %>%
readRDS