Tidy Data: Pivoting

The Big Picture

Our primary focus in this Section is on the Tidy phase.

Pivoting Data

Pivoting

Pivoting involves transforming data from to “narrower” or “wider” formats.
The data starts as tidy, and is transformed to another tidy form.
So it’s really transformation, but we call it “tidy-ing.”

The important functions come from the tidyr package:

pivot_longer()
pivot_wider()

Gathering (“Pivoting Longer”)

Hypothetical Data

Create this data table for an imaginary study on the effects of two days of fasting.

Fasting <- data.frame(name = c("April", "Bishma", "Carl"),
                      before = c(120, 175, 182),
                      after = c(115, 167, 178))

The `Fasting` Table

name	before	after
April	120	115
Bishma	175	167
Carl	182	178

Cases are individual people.
Variables are:
- name of person
- weight at start of fast
- weight at end of fast

Wide Form

Fasting is in “wide” form: few cases, more variables for each case.

This is convenient for some purposes, for example:

How much did the weights change?

Fasting %>%
  mutate(diff = after - before)

The Result

##     name before after diff
## 1  April    120   115   -5
## 2 Bishma    175   167   -8
## 3   Carl    182   178   -4

Drawbacks: Difficult Target Graph

This graph is difficult to make from the wide data table.

Solution: Gather the Data!

We will use the tidyr package (loaded when we attach the tidyverse).

Use tidyr’s pivot_longer() data verb:

Fasting_narrow <-
  Fasting %>%
  pivot_longer(cols = -name, names_to = "when", values_to = "weight")

The Result

## # A tibble: 6 × 3
##   name   when   weight
##   <chr>  <chr>   <dbl>
## 1 April  before    120
## 2 April  after     115
## 3 Bishma before    175
## 4 Bishma after     167
## 5 Carl   before    182
## 6 Carl   after     178

Remarks

In pivot_longer(cols = -name, names_to = "when", values_to = "weight"):

the first input is data = Fasting (provided by %>% in our code)
cols tells you which columns to gather together
names_to says what to call the column that will say whether the weight is a “before” or “after” weight
values_to says what to call the column that will hold the weights

Another Way

You can also list out the columns you want to gather:

Fasting %>%
  pivot_longer(
    cols = c(before, after), 
    names_to = "when", 
    values_to = "weight"
  )

## # A tibble: 6 × 3
##   name   when   weight
##   <chr>  <chr>   <dbl>
## 1 April  before    120
## 2 April  after     115
## 3 Bishma before    175
## 4 Bishma after     167
## 5 Carl   before    182
## 6 Carl   after     178

Narrow Form

Fasting_narrow is in “narrow” form: there are more cases.

Each case is now a single act of weighing a person.

Code for the Graph: Try #1

ggplot(Fasting_narrow, aes(x = when, y = weight)) +
    geom_point(aes(color = name)) +
    geom_line(aes(color = name))

Ooops, No Lines!

Diagnosis

Fasting_narrow %>%
  str()

## tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
##  $ name  : chr [1:6] "April" "April" "Bishma" "Bishma" ...
##  $ when  : chr [1:6] "before" "after" "before" "after" ...
##  $ weight: num [1:6] 120 115 175 167 182 178

when is a categorical variable.
If the x-aesthetic is categorical, geom_line() does not know which groups to connect.

Solution

ggplot() accepts a “group” aesthetic. Set it to name.

ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
    geom_point(aes(color = name)) +
    geom_line(aes(color = name))

The Result

We need better labels for the axes.

Axis-Labels

ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
    geom_point(aes(color = name)) +
    geom_line(aes(color = name)) +
    labs(x = "time when weight was recorded",
         y = "weight (pounds)")

The Result

Another Option: Flipping Coordinates

Just so you know that it can be done …

ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
    geom_point(aes(color = name)) +
    geom_line(aes(color = name)) +
    labs(x = "time when weight was recorded",
         y = "weight (pounds)") +
  coord_flip()

The Result

Spreading (“Pivoting Wider”)

How to Spread Data

Spreading converts data from narrow to wide form.

Fasting_narrow %>%
  pivot_wider(names_from = when, values_from = weight)

## # A tibble: 3 × 3
##   name   before after
##   <chr>   <dbl> <dbl>
## 1 April     120   115
## 2 Bishma    175   167
## 3 Carl      182   178

This takes you back to the original Fasting data.

Naming Options

Sometimes you want to help the reader understand what the new variables mean. You can alter their names a bit with the names_prefix parameter:

Fasting_narrow %>%
  pivot_wider(
    names_from = when, 
    values_from = weight, 
    names_prefix = "weight_"
  )

## # A tibble: 3 × 3
##   name   weight_before weight_after
##   <chr>          <dbl>        <dbl>
## 1 April            120          115
## 2 Bishma           175          167
## 3 Carl             182          178

`babynames` Again

Research Question:

Which common babynames since the year 2000 are the most gender-neutral?

Let’s get the total babies born since 2000, using only names common among both sexes

RecentBabies <-
  babynames %>%
  filter(year >= 2000) %>%
  group_by(name, sex) %>%
  summarise(total = sum(n))

The Result

name	sex	total
Aaban	M	107
Aabha	F	35
Aabid	M	10
Aabir	M	5
Aabriella	F	32

It would be easier to have male and female counts side-by-side.

Spread!

RecentBabiesWide <-
  RecentBabies %>%
  pivot_wider(names_from = sex, values_from = total)

The Result

name	M	F
Aaban	107	NA
Aabha	NA	35
Aabid	10	NA
Aabir	5	NA
Aabriella	NA	32

We get NA when there were no babies for a given name and sex. We would prefer to have counts of 0.

Solution

Use the values_fill parameter:.

RecentBabiesWide <-
  RecentBabies %>%
  pivot_wider(
    names_from = sex, 
    values_from = total,
    values_fill = list(total = 0)
  )

The Result

name	M	F
Aaban	107	0
Aabha	0	35
Aabid	10	0
Aabir	5	0
Aabriella	0	32

Much better.

Names Common Among Both Sexes

Let’s consider only names where at least 1000 babies of each sex have that name.

Common <-
  RecentBabiesWide %>%
  filter(M > 1000, F > 1000)

Measuring Gender-Neutrality

There are about the same number of boys and girls born each year.
So a name is “gender-neutral” when the number boys with that name is “about the same” as the number of girls with that name.
Let \(F = \text{number of girls with the name}\).
Let \(M = \text{number of boys with the name}\).
Compute:

\[\min(\frac{M}{F}, \frac{F}{M})\]

The closer this is to 1, the more gender-neutral the name is!

Example: Leslie

Leslie <-
  Common %>%
  filter(name == "Leslie")

name	M	F
Leslie	1359	39613

Use R to compute \(\min(M/F, F/M)\). What do you get?

Example: Robin

Robin <-
  Common %>%
  filter(name == "Robin")

name	M	F
Robin	2466	4223

Use R to compute \(\min(M/F, F/M)\). What do you get?
Which name (Leslie, Robin) is more gender-neutral?

All Names at Once

Common %>%
  mutate(gnMeasure = pmin(M / F, F / M)) %>%
  arrange(desc(gnMeasure))

name	M	F	gnMeasure
Gentry	1224	1215	0.9926471
Justice	11267	10947	0.9715985
Baby	1639	1573	0.9597315
Jules	1166	1215	0.9596708
Marion	2092	2189	0.9556875

Note on `pmin()`

numbers1 <- c(8, 10, 3, -2, 0)
numbers2 <- c(7, 11, 3, -5, 4)
pmin(numbers1, numbers2)

## [1]  7 10  3 -5  0

pmin() finds the minimum value of each corresponding pair of numbers.

`M / F`

M / F gave the male-to-female ratio of counts, for each name.
M / F gave the female-to-male ratio of counts, for each name.
pmin(M /f, F / M) found the minimum ratio for each name.

Here’s a more detailed look:

Common %>%
  mutate(MtoF = M / F, FtoM = F / M) %>%
  mutate(gnMeasure = pmin(MtoF, FtoM)) %>%
  select(name, MtoF, FtoM, gnMeasure)

Result

name	MtoF	FtoM	gnMeasure
Addison	0.0451761	22.1355765	0.0451761
Adrian	43.2266523	0.0231339	0.0231339
Aidan	52.8602784	0.0189178	0.0189178
Aiden	114.3429923	0.0087456	0.0087456
Alex	26.3813927	0.0379055	0.0379055
Alexis	0.2250332	4.4437892	0.2250332
Ali	3.4644784	0.2886437	0.2886437

Repeated Measures

Repeated-Measure Studies

Fasting

##     name before after
## 1  April    120   115
## 2 Bishma    175   167
## 3   Carl    182   178

This hypothetical study was a repeated-measure study. Each subject was weighed more than once.

Narrow format is especially convenient when working with repeated-measure data.

The Labels Experiment

Access and learn about it:

data("labels", package = "tigerstats")

?tigerstats::labels

Research Questions:

Do people rate a product more highly when they believe it is more expensive?
Are men and women differently affected by the difference in labels?

The Data (Wide)

head(labels, n = 6)

##   jiffrating greatvaluerating    sex
## 1          8                5 female
## 2         10                7 female
## 3          8                6 female
## 4          7                5 female
## 5          9                5 female
## 6          8                9 female

Target Graph

Gather (First Try)

labels_narrow <-
  labels %>%
  pivot_longer(cols = c(jiffrating, greatvaluerating), 
               names_to = "which",
               values_to = "rating") %>%
  mutate(better = recode(which, 'jiffrating' = 'jiff', 
                         'greatvaluerating' = 'greatvalue'))

The Result

sex	which	rating	better
female	jiffrating	8	jiff
female	greatvaluerating	5	greatvalue
female	jiffrating	10	jiff
female	greatvaluerating	7	greatvalue
female	jiffrating	8	jiff

Oops! No way to tell which two cases go with a given person.

Gather (Second Try)

Add an “id” variable:

labels_narrow <-
  labels %>%
  mutate(id = 1:nrow(labels)) %>%
  pivot_longer(cols = c(jiffrating, greatvaluerating), 
               names_to = "which",
               values_to = "rating") %>%
  mutate(better = recode(which, 'jiffrating' = 'jiff', 
                         'greatvaluerating' = 'greatvalue')) %>% 
  arrange(id)

The Result

Practice

Replace the VARs below with the choices that will produce the target graph.

ggplot(labels_narrow, aes(x = VAR1, y = VAR2, group = VAR3)) +
  geom_point() +
  geom_line() +
  facet_grid(. ~ VAR4) +
  labs(x = "label on jar")

Chick Weights

?ChickWeight
View(ChickWeight)

Research Question:

Which diet produces the most growth, over time?

Target Graph

Mean weights of each diet-group, over time.

Practice

Fill in VARs to make the target graph.

ChickWeight %>%
  group_by(VAR1, VAR2) %>%
  summarize(meanWeight = mean(VAR3, na.rm = T)) %>%
  ggplot(aes(x = VAR4, y = VAR5)) +
    geom_point(aes(color = VAR6), size = 3) +
    geom_line(aes(color = VAR7))

The Big Picture

Pivoting Data

Pivoting

Gathering (“Pivoting Longer”)

Hypothetical Data

The Fasting Table

Wide Form

The Result

Drawbacks: Difficult Target Graph

Solution: Gather the Data!

The Result

Remarks

Another Way

Narrow Form

Code for the Graph: Try #1

Ooops, No Lines!

Diagnosis

Solution

The Result

Axis-Labels

The Result

Another Option: Flipping Coordinates

The Result

Spreading (“Pivoting Wider”)

How to Spread Data

Naming Options

babynames Again

The Result

Spread!

The Result

Solution

The Result

Names Common Among Both Sexes

Measuring Gender-Neutrality

Example: Leslie

Example: Robin

All Names at Once

Note on pmin()

M / F

Result

Repeated Measures

Repeated-Measure Studies

The Labels Experiment

The Data (Wide)

Target Graph

Gather (First Try)

The Result

Gather (Second Try)

The Result

Practice

Chick Weights

Target Graph

Practice

The `Fasting` Table

`babynames` Again

Note on `pmin()`

`M / F`