The Big Picture

Our primary focus in this Section is on the Tidy phase.

Our primary focus in this Section is on the Tidy phase.

Pivoting Data

Pivoting

  • Pivoting involves transforming data from to “narrower” or “wider” formats.
  • The data starts as tidy, and is transformed to another tidy form.
  • So it’s really transformation, but we call it “tidy-ing.”

The important functions come from the tidyr package:

  • pivot_longer()
  • pivot_wider()

Gathering (“Pivoting Longer”)

Hypothetical Data

Create this data table for an imaginary study on the effects of two days of fasting.

Fasting <- data.frame(name = c("April", "Bishma", "Carl"),
                      before = c(120, 175, 182),
                      after = c(115, 167, 178))

The Fasting Table

name before after
April 120 115
Bishma 175 167
Carl 182 178
  • Cases are individual people.
  • Variables are:
    • name of person
    • weight at start of fast
    • weight at end of fast

Wide Form

Fasting is in “wide” form: few cases, more variables for each case.

This is convenient for some purposes, for example:

How much did the weights change?

Fasting %>%
  mutate(diff = after - before)

The Result

##     name before after diff
## 1  April    120   115   -5
## 2 Bishma    175   167   -8
## 3   Carl    182   178   -4

Drawbacks: Difficult Target Graph

This graph is difficult to make from the wide data table.

This graph is difficult to make from the wide data table.

Solution: Gather the Data!

We will use the tidyr package (loaded when we attach the tidyverse).

Use tidyr’s pivot_longer() data verb:

Fasting_narrow <-
  Fasting %>%
  pivot_longer(cols = -name, names_to = "when", values_to = "weight")

The Result

## # A tibble: 6 × 3
##   name   when   weight
##   <chr>  <chr>   <dbl>
## 1 April  before    120
## 2 April  after     115
## 3 Bishma before    175
## 4 Bishma after     167
## 5 Carl   before    182
## 6 Carl   after     178

Remarks

In pivot_longer(cols = -name, names_to = "when", values_to = "weight"):

  • the first input is data = Fasting (provided by %>% in our code)
  • cols tells you which columns to gather together
  • names_to says what to call the column that will say whether the weight is a “before” or “after” weight
  • values_to says what to call the column that will hold the weights

Another Way

You can also list out the columns you want to gather:

Fasting %>%
  pivot_longer(
    cols = c(before, after), 
    names_to = "when", 
    values_to = "weight"
  )

## # A tibble: 6 × 3
##   name   when   weight
##   <chr>  <chr>   <dbl>
## 1 April  before    120
## 2 April  after     115
## 3 Bishma before    175
## 4 Bishma after     167
## 5 Carl   before    182
## 6 Carl   after     178

Narrow Form

Fasting_narrow is in “narrow” form: there are more cases.

Each case is now a single act of weighing a person.

Code for the Graph: Try #1

ggplot(Fasting_narrow, aes(x = when, y = weight)) +
    geom_point(aes(color = name)) +
    geom_line(aes(color = name))

Ooops, No Lines!

Diagnosis

Fasting_narrow %>%
  str()
## tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
##  $ name  : chr [1:6] "April" "April" "Bishma" "Bishma" ...
##  $ when  : chr [1:6] "before" "after" "before" "after" ...
##  $ weight: num [1:6] 120 115 175 167 182 178
  • when is a categorical variable.
  • If the x-aesthetic is categorical, geom_line() does not know which groups to connect.

Solution

ggplot() accepts a “group” aesthetic. Set it to name.

ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
    geom_point(aes(color = name)) +
    geom_line(aes(color = name))

The Result

We need better labels for the axes.

We need better labels for the axes.

Axis-Labels

ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
    geom_point(aes(color = name)) +
    geom_line(aes(color = name)) +
    labs(x = "time when weight was recorded",
         y = "weight (pounds)")

The Result

Another Option: Flipping Coordinates

Just so you know that it can be done …

ggplot(Fasting_narrow, aes(x = when, y = weight, group = name)) +
    geom_point(aes(color = name)) +
    geom_line(aes(color = name)) +
    labs(x = "time when weight was recorded",
         y = "weight (pounds)") +
  coord_flip()

The Result

Spreading (“Pivoting Wider”)

How to Spread Data

Spreading converts data from narrow to wide form.

Fasting_narrow %>%
  pivot_wider(names_from = when, values_from = weight)
## # A tibble: 3 × 3
##   name   before after
##   <chr>   <dbl> <dbl>
## 1 April     120   115
## 2 Bishma    175   167
## 3 Carl      182   178

This takes you back to the original Fasting data.

Naming Options

Sometimes you want to help the reader understand what the new variables mean. You can alter their names a bit with the names_prefix parameter:

Fasting_narrow %>%
  pivot_wider(
    names_from = when, 
    values_from = weight, 
    names_prefix = "weight_"
  )
## # A tibble: 3 × 3
##   name   weight_before weight_after
##   <chr>          <dbl>        <dbl>
## 1 April            120          115
## 2 Bishma           175          167
## 3 Carl             182          178

babynames Again

Research Question:

Which common babynames since the year 2000 are the most gender-neutral?

Let’s get the total babies born since 2000, using only names common among both sexes

RecentBabies <-
  babynames %>%
  filter(year >= 2000) %>%
  group_by(name, sex) %>%
  summarise(total = sum(n))

The Result

name sex total
Aaban M 107
Aabha F 35
Aabid M 10
Aabir M 5
Aabriella F 32


It would be easier to have male and female counts side-by-side.

Spread!

RecentBabiesWide <-
  RecentBabies %>%
  pivot_wider(names_from = sex, values_from = total)

The Result

name M F
Aaban 107 NA
Aabha NA 35
Aabid 10 NA
Aabir 5 NA
Aabriella NA 32


We get NA when there were no babies for a given name and sex. We would prefer to have counts of 0.

Solution

Use the values_fill parameter:.

RecentBabiesWide <-
  RecentBabies %>%
  pivot_wider(
    names_from = sex, 
    values_from = total,
    values_fill = list(total = 0)
  )

The Result

name M F
Aaban 107 0
Aabha 0 35
Aabid 10 0
Aabir 5 0
Aabriella 0 32


Much better.

Names Common Among Both Sexes

Let’s consider only names where at least 1000 babies of each sex have that name.

Common <-
  RecentBabiesWide %>%
  filter(M > 1000, F > 1000)

Measuring Gender-Neutrality

  • There are about the same number of boys and girls born each year.
  • So a name is “gender-neutral” when the number boys with that name is “about the same” as the number of girls with that name.
  • Let \(F = \text{number of girls with the name}\).
  • Let \(M = \text{number of boys with the name}\).
  • Compute:

\[\min(\frac{M}{F}, \frac{F}{M})\]

The closer this is to 1, the more gender-neutral the name is!

Example: Leslie

Leslie <-
  Common %>%
  filter(name == "Leslie")
name M F
Leslie 1359 39613


Use R to compute \(\min(M/F, F/M)\). What do you get?

Example: Robin

Robin <-
  Common %>%
  filter(name == "Robin")
name M F
Robin 2466 4223


  • Use R to compute \(\min(M/F, F/M)\). What do you get?
  • Which name (Leslie, Robin) is more gender-neutral?

All Names at Once

Common %>%
  mutate(gnMeasure = pmin(M / F, F / M)) %>%
  arrange(desc(gnMeasure))
name M F gnMeasure
Gentry 1224 1215 0.9926471
Justice 11267 10947 0.9715985
Baby 1639 1573 0.9597315
Jules 1166 1215 0.9596708
Marion 2092 2189 0.9556875

Note on pmin()

numbers1 <- c(8, 10, 3, -2, 0)
numbers2 <- c(7, 11, 3, -5, 4)
pmin(numbers1, numbers2)
## [1]  7 10  3 -5  0

pmin() finds the minimum value of each corresponding pair of numbers.

M / F

  • M / F gave the male-to-female ratio of counts, for each name.
  • M / F gave the female-to-male ratio of counts, for each name.
  • pmin(M /f, F / M) found the minimum ratio for each name.

Here’s a more detailed look:

Common %>%
  mutate(MtoF = M / F, FtoM = F / M) %>%
  mutate(gnMeasure = pmin(MtoF, FtoM)) %>%
  select(name, MtoF, FtoM, gnMeasure)

Result

name MtoF FtoM gnMeasure
Addison 0.0451761 22.1355765 0.0451761
Adrian 43.2266523 0.0231339 0.0231339
Aidan 52.8602784 0.0189178 0.0189178
Aiden 114.3429923 0.0087456 0.0087456
Alex 26.3813927 0.0379055 0.0379055
Alexis 0.2250332 4.4437892 0.2250332
Ali 3.4644784 0.2886437 0.2886437

Repeated Measures

Repeated-Measure Studies

Fasting
##     name before after
## 1  April    120   115
## 2 Bishma    175   167
## 3   Carl    182   178

This hypothetical study was a repeated-measure study. Each subject was weighed more than once.

Narrow format is especially convenient when working with repeated-measure data.

The Labels Experiment

Access and learn about it:

data("labels", package = "tigerstats")
?tigerstats::labels

Research Questions:

  • Do people rate a product more highly when they believe it is more expensive?
  • Are men and women differently affected by the difference in labels?

The Data (Wide)

head(labels, n = 6)
##   jiffrating greatvaluerating    sex
## 1          8                5 female
## 2         10                7 female
## 3          8                6 female
## 4          7                5 female
## 5          9                5 female
## 6          8                9 female

Target Graph

Gather (First Try)

labels_narrow <-
  labels %>%
  pivot_longer(cols = c(jiffrating, greatvaluerating), 
               names_to = "which",
               values_to = "rating") %>%
  mutate(better = recode(which, 'jiffrating' = 'jiff', 
                         'greatvaluerating' = 'greatvalue'))

The Result

sex which rating better
female jiffrating 8 jiff
female greatvaluerating 5 greatvalue
female jiffrating 10 jiff
female greatvaluerating 7 greatvalue
female jiffrating 8 jiff


Oops! No way to tell which two cases go with a given person.

Gather (Second Try)

Add an “id” variable:

labels_narrow <-
  labels %>%
  mutate(id = 1:nrow(labels)) %>%
  pivot_longer(cols = c(jiffrating, greatvaluerating), 
               names_to = "which",
               values_to = "rating") %>%
  mutate(better = recode(which, 'jiffrating' = 'jiff', 
                         'greatvaluerating' = 'greatvalue')) %>% 
  arrange(id)

The Result

Practice

Replace the VARs below with the choices that will produce the target graph.

ggplot(labels_narrow, aes(x = VAR1, y = VAR2, group = VAR3)) +
  geom_point() +
  geom_line() +
  facet_grid(. ~ VAR4) +
  labs(x = "label on jar")

Chick Weights

?ChickWeight
View(ChickWeight)

Research Question:

Which diet produces the most growth, over time?

Target Graph

Mean weights of each diet-group, over time.

Mean weights of each diet-group, over time.

Practice

Fill in VARs to make the target graph.

ChickWeight %>%
  group_by(VAR1, VAR2) %>%
  summarize(meanWeight = mean(VAR3, na.rm = T)) %>%
  ggplot(aes(x = VAR4, y = VAR5)) +
    geom_point(aes(color = VAR6), size = 3) +
    geom_line(aes(color = VAR7))