Tidying data

Introduction

Tidy datasets are all alike, but every messy dataset is messy in its own way.

— Hadley Wickham

The Tidyverse revolves around tidy data. Data can be represented in several ways, but to be tidy it must have the following properties:

Each variable corresponds to exactly one column.
Each observation corresponds to exactly one row.
Each value corresponds to exactly one cell.

Since these three rules are related, they can be simplified to two practical rules:

Put each dataset in a tibble.
Put each variable in a column.

This is all we need to know for this workshop, but if you are interested in the details make sure to check out this article.

If a dataset is already tidy, operations using Tidyverse packages are easy and straightforward. If this is not the case, the tidyr package provides useful functions to transform a messy dataset into a tidy one.

Here are some examples for tidy and messy data frames representing the same dataset consisting of four variables country, year, population, and cases. These examples are available in the tidyr package, so let’s activate it first:

library(tidyr)

table1

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

table3

# A tibble: 6 × 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

table4b

# A tibble: 3 × 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

All data frames represent exactly the same data, but only table1 is tidy. Let’s determine how each of the messy data frames are not tidy:

table2 contains values of two variables (cases and population) in its last column; this violates the rule that each variable must have its own column.
table3 combines two variables (cases and population) in one column (rate). In addition, the rate is represented as a character column, which makes it impossible to directly perform numeric calculations.
table4a and table4b are spread across two data frames although they each contain a variable for the same observations.

Here are some examples how easy it is to work with table1:

library(dplyr)

# compute rate per 10000
table1 |>
    mutate(rate=cases / population * 10000)

# A tibble: 6 × 5
  country      year  cases population  rate
  <chr>       <int>  <int>      <int> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67

# compute cases per year
table1 |>
    count(year, wt=cases)

# A tibble: 2 × 2
   year      n
  <int>  <int>
1  1999 250740
2  2000 296920

# plot changes over time
library(ggplot2)

ggplot(data=table1, mapping=aes(x=year, y=cases, color=country)) +
    geom_line() +
    geom_point()

Pivoting

There are two main things to consider if a dataset is not tidy. First, it is important to determine what the variables and observations are. This sounds easier as it might be in practice, so sometimes it is necessary to do a fair amount of research before completing this step. Second, we need to make sure that the rules of tidy data apply. This means that each variable should correspond to one column, and each observation should correspond to one row. If this is not the case, we can pivot the data so that it fulfills these requirements. In most cases, we only need two functions from the tidyr package, namely pivot_longer() and pivot_wider().

Longer

A variable is often spread across two or more columns. This means that column titles actually contain values of a variable (instead of variable names). For example, table4a (and table4b) contains two columns named `1999` and `2000`:

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

These column titles are actually two values of the variable year. The values in these columns are values of the variable cases, and therefore one row corresponds to two observations. We can tidy this representation by pivoting the two columns into one using pivot_longer(). To do this, we need to specify three pieces of information:

The columns which contain values instead of names (`1999` and `2000` in this case).
A (new) variable name to move the column names to (we choose year in this example).
A (new) variable name to move the values to (here we choose cases).

table4a |>
    pivot_longer(c(`1999`, `2000`), names_to="year", values_to="cases")

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Note that because these two columns contain names that would be a syntax error (names must not start with a digit), we have to enclose them with backticks. Furthermore, the type of the year column is character, because this is the default behavior for columns created with the names_to argument. If we wanted an integer column, we could use the names_transform argument:

tidy4a = table4a |>
    pivot_longer(
        c(`1999`, `2000`),
        names_to="year",
        values_to="cases",
        names_transform=list(year=as.integer)
    )

The resulting data frame is now tidy. We can perform the same pivoting for table4b containing data on the population variable:

tidy4b = table4b |>
    pivot_longer(
        c(`1999`, `2000`),
        names_to="year",
        values_to="population",
        names_transform=list(year=as.integer)
    )
tidy4b

# A tibble: 6 × 3
  country      year population
  <chr>       <int>      <int>
1 Afghanistan  1999   19987071
2 Afghanistan  2000   20595360
3 Brazil       1999  172006362
4 Brazil       2000  174504898
5 China        1999 1272915272
6 China        2000 1280428583

The function is called pivot_longer() because it generally results in a data frame consisting of more rows and/or fewer columns than the original one (so the data frame gets “longer”).

Finally, we would like to combine tidy4a and tidy4b to a single data frame. We will not cover how to do that (you can read about it in the relational data chapter in R for Data Science), but just provide the relevant dplyr function call:

left_join(tidy4a, tidy4b)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Wider

The opposite operation to pivot_longer() is pivot_wider(). It is useful in the case where one observation is represented in multiple rows, so our goal is to put the values into separate columns. In general, the resulting data frame is wider because it consists of more columns and/or fewer rows.

Here’s an example of a data frame with multiple rows per observation:

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

To separate rows into columns, we pass two arguments to pivot_wider(), namely the column containing the names of the variables and the column containing the values:

table2 |>
    pivot_wider(names_from=type, values_from=count)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Exercises

See https://r4ds.had.co.nz/tidy-data.html#exercises-24.

Separating and uniting

TODO