Tidy datasets are all alike, but every messy dataset is messy in its own way.
— Hadley Wickham
The Tidyverse revolves around tidy data. Data can be represented in several ways, but to be tidy it must have the following properties:
Since these three rules are related, they can be simplified to two practical rules:
This is all we need to know for this workshop, but if you are interested in the details make sure to check out this article.
If a dataset is already tidy, operations using Tidyverse packages are easy and straightforward. If this is not the case, the tidyr package provides useful functions to transform a messy dataset into a tidy one.
Here are some examples for tidy and messy data frames representing the same dataset consisting of four variables country, year, population, and cases. These examples are available in the tidyr package, so let’s activate it first:
library(tidyr)
table1
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2
# A tibble: 12 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3
# A tibble: 6 × 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table4a
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4b
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
All data frames represent exactly the same data, but only
table1 is tidy. Let’s determine how each of the messy data
frames are not tidy:
table2 contains values of two variables
(cases and population) in its last column;
this violates the rule that each variable must have its own column.table3 combines two variables (cases and
population) in one column (rate). In addition,
the rate is represented as a character column, which makes it impossible
to directly perform numeric calculations.table4a and table4b are spread across two
data frames although they each contain a variable for the same
observations.Here are some examples how easy it is to work with
table1:
library(dplyr)
# compute rate per 10000
table1 |>
mutate(rate=cases / population * 10000)
# A tibble: 6 × 5
country year cases population rate
<chr> <int> <int> <int> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
# compute cases per year
table1 |>
count(year, wt=cases)
# A tibble: 2 × 2
year n
<int> <int>
1 1999 250740
2 2000 296920
# plot changes over time
library(ggplot2)
ggplot(data=table1, mapping=aes(x=year, y=cases, color=country)) +
geom_line() +
geom_point()
There are two main things to consider if a dataset is not tidy.
First, it is important to determine what the variables and observations
are. This sounds easier as it might be in practice, so sometimes it is
necessary to do a fair amount of research before completing this step.
Second, we need to make sure that the rules of tidy data apply. This
means that each variable should correspond to one column, and each
observation should correspond to one row. If this is not the case, we
can pivot the data so that it fulfills these requirements. In most
cases, we only need two functions from the tidyr package,
namely pivot_longer() and pivot_wider().
A variable is often spread across two or more columns. This means
that column titles actually contain values of a variable (instead of
variable names). For example, table4a (and
table4b) contains two columns named `1999` and
`2000`:
table4a
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
These column titles are actually two values of the variable
year. The values in these columns are values of the
variable cases, and therefore one row corresponds to two
observations. We can tidy this representation by pivoting the two
columns into one using pivot_longer(). To do this, we need
to specify three pieces of information:
`1999` and `2000` in this case).year in this example).cases).table4a |>
pivot_longer(c(`1999`, `2000`), names_to="year", values_to="cases")
# A tibble: 6 × 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Note that because these two columns contain names that would be a
syntax error (names must not start with a digit), we have to enclose
them with backticks. Furthermore, the type of the year
column is character, because this is the default behavior for columns
created with the names_to argument. If we wanted an integer
column, we could use the names_transform argument:
tidy4a = table4a |>
pivot_longer(
c(`1999`, `2000`),
names_to="year",
values_to="cases",
names_transform=list(year=as.integer)
)
The resulting data frame is now tidy. We can perform the same
pivoting for table4b containing data on the
population variable:
tidy4b = table4b |>
pivot_longer(
c(`1999`, `2000`),
names_to="year",
values_to="population",
names_transform=list(year=as.integer)
)
tidy4b
# A tibble: 6 × 3
country year population
<chr> <int> <int>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
The function is called pivot_longer() because it
generally results in a data frame consisting of more rows and/or fewer
columns than the original one (so the data frame gets “longer”).
Finally, we would like to combine tidy4a and
tidy4b to a single data frame. We will not cover how to do
that (you can read about it in the relational
data chapter in R for Data Science), but just provide the relevant
dplyr function call:
left_join(tidy4a, tidy4b)
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
The opposite operation to pivot_longer() is
pivot_wider(). It is useful in the case where one
observation is represented in multiple rows, so our goal is to put the
values into separate columns. In general, the resulting data frame is
wider because it consists of more columns and/or fewer rows.
Here’s an example of a data frame with multiple rows per observation:
table2
# A tibble: 12 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
To separate rows into columns, we pass two arguments to
pivot_wider(), namely the column containing the names of
the variables and the column containing the values:
table2 |>
pivot_wider(names_from=type, values_from=count)
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583