Tidy Data

Author

Jamal Rogers

Published

May 12, 2023

We shall use the built-in datasets from the tidyr package.

library(tidyverse)

There are three interrelated rules that make a dataset tidy:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

Why ensure your data is tidy?

There are two main advantages:

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
There’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. As you learned, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with table1.

Compute rate per 10,000

table1 |>
  mutate(rate = cases / population * 10000)

# A tibble: 6 × 5
  country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67

Compute total cases per year

table1 |> 
  group_by(year) |> 
  summarize(total_cases = sum(cases))

# A tibble: 2 × 2
   year total_cases
  <dbl>       <dbl>
1  1999      250740
2  2000      296920

Visualize changes over time

ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000