Ch9 Data analysis

Introduction

So far, every example in this book has started with a nice dataset that’s easy to plot. That’s great for learning (because you don’t want to struggle with data handling while you’re learning visualisation), but in real life, datasets hardly ever come in exactly the right structure. To use ggplot2 in practice, you’ll need to learn some data wrangling skills. Indeed, in my experience, visualisation is often the easiest part of the data analysis process: once you have the right data, in the right format, aggregated in the right way, the right visualisation is often obvious.

The goal of this part of the book is to show you how to integrate ggplot2 with other tools needed for a complete data analysis:

In this chapter, you’ll learn the principles of tidy data (Wickham 2014), which help you organise your data in a way that makes it easy to visualise with ggplot2, manipulate with dplyr and model with the many modelling packages. The principles of tidy data are supported by the tidyr package, which helps you tidy messy datasets.
Most visualisations require some data transformation whether it’s creating a new variable from existing variables, or performing simple aggregations so you can see the forest for the trees. dplyr will show you how to do this with the dplyr package.
If you’re using R, you’re almost certainly using it for its fantastic modelling capabilities. While there’s an R package for almost every type of model that you can think of, the results of these models can be hard to visualise. In modelling, you’ll learn about the broom package, by David Robinson, to convert models into tidy datasets so you can easily visualise them with ggplot2.

Tidy data is the foundation for data manipulation and visualising models. In the following sections, you’ll learn the definition of tidy data, and the tools you need to make messy data tidy. The chapter concludes with two case studies that show how to apply the tools in sequence to work with real(istic) data.

Tidy data

The principle behind tidy data is simple: storing your data in a consistent way makes it easier to work with it. Tidy data is a mapping between the statistical structure of a data frame (variables and observations) and the physical structure (columns and rows). Tidy data follows two main principles:

Variables go in columns.
Observations go in rows.

Tidy data is particularly important for ggplot2 because the job of ggplot2 is to map variables to visual properties: if your data isn’t tidy, you’ll have a hard time visualising it.

Sometimes you’ll find a dataset that you have no idea how to plot. That’s normally because it’s not tidy. For example, take this data frame that contains monthly employment data for the United States:

ec2
#> # A tibble: 12 × 11
#>   month `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014`
#> * <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1     1    8.6    8.3    9.0   10.7   20.0   21.6   21.0   16.2   15.9
#> 2     2    9.1    8.5    8.7   11.7   19.9   21.1   19.8   17.5   16.2
#> 3     3    8.7    9.1    8.7   12.3   20.4   21.5   19.2   17.7   15.9
#> 4     4    8.4    8.6    9.4   13.1   22.1   20.9   19.1   17.1   15.6
#> 5     5    8.5    8.2    7.9   14.2   22.3   21.6   19.9   17.0   14.5
#> 6     6    7.3    7.7    9.0   17.2   25.2   22.3   20.1   16.6   13.2
#> # ... with 6 more rows, and 1 more variables: `2015` <dbl>

(If it looks familiar, it’s because it’s derived from the economics dataset that we used earlier in the book.)

Imagine you want to plot a time series showing how unemployment has changed over the last 10 years. Can you picture the ggplot2 command you’d need to do it? What if you wanted to focus on the seasonal component of unemployment by putting months on the x-axis and drawing one line for each year? It’s difficult to see how to create those plots because the data is not tidy. There are three variables, month, year and unemployment rate, but each variable is stored in a different way:

month is stored in a column.
year is spread across the column names.
rate is the value of each cell.

To make it possible to plot this data we first need to tidy it. There are two important pairs of tools:

Spread & gather.
Separate & unite.

Spread and gather

Take a look at the two tables below:

x	y	z
a	A	1
b	D	5
c	A	4
c	B	10
d	C	9

x	A	B	C	D
a	1	NA	NA	NA
b	NA	NA	NA	5
c	4	10	NA	NA
d	NA	NA	9	NA

If you study them for a little while, you’ll notice that they contain the same data in different forms. I call the first form indexed data, because you look up a value using an index (the values of the x and y variables). I call the second form Cartesian data, because you find a value by looking at intersection of a row and a column. We can’t tell if these datasets are tidy or not. Either form could be tidy depending on what the values “A”, “B”, “C”, “D” mean.

(Also note the missing values: missing values that are explicit in one form may be implicit in the other. An NA is the presence of an absense; but sometimes a missing value is the absense of a presence.)

Tidying your data will often require translating Cartesian → indexed forms, called gathering, and less commonly, indexed → Cartesian, called spreading. The tidyr package provides the spread() and gather() functions to perform these operations, as described below.

(You can imagine generalising these ideas to higher dimensions. However, data is almost always stored in 2d (rows & columns), so these generalisations are fun to think about, but not that practical. I explore the idea more in Wickham (2007).

Gather

gather() has four main arguments:

data: the dataset to translate.
key & value: the key is the name of the variable that will be created from the column names, and the value is the name of the variable that will be created from the cell values.
...: which variables to gather. You can specify individually, A, B, C, D, or as a range A:D. Alternatively, you can specify which columns are not to be gathered with -: -E, -F.

To tidy the economics dataset shown above, you first need to identify the variables: year, month and rate. month is already in a column, but year and rate are in Cartesian form, and we want them in indexed form, so we need to use gather(). In this example, the key is year, the value is unemp and we want to select columns from 2006 to 2015:

gather(ec2, key = year, value = unemp, `2006`:`2015`)
#> # A tibble: 120 × 3
#>   month  year unemp
#>   <dbl> <chr> <dbl>
#> 1     1  2006   8.6
#> 2     2  2006   9.1
#> 3     3  2006   8.7
#> 4     4  2006   8.4
#> 5     5  2006   8.5
#> 6     6  2006   7.3
#> # ... with 114 more rows

Note that the columns have names that are not standard variable names in R (they don’t start with a letter). This means that we need to surround them in backticks, i.e. `2006` to refer to them.

Alternatively, we could gather all columns except month:

gather(ec2, key = year, value = unemp, -month)
#> # A tibble: 120 × 3
#>   month  year unemp
#>   <dbl> <chr> <dbl>
#> 1     1  2006   8.6
#> 2     2  2006   9.1
#> 3     3  2006   8.7
#> 4     4  2006   8.4
#> 5     5  2006   8.5
#> 6     6  2006   7.3
#> # ... with 114 more rows

To be most useful, we can provide two extra arguments:

economics_2 <- gather(ec2, year, rate, `2006`:`2015`, 
  convert = TRUE, na.rm = TRUE)
economics_2
#> # A tibble: 112 × 3
#>   month  year  rate
#> * <dbl> <int> <dbl>
#> 1     1  2006   8.6
#> 2     2  2006   9.1
#> 3     3  2006   8.7
#> 4     4  2006   8.4
#> 5     5  2006   8.5
#> 6     6  2006   7.3
#> # ... with 106 more rows

We use convert = TRUE to automatically convert the years from character strings to numbers, and na.rm = TRUE to remove the months with no data. (In some sense the data isn’t actually missing because it represents dates that haven’t occurred yet.)

When the data is in this form, it’s easy to visualise in many different ways. For example, we can choose to emphasise either long term trend or seasonal variations:

ggplot(economics_2, aes(year + (month - 1) / 12, rate)) +
  geom_line()

ggplot(economics_2, aes(month, rate, group = year)) +
  geom_line(aes(colour = year), size = 1)

Spread

spread() is the opposite of gather(). You use it when you have a pair of columns that are in indexed form, instead of Cartesian form. For example, the following example dataset contains three variables (day, rain and temp), but rain and temp are stored in indexed form.

weather <- dplyr::data_frame(
  day = rep(1:3, 2),
  obs = rep(c("temp", "rain"), each = 3),
  val = c(c(23, 22, 20), c(0, 0, 5))
)
weather
#> # A tibble: 6 × 3
#>     day   obs   val
#>   <int> <chr> <dbl>
#> 1     1  temp    23
#> 2     2  temp    22
#> 3     3  temp    20
#> 4     1  rain     0
#> 5     2  rain     0
#> 6     3  rain     5

Spread allows us to turn this messy indexed form into a tidy Cartesian form. It shares many of the arguments with gather(). You’ll need to supply the data to translate, as well as the name of the key column which gives the variable names, and the value column which contains the cell values. Here the key is obs and the value is val:

spread(weather, key = obs, value = val)
#> # A tibble: 3 × 3
#>     day  rain  temp
#> * <int> <dbl> <dbl>
#> 1     1     0    23
#> 2     2     0    22
#> 3     3     5    20

Exercises

How can you translate each of the initial example datasets into the other form?
How can you convert back and forth between the economics and economics_long datasets built into ggplot2?
Install the EDAWR package from https://github.com/rstudio/EDAWR. Tidy the storms, population and tb datasets.

Separate and unite

Spread and gather help when the variables are in the wrong place in the dataset. Separate and unite help when multiple variables are crammed into one column, or spread across multiple columns.

For example, the following dataset stores some information about the response to a medical treatment. There are three variables (time, treatment and value), but time and treatment are jammed in one variable together:

trt <- dplyr::data_frame(
  var = paste0(rep(c("beg", "end"), each = 3), "_", rep(c("a", "b", "c"))),
  val = c(1, 4, 2, 10, 5, 11)
)
trt
#> # A tibble: 6 × 2
#>     var   val
#>   <chr> <dbl>
#> 1 beg_a     1
#> 2 beg_b     4
#> 3 beg_c     2
#> 4 end_a    10
#> 5 end_b     5
#> 6 end_c    11

The separate() function makes it easy to tease apart multiple variables stored in one column. It takes four arguments:

data: the data frame to modify.
col: the name of the variable to split into pieces.
into: a character vector giving the names of the new variables.
sep: a description of how to split the variable apart. This can either be a regular expression, e.g. _ to split by underscores, or [^a-z] to split by any non-letter, or an integer giving a position.

In this case, we want to split by the _ character:

separate(trt, var, c("time", "treatment"), "_")
#> # A tibble: 6 × 3
#>    time treatment   val
#> * <chr>     <chr> <dbl>
#> 1   beg         a     1
#> 2   beg         b     4
#> 3   beg         c     2
#> 4   end         a    10
#> 5   end         b     5
#> 6   end         c    11

(If the variables are combined in a more complex form, have a look at extract(). Alternatively, you might need to create columns individually yourself using other calculations. A useful tool for this is mutate() which you’ll learn about in the next chapter.)

unite() is the inverse of separate() - it joins together multiple columns into one column. This is less common, but it’s useful to know about as the inverse of separate().

Exercises

Install the EDAWR package from https://github.com/rstudio/EDAWR. Tidy the who dataset.
Work through the demos included in the tidyr package (demo(package = "tidyr"))

Case studies

For most real datasets, you’ll need to use more than one tidying verb. There many be multiple ways to get there, but as long as each step makes the data tidier, you’ll eventually get to the tidy dataset. That said, you typically apply the functions in the same order: gather(), separate() and spread() (although you might not use all three).

Blood pressure

The first step when tidying a new dataset is always to identify the variables. Take the following simulated medical data. There are seven variables in this dataset: name, age, start date, week, systolic & diastolic blood pressure. Can you see how they’re stored?

# Adapted from example by Barry Rowlingson, 
# http://barryrowlingson.github.io/hadleyverse/
bpd <- readr::read_table(
"name age      start  week1  week2  week3
Anne  35 2014-03-27 100/80 100/75 120/90
 Ben  41 2014-03-09 110/65 100/65 135/70
Carl  33 2014-04-02 125/80   <NA>   <NA>
", na = "<NA>")

The first step is to convert from Cartesian to indexed form:

bpd_1 <- gather(bpd, week, bp, week1:week3)
bpd_1
#> # A tibble: 9 × 5
#>    name   age      start  week     bp
#>   <chr> <int>     <date> <chr>  <chr>
#> 1  Anne    35 2014-03-27 week1 100/80
#> 2   Ben    41 2014-03-09 week1 110/65
#> 3  Carl    33 2014-04-02 week1 125/80
#> 4  Anne    35 2014-03-27 week2 100/75
#> 5   Ben    41 2014-03-09 week2 100/65
#> 6  Carl    33 2014-04-02 week2   <NA>
#> # ... with 3 more rows

This is tidier, but we have two variables combined together in the bp variable. This is a common way of writing down the blood pressure, but analysis is easier if we break it into two variables. That’s the job of separate:

bpd_2 <- separate(bpd_1, bp, c("sys", "dia"), "/")
bpd_2
#> # A tibble: 9 × 6
#>    name   age      start  week   sys   dia
#> * <chr> <int>     <date> <chr> <chr> <chr>
#> 1  Anne    35 2014-03-27 week1   100    80
#> 2   Ben    41 2014-03-09 week1   110    65
#> 3  Carl    33 2014-04-02 week1   125    80
#> 4  Anne    35 2014-03-27 week2   100    75
#> 5   Ben    41 2014-03-09 week2   100    65
#> 6  Carl    33 2014-04-02 week2  <NA>  <NA>
#> # ... with 3 more rows

This dataset is now tidy, but we could do a little more to make it easier to use. The following code uses extract() to pull the week number out into its own variable (using regular expressions is beyond the scope of the book, but \\d stands for any digit). I also use arrange() (which you’ll learn about in the next chapter) to order the rows to keep the records for each person together.

bpd_3 <- extract(bpd_2, week, "week", "(\\d)", convert = TRUE)
bpd_4 <- dplyr::arrange(bpd_3, name, week)
bpd_4
#> # A tibble: 9 × 6
#>    name   age      start  week   sys   dia
#>   <chr> <int>     <date> <int> <chr> <chr>
#> 1  Anne    35 2014-03-27     1   100    80
#> 2  Anne    35 2014-03-27     2   100    75
#> 3  Anne    35 2014-03-27     3   120    90
#> 4   Ben    41 2014-03-09     1   110    65
#> 5   Ben    41 2014-03-09     2   100    65
#> 6   Ben    41 2014-03-09     3   135    70
#> # ... with 3 more rows

You might notice that there’s some repetition in this dataset: if you know the name, then you also know the age and start date. This reflects a third condition of tidyness that I don’t discuss here: each data frame should contain one and only one data set. Here there are really two datasets: information about each person that doesn’t change over time, and their weekly blood pressure measurements. You can learn more about this sort of messiness in the resources mentioned at the end of the chapter.

Test scores

Imagine you’re interested in the effect of an intervention on test scores. You’ve collected the following data. What are the variables?

# Adapted from http://stackoverflow.com/questions/29775461
scores <- dplyr::data_frame(
  person = rep(c("Greg", "Sally", "Sue"), each = 2),
  time   = rep(c("pre", "post"), 3),
  test1  = round(rnorm(6, mean = 80, sd = 4), 0),
  test2  = round(jitter(test1, 15), 0)
)
scores
#> # A tibble: 6 × 4
#>   person  time test1 test2
#>    <chr> <chr> <dbl> <dbl>
#> 1   Greg   pre    84    82
#> 2   Greg  post    73    75
#> 3  Sally   pre    74    75
#> 4  Sally  post    77    76
#> 5    Sue   pre    81    80
#> 6    Sue  post    79    81

I think the variables are person, test, pre-test score and post-test score. As usual, we start by converting columns in Cartesian form (test1 and test2) to indexed form (test and score):

scores_1 <- gather(scores, test, score, test1:test2)
scores_1
#> # A tibble: 12 × 4
#>   person  time  test score
#>    <chr> <chr> <chr> <dbl>
#> 1   Greg   pre test1    84
#> 2   Greg  post test1    73
#> 3  Sally   pre test1    74
#> 4  Sally  post test1    77
#> 5    Sue   pre test1    81
#> 6    Sue  post test1    79
#> # ... with 6 more rows

Now we need to do the opposite: pre and post should be variables, not values, so we need to spread time and score:

scores_2 <- spread(scores_1, time, score)
scores_2
#> # A tibble: 6 × 4
#>   person  test  post   pre
#> *  <chr> <chr> <dbl> <dbl>
#> 1   Greg test1    73    84
#> 2   Greg test2    75    82
#> 3  Sally test1    77    74
#> 4  Sally test2    76    75
#> 5    Sue test1    79    81
#> 6    Sue test2    81    80

A good indication that we have made a tidy dataset is that it’s now easy to calculate the statistic of interest: the difference between pre- and post-intervention scores:

scores_3 <- mutate(scores_2, diff = post - pre)
scores_3
#> # A tibble: 6 × 5
#>   person  test  post   pre  diff
#>    <chr> <chr> <dbl> <dbl> <dbl>
#> 1   Greg test1    73    84   -11
#> 2   Greg test2    75    82    -7
#> 3  Sally test1    77    74     3
#> 4  Sally test2    76    75     1
#> 5    Sue test1    79    81    -2
#> 6    Sue test2    81    80     1

And it’s similarly easy to plot:

ggplot(scores_3, aes(person, diff, color = test)) +
  geom_hline(size = 2, colour = "white", yintercept = 0) +
  geom_point() +
  geom_path(aes(group = person), colour = "grey50", 
    arrow = arrow(length = unit(0.25, "cm")))

(Again, you’ll learn about mutate() in the next chapter.)

Learning more

Data tidying is a big topic and this chapter only scratches the surface. I recommend the following references which go into considerably more depth on this topic:

The tidyr documentation. I’ve described the most important arguments, but most functions have other arguments that help deal with less common situations. If you’re struggling, make sure to read the documentation to see if there’s an argument that might help you.
“Tidy data”, an article in the Journal of Statistical Software. It describes the ideas of tidy data in more depth and shows other types of messy data. Unfortunately the paper was written before tidyr existed, so to see how to use tidyr instead of reshape2, consult the tidyr vignette.
The data wrangling cheatsheet by RStudio, includes the most common tidyr verbs in a form designed to jog your memory when you’re stuck.

References

Wickham, Hadley. 2007. “Reshaping Data with the Reshape Package.” Journal of Statistical Software 21 (12). http://www.jstatsoft.org/v21/i12/paper.

———. 2014. “Tidy Data.” The Journal of Statistical Software 59. http://www.jstatsoft.org/v59/i10/.