Ch10 Data transformation

Introduction

Tidy data is important, but it’s not the end of the road. Often you won’t have quite the right variables, or your data might need a little aggregation before you visualise it. This chapter will show you how to solve these problems (and more!) with the dplyr package.

The goal of dplyr is to provide verbs (functions) that help you solve the most common 95% of data manipulation problems. dplyr is similar to ggplot2, but instead of providing a grammar of graphics, it provides a grammar of data manipulation. Like ggplot2, dplyr helps you not just by giving you functions, but it also helps you think about data manipulation. In particular, dplyr helps by constraining you: instead of struggling to think about which of the thousands of functions that might help, you can just pick from a handful that are design to be very likely to be helpful. In this chapter you’ll learn four of the most important dplyr verbs:

filter()
mutate()
group_by() & summarise()

These verbs are easy to learn because they all work the same way: they take a data frame as the first argument, and return a modified data frame. The other arguments control the details of the transformation, and are always interpreted in the context of the data frame so you can refer to variables directly. I’ll also explain each in the same way: I’ll show you a motivating example using the diamonds data, give you more details about how the function works, and finish up with some exercises for you to practice your skills with.

You’ll also learn how to create data transformation pipelines using %>%. %>% plays a similar role to + in ggplot2: it allows you to solve complex problems by combining small pieces that are easily understood in isolation.

This chapter only scratches the surface of dplyr’s capabilities but it should be enough to help you with visualisation problems. You can learn more by using the resources discussed at the end of the chapter.

Filter observations

It’s common to only want to explore one part of a dataset. A great data analysis strategy is to start with just one observation unit (one person, one city, etc), and understand how it works before attempting to generalise the conclusion to others. This is a great technique if you ever feel overwhelmed by an analysis: zoom down to a small subset, master it, and then zoom back out, to apply your conclusions to the full dataset.

Filtering is also useful for extracting outliers. Generally, you don’t want to just throw outliers away, as they’re often highly revealing, but it’s useful to think about partitioning the data into the common and the unusual. You summarise the common to look at the broad trends; you examine the outliers individually to see if you can figure out what’s going on.

For example, look at this plot that shows how the x and y dimensions of the diamonds are related:

ggplot(diamonds, aes(x, y)) + 
  geom_bin2d()

There are around 50,000 points in this dataset: most of them lie along the diagonal, but there are a handful of outliers. One clear set of incorrect values are those diamonds with zero dimensions. We can use filter() to pull them out:

filter(diamonds, x == 0 | y == 0)
#> # A tibble: 8 × 10
#>   carat       cut color clarity depth table price     x     y     z
#>   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  1.07     Ideal     F     SI2  61.6    56  4954     0  6.62     0
#> 2  1.00 Very Good     H     VS2  63.3    53  5139     0  0.00     0
#> 3  1.14      Fair     G     VS1  57.5    67  6381     0  0.00     0
#> 4  1.56     Ideal     G     VS2  62.2    54 12800     0  0.00     0
#> 5  1.20   Premium     D    VVS1  62.1    59 15686     0  0.00     0
#> 6  2.25   Premium     H     SI2  62.8    59 18034     0  0.00     0
#> # ... with 2 more rows

This is equivalent to the base R code diamonds[diamonds$x == 0 | diamonds$y == 0, ], but is more concise because filter() knows to look for the bare x in the data frame.

(If you’ve used subset() before, you’ll notice that it has very similar behaviour. The biggest difference is that subset() can select both observations and variables, where in dplyr, filter() works exclusively with observations and select() with variables. There are some other subtle differences, but the main advantage to using filter() is that it behaves identically to the other dplyr verbs and it tends to be a bit faster than subset().)

In a real analysis, you’d look at the outliers in more detail to see if you can find the root cause of the data quality problem. In this case, we’re just going to throw them out and focus on what remains. To save some typing, we may provide multiple arguments to filter() which combines them.

diamonds_ok <- filter(diamonds, x > 0, y > 0, y < 20)
ggplot(diamonds_ok, aes(x, y)) +
  geom_bin2d() +
  geom_abline(slope = 1, colour = "white", size = 1, alpha = 0.5)

This plot is now more informative - we can see a very strong relationship between x and y. I’ve added the reference line to make it clear that for most diamonds, x and y are very similar. However, this plot still has problems:

The plot is mostly empty, because most of the data lies along the diagonal.
There are some clear bivariate outliers, but it’s hard to select them with a simple filter.

We’ll solve both of these problem in the next section by adding a new variable that’s a transformation of x and y. But before we continue on to that, let’s talk more about the details of filter().

Useful tools

The first argument to filter() is a data frame. The second and subsequent arguments must be logical vectors: filter() selects every row where all the logical expressions are TRUE. The logical vectors must always be the same length as the data frame: if not, you’ll get an error. Typically you create the logical vector with the comparison operators:

x == y: x and y are equal.
x != y: x and y are not equal.
x %in% c("a", "b", "c"): x is one of the values in the right hand side.
x > y, x >= y, x < y, x <= y: greater than, greater than or equal to, less than, less than or equal to.

And combine them with logical operators:

!x (pronounced “not x”), flips TRUE and FALSE so it keeps all the values where x is FALSE.
x & y: TRUE if both x and y are TRUE.
x | y: TRUE if either x or y (or both) are TRUE.
xor(x, y): TRUE if either x or y are TRUE, but not both (exclusive or).

Most real queries involve some combination of both:

Price less than $500: price < 500
Size between 1 and 2 carats: carat >= 1 & carat < 2
Cut is ideal or premium: cut == "Premium" | cut == "Ideal", or cut %in% c("Premium", "Ideal") (note that R is case sensitive)
Worst colour, cut and clarity: cut == "Fair" & color == "J" & clarity == "SI2"

You can also use functions in the filtering expression:

Size is between 1 and 2 carats: floor(carat) == 1
An average dimension greater than 3: (x + y + z) / 3 > 3

This is useful for simple expressions, but as things get more complicated it’s better to create a new variable first so you can check that you’ve done the computation correctly before doing the subsetting. You’ll learn how to do that in the next section.

The rules for NA are a bit trickier, so I’ll explain them next.

Missing values

NA, R’s missing value indicator, can be frustrating to work with. R’s underlying philosophy is to force you to recognise that you have missing values, and make a deliberate choice to deal with them: missing values never silently go missing. This is a pain because you almost always want to just get rid of them, but it’s a good principle to force you to think about the correct option.

The most important thing to understand about missing values is that they are infectious: with few exceptions, the result of any operation that includes a missing value will be a missing value. This happens because NA represents an unknown value, and there are few operations that turn an unknown value into a known value.

x <- c(1, NA, 2)
x == 1
#> [1]  TRUE    NA FALSE
x > 2
#> [1] FALSE    NA FALSE
x + 10
#> [1] 11 NA 12

When you first learn R, you might be tempted to find missing values using ==:

x == NA
#> [1] NA NA NA
x != NA
#> [1] NA NA NA

But that doesn’t work! A little thought reveals why: there’s no reason why two unknown values should be the same. Instead, use is.na(X) to determine if a value is missing:

is.na(x)
#> [1] FALSE  TRUE FALSE

filter() only includes observations where all arguments are TRUE, so NA values are automatically dropped. If you want to include missing values, be explicit: x > 10 | is.na(x). In other parts of R, you’ll sometimes need to convert missing values into FALSE. You can do that with x > 10 & !is.na(x)

Exercises

Practice your filtering skills by:
- Finding all the diamonds with equal x and y dimensions.
- A depth between 55 and 70.
- A carat smaller than the median carat.
- Cost more than $10,000 per carat
- Are of good or better quality
Fill in the question marks in this table:

Expression TRUE FALSE NA

x x

? x

is.na(x) x

!is.na(x) ? ? ?

? x x

? x x
Repeat the analysis of outlying values with z. Compared to x and y, how would you characterise the relationship of x and z, or y and z?
Install the ggplot2movies package and look at the movies that have a missing budget. How are they different from the movies with a budget? (Hint: try a frequency polygon plus colour = is.na(budget).)
What is NA & FALSE and NA | TRUE? Why? Why doesn’t NA * 0 equal zero? What number times zero does not equal 0? What do you expect NA ^ 0 to equal? Why?

Expression	`TRUE`	`FALSE`	`NA`
`x`	x
?		x
`is.na(x)`			x
`!is.na(x)`	?	?	?
?	x		x
?		x	x

Create new variables

To better explore the relationship between x and y, it’s useful to “rotate” the plot so that the data is flat, not diagonal. We can do that by creating two new variables: one that represents the difference between x and y (which in this context represents the symmetry of the diamond) and one that represents its size (the length of the diagonal).

To create new variables use mutate(). Like filter() it takes a data frame as its first argument and returns a data frame. Its second and subsequent arguments are named expressions that generate new variables. Like filter() you can refer to variables just by their name, you don’t need to also include the name of the dataset.

diamonds_ok2 <- mutate(diamonds_ok,
  sym = x - y,
  size = sqrt(x ^ 2 + y ^ 2)
)
diamonds_ok2
#> # A tibble: 53,930 × 12
#>   carat       cut color clarity depth table price     x     y     z   sym
#>   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43 -0.03
#> 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31  0.05
#> 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31 -0.02
#> 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63 -0.03
#> 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75 -0.01
#> 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48 -0.02
#> # ... with 5.392e+04 more rows, and 1 more variables: size <dbl>

ggplot(diamonds_ok2, aes(size, sym)) + 
  stat_bin2d()

This plot has two advantages: we can more easily see the pattern followed by most diamonds, and we can easily select outliers. Here, it doesn’t seem important whether the outliers are positive (i.e. x is bigger than y) or negative (i.e. y is bigger x). So we can use the absolute value of the symmetry variable to pull out the outliers. Based on the plot, and a little experimentation, I came up with a threshold of 0.20. We’ll check out the results with a histogram.

ggplot(diamonds_ok2, aes(abs(sym))) + 
  geom_histogram(binwidth = 0.10)

diamonds_ok3 <- filter(diamonds_ok2, abs(sym) < 0.20)
ggplot(diamonds_ok3, aes(abs(sym))) + 
  geom_histogram(binwidth = 0.01)

That’s an interesting histogram! While most diamonds are close to being symmetric there are very few that are perfectly symmetric (i.e. x == y).

Useful tools

Typically, transformations will be suggested by your domain knowledge. However, there are a few transformations that are useful in a surprisingly wide range of circumstances.

Log-transformations are often useful. They turn multiplicative relationships into additive relationships; they compress data that varies over orders of magnitude; they convert power relationships to linear relationship. See examples at http://stats.stackexchange.com/questions/27951
Relative difference: If you’re interested in the relative difference between two variables, use log(x / y). It’s better than x / y because it’s symmetric: if x < y, x / y takes values [0, 1), but if x > y, x / y takes values (1, Inf). See Törnqvist, Vartia, and Vartia (1985) for more details.
Sometimes integrating or differentiating might make the data more interpretable: if you have distance and time, would speed or acceleration be more useful? (or vice versa). (Note that integration makes data more smooth; differentiation makes it less smooth.)
Partition a number into magnitude and direction with abs(x) and sign(x).

There are also a few useful ways to transform pairs of variables:

Partitioning into overall size and difference is often useful, as seen above.
If you see a strong trend, use a model to partition it into pattern and residuals is often useful. You’ll learn more about that in the next chapter.
Sometimes it’s useful to change positions to polar coordinates (or vice versa): distance (sqrt(x^2 + y^2)) and angle (atan2(y, x)).

Exercises

Practice your variable creation skills by creating the following new variables:
- The approximate volume of the diamond (using x, y, and z).
- The approximate density of the diamond.
- The price per carat.
- Log transformation of carat and price.
How can you improve the data density of ggplot(diamonds, aes(x, z)) + stat_bin2d(). What transformation makes it easier to extract outliers?
The depth variable is just the width of the diamond (average of x and y) divided by its height (z) multiplied by 100 and round to the nearest integer. Compute the depth yourself and compare it to the existing depth variable. Summarise your findings with a plot.
Compare the distribution of symmetry for diamonds with $x > y$ vs. $x < y$.

Group-wise summaries

Many insightful visualisations require that you reduce the full dataset down to a meaningful summary. ggplot2 provides a number of geoms that will do summaries for you. But it’s often useful to do summaries by hand: that gives you more flexibility and you can use the summaries for other purposes.

dplyr does summaries in two steps:

Define the grouping variables with group_by().
Describe how to summarise each group with a single row with summarise()

For example, to look at the average price per clarity, we first group by clarity, then summarise:

by_clarity <- group_by(diamonds, clarity)
sum_clarity <- summarise(by_clarity, price = mean(price))
sum_clarity
#> # A tibble: 8 × 2
#>   clarity price
#>     <ord> <dbl>
#> 1      I1  3924
#> 2     SI2  5063
#> 3     SI1  3996
#> 4     VS2  3925
#> 5     VS1  3839
#> 6    VVS2  3284
#> # ... with 2 more rows

ggplot(sum_clarity, aes(clarity, price)) + 
  geom_line(aes(group = 1), colour = "grey80") +
  geom_point(size = 2)

You might be surprised by this pattern: why do diamonds with better clarity have lower prices? We’ll see why this is the case and what to do about it in removing trend.

Supply additional variables to group_by() to create groups based on more than one variable. The next example shows how we can compute (by hand) a frequency polygon that shows how cut and depth interact. The special summary function n() counts the number of observations in each group.

cut_depth <- summarise(group_by(diamonds, cut, depth), n = n())
cut_depth <- filter(cut_depth, depth > 55, depth < 70)
cut_depth
#> Source: local data frame [455 x 3]
#> Groups: cut [5]
#> 
#>     cut depth     n
#>   <ord> <dbl> <int>
#> 1  Fair  55.1     3
#> 2  Fair  55.2     6
#> 3  Fair  55.3     5
#> 4  Fair  55.4     2
#> 5  Fair  55.5     3
#> 6  Fair  55.6     4
#> # ... with 449 more rows

ggplot(cut_depth, aes(depth, n, colour = cut)) + 
  geom_line()

We can use a grouped mutate() to convert counts to proportions, so it’s easier to compare across the cuts. summarise() strips one level of grouping off, so cut_depth will be grouped by cut.

cut_depth <- mutate(cut_depth, prop = n / sum(n))
ggplot(cut_depth, aes(depth, prop, colour = cut)) + 
  geom_line()

Useful tools

summarise() needs to be used with functions that take a vector of $n$ values and always return a single value. Those functions include:

Counts: n(), n_distinct(x).
Middle: mean(x), median(x).
Spread: sd(x), mad(x), IQR(x).
Extremes: quartile(x), min(x), max(x).
Positions: first(x), last(x), nth(x, 2).

Another extremely useful technique is to use sum() or mean() with a logical vector. When a logical vector is treated as numeric, TRUE becomes 1 and FALSE becomes 0. This means that sum() tells you the number of TRUEs, and mean() tells you the proportion of TRUEs. For example, the following code counts the number of diamonds with carat greater than or equal to 4, and the proportion of diamonds that cost less than $1000.

summarise(diamonds, 
  n_big = sum(carat >= 4), 
  prop_cheap = mean(price < 1000)
)
#> # A tibble: 1 × 2
#>   n_big prop_cheap
#>   <int>      <dbl>
#> 1     6      0.269

Most summary functions have a na.rm argument: na.rm = TRUE tells the summary function to remove any missing values prior to summiarisation. This is a convenient shortcut: rather than removing the missing values then summarising, you can do it in one step.

Statistical considerations

When summarising with the mean or median, it’s always a good idea to include a count and a measure of spread. This helps you calibrate your assessments - if you don’t include them you’re likely to think that the data is less variable than it really is, and potentially draw unwarranted conclusions.

The following example extends our previous summary of the average price by clarity to also include the number of observations in each group, and the upper and lower quartiles. It suggests the mean might be a bad summary for this data - the distributions of price are so highly skewed that the mean is higher than the upper quartile for some of the groups!

by_clarity <- diamonds %>%
  group_by(clarity) %>%
  summarise(
    n = n(), 
    mean = mean(price), 
    lq = quantile(price, 0.25), 
    uq = quantile(price, 0.75)
  )
by_clarity
#> # A tibble: 8 × 5
#>   clarity     n  mean    lq    uq
#>     <ord> <int> <dbl> <dbl> <dbl>
#> 1      I1   741  3924  2080  5161
#> 2     SI2  9194  5063  2264  5777
#> 3     SI1 13065  3996  1089  5250
#> 4     VS2 12258  3925   900  6024
#> 5     VS1  8171  3839   876  6023
#> 6    VVS2  5066  3284   794  3638
#> # ... with 2 more rows
ggplot(by_clarity, aes(clarity, mean)) + 
  geom_linerange(aes(ymin = lq, ymax = uq)) + 
  geom_line(aes(group = 1), colour = "grey50") +
  geom_point(aes(size = n))

Another example of this comes from baseball. Let’s take the MLB batting data from the Lahman package and calculate the batting average: the number of hits divided by the number of at bats. Who’s the best batter according to this metric?

data(Batting, package = "Lahman")
batters <- filter(Batting, AB > 0)
per_player <- group_by(batters, playerID)
ba <- summarise(per_player, 
  ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE)
)
ggplot(ba, aes(ba)) + 
  geom_histogram(binwidth = 0.01)

Wow, there are a lot of players who can hit the ball every single time! Would you want them on your fantasy baseball team? Let’s double check they’re really that good by calibrating also showing the total number of at bats:

ba <- summarise(per_player, 
  ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
  ab = sum(AB, na.rm = TRUE)
)
ggplot(ba, aes(ab, ba)) + 
  geom_bin2d(bins = 100) + 
  geom_smooth()

The highest batting averages occur for the players with the smallest number of at bats - it’s not hard to hit the ball every time if you’ve only had two pitches. We can make the pattern a little more clear by getting rid of the players with less than 10 at bats.

ggplot(filter(ba, ab >= 10), aes(ab, ba)) + 
  geom_bin2d() + 
  geom_smooth()

You’ll often see a similar pattern whenever you plot number of observations vs. an average. Be aware!

Exercises

For each year in the ggplot2movies::movies data determine the percent of movies with missing budgets. Visualise the result.
How does the average length of a movie change over time? Display your answer with a plot, including some display of uncertainty.
For each combination of diamond quality (e.g. cut, colour and clarity), count the number of diamonds, the average price and the average size. Visualise the results.
Compute a histogram of carat by “hand” using a binwidth of 0.1. Display the results with geom_bar(stat = "identity"). (Hint: you might need to create a new variable first).
In the baseball example, the batting average seems to increase as the number of at bats increases. Why?

Transformation pipelines

Most real analyses require you to string together multiple mutate()s, filter()s, group_by()s , and summarise()s. For example, above, we created a frequency polygon by hand with a combination of all four verbs:

# By using intermediate values
cut_depth <- group_by(diamonds, cut, depth)
cut_depth <- summarise(cut_depth, n = n())
cut_depth <- filter(cut_depth, depth > 55, depth < 70)
cut_depth <- mutate(cut_depth, prop = n / sum(n))

This sequence of operations is a bit painful because we repeated the name of the data frame many times. An alternative is just to do it with one sequence of function calls:

# By "composing" functions
mutate(
  filter(
    summarise(
      group_by(
        diamonds, 
        cut, 
        depth
      ), 
      n = n()
    ), 
    depth > 55, 
    depth < 70
  ), 
  prop = n / sum(n)
)

But this is also hard to read because the sequence of operations is inside out, and the arguments to each function can be quite far apart. dplyr provides an alternative approach with the pipe, %>%. With the pipe, we can write the above sequence of operations as:

cut_depth <- diamonds %>% 
  group_by(cut, depth) %>% 
  summarise(n = n()) %>% 
  filter(depth > 55, depth < 70) %>% 
  mutate(prop = n / sum(n))

This makes it easier to understand what’s going on as we can read it almost like an English sentence: first group, then summarise, then filter, then mutate. In fact, the best way to pronounce %>% when reading a sequence of code is as “then”. %>% comes from the magrittr package, by Stefan Milton Bache. It provides a number of other tools that dplyr doesn’t expose by default, so I highly recommend that you check out the magrittr website.

%>% works by taking the thing on the left hand side (LHS) and supplying it as the first argument to the function on the right hand side (RHS). Each of these pairs of calls is equivalent:

f(x, y)
# is the same as
x %>% f(y)

g(f(x, y), z)
# is the same as
x %>% f(y) %>% g(z)

Exercises

Translate each of the examples in this chapter to use the pipe.

What does the following pipe do?

library(magrittr)
x <- runif(100)
x %>%
  subtract(mean(.)) %>%
  raise_to_power(2) %>%
  mean() %>%
  sqrt()

Which player in the Batting dataset has had the most consistently good performance over the course of their career?

Learning more

dplyr provides a number of other verbs that are less useful for visualisation, but important to know about in general:

arrange() orders observations according to variable(s). This is most useful when you’re looking at the data from the console. It can also be useful for visualisations if you want to control which points are plotted on top.
select() picks variables based on their names. Useful when you have many variables and want to focus on just a few for analysis.
rename() allows you to change the name of variables.
Grouped mutates and filters are also useful, but more advanced. See vignette("window-functions", package = "dplyr") for more details.
There are a number of verbs designed to work with two tables of data at a time. These include SQL joins (like the base merge() function) and set operations. Learn more about them in vignette("two-table", package = "dplyr").
dplyr can work directly with data stored in a database - you use the same R code as you do for local data and dplyr generates SQL to send to the database. See vignette("databases", package = "dplyr") for the details.

Finally, RStudio provides a handy dplyr cheatsheet that will help jog your memory when you’re wondering which function to use. Get it from http://rstudio.com/cheatsheets.

References

Törnqvist, Leo, Pentti Vartia, and Yrjö O Vartia. 1985. “How Should Relative Changes Be Measured?” The American Statistician 39 (1): 43–46.