Executive Summary

The dplyr package is one of my workhorses when manipulating dataframes in R. The verb-based approach fits comfortably with my SQL background. It is a part of Hadley Wickham’s tidyverse which is a toolbox for almost anything a data scientist would need, with a common grammer. This paper exclusively walks through the new functionality and datasets available in the new version of dplyr. At the end we will look at dplyrs implementation of rlang functionality to better provide column-references for functions and apps.

Sources

Blog Source: https://blog.rstudio.org/2017/06/13/dplyr-0-7-0/
Vignette: http://dplyr.tidyverse.org/articles/programming.html

General Setup

Load libraries

library(tidyverse)
library(rlang)

Version check

packageVersion("dplyr")
## [1] '0.7.0'

Good, we can see that dplyr is v0.7.0 for this exercise.

Basics

To show the basic dplyr functionality we’ll use one of the new datasets called starwars.

Look at the new starwars dataset

glimpse(starwars)
## Observations: 87
## Variables: 13
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
## $ gender     <chr> "male", NA, NA, "male", "female", "male", "female",...
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
## $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
## $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...

Note that the last three columns are actually embedded lists.

General existing dplyr syntax example

dplyr is all about manipulating dataframes. This is made simpler by the new version.

In this case you start with the starwars dataframe and grind it down to what wou want.

set.seed(317)
starwars %>%
    select(species, mass, height, birth_year) %>%
    filter(height >= 100) %>%
    group_by(species) %>%
    na.omit() %>%
    summarise(avg_mass = mean(mass), avg_height = mean(height), 
              avg_birth_year = mean(birth_year)) %>%
    sample_n(6) %>%
    arrange(-avg_mass)
## # A tibble: 6 x 4
##      species avg_mass avg_height avg_birth_year
##        <chr>    <dbl>      <dbl>          <dbl>
## 1 Trandoshan    113.0      190.0           53.0
## 2      Droid    107.5      183.5           63.5
## 3    Kel Dor     80.0      188.0           22.0
## 4     Zabrak     80.0      175.0           54.0
## 5     Gungan     66.0      196.0           52.0
## 6   Mirialan     53.1      168.0           49.0

In this example I select the species, for grouping. I then select all of the columns with continuous data, filter out the very small species, like R2D2, group by species then drop NA rows, get the average height, mass, and birth_year, and then sample six of the entire data set, and sorts the species with the highest mass to the top.

NOTE: Since this uses the sample_n() function we are introducing randomness. For reproducibility I have set the seed to 317.

NOTE: I really didn’t want to filter out R2D2 but there were few other options.

What’s New

Other new datasets

Storms

storms has the trajectories of ~200 tropical storms. It contains a strong grouping structure.

glimpse(storms)
## Observations: 10,010
## Variables: 13
## $ name        <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "...
## $ year        <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 19...
## $ month       <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7,...
## $ day         <int> 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 30...
## $ hour        <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12...
## $ lat         <dbl> 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3, 34.0, 34...
## $ long        <dbl> -79.0, -79.0, -79.0, -79.0, -78.8, -78.7, -78.0, -...
## $ status      <chr> "tropical depression", "tropical depression", "tro...
## $ category    <ord> -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, ...
## $ wind        <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 45, 50, 50...
## $ pressure    <int> 1013, 1013, 1013, 1013, 1012, 1012, 1011, 1006, 10...
## $ ts_diameter <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ hu_diameter <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

band_members, band_instruments and band_instruments2

These are tiny datasets of bands. It’s designed to be very simple so you can illustrate how joins work without getting distracted by the details of the data.

band_members
## # A tibble: 3 x 2
##    name    band
##   <chr>   <chr>
## 1  Mick  Stones
## 2  John Beatles
## 3  Paul Beatles
band_instruments
## # A tibble: 3 x 2
##    name  plays
##   <chr>  <chr>
## 1  John guitar
## 2  Paul   bass
## 3 Keith guitar
band_instruments2
## # A tibble: 3 x 2
##   artist  plays
##    <chr>  <chr>
## 1   John guitar
## 2   Paul   bass
## 3  Keith guitar

New and Improved Verbs

pull()

The pull() function allows you to extract a single column either by name or position. It’s similar to select() but returns a vector, rather than a single-column dataframe.

Included dataset: mtcars

For this we’re switching to another dataset included in the tidyverse called mtcars The mtcars data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 other aspects of automobile design and performance for 32 automobiles (1973-74 models). That’s 44 years ago!!!

mtcars is unusual as the car names are stored in the row name instead of any actual column. This was probably done to maintain the complete numerical structure of the data.

row.names(mtcars)
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

Look at mtcars data

glimpse(mtcars)
## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

Pull by name

This is the most straight-forward approach.

mtcars %>% pull(cyl) %>% str()
##  num [1:32] 6 6 4 6 8 6 8 4 4 6 ...

Pull by position

Sometimes this works better. In this example I pull he same column as above, cyl.

mtcars %>% pull(2) %>% str()
##  num [1:32] 6 6 4 6 8 6 8 4 4 6 ...

NOTE: The result is a vector and not a data_frame.

Changes to arrange()

arrange() is used for sorting data by specified variables. You can also specify many variables to create progressive grouping.

Sort groups using arrange()

This is very good new functionality in dplyr.

Assumption: The dataframe is already grouped.

Gains a .by_group argument so you can choose to sort by groups if you want to. Defaults to FALSE

Existing functionality

Group cars by cyl and gear
mtcars_grouped <- mtcars %>%
    select(cyl, gear, carb, everything()) %>%
    group_by(cyl, gear)
Sort mtcars by cyl and gear, ascending.
head(arrange(mtcars_grouped, cyl, gear))
## # A tibble: 6 x 11
## # Groups:   cyl, gear [2]
##     cyl  gear  carb   mpg  disp    hp  drat    wt  qsec    vs    am
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     4     3     1  21.5 120.1    97  3.70 2.465 20.01     1     0
## 2     4     4     1  22.8 108.0    93  3.85 2.320 18.61     1     1
## 3     4     4     2  24.4 146.7    62  3.69 3.190 20.00     1     0
## 4     4     4     2  22.8 140.8    95  3.92 3.150 22.90     1     0
## 5     4     4     1  32.4  78.7    66  4.08 2.200 19.47     1     1
## 6     4     4     2  30.4  75.7    52  4.93 1.615 18.52     1     1

Sort by gear, descending

head(arrange(mtcars_grouped, desc(gear)))
## # A tibble: 6 x 11
## # Groups:   cyl, gear [4]
##     cyl  gear  carb   mpg  disp    hp  drat    wt  qsec    vs    am
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     4     5     2  26.0 120.3    91  4.43 2.140 16.70     0     1
## 2     4     5     2  30.4  95.1   113  3.77 1.513 16.90     1     1
## 3     8     5     4  15.8 351.0   264  4.22 3.170 14.50     0     1
## 4     6     5     6  19.7 145.0   175  3.62 2.770 15.50     0     1
## 5     8     5     8  15.0 301.0   335  3.54 3.570 14.60     0     1
## 6     6     4     4  21.0 160.0   110  3.90 2.620 16.46     0     1

Existing grouped arrange() ignores groups

Not good, because in a lot of cases you want to see the highest or lowest values by group.

by_cyl <- mtcars_grouped %>% group_by(cyl)
by_cyl %>% arrange(desc(gear))
## # A tibble: 32 x 11
## # Groups:   cyl [3]
##      cyl  gear  carb   mpg  disp    hp  drat    wt  qsec    vs    am
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     4     5     2  26.0 120.3    91  4.43 2.140 16.70     0     1
##  2     4     5     2  30.4  95.1   113  3.77 1.513 16.90     1     1
##  3     8     5     4  15.8 351.0   264  4.22 3.170 14.50     0     1
##  4     6     5     6  19.7 145.0   175  3.62 2.770 15.50     0     1
##  5     8     5     8  15.0 301.0   335  3.54 3.570 14.60     0     1
##  6     6     4     4  21.0 160.0   110  3.90 2.620 16.46     0     1
##  7     6     4     4  21.0 160.0   110  3.90 2.875 17.02     0     1
##  8     4     4     1  22.8 108.0    93  3.85 2.320 18.61     1     1
##  9     4     4     2  24.4 146.7    62  3.69 3.190 20.00     1     0
## 10     4     4     2  22.8 140.8    95  3.92 3.150 22.90     1     0
## # ... with 22 more rows

New .by_group parameter does what you want

by_cyl %>% arrange(desc(gear), .by_group = TRUE)
## # A tibble: 32 x 11
## # Groups:   cyl [3]
##      cyl  gear  carb   mpg  disp    hp  drat    wt  qsec    vs    am
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     4     5     2  26.0 120.3    91  4.43 2.140 16.70     0     1
##  2     4     5     2  30.4  95.1   113  3.77 1.513 16.90     1     1
##  3     4     4     1  22.8 108.0    93  3.85 2.320 18.61     1     1
##  4     4     4     2  24.4 146.7    62  3.69 3.190 20.00     1     0
##  5     4     4     2  22.8 140.8    95  3.92 3.150 22.90     1     0
##  6     4     4     1  32.4  78.7    66  4.08 2.200 19.47     1     1
##  7     4     4     2  30.4  75.7    52  4.93 1.615 18.52     1     1
##  8     4     4     1  33.9  71.1    65  4.22 1.835 19.90     1     1
##  9     4     4     1  27.3  79.0    66  4.08 1.935 18.90     1     1
## 10     4     4     2  21.4 121.0   109  4.11 2.780 18.60     1     1
## # ... with 22 more rows

Very useful

All single-table verbs now have scoped variants suffixed with _if(), _at() and _all().

Use these if you want to do something to: * every variable (_all), * variables selected by their names (_at), * or variables that satisfy some predicate (_if).

summarise_if()

This example summarizes all numeric columns by returning the mean. Great for separate processing by column datatype.

iris %>% summarise_if(is.numeric, mean)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.843333    3.057333        3.758    1.199333

select_if()

This selects all columns that do not contain sub-lists. Very useful.

starwars %>% select_if(Negate(is.list))
## # A tibble: 87 x 10
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 4 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>

group_by_at()

This is a great example when you want to group by multiple variables.
It groups successively by each column from year to hour. Note that this works best when your data columns follow the groupings that are needed. You can individually select columns but your data looks better and this function works more naturally when they are in the correct order

storms %>% group_by_at(vars(year:hour))
## # A tibble: 10,010 x 13
## # Groups:   year, month, day, hour [7,958]
##     name  year month   day  hour   lat  long              status category
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>               <chr>    <ord>
##  1   Amy  1975     6    27     0  27.5 -79.0 tropical depression       -1
##  2   Amy  1975     6    27     6  28.5 -79.0 tropical depression       -1
##  3   Amy  1975     6    27    12  29.5 -79.0 tropical depression       -1
##  4   Amy  1975     6    27    18  30.5 -79.0 tropical depression       -1
##  5   Amy  1975     6    28     0  31.5 -78.8 tropical depression       -1
##  6   Amy  1975     6    28     6  32.4 -78.7 tropical depression       -1
##  7   Amy  1975     6    28    12  33.3 -78.0 tropical depression       -1
##  8   Amy  1975     6    28    18  34.0 -77.0 tropical depression       -1
##  9   Amy  1975     6    29     0  34.4 -75.8      tropical storm        0
## 10   Amy  1975     6    29     6  34.0 -74.8      tropical storm        0
## # ... with 10,000 more rows, and 4 more variables: wind <int>,
## #   pressure <int>, ts_diameter <dbl>, hu_diameter <dbl>

Use of the vars() function is a little strange for me. Most dplyr syntax doesn’t force you to use this convention. I’m sure there is a good reason for it.

What is this vars() function?

According to the vars() documentation:

vars() is intended to provide equivalent semantics to select(). It is used for instance in scoped summarising and mutating verbs (mutate_at() and summarise_at())

I tried it without vars() and got this error
Error in typeof(x) : object 'month' not found

Wickham tries to make everything as consistent as possible so I’m sure that this fits into the overall grammar in ways that I’m not currently aware of.

Better joining when missing values are present

Local join functions can now control how missing values are matched. The default value is na_matches = "na", which treats two missing values as equal. To prevent missing values from matching, use na_matches = "never".

#                         <need code example>

This behavior can also be changed at the session level.

pkgconfig::set_config("dplyr::na_matches", "never").

bind_rows() and combine() are more strict when coercing.

  • Logical values are no longer coerced to integer and numeric.
  • Date, POSIXct and other integer or double-based classes are no longer coerced to integer or double to avoid dropping important metadata. Wickham plans to continue improving this interface in the future.
#                          <Need code example>

The Big Change

TIDY EVALUATION

The biggest change is a new system for programming with dplyr, called tidy evaluation, or tidy eval for short.

Tidy eval is a system for capturing expressions and later evaluating them in the correct context.

This is important because it allows you to interpolate values in contexts where dplyr usually works with expressions.

my_var <- quo(homeworld)
my_var
## <quosure: global>
## ~homeworld

What is a quosure?

Quosures are quoted expressions that keep track of an environment (just like closure functions). They are implemented as a subclass of one-sided formulas. They are an essential piece of the tidy evaluation framework.

How is a quosure used?

starwars %>%
  group_by(!!my_var) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)
## # A tibble: 49 x 3
##         homeworld   height  mass
##             <chr>    <dbl> <dbl>
##  1       Alderaan 176.3333  64.0
##  2    Aleen Minor  79.0000  15.0
##  3         Bespin 175.0000  79.0
##  4     Bestine IV 180.0000 110.0
##  5 Cato Neimoidia 191.0000  90.0
##  6          Cerea 198.0000  82.0
##  7       Champala 196.0000   NaN
##  8      Chandrila 150.0000   NaN
##  9   Concord Dawn 183.0000  79.0
## 10       Corellia 175.0000  78.5
## # ... with 39 more rows

The !! operator is dplyr’s implementation of quosures. Note that no quotes are used on the quo() input.

my_var has the value homeworld which is expected to map to the homeworld column of the starwars dataframe.

I noticed that this syntax does basically what I did in the first example but in a much more concise way. Good learning.

The Value-Add

This makes it possible to write your functions that work like dplyr functions, reducing the amount of copy-and-paste in your code:

starwars_mean <- function(my_var) {
  my_var <- enquo(my_var)
  
  starwars %>%
    group_by(!!my_var) %>%
    summarise_at(vars(height:mass), mean, na.rm = TRUE)
}
starwars_mean(homeworld)
## # A tibble: 49 x 3
##         homeworld   height  mass
##             <chr>    <dbl> <dbl>
##  1       Alderaan 176.3333  64.0
##  2    Aleen Minor  79.0000  15.0
##  3         Bespin 175.0000  79.0
##  4     Bestine IV 180.0000 110.0
##  5 Cato Neimoidia 191.0000  90.0
##  6          Cerea 198.0000  82.0
##  7       Champala 196.0000   NaN
##  8      Chandrila 150.0000   NaN
##  9   Concord Dawn 183.0000  79.0
## 10       Corellia 175.0000  78.5
## # ... with 39 more rows

This took a little bit to get my head around.

It introduces the enquo() function that converts the quo() version of the grouping factor back to something that can be used in a function.

These two functions are not a part of the dplyr package. They are part of the rlang package. What Wickham has done here is to harness another packages functionality to enhance dplyr functionality. The !! operator is the dplyr implementation of rlang classes.

Documentation: quo(), and enquo()

quo() quotes its input (i.e. captures R code without evaluation), captures the current environment, and bundles them in a quosure.

enquo() takes a symbol referring to a function argument, quotes the R code that was supplied to this argument, captures the environment where the function was called (and thus where the R code was typed), and bundles them in a quosure.

The .data pronoun

You can also use the new .data pronoun to refer to variables with strings:

my_var <- "homeworld"

starwars %>%
  group_by(.data[[my_var]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)
## # A tibble: 49 x 3
##            my_var   height  mass
##             <chr>    <dbl> <dbl>
##  1       Alderaan 176.3333  64.0
##  2    Aleen Minor  79.0000  15.0
##  3         Bespin 175.0000  79.0
##  4     Bestine IV 180.0000 110.0
##  5 Cato Neimoidia 191.0000  90.0
##  6          Cerea 198.0000  82.0
##  7       Champala 196.0000   NaN
##  8      Chandrila 150.0000   NaN
##  9   Concord Dawn 183.0000  79.0
## 10       Corellia 175.0000  78.5
## # ... with 39 more rows

NOTE: This is useful when you’re writing packages that use dplyr code because it avoids an annoying note from R CMD check.

Observation

I can see where this mechanism will be useful. If you have a character vector of column names and you want to reference specific elements of the vector to choose individual columns then this is what you would use.

The introduction of tidy evaluation means that the standard evaluation (underscored) version of each main verb (filter_(), select_() etc) is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).

Conclusion

To me dplyr is a mature product and these changes should be considered enhancements to reduce coding and to use the assumptions that data scientists usually have as the default.
The big question is When will we get to v1.0 ;-) ?

END