The dplyr package is one of my workhorses when manipulating dataframes in R. The verb-based approach fits comfortably with my SQL background. It is a part of Hadley Wickham’s tidyverse which is a toolbox for almost anything a data scientist would need, with a common grammer. This paper exclusively walks through the new functionality and datasets available in the new version of dplyr. At the end we will look at dplyrs implementation of rlang functionality to better provide column-references for functions and apps.
Blog Source: https://blog.rstudio.org/2017/06/13/dplyr-0-7-0/
Vignette: http://dplyr.tidyverse.org/articles/programming.html
library(tidyverse)
library(rlang)
packageVersion("dplyr")
## [1] '0.7.0'
Good, we can see that dplyr is v0.7.0 for this exercise.
To show the basic dplyr functionality we’ll use one of the new datasets called starwars.
starwars datasetglimpse(starwars)
## Observations: 87
## Variables: 13
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
## $ gender <chr> "male", NA, NA, "male", "female", "male", "female",...
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
## $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...
Note that the last three columns are actually embedded lists.
dplyr syntax exampledplyr is all about manipulating dataframes. This is made simpler by the new version.
In this case you start with the starwars dataframe and grind it down to what wou want.
set.seed(317)
starwars %>%
select(species, mass, height, birth_year) %>%
filter(height >= 100) %>%
group_by(species) %>%
na.omit() %>%
summarise(avg_mass = mean(mass), avg_height = mean(height),
avg_birth_year = mean(birth_year)) %>%
sample_n(6) %>%
arrange(-avg_mass)
## # A tibble: 6 x 4
## species avg_mass avg_height avg_birth_year
## <chr> <dbl> <dbl> <dbl>
## 1 Trandoshan 113.0 190.0 53.0
## 2 Droid 107.5 183.5 63.5
## 3 Kel Dor 80.0 188.0 22.0
## 4 Zabrak 80.0 175.0 54.0
## 5 Gungan 66.0 196.0 52.0
## 6 Mirialan 53.1 168.0 49.0
In this example I select the species, for grouping. I then select all of the columns with continuous data, filter out the very small species, like R2D2, group by species then drop NA rows, get the average height, mass, and birth_year, and then sample six of the entire data set, and sorts the species with the highest mass to the top.
NOTE: Since this uses the sample_n() function we are introducing randomness. For reproducibility I have set the seed to 317.
NOTE: I really didn’t want to filter out R2D2 but there were few other options.
storms has the trajectories of ~200 tropical storms. It contains a strong grouping structure.
glimpse(storms)
## Observations: 10,010
## Variables: 13
## $ name <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "...
## $ year <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 19...
## $ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7,...
## $ day <int> 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 30...
## $ hour <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12...
## $ lat <dbl> 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3, 34.0, 34...
## $ long <dbl> -79.0, -79.0, -79.0, -79.0, -78.8, -78.7, -78.0, -...
## $ status <chr> "tropical depression", "tropical depression", "tro...
## $ category <ord> -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, ...
## $ wind <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 45, 50, 50...
## $ pressure <int> 1013, 1013, 1013, 1013, 1012, 1012, 1011, 1006, 10...
## $ ts_diameter <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ hu_diameter <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
These are tiny datasets of bands. It’s designed to be very simple so you can illustrate how joins work without getting distracted by the details of the data.
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
band_instruments2
## # A tibble: 3 x 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
The pull() function allows you to extract a single column either by name or position. It’s similar to select() but returns a vector, rather than a single-column dataframe.
mtcarsFor this we’re switching to another dataset included in the tidyverse called mtcars The mtcars data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 other aspects of automobile design and performance for 32 automobiles (1973-74 models). That’s 44 years ago!!!
mtcars is unusual as the car names are stored in the row name instead of any actual column. This was probably done to maintain the complete numerical structure of the data.
row.names(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
glimpse(mtcars)
## Observations: 32
## Variables: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
This is the most straight-forward approach.
mtcars %>% pull(cyl) %>% str()
## num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
Sometimes this works better. In this example I pull he same column as above, cyl.
mtcars %>% pull(2) %>% str()
## num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
NOTE: The result is a vector and not a data_frame.
arrange()arrange() is used for sorting data by specified variables. You can also specify many variables to create progressive grouping.
arrange()This is very good new functionality in dplyr.
Assumption: The dataframe is already grouped.
Gains a .by_group argument so you can choose to sort by groups if you want to. Defaults to FALSE
mtcars_grouped <- mtcars %>%
select(cyl, gear, carb, everything()) %>%
group_by(cyl, gear)
mtcars by cyl and gear, ascending.head(arrange(mtcars_grouped, cyl, gear))
## # A tibble: 6 x 11
## # Groups: cyl, gear [2]
## cyl gear carb mpg disp hp drat wt qsec vs am
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 3 1 21.5 120.1 97 3.70 2.465 20.01 1 0
## 2 4 4 1 22.8 108.0 93 3.85 2.320 18.61 1 1
## 3 4 4 2 24.4 146.7 62 3.69 3.190 20.00 1 0
## 4 4 4 2 22.8 140.8 95 3.92 3.150 22.90 1 0
## 5 4 4 1 32.4 78.7 66 4.08 2.200 19.47 1 1
## 6 4 4 2 30.4 75.7 52 4.93 1.615 18.52 1 1
head(arrange(mtcars_grouped, desc(gear)))
## # A tibble: 6 x 11
## # Groups: cyl, gear [4]
## cyl gear carb mpg disp hp drat wt qsec vs am
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 5 2 26.0 120.3 91 4.43 2.140 16.70 0 1
## 2 4 5 2 30.4 95.1 113 3.77 1.513 16.90 1 1
## 3 8 5 4 15.8 351.0 264 4.22 3.170 14.50 0 1
## 4 6 5 6 19.7 145.0 175 3.62 2.770 15.50 0 1
## 5 8 5 8 15.0 301.0 335 3.54 3.570 14.60 0 1
## 6 6 4 4 21.0 160.0 110 3.90 2.620 16.46 0 1
Not good, because in a lot of cases you want to see the highest or lowest values by group.
by_cyl <- mtcars_grouped %>% group_by(cyl)
by_cyl %>% arrange(desc(gear))
## # A tibble: 32 x 11
## # Groups: cyl [3]
## cyl gear carb mpg disp hp drat wt qsec vs am
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 5 2 26.0 120.3 91 4.43 2.140 16.70 0 1
## 2 4 5 2 30.4 95.1 113 3.77 1.513 16.90 1 1
## 3 8 5 4 15.8 351.0 264 4.22 3.170 14.50 0 1
## 4 6 5 6 19.7 145.0 175 3.62 2.770 15.50 0 1
## 5 8 5 8 15.0 301.0 335 3.54 3.570 14.60 0 1
## 6 6 4 4 21.0 160.0 110 3.90 2.620 16.46 0 1
## 7 6 4 4 21.0 160.0 110 3.90 2.875 17.02 0 1
## 8 4 4 1 22.8 108.0 93 3.85 2.320 18.61 1 1
## 9 4 4 2 24.4 146.7 62 3.69 3.190 20.00 1 0
## 10 4 4 2 22.8 140.8 95 3.92 3.150 22.90 1 0
## # ... with 22 more rows
.by_group parameter does what you wantby_cyl %>% arrange(desc(gear), .by_group = TRUE)
## # A tibble: 32 x 11
## # Groups: cyl [3]
## cyl gear carb mpg disp hp drat wt qsec vs am
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 5 2 26.0 120.3 91 4.43 2.140 16.70 0 1
## 2 4 5 2 30.4 95.1 113 3.77 1.513 16.90 1 1
## 3 4 4 1 22.8 108.0 93 3.85 2.320 18.61 1 1
## 4 4 4 2 24.4 146.7 62 3.69 3.190 20.00 1 0
## 5 4 4 2 22.8 140.8 95 3.92 3.150 22.90 1 0
## 6 4 4 1 32.4 78.7 66 4.08 2.200 19.47 1 1
## 7 4 4 2 30.4 75.7 52 4.93 1.615 18.52 1 1
## 8 4 4 1 33.9 71.1 65 4.22 1.835 19.90 1 1
## 9 4 4 1 27.3 79.0 66 4.08 1.935 18.90 1 1
## 10 4 4 2 21.4 121.0 109 4.11 2.780 18.60 1 1
## # ... with 22 more rows
Very useful
Use these if you want to do something to: * every variable (_all), * variables selected by their names (_at), * or variables that satisfy some predicate (_if).
This example summarizes all numeric columns by returning the mean. Great for separate processing by column datatype.
iris %>% summarise_if(is.numeric, mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.843333 3.057333 3.758 1.199333
This selects all columns that do not contain sub-lists. Very useful.
starwars %>% select_if(Negate(is.list))
## # A tibble: 87 x 10
## name height mass hair_color skin_color eye_color
## <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Skywalker 172 77 blond fair blue
## 2 C-3PO 167 75 <NA> gold yellow
## 3 R2-D2 96 32 <NA> white, blue red
## 4 Darth Vader 202 136 none white yellow
## 5 Leia Organa 150 49 brown light brown
## 6 Owen Lars 178 120 brown, grey light blue
## 7 Beru Whitesun lars 165 75 brown light blue
## 8 R5-D4 97 32 <NA> white, red red
## 9 Biggs Darklighter 183 84 black light brown
## 10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
## # ... with 77 more rows, and 4 more variables: birth_year <dbl>,
## # gender <chr>, homeworld <chr>, species <chr>
This is a great example when you want to group by multiple variables.
It groups successively by each column from year to hour. Note that this works best when your data columns follow the groupings that are needed. You can individually select columns but your data looks better and this function works more naturally when they are in the correct order
storms %>% group_by_at(vars(year:hour))
## # A tibble: 10,010 x 13
## # Groups: year, month, day, hour [7,958]
## name year month day hour lat long status category
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord>
## 1 Amy 1975 6 27 0 27.5 -79.0 tropical depression -1
## 2 Amy 1975 6 27 6 28.5 -79.0 tropical depression -1
## 3 Amy 1975 6 27 12 29.5 -79.0 tropical depression -1
## 4 Amy 1975 6 27 18 30.5 -79.0 tropical depression -1
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical depression -1
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical depression -1
## 7 Amy 1975 6 28 12 33.3 -78.0 tropical depression -1
## 8 Amy 1975 6 28 18 34.0 -77.0 tropical depression -1
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical storm 0
## 10 Amy 1975 6 29 6 34.0 -74.8 tropical storm 0
## # ... with 10,000 more rows, and 4 more variables: wind <int>,
## # pressure <int>, ts_diameter <dbl>, hu_diameter <dbl>
Use of the vars() function is a little strange for me. Most dplyr syntax doesn’t force you to use this convention. I’m sure there is a good reason for it.
vars() function?According to the vars() documentation:
vars()is intended to provide equivalent semantics toselect(). It is used for instance in scoped summarising and mutating verbs (mutate_at()andsummarise_at())
I tried it without vars() and got this error
Error in typeof(x) : object 'month' not found
Wickham tries to make everything as consistent as possible so I’m sure that this fits into the overall grammar in ways that I’m not currently aware of.
Local join functions can now control how missing values are matched. The default value is na_matches = "na", which treats two missing values as equal. To prevent missing values from matching, use na_matches = "never".
# <need code example>
This behavior can also be changed at the session level.
pkgconfig::set_config("dplyr::na_matches", "never").
bind_rows() and combine() are more strict when coercing.# <Need code example>
The biggest change is a new system for programming with dplyr, called tidy evaluation, or tidy eval for short.
Tidy eval is a system for capturing expressions and later evaluating them in the correct context.
This is important because it allows you to interpolate values in contexts where dplyr usually works with expressions.
my_var <- quo(homeworld)
my_var
## <quosure: global>
## ~homeworld
quosure?Quosures are quoted expressions that keep track of an environment (just like closure functions). They are implemented as a subclass of one-sided formulas. They are an essential piece of the tidy evaluation framework.
starwars %>%
group_by(!!my_var) %>%
summarise_at(vars(height:mass), mean, na.rm = TRUE)
## # A tibble: 49 x 3
## homeworld height mass
## <chr> <dbl> <dbl>
## 1 Alderaan 176.3333 64.0
## 2 Aleen Minor 79.0000 15.0
## 3 Bespin 175.0000 79.0
## 4 Bestine IV 180.0000 110.0
## 5 Cato Neimoidia 191.0000 90.0
## 6 Cerea 198.0000 82.0
## 7 Champala 196.0000 NaN
## 8 Chandrila 150.0000 NaN
## 9 Concord Dawn 183.0000 79.0
## 10 Corellia 175.0000 78.5
## # ... with 39 more rows
The !! operator is dplyr’s implementation of quosures. Note that no quotes are used on the quo() input.
my_var has the value homeworld which is expected to map to the homeworld column of the starwars dataframe.
I noticed that this syntax does basically what I did in the first example but in a much more concise way. Good learning.
This makes it possible to write your functions that work like dplyr functions, reducing the amount of copy-and-paste in your code:
starwars_mean <- function(my_var) {
my_var <- enquo(my_var)
starwars %>%
group_by(!!my_var) %>%
summarise_at(vars(height:mass), mean, na.rm = TRUE)
}
starwars_mean(homeworld)
## # A tibble: 49 x 3
## homeworld height mass
## <chr> <dbl> <dbl>
## 1 Alderaan 176.3333 64.0
## 2 Aleen Minor 79.0000 15.0
## 3 Bespin 175.0000 79.0
## 4 Bestine IV 180.0000 110.0
## 5 Cato Neimoidia 191.0000 90.0
## 6 Cerea 198.0000 82.0
## 7 Champala 196.0000 NaN
## 8 Chandrila 150.0000 NaN
## 9 Concord Dawn 183.0000 79.0
## 10 Corellia 175.0000 78.5
## # ... with 39 more rows
This took a little bit to get my head around.
It introduces the enquo() function that converts the quo() version of the grouping factor back to something that can be used in a function.
These two functions are not a part of the dplyr package. They are part of the rlang package. What Wickham has done here is to harness another packages functionality to enhance dplyr functionality. The !! operator is the dplyr implementation of rlang classes.
quo(), and enquo()quo() quotes its input (i.e. captures R code without evaluation), captures the current environment, and bundles them in a quosure.
enquo() takes a symbol referring to a function argument, quotes the R code that was supplied to this argument, captures the environment where the function was called (and thus where the R code was typed), and bundles them in a quosure.
.data pronounYou can also use the new .data pronoun to refer to variables with strings:
my_var <- "homeworld"
starwars %>%
group_by(.data[[my_var]]) %>%
summarise_at(vars(height:mass), mean, na.rm = TRUE)
## # A tibble: 49 x 3
## my_var height mass
## <chr> <dbl> <dbl>
## 1 Alderaan 176.3333 64.0
## 2 Aleen Minor 79.0000 15.0
## 3 Bespin 175.0000 79.0
## 4 Bestine IV 180.0000 110.0
## 5 Cato Neimoidia 191.0000 90.0
## 6 Cerea 198.0000 82.0
## 7 Champala 196.0000 NaN
## 8 Chandrila 150.0000 NaN
## 9 Concord Dawn 183.0000 79.0
## 10 Corellia 175.0000 78.5
## # ... with 39 more rows
NOTE: This is useful when you’re writing packages that use dplyr code because it avoids an annoying note from R CMD check.
I can see where this mechanism will be useful. If you have a character vector of column names and you want to reference specific elements of the vector to choose individual columns then this is what you would use.
The introduction of tidy evaluation means that the standard evaluation (underscored) version of each main verb (filter_(), select_() etc) is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).
To me dplyr is a mature product and these changes should be considered enhancements to reduce coding and to use the assumptions that data scientists usually have as the default.
The big question is When will we get to v1.0 ;-) ?