Tidy evaluation is a special type of non-standard evaluation used throughout the tidyverse. There are two basic forms found in dplyr:
arrange(), count(), filter(), group_by(), mutate(), and summarise() use data masking so that you can use data variables as if they were variables in the environment (i.e. you write my_variable not df$myvariable).
across(), relocate(), rename(), select(), and pull() use tidy selection so you can easily choose variables based on their position, name, or type (e.g. starts_with(“x”) or is.numeric).
To determine whether a function argument uses data masking or tidy selection, look at the documentation: in the arguments list, you’ll see
Data masking and tidy selection make interactive data exploration fast and fluid, but they add some new challenges when you attempt to use them indirectly such as in a for loop or a function. This vignette shows you how to overcome those challenges. We’ll first go over the basics of data masking and tidy selection, talk about how to use them indirectly, and then show you a number of recipes to solve common problems.
Data masking makes data manipulation faster because it requires less typing. In most (but not all1) base R functions you need to refer to variables with $, leading to code that repeats the name of the data frame many times:
starwars[starwars$homeworld == "Naboo" & starwars$species == "Human", ,]
The dplyr equivalent of this code is more concise because data masking allows you to need to type starwars once:
starwars %>% filter(homeworld == "Naboo", species == "Human")
The key idea behind data masking is that it blurs the line between the two different meanings of the word “variable”:
env-variables are “programming” variables that live in an environment. They are usually created with <-.
data-variables are “statistical” variables that live in a data frame. They usually come from data files (e.g. .csv, .xls), or are created manipulating existing variables.
To make those definitions a little more concrete, take this piece of code:
df <- data.frame(x = runif(3), y = runif(3))
df$x
#> [1] 0.08075014 0.83433304 0.60076089
It creates a env-variable, df, that contains two data-variables, x and y. Then it extracts the data-variable x out of the env-variable df using $.
I think this blurring of the meaning of “variable” is a really nice feature for interactive data analysis because it allows you to refer to data-vars as is, without any prefix. And this seems to be fairly intuitive since many newer R users will attempt to write diamonds[x == 0 | y == 0, ].
Unfortunately, this benefit does not come for free. When you start to program with these tools, you’re going to have to grapple with the distinction. This will be hard because you’ve never had to think about it before, so it’ll take a while for your brain to learn these new concepts and categories. However, once you’ve teased apart the idea of “variable” into data-variable and env-variable, I think you’ll find it fairly straightforward to use.
The main challenge of programming with functions that use data masking arises when you introduce some indirection, i.e. when you want to get the data-variable from an env-variable instead of directly typing the data-variable’s name. There are two main cases:
var_summary <- function(data, var) {
data %>%
summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
mtcars %>%
group_by(cyl) %>%
var_summary(mpg)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
## cyl n min max
## <dbl> <int> <dbl> <dbl>
## 1 4 11 21.4 33.9
## 2 6 7 17.8 21.4
## 3 8 14 10.4 19.2
for (var in names(mtcars)) {
mtcars %>% count(.data[[var]]) %>% print()
}
Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x or indirectly with .data[[var]]. Don’t expect other functions to work with it.
Data masking makes it easy to compute on values within a dataset. Tidy selection is a complementary tool that makes it easy to work with the columns of a dataset.
Underneath all functions that use tidy selection is the tidyselect package. It provides a miniature domain specific language that makes it easy to select columns by name, position, or type. For example:
select(df, 1) selects the first column; select(df, last_col()) selects the last column.
select(df, c(a, b, c)) selects columns a, b, and c.
select(df, starts_with(“a”)) selects all columns whose name starts with “a”; select(df, ends_with(“z”)) selects all columns whose name ends with “z”.
select(df, where(is.numeric)) selects all numeric columns.
You can see more details in ?dplyr_tidy_select.
As with data masking, tidy selection makes a common task easier at the cost of making a less common task harder. When you want to use tidy select indirectly with the column specification stored in an intermediate variable, you’ll need to learn some new tools. Again, there are two forms of indirection:
summarise_mean <- function(data, vars) {
data %>% summarise(n = n(), across({{ vars }}, mean))
}
mtcars %>%
group_by(cyl,gear) %>%
summarise_mean(where(is.numeric))
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
## # A tibble: 8 x 12
## # Groups: cyl [3]
## cyl gear n mpg disp hp drat wt qsec vs am carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 3 1 21.5 120. 97 3.7 2.46 20.0 1 0 1
## 2 4 4 8 26.9 103. 76 4.11 2.38 19.6 1 0.75 1.5
## 3 4 5 2 28.2 108. 102 4.1 1.83 16.8 0.5 1 2
## 4 6 3 2 19.8 242. 108. 2.92 3.34 19.8 1 0 1
## 5 6 4 4 19.8 164. 116. 3.91 3.09 17.7 0.5 0.5 4
## 6 6 5 1 19.7 145 175 3.62 2.77 15.5 0 1 6
## 7 8 3 12 15.0 358. 194. 3.12 4.10 17.1 0 0 3.08
## 8 8 5 2 15.4 326 300. 3.88 3.37 14.6 0 1 6
vars <- c("mpg", "gear")
mtcars %>% select(all_of(vars)) %>% head()
## mpg gear
## Mazda RX4 21.0 4
## Mazda RX4 Wag 21.0 4
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 3
## Hornet Sportabout 18.7 3
## Valiant 18.1 3
mtcars %>% select(!all_of(vars)) %>% head()
## cyl disp hp drat wt qsec vs am carb
## Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4
## Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4
## Datsun 710 4 108 93 3.85 2.320 18.61 1 1 1
## Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 1
## Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 2
## Valiant 6 225 105 2.76 3.460 20.22 1 0 1
The following examples solve a grab bag of common problems.
If you check the documentation, you’ll see that .data never uses data masking or tidy select. That means you don’t need to do anything special in your function:
mutate_y <- function(data) {
mutate(data, y = a + x)
}
If you’re writing a package and you have a function that uses data-variables:
my_summary_function <- function(data) {
data %>%
filter(x > 0) %>%
group_by(grp) %>%
summarise(y = mean(y), n = n())
}
You’ll get an R CMD CHECK NOTE:
N checking R code for possible problems
my_summary_function: no visible binding for global variable ‘x’, ‘grp’, ‘y’
Undefined global functions or variables:
x grp y
You can eliminate this by using .data\(var and importing .data from its source in the rlang package (the underlying package that implements tidy evaluation): ``` #' @importFrom rlang .data my_summary_function <- function(data) { data %>% filter(.data\)x > 0) %>% group_by(.data\(grp) %>% summarise(y = mean(.data\)y), n = n()) }
### One or more user-supplied expressions
If you want the user to supply an expression that’s passed onto an argument which uses data masking or tidy select, embrace the argument:
```r
my_summarise <- function(data, group_var) {
data %>%
group_by({{ group_var }}) %>%
summarise(mean = mean(mpg))
}
mtcars %>% my_summarise(gear)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## gear mean
## <dbl> <dbl>
## 1 3 16.1
## 2 4 24.5
## 3 5 21.4
This generalises in a straightforward way if you want to use one user-supplied expression in multiple places:
my_summarise2 <- function(data, expr) {
data %>% summarise(
mean = mean({{ expr }}),
sum = sum({{ expr }}),
n = n()
)
}
mtcars %>% my_summarise2(mpg)
## mean sum n
## 1 20.09062 642.9 32
If you want the user to provide multiple expressions, embrace each of them:
my_summarise3 <- function(data, mean_var, sd_var) {
data %>%
summarise(mean = mean({{ mean_var }}), sd = mean({{ sd_var }}))
}
mtcars %>% my_summarise3(wt,mpg)
## mean sd
## 1 3.21725 20.09062
If you want to use the names of variables in the output, you can use glue syntax in conjunction with :=:
my_summarise4 <- function(data, expr) {
data %>% summarise(
"mean_{{expr}}" := mean({{ expr }}),
"sum_{{expr}}" := sum({{ expr }}),
"n_{{expr}}" := n()
)
}
my_summarise5 <- function(data, mean_var, sd_var) {
data %>%
summarise(
"mean_{{mean_var}}" := mean({{ mean_var }}),
"sd_{{sd_var}}" := mean({{ sd_var }})
)
}
mtcars %>% my_summarise5(mpg,wt)
## mean_mpg sd_wt
## 1 20.09062 3.21725
If you want to take an arbitrary number of user supplied expressions, use …. This is most often useful when you want to give the user full control over a single part of the pipeline, like a group_by() or a mutate().
When you use … in this way, make sure that any other arguments start with . to reduce the chances of argument clashes; see https://design.tidyverse.org/dots-prefix.html for more details.
my_summarise <- function(.data, ...) {
.data %>%
group_by(...) %>%
summarise(mass = mean(mass, na.rm = TRUE), height = mean(height, na.rm = TRUE))
}
starwars %>% my_summarise(homeworld) %>% head()
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 3
## homeworld mass height
## <chr> <dbl> <dbl>
## 1 Alderaan 64 176.
## 2 Aleen Minor 15 79
## 3 Bespin 79 175
## 4 Bestine IV 110 180
## 5 Cato Neimoidia 90 191
## 6 Cerea 82 198
starwars %>% my_summarise(sex, gender)
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 6 x 4
## # Groups: sex [5]
## sex gender mass height
## <chr> <chr> <dbl> <dbl>
## 1 female feminine 54.7 169.
## 2 hermaphroditic masculine 1358 175
## 3 male masculine 81.0 179.
## 4 none feminine NaN 96
## 5 none masculine 69.8 140
## 6 <NA> <NA> 48 181.
If you want the user to provide a set of data-variables that are then transformed, use across():
my_summarise <- function(data, summary_vars) {
data %>%
summarise(across({{ summary_vars }}, ~ mean(., na.rm = TRUE)))
}
starwars %>%
group_by(species) %>%
my_summarise(c(mass, height)) %>% head()
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 3
## species mass height
## <chr> <dbl> <dbl>
## 1 Aleena 15 79
## 2 Besalisk 102 198
## 3 Cerean 82 198
## 4 Chagrian NaN 196
## 5 Clawdite 55 168
## 6 Droid 69.8 131.
You can use this same idea for multiple sets of input data-variables:
my_summarise <- function(data, group_var, summarise_var) {
data %>%
group_by(across({{ group_var }})) %>%
summarise(across({{ summarise_var }}, mean))
}
mtcars %>% my_summarise(c(cyl,gear),c(mpg,wt))
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
## # A tibble: 8 x 4
## # Groups: cyl [3]
## cyl gear mpg wt
## <dbl> <dbl> <dbl> <dbl>
## 1 4 3 21.5 2.46
## 2 4 4 26.9 2.38
## 3 4 5 28.2 1.83
## 4 6 3 19.8 3.34
## 5 6 4 19.8 3.09
## 6 6 5 19.7 2.77
## 7 8 3 15.0 4.10
## 8 8 5 15.4 3.37
Use the .names argument to across() to control the names of the output.
my_summarise <- function(data, group_var, summarise_var) {
data %>%
group_by(across({{ group_var }})) %>%
summarise(across({{ summarise_var }}, mean, .names = "mean_{col}"))
}
mtcars %>% my_summarise(c(am,gear),c(mpg,hp))
## `summarise()` regrouping output by 'am' (override with `.groups` argument)
## # A tibble: 4 x 4
## # Groups: am [2]
## am gear mean_mpg mean_hp
## <dbl> <dbl> <dbl> <dbl>
## 1 0 3 16.1 176.
## 2 0 4 21.0 101.
## 3 1 4 26.3 83.9
## 4 1 5 21.4 196.
If you have a character vector of variable names, and want to operate on them with a for loop, index into the special .data pronoun:
for (var in names(mtcars)) {
mtcars %>% count(.data[[var]]) %>% head(1) %>% print()
}
## mpg n
## 1 10.4 2
## cyl n
## 1 4 11
## disp n
## 1 71.1 1
## hp n
## 1 52 1
## drat n
## 1 2.76 2
## wt n
## 1 1.513 1
## qsec n
## 1 14.5 1
## vs n
## 1 0 18
## am n
## 1 0 19
## gear n
## 1 3 15
## carb n
## 1 1 7
This same technique works with for loop alternatives like the base R apply() family and the purrr map() family:
mtcars %>%
names() %>%
purrr::map(~ count(mtcars, .data[[.x]]))
Many Shiny input controls return character vectors, so you can use the same approach as above: .data[[input$var]].
library(shiny)
ui <- fluidPage(
selectInput("var", "Variable", choices = names(diamonds)),
tableOutput("output")
)
server <- function(input, output, session) {
data <- reactive(filter(diamonds, .data[[input$var]] > 0))
output$output <- renderTable(head(data()))
}
See https://mastering-shiny.org/action-tidy.html for more details and case studies.
select_col <- function(df, col_name){
tmp <- df %>% select({{col_name}}) #!!col_name
}
mtcars %>% select_col(c("cyl","mpg","gear")) %>% head()
## cyl mpg gear
## Mazda RX4 6 21.0 4
## Mazda RX4 Wag 6 21.0 4
## Datsun 710 4 22.8 4
## Hornet 4 Drive 6 21.4 3
## Hornet Sportabout 8 18.7 3
## Valiant 6 18.1 3
mutate_col <- function(df, col_name){
tmp <- df %>% select(!!col_name) %>%
mutate(col2=.data[[col_name[2]]])
}
mtcars %>% select_col(c("cyl","mpg","gear")) %>% head()
## cyl mpg gear
## Mazda RX4 6 21.0 4
## Mazda RX4 Wag 6 21.0 4
## Datsun 710 4 22.8 4
## Hornet 4 Drive 6 21.4 3
## Hornet Sportabout 8 18.7 3
## Valiant 6 18.1 3
groupby_col <- function(df, col_name){
tmp <- df %>% group_by(across(all_of(col_name))) %>%
summarise(n=n())
}
groupby_col2 <- function(df, col_name){
tmp <- df %>% group_by(!!! rlang::syms(col_name)) %>%
summarise(n=n())
}
mtcars %>% groupby_col(c("cyl","gear")) %>% head()
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups: cyl [2]
## cyl gear n
## <dbl> <dbl> <int>
## 1 4 3 1
## 2 4 4 8
## 3 4 5 2
## 4 6 3 2
## 5 6 4 4
## 6 6 5 1
mtcars %>% groupby_col2(c("cyl","gear")) %>% head()
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups: cyl [2]
## cyl gear n
## <dbl> <dbl> <int>
## 1 4 3 1
## 2 4 4 8
## 3 4 5 2
## 4 6 3 2
## 5 6 4 4
## 6 6 5 1
var_summary <- function(data, var) {
data %>%
summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
mtcars %>%
group_by(cyl) %>%
var_summary(mpg)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
## cyl n min max
## <dbl> <int> <dbl> <dbl>
## 1 4 11 21.4 33.9
## 2 6 7 17.8 21.4
## 3 8 14 10.4 19.2
summarise_mean <- function(data, vars) {
data %>% summarise(n = n(), across({{ vars }}, mean))
}
mtcars %>%
group_by(cyl) %>%
summarise_mean(where(is.numeric))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 12
## cyl n mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 11 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
## 2 6 7 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
## 3 8 14 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5