across()The newest versions of the dplyr package introduced a new function, across(), to be used within summarise, mutate and filter functions, and I am trying to learn it. I am, however, having a hard time understanding the use and the mechanics of the function in combination with filter. While this vignette has ample examples on mutate and summarise, the examples on filter are few and not very insightful. The same is true for several non-official guides that I have read.
As mentioned, the dplyr documentation provides examples with the summarise and mutate functions, but it does not deal extensively with the filter function, which is not (in my opinion) straightforward to use at all. The objective of this pamphlet is to introduce the reader to the (proper and transparent) usage of across() in conjunction with filter, but also to remind the writer of the functionality of across(). Finally, it aims to help new and experienced programmers wrap their head around this new function. After reading this pamphlet, you should be familiar enough with across() to be able to replace the _all, _at, and _if function family intuitively. I will illustrate how this works with the help of some examples.
First, let us look at the most basic usage of across in filter: filtering a dataset based on all rows meeting a certain condition or requirement.
Suppose we want to select rows in a dataset such that for every character variable, the length of the string should be larger than a particular number, in this example, 5. This would be the equivalent of filter_all in a data.frame consisting purely of character vectors, or otherwise, of filter_at.
First, load the relevant packages..
As mentioned in the vignette, the across function without any additions works automatically as all_vars, as evidenced by the following behavior:
a b
1 10000 10000
2 90000 20000
3 60000 4000
4 3000 30000
a b
1 10000 10000
2 90000 20000
We can also specify which variables we want to meet the condition. Let’s take the dataset from before and apply the filter only on the first column:
a b
1 10000 10000
2 90000 20000
3 60000 4000
As as second example, et’s implement a condition requiring that the string length of every character variable be greater than 2 (or 3) in the dplyr::starwars dataset:
# A tibble: 69 x 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke… 172 77 blond fair blue 19 male mascu…
2 Dart… 202 136 none white yellow 41.9 male mascu…
3 Leia… 150 49 brown light brown 19 fema… femin…
4 Owen… 178 120 brown, gr… light blue 52 male mascu…
5 Beru… 165 75 brown light blue 47 fema… femin…
6 Bigg… 183 84 black light brown 24 male mascu…
7 Obi-… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anak… 188 84 blond fair blue 41.9 male mascu…
9 Wilh… 180 NA auburn, g… fair blue 64 male mascu…
10 Chew… 228 112 brown unknown blue 200 male mascu…
# … with 59 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
# A tibble: 63 x 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke… 172 77 blond fair blue 19 male mascu…
2 Dart… 202 136 none white yellow 41.9 male mascu…
3 Leia… 150 49 brown light brown 19 fema… femin…
4 Owen… 178 120 brown, gr… light blue 52 male mascu…
5 Beru… 165 75 brown light blue 47 fema… femin…
6 Bigg… 183 84 black light brown 24 male mascu…
7 Obi-… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anak… 188 84 blond fair blue 41.9 male mascu…
9 Wilh… 180 NA auburn, g… fair blue 64 male mascu…
10 Chew… 228 112 brown unknown blue 200 male mascu…
# … with 53 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
We select all character variables with is.character, and condition the filter on the string length of all variables being larger than 5.
Similarly, consider the following modified dataframe from the colwise vignette I mentioned above:
df <- tibble(x = c("a", "b"),
y = c(1, 1),
z = c(-1, 1),
w = c("Hanz", "Genghis Khan"),
u = c("Werner", "Monsieur Eugene Duchene"))
df# A tibble: 2 x 5
x y z w u
<chr> <dbl> <dbl> <chr> <chr>
1 a 1 -1 Hanz Werner
2 b 1 1 Genghis Khan Monsieur Eugene Duchene
Supppose we want to select, in parallel to the example given in the vignete, the rows for which all numeric variables are greater than -1:
# A tibble: 1 x 5
x y z w u
<chr> <dbl> <dbl> <chr> <chr>
1 b 1 1 Genghis Khan Monsieur Eugene Duchene
So it is very straightforward to use the filter() function if you want to filter such that all variables meet a criterion. It is also very straightforward to filter using multiple criteria, as long as you want all variables, and NOT some variables, to which you apply the criterion have to meet it:
# A tibble: 1 x 5
x y z w u
<chr> <dbl> <dbl> <chr> <chr>
1 b 1 1 Genghis Khan Monsieur Eugene Duchene
.. or perhaps better illustrated by:
# A tibble: 35 x 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Dart… 202 136 none white yellow 41.9 male mascu…
5 Leia… 150 49 brown light brown 19 fema… femin…
6 Owen… 178 120 brown, gr… light blue 52 male mascu…
7 Beru… 165 75 brown light blue 47 fema… femin…
8 Bigg… 183 84 black light brown 24 male mascu…
9 Obi-… 182 77 auburn, w… fair blue-gray 57 male mascu…
10 Anak… 188 84 blond fair blue 41.9 male mascu…
# … with 25 more rows, and 5 more variables: homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
# All numeric variables must be larger than 10
# and both name AND hair_color should contain an a.
starwars %>%
filter(across(is.numeric, ~ . > 10),
across(c(name, hair_color), ~ grepl("a", .)))# A tibble: 7 x 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Bigg… 183 84 black light brown 24 male mascu…
2 Obi-… 182 77 auburn, w… fair blue-gray 57 male mascu…
3 Boba… 183 78.2 black fair brown 31.5 male mascu…
4 Land… 177 79 black dark brown 31 male mascu…
5 Lumi… 170 56.2 black yellow blue 58 fema… femin…
6 Barr… 166 50 black yellow blue 40 fema… femin…
7 Jang… 183 79 black tan brown 66 male mascu…
# … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Problems arise when we attempt to do the following: Suppose we want to laxen the condition, and instead, we want to require that at least one condition be met.
For example, in the data.frame df below, we want at least one character string to be larger than 7. How do we go about this? Instead of using the any_vars and all_vars helpers, we must use an auxiliary function, as documented here. In my opinion, this way of filtering is far from intuitive and transparent, and it took me a while to figure it out what I should do in each specific case: it is the reason why I wrote this pamphlet.
# A tibble: 2 x 4
x y z w
<chr> <dbl> <dbl> <chr>
1 a 1 -1 Harry
2 b 1 1 Potterisverycool
Now, suppose we want to filter the data.frame df to include only the rows in which at least one character vector has a string length greater than 6. The vignette advises us to use a helper function, which is defined to be:
In my view, this seems to be pretty much the only ‘helper’ function I can think of and that works, so I don’t see why it is not included as a function in the package. In any case, the job consists of apply the across() function as an argument to rowAny:
# A tibble: 1 x 4
x y z w
<chr> <dbl> <dbl> <chr>
1 b 1 1 Potterisverycool
# A tibble: 2 x 4
x y z w
<chr> <dbl> <dbl> <chr>
1 a 1 -1 Harry
2 b 1 1 Potterisverycool
In other words, we are asking R to filter to any rows to which the following condition aplies:
For any row, check all character vectors
Compute the string length (str_length) of the cells
Ask whether it is larger than 6 (or 4, respectively, in the examples)
We get back a logical in every cell
If the rowwise-sum of all these conditions is larger than 0, then keep the row
Let’s go over each of these individual steps quickly and try to emulate R’s behavior over this function:
#First, this is what we get when we evaluate the condition in each cell
df %>%
summarise(across(everything(), ~ str_length(.x) > 5))# A tibble: 2 x 4
x y z w
<lgl> <lgl> <lgl> <lgl>
1 FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE TRUE
#Then, we sum the value of all the logicals per row:
df %>%
summarise(across(everything(), ~ str_length(.x) > 5)) %>%
rowwise() %>%
rowSums() > 0[1] FALSE TRUE
#which gives us back the rows eligible for inclusion
df[df %>%
summarise(across(everything(), ~ str_length(.x) > 5)) %>%
rowwise() %>%
rowSums() > 0,
]# A tibble: 1 x 4
x y z w
<chr> <dbl> <dbl> <chr>
1 b 1 1 Potterisverycool
Similarly, going back to the df example, we could have also changed the zero to go from “at least one” to “at least two”, for example. So this would mean “select all Rows in which at least 2 variables meet a certain condition”. More concretely:
# A tibble: 2 x 5
x y z w u
<chr> <dbl> <dbl> <chr> <chr>
1 a 1 -1 Harry Hermione
2 b 1 1 Potterisverycool Granger
Let’s see if filter now works as expected, that is to say, it should only include the second row, as I’ve specified that the amount of occurrences (the count of the logical vectors) should be larger than one. (In case of the first row, it is 1.)
# A tibble: 1 x 5
x y z w u
<chr> <dbl> <dbl> <chr> <chr>
1 b 1 1 Potterisverycool Granger
# A tibble: 0 x 5
# … with 5 variables: x <chr>, y <dbl>, z <dbl>, w <chr>, u <chr>
Whereas across() has many other benefits clearly demonstrated in various vignettes, and has enormous potential, in the specific case of filtering, it might have become more awkward to use: the functions of filter_all or filter_at in combination with any_vars and all_vars were (i) at least as, or more intuitive, and (ii) did not require the user to specify a helper function. As for me, I hope this rowAny will get a place inside dplyr, preferably under an intuitive monniker. In any case, I hope you’ve found the small demonstration useful. Feel free to contact me at any time via e-mail or Github.