I’m working to become more facile with tidyeval, which is the name of a new framework for “computing on the language” (the R language, that is). The immediate motivation for tidyeval is to provide better tools for programming around packages in the tidyverse, especially dplyr and, eventually, ggplot2. To learn more:
Luis Verde Arregoitia tagged me in a tweet linking to his recent blog post Using Tidy Evaluation to untangle header rows. I admire his posts on the practical problems at the interface of R and spreadsheets. In this post, he does battle with non-data rows that are embedded in the data rectangle and not necessarily at the top. This is an nasty combination of two common patterns:
For the full setup, please read his post.
Luis wants a function to put the “trickle down” information into its own variable and then eliminate the non-data rows. And he wants it to feel like a dplyr verb: user can provide the names of these variables in bare form, i.e. not surrounded by quotes.
First, I am easily sucked into spreadsheet problems. Second, the best way I have found to learn something, like tidyeval, is to take every opportunity to work seemingly simple examples. If they are simple … great! You’ve gotten practice. If they are not … great! You’ve learned something.
untangle()
So here’s my version, where “my” means I actually benefitted from very helpful input from the masterminds behind tidyeval, Lionel Henry and Hadley Wickham.
Load dplyr, tidyr, and rlang. Why rlang? dplyr re-exports the rlang functions that are most useful for “civilians”, but it was helpful here to use sym()
, which is not re-exported by dplyr.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(rlang)
Set up a tibble with one column, which is contaminated by both species and diet information. Ultimately, only the rows holding "Sp1"
, "Sp2"
, etc. should survive as data rows, with new variables for species and diet.
(dat <- tibble(
jumble = c("Muridae", "diet:seeds", "Sp1", "Sp2", "Sp3",
"diet:unknown", "Sp4", "Sp5", "Cricetidae", "diet:fruits", "Sp11",
"Sp32", "Sp113")
))
## # A tibble: 13 x 1
## jumble
## <chr>
## 1 Muridae
## 2 diet:seeds
## 3 Sp1
## 4 Sp2
## 5 Sp3
## 6 diet:unknown
## 7 Sp4
## 8 Sp5
## 9 Cricetidae
## 10 diet:fruits
## 11 Sp11
## 12 Sp32
## 13 Sp113
Define the new verb, untangle2()
. The regex
identifies cells holding group-level info in the existing variable orig
. new
specifies the new variable you want to create.
untangle2 <- function(df, regex, orig, new) {
orig <- enquo(orig)
new <- sym(quo_name(enquo(new)))
df %>%
mutate(
!!new := if_else(grepl(regex, !! orig), !! orig, NA_character_)
) %>%
fill(!! new) %>%
filter(!grepl(regex, !! orig))
}
dat %>%
untangle2("dae$", jumble, family) %>%
untangle2("^diet", jumble, diet)
## # A tibble: 8 x 3
## jumble family diet
## <chr> <chr> <chr>
## 1 Sp1 Muridae diet:seeds
## 2 Sp2 Muridae diet:seeds
## 3 Sp3 Muridae diet:seeds
## 4 Sp4 Muridae diet:unknown
## 5 Sp5 Muridae diet:unknown
## 6 Sp11 Cricetidae diet:fruits
## 7 Sp32 Cricetidae diet:fruits
## 8 Sp113 Cricetidae diet:fruits
Voilà!
What are our innovations? How and why does this differ from Luis’s original (included below)?
sym(quo_name(enquo(new)))
to capture the new variable as a symbol, as opposed to a quosure or a string. This is nice because later, we need to use it in both LHS and RHS contexts.if_else()
instead of case_when()
to initialize the group-level info, just because it seems more transparent re: what’s going on.Here’s Luis’s original, for easier comparison:
untangle_luis <- function(dframe, matchstring, newCol) {
match <- enquo(match)
newCol <- quo_name(newCol)
dframe %>% mutate(!!newCol := case_when(grepl(!!matchstring,!!dframe[[1]])~!!dframe[[1]]),
removeLater = case_when(grepl(!!matchstring,!!dframe[[1]])~"yes")) %>%
fill(newCol) %>%
filter(is.na(removeLater)) %>% select(-removeLater)
}
Huge thanks again to Luis for this great series of spreadsheet posts and for sharing an excellent motivating example for practicing with tidyeval!