I’m working to become more facile with tidyeval, which is the name of a new framework for “computing on the language” (the R language, that is). The immediate motivation for tidyeval is to provide better tools for programming around packages in the tidyverse, especially dplyr and, eventually, ggplot2. To learn more:
Luis Verde Arregoitia tagged me in a tweet linking to his recent blog post Using Tidy Evaluation to untangle header rows. I admire his posts on the practical problems at the interface of R and spreadsheets. In this post, he does battle with non-data rows that are embedded in the data rectangle and not necessarily at the top. This is an nasty combination of two common patterns:
For the full setup, please read his post.
Luis wants a function to put the “trickle down” information into its own variable and then eliminate the non-data rows. And he wants it to feel like a dplyr verb: user can provide the names of these variables in bare form, i.e. not surrounded by quotes.
First, I am easily sucked into spreadsheet problems. Second, the best way I have found to learn something, like tidyeval, is to take every opportunity to work seemingly simple examples. If they are simple … great! You’ve gotten practice. If they are not … great! You’ve learned something.
untangle()So here’s my version, where “my” means I actually benefitted from very helpful input from the masterminds behind tidyeval, Lionel Henry and Hadley Wickham.
Load dplyr, tidyr, and rlang. Why rlang? dplyr re-exports the rlang functions that are most useful for “civilians”, but it was helpful here to use sym(), which is not re-exported by dplyr.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(rlang)
Set up a tibble with one column, which is contaminated by both species and diet information. Ultimately, only the rows holding "Sp1", "Sp2", etc. should survive as data rows, with new variables for species and diet.
(dat <- tibble(
jumble = c("Muridae", "diet:seeds", "Sp1", "Sp2", "Sp3",
"diet:unknown", "Sp4", "Sp5", "Cricetidae", "diet:fruits", "Sp11",
"Sp32", "Sp113")
))
## # A tibble: 13 x 1
## jumble
## <chr>
## 1 Muridae
## 2 diet:seeds
## 3 Sp1
## 4 Sp2
## 5 Sp3
## 6 diet:unknown
## 7 Sp4
## 8 Sp5
## 9 Cricetidae
## 10 diet:fruits
## 11 Sp11
## 12 Sp32
## 13 Sp113
Define the new verb, untangle2(). The regex identifies cells holding group-level info in the existing variable orig. new specifies the new variable you want to create.
untangle2 <- function(df, regex, orig, new) {
orig <- enquo(orig)
new <- sym(quo_name(enquo(new)))
df %>%
mutate(
!!new := if_else(grepl(regex, !! orig), !! orig, NA_character_)
) %>%
fill(!! new) %>%
filter(!grepl(regex, !! orig))
}
dat %>%
untangle2("dae$", jumble, family) %>%
untangle2("^diet", jumble, diet)
## # A tibble: 8 x 3
## jumble family diet
## <chr> <chr> <chr>
## 1 Sp1 Muridae diet:seeds
## 2 Sp2 Muridae diet:seeds
## 3 Sp3 Muridae diet:seeds
## 4 Sp4 Muridae diet:unknown
## 5 Sp5 Muridae diet:unknown
## 6 Sp11 Cricetidae diet:fruits
## 7 Sp32 Cricetidae diet:fruits
## 8 Sp113 Cricetidae diet:fruits
Voilà!
What are our innovations? How and why does this differ from Luis’s original (included below)?
sym(quo_name(enquo(new))) to capture the new variable as a symbol, as opposed to a quosure or a string. This is nice because later, we need to use it in both LHS and RHS contexts.if_else() instead of case_when() to initialize the group-level info, just because it seems more transparent re: what’s going on.Here’s Luis’s original, for easier comparison:
untangle_luis <- function(dframe, matchstring, newCol) {
match <- enquo(match)
newCol <- quo_name(newCol)
dframe %>% mutate(!!newCol := case_when(grepl(!!matchstring,!!dframe[[1]])~!!dframe[[1]]),
removeLater = case_when(grepl(!!matchstring,!!dframe[[1]])~"yes")) %>%
fill(newCol) %>%
filter(is.na(removeLater)) %>% select(-removeLater)
}
Huge thanks again to Luis for this great series of spreadsheet posts and for sharing an excellent motivating example for practicing with tidyeval!