tidyeval meets Spreadsheet Hell

What does this have to do with spreadsheets?

Luis Verde Arregoitia tagged me in a tweet linking to his recent blog post Using Tidy Evaluation to untangle header rows. I admire his posts on the practical problems at the interface of R and spreadsheets. In this post, he does battle with non-data rows that are embedded in the data rectangle and not necessarily at the top. This is an nasty combination of two common patterns:

Embedding summary rows in the data rectangle, such as sub-totals, for some grouping variable.
“Shit trickles down”: saving space in a spreadsheet by having the value in one cell somehow apply to cells below, usually in a way that is easy for humans to parse, but not computers.

For the full setup, please read his post.

What does this have to do with tidy eval?

Luis wants a function to put the “trickle down” information into its own variable and then eliminate the non-data rows. And he wants it to feel like a dplyr verb: user can provide the names of these variables in bare form, i.e. not surrounded by quotes.

What does this have to do me?

First, I am easily sucked into spreadsheet problems. Second, the best way I have found to learn something, like tidyeval, is to take every opportunity to work seemingly simple examples. If they are simple … great! You’ve gotten practice. If they are not … great! You’ve learned something.

“My” version of `untangle()`

So here’s my version, where “my” means I actually benefitted from very helpful input from the masterminds behind tidyeval, Lionel Henry and Hadley Wickham.

Load dplyr, tidyr, and rlang. Why rlang? dplyr re-exports the rlang functions that are most useful for “civilians”, but it was helpful here to use sym(), which is not re-exported by dplyr.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(rlang)

Set up a tibble with one column, which is contaminated by both species and diet information. Ultimately, only the rows holding "Sp1", "Sp2", etc. should survive as data rows, with new variables for species and diet.

(dat <- tibble(
  jumble = c("Muridae", "diet:seeds", "Sp1", "Sp2", "Sp3",
             "diet:unknown", "Sp4", "Sp5", "Cricetidae", "diet:fruits", "Sp11",
             "Sp32", "Sp113")
))

## # A tibble: 13 x 1
##          jumble
##           <chr>
##  1      Muridae
##  2   diet:seeds
##  3          Sp1
##  4          Sp2
##  5          Sp3
##  6 diet:unknown
##  7          Sp4
##  8          Sp5
##  9   Cricetidae
## 10  diet:fruits
## 11         Sp11
## 12         Sp32
## 13        Sp113

Define the new verb, untangle2(). The regex identifies cells holding group-level info in the existing variable orig. new specifies the new variable you want to create.

 untangle2 <- function(df, regex, orig, new) {
  orig <- enquo(orig)
  new <- sym(quo_name(enquo(new)))

  df %>%
    mutate(
      !!new := if_else(grepl(regex, !! orig), !! orig, NA_character_)
    ) %>%
    fill(!! new) %>%
    filter(!grepl(regex, !! orig))
}

dat %>%
  untangle2("dae$", jumble, family) %>%
  untangle2("^diet", jumble, diet)

## # A tibble: 8 x 3
##   jumble     family         diet
##    <chr>      <chr>        <chr>
## 1    Sp1    Muridae   diet:seeds
## 2    Sp2    Muridae   diet:seeds
## 3    Sp3    Muridae   diet:seeds
## 4    Sp4    Muridae diet:unknown
## 5    Sp5    Muridae diet:unknown
## 6   Sp11 Cricetidae  diet:fruits
## 7   Sp32 Cricetidae  diet:fruits
## 8  Sp113 Cricetidae  diet:fruits

Voilà!

What are our innovations? How and why does this differ from Luis’s original (included below)?

We use tidyeval to make it easy to specify both the existing and new variables. The original was hard-wired to work on the first column.
We treat the regex that identifies group-level info as a plain string. It doesn’t need to be handled via tidyeval.
We use sym(quo_name(enquo(new))) to capture the new variable as a symbol, as opposed to a quosure or a string. This is nice because later, we need to use it in both LHS and RHS contexts.
We choose if_else() instead of case_when() to initialize the group-level info, just because it seems more transparent re: what’s going on.
We avoid the creation of a temporary variable to indicate the rows that need to be eliminated.

Here’s Luis’s original, for easier comparison:

untangle_luis <- function(dframe, matchstring, newCol) {
  match <- enquo(match)
  newCol <- quo_name(newCol)

  dframe %>% mutate(!!newCol := case_when(grepl(!!matchstring,!!dframe[[1]])~!!dframe[[1]]),
                    removeLater = case_when(grepl(!!matchstring,!!dframe[[1]])~"yes")) %>%
    fill(newCol) %>%
    filter(is.na(removeLater)) %>% select(-removeLater)
}

Huge thanks again to Luis for this great series of spreadsheet posts and for sharing an excellent motivating example for practicing with tidyeval!

tidyeval meets Spreadsheet Hell

Jenny Bryan

2017-06-28

What does this have to do with spreadsheets?

What does this have to do with tidy eval?

What does this have to do me?

“My” version of `untangle()`

tidyeval meets Spreadsheet Hell

Jenny Bryan

2017-06-28

What does this have to do with spreadsheets?

What does this have to do with tidy eval?

What does this have to do me?

“My” version of untangle()

“My” version of `untangle()`