bind_rows and JSON

Sample lists of lists (from JSON data)

Context: JSON is often returned as a list of lists in R. JSON allows for NULL which is not replaced by NA by default using httr or httr2 built-in JSON functions. JSON also can have empty lists, which are “empty” but not NULL in R.

Below are the simulated JSON data objects, with varied combinations of NULL and list(), prompted by a real world application where bind_rows() consistently failed with no additional information.

We are using three different variations of the example data to try to narrow down the cause of the issue. I’m going to include the str() output of each below.

default: age is some NULL, country is all NULL, items is all list()
some_NULL: one country entry filled in; no all NULL entries
some_list: one items entry filled; no all list() entries

## *** default ***
## List of 3
##  $ :List of 5
##   ..$ id     : num 1
##   ..$ name   : chr "bob"
##   ..$ age    : num 23
##   ..$ country: NULL
##   ..$ items  : list()
##  $ :List of 5
##   ..$ id     : num 2
##   ..$ name   : chr "sam"
##   ..$ age    : NULL
##   ..$ country: NULL
##   ..$ items  : list()
##  $ :List of 5
##   ..$ id     : num 3
##   ..$ name   : chr "joe"
##   ..$ age    : NULL
##   ..$ country: NULL
##   ..$ items  : list()
## 
## *** some_NULL ***
## List of 3
##  $ :List of 5
##   ..$ id     : num 1
##   ..$ name   : chr "bob"
##   ..$ age    : num 23
##   ..$ country: NULL
##   ..$ items  : list()
##  $ :List of 5
##   ..$ id     : num 2
##   ..$ name   : chr "sam"
##   ..$ age    : NULL
##   ..$ country: NULL
##   ..$ items  : list()
##  $ :List of 5
##   ..$ id     : num 3
##   ..$ name   : chr "joe"
##   ..$ age    : NULL
##   ..$ country: chr "US"
##   ..$ items  : list()
## 
## *** some_list ***
## List of 3
##  $ :List of 5
##   ..$ id     : num 1
##   ..$ name   : chr "bob"
##   ..$ age    : num 23
##   ..$ country: NULL
##   ..$ items  :List of 3
##   .. ..$ : chr "apple"
##   .. ..$ : chr "banana"
##   .. ..$ : chr "pear"
##  $ :List of 5
##   ..$ id     : num 2
##   ..$ name   : chr "sam"
##   ..$ age    : NULL
##   ..$ country: NULL
##   ..$ items  : list()
##  $ :List of 5
##   ..$ id     : num 3
##   ..$ name   : chr "joe"
##   ..$ age    : NULL
##   ..$ country: NULL
##   ..$ items  : list()

Binding to a dataframe (or data.table, or tibble)

When working this up into a dataframe we have three general options from the big three in R programming paradigms. base::do.call(rbind), dplyr::bind_rows(), and data.table::rbindlist(). For brevity, we’ll just show right off the bat that rbindlist() works as expected and then we can focus on the tidyverse for the remainder.

c("default", "some_NULL", "some_list") |> 
  purrr::walk(\(x) {
    cat("*** ", x, "***\n")
    get(x) |> data.table::rbindlist() |> print()
    cat("\n")
  })

## ***  default ***
##       id   name   age country  items
##    <num> <char> <num>  <lgcl> <list>
## 1:     1    bob    23      NA       
## 2:     2    sam    NA      NA       
## 3:     3    joe    NA      NA       
## 
## ***  some_NULL ***
##       id   name   age country  items
##    <num> <char> <num>  <char> <list>
## 1:     1    bob    23    <NA>       
## 2:     2    sam    NA    <NA>       
## 3:     3    joe    NA      US       
## 
## ***  some_list ***
##       id   name   age country  items
##    <num> <char> <num>  <lgcl> <list>
## 1:     1    bob    23      NA  apple
## 2:     1    bob    23      NA banana
## 3:     1    bob    23      NA   pear
## 4:     2    sam    NA      NA       
## 5:     3    joe    NA      NA

So we can see in all three instances of our test data, rbindlist() kept all columns including those with all NULL and all list(). It correctly replaced the NULL values with NA in age and county, but there are still NULL values in items which we’d have to deal with at some point.

It also successfully unpacked our list of items in the final example - we have three rows for bob. Let’s examine the behavior of bind_rows().

c("default", "some_NULL", "some_list") |> 
  purrr::walk(\(x) {
    cat("*** ", x, "***\n")
    get(x) |> dplyr::bind_rows() |> print()
    cat("\n")
  })

## ***  default ***
## # A tibble: 0 × 4
## # ℹ 4 variables: id <dbl>, name <chr>, age <dbl>, items <list>
## 
## ***  some_NULL ***
## # A tibble: 0 × 5
## # ℹ 5 variables: id <dbl>, name <chr>, age <dbl>, items <list>, country <chr>
## 
## ***  some_list ***
## # A tibble: 3 × 4
##      id name    age items    
##   <dbl> <chr> <dbl> <list>   
## 1     1 bob      23 <chr [1]>
## 2     1 bob      23 <chr [1]>
## 3     1 bob      23 <chr [1]>

So in all of the results we have lost something somewhere.

Mainly, any column that was all NULL was dropped. This could cause programming issues down the line, where an API may define a field (and the programmer would expect it to exist) but dplyr removes it from the results simply because it happens to be NULL from the JSON representation.

In addition, any data structure where items was all empty list() was lost completely. Something about list() causes bind_rows() to completely fail.

Only by modifying one of the items to contain a sublist did we actually get a result, and we still lost rows sam and joe since their items were list().

Making bind_rows() work

We have identified three problems we need to fix so far to achieve a tidy result:

Replace NULL with NA to avoid column loss where all entries are NULL (country) from bind_rows().
Replace items == list() with NA to avoid row loss during bind_rows().
Unpack-list col items where items != list() into separate rows

Now that we’ve identified the issues we’ll stick with some_list since it represents all of these issues - one all NULL column, one some NULL column, and one some list() column.

For replacing NULL and list() with NA, we can’t just use is.null() since is.null(list()) would return FALSE. So, we use length() to capture both NULL and list().

some_list |> 
  purrr::map(
    \(sub) purrr::map(sub, \(i) if (length(i) == 0) NA else i)
  ) |> 
  dplyr::bind_rows()

## # A tibble: 5 × 5
##      id name    age country items    
##   <dbl> <chr> <dbl> <lgl>   <list>   
## 1     1 bob      23 NA      <chr [1]>
## 2     1 bob      23 NA      <chr [1]>
## 3     1 bob      23 NA      <chr [1]>
## 4     2 sam      NA NA      <NULL>   
## 5     3 joe      NA NA      <NULL>

This looks promising in that we now retain the country column and the records for sam and joe, but this falls apart when trying to unpack items into separate rows with unnest().

some_list |> 
  purrr::map(
    \(sub) purrr::map(sub, \(i) if (length(i) == 0) NA else i)
  ) |> 
  dplyr::bind_rows() |> 
  tidyr::unnest(items)

## # A tibble: 3 × 5
##      id name    age country items 
##   <dbl> <chr> <dbl> <lgl>   <chr> 
## 1     1 bob      23 NA      apple 
## 2     1 bob      23 NA      banana
## 3     1 bob      23 NA      pear

Now we’ve lost sam and joe again! The issue actually occurs during bind_rows(). Although our inital nested map step correctly replaces both NULL and list() with NA (try it yourself), after calling bind_rows() our newly minted NA entries in the list-col items for some reason revert back to NULL! This causes them to be dropped during unnest().

So, we have to add another step and put the NA values back in after using bind_rows() but before unnest().

some_list |> 
  purrr::map(
    \(sub) purrr::map(sub, \(i) if (length(i) == 0) NA else i)
  ) |> 
  dplyr::bind_rows() |> 
  dplyr::mutate(items = purrr::map(items, \(i) if (is.null(i)) NA else i)) |> 
  tidyr::unnest(items)

## # A tibble: 5 × 5
##      id name    age country items 
##   <dbl> <chr> <dbl> <lgl>   <chr> 
## 1     1 bob      23 NA      apple 
## 2     1 bob      23 NA      banana
## 3     1 bob      23 NA      pear  
## 4     2 sam      NA NA      <NA>  
## 5     3 joe      NA NA      <NA>

Now we finally have a working, tidy result, but in my opinion this is too much manipulation required to get there. In summary, we had to do the following:

nested maps to handle NULL/list() values in items before binding:
- NULL and list() become NA
bind_rows()
- NA in list-col (items) becomes NULL (bad)
- NA in remaining columns untouched (good)
map within mutate to replace NULL generated during step 2 with NA (again).
unnest() to unpack the list-col items

Parity with rbindlist()

In comparison, here is all we need to do using rbindlist() to get the same result. We still need to fix the NULL in the list-col items before using unnest(), but at least we don’t need a nested map step. Note that here we allowed warnings from data.table since it kindly lets us know that it has handled the NULL values for us.

some_list |> 
  data.table::rbindlist() |> 
  dplyr::mutate(items = purrr::map(items, \(i) if (is.null(i)) NA else i)) |>
  tidyr::unnest(items)

## Warning in data.table::rbindlist(some_list): Column 4 ['country'] of item 1 is
## length 0. This (and 6 others like it) has been filled with NA (NULL for list
## columns) to make each item uniform.

## # A tibble: 5 × 5
##      id name    age country items 
##   <dbl> <chr> <dbl> <lgl>   <chr> 
## 1     1 bob      23 NA      apple 
## 2     1 bob      23 NA      banana
## 3     1 bob      23 NA      pear  
## 4     2 sam      NA NA      <NA>  
## 5     3 joe      NA NA      <NA>

Summary

Here are the places where tidyverse can improve:

bind_rows should not return an empty tibble when a list() column is present
bind_rows should not drop columns where all are NULL
bind_rows should not replace NA with NULL
unnest() should not drop rows where val is NULL in unnest(val)
Most importantly, if none of these behaviors can change, a warning message should be posted where columns / rows are dropped as a result.

At the very least, allowing a fill argument to bind_rows() would bring its functionality very close to rbindlist() for working with JSON data that have been converted to R list structures.