Context: JSON is often returned as a list of lists in R. JSON allows
for NULL which is not replaced by NA by
default using httr or httr2 built-in JSON
functions. JSON also can have empty lists, which are “empty” but not
NULL in R.
Below are the simulated JSON data objects, with varied combinations
of NULL and list(), prompted by a real world
application where bind_rows() consistently failed with no
additional information.
We are using three different variations of the example data to try to
narrow down the cause of the issue. I’m going to include the
str() output of each below.
default: age is some NULL,
country is all NULL, items is all
list()
some_NULL: one country entry filled in;
no all NULL entries
some_list: one items entry filled; no
all list() entries
## *** default ***
## List of 3
## $ :List of 5
## ..$ id : num 1
## ..$ name : chr "bob"
## ..$ age : num 23
## ..$ country: NULL
## ..$ items : list()
## $ :List of 5
## ..$ id : num 2
## ..$ name : chr "sam"
## ..$ age : NULL
## ..$ country: NULL
## ..$ items : list()
## $ :List of 5
## ..$ id : num 3
## ..$ name : chr "joe"
## ..$ age : NULL
## ..$ country: NULL
## ..$ items : list()
##
## *** some_NULL ***
## List of 3
## $ :List of 5
## ..$ id : num 1
## ..$ name : chr "bob"
## ..$ age : num 23
## ..$ country: NULL
## ..$ items : list()
## $ :List of 5
## ..$ id : num 2
## ..$ name : chr "sam"
## ..$ age : NULL
## ..$ country: NULL
## ..$ items : list()
## $ :List of 5
## ..$ id : num 3
## ..$ name : chr "joe"
## ..$ age : NULL
## ..$ country: chr "US"
## ..$ items : list()
##
## *** some_list ***
## List of 3
## $ :List of 5
## ..$ id : num 1
## ..$ name : chr "bob"
## ..$ age : num 23
## ..$ country: NULL
## ..$ items :List of 3
## .. ..$ : chr "apple"
## .. ..$ : chr "banana"
## .. ..$ : chr "pear"
## $ :List of 5
## ..$ id : num 2
## ..$ name : chr "sam"
## ..$ age : NULL
## ..$ country: NULL
## ..$ items : list()
## $ :List of 5
## ..$ id : num 3
## ..$ name : chr "joe"
## ..$ age : NULL
## ..$ country: NULL
## ..$ items : list()
When working this up into a dataframe we have three general options
from the big three in R programming paradigms.
base::do.call(rbind), dplyr::bind_rows(), and
data.table::rbindlist(). For brevity, we’ll just show right
off the bat that rbindlist() works as expected and then we
can focus on the tidyverse for the remainder.
c("default", "some_NULL", "some_list") |>
purrr::walk(\(x) {
cat("*** ", x, "***\n")
get(x) |> data.table::rbindlist() |> print()
cat("\n")
})
## *** default ***
## id name age country items
## <num> <char> <num> <lgcl> <list>
## 1: 1 bob 23 NA
## 2: 2 sam NA NA
## 3: 3 joe NA NA
##
## *** some_NULL ***
## id name age country items
## <num> <char> <num> <char> <list>
## 1: 1 bob 23 <NA>
## 2: 2 sam NA <NA>
## 3: 3 joe NA US
##
## *** some_list ***
## id name age country items
## <num> <char> <num> <lgcl> <list>
## 1: 1 bob 23 NA apple
## 2: 1 bob 23 NA banana
## 3: 1 bob 23 NA pear
## 4: 2 sam NA NA
## 5: 3 joe NA NA
So we can see in all three instances of our test data,
rbindlist() kept all columns including those with all
NULL and all list(). It correctly replaced the
NULL values with NA in age and
county, but there are still NULL values in
items which we’d have to deal with at some point.
It also successfully unpacked our list of items in the
final example - we have three rows for bob. Let’s examine
the behavior of bind_rows().
c("default", "some_NULL", "some_list") |>
purrr::walk(\(x) {
cat("*** ", x, "***\n")
get(x) |> dplyr::bind_rows() |> print()
cat("\n")
})
## *** default ***
## # A tibble: 0 × 4
## # ℹ 4 variables: id <dbl>, name <chr>, age <dbl>, items <list>
##
## *** some_NULL ***
## # A tibble: 0 × 5
## # ℹ 5 variables: id <dbl>, name <chr>, age <dbl>, items <list>, country <chr>
##
## *** some_list ***
## # A tibble: 3 × 4
## id name age items
## <dbl> <chr> <dbl> <list>
## 1 1 bob 23 <chr [1]>
## 2 1 bob 23 <chr [1]>
## 3 1 bob 23 <chr [1]>
So in all of the results we have lost something somewhere.
Mainly, any column that was all NULL was dropped. This
could cause programming issues down the line, where an API may define a
field (and the programmer would expect it to exist) but
dplyr removes it from the results simply because it happens
to be NULL from the JSON representation.
In addition, any data structure where items was all
empty list() was lost completely. Something about
list() causes bind_rows() to completely
fail.
Only by modifying one of the items to contain a sublist
did we actually get a result, and we still lost rows sam
and joe since their items were
list().
We have identified three problems we need to fix so far to achieve a tidy result:
Replace NULL with NA to avoid column
loss where all entries are NULL (country) from
bind_rows().
Replace items == list() with NA to
avoid row loss during bind_rows().
Unpack-list col items where
items != list() into separate rows
Now that we’ve identified the issues we’ll stick with
some_list since it represents all of these issues - one all
NULL column, one some NULL column, and one
some list() column.
For replacing NULL and list() with
NA, we can’t just use is.null() since
is.null(list()) would return FALSE. So, we use
length() to capture both NULL and
list().
some_list |>
purrr::map(
\(sub) purrr::map(sub, \(i) if (length(i) == 0) NA else i)
) |>
dplyr::bind_rows()
## # A tibble: 5 × 5
## id name age country items
## <dbl> <chr> <dbl> <lgl> <list>
## 1 1 bob 23 NA <chr [1]>
## 2 1 bob 23 NA <chr [1]>
## 3 1 bob 23 NA <chr [1]>
## 4 2 sam NA NA <NULL>
## 5 3 joe NA NA <NULL>
This looks promising in that we now retain the country
column and the records for sam and joe, but
this falls apart when trying to unpack items into separate
rows with unnest().
some_list |>
purrr::map(
\(sub) purrr::map(sub, \(i) if (length(i) == 0) NA else i)
) |>
dplyr::bind_rows() |>
tidyr::unnest(items)
## # A tibble: 3 × 5
## id name age country items
## <dbl> <chr> <dbl> <lgl> <chr>
## 1 1 bob 23 NA apple
## 2 1 bob 23 NA banana
## 3 1 bob 23 NA pear
Now we’ve lost sam and joe again! The issue
actually occurs during bind_rows(). Although our inital
nested map step correctly replaces both NULL
and list() with NA (try it yourself), after
calling bind_rows() our newly minted NA
entries in the list-col items for some reason revert back
to NULL! This causes them to be dropped during
unnest().
So, we have to add another step and put the NA values back in after
using bind_rows() but before unnest().
some_list |>
purrr::map(
\(sub) purrr::map(sub, \(i) if (length(i) == 0) NA else i)
) |>
dplyr::bind_rows() |>
dplyr::mutate(items = purrr::map(items, \(i) if (is.null(i)) NA else i)) |>
tidyr::unnest(items)
## # A tibble: 5 × 5
## id name age country items
## <dbl> <chr> <dbl> <lgl> <chr>
## 1 1 bob 23 NA apple
## 2 1 bob 23 NA banana
## 3 1 bob 23 NA pear
## 4 2 sam NA NA <NA>
## 5 3 joe NA NA <NA>
Now we finally have a working, tidy result, but in my opinion this is too much manipulation required to get there. In summary, we had to do the following:
maps to handle
NULL/list() values in items before binding:
NULL and list() become
NAbind_rows()
NA in list-col (items) becomes
NULL (bad)NA in remaining columns untouched (good)map within mutate to replace
NULL generated during step 2 with NA
(again).unnest() to unpack the list-col itemsIn comparison, here is all we need to do using
rbindlist() to get the same result. We still need to fix
the NULL in the list-col items before using
unnest(), but at least we don’t need a nested map step.
Note that here we allowed warnings from data.table since it
kindly lets us know that it has handled the NULL values for
us.
some_list |>
data.table::rbindlist() |>
dplyr::mutate(items = purrr::map(items, \(i) if (is.null(i)) NA else i)) |>
tidyr::unnest(items)
## Warning in data.table::rbindlist(some_list): Column 4 ['country'] of item 1 is
## length 0. This (and 6 others like it) has been filled with NA (NULL for list
## columns) to make each item uniform.
## # A tibble: 5 × 5
## id name age country items
## <dbl> <chr> <dbl> <lgl> <chr>
## 1 1 bob 23 NA apple
## 2 1 bob 23 NA banana
## 3 1 bob 23 NA pear
## 4 2 sam NA NA <NA>
## 5 3 joe NA NA <NA>
Here are the places where tidyverse can improve:
bind_rows should not return an empty tibble when a
list() column is presentbind_rows should not drop columns where all are
NULLbind_rows should not replace NA with
NULLunnest() should not drop rows where val is
NULL in unnest(val)At the very least, allowing a fill argument to
bind_rows() would bring its functionality very close to
rbindlist() for working with JSON data that have been
converted to R list structures.