This is my version of the tidyr article
which I felt needed further details.
Nested list data such as JSON or XML into dataframe rows and columns. To help get this JSON into dataframe use the following
tidyr::unnest_longer() takes each element of a
list-column and makes a new rowtidyr::unnest_wider() takes each element of a
list-column and makes a new columnunnest_auto() guesses at what function longer or
widerhoist() only plucks out selected componentslibrary(tidyverse)
library(jsontools)
library(jsonlite)
library(repurrrsive)
The library repurrrsive has JSON on GitHub users
# load in GitHub users data
ghu = repurrrsive::gh_users
# put data into a tibble
users = tibble(user= ghu)
users
## # A tibble: 6 × 1
## user
## <list>
## 1 <named list [30]>
## 2 <named list [30]>
## 3 <named list [30]>
## 4 <named list [30]>
## 5 <named list [30]>
## 6 <named list [30]>
# get the 1st element (0 is not allowed)
names(users$user[[1]])
## [1] "login" "id" "avatar_url"
## [4] "gravatar_id" "url" "html_url"
## [7] "followers_url" "following_url" "gists_url"
## [10] "starred_url" "subscriptions_url" "organizations_url"
## [13] "repos_url" "events_url" "received_events_url"
## [16] "type" "site_admin" "name"
## [19] "company" "blog" "location"
## [22] "email" "hireable" "bio"
## [25] "public_repos" "public_gists" "followers"
## [28] "following" "created_at" "updated_at"
There are 2 ways of turning JSON lists into columns,
unnest_wider() and hoist().
# method 1 - unnest_wider()
users %>% unnest_wider(user)
## # A tibble: 6 × 30
## login id avatar_url gravatar_id url html_url followers_url following_url
## <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 gabo… 6.60e5 https://a… "" http… https:/… https://api.… https://api.…
## 2 jenn… 5.99e5 https://a… "" http… https:/… https://api.… https://api.…
## 3 jtle… 1.57e6 https://a… "" http… https:/… https://api.… https://api.…
## 4 juli… 1.25e7 https://a… "" http… https:/… https://api.… https://api.…
## 5 leep… 3.51e6 https://a… "" http… https:/… https://api.… https://api.…
## 6 masa… 8.36e6 https://a… "" http… https:/… https://api.… https://api.…
## # … with 22 more variables: gists_url <chr>, starred_url <chr>,
## # subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
## # events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
## # name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
## # hireable <lgl>, bio <chr>, public_repos <int>, public_gists <int>,
## # followers <int>, following <int>, created_at <chr>, updated_at <chr>
I was curious what unnest_longer() would look like, it
makes 180 row tibble from the simple 6 rows above.
if you wanted only GitHub user login /
name, followers, following and
url we can use the hoist() function.
selected_users = users %>%
hoist(user,
name = 'name',
followers = 'followers',
following = 'following',
url = 'html_url'
)
selected_users
## # A tibble: 6 × 5
## name followers following url user
## <chr> <int> <int> <chr> <list>
## 1 Gábor Csárdi 303 22 https://github.com/ga… <named list>
## 2 Jennifer (Jenny) Bryan 780 34 https://github.com/je… <named list>
## 3 Jeff L. 3958 6 https://github.com/jt… <named list>
## 4 Julia Silge 115 10 https://github.com/ju… <named list>
## 5 Thomas J. Leeper 213 230 https://github.com/le… <named list>
## 6 Maëlle Salmon 34 38 https://github.com/ma… <named list>
The JSON tibble data has list [26] elements. We have data that we can work with, let us group by number of followers and arrange them in descending order.
selected_users %>%
select(-user) %>%
group_by(followers) %>%
arrange(desc(followers), sort=TRUE)
## # A tibble: 6 × 4
## # Groups: followers [6]
## name followers following url
## <chr> <int> <int> <chr>
## 1 Jeff L. 3958 6 https://github.com/jtleek
## 2 Jennifer (Jenny) Bryan 780 34 https://github.com/jennybc
## 3 Gábor Csárdi 303 22 https://github.com/gaborcsardi
## 4 Thomas J. Leeper 213 230 https://github.com/leeper
## 5 Julia Silge 115 10 https://github.com/juliasilge
## 6 Maëlle Salmon 34 38 https://github.com/masalmon
Game of Thrones character data in JSON, this data is available from
repurrrsive.
chars <- tibble(char = got_chars)
chars
## # A tibble: 30 × 1
## char
## <list>
## 1 <named list [18]>
## 2 <named list [18]>
## 3 <named list [18]>
## 4 <named list [18]>
## 5 <named list [18]>
## 6 <named list [18]>
## 7 <named list [18]>
## 8 <named list [18]>
## 9 <named list [18]>
## 10 <named list [18]>
## # … with 20 more rows
chars2 = chars %>% unnest_wider(char)
chars2
## # A tibble: 30 × 18
## url id name gender culture born died alive titles aliases father
## <chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
## 1 https://w… 1022 Theo… Male "Ironb… "In … "" TRUE <chr> <chr> ""
## 2 https://w… 1052 Tyri… Male "" "In … "" TRUE <chr> <chr> ""
## 3 https://w… 1074 Vict… Male "Ironb… "In … "" TRUE <chr> <chr> ""
## 4 https://w… 1109 Will Male "" "" "In … FALSE <chr> <chr> ""
## 5 https://w… 1166 Areo… Male "Norvo… "In … "" TRUE <chr> <chr> ""
## 6 https://w… 1267 Chett Male "" "At … "In … FALSE <chr> <chr> ""
## 7 https://w… 1295 Cres… Male "" "In … "In … FALSE <chr> <chr> ""
## 8 https://w… 130 Aria… Female "Dorni… "In … "" TRUE <chr> <chr> ""
## 9 https://w… 1303 Daen… Female "Valyr… "In … "" TRUE <chr> <chr> ""
## 10 https://w… 1319 Davo… Male "Weste… "In … "" TRUE <chr> <chr> ""
## # … with 20 more rows, and 7 more variables: mother <chr>, spouse <chr>,
## # allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
## # playedBy <list>
Some of the columns are in lists.
chars2 %>%
select(name, books, tvSeries )
## # A tibble: 30 × 3
## name books tvSeries
## <chr> <list> <list>
## 1 Theon Greyjoy <chr [3]> <chr [6]>
## 2 Tyrion Lannister <chr [2]> <chr [6]>
## 3 Victarion Greyjoy <chr [3]> <chr [1]>
## 4 Will <chr [1]> <chr [1]>
## 5 Areo Hotah <chr [3]> <chr [2]>
## 6 Chett <chr [2]> <chr [1]>
## 7 Cressen <chr [2]> <chr [1]>
## 8 Arianne Martell <chr [4]> <chr [1]>
## 9 Daenerys Targaryen <chr [1]> <chr [6]>
## 10 Davos Seaworth <chr [1]> <chr [5]>
## # … with 20 more rows
The list-columns of books and tvSeries pivot to column called media, using pivot_longer function.
chars2 %>%
select(name, books, tvSeries ) %>%
pivot_longer( c(books, tvSeries), names_to = "media", values_to = "value") %>%
unnest_longer(value )
## # A tibble: 180 × 3
## name media value
## <chr> <chr> <chr>
## 1 Theon Greyjoy books A Game of Thrones
## 2 Theon Greyjoy books A Storm of Swords
## 3 Theon Greyjoy books A Feast for Crows
## 4 Theon Greyjoy tvSeries Season 1
## 5 Theon Greyjoy tvSeries Season 2
## 6 Theon Greyjoy tvSeries Season 3
## 7 Theon Greyjoy tvSeries Season 4
## 8 Theon Greyjoy tvSeries Season 5
## 9 Theon Greyjoy tvSeries Season 6
## 10 Tyrion Lannister books A Feast for Crows
## # … with 170 more rows
Another version is to have name associated with title
chars2 %>%
select(name, title= titles) %>%
unnest_longer(title)
## # A tibble: 60 × 2
## name title
## <chr> <chr>
## 1 Theon Greyjoy "Prince of Winterfell"
## 2 Theon Greyjoy "Captain of Sea Bitch"
## 3 Theon Greyjoy "Lord of the Iron Islands (by law of the green lands)"
## 4 Tyrion Lannister "Acting Hand of the King (former)"
## 5 Tyrion Lannister "Master of Coin (former)"
## 6 Victarion Greyjoy "Lord Captain of the Iron Fleet"
## 7 Victarion Greyjoy "Master of the Iron Victory"
## 8 Will ""
## 9 Areo Hotah "Captain of the Guard at Sunspear"
## 10 Chett ""
## # … with 50 more rows
Read in the astros.json file, which is about how many
NASA astronauts are in space. I use Thomas Mock’s article
to read and work with json data, I use different data.
The astros.json file is simple object that has 3 lists, the list item for people is an array of objects.
astros = read_json("astros.json")
astros %>%
purrr::pluck('people') %>%
tibble() %>%
unnest_wider(1)
## # A tibble: 11 × 2
## name craft
## <chr> <chr>
## 1 Raja Chari ISS
## 2 Tom Marshburn ISS
## 3 Kayla Barron ISS
## 4 Matthias Maurer ISS
## 5 Oleg Artemyev ISS
## 6 Denis Matveev ISS
## 7 Sergey Korsakov ISS
## 8 Kjell Lindgren ISS
## 9 Bob Hines ISS
## 10 Samantha Cristoforetti ISS
## 11 Jessica Watkins ISS
Using the purrr::pluck() function to pull out the data
we want, much like $ or [[ ]] does.
astros %>%
purrr::pluck("people", 3) %>% # 3 is the index
unlist()
## name craft
## "Kayla Barron" "ISS"