Rectangling JSON data
Game of Thrones JSON
- Unnest GoT characters
- Pivot longer
Read in JSON file
- Pluck the data

This is my version of the tidyr article which I felt needed further details.

Rectangling JSON data

Nested list data such as JSON or XML into dataframe rows and columns. To help get this JSON into dataframe use the following

tidyr::unnest_longer() takes each element of a list-column and makes a new row
tidyr::unnest_wider() takes each element of a list-column and makes a new column
unnest_auto() guesses at what function longer or wider
hoist() only plucks out selected components

library(tidyverse)
library(jsontools)
library(jsonlite)
library(repurrrsive)

GitHub users JSON data

The library repurrrsive has JSON on GitHub users

# load in GitHub users data
ghu = repurrrsive::gh_users

# put data into a tibble
users = tibble(user= ghu)
users

## # A tibble: 6 × 1
##   user             
##   <list>           
## 1 <named list [30]>
## 2 <named list [30]>
## 3 <named list [30]>
## 4 <named list [30]>
## 5 <named list [30]>
## 6 <named list [30]>

# get the 1st element (0 is not allowed)
names(users$user[[1]])

##  [1] "login"               "id"                  "avatar_url"         
##  [4] "gravatar_id"         "url"                 "html_url"           
##  [7] "followers_url"       "following_url"       "gists_url"          
## [10] "starred_url"         "subscriptions_url"   "organizations_url"  
## [13] "repos_url"           "events_url"          "received_events_url"
## [16] "type"                "site_admin"          "name"               
## [19] "company"             "blog"                "location"           
## [22] "email"               "hireable"            "bio"                
## [25] "public_repos"        "public_gists"        "followers"          
## [28] "following"           "created_at"          "updated_at"

Lists to Columns

There are 2 ways of turning JSON lists into columns, unnest_wider() and hoist().

# method 1 - unnest_wider()
users %>% unnest_wider(user)

## # A tibble: 6 × 30
##   login     id avatar_url gravatar_id url   html_url followers_url following_url
##   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
## 1 gabo… 6.60e5 https://a… ""          http… https:/… https://api.… https://api.…
## 2 jenn… 5.99e5 https://a… ""          http… https:/… https://api.… https://api.…
## 3 jtle… 1.57e6 https://a… ""          http… https:/… https://api.… https://api.…
## 4 juli… 1.25e7 https://a… ""          http… https:/… https://api.… https://api.…
## 5 leep… 3.51e6 https://a… ""          http… https:/… https://api.… https://api.…
## 6 masa… 8.36e6 https://a… ""          http… https:/… https://api.… https://api.…
## # … with 22 more variables: gists_url <chr>, starred_url <chr>,
## #   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
## #   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
## #   name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
## #   hireable <lgl>, bio <chr>, public_repos <int>, public_gists <int>,
## #   followers <int>, following <int>, created_at <chr>, updated_at <chr>

I was curious what unnest_longer() would look like, it makes 180 row tibble from the simple 6 rows above.

Hoist it

if you wanted only GitHub user login / name, followers, following and url we can use the hoist() function.

selected_users = users %>% 
  hoist(user,
        name = 'name',
        followers = 'followers',
        following = 'following',
        url = 'html_url'
        )

selected_users

## # A tibble: 6 × 5
##   name                   followers following url                    user        
##   <chr>                      <int>     <int> <chr>                  <list>      
## 1 Gábor Csárdi                 303        22 https://github.com/ga… <named list>
## 2 Jennifer (Jenny) Bryan       780        34 https://github.com/je… <named list>
## 3 Jeff L.                     3958         6 https://github.com/jt… <named list>
## 4 Julia Silge                  115        10 https://github.com/ju… <named list>
## 5 Thomas J. Leeper             213       230 https://github.com/le… <named list>
## 6 Maëlle Salmon                 34        38 https://github.com/ma… <named list>

The JSON tibble data has list [26] elements. We have data that we can work with, let us group by number of followers and arrange them in descending order.

selected_users %>% 
  select(-user) %>% 
  group_by(followers) %>% 
  arrange(desc(followers), sort=TRUE)

## # A tibble: 6 × 4
## # Groups:   followers [6]
##   name                   followers following url                           
##   <chr>                      <int>     <int> <chr>                         
## 1 Jeff L.                     3958         6 https://github.com/jtleek     
## 2 Jennifer (Jenny) Bryan       780        34 https://github.com/jennybc    
## 3 Gábor Csárdi                 303        22 https://github.com/gaborcsardi
## 4 Thomas J. Leeper             213       230 https://github.com/leeper     
## 5 Julia Silge                  115        10 https://github.com/juliasilge 
## 6 Maëlle Salmon                 34        38 https://github.com/masalmon

Game of Thrones JSON

Game of Thrones character data in JSON, this data is available from repurrrsive.

chars <- tibble(char = got_chars)
chars

## # A tibble: 30 × 1
##    char             
##    <list>           
##  1 <named list [18]>
##  2 <named list [18]>
##  3 <named list [18]>
##  4 <named list [18]>
##  5 <named list [18]>
##  6 <named list [18]>
##  7 <named list [18]>
##  8 <named list [18]>
##  9 <named list [18]>
## 10 <named list [18]>
## # … with 20 more rows

Unnest GoT characters

chars2 = chars %>% unnest_wider(char)
chars2

## # A tibble: 30 × 18
##    url           id name  gender culture born  died  alive titles aliases father
##    <chr>      <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
##  1 https://w…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
##  2 https://w…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
##  3 https://w…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
##  4 https://w…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
##  5 https://w…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
##  6 https://w…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
##  7 https://w…  1295 Cres… Male   ""      "In … "In … FALSE <chr>  <chr>   ""    
##  8 https://w…   130 Aria… Female "Dorni… "In … ""    TRUE  <chr>  <chr>   ""    
##  9 https://w…  1303 Daen… Female "Valyr… "In … ""    TRUE  <chr>  <chr>   ""    
## 10 https://w…  1319 Davo… Male   "Weste… "In … ""    TRUE  <chr>  <chr>   ""    
## # … with 20 more rows, and 7 more variables: mother <chr>, spouse <chr>,
## #   allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
## #   playedBy <list>

Some of the columns are in lists.

chars2 %>% 
  select(name, books, tvSeries )

## # A tibble: 30 × 3
##    name               books     tvSeries 
##    <chr>              <list>    <list>   
##  1 Theon Greyjoy      <chr [3]> <chr [6]>
##  2 Tyrion Lannister   <chr [2]> <chr [6]>
##  3 Victarion Greyjoy  <chr [3]> <chr [1]>
##  4 Will               <chr [1]> <chr [1]>
##  5 Areo Hotah         <chr [3]> <chr [2]>
##  6 Chett              <chr [2]> <chr [1]>
##  7 Cressen            <chr [2]> <chr [1]>
##  8 Arianne Martell    <chr [4]> <chr [1]>
##  9 Daenerys Targaryen <chr [1]> <chr [6]>
## 10 Davos Seaworth     <chr [1]> <chr [5]>
## # … with 20 more rows

Pivot longer

The list-columns of books and tvSeries pivot to column called media, using pivot_longer function.

chars2 %>% 
  select(name, books, tvSeries ) %>% 
  pivot_longer( c(books, tvSeries), names_to = "media", values_to = "value") %>% 
  unnest_longer(value )

## # A tibble: 180 × 3
##    name             media    value            
##    <chr>            <chr>    <chr>            
##  1 Theon Greyjoy    books    A Game of Thrones
##  2 Theon Greyjoy    books    A Storm of Swords
##  3 Theon Greyjoy    books    A Feast for Crows
##  4 Theon Greyjoy    tvSeries Season 1         
##  5 Theon Greyjoy    tvSeries Season 2         
##  6 Theon Greyjoy    tvSeries Season 3         
##  7 Theon Greyjoy    tvSeries Season 4         
##  8 Theon Greyjoy    tvSeries Season 5         
##  9 Theon Greyjoy    tvSeries Season 6         
## 10 Tyrion Lannister books    A Feast for Crows
## # … with 170 more rows

Another version is to have name associated with title

chars2 %>% 
  select(name, title= titles) %>% 
  unnest_longer(title)

## # A tibble: 60 × 2
##    name              title                                                 
##    <chr>             <chr>                                                 
##  1 Theon Greyjoy     "Prince of Winterfell"                                
##  2 Theon Greyjoy     "Captain of Sea Bitch"                                
##  3 Theon Greyjoy     "Lord of the Iron Islands (by law of the green lands)"
##  4 Tyrion Lannister  "Acting Hand of the King (former)"                    
##  5 Tyrion Lannister  "Master of Coin (former)"                             
##  6 Victarion Greyjoy "Lord Captain of the Iron Fleet"                      
##  7 Victarion Greyjoy "Master of the Iron Victory"                          
##  8 Will              ""                                                    
##  9 Areo Hotah        "Captain of the Guard at Sunspear"                    
## 10 Chett             ""                                                    
## # … with 50 more rows

Read in JSON file

Read in the astros.json file, which is about how many NASA astronauts are in space. I use Thomas Mock’s article to read and work with json data, I use different data.

The astros.json file is simple object that has 3 lists, the list item for people is an array of objects.

astros = read_json("astros.json")

astros %>% 
  purrr::pluck('people') %>% 
  tibble() %>%
  unnest_wider(1)

## # A tibble: 11 × 2
##    name                   craft
##    <chr>                  <chr>
##  1 Raja Chari             ISS  
##  2 Tom Marshburn          ISS  
##  3 Kayla Barron           ISS  
##  4 Matthias Maurer        ISS  
##  5 Oleg Artemyev          ISS  
##  6 Denis Matveev          ISS  
##  7 Sergey Korsakov        ISS  
##  8 Kjell Lindgren         ISS  
##  9 Bob Hines              ISS  
## 10 Samantha Cristoforetti ISS  
## 11 Jessica Watkins        ISS

Pluck the data

Using the purrr::pluck() function to pull out the data we want, much like $ or [[ ]] does.

astros %>% 
  purrr::pluck("people", 3) %>%  # 3 is the index
  unlist()

##           name          craft 
## "Kayla Barron"          "ISS"

JSON data

Zane Dax

2022-05-07