Parsing JSON data

I prepared this document to show you how to parse json data with the awesome jsonlite package.

Let’s say that you are a huge fan of the great and powerful Hadley Wickam. (He’s the Chief Scientist at Rstudio. He’s a big reason why all this R magic is possible). You want to stay up to date with his latest work. You could skim through his github repo every day. Or you could use an API to pull his github activity as a JSON file. Let’s do the latter.

Let’s import a few libraries.

library(jsonlite)
library(tidyverse)
library(knitr)
library(kableExtra)

Let’s pull the JSON data with the github api. Then look at the data structure:

data1 <- fromJSON("https://api.github.com/users/hadley/repos")
str(data1[,c(0:7)])

## 'data.frame':    30 obs. of  7 variables:
##  $ id       : int  40423928 40544418 14984909 12241750 5154874 9324319 20228011 82348 888200 3116998 ...
##  $ node_id  : chr  "MDEwOlJlcG9zaXRvcnk0MDQyMzkyOA==" "MDEwOlJlcG9zaXRvcnk0MDU0NDQxOA==" "MDEwOlJlcG9zaXRvcnkxNDk4NDkwOQ==" "MDEwOlJlcG9zaXRvcnkxMjI0MTc1MA==" ...
##  $ name     : chr  "15-state-of-the-union" "15-student-papers" "500lines" "adv-r" ...
##  $ full_name: chr  "hadley/15-state-of-the-union" "hadley/15-student-papers" "hadley/500lines" "hadley/adv-r" ...
##  $ private  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ owner    :'data.frame':   30 obs. of  18 variables:
##   ..$ login              : chr  "hadley" "hadley" "hadley" "hadley" ...
##   ..$ id                 : int  4196 4196 4196 4196 4196 4196 4196 4196 4196 4196 ...
##   ..$ node_id            : chr  "MDQ6VXNlcjQxOTY=" "MDQ6VXNlcjQxOTY=" "MDQ6VXNlcjQxOTY=" "MDQ6VXNlcjQxOTY=" ...
##   ..$ avatar_url         : chr  "https://avatars3.githubusercontent.com/u/4196?v=4" "https://avatars3.githubusercontent.com/u/4196?v=4" "https://avatars3.githubusercontent.com/u/4196?v=4" "https://avatars3.githubusercontent.com/u/4196?v=4" ...
##   ..$ gravatar_id        : chr  "" "" "" "" ...
##   ..$ url                : chr  "https://api.github.com/users/hadley" "https://api.github.com/users/hadley" "https://api.github.com/users/hadley" "https://api.github.com/users/hadley" ...
##   ..$ html_url           : chr  "https://github.com/hadley" "https://github.com/hadley" "https://github.com/hadley" "https://github.com/hadley" ...
##   ..$ followers_url      : chr  "https://api.github.com/users/hadley/followers" "https://api.github.com/users/hadley/followers" "https://api.github.com/users/hadley/followers" "https://api.github.com/users/hadley/followers" ...
##   ..$ following_url      : chr  "https://api.github.com/users/hadley/following{/other_user}" "https://api.github.com/users/hadley/following{/other_user}" "https://api.github.com/users/hadley/following{/other_user}" "https://api.github.com/users/hadley/following{/other_user}" ...
##   ..$ gists_url          : chr  "https://api.github.com/users/hadley/gists{/gist_id}" "https://api.github.com/users/hadley/gists{/gist_id}" "https://api.github.com/users/hadley/gists{/gist_id}" "https://api.github.com/users/hadley/gists{/gist_id}" ...
##   ..$ starred_url        : chr  "https://api.github.com/users/hadley/starred{/owner}{/repo}" "https://api.github.com/users/hadley/starred{/owner}{/repo}" "https://api.github.com/users/hadley/starred{/owner}{/repo}" "https://api.github.com/users/hadley/starred{/owner}{/repo}" ...
##   ..$ subscriptions_url  : chr  "https://api.github.com/users/hadley/subscriptions" "https://api.github.com/users/hadley/subscriptions" "https://api.github.com/users/hadley/subscriptions" "https://api.github.com/users/hadley/subscriptions" ...
##   ..$ organizations_url  : chr  "https://api.github.com/users/hadley/orgs" "https://api.github.com/users/hadley/orgs" "https://api.github.com/users/hadley/orgs" "https://api.github.com/users/hadley/orgs" ...
##   ..$ repos_url          : chr  "https://api.github.com/users/hadley/repos" "https://api.github.com/users/hadley/repos" "https://api.github.com/users/hadley/repos" "https://api.github.com/users/hadley/repos" ...
##   ..$ events_url         : chr  "https://api.github.com/users/hadley/events{/privacy}" "https://api.github.com/users/hadley/events{/privacy}" "https://api.github.com/users/hadley/events{/privacy}" "https://api.github.com/users/hadley/events{/privacy}" ...
##   ..$ received_events_url: chr  "https://api.github.com/users/hadley/received_events" "https://api.github.com/users/hadley/received_events" "https://api.github.com/users/hadley/received_events" "https://api.github.com/users/hadley/received_events" ...
##   ..$ type               : chr  "User" "User" "User" "User" ...
##   ..$ site_admin         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ html_url : chr  "https://github.com/hadley/15-state-of-the-union" "https://github.com/hadley/15-student-papers" "https://github.com/hadley/500lines" "https://github.com/hadley/adv-r" ...

We see that this is a nested JSON file by looking at the first few columns. The “owner” column consists of a dataframe.

We want a flat dataframe. Luckily for us, jsonlite comes with a function to flatten dataframes.

data1 <- jsonlite::flatten(data1)

We can now filter the columns to folder names, date of last update and folder url. We can see what he’s been working on lately.

data1 %>% 
  select(name, updated_at, git_url) %>%
  arrange(desc(updated_at)) %>%
  head(5) %>% 
  kable() %>%
  kable_styling

name	updated_at	git_url
assertthat	2019-12-12T09:55:44Z	git://github.com/hadley/assertthat.git
adv-r	2019-12-11T10:22:52Z	git://github.com/hadley/adv-r.git
data-baby-names	2019-12-02T23:06:51Z	git://github.com/hadley/data-baby-names.git
babynames	2019-11-21T12:15:46Z	git://github.com/hadley/babynames.git
beautiful-data	2019-11-06T20:02:39Z	git://github.com/hadley/beautiful-data.git

There we go, apis and json files making our life easier!