This document explains how to use json OCDS data in R.

Transforming json data to dataframes

To load json data in R, we can use the fromJSON command from the jsonlite package. For this example we will use sample data from Colombia, that you can download in this link.

Setup

Set your working directory and load the packages:

setwd("SET YOUR WORKING DIRECTORY")
library(jsonlite)
library(dplyr)

Load the data and explore the structure

1- Load the data with the fromJSON command. Use the option flatten=TRUE to flatten the json data into a list.

2- The list only has one element, a dataframe called releases that contains all the data. We extract it from the list with the command data[["releases"]] and transform it into a tibble as_tibble.

3- Explore the structure of the dataframe. The releases dataframe has different list columns, each containing dataframes; these dataframes correspond to arrays in the json data.

Note: if you are working with large and multiple json files we recommend to use a parallel approach. You can check section 2.2 on this R guide

#load the data
data<-fromJSON("test.json", flatten = TRUE)

releases<- as_tibble(data[["releases"]])

glimpse(releases)
## Rows: 8
## Columns: 30
## $ ocid                                 <chr> "ocds-k50g02-20-12-10675174", "o…
## $ id                                   <chr> "20-12-10675174", "20-12-1067555…
## $ date                                 <chr> "2020-04-14T14:39:03.000Z", "202…
## $ tag                                  <list> ["contract", "contract", "contr…
## $ initiationType                       <chr> "tender", "tender", "tender", "t…
## $ language                             <chr> "es", "es", "es", "es", "es", "e…
## $ lastUpdate                           <chr> "2020-04-14T00:00:00.000Z", "202…
## $ parties                              <list> [<data.frame[2 x 16]>, <data.fr…
## $ awards                               <list> [<data.frame[1 x 10]>, <data.fr…
## $ contracts                            <list> [<data.frame[1 x 10]>, <data.fr…
## $ buyer.name                           <chr> "AMAZONAS - GOBERNACION", "AMAZO…
## $ planning.rationale                   <chr> "PRESTACION DE SERVICIOS DE UN T…
## $ planning.milestones                  <list> [<data.frame[1 x 4]>, <data.fra…
## $ planning.budget.id                   <chr> "1193", "1148", "1191", "1150", …
## $ planning.budget.description          <chr> "CDP", "CDP", "CDP", "CDP", NA, …
## $ planning.budget.amount.amount        <int> 4800000, 9600000, 4800000, 18600…
## $ tender.id                            <chr> "20-12-10675174", "20-12-1067555…
## $ tender.title                         <chr> "800/2020", "777/2020", "802/202…
## $ tender.description                   <chr> "PRESTACION DE SERVICIOS DE UN T…
## $ tender.status                        <chr> "complete", "complete", "complet…
## $ tender.procurementMethod             <chr> "limited", "limited", "limited",…
## $ tender.procurementMethodDetails      <chr> "Contratación Directa (Ley 1150 …
## $ tender.procurementMethodRationale    <chr> "Urgencia Manifiesta (Literal A)…
## $ tender.submissionMethod              <list> ["inPerson", "inPerson", "inPer…
## $ tender.submissionMethodDetails       <chr> "Municipio obtención: Leticia Mu…
## $ tender.milestones                    <list> [<data.frame[1 x 4]>, <data.fra…
## $ tender.coveredBy                     <list> ["x_Estatuto_General_de_Contrat…
## $ tender.procuringEntity.name          <chr> "AMAZONAS - GOBERNACIÓN", "AMAZO…
## $ tender.value.amount                  <int> 4800000, 9600000, 4800000, 18600…
## $ tender.expressionAddress.countryName <chr> "Colombia", "Colombia", "Colombi…
head(releases,5)
## # A tibble: 5 x 30
##   ocid  id    date  tag   initiationType language lastUpdate parties awards
##   <chr> <chr> <chr> <lis> <chr>          <chr>    <chr>      <list>  <list>
## 1 ocds… 20-1… 2020… <chr… tender         es       2020-04-1… <df[,1… <df[,…
## 2 ocds… 20-1… 2020… <chr… tender         es       2020-04-1… <df[,1… <df[,…
## 3 ocds… 20-1… 2020… <chr… tender         es       2020-04-1… <df[,1… <df[,…
## 4 ocds… 20-1… 2020… <chr… tender         es       2020-04-1… <df[,1… <df[,…
## 5 ocds… 20-1… 2020… <chr… tender         es       2020-04-1… <df[,1… <df[,…
## # … with 21 more variables: contracts <list>, buyer.name <chr>,
## #   planning.rationale <chr>, planning.milestones <list>,
## #   planning.budget.id <chr>, planning.budget.description <chr>,
## #   planning.budget.amount.amount <int>, tender.id <chr>, tender.title <chr>,
## #   tender.description <chr>, tender.status <chr>,
## #   tender.procurementMethod <chr>, tender.procurementMethodDetails <chr>,
## #   tender.procurementMethodRationale <chr>, tender.submissionMethod <list>,
## #   tender.submissionMethodDetails <chr>, tender.milestones <list>,
## #   tender.coveredBy <list>, tender.procuringEntity.name <chr>,
## #   tender.value.amount <int>, tender.expressionAddress.countryName <chr>

Extract dataframes

To extract the dataframes, we can use the apply function, that applies the same function to all the rows (we add 1 in the apply function). In this example we are going to extract the contracts column to a dataframe:

1- result <- r['contracts'][[1]] creates a new dataframe called result, that includes the first element of the parties column (a dataframe).

2- result['ocid'] <- r['ocid'] creates a new column in the result dataframe that includes the ocid, this is useful in order to link the different dataframes.

3-parties<- bind_rows(parties) binds all the elements of the parties object created above, into a single dataframe.

As you can see this new dataframe also contains some list columns, that we can extract with the same procedure. In general, each array in the original json file is transformed into a list column, that contains a dataframe for each observation.

contracts <- apply(releases, 1, function(r) {
  result <- r['contracts'][[1]]
  result['ocid'] <- r['ocid']
  return(result)
})
contracts<- as_tibble(bind_rows(contracts))

head(contracts,5)
## # A tibble: 5 x 11
##       id awardID title description items dateSigned period.startDate
##    <int>   <int> <chr> <chr>       <lis> <chr>      <chr>           
## 1 9.79e6 9789698 800/… PRESTACION… <df[… 2020-04-0… 2020-04-08T00:0…
## 2 9.79e6 9790170 777/… PRESTACION… <df[… 2020-03-2… 2020-03-30T00:0…
## 3 9.79e6 9789924 802/… PRESTACION… <df[… 2020-04-0… 2020-04-06T00:0…
## 4 9.79e6 9789626 788/… PRESTACIÓN… <df[… 2020-03-2… 2020-03-30T00:0…
## 5 9.79e6 9789822 801/… PRESTACION… <df[… 2020-04-0… 2020-04-08T00:0…
## # … with 4 more variables: period.durationInDays <int>, value.amount <int>,
## #   value.currency <chr>, ocid <chr>

What if we want to extract the suppliers of each award? We repeat the same process, but indexing the right column:

1- The suppliers are stored in a list column in the awards dataframe. So we need to extract first the awards column ['awards'][[1]] and then the suppliers column [['suppliers']][[1]].

2- In this example we also included the award id column.

suppliers <- apply(releases, 1, function(r) {
  result <- r['awards'][[1]][['suppliers']][[1]]
  result['award_id'] <- r['awards'][[1]][['id']]
  result['ocid'] <- r['ocid']
  return(result)
})
suppliers<- as_tibble(bind_rows(suppliers))

head(suppliers,5)
## # A tibble: 5 x 4
##   id        name                             award_id ocid                      
##   <chr>     <chr>                               <int> <chr>                     
## 1 112119937 LORENA HERMINIA MALDONADO MATAPI  9789698 ocds-k50g02-20-12-10675174
## 2 112121617 DANIELA ANDREA BARBOSA RODRIGUEZ  9790170 ocds-k50g02-20-12-10675553
## 3 11432675  FRANCISCO JAVIER SALDAÑA VARGAS   9789924 ocds-k50g02-20-12-10675365
## 4 107524099 JHEFERSON SALAZAR MUÑOZ           9789626 ocds-k50g02-20-12-10675127
## 5 112121320 MELISSA MERCEDES FORERO GARCIA    9789822 ocds-k50g02-20-12-10675228

Query OCDS data from a SQLite database

Another option to load OCDS data in R is to first load it into a database engine such as SQLite or PostgreSQL. In this example we are going to load json OCDS data into a SQLite database using OCDSkit

Load data into a database using the command line

Run these commands on the Terminal (you can use the Terminal in R Studio):

1- Install ocds kit: pip install ocdskit

2- Install SQLite: sudo apt install sqlite3

3- Create a database to load the data: sqlite3 database_name.db

4- Load OCDS data into the database: cat OCDS_FILE.json | ocdskit tabulate sqlite:///database_name.db. Make sure the json file is in the directory you are working with, if not add the full path. In you are using Mac use sqlite://// (with four slashes).

Load the data into R

Once you have loaded the json data into the SQLite database, its quite simple to query it in R. In this example we also use sample data from Colombia, loaded into a database testocds1.db:

1-Load or install the DBI and RSQLite packages

2-Connect to the database using the dbConnect command

3-List all the available tables with dbListTables. As you can see in the output, when we load the json data into the SQLite database a table is created for each array. This was what we were trying to do in the first section, where we extracted elements from lists into new dataframes

library(DBI)
library(RSQLite)
db <- dbConnect(RSQLite::SQLite(), "YOUR DATABASE.db")
dbListTables(db)
##  [1] "awards"                                                           
##  [2] "awards_amendment_changes"                                         
##  [3] "awards_amendments"                                                
##  [4] "awards_amendments_changes"                                        
##  [5] "awards_documents"                                                 
##  [6] "awards_items"                                                     
##  [7] "awards_items_additionalClassifications"                           
##  [8] "awards_suppliers"                                                 
##  [9] "awards_suppliers_additionalIdentifiers"                           
## [10] "buyer_additionalIdentifiers"                                      
## [11] "contracts"                                                        
## [12] "contracts_amendment_changes"                                      
## [13] "contracts_amendments"                                             
## [14] "contracts_amendments_changes"                                     
## [15] "contracts_documents"                                              
## [16] "contracts_implementation_documents"                               
## [17] "contracts_implementation_milestones"                              
## [18] "contracts_implementation_milestones_documents"                    
## [19] "contracts_implementation_transactions"                            
## [20] "contracts_implementation_transactions_payee_additionalIdentifiers"
## [21] "contracts_implementation_transactions_payer_additionalIdentifiers"
## [22] "contracts_items"                                                  
## [23] "contracts_items_additionalClassifications"                        
## [24] "contracts_milestones"                                             
## [25] "contracts_milestones_documents"                                   
## [26] "contracts_relatedProcesses"                                       
## [27] "parties"                                                          
## [28] "parties_additionalIdentifiers"                                    
## [29] "planning_documents"                                               
## [30] "planning_milestones"                                              
## [31] "planning_milestones_documents"                                    
## [32] "relatedProcesses"                                                 
## [33] "releases"                                                         
## [34] "tender_amendment_changes"                                         
## [35] "tender_amendments"                                                
## [36] "tender_amendments_changes"                                        
## [37] "tender_documents"                                                 
## [38] "tender_items"                                                     
## [39] "tender_items_additionalClassifications"                           
## [40] "tender_milestones"                                                
## [41] "tender_milestones_documents"                                      
## [42] "tender_procuringEntity_additionalIdentifiers"                     
## [43] "tender_tenderers"                                                 
## [44] "tender_tenderers_additionalIdentifiers"

4-Create specific dataframes for each table, for instance a dataframe for the awards table. We can first explore all the fields available in each table with the dbListFields command. We use SELECT * to extract all the fields:

dbListFields(db, "awards")
##  [1] "ocid"                          "release_id"                   
##  [3] "award_id"                      "amendment_amendsReleaseID"    
##  [5] "amendment_date"                "amendment_description"        
##  [7] "amendment_id"                  "amendment_rationale"          
##  [9] "amendment_releaseID"           "contractPeriod_durationInDays"
## [11] "contractPeriod_endDate"        "contractPeriod_maxExtentDate" 
## [13] "contractPeriod_startDate"      "date"                         
## [15] "description"                   "status"                       
## [17] "title"                         "value_amount"                 
## [19] "value_currency"                "extras"
awards<-dbGetQuery(db, "SELECT * from awards")

head(as_tibble(awards,5))
## # A tibble: 6 x 20
##   ocid  release_id award_id amendment_amend… amendment_date amendment_descr…
##   <chr> <chr>      <chr>    <chr>            <chr>          <chr>           
## 1 ocds… 20-12-106… 9789698  <NA>             <NA>           <NA>            
## 2 ocds… 20-12-106… 9790170  <NA>             <NA>           <NA>            
## 3 ocds… 20-12-106… 9789924  <NA>             <NA>           <NA>            
## 4 ocds… 20-12-106… 9789626  <NA>             <NA>           <NA>            
## 5 ocds… 20-12-106… 9789822  <NA>             <NA>           <NA>            
## 6 ocds… 20-12-106… 9789972  <NA>             <NA>           <NA>            
## # … with 14 more variables: amendment_id <chr>, amendment_rationale <chr>,
## #   amendment_releaseID <chr>, contractPeriod_durationInDays <int>,
## #   contractPeriod_endDate <chr>, contractPeriod_maxExtentDate <chr>,
## #   contractPeriod_startDate <chr>, date <chr>, description <chr>,
## #   status <chr>, title <chr>, value_amount <int64>, value_currency <chr>,
## #   extras <chr>

You can do all types of SQL queries, for instance selecting some fields or agregating data. This can be a useful way to analyze data, without loading all the data into R. For instance:

#Select only some fields
dbGetQuery(db, "SELECT ocid, award_id, value_amount  
                FROM awards
                LIMIT 5")
##                         ocid award_id value_amount
## 1 ocds-k50g02-20-12-10675174  9789698      4800000
## 2 ocds-k50g02-20-12-10675553  9790170      9600000
## 3 ocds-k50g02-20-12-10675365  9789924      4800000
## 4 ocds-k50g02-20-12-10675127  9789626     18600000
## 5 ocds-k50g02-20-12-10675228  9789822      4800000
#Calculate the total value awarded
dbGetQuery(db, "SELECT SUM(value_amount) as total_value 
                          FROM awards")
##     total_value
## 1 1313862772986

If you want to learn more about the SQL queries you can check this tutorial.

Other resources