This document explains how to use json OCDS data in R.
To load json data in R, we can use the fromJSON
command from the jsonlite
package. For this example we will use sample data from Colombia, that you can download in this link.
Set your working directory and load the packages:
setwd("SET YOUR WORKING DIRECTORY")
library(jsonlite)
library(dplyr)
1- Load the data with the fromJSON
command. Use the option flatten=TRUE
to flatten the json data into a list.
2- The list only has one element, a dataframe called releases that contains all the data. We extract it from the list with the command data[["releases"]]
and transform it into a tibble as_tibble
.
3- Explore the structure of the dataframe. The releases dataframe has different list columns, each containing dataframes; these dataframes correspond to arrays in the json data.
Note: if you are working with large and multiple json files we recommend to use a parallel approach. You can check section 2.2 on this R guide
#load the data
data<-fromJSON("test.json", flatten = TRUE)
releases<- as_tibble(data[["releases"]])
glimpse(releases)
## Rows: 8
## Columns: 30
## $ ocid <chr> "ocds-k50g02-20-12-10675174", "o…
## $ id <chr> "20-12-10675174", "20-12-1067555…
## $ date <chr> "2020-04-14T14:39:03.000Z", "202…
## $ tag <list> ["contract", "contract", "contr…
## $ initiationType <chr> "tender", "tender", "tender", "t…
## $ language <chr> "es", "es", "es", "es", "es", "e…
## $ lastUpdate <chr> "2020-04-14T00:00:00.000Z", "202…
## $ parties <list> [<data.frame[2 x 16]>, <data.fr…
## $ awards <list> [<data.frame[1 x 10]>, <data.fr…
## $ contracts <list> [<data.frame[1 x 10]>, <data.fr…
## $ buyer.name <chr> "AMAZONAS - GOBERNACION", "AMAZO…
## $ planning.rationale <chr> "PRESTACION DE SERVICIOS DE UN T…
## $ planning.milestones <list> [<data.frame[1 x 4]>, <data.fra…
## $ planning.budget.id <chr> "1193", "1148", "1191", "1150", …
## $ planning.budget.description <chr> "CDP", "CDP", "CDP", "CDP", NA, …
## $ planning.budget.amount.amount <int> 4800000, 9600000, 4800000, 18600…
## $ tender.id <chr> "20-12-10675174", "20-12-1067555…
## $ tender.title <chr> "800/2020", "777/2020", "802/202…
## $ tender.description <chr> "PRESTACION DE SERVICIOS DE UN T…
## $ tender.status <chr> "complete", "complete", "complet…
## $ tender.procurementMethod <chr> "limited", "limited", "limited",…
## $ tender.procurementMethodDetails <chr> "Contratación Directa (Ley 1150 …
## $ tender.procurementMethodRationale <chr> "Urgencia Manifiesta (Literal A)…
## $ tender.submissionMethod <list> ["inPerson", "inPerson", "inPer…
## $ tender.submissionMethodDetails <chr> "Municipio obtención: Leticia Mu…
## $ tender.milestones <list> [<data.frame[1 x 4]>, <data.fra…
## $ tender.coveredBy <list> ["x_Estatuto_General_de_Contrat…
## $ tender.procuringEntity.name <chr> "AMAZONAS - GOBERNACIÓN", "AMAZO…
## $ tender.value.amount <int> 4800000, 9600000, 4800000, 18600…
## $ tender.expressionAddress.countryName <chr> "Colombia", "Colombia", "Colombi…
head(releases,5)
## # A tibble: 5 x 30
## ocid id date tag initiationType language lastUpdate parties awards
## <chr> <chr> <chr> <lis> <chr> <chr> <chr> <list> <list>
## 1 ocds… 20-1… 2020… <chr… tender es 2020-04-1… <df[,1… <df[,…
## 2 ocds… 20-1… 2020… <chr… tender es 2020-04-1… <df[,1… <df[,…
## 3 ocds… 20-1… 2020… <chr… tender es 2020-04-1… <df[,1… <df[,…
## 4 ocds… 20-1… 2020… <chr… tender es 2020-04-1… <df[,1… <df[,…
## 5 ocds… 20-1… 2020… <chr… tender es 2020-04-1… <df[,1… <df[,…
## # … with 21 more variables: contracts <list>, buyer.name <chr>,
## # planning.rationale <chr>, planning.milestones <list>,
## # planning.budget.id <chr>, planning.budget.description <chr>,
## # planning.budget.amount.amount <int>, tender.id <chr>, tender.title <chr>,
## # tender.description <chr>, tender.status <chr>,
## # tender.procurementMethod <chr>, tender.procurementMethodDetails <chr>,
## # tender.procurementMethodRationale <chr>, tender.submissionMethod <list>,
## # tender.submissionMethodDetails <chr>, tender.milestones <list>,
## # tender.coveredBy <list>, tender.procuringEntity.name <chr>,
## # tender.value.amount <int>, tender.expressionAddress.countryName <chr>
To extract the dataframes, we can use the apply
function, that applies the same function to all the rows (we add 1
in the apply function). In this example we are going to extract the contracts column to a dataframe:
1- result <- r['contracts'][[1]]
creates a new dataframe called result, that includes the first element of the parties column (a dataframe).
2- result['ocid'] <- r['ocid']
creates a new column in the result dataframe that includes the ocid
, this is useful in order to link the different dataframes.
3-parties<- bind_rows(parties)
binds all the elements of the parties object created above, into a single dataframe.
As you can see this new dataframe also contains some list columns, that we can extract with the same procedure. In general, each array in the original json file is transformed into a list column, that contains a dataframe for each observation.
contracts <- apply(releases, 1, function(r) {
result <- r['contracts'][[1]]
result['ocid'] <- r['ocid']
return(result)
})
contracts<- as_tibble(bind_rows(contracts))
head(contracts,5)
## # A tibble: 5 x 11
## id awardID title description items dateSigned period.startDate
## <int> <int> <chr> <chr> <lis> <chr> <chr>
## 1 9.79e6 9789698 800/… PRESTACION… <df[… 2020-04-0… 2020-04-08T00:0…
## 2 9.79e6 9790170 777/… PRESTACION… <df[… 2020-03-2… 2020-03-30T00:0…
## 3 9.79e6 9789924 802/… PRESTACION… <df[… 2020-04-0… 2020-04-06T00:0…
## 4 9.79e6 9789626 788/… PRESTACIÓN… <df[… 2020-03-2… 2020-03-30T00:0…
## 5 9.79e6 9789822 801/… PRESTACION… <df[… 2020-04-0… 2020-04-08T00:0…
## # … with 4 more variables: period.durationInDays <int>, value.amount <int>,
## # value.currency <chr>, ocid <chr>
What if we want to extract the suppliers of each award? We repeat the same process, but indexing the right column:
1- The suppliers are stored in a list column in the awards dataframe. So we need to extract first the awards column ['awards'][[1]]
and then the suppliers column [['suppliers']][[1]]
.
2- In this example we also included the award id column.
suppliers <- apply(releases, 1, function(r) {
result <- r['awards'][[1]][['suppliers']][[1]]
result['award_id'] <- r['awards'][[1]][['id']]
result['ocid'] <- r['ocid']
return(result)
})
suppliers<- as_tibble(bind_rows(suppliers))
head(suppliers,5)
## # A tibble: 5 x 4
## id name award_id ocid
## <chr> <chr> <int> <chr>
## 1 112119937 LORENA HERMINIA MALDONADO MATAPI 9789698 ocds-k50g02-20-12-10675174
## 2 112121617 DANIELA ANDREA BARBOSA RODRIGUEZ 9790170 ocds-k50g02-20-12-10675553
## 3 11432675 FRANCISCO JAVIER SALDAÑA VARGAS 9789924 ocds-k50g02-20-12-10675365
## 4 107524099 JHEFERSON SALAZAR MUÑOZ 9789626 ocds-k50g02-20-12-10675127
## 5 112121320 MELISSA MERCEDES FORERO GARCIA 9789822 ocds-k50g02-20-12-10675228
Another option to load OCDS data in R is to first load it into a database engine such as SQLite or PostgreSQL. In this example we are going to load json OCDS data into a SQLite database using OCDSkit
Run these commands on the Terminal (you can use the Terminal in R Studio):
1- Install ocds kit: pip install ocdskit
2- Install SQLite: sudo apt install sqlite3
3- Create a database to load the data: sqlite3 database_name.db
4- Load OCDS data into the database: cat OCDS_FILE.json | ocdskit tabulate sqlite:///database_name.db
. Make sure the json file is in the directory you are working with, if not add the full path. In you are using Mac use sqlite:////
(with four slashes).
Once you have loaded the json data into the SQLite database, its quite simple to query it in R. In this example we also use sample data from Colombia, loaded into a database testocds1.db
:
1-Load or install the DBI
and RSQLite
packages
2-Connect to the database using the dbConnect
command
3-List all the available tables with dbListTables
. As you can see in the output, when we load the json data into the SQLite database a table is created for each array. This was what we were trying to do in the first section, where we extracted elements from lists into new dataframes
library(DBI)
library(RSQLite)
db <- dbConnect(RSQLite::SQLite(), "YOUR DATABASE.db")
dbListTables(db)
## [1] "awards"
## [2] "awards_amendment_changes"
## [3] "awards_amendments"
## [4] "awards_amendments_changes"
## [5] "awards_documents"
## [6] "awards_items"
## [7] "awards_items_additionalClassifications"
## [8] "awards_suppliers"
## [9] "awards_suppliers_additionalIdentifiers"
## [10] "buyer_additionalIdentifiers"
## [11] "contracts"
## [12] "contracts_amendment_changes"
## [13] "contracts_amendments"
## [14] "contracts_amendments_changes"
## [15] "contracts_documents"
## [16] "contracts_implementation_documents"
## [17] "contracts_implementation_milestones"
## [18] "contracts_implementation_milestones_documents"
## [19] "contracts_implementation_transactions"
## [20] "contracts_implementation_transactions_payee_additionalIdentifiers"
## [21] "contracts_implementation_transactions_payer_additionalIdentifiers"
## [22] "contracts_items"
## [23] "contracts_items_additionalClassifications"
## [24] "contracts_milestones"
## [25] "contracts_milestones_documents"
## [26] "contracts_relatedProcesses"
## [27] "parties"
## [28] "parties_additionalIdentifiers"
## [29] "planning_documents"
## [30] "planning_milestones"
## [31] "planning_milestones_documents"
## [32] "relatedProcesses"
## [33] "releases"
## [34] "tender_amendment_changes"
## [35] "tender_amendments"
## [36] "tender_amendments_changes"
## [37] "tender_documents"
## [38] "tender_items"
## [39] "tender_items_additionalClassifications"
## [40] "tender_milestones"
## [41] "tender_milestones_documents"
## [42] "tender_procuringEntity_additionalIdentifiers"
## [43] "tender_tenderers"
## [44] "tender_tenderers_additionalIdentifiers"
4-Create specific dataframes for each table, for instance a dataframe for the awards table. We can first explore all the fields available in each table with the dbListFields
command. We use SELECT *
to extract all the fields:
dbListFields(db, "awards")
## [1] "ocid" "release_id"
## [3] "award_id" "amendment_amendsReleaseID"
## [5] "amendment_date" "amendment_description"
## [7] "amendment_id" "amendment_rationale"
## [9] "amendment_releaseID" "contractPeriod_durationInDays"
## [11] "contractPeriod_endDate" "contractPeriod_maxExtentDate"
## [13] "contractPeriod_startDate" "date"
## [15] "description" "status"
## [17] "title" "value_amount"
## [19] "value_currency" "extras"
awards<-dbGetQuery(db, "SELECT * from awards")
head(as_tibble(awards,5))
## # A tibble: 6 x 20
## ocid release_id award_id amendment_amend… amendment_date amendment_descr…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ocds… 20-12-106… 9789698 <NA> <NA> <NA>
## 2 ocds… 20-12-106… 9790170 <NA> <NA> <NA>
## 3 ocds… 20-12-106… 9789924 <NA> <NA> <NA>
## 4 ocds… 20-12-106… 9789626 <NA> <NA> <NA>
## 5 ocds… 20-12-106… 9789822 <NA> <NA> <NA>
## 6 ocds… 20-12-106… 9789972 <NA> <NA> <NA>
## # … with 14 more variables: amendment_id <chr>, amendment_rationale <chr>,
## # amendment_releaseID <chr>, contractPeriod_durationInDays <int>,
## # contractPeriod_endDate <chr>, contractPeriod_maxExtentDate <chr>,
## # contractPeriod_startDate <chr>, date <chr>, description <chr>,
## # status <chr>, title <chr>, value_amount <int64>, value_currency <chr>,
## # extras <chr>
You can do all types of SQL queries, for instance selecting some fields or agregating data. This can be a useful way to analyze data, without loading all the data into R. For instance:
#Select only some fields
dbGetQuery(db, "SELECT ocid, award_id, value_amount
FROM awards
LIMIT 5")
## ocid award_id value_amount
## 1 ocds-k50g02-20-12-10675174 9789698 4800000
## 2 ocds-k50g02-20-12-10675553 9790170 9600000
## 3 ocds-k50g02-20-12-10675365 9789924 4800000
## 4 ocds-k50g02-20-12-10675127 9789626 18600000
## 5 ocds-k50g02-20-12-10675228 9789822 4800000
#Calculate the total value awarded
dbGetQuery(db, "SELECT SUM(value_amount) as total_value
FROM awards")
## total_value
## 1 1313862772986
If you want to learn more about the SQL queries you can check this tutorial.
You can check our Guide to analyze OCDS data in R that includes more detail and examples on how to analyze OCDS data in R.
Check our OCDS kit tutorial