If you consider yourself a budding data scientist, an explorer of a brave new world where data is ubiquitous then there are a few things you will need to do, one of which is access data from the web. One of the ways that we can get information from the way when using R is through a Web API; a Web API is a gate, an Application Programming Interface. The API is the ‘thing’ that allows different data sets to escape from the yard with the correct keys. It’s an agreement between an enterprise and the applications that use it.
This is where we get specific, the enterprise for us is CKAN and the application is R. CKAN is Comprehensive Knowledge Archive Network, it is an open-source open data portal for the storage and distribution of data. CKAN is used by governments at all levels internationally along with universities and some private enterprises.
Now we have a question, let’s say it relates to the department of foreign affairs and trade and the portfolio budget statements (PBS) for Australia. To get this we will have to go to the government data website [link] (https://data.gov.au) site to see what they have. There are plenty of results but we also know that we are going to analyse the data in R. There is a better way to get the information straight into R. First there are a few things we will have to do.
library(tidyverse)
library(ckanr)
library(jsonlite)
library(readxl)
library(curl)
library(readxl)
url_site <- "https://data.gov.au/"
ckanr_setup(url = url_site)
Now the url that you will conduct all of your searches in is set to https://data.gov.au
Information on CKAN data bases is grouped. As a start point we can see what groups there are
group_list(as = "table")
## [1] "business" "civic" "communications"
## [4] "community" "cultural" "defence"
## [7] "education" "employment" "environment"
## [10] "finance" "governance" "health"
## [13] "immigration" "indigenous" "international"
## [16] "justice" "maritime" "natural"
## [19] "primary" "sciences" "security"
## [22] "sport" "statistical-services" "tourism"
## [25] "trade" "transport"
So we can see there are a nuber of groups. If we look at a group in a bit more detail we can see that there are a number of packages under each group
group_list(limit = 2)
## [[1]]
## <CKAN Group> ada735db-98c9-42cb-8969-dc356ea4281e
## Name: business
## Display Name: Business Support and Regulation
## No. users: 1
## No. packages: 85
## No. followers: 1
## Created: 2013-05-28T09:05:53.425083
##
## [[2]]
## <CKAN Group> 7d67a152-2ac6-4f5f-b1e5-f6a2791067cc
## Name: civic
## Display Name: Civic Infrastructure
## No. users: 1
## No. packages: 216
## No. followers: 0
## Created: 2013-09-18T02:31:05.900153
The information can be searched in two broad formats, packages and resources. Packages are the way the information is grouped under a ‘title’ where as resources are the things we are really searching for. Let’s have a look at the packages first.
package_list(as = "table")
## [1] "00110732-e257-49bf-b10d-7f347252244a"
## [2] "00160570-97e4-4394-8073-00b2c891ecf5"
## [3] "00358558-b7fc-4165-a4e3-fd1ec26ed51a"
## [4] "0039f296-c7c7-410d-bb37-a1f4b1eb5976"
## [5] "003e3429-7601-4d7e-a996-fb1676f58f0f"
## [6] "00509e30-743b-42ff-b86f-9fa407a1470a"
## [7] "00656ea3-336d-4fa5-9070-3aa8d6fc522e"
## [8] "00d5b6be-144b-410b-8289-5bf40b0a1ee6"
## [9] "00d84f32-82ef-46ae-908b-880a727fafaf"
## [10] "00ebd1eb-f908-4853-a2a0-a1626746985f"
## [11] "01089f5d-33ef-4fc4-b044-6e1e1f72d698"
## [12] "01130a0e-782a-4ec2-b8c8-ac6d9a2f5337"
## [13] "0115c2ba-73c6-4c98-b539-9f67594980cf"
## [14] "01269b84-ebf1-4f9f-a100-1820e8bed267"
## [15] "014e470d-c343-4eec-97ce-28916a87dab0"
## [16] "01631b54-16a3-4a84-8dac-3e8d142e12b9"
## [17] "018bfc12-6b9f-4ccc-83e4-e002cfd72b6a"
## [18] "019128a1-d747-4ac2-842d-c0b30a4f8627"
## [19] "019a5cf0-8a13-4e7b-953d-5c16e3b2de99"
## [20] "01c1f112-3629-4527-a576-f84dbcda493e"
## [21] "01c48600-f9d3-42b6-8449-d2c5f40ef26f"
## [22] "01f26038-daed-4ba7-b1d8-efabc749cd17"
## [23] "01f58f60-9688-4b62-9de9-7e58aa0eeeb6"
## [24] "020957ea-4877-4009-872c-3cacfb6f8ded"
## [25] "02418c67-f8bb-48a8-88a3-2a5c6b485f78"
## [26] "02c18a51-61e7-4a20-9edf-bd0138e5002a"
## [27] "02ffb180-7fc4-40ce-9b86-44c8031eeaa8"
## [28] "033319be-8058-48b5-ba2d-23f3931a25d4"
## [29] "0336445d-a34c-4a4c-a83d-000a3ed1895d"
## [30] "0353426b-403d-4d4b-aac0-75afb55e1176"
## [31] "038c169b-606f-4478-99db-e3125319b321"
Packages come out like this, not all that helpful (yet)… we need another way to search the data base to make a little more sense of it. If we show a package we can get more information about the package.
package_show("038c169b-606f-4478-99db-e3125319b321")
## <CKAN Package> b9e8a89d-c022-4a93-a6a8-a4d6a224d026
## Title: QLD Current Authorities to Prospect for Petroleum (ATP), 6/3/2013
## Creator/Modified: 2016-03-23T01:01:40.170544 / 2019-11-20T02:42:51.072334
## Resources (up to 5): QLD Current Authorities to Prospect for Petroleum (ATP), 6/3/2013
## Tags (up to 5): BOUNDARIES, Cooper subregion, Galilee subregion, Queensland, economy
## Groups (up to 5):
We can see that within each of the packages there are number of resources. If we list all of the resources in the data base it is extensive - too many to look through. Instead we can do a more targeted search within the resources.
x <- resource_search(q = "name:foreign affairs", limit = 5)
x$results
## [[1]]
## <CKAN Resource> 401239c8-50fe-46e5-befd-11981aa9188b
## Name: Department of Foreign Affairs and Trade 2015-16 PAES
## Description:
## Creator/Modified: 2016-02-09T11:14:52.370311 / 2016-02-09T00:00:00
## Size:
## Format: excel (.xlsx)
##
## [[2]]
## <CKAN Resource> 8e5fe4f6-2047-4c4e-aa1d-ddaf2335b796
## Name: Department of Foreign Affairs and Trade PAES 2014-15
## Description:
## Creator/Modified: 2015-02-12T10:02:20.756066 /
## Size:
## Format: XLSX
##
## [[3]]
## <CKAN Resource> b1b051d4-552a-4276-9063-9699ecd707d9
## Name: Department of Foreign Affairs and Trade PBS 2015-16
## Description:
## Creator/Modified: 2015-05-12T11:22:44.035039 /
## Size:
## Format: XLSX
##
## [[4]]
## <CKAN Resource> eba5e32d-2c02-4cf3-8288-5a9ff63d42ac
## Name: Department of Foreign Affairs and Trade 2014 - 15 PBS
## Description:
## Creator/Modified: 2014-05-13T10:44:49.166867 /
## Size:
## Format: XLSX
##
## [[5]]
## <CKAN Resource> 9475a3f7-ba5f-46e4-9391-9f2a82cea51b
## Name: Foreign Affairs and Trade 2016-17 PBS.xlsx
## Description:
## Creator/Modified: 2019-03-20T16:54:14.640637 / 2019-03-20T16:54:14.330457
## Size: 77073
## Format: XLSX
As a point to note you can search the Name , Description and Format fields to offer more options to search within resources.
Let’s say that we want to look last entry at the last entry, the PBS from 2016-17. It’s an xlsx format file. The first thing we will have to do is to get the url. The resource number is the
url<-resource_show(id ="9475a3f7-ba5f-46e4-9391-9f2a82cea51b")
url$url
## [1] "https://data.gov.au/data/dataset/0c516e3a-0fb0-45a4-ac3f-0ab33c265885/resource/9475a3f7-ba5f-46e4-9391-9f2a82cea51b/download/foreign-affairs-and-trade-2016-17-pbs.xlsx"
Now we have the url for the data so let’s import it.
url <- ("https://data.gov.au/data/dataset/0c516e3a-0fb0-45a4-ac3f-0ab33c265885/resource/9475a3f7-ba5f-46e4-9391-9f2a82cea51b/download/foreign-affairs-and-trade-2016-17-pbs.xlsx")
destfile <- ("foreign_affairs_and_trade_2016_17_pbs.xlsx")
curl::curl_download(url, destfile)
foreign_affairs_and_trade_2016_17_pbs <- read_excel(destfile, col_names = c("Item", "2015-2016", "2016-2017"), skip = 2)
View(foreign_affairs_and_trade_2016_17_pbs)
As part of the import we did a little tidying up of the headers with the column names.
And there we have it, the Preliminary Budget Statement data for the department of foriegn affairs and trade for 2016-2017.
As you can see, the CKAN data base is extensive and it is used by governments at all levels all over the world. Some private organisations are using it too.
Here are some links that help us to download other data types from CKAN data bases.
1.https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio#importing-excel 2.https://extensions.ckan.org/extension/ckanr/ 3.https://support.rstudio.com/hc/en-us/categories/200035113-How-To-Articles 4.https://cran.r-project.org/web/packages/ckanr/ckanr.pdf