govEDA: A package for use and consumption of government open data

Case study: Datos abiertos - Colombia

José Fernando Zea, PhD(c.) - USTA,DNP

Open data platforms

Service platforms enables customers and users to download the data via a huge number of open standards (CSV, JSON, etc.) so there is no risk of lock-in at the data level.

  • Socrata is a software that provides a cloud-based solution for open data publishing and visualization

  • CKAN is an open source project, developed by the Open Knowledge Foundation, that lets users provision open data catalogs and, in some cases, visualizations and APIs.

  • Successful cases: Colombia open data portal, Mexico open data portal, Paraguay Open Data portal, etc.

plot1

Basic functions

  • get_metada: Fetch a detailed metadata of open datasets (extract json and basic list of datasets available from a Socrata domain
  • search_data: Shows a list object with available datasets by keywords/tags. The list contains four things:
  • eda_opendata: Generate a basic report in rmarkdown or shiny with selected datasets from open data portal.
  • sizeTabularData: Compute number of rows and columns, size and elapsed system time of every selected dataset
  • ReadTabularData: Read a tabular dataset of open portal in R

Future: a basic recommendation system.

plot1

get_metada

Description Fetch a detailed metada of open datasets (extract json and basic list of datasets available from a Socrata/CKAN domain)

Usage get_socrata_metada(url)

Arguments

  • url - A Socrata/CKAN URL. This simplest points to the site root.

Details A R data frame containing a listing of datasets along with detailed metadata. Next field are preserved for every open dataset:

  • id, name, attribution, averageRating, category, createdAt, description, displayType, downloadCount, hideFromCatalog, hideFromDataJson
  • viewCount, viewLastModified, viewType, columns, grants, license, metadata ratings disabledFeatureFlags, accessLevel, landingPage, issued, @type, modified
  • keyword, contactPoint, publisher, identifier, title, distribution, theme

An others…

Value The function returns a data frame with detailed information about datasets.

search_data

Description Shows a list object with available datasets by keywords/tags. The list contains four data frames:

- Tabular datasets: A data frame with four columns (dataset name, attribution (public agency), number of rows and columns size of dataset (Mb) and elapsed time for R export.
- Geo: A data frame with general information about geo file.
- Href: A data frame with general information about href file.
- Blobby: A data frame with general information about blobby href file.

Usage

search_data(metadata, tags)

Arguments

  • metadata: A data frame get from get_metadata function
  • tags: A character vector with keywords and/or tags required for searching in dataset name, tag field or public agency field

Details The function returns a data frame with datasets associated with tags or keywords. Value The function returns a list of 4 data frames. The 4 data frames contain information of tabular, geo, href and blobby data of open data portal.

The output (search_data)

  • An output of search_data (searching “Bogotá” related datasets):
  • R code: >search_data(metadata = metada_colombia, tags = “Bogotá”)[[1]]
id name n_row n_col size elapsed_time
229w-qzrf DATOS ABIERTOS BOGOTA 16 11 2.2 Kb 1.72
45pp-fbx3 RESERVA DE CUPOS 18 7 23.8 Mb 77.69
63i4-nng2 DNP-BASE EJEC PTAL 02-15 (ACT 30 ABRIL 15) 9 6 4.5 Mb 22.13
9vy2-biux LISTA DE JUNTAS DE ACCION COMUNAL MUNICIPIO DE TESALIA - HUILA 12 9 6.1 Kb 0.92
b4cc-cqqu MEDICION 2 SENSÓRICA CONDUCTUAL CHINCHINA 10 11 8.1 Kb 0.95
d8dw-68hx SALAS DE CINE Y CINEMATECAS EN LA CIUDAD DE BOGOTÁ 13 3 21.5 Kb 0.90
dmmd-s8ju ENTIDADES ACREDITADAS CON AVAL A 31 DE ENERO DE 2017 17 4 60.8 Kb 1.00
e2it-6w34 REGISTRO USUARIOS DE ASISTENCIA TÉCNICA BUSBANZA 7 5 37.4 Kb 0.92
f8ve-dac2 CADENAS PRODUCTIVAS DE TUTAZÁ 3 1 5 Kb 0.98
gr6q-6jhn PARTICIPACIÓN PROVEEDORES POR DEMANDA 2 13 3.7 Kb 0.87
ih48-erzn SÍFILIS GESTACIONAL EN EL DEPARTAMENTO DE NARIÑO AÑO 2008 A 2015 (GOBERNACIÓN DE NARIÑO) 15 8 20.9 Kb 0.90
m8cr-nt44 ACUEDUCTOS VEREDALES CHÍQUIZA 4 12 3.6 Kb 0.97
qqbd-442m ESTACIONES DE SERVICIO DE MISTRATÓ RISARALDA 1 11 1.3 Kb 0.99
s52q-9chx OFERTA HIDRICA JUNIO 2015 5 10 26.9 Kb 0.98
snc5-vevu CONSOLIDADO DE PROCESOS DE SELECCIÓN 14 2 328 Kb 1.25
tdkw-xhkb LISTADO DE DROGUERIAS -TUNJA 11 2 9.2 Kb 0.92
tg45-q549 ROTACIÓN DEPORTIVA ESCOLAR COPACABANA 2016 8 13 5.2 Kb 1.00
udix-2txh INTERNET POR SUSCRIPCIÓN Y TECNOLOGÍA 6 11 2.1 Kb 1.00
wheb-axi3 SITIOS TURÍSTICOS-6 2 14 5.9 Kb 0.95

eda_opendata

Description Generate a basic report in rmarkdown or shiny with selected datasets from an open data portal.

Usage

eda_opendata(metadata, ids, tags)

Arguments

  • metadata - a data frame get from get_metadata function
  • ids - a character vector with datasets id required. if ids and tags arguments are NULL all datasets are analyzed
  • tags - a character vector with tags/keywords required for analysis. if ids and tags arguments are NULL all datasets are analyzed.

Details An exploratory data analysis is carry on a subset of datasets previously chosen.

Value A rmarkdown or shiny report is generated. Summary statistics and plots are generated by default.

Case study: Datos abiertos - Colombia

Type data

  • Most datasets to (2017-03-04) are of tabular type.
  • A frequency analysis of data type for tabular data in Colombia open portal is shown below:

viewType count percent
tabular 3543 82.0
href 574 13.3
blobby 162 3.7
geo 44 1.0

Open Data in Colombia is new!

  • Every day is added new open datasets to “Datos Abiertos - Colombia” portal:

plot of chunk unnamed-chunk-6

Main topics

  • Some common topics in Colombia open data are shown in the next wordcloud:

plot1

View Count Distribution

  • Distribution of number of views of Colombian datasets:

plot of chunk unnamed-chunk-11

Summary View Counts

  • A descriptive statistics summary of view counts is presented below:

Stats
nbr.val 4323.0
nbr.null 0.0
nbr.na 0.0
min 1.0
max 25932.0
range 25931.0
sum 375981.0
median 26.0
mean 87.0
SE.mean 10.9
CI.mean.0.95 21.4
var 513229.0
std.dev 716.4
coef.var 8.2

Most Frequently Viewed Datasets:

plot1

Most Downloaded Datasets

plot1