goveda: a package for use and consuption of open data

José F. Zea
March 4th, 2017

Datos abiertos Colombia

Open data plataforms

Service plataforms enables customers and users to download the data via a huge number of open standards (CSV, JSON, etc.) so there is no risk of lock-in at the data level.

  • Socrata is a software-as-a-service platform that provides a cloud-based solution for open data publishing and visualization

  • CKAN is an open source project, developed by the Open Knowledge Foundation, that lets users provision open data catalogs and, in some cases, visualizations and APIs.

  • Succesuful cases: Colombia open data porta, Mexico open data portal, Paraguay Open Data portal, etc.

plot1

Basic functions

  • get_socrata_metada: fetch a detailed metada of open datasets (extract json and basic list of datasets available from a Socrata domain

  • search_data: Shows a list object with with available datasets by keywords/tags. The list contains four things:

  • eda_opendata: Generate a basic report in rmarkdown or shiny with selected datasets from an open data portal.

Future: a basic recommendation system.

plot1

get_socrata_metada

Description fetch a detailed metada of open datasets (extract json and basic list of datasets available from a Socrata domain

Usage

get_socrata_metada(url)

Arguments

  • url - A Socrata URL. This simply points to the site root.

Details a R data frame containing a listing of datasets along with detailed metadata. Next field are preserved for every open dataset:

  • id, name, attribution, averageRating, category, createdAt, description, displayType, downloadCount, hideFromCatalog, hideFromDataJson
  • indexUpdatedAt, licenseId, newBackend, numberOfComments, oid, provenance, publicationAppendEnabled, publicationDate, publicationGroup
  • publicationStage, rowsUpdatedAt, rowsUpdatedBy, tableId, totalTimesRated, viewCount, viewLastModified, viewType, columns, grants, license, metadata
  • owner, query, rights, tableAuthor, tags, flags, attributionLink, blobFilename, blobFileSize, blobId, blobMimeType, rowClass, previewImageId, * childViews, displayFormat, iconUrl, rowIdentifierColumnId, ratings disabledFeatureFlags, accessLevel, landingPage, issued, @type, modified
  • keyword, contactPoint, publisher, identifier, title, distribution, theme

Value The function returns a dataframe with detailed information about datasets.

search_data

Description Shows a list object with with available datasets by keywords/tags. The list contains four things:

  • tabular datasets: a dataframe with four columns (dataset name, attribution (public agency), number of rows and columns size of dataset (Mb) and elapsed time for R export.
  • geo: a dataframe with general information about the geo file.
  • href: a dataframe with general information about the href file.
  • blobby: a dataframe with general information about the href file.

Usage

search_data(metadata, tags)

Arguments

  • metadata - a dataframe get from get_metadata function
  • tags a character vector with keywords and/or tags required for searching in dataset name, tag field or public agency field

Details The selected sample is drawn according to a selection-rejection (list-sequential) algorithm

Value The function returns a list of 4 dataframes. The 4 dataframes contain information of tabular, geo, href and blobby data of open data portal.

The output (search_data)

  • A output from a tabular data (searching Bogotá datasets):
  • > search_data(metadata = metada_colombia, tags = “Bogotá”)[[1]]
id name n_row n_col size elapsed_time
229w-qzrf DATOS ABIERTOS BOGOTA 16 11 2.2 Kb 1.72
45pp-fbx3 RESERVA DE CUPOS 18 7 23.8 Mb 77.69
63i4-nng2 DNP-BASE EJEC PTAL 02-15 (ACT 30 ABRIL 15) 9 6 4.5 Mb 22.13
9vy2-biux LISTA DE JUNTAS DE ACCION COMUNAL MUNICIPIO DE TESALIA - HUILA 12 9 6.1 Kb 0.92
b4cc-cqqu MEDICION 2 SENSÓRICA CONDUCTUAL CHINCHINA 10 11 8.1 Kb 0.95
d8dw-68hx SALAS DE CINE Y CINEMATECAS EN LA CIUDAD DE BOGOTÁ 13 3 21.5 Kb 0.90
dmmd-s8ju ENTIDADES ACREDITADAS CON AVAL A 31 DE ENERO DE 2017 17 4 60.8 Kb 1.00
e2it-6w34 REGISTRO USUARIOS DE ASISTENCIA TÉCNICA BUSBANZA 7 5 37.4 Kb 0.92
f8ve-dac2 CADENAS PRODUCTIVAS DE TUTAZÁ 3 1 5 Kb 0.98
gr6q-6jhn PARTICIPACIÓN PROVEEDORES POR DEMANDA 2 13 3.7 Kb 0.87
ih48-erzn SÍFILIS GESTACIONAL EN EL DEPARTAMENTO DE NARIÑO AÑO 2008 A 2015 (GOBERNACIÓN DE NARIÑO) 15 8 20.9 Kb 0.90
m8cr-nt44 ACUEDUCTOS VEREDALES CHÍQUIZA 4 12 3.6 Kb 0.97
qqbd-442m ESTACIONES DE SERVICIO DE MISTRATÓ RISARALDA 1 11 1.3 Kb 0.99
s52q-9chx OFERTA HIDRICA JUNIO 2015 5 10 26.9 Kb 0.98
snc5-vevu CONSOLIDADO DE PROCESOS DE SELECCIÓN 14 2 328 Kb 1.25
tdkw-xhkb LISTADO DE DROGUERIAS -TUNJA 11 2 9.2 Kb 0.92
tg45-q549 ROTACIÓN DEPORTIVA ESCOLAR COPACABANA 2016 8 13 5.2 Kb 1.00
udix-2txh INTERNET POR SUSCRIPCIÓN Y TECNOLOGÍA 6 11 2.1 Kb 1.00
wheb-axi3 SITIOS TURÍSTICOS-6 2 14 5.9 Kb 0.95

eda_opendata

Description Generate a basic report in rmarkdown or shiny with selected datasets from an open data portal.

Usage

eda_opendata(metadata, ids, tags)

Arguments

  • metadata - a dataframe get from get_metadata function
  • ids - a character vector with datasets id required. if ids and tags arguments are NULL all datasets are analysed
  • tags - a character vector with tags/keywords required for analysis. if ids and tags arguments are NULL all datasets are analysed.

Details The selected sample is drawn according to a selection-rejection (list-sequential) algorithm

Value The function returns a list of 4 dataframes. The 4 dataframes contain information of tabular, geo, href and blobby data of open data portal.

Case study: Datos abiertos - Colombia

Type data

  • Most datasets to (2017-03-04) are of tabular type.
  • A frecuency analysis of data type in colombia open is show below:

viewType count percent
tabular 3543 82.0
href 574 13.3
blobby 162 3.7
geo 44 1.0

Open Data in Colombia is new!

  • Every day is added new open data to “datos abiertos” portal:

plot of chunk unnamed-chunk-6

Main topics

  • Some common topics in Colombia open data are:

plot1

View Count Distribution

  • Distribution of number of views of Colombian datasets:

plot of chunk unnamed-chunk-11

Summary view counts

  • A descriptive statistics summary of view counts is presented below:

Stats
nbr.val 4323.0
nbr.null 0.0
nbr.na 0.0
min 1.0
max 25932.0
range 25931.0
sum 375981.0
median 26.0
mean 87.0
SE.mean 10.9
CI.mean.0.95 21.4
var 513229.0
std.dev 716.4
coef.var 8.2

Most frequently viewed datasets:

plot1

Most downloaded datasets

plot1