data science for the ocean health index

ben best <bbest@nceas.ucsb.edu>
2014-04-15 in santa barbara, ca

overview

ocean health index
data flow
data wrangling
distributed development

what is a healthy ocean?

pristine?
“a healthy ocean sustainably delivers a range of benefits to people now and in the future”

quantify using publicly available datasets (mostly)
halpern et al. (2012) nature.
rolling out toolbox for individual countries to formulate own index
“reproducible research”

what do we measure?

goals

natural products
food provision
artisanal fishing
carbon storage
coastal protection
tourism & recreation
livelihoods & economies
sense of place
biodiversity
clean waters

process

set a reference point
index = avg(status, future)
future ~ status, trend, resilience (+), pressures (-)
pressure = matrix of weights, goals vs pressures
resilience = matrix of weights, goals vs resilience

overview

ocean health index
data flow
data wrangling
distributed development

flow

darren hardy (https://github.com/drh-stanford)
linear-flow: swiss-army knife for scientific workflows
linux file server with ssd, postgresql
data/
- raw/ -> ingest/ -> stable/ -> model/
  - [study area]-[provider]-[product]_[version]/
    eg “GL-WorldBank-Statistics_v2012/”
    - manual_output/, tmp/ -> data/
    - scripts.*: setup, download, import, ingest, model, run, digest, export, report, upload, finish
    - *.languages: sh, py, r, sql, pl, rb

flow: install & run

python setup.py install
cd example
flow

Running flow [rootdir=~/linear-flow-master/example] depth=0 nproc=1

Running package with default style [~/linear-flow-master/example]

Running python script import.py
Generating some dummy data in data.csv...

Running R script model.R
Running R script model.r

flow: principles

extract, transform, load (etl)
makefile
common output format: data/*.csv

overview

ocean health index
data flow
data wrangling
distributed development

data wrangling

task: calculate the batting average (AVG): number of base hits (H) divided by the total number of at bats (AB) using the Lahman baseball database. limit to Babe Ruth and Jackie Robinson.
setup

library(Lahman)
library(dplyr)
library(RSQLite)

result

  nameFirst nameLast    avg
      Babe      Ruth  0.323
    Jackie  Robinson  0.308

data wrangling: sql

sql (csv2psql)

tbl(lahman_sqlite(), sql(
"SELECT nameFirst, nameLast, 
  ROUND(AVG(H/(AB*1.0)), 3) AS avg 
FROM Batting
JOIN Master USING (playerID)
WHERE AB > 0 AND ((
  (nameFirst = 'Babe' AND 
   nameLast = 'Ruth') OR 
  (nameFirst = 'Jackie' AND 
   nameLast = 'Robinson')) 
GROUP BY nameFirst, nameLast
ORDER BY avg DESC")))

data wrangling: dplyr

chaining (%.%): grammar of data manipulation

Batting %.%
  merge(Master, by='playerID') %.%
  filter(
    AB > 0 &
    (nameFirst=='Babe' & 
     nameLast =='Ruth') | 
    (nameFirst=='Jackie' & 
     nameLast =='Robinson')) %.%  
  group_by(nameFirst, nameLast) %.%
  summarise(avg = round(mean(H/AB), 3)) %.%
  arrange(desc(avg))

overview

ocean health index
data flow
data wrangling
distributed development

distributed dev: fork & pull

direction	org web	user web	user local
	`github.com/[org]/[repo]`	`github.com/[user]/[repo]`	`~/github/[repo]`
-> (1x)		-> fork	-> clone
<-	merge {admin} <-	<- pull request	<- push, <-> commit

where:

[org] is an organization (eg ohi-science)
[repo] is a repository in the orgranization (eg ohiprep)
[user] is your github username (eg bbest)

github features

track changes, issues, etc. free for public repos
max: 1GB per repo, 100MB per file. so larger files (and binary) on file server, with remote vpn option
render markdown, eg README.md
csv view
geojson view

for more...

ohi-science.org
- ohiprep: data layer preparation
- ohicore: r library of core functions
- ohigui: graphical user interface with shiny and rCharts
github.com/bbest / bbest.github.com / talks
- this talk, as r presentation
- productivity with rstudio: 1. data wrangling with dplyr, 2. documenting with markdown, 3. versioning with github
other
- marwick reproducible research
- github.com/datasets