data science for the ocean health index

ben best <bbest@nceas.ucsb.edu>
2014-04-15 in santa barbara, ca

overview

  1. ocean health index
  2. data flow
  3. data wrangling
  4. distributed development

what is a healthy ocean?

  • pristine?

    what is a healthy ocean?

  • “a healthy ocean sustainably delivers a range of benefits to people now and in the future

  • quantify using publicly available datasets (mostly)
  • halpern et al. (2012) nature.
  • rolling out toolbox for individual countries to formulate own index
  • “reproducible research”

what do we measure?

goals

  1. natural products
  2. food provision
  3. artisanal fishing
  4. carbon storage
  5. coastal protection
  6. tourism & recreation
  7. livelihoods & economies
  8. sense of place
  9. biodiversity
  10. clean waters

process

  • set a reference point
  • index = avg(status, future)
  • future ~ status, trend, resilience (+), pressures (-)
  • pressure = matrix of weights, goals vs pressures
  • resilience = matrix of weights, goals vs resilience

overview

  1. ocean health index
  2. data flow
  3. data wrangling
  4. distributed development

flow

  • darren hardy (https://github.com/drh-stanford)
  • linear-flow: swiss-army knife for scientific workflows
  • linux file server with ssd, postgresql
  • data/
    • raw/ -> ingest/ -> stable/ -> model/
      • [study area]-[provider]-[product]_[version]/
        eg “GL-WorldBank-Statistics_v2012/”
        • manual_output/, tmp/ -> data/
        • scripts.*: setup, download, import, ingest, model, run, digest, export, report, upload, finish
        • *.languages: sh, py, r, sql, pl, rb

flow: install & run

python setup.py install
cd example
flow
Running flow [rootdir=~/linear-flow-master/example] depth=0 nproc=1

Running package with default style [~/linear-flow-master/example]

Running python script import.py
Generating some dummy data in data.csv...

Running R script model.R
Running R script model.r

flow: principles

  • extract, transform, load (etl)
  • makefile
  • common output format: data/*.csv

overview

  1. ocean health index
  2. data flow
  3. data wrangling
  4. distributed development

data wrangling

  • task: calculate the batting average (AVG): number of base hits (H) divided by the total number of at bats (AB) using the Lahman baseball database. limit to Babe Ruth and Jackie Robinson.

  • setup

library(Lahman)
library(dplyr)
library(RSQLite)
  • result
  nameFirst nameLast    avg
      Babe      Ruth  0.323
    Jackie  Robinson  0.308

data wrangling: sql

tbl(lahman_sqlite(), sql(
"SELECT nameFirst, nameLast, 
  ROUND(AVG(H/(AB*1.0)), 3) AS avg 
FROM Batting
JOIN Master USING (playerID)
WHERE AB > 0 AND ((
  (nameFirst = 'Babe' AND 
   nameLast = 'Ruth') OR 
  (nameFirst = 'Jackie' AND 
   nameLast = 'Robinson')) 
GROUP BY nameFirst, nameLast
ORDER BY avg DESC")))

data wrangling: dplyr

  • chaining (%.%): grammar of data manipulation
Batting %.%
  merge(Master, by='playerID') %.%
  filter(
    AB > 0 &
    (nameFirst=='Babe' & 
     nameLast =='Ruth') | 
    (nameFirst=='Jackie' & 
     nameLast =='Robinson')) %.%  
  group_by(nameFirst, nameLast) %.%
  summarise(avg = round(mean(H/AB), 3)) %.%
  arrange(desc(avg))

overview

  1. ocean health index
  2. data flow
  3. data wrangling
  4. distributed development

distributed dev: fork & pull

direction org web user web user local
github.com/[org]/[repo] github.com/[user]/[repo] ~/github/[repo]
-> (1x) -> fork -> clone
<- merge {admin} <- <- pull request <- push, <-> commit

where:

  • [org] is an organization (eg ohi-science)
  • [repo] is a repository in the orgranization (eg ohiprep)
  • [user] is your github username (eg bbest)

github features

  • track changes, issues, etc. free for public repos
  • max: 1GB per repo, 100MB per file. so larger files (and binary) on file server, with remote vpn option
  • render markdown, eg README.md
  • csv view
  • geojson view

for more...