2019-04-30

About me

  • Dr Julien Colomb,data curator
  • @j_colomb
  • berlin open science meetup organiser
  • steering committee OSMOOC
  • worked for Uni Jena RDM helpdesk for 8 month: https://rdmpromotion.rbind.io (a hugo based website)
  • freelance teacher at A2P
  • ex-neurobiologist, work with Y. Winter a couple of time

Open science ?

Why open research data

Why open research data

  • requested by funders
  • increase visibility/citations
  • defence against accusation of fraud, build trust
  • foster collaboration
  • allow preliminary and meta-analysis
  • accelerates research !
  • well organised, speed up the research process ?!

Why not open research data

  • no time ?
  • zip it + push it to zenodo + add authors = 30 minutes
  • Fear people will find errors?
  • If there are errors, do you really want no one to find them ?
  • Ashamed of the data status ?
  • Design your work for it to be shared (next time)

Two real stories

https://www.frontiersin.org/articles/10.3389/fnsys.2017.00100/full

Reviewer ask for all figures with points instead of boxplots.

  • create 11 boxplot images with points
  • concatenate them into the 5 figures
  • upload them online, privately
  • get a link giving access for the reviewer
  • https://figshare.com/s/7681d9bea3bd62220fd5
  • How long ?

  • took 10 minutes

https://peerj.com/articles/1971/

Need to filter all data with score pretraining above 0.66

  • 10 experiments to deal with
  • filter data
  • make statistics
  • create boxplot (with sample size and stars for statistical significant differences)
  • How long ?

  • took 30 minutes

How is it possible? Reproducible reports!

  1. set variables and data path
  2. raw data loaded
    • putative filtering
  3. analysis performed
  4. statistics produced
  5. graphs produced via a reusable code or a function
    • Same look for all figures
    • using a different function -> different visualisation

What is the relation with open data?

  1. set variables and data path
  2. raw data loaded

set variables and data path?

Data should not be lost

Kein Backup, kein Mitleid

  • 3 copies in 2 locations

Find the data, a practical case

Task = analyse home cage scan data using machine learning algorithms. Run it through the data produced so far.

Data from Prof. Steele, no data management:

  • only got uninteresting hourly summaries
  • grouping not clear

Data from the AOCF, with data management plan:

  • I find the data in minutes :)
  • Raw data was saved, it is a text file :)
  • Get animal history and genotype easily.
  • File naming rules exist, :) but
    • changed over time or were not followed completely :(
    • special character were used :(

Find the data, a practical case

Solution: forget about Steele data, move the AOCF data, rename some files (Prüß -> Pruess), create an index (semi-automatically), add information about the animals in the index: https://www.overleaf.com/12764720rmhzqbpvtwxd

Now I can load the data.

Data should accessible via the computer

  • Folder organisation
  • File naming
  • Data index

2. raw data loaded

Data format

  • What format is the raw data
  • What analysis will be performed
  • How/where the data will be published/stored

  • 10 years rule

well organised, open data speed up the research process ?!

Who will analyse your data ?

The stupid future self of me will take care of that shit!

  • Your future self is talking shit about you.

Designed to be analysed and shared

  • Plan all ahead:
    • Folder organisation and file naming
    • Data format along the data flow
    • Where will it be published ?
  • Apply and document your own rules:
    • Implementation is key
    • Check regularly you are still following the rules
    • document, document, document

FAIR open data

Time to upgrade your digital gear !