Adopting open source practices for better science

class: center, middle, inverse

# Adopting open source practices for better science
### <p>Pierce Edmiston<br />
<a href="mailto:pedmiston@wisc.edu">pedmiston@wisc.edu</a><br />
<a href="http://sapir.psych.wisc.edu">sapir.psych.wisc.edu</a><br />
<a href="https://github.com/pedmiston">github.com/pedmiston</a></p>

---

# Outline

Open source practices that make for more reproducible science:

1. Version control
2. Dynamic documents
3. Building from source

Conclusion: It's worth it!

---
# Why I care about reproducibility

1. I want my research to be reproducible.
2. I want to attract collaborators.

???

I want my research to be reproducible by other people and by myself. This means no undocumented steps! Document things in code for maximum reproducibility.

I want the bar for getting involved in one of my projects to be as low as possible.

---
# The reproducibility crisis

<small>Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. _Science_.</small>

???

My interest in reproducibility is really more of an obsession or a paranoia. I worry that my own research is not going to replicate.

---
# Why isn't psychological research more reproducible?

- Simmons, Nelson, & Simonsohn. (2011). False-positive psychology. _Psychological Science_.
- Gelman & Loken (2013). The garden of forking paths. Unpublished manuscript.
- Palmeri (2016). Psychology is in crisis over whether it's in crisis. _WIRED_.
- Ioannidis (2005). Why most published research findings are false. _PLOS Medicine_.
- Edmiston (now). Publications are not answers to research questions. Unpublished thoughts.

???

The first reason is that there is a lot of flexibility in how we analyze behavioral data. The second reason is that we tend to look until we find something. But of course there are plenty of people who disagree whether the state of psychology research is as bad as some say. And actually these problems with publication bias and the file drawer effect are much more widespread than just psychology. Personally I think the culprit is the idea that publications are definitive answers to research questions.

---
# Reproducibility is a big problem

<small>Munafò et al. (2017). A manifesto for reproducible science. _Nature_.</small>

---
# Steps toward reproducible science

- Blind data analysts to experiment conditions.
- Improve statistics education (adapted for web).
- Hire methodological consultants.
- Seek collaboration for scalability.

---
# Why I think open source is the answer

Compare these two goals of reproducibility in science and in open source:

> Fellow researchers should be able to reproduce my work.  
> Anyone should be able to use and contribute to this project.

<div id="htmlwidget-20e383c0ec3db0a3a114" style="width:672px;height:200px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-20e383c0ec3db0a3a114">{"x":{"diagram":"\ndigraph {\n  label = \"Open source science\";\n  labelloc = t;\n  fontname = \"Helvetica-bold\";\n  node[shape = none, fontname = Helvetica];\n  Reproducibility -> {Confidence, Collaboration};\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

???

Open source is the answer because the goal of reproducibility in open source communities is actually a loftier goal than in the scientific community.

---
# Version control

---
# Pick your poison

- **git**
- mercurial
- subversion
- gitless

---
# Tools for climbing

???

There are a number of ways to think about version control. One way is to think about it as a safety net, that no matter what you do, you can always roll back to what it was before. This is the power of the "undo" button. However, this doesn't really get at why I think version control is such a powerful tool. A better analogy is to think about version control as a tool for climbing. The picture is of tools used by rock climbers called "nuts" that you jam into a crack in the rock, and then you can use it as a hold. This is how I think about version control. It definitely has the effect of keeping you safer, but it also allows you to climb in places you otherwise wouldn't be able to.

---
# Conquer clutter

---
# Forks and branches

<div id="htmlwidget-e32ed0775c3515572206" style="width:672px;height:200px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-e32ed0775c3515572206">{"x":{"diagram":"\ndigraph {\n  rankdir = LR;\n  bgcolor = transparent;\n  node[label = \"\"; style = \"filled\"; fillcolor = \"#8DA0CB\"];\n\n  t2;\n\n  b0 -> b1 -> b2 -> b3[style = invis];\n  m0 -> m1 -> m2 -> m3;\n  t0 -> t1 -> t2[style = invis];\n\n  m1 -> t2 -> m3[constraint = false];\n  m2 -> b3[constraint = false];\n\n  b0, b1, b2, t0, t1[style = invis];\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---
# Submodules

.pull-left[
```
# key
parent_repo/
├── child_repo_1  # submodule 1
└── child_repo_2  # submodule 2

# example 1
talk or publication/
├── research_project_1
└── research_project_2
```
]

.pull-right[
```
# example 2
meta-analysis/
├── research_project_1
├── ...
└── research_project_n

# example 3
big-project/
├── *web-app*  -> also installed on web server
├── *lab-exp*  -> also installed on lab computers
├── *r-pkg*    -> installed by anyone who wants the data
├── conference
└── journal
```

---
# Version control's dirty little secret

(It only really works on plaintext files.)

<div class="figure">
<img src="presentation_files/figure-html/unnamed-chunk-7-1.png" alt="Excel panic. Well, did you make changes or didn't you??" width="672" />
<p class="caption">Excel panic. Well, did you make changes or didn't you??</p>
</div>

---
# The good news

Once you're working in plaintext, you can do lots of cool things.

- Full power of VCS (merge, blame, etc).
- Use free and open source tools (Unix).
- Write dynamic documents.

---
# Dynamic documents

- Philosophy: DRY, Literate Programming
- Tools: Sphinx, Jupyter, Knitr, Pandoc

---
# Philosophy

.pull-left[
## Don't Repeat Yourself (DRY)

> Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

Hunt & Thomas. (1999). _The pragmatic programmer_.
]

.pull-right[
## Literate programming (LP)

* Intermingle prose and code for better understanding of the program.
* The explanation of a program does not need to resemble the program structure.

Knuth, Donald E. (1983). _Literate programming_.
]

---
# Sphinx

**Python documentation generation**

* Python standard library: [json](https://docs.python.org/3/library/json.html)
* Third party packages: [requests](http://docs.python-requests.org/en/master/api/#requests.request)

---
# Project Jupyter

**Web-based, language-agnostic lab notebook.**

---
# Knitr

**Elegant, flexible and fast dynamic report generation with R.**

Participants in condition A outperformed participants in
    condition B, `report_model_results(mod, param = "condition")`.

---
# Dynamic documents in practice

- Handouts
- Homework
- Supplemental materials
- Conference proceedings
- Journal papers

---
# A litmus test for reproducible research.

Can you build the published paper without the original data?

---
# Identical environments

.pull-left[
## Open source tools

- python
- R (S)
- octave (Matlab)

## Virtual environments

- Enthought
- Anaconda
]

.pull-right[
## Cloud computing

- Amazon Web Services
- Open Stack

## Infrastructure-as-code

- ansible
- terraform
]

---
# Examples of "Building from source"

1. Kaggle leaderboards
2. Totems game

---
# Does it work?

<small>McKiernan, _et al._ (2016). How open science helps researchers succeed. _eLife_.</small>

---
# Cultural ratchets

.pull-left[
## Many species have elements of culture

- chimps
- whales
- crows
]

.pull-right[
## Only humans exhibit **cumulative** culture

- ratchet effect
- evolutionary process
- transmission fidelity
]

---
# Adoption open source practices for better science

Pierce Edmiston  
<pedmiston@wisc.edu>  
[sapir.psych.wisc.edu](http://sapir.psych.wisc.edu)  
[github.com/pedmiston](https://github.com/pedmiston)

Open source practices that make for more reproducible science:

1. Version control
2. Dynamic documents
3. Building from source