Literate programming, version control, reproducible research, collaboration, and all that

Version: 2013-07-14 15:38:35

Literate programming, reproducible research, workflow management, version control, dependency control, and collaborative tools are a constellation of complementary and overlapping concepts and tools that are extremely useful for working computational scientists (which in the broad sense includes any scientist who uses a computer to do their work – i.e. all scientists). This document tries to lay out some of the bits and pieces, with an emphasis on reproducible research using R.

Literate programming (LP)

LP (LP), the oldest of these ideas, was introduced by Donald Knuth as a way to document code, and as a new way to write code. Although (according to Wikipedia and sources) LP is supposed to mean more than just a system for automatic generation of documentation and (nicely formatted) code, it is often used in that lowest-common-denominator sense. noweb is one of the oldest tools for literate programming. It is language-agnostic and relatively markup-agnostic (TeX, LaTeX, HTML, troff) (see Targets below).

Literate programming seems relatively inactive at present, having been largely superseded (for better or worse) by the related activities of reproducible research and documentation generation (see comments on StackOverflow). (When asked in an interview Donald Knuth refers to a comment by Jon Bentley that “a small percentage of the world's population is good at programming, and a small percentage is good at writing; apparently I [Knuth] am asking everybody to be in both subsets” [I can't find the original source].)

Reproducible research (RR)

RR was partly inspired by LP, but has a different scope – its ultimate output is a document describing or illustrating scientific research (a journal article, technical report, class notes, blog post, etc.) rather than a documented piece of software. Proximally, a reproducible document tends to consist of shorter scripts interspersed with text, rather than of code chunks interspersed with documentation.

It is admittedly hard to make a sharp demarcation between code within a document in the sense of literate programming and in the sense of reproducible research. I think of them as being somewhat different in terms of their typical endpoints: the endpoint of LP is a piece of software that is literately documented, while that of RR is a research document whose methods are well-documented. (Well-documented software is often an adjunct of a research document, but not the primary goal.)

General tools for RR

A reproducible research blog.
Emacs org-mode-babel is a general-purpose tool for interleaving “chunks” of code in different languages within a document
Working wiki and its spin-off Projects wiki are web-based frameworks for RR.
In principle, proprietary tools such as Mathematica notebooks could be used for RR, but they are limited by the availability of the underlying tools.

R-specific tools

The R task view on reproducible research gives a general overview.
Sweave is the grand-daddy of RR in R, inspired by Knuth's LP tools; it has R as a code target and LaTeX as a document-format target.
knitr now dominates Sweave, with a wider variety of code and document-format targets and a variety of additional features.
Emacs org-mode can handle R.
Other packages such as brew are designed more generally for reporting (i.e. target is typically an HTML page), rather than for generating scientific or technical papers.

Workflows

Another category of tools and ideas deals more specifically with pipelines for doing complicated series of computational steps in an efficient and reproducible way. These tools are primarily interested in (1) interconnections with data sources (sensors, web databases, etc.) and grid computing networks allowing computations to be farmed out to remote processors; (2) dependency description and management, especially figuring out which steps of a computation can be done independently/in parallel (see Dependency management below); (2) establishing the provenance of computational results (i.e., a rigorous chain of evidence describing where each result comes from). For the most part these tools do not seem to be designed particularly well for lightweight or small projects. They often combine a GUI (useful for visualization and for interacting with tech-naive users) with back-end definitions written in a machine-friendly XML format - in principle these definitions are human-editable, but you wouldn't generally want to have to do it.

Kepler
VDS (Virtual Data System)/Open Science Grid
Sumatra is a Python-based framework for tracking the results of computational experiments. It currently handles Python and MATLAB, presentations mention extensions to R … there's also a \( \LaTeX \) package for embedding figures.

Revision/version control

Logically, revision/version control is separate from LP and RR, but in practice it is a critical component of RR. Part of the need for version control comes from the research process (as you work, you continuously change your scripts, correct errors in data, etc., and knowing when and why things changed, and being able to go back and fix them after you broke them, is critical). The other part comes from the additional complexity of computational tools; because it's so easy to break your manuscript by making an apparently trivial change, it becomes important to be able to go back to earlier versions. Version control is critical for multi-author projects, but is considered extremely useful even for single-author projects.

Subversion (SVN) and Git are probably the most popular revision control systems (RCSs) in my circles. Collaborative tools (see below) often provide at least a simple form of version control. I don't know much about mercurial or other systems.
Working Wiki provides automatic version tracking, which is convenient … but non-automatic versioning as in SVN/git that requires an explicit check-in (rather than “version on save”) is better for maintaining an appropriate granularity in the record of changes. (Providing a tagging capability (i.e., labeling a particular version as being of interest) for WW might help address this issue.)
Dropbox sort of provides version control. It's extremely convenient because it works automatically in the background (and no need to explicitly commit revisions or explain how a given revision differs from the previous version). However, it also offers little control – there are no tags describing the history, just timestamps (at least I'm not aware of a way to tag revisions), and no easy way to generate diffs. It's really a backup system rather than a version control system.

Dependency control

Dependency control is like version control in being logically distinct from, but closely integrated with, RR. Having a robust, flexible dependency control system allows a reasonable compromise between (1) making errors due to out-of-date upstream changes in data files or scripts and (2) re-making an entire project every time something trivial changes. Various Sweave add-ons (weaver, cacheSweave, pgfSweave) provided caching; knitr does too, and dominates them because of its other features and relative robustness. However, it's hard to beat the flexibility (if not necessarily transparency) of make, which is built into Working Wiki (Rob Hyndman uses make for his projects. I'm not sure if/how other RR tools integrate dependency control …

Maven/[Ant?](http://ant.apache.org/)
discussion of build systems on SO

Documentation generation

In documentation generation, the idea is to keep documentation in sync with code by providing tools that (1) keep code and documentation in a single file; (2) can produce nicely organized/indexed/formatted lists of functions etc. The actual nuts and bolts are similar to literate programming, and a number of people argue that document generators are the successors to LP.

Dexy (also usable as a collaboration tool)
Doxygen (and derivatives Roxygen, Roxygen2, Roxygen3, …)
Javadoc
R vignettes

Collaborative tools

In conjunction with all this stuff one may want tools that make it easier to work together, especially through some easy-to-use web-based system. Not completely separate from version control, but typically a nicer front-end and perhaps more aimed at non-technical users. May allow very fine-grained version control/simultaneous editing (Google Docs).

MS Word/Track changes/Dropbox
Google Docs
LaTeXLab, a (beta) version of google docs for LaTeX
Subversion (SVN)
Github

Integrated development environments

These primarily offer code editing, compilation/linking, and debugging tools, but also hooks into dependency control, documentation management, version control, platforms for reproducible research.

Emacs (via ESS and/or org-babel)
Vim
Eclipse
RStudio
other R IDEs: Tinn-R, RKward, etc.

Targets

Text

The text target is the markup language in which the actual documentation/paper is written. In some simple cases this could just be plain text, but it may be important to have support for simple markup (bold, italic, typewriter/monospaced font, headings), math notation, bibliographic formatting …

Markdown (+ various extensions)
TeX/LaTeX
HTML
Wikimedia/other wiki markup languages (MLs)
Word processor formats: Open Document Format (ODF), Microsoft Office (doc/docx)
pandoc is useful for interconversion among text targets

Language

The language target is the programming language (languages). Scripting or domain-specific languages are especially popular (Python, R, MATLAB), but it can be useful to have access to compiled languages (C++, FORTRAN) or computer algebra systems (Yacas, Maxima, Mathematica, SAGE?).

Dissemination

Output as PDF, HTML, other options?
Easy ways to “push” to blogs, other publishing platforms (Rpubs)

Misc.

Other bits and pieces or ideas that haven't made it in yet.

Editor dependence: some systems are tied to specific editors (e.g. org-babel, Rstudio [native and Vim modes]), others are editor-agnostic (e.g. Working Wiki/It's All Text, git); not being able to use a preferred editor (or at least a preferred set of keybindings) can be a deal-breaker
Convergence: Systems have a tendency to expand. e.g. knitr and RStudio started out as cross-platform but language ® and text (LaTeX)-specific, but have now expanded both their language (R, python, awk, … http://yihui.name/knitr/demo/engines/) and their text support (especially Markdown). github is a server and front-end for git that formats documents in various ways, and thus adds document-preparation and dissemination features to what was originally just a RCS/collaboration tool. A specific case of creeping featurism?

library(fortunes)
fortune("pizza")

## 
## Roger D. Peng: I don't think anyone actually believes that R is designed
## to make *everyone* happy. For me, R does about 99% of the things I need to
## do, but sadly, when I need to order a pizza, I still have to pick up the
## telephone.
## Douglas Bates: There are several chains of pizzerias in the U.S. that
## provide for Internet-based ordering (e.g. www.papajohnsonline.com) so,
## with the Internet modules in R, it's only a matter of time before you will
## have a pizza-ordering function available.
## Brian D. Ripley: Indeed, the GraphApp toolkit (used for the RGui interface
## under R for Windows, but Guido forgot to include it) provides one (for use
## in Sydney, Australia, we presume as that is where the GraphApp author
## hails from). Alternatively, a Padovian has no need of ordering pizzas with
## both home and neighbourhood restaurants ....
##    -- Roger D. Peng, Douglas Bates, and Brian D. Ripley
##       R-help (June 2004)

Electronic lab notebooks: another niche, with many of the same ideas, typically aimed at a tech-naive audience and focusing on (1) backup/provenance; (2) GUIs.
Metadata and data management systems (Morpho etc., RDBMSs; RDBMSs often lack version control, relying on backups instead, because they are designed for high performance)
Document structure: Which is better, a single big blob (as in Sweave/knitr), or collection of documents with dependency structure? Keep everything as text files, or use a richer binary format?
- Small files generally provide better granularity for version tracking/branching; blobs are easier to redistribute. R's only format for disseminating combined data, metadata, code, documentation, and documents (plus other material such as bibliographic info?) is the package, which seems to heavy for a simple “here's my data plus the code required to run it”. Giant XML containers (e.g. ODF) may be useful, but are not human/command-line friendly.
- Binary formats obviously add a lot of potential richness, but at an obvious cost to openness/manageability with text-based tools.
Would a comparison table be useful? (Rows=systems; primary category; columns=code language targets; text format targets; storage format; version control; dependency control; dissemination; collaborative editing …)
remote/cloud computation: discussed a bit under Workflows. Obviously a huge topic. It would be nice to lower the activation energy required for cloud computing by ecologists, i.e. a really simple/easy-to-use/lightweight pushToCloud() function …
offline modes: having everything cloud/web based is a two-edged sword. The most convenient platforms (Git, Dropbox, RStudio, etc.) allow a seamless transition between local and remote storage/computation, so that you aren't screwed when your network connection (or the server) goes down at a critical moment.

History

updated 2013-13-07 after conversations with Matt Jones at the NCEAS Summer Institute: added Kepler, VDL etc.