Literate programming, version control, reproducible research, collaboration, and all that

Version: 2013-07-14 15:38:35

Literate programming, reproducible research, workflow management, version control, dependency control, and collaborative tools are a constellation of complementary and overlapping concepts and tools that are extremely useful for working computational scientists (which in the broad sense includes any scientist who uses a computer to do their work – i.e. all scientists). This document tries to lay out some of the bits and pieces, with an emphasis on reproducible research using R.

Literate programming (LP)

LP (LP), the oldest of these ideas, was introduced by Donald Knuth as a way to document code, and as a new way to write code. Although (according to Wikipedia and sources) LP is supposed to mean more than just a system for automatic generation of documentation and (nicely formatted) code, it is often used in that lowest-common-denominator sense. noweb is one of the oldest tools for literate programming. It is language-agnostic and relatively markup-agnostic (TeX, LaTeX, HTML, troff) (see Targets below).

Literate programming seems relatively inactive at present, having been largely superseded (for better or worse) by the related activities of reproducible research and documentation generation (see comments on StackOverflow). (When asked in an interview Donald Knuth refers to a comment by Jon Bentley that “a small percentage of the world's population is good at programming, and a small percentage is good at writing; apparently I [Knuth] am asking everybody to be in both subsets” [I can't find the original source].)

Reproducible research (RR)

RR was partly inspired by LP, but has a different scope – its ultimate output is a document describing or illustrating scientific research (a journal article, technical report, class notes, blog post, etc.) rather than a documented piece of software. Proximally, a reproducible document tends to consist of shorter scripts interspersed with text, rather than of code chunks interspersed with documentation.

It is admittedly hard to make a sharp demarcation between code within a document in the sense of literate programming and in the sense of reproducible research. I think of them as being somewhat different in terms of their typical endpoints: the endpoint of LP is a piece of software that is literately documented, while that of RR is a research document whose methods are well-documented. (Well-documented software is often an adjunct of a research document, but not the primary goal.)

General tools for RR

R-specific tools

Workflows

Another category of tools and ideas deals more specifically with pipelines for doing complicated series of computational steps in an efficient and reproducible way. These tools are primarily interested in (1) interconnections with data sources (sensors, web databases, etc.) and grid computing networks allowing computations to be farmed out to remote processors; (2) dependency description and management, especially figuring out which steps of a computation can be done independently/in parallel (see Dependency management below); (2) establishing the provenance of computational results (i.e., a rigorous chain of evidence describing where each result comes from). For the most part these tools do not seem to be designed particularly well for lightweight or small projects. They often combine a GUI (useful for visualization and for interacting with tech-naive users) with back-end definitions written in a machine-friendly XML format - in principle these definitions are human-editable, but you wouldn't generally want to have to do it.

Revision/version control

Logically, revision/version control is separate from LP and RR, but in practice it is a critical component of RR. Part of the need for version control comes from the research process (as you work, you continuously change your scripts, correct errors in data, etc., and knowing when and why things changed, and being able to go back and fix them after you broke them, is critical). The other part comes from the additional complexity of computational tools; because it's so easy to break your manuscript by making an apparently trivial change, it becomes important to be able to go back to earlier versions. Version control is critical for multi-author projects, but is considered extremely useful even for single-author projects.

Dependency control

Dependency control is like version control in being logically distinct from, but closely integrated with, RR. Having a robust, flexible dependency control system allows a reasonable compromise between (1) making errors due to out-of-date upstream changes in data files or scripts and (2) re-making an entire project every time something trivial changes. Various Sweave add-ons (weaver, cacheSweave, pgfSweave) provided caching; knitr does too, and dominates them because of its other features and relative robustness. However, it's hard to beat the flexibility (if not necessarily transparency) of make, which is built into Working Wiki (Rob Hyndman uses make for his projects. I'm not sure if/how other RR tools integrate dependency control …

Documentation generation

In documentation generation, the idea is to keep documentation in sync with code by providing tools that (1) keep code and documentation in a single file; (2) can produce nicely organized/indexed/formatted lists of functions etc. The actual nuts and bolts are similar to literate programming, and a number of people argue that document generators are the successors to LP.

Collaborative tools

In conjunction with all this stuff one may want tools that make it easier to work together, especially through some easy-to-use web-based system. Not completely separate from version control, but typically a nicer front-end and perhaps more aimed at non-technical users. May allow very fine-grained version control/simultaneous editing (Google Docs).

Integrated development environments

These primarily offer code editing, compilation/linking, and debugging tools, but also hooks into dependency control, documentation management, version control, platforms for reproducible research.

Targets

Text

The text target is the markup language in which the actual documentation/paper is written. In some simple cases this could just be plain text, but it may be important to have support for simple markup (bold, italic, typewriter/monospaced font, headings), math notation, bibliographic formatting …

Language

The language target is the programming language (languages). Scripting or domain-specific languages are especially popular (Python, R, MATLAB), but it can be useful to have access to compiled languages (C++, FORTRAN) or computer algebra systems (Yacas, Maxima, Mathematica, SAGE?).

Dissemination

Misc.

Other bits and pieces or ideas that haven't made it in yet.

library(fortunes)
fortune("pizza")
## 
## Roger D. Peng: I don't think anyone actually believes that R is designed
## to make *everyone* happy. For me, R does about 99% of the things I need to
## do, but sadly, when I need to order a pizza, I still have to pick up the
## telephone.
## Douglas Bates: There are several chains of pizzerias in the U.S. that
## provide for Internet-based ordering (e.g. www.papajohnsonline.com) so,
## with the Internet modules in R, it's only a matter of time before you will
## have a pizza-ordering function available.
## Brian D. Ripley: Indeed, the GraphApp toolkit (used for the RGui interface
## under R for Windows, but Guido forgot to include it) provides one (for use
## in Sydney, Australia, we presume as that is where the GraphApp author
## hails from). Alternatively, a Padovian has no need of ordering pizzas with
## both home and neighbourhood restaurants ....
##    -- Roger D. Peng, Douglas Bates, and Brian D. Ripley
##       R-help (June 2004)

History