Reproducible research consists of many different parts that stack on top of each other. It’s important to get the basics right which are mostly conceptual ideas about data, code and general project organisation.

Project organisation

A big part of reproducible research is a sensible organisation of (raw) data, code and outputs in a folder structure. This is usually something that happens in evolutions/step-by-step - it’s rather the rule than the exception to have (or lack) a structure that turns out not to be working so well for you when you progress in a project. There are still some guidelines on what to follow. A good folder structure to start with is something like the following:

.
├── R
├── data
│   ├── processed
│   └── raw
└── output

In this example, you would have a subfolder R that contains R scripts with functions that can be used/shared by your analysis scripts, e.g. a function that plots sediment core photographs together with data.

The folder data can directly contain data or sub-folders. In this case, we make a distinction between raw or original and processed data.

The folder output finally could contain all your “deliverables”, i.e. tables, figures and others things you want to use in your manuscript. The outputs don’t have to be identical to the deliverables in your manuscript. E.g. you might want to touch up/annotate something on a figure with a vector illustration program (such as Adobe Illustrator, Affinity Designer or Inkscape).

It is up to you if you have only one document for all analyses or you might want to split your analyses up in different scripts that then are called sequentially and “do part of the work”.

On R scripts, R Markdown and literate programming

This current document is written in R flavoured Markdown. Markdown is an easy formatting language. We use normal text symbols to create formatting - this way we don’t have to use binary formats like Microsoft Word. Markdown supports the most important features for text editing, but not (remotely) as many as Microsoft Word or LibreOffice.

Some people prefer to use R scripts with comments for their analyses while others use Rmarkdown. The main advantage of R Markdown (similar to Jupyter notebooks in Python) is, that textual content with a meaning (e.g. the text that I am writing here) is combined with code that can be run. You can even use results from your calculations inline in text. Did you e.g. know that the average waiting time between europtions of the Old Faithful geyser in Yellowstone National Park is 1 hour(s) and 11 minute(s)?

The concept to mix code and text is called literate programming and is especially useful for data analyses.

On original data and file types

What is original data? That question is more difficult than it seems! Is an Excel workbook that you receive with a compilation of data original? Probably not. But does that mean you need to use a machine’s internal and obscure files that are hard to read? Probably not. There are many definitions regarding raw or original data and ultimately, you will have to decide (and be able to justify) yourself what is original data to you. Whenever possible, do not make calculations in Excel and prefer text files (.csv, .txt, .dat etc.). A good start is to use the least modified but still usable data you get from a machine (this data can very well already be calculated on, e.g. concentration calculations from peak areas in spectrometry). Original data ideally changes rarely (only if new data is added or if an error in the measurement has to be corrected).

What about “weird” files? My data is not well-formatted/lacks columns/uses weird symbols or I only received a compilation of data and there is no raw data?

If for one or the other reason, you are not able to directly read the raw data into R, by all means, use Excel or another tool that can help you formatting your data. This may include adding name columns, removing not-well defined text (e.g. if after some table data there is more unformatted text). Be careful however that Excel might convert numbers into dates, cut off digits or alter your data otherwise. Always check the data exported from Excel e.g. with a plain text editor like Windows Notepad, macOS TextEdit or the amazing free Visual Studio Code editor which has plug-ins for many different file types.

Why should you even bother about filetypes? And why is it a bad idea to keep your data only in an Excel sheet or Word document? To understand this, we have to talk about how computers save and represent data: There are two basic types of data on modern computers which are text files and binary (which is the whole rest).

Text files contain human-readable letters and numbers when opened with a plain text editor (Notepad on Windows or TextEdit on macOS). There is no other information (formatting, pictures, executable code etc.) contained in a plain text file. Examples for plain text files are .csv, .txt and in some cases .dat files - and of course all source code files (e.g. .R or .Rmd for R, .c for C, .cpp for C++ etc.). Btw: The suffix is not always a guarantee that the file contains what it looks like - it’s just a help for humans and computers to figure out what a file contains.

An example of a plain text file: the hosts file in the directory /etc that lives on every Linux, Unix and macOS system output by the very basic cat command that prints and concatenates files.

Binary files make up the biggest part of all files and include software with machine code, zip-archives, PDFs, Microsoft Office files (which are zip archives that contain many other files in turn), videos, music, photos. The main difference is that a machine/computer can execute binary files but humans can’t read it. Case in point, a PDF file, opened with cat.

A binary file that was opened in a plain text editor: Mostly weird symbols that can’t be read by humans. The structure of binary files differs between file types. The %PDF at the beginning is a typical “magic” string that helps a computer to find out what file type it is dealing with.

Binary files have one more grave disadvantage when it comes to data preservation: Since they are not human readable and have often an undocumented format, a small change can lead to a document that can’t be opened anymore and is damaged beyond save (remember the Word crashes that lead to a broken document?). This can be seen very well in the machine code representation of the same binary document as above. If a relatively small part of the machine code is changed to “0” (hexadecimal representation of binary code), the PDF does not open correctly anymore.

PDF binary data in hex editor

On relative and absolute folder paths

All common computer systems organise their data in hierarchical folders. In UNIX/Linux/macOS systems, the top level directory is called root-directory (like the root of a plant, where the whole plant originates from). Following that metaphor, the next directories would be the stem, then branches, twigs and files would be leaves. The directories are separated with a forward slash /. The file /etc/hosts resides in the top level directory /etc. The first slash represents the top-level directory or root. Your personal data might be in a folder like /Users/<username> (macOS) or /home/<username> (Linux). In Unix/Linux/macOS systems, your personal folder is abbreviated with a tilde ~.

In Windows systems, the top-level is given by a drive letter, e.g. C: and directories are commonly separated by backslashes \. Your personal data on Windows is in a folder of the form C:\Users\<username> (R might use forward slashes instead because backslashes are special symbols in R).

All these folders are given in absolute notation. Absolute means that they contain the full path to a directory or file. Relative notation in comparison means, that the paths are given starting from a defined working directory. In R, this working directory is shown at the top of the console. In the Files panel on the right side, the files and folders inside this working directory are shown. If we wanted to access the one of the XRF files that is in the subfolders data > raw > xrf, we could write "data/raw/xrf/AW-13-16_10kVa_5mm_30sec.csv" or "./data/raw/xrf/AW-13-16_10kVa_5mm_30sec.csv". These two options are equivalent, since a dot in front of the first forward slash just means “working directory”. Two dots (..) in front of the first forward slash means “one level up directory”, which is, in this case, the folder R-code (see current working directory).

Working directory as shown in RStudio

Why are relative paths so important? Relative paths make your code/content portable. If you only ever refer to data that is contained within a working directory using relative paths, your code should run on any computer!

On RStudio projects - and bad habits

R by itself does not provide an option to create projects; a project being a self-contained entity that is portable. A project can be anything from just code, to code and data, to a managed environment where R and package versions are controlled. This is where the RStudio projects jump in: When you work with an RStudio project you can always be sure that you’re starting in the correct working directory (= the project directory) and you can set options that are specific to this project.

When you use projects, you can and should do away with some bad habits many R users have (learnt!):

Drastic views on the infamous rm(list = ls()) command

Do not use rm(list = ls()) on top of your scripts. This is very widely taught in R stats classes and practised, so what is the reason why this is frowned upon? Think about getting a script from a colleague or student and executing it. The first line will delete all (visible) objects in your R workspace/session, which could be annoying. Even worse though, it doesn’t delete everything! Graphic parameters or other options (set with par() and options() resp) will be retained! Instead, use an RStudio project.
Do not save your R workspace when you end your session/R and do not restore it upon startup! Your research will only be reproducable if everything is produced by code and not by using data that somehow is still living in your workspace! Imagine your current session or workspace as a pool that holds objects that have been floating around (and you don’t know for how long!). Better don’t rely on them if the pool accidentally gets cleaned!
Your code needs to run from top to bottom without errors. You should not write code that will only run when you insert some data somewhere manually, copy paste some snippets and only on lunar eclipse.

R Sedi Workshop

Remo Roethlin

2022-05-25

Project organisation

On R scripts, R Markdown and literate programming

On original data and file types

On relative and absolute folder paths

On RStudio projects - and bad habits