Reproducible research consists of many different parts that stack on top of each other. It’s important to get the basics right which are mostly conceptual ideas about data, code and general project organisation.
A big part of reproducible research is a sensible organisation of (raw) data, code and outputs in a folder structure. This is usually something that happens in evolutions/step-by-step - it’s rather the rule than the exception to have (or lack) a structure that turns out not to be working so well for you when you progress in a project. There are still some guidelines on what to follow. A good folder structure to start with is something like the following:
.
├── R
├── data
│ ├── processed
│ └── raw
└── output
In this example, you would have a subfolder R that
contains R scripts with functions that can be used/shared by your
analysis scripts, e.g. a function that plots sediment core photographs
together with data.
The folder data can directly contain data or
sub-folders. In this case, we make a distinction between
raw or original and processed data.
The folder output finally could contain all your
“deliverables”, i.e. tables, figures and others things you want to use
in your manuscript. The outputs don’t have to be identical to the
deliverables in your manuscript. E.g. you might want to touch
up/annotate something on a figure with a vector illustration program
(such as Adobe Illustrator, Affinity Designer or Inkscape).
It is up to you if you have only one document for all analyses or you might want to split your analyses up in different scripts that then are called sequentially and “do part of the work”.
This current document is written in R flavoured Markdown. Markdown is an easy formatting language. We use normal text symbols to create formatting - this way we don’t have to use binary formats like Microsoft Word. Markdown supports the most important features for text editing, but not (remotely) as many as Microsoft Word or LibreOffice.
Some people prefer to use R scripts with comments for their analyses while others use Rmarkdown. The main advantage of R Markdown (similar to Jupyter notebooks in Python) is, that textual content with a meaning (e.g. the text that I am writing here) is combined with code that can be run. You can even use results from your calculations inline in text. Did you e.g. know that the average waiting time between europtions of the Old Faithful geyser in Yellowstone National Park is 1 hour(s) and 11 minute(s)?
The concept to mix code and text is called literate programming and is especially useful for data analyses.
What is original data? That question is more difficult than it seems! Is an Excel workbook that you receive with a compilation of data original? Probably not. But does that mean you need to use a machine’s internal and obscure files that are hard to read? Probably not. There are many definitions regarding raw or original data and ultimately, you will have to decide (and be able to justify) yourself what is original data to you. Whenever possible, do not make calculations in Excel and prefer text files (.csv, .txt, .dat etc.). A good start is to use the least modified but still usable data you get from a machine (this data can very well already be calculated on, e.g. concentration calculations from peak areas in spectrometry). Original data ideally changes rarely (only if new data is added or if an error in the measurement has to be corrected).
What about “weird” files? My data is not well-formatted/lacks columns/uses weird symbols or I only received a compilation of data and there is no raw data?
If for one or the other reason, you are not able to directly read the raw data into R, by all means, use Excel or another tool that can help you formatting your data. This may include adding name columns, removing not-well defined text (e.g. if after some table data there is more unformatted text). Be careful however that Excel might convert numbers into dates, cut off digits or alter your data otherwise. Always check the data exported from Excel e.g. with a plain text editor like Windows Notepad, macOS TextEdit or the amazing free Visual Studio Code editor which has plug-ins for many different file types.
Why should you even bother about filetypes? And why is it a bad idea to keep your data only in an Excel sheet or Word document? To understand this, we have to talk about how computers save and represent data: There are two basic types of data on modern computers which are text files and binary (which is the whole rest).
An example of a plain text file: the
hosts file in the directory /etc that lives on
every Linux, Unix and macOS system output by the very basic
cat command that prints and concatenates files.
cat.A binary file that was opened in a plain text editor:
Mostly weird symbols that can’t be read by humans. The structure of
binary files differs between file types. The %PDF at the
beginning is a typical “magic” string that helps a computer to find out
what file type it is dealing with.
PDF binary data in hex editor
All common computer systems organise their data in hierarchical
folders. In UNIX/Linux/macOS systems, the top level directory is called
root-directory (like the root of a plant, where the whole plant
originates from). Following that metaphor, the next directories would be
the stem, then branches, twigs and files would be leaves. The
directories are separated with a forward slash /. The file
/etc/hosts resides in the top level directory
/etc. The first slash represents the top-level directory or
root. Your personal data might be in a folder like
/Users/<username> (macOS) or
/home/<username> (Linux). In Unix/Linux/macOS
systems, your personal folder is abbreviated with a tilde
~.
In Windows systems, the top-level is given by a drive letter, e.g. C:
and directories are commonly separated by backslashes \.
Your personal data on Windows is in a folder of the form
C:\Users\<username> (R might use forward slashes
instead because backslashes are special symbols in R).
All these folders are given in absolute notation. Absolute means that
they contain the full path to a directory or file.
Relative notation in comparison means, that the paths are given
starting from a defined working directory. In R, this working
directory is shown at the top of the console. In the Files
panel on the right side, the files and folders inside this working
directory are shown. If we wanted to access the one of the XRF files
that is in the subfolders data > raw > xrf, we could write
"data/raw/xrf/AW-13-16_10kVa_5mm_30sec.csv" or
"./data/raw/xrf/AW-13-16_10kVa_5mm_30sec.csv". These two
options are equivalent, since a dot in front of the first forward slash
just means “working directory”. Two dots (..) in front of the first
forward slash means “one level up directory”, which is, in this case,
the folder R-code (see current working directory).
Working directory as shown in RStudio
Why are relative paths so important? Relative paths make your code/content portable. If you only ever refer to data that is contained within a working directory using relative paths, your code should run on any computer!
R by itself does not provide an option to create projects; a project being a self-contained entity that is portable. A project can be anything from just code, to code and data, to a managed environment where R and package versions are controlled. This is where the RStudio projects jump in: When you work with an RStudio project you can always be sure that you’re starting in the correct working directory (= the project directory) and you can set options that are specific to this project.
When you use projects, you can and should do away with some bad habits many R users have (learnt!):
Drastic views on the infamous rm(list = ls()) command
rm(list = ls()) on top of your scripts. This
is very widely taught in R stats classes and practised, so what is the
reason why this is frowned upon? Think about getting a script from a
colleague or student and executing it. The first line will delete all
(visible) objects in your R workspace/session, which could be annoying.
Even worse though, it doesn’t delete everything! Graphic parameters or
other options (set with par() and options()
resp) will be retained! Instead, use an RStudio project.