Data science community is plagued by reproducibility issues.This post
serves to address some of the barriers to reproducible research and how
to start to address some of those problems during the data management
and analysis phases of the research life cycle. I will then present a
brief introduction to Rmarkdown . This post assumes that you know
Who cares about
reproducible research
Science is plagued by reproducibility problems.
- health
- finance
- economics
- social studies

How do we reproduce it? What do we need?
- The data.
- Data points themselves.
- Other metadata.
- The code.
- Should be readable.
- Comments in the code / well-documented so a normal person can figure
out how it runs.
- How were the trend lines drawn?
- What version of software / packages were used?
This kind of information is rarely available in scientific
publications, but it’s now extraordinarly easy to put this kind of
information on the web.
As scientists we should aim for robust and
reproducible research
- “Robust research is about doing small things that
stack the deck in your favor to prevent mistakes.”
- Reproducible research can be repeated by other
researchers with the same results.
it will make your
life (and science) easier!
- Most likely, you will have to re-run your analysis more than
once.
- In the future, you or a collaborator may have to re-visit part of
the project.
- Your most likely collaborator is your future self, and your past
self doesn’t answer emails.
- You can make modularized parts of the project into re-useable tools
for the future.
- Reproducibility makes you easier to work and collaborate with.
Some recommendations
for reproducible research
- Write code for humans, write data for computers.
- Code should be broken down into small chunks that may be
re-used.
- Make names/variables consistent, distinctive and meaningful.
- Adopt a consistent style
- Write concise and clear comments.
- Make incremental changes and Use version control.
S
- Make assertions and be loud, in code and in your
methods.
- Use existing libraries (packages) whenever
possible. Don’t reinvent the wheel. Use functions that have
already been developed and tested by others.
- Prevent catastrophe and help reproducibility by making your
data read-only. Rather than modifying your original
data directly, always use a workflow that reads in data,
processes/modifies, then writes out intermediate and final files as
necessary.
- Release your code and data. Simple. Without your
code and data, your research is not reproducible.
- GitHub (https://github.com/) is a great place for storing,
distributing, collaborating, and version-controlling code.
- RPubs (http://rpubs.com/) allows you to share dynamic documents
you write in RStudio online.
- Always set your seed. If you’re doing anything that
involves random/monte-carlo approaches, always use
set.seed()
.
- Document everything and use code as documentation.
- Document why you do something, not mechanics.
- Document your methods and workflows.
- Document the origin of all data in your project directory.
- Document when and how you
downloaded the data.
- Record data version info.
- Record software version info with
session_info()
.
- Use dynamic documentation to make your life easier.
Rmarkdown

R Markdown supports users with easy analysis in R
because it enables users to weave together narrative text and code in
the document. It supports multiple programming languages including R,
Python, and SQL. After performing analysis, R Markdown supports
dozens of static and dynamic output formats including
PDF, Word document, HTML document, and Interactive PowerPoint
Presentation.
Using R Markdown
Before using R Markdown you need to install the package
rmarkdown into your machine. If this is your first time using
RStudio, you will see this in your RStudio window:

Above is the default view of RStudio. There are 4 panels each with
its function:
- Editor: is where we can input codes and narration
on specific files that can be saved into our computer.
- Console: is where we can input codes and perform
analysis without saving it into our computer.
- Environment: is where R stores our data temporarily
when doing data analysis in R. This allows us to see and track our data
while doing data analysis. There is also tab history
and connection, though we will not use these in this
workshop.
- Files, packages, help, etc: is where we can track
our files in our computer, our packages, and search for documentation
and description about specific function/command we use in our project.
Additionally, there are also plots and
viewer to preview plots and files generated using
R.
To easily analyze data and produce business reports using R we will
be using R Markdown. We can create new R Markdown document by
clicking on the menu File > New File > R
Markdown. Alternatively, we can hover our mouse to a dropdown
menu on the left corner of RStudio
and then choose “R Markdown”. We will be directed to a pop-up for
creating a new R Markdown document.

We can choose the title and author
for our project and there are several output options we can choose. For
the introduction, let’s use the default HTML output. An R Markdown
document, a plain text file with the extension .Rmd
, will
be created on our Editor panel.

The document contains three types of content:
- YAML Header
- surrounded by
---
before and after its section.
- this is where we can custom our report template (will be discussed
in the following section).
- Code Chunks
- surrounded by
```
before and after its section, colored
gray.
- this is where we can put R function/commands for data analysis.
- Text/Naration
- space colored white.
- this is where we write paragraphs or explanations for our business
report.
- it can be added with various text formatting such as the use of
#
for heading.
This content allow us to write both R command for data analysis and
business explanation in one file. That’s like working with Excel and
Word at the same time, with added functionality to export it into
various outputs with a customized template! Such a lot of work can be
done with one document.
In addition to the versatility, another benefit of using R Markdown
is its notebook interface. With R Markdown, the code
inside the chunk can be executed independently and interactively, with
output displayed immediately beneath the chunk. This
allows complex data analysis using R to be performed and previewed
easily. For example, if we run the code in the last chunk by clicking
the green ‘play’ icon on the right side, a plot will come out.

Finally, we can also export the document into certain or multiple
formats by using the Knit button
in RStudio on the upper part of the
document.
If we haven’t saved our document, R will direct us to save our file.
The best practice is to store our R Markdown document and the data we
use in one working directory. This is to prevent any connection error
while importing data and such. After knitting the document, R will
produce the document output based on the format we choose earlier.
