Reproducible Research in Data Science

Data science community is plagued by reproducibility issues.This post serves to address some of the barriers to reproducible research and how to start to address some of those problems during the data management and analysis phases of the research life cycle. I will then present a brief introduction to Rmarkdown . This post assumes that you know

basic familiarity with R
manipulating data with dplyr and %>%
plotting with ggplot2
familiar with jupyter notebooks

0.1 Who cares about reproducible research

Science is plagued by reproducibility problems.

health
finance
economics
social studies

How do we reproduce it? What do we need?

The data.
- Data points themselves.
- Other metadata.
The code.
- Should be readable.
- Comments in the code / well-documented so a normal person can figure out how it runs.
- How were the trend lines drawn?
- What version of software / packages were used?

This kind of information is rarely available in scientific publications, but it’s now extraordinarly easy to put this kind of information on the web.

As scientists we should aim for robust and reproducible research

“Robust research is about doing small things that stack the deck in your favor to prevent mistakes.”
Reproducible research can be repeated by other researchers with the same results.

0.2 it will make your life (and science) easier!

Most likely, you will have to re-run your analysis more than once.
In the future, you or a collaborator may have to re-visit part of the project.
Your most likely collaborator is your future self, and your past self doesn’t answer emails.
You can make modularized parts of the project into re-useable tools for the future.
Reproducibility makes you easier to work and collaborate with.

0.3 Some recommendations for reproducible research

Write code for humans, write data for computers.
- Code should be broken down into small chunks that may be re-used.
- Make names/variables consistent, distinctive and meaningful.
- Adopt a consistent style
- Write concise and clear comments.
Make incremental changes and Use version control. S
Make assertions and be loud, in code and in your methods.
Use existing libraries (packages) whenever possible. Don’t reinvent the wheel. Use functions that have already been developed and tested by others.
Prevent catastrophe and help reproducibility by making your data read-only. Rather than modifying your original data directly, always use a workflow that reads in data, processes/modifies, then writes out intermediate and final files as necessary.
Release your code and data. Simple. Without your code and data, your research is not reproducible.
- GitHub (https://github.com/) is a great place for storing, distributing, collaborating, and version-controlling code.
- RPubs (http://rpubs.com/) allows you to share dynamic documents you write in RStudio online.
Always set your seed. If you’re doing anything that involves random/monte-carlo approaches, always use set.seed().
Document everything and use code as documentation.
- Document why you do something, not mechanics.
- Document your methods and workflows.
- Document the origin of all data in your project directory.
- Document when and how you downloaded the data.
- Record data version info.
- Record software version info with session_info().
- Use dynamic documentation to make your life easier.

0.4 Rmarkdown

R Markdown supports users with easy analysis in R because it enables users to weave together narrative text and code in the document. It supports multiple programming languages including R, Python, and SQL. After performing analysis, R Markdown supports dozens of static and dynamic output formats including PDF, Word document, HTML document, and Interactive PowerPoint Presentation.

0.5 Using R Markdown

Before using R Markdown you need to install the package rmarkdown into your machine. If this is your first time using RStudio, you will see this in your RStudio window:

Above is the default view of RStudio. There are 4 panels each with its function:

Editor: is where we can input codes and narration on specific files that can be saved into our computer.
Console: is where we can input codes and perform analysis without saving it into our computer.
Environment: is where R stores our data temporarily when doing data analysis in R. This allows us to see and track our data while doing data analysis. There is also tab history and connection, though we will not use these in this workshop.
Files, packages, help, etc: is where we can track our files in our computer, our packages, and search for documentation and description about specific function/command we use in our project. Additionally, there are also plots and viewer to preview plots and files generated using R.

To easily analyze data and produce business reports using R we will be using R Markdown. We can create new R Markdown document by clicking on the menu File > New File > R Markdown. Alternatively, we can hover our mouse to a dropdown menu on the left corner of RStudio and then choose “R Markdown”. We will be directed to a pop-up for creating a new R Markdown document.

We can choose the title and author for our project and there are several output options we can choose. For the introduction, let’s use the default HTML output. An R Markdown document, a plain text file with the extension .Rmd, will be created on our Editor panel.

The document contains three types of content:

YAML Header
- surrounded by --- before and after its section.
- this is where we can custom our report template (will be discussed in the following section).
Code Chunks
- surrounded by ``` before and after its section, colored gray.
- this is where we can put R function/commands for data analysis.
Text/Naration
- space colored white.
- this is where we write paragraphs or explanations for our business report.
- it can be added with various text formatting such as the use of # for heading.

This content allow us to write both R command for data analysis and business explanation in one file. That’s like working with Excel and Word at the same time, with added functionality to export it into various outputs with a customized template! Such a lot of work can be done with one document.

In addition to the versatility, another benefit of using R Markdown is its notebook interface. With R Markdown, the code inside the chunk can be executed independently and interactively, with output displayed immediately beneath the chunk. This allows complex data analysis using R to be performed and previewed easily. For example, if we run the code in the last chunk by clicking the green ‘play’ icon on the right side, a plot will come out.

Finally, we can also export the document into certain or multiple formats by using the Knit button in RStudio on the upper part of the document.

If we haven’t saved our document, R will direct us to save our file. The best practice is to store our R Markdown document and the data we use in one working directory. This is to prevent any connection error while importing data and such. After knitting the document, R will produce the document output based on the format we choose earlier.