Reproducible Research Using RMarkdown and Git through RStudio

The Agenda and Learning Goals
Key Resources Used to Build this lesson
We scientists have a few problems
How to replicate this Figure 1?
- Could I replicate Figure 1 from your last publication, grant proposal, or presentation?
As scientists, it should be our goal to perform robust and reproducible research.
Some Recommendations
- 5 Recommendations for Robust Research
- 6 Recommendations for reproducible research
Using Git through RStudio
Time to discover how amazing the RStudio interface is!
RMarkdown
R Code Chunks
A Git Workflow within RStudio
Challenge
Resources

The Agenda and Learning Goals

Agenda

What is robust and reproducible research?
How can you make your research more robust and reproducible?
You can use RMarkdown! Wait, what’s RMarkdown?
How to use RMarkdown and why it may change your workflow.
Using git through RStudio.
Initialize your a more reproducible github repo within RStudio!

Learning Goals

Walk away knowing what robust and reproducible research are.
To know the changes you can make to your research to make it more robust and reproducible.
To understand what RMarkdown is.
To know how to create an RMarkdown document and understand it’s uses for code sharing.
To understand how to use git through the RStudio interface.
Know how to initialize a more reproducible github repo within RStudio!

Key Resources Used to Build this lesson

Vince Buffalo’s Bioinformatics Data Skills book and it’s helpful github page.
- Main source for this presentation.
A course by Karl Broman at the University of Wisconsin-Madison on Reproducible Research.
Best Practices for Scientific Computing by Wilson et al., 2014.

This tutorial was originally constructed as a part of Titus Brown’s Next Generation Sequencing Data Analysis Workshop Week 3 that took place at Michigan State University’s Kellogg Biological Station between August 24-28, 2015.

Titus Brown’s NGS Course
- Schedule of the NGS Course Week 3
- A blog post from Titus synthesizing the third week of the course.
Github for this website

We scientists have a few problems

The LaCour Scandal of fabricated data published in Science.
Scientists in the United States spend $28 billion each year on basic biomedical research that cannot be repeated successfully
A reproducibility study in psychology found that only 39 of 100 studies could be reproduced.
- A quote: “A lot of working scientists assume that if it’s published, it’s right… This makes it hard to dismiss that there are still a lot of false positives in the literature.”
- Recent Atlantic Article by Ed Yong summarizing the project
The Journal Nature on the issue of reproducibility
- A comment from the journal Nature: “Nature and the Nature research journals will introduce editorial measures to address the problem by improving the consistency and quality of reporting in life-sciences articles… we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data.”
- Nature also released a checklist with an albeit wimpy computational check (see #18).
  - This talk will hopefully help all of us improve this step.
The Economist in 2013 published on Trouble at the lab

How to replicate this Figure 1?

To replicate this figure 1, we need…

Sequencing data!
All other data (i.e. where in the water column it was sampled, particle-association, which lake, nutrient profile, mixing status)
The code! Which should be:
- Easily read-able
- Very well commented and documented.
  - Using set.seed!
- How to calculate:
  - Observed richness?
  - Simpson’s Evenness?
The versions of the software and packages/libraries used

Could I replicate Figure 1 from your last publication, grant proposal, or presentation?

If not, what would you and your co-authors need to provide or do so I could replicate Figure 1 from your last publication?

As scientists, it should be our goal to perform robust and reproducible research.

“Robust research is about doing small things that stack the deck in your favor to prevent mistakes.” ~Vince Buffalo
Reproducible research may be repeated by other researchers with the same results.

Reproducibility can be difficult with genomic data.

Genomics data is too large and high dimensional to easily inspect or visualize. Usually, workflows involve multiple steps and it’s not feasible to inspect every step.
Unlike in the wet lab, we don’t always know what to expect of our genomics data analysis.
It’s difficult to distinguish good from bad results.
Scientific code is usually only run once to generate results for a publication, and is therefore more likely to contain (silent) bugs.
- Silent errors arise from code that may produce unknowingly incorrect output (rather than stop with an error message).

What are the ingredients to robust and reproducible research?

Work must be well documented! Methods, code, and data must be made available to others!
Adopt a cautious attitude and check everything.
- Vince Buffalo’s golden rule of bioinformatics: “Never ever trust your tools (or data)”
- Remember, “garbage in, garbage out” - an analysis is only as good as the data going in.
- Let the data prove that it is high quality.
Take the time to develop fequently used scripts into tools.
- Then have your lab mates or collaborators test them and try to break them.
Collaborate!
- Do paired-programming with your labmates and collaborators.
- Hack-a-thons!

What’s the benefit for you?

Yeah, it takes a lot of effort to be robust and reproducible. However, it will make your life (and science) easier!

Most likely, you will have to re-run your analysis more than once.
In the future, you or a collaborator may have to re-visit part of the project.
You can make modularized parts of the project into re-useable tools for the future.
Reproducibility makes you easier to work and collaborate with.

Some Recommendations

5 Recommendations for Robust Research

1a. Write code for humans

Code readability is very important.

Code should be broken down into small chunks that may be re-used.
- Do not re-write code to do the same task over and over again.
  - Do not repeat yourself! (Who wants to read that?)
    - If you need to, make it a tool/function.
Do not require readers to have to think of multiple facts at once.
Make names/variables consistent, distinctive and meaningful.
Adopt a style and format and keep it consistent.
Be a concise and clear commenter.

If your code is more readable, then:

Your project is more reproducible.
It’s easier to find and correct bugs.
You will be your friend in the future when you revisit the code.

1b. Write data for computers

Let your computer do the work for you
Format your data so its easily read by your computer, not by you or other humans.

Code written for people to read requires cleaning and tidying to be processed by a computer.
Name data files in a consistent way.
- Automating tasks will be easier, which will prevent you from making trivial mistakes.

2. Make incremental changes.

Work in small steps with frequent feedback.
- Have a friend or labmate test your code and try to break it.
- Challenge your PI to test your code!
Use version control!
Put all manual changes under version control, too!

3. Be a “Defensive Programmer” - Make Assertions

Add tests within your code to make sure your code is doing what it is supposed to do.

Assertions are statments that something holds true. Assertions:
1. Ensure that if something goes wrong, the program will stop.
2. They also explain what the program is doing.

In R you can use stopifnot()
- The testthat package is made for this! Check it out the testthat package here
In python you can use assert()

4. Use existing libraries (packages) whenever possible

Do not try to re-invent the wheel while your performing your data anaylsis.
Use functions that have already been written and tested for you.

5. Prevent catastrophe and help reproducibility by making your data read-only

Read-only is important because:

Modifying data can corrupt your results.
It’s easy to lose track of how you have changed a file when you modify it in place.

5 Recommendations for Robust Research

Write code for humans, write data for computers
Make incremental changes.
Make assertions and be loud, in code and in your methods
Use existing libraries (packages) whenever possible
Prevent catastrophe and help reproducibility by making your data read-only

6 Recommendations for reproducible research

1. Encapsulate the full project into one directory that is supported with version control.

The Reproducible-Science-Curriculum Github repo for Reproducible Research Project Initialization is a great place to start a reproducible research project.

2. Release your code and data

It is simple. Without your code and data, your research is not reproducible.

3. Document everything!

Bottom line: Adopt a computing notebook that is as good as a wet-lab notebook.

To fully reproduce a study, each step of analysis must be described in much more detail than can be included in a publication.

Include a record of your steps, where files are, where they came from, and what they contain.

Include session_info() in your document, preferably at the bottom. Session info lists the version of R that you’re using plus all of the packages you’ve loaded.

In your computing notebook:

Document your methods and workflows
Document the origin of all data in your project directory
Document when and how you downloaded the data
Record data version info
Record software version info with session_info()

For example, all the above information could be stored in a README file

4. Make figures, tables, and statistics the results of scripts.

Using inline code can make the creation of tables much easier if the data changes!

5. Write code that uses relative paths.

Do not rely on hard-coded absolute paths (i.e. /Users/marschmi/Data/seq-data.csv or even ~/Data/seq-data.csv).

Relative paths (i.e. Data/seq-data.csv) or command line arguments are better alternatives.

6. Always Set your seed

If there is any randomizations of data or simulations, use set.seed() in the first code chunk.

Karl Broman suggests to open R and type runif(1, 0, 10^8) and then paste the resulting large number into set.seed() in the first code chunk. If you do this, then the random aspects of your analysis should be repeated the same way.

6 Recommendations for reproducible research

Encapsulate the full project into one directory that is supported with version control.
Release your code and data.
Document everything and use code as documentation!
Make figures, tables, and statistics the results of scripts.
Write code that uses relative paths.
Always Set your seed.

How can you revise your work flow?

Where you can introduce robust steps?
Where can you add reproducible steps?

Do you have …

RStudio?
R?
- Please install these packages:
  - install.packages("knitr")
  - install.packages("rmarkdown")
A Github account?
Git?
- We may need to generate an SSH Key SSH key.
  - An SSH key is a way to identify a trusted computer without a password.
  - We will come back to this if necessary.

Using Git through RStudio

Sign into Github.
Initialize repo on the github page.
- Name the Repo “Bioinformatics_reproducibility”
- Down in the right-hand corner, copy the SSH clone URL, do not be tempted to copy the https url!
Open up RStudio
File -> New Project -> Version Control -> Git -> Paste the SSH clone URL.
- Be sure to use the same repo name as on your github page!
- If you get the following error, you may have copied the HTTPS clone URL instead of the SSH clone url.
```
Cloning into 'repo_name'...
error: unable to read askpass response from 'rpostback-askpass'
fatal: could not read Username for 'https://github.com': Device not configured
```

Time to discover how amazing the RStudio interface is!

RMarkdown

What is R Markdown?

RMarkdown is a variant of Markdown that has embedded R code chunks to be used with knitr to make it easy to create reproducible web-based reports.
- Markdown: A system for writing simple, readable text that is easily converted to html.
  - Allows you to write using an easy-to-read, easy-to-write plain text format.
Rmd -> md -> html (docx, pdf)
Can include both text and code to execute

Why R Markdown?

A convenient tool for reproducible and dynamic reports with R!

Execute code with knitr.
Easy to learn syntax.
Include LaTeX equations.
Don’t need to worry about page breaks or figure placement.
Consolidate your code and write up into a single file:
- Slideshows, pdfs, html documents, word files
It’s so easy to use with version control with Git!

Simple Workflow

How to Open an Rmd File

Choose Output

YAML Header: A set of key value pairs at the start of your file. Begin and end the header with a line of three dashes (- - -)

R Studio template writes the YAML header for you

output: html_document
output: pdf_document
output: word_document
output: beamer_presentation (beamer slideshow - pdf)
output: ioslides_presentation (ioslides presentation - html)

For example: Here’s the YAML header for this webpage with a table of contents.

---
title: "Reproducible Research Using RMarkdown and Git through RStudio"
subtitle: "Tutorial for EEB 416 - Intro to Bioinformatics"
author: "Marian L. Schmidt, @micro_marian, marschmi@umich.edu"
date: "October 12th, 2015"
output:
  html_document:
    theme: united
    toc: yes
---

Markdown basics

Markdown is a simple formatting language that is easy to use

Create lists with * or + sign
- like this
- and this
- A very important note: The end of a line is marked by two spaces and an enter!! Otherwise your list will look ugly like the one below:
Use one or two asterisk marks to provide emphasis such as *italics* and **bold**. Can even include tables:

First Header	Second Header
Content Cell	Content Cell
Content Cell	Content Cell

Markdown basics

R Code Chunks

Code blocks display with fixed-width font

#quick summary
library(ggplot2)
min(diamonds$price)

## [1] 326

mean(diamonds$price)

## [1] 3932.8

max(diamonds$price)

## [1] 18823

More R Code Chunks

You can name the code chunk.
echo = TRUE: The code will be displayed.
eval = TRUE: Yes, execute the code.

R Code Chunk Arguments

R Code Chunks: Displaying Plots

Global Chunk Options

You may want to use the same set of chunk options throughout a document and you don’t want to retype those options in every chunk.

Global chunk options are for you!

Inline R Code

You can evaluate expressions inline by enclosing the expression within a single back-tick qualified with r.

Inline code is underappreciated!

Last night, I saw 7 shooting stars!

Rendering document

Run rmarkdown::render("<filepath>")
Click the very cute knit HTML button at the top of the RStudio scripts pane

When you render, R will:

Execute each embedded code chunk and insert the results into your report.
Build a new version of your report in the output file type.
Open a preview of the output file in the viewer pane.
Save the output file in your working directory.

A Git Workflow within RStudio

Make some changes to your document that you would like to save a copy of.
Git add by checking the box under “staged” in the git screen.
- Hint: The git screen is in the same pane as the RStudio Environment and history.
Draft your commit message.
- It should be a meaningful message!
- Think of you in 6 months looking for changes you had made to your document. Don’t you want to be your own friend?
- Do not allow your commit messages to get less informative as your project continues
Click “Commit.”
Perform Git Push by clicking the bright green arrow, which is the “git Push” button.
Make sure everything is pushed to the remote repository without any errors.

Click on the clock button to view your git history. Here, you can also view the difference between documents.

Challenge

Working within your new R Project “Bioinformatics_Reproducibility”
Let’s create a new directory called “data”
Download the Ecoli_metadata.csv from Data Carpentry into your data directory within your “Bioinformatics_reproducibility” repository.
Create a new RMarkdown file within “Bioinformatics_Reproducibilty.
Time to start our project!

For example we can learn from some of the data carpentry lesson written by Kate Hertweck, Susan McClatchey, Tracy Teal, and Ryan Williams:

runif(1, 0, 10^8) # Generate a random number 

#############  First Code chunk - setting the seed
## {r Set the seed, include=TRUE, echo=TRUE, eval = TRUE}
set.seed() # Insert your random number here - NOTE:  Only do this once when you are initalizing your file!
## end chunk 1

#############  Second Code Chunk - reading in the data
## {r Import Data, echo = TRUE, eval = TRUE}
metadata <- read.csv('data/Ecoli_metadata.csv') # Load in the data from the data directory!
head(metadata) # This will show us the first 6 rows of the dataframe
str(metadata) # This will show us the structure of the data
mean(metadata$genome_size) # Calculate the mean genome_size
## end chunk 2

#############  Third Chunk - Install and load necessary packages
## {r package import, echo = TRUE, eval = TRUE}
install.packages("ggplot2") # Install the best plotting package in R
library(ggplot2) # Make sure R knows to source from it
## end chunk 3

#############  Fourth Chunk - Create some plots!
## {r data exploration, echo = TRUE, eval = TRUE, fig.center = TRUE}
##  Plot 1:  Let's look at the distribution of the genome size
ggplot(metadata, aes(x = genome_size)) +
  geom_bar(stat = "bin", binwidth=0.01) # create a bar plot (histogram) with bins by a genome size of 0.01

# Plot 2:  Looking at all of the genome sizes for each strain
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
  geom_point(size = rel(3.0)) + # we are going to make points
  theme(axis.text.x = element_text(angle=45, hjust=1)) # x-axis text on a 45 degree angle 
  
# Plot 3: Taking the average genome size for the types of E.coli mutants
ggplot(metadata, aes(x = cit, y = genome_size, fill = cit)) + # plot time 
  geom_boxplot() + # make it a boxplot
  ggtitle('Boxplot of genome size by citrate mutant type') + #add a title
  xlab('citrate mutant') + # add x axis label
  ylab('genome size') + #add y axis label
  theme(axis.text.x = element_text(angle=45, hjust=1), # put x axis text on a 45 degree angle
          axis.title = element_text(size = rel(1.5)), #make the relative size of the axis title text
          axis.text = element_text(size = rel(1.25))) #make the relative size of the axis text
## end chunk 4
          
#############  Final Chunk 5         
# {r Presentation session_info, include=TRUE, echo=TRUE, results='markup'}
devtools::session_info() # This will include session info
## end chunk 5

Resources

Resources for Reproducible Research

Vince Buffalo’s Bioinformatics Data Skills book and it’s helpful github page.
- Main source for this presentation.
A course by Karl Broman at the University of Wisconsin-Madison on Reproducible Research.
Reproducible-Science-Curriculum Github repo for Reproducible Research Project Initialization
ROpenSci Reproducibility Research guidelines
Publications:
- Best Practices for Scientific Computing by Wilson et al., 2014.
- A Quick Guide to Organizing Computational Biology Projects by Noble, 2009.
- Reproducible Research in Computational Science by Peng, 2011.
Statistics Department Resources from the University of Wisconsin
“Baby steps for the open-curious” from Christie Bahlai

Resources for Rmarkdown, RStudio and R

Yihui Xie’s Dynamic Documents with R and Knitr and it’s github page.
Resources from Jennifer Bryan’s Stats 545.
RMarkdown Quick Reference Guide
Christopher Gandrud’s Reproducible Research with R and RStudio and it’s github page
RStudio RMarkdown Documentation
Rmd Cheatsheet
Knitr Reference Card
R Cookbook for ggplot

Other Resources

Session Info

Karl Broman recommends using the session_info() from the devtools package.

devtools::session_info()

##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Detroit             
## 
##  package    * version date       source        
##  colorspace   1.2-6   2015-03-11 CRAN (R 3.2.0)
##  curl         0.9.3   2015-08-25 CRAN (R 3.2.2)
##  devtools     1.8.0   2015-05-09 CRAN (R 3.2.0)
##  digest       0.6.8   2014-12-31 CRAN (R 3.2.0)
##  evaluate     0.7.2   2015-08-13 CRAN (R 3.2.0)
##  formatR      1.2     2015-04-21 CRAN (R 3.2.0)
##  ggplot2    * 1.0.1   2015-03-17 CRAN (R 3.2.0)
##  git2r        0.11.0  2015-08-12 CRAN (R 3.2.0)
##  gtable       0.1.2   2012-12-05 CRAN (R 3.2.0)
##  htmltools    0.2.6   2014-09-08 CRAN (R 3.2.0)
##  knitr        1.11    2015-08-14 CRAN (R 3.2.2)
##  labeling     0.3     2014-08-23 CRAN (R 3.2.0)
##  magrittr     1.5     2014-11-22 CRAN (R 3.2.0)
##  MASS         7.3-43  2015-07-16 CRAN (R 3.2.2)
##  memoise      0.2.1   2014-04-22 CRAN (R 3.2.0)
##  munsell      0.4.2   2013-07-11 CRAN (R 3.2.0)
##  plyr         1.8.3   2015-06-12 CRAN (R 3.2.0)
##  proto        0.3-10  2012-12-22 CRAN (R 3.2.0)
##  Rcpp         0.12.0  2015-07-25 CRAN (R 3.2.0)
##  reshape2     1.4.1   2014-12-06 CRAN (R 3.2.0)
##  rmarkdown    0.7     2015-06-13 CRAN (R 3.2.0)
##  rversions    1.0.2   2015-07-13 CRAN (R 3.2.0)
##  scales       0.3.0   2015-08-25 CRAN (R 3.2.2)
##  stringi      0.5-5   2015-06-29 CRAN (R 3.2.0)
##  stringr      1.0.0   2015-04-30 CRAN (R 3.2.0)
##  xml2         0.1.1   2015-06-02 CRAN (R 3.2.0)
##  yaml         2.1.13  2014-06-12 CRAN (R 3.2.0)