This tutorial was originally constructed as a part of Titus Brown’s Next Generation Sequencing Data Analysis Workshop Week 3 that took place at Michigan State University’s Kellogg Biological Station between August 24-28, 2015.
If not, what would you and your co-authors need to provide or do so I could replicate Figure 1 from your last publication?
Yeah, it takes a lot of effort to be robust and reproducible. However, it will make your life (and science) easier!
Code readability is very important.
If your code is more readable, then:
Let your computer do the work for you
Format your data so its easily read by your computer, not by you or other humans.
Add tests within your code to make sure your code is doing what it is supposed to do.
Assertions are statments that something holds true. Assertions:
1. Ensure that if something goes wrong, the program will stop.
2. They also explain what the program is doing.
stopifnot()
testthat package is made for this! Check it out the testthat package hereassert()Read-only is important because:
The Reproducible-Science-Curriculum Github repo for Reproducible Research Project Initialization is a great place to start a reproducible research project.
It is simple. Without your code and data, your research is not reproducible.
Bottom line: Adopt a computing notebook that is as good as a wet-lab notebook.
To fully reproduce a study, each step of analysis must be described in much more detail than can be included in a publication.
Include a record of your steps, where files are, where they came from, and what they contain.
Include session_info() in your document, preferably at the bottom. Session info lists the version of R that you’re using plus all of the packages you’ve loaded.
session_info()For example, all the above information could be stored in a README file
Using inline code can make the creation of tables much easier if the data changes!
Do not rely on hard-coded absolute paths (i.e. /Users/marschmi/Data/seq-data.csv or even ~/Data/seq-data.csv).
Relative paths (i.e. Data/seq-data.csv) or command line arguments are better alternatives.
If there is any randomizations of data or simulations, use set.seed() in the first code chunk.
Karl Broman suggests to open R and type runif(1, 0, 10^8) and then paste the resulting large number into set.seed() in the first code chunk. If you do this, then the random aspects of your analysis should be repeated the same way.
If you get the following error, you may have copied the HTTPS clone URL instead of the SSH clone url.
Cloning into 'repo_name'...
error: unable to read askpass response from 'rpostback-askpass'
fatal: could not read Username for 'https://github.com': Device not configuredknitr to make it easy to create reproducible web-based reports.
A convenient tool for reproducible and dynamic reports with R!
knitr.YAML Header: A set of key value pairs at the start of your file. Begin and end the header with a line of three dashes (- - -)
R Studio template writes the YAML header for you
output: html_document
output: pdf_document
output: word_document
output: beamer_presentation (beamer slideshow - pdf)
output: ioslides_presentation (ioslides presentation - html)
For example: Here’s the YAML header for this webpage with a table of contents.
---
title: "Reproducible Research Using RMarkdown and Git through RStudio"
subtitle: "Tutorial for EEB 416 - Intro to Bioinformatics"
author: "Marian L. Schmidt, @micro_marian, marschmi@umich.edu"
date: "October 12th, 2015"
output:
html_document:
theme: united
toc: yes
---
Markdown is a simple formatting language that is easy to use
* or + sign
*italics* and **bold**. Can even include tables:| First Header | Second Header |
|---|---|
| Content Cell | Content Cell |
| Content Cell | Content Cell |
Code blocks display with fixed-width font
#quick summary
library(ggplot2)
min(diamonds$price)
## [1] 326
mean(diamonds$price)
## [1] 3932.8
max(diamonds$price)
## [1] 18823
You can name the code chunk.
echo = TRUE: The code will be displayed.
eval = TRUE: Yes, execute the code.
You may want to use the same set of chunk options throughout a document and you don’t want to retype those options in every chunk.
Global chunk options are for you!
You can evaluate expressions inline by enclosing the expression within a single back-tick qualified with r.
Inline code is underappreciated!
Last night, I saw 7 shooting stars!
rmarkdown::render("<filepath>")When you render, R will:
Execute each embedded code chunk and insert the results into your report.
Build a new version of your report in the output file type.
Open a preview of the output file in the viewer pane.
Save the output file in your working directory.
Click on the clock button to view your git history. Here, you can also view the difference between documents.
For example we can learn from some of the data carpentry lesson written by Kate Hertweck, Susan McClatchey, Tracy Teal, and Ryan Williams:
runif(1, 0, 10^8) # Generate a random number
############# First Code chunk - setting the seed
## {r Set the seed, include=TRUE, echo=TRUE, eval = TRUE}
set.seed() # Insert your random number here - NOTE: Only do this once when you are initalizing your file!
## end chunk 1
############# Second Code Chunk - reading in the data
## {r Import Data, echo = TRUE, eval = TRUE}
metadata <- read.csv('data/Ecoli_metadata.csv') # Load in the data from the data directory!
head(metadata) # This will show us the first 6 rows of the dataframe
str(metadata) # This will show us the structure of the data
mean(metadata$genome_size) # Calculate the mean genome_size
## end chunk 2
############# Third Chunk - Install and load necessary packages
## {r package import, echo = TRUE, eval = TRUE}
install.packages("ggplot2") # Install the best plotting package in R
library(ggplot2) # Make sure R knows to source from it
## end chunk 3
############# Fourth Chunk - Create some plots!
## {r data exploration, echo = TRUE, eval = TRUE, fig.center = TRUE}
## Plot 1: Let's look at the distribution of the genome size
ggplot(metadata, aes(x = genome_size)) +
geom_bar(stat = "bin", binwidth=0.01) # create a bar plot (histogram) with bins by a genome size of 0.01
# Plot 2: Looking at all of the genome sizes for each strain
ggplot(metadata, aes(x = sample, y= genome_size, color = generation, shape = cit)) +
geom_point(size = rel(3.0)) + # we are going to make points
theme(axis.text.x = element_text(angle=45, hjust=1)) # x-axis text on a 45 degree angle
# Plot 3: Taking the average genome size for the types of E.coli mutants
ggplot(metadata, aes(x = cit, y = genome_size, fill = cit)) + # plot time
geom_boxplot() + # make it a boxplot
ggtitle('Boxplot of genome size by citrate mutant type') + #add a title
xlab('citrate mutant') + # add x axis label
ylab('genome size') + #add y axis label
theme(axis.text.x = element_text(angle=45, hjust=1), # put x axis text on a 45 degree angle
axis.title = element_text(size = rel(1.5)), #make the relative size of the axis title text
axis.text = element_text(size = rel(1.25))) #make the relative size of the axis text
## end chunk 4
############# Final Chunk 5
# {r Presentation session_info, include=TRUE, echo=TRUE, results='markup'}
devtools::session_info() # This will include session info
## end chunk 5
Karl Broman recommends using the session_info() from the devtools package.
devtools::session_info()
## setting value
## version R version 3.2.2 (2015-08-14)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Detroit
##
## package * version date source
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
## curl 0.9.3 2015-08-25 CRAN (R 3.2.2)
## devtools 1.8.0 2015-05-09 CRAN (R 3.2.0)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## evaluate 0.7.2 2015-08-13 CRAN (R 3.2.0)
## formatR 1.2 2015-04-21 CRAN (R 3.2.0)
## ggplot2 * 1.0.1 2015-03-17 CRAN (R 3.2.0)
## git2r 0.11.0 2015-08-12 CRAN (R 3.2.0)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
## knitr 1.11 2015-08-14 CRAN (R 3.2.2)
## labeling 0.3 2014-08-23 CRAN (R 3.2.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## MASS 7.3-43 2015-07-16 CRAN (R 3.2.2)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
## munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
## proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
## Rcpp 0.12.0 2015-07-25 CRAN (R 3.2.0)
## reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0)
## rmarkdown 0.7 2015-06-13 CRAN (R 3.2.0)
## rversions 1.0.2 2015-07-13 CRAN (R 3.2.0)
## scales 0.3.0 2015-08-25 CRAN (R 3.2.2)
## stringi 0.5-5 2015-06-29 CRAN (R 3.2.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## xml2 0.1.1 2015-06-02 CRAN (R 3.2.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)