Learning Objectives:

Define the term “reproducibility” and describe the requirements of a reproducible data analysis
Explain the importance of scripting in data analysis
Describe the different aspects of elements of a data analysis and how they can be organized.

This course includes the following contents:

Structuring and organizing a data analysis
Markdown and R Markdown
knitr / RPubs
Reproducible research check list
Evidence-based analysis
Case studies in air pollution epidemiology and high-throughput biology

week 1

1-1 Concepts

Replication is hard for several reasons. for reproducible research, you could make the data and method available, Analytic code are available, documentation of code and data. It is the validation of your analysis.

Pipeline Research

Literate programming: An article is a stream of text and code chunks. Literate programs can be sweaved to produce human-readable documents, knitr is used as an alternative of Sweave to sweave.

Scripting everything is essential for reproducible research

1-2 Organize your data analysis

Data analysis files:

Data: raw data or processed data (including processing scripts and instructions)
Figures: (exploratory figures, final figures)
R code:(final scripts, clearly commented, include processing details, R markdown files)
Text:(readme.txt, text of the document)

Steps in a data analysis: 1, Define the question. 2, Define the ideal data set. 3, obtain the data. 4, clean the data. other steps: Exploratory data analysis,Statistical prediction/modeling, Interpret results, Challenge results,Synthesize/write up results, Create reproducible code.

Quiz 1

Q10: knitr system differs from Sweave: knitr is written in python instead of R
Q9: A documentation language like LaTeX is required to implement a literate programming system

Week 2

learning Objectives:

Create and edit an R Markdown document in RStudio
Create an R Markdown document and compile it with knitr to produce an HTML document

2-1 Tools, Rmarkdown

coding standards

a text editor and a text file
indenting
limit the size of your functions

R markdown is core tool in literate statistical programming.
slides written in R markdown can converted to slides using the slidify packge

2-2 knitr

Seting Global Options

 \```{r,setoptions, echo=FALSE}  
    opts_chunk$set(echo=FALSE, result="hide")
 \```

Caching Computations: cache=TRUE. Add #### at the end of every comment you want to list.eval=FALSE will show the code without evaluation the codes.You could just do {r labelname, echo = FALSE} to name or label the chunks.

time <- Sys.time()
rand <- rnorm(1)

The current time is 2016-02-29 10:29:12, the random number is -0.5083631.

library(knitr) to see functions in this package.

Week 3

learning Objectives:

Create an R Markdown document, compile it with knitr, and upload the resulting document to R Pubs
Describe the key elements in the Reproducible Research Checklist

3-1 Distribution and Checklist

Communicate data analysis findings and publish them on the web: Click the Knit HTML button in the doc toolbar to preview your document. In the preview window, click the Publish button(R Pubs).

Reproducible Research Checklist:

Don’t do things by hand anything you do by hand must be concisely recorded in documents.
Don’t point and click
teach the computer to do as much as possible
use github
keep track of your software environment
Don’t save output
set your seed(set.seed(1))
think about the entire pipeline

3-2 Evidence based data analysis

replicable vs reproducible, Reproducibility focuses on the validity of data analysis, but does not solve the quesiton of whether a data analysis is trustworthy.

week 4

4-1 Caching computations

package cacher using cacher as a reader. evaluates code written in files and stores intermediate results in a key-value database. it stores source file, cached data objects, metadata. checkcode() function to see if it starts from scratch.

library(cacher)  ## This package is not availble this R version 3.2.3

4-2 cases

Air pollution, pm10, chemicals such as nickel in the air. The question is that whether Nickel make PM toxic?

A case study in Class 4(Exploratory data analysis)
we’ll apply some of the techniques we learned in this course to study air pollution data, specifically particulate matter (we’ll call it pm25 sometimes).Our goal is to see if there’s been a noticeable decline in this type of air pollution between these two years.

## pm0, pm1 are the datasets for those two years
head(pm0)
cnames <- names(dat)
cnames <- strsplit(cnames, "|", fixed = TRUE) 
names(pm0) <- make.names(cnames[[1]][wcol])
names(pm1) <- make.names(cnames[[1]][wcol])
x0 <- pm0$Sample.Value; str(x0); mean(is.na(x0));summary(x0)
x1 <- pm1$Sample.Value
boxplot(x0,x1)
boxplot(log10(x0), log10(x1))
negative <- x1 < 0; sum(negative, na.rm = TRUE);  mean(negative, na.rm = TRUE) 
dates <- pm1$Date; str(dates)
dates <- as.Date(as.character(dates), "%Y%m%d")

## Time series dealing methods
hist(dates[negative], "month")

both <- intersect(site0, site1)
cnt0 <- subset(pm0, State.Code == 36 & county.site %in% both) ##The code for New York
sapply(split(cnt0,cnt0$county.site),nrow)##how many measurements each monitor recorded.
sapply(split(cnt1, cnt1$county.site), nrow)
## which monitor is the only one whose number of measurements increased from 1999 to 2012?
## Monitor with ID 63.2008
pm0sub <- subset(cnt0, County.Code==63 & Site.ID==2008);x0sub <- pm0sub$Sample.Value
pm1sub <- subset(cnt1, County.Code==63 & Site.ID==2008);x1sub <- pm1sub$Sample.Value
dates0 <- as.Date(as.character(pm0sub$Date),"%Y%m%d")
dates1 <- as.Date(as.character(pm1sub$Date),"%Y%m%d")
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(dates0, x0sub, pch = 20)
plot(dates1, x1sub, pch = 20)
abline(h = median(x1sub, na.rm = TRUE),lwd=2) ##The 1999 plot shows a bigger range of y values than the 2012 plot
rng <- range(x0sub,x1sub,na.rm=TRUE) ## Then we replot the two plots

mn0 <- with(pm0,tapply(Sample.Value,State.Code,mean,na.rm=TRUE)) 
mn1 <- with(pm1,tapply(Sample.Value,State.Code,mean,na.rm=TRUE))
d0 <- data.frame(state = names(mn0), mean = mn0)
d1 <- data.frame(state = names(mn1), mean = mn1)
mrg <- merge(d0, d1, by = "state");head(mrg)
mrg[mrg$mean.x < mrg$mean.y,]
## Plots with comparison
with(mrg, plot(rep(1, 52), mrg[, 2], xlim = c(.5, 2.5)))
with(mrg, points(rep(2, 52), mrg[, 3]))
segments(rep(1, 52), mrg[, 2], rep(2, 52), mrg[, 3])

case: throughput biology

A class by Prof. …

questions

insert the pipeline research picture on local disk(![description](filepath))
opts_chunk doesn’t exist. library(knitr)
how to label the code chunks? You could just do {r labelname, echo = FALSE}. But how to count chunks generalized figures?
how to see all the chunk options in package knitr?
```
library(knitr); ?opts_chunk
```

Reproducible Research

Heyang_Gong

February 27, 2016