Learning Objectives:

This course includes the following contents:

week 1

1-1 Concepts

Replication is hard for several reasons. for reproducible research, you could make the data and method available, Analytic code are available, documentation of code and data. It is the validation of your analysis.

Pipeline Research

Literate programming
An article is a stream of text and code chunks. Literate programs can be sweaved to produce human-readable documents, knitr is used as an alternative of Sweave to sweave.

Scripting everything is essential for reproducible research

1-2 Organize your data analysis

Data analysis files:

  • Data: raw data or processed data (including processing scripts and instructions)
  • Figures: (exploratory figures, final figures)
  • R code:(final scripts, clearly commented, include processing details, R markdown files)
  • Text:(readme.txt, text of the document)
Steps in a data analysis
1, Define the question. 2, Define the ideal data set. 3, obtain the data. 4, clean the data. other steps: Exploratory data analysis,Statistical prediction/modeling, Interpret results, Challenge results,Synthesize/write up results, Create reproducible code.

Quiz 1

Q10
knitr system differs from Sweave: knitr is written in python instead of R
Q9
A documentation language like LaTeX is required to implement a literate programming system

Week 2

learning Objectives:

2-1 Tools, Rmarkdown

coding standards

  • a text editor and a text file
  • indenting
  • limit the size of your functions

R markdown is core tool in literate statistical programming.
slides written in R markdown can converted to slides using the slidify packge

2-2 knitr

Seting Global Options

 \```{r,setoptions, echo=FALSE}  
    opts_chunk$set(echo=FALSE, result="hide")
 \```

Caching Computations: cache=TRUE. Add #### at the end of every comment you want to list.eval=FALSE will show the code without evaluation the codes.You could just do {r labelname, echo = FALSE} to name or label the chunks.

time <- Sys.time()
rand <- rnorm(1)

The current time is 2016-02-29 10:29:12, the random number is -0.5083631.

library(knitr) to see functions in this package.

Week 3

learning Objectives:

3-1 Distribution and Checklist

Communicate data analysis findings and publish them on the web: Click the Knit HTML button in the doc toolbar to preview your document. In the preview window, click the Publish button(R Pubs).

Reproducible Research Checklist:

  • Don’t do things by hand anything you do by hand must be concisely recorded in documents.
  • Don’t point and click
  • teach the computer to do as much as possible
  • use github
  • keep track of your software environment
  • Don’t save output
  • set your seed(set.seed(1))
  • think about the entire pipeline

3-2 Evidence based data analysis

replicable vs reproducible, Reproducibility focuses on the validity of data analysis, but does not solve the quesiton of whether a data analysis is trustworthy.

week 4

4-1 Caching computations

package cacher using cacher as a reader. evaluates code written in files and stores intermediate results in a key-value database. it stores source file, cached data objects, metadata. checkcode() function to see if it starts from scratch.

library(cacher)  ## This package is not availble this R version 3.2.3

4-2 cases

Air pollution, pm10, chemicals such as nickel in the air. The question is that whether Nickel make PM toxic?

A case study in Class 4(Exploratory data analysis)
we’ll apply some of the techniques we learned in this course to study air pollution data, specifically particulate matter (we’ll call it pm25 sometimes).Our goal is to see if there’s been a noticeable decline in this type of air pollution between these two years.

## pm0, pm1 are the datasets for those two years
head(pm0)
cnames <- names(dat)
cnames <- strsplit(cnames, "|", fixed = TRUE) 
names(pm0) <- make.names(cnames[[1]][wcol])
names(pm1) <- make.names(cnames[[1]][wcol])
x0 <- pm0$Sample.Value; str(x0); mean(is.na(x0));summary(x0)
x1 <- pm1$Sample.Value
boxplot(x0,x1)
boxplot(log10(x0), log10(x1))
negative <- x1 < 0; sum(negative, na.rm = TRUE);  mean(negative, na.rm = TRUE) 
dates <- pm1$Date; str(dates)
dates <- as.Date(as.character(dates), "%Y%m%d")

## Time series dealing methods
hist(dates[negative], "month")

both <- intersect(site0, site1)
cnt0 <- subset(pm0, State.Code == 36 & county.site %in% both) ##The code for New York
sapply(split(cnt0,cnt0$county.site),nrow)##how many measurements each monitor recorded.
sapply(split(cnt1, cnt1$county.site), nrow)
## which monitor is the only one whose number of measurements increased from 1999 to 2012?
## Monitor with ID 63.2008
pm0sub <- subset(cnt0, County.Code==63 & Site.ID==2008);x0sub <- pm0sub$Sample.Value
pm1sub <- subset(cnt1, County.Code==63 & Site.ID==2008);x1sub <- pm1sub$Sample.Value
dates0 <- as.Date(as.character(pm0sub$Date),"%Y%m%d")
dates1 <- as.Date(as.character(pm1sub$Date),"%Y%m%d")
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(dates0, x0sub, pch = 20)
plot(dates1, x1sub, pch = 20)
abline(h = median(x1sub, na.rm = TRUE),lwd=2) ##The 1999 plot shows a bigger range of y values than the 2012 plot
rng <- range(x0sub,x1sub,na.rm=TRUE) ## Then we replot the two plots

mn0 <- with(pm0,tapply(Sample.Value,State.Code,mean,na.rm=TRUE)) 
mn1 <- with(pm1,tapply(Sample.Value,State.Code,mean,na.rm=TRUE))
d0 <- data.frame(state = names(mn0), mean = mn0)
d1 <- data.frame(state = names(mn1), mean = mn1)
mrg <- merge(d0, d1, by = "state");head(mrg)
mrg[mrg$mean.x < mrg$mean.y,]
## Plots with comparison
with(mrg, plot(rep(1, 52), mrg[, 2], xlim = c(.5, 2.5)))
with(mrg, points(rep(2, 52), mrg[, 3]))
segments(rep(1, 52), mrg[, 2], rep(2, 52), mrg[, 3])

case: throughput biology

A class by Prof. …

questions