Learning Objectives:
This course includes the following contents:
Replication is hard for several reasons. for reproducible research, you could make the data and method available, Analytic code are available, documentation of code and data. It is the validation of your analysis.
Scripting everything is essential for reproducible research
Data analysis files:
learning Objectives:
coding standards
R markdown is core tool in literate statistical programming.
slides written in R markdown can converted to slides using the slidify packge
Seting Global Options
\```{r,setoptions, echo=FALSE}
opts_chunk$set(echo=FALSE, result="hide")
\```
Caching Computations: cache=TRUE
. Add ####
at the end of every comment you want to list.eval=FALSE
will show the code without evaluation the codes.You could just do {r labelname, echo = FALSE}
to name or label the chunks.
time <- Sys.time()
rand <- rnorm(1)
The current time is 2016-02-29 10:29:12, the random number is -0.5083631.
library(knitr)
to see functions in this package.
learning Objectives:
Communicate data analysis findings and publish them on the web: Click the Knit HTML button in the doc toolbar to preview your document. In the preview window, click the Publish button(R Pubs).
Reproducible Research Checklist:
replicable vs reproducible, Reproducibility focuses on the validity of data analysis, but does not solve the quesiton of whether a data analysis is trustworthy.
package cacher using cacher as a reader. evaluates code written in files and stores intermediate results in a key-value database. it stores source file, cached data objects, metadata. checkcode()
function to see if it starts from scratch.
library(cacher) ## This package is not availble this R version 3.2.3
Air pollution, pm10, chemicals such as nickel in the air. The question is that whether Nickel make PM toxic?
A case study in Class 4(Exploratory data analysis)
we’ll apply some of the techniques we learned in this course to study air pollution data, specifically particulate matter (we’ll call it pm25 sometimes).Our goal is to see if there’s been a noticeable decline in this type of air pollution between these two years.
## pm0, pm1 are the datasets for those two years
head(pm0)
cnames <- names(dat)
cnames <- strsplit(cnames, "|", fixed = TRUE)
names(pm0) <- make.names(cnames[[1]][wcol])
names(pm1) <- make.names(cnames[[1]][wcol])
x0 <- pm0$Sample.Value; str(x0); mean(is.na(x0));summary(x0)
x1 <- pm1$Sample.Value
boxplot(x0,x1)
boxplot(log10(x0), log10(x1))
negative <- x1 < 0; sum(negative, na.rm = TRUE); mean(negative, na.rm = TRUE)
dates <- pm1$Date; str(dates)
dates <- as.Date(as.character(dates), "%Y%m%d")
## Time series dealing methods
hist(dates[negative], "month")
both <- intersect(site0, site1)
cnt0 <- subset(pm0, State.Code == 36 & county.site %in% both) ##The code for New York
sapply(split(cnt0,cnt0$county.site),nrow)##how many measurements each monitor recorded.
sapply(split(cnt1, cnt1$county.site), nrow)
## which monitor is the only one whose number of measurements increased from 1999 to 2012?
## Monitor with ID 63.2008
pm0sub <- subset(cnt0, County.Code==63 & Site.ID==2008);x0sub <- pm0sub$Sample.Value
pm1sub <- subset(cnt1, County.Code==63 & Site.ID==2008);x1sub <- pm1sub$Sample.Value
dates0 <- as.Date(as.character(pm0sub$Date),"%Y%m%d")
dates1 <- as.Date(as.character(pm1sub$Date),"%Y%m%d")
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(dates0, x0sub, pch = 20)
plot(dates1, x1sub, pch = 20)
abline(h = median(x1sub, na.rm = TRUE),lwd=2) ##The 1999 plot shows a bigger range of y values than the 2012 plot
rng <- range(x0sub,x1sub,na.rm=TRUE) ## Then we replot the two plots
mn0 <- with(pm0,tapply(Sample.Value,State.Code,mean,na.rm=TRUE))
mn1 <- with(pm1,tapply(Sample.Value,State.Code,mean,na.rm=TRUE))
d0 <- data.frame(state = names(mn0), mean = mn0)
d1 <- data.frame(state = names(mn1), mean = mn1)
mrg <- merge(d0, d1, by = "state");head(mrg)
mrg[mrg$mean.x < mrg$mean.y,]
## Plots with comparison
with(mrg, plot(rep(1, 52), mrg[, 2], xlim = c(.5, 2.5)))
with(mrg, points(rep(2, 52), mrg[, 3]))
segments(rep(1, 52), mrg[, 2], rep(2, 52), mrg[, 3])
case: throughput biology
A class by Prof. …

)library(knitr)
{r labelname, echo = FALSE}
. But how to count chunks generalized figures?how to see all the chunk options in package knitr?
library(knitr); ?opts_chunk