R과 RStudio를 이용한 재현가능한 연구

문 건 웅
2014-6-13

내 용

  • Replication ? Reproducible Research ?
  • Reproducible research using RStudio
  • Quiz

alt text

Replication

  • The Confirmation of Results and Conclusions From One Study obtained independently in another

  • is considered the scientific gold stanadrd

Replication

  • Replication of findings and conducting studies with independent

    • Investigators
    • Data
    • Analytic Methods
    • Laboratories
    • Instruments
  • Replication is particularly important in studies that can impact broad policy or regulatory decisions

What’s Wrong with Replication?

  • Some studies cannot be replicated

    • No time, opportunistic : 장기간 연구
    • No money : 대규모 무작위 - 대조군 연구
    • Unique
  • 대규모 , 장기간 연구가 아닌 경우

    • replication이 가능할까?

alt text

alt text

Repeatability of published microarray gene expression analyses

  • Selected articles published in Nature Genetics between January 2005 and December 2006 that had used profiling with microarrays

  • Of the 56 items retrieved electronically, 20 articles were considered potentially eligible for the project

  • The four teams were from

    • University of Alabama at Birmingham(UAB)
    • Stanford/Dana-Farber(SD)
    • London(L)
    • Ioannina/Trento(IT)
  • Each team was comprised of 3-6 scientists who worked together to evaluate each article.

Results

  • Result could be reproduced : n=2

  • Reproduced with discrepancy : n=6

  • Could not be reproduced : n=10

    • No data n=4 (no data n=2, subset n=1, no reporter data n=1)
    • Confusion over matching of data to analysis (n=2)
    • Specialized software required and not available (n=1)
    • Raw data available but could not be processed (n=2)

alt text

How Can We Bridge the Gap

alt text

How Can We Bridge the Gap

alt text

Reproducible Research

  • data and the computer code used to analyze the data be made available to others
  • attainable minimum reproducibility standard
  • fill the gap between full replication of a study and no replication

Why Do We Need Reproducible Research?

  • New technologies increasing data collection throughput; data are more complex and extremely high dimensional
  • Existing databases can be merged into new “megadatabases”
  • Computing power is greatly increased, allowing more sophisticated analyses
  • For every field “X” there is a field “Computational X”

Example: Reproducible Air Pollution and Health Research

  • Estimating small (but important) health effects in the presence of much stronger signals
  • Results inform substantial policy decisions, affect many stakeholders
    • EPA regulations can cost billions of dollars
  • Complex statistical methods are needed and subjected to intense scrutiny

Internet-based Health and Air-Pollution Surveillance System(iHAPSS)

alt text

Research Pipeline

alt text

Research Pipeline

alt text

Recent Developments in Reproducible Research

alt text

alt text

Recent Developments in Reproducible Research

alt text

Recent Developments in Reproducible Research

alt text

Omics ??

  • genomics
  • transcriptomics
  • proteomics
  • metabolomics

The IOM Report

In the Discovery/Test Validation stage of omics-based tests:

  • Data/metadata used to develop test should be made publicly available
  • The computer code and fully specified computational procedures used for development of the candidate omics-based test should be made sustainably available
  • “Ideally, the computer code that is released will encompass all of the steps of computational analysis, including all data preprocessing steps, that have been described in this chapter. All aspects of the analysis need to be transparently reported.”

What do We Need?

  • Analytic data are available
  • Analytic code are available
  • Documentation of code and data
  • Standard means of distribution

Who are the Players?

  • Authors
    • Want to make their research reproducible
    • Want tools for RR to make their lives easier (or at least not much harder)
  • Readers
    • Want to reproduce (and perhaps expand upon) interesting findings
    • Want tools for RR to make their lives easier

Challenges

  • Authors must undertake considerable effort to put data/results on the web (may not have resources like a web server)
  • Readers must download data/results individually and piece together which data go with which
    code sections, etc.
  • Readers may not have the same resources as authors
  • Few tools to help authors/readers (although toolbox is growing!)

In Reality...

  • Authors
    • Just put stuff on the web
    • (Infamous) Journal supplementary materials
    • There are some central databases for various fields (e.g. biology, ICPSR)
  • Readers
    • Just download the data and (try to) figure it out
    • Piece together the software and run it

R과 RStudio를 이용한 재현가능한 연구

Why R ?

  • (왜 R을 써야 하지?)

RStudio ?

  • (이건 또 뭐지? 듣보잡)

Reproducible research using R and RStudio

  • (이건 또 뭥미?)

Why R ?

왜 R을 써야 하지 ?

  • 난 SPSS는 쓸줄 안다.
  • SAS는 넘사벽…. R도 어렵다.
  • 새로운 걸 배우기 싫다. 이 나이에…
  • 그냥 하던대로 할랜다. 그게 속 편하다. 내 맘이다.

compareGroup 패키지

library(compareGroups)
data(predimed)

predimed 연구의 데이타

alt text

  • 6324 observations
  • 15 variables

이 표 만드는데 얼마나 걸릴까?

alt text

compareGroups ... CreateTable

predimed 데이타의 모든 변수를 그룹별로 비교할거다

res=compareGroups(group ~ . , data=predimed)

표를 만들어라

createTable(res)

More exhausting table

alt text

More Exhausting Table

그룹별로 나눠서 비교할거다

res = compareGroups(group ~ .-sex-hormo, data=predimed)

모든 환자, 남자환자만, 여자환자만 따로 표를 세개 만든다

alltab=createTable(res)
femaletab=createTable(update(res,subset=sex=="Female"))
maletab=createTable(update(res,subset=sex=="Male"))

표 세개를 합친다.

cbind("ALL"=alltab,"FEMALE"=femaletab,"MALE"=maletab)                

With R and RStudio

  • Minimal Effort ; maximal result

  • Nice plots

plot of chunk unnamed-chunk-8

R Graphical Manual

alt text

Reproducible research

Literate Programing

  • (문학적 프로그래밍)

Reproducible Research

  • (재현가능한 연구)

When issues of reproducibility arise

  • “Remember that microarray analysis you did six months ago?
    We ran a few more arrays.
    Can you add them to the project and repeat the same analysis?”
  • “The statistical analyst who looked at the data I generated previously is no longer available.
    Can you get someone else to analyze my new data set using the same methods (and thus producing a report I can expect to understand)”
  • “Please write/edit the methods sections for the abstract/paper/grant proposal I am submitting based on the analysis you did several months ago.

Typical workflow of many research projects

  • First have an idea

  • e.g. stopping distance correlate with speed ?

alt text

1. Prepare data(Excel/Numbers)

alt text

2. Do Some analysis(R/SPSS/SAS)

alt text

3. Write a report/paper(Word?Pages)

alt text

All results(figures, tables) manually imported to Word

This workflow is BROKEN

1. Collect and manage data(EXCEL)

2. Analysis (R/SPSS)

3. Writeup(WORD)

Problems brought by the broken workflow

alt text

  • What analysis is behind this figure? Did you account for [ooo] in the analysis?
  • What dataset was used (e.g. final vs preliminary dataset)?
  • Oops, there is an error in the data. Can you repeat the analysis? And update figures/tables in Word!
  • As a coauthor/reader, I'd like to see the whole research process (how you arrived to that conclusion), rather than cooked manuscript with inserted tables/figures.

RStudio allows us to fix the disconnect

Integrating

  • 1. Data management
  • 2. Data analysis
  • 3. Writing up results

in a single dynamic document

Reproducible research !!

Let's make our project reproducible

alt text

  • In literate programming, an analytical document is composed of a descriptive narrative “woven” together with software code and computed results.

  • Advantages

    • A single document both describes and performs the analysis
    • Enforces reproducibility

Conversion to word or pdf : With PANDOC

alt text

alt text

If error

- If spotting eror in data, or using different dataset…

- make changes in Rmarkdown and report will update automatically

So... Main Advantages

alt text

  • Data management fully documented (no more manual changes in Excel!)
  • Analysis fully documented
  • Automated reports
  • can publish via Rpubs.com : http://www.rpubs.com/cardiomoon/19541
  • can share the project

Literate (Statistical) Programming

  • An article is a stream of text and code
  • Analysis code is divided into text and code “chunks”
  • Each code chunk loads data and computes results
  • Presentation code formats results (tables, figures, etc.)
  • Article text explains what is going on
  • Literate programs can be weaved to produce human-readable documents and tangled to produce machine-readable documents

Literate (Statistical) Programming

  • Literate programming is a general concept that requires
    1. A documentation language (human readable)
    2. A programming language (machine readable)
  • Sweave uses LATEX and R as the documentation and programming languages
  • Sweave was developed by Friedrich Leisch (member of the R Core) and is maintained by R core
  • Main web site: http://www.statistik.lmu.de/~leisch/Sweave

Sweave Limitations

  • Sweave has many limitations
  • Focused primarily on LaTeX, a difficult to learn markup language used only by weirdos
  • Lacks features like caching, multiple plots per chunk, mixing programming languages and many other technical items
  • Not frequently updated or very actively developed

Literate (Statistical) Programming

  • knitr is an alternative (more recent) package
  • Brings together many features added on to Sweave to address limitations
  • knitr uses R as the programming language(although others are allowed) and variety of documentation languages
    • LaTeX, Markdown,HTML
  • knitr was developed by Yihui Xie (while a graduate student in statistics at Iowa State)
  • See http://yihui.name/knitr/

Summary

  • Reproducible research is important as a minimum standard, particularly for studies that are difficult to replicate
  • Infrastructure is needed for creating and distributing reproducible documents, beyond what is currently available
  • There is a growing number of tools for creating reproducible documents

Quiz