R과 Rstudio를 이용한 재현가능한 연구

Keon-Woong Moon
2014-4-25

내용

Why R ?

  • (왜 R을 써야 하지?)

RStudio ?

  • (이건 또 뭐지? 듣보잡)

Reproducible research using R and RStudio

  • (이건 또 뭥미?)

Why R ?

왜 R을 써야 하지 ?

  • 난 SPSS는 쓸줄 안다.
  • SAS는 넘사벽…. R도 어렵다.
  • 새로운 걸 배우기 싫다. 이 나이에…
  • 그냥 하던대로 할랜다. 그게 속 편하다. 내 맘이다.

compareGroup 패키지

library(compareGroups)
data(predimed)

predimed 연구의 데이타

alt text

  • 6324 observations
  • 15 variables

이 표 만드는데 얼마나 걸릴까?

alt text

compareGroups ... CreateTable

predimed 데이타의 모든 변수를 그룹별로 비교할거다

res=compareGroups(group ~ . , data=predimed)

표를 만들어라

createTable(res)

More exhausting table

alt text

More Exhausting Table

그룹별로 나눠서 비교할거다

res = compareGroups(group ~ .-sex-hormo, data=predimed)

모든 환자, 남자환자만, 여자환자만 따로 표를 세개 만든다

alltab=createTable(res)
femaletab=createTable(update(res,subset=sex=="Female"))
maletab=createTable(update(res,subset=sex=="Male"))

표 세개를 합친다.

cbind("ALL"=alltab,"FEMALE"=femaletab,"MALE"=maletab)                

With R and RStudio

  • Minimal Effort ; maximal result

  • Nice plots

plot of chunk unnamed-chunk-8

Reproducible research

Literate Programing

  • (문학적 프로그래밍)

Reproducible Research

  • (재현가능한 연구)

When issues of reproducibility arise

  • “Remember that microarray analysis you did six months ago?
    We ran a few more arrays.
    Can you add them to the project and repeat the same analysis?”

  • “The statistical analyst who looked at the data I generated previously is no longer available.
    Can you get someone else to analyze my new data set using the same methods (and thus producing a report I can expect to understand)”

  • “Please write/edit the methods sections for the abstract/paper/grant proposal I am submitting based on the analysis you did several months ago.

alt text

alt text

Repeatability of published microarray gene expression analyses

  • Selected articles published in Nature Genetics between January 2005 and December 2006 that had used profiling with microarrays

  • Of the 56 items retrieved electronically, 20 articles were considered potentially eligible for the project

  • The four teams were from

    • University of Alabama at Birmingham(UAB)
    • Stanford/Dana-Farber(SD)
    • London(L)
    • Ioannina/Trento(IT)
  • Each team was comprised of 3-6 scientists who worked together to evaluate each article.

Results

  • Result could be reproduced : n=2

  • Reproduced with discrepancy : n=6

  • Could not be reproduced : n=10

    • No data n=4 (no data n=2, subset n=1, no reporter data n=1)
    • Confusion over matching of data to analysis (n=2)
    • Specialized software required and not available (n=1)
    • Raw data available but could not be processed n=2

alt text

Typical workflow of many research projects

  • First have an idea

  • e.g. stopping distance correlate with speed ?

alt text

1. Prepare data(Excel/Numbers)

alt text

2. Do Some analysis(R/SPSS/SAS)

alt text

3. Write a report/paper(Word?Pages)

alt text

All results(figures, tables) manually imported to Word

This workflow is BROKEN

1. Collect and manage data(EXCEL)

2. Analysis (R/SPSS)

3. Writeup(WORD)

Problems brought by the broken workflow

alt text

  • What analysis is behind this figure? Did you account for [ooo] in the analysis?
  • What dataset was used (e.g. final vs preliminary dataset)?
  • Oops, there is an error in the data. Can you repeat the analysis? And update figures/tables in Word!
  • As a coauthor/reader, I'd like to see the whole research process (how you arrived to that conclusion), rather than cooked manuscript with inserted tables/figures.

RStudio allows us to fix the disconnect

Integrating

  • 1. Data management
  • 2. Data analysis
  • 3. Writing up results

in a single dynamic document

Reproducible research !!

Let's make our project reproducible

alt text

  • In literate programming, an analytical document is composed of a descriptive narrative “woven” together with software code and computed results.

  • Advantages

    • A single document both describes and performs the analysis
    • Enforces reproducibility

Conversion to word or pdf : With PANDOC

alt text

alt text

If error

- If spotting eror in data, or using different dataset…

- make changes in Rmarkdown and report will update automatically

So... Main Advantages

alt text

  • Data management fully documented (no more manual changes in Excel!)
  • Analysis fully documented
  • Automated reports
  • can publish via Rpubs.com : http://www.rpubs.com/cardiomoon/16255
  • can share the project