R과 Rstudio를 이용한 재현가능한 연구

Keon-Woong Moon


Why R ?

  • (왜 R을 써야 하지?)

RStudio ?

  • (이건 또 뭐지? 듣보잡)

Reproducible research using R and RStudio

  • (이건 또 뭥미?)

Why R ?

왜 R을 써야 하지 ?

  • 난 SPSS는 쓸줄 안다.
  • SAS는 넘사벽…. R도 어렵다.
  • 새로운 걸 배우기 싫다. 이 나이에…
  • 그냥 하던대로 할랜다. 그게 속 편하다. 내 맘이다.

compareGroup 패키지


predimed 연구의 데이타

alt text

  • 6324 observations
  • 15 variables

이 표 만드는데 얼마나 걸릴까?

alt text

compareGroups ... CreateTable

predimed 데이타의 모든 변수를 그룹별로 비교할거다

res=compareGroups(group ~ . , data=predimed)

표를 만들어라


More exhausting table

alt text

More Exhausting Table

그룹별로 나눠서 비교할거다

res = compareGroups(group ~ .-sex-hormo, data=predimed)

모든 환자, 남자환자만, 여자환자만 따로 표를 세개 만든다


표 세개를 합친다.


With R and RStudio

  • Minimal Effort ; maximal result

  • Nice plots

plot of chunk unnamed-chunk-8

Reproducible research

Literate Programing

  • (문학적 프로그래밍)

Reproducible Research

  • (재현가능한 연구)

When issues of reproducibility arise

  • “Remember that microarray analysis you did six months ago?
    We ran a few more arrays.
    Can you add them to the project and repeat the same analysis?”

  • “The statistical analyst who looked at the data I generated previously is no longer available.
    Can you get someone else to analyze my new data set using the same methods (and thus producing a report I can expect to understand)”

  • “Please write/edit the methods sections for the abstract/paper/grant proposal I am submitting based on the analysis you did several months ago.

alt text

alt text

Repeatability of published microarray gene expression analyses

  • Selected articles published in Nature Genetics between January 2005 and December 2006 that had used profiling with microarrays

  • Of the 56 items retrieved electronically, 20 articles were considered potentially eligible for the project

  • The four teams were from

    • University of Alabama at Birmingham(UAB)
    • Stanford/Dana-Farber(SD)
    • London(L)
    • Ioannina/Trento(IT)
  • Each team was comprised of 3-6 scientists who worked together to evaluate each article.


  • Result could be reproduced : n=2

  • Reproduced with discrepancy : n=6

  • Could not be reproduced : n=10

    • No data n=4 (no data n=2, subset n=1, no reporter data n=1)
    • Confusion over matching of data to analysis (n=2)
    • Specialized software required and not available (n=1)
    • Raw data available but could not be processed n=2