To R or not to R

Experience from Using R in a Medical School

Kamarul Imran Musa
Associate Professor in Epidemiology and Statistics (MD, M Community Med, PHD)

Download

What do i use

For this presentation, I use:

  1. R
  2. RStudio
    • Slidify
    • markdown

R

R

RStudio

RStudio

For slides

Slidify

Objectives

The objectives of the presentation are:

  1. to provide an overview of data analysis sofware in medicine
  2. to describe common software used at medical schools
  3. to describe syllabus that we teach
  4. to describe our analytics pedagogy
  5. to describe our reproducibility initiative
  6. to reflect on the our experience teaching statistics

Overview of data analysis sofware in medicine

Software used at USM medical school

  • We used IBM SPSS generally.
  • In addition to SPPS, we use Stata.
    • For more advanced mathy students
    • Especially for post-graduate students doing:
1.  Masters of Science (Medical Statistics)
2.  Doctor of Public Health (DrPH)

stata

What do we teach

  • For these post-graduate students, we teach
  1. General linear models
  2. Generalized linear models
  3. Structural Equation Modeling
  • independent, dependent t-test, ANOVA, chi-square etc etc are also taught BUT in the introductory level clasess

Migrating to R

  • R is a freely available language and environment for statistical computing and graphics.

  • It has over 10200 packages to cater for the needs of statisticians, data scientists, epidemiologists, ecologists, econometricians and many more people.

  • For more info about R, check here https://cran.r-project.org/

  • Introduced R at USM Medical School in late 2013

  • We formally integrated R into our academic syllabus in 2015

RStudio to enhance R learning

  • the R GNU is intimidating
  • So,we use RStudio IDE https://www.rstudio.com/
  • Other popular includes Microsoft R Open
  • Previously known as Revolution R
  • In the class, we teach applied statistics. Students need to know how:
  1. to read data
  2. to perform statistical analysis and
  3. to produce reports (we recommend Rmarkdown)

The aesthetic difference

R and RStudio

This is R

RStudio

What comes with R

  • Headache !!
    • Students need to analyze data with codes
    • They are more used to 'point-and-click' software
  • Steeper learning curve
  • Limited resources for applied statistics (in my opinion)
    • Resources on STATA and SPSS are more organized
    • As a result, need to google a lot
  • But versatile.
    • Ability to approach data in many ways. There is no more ONLY THIS WAY.
  • The focus are on the codes.
    • This makes students more aware of what they want to do and
    • And what they want to get from data analysis

Packages

  • R has its base and user-contributed packages
  • We do not learn how to code (unfortunately) at all
  • We act more like analysts rather than programmers
  • We use R packages to help us rather than using codes to create packages
  • These packages help us to:
  1. Read statistical data
  2. Perform data management
  3. Do exploratory data analysis
  4. Do inferential data analysis
  5. Produce reproducible reports

Data that we deal with

  • Data in medicine and health:
  • Excel .xlsx .csv
  • SPSS .sav
  • Stata .dta
  • SAS (not common in Malaysia)
library(foreign)
dat1 <- read.dta('abc.dta', convert.factors = TRUE )
dat2 <- read.spss('abc.sav', to.data.frame = TRUE, use.value.labels = TRUE)

library(haven)
dat1 <- read_dta('abc.dta')
dat2 <- read_spss('abc.sav')

Data management

  • Very important to make sure variables and observations can be analyzed properly
  • so need to prep data
  • perhaps 70% of our time spent on data cleaning and data management
  • We use packages like:
  1. psych
  2. dplyr
  • psych package provides many useful data summary
  • dplyr aims to provide a function for each basic verb of data manipulation:

Pipelines for data analysis

As proposed by Hadley Wickham

dplyr

  • dplyr provides verbs for data management
  • For example functions like :
    • filter, select then mutate
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mtcars2 <- mtcars %>% filter(disp < 200) %>% select(mpg, cyl, disp) %>% mutate(cyl_mpg = cyl + mpg)
  • slice(), arrange()
  • select(), rename(), distinct()
  • mutate(), transmute()
  • summarise(), sample_n(), sample_frac()

dplyr makes use of pipe %>%

Exploratory data analysis

  • We use the base package graphics::plot to perform plottings
  • But we do recommend our students to use ggplot2 package

graphics::plot

plot(mtcars$mpg, mtcars$hp, main = 'my mpg vs hp')

plot of chunk unnamed-chunk-5

ggplot2

  • ggplot2 uses proper grammar for graphics
  • It is intuitive but requires more coding
library(ggplot2)
ggplot(mtcars, aes(x= mpg, y= hp)) +
    geom_point(shape=1) +    
    geom_smooth()            
## `geom_smooth()` using method = 'loess'

plot of chunk unnamed-chunk-6

Inferential Statistics

Depends on the syllabus:

  1. survival analysis: time-to-event data
  2. logistic regression: categorical data analysis
  3. SEM: multivariate data
library(survival)
mod_sur <- coxph(Surv(time, status) ~ age , data = cancer)
summary(mod_sur)

mod_sur_para <- survreg(Surv(time, status) ~ age , data = cancer, dist = 'weibull')
summary(mod_sur_para)

General linear model

General and generalized linear model

mod1 <- lm(mpg ~ disp, data = mtcars)
mod2 <- glm(vs ~ wt, data = mtcars, 
            family = binomial(link = 'logit'))

More on generalized linear model:

  • count outcome data eg: Poisson
  • multinomial outcome data eg: multinomial logistic
  • ordinal outcome data: eg: proprortional odds model

Our approaches in teaching

  1. Reassurance. We tell out students It is ok, keep going.
  2. Counter argue the myth that point-and-click software are useful in the long run
  3. Need to
    • understand how R works
    • get skill to code and understand code
    • to identify useful packages
    • to read and understand error messages
  4. Sharing

Presenting results

using SPSS and Stata - need to copy and paste for most of users

Risk of doing that:

  1. error in results after transport/copy etc
  2. formating problems
  3. aesthetic values

We advocate Reproducibility

Reproducibility

  • Reproducibility means ability to regenerate
  • It is an important issue
  • Results of study must be reproducible

The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the laboratory notebooks and full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based ...

https://en.wikipedia.org/wiki/Reproducibility

Reasons for non-reproducible results

How do we ensure reproducibility

  • In order to reproduce analysis and results, We use markdown. RStudio has built-in Rmarkdown and Pandoc (which convert the doc to MS Word, PDF and HTML)

Markdown

Markdown was made by John Gruber http://daringfireball.net/

"Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).""

What I have learned

Some lessons for us:

  • Academicians must use R. It is coming BIG time
  • Always get feedbacks from students. Some are LOST
  • Then improve teachings/pedagogy based on feedbacks
  • Encourage them to use R. No point teaching if students not using or if they do not understand

What are our wishlists?

I wish:

  1. R integrates a drop-down menu for data management or at least for tidyverse packages
  2. R provides better documentation for packages. We need working and applied examples not just codes
  3. Data entry interface to enter data

Our activities at USM Health Campus

Some of our activities:

  1. Data science related activities
  2. BI related activities using Power BI (R can be integrated)
  3. Project sharing on R

We have RStudio Server Pro running at our medical school https://healthdata.usm.my/rstudio/auth-sign-in

Opportunity

The opportunity at USM, Health Campus:

  1. Post-graduate study in epidemiology
  2. Post-graduate study in statistics
  3. Post-graduate study in data science in health and medicine

can contact me at drkamarul@usm.my or drki.musa@gmail.com

Conclusions

  1. R forces students to understand every steps taken when analysing data
  2. R requires steep learning curve especially for non-programmers
  3. R open up possibilities to use the most-to-date technology in data science and statistics
  4. Self-study and learning are very much required to improve skill in R

The race is on for data and statistical literacy in health and medicine. Is Malaysia in the driving seat?