October 13, 2016

About me


  • R-enthusiastic

  • More than 4 years R experience

  • 2 packages on CRAN (and GitHub)

  • Web applications with Shiny R

  • Reproducible research with R markdown

  • Co-organizer of the Stockholm R useR group (SRUG)



Why R

Rstudio

  • Integrated Development Environment (IDE) for R

  • Syntax highlighting, code completion, and smart indentation

  • Easily manage multiple working directories using projects

  • Workspace browser and data viewer

  • Plot history, zooming, and flexible image and PDF export

  • Integrated R help and documentation

Learning outcomes

  • Get familiar with R (in RStudio)

  • Read different data format in R

  • Manipulate manage data

  • Explore and obtain summary statistics

  • Produce common useful graphs

R for Data Science

tidyverse R package

The tidyverse is a set of packages that work in harmony because they share common data representations and API design.

It includes:

  • ggplot2, for data visualisation
  • dplyr, for data manipulation
  • tidyr, for data tidying
  • readr, for data import
  • purrr, for functional programming
  • tibble, for tibbles, a modern re-imagining of data frames
  • and many more …

Motivating example

Dataset

marathon.Rdata

load(url("http://alecri.github.io/downloads/data/marathon.Rdata"))

Reference (article)

"Hyponatremia among Runners in the Boston Marathon", New England Journal of Medicine, 2005, Volume 352:1550-1556.

Descriptive abstract

Hyponatremia has emerged as an important cause of race-related death and life-threatening illness among marathon runners. We studied a cohort of marathon runners to estimate the incidence of hyponatremia and to identify the principal risk factors.

Data transformation

dplyr: grammar of data manipulation

https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html


5 functions (verbs) to manipulate data

  • Pick observations by their values (filter())
  • Reorder the rows (arrange())
  • Pick variables by their names (select())
  • Create new variables with functions of existing variables (mutate())
  • Collapse many values down to a single summary (summarise())

The %>% (pipe) operator

# create a variable timeh (time in hours)
marathon = mutate(data = marathon, timeh = runtime/60)
# select only a few variables
marathon_sub = select(data = marathon, id, female, age, na, bmi, timeh)
# select only female with bmi > 30
female_30 = filter(data = marathon_sub, female == "female", bmi > 30)
# sorting by (descending) na levels and timeh
arrange(female_30, desc(na), timeh)

Equivalent to

marathon %>% 
   mutate(timeh = runtime/60) %>%
   select(id, female, age, na, bmi, timeh) %>%
   filter(female == "female", bmi > 30) %>%
   arrange(desc(na), timeh)

Data visualization

ggplot2: grammar of graphics

http://www.cookbook-r.com/Graphs/

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
  1. first argument is the dataset to use in the plot

  2. the geom_functions add different layer to the plot

  3. each geom_function takes a mapping argument (paired with the aes())

# an example
ggplot(data = marathon, aes(x = wtdiff, y = na, color = female)) +
   geom_point(aes(size = runtime), shape = 18) + 
   geom_smooth(method = "lm", se = F) 

ggplot behind the scene

geom_functions univariate