Get started with R

October 13, 2016

About me

R-enthusiastic
More than 4 years R experience
2 packages on CRAN (and GitHub)
Web applications with Shiny R
Reproducible research with R markdown
Co-organizer of the Stockholm R useR group (SRUG)

Why R

Free and open source
Leading software for statistics, data analysis, and machine learning (https://www.r-bloggers.com/r-passes-sas-in-scholarly-use-finally/)
Many packages available on CRAN
Support for reproducible research (rmarkdown), interactive analyses (shiny)

Rstudio

Integrated Development Environment (IDE) for R
Syntax highlighting, code completion, and smart indentation
Easily manage multiple working directories using projects
Workspace browser and data viewer
Plot history, zooming, and flexible image and PDF export
Integrated R help and documentation

Learning outcomes

Get familiar with R (in RStudio)
Read different data format in R
Manipulate manage data
Explore and obtain summary statistics
Produce common useful graphs

R for Data Science

http://r4ds.had.co.nz/index.html

tidyverse R package

The tidyverse is a set of packages that work in harmony because they share common data representations and API design.

It includes:

ggplot2, for data visualisation
dplyr, for data manipulation
tidyr, for data tidying
readr, for data import
purrr, for functional programming
tibble, for tibbles, a modern re-imagining of data frames
and many more …

Motivating example

Dataset

marathon.Rdata

load(url("http://alecri.github.io/downloads/data/marathon.Rdata"))

Reference (article)

"Hyponatremia among Runners in the Boston Marathon", New England Journal of Medicine, 2005, Volume 352:1550-1556.

Descriptive abstract

Hyponatremia has emerged as an important cause of race-related death and life-threatening illness among marathon runners. We studied a cohort of marathon runners to estimate the incidence of hyponatremia and to identify the principal risk factors.

Data transformation

dplyr: grammar of data manipulation

https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

5 functions (verbs) to manipulate data

Pick observations by their values (filter())
Reorder the rows (arrange())
Pick variables by their names (select())
Create new variables with functions of existing variables (mutate())
Collapse many values down to a single summary (summarise())

The %>% (pipe) operator

# create a variable timeh (time in hours)
marathon = mutate(data = marathon, timeh = runtime/60)
# select only a few variables
marathon_sub = select(data = marathon, id, female, age, na, bmi, timeh)
# select only female with bmi > 30
female_30 = filter(data = marathon_sub, female == "female", bmi > 30)
# sorting by (descending) na levels and timeh
arrange(female_30, desc(na), timeh)

Equivalent to

marathon %>% 
   mutate(timeh = runtime/60) %>%
   select(id, female, age, na, bmi, timeh) %>%
   filter(female == "female", bmi > 30) %>%
   arrange(desc(na), timeh)

Data visualization

ggplot2: grammar of graphics

http://www.cookbook-r.com/Graphs/

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

first argument is the dataset to use in the plot
the geom_functions add different layer to the plot
each geom_function takes a mapping argument (paired with the aes())

# an example
ggplot(data = marathon, aes(x = wtdiff, y = na, color = female)) +
   geom_point(aes(size = runtime), shape = 18) + 
   geom_smooth(method = "lm", se = F)