Introduction to R for Public Health Researchers

last update: October 8, 2016

Aims

Get familiar with R
Basic data types in R
Read data into R
Manipulate data
Explore and summarize data
Make exploratory plots

Get familiar with R

What is R

"R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis."

– Google "what is r?"

Features of R

Free and open source
Powerful and flexible tool designed for statistical computing and graphics
Programming language
Many packages available on CRAN. See https://cran.r-project.org/web/views/ for an overview

Why R

Leading software for statistics, data analysis, and machine learning (https://www.r-bloggers.com/r-passes-sas-in-scholarly-use-finally/)
Big community
Integration with many other language (Python, and C++, Perl and Java)
Support for reproducible research (rmarkdown), interactive analysis (shiny)

Possible limitations

Steep learning curve
Little central support (based on community)
Slower and more memory intensive

Installing R

Installation is via the installer R-3.3.1-win.exe . Just double-click on the icon and follow the instructions. When installing on a 64-bit version of Windows the options will include 32- or 64-bit versions of R (and the default is to install both). You can uninstall R from the Control Panel.

– Google "how to install r?"

https://www.r-project.org/

Rstudio

Integrated Development Environment (IDE) for R
Syntax highlighting, code completion, and smart indentation
Easily manage multiple working directories using projects
Workspace browser and data viewer
Plot history, zooming, and flexible image and PDF export
Integrated R help and documentation

Installing Rstudio

Just follow these steps:

Go to RStudio Download.

Click the Download RStudio Desktop button.

Select the installation file for your system.

Run the installation file.

– Google "how to install rstudio?"

https://www.rstudio.com/ https://www.rstudio.com/products/rstudio/download2/

How to use R

From the console (interactive)
- calculator
- create variables
- run functions
Using an R script (reproducible)
- keep trace of what you write
- try interactive and then add to the script

R as a calculator

# this is a comment
2 + 2
## [1] 4

2*3
## [1] 6

3^2
## [1] 9

((2 + 2)*2*3)^2
## [1] 576

exp(2)
## [1] 7.389056

log(2.718282)
## [1] 1

9^.5
## [1] 3

round(pi, 2)
## [1] 3.14

R from console

# create a variable
x = 2 # same as: x <- 2
x
## [1] 2

# R is case-sensitive:
# 'X' is not the same as 'x'
X
## Error in eval(expr, envir, enclos): object 'X' not found

# which objects are in the workspace
ls()
## [1] "x"

x + 2
## [1] 4

x^2 + x/2 + 5
## [1] 10

y = "This is a string"
print(y)
## [1] "This is a string"

ls()
## [1] "x" "y"

# remove objects
rm("x", "y")
ls()
## character(0)

Basic data types in R

Basic Data Types

Vectors
Matrices
Lists
Data frames

Vectors

Simpler and basic data structure in R
Contain elements of the same type (numeric, character, logical)
Can be created using the combine function c()

# examples of three vectors of different type
x = c(2, 5, 9, 15, 4)
letters = c("a", "b", "r", "d", "e")
y = c(T, T, F, FALSE, TRUE)

# print one vector to the console
x
## [1]  2  5  9 15  4

Access/modify elements of a vector by using []

# First and Fourth element of x
x[c(1, 4)]
## [1]  2 15

# Changing the third elemnt in letters
letters[3] = "c"
letters
## [1] "a" "b" "c" "d" "e"

# Using logical
x < 7 # is it (each element of) x less than 7  
## [1]  TRUE  TRUE FALSE FALSE  TRUE

x[x < 7] # select only those elements of x < 7
## [1] 2 5 4

letters[y]
## [1] "a" "b" "e"

Arithmetic Operators

Operator	symbol
+	addition
-	subtraction
*	multiplication
/	division
^	exponentiation

Logical Operators

Operator	symbol
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to
x \| y	x OR y
x & y	x AND y

# find out about the class
class(letters)
## [1] "character"

# how many elements
length(y)
## [1] 5

# Depending of the type of vector we can obtain different statistics
mean(x)
## [1] 7

# create a vector (stat) with min, median, and max of x
stat = c(minimun = min(x), median = median(x), maximum = max(x))
stat
## minimun  median maximum 
##       2       5      15

Many functions

https://cran.r-project.org/doc/contrib/Short-refcard.pdf

function	explanation
`sin, cos, tan, sqrt, log, log10, exp`
`max(x)`	maximum of the elements of x
`min(x)`	minimum of the elements of x
`range(x)`	id. then c(min(x), max(x))
`sum(x)`	sum of the elements of x
`diff(x)`	lagged and iterated differences of vector x
`prod(x)`	product of the elements of x
`mean(x)`	mean of the elements of x
`median(x)`	median of the elements of x

Many functions (2)

function	explanation
`quantile(x,probs=)`	sample quantiles corresponding to the given probabilities (defaults to 0,.25,.5,.75,1)
`var(x)` or `cov(x)`	variance of the elements of x
`sd(x)`	standard deviation of x
`cor(x)`	correlation matrix of x
`var(x, y)` or `cov(x, y)`	covariance between x and y
`cor(x, y)`	linear correlation between x and y
`round(x, n)`	rounds the elements of x to n decimals
`log(x, base)`	computes the logarithm of x with base base

?Arithmetic

Factors

A vector that contains only predefined values (e.g. gender, treatment)
Numeric values associated with a character label

# create a factor
gender = factor(c("male", "female", "male", "male", "female"))
gender
## [1] male   female male   male   female
## Levels: female male

class(gender)
## [1] "factor"

# changing the labels
levels(gender) = c("woman", "man")
gender
## [1] man   woman man   man   woman
## Levels: woman man

Matrices

A collection of elements arranged in a two-dimensional layout

# Create a matrix from elements
elements = seq(5, 30, 5)
A = matrix(elements, nrow = 2, ncol = 3)
A
##      [,1] [,2] [,3]
## [1,]    5   15   25
## [2,]   10   20   30

# Create by combining columns
a.1 = c(5, 10)
a.2 = c(15, 20)
a.3 = c(25, 30)
cbind(a.1, a.2, a.3)
##      a.1 a.2 a.3
## [1,]   5  15  25
## [2,]  10  20  30

# Create by combining rows
a1. = c(5, 10, 15)
a2. = c(20, 25, 30)
rbind(a1., a2.)
##     [,1] [,2] [,3]
## a1.    5   10   15
## a2.   20   25   30

access/modify elements of a matrix by [ , ]: rows and columns before and after the comma

A[1, c(1, 3)]
## [1]  5 25

# First row
A[1, ]
## [1]  5 15 25

# Second columns
A[, 2]
## [1] 15 20

# Change a value
A[1, 1] = 0
A
##      [,1] [,2] [,3]
## [1,]    0   15   25
## [2,]   10   20   30

Basic functions

class(A)
## [1] "matrix"

dim(A)
## [1] 2 3

nrow(A)
## [1] 2

ncol(A)
## [1] 3

names(A)
## NULL

Many functions

function	explanation
`A * B`	Element-wise multiplication
`A %*% B`	Matrix multiplication
`A %o% B`	Outer product. AB'
`crossprod(A,B)`	A'B
`t(A)`	Transpose
`diag(x)`	Creates diagonal matrix with elements of x in the principal diagona
`diag(A)`	Returns a vector containing the elements of the principal diagonal
`solve(A)`	Inverse of A where A is a square matrix
`rowMeans(A)`	Returns vector of row means
`rowSums(A)`	Returns vector of column sums

Lists

Similar to vectors but with elements that may be of different type

# create a list using the list() function
mylist = list(first = x, letters, A, c(1, 2))
mylist
## $first
## [1]  2  5  9 15  4
## 
## [[2]]
## [1] "a" "b" "c" "d" "e"
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    0   15   25
## [2,]   10   20   30
## 
## [[4]]
## [1] 1 2

Use [[]] or $ (if it's a named list) to access the element

# with name
mylist$first
## [1]  2  5  9 15  4

# or position number
mylist[[1]]
## [1]  2  5  9 15  4

Basic functions

class(mylist)
## [1] "list"

length(mylist)
## [1] 4

names(mylist)
## [1] "first" ""      ""      ""

Data frame

Common way of storing data in R
Similar to matrix but with columns that may be of a different type
Share some properties of matrices and lists

# How to create a data frame
mydata = data.frame(
   errors = x,
   letters = letters,
   logical = y,
   sex = gender
)
mydata
##   errors letters logical   sex
## 1      2       a    TRUE   man
## 2      5       b    TRUE woman
## 3      9       c   FALSE   man
## 4     15       d   FALSE   man
## 5      4       e    TRUE woman

Read data into R

Motivating example

Dataset

marathon.Rda

Reference

"Hyponatremia among Runners in the Boston Marathon", New England Journal of Medicine, 2005, Volume 352:1550-1556.

Descriptive abstract

Hyponatremia has emerged as an important cause of race-related death and life-threatening illness among marathon runners. We studied a cohort of marathon runners to estimate the incidence of hyponatremia and to identify the principal risk factors.

Data import

Different formats of marathon.Rda
Reading data is the first step in a project
R can read almost any file format and has many dedicated package (foreign, haven, read_excel, and many more)
Import data from databases, webscraping, etc.

NB: You need to provide the location (path) of the row data or place them in the working directory

# to get you working directory
getwd()
## [1] "/Users/alecri/Dropbox/KI/Teaching/Rintro/slides/ioslides"

# to change it
## setwd("path/to/file/data")

format	function	package	example
`.txt`	`read.table()`	`base`	`read.table("http://alecri.github.io/downloads/data/marathon.txt")`
`.csv`	`read.csv()`	`base`	`read.csv("http://alecri.github.io/downloads/data/marathon.csv")`
`.dta`	`read_dta()`	`haven`	`read_dta("http://alecri.github.io/downloads/data/marathon.dta")`
`.sav`	`read_spss()`	`haven`	`read_spss("http://alecri.github.io/downloads/data/marathon.sav")`
`.b7dat`	`read_b7dat()`	`haven`	`read_b7dat("http://alecri.github.io/downloads/data/marathon.b7dat")`
`.xlsx`	`read_excel()`	`readxl`	`read_excel("data/marathon.xlsx")`

… and many more

#load("data/marathon.Rdata")
load(url("http://alecri.github.io/downloads/data/marathon.Rdata"))

Looking at the data

Use data viewer in Rstudio
View, Sort, Filter, and Search

View(marathon)

Or use basic functions

# What are the dimensions (i.e. rows and columns)
dim(marathon)
## [1] 488  18

# Which variable
names(marathon)
##  [1] "id"        "na"        "nas135"    "female"    "age"       "urinat3p"  "prewt"    
##  [8] "postwt"    "wtdiff"    "height"    "bmi"       "runtime"   "trainpace" "prevmara" 
## [15] "fluidint"  "waterload" "nsaid"     "wtdiffc"

# Get the structure of the data
str(marathon)
## Classes 'tbl_df', 'tbl' and 'data.frame':    488 obs. of  18 variables:
##  $ id       :Classes 'labelled', 'integer'  atomic [1:488] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..- attr(*, "label")= chr "ID number"
##  $ na       :Classes 'labelled', 'integer'  atomic [1:488] 138 142 151 139 145 140 142 140 141 138 ...
##   .. ..- attr(*, "label")= chr "Serum sodium concentration (mmol/liter)"
##  $ nas135   :Class 'labelled'  atomic [1:488] 0 0 0 0 0 0 0 0 0 0 ...
##   .. ..- attr(*, "label")= chr "Serum sodium concentration <= 135 mmol/liter"
##   .. ..- attr(*, "labels")= Named int [1:2] 0 1
##   .. .. ..- attr(*, "names")= chr [1:2] "No" "Yes"
##  $ female   : Factor w/ 2 levels "male","female": 2 1 1 1 2 2 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Female"
##  $ age      :Classes 'labelled', 'numeric'  atomic [1:488] 24.2 44.3 42 28.2 30.2 ...
##   .. ..- attr(*, "label")= chr "Age (years)"
##  $ urinat3p :Class 'labelled'  atomic [1:488] 1 0 0 1 0 0 0 0 0 0 ...
##   .. ..- attr(*, "label")= chr "Urine output"
##   .. ..- attr(*, "labels")= Named int [1:2] 0 1
##   .. .. ..- attr(*, "names")= chr [1:2] "<3" ">=3"
##  $ prewt    :Classes 'labelled', 'numeric'  atomic [1:488] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..- attr(*, "label")= chr "Weight (kg) pre-race"
##  $ postwt   :Classes 'labelled', 'numeric'  atomic [1:488] NA NA NA NA 50.7 ...
##   .. ..- attr(*, "label")= chr "Weight (kg) post-race"
##  $ wtdiff   :Classes 'labelled', 'numeric'  atomic [1:488] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..- attr(*, "label")= chr "Weight change (kg) pre/post race"
##  $ height   :Classes 'labelled', 'numeric'  atomic [1:488] 1.73 NA NA 1.73 NA ...
##   .. ..- attr(*, "label")= chr "Height (cm)"
##  $ bmi      :Classes 'labelled', 'numeric'  atomic [1:488] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..- attr(*, "label")= chr "Body-mass index (kg/m^2)"
##  $ runtime  :Classes 'labelled', 'integer'  atomic [1:488] NA 161 156 346 185 233 183 162 182 190 ...
##   .. ..- attr(*, "label")= chr "Race duration (minutes)"
##  $ trainpace:Classes 'labelled', 'numeric'  atomic [1:488] 480 430 360 630 NA NA 435 450 435 440 ...
##   .. ..- attr(*, "label")= chr "Training pace (seconds/mile)"
##  $ prevmara :Classes 'labelled', 'integer'  atomic [1:488] 3 40 40 1 3 25 19 2 80 10 ...
##   .. ..- attr(*, "label")= chr "Previous marathons (no.)"
##  $ fluidint : Factor w/ 3 levels "Every mile","Every other mile",..: 1 1 2 1 1 2 2 3 1 1 ...
##   ..- attr(*, "label")= chr "Self-reported fluid intake"
##  $ waterload: Factor w/ 2 levels "No","Yes": 2 2 NA 2 2 2 2 1 2 2 ...
##   ..- attr(*, "label")= chr "Self-reported water loading"
##  $ nsaid    : Factor w/ 2 levels "No","Yes": 2 2 NA 1 2 1 2 1 2 2 ...
##   ..- attr(*, "label")= chr "Self-reported use of NSAIDs"
##  $ wtdiffc  : Factor w/ 7 levels "3.0 to 4.9","2.0 to 2.9",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Categorization of weight change"

# First 6 observations
# use 'tail' for the last 6
head(marathon)
## # A tibble: 6 × 18
##               id             na         nas135 female            age       urinat3p
##   <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1              1            138              0 female       24.20534              1
## 2              2            142              0   male       44.28200              0
## 3              3            151              0   male       41.96304              0
## 4              4            139              0   male       28.19713              1
## 5              5            145              0 female       30.18207              0
## 6              6            140              0 female       28.29295              0
## # ... with 12 more variables: prewt <S3: labelled>, postwt <S3: labelled>, wtdiff <S3:
## #   labelled>, height <S3: labelled>, bmi <S3: labelled>, runtime <S3: labelled>,
## #   trainpace <S3: labelled>, prevmara <S3: labelled>, fluidint <fctr>, waterload <fctr>,
## #   nsaid <fctr>, wtdiffc <fctr>

Manipulate data

The `dplyr` package

A fast and consistent tool for working with data frame
Exploratory data analysis and manipulation
Make it easier to choose what to do, how to program and execute it
Identify the most important data manipulation verbs and make them easy to use from R.

install.packages("dplyr")
library(dplyr)
## To learn more about that
browseVignettes(package = "dplyr")

Verbs

filter() and slice()
arrange()
select() and rename()
distinct()
mutate() and transmute()
summarise()