last update: October 8, 2016

Aims

  • Get familiar with R

  • Basic data types in R

  • Read data into R

  • Manipulate data

  • Explore and summarize data

  • Make exploratory plots

Get familiar with R

What is R


"R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis."

– Google "what is r?"

Features of R

  • Free and open source

  • Powerful and flexible tool designed for statistical computing and graphics

  • Programming language

  • Many packages available on CRAN. See https://cran.r-project.org/web/views/ for an overview

Why R

Possible limitations


  • Steep learning curve
  • Little central support (based on community)
  • Slower and more memory intensive



Installing R


Installation is via the installer R-3.3.1-win.exe . Just double-click on the icon and follow the instructions. When installing on a 64-bit version of Windows the options will include 32- or 64-bit versions of R (and the default is to install both). You can uninstall R from the Control Panel.

– Google "how to install r?"

https://www.r-project.org/

Rstudio

  • Integrated Development Environment (IDE) for R

  • Syntax highlighting, code completion, and smart indentation

  • Easily manage multiple working directories using projects

  • Workspace browser and data viewer

  • Plot history, zooming, and flexible image and PDF export

  • Integrated R help and documentation

Installing Rstudio

How to use R

  1. From the console (interactive)

    • calculator

    • create variables

    • run functions

  2. Using an R script (reproducible)

    • keep trace of what you write

    • try interactive and then add to the script

R as a calculator

# this is a comment
2 + 2
## [1] 4
2*3
## [1] 6
3^2
## [1] 9
((2 + 2)*2*3)^2
## [1] 576
exp(2)
## [1] 7.389056
log(2.718282)
## [1] 1
9^.5
## [1] 3
round(pi, 2)
## [1] 3.14

R from console

# create a variable
x = 2 # same as: x <- 2
x
## [1] 2
# R is case-sensitive:
# 'X' is not the same as 'x'
X
## Error in eval(expr, envir, enclos): object 'X' not found
# which objects are in the workspace
ls()
## [1] "x"
x + 2
## [1] 4
x^2 + x/2 + 5
## [1] 10
y = "This is a string"
print(y)
## [1] "This is a string"
ls()
## [1] "x" "y"
# remove objects
rm("x", "y")
ls()
## character(0)

Basic data types in R

Basic Data Types

  1. Vectors

  2. Matrices

  3. Lists

  4. Data frames

Vectors

  • Simpler and basic data structure in R

  • Contain elements of the same type (numeric, character, logical)

  • Can be created using the combine function c()

# examples of three vectors of different type
x = c(2, 5, 9, 15, 4)
letters = c("a", "b", "r", "d", "e")
y = c(T, T, F, FALSE, TRUE)
# print one vector to the console
x
## [1]  2  5  9 15  4

  • Access/modify elements of a vector by using []
# First and Fourth element of x
x[c(1, 4)]
## [1]  2 15
# Changing the third elemnt in letters
letters[3] = "c"
letters
## [1] "a" "b" "c" "d" "e"
# Using logical
x < 7 # is it (each element of) x less than 7  
## [1]  TRUE  TRUE FALSE FALSE  TRUE
x[x < 7] # select only those elements of x < 7
## [1] 2 5 4
letters[y]
## [1] "a" "b" "e"

Arithmetic Operators

Operator symbol
+ addition
- subtraction
* multiplication
/ division
^ exponentiation

Logical Operators

Operator symbol
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
x | y x OR y
x & y x AND y

# find out about the class
class(letters)
## [1] "character"
# how many elements
length(y)
## [1] 5
# Depending of the type of vector we can obtain different statistics
mean(x)
## [1] 7
# create a vector (stat) with min, median, and max of x
stat = c(minimun = min(x), median = median(x), maximum = max(x))
stat
## minimun  median maximum 
##       2       5      15

Many functions

https://cran.r-project.org/doc/contrib/Short-refcard.pdf

function explanation
sin, cos, tan, sqrt, log, log10, exp
max(x) maximum of the elements of x
min(x) minimum of the elements of x
range(x) id. then c(min(x), max(x))
sum(x) sum of the elements of x
diff(x) lagged and iterated differences of vector x
prod(x) product of the elements of x
mean(x) mean of the elements of x
median(x) median of the elements of x

Many functions (2)

function explanation
quantile(x,probs=) sample quantiles corresponding to the given probabilities (defaults to 0,.25,.5,.75,1)
var(x) or cov(x) variance of the elements of x
sd(x) standard deviation of x
cor(x) correlation matrix of x
var(x, y) or cov(x, y) covariance between x and y
cor(x, y) linear correlation between x and y
round(x, n) rounds the elements of x to n decimals
log(x, base) computes the logarithm of x with base base


?Arithmetic

Factors

  • A vector that contains only predefined values (e.g. gender, treatment)
  • Numeric values associated with a character label
# create a factor
gender = factor(c("male", "female", "male", "male", "female"))
gender
## [1] male   female male   male   female
## Levels: female male
class(gender)
## [1] "factor"
# changing the labels
levels(gender) = c("woman", "man")
gender
## [1] man   woman man   man   woman
## Levels: woman man

Matrices

  • A collection of elements arranged in a two-dimensional layout
# Create a matrix from elements
elements = seq(5, 30, 5)
A = matrix(elements, nrow = 2, ncol = 3)
A
##      [,1] [,2] [,3]
## [1,]    5   15   25
## [2,]   10   20   30
# Create by combining columns
a.1 = c(5, 10)
a.2 = c(15, 20)
a.3 = c(25, 30)
cbind(a.1, a.2, a.3)
##      a.1 a.2 a.3
## [1,]   5  15  25
## [2,]  10  20  30
# Create by combining rows
a1. = c(5, 10, 15)
a2. = c(20, 25, 30)
rbind(a1., a2.)
##     [,1] [,2] [,3]
## a1.    5   10   15
## a2.   20   25   30

  • access/modify elements of a matrix by [ , ]: rows and columns before and after the comma
A[1, c(1, 3)]
## [1]  5 25
# First row
A[1, ]
## [1]  5 15 25
# Second columns
A[, 2]
## [1] 15 20
# Change a value
A[1, 1] = 0
A
##      [,1] [,2] [,3]
## [1,]    0   15   25
## [2,]   10   20   30

Basic functions

class(A)
## [1] "matrix"
dim(A)
## [1] 2 3
nrow(A)
## [1] 2
ncol(A)
## [1] 3
names(A)
## NULL

Many functions

function explanation
A * B Element-wise multiplication
A %*% B Matrix multiplication
A %o% B Outer product. AB'
crossprod(A,B) A'B
t(A) Transpose
diag(x) Creates diagonal matrix with elements of x in the principal diagona
diag(A) Returns a vector containing the elements of the principal diagonal
solve(A) Inverse of A where A is a square matrix
rowMeans(A) Returns vector of row means
rowSums(A) Returns vector of column sums

Lists

  • Similar to vectors but with elements that may be of different type
# create a list using the list() function
mylist = list(first = x, letters, A, c(1, 2))
mylist
## $first
## [1]  2  5  9 15  4
## 
## [[2]]
## [1] "a" "b" "c" "d" "e"
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    0   15   25
## [2,]   10   20   30
## 
## [[4]]
## [1] 1 2

  • Use [[]] or $ (if it's a named list) to access the element
# with name
mylist$first
## [1]  2  5  9 15  4
# or position number
mylist[[1]]
## [1]  2  5  9 15  4
  • Basic functions
class(mylist)
## [1] "list"
length(mylist)
## [1] 4
names(mylist)
## [1] "first" ""      ""      ""

Data frame

  • Common way of storing data in R
  • Similar to matrix but with columns that may be of a different type
  • Share some properties of matrices and lists
# How to create a data frame
mydata = data.frame(
   errors = x,
   letters = letters,
   logical = y,
   sex = gender
)
mydata
##   errors letters logical   sex
## 1      2       a    TRUE   man
## 2      5       b    TRUE woman
## 3      9       c   FALSE   man
## 4     15       d   FALSE   man
## 5      4       e    TRUE woman

Read data into R

Motivating example

Dataset

marathon.Rda

Reference

"Hyponatremia among Runners in the Boston Marathon", New England Journal of Medicine, 2005, Volume 352:1550-1556.

Descriptive abstract

Hyponatremia has emerged as an important cause of race-related death and life-threatening illness among marathon runners. We studied a cohort of marathon runners to estimate the incidence of hyponatremia and to identify the principal risk factors.

Data import

  • Different formats of marathon.Rda

  • Reading data is the first step in a project

  • R can read almost any file format and has many dedicated package (foreign, haven, read_excel, and many more)

  • Import data from databases, webscraping, etc.

NB: You need to provide the location (path) of the row data or place them in the working directory

# to get you working directory
getwd()
## [1] "/Users/alecri/Dropbox/KI/Teaching/Rintro/slides/ioslides"
# to change it
## setwd("path/to/file/data")

format function package example
.txt read.table() base read.table("http://alecri.github.io/downloads/data/marathon.txt")
.csv read.csv() base read.csv("http://alecri.github.io/downloads/data/marathon.csv")
.dta read_dta() haven read_dta("http://alecri.github.io/downloads/data/marathon.dta")
.sav read_spss() haven read_spss("http://alecri.github.io/downloads/data/marathon.sav")
.b7dat read_b7dat() haven read_b7dat("http://alecri.github.io/downloads/data/marathon.b7dat")
.xlsx read_excel() readxl read_excel("data/marathon.xlsx")


… and many more

#load("data/marathon.Rdata")
load(url("http://alecri.github.io/downloads/data/marathon.Rdata"))

Looking at the data

  • Use data viewer in Rstudio
  • View, Sort, Filter, and Search
View(marathon)
  • Or use basic functions
# What are the dimensions (i.e. rows and columns)
dim(marathon)
## [1] 488  18
# Which variable
names(marathon)
##  [1] "id"        "na"        "nas135"    "female"    "age"       "urinat3p"  "prewt"    
##  [8] "postwt"    "wtdiff"    "height"    "bmi"       "runtime"   "trainpace" "prevmara" 
## [15] "fluidint"  "waterload" "nsaid"     "wtdiffc"

# Get the structure of the data
str(marathon)
## Classes 'tbl_df', 'tbl' and 'data.frame':    488 obs. of  18 variables:
##  $ id       :Classes 'labelled', 'integer'  atomic [1:488] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..- attr(*, "label")= chr "ID number"
##  $ na       :Classes 'labelled', 'integer'  atomic [1:488] 138 142 151 139 145 140 142 140 141 138 ...
##   .. ..- attr(*, "label")= chr "Serum sodium concentration (mmol/liter)"
##  $ nas135   :Class 'labelled'  atomic [1:488] 0 0 0 0 0 0 0 0 0 0 ...
##   .. ..- attr(*, "label")= chr "Serum sodium concentration <= 135 mmol/liter"
##   .. ..- attr(*, "labels")= Named int [1:2] 0 1
##   .. .. ..- attr(*, "names")= chr [1:2] "No" "Yes"
##  $ female   : Factor w/ 2 levels "male","female": 2 1 1 1 2 2 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Female"
##  $ age      :Classes 'labelled', 'numeric'  atomic [1:488] 24.2 44.3 42 28.2 30.2 ...
##   .. ..- attr(*, "label")= chr "Age (years)"
##  $ urinat3p :Class 'labelled'  atomic [1:488] 1 0 0 1 0 0 0 0 0 0 ...
##   .. ..- attr(*, "label")= chr "Urine output"
##   .. ..- attr(*, "labels")= Named int [1:2] 0 1
##   .. .. ..- attr(*, "names")= chr [1:2] "<3" ">=3"
##  $ prewt    :Classes 'labelled', 'numeric'  atomic [1:488] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..- attr(*, "label")= chr "Weight (kg) pre-race"
##  $ postwt   :Classes 'labelled', 'numeric'  atomic [1:488] NA NA NA NA 50.7 ...
##   .. ..- attr(*, "label")= chr "Weight (kg) post-race"
##  $ wtdiff   :Classes 'labelled', 'numeric'  atomic [1:488] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..- attr(*, "label")= chr "Weight change (kg) pre/post race"
##  $ height   :Classes 'labelled', 'numeric'  atomic [1:488] 1.73 NA NA 1.73 NA ...
##   .. ..- attr(*, "label")= chr "Height (cm)"
##  $ bmi      :Classes 'labelled', 'numeric'  atomic [1:488] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..- attr(*, "label")= chr "Body-mass index (kg/m^2)"
##  $ runtime  :Classes 'labelled', 'integer'  atomic [1:488] NA 161 156 346 185 233 183 162 182 190 ...
##   .. ..- attr(*, "label")= chr "Race duration (minutes)"
##  $ trainpace:Classes 'labelled', 'numeric'  atomic [1:488] 480 430 360 630 NA NA 435 450 435 440 ...
##   .. ..- attr(*, "label")= chr "Training pace (seconds/mile)"
##  $ prevmara :Classes 'labelled', 'integer'  atomic [1:488] 3 40 40 1 3 25 19 2 80 10 ...
##   .. ..- attr(*, "label")= chr "Previous marathons (no.)"
##  $ fluidint : Factor w/ 3 levels "Every mile","Every other mile",..: 1 1 2 1 1 2 2 3 1 1 ...
##   ..- attr(*, "label")= chr "Self-reported fluid intake"
##  $ waterload: Factor w/ 2 levels "No","Yes": 2 2 NA 2 2 2 2 1 2 2 ...
##   ..- attr(*, "label")= chr "Self-reported water loading"
##  $ nsaid    : Factor w/ 2 levels "No","Yes": 2 2 NA 1 2 1 2 1 2 2 ...
##   ..- attr(*, "label")= chr "Self-reported use of NSAIDs"
##  $ wtdiffc  : Factor w/ 7 levels "3.0 to 4.9","2.0 to 2.9",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Categorization of weight change"

# First 6 observations
# use 'tail' for the last 6
head(marathon)
## # A tibble: 6 × 18
##               id             na         nas135 female            age       urinat3p
##   <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1              1            138              0 female       24.20534              1
## 2              2            142              0   male       44.28200              0
## 3              3            151              0   male       41.96304              0
## 4              4            139              0   male       28.19713              1
## 5              5            145              0 female       30.18207              0
## 6              6            140              0 female       28.29295              0
## # ... with 12 more variables: prewt <S3: labelled>, postwt <S3: labelled>, wtdiff <S3:
## #   labelled>, height <S3: labelled>, bmi <S3: labelled>, runtime <S3: labelled>,
## #   trainpace <S3: labelled>, prevmara <S3: labelled>, fluidint <fctr>, waterload <fctr>,
## #   nsaid <fctr>, wtdiffc <fctr>

Manipulate data

The dplyr package

  • A fast and consistent tool for working with data frame

  • Exploratory data analysis and manipulation

  • Make it easier to choose what to do, how to program and execute it

  • Identify the most important data manipulation verbs and make them easy to use from R.

install.packages("dplyr")
library(dplyr)
## To learn more about that
browseVignettes(package = "dplyr")

Verbs

  • filter() and slice()

  • arrange()

  • select() and rename()

  • distinct()

  • mutate() and transmute()

  • summarise()

filter() and slice()

Select a subset of rows

# Filter male with a bmi > 30
filter(marathon, female == "male", bmi > 30)
## # A tibble: 6 × 18
##               id             na         nas135 female            age       urinat3p
##   <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1            305            143              0   male       31.99726              0
## 2            393            138              0   male       45.57974              1
## 3            404            145              0   male       48.55031              1
## 4            426            143              0   male       35.51266              0
## 5            485            128              1   male       38.83915              0
## 6            487            131              1   male       38.33265              0
## # ... with 12 more variables: prewt <S3: labelled>, postwt <S3: labelled>, wtdiff <S3:
## #   labelled>, height <S3: labelled>, bmi <S3: labelled>, runtime <S3: labelled>,
## #   trainpace <S3: labelled>, prevmara <S3: labelled>, fluidint <fctr>, waterload <fctr>,
## #   nsaid <fctr>, wtdiffc <fctr>

Equivalent to the code in base R

subset(marathon, female == "male" & bmi > 30)
marathon[marathon$female == "male" & marathon$bmi > 30 & !is.na(marathon$bmi), ]

Select rows by position, use slice():

# Select the first 10 obs
slice(marathon, 1:10)
## # A tibble: 10 × 18
##                id             na         nas135 female            age       urinat3p
##    <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1               1            138              0 female       24.20534              1
## 2               2            142              0   male       44.28200              0
## 3               3            151              0   male       41.96304              0
## 4               4            139              0   male       28.19713              1
## 5               5            145              0 female       30.18207              0
## 6               6            140              0 female       28.29295              0
## 7               7            142              0   male       34.44216              0
## 8               8            140              0   male       26.16016              0
## 9               9            141              0   male       50.35729              0
## 10             10            138              0   male       49.39083              0
## # ... with 12 more variables: prewt <S3: labelled>, postwt <S3: labelled>, wtdiff <S3:
## #   labelled>, height <S3: labelled>, bmi <S3: labelled>, runtime <S3: labelled>,
## #   trainpace <S3: labelled>, prevmara <S3: labelled>, fluidint <fctr>, waterload <fctr>,
## #   nsaid <fctr>, wtdiffc <fctr>

Equivalent to the code in base R

marathon[1:10, ]

arrange()

Arrange or reorder the rows

# Sort by (increasing) runtime and decreasing age
arrange(marathon, runtime, desc(age))
## # A tibble: 488 × 18
##                id             na         nas135 female            age       urinat3p
##    <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1             222            144              0   male       54.16838              0
## 2             208            137              0   male       33.97399              0
## 3               3            151              0   male       41.96304              0
## 4             464            133              1   male       32.89802              0
## 5             429            134              1   male       31.52361              0
## 6               2            142              0   male       44.28200              0
## 7             104            141              0   male       44.93087              0
## 8               8            140              0   male       26.16016              0
## 9             285            141              0   male       30.55441              0
## 10            234            139              0   male       39.13758              0
## # ... with 478 more rows, and 12 more variables: prewt <S3: labelled>, postwt <S3:
## #   labelled>, wtdiff <S3: labelled>, height <S3: labelled>, bmi <S3: labelled>,
## #   runtime <S3: labelled>, trainpace <S3: labelled>, prevmara <S3: labelled>,
## #   fluidint <fctr>, waterload <fctr>, nsaid <fctr>, wtdiffc <fctr>

Equivalent to the code in base R

marathon[order(marathon$runtime, -marathon$age), ]

select()

Select a subset of columns

# Select id, na, bmi, runtime
select(marathon, id, na, bmi, runtime)
## # A tibble: 488 × 4
##                id             na            bmi        runtime
##    <S3: labelled> <S3: labelled> <S3: labelled> <S3: labelled>
## 1               1            138             NA             NA
## 2               2            142             NA            161
## 3               3            151             NA            156
## 4               4            139             NA            346
## 5               5            145             NA            185
## 6               6            140             NA            233
## 7               7            142             NA            183
## 8               8            140             NA            162
## 9               9            141             NA            182
## 10             10            138             NA            190
## # ... with 478 more rows

Equivalent to the code in base R

marathon[, c("id", "na", "bmi", "runtime")]
subset(marathon, select = c("id", "na", "bmi", "runtime"))

rename()

Rename some of the variables

# Rename nas135 with hyponatremia
rename(marathon, hyponatremia = nas135)
## # A tibble: 488 × 18
##                id             na   hyponatremia female            age       urinat3p
##    <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1               1            138              0 female       24.20534              1
## 2               2            142              0   male       44.28200              0
## 3               3            151              0   male       41.96304              0
## 4               4            139              0   male       28.19713              1
## 5               5            145              0 female       30.18207              0
## 6               6            140              0 female       28.29295              0
## 7               7            142              0   male       34.44216              0
## 8               8            140              0   male       26.16016              0
## 9               9            141              0   male       50.35729              0
## 10             10            138              0   male       49.39083              0
## # ... with 478 more rows, and 12 more variables: prewt <S3: labelled>, postwt <S3:
## #   labelled>, wtdiff <S3: labelled>, height <S3: labelled>, bmi <S3: labelled>,
## #   runtime <S3: labelled>, trainpace <S3: labelled>, prevmara <S3: labelled>,
## #   fluidint <fctr>, waterload <fctr>, nsaid <fctr>, wtdiffc <fctr>

Equivalent to the code in base R

names(marathon)[3] = "hyponatremia"

distinct()

Rename some of the variables

# How many distinct id
distinct(marathon, id)
## # A tibble: 488 × 1
##                id
##    <S3: labelled>
## 1               1
## 2               2
## 3               3
## 4               4
## 5               5
## 6               6
## 7               7
## 8               8
## 9               9
## 10             10
## # ... with 478 more rows

Equivalent to the code in base R

length(unique(marathon$id))

mutate()

Create new variables (e.g. runtime in hours)

mutate(marathon, timeh = runtime/60)
## # A tibble: 488 × 19
##                id             na         nas135 female            age       urinat3p
##    <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1               1            138              0 female       24.20534              1
## 2               2            142              0   male       44.28200              0
## 3               3            151              0   male       41.96304              0
## 4               4            139              0   male       28.19713              1
## 5               5            145              0 female       30.18207              0
## 6               6            140              0 female       28.29295              0
## 7               7            142              0   male       34.44216              0
## 8               8            140              0   male       26.16016              0
## 9               9            141              0   male       50.35729              0
## 10             10            138              0   male       49.39083              0
## # ... with 478 more rows, and 13 more variables: prewt <S3: labelled>, postwt <S3:
## #   labelled>, wtdiff <S3: labelled>, height <S3: labelled>, bmi <S3: labelled>,
## #   runtime <S3: labelled>, trainpace <S3: labelled>, prevmara <S3: labelled>,
## #   fluidint <fctr>, waterload <fctr>, nsaid <fctr>, wtdiffc <fctr>, timeh <S3: labelled>

Equivalent to the code in base R

transform(marathon, timeh = runtime/60)
marathon$timeh = marathon$runtime/60

How to create a categorical variable

mutate(marathon, bmic = cut(bmi, breaks = c(15, 20, 25, 35), 
                                        labels = c("<20", "20-25", ">25"),
                                        include.lowest = T, right = T))
## # A tibble: 488 × 19
##                id             na         nas135 female            age       urinat3p
##    <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1               1            138              0 female       24.20534              1
## 2               2            142              0   male       44.28200              0
## 3               3            151              0   male       41.96304              0
## 4               4            139              0   male       28.19713              1
## 5               5            145              0 female       30.18207              0
## 6               6            140              0 female       28.29295              0
## 7               7            142              0   male       34.44216              0
## 8               8            140              0   male       26.16016              0
## 9               9            141              0   male       50.35729              0
## 10             10            138              0   male       49.39083              0
## # ... with 478 more rows, and 13 more variables: prewt <S3: labelled>, postwt <S3:
## #   labelled>, wtdiff <S3: labelled>, height <S3: labelled>, bmi <S3: labelled>,
## #   runtime <S3: labelled>, trainpace <S3: labelled>, prevmara <S3: labelled>,
## #   fluidint <fctr>, waterload <fctr>, nsaid <fctr>, wtdiffc <fctr>, bmic <fctr>

Equivalent to the code in base R

marathon$bmic = cut(marathon$bmi, breaks = c(15, 20, 25, 35), 
                                        labels = c("<20", "20-25", ">25"),
                                        include.lowest = T, right = T)

Chaining

You can wrap the different function inside each other

arrange(
   filter(
      select(
         mutate(marathon,
                timeh = runtime/60),
         id, female, age, na, bmi, timeh),
      female == "male", bmi > 30), timeh, desc(age))
## # A tibble: 6 × 6
##               id female            age             na            bmi          timeh
##   <S3: labelled> <fctr> <S3: labelled> <S3: labelled> <S3: labelled> <S3: labelled>
## 1            305   male       31.99726            143       32.09602       3.033333
## 2            485   male       38.83915            128       31.38768       4.166667
## 3            426   male       35.51266            143       32.20786       4.500000
## 4            487   male       38.33265            131       31.22440       5.133333
## 5            404   male       48.55031            145       30.60518       5.583333
## 6            393   male       45.57974            138       31.18349       5.716667

Difficult to read: because the order of the operations is from inside to out

The %>% (pipe) operator

Ceci n'est pas un pipe

x %>% f, rather than f(x)

marathon %>% 
   mutate(timeh = runtime/60) %>%
   select(id, female, age, na, bmi, timeh) %>%
   filter(female == "male", bmi > 30) %>%
   arrange(timeh, desc(age))
## # A tibble: 6 × 6
##               id female            age             na            bmi          timeh
##   <S3: labelled> <fctr> <S3: labelled> <S3: labelled> <S3: labelled> <S3: labelled>
## 1            305   male       31.99726            143       32.09602       3.033333
## 2            485   male       38.83915            128       31.38768       4.166667
## 3            426   male       35.51266            143       32.20786       4.500000
## 4            487   male       38.33265            131       31.22440       5.133333
## 5            404   male       48.55031            145       30.60518       5.583333
## 6            393   male       45.57974            138       31.18349       5.716667

Explore and summarize data

Summary statistics

In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible

– Wikipedia

They can be univariate or (multi) bivariate and they depend on the type of the variable(s)

  • continuous

  • categorical

A generic summary of the variable of the data set can be obtained by using the summary() function.
NB to be useful, the variables need to be properly defined

summary(marathon)
##        id              na            nas135         female         age       
##  Min.   :  1.0   Min.   :114.0   Min.   :0.000   male  :322   Min.   :19.44  
##  1st Qu.:122.8   1st Qu.:138.0   1st Qu.:0.000   female:166   1st Qu.:31.40  
##  Median :244.5   Median :141.0   Median :0.000                Median :38.80  
##  Mean   :244.5   Mean   :140.4   Mean   :0.127                Mean   :38.85  
##  3rd Qu.:366.2   3rd Qu.:143.0   3rd Qu.:0.000                3rd Qu.:45.71  
##  Max.   :488.0   Max.   :158.0   Max.   :1.000                Max.   :73.08  
##                                                               NA's   :2      
##     urinat3p           prewt            postwt           wtdiff            height     
##  Min.   :0.00000   Min.   : 42.05   Min.   : 42.73   Min.   :-7.0454   Min.   :1.511  
##  1st Qu.:0.00000   1st Qu.: 60.00   1st Qu.: 60.17   1st Qu.:-1.7756   1st Qu.:1.676  
##  Median :0.00000   Median : 68.64   Median : 67.95   Median :-0.6818   Median :1.727  
##  Mean   :0.05625   Mean   : 69.26   Mean   : 68.62   Mean   :-0.6895   Mean   :1.734  
##  3rd Qu.:0.00000   3rd Qu.: 75.91   3rd Qu.: 75.23   3rd Qu.: 0.4546   3rd Qu.:1.800  
##  Max.   :1.00000   Max.   :101.82   Max.   :100.00   Max.   : 4.0909   Max.   :2.108  
##  NA's   :8         NA's   :19       NA's   :20       NA's   :33        NA's   :18     
##       bmi           runtime        trainpace        prevmara      
##  Min.   :16.77   Min.   :142.0   Min.   :330.0   Min.   :  0.000  
##  1st Qu.:21.16   1st Qu.:195.0   1st Qu.:450.0   1st Qu.:  2.000  
##  Median :22.54   Median :220.0   Median :480.0   Median :  5.000  
##  Mean   :22.94   Mean   :225.5   Mean   :488.8   Mean   :  8.839  
##  3rd Qu.:24.25   3rd Qu.:248.0   3rd Qu.:525.0   3rd Qu.: 10.000  
##  Max.   :32.21   Max.   :400.0   Max.   :780.0   Max.   :120.000  
##  NA's   :23      NA's   :11      NA's   :59      NA's   :3        
##                            fluidint   waterload   nsaid             wtdiffc   
##  Every mile                    :276   No  :122   No  :218   -1.0 to -0.1:109  
##  Every other mile              :168   Yes :354   Yes :259   -2.0 to -1.1: 99  
##  Every third mile or less often: 40   NA's: 12   NA's: 11   0.0 to 0.9  : 96  
##  NA's                          :  4                         -5.0 to -2.1: 84  
##                                                             1.0 to 1.9  : 39  
##                                                             (Other)     : 23  
##                                                             NA's        : 38

For a more detailed description check the describe() function in the Hmisc package

library(Hmisc)
describe(marathon)

Univariate continuous variable

R provides a wide range of functions for obtaining summary statistics of a continuous variable

# specific quantiles + mean
summary(marathon$na)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   114.0   138.0   141.0   140.4   143.0   158.0
# mean, sd, var, range
with(marathon, c(mean = mean(na), sd = sd(na), var = var(na), range = range(na)))
##       mean         sd        var     range1     range2 
## 140.405738   4.752102  22.582472 114.000000 158.000000
# skewness and kurtosis
library(moments)
with(marathon, c(skewness = skewness(na), kurtosis = kurtosis(na)))
##  skewness  kurtosis 
## -0.654781  6.784140

At the finish line, the runners had a mean serum sodium concentration of 140 ± 5 mmol per liter (range, 114: 158). The middle 50% of the marathon runners had sodium concentration between 138 and 143 nmmo/liter.

Most common graphical presentation: histogram & boxplot

ggplot(marathon, aes(x = na)) + 
   geom_histogram()

ggplot(marathon, aes(x = factor(1), y = na)) + 
   geom_boxplot()

Similar to the code in base R

with(marathon, hist(na))

boxplot(marathon$na, main = "Boxplot")

More elaborate plots

with(marathon, hist(na, freq = F))
d <- density(marathon$na)
lines(d)

ggplot(marathon, aes(x = na)) +
   geom_histogram(aes(y = ..density..), 
      binwidth = density(marathon$na)$bw) +
   geom_density() + theme_bw()

Univariate categorical

  • table of frequencies are the tipical way to summarize a categorical variable
# summary(marathon$nas135)
# xtabs(~ nas135, data = marathon)
(freq = table(marathon$nas135))
## 
##   0   1 
## 426  62
# get percentages
cbind(freq, percent = prop.table(freq))
##   freq   percent
## 0  426 0.8729508
## 1   62 0.1270492

Most common graphical presentation: barplot

barplot(freq)

ggplot(marathon, aes(x = nas135)) + geom_bar()

Two categorical variables

Two way table

#xtabs(~ nas135 + female, data = marathon)
(tab = with(marathon, table(nas135, female)))
##       female
## nas135 male female
##      0  297    129
##      1   25     37
# get proportion (margin: 1 by row, 2 by col)
prop.table(tab, 1)
##       female
## nas135      male    female
##      0 0.6971831 0.3028169
##      1 0.4032258 0.5967742
# add margin
addmargins(tab)
##       female
## nas135 male female Sum
##    0    297    129 426
##    1     25     37  62
##    Sum  322    166 488

Check out the Epi package

library(Epi)
stat.table(list(nas135, female), 
           list(count(), percent(female)), 
           marathon, margin = T)
##  --------------------------------- 
##          ---------female---------- 
##  nas135      male  female   Total  
##  --------------------------------- 
##  0            297     129     426  
##              69.7    30.3   100.0  
##                                    
##  1             25      37      62  
##              40.3    59.7   100.0  
##                                    
##                                    
##  Total        322     166     488  
##              66.0    34.0   100.0  
##  ---------------------------------

Graphical presentation

barplot(tab, legend.text = T)

ggplot(marathon, aes(x = female, fill = nas135)) + 
   geom_bar()

Testing for a two-way table

Differences in proportion

Let’s compare the proportions of cases of hyponatremia in men and women.

chisq.test(tab, correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 20.837, df = 1, p-value = 5.002e-06

A programming example

# Expected counts
expect = margin.table(tab, 1) %*% t(margin.table(tab, 2)) / margin.table(tab)
# Computing Chi2 statistic and df
chi2 = sum((tab - expect)^2/expect)
df = (ncol(tab)-1)*(nrow(tab)-1)
cat("The test statistics is ", chi2, ",\n with ", 
    df, " degrees of freedom \n p-value =", 1-pchisq(chi2, df))
## The test statistics is  20.83654 ,
##  with  1  degrees of freedom 
##  p-value = 5.001948e-06

Bivariate: continuous outcome & binary exposure

Summary statistics by levels of the exposure

# mean, std, n, and se of bmi for levels of nas135
descr = marathon %>% group_by(nas135) %>%
   summarise(mean = mean(bmi, na.rm = T),
             sd = sd(bmi, na.rm = T),
             n = sum(!is.na(bmi)),
             se = sd/sqrt(n))
descr
## # A tibble: 2 × 5
##           nas135     mean       sd     n        se
##   <S3: labelled>    <dbl>    <dbl> <int>     <dbl>
## 1              0 22.96212 2.521375   405 0.1252881
## 2              1 22.75710 3.711100    60 0.4791009

Similar to the code in base R

aggregate(bmi ~ nas135, data = marathon, FUN = summary)
with(marathon, tapply(bmi, nas135, mean, na.rm = T))

Graphical presentation: boxplot

boxplot(bmi ~ nas135, data = marathon)



ggplot(marathon, aes(x = nas135, y = bmi)) + 
   geom_boxplot()

Graphical presentation: errorbar plot

ggplot(descr, aes(x = nas135, y = mean, 
   fill = nas135)) + geom_bar(stat="identity") +
   geom_errorbar(aes(ymin = mean - se, 
               ymax = mean + se), width = .1)

with(descr, {
   bp = barplot(height = mean, ylim = c(0, 25),
           names.arg = levels(nas135))
   errbar(bp, mean, mean - se, mean + se, add = T)
   }
)

Testing

Differences in means

Let’s compare the means of bmi between cases and non-cases

t.test(bmi ~ nas135, data = marathon)
## 
##  Welch Two Sample t-test
## 
## data:  bmi by nas135
## t = 0.41399, df = 67.299, p-value = 0.6802
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7833524  1.1933799
## sample estimates:
## mean in group 0 mean in group 1 
##        22.96212        22.75710

NB by default var.equal = FALSE

Continuous outcome & 2 binary exposures

#with(marathon, tapply(bmi, list(female, nas135), mean, na.rm = T))
xtabs(bmi ~ female + nas135, data = aggregate(bmi ~ female + nas135, marathon, mean))
##         nas135
## female          0        1
##   male   23.61298 25.21807
##   female 21.52093 21.11646
stat.table(list(female, nas135), list(mean = mean(bmi, na.rm = T),
                                      sd = sd(bmi, na.rm = T)),
           data = marathon, margin = T)
##  --------------------------------- 
##          ---------nas135---------- 
##  female         0       1   Total  
##  --------------------------------- 
##  male       23.61   25.22   23.74  
##              2.52    3.80    2.67  
##                                    
##  female     21.52   21.12   21.43  
##              1.85    2.61    2.04  
##                                    
##                                    
##  Total      22.96   22.76   22.94  
##              2.52    3.71    2.70  
##  ---------------------------------

Figure 1




library(tidyr)
marathon %>%
   group_by(wtdiffc) %>%
   summarise(n = n(), severe = 100*mean(na <= 130),
     nosevere = 100*(mean(na <= 135)) - severe) %>%
   na.omit(cols = wdiffc) %>% # remove NA
   gather(severe, risk, severe, nosevere) %>% 
   mutate(n = replace(n, severe == "severe", NA)) %>%
   ggplot(aes(x = wtdiffc, y = risk, fill = severe)) +
   geom_bar(stat = "identity", position = "stack") +
   geom_text(aes(label = n), position = 
                position_stack(), vjust=-0.25) +
   labs(x = "Weight Change (kg)", 
        y = "Risk of Hyponatremia (%)")