Get familiar with R
Basic data types in R
Read data into R
Manipulate data
Explore and summarize data
Make exploratory plots
last update: October 8, 2016
Get familiar with R
Basic data types in R
Read data into R
Manipulate data
Explore and summarize data
Make exploratory plots
"R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis."
Free and open source
Powerful and flexible tool designed for statistical computing and graphics
Programming language
Many packages available on CRAN. See https://cran.r-project.org/web/views/ for an overview
Leading software for statistics, data analysis, and machine learning (https://www.r-bloggers.com/r-passes-sas-in-scholarly-use-finally/)
Big community
Integration with many other language (Python, and C++, Perl and Java)
Support for reproducible research (rmarkdown), interactive analysis (shiny)
Installation is via the installer R-3.3.1-win.exe . Just double-click on the icon and follow the instructions. When installing on a 64-bit version of Windows the options will include 32- or 64-bit versions of R (and the default is to install both). You can uninstall R from the Control Panel.
Integrated Development Environment (IDE) for R
Syntax highlighting, code completion, and smart indentation
Easily manage multiple working directories using projects
Workspace browser and data viewer
Plot history, zooming, and flexible image and PDF export
Integrated R help and documentation
Just follow these steps:
- Go to RStudio Download.
- Click the Download RStudio Desktop button.
- Select the installation file for your system.
- Run the installation file.
https://www.rstudio.com/ https://www.rstudio.com/products/rstudio/download2/
From the console (interactive)
calculator
create variables
run functions
Using an R script (reproducible)
keep trace of what you write
try interactive and then add to the script
# this is a comment 2 + 2 ## [1] 4
2*3 ## [1] 6
3^2 ## [1] 9
((2 + 2)*2*3)^2 ## [1] 576
exp(2) ## [1] 7.389056
log(2.718282) ## [1] 1
9^.5 ## [1] 3
round(pi, 2) ## [1] 3.14
# create a variable x = 2 # same as: x <- 2 x ## [1] 2
# R is case-sensitive: # 'X' is not the same as 'x' X ## Error in eval(expr, envir, enclos): object 'X' not found
# which objects are in the workspace ls() ## [1] "x"
x + 2 ## [1] 4
x^2 + x/2 + 5 ## [1] 10
y = "This is a string" print(y) ## [1] "This is a string"
ls() ## [1] "x" "y"
# remove objects
rm("x", "y")
ls()
## character(0)
Vectors
Matrices
Lists
Data frames
Simpler and basic data structure in R
Contain elements of the same type (numeric, character, logical)
Can be created using the combine function c()
# examples of three vectors of different type
x = c(2, 5, 9, 15, 4)
letters = c("a", "b", "r", "d", "e")
y = c(T, T, F, FALSE, TRUE)
# print one vector to the console x ## [1] 2 5 9 15 4
[]# First and Fourth element of x x[c(1, 4)] ## [1] 2 15
# Changing the third elemnt in letters letters[3] = "c" letters ## [1] "a" "b" "c" "d" "e"
# Using logical x < 7 # is it (each element of) x less than 7 ## [1] TRUE TRUE FALSE FALSE TRUE
x[x < 7] # select only those elements of x < 7 ## [1] 2 5 4
letters[y] ## [1] "a" "b" "e"
| Operator | symbol |
|---|---|
| + | addition |
| - | subtraction |
| * | multiplication |
| / | division |
| ^ | exponentiation |
| Operator | symbol |
|---|---|
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | exactly equal to |
| != | not equal to |
| x | y | x OR y |
| x & y | x AND y |
# find out about the class class(letters) ## [1] "character"
# how many elements length(y) ## [1] 5
# Depending of the type of vector we can obtain different statistics mean(x) ## [1] 7
# create a vector (stat) with min, median, and max of x stat = c(minimun = min(x), median = median(x), maximum = max(x)) stat ## minimun median maximum ## 2 5 15
https://cran.r-project.org/doc/contrib/Short-refcard.pdf
| function | explanation |
|---|---|
sin, cos, tan, sqrt, log, log10, exp |
|
max(x) |
maximum of the elements of x |
min(x) |
minimum of the elements of x |
range(x) |
id. then c(min(x), max(x)) |
sum(x) |
sum of the elements of x |
diff(x) |
lagged and iterated differences of vector x |
prod(x) |
product of the elements of x |
mean(x) |
mean of the elements of x |
median(x) |
median of the elements of x |
| function | explanation |
|---|---|
quantile(x,probs=) |
sample quantiles corresponding to the given probabilities (defaults to 0,.25,.5,.75,1) |
var(x) or cov(x) |
variance of the elements of x |
sd(x) |
standard deviation of x |
cor(x) |
correlation matrix of x |
var(x, y) or cov(x, y) |
covariance between x and y |
cor(x, y) |
linear correlation between x and y |
round(x, n) |
rounds the elements of x to n decimals |
log(x, base) |
computes the logarithm of x with base base |
?Arithmetic
# create a factor
gender = factor(c("male", "female", "male", "male", "female"))
gender
## [1] male female male male female
## Levels: female male
class(gender) ## [1] "factor"
# changing the labels
levels(gender) = c("woman", "man")
gender
## [1] man woman man man woman
## Levels: woman man
# Create a matrix from elements elements = seq(5, 30, 5) A = matrix(elements, nrow = 2, ncol = 3) A ## [,1] [,2] [,3] ## [1,] 5 15 25 ## [2,] 10 20 30
# Create by combining columns a.1 = c(5, 10) a.2 = c(15, 20) a.3 = c(25, 30) cbind(a.1, a.2, a.3) ## a.1 a.2 a.3 ## [1,] 5 15 25 ## [2,] 10 20 30
# Create by combining rows a1. = c(5, 10, 15) a2. = c(20, 25, 30) rbind(a1., a2.) ## [,1] [,2] [,3] ## a1. 5 10 15 ## a2. 20 25 30
[ , ]: rows and columns before and after the commaA[1, c(1, 3)] ## [1] 5 25
# First row A[1, ] ## [1] 5 15 25
# Second columns A[, 2] ## [1] 15 20
# Change a value A[1, 1] = 0 A ## [,1] [,2] [,3] ## [1,] 0 15 25 ## [2,] 10 20 30
Basic functions
class(A) ## [1] "matrix"
dim(A) ## [1] 2 3
nrow(A) ## [1] 2
ncol(A) ## [1] 3
names(A) ## NULL
| function | explanation |
|---|---|
A * B |
Element-wise multiplication |
A %*% B |
Matrix multiplication |
A %o% B |
Outer product. AB' |
crossprod(A,B) |
A'B |
t(A) |
Transpose |
diag(x) |
Creates diagonal matrix with elements of x in the principal diagona |
diag(A) |
Returns a vector containing the elements of the principal diagonal |
solve(A) |
Inverse of A where A is a square matrix |
rowMeans(A) |
Returns vector of row means |
rowSums(A) |
Returns vector of column sums |
# create a list using the list() function mylist = list(first = x, letters, A, c(1, 2)) mylist ## $first ## [1] 2 5 9 15 4 ## ## [[2]] ## [1] "a" "b" "c" "d" "e" ## ## [[3]] ## [,1] [,2] [,3] ## [1,] 0 15 25 ## [2,] 10 20 30 ## ## [[4]] ## [1] 1 2
[[]] or $ (if it's a named list) to access the element# with name mylist$first ## [1] 2 5 9 15 4
# or position number mylist[[1]] ## [1] 2 5 9 15 4
class(mylist) ## [1] "list"
length(mylist) ## [1] 4
names(mylist) ## [1] "first" "" "" ""
# How to create a data frame mydata = data.frame( errors = x, letters = letters, logical = y, sex = gender ) mydata ## errors letters logical sex ## 1 2 a TRUE man ## 2 5 b TRUE woman ## 3 9 c FALSE man ## 4 15 d FALSE man ## 5 4 e TRUE woman
marathon.Rda
"Hyponatremia among Runners in the Boston Marathon", New England Journal of Medicine, 2005, Volume 352:1550-1556.
Hyponatremia has emerged as an important cause of race-related death and life-threatening illness among marathon runners. We studied a cohort of marathon runners to estimate the incidence of hyponatremia and to identify the principal risk factors.
Different formats of marathon.Rda
Reading data is the first step in a project
R can read almost any file format and has many dedicated package (foreign, haven, read_excel, and many more)
Import data from databases, webscraping, etc.
NB: You need to provide the location (path) of the row data or place them in the working directory
# to get you working directory getwd() ## [1] "/Users/alecri/Dropbox/KI/Teaching/Rintro/slides/ioslides"
# to change it
## setwd("path/to/file/data")
| format | function | package | example |
|---|---|---|---|
.txt |
read.table() |
base |
read.table("http://alecri.github.io/downloads/data/marathon.txt") |
.csv |
read.csv() |
base |
read.csv("http://alecri.github.io/downloads/data/marathon.csv") |
.dta |
read_dta() |
haven |
read_dta("http://alecri.github.io/downloads/data/marathon.dta") |
.sav |
read_spss() |
haven |
read_spss("http://alecri.github.io/downloads/data/marathon.sav") |
.b7dat |
read_b7dat() |
haven |
read_b7dat("http://alecri.github.io/downloads/data/marathon.b7dat") |
.xlsx |
read_excel() |
readxl |
read_excel("data/marathon.xlsx") |
… and many more
#load("data/marathon.Rdata")
load(url("http://alecri.github.io/downloads/data/marathon.Rdata"))
View(marathon)
# What are the dimensions (i.e. rows and columns) dim(marathon) ## [1] 488 18
# Which variable names(marathon) ## [1] "id" "na" "nas135" "female" "age" "urinat3p" "prewt" ## [8] "postwt" "wtdiff" "height" "bmi" "runtime" "trainpace" "prevmara" ## [15] "fluidint" "waterload" "nsaid" "wtdiffc"
# Get the structure of the data str(marathon) ## Classes 'tbl_df', 'tbl' and 'data.frame': 488 obs. of 18 variables: ## $ id :Classes 'labelled', 'integer' atomic [1:488] 1 2 3 4 5 6 7 8 9 10 ... ## .. ..- attr(*, "label")= chr "ID number" ## $ na :Classes 'labelled', 'integer' atomic [1:488] 138 142 151 139 145 140 142 140 141 138 ... ## .. ..- attr(*, "label")= chr "Serum sodium concentration (mmol/liter)" ## $ nas135 :Class 'labelled' atomic [1:488] 0 0 0 0 0 0 0 0 0 0 ... ## .. ..- attr(*, "label")= chr "Serum sodium concentration <= 135 mmol/liter" ## .. ..- attr(*, "labels")= Named int [1:2] 0 1 ## .. .. ..- attr(*, "names")= chr [1:2] "No" "Yes" ## $ female : Factor w/ 2 levels "male","female": 2 1 1 1 2 2 1 1 1 1 ... ## ..- attr(*, "label")= chr "Female" ## $ age :Classes 'labelled', 'numeric' atomic [1:488] 24.2 44.3 42 28.2 30.2 ... ## .. ..- attr(*, "label")= chr "Age (years)" ## $ urinat3p :Class 'labelled' atomic [1:488] 1 0 0 1 0 0 0 0 0 0 ... ## .. ..- attr(*, "label")= chr "Urine output" ## .. ..- attr(*, "labels")= Named int [1:2] 0 1 ## .. .. ..- attr(*, "names")= chr [1:2] "<3" ">=3" ## $ prewt :Classes 'labelled', 'numeric' atomic [1:488] NA NA NA NA NA NA NA NA NA NA ... ## .. ..- attr(*, "label")= chr "Weight (kg) pre-race" ## $ postwt :Classes 'labelled', 'numeric' atomic [1:488] NA NA NA NA 50.7 ... ## .. ..- attr(*, "label")= chr "Weight (kg) post-race" ## $ wtdiff :Classes 'labelled', 'numeric' atomic [1:488] NA NA NA NA NA NA NA NA NA NA ... ## .. ..- attr(*, "label")= chr "Weight change (kg) pre/post race" ## $ height :Classes 'labelled', 'numeric' atomic [1:488] 1.73 NA NA 1.73 NA ... ## .. ..- attr(*, "label")= chr "Height (cm)" ## $ bmi :Classes 'labelled', 'numeric' atomic [1:488] NA NA NA NA NA NA NA NA NA NA ... ## .. ..- attr(*, "label")= chr "Body-mass index (kg/m^2)" ## $ runtime :Classes 'labelled', 'integer' atomic [1:488] NA 161 156 346 185 233 183 162 182 190 ... ## .. ..- attr(*, "label")= chr "Race duration (minutes)" ## $ trainpace:Classes 'labelled', 'numeric' atomic [1:488] 480 430 360 630 NA NA 435 450 435 440 ... ## .. ..- attr(*, "label")= chr "Training pace (seconds/mile)" ## $ prevmara :Classes 'labelled', 'integer' atomic [1:488] 3 40 40 1 3 25 19 2 80 10 ... ## .. ..- attr(*, "label")= chr "Previous marathons (no.)" ## $ fluidint : Factor w/ 3 levels "Every mile","Every other mile",..: 1 1 2 1 1 2 2 3 1 1 ... ## ..- attr(*, "label")= chr "Self-reported fluid intake" ## $ waterload: Factor w/ 2 levels "No","Yes": 2 2 NA 2 2 2 2 1 2 2 ... ## ..- attr(*, "label")= chr "Self-reported water loading" ## $ nsaid : Factor w/ 2 levels "No","Yes": 2 2 NA 1 2 1 2 1 2 2 ... ## ..- attr(*, "label")= chr "Self-reported use of NSAIDs" ## $ wtdiffc : Factor w/ 7 levels "3.0 to 4.9","2.0 to 2.9",..: NA NA NA NA NA NA NA NA NA NA ... ## ..- attr(*, "label")= chr "Categorization of weight change"
# First 6 observations # use 'tail' for the last 6 head(marathon) ## # A tibble: 6 × 18 ## id na nas135 female age urinat3p ## <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled> ## 1 1 138 0 female 24.20534 1 ## 2 2 142 0 male 44.28200 0 ## 3 3 151 0 male 41.96304 0 ## 4 4 139 0 male 28.19713 1 ## 5 5 145 0 female 30.18207 0 ## 6 6 140 0 female 28.29295 0 ## # ... with 12 more variables: prewt <S3: labelled>, postwt <S3: labelled>, wtdiff <S3: ## # labelled>, height <S3: labelled>, bmi <S3: labelled>, runtime <S3: labelled>, ## # trainpace <S3: labelled>, prevmara <S3: labelled>, fluidint <fctr>, waterload <fctr>, ## # nsaid <fctr>, wtdiffc <fctr>
dplyr packageA fast and consistent tool for working with data frame
Exploratory data analysis and manipulation
Make it easier to choose what to do, how to program and execute it
Identify the most important data manipulation verbs and make them easy to use from R.
install.packages("dplyr")
library(dplyr)
## To learn more about that
browseVignettes(package = "dplyr")
filter() and slice()
arrange()
select() and rename()
distinct()
mutate() and transmute()
summarise()
Select a subset of rows
# Filter male with a bmi > 30 filter(marathon, female == "male", bmi > 30) ## # A tibble: 6 × 18 ## id na nas135 female age urinat3p ## <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled> ## 1 305 143 0 male 31.99726 0 ## 2 393 138 0 male 45.57974 1 ## 3 404 145 0 male 48.55031 1 ## 4 426 143 0 male 35.51266 0 ## 5 485 128 1 male 38.83915 0 ## 6 487 131 1 male 38.33265 0 ## # ... with 12 more variables: prewt <S3: labelled>, postwt <S3: labelled>, wtdiff <S3: ## # labelled>, height <S3: labelled>, bmi <S3: labelled>, runtime <S3: labelled>, ## # trainpace <S3: labelled>, prevmara <S3: labelled>, fluidint <fctr>, waterload <fctr>, ## # nsaid <fctr>, wtdiffc <fctr>
Equivalent to the code in base R
subset(marathon, female == "male" & bmi > 30) marathon[marathon$female == "male" & marathon$bmi > 30 & !is.na(marathon$bmi), ]
Select rows by position, use slice():
# Select the first 10 obs slice(marathon, 1:10) ## # A tibble: 10 × 18 ## id na nas135 female age urinat3p ## <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled> ## 1 1 138 0 female 24.20534 1 ## 2 2 142 0 male 44.28200 0 ## 3 3 151 0 male 41.96304 0 ## 4 4 139 0 male 28.19713 1 ## 5 5 145 0 female 30.18207 0 ## 6 6 140 0 female 28.29295 0 ## 7 7 142 0 male 34.44216 0 ## 8 8 140 0 male 26.16016 0 ## 9 9 141 0 male 50.35729 0 ## 10 10 138 0 male 49.39083 0 ## # ... with 12 more variables: prewt <S3: labelled>, postwt <S3: labelled>, wtdiff <S3: ## # labelled>, height <S3: labelled>, bmi <S3: labelled>, runtime <S3: labelled>, ## # trainpace <S3: labelled>, prevmara <S3: labelled>, fluidint <fctr>, waterload <fctr>, ## # nsaid <fctr>, wtdiffc <fctr>
Equivalent to the code in base R
marathon[1:10, ]
Arrange or reorder the rows
# Sort by (increasing) runtime and decreasing age arrange(marathon, runtime, desc(age)) ## # A tibble: 488 × 18 ## id na nas135 female age urinat3p ## <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled> ## 1 222 144 0 male 54.16838 0 ## 2 208 137 0 male 33.97399 0 ## 3 3 151 0 male 41.96304 0 ## 4 464 133 1 male 32.89802 0 ## 5 429 134 1 male 31.52361 0 ## 6 2 142 0 male 44.28200 0 ## 7 104 141 0 male 44.93087 0 ## 8 8 140 0 male 26.16016 0 ## 9 285 141 0 male 30.55441 0 ## 10 234 139 0 male 39.13758 0 ## # ... with 478 more rows, and 12 more variables: prewt <S3: labelled>, postwt <S3: ## # labelled>, wtdiff <S3: labelled>, height <S3: labelled>, bmi <S3: labelled>, ## # runtime <S3: labelled>, trainpace <S3: labelled>, prevmara <S3: labelled>, ## # fluidint <fctr>, waterload <fctr>, nsaid <fctr>, wtdiffc <fctr>
Equivalent to the code in base R
marathon[order(marathon$runtime, -marathon$age), ]
Select a subset of columns
# Select id, na, bmi, runtime select(marathon, id, na, bmi, runtime) ## # A tibble: 488 × 4 ## id na bmi runtime ## <S3: labelled> <S3: labelled> <S3: labelled> <S3: labelled> ## 1 1 138 NA NA ## 2 2 142 NA 161 ## 3 3 151 NA 156 ## 4 4 139 NA 346 ## 5 5 145 NA 185 ## 6 6 140 NA 233 ## 7 7 142 NA 183 ## 8 8 140 NA 162 ## 9 9 141 NA 182 ## 10 10 138 NA 190 ## # ... with 478 more rows
Equivalent to the code in base R
marathon[, c("id", "na", "bmi", "runtime")]
subset(marathon, select = c("id", "na", "bmi", "runtime"))
Rename some of the variables
# Rename nas135 with hyponatremia rename(marathon, hyponatremia = nas135) ## # A tibble: 488 × 18 ## id na hyponatremia female age urinat3p ## <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled> ## 1 1 138 0 female 24.20534 1 ## 2 2 142 0 male 44.28200 0 ## 3 3 151 0 male 41.96304 0 ## 4 4 139 0 male 28.19713 1 ## 5 5 145 0 female 30.18207 0 ## 6 6 140 0 female 28.29295 0 ## 7 7 142 0 male 34.44216 0 ## 8 8 140 0 male 26.16016 0 ## 9 9 141 0 male 50.35729 0 ## 10 10 138 0 male 49.39083 0 ## # ... with 478 more rows, and 12 more variables: prewt <S3: labelled>, postwt <S3: ## # labelled>, wtdiff <S3: labelled>, height <S3: labelled>, bmi <S3: labelled>, ## # runtime <S3: labelled>, trainpace <S3: labelled>, prevmara <S3: labelled>, ## # fluidint <fctr>, waterload <fctr>, nsaid <fctr>, wtdiffc <fctr>
Equivalent to the code in base R
names(marathon)[3] = "hyponatremia"
Rename some of the variables
# How many distinct id distinct(marathon, id) ## # A tibble: 488 × 1 ## id ## <S3: labelled> ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5 ## 6 6 ## 7 7 ## 8 8 ## 9 9 ## 10 10 ## # ... with 478 more rows
Equivalent to the code in base R
length(unique(marathon$id))
Create new variables (e.g. runtime in hours)
mutate(marathon, timeh = runtime/60) ## # A tibble: 488 × 19 ## id na nas135 female age urinat3p ## <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled> ## 1 1 138 0 female 24.20534 1 ## 2 2 142 0 male 44.28200 0 ## 3 3 151 0 male 41.96304 0 ## 4 4 139 0 male 28.19713 1 ## 5 5 145 0 female 30.18207 0 ## 6 6 140 0 female 28.29295 0 ## 7 7 142 0 male 34.44216 0 ## 8 8 140 0 male 26.16016 0 ## 9 9 141 0 male 50.35729 0 ## 10 10 138 0 male 49.39083 0 ## # ... with 478 more rows, and 13 more variables: prewt <S3: labelled>, postwt <S3: ## # labelled>, wtdiff <S3: labelled>, height <S3: labelled>, bmi <S3: labelled>, ## # runtime <S3: labelled>, trainpace <S3: labelled>, prevmara <S3: labelled>, ## # fluidint <fctr>, waterload <fctr>, nsaid <fctr>, wtdiffc <fctr>, timeh <S3: labelled>
Equivalent to the code in base R
transform(marathon, timeh = runtime/60) marathon$timeh = marathon$runtime/60
How to create a categorical variable
mutate(marathon, bmic = cut(bmi, breaks = c(15, 20, 25, 35),
labels = c("<20", "20-25", ">25"),
include.lowest = T, right = T))
## # A tibble: 488 × 19
## id na nas135 female age urinat3p
## <S3: labelled> <S3: labelled> <S3: labelled> <fctr> <S3: labelled> <S3: labelled>
## 1 1 138 0 female 24.20534 1
## 2 2 142 0 male 44.28200 0
## 3 3 151 0 male 41.96304 0
## 4 4 139 0 male 28.19713 1
## 5 5 145 0 female 30.18207 0
## 6 6 140 0 female 28.29295 0
## 7 7 142 0 male 34.44216 0
## 8 8 140 0 male 26.16016 0
## 9 9 141 0 male 50.35729 0
## 10 10 138 0 male 49.39083 0
## # ... with 478 more rows, and 13 more variables: prewt <S3: labelled>, postwt <S3:
## # labelled>, wtdiff <S3: labelled>, height <S3: labelled>, bmi <S3: labelled>,
## # runtime <S3: labelled>, trainpace <S3: labelled>, prevmara <S3: labelled>,
## # fluidint <fctr>, waterload <fctr>, nsaid <fctr>, wtdiffc <fctr>, bmic <fctr>
Equivalent to the code in base R
marathon$bmic = cut(marathon$bmi, breaks = c(15, 20, 25, 35),
labels = c("<20", "20-25", ">25"),
include.lowest = T, right = T)
You can wrap the different function inside each other
arrange(
filter(
select(
mutate(marathon,
timeh = runtime/60),
id, female, age, na, bmi, timeh),
female == "male", bmi > 30), timeh, desc(age))
## # A tibble: 6 × 6
## id female age na bmi timeh
## <S3: labelled> <fctr> <S3: labelled> <S3: labelled> <S3: labelled> <S3: labelled>
## 1 305 male 31.99726 143 32.09602 3.033333
## 2 485 male 38.83915 128 31.38768 4.166667
## 3 426 male 35.51266 143 32.20786 4.500000
## 4 487 male 38.33265 131 31.22440 5.133333
## 5 404 male 48.55031 145 30.60518 5.583333
## 6 393 male 45.57974 138 31.18349 5.716667
Difficult to read: because the order of the operations is from inside to out
Ceci n'est pas un pipe
x %>% f, rather than f(x)
marathon %>% mutate(timeh = runtime/60) %>% select(id, female, age, na, bmi, timeh) %>% filter(female == "male", bmi > 30) %>% arrange(timeh, desc(age)) ## # A tibble: 6 × 6 ## id female age na bmi timeh ## <S3: labelled> <fctr> <S3: labelled> <S3: labelled> <S3: labelled> <S3: labelled> ## 1 305 male 31.99726 143 32.09602 3.033333 ## 2 485 male 38.83915 128 31.38768 4.166667 ## 3 426 male 35.51266 143 32.20786 4.500000 ## 4 487 male 38.33265 131 31.22440 5.133333 ## 5 404 male 48.55031 145 30.60518 5.583333 ## 6 393 male 45.57974 138 31.18349 5.716667
In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible
They can be univariate or (multi) bivariate and they depend on the type of the variable(s)
continuous
categorical
A generic summary of the variable of the data set can be obtained by using the summary() function.
NB to be useful, the variables need to be properly defined
summary(marathon) ## id na nas135 female age ## Min. : 1.0 Min. :114.0 Min. :0.000 male :322 Min. :19.44 ## 1st Qu.:122.8 1st Qu.:138.0 1st Qu.:0.000 female:166 1st Qu.:31.40 ## Median :244.5 Median :141.0 Median :0.000 Median :38.80 ## Mean :244.5 Mean :140.4 Mean :0.127 Mean :38.85 ## 3rd Qu.:366.2 3rd Qu.:143.0 3rd Qu.:0.000 3rd Qu.:45.71 ## Max. :488.0 Max. :158.0 Max. :1.000 Max. :73.08 ## NA's :2 ## urinat3p prewt postwt wtdiff height ## Min. :0.00000 Min. : 42.05 Min. : 42.73 Min. :-7.0454 Min. :1.511 ## 1st Qu.:0.00000 1st Qu.: 60.00 1st Qu.: 60.17 1st Qu.:-1.7756 1st Qu.:1.676 ## Median :0.00000 Median : 68.64 Median : 67.95 Median :-0.6818 Median :1.727 ## Mean :0.05625 Mean : 69.26 Mean : 68.62 Mean :-0.6895 Mean :1.734 ## 3rd Qu.:0.00000 3rd Qu.: 75.91 3rd Qu.: 75.23 3rd Qu.: 0.4546 3rd Qu.:1.800 ## Max. :1.00000 Max. :101.82 Max. :100.00 Max. : 4.0909 Max. :2.108 ## NA's :8 NA's :19 NA's :20 NA's :33 NA's :18 ## bmi runtime trainpace prevmara ## Min. :16.77 Min. :142.0 Min. :330.0 Min. : 0.000 ## 1st Qu.:21.16 1st Qu.:195.0 1st Qu.:450.0 1st Qu.: 2.000 ## Median :22.54 Median :220.0 Median :480.0 Median : 5.000 ## Mean :22.94 Mean :225.5 Mean :488.8 Mean : 8.839 ## 3rd Qu.:24.25 3rd Qu.:248.0 3rd Qu.:525.0 3rd Qu.: 10.000 ## Max. :32.21 Max. :400.0 Max. :780.0 Max. :120.000 ## NA's :23 NA's :11 NA's :59 NA's :3 ## fluidint waterload nsaid wtdiffc ## Every mile :276 No :122 No :218 -1.0 to -0.1:109 ## Every other mile :168 Yes :354 Yes :259 -2.0 to -1.1: 99 ## Every third mile or less often: 40 NA's: 12 NA's: 11 0.0 to 0.9 : 96 ## NA's : 4 -5.0 to -2.1: 84 ## 1.0 to 1.9 : 39 ## (Other) : 23 ## NA's : 38
For a more detailed description check the describe() function in the Hmisc package
library(Hmisc) describe(marathon)
R provides a wide range of functions for obtaining summary statistics of a continuous variable
# specific quantiles + mean summary(marathon$na) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 114.0 138.0 141.0 140.4 143.0 158.0
# mean, sd, var, range with(marathon, c(mean = mean(na), sd = sd(na), var = var(na), range = range(na))) ## mean sd var range1 range2 ## 140.405738 4.752102 22.582472 114.000000 158.000000
# skewness and kurtosis library(moments) with(marathon, c(skewness = skewness(na), kurtosis = kurtosis(na))) ## skewness kurtosis ## -0.654781 6.784140
At the finish line, the runners had a mean serum sodium concentration of 140 ± 5 mmol per liter (range, 114: 158). The middle 50% of the marathon runners had sodium concentration between 138 and 143 nmmo/liter.
Most common graphical presentation: histogram & boxplot
ggplot(marathon, aes(x = na)) + geom_histogram()
ggplot(marathon, aes(x = factor(1), y = na)) + geom_boxplot()
Similar to the code in base R
with(marathon, hist(na))
boxplot(marathon$na, main = "Boxplot")
More elaborate plots
with(marathon, hist(na, freq = F)) d <- density(marathon$na) lines(d)
ggplot(marathon, aes(x = na)) +
geom_histogram(aes(y = ..density..),
binwidth = density(marathon$na)$bw) +
geom_density() + theme_bw()
# summary(marathon$nas135) # xtabs(~ nas135, data = marathon) (freq = table(marathon$nas135)) ## ## 0 1 ## 426 62
# get percentages cbind(freq, percent = prop.table(freq)) ## freq percent ## 0 426 0.8729508 ## 1 62 0.1270492
Most common graphical presentation: barplot
barplot(freq)
ggplot(marathon, aes(x = nas135)) + geom_bar()
#xtabs(~ nas135 + female, data = marathon) (tab = with(marathon, table(nas135, female))) ## female ## nas135 male female ## 0 297 129 ## 1 25 37
# get proportion (margin: 1 by row, 2 by col) prop.table(tab, 1) ## female ## nas135 male female ## 0 0.6971831 0.3028169 ## 1 0.4032258 0.5967742
# add margin addmargins(tab) ## female ## nas135 male female Sum ## 0 297 129 426 ## 1 25 37 62 ## Sum 322 166 488
Check out the Epi package
library(Epi)
stat.table(list(nas135, female),
list(count(), percent(female)),
marathon, margin = T)
## ---------------------------------
## ---------female----------
## nas135 male female Total
## ---------------------------------
## 0 297 129 426
## 69.7 30.3 100.0
##
## 1 25 37 62
## 40.3 59.7 100.0
##
##
## Total 322 166 488
## 66.0 34.0 100.0
## ---------------------------------
Graphical presentation
barplot(tab, legend.text = T)
ggplot(marathon, aes(x = female, fill = nas135)) + geom_bar()
Let’s compare the proportions of cases of hyponatremia in men and women.
chisq.test(tab, correct = F) ## ## Pearson's Chi-squared test ## ## data: tab ## X-squared = 20.837, df = 1, p-value = 5.002e-06
A programming example
# Expected counts
expect = margin.table(tab, 1) %*% t(margin.table(tab, 2)) / margin.table(tab)
# Computing Chi2 statistic and df
chi2 = sum((tab - expect)^2/expect)
df = (ncol(tab)-1)*(nrow(tab)-1)
cat("The test statistics is ", chi2, ",\n with ",
df, " degrees of freedom \n p-value =", 1-pchisq(chi2, df))
## The test statistics is 20.83654 ,
## with 1 degrees of freedom
## p-value = 5.001948e-06
Summary statistics by levels of the exposure
# mean, std, n, and se of bmi for levels of nas135
descr = marathon %>% group_by(nas135) %>%
summarise(mean = mean(bmi, na.rm = T),
sd = sd(bmi, na.rm = T),
n = sum(!is.na(bmi)),
se = sd/sqrt(n))
descr
## # A tibble: 2 × 5
## nas135 mean sd n se
## <S3: labelled> <dbl> <dbl> <int> <dbl>
## 1 0 22.96212 2.521375 405 0.1252881
## 2 1 22.75710 3.711100 60 0.4791009
Similar to the code in base R
aggregate(bmi ~ nas135, data = marathon, FUN = summary) with(marathon, tapply(bmi, nas135, mean, na.rm = T))
Graphical presentation: boxplot
boxplot(bmi ~ nas135, data = marathon)
ggplot(marathon, aes(x = nas135, y = bmi)) + geom_boxplot()
Graphical presentation: errorbar plot
ggplot(descr, aes(x = nas135, y = mean,
fill = nas135)) + geom_bar(stat="identity") +
geom_errorbar(aes(ymin = mean - se,
ymax = mean + se), width = .1)
with(descr, {
bp = barplot(height = mean, ylim = c(0, 25),
names.arg = levels(nas135))
errbar(bp, mean, mean - se, mean + se, add = T)
}
)
Let’s compare the means of bmi between cases and non-cases
t.test(bmi ~ nas135, data = marathon) ## ## Welch Two Sample t-test ## ## data: bmi by nas135 ## t = 0.41399, df = 67.299, p-value = 0.6802 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.7833524 1.1933799 ## sample estimates: ## mean in group 0 mean in group 1 ## 22.96212 22.75710
NB by default var.equal = FALSE
#with(marathon, tapply(bmi, list(female, nas135), mean, na.rm = T)) xtabs(bmi ~ female + nas135, data = aggregate(bmi ~ female + nas135, marathon, mean)) ## nas135 ## female 0 1 ## male 23.61298 25.21807 ## female 21.52093 21.11646
stat.table(list(female, nas135), list(mean = mean(bmi, na.rm = T),
sd = sd(bmi, na.rm = T)),
data = marathon, margin = T)
## ---------------------------------
## ---------nas135----------
## female 0 1 Total
## ---------------------------------
## male 23.61 25.22 23.74
## 2.52 3.80 2.67
##
## female 21.52 21.12 21.43
## 1.85 2.61 2.04
##
##
## Total 22.96 22.76 22.94
## 2.52 3.71 2.70
## ---------------------------------
library(tidyr)
marathon %>%
group_by(wtdiffc) %>%
summarise(n = n(), severe = 100*mean(na <= 130),
nosevere = 100*(mean(na <= 135)) - severe) %>%
na.omit(cols = wdiffc) %>% # remove NA
gather(severe, risk, severe, nosevere) %>%
mutate(n = replace(n, severe == "severe", NA)) %>%
ggplot(aes(x = wtdiffc, y = risk, fill = severe)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = n), position =
position_stack(), vjust=-0.25) +
labs(x = "Weight Change (kg)",
y = "Risk of Hyponatremia (%)")