Feiyang Niu
July 7, 2017
dplyr is a R package, developed by Hadley Wickham and Romain Francoisdplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.dplyr provides blazing fast performance for in-memory data by writing key pieces in C++dplyr| Column name | Description |
|---|---|
| STUDYID | Study ID |
| USUBJID | Unique subject ID |
| PARAM | Biomarker name (Biomarker-1, …, Biomarker-4) |
| AVISIT | Visit (Baseline, Week 4, …) |
| AVISITN | Numeric visit (0, 4, …) |
| AVAL | Biomarker raw measurement |
| BASE | Biomarker baseline measurement |
| UNIT | Biomarker measurement unit |
| ETHNIC | Ethnic (Hispanic or Latino, Not Hispanic or Latino) |
| COUNTRY | Country (USA, FRA, GBR, ITA, POL, DEU) |
| SEX | Sex (M, F) |
| AGE | Age category (<65, >=65) |
| Region | Region (US, Non-US) |
| Disease_type | Disease type (A, B, C, D) |
| AGE1 | Actual age (continuous, in years) |
# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("dplyr")
# Or the the development version from GitHub:
# install.packages("devtools")
devtools::install_github("tidyverse/dplyr")
“Verbs” in dplyr are a set of functions that perform common data manipulation operations such as
select(data, colname1[, colname2, ...])select(biomarker, STUDYID, USUBJID, PARAM, AVISIT, AVISITN, AVAL)
# A tibble: 4,368 × 6
STUDYID USUBJID PARAM AVISIT AVISITN AVAL
<chr> <fctr> <chr> <fctr> <dbl> <dbl>
1 GS-Sample-01 SUBJID-1 Biomarker-1 Baseline 0 4.086984
2 GS-Sample-01 SUBJID-1 Biomarker-1 Week 4 4 10.071358
3 GS-Sample-01 SUBJID-1 Biomarker-1 Week 8 8 6.023101
4 GS-Sample-01 SUBJID-1 Biomarker-1 Week 12 12 7.057249
5 GS-Sample-01 SUBJID-1 Biomarker-1 Week 16 16 4.522446
6 GS-Sample-01 SUBJID-1 Biomarker-1 Week 20 20 3.121714
7 GS-Sample-01 SUBJID-1 Biomarker-1 Week 24 24 3.288062
8 GS-Sample-01 SUBJID-1 Biomarker-1 Week 30 30 4.336657
9 GS-Sample-01 SUBJID-1 Biomarker-1 Week 36 36 5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1 Week 42 42 4.437676
# ... with 4,358 more rows
select(data, -colname1[, -colname2, ...])select(biomarker, -BASE, -UNIT, -ETHNIC, -COUNTRY, -SEX, -AGE, -Region, -Disease_type, -AGE1)
# A tibble: 4,368 × 6
STUDYID USUBJID PARAM AVISIT AVISITN AVAL
<chr> <fctr> <chr> <fctr> <dbl> <dbl>
1 GS-Sample-01 SUBJID-1 Biomarker-1 Baseline 0 4.086984
2 GS-Sample-01 SUBJID-1 Biomarker-1 Week 4 4 10.071358
3 GS-Sample-01 SUBJID-1 Biomarker-1 Week 8 8 6.023101
4 GS-Sample-01 SUBJID-1 Biomarker-1 Week 12 12 7.057249
5 GS-Sample-01 SUBJID-1 Biomarker-1 Week 16 16 4.522446
6 GS-Sample-01 SUBJID-1 Biomarker-1 Week 20 20 3.121714
7 GS-Sample-01 SUBJID-1 Biomarker-1 Week 24 24 3.288062
8 GS-Sample-01 SUBJID-1 Biomarker-1 Week 30 30 4.336657
9 GS-Sample-01 SUBJID-1 Biomarker-1 Week 36 36 5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1 Week 42 42 4.437676
# ... with 4,358 more rows
select(data, start_col:end_col)# [STUDYID USUBJID PARAM AVISIT AVISITN AVAL] BASE UNIT ETHNIC COUNTRY SEX AGE Region Disease_type AGE1
select(biomarker, STUDYID:AVAL)
# A tibble: 4,368 × 6
STUDYID USUBJID PARAM AVISIT AVISITN AVAL
<chr> <fctr> <chr> <fctr> <dbl> <dbl>
1 GS-Sample-01 SUBJID-1 Biomarker-1 Baseline 0 4.086984
2 GS-Sample-01 SUBJID-1 Biomarker-1 Week 4 4 10.071358
3 GS-Sample-01 SUBJID-1 Biomarker-1 Week 8 8 6.023101
4 GS-Sample-01 SUBJID-1 Biomarker-1 Week 12 12 7.057249
5 GS-Sample-01 SUBJID-1 Biomarker-1 Week 16 16 4.522446
6 GS-Sample-01 SUBJID-1 Biomarker-1 Week 20 20 3.121714
7 GS-Sample-01 SUBJID-1 Biomarker-1 Week 24 24 3.288062
8 GS-Sample-01 SUBJID-1 Biomarker-1 Week 30 30 4.336657
9 GS-Sample-01 SUBJID-1 Biomarker-1 Week 36 36 5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1 Week 42 42 4.437676
# ... with 4,358 more rows
starts_with(): select all columns that start with a character stringhead(select(biomarker, starts_with('S')), 3)
# A tibble: 3 × 2
STUDYID SEX
<chr> <chr>
1 GS-Sample-01 M
2 GS-Sample-01 M
3 GS-Sample-01 M
ends_with(): select columns that end with a character stringcontains(): select columns that contain a character stringmatches(): select columns that match a regular expressionone_of(): select columns names that are from a group of namesfilter(data, criterion1[, criterion2, ...])# Find raw measurements
# - of Biomarker-1
# - of Female subjects
# - above the age of 75
# - at Week 4
temp <- filter(biomarker, PARAM == 'Biomarker-1', SEX == 'F', AGE1 > 75, AVISIT == 'Week 4')
select(temp, USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL)
# A tibble: 6 × 6
USUBJID PARAM AVISIT SEX AGE1 AVAL
<fctr> <chr> <fctr> <chr> <dbl> <dbl>
1 SUBJID-8 Biomarker-1 Week 4 F 87 5.202696
2 SUBJID-12 Biomarker-1 Week 4 F 81 56.945610
3 SUBJID-69 Biomarker-1 Week 4 F 77 5.531761
4 SUBJID-74 Biomarker-1 Week 4 F 79 6.980639
5 SUBJID-82 Biomarker-1 Week 4 F 81 3.523846
6 SUBJID-120 Biomarker-1 Week 4 F 81 10.016988
# input +--------+ +--------+ +--------+ result
# data %>% | verb | %>% | verb | %>% | verb | -> data
# frame +--------+ +--------+ +--------+ frame
%>% was first introduced in magrittr# Find raw measurements
# - of Biomarker-1
# - of Female subjects
# - above the age of 75
# - at Week 4
biomarker %>%
filter(PARAM == 'Biomarker-1',
SEX == 'F',
AGE1 > 75,
AVISIT == 'Week 4') %>%
select(USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL)
# A tibble: 6 × 6
USUBJID PARAM AVISIT SEX AGE1 AVAL
<fctr> <chr> <fctr> <chr> <dbl> <dbl>
1 SUBJID-8 Biomarker-1 Week 4 F 87 5.202696
2 SUBJID-12 Biomarker-1 Week 4 F 81 56.945610
3 SUBJID-69 Biomarker-1 Week 4 F 77 5.531761
4 SUBJID-74 Biomarker-1 Week 4 F 79 6.980639
5 SUBJID-82 Biomarker-1 Week 4 F 81 3.523846
6 SUBJID-120 Biomarker-1 Week 4 F 81 10.016988
# As a comparison, to achieve the above, below is Base R version
temp <- biomarker[biomarker$PARAM == 'Biomarker-1' &
biomarker$SEX == 'F' &
biomarker$AGE1 > 75 &
biomarker$AVISIT == 'Week 4', ]
temp[, c('USUBJID','PARAM','AVISIT','SEX','AGE1','AVAL')]
# A tibble: 6 × 6
USUBJID PARAM AVISIT SEX AGE1 AVAL
<fctr> <chr> <fctr> <chr> <dbl> <dbl>
1 SUBJID-8 Biomarker-1 Week 4 F 87 5.202696
2 SUBJID-12 Biomarker-1 Week 4 F 81 56.945610
3 SUBJID-69 Biomarker-1 Week 4 F 77 5.531761
4 SUBJID-74 Biomarker-1 Week 4 F 79 6.980639
5 SUBJID-82 Biomarker-1 Week 4 F 81 3.523846
6 SUBJID-120 Biomarker-1 Week 4 F 81 10.016988
arrange(data, colname1[, colname2, ...])arrange(data, desc(colname1))# Find raw measurements
# - of Biomarker-1
# - of Female subjects
# - above the age of 75
# - at Week 4
# Sort them in descending order
biomarker %>%
filter(PARAM == 'Biomarker-1', SEX == 'F', AGE1 > 75, AVISIT == 'Week 4') %>%
select(USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL) %>%
arrange(desc(AVAL))
# A tibble: 6 × 6
USUBJID PARAM AVISIT SEX AGE1 AVAL
<fctr> <chr> <fctr> <chr> <dbl> <dbl>
1 SUBJID-12 Biomarker-1 Week 4 F 81 56.945610
2 SUBJID-120 Biomarker-1 Week 4 F 81 10.016988
3 SUBJID-74 Biomarker-1 Week 4 F 79 6.980639
4 SUBJID-69 Biomarker-1 Week 4 F 77 5.531761
5 SUBJID-8 Biomarker-1 Week 4 F 87 5.202696
6 SUBJID-82 Biomarker-1 Week 4 F 81 3.523846
mutate(data, new_var1 = expression1[, new_var2 = expression2, ...])# Create three new columns
# - CHG: change from baseline
# - PCHG: % change from baseline
# - PRCHG: % baseline
biomarker %>%
mutate(CHG = AVAL - BASE,
PCHG = CHG / BASE * 100,
PRCHG = AVAL / BASE * 100) %>%
select(USUBJID, PARAM, AVISIT, AVAL, BASE, CHG, PCHG, PRCHG)
# A tibble: 4,368 × 8
USUBJID PARAM AVISIT AVAL BASE CHG PCHG
<fctr> <chr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 SUBJID-1 Biomarker-1 Baseline 4.086984 4.086984 0.0000000 0.000000
2 SUBJID-1 Biomarker-1 Week 4 10.071358 4.086984 5.9843738 146.425168
3 SUBJID-1 Biomarker-1 Week 8 6.023101 4.086984 1.9361169 47.372749
4 SUBJID-1 Biomarker-1 Week 12 7.057249 4.086984 2.9702647 72.676192
5 SUBJID-1 Biomarker-1 Week 16 4.522446 4.086984 0.4354616 10.654838
6 SUBJID-1 Biomarker-1 Week 20 3.121714 4.086984 -0.9652702 -23.618151
7 SUBJID-1 Biomarker-1 Week 24 3.288062 4.086984 -0.7989224 -19.547968
8 SUBJID-1 Biomarker-1 Week 30 4.336657 4.086984 0.2496724 6.108964
9 SUBJID-1 Biomarker-1 Week 36 5.992019 4.086984 1.9050345 46.612228
10 SUBJID-1 Biomarker-1 Week 42 4.437676 4.086984 0.3506914 8.580690
# ... with 4,358 more rows, and 1 more variables: PRCHG <dbl>
rename(data, newname1 = oldname1[, newname2 = oldname2, ...])# Rename the following two columns
# - AGE -> age_categorical
# - AGE1 -> age_continuous
biomarker %>%
rename(age_categorical = AGE,
age_continuous = AGE1) %>%
select(USUBJID, age_categorical, age_continuous)
# A tibble: 4,368 × 3
USUBJID age_categorical age_continuous
<fctr> <chr> <dbl>
1 SUBJID-1 <65 34
2 SUBJID-1 <65 34
3 SUBJID-1 <65 34
4 SUBJID-1 <65 34
5 SUBJID-1 <65 34
6 SUBJID-1 <65 34
7 SUBJID-1 <65 34
8 SUBJID-1 <65 34
9 SUBJID-1 <65 34
10 SUBJID-1 <65 34
# ... with 4,358 more rows
summarise(data, var1 = fun1(args1)[, var2 = fun2(args2), ...])# Calculate the following summary statistics of raw Biomarker-1 measurements
# - min
# - max
# - mean
# - sd
biomarker %>%
filter(PARAM == 'Biomarker-1') %>%
summarise('Bmk1 min' = min(AVAL),
'Bmk1 max' = max(AVAL),
'Bmk1 mean' = mean(AVAL),
'Bmk1 sd' = sd(AVAL))
# A tibble: 1 × 4
`Bmk1 min` `Bmk1 max` `Bmk1 mean` `Bmk1 sd`
<dbl> <dbl> <dbl> <dbl>
1 0.6611332 131.7934 8.612313 9.230064
group_by(data, colname1[, colname2, ...])# Calculate summary statistics of raw Biomarker-1 measurements at each visit
biomarker %>%
filter(PARAM == 'Biomarker-1') %>%
group_by(AVISIT) %>%
summarise('Bmk1 min' = min(AVAL),
'Bmk1 max' = max(AVAL),
'Bmk1 mean' = mean(AVAL),
'Bmk1 sd' = sd(AVAL))
# A tibble: 23 × 5
AVISIT `Bmk1 min` `Bmk1 max` `Bmk1 mean` `Bmk1 sd`
<fctr> <dbl> <dbl> <dbl> <dbl>
1 Baseline 1.236720 31.39085 7.038520 4.178917
2 Week 4 1.719325 56.94561 8.479816 7.174010
3 Week 8 1.655803 131.79340 10.896503 17.562428
4 Week 12 2.153257 51.07267 9.279077 8.221138
5 Week 16 2.384246 37.02386 8.970557 6.511436
6 Week 20 1.746164 56.76062 7.908845 7.286787
7 Week 24 1.089179 44.46084 9.066420 8.285184
8 Week 30 1.317913 44.94915 8.043445 6.282829
9 Week 36 1.229387 34.95352 7.922025 6.437708
10 Week 42 1.987400 31.41053 7.821189 5.361287
# ... with 13 more rows
biomarker %>%
filter(PARAM == 'Biomarker-1') %>%
group_by(AVISIT) %>%
summarise(N = n())
# A tibble: 23 × 2
AVISIT N
<fctr> <int>
1 Baseline 121
2 Week 4 118
3 Week 8 108
4 Week 12 90
5 Week 16 89
6 Week 20 79
7 Week 24 75
8 Week 30 61
9 Week 36 61
10 Week 42 51
# ... with 13 more rows
biomarker %>%
filter(PARAM == 'Biomarker-1') %>%
count(AVISIT)
# A tibble: 23 × 2
AVISIT n
<fctr> <int>
1 Baseline 121
2 Week 4 118
3 Week 8 108
4 Week 12 90
5 Week 16 89
6 Week 20 79
7 Week 24 75
8 Week 30 61
9 Week 36 61
10 Week 42 51
# ... with 13 more rows
summarise_each(): apply one or more functions to one or more columnsdo(): perform arbitrary computation, returning a list. Particular useful when working with modelsslice(): select rows by positionsample_n(): sample a fixed number of rows, without replacementsample_frac(): sample a fixed percentage of rows, without replacementinner_join(), left_join(), right_join(), full_join(), semi_join(), anti_join()| Biomarker | Gender | N | Median | Mean | Std | P value |
|---|---|---|---|---|---|---|
| Biomarker-1 | F | |||||
| Biomarker-1 | M | |||||
| … | … | … | … | … | … | … |
| Biomarker-4 | F | |||||
| Biomarker-4 | M |
# It takes three steps:
# Step 1: Compute summary statistics
# Step 2: Compute Wilcoxon rank sum test p value
# Step 3: Merge summary statistics and p values
# Step 1: Compute summary statistics
summary_tbl <- biomarker %>%
filter(AVISIT == 'Week 4') %>%
group_by(PARAM, SEX) %>%
summarise(N = n(), Median = median(AVAL), Mean = mean(AVAL), Std = sd(AVAL))
# Step 2: Compute Wilcoxon rank sum test p value
p_value_tbl <- biomarker %>%
filter(AVISIT == 'Week 4') %>%
group_by(PARAM) %>%
do(wil_test = wilcox.test(AVAL ~ SEX, data = .)) %>%
summarise(PARAM, p_value = wil_test$p.value)
# Step 3: Merge summary statistics and p values
association_tbl <- left_join(summary_tbl, p_value_tbl, by = c('PARAM'))
# print out association table of biomarkers at Week 4 with gender
association_tbl
Source: local data frame [8 x 7]
Groups: PARAM [?]
PARAM SEX N Median Mean Std p_value
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Biomarker-1 F 43 5.793712 8.374152 8.9205442 0.22283879
2 Biomarker-1 M 75 6.467959 8.540396 6.0164397 0.22283879
3 Biomarker-2 F 43 2.273683 4.390728 7.8210077 0.67902511
4 Biomarker-2 M 75 2.431877 3.890596 7.0265673 0.67902511
5 Biomarker-3 F 43 4.465109 4.530818 1.4929249 0.50220969
6 Biomarker-3 M 75 4.126813 4.535112 1.9115268 0.50220969
7 Biomarker-4 F 43 1.617746 1.694869 0.7456146 0.04180881
8 Biomarker-4 M 75 2.030626 1.993155 0.8635891 0.04180881
dplyr verbs inside of a function?
For example, what if we want to write a function, summary_stat_by_group(), which
dplyr verbs!#----- Regular verbs -----#
# select
select(biomarker, AVAL)
# filter
filter(biomarker, AGE1 > 65)
# arrange
arrange(biomarker, PARAM)
# mutate
mutate(biomarker, CHG = AVAL / BASE)
# group_by
group_by(biomarker, PARAM, AVISIT)
#----- SE verbs -----#
# select_
select_(biomarker, 'AVAL')
# filter_
filter_(biomarker, 'AGE1 > 65')
# arrange_
arrange_(biomarker, 'PARAM')
# mutate_
mutate_(biomarker, CHG = 'AVAL / BASE')
# group_by_
group_by_(biomarker, 'PARAM', 'AVISIT')
group_by_(biomarker, .dots = c('PARAM', 'AVISIT'))
# It only takes two steps
# Step 1: Define a utility function that applies a list of function on a vector
# Step 2: Define `summary_stat_by_group()`
# Step 1: Define a utility function
fapply <- function(vec, named_func_list) {
vec <- unlist(vec)
res <- sapply(named_func_list, function(FUN, x) return(FUN(x)), x = vec)
res <- setNames(data.frame(t(res)), names(named_func_list))
return(res)
}
# Step 2: Define `summary_stat_by_group()`
summary_stat_by_group <- function(df, column, named_func_list, groups = NULL) {
if(!is.null(groups)) df <- group_by_(df, .dots = groups)
columns_to_select <- c(groups, column)
res <- df %>%
select_(.dots = columns_to_select) %>%
do(fapply(vec = .[[column]], named_func_list = named_func_list))
return(res)
}
summary_stat_by_group() to compute summary statistics of raw biomarker at each visitnamed_func_list <- list(Min = min, Max = max, Mean = mean, Std = sd)
summary_stat_by_group(biomarker, 'AVAL', named_func_list, c('PARAM', 'AVISIT'))
Source: local data frame [92 x 6]
Groups: PARAM, AVISIT [92]
PARAM AVISIT Min Max Mean Std
<chr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 Biomarker-1 Baseline 1.236720 31.39085 7.038520 4.178917
2 Biomarker-1 Week 4 1.719325 56.94561 8.479816 7.174010
3 Biomarker-1 Week 8 1.655803 131.79340 10.896503 17.562428
4 Biomarker-1 Week 12 2.153257 51.07267 9.279077 8.221138
5 Biomarker-1 Week 16 2.384246 37.02386 8.970557 6.511436
6 Biomarker-1 Week 20 1.746164 56.76062 7.908845 7.286787
7 Biomarker-1 Week 24 1.089179 44.46084 9.066420 8.285184
8 Biomarker-1 Week 30 1.317913 44.94915 8.043445 6.282829
9 Biomarker-1 Week 36 1.229387 34.95352 7.922025 6.437708
10 Biomarker-1 Week 42 1.987400 31.41053 7.821189 5.361287
# ... with 82 more rows
Specifically, we would like to define a function, aggregate_records(), which
aggregate_records()# It only takes two steps
# Step 1: Define a default record aggregation rule
# Step 2: Define `aggregate_records()`
# Step 1: Define a default record aggregation rule
default_aggregation_rule <- function(vec) {
vec <- unlist(vec)
if(length(na.omit(vec)) == 0) return(NA)
else if(is.numeric(vec)) return(mean(vec, na.rm = TRUE))
else return(na.omit(vec)[1])
}
# Step 2: Define `aggregate_records()`
aggregate_records <- function(df, duprem_rule = default_aggregation_rule, groups = NULL) {
if(!is.null(groups)) df <- group_by_(df, .dots = groups)
res <- df %>% summarise_each(funs(duprem_rule))
return(res)
}
%>%)_ on the end
summarise() is summarise_()reshape2: a package that makes it easy to transform data between wide and long formatstidyr: a package that helps create tidy datadata.table: extension of data.frame
plyr: tools for splitting, applying and combining data
dplyr: load plyr first, then dplyrdplyr and SQLdplyr translates R code into SQL code
select() -> SELECTmutate() -> user-defined columnssummarise() -> aggregated columnsleft_join() -> LEFT JOINfilter() -> WHEREgroup_by() -> GROUP BYdplyr and SQLHadley Wickham, the author of dplyr explains:
When working with databases, dplyr tries to be as lazy as possible:
- It never pulls data into R unless you explicitly ask for it
- It delays doing any work until the last possible moment - it collects together everything you want to do and then sends it to the database in one step.
Resources
Tibbles are a modern take on data frames. There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.
[ and [[
[ always returns another tibble[[ always returns a vector. No more drop = FALSE!
More information can be found here, Tibbles