Efficient Data Manipulation with dplyr

Feiyang Niu
July 7, 2017

Data flow

dplyr

When & Where
- Not relevant
Who
- dplyr is a R package, developed by Hadley Wickham and Romain Francois

dplyr

What
- dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
Why
- Speed. dplyr provides blazing fast performance for in-memory data by writing key pieces in C++
- Direct connection to and analysis within external databases permitting simpler handling of large data
- Function chaining that allows us to avoid cluttering our workspace with interim objects
- Syntax simplicity and ease of use. The code is easy to write and to follow.

Overview

Verbs
Pipe operator
A statistical problem
Programming with dplyr

Dataset

Column name	Description
STUDYID	Study ID
USUBJID	Unique subject ID
PARAM	Biomarker name (Biomarker-1, …, Biomarker-4)
AVISIT	Visit (Baseline, Week 4, …)
AVISITN	Numeric visit (0, 4, …)
AVAL	Biomarker raw measurement
BASE	Biomarker baseline measurement
UNIT	Biomarker measurement unit
ETHNIC	Ethnic (Hispanic or Latino, Not Hispanic or Latino)
COUNTRY	Country (USA, FRA, GBR, ITA, POL, DEU)
SEX	Sex (M, F)
AGE	Age category (<65, >=65)
Region	Region (US, Non-US)
Disease_type	Disease type (A, B, C, D)
AGE1	Actual age (continuous, in years)

First things first…

Installation

# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")

# Or the the development version from GitHub:
# install.packages("devtools")
devtools::install_github("tidyverse/dplyr")

Verbs

“Verbs” in dplyr are a set of functions that perform common data manipulation operations such as

selecting specific columns,
filtering for rows,
re-ordering rows,
adding new columns,
grouping data,
summarizing data,
and their possible combinations.

select: Pick columns by name

`select(data, colname1[, colname2, ...])`

select(biomarker, STUDYID, USUBJID, PARAM, AVISIT, AVISITN, AVAL)

# A tibble: 4,368 × 6
        STUDYID  USUBJID       PARAM   AVISIT AVISITN      AVAL
          <chr>   <fctr>       <chr>   <fctr>   <dbl>     <dbl>
1  GS-Sample-01 SUBJID-1 Biomarker-1 Baseline       0  4.086984
2  GS-Sample-01 SUBJID-1 Biomarker-1   Week 4       4 10.071358
3  GS-Sample-01 SUBJID-1 Biomarker-1   Week 8       8  6.023101
4  GS-Sample-01 SUBJID-1 Biomarker-1  Week 12      12  7.057249
5  GS-Sample-01 SUBJID-1 Biomarker-1  Week 16      16  4.522446
6  GS-Sample-01 SUBJID-1 Biomarker-1  Week 20      20  3.121714
7  GS-Sample-01 SUBJID-1 Biomarker-1  Week 24      24  3.288062
8  GS-Sample-01 SUBJID-1 Biomarker-1  Week 30      30  4.336657
9  GS-Sample-01 SUBJID-1 Biomarker-1  Week 36      36  5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1  Week 42      42  4.437676
# ... with 4,358 more rows

select: Pick columns by name

`select(data, -colname1[, -colname2, ...])`

select(biomarker, -BASE, -UNIT, -ETHNIC, -COUNTRY, -SEX, -AGE, -Region, -Disease_type, -AGE1)

# A tibble: 4,368 × 6
        STUDYID  USUBJID       PARAM   AVISIT AVISITN      AVAL
          <chr>   <fctr>       <chr>   <fctr>   <dbl>     <dbl>
1  GS-Sample-01 SUBJID-1 Biomarker-1 Baseline       0  4.086984
2  GS-Sample-01 SUBJID-1 Biomarker-1   Week 4       4 10.071358
3  GS-Sample-01 SUBJID-1 Biomarker-1   Week 8       8  6.023101
4  GS-Sample-01 SUBJID-1 Biomarker-1  Week 12      12  7.057249
5  GS-Sample-01 SUBJID-1 Biomarker-1  Week 16      16  4.522446
6  GS-Sample-01 SUBJID-1 Biomarker-1  Week 20      20  3.121714
7  GS-Sample-01 SUBJID-1 Biomarker-1  Week 24      24  3.288062
8  GS-Sample-01 SUBJID-1 Biomarker-1  Week 30      30  4.336657
9  GS-Sample-01 SUBJID-1 Biomarker-1  Week 36      36  5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1  Week 42      42  4.437676
# ... with 4,358 more rows

select: Pick columns by name

`select(data, start_col:end_col)`

# [STUDYID USUBJID PARAM AVISIT AVISITN AVAL] BASE UNIT ETHNIC COUNTRY SEX AGE Region Disease_type AGE1
select(biomarker, STUDYID:AVAL)

# A tibble: 4,368 × 6
        STUDYID  USUBJID       PARAM   AVISIT AVISITN      AVAL
          <chr>   <fctr>       <chr>   <fctr>   <dbl>     <dbl>
1  GS-Sample-01 SUBJID-1 Biomarker-1 Baseline       0  4.086984
2  GS-Sample-01 SUBJID-1 Biomarker-1   Week 4       4 10.071358
3  GS-Sample-01 SUBJID-1 Biomarker-1   Week 8       8  6.023101
4  GS-Sample-01 SUBJID-1 Biomarker-1  Week 12      12  7.057249
5  GS-Sample-01 SUBJID-1 Biomarker-1  Week 16      16  4.522446
6  GS-Sample-01 SUBJID-1 Biomarker-1  Week 20      20  3.121714
7  GS-Sample-01 SUBJID-1 Biomarker-1  Week 24      24  3.288062
8  GS-Sample-01 SUBJID-1 Biomarker-1  Week 30      30  4.336657
9  GS-Sample-01 SUBJID-1 Biomarker-1  Week 36      36  5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1  Week 42      42  4.437676
# ... with 4,358 more rows

select: Pick columns by name

starts_with(): select all columns that start with a character string

head(select(biomarker, starts_with('S')), 3)

# A tibble: 3 × 2
       STUDYID   SEX
         <chr> <chr>
1 GS-Sample-01     M
2 GS-Sample-01     M
3 GS-Sample-01     M

ends_with(): select columns that end with a character string
contains(): select columns that contain a character string
matches(): select columns that match a regular expression
one_of(): select columns names that are from a group of names

filter: Keep rows matching criteria

`filter(data, criterion1[, criterion2, ...])`

# Find raw measurements
#   - of Biomarker-1
#   - of Female subjects 
#   - above the age of 75
#   - at Week 4 
temp <- filter(biomarker, PARAM == 'Biomarker-1', SEX == 'F', AGE1 > 75, AVISIT == 'Week 4')
select(temp, USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL)

# A tibble: 6 × 6
     USUBJID       PARAM AVISIT   SEX  AGE1      AVAL
      <fctr>       <chr> <fctr> <chr> <dbl>     <dbl>
1   SUBJID-8 Biomarker-1 Week 4     F    87  5.202696
2  SUBJID-12 Biomarker-1 Week 4     F    81 56.945610
3  SUBJID-69 Biomarker-1 Week 4     F    77  5.531761
4  SUBJID-74 Biomarker-1 Week 4     F    79  6.980639
5  SUBJID-82 Biomarker-1 Week 4     F    81  3.523846
6 SUBJID-120 Biomarker-1 Week 4     F    81 10.016988

Pipe operator: %>%

# input         +--------+        +--------+        +--------+      result
# data   %>%    |  verb  |  %>%   |  verb  |  %>%   |  verb  |  ->  data
# frame         +--------+        +--------+        +--------+      frame

%>% was first introduced in magrittr

Pipe operator: %>%

# Find raw measurements
#   - of Biomarker-1
#   - of Female subjects 
#   - above the age of 75
#   - at Week 4 
biomarker %>%
    filter(PARAM == 'Biomarker-1',
           SEX == 'F',
           AGE1 > 75,
           AVISIT == 'Week 4') %>%
    select(USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL)

# A tibble: 6 × 6
     USUBJID       PARAM AVISIT   SEX  AGE1      AVAL
      <fctr>       <chr> <fctr> <chr> <dbl>     <dbl>
1   SUBJID-8 Biomarker-1 Week 4     F    87  5.202696
2  SUBJID-12 Biomarker-1 Week 4     F    81 56.945610
3  SUBJID-69 Biomarker-1 Week 4     F    77  5.531761
4  SUBJID-74 Biomarker-1 Week 4     F    79  6.980639
5  SUBJID-82 Biomarker-1 Week 4     F    81  3.523846
6 SUBJID-120 Biomarker-1 Week 4     F    81 10.016988

# As a comparison, to achieve the above, below is Base R version
temp <- biomarker[biomarker$PARAM == 'Biomarker-1' &
                      biomarker$SEX == 'F' &
                      biomarker$AGE1 > 75 &
                      biomarker$AVISIT == 'Week 4', ]
temp[, c('USUBJID','PARAM','AVISIT','SEX','AGE1','AVAL')]

# A tibble: 6 × 6
     USUBJID       PARAM AVISIT   SEX  AGE1      AVAL
      <fctr>       <chr> <fctr> <chr> <dbl>     <dbl>
1   SUBJID-8 Biomarker-1 Week 4     F    87  5.202696
2  SUBJID-12 Biomarker-1 Week 4     F    81 56.945610
3  SUBJID-69 Biomarker-1 Week 4     F    77  5.531761
4  SUBJID-74 Biomarker-1 Week 4     F    79  6.980639
5  SUBJID-82 Biomarker-1 Week 4     F    81  3.523846
6 SUBJID-120 Biomarker-1 Week 4     F    81 10.016988

arrange: Re-order rows

`arrange(data, colname1[, colname2, ...])`

In descending order: arrange(data, desc(colname1))

# Find raw measurements
#   - of Biomarker-1
#   - of Female subjects 
#   - above the age of 75
#   - at Week 4
# Sort them in descending order
biomarker %>%
    filter(PARAM == 'Biomarker-1', SEX == 'F', AGE1 > 75, AVISIT == 'Week 4') %>%
    select(USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL) %>%
    arrange(desc(AVAL))

# A tibble: 6 × 6
     USUBJID       PARAM AVISIT   SEX  AGE1      AVAL
      <fctr>       <chr> <fctr> <chr> <dbl>     <dbl>
1  SUBJID-12 Biomarker-1 Week 4     F    81 56.945610
2 SUBJID-120 Biomarker-1 Week 4     F    81 10.016988
3  SUBJID-74 Biomarker-1 Week 4     F    79  6.980639
4  SUBJID-69 Biomarker-1 Week 4     F    77  5.531761
5   SUBJID-8 Biomarker-1 Week 4     F    87  5.202696
6  SUBJID-82 Biomarker-1 Week 4     F    81  3.523846

mutate: Create new columns

`mutate(data, new_var1 = expression1[, new_var2 = expression2, ...])`

# Create three new columns
#   - CHG:      change from baseline
#   - PCHG:     % change from baseline
#   - PRCHG:    % baseline
biomarker %>%
    mutate(CHG = AVAL - BASE,
           PCHG = CHG / BASE * 100,
           PRCHG = AVAL / BASE * 100) %>%
    select(USUBJID, PARAM, AVISIT, AVAL, BASE, CHG, PCHG, PRCHG)

# A tibble: 4,368 × 8
    USUBJID       PARAM   AVISIT      AVAL     BASE        CHG       PCHG
     <fctr>       <chr>   <fctr>     <dbl>    <dbl>      <dbl>      <dbl>
1  SUBJID-1 Biomarker-1 Baseline  4.086984 4.086984  0.0000000   0.000000
2  SUBJID-1 Biomarker-1   Week 4 10.071358 4.086984  5.9843738 146.425168
3  SUBJID-1 Biomarker-1   Week 8  6.023101 4.086984  1.9361169  47.372749
4  SUBJID-1 Biomarker-1  Week 12  7.057249 4.086984  2.9702647  72.676192
5  SUBJID-1 Biomarker-1  Week 16  4.522446 4.086984  0.4354616  10.654838
6  SUBJID-1 Biomarker-1  Week 20  3.121714 4.086984 -0.9652702 -23.618151
7  SUBJID-1 Biomarker-1  Week 24  3.288062 4.086984 -0.7989224 -19.547968
8  SUBJID-1 Biomarker-1  Week 30  4.336657 4.086984  0.2496724   6.108964
9  SUBJID-1 Biomarker-1  Week 36  5.992019 4.086984  1.9050345  46.612228
10 SUBJID-1 Biomarker-1  Week 42  4.437676 4.086984  0.3506914   8.580690
# ... with 4,358 more rows, and 1 more variables: PRCHG <dbl>

rename: Rename columns

`rename(data, newname1 = oldname1[, newname2 = oldname2, ...])`

# Rename the following two columns
#   - AGE  -> age_categorical
#   - AGE1 -> age_continuous
biomarker %>%
    rename(age_categorical = AGE,
           age_continuous = AGE1) %>%
    select(USUBJID, age_categorical, age_continuous)

# A tibble: 4,368 × 3
    USUBJID age_categorical age_continuous
     <fctr>           <chr>          <dbl>
1  SUBJID-1             <65             34
2  SUBJID-1             <65             34
3  SUBJID-1             <65             34
4  SUBJID-1             <65             34
5  SUBJID-1             <65             34
6  SUBJID-1             <65             34
7  SUBJID-1             <65             34
8  SUBJID-1             <65             34
9  SUBJID-1             <65             34
10 SUBJID-1             <65             34
# ... with 4,358 more rows

summarise: Summarize multiple values to a single value

`summarise(data, var1 = fun1(args1)[, var2 = fun2(args2), ...])`

# Calculate the following summary statistics of raw Biomarker-1 measurements
#   - min
#   - max
#   - mean
#   - sd
biomarker %>%
    filter(PARAM == 'Biomarker-1')  %>%
    summarise('Bmk1 min' = min(AVAL),
              'Bmk1 max' = max(AVAL),
              'Bmk1 mean' = mean(AVAL),
              'Bmk1 sd' = sd(AVAL))

# A tibble: 1 × 4
  `Bmk1 min` `Bmk1 max` `Bmk1 mean` `Bmk1 sd`
       <dbl>      <dbl>       <dbl>     <dbl>
1  0.6611332   131.7934    8.612313  9.230064

group_by: Create the groups that will be operated on

`group_by(data, colname1[, colname2, ...])`

# Calculate summary statistics of raw Biomarker-1 measurements at each visit
biomarker %>%
    filter(PARAM == 'Biomarker-1')  %>%
    group_by(AVISIT) %>%
    summarise('Bmk1 min' = min(AVAL),
              'Bmk1 max' = max(AVAL),
              'Bmk1 mean' = mean(AVAL),
              'Bmk1 sd' = sd(AVAL))

# A tibble: 23 × 5
     AVISIT `Bmk1 min` `Bmk1 max` `Bmk1 mean` `Bmk1 sd`
     <fctr>      <dbl>      <dbl>       <dbl>     <dbl>
1  Baseline   1.236720   31.39085    7.038520  4.178917
2    Week 4   1.719325   56.94561    8.479816  7.174010
3    Week 8   1.655803  131.79340   10.896503 17.562428
4   Week 12   2.153257   51.07267    9.279077  8.221138
5   Week 16   2.384246   37.02386    8.970557  6.511436
6   Week 20   1.746164   56.76062    7.908845  7.286787
7   Week 24   1.089179   44.46084    9.066420  8.285184
8   Week 30   1.317913   44.94915    8.043445  6.282829
9   Week 36   1.229387   34.95352    7.922025  6.437708
10  Week 42   1.987400   31.41053    7.821189  5.361287
# ... with 13 more rows

group_by: Create the groups that will be operated on

Calculate number of subjects at each visit

biomarker %>%
    filter(PARAM == 'Biomarker-1')  %>%
    group_by(AVISIT) %>%
    summarise(N = n())

# A tibble: 23 × 2
     AVISIT     N
     <fctr> <int>
1  Baseline   121
2    Week 4   118
3    Week 8   108
4   Week 12    90
5   Week 16    89
6   Week 20    79
7   Week 24    75
8   Week 30    61
9   Week 36    61
10  Week 42    51
# ... with 13 more rows

count() version

biomarker %>%
    filter(PARAM == 'Biomarker-1')  %>%
    count(AVISIT)

# A tibble: 23 × 2
     AVISIT     n
     <fctr> <int>
1  Baseline   121
2    Week 4   118
3    Week 8   108
4   Week 12    90
5   Week 16    89
6   Week 20    79
7   Week 24    75
8   Week 30    61
9   Week 36    61
10  Week 42    51
# ... with 13 more rows

More verbs

summarise_each(): apply one or more functions to one or more columns
do(): perform arbitrary computation, returning a list. Particular useful when working with models
slice(): select rows by position
sample_n(): sample a fixed number of rows, without replacement
sample_frac(): sample a fixed percentage of rows, without replacement
Join two tables
- inner_join(), left_join(), right_join(), full_join(), semi_join(), anti_join()

A statistical problem

Problem: Association table of biomarkers at Week 4 with gender

Association table mockup

Biomarker	Gender	N	Median	Mean	Std	P value
Biomarker-1	F
Biomarker-1	M
…	…	…	…	…	…	…
Biomarker-4	F
Biomarker-4	M

Note: p value is calculated by Wilcoxon rank sum test

A statistical problem

Problem: Association table of biomarkers at Week 4 with gender

# It takes three steps:
#   Step 1: Compute summary statistics
#   Step 2: Compute Wilcoxon rank sum test p value
#   Step 3: Merge summary statistics and p values

# Step 1: Compute summary statistics
summary_tbl <- biomarker %>%
    filter(AVISIT == 'Week 4') %>%
    group_by(PARAM, SEX) %>%
    summarise(N = n(), Median = median(AVAL), Mean = mean(AVAL), Std = sd(AVAL))

# Step 2: Compute Wilcoxon rank sum test p value
p_value_tbl <- biomarker %>%
    filter(AVISIT == 'Week 4') %>%
    group_by(PARAM) %>%
    do(wil_test = wilcox.test(AVAL ~ SEX, data = .)) %>%
    summarise(PARAM, p_value = wil_test$p.value)

# Step 3: Merge summary statistics and p values
association_tbl <- left_join(summary_tbl, p_value_tbl, by = c('PARAM'))

A statistical problem

# print out association table of biomarkers at Week 4 with gender
association_tbl

Source: local data frame [8 x 7]
Groups: PARAM [?]

        PARAM   SEX     N   Median     Mean       Std    p_value
        <chr> <chr> <int>    <dbl>    <dbl>     <dbl>      <dbl>
1 Biomarker-1     F    43 5.793712 8.374152 8.9205442 0.22283879
2 Biomarker-1     M    75 6.467959 8.540396 6.0164397 0.22283879
3 Biomarker-2     F    43 2.273683 4.390728 7.8210077 0.67902511
4 Biomarker-2     M    75 2.431877 3.890596 7.0265673 0.67902511
5 Biomarker-3     F    43 4.465109 4.530818 1.4929249 0.50220969
6 Biomarker-3     M    75 4.126813 4.535112 1.9115268 0.50220969
7 Biomarker-4     F    43 1.617746 1.694869 0.7456146 0.04180881
8 Biomarker-4     M    75 2.030626 1.993155 0.8635891 0.04180881

Programming with dplyr

Problem: Can we use `dplyr` verbs inside of a function?

For example, what if we want to write a function, summary_stat_by_group(), which

takes
- a dataframe,
- a column on which the summary statistics are computed,
- a list of named summary functions, and
- optionally a vector of group variables
returns
- summary statistics by groups

Tip: We need to know standard evaluation (SE) equivalents of `dplyr` verbs!

SE equivalents of verbs

#----- Regular verbs -----#

# select
select(biomarker, AVAL)

# filter
filter(biomarker, AGE1 > 65)

# arrange
arrange(biomarker, PARAM)

# mutate
mutate(biomarker, CHG = AVAL / BASE)

# group_by
group_by(biomarker, PARAM, AVISIT)

Programming with dplyr

#----- SE verbs -----#

# select_
select_(biomarker, 'AVAL')

# filter_
filter_(biomarker, 'AGE1 > 65')

# arrange_
arrange_(biomarker, 'PARAM')

# mutate_
mutate_(biomarker, CHG = 'AVAL / BASE')

# group_by_
group_by_(biomarker, 'PARAM', 'AVISIT')
group_by_(biomarker, .dots = c('PARAM', 'AVISIT'))

summary_stat_by_group()

# It only takes two steps
#   Step 1: Define a utility function that applies a list of function on a vector
#   Step 2: Define `summary_stat_by_group()`

# Step 1: Define a utility function
fapply <- function(vec, named_func_list) {
    vec <- unlist(vec)
    res <- sapply(named_func_list, function(FUN, x) return(FUN(x)), x = vec)
    res <- setNames(data.frame(t(res)), names(named_func_list))
    return(res)
}

# Step 2: Define `summary_stat_by_group()`
summary_stat_by_group <- function(df, column, named_func_list, groups = NULL) {
    if(!is.null(groups)) df <- group_by_(df, .dots = groups)
    columns_to_select <- c(groups, column)
    res <- df %>%
        select_(.dots = columns_to_select) %>%
        do(fapply(vec = .[[column]], named_func_list = named_func_list))
    return(res)
}

summary_stat_by_group()

Use `summary_stat_by_group()` to compute summary statistics of raw biomarker at each visit

named_func_list <- list(Min = min, Max = max, Mean = mean, Std = sd)
summary_stat_by_group(biomarker, 'AVAL', named_func_list, c('PARAM', 'AVISIT'))

Source: local data frame [92 x 6]
Groups: PARAM, AVISIT [92]

         PARAM   AVISIT      Min       Max      Mean       Std
         <chr>   <fctr>    <dbl>     <dbl>     <dbl>     <dbl>
1  Biomarker-1 Baseline 1.236720  31.39085  7.038520  4.178917
2  Biomarker-1   Week 4 1.719325  56.94561  8.479816  7.174010
3  Biomarker-1   Week 8 1.655803 131.79340 10.896503 17.562428
4  Biomarker-1  Week 12 2.153257  51.07267  9.279077  8.221138
5  Biomarker-1  Week 16 2.384246  37.02386  8.970557  6.511436
6  Biomarker-1  Week 20 1.746164  56.76062  7.908845  7.286787
7  Biomarker-1  Week 24 1.089179  44.46084  9.066420  8.285184
8  Biomarker-1  Week 30 1.317913  44.94915  8.043445  6.282829
9  Biomarker-1  Week 36 1.229387  34.95352  7.922025  6.437708
10 Biomarker-1  Week 42 1.987400  31.41053  7.821189  5.361287
# ... with 82 more rows

Problem: Aggregate records

Specifically, we would like to define a function, aggregate_records(), which

takes
- a dataframe,
- a record aggregation rule, and
- optionally a vector of group variables
returns
- a dataframe with at most one observation per group

`aggregate_records()`

# It only takes two steps
#   Step 1: Define a default record aggregation rule
#   Step 2: Define `aggregate_records()`

# Step 1: Define a default record aggregation rule
default_aggregation_rule <- function(vec) {
    vec <- unlist(vec)
    if(length(na.omit(vec)) == 0) return(NA)
    else if(is.numeric(vec)) return(mean(vec, na.rm = TRUE))
    else return(na.omit(vec)[1])
}

# Step 2: Define `aggregate_records()`
aggregate_records <- function(df, duprem_rule = default_aggregation_rule, groups = NULL) {
    if(!is.null(groups)) df <- group_by_(df, .dots = groups)
    res <- df %>% summarise_each(funs(duprem_rule))
    return(res)
}

Summary

Verbs
Pipe operator (%>%)
Standard evaluation
- Every function in dplyr that uses NSE also has a version that uses SE
- The name of the SE version is always the NSE name with an _ on the end
  - For example, the SE version of summarise() is summarise_()

Miscellaneous

`dplyr` and SQL

dplyr translates R code into SQL code

select() -> SELECT
mutate() -> user-defined columns
summarise() -> aggregated columns
left_join() -> LEFT JOIN
filter() -> WHERE
group_by() -> GROUP BY

Miscellaneous

`dplyr` and SQL

Hadley Wickham, the author of dplyr explains:

When working with databases, dplyr tries to be as lazy as possible:

It never pulls data into R unless you explicitly ask for it

It delays doing any work until the last possible moment - it collects together everything you want to do and then sends it to the database in one step.

Resources

Miscellaneous

About this slide deck

Miscellaneous

What is a local dataframe, or a tibble

Tibbles are a modern take on data frames. There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
Tibbles also clearly delineate [ and [[
- [ always returns another tibble
- [[ always returns a vector. No more drop = FALSE!

More information can be found here, Tibbles

Efficient Data Manipulation with dplyr

Data flow

dplyr

dplyr

dplyr

Overview

Dataset

First things first…

Installation

Verbs

select: Pick columns by name

select(data, colname1[, colname2, ...])

select: Pick columns by name

select(data, -colname1[, -colname2, ...])

select: Pick columns by name

select(data, start_col:end_col)

select: Pick columns by name

filter: Keep rows matching criteria

filter(data, criterion1[, criterion2, ...])

Pipe operator: %>%

Pipe operator: %>%

arrange: Re-order rows

arrange(data, colname1[, colname2, ...])

mutate: Create new columns

mutate(data, new_var1 = expression1[, new_var2 = expression2, ...])

rename: Rename columns

rename(data, newname1 = oldname1[, newname2 = oldname2, ...])

summarise: Summarize multiple values to a single value

summarise(data, var1 = fun1(args1)[, var2 = fun2(args2), ...])

group_by: Create the groups that will be operated on

group_by(data, colname1[, colname2, ...])

group_by: Create the groups that will be operated on

group_by: Create the groups that will be operated on

Calculate number of subjects at each visit

count() version

More verbs

A statistical problem

Problem: Association table of biomarkers at Week 4 with gender

Association table mockup

Note: p value is calculated by Wilcoxon rank sum test

A statistical problem

Problem: Association table of biomarkers at Week 4 with gender

A statistical problem

Programming with dplyr

Problem: Can we use dplyr verbs inside of a function?

Tip: We need to know standard evaluation (SE) equivalents of dplyr verbs!

SE equivalents of verbs

summary_stat_by_group()

summary_stat_by_group()

Use summary_stat_by_group() to compute summary statistics of raw biomarker at each visit

More problem-solving (if time permits)

Problem: Aggregate records

More problem-solving (if time permits)

aggregate_records()

Summary

Miscellaneous

More on data manipulation

Miscellaneous

dplyr and SQL

Miscellaneous

dplyr and SQL

Miscellaneous

About this slide deck

Miscellaneous

What is a local dataframe, or a tibble

Thank you!

`select(data, colname1[, colname2, ...])`

`select(data, -colname1[, -colname2, ...])`

`select(data, start_col:end_col)`

`filter(data, criterion1[, criterion2, ...])`

`arrange(data, colname1[, colname2, ...])`

`mutate(data, new_var1 = expression1[, new_var2 = expression2, ...])`

`rename(data, newname1 = oldname1[, newname2 = oldname2, ...])`

`summarise(data, var1 = fun1(args1)[, var2 = fun2(args2), ...])`

`group_by(data, colname1[, colname2, ...])`

Problem: Can we use `dplyr` verbs inside of a function?

Tip: We need to know standard evaluation (SE) equivalents of `dplyr` verbs!

Use `summary_stat_by_group()` to compute summary statistics of raw biomarker at each visit

`aggregate_records()`

`dplyr` and SQL

`dplyr` and SQL