Efficient Data Manipulation with dplyr

Feiyang Niu
July 7, 2017

Data flow

dplyr

dplyr

  • When & Where
    • Not relevant
  • Who
    • dplyr is a R package, developed by Hadley Wickham and Romain Francois

dplyr

  • What
    • dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
  • Why
    • Speed. dplyr provides blazing fast performance for in-memory data by writing key pieces in C++
    • Direct connection to and analysis within external databases permitting simpler handling of large data
    • Function chaining that allows us to avoid cluttering our workspace with interim objects
    • Syntax simplicity and ease of use. The code is easy to write and to follow.

Overview

  • Verbs
  • Pipe operator
  • A statistical problem
  • Programming with dplyr

Dataset

Column name Description
STUDYID Study ID
USUBJID Unique subject ID
PARAM Biomarker name (Biomarker-1, …, Biomarker-4)
AVISIT Visit (Baseline, Week 4, …)
AVISITN Numeric visit (0, 4, …)
AVAL Biomarker raw measurement
BASE Biomarker baseline measurement
UNIT Biomarker measurement unit
ETHNIC Ethnic (Hispanic or Latino, Not Hispanic or Latino)
COUNTRY Country (USA, FRA, GBR, ITA, POL, DEU)
SEX Sex (M, F)
AGE Age category (<65, >=65)
Region Region (US, Non-US)
Disease_type Disease type (A, B, C, D)
AGE1 Actual age (continuous, in years)

First things first…

Installation

# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")

# Or the the development version from GitHub:
# install.packages("devtools")
devtools::install_github("tidyverse/dplyr")

Verbs

“Verbs” in dplyr are a set of functions that perform common data manipulation operations such as

  • selecting specific columns,
  • filtering for rows,
  • re-ordering rows,
  • adding new columns,
  • grouping data,
  • summarizing data,
  • and their possible combinations.

select: Pick columns by name

select(data, colname1[, colname2, ...])

select(biomarker, STUDYID, USUBJID, PARAM, AVISIT, AVISITN, AVAL)
# A tibble: 4,368 × 6
        STUDYID  USUBJID       PARAM   AVISIT AVISITN      AVAL
          <chr>   <fctr>       <chr>   <fctr>   <dbl>     <dbl>
1  GS-Sample-01 SUBJID-1 Biomarker-1 Baseline       0  4.086984
2  GS-Sample-01 SUBJID-1 Biomarker-1   Week 4       4 10.071358
3  GS-Sample-01 SUBJID-1 Biomarker-1   Week 8       8  6.023101
4  GS-Sample-01 SUBJID-1 Biomarker-1  Week 12      12  7.057249
5  GS-Sample-01 SUBJID-1 Biomarker-1  Week 16      16  4.522446
6  GS-Sample-01 SUBJID-1 Biomarker-1  Week 20      20  3.121714
7  GS-Sample-01 SUBJID-1 Biomarker-1  Week 24      24  3.288062
8  GS-Sample-01 SUBJID-1 Biomarker-1  Week 30      30  4.336657
9  GS-Sample-01 SUBJID-1 Biomarker-1  Week 36      36  5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1  Week 42      42  4.437676
# ... with 4,358 more rows

select: Pick columns by name

select(data, -colname1[, -colname2, ...])

select(biomarker, -BASE, -UNIT, -ETHNIC, -COUNTRY, -SEX, -AGE, -Region, -Disease_type, -AGE1)
# A tibble: 4,368 × 6
        STUDYID  USUBJID       PARAM   AVISIT AVISITN      AVAL
          <chr>   <fctr>       <chr>   <fctr>   <dbl>     <dbl>
1  GS-Sample-01 SUBJID-1 Biomarker-1 Baseline       0  4.086984
2  GS-Sample-01 SUBJID-1 Biomarker-1   Week 4       4 10.071358
3  GS-Sample-01 SUBJID-1 Biomarker-1   Week 8       8  6.023101
4  GS-Sample-01 SUBJID-1 Biomarker-1  Week 12      12  7.057249
5  GS-Sample-01 SUBJID-1 Biomarker-1  Week 16      16  4.522446
6  GS-Sample-01 SUBJID-1 Biomarker-1  Week 20      20  3.121714
7  GS-Sample-01 SUBJID-1 Biomarker-1  Week 24      24  3.288062
8  GS-Sample-01 SUBJID-1 Biomarker-1  Week 30      30  4.336657
9  GS-Sample-01 SUBJID-1 Biomarker-1  Week 36      36  5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1  Week 42      42  4.437676
# ... with 4,358 more rows

select: Pick columns by name

select(data, start_col:end_col)

# [STUDYID USUBJID PARAM AVISIT AVISITN AVAL] BASE UNIT ETHNIC COUNTRY SEX AGE Region Disease_type AGE1
select(biomarker, STUDYID:AVAL)
# A tibble: 4,368 × 6
        STUDYID  USUBJID       PARAM   AVISIT AVISITN      AVAL
          <chr>   <fctr>       <chr>   <fctr>   <dbl>     <dbl>
1  GS-Sample-01 SUBJID-1 Biomarker-1 Baseline       0  4.086984
2  GS-Sample-01 SUBJID-1 Biomarker-1   Week 4       4 10.071358
3  GS-Sample-01 SUBJID-1 Biomarker-1   Week 8       8  6.023101
4  GS-Sample-01 SUBJID-1 Biomarker-1  Week 12      12  7.057249
5  GS-Sample-01 SUBJID-1 Biomarker-1  Week 16      16  4.522446
6  GS-Sample-01 SUBJID-1 Biomarker-1  Week 20      20  3.121714
7  GS-Sample-01 SUBJID-1 Biomarker-1  Week 24      24  3.288062
8  GS-Sample-01 SUBJID-1 Biomarker-1  Week 30      30  4.336657
9  GS-Sample-01 SUBJID-1 Biomarker-1  Week 36      36  5.992019
10 GS-Sample-01 SUBJID-1 Biomarker-1  Week 42      42  4.437676
# ... with 4,358 more rows

select: Pick columns by name

  • starts_with(): select all columns that start with a character string
head(select(biomarker, starts_with('S')), 3)
# A tibble: 3 × 2
       STUDYID   SEX
         <chr> <chr>
1 GS-Sample-01     M
2 GS-Sample-01     M
3 GS-Sample-01     M
  • ends_with(): select columns that end with a character string
  • contains(): select columns that contain a character string
  • matches(): select columns that match a regular expression
  • one_of(): select columns names that are from a group of names

filter: Keep rows matching criteria

filter(data, criterion1[, criterion2, ...])

# Find raw measurements
#   - of Biomarker-1
#   - of Female subjects 
#   - above the age of 75
#   - at Week 4 
temp <- filter(biomarker, PARAM == 'Biomarker-1', SEX == 'F', AGE1 > 75, AVISIT == 'Week 4')
select(temp, USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL)
# A tibble: 6 × 6
     USUBJID       PARAM AVISIT   SEX  AGE1      AVAL
      <fctr>       <chr> <fctr> <chr> <dbl>     <dbl>
1   SUBJID-8 Biomarker-1 Week 4     F    87  5.202696
2  SUBJID-12 Biomarker-1 Week 4     F    81 56.945610
3  SUBJID-69 Biomarker-1 Week 4     F    77  5.531761
4  SUBJID-74 Biomarker-1 Week 4     F    79  6.980639
5  SUBJID-82 Biomarker-1 Week 4     F    81  3.523846
6 SUBJID-120 Biomarker-1 Week 4     F    81 10.016988

Pipe operator: %>%

# input         +--------+        +--------+        +--------+      result
# data   %>%    |  verb  |  %>%   |  verb  |  %>%   |  verb  |  ->  data
# frame         +--------+        +--------+        +--------+      frame

Pipe operator: %>%

# Find raw measurements
#   - of Biomarker-1
#   - of Female subjects 
#   - above the age of 75
#   - at Week 4 
biomarker %>%
    filter(PARAM == 'Biomarker-1',
           SEX == 'F',
           AGE1 > 75,
           AVISIT == 'Week 4') %>%
    select(USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL)
# A tibble: 6 × 6
     USUBJID       PARAM AVISIT   SEX  AGE1      AVAL
      <fctr>       <chr> <fctr> <chr> <dbl>     <dbl>
1   SUBJID-8 Biomarker-1 Week 4     F    87  5.202696
2  SUBJID-12 Biomarker-1 Week 4     F    81 56.945610
3  SUBJID-69 Biomarker-1 Week 4     F    77  5.531761
4  SUBJID-74 Biomarker-1 Week 4     F    79  6.980639
5  SUBJID-82 Biomarker-1 Week 4     F    81  3.523846
6 SUBJID-120 Biomarker-1 Week 4     F    81 10.016988
# As a comparison, to achieve the above, below is Base R version
temp <- biomarker[biomarker$PARAM == 'Biomarker-1' &
                      biomarker$SEX == 'F' &
                      biomarker$AGE1 > 75 &
                      biomarker$AVISIT == 'Week 4', ]
temp[, c('USUBJID','PARAM','AVISIT','SEX','AGE1','AVAL')]
# A tibble: 6 × 6
     USUBJID       PARAM AVISIT   SEX  AGE1      AVAL
      <fctr>       <chr> <fctr> <chr> <dbl>     <dbl>
1   SUBJID-8 Biomarker-1 Week 4     F    87  5.202696
2  SUBJID-12 Biomarker-1 Week 4     F    81 56.945610
3  SUBJID-69 Biomarker-1 Week 4     F    77  5.531761
4  SUBJID-74 Biomarker-1 Week 4     F    79  6.980639
5  SUBJID-82 Biomarker-1 Week 4     F    81  3.523846
6 SUBJID-120 Biomarker-1 Week 4     F    81 10.016988

arrange: Re-order rows

arrange(data, colname1[, colname2, ...])

  • In descending order: arrange(data, desc(colname1))
# Find raw measurements
#   - of Biomarker-1
#   - of Female subjects 
#   - above the age of 75
#   - at Week 4
# Sort them in descending order
biomarker %>%
    filter(PARAM == 'Biomarker-1', SEX == 'F', AGE1 > 75, AVISIT == 'Week 4') %>%
    select(USUBJID, PARAM, AVISIT, SEX, AGE1, AVAL) %>%
    arrange(desc(AVAL))
# A tibble: 6 × 6
     USUBJID       PARAM AVISIT   SEX  AGE1      AVAL
      <fctr>       <chr> <fctr> <chr> <dbl>     <dbl>
1  SUBJID-12 Biomarker-1 Week 4     F    81 56.945610
2 SUBJID-120 Biomarker-1 Week 4     F    81 10.016988
3  SUBJID-74 Biomarker-1 Week 4     F    79  6.980639
4  SUBJID-69 Biomarker-1 Week 4     F    77  5.531761
5   SUBJID-8 Biomarker-1 Week 4     F    87  5.202696
6  SUBJID-82 Biomarker-1 Week 4     F    81  3.523846

mutate: Create new columns

mutate(data, new_var1 = expression1[, new_var2 = expression2, ...])

# Create three new columns
#   - CHG:      change from baseline
#   - PCHG:     % change from baseline
#   - PRCHG:    % baseline
biomarker %>%
    mutate(CHG = AVAL - BASE,
           PCHG = CHG / BASE * 100,
           PRCHG = AVAL / BASE * 100) %>%
    select(USUBJID, PARAM, AVISIT, AVAL, BASE, CHG, PCHG, PRCHG)
# A tibble: 4,368 × 8
    USUBJID       PARAM   AVISIT      AVAL     BASE        CHG       PCHG
     <fctr>       <chr>   <fctr>     <dbl>    <dbl>      <dbl>      <dbl>
1  SUBJID-1 Biomarker-1 Baseline  4.086984 4.086984  0.0000000   0.000000
2  SUBJID-1 Biomarker-1   Week 4 10.071358 4.086984  5.9843738 146.425168
3  SUBJID-1 Biomarker-1   Week 8  6.023101 4.086984  1.9361169  47.372749
4  SUBJID-1 Biomarker-1  Week 12  7.057249 4.086984  2.9702647  72.676192
5  SUBJID-1 Biomarker-1  Week 16  4.522446 4.086984  0.4354616  10.654838
6  SUBJID-1 Biomarker-1  Week 20  3.121714 4.086984 -0.9652702 -23.618151
7  SUBJID-1 Biomarker-1  Week 24  3.288062 4.086984 -0.7989224 -19.547968
8  SUBJID-1 Biomarker-1  Week 30  4.336657 4.086984  0.2496724   6.108964
9  SUBJID-1 Biomarker-1  Week 36  5.992019 4.086984  1.9050345  46.612228
10 SUBJID-1 Biomarker-1  Week 42  4.437676 4.086984  0.3506914   8.580690
# ... with 4,358 more rows, and 1 more variables: PRCHG <dbl>

rename: Rename columns

rename(data, newname1 = oldname1[, newname2 = oldname2, ...])

# Rename the following two columns
#   - AGE  -> age_categorical
#   - AGE1 -> age_continuous
biomarker %>%
    rename(age_categorical = AGE,
           age_continuous = AGE1) %>%
    select(USUBJID, age_categorical, age_continuous)
# A tibble: 4,368 × 3
    USUBJID age_categorical age_continuous
     <fctr>           <chr>          <dbl>
1  SUBJID-1             <65             34
2  SUBJID-1             <65             34
3  SUBJID-1             <65             34
4  SUBJID-1             <65             34
5  SUBJID-1             <65             34
6  SUBJID-1             <65             34
7  SUBJID-1             <65             34
8  SUBJID-1             <65             34
9  SUBJID-1             <65             34
10 SUBJID-1             <65             34
# ... with 4,358 more rows

summarise: Summarize multiple values to a single value

summarise(data, var1 = fun1(args1)[, var2 = fun2(args2), ...])

# Calculate the following summary statistics of raw Biomarker-1 measurements
#   - min
#   - max
#   - mean
#   - sd
biomarker %>%
    filter(PARAM == 'Biomarker-1')  %>%
    summarise('Bmk1 min' = min(AVAL),
              'Bmk1 max' = max(AVAL),
              'Bmk1 mean' = mean(AVAL),
              'Bmk1 sd' = sd(AVAL))
# A tibble: 1 × 4
  `Bmk1 min` `Bmk1 max` `Bmk1 mean` `Bmk1 sd`
       <dbl>      <dbl>       <dbl>     <dbl>
1  0.6611332   131.7934    8.612313  9.230064

group_by: Create the groups that will be operated on

group_by(data, colname1[, colname2, ...])

# Calculate summary statistics of raw Biomarker-1 measurements at each visit
biomarker %>%
    filter(PARAM == 'Biomarker-1')  %>%
    group_by(AVISIT) %>%
    summarise('Bmk1 min' = min(AVAL),
              'Bmk1 max' = max(AVAL),
              'Bmk1 mean' = mean(AVAL),
              'Bmk1 sd' = sd(AVAL))
# A tibble: 23 × 5
     AVISIT `Bmk1 min` `Bmk1 max` `Bmk1 mean` `Bmk1 sd`
     <fctr>      <dbl>      <dbl>       <dbl>     <dbl>
1  Baseline   1.236720   31.39085    7.038520  4.178917
2    Week 4   1.719325   56.94561    8.479816  7.174010
3    Week 8   1.655803  131.79340   10.896503 17.562428
4   Week 12   2.153257   51.07267    9.279077  8.221138
5   Week 16   2.384246   37.02386    8.970557  6.511436
6   Week 20   1.746164   56.76062    7.908845  7.286787
7   Week 24   1.089179   44.46084    9.066420  8.285184
8   Week 30   1.317913   44.94915    8.043445  6.282829
9   Week 36   1.229387   34.95352    7.922025  6.437708
10  Week 42   1.987400   31.41053    7.821189  5.361287
# ... with 13 more rows

group_by: Create the groups that will be operated on

group_by: Create the groups that will be operated on

Calculate number of subjects at each visit

biomarker %>%
    filter(PARAM == 'Biomarker-1')  %>%
    group_by(AVISIT) %>%
    summarise(N = n())
# A tibble: 23 × 2
     AVISIT     N
     <fctr> <int>
1  Baseline   121
2    Week 4   118
3    Week 8   108
4   Week 12    90
5   Week 16    89
6   Week 20    79
7   Week 24    75
8   Week 30    61
9   Week 36    61
10  Week 42    51
# ... with 13 more rows

count() version

biomarker %>%
    filter(PARAM == 'Biomarker-1')  %>%
    count(AVISIT)
# A tibble: 23 × 2
     AVISIT     n
     <fctr> <int>
1  Baseline   121
2    Week 4   118
3    Week 8   108
4   Week 12    90
5   Week 16    89
6   Week 20    79
7   Week 24    75
8   Week 30    61
9   Week 36    61
10  Week 42    51
# ... with 13 more rows

More verbs

  • summarise_each(): apply one or more functions to one or more columns
  • do(): perform arbitrary computation, returning a list. Particular useful when working with models
  • slice(): select rows by position
  • sample_n(): sample a fixed number of rows, without replacement
  • sample_frac(): sample a fixed percentage of rows, without replacement
  • Join two tables
    • inner_join(), left_join(), right_join(), full_join(), semi_join(), anti_join()

A statistical problem

Problem: Association table of biomarkers at Week 4 with gender


Association table mockup

Biomarker Gender N Median Mean Std P value
Biomarker-1 F
Biomarker-1 M
Biomarker-4 F
Biomarker-4 M
Note: p value is calculated by Wilcoxon rank sum test

A statistical problem

Problem: Association table of biomarkers at Week 4 with gender

# It takes three steps:
#   Step 1: Compute summary statistics
#   Step 2: Compute Wilcoxon rank sum test p value
#   Step 3: Merge summary statistics and p values

# Step 1: Compute summary statistics
summary_tbl <- biomarker %>%
    filter(AVISIT == 'Week 4') %>%
    group_by(PARAM, SEX) %>%
    summarise(N = n(), Median = median(AVAL), Mean = mean(AVAL), Std = sd(AVAL))

# Step 2: Compute Wilcoxon rank sum test p value
p_value_tbl <- biomarker %>%
    filter(AVISIT == 'Week 4') %>%
    group_by(PARAM) %>%
    do(wil_test = wilcox.test(AVAL ~ SEX, data = .)) %>%
    summarise(PARAM, p_value = wil_test$p.value)

# Step 3: Merge summary statistics and p values
association_tbl <- left_join(summary_tbl, p_value_tbl, by = c('PARAM'))

A statistical problem

# print out association table of biomarkers at Week 4 with gender
association_tbl
Source: local data frame [8 x 7]
Groups: PARAM [?]

        PARAM   SEX     N   Median     Mean       Std    p_value
        <chr> <chr> <int>    <dbl>    <dbl>     <dbl>      <dbl>
1 Biomarker-1     F    43 5.793712 8.374152 8.9205442 0.22283879
2 Biomarker-1     M    75 6.467959 8.540396 6.0164397 0.22283879
3 Biomarker-2     F    43 2.273683 4.390728 7.8210077 0.67902511
4 Biomarker-2     M    75 2.431877 3.890596 7.0265673 0.67902511
5 Biomarker-3     F    43 4.465109 4.530818 1.4929249 0.50220969
6 Biomarker-3     M    75 4.126813 4.535112 1.9115268 0.50220969
7 Biomarker-4     F    43 1.617746 1.694869 0.7456146 0.04180881
8 Biomarker-4     M    75 2.030626 1.993155 0.8635891 0.04180881

Programming with dplyr

Problem: Can we use dplyr verbs inside of a function?


For example, what if we want to write a function, summary_stat_by_group(), which

  • takes
    • a dataframe,
    • a column on which the summary statistics are computed,
    • a list of named summary functions, and
    • optionally a vector of group variables
  • returns
    • summary statistics by groups


Tip: We need to know standard evaluation (SE) equivalents of dplyr verbs!

SE equivalents of verbs

#----- Regular verbs -----#

# select
select(biomarker, AVAL)

# filter
filter(biomarker, AGE1 > 65)

# arrange
arrange(biomarker, PARAM)

# mutate
mutate(biomarker, CHG = AVAL / BASE)

# group_by
group_by(biomarker, PARAM, AVISIT)





Programming with dplyr

#----- SE verbs -----#

# select_
select_(biomarker, 'AVAL')

# filter_
filter_(biomarker, 'AGE1 > 65')

# arrange_
arrange_(biomarker, 'PARAM')

# mutate_
mutate_(biomarker, CHG = 'AVAL / BASE')

# group_by_
group_by_(biomarker, 'PARAM', 'AVISIT')
group_by_(biomarker, .dots = c('PARAM', 'AVISIT'))

summary_stat_by_group()

# It only takes two steps
#   Step 1: Define a utility function that applies a list of function on a vector
#   Step 2: Define `summary_stat_by_group()`

# Step 1: Define a utility function
fapply <- function(vec, named_func_list) {
    vec <- unlist(vec)
    res <- sapply(named_func_list, function(FUN, x) return(FUN(x)), x = vec)
    res <- setNames(data.frame(t(res)), names(named_func_list))
    return(res)
}

# Step 2: Define `summary_stat_by_group()`
summary_stat_by_group <- function(df, column, named_func_list, groups = NULL) {
    if(!is.null(groups)) df <- group_by_(df, .dots = groups)
    columns_to_select <- c(groups, column)
    res <- df %>%
        select_(.dots = columns_to_select) %>%
        do(fapply(vec = .[[column]], named_func_list = named_func_list))
    return(res)
}

summary_stat_by_group()

Use summary_stat_by_group() to compute summary statistics of raw biomarker at each visit

named_func_list <- list(Min = min, Max = max, Mean = mean, Std = sd)
summary_stat_by_group(biomarker, 'AVAL', named_func_list, c('PARAM', 'AVISIT'))
Source: local data frame [92 x 6]
Groups: PARAM, AVISIT [92]

         PARAM   AVISIT      Min       Max      Mean       Std
         <chr>   <fctr>    <dbl>     <dbl>     <dbl>     <dbl>
1  Biomarker-1 Baseline 1.236720  31.39085  7.038520  4.178917
2  Biomarker-1   Week 4 1.719325  56.94561  8.479816  7.174010
3  Biomarker-1   Week 8 1.655803 131.79340 10.896503 17.562428
4  Biomarker-1  Week 12 2.153257  51.07267  9.279077  8.221138
5  Biomarker-1  Week 16 2.384246  37.02386  8.970557  6.511436
6  Biomarker-1  Week 20 1.746164  56.76062  7.908845  7.286787
7  Biomarker-1  Week 24 1.089179  44.46084  9.066420  8.285184
8  Biomarker-1  Week 30 1.317913  44.94915  8.043445  6.282829
9  Biomarker-1  Week 36 1.229387  34.95352  7.922025  6.437708
10 Biomarker-1  Week 42 1.987400  31.41053  7.821189  5.361287
# ... with 82 more rows

More problem-solving (if time permits)

Problem: Aggregate records


Specifically, we would like to define a function, aggregate_records(), which

  • takes
    • a dataframe,
    • a record aggregation rule, and
    • optionally a vector of group variables
  • returns
    • a dataframe with at most one observation per group

More problem-solving (if time permits)

aggregate_records()

# It only takes two steps
#   Step 1: Define a default record aggregation rule
#   Step 2: Define `aggregate_records()`

# Step 1: Define a default record aggregation rule
default_aggregation_rule <- function(vec) {
    vec <- unlist(vec)
    if(length(na.omit(vec)) == 0) return(NA)
    else if(is.numeric(vec)) return(mean(vec, na.rm = TRUE))
    else return(na.omit(vec)[1])
}

# Step 2: Define `aggregate_records()`
aggregate_records <- function(df, duprem_rule = default_aggregation_rule, groups = NULL) {
    if(!is.null(groups)) df <- group_by_(df, .dots = groups)
    res <- df %>% summarise_each(funs(duprem_rule))
    return(res)
}

Summary

  • Verbs
  • Pipe operator (%>%)
  • Standard evaluation
    • Every function in dplyr that uses NSE also has a version that uses SE
    • The name of the SE version is always the NSE name with an _ on the end
      • For example, the SE version of summarise() is summarise_()

Miscellaneous

More on data manipulation

  • reshape2: a package that makes it easy to transform data between wide and long formats
  • tidyr: a package that helps create tidy data
  • data.table: extension of data.frame
    • Fast aggregation of large data (e.g. 100GB in RAM)
    • Fast ordered joins
    • Fast add/modify/delete of columns by group using no copies at all
  • plyr: tools for splitting, applying and combining data
    • Compatible issue with dplyr: load plyr first, then dplyr

Miscellaneous

dplyr and SQL

dplyr translates R code into SQL code

  • select() -> SELECT
  • mutate() -> user-defined columns
  • summarise() -> aggregated columns
  • left_join() -> LEFT JOIN
  • filter() -> WHERE
  • group_by() -> GROUP BY

Miscellaneous

dplyr and SQL

Hadley Wickham, the author of dplyr explains:

When working with databases, dplyr tries to be as lazy as possible:

  • It never pulls data into R unless you explicitly ask for it
  • It delays doing any work until the last possible moment - it collects together everything you want to do and then sends it to the database in one step.

Resources

Miscellaneous

Miscellaneous

What is a local dataframe, or a tibble

Tibbles are a modern take on data frames. There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.

  • Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
  • Tibbles also clearly delineate [ and [[
    • [ always returns another tibble
    • [[ always returns a vector. No more drop = FALSE!



More information can be found here, Tibbles

Thank you!