Tidyverse <-> data.table
Equivalence between Tidyverse and data.table expressions
data.table, summarizing the equivalences between most dplyr/tidyr verbs and data.table.
Please visit this page for a more up-to-date version of this post.
-
V1: 2022-05-19
-
V2: 2022-05-26
- Improved the section on keys (for ordering & filtering)
- Adding a section for translations of
Tidyr(and other similar packages)
- Capping tables to display 15 rows max when unfolded
- Improving table display (stripping, hiding the contents of nested columns, …)
- Improved the section on keys (for ordering & filtering)
-
V3: 2022-07-20
- Updating examples of dynamic programming based on the latest recommendations
- Added new entries in processing examples
- Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
- Added
pivot_widerexamples to match thedcastones in the Pivots section
- Added some new examples here and there across the Basic Operations section
- Added an entry for operating inside nested data.frames/data.tables
- Added a processing example for run-length encoding (i.e. successive event tagging)
- Updating examples of dynamic programming based on the latest recommendations
-
V4: 2022-08-05
- Improved
pivotsection: example of one-hot encoding (and reverse operation) + better examples of partial pivots with.value
- Added
tidyr::uncount()(row duplication) example
- Improved both light & dark themes (code highlight, tables, …)
- Improved
1 Setup
library(here) # Project management
library(data.table) # Data wrangling (>= 1.14.3)
library(dplyr) # Data wrangling (>= 1.1.0)
library(tidyr) # Data wrangling (extras) (>= 1.2.0)
library(pipebind) # Piping goodies (>= 0.1.1)
library(stringr) # Manipulating strings
library(purrr) # Manipulating lists
library(lubridate) # Manipulating dates
library(broom)
data.table::setDTthreads(parallel::detectCores(logical = TRUE))─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os Ubuntu 20.04.4 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz Europe/Paris
date 2022-09-24
pandoc 2.19.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
Quarto 1.1.251
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P broom * 1.0.0 2022-07-01 [?] CRAN (R 4.2.0)
P data.table * 1.14.3 2022-07-27 [?] Github (Rdatatable/data.table@c4a2085)
dplyr * 1.0.99.9000 2022-08-15 [1] Github (tidyverse/dplyr@d8294b4)
P here * 1.0.1 2020-12-13 [?] CRAN (R 4.2.0)
P lubridate * 1.8.0 2021-10-07 [?] CRAN (R 4.2.0)
pipebind * 0.1.1 2022-08-10 [1] CRAN (R 4.2.0)
P purrr * 0.3.4 2020-04-17 [?] CRAN (R 4.2.0)
P stringr * 1.4.0 2019-02-10 [2] CRAN (R 4.2.0)
P tidyr * 1.2.0 2022-02-01 [?] CRAN (R 4.2.0)
[1] /home/mar/Dev/Projects/R/Misc/renv/library/R-4.2/x86_64-pc-linux-gnu
[2] /home/mar/.cache/R/renv/library/Misc-f25fd835/R-4.2/x86_64-pc-linux-gnu
[3] /usr/lib/R/library
[4] /usr/local/lib/R/site-library
[5] /usr/lib/R/site-library
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
2 Basic Operations:
data.table general syntax:
DT[row selector (filter/sort), col selector (select/mutate/summarize/rename), modifiers (group)]
Data
MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]2.1 Arrange / Order:
mtcars |> arrange(desc(cyl))data.frame [32 x 11]
| [ omitted 17 entries ] |
MT[order(-cyl)]data.table [32 x 11]
| [ omitted 17 entries ] |
setorder(MT, -cyl)[]data.table [32 x 11]
| [ omitted 17 entries ] |
MT[order(-cyl, gear)]data.table [32 x 11]
| [ omitted 17 entries ] |
Ordering on a character column
IRIS[chorder(Species)]data.table [150 x 5]
| [ omitted 135 entries ] |
Ordering with keys
- Keys physically reorders the dataset within the RAM (by reference)
- No memory is used for sorting (other than marking which columns is the key)
- No memory is used for sorting (other than marking which columns is the key)
- The dataset is marked with an attribute “sorted”
- The dataset is always sorted in ascending order, with NAs first
- Using
keybyinstead ofbywhen grouping will set the grouping factors as keys
See this SO post for more information on keys.
setkey(MT, cyl, gear)
setkeyv(MT, c("cyl", "gear"))
MTdata.table [32 x 11]
| [ omitted 17 entries ] |
To see over which keys (if any) the dataset is currently ordered:
haskey(MT)[1] TRUE
key(MT)[1] “cyl” “gear”
Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.
Ordering with (secondary) indices
-
setindexcreates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
- It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
setindex(MT, cyl, gear)
setindexv(MT, c("cyl", "gear"))
MTdata.table [32 x 11]
| [ omitted 17 entries ] |
We can see the additional index attribute added to the data.table:
names(attributes(MT))[1] "names" "row.names" "class"
[4] ".internal.selfref" "index"
We can get the currently used indices with:
indices(MT)[1] “cyl__gear”
Adding a new index doesn’t remove a previously existing one:
setindex(MT, hp)
indices(MT)[1] “cyl__gear” “hp”
We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !
2.2 Subset / Filter:
mtcars |> filter(cyl >= 6 & disp < 180)data.frame [5 x 11]
MT[cyl >= 6 & disp < 180]data.table [5 x 11]
Filter based on a range:
MT[disp %between% c(200, 300)]data.table [5 x 11]
Filtering on characters:
For non-regex, use %chin%, which is a character-optimized version of %in%.
IRIS[Species %chin% c("setosa")]data.table [50 x 5]
| [ omitted 35 entries ] |
Filter with pattern:
For regex patterns, use %like%
mtcars |> filter(str_detect(disp, "^\\d{3}\\."))data.frame [9 x 11]
MT[like(disp, "^\\d{3}\\.")]data.table [9 x 11]
Alternatively:
MT[disp %like% "^\\d{3}\\."]data.table [9 x 11]
Filter by keys
When keys or indices are defined, we can filter based on them, which is often a lot faster.
We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.
setkey(MT, cyl)
MT[.(6)] # Equivalent to MT[cyl == 6]data.table [7 x 11]
setkey(MT, cyl, gear)
MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]data.table [4 x 11]
Filter by indices
To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).
IRIS["setosa", on = "Species"]data.table [50 x 5]
| [ omitted 35 entries ] |
Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.
When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)
MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]data.table [12 x 11]
Filter based on position:
dplyr::first(MT$cyl)[1] 4
MT[, first(cyl)][1] 4
dplyr::last(MT$cyl)[1] 8
MT[, last(cyl)][1] 8
dplyr::nth(MT$cyl, 5)[1] 4
MT[5, cyl][1] 4
Distinct / Unique
mtcars |> distinct(mpg, hp, .keep_all = TRUE)data.frame [31 x 11]
| [ omitted 16 entries ] |
data.table [31 x 11]
| [ omitted 16 entries ] |
N Distinct / Unique N
n_distinct(mtcars$gear)[1] 3
uniqueN(MT, by = "gear")[1] 3
Applying a filtering function on multiple columns
Function to filter rows that have 2 or more non-zero decimals in one column
decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) > 1Manual solution:
mtcars |> filter(decp(drat) & decp(wt) & decp(qsec))data.frame [13 x 11]
MT[decp(drat) & decp(wt) & decp(qsec), ]data.table [13 x 11]
Programmatically applying the method to the different columns:
cols <- c("drat", "wt", "qsec")mtcars |> filter(if_all(cols, decp))data.frame [13 x 11]
data.table [13 x 11]
data.table [13 x 11]
With the newer env meta-programming interface:
data.table [13 x 11]
We can’t use .SD in the i clause of a data.table, but we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector
2.3 Select:
MT |> select(matches("cyl|disp"))data.table [32 x 2]
| [ omitted 17 entries ] |
MT[, .(mpg, disp)]data.table [32 x 2]
| [ omitted 17 entries ] |
MT[ , .SD, .SDcols = c("mpg", "disp")]data.table [32 x 2]
| [ omitted 17 entries ] |
MT[, .SD, .SDcols = patterns("mpg|disp")]data.table [32 x 2]
| [ omitted 17 entries ] |
By dynamic name:
cols <- c("cyl", "disp")
mtcars |> select(all_of(cols))data.frame [32 x 2]
| [ omitted 17 entries ] |
mtcars |> select(!!cols)data.frame [32 x 2]
| [ omitted 17 entries ] |
copy(MT)[, ..cols]data.table [32 x 2]
| [ omitted 17 entries ] |
copy(MT)[, mget(cols)]data.table [32 x 2]
| [ omitted 17 entries ] |
copy(MT)[, cols, with = FALSE]data.table [32 x 2]
| [ omitted 17 entries ] |
data.table [32 x 2]
| [ omitted 17 entries ] |
Remove a column
mtcars |> select(-cyl)data.frame [32 x 10]
| [ omitted 17 entries ] |
copy(MT)[, c("cyl") := NULL][]data.table [32 x 10]
| [ omitted 17 entries ] |
copy(MT)[, !"cyl"] # MT[, -"cyl"]data.table [32 x 10]
| [ omitted 17 entries ] |
By dynamic name:
col <- "cyl"
copy(MT)[, (col) := NULL][]data.table [32 x 10]
| [ omitted 17 entries ] |
copy(MT)[, j := NULL, env = list(j = col)][]data.table [32 x 10]
| [ omitted 17 entries ] |
cols <- c("cyl", "disp")
mtcars |> select(!matches(cols))data.frame [32 x 9]
| [ omitted 17 entries ] |
copy(MT)[, !..cols]data.table [32 x 9]
| [ omitted 17 entries ] |
copy(MT)[, !cols, with = FALSE]data.table [32 x 9]
| [ omitted 17 entries ] |
data.table [32 x 9]
| [ omitted 17 entries ] |
By pattern:
mtcars |> select(-matches("^d"))data.frame [32 x 9]
| [ omitted 17 entries ] |
copy(MT)[, .SD, .SDcols = !patterns("^d")]data.table [32 x 9]
| [ omitted 17 entries ] |
data.table [32 x 9]
| [ omitted 17 entries ] |
By type:
IRIS |> select(where(\(c) !is.numeric(c)))data.table [150 x 1]
| [ omitted 135 entries ] |
IRIS[, .SD, .SDcols = !is.numeric]data.table [150 x 1]
| [ omitted 135 entries ] |
Select + pull
mtcars |> pull(disp)MT[, disp]Select + rename
mtcars |> select(dispp = disp)data.frame [32 x 1]
| [ omitted 17 entries ] |
MT[, .(dispp = disp)]data.table [32 x 1]
| [ omitted 17 entries ] |
2.4 Rename:
Manually:
mtcars |> rename(CYL = cyl, MPG = mpg)data.frame [32 x 11]
| [ omitted 17 entries ] |
data.table [32 x 11]
| [ omitted 17 entries ] |
Programmatically:
2.5 Mutate:
data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := modifies the current dt in place (like dplyr::mutate)
The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.
If the number of values provided is smaller than the original column/group:
- With :=, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).
2.5.1 Transmute:
MT[, .(cyl = cyl * 2)]data.table [32 x 1]
| [ omitted 17 entries ] |
2.5.2 In-Place:
2.5.2.1 Single column:
mtcars |> mutate(cyl = 200)data.frame [32 x 11]
| [ omitted 17 entries ] |
copy(MT)[, cyl := 200][]data.table [32 x 11]
| [ omitted 17 entries ] |
Mutate a single column with a function:
mtcars |> mutate(mean_cyl = mean(cyl, na.rm = TRUE))data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, mean_cyl := mean(cyl, na.rm = TRUE)][]data.table [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, `:=`(mean_cyl = mean(cyl, na.rm = TRUE))][]data.table [32 x 12]
| [ omitted 17 entries ] |
Dynamic mutate:
Dynamic name on the LHS:
RHS <- "MPG"
mtcars |> mutate({{RHS}} := mean(mpg))data.frame [32 x 12]
| [ omitted 17 entries ] |
mtcars |> mutate("{RHS}" := mean(mpg))data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, (RHS) := mean(mpg)][] # (RHS) <=> c(RHS)data.table [32 x 12]
| [ omitted 17 entries ] |
data.table [32 x 12]
| [ omitted 17 entries ] |
Dynamic name on both LHS & RHS:
data.table requires the use of base::get() on the LHS
LHS <- "MPG"
RHS <- "mpg"mtcars |> mutate("{LHS}" := as.character(.data[[RHS]]))data.frame [32 x 12]
| [ omitted 17 entries ] |
mtcars |> mutate({{LHS}} := as.character(cur_data()[[RHS]]))data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, c(LHS) := as.character(get(RHS))][]data.table [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, x := y, env = list(x = LHS, y = RHS)][]data.table [32 x 12]
| [ omitted 17 entries ] |
Mutate based on multiple conditions:
if_else:
mtcars |> mutate(Size = if_else(cyl >= 6, "BIG", "small"))data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, Size := fifelse(cyl >= 6, "BIG", "small")][]data.table [32 x 12]
| [ omitted 17 entries ] |
case_when:
data.frame [32 x 12]
| [ omitted 17 entries ] |
data.table [32 x 12]
| [ omitted 17 entries ] |
Mutate only if condition is met:
It will keep all the rows and only mutate the ones meeting the provided condition (in i).
This can be extended to mutating multiple columns, of course.
mtcars |> mutate(BIG = case_when(am == 1 ~ cyl >= 6))data.frame [32 x 12]
| [ omitted 17 entries ] |
# mtcars |> mutate(BIG = cyl >= 6, .when = am == 1) # Not implemented yet as of dplyr 1.0.9copy(MT)[am == 1, BIG := cyl >= 6][]data.table [32 x 12]
| [ omitted 17 entries ] |
Lag / Lead
mtcars |> mutate(gear1 = lead(gear))data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, gear1 := shift(gear, 1, type = "lead")][]data.table [32 x 12]
| [ omitted 17 entries ] |
2.5.2.2 Mutate multiple columns:
mtcars |> mutate(cyl = 200, gear = 5)data.frame [32 x 11]
| [ omitted 17 entries ] |
copy(MT)[, `:=`(cyl = 200, gear = 5)][]data.table [32 x 11]
| [ omitted 17 entries ] |
data.table [32 x 11]
| [ omitted 17 entries ] |
One function applied to multiple columns (across rows):
data.frame [32 x 13]
| [ omitted 17 entries ] |
data.table [32 x 13]
| [ omitted 17 entries ] |
With dynamic naming:
new <- c("min_mpg", "min_disp")
old <- c("mpg", "disp")
copy(MT)[, c(new) := lapply(mget(old), min)][]data.table [32 x 13]
| [ omitted 17 entries ] |
data.table [32 x 13]
| [ omitted 17 entries ] |
Multiple functions on one column (across rows):
data.table [32 x 13]
| [ omitted 17 entries ] |
data.table [32 x 13]
| [ omitted 17 entries ] |
copy(MT)[, c("min_mpg", "max_mpg") := lapply(.SD, \(x) list(min(x), max(x))) |> rbindlist(), .SDcols = "mpg"][]data.table [32 x 13]
| [ omitted 17 entries ] |
copy(MT)[, c("min_mpg", "max_mpg") := lapply(.SD[, .(mpg)], \(x) list(min(x), max(x))) |> rbindlist()][]data.table [32 x 13]
| [ omitted 17 entries ] |
copy(MT)[, c("min_mpg", "max_mpg") := lapply(.(mpg), \(x) list(min(x), max(x))) |> do.call(rbind, args = _)][]data.table [32 x 13]
| [ omitted 17 entries ] |
One function applied to multiple columns (across columns)
mtcars |> rowwise() |> mutate(RowSum = sum(c_across(where(is.numeric)))) |> ungroup()data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, RowSum := rowSums(.SD), .SDcols = is.numeric][]data.table [32 x 12]
| [ omitted 17 entries ] |
More general option using row-wise apply:
data.table [32 x 12]
| [ omitted 17 entries ] |
Multiple functions applied to multiple columns (row-wise)
copy(MT)[, c("row_mean", "row_sum") := apply(.SD, 1, \(x) list(mean(x), sum(x))) |> rbindlist(), .SDcols = is.numeric][]data.table [32 x 13]
| [ omitted 17 entries ] |
Apply an anonymous function inside the DT:
2.6 Group / Aggregate:
The examples listed apply a grouping but do nothing (using .SD to simply keep all columns as is)
One group:
mtcars |> group_by(cyl)data.frame [32 x 11]
| [ omitted 17 entries ] |
MT[, .SD, by = cyl]data.table [32 x 11]
| [ omitted 17 entries ] |
Multiple groups:
MT[, .SD, by = .(cyl, gear)]data.table [32 x 11]
| [ omitted 17 entries ] |
Dynamic grouping:
cols <- c("cyl", "disp")
mtcars |> group_by(across(any_of(cols)))data.frame [32 x 11]
| [ omitted 17 entries ] |
MT[, .SD, by = cols]data.table [32 x 11]
| [ omitted 17 entries ] |
With potentially absent columns:
cols <- c("cyl", "disp", "fake_col")
mtcars |> group_by(across(any_of(cols)))data.frame [32 x 11]
| [ omitted 17 entries ] |
data.table [32 x 11]
| [ omitted 17 entries ] |
Getting the current group name:
Use the .BY argument to get the current group name:
2.7 Row numbers & indices:
.I: Row indices.N: Number of rows
.GRP: Group indices.NGRP: Number of groups
Getting rows indices:
MT[, .I] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32
Adding rows indices:
mtcars |> mutate(I = row_number())data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[ , I := .I][]data.table [32 x 12]
| [ omitted 17 entries ] |
Getting row indices (after filtering):
.I gives the vector of row numbers after any subsetting/filtering has been done
Returns the row numbers in the original dataset:
mtcars |> mutate(I = row_number()) |> filter(gear == 4) |> pull(I)[1] 1 2 3 8 9 10 11 18 19 20 26 32
MT[, .I[gear == 4]][1] 1 2 3 8 9 10 11 18 19 20 26 32
Returns the row numbers in the new dataset (after filtering):
mtcars |> filter(gear == 4) |> mutate(I = row_number()) |> pull(I)[1] 1 2 3 4 5 6 7 8 9 10 11 12
MT[gear == 4, .I][1] 1 2 3 4 5 6 7 8 9 10 11 12
Getting the row numbers of specific observations:
Row number of the first and last observation of each group:
mtcars |> group_by(cyl) |> summarize(I = cur_group_rows()[c(1, n())]) |> ungroup()data.frame [6 x 2]
MT[, .I[c(1, .N)], keyby = cyl]data.table [6 x 2]
Keeping all other columns:
mtcars |> mutate(I = row_number()) |> group_by(cyl) |> slice(c(1, n())) |> ungroup()data.frame [6 x 12]
copy(MT)[, I := .I][, .SD[c(1, .N)], keyby = cyl]data.table [6 x 12]
Filtering based on row numbers:
MT[MT[, .I[(.N-10):.N]]]data.table [11 x 11]
(Gets the indices of the last 10 rows and filters based on them)
Adding group indices:
mtcars |> group_by(cyl) |> summarize(GRP = cur_group_id())data.frame [3 x 2]
MT[, .GRP, by = cyl]data.table [3 x 2]
Mutate instead of summarize:
mtcars |> arrange(cyl) |> group_by(cyl) |> mutate(GRP = cur_group_id())data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, GRP := .GRP, keyby = cyl][]data.table [32 x 12]
| [ omitted 17 entries ] |
Row numbers by group:
mtcars |> arrange(gear) |> group_by(gear) |> mutate(I_GRP = row_number())data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[, I_GRP := 1:.N, keyby = gear][]data.table [32 x 12]
| [ omitted 17 entries ] |
Random sample:
Sample by group:
mtcars |> group_by(cyl) |> slice_sample(n = 5)data.frame [15 x 11]
MT[, .SD[sample(.N, 5)], keyby = cyl]data.table [15 x 11]
Filter by group size:
mtcars |> group_by(cyl) |> filter(n() >= 8)data.frame [25 x 11]
| [ omitted 10 entries ] |
MT[, if(.N >= 8) .SD, by = cyl]data.table [25 x 11]
| [ omitted 10 entries ] |
2.8 Relocate:
mtcars |> group_by(cyl) |> mutate(GRP = cur_group_id(), .before = 1)data.frame [32 x 12]
| [ omitted 17 entries ] |
(copy(MT)[ , GRP := .GRP, by = cyl] |> setcolorder(c("GRP", .SD)))[]data.table [32 x 12]
| [ omitted 17 entries ] |
Ordering by column names
mtcars |> select(sort(tidyselect::peek_vars()))data.frame [32 x 11]
| [ omitted 17 entries ] |
data.table [32 x 11]
| [ omitted 17 entries ] |
2.9 Summarize:
Summarizes uses the = operator.
It’s only difference with mutate is that it takes a function that returns a list of values smaller than the original column (or group) size.
By default, it will only keep the modified columns (like transmute).
mtcars |> summarize(mean_cyl = mean(cyl, na.rm = T))data.frame [1 x 1]
MT[, .(mean_cyl = mean(cyl, na.rm = T))]data.table [1 x 1]
Group > summarize
mtcars |> group_by(cyl) |> summarize(N = n())data.frame [3 x 2]
MT[, .N, by = cyl]data.table [3 x 2]
dplyr automatically arrange the result by the grouping factor.
To mimic this with data.table:
MT[, .N, keyby = cyl]data.table [3 x 2]
MT[order(cyl), .N, by = cyl]data.table [3 x 2]
MT[, .N, by = cyl][order(cyl)]data.table [3 x 2]
Grouping on a condition:
mtcars |> group_by(cyl > 6) |> summarize(N = n())data.frame [2 x 2]
MT[, .N, by = .(cyl > 6)]data.table [2 x 2]
Group > filter > summarize
mtcars |> filter(cyl >= 6 & disp >= 200) |> summarize(N = n())data.frame [1 x 1]
MT[cyl >= 6 & disp >= 200, .(.N)]data.table [1 x 1]
mtcars |> summarize(N = sum(cyl >= 6 & disp >= 200, na.rm = T))data.frame [1 x 1]
MT[, .(N = sum(cyl >= 6 & disp >= 200, na.rm = T))]data.table [1 x 1]
Obtaining one summary statistic on multiple columns
mtcars |> group_by(cyl) |> summarize(across(everything(), \(c) mean(c)))data.frame [3 x 11]
data.table [3 x 11]
Apply summary function based on column type:
mtcars |> group_by(cyl) |> summarize(across(where(is.double), \(col) mean(col)))data.frame [3 x 11]
data.table [3 x 11]
Apply summary function to specific columns:
data.frame [3 x 3]
data.table [3 x 3]
data.table [3 x 3]
Apply summary function to specific columns (by pattern):
mtcars |> group_by(cyl) |> summarize(across(matches("^mpg|^disp"), \(.x) mean(.x)))data.frame [3 x 3]
MT[, lapply(.SD, mean), keyby = cyl, .SDcols = patterns("^mpg|^disp")]data.table [3 x 3]
Obtaining multiple summary statistics for one column:
data.frame [3 x 3]
data.table [3 x 3]
MT[, lapply(.SD, \(x) list(mean_mpg = mean(x), sd_mpg = sd(x))) |> rbindlist(), keyby = cyl, .SDcols = "mpg"]data.table [3 x 3]
data.table [3 x 3]
Obtaining multiple summary statistics on multiple columns (as rows):
Obtaining multiple summary statistics on multiple columns (as columns):
dplyr & data.table don’t use the same “format” when pre-defining a list of function to be applied:
- dplyr needs a list of individual functions
- data.table needs a function returning a list
mtcars |> group_by(gear) |> summarize(across(cols, .fns = list_of_funs, .names = "{.col}.{.fn}"))data.frame [3 x 5]
data.table [3 x 5]
data.table [3 x 5]
Different column order & naming scheme:
Here we can use the list_of_funs with data.table since we apply them individually.
MT[, lapply(list_of_funs, \(f) lapply(.SD, f)) |> do.call(c, args = _), keyby = gear, .SDcols = cols]data.table [3 x 5]
Using dcast (see next section):
dcast(MT, gear ~ ., fun.aggregate = list(mean, sd), value.var = cols)data.table [3 x 5]
3 Pivots:
3.1 Melt / Longer:
Data:
FAM1data.table [5 x 5]
FAM2data.table [5 x 8]
One group of columns –> single value column
FAM1 |> pivot_longer(cols = matches("dob_"), names_to = "variable")data.frame [15 x 4]
FAM1 |> melt(measure.vars = c("dob_child1", "dob_child2", "dob_child3"))data.table [15 x 4]
FAM1 |> melt(measure.vars = patterns("^dob_"))data.table [15 x 4]
One group of columns –> multiple value columns
FAM1 |> melt(measure.vars = patterns(child1 = "child1$", child2 = "child2$|child3$"))data.table [10 x 5]
3.1.1 Merging multiple yes/no columns:
Melting multiple presence/absence columns into a single variable:
movies_widedata.frame [3 x 4]
pivot_longer(
movies_wide, -ID, names_to = "Genre",
values_transform = \(x) ifelse(x == 0, NA, x), values_drop_na = TRUE
) |> select(-value)data.frame [6 x 2]
melt(MOVIES_WIDE, id.vars = "ID", variable.name = "Genre")[value != 0][order(ID), -"value"]data.table [6 x 2]
3.1.2 Partial pivot:
Multiple groups of columns –> Multiple value columns
Manually:
colA <- str_subset(colnames(FAM2), "^dob")
colB <- str_subset(colnames(FAM2), "^gender")
FAM2 |> melt(measure.vars = list(colA, colB), value.name = c("dob", "gender"), variable.name = "child")data.table [15 x 5]
FAM2 |> melt(measure.vars = list(a, b), value.name = c("dob", "gender"), variable.name = "child") |>
substitute2(env = list(a = I(str_subset(colnames(FAM2), "^dob")), b = I(str_subset(colnames(FAM2), "^gender")))) |> eval()data.table [15 x 5]
Using .value:
Using the .value special identifier allows to do a “half” pivot: the values that would be listed as rows under .value are instead used as columns.
FAM2 |> pivot_longer(cols = matches("^dob|^gender"), names_to = c(".value", "child"), names_sep = "_child")data.frame [15 x 5]
FAM2 |> melt(measure.vars = patterns("^dob", "^gender"), value.name = c("dob", "gender"), variable.name = "child")data.table [15 x 5]
Using measure and value.name:
data.table only
FAM2 |> melt(measure.vars = measure(value.name, child = \(x) as.integer(x), sep = "_child"))data.table [15 x 5]
FAM2 |> melt(measure.vars = measurev(list(value.name = NULL, child = as.integer), pattern = "(.*)_child(\\d{1})"))data.table [15 x 5]
3.2 Dcast / Wider:
General idea:
- Pivot around the combination of id.vars (LHS of the formula)
- The measure.vars (RHS of the formula) are the ones whose values become column names
- The value.var are the ones the values are taken from to fill the new columns
Data:
(FAM1L <- FAM1 |> melt(measure.vars = c("dob_child1", "dob_child2", "dob_child3")))data.table [15 x 4]
(FAM2L <- FAM2 |> melt(measure.vars = measure(value.name, child = \(.x) as.integer(.x), sep = "_child")))data.table [15 x 5]
Basic pivot wider:
FAM1L |> pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")data.frame [5 x 5]
FAM1L |> dcast(family_id + age_mother ~ variable)data.table [5 x 5]
Using all the columns as IDs:
By default, id_cols = everything()
FAM1L |> pivot_wider(names_from = variable)data.frame [5 x 5]
... => “every unused column”
FAM1L |> dcast(... ~ variable)data.table [5 x 5]
Multiple value columns –> Multiple groups of columns:
FAM2L |> pivot_wider(
id_cols = c("family_id", "age_mother"), values_from = c("dob", "gender"),
names_from = "child", names_sep = "_child"
)data.frame [5 x 8]
FAM2L |> dcast(family_id + age_mother ~ child, value.var = c("dob", "gender"), sep = "_child")data.table [5 x 8]
FAM2L |> dcast(... ~ child, value.var = c("dob", "gender"), sep = "_child")data.table [5 x 8]
Dynamic names in the formula:
var_name <- "variable"FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = {{ var_name }})data.frame [5 x 5]
Multiple variables:
id_vars <- c("family_id", "age_mother")FAM1L |> pivot_wider(id_cols = all_of(id_vars), names_from = variable)data.frame [5 x 5]
FAM1L |> dcast(v1 + v2 ~ variable) |> substitute2(env = list(v1 = id_vars[1], v2 = id_vars[2])) |> eval()data.table [5 x 5]
3.2.1 Renaming (prefix/suffix) the columns:
FAM1L |> pivot_wider(names_from = variable, values_from = value, names_prefix = "Attr: ")data.frame [5 x 5]
FAM1L |> pivot_wider(names_from = variable, values_from = value, names_glue = "Attr: {variable}")data.frame [5 x 5]
FAM1L |> dcast(family_id + age_mother ~ paste0("Attr: ", variable))data.table [5 x 5]
3.2.2 Unused combinations:
The logic is inverted between dplyr (keep) and data.table (drop)
FAM1L |> pivot_wider(names_from = variable, values_from = value, id_expand = TRUE, names_expand = FALSE) # (keep_id, keep_names)data.frame [25 x 5]
| [ omitted 10 entries ] |
FAM1L |> dcast(family_id + age_mother ~ variable, drop = c(F, T)) # (drop_LHS, drop_RHS)data.table [25 x 5]
| [ omitted 10 entries ] |
3.2.3 Subsetting:
AFAIK, pivot_wider can’t do this on it’s own.
FAM1L |> filter(value >= lubridate::ymd(20030101)) |>
pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")data.frame [3 x 5]
FAM1L |> dcast(family_id + age_mother ~ variable, subset = .(value >= lubridate::ymd(20030101)))data.table [3 x 5]
3.2.4 Aggregating:
Not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures.
FAM1L |> dcast(family_id + age_mother ~ .)data.table [5 x 3]
We can customize that default behavior using the fun.aggregate argument:
Here, we count the number of child for each each combination of (family_id + age_mother) -> sum all non-NA value
FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(.x) sum(!is.na(.x))) |>
rowwise() |> mutate(child_count = sum(c_across(matches("_child")))) |> ungroup()data.frame [5 x 6]
FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(.x) sum(!is.na(.x))) |>
mutate(child_count = apply(select(cur_data(), matches("_child")), 1, \(r) sum(r)))data.frame [5 x 6]
(FAM1L |> dcast(family_id + age_mother ~ ., fun.agg = \(.x) sum(!is.na(.x))) |> setnames(".", "child_count"))data.table [5 x 3]
Applying multiple fun.agg:
Data:
(DTL <- data.table(
id1 = sample(5, 20, TRUE),
id2 = sample(2, 20, TRUE),
group = sample(letters[1:2], 20, TRUE),
v1 = runif(20),
v2 = 1L)
)data.table [20 x 5]
| [ omitted 5 entries ] |
Multiple fun.agg applied to one variable:
DTL |> dcast(id1 + id2 ~ group, fun.aggregate = list(sum, mean), value.var = "v1")data.table [9 x 6]
Multiple fun.agg to multiple value.var (all combinations):
data.table [9 x 10]
Multiple fun.agg and multiple value.var (one-to-one):
Here, we apply sum to v1 (for both group a & b), and mean to v2 (for both group a & b)
3.2.5 One-hot encoding:
Making each level of a variable into a presence/absence column:
movies_longdata.frame [6 x 2]
4 Joins:
In data.table, a JOIN is just another type of SUBSET: we subset the rows of one data.table with the rows of a second one, based on some conditions that define the type of JOIN.
Matching two tables based on their rows can be done:
- Either on equivalences (equi-joins)
- Or functions comparing one row to another (non-equi joins)
Data:
(DT1 <- data.table(
ID = LETTERS[1:10],
A = sample(1:5, 10, replace = TRUE),
B = sample(10:20, 10)
))data.table [10 x 3]
(DT2 <- data.table(
ID = LETTERS[5:14],
C = sample(1:5, 10, replace = TRUE),
D = sample(10:20, 10)
))data.table [10 x 3]
Basic (right) join example:
right_join(
DT1 |> select(ID, A),
DT2 |> select(ID, C),
by = "ID"
) |> as_tibble()data.frame [10 x 3]
DT1[DT2, .(ID, A, C), on = .(ID)]data.table [10 x 3]
4.1 Outer (right, left):
Appends data of one at the end of the other.
data.table doesn’t do left joins natively
Subsetting DT1 by DT2:
DT2 (everything) + DT1 (all columns, but only the rows that match those in DT1).
> Looking up DT1’s rows using DT2 (or DT2’s key, if it has one) as an index.
As a right join:
right_join(DT1, DT2, by = "ID") # DT1 into DT2data.table [10 x 5]
DT1[DT2, on = .(ID)]data.table [10 x 5]
As a left join:
Not exactly equivalent to the right join: same columns, but DT2 is first instead of DT1
left_join(DT2, DT1, by = "ID") # DT1 into DT2data.table [10 x 5]
data.table [10 x 5]
Subsetting DT2 by DT1:
DT1 (everything) + DT2 (all columns, but only the rows that match those in DT1).
> Looking up DT2’s rows using DT1 (or DT1’s key, if it has one) as an index.
As a right join:
right_join(DT2, DT1, by = "ID") # DT2 into DT1data.table [10 x 5]
DT2[DT1, on = .(ID)]data.table [10 x 5]
As a left join:
Not exactly equivalent to the right join: same columns, but DT1 is first instead of DT2
4.2 Full (outer):
full_join(DT1, DT2, by = "ID")data.table [14 x 5]
data.table::merge.data.table(DT1, DT2, by = "ID", all = TRUE)data.table [14 x 5]
Alternatively:
setkey(DT1, ID)
setkey(DT2, ID)
# Getting the union of the unique keys of both DT
unique_keys <- union(DT1[, ID], DT2[, ID])
DT1[DT2[unique_keys, on = "ID"]]data.table [14 x 5]
4.3 Inner:
Only returns the ROWS matching both tables:
- Inner: rows matching both DT1 and DT2, columns of both (add DT2’s columns to the right)
- Semi: rows matching both DT1 and DT2, columns of first one
Inner:
inner_join(DT1, DT2, by = "ID") data.table [6 x 5]
DT1[DT2, on = .(ID), nomatch = NULL]data.table [6 x 5]
Semi:
semi_join(DT1, DT2, by = "ID")data.table [6 x 3]
DT1[na.omit(DT1[DT2, on = .(ID), which = TRUE])]data.table [6 x 3]
which = TRUE returns the row numbers instead of the rows themselves.
4.4 Anti:
ROWS of DT1 that are NOT in DT2, and only the columns of DT1.
anti_join(DT1, DT2, by = "ID")data.table [4 x 3]
DT1[!DT2, on = .(ID)]data.table [4 x 3]
ROWS of DT2 that are NOT in DT1, and only the columns of DT2.
anti_join(DT2, DT1, by = "ID")data.table [4 x 3]
DT2[!DT1, on = .(ID)]data.table [4 x 3]
4.5 Non-equi joins:
DT1[DT2, on = .(ID, A <= C)]data.table [10 x 4]
4.6 Rolling joins:
DT1[DT2, on = "ID", roll = TRUE]data.table [10 x 5]
Inverse the rolling direction:
DT1[DT2, on = "ID", roll = -Inf]data.table [10 x 5]
DT1[DT2, on = "ID", rollends = TRUE]data.table [10 x 5]
5 Tidyr & Others:
5.1 Remove NA:
tidyr::drop_na(IRIS, matches("Sepal"))data.table [150 x 5]
| [ omitted 135 entries ] |
na.omit(IRIS, cols = str_subset(colnames(IRIS), "Sepal"))data.table [150 x 5]
| [ omitted 135 entries ] |
5.2 Unite:
Combine multiple columns into a single one:
mtcars |> tidyr::unite("x", gear, carb, sep = "_")data.frame [32 x 10]
| [ omitted 17 entries ] |
copy(MT)[, x := paste(gear, carb, sep = "_")][]data.table [32 x 12]
| [ omitted 17 entries ] |
5.3 Extract / Separate:
Separate a row into multiple columns based on a pattern (extract) or a separator (separate):
MT.ext <- MT[, .(x = str_c(gear, carb, sep = "_"))]5.4 Separate rows:
Separate a row into multiple rows based on a separator:
Data
(SP <- data.table(
val = c(1,"2,3",4),
date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-03"), origin = "1970-01-01")
)
)data.table [3 x 2]
SP |> tidyr::separate_rows(val, sep = ",", convert = TRUE)data.frame [4 x 2]
Solution 1:
data.table [4 x 2]
Solution 2:
SP[, strsplit(val, ",", fixed = TRUE), by = val][SP, on = "val"][, `:=`(val = V1, V1 = NULL)][]data.table [4 x 2]
Solution 3:
(With type conversion)
SP[, unlist(tstrsplit(val, ",", type.convert = TRUE)), by = val][SP, on = "val"][, `:=`(val = V1, V1 = NULL)][]data.table [4 x 2]
Solution 4:
data.table [4 x 2]
(With type conversion)
copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))
][, val := strsplit(val, ","), by = val
][, val := utils::type.convert(val, as.is = T)][]data.table [4 x 2]
5.5 Duplicates:
5.5.1 Duplicated rows:
Finding duplicated rows:
mtcars |> group_by(mpg, hp) |> filter(n() > 1)data.frame [2 x 11]
MT[, if(.N > 1) .SD, by = .(mpg, hp)]data.table [2 x 11]
Only keeping non-duplicated rows:
This is different from distinct/unique, which will keep one of the duplicated rows of each group.
This removes all groups which have duplicated rows.
Solution 1:
mtcars |> group_by(mpg, hp) |> filter(n() == 1)data.frame [30 x 11]
| [ omitted 15 entries ] |
MT[, if(.N == 1) .SD, by = .(mpg, hp)]data.table [30 x 11]
| [ omitted 15 entries ] |
Solution 2:
More convoluted
mtcars |> group_by(mpg, hp) |> filter(n() > 1) |> anti_join(mtcars, y = _)data.frame [30 x 11]
| [ omitted 15 entries ] |
MT[!MT[, if(.N > 1) .SD, by = .(mpg, hp)], on = names(MT)]data.table [30 x 11]
| [ omitted 15 entries ] |
fsetdiff(MT, setcolorder(MT[, if(.N > 1) .SD, by = .(mpg, hp)], names(MT)))data.table [30 x 11]
| [ omitted 15 entries ] |
5.5.2 Duplicated values (per row):
(DUPED <- data.table(
A = c("A1", "A2", "B3", "A4"),
B = c("B1", "B2", "B3", "B4"),
C = c("A1", "C2", "D3", "C4"),
D = c("A1", "D2", "D3", "D4")
)
)data.table [4 x 4]
DUPED |> mutate(Repeats = apply(cur_data(), 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", ")))data.table [4 x 5]
DUPED[, Repeats := apply(.SD, 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", "))][]data.table [4 x 5]
With duplication counter:
DUPED |> mutate(Repeats = apply(cur_data(), 1, \(r) dup_counts(r)))data.table [4 x 5]
DUPED[, Repeats := apply(.SD, 1, \(r) dup_counts(r))][]data.table [4 x 5]
5.6 Expand & Complete:
Here, we are missing an entry for person B on year 2010, that we want to fill:
(CAR <- data.table(
year = c(2010,2011,2012,2013,2014,2015,2011,2012,2013,2014,2015),
person = c("A","A","A","A","A","A", "B","B","B","B","B"),
car = c("BMW", "BMW", "AUDI", "AUDI", "AUDI", "Mercedes", "Citroen","Citroen", "Citroen", "Toyota", "Toyota")
)
)data.table [11 x 3]
5.6.1 Expand:
tidyr::expand(CAR, person, year)data.frame [12 x 2]
CJ(CAR$person, CAR$year, unique = TRUE)data.table [12 x 2]
5.6.2 Complete:
Joins the original dataset with the expanded one:
CAR |> tidyr::complete(person, year)data.frame [12 x 3]
CAR[CJ(person, year, unique = TRUE), on = .(person, year)]data.table [12 x 3]
5.7 Uncount:
Duplicating aggregated rows to get the un-aggregated version back
Data
cols <- c("Mild", "Moderate", "Severe")
dat_aggdata.frame [10 x 6]
dat_agg |>
tidyr::pivot_longer(cols = cols, names_to = "Severity", values_to = "Count") |>
tidyr::uncount(Count) |>
mutate(ID_new = row_number(), .after = "ID") |>
tidyr::pivot_wider(
names_from = "Severity", values_from = "Severity",
values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
)data.frame [23 x 7]
| [ omitted 8 entries ] |
Solution 1:
(melt(DAT_AGG, measure.vars = cols, variable.name = "Severity", value.name = "Count")
[rep(1:.N, Count)][, ID_new := .I] |>
dcast(... ~ Severity, value.var = "Severity", fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0)
)[, -"Count"]data.table [23 x 7]
| [ omitted 8 entries ] |
Solution 2:
5.8 List / Unlist:
When a column contains a simple vector/list of values (of the same type, without structure)
5.8.1 One listed column:
Single ID (grouping) column:
(mtcars_list <- mtcars |> group_by(cyl) |> summarize(mpg = list(mpg)) |> ungroup())data.frame [3 x 2]
(MT_LIST <- MT[, .(mpg = .(mpg)), keyby = cyl])data.table [3 x 2]
Solution 1:
mtcars_list |> unnest(mpg)data.frame [32 x 2]
| [ omitted 17 entries ] |
MT_LIST[, .(mpg = unlist(mpg)), keyby = cyl]data.table [32 x 2]
| [ omitted 17 entries ] |
Solution 2:
Bypasses the need of grouping when unlisting by growing the data.table back to its original number of rows before unlisting.
data.table [32 x 2]
| [ omitted 17 entries ] |
Multiple ID (grouping) columns:
(mtcars_list2 <- mtcars |> group_by(cyl, gear) |> summarize(mpg = list(mpg)) |> ungroup())data.frame [8 x 3]
(MT_LIST2 <- MT[, .(mpg = .(mpg)), keyby = .(cyl, gear)])data.table [8 x 3]
Solution 1:
mtcars_list2 |> unnest(mpg) # group_by(cyl, gear) is optionaldata.frame [32 x 3]
| [ omitted 17 entries ] |
data.table [32 x 3]
| [ omitted 17 entries ] |
Solution 2:
Same as with one grouping column
5.8.2 Multiple listed column:
Creating the data:
(mtcars_list_mult <- mtcars |> group_by(cyl, gear) |> summarize(across(c(mpg, disp), \(c) list(c))) |> ungroup())data.frame [8 x 4]
data.table [8 x 4]
Solution 1:
5.9 Nest / Unnest:
When a column contains a data.table/data.frame (with multiple columns, structured)
5.9.1 One nested column:
Nesting
(mtcars_nest <- mtcars |> tidyr::nest(data = -cyl)) # Data is inside a tibbledata.frame [3 x 2]
mtcars_nest <- mtcars |> nest_by(cyl) |> ungroup() # Data is inside a vctrs_list_of(MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl])data.table [3 x 2]
Unnesting
mtcars_nest |> unnest(data) |> ungroup()data.frame [32 x 11]
| [ omitted 17 entries ] |
MT_NEST[, rbindlist(data), keyby = cyl]data.table [32 x 11]
| [ omitted 17 entries ] |
# MT_NEST[, do.call(c, data), keyby = cyl]5.9.2 Multiple nested column:
Nesting:
(mtcars_nest_mult <- mtcars |> group_by(cyl, gear) |> nest(data1 = c(mpg, hp), data2 = !c(cyl, gear, mpg, hp)) |> ungroup())data.frame [8 x 4]
(MT_NEST_MULT <- MT[, .(data1 = .(.SD[, .(mpg, hp)]), data2 = .(.SD[, !c("mpg", "hp")])), keyby = .(cyl, gear)])data.table [8 x 4]
Unnesting:
mtcars_nest_mult |> unnest(c(data1, data2)) |> ungroup()data.frame [32 x 11]
| [ omitted 17 entries ] |
MT_NEST_MULT[, c(rbindlist(data1), rbindlist(data2)), keyby = .(cyl, gear)]data.table [32 x 11]
| [ omitted 17 entries ] |
MT_NEST_MULT[, do.call(c, unname(lapply(.SD, \(c) rbindlist(c)))), .SDcols = patterns('data'), keyby = .(cyl, gear)]data.table [32 x 11]
| [ omitted 17 entries ] |
5.9.3 Operate on nested/list columns:
(mtcars_nest <- mtcars |> nest(-cyl) |> ungroup())data.frame [3 x 2]
(MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl])data.table [3 x 2]
Creating a new column using the nested data:
Keeping the nested column:
data.frame [3 x 3]
data.table [3 x 3]
Dropping the nested column:
data.frame [3 x 2]
data.table [3 x 2]
Creating multiple new columns using the nested data:
linreg <- \(data) lm(mpg ~ hp, data = data) |> broom::tidy()mtcars_nest |> group_by(cyl) |> group_modify(\(d, g) linreg(unnest(d, everything()))) |> ungroup()data.frame [6 x 6]
MT_NEST[, rbindlist(lapply(data, \(ndt) linreg(ndt))), keyby = cyl][]data.table [6 x 6]
Operating inside the nested data:
mtcars_nest |>
mutate(data = map(data, \(tibl) mutate(tibl, sum = purrr::pmap_dbl(cur_data(), sum)))) |>
unnest(data)data.frame [32 x 12]
| [ omitted 17 entries ] |
mtcars_nest |>
mutate(across(data, \(tibls) map(tibls, \(tibl) mutate(tibl, sum = apply(cur_data(), 1, sum))))) |>
unnest(data)data.frame [32 x 12]
| [ omitted 17 entries ] |
Using the nplyr package:
5.10 Rotate / Transpose:
(MT_SUMMARY <- MT[, tidy(summary(mpg)), by = cyl])data.table [3 x 7]
Solution 1:
Using pivots to fully rotate the data.table:
MT_SUMMARY |>
pivot_longer(!cyl, names_to = "Statistic") |>
pivot_wider(id_cols = "Statistic", names_from = "cyl", names_prefix = "Cyl ")data.frame [6 x 4]
MT_SUMMARY |>
melt(id.vars = "cyl", variable.name = "Statistic") |>
dcast(Statistic ~ paste0("Cyl ", cyl))data.table [6 x 4]
Solution 2:
Using a dedicated function:
AFAIK there is no native Tidyverse function to do this.
library(datawizard)
datawizard::data_rotate(MT_SUMMARY, colnames = TRUE, rownames = "Statistic")data.frame [6 x 4]
data.table::transpose(MT_SUMMARY, keep.names = "Statistic", make.names = 1)data.table [6 x 4]
6 Processing examples:
Examples of interesting tasks that I’ve collected over time.
6.1 Find minimum in each group:
MT |> group_by(cyl) |> arrange(mpg) |> slice(1) |> ungroup()data.frame [3 x 11]
MT[, .SD[which.min(mpg)], keyby = cyl]data.table [3 x 11]
6.2 GROUP > FILTER > MUTATE
Data:
(DAT <- structure(list(
id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"),
year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L),
job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"),
job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)
),
.Names = c("id", "name", "year", "job", "job2"),
class = "data.frame",
row.names = c(NA, -16L)
) |> setDT())data.table [16 x 5]
| [ omitted 1 entries ] |
Tidyverse:
DAT |> group_by(name, job) |>
filter(job != "Boss" | year == min(year)) |>
mutate(cumu_job2 = cumsum(job2)) |>
ungroup()data.frame [11 x 6]
Here, the grouping is done BEFORE the filter -> there will be empty groups, meaning they will sum to 0
data.table:
Solution 1:
data.table [11 x 3]
Solution 2:
data.table [11 x 3]
Solution 3:
DAT[
DAT[, .I[job != "Boss" | year == min(year)], by = .(name, job)]$V1 # Row indices
][
, cumu_job2 := cumsum(job2), by = .(name, job)
][]data.table [11 x 6]
If we filtered after the grouping:
6.3 GROUP > SUMMARIZE > JOIN > MUTATE
Data:
(GSJM1 <- data.table(x = c(1,1,1,1,2,2,2,2), y = c("a", "a", "b", "b"), z = 1:8, key = c("x", "y")))data.table [8 x 3]
data.table [2 x 3]
Tidyverse:
as.data.frame(GSJM1) |>
group_by(x, y) |>
summarise(z = sum(z)) |>
ungroup() |>
right_join(GSJM2) |>
mutate(z = z * mul) |>
select(-mul)data.frame [2 x 3]
data.table:
Basic:
GSJM1[, .(z = sum(z)), by = .(x, y)][GSJM2][, `:=`(z = z * mul, mul = NULL)][]data.table [2 x 3]
Advanced (using .EACHI):
GSJM1[GSJM2, .(z = sum(z) * mul), by = .EACHI]data.table [2 x 3]
6.4 Separating rows & cleaning text:
Data
(DT_COMA <- data.table(
first = c(1,"2,3",3,4,5,6.5,7,8,9,0),
second = c(1,"2,,5",3,4,5,"6,5,9",7,8,9,0),
third = c("one", "two", "thr,ee", "four", "five", "six", "sev,en", "eight", "nine", "zero"),
fourth = as.Date(c(1/1/2020, 2/1/2020, 3/1/2020, 4/1/2020, 5/1/2020, 6/1/2020, 7/1/2020, 8/1/2020, 9/1/2020, 10/1/2020), origin = "1970-01-01")
)
)data.table [10 x 4]
6.4.1 Step1: Cleaning
Removing unwanted commas within words
Tidyverse:
DT_COMA |> mutate(across(where(\(v) is.character(v) & all(is.na(as.numeric(v)))), \(v) stringr::str_remove_all(v, ",")))data.table [10 x 4]
data.table:
cols_to_clean <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & all(is.na(as.numeric(v)))] |> colnames()
copy(DT_COMA)[, c(cols_to_clean) := purrr::map(.SD[, cols_to_clean, with = F], \(v) stringr::str_remove_all(v, ","))][]data.table [10 x 4]
6.4.2 Step 2: Separating rows
Each numeric row that has multiple comma-separated values has to be split into multiple rows (one value per row)
Tidyverse:
cols_to_separate <- DT_COMA |> select(where(\(v) is.character(v) & any(!is.na(as.numeric(v))))) |> colnames()
purrr::reduce(
cols_to_separate,
\(acc, col) acc |> tidyr::separate_rows(col, sep = ",", convert = T),
.init = DT_COMA
)data.frame [17 x 4]
| [ omitted 2 entries ] |
data.table:
cols_to_separate <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & any(!is.na(as.numeric(v)))] |> colnames()
(purrr::reduce(
cols_to_separate,
\(acc, col) acc[rep(1:.N, lengths(strsplit(get(col), ",")))][, (col) := type.convert(unlist(strsplit(acc[[col]], ",", fixed = T)), as.is = T, na.strings = "")],
.init = DT_COMA
))[]data.table [17 x 4]
| [ omitted 2 entries ] |
6.4.3 Combining both steps:
Tidyverse:
DT_COMA <- DT_COMA |> mutate(across(where(\(v) is.character(v) & all(is.na(as.numeric(v)))), \(v) stringr::str_remove_all(v, ",")))
purrr::reduce(
select(DT_COMA, where(\(v) is.character(v) & any(!is.na(as.numeric(v))))) |> colnames(),
\(acc, col) acc |> tidyr::separate_rows(col, sep = ",", convert = T),
.init = DT_COMA
)data.frame [17 x 4]
| [ omitted 2 entries ] |
data.table:
cols_to_clean <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & all(is.na(as.numeric(v)))] |> colnames()
cols_to_separate <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & any(!is.na(as.numeric(v)))] |> colnames()
DT_COMA[, c(cols_to_clean) := purrr::map(.SD[, cols_to_clean, with = F], \(v) stringr::str_remove_all(v, ","))]data.table [10 x 4]
(purrr::reduce(
cols_to_separate,
\(acc, col) acc[rep(1:.N, lengths(strsplit(get(col), ",")))][, (col) := type.convert(unlist(strsplit(acc[[col]], ",", fixed = T)), as.is = T, na.strings = "")],
.init = DT_COMA
))[]data.table [17 x 4]
| [ omitted 2 entries ] |
6.5 Multiple choice questions:
Data:
survdata.frame [5 x 2]
Here we will spread the answers into their own columns using a pivot because not all rows have all the possible answers:
Tidyverse:
surv |>
mutate(response = str_split(response, fixed("|"))) |>
unnest(response) |>
pivot_wider(id_cols = ID, names_from = response, values_from = response, values_fn = \(.x) sum(!is.na(.x)), values_fill = 0)data.frame [5 x 6]
data.table:
6.6 Filling with lagging conditions:
Task: See this SO question.
Data:
ZIP <- structure(
list(
zipcode = c(1001, 1002, 1003, 1004, 1101, 1102, 1103, 1104, 1201, 1202, 1203, 1302),
areacode = c(4, 4, NA, 4, 4, 4, NA, 1, 4, 4, NA, 4),
type = structure(c(1L, 1L, NA, 1L, 2L, 2L, NA, 1L, 1L, 1L, NA, 1L), .Label = c("clay", "sand"), class = "factor"),
region = c(3, 3, NA, 3, 3, 3, NA, 3, 3, 3, NA, 3),
do_not_fill = c(1, NA, NA, 1, 1, NA, NA, 1, NA, NA, NA, 1)
),
class = c("data.table", "data.frame"), row.names = c(NA, -4L)
)Tidyverse:
as_tibble(ZIP) |>
mutate(type = as.character(type)) |>
mutate(
across(1:4, ~ ifelse(
is.na(.) & lag(areacode) == lead(areacode) &
lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))),
lag(.), .
)
)
)data.frame [12 x 5]
data.table:
ZIP[, c(lapply(.SD, \(v) {fifelse(
is.na(areacode) & lag(areacode) == lead(areacode) &
lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))), lag(v), v)}),
.SD[, .(do_not_fill)]), .SDcols = !patterns("do_not_fill")]data.table [12 x 5]
6.7 Join + Coalesce:
Task: Replace the missing dates from one dataset with the earliest date from another dataset, matching by ID:
Data:
(dt1 <- data.table::fread(
"
id x y z
1 A 1 NA
2 C 3 NA
3 C 3 NA
4 C 2 NA
5 B 2 2019-08-04
6 C 1 2019-09-18
7 B 3 2019-12-17
8 A 2 2019-11-02
9 A 3 2020-03-16
10 A 1 2020-01-31
"
))data.table [10 x 4]
(dt2 <- data.table::fread(
" id date
1 2012-09-25
1 2012-03-26
1 2012-11-12
2 2013-01-24
2 2012-05-04
2 2012-02-24
3 2012-05-30
3 2012-02-15
4 2012-03-13
4 2012-05-18
"))data.table [10 x 2]
Tidyverse:
Using coalesce:
left_join(
dt1,
dt2 |> group_by(id) |> summarize(date = min(date)),
by = "id"
) |> mutate(date = coalesce(z, date), z = NULL)data.table [10 x 4]
Using the rows_* functions:
dplyr::rows_patch(
dt1 |> rename(date = z),
dt2 |> group_by(id) |> summarize(date = min(date)),
by = "id"
)data.table [10 x 4]
data.table:
As a right join:
copy(dt2)[, .(date = min(date)), by = id
][dt1, on = "id"][, `:=`(date = fcoalesce(date, z), z = NULL)][]data.table [10 x 4]
As a left join:
6.8 Join on multiple columns (partial matching):
Task: Join both tables based on matching IDs, but the IDs are split between multiple columns in one table (id1 & id2).
(dt1 <- data.table(id = c("ABC", "AAA", "CBC"), x = 1:3))data.table [3 x 2]
(dt2 <- data.table(
id1 = c("ABC", "AA", "CB"),
id2 = c("AB", "AAA", "CBC"),
y = c(0.307, 0.144, 0.786))
)data.table [3 x 3]
Solution 1:
Combine the two ID columns into one with pivot_longer, then join:
dt2 |> pivot_longer(matches("^id"), names_to = NULL, values_to = "id") |> right_join(dt1)data.frame [3 x 3]
melt(dt2, measure.vars = patterns("^id"), value.name = "id")[, variable := NULL][dt1, on = "id"]data.table [3 x 3]
Solution 2:
Combine the two ID columns into one with unite + separate_rows, then join:
(From @TimTeaFan
dt2 |> unite("id", id1, id2, sep = "_") |> separate_rows("id") |> right_join(dt1)data.frame [3 x 3]
copy(dt2)[, id := paste(id1, id2, sep = "_")
][, c(V1 = strsplit(id, "_", fixed = TRUE), .SD), by = id
][, `:=`(id = V1, V1 = NULL, id1 = NULL, id2 = NULL)
][dt1, on = "id"]data.table [3 x 3]
Solution 3:
Join on one of the two columns (id2 here), and then fill in (patch) the missing values:
left_join(dt2, dt1, by = c("id2" = "id")) |>
rows_patch(rename(dt1, id1 = id), unmatched = "ignore")data.table [3 x 4]
6.9 Merging rows across multiple columns (every X rows):
Data:
(BANK <- data.table(
date = c("30 feb", "NA", "NA", "NA", "31 feb", "NA", "NA", "NA"),
description = c("Mary", "had a", "little", "lamb", "Twinkle", "twinkle", "little", "star"),
withdrawal = c("100", "NA", "NA", "NA", "NA", "NA", "NA", "NA"),
deposit = c("NA", "NA", "NA", "NA", "100", "NA", "NA", "NA")
)[, lapply(.SD, \(c) utils::type.convert(c, as.is = T))]
)data.table [8 x 4]
merge_and_convert <- function(v) {
utils::type.convert(v, as.is = T) |> na.omit() |>
paste(collapse = " ") |> utils::type.convert(as.is = T) |>
bind(x, ifelse(is.logical(x), as.integer(x), x))
}Tidyverse:
Solution 1:
mutate(BANK, ID = ceiling(seq_along(row_number())/4)) |>
group_by(ID) |>
summarize(across(everything(), \(m) merge_and_convert(m)))data.frame [2 x 5]
Solution 2:
summarize(BANK, across(
everything(),
\(c) sapply(split(c, ceiling(seq_along(c)/4)), \(m) merge_and_convert(m))
))data.frame [2 x 4]
data.table:
data.table [2 x 4]
6.10 Tagging successive events:
Tagging repeated blocks of events (aka run length encoding):
data.table [15 x 1]
data.table [15 x 2]
DAT |> mutate(ID = c(0, cumsum(diff(as.integer(factor(event))) != 0)) + 1)data.table [15 x 2]
Using data.table’s rleid() function:
DAT |> mutate(ID = data.table::rleid(event))data.table [15 x 2]
copy(DAT)[, ID := rleid(event)][]data.table [15 x 2]
7 Miscellaneous:
7.1 Keywords:
.SD
.I, .N
.GRP, .NGRP
.BY
.EACHI
7.2 Useful functions:
fsetdiff, fintersect, funion and fsetequal (apply to data.tables instead of vectors)
nafill, fcoalesce
as.IDate