Tidyverse <-> data.table

Equivalence between Tidyverse and data.table expressions

Data Manipulation
Tidyverse
data.table
R
Author

Marc-Aurèle Rivière

Published

May 19, 2022

Abstract
This document is a collection of notes I took while learning to use data.table, summarizing the equivalences between most dplyr/tidyr verbs and data.table.
This document is no longer updated

Please visit this page for a more up-to-date version of this post.

  • V1: 2022-05-19
  • V2: 2022-05-26
    • Improved the section on keys (for ordering & filtering)
    • Adding a section for translations of Tidyr (and other similar packages)
    • Capping tables to display 15 rows max when unfolded
    • Improving table display (stripping, hiding the contents of nested columns, …)
  • V3: 2022-07-20
    • Updating examples of dynamic programming based on the latest recommendations
    • Added new entries in processing examples
    • Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
    • Added pivot_wider examples to match the dcast ones in the Pivots section
    • Added some new examples here and there across the Basic Operations section
    • Added an entry for operating inside nested data.frames/data.tables
    • Added a processing example for run-length encoding (i.e. successive event tagging)
  • V4: 2022-08-05
    • Improved pivot section: example of one-hot encoding (and reverse operation) + better examples of partial pivots with .value
    • Added tidyr::uncount() (row duplication) example
    • Improved both light & dark themes (code highlight, tables, …)

1 Setup


renv::install(
  c(
    "here",
    "Rdatatable/data.table",
    "tidyverse/dplyr",
    "tidyr",
    "pipebind",
    "stringr",
    "purrr",
    "lubridate",
    "broom"
  )
)
library(here)        # Project management

library(data.table)  # Data wrangling (>= 1.14.3)
library(dplyr)       # Data wrangling (>= 1.1.0)
library(tidyr)       # Data wrangling (extras) (>= 1.2.0)
library(pipebind)    # Piping goodies (>= 0.1.1)

library(stringr)     # Manipulating strings
library(purrr)       # Manipulating lists
library(lubridate)   # Manipulating dates

library(broom)

data.table::setDTthreads(parallel::detectCores(logical = TRUE))
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 20.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/Paris
 date     2022-09-24
 pandoc   2.19.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
 Quarto   1.1.251

─ Packages ───────────────────────────────────────────────────────────────────
 ! package    * version     date (UTC) lib source
 P broom      * 1.0.0       2022-07-01 [?] CRAN (R 4.2.0)
 P data.table * 1.14.3      2022-07-27 [?] Github (Rdatatable/data.table@c4a2085)
   dplyr      * 1.0.99.9000 2022-08-15 [1] Github (tidyverse/dplyr@d8294b4)
 P here       * 1.0.1       2020-12-13 [?] CRAN (R 4.2.0)
 P lubridate  * 1.8.0       2021-10-07 [?] CRAN (R 4.2.0)
   pipebind   * 0.1.1       2022-08-10 [1] CRAN (R 4.2.0)
 P purrr      * 0.3.4       2020-04-17 [?] CRAN (R 4.2.0)
 P stringr    * 1.4.0       2019-02-10 [2] CRAN (R 4.2.0)
 P tidyr      * 1.2.0       2022-02-01 [?] CRAN (R 4.2.0)

 [1] /home/mar/Dev/Projects/R/Misc/renv/library/R-4.2/x86_64-pc-linux-gnu
 [2] /home/mar/.cache/R/renv/library/Misc-f25fd835/R-4.2/x86_64-pc-linux-gnu
 [3] /usr/lib/R/library
 [4] /usr/local/lib/R/site-library
 [5] /usr/lib/R/site-library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

2 Basic Operations:


data.table general syntax:

DT[row selector (filter/sort), col selector (select/mutate/summarize/rename), modifiers (group)]

Data

MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]

2.1 Arrange / Order:

mtcars |> arrange(desc(cyl))
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
MT[order(-cyl)]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
setorder(MT, -cyl)[]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
MT[order(-cyl, gear)]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
[ omitted 17 entries ]

Ordering on a character column

IRIS[chorder(Species)]
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]

Ordering with keys

  • Keys physically reorders the dataset within the RAM (by reference)
    • No memory is used for sorting (other than marking which columns is the key)
  • The dataset is marked with an attribute “sorted”
  • The dataset is always sorted in ascending order, with NAs first
  • Using keyby instead of by when grouping will set the grouping factors as keys
Tip

See this SO post for more information on keys.

setkey(MT, cyl, gear)

setkeyv(MT, c("cyl", "gear"))

MT
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
[ omitted 17 entries ]

To see over which keys (if any) the dataset is currently ordered:

haskey(MT)

[1] TRUE

key(MT)

[1] “cyl” “gear”

Warning

Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.

Ordering with (secondary) indices

  • setindex creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
  • It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
setindex(MT, cyl, gear)

setindexv(MT, c("cyl", "gear"))

MT
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

We can see the additional index attribute added to the data.table:

[1] "names"             "row.names"         "class"            
[4] ".internal.selfref" "index"            

We can get the currently used indices with:

indices(MT)

[1] “cyl__gear”

Adding a new index doesn’t remove a previously existing one:

setindex(MT, hp)

indices(MT)

[1] “cyl__gear” “hp”

We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !

2.2 Subset / Filter:

mtcars |> filter(cyl >= 6 & disp < 180)
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
MT[cyl >= 6 & disp < 180]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

Filter based on a range:

MT[disp %between% c(200, 300)]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3

Filtering on characters:

For non-regex, use %chin%, which is a character-optimized version of %in%.

IRIS[Species %chin% c("setosa")]
data.table [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]

Filter with pattern:

For regex patterns, use %like%

mtcars |> filter(str_detect(disp, "^\\d{3}\\."))
data.frame [9 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
MT[like(disp, "^\\d{3}\\.")]
data.table [9 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2

Alternatively:

MT[disp %like% "^\\d{3}\\."]
data.table [9 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2

Filter by keys

When keys or indices are defined, we can filter based on them, which is often a lot faster.

Tip

We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.

setkey(MT, cyl)

MT[.(6)] # Equivalent to MT[cyl == 6]
data.table [7 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
setkey(MT, cyl, gear)

MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]
data.table [4 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4

Filter by indices

To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).

IRIS["setosa", on = "Species"]
data.table [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]

Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.

Tip

When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)

MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]
data.table [12 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4

Filter based on position:

dplyr::first(MT$cyl)

[1] 4

MT[, first(cyl)]

[1] 4

dplyr::last(MT$cyl)

[1] 8

MT[, last(cyl)]

[1] 8

dplyr::nth(MT$cyl, 5)

[1] 4

MT[5, cyl]

[1] 4

Distinct / Unique

mtcars |> distinct(mpg, hp, .keep_all = TRUE)
data.frame [31 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
[ omitted 16 entries ]
unique(MT, by = c("mpg", "hp"))
data.table [31 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
21 6 160 110 3.9 2.62 16.46 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
[ omitted 16 entries ]

N Distinct / Unique N

n_distinct(mtcars$gear)

[1] 3

uniqueN(MT, by = "gear")

[1] 3

Applying a filtering function on multiple columns

Function to filter rows that have 2 or more non-zero decimals in one column

decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) > 1

Manual solution:

mtcars |> filter(decp(drat) & decp(wt) & decp(qsec))
data.frame [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
MT[decp(drat) & decp(wt) & decp(qsec), ]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

Programmatically applying the method to the different columns:

cols <- c("drat", "wt", "qsec")
mtcars |> filter(if_all(cols, decp))
data.frame [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
MT[Reduce(`&`, lapply(mget(cols), decp)), ]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
MT[Reduce(`&`, lapply(MT[, ..cols], decp)), ]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

With the newer env meta-programming interface:

MT[Reduce(`&`, lapply(v1, decp)), env = list(v1 = as.list(cols))]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
MT[f1(`&`, f2(v1, decp)), env = list(f1 = "Reduce", f2 = "lapply", v1 = as.list(cols))]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
Note

We can’t use .SD in the i clause of a data.table, but we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector

MT[MT[, Reduce(`&`, lapply(.SD, decp)), .SDcols = cols]]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
MT[MT[, Reduce(`&`, lapply(.SD[, mget(cols)], decp))]]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

2.3 Select:

MT |> select(matches("cyl|disp"))
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]
MT[, .(mpg, disp)]
data.table [32 x 2]
mpg disp
21 160
21 160
22.8 108
21.4 258
18.7 360
18.1 225
14.3 360
24.4 146.7
22.8 140.8
19.2 167.6
17.8 167.6
16.4 275.8
17.3 275.8
15.2 275.8
10.4 472
[ omitted 17 entries ]
MT[ , .SD, .SDcols = c("mpg", "disp")]
data.table [32 x 2]
mpg disp
21 160
21 160
22.8 108
21.4 258
18.7 360
18.1 225
14.3 360
24.4 146.7
22.8 140.8
19.2 167.6
17.8 167.6
16.4 275.8
17.3 275.8
15.2 275.8
10.4 472
[ omitted 17 entries ]
MT[, .SD, .SDcols = patterns("mpg|disp")]
data.table [32 x 2]
mpg disp
21 160
21 160
22.8 108
21.4 258
18.7 360
18.1 225
14.3 360
24.4 146.7
22.8 140.8
19.2 167.6
17.8 167.6
16.4 275.8
17.3 275.8
15.2 275.8
10.4 472
[ omitted 17 entries ]

By dynamic name:

cols <- c("cyl", "disp")

mtcars |> select(all_of(cols))
data.frame [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]
mtcars |> select(!!cols)
data.frame [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]
copy(MT)[, ..cols]
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]
copy(MT)[, mget(cols)]
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]
copy(MT)[, cols, with = FALSE]
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]
copy(MT)[, j, env = list(j = as.list(cols))]
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]

Remove a column

mtcars |> select(-cyl)
data.frame [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, c("cyl") := NULL][]
data.table [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, !"cyl"] # MT[, -"cyl"]
data.table [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

By dynamic name:

col <- "cyl"

copy(MT)[, (col) := NULL][]
data.table [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, j := NULL, env = list(j = col)][]
data.table [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
cols <- c("cyl", "disp")

mtcars |> select(!matches(cols))
data.frame [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, !..cols]
data.table [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, !cols, with = FALSE]
data.table [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, -j, env = list(j = I(cols))][]
data.table [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

By pattern:

mtcars |> select(-matches("^d"))
data.frame [32 x 9]
mpg cyl hp wt qsec vs am gear carb
21 6 110 2.62 16.46 0 1 4 4
21 6 110 2.875 17.02 0 1 4 4
22.8 4 93 2.32 18.61 1 1 4 1
21.4 6 110 3.215 19.44 1 0 3 1
18.7 8 175 3.44 17.02 0 0 3 2
18.1 6 105 3.46 20.22 1 0 3 1
14.3 8 245 3.57 15.84 0 0 3 4
24.4 4 62 3.19 20 1 0 4 2
22.8 4 95 3.15 22.9 1 0 4 2
19.2 6 123 3.44 18.3 1 0 4 4
17.8 6 123 3.44 18.9 1 0 4 4
16.4 8 180 4.07 17.4 0 0 3 3
17.3 8 180 3.73 17.6 0 0 3 3
15.2 8 180 3.78 18 0 0 3 3
10.4 8 205 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, .SD, .SDcols = !patterns("^d")]
data.table [32 x 9]
mpg cyl hp wt qsec vs am gear carb
21 6 110 2.62 16.46 0 1 4 4
21 6 110 2.875 17.02 0 1 4 4
22.8 4 93 2.32 18.61 1 1 4 1
21.4 6 110 3.215 19.44 1 0 3 1
18.7 8 175 3.44 17.02 0 0 3 2
18.1 6 105 3.46 20.22 1 0 3 1
14.3 8 245 3.57 15.84 0 0 3 4
24.4 4 62 3.19 20 1 0 4 2
22.8 4 95 3.15 22.9 1 0 4 2
19.2 6 123 3.44 18.3 1 0 4 4
17.8 6 123 3.44 18.9 1 0 4 4
16.4 8 180 4.07 17.4 0 0 3 3
17.3 8 180 3.73 17.6 0 0 3 3
15.2 8 180 3.78 18 0 0 3 3
10.4 8 205 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, grep("^d", colnames(MT)) := NULL][]
data.table [32 x 9]
mpg cyl hp wt qsec vs am gear carb
21 6 110 2.62 16.46 0 1 4 4
21 6 110 2.875 17.02 0 1 4 4
22.8 4 93 2.32 18.61 1 1 4 1
21.4 6 110 3.215 19.44 1 0 3 1
18.7 8 175 3.44 17.02 0 0 3 2
18.1 6 105 3.46 20.22 1 0 3 1
14.3 8 245 3.57 15.84 0 0 3 4
24.4 4 62 3.19 20 1 0 4 2
22.8 4 95 3.15 22.9 1 0 4 2
19.2 6 123 3.44 18.3 1 0 4 4
17.8 6 123 3.44 18.9 1 0 4 4
16.4 8 180 4.07 17.4 0 0 3 3
17.3 8 180 3.73 17.6 0 0 3 3
15.2 8 180 3.78 18 0 0 3 3
10.4 8 205 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

By type:

IRIS |> select(where(\(c) !is.numeric(c)))
data.table [150 x 1]
Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]
IRIS[, .SD, .SDcols = !is.numeric]
data.table [150 x 1]
Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]

Select + pull

mtcars |> pull(disp)
MT[, disp]

Select + rename

mtcars |> select(dispp = disp)
data.frame [32 x 1]
dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]
MT[, .(dispp = disp)]
data.table [32 x 1]
dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

2.4 Rename:

Manually:

mtcars |> rename(CYL = cyl, MPG = mpg)
data.frame [32 x 11]
MPG CYL disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
setnames(copy(MT), c("cyl", "mpg"), c("CYL", "MPG"))[]
data.table [32 x 11]
MPG CYL disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Programmatically:

mtcars |> rename_with(\(c) toupper(c), .cols = matches("^d"))
data.frame [32 x 11]
mpg cyl DISP hp DRAT wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
setnames(copy(MT), grep("^d", names(MT)), \(c) toupper(c))[]
data.table [32 x 11]
mpg cyl DISP hp DRAT wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

2.5 Mutate:

data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := modifies the current dt in place (like dplyr::mutate)

The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.

If the number of values provided is smaller than the original column/group:
- With :=, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).

2.5.1 Transmute:

MT[, .(cyl = cyl * 2)]
data.table [32 x 1]
cyl
12
12
8
12
16
12
16
8
8
12
12
16
16
16
16
[ omitted 17 entries ]

2.5.2 In-Place:

2.5.2.1 Single column:

mtcars |> mutate(cyl = 200)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 4 4
21 200 160 110 3.9 2.875 17.02 0 1 4 4
22.8 200 108 93 3.85 2.32 18.61 1 1 4 1
21.4 200 258 110 3.08 3.215 19.44 1 0 3 1
18.7 200 360 175 3.15 3.44 17.02 0 0 3 2
18.1 200 225 105 2.76 3.46 20.22 1 0 3 1
14.3 200 360 245 3.21 3.57 15.84 0 0 3 4
24.4 200 146.7 62 3.69 3.19 20 1 0 4 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 200 275.8 180 3.07 3.78 18 0 0 3 3
10.4 200 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, cyl := 200][]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 4 4
21 200 160 110 3.9 2.875 17.02 0 1 4 4
22.8 200 108 93 3.85 2.32 18.61 1 1 4 1
21.4 200 258 110 3.08 3.215 19.44 1 0 3 1
18.7 200 360 175 3.15 3.44 17.02 0 0 3 2
18.1 200 225 105 2.76 3.46 20.22 1 0 3 1
14.3 200 360 245 3.21 3.57 15.84 0 0 3 4
24.4 200 146.7 62 3.69 3.19 20 1 0 4 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 200 275.8 180 3.07 3.78 18 0 0 3 3
10.4 200 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Mutate a single column with a function:

mtcars |> mutate(mean_cyl = mean(cyl, na.rm = TRUE))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_cyl
21 6 160 110 3.9 2.62 16.46 0 1 4 4 6.188
21 6 160 110 3.9 2.875 17.02 0 1 4 4 6.188
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 6.188
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.188
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 6.188
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6.188
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 6.188
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 6.188
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 6.188
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6.188
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 6.188
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 6.188
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6.188
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 6.188
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 6.188
[ omitted 17 entries ]
copy(MT)[, mean_cyl := mean(cyl, na.rm = TRUE)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_cyl
21 6 160 110 3.9 2.62 16.46 0 1 4 4 6.188
21 6 160 110 3.9 2.875 17.02 0 1 4 4 6.188
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 6.188
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.188
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 6.188
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6.188
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 6.188
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 6.188
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 6.188
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6.188
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 6.188
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 6.188
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6.188
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 6.188
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 6.188
[ omitted 17 entries ]
copy(MT)[, `:=`(mean_cyl = mean(cyl, na.rm = TRUE))][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_cyl
21 6 160 110 3.9 2.62 16.46 0 1 4 4 6.188
21 6 160 110 3.9 2.875 17.02 0 1 4 4 6.188
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 6.188
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.188
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 6.188
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6.188
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 6.188
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 6.188
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 6.188
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6.188
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 6.188
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 6.188
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6.188
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 6.188
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 6.188
[ omitted 17 entries ]

Dynamic mutate:

Dynamic name on the LHS:

RHS <- "MPG"

mtcars |> mutate({{RHS}} := mean(mpg))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb MPG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
mtcars |> mutate("{RHS}" := mean(mpg))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb MPG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
copy(MT)[, (RHS) := mean(mpg)][] # (RHS) <=> c(RHS)
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb MPG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
copy(MT)[, j := mean(mpg), env = list(j = RHS)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb MPG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]

Dynamic name on both LHS & RHS:

data.table requires the use of base::get() on the LHS

LHS <- "MPG"
RHS <- "mpg"
mtcars |> mutate("{LHS}" := as.character(.data[[RHS]]))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb MPG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 21
21 6 160 110 3.9 2.875 17.02 0 1 4 4 21
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 22.8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 18.7
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 18.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 14.3
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 22.8
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 19.2
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 17.8
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 16.4
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 17.3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 15.2
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4
[ omitted 17 entries ]
mtcars |> mutate({{LHS}} := as.character(cur_data()[[RHS]]))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb MPG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 21
21 6 160 110 3.9 2.875 17.02 0 1 4 4 21
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 22.8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 18.7
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 18.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 14.3
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 22.8
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 19.2
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 17.8
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 16.4
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 17.3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 15.2
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4
[ omitted 17 entries ]
copy(MT)[, c(LHS) := as.character(get(RHS))][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb MPG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 21
21 6 160 110 3.9 2.875 17.02 0 1 4 4 21
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 22.8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 18.7
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 18.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 14.3
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 22.8
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 19.2
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 17.8
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 16.4
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 17.3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 15.2
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4
[ omitted 17 entries ]
copy(MT)[, x := y, env = list(x = LHS, y = RHS)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb MPG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 21
21 6 160 110 3.9 2.875 17.02 0 1 4 4 21
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 22.8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 18.7
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 18.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 14.3
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 22.8
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 19.2
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 17.8
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 16.4
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 17.3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 15.2
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4
[ omitted 17 entries ]

Mutate based on multiple conditions:

if_else:

mtcars |> mutate(Size = if_else(cyl >= 6, "BIG", "small"))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb Size
21 6 160 110 3.9 2.62 16.46 0 1 4 4 BIG
21 6 160 110 3.9 2.875 17.02 0 1 4 4 BIG
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 small
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 BIG
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 BIG
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 BIG
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 BIG
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 small
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 small
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 BIG
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 BIG
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 BIG
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 BIG
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 BIG
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 BIG
[ omitted 17 entries ]
copy(MT)[, Size := fifelse(cyl >= 6, "BIG", "small")][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb Size
21 6 160 110 3.9 2.62 16.46 0 1 4 4 BIG
21 6 160 110 3.9 2.875 17.02 0 1 4 4 BIG
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 small
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 BIG
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 BIG
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 BIG
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 BIG
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 small
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 small
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 BIG
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 BIG
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 BIG
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 BIG
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 BIG
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 BIG
[ omitted 17 entries ]

case_when:

mtcars |> mutate(Size = case_when(
  cyl %between% c(2,4) ~ "small",
  cyl %between% c(4,8) ~ "BIG"
))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb Size
21 6 160 110 3.9 2.62 16.46 0 1 4 4 BIG
21 6 160 110 3.9 2.875 17.02 0 1 4 4 BIG
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 small
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 BIG
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 BIG
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 BIG
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 BIG
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 small
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 small
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 BIG
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 BIG
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 BIG
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 BIG
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 BIG
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 BIG
[ omitted 17 entries ]
copy(MT)[, Size := fcase(
  cyl %between% c(2,4), "small", 
  cyl %between% c(4,8), "BIG"
)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb Size
21 6 160 110 3.9 2.62 16.46 0 1 4 4 BIG
21 6 160 110 3.9 2.875 17.02 0 1 4 4 BIG
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 small
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 BIG
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 BIG
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 BIG
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 BIG
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 small
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 small
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 BIG
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 BIG
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 BIG
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 BIG
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 BIG
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 BIG
[ omitted 17 entries ]

Mutate only if condition is met:

It will keep all the rows and only mutate the ones meeting the provided condition (in i).

Note

This can be extended to mutating multiple columns, of course.

mtcars |> mutate(BIG = case_when(am == 1 ~ cyl >= 6))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb BIG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 TRUE
21 6 160 110 3.9 2.875 17.02 0 1 4 4 TRUE
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 FALSE
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 NA
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 NA
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 NA
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 NA
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 NA
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 NA
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 NA
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 NA
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 NA
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 NA
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 NA
[ omitted 17 entries ]
# mtcars |> mutate(BIG = cyl >= 6, .when = am == 1) # Not implemented yet as of dplyr 1.0.9
copy(MT)[am == 1, BIG := cyl >= 6][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb BIG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 TRUE
21 6 160 110 3.9 2.875 17.02 0 1 4 4 TRUE
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 FALSE
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 NA
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 NA
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 NA
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 NA
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 NA
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 NA
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 NA
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 NA
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 NA
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 NA
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 NA
[ omitted 17 entries ]

Lag / Lead

mtcars |> mutate(gear1 = lead(gear))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb gear1
21 6 160 110 3.9 2.62 16.46 0 1 4 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 3
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3
[ omitted 17 entries ]
copy(MT)[, gear1 := shift(gear, 1, type = "lead")][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb gear1
21 6 160 110 3.9 2.62 16.46 0 1 4 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 3
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3
[ omitted 17 entries ]

2.5.2.2 Mutate multiple columns:

mtcars |> mutate(cyl = 200, gear = 5)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 5 4
21 200 160 110 3.9 2.875 17.02 0 1 5 4
22.8 200 108 93 3.85 2.32 18.61 1 1 5 1
21.4 200 258 110 3.08 3.215 19.44 1 0 5 1
18.7 200 360 175 3.15 3.44 17.02 0 0 5 2
18.1 200 225 105 2.76 3.46 20.22 1 0 5 1
14.3 200 360 245 3.21 3.57 15.84 0 0 5 4
24.4 200 146.7 62 3.69 3.19 20 1 0 5 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 5 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 5 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 5 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 5 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 5 3
15.2 200 275.8 180 3.07 3.78 18 0 0 5 3
10.4 200 472 205 2.93 5.25 17.98 0 0 5 4
[ omitted 17 entries ]
copy(MT)[, `:=`(cyl = 200, gear = 5)][]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 5 4
21 200 160 110 3.9 2.875 17.02 0 1 5 4
22.8 200 108 93 3.85 2.32 18.61 1 1 5 1
21.4 200 258 110 3.08 3.215 19.44 1 0 5 1
18.7 200 360 175 3.15 3.44 17.02 0 0 5 2
18.1 200 225 105 2.76 3.46 20.22 1 0 5 1
14.3 200 360 245 3.21 3.57 15.84 0 0 5 4
24.4 200 146.7 62 3.69 3.19 20 1 0 5 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 5 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 5 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 5 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 5 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 5 3
15.2 200 275.8 180 3.07 3.78 18 0 0 5 3
10.4 200 472 205 2.93 5.25 17.98 0 0 5 4
[ omitted 17 entries ]
copy(MT)[, c("cyl", "gear") := list(200, 5)][]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 5 4
21 200 160 110 3.9 2.875 17.02 0 1 5 4
22.8 200 108 93 3.85 2.32 18.61 1 1 5 1
21.4 200 258 110 3.08 3.215 19.44 1 0 5 1
18.7 200 360 175 3.15 3.44 17.02 0 0 5 2
18.1 200 225 105 2.76 3.46 20.22 1 0 5 1
14.3 200 360 245 3.21 3.57 15.84 0 0 5 4
24.4 200 146.7 62 3.69 3.19 20 1 0 5 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 5 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 5 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 5 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 5 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 5 3
15.2 200 275.8 180 3.07 3.78 18 0 0 5 3
10.4 200 472 205 2.93 5.25 17.98 0 0 5 4
[ omitted 17 entries ]

One function applied to multiple columns (across rows):

mtcars |> mutate(across(c("mpg", "disp"), \(c) min(c), .names = "min_{col}"))
data.frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1
[ omitted 17 entries ]
copy(MT)[, c("min_mpg", "min_disp") := lapply(.SD, \(c) min(c)), .SDcols = c("mpg", "disp")][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1
[ omitted 17 entries ]

With dynamic naming:

new <- c("min_mpg", "min_disp")
old <- c("mpg", "disp")

copy(MT)[, c(new) := lapply(mget(old), min)][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1
[ omitted 17 entries ]
copy(MT)[, c(new) := lapply(x, min), env = list(x = as.list(setNames(nm = old)))][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1
[ omitted 17 entries ]

Multiple functions on one column (across rows):

copy(MT)[, c("min_mpg", "max_mpg") := list(min(c(mpg)), max(c(mpg)))][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]
copy(MT)[, `:=`(min_mpg = min(c(mpg)), max_mpg = max(c(mpg)))][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]
copy(MT)[, c("min_mpg", "max_mpg") := lapply(.SD, \(x) list(min(x), max(x))) |> rbindlist(), .SDcols = "mpg"][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]
copy(MT)[, c("min_mpg", "max_mpg") := lapply(.SD[, .(mpg)], \(x) list(min(x), max(x))) |> rbindlist()][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]
copy(MT)[, c("min_mpg", "max_mpg") := lapply(.(mpg), \(x) list(min(x), max(x))) |> do.call(rbind, args = _)][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]

One function applied to multiple columns (across columns)

mtcars |> rowwise() |> mutate(RowSum = sum(c_across(where(is.numeric)))) |> ungroup()
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb RowSum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 728.56
[ omitted 17 entries ]
copy(MT)[, RowSum := rowSums(.SD), .SDcols = is.numeric][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb RowSum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 728.56
[ omitted 17 entries ]

More general option using row-wise apply:

copy(MT)[, RowMean := apply(.SD, 1, \(x) mean(x)), .SDcols = is.numeric][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb RowMean
21 6 160 110 3.9 2.62 16.46 0 1 4 4 29.907
21 6 160 110 3.9 2.875 17.02 0 1 4 4 29.981
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 23.598
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 38.74
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 53.665
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 35.049
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 59.72
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.635
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 27.234
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 31.86
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 31.787
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 46.431
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 46.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 46.35
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 66.233
[ omitted 17 entries ]

Multiple functions applied to multiple columns (row-wise)

copy(MT)[, c("row_mean", "row_sum") := apply(.SD, 1, \(x) list(mean(x), sum(x))) |> rbindlist(), .SDcols = is.numeric][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb row_mean row_sum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 29.907 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 29.981 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 23.598 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 38.74 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 53.665 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 35.049 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 59.72 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.635 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 27.234 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 31.86 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 31.787 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 46.431 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 46.5 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 46.35 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 66.233 728.56
[ omitted 17 entries ]

Apply an anonymous function inside the DT:

MT[, {
    print(summary(mpg))
    x <- cyl + gear
    .(RN = 1:.N, CG = x)
  }
]

Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90

data.table [32 x 2]
RN CG
1 10
2 10
3 8
4 9
5 11
6 9
7 11
8 8
9 8
10 10
11 10
12 11
13 11
14 11
15 11
[ omitted 17 entries ]

2.6 Group / Aggregate:

The examples listed apply a grouping but do nothing (using .SD to simply keep all columns as is)

One group:

mtcars |> group_by(cyl)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, .SD, by = cyl]
data.table [32 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
6 21 160 110 3.9 2.62 16.46 0 1 4 4
6 21 160 110 3.9 2.875 17.02 0 1 4 4
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
6 17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
[ omitted 17 entries ]

Multiple groups:

MT[, .SD, by = .(cyl, gear)]
data.table [32 x 11]
cyl gear mpg disp hp drat wt qsec vs am carb
6 4 21 160 110 3.9 2.62 16.46 0 1 4
6 4 21 160 110 3.9 2.875 17.02 0 1 4
6 4 19.2 167.6 123 3.92 3.44 18.3 1 0 4
6 4 17.8 167.6 123 3.92 3.44 18.9 1 0 4
4 4 22.8 108 93 3.85 2.32 18.61 1 1 1
4 4 24.4 146.7 62 3.69 3.19 20 1 0 2
4 4 22.8 140.8 95 3.92 3.15 22.9 1 0 2
4 4 32.4 78.7 66 4.08 2.2 19.47 1 1 1
4 4 30.4 75.7 52 4.93 1.615 18.52 1 1 2
4 4 33.9 71.1 65 4.22 1.835 19.9 1 1 1
4 4 27.3 79 66 4.08 1.935 18.9 1 1 1
4 4 21.4 121 109 4.11 2.78 18.6 1 1 2
6 3 21.4 258 110 3.08 3.215 19.44 1 0 1
6 3 18.1 225 105 2.76 3.46 20.22 1 0 1
8 3 18.7 360 175 3.15 3.44 17.02 0 0 2
[ omitted 17 entries ]

Dynamic grouping:

cols <- c("cyl", "disp")

mtcars |> group_by(across(any_of(cols)))
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, .SD, by = cols]
data.table [32 x 11]
cyl disp mpg hp drat wt qsec vs am gear carb
6 160 21 110 3.9 2.62 16.46 0 1 4 4
6 160 21 110 3.9 2.875 17.02 0 1 4 4
4 108 22.8 93 3.85 2.32 18.61 1 1 4 1
6 258 21.4 110 3.08 3.215 19.44 1 0 3 1
8 360 18.7 175 3.15 3.44 17.02 0 0 3 2
8 360 14.3 245 3.21 3.57 15.84 0 0 3 4
6 225 18.1 105 2.76 3.46 20.22 1 0 3 1
4 146.7 24.4 62 3.69 3.19 20 1 0 4 2
4 140.8 22.8 95 3.92 3.15 22.9 1 0 4 2
6 167.6 19.2 123 3.92 3.44 18.3 1 0 4 4
6 167.6 17.8 123 3.92 3.44 18.9 1 0 4 4
8 275.8 16.4 180 3.07 4.07 17.4 0 0 3 3
8 275.8 17.3 180 3.07 3.73 17.6 0 0 3 3
8 275.8 15.2 180 3.07 3.78 18 0 0 3 3
8 472 10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

With potentially absent columns:

cols <- c("cyl", "disp", "fake_col")

mtcars |> group_by(across(any_of(cols)))
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, .SD, by = intersect(cols, colnames(MT))]
data.table [32 x 11]
cyl disp mpg hp drat wt qsec vs am gear carb
6 160 21 110 3.9 2.62 16.46 0 1 4 4
6 160 21 110 3.9 2.875 17.02 0 1 4 4
4 108 22.8 93 3.85 2.32 18.61 1 1 4 1
6 258 21.4 110 3.08 3.215 19.44 1 0 3 1
8 360 18.7 175 3.15 3.44 17.02 0 0 3 2
8 360 14.3 245 3.21 3.57 15.84 0 0 3 4
6 225 18.1 105 2.76 3.46 20.22 1 0 3 1
4 146.7 24.4 62 3.69 3.19 20 1 0 4 2
4 140.8 22.8 95 3.92 3.15 22.9 1 0 4 2
6 167.6 19.2 123 3.92 3.44 18.3 1 0 4 4
6 167.6 17.8 123 3.92 3.44 18.9 1 0 4 4
8 275.8 16.4 180 3.07 4.07 17.4 0 0 3 3
8 275.8 17.3 180 3.07 3.73 17.6 0 0 3 3
8 275.8 15.2 180 3.07 3.78 18 0 0 3 3
8 472 10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Getting the current group name:

Use the .BY argument to get the current group name:

mtcars |> group_by(cyl) |> 
  group_walk(
    \(d, g) with(d, plot(gear, mpg, main = paste("Cylinders:", g$cyl)))
  )

MT[, with(.SD, plot(gear, mpg, main = paste("Cylinders:", .BY))), by = cyl] -> void

2.7 Row numbers & indices:

.I: Row indices
.N: Number of rows

.GRP: Group indices
.NGRP: Number of groups

Getting rows indices:

MT[, .I]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32

Adding rows indices:

mtcars |> mutate(I = row_number())
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 5
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 7
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 8
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 11
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 12
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 13
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 14
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 15
[ omitted 17 entries ]
copy(MT)[ , I := .I][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 5
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 7
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 8
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 11
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 12
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 13
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 14
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 15
[ omitted 17 entries ]

Getting row indices (after filtering):

Important

.I gives the vector of row numbers after any subsetting/filtering has been done

Returns the row numbers in the original dataset:

mtcars |> mutate(I = row_number()) |> filter(gear == 4) |> pull(I)

[1] 1 2 3 8 9 10 11 18 19 20 26 32

MT[, .I[gear == 4]]

[1] 1 2 3 8 9 10 11 18 19 20 26 32

Returns the row numbers in the new dataset (after filtering):

mtcars |> filter(gear == 4) |> mutate(I = row_number()) |> pull(I)

[1] 1 2 3 4 5 6 7 8 9 10 11 12

MT[gear == 4, .I]

[1] 1 2 3 4 5 6 7 8 9 10 11 12

Getting the row numbers of specific observations:

Row number of the first and last observation of each group:

mtcars |> group_by(cyl) |> summarize(I = cur_group_rows()[c(1, n())]) |> ungroup()
data.frame [6 x 2]
cyl I
4 3
4 32
6 1
6 30
8 5
8 31
MT[, .I[c(1, .N)], keyby = cyl]
data.table [6 x 2]
cyl V1
4 3
4 32
6 1
6 30
8 5
8 31

Keeping all other columns:

mtcars |> mutate(I = row_number()) |> group_by(cyl) |> slice(c(1, n())) |> ungroup()
data.frame [6 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 32
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 30
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 5
15 8 301 335 3.54 3.57 14.6 0 1 5 8 31
copy(MT)[, I := .I][, .SD[c(1, .N)], keyby = cyl]
data.table [6 x 12]
cyl mpg disp hp drat wt qsec vs am gear carb I
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1 3
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2 32
6 21 160 110 3.9 2.62 16.46 0 1 4 4 1
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 30
8 18.7 360 175 3.15 3.44 17.02 0 0 3 2 5
8 15 301 335 3.54 3.57 14.6 0 1 5 8 31

Filtering based on row numbers:

mtcars |> tail(10)
data.frame [10 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
MT[(.N-10):.N] # Get the last 10 rows
data.table [11 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
MT[MT[, .I[(.N-10):.N]]]
data.table [11 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

(Gets the indices of the last 10 rows and filters based on them)

Adding group indices:

mtcars |> group_by(cyl) |> summarize(GRP = cur_group_id())
data.frame [3 x 2]
cyl GRP
4 1
6 2
8 3
MT[, .GRP, by = cyl]
data.table [3 x 2]
cyl GRP
6 1
4 2
8 3

Mutate instead of summarize:

mtcars |> arrange(cyl) |> group_by(cyl) |> mutate(GRP = cur_group_id())
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb GRP
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 1
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 1
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2 1
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 1
21 6 160 110 3.9 2.62 16.46 0 1 4 4 2
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 2
[ omitted 17 entries ]
copy(MT)[, GRP := .GRP, keyby = cyl][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb GRP
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 1
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 1
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2 1
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 1
21 6 160 110 3.9 2.62 16.46 0 1 4 4 2
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 2
[ omitted 17 entries ]

Row numbers by group:

mtcars |> arrange(gear) |> group_by(gear) |> mutate(I_GRP = row_number())
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I_GRP
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 5
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 7
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 8
10.4 8 460 215 3 5.424 17.82 0 0 3 4 9
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 10
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1 11
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2 12
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2 13
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 14
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 15
[ omitted 17 entries ]
copy(MT)[, I_GRP := 1:.N, keyby = gear][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I_GRP
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 5
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 7
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 8
10.4 8 460 215 3 5.424 17.82 0 0 3 4 9
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 10
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1 11
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2 12
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2 13
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 14
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 15
[ omitted 17 entries ]

Random sample:

mtcars |> slice_sample(n = 5)
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
MT[sample(.N, 5)]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
10.4 8 460 215 3 5.424 17.82 0 0 3 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4

Sample by group:

mtcars |> group_by(cyl) |> slice_sample(n = 5)
data.frame [15 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
15 8 301 335 3.54 3.57 14.6 0 1 5 8
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
MT[, .SD[sample(.N, 5)], keyby = cyl]
data.table [15 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2
6 21 160 110 3.9 2.62 16.46 0 1 4 4
6 21 160 110 3.9 2.875 17.02 0 1 4 4
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
8 13.3 350 245 3.73 3.84 15.41 0 0 3 4
8 14.7 440 230 3.23 5.345 17.42 0 0 3 4
8 19.2 400 175 3.08 3.845 17.05 0 0 3 2
8 15.8 351 264 4.22 3.17 14.5 0 1 5 4
8 17.3 275.8 180 3.07 3.73 17.6 0 0 3 3

Filter by group size:

mtcars |> group_by(cyl) |> filter(n() >= 8)
data.frame [25 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
[ omitted 10 entries ]
MT[, if(.N >= 8) .SD, by = cyl]
data.table [25 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2
8 18.7 360 175 3.15 3.44 17.02 0 0 3 2
8 14.3 360 245 3.21 3.57 15.84 0 0 3 4
8 16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
8 17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
[ omitted 10 entries ]

2.8 Relocate:

mtcars |> group_by(cyl) |> mutate(GRP = cur_group_id(), .before = 1)
data.frame [32 x 12]
GRP mpg cyl disp hp drat wt qsec vs am gear carb
2 21 6 160 110 3.9 2.62 16.46 0 1 4 4
2 21 6 160 110 3.9 2.875 17.02 0 1 4 4
1 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
2 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
3 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
2 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
3 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
1 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
1 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
2 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
2 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
3 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
3 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
3 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
3 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
(copy(MT)[ , GRP := .GRP, by = cyl] |> setcolorder(c("GRP", .SD)))[]
data.table [32 x 12]
GRP mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
1 21 6 160 110 3.9 2.875 17.02 0 1 4 4
2 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
3 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
1 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
3 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
2 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
2 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
1 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
1 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
3 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
3 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
3 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
3 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Ordering by column names

mtcars |> select(sort(tidyselect::peek_vars()))
data.frame [32 x 11]
am carb cyl disp drat gear hp mpg qsec vs wt
1 4 6 160 3.9 4 110 21 16.46 0 2.62
1 4 6 160 3.9 4 110 21 17.02 0 2.875
1 1 4 108 3.85 4 93 22.8 18.61 1 2.32
0 1 6 258 3.08 3 110 21.4 19.44 1 3.215
0 2 8 360 3.15 3 175 18.7 17.02 0 3.44
0 1 6 225 2.76 3 105 18.1 20.22 1 3.46
0 4 8 360 3.21 3 245 14.3 15.84 0 3.57
0 2 4 146.7 3.69 4 62 24.4 20 1 3.19
0 2 4 140.8 3.92 4 95 22.8 22.9 1 3.15
0 4 6 167.6 3.92 4 123 19.2 18.3 1 3.44
0 4 6 167.6 3.92 4 123 17.8 18.9 1 3.44
0 3 8 275.8 3.07 3 180 16.4 17.4 0 4.07
0 3 8 275.8 3.07 3 180 17.3 17.6 0 3.73
0 3 8 275.8 3.07 3 180 15.2 18 0 3.78
0 4 8 472 2.93 3 205 10.4 17.98 0 5.25
[ omitted 17 entries ]
setcolorder(copy(MT), sort(names(MT)))[]
data.table [32 x 11]
am carb cyl disp drat gear hp mpg qsec vs wt
1 4 6 160 3.9 4 110 21 16.46 0 2.62
1 4 6 160 3.9 4 110 21 17.02 0 2.875
1 1 4 108 3.85 4 93 22.8 18.61 1 2.32
0 1 6 258 3.08 3 110 21.4 19.44 1 3.215
0 2 8 360 3.15 3 175 18.7 17.02 0 3.44
0 1 6 225 2.76 3 105 18.1 20.22 1 3.46
0 4 8 360 3.21 3 245 14.3 15.84 0 3.57
0 2 4 146.7 3.69 4 62 24.4 20 1 3.19
0 2 4 140.8 3.92 4 95 22.8 22.9 1 3.15
0 4 6 167.6 3.92 4 123 19.2 18.3 1 3.44
0 4 6 167.6 3.92 4 123 17.8 18.9 1 3.44
0 3 8 275.8 3.07 3 180 16.4 17.4 0 4.07
0 3 8 275.8 3.07 3 180 17.3 17.6 0 3.73
0 3 8 275.8 3.07 3 180 15.2 18 0 3.78
0 4 8 472 2.93 3 205 10.4 17.98 0 5.25
[ omitted 17 entries ]
mtcars |> select(carb, sort(tidyselect::peek_vars()))
data.frame [32 x 11]
carb am cyl disp drat gear hp mpg qsec vs wt
4 1 6 160 3.9 4 110 21 16.46 0 2.62
4 1 6 160 3.9 4 110 21 17.02 0 2.875
1 1 4 108 3.85 4 93 22.8 18.61 1 2.32
1 0 6 258 3.08 3 110 21.4 19.44 1 3.215
2 0 8 360 3.15 3 175 18.7 17.02 0 3.44
1 0 6 225 2.76 3 105 18.1 20.22 1 3.46
4 0 8 360 3.21 3 245 14.3 15.84 0 3.57
2 0 4 146.7 3.69 4 62 24.4 20 1 3.19
2 0 4 140.8 3.92 4 95 22.8 22.9 1 3.15
4 0 6 167.6 3.92 4 123 19.2 18.3 1 3.44
4 0 6 167.6 3.92 4 123 17.8 18.9 1 3.44
3 0 8 275.8 3.07 3 180 16.4 17.4 0 4.07
3 0 8 275.8 3.07 3 180 17.3 17.6 0 3.73
3 0 8 275.8 3.07 3 180 15.2 18 0 3.78
4 0 8 472 2.93 3 205 10.4 17.98 0 5.25
[ omitted 17 entries ]
setcolorder(copy(MT), c("carb", sort(setdiff(names(MT), "carb"))))[]
data.table [32 x 11]
carb am cyl disp drat gear hp mpg qsec vs wt
4 1 6 160 3.9 4 110 21 16.46 0 2.62
4 1 6 160 3.9 4 110 21 17.02 0 2.875
1 1 4 108 3.85 4 93 22.8 18.61 1 2.32
1 0 6 258 3.08 3 110 21.4 19.44 1 3.215
2 0 8 360 3.15 3 175 18.7 17.02 0 3.44
1 0 6 225 2.76 3 105 18.1 20.22 1 3.46
4 0 8 360 3.21 3 245 14.3 15.84 0 3.57
2 0 4 146.7 3.69 4 62 24.4 20 1 3.19
2 0 4 140.8 3.92 4 95 22.8 22.9 1 3.15
4 0 6 167.6 3.92 4 123 19.2 18.3 1 3.44
4 0 6 167.6 3.92 4 123 17.8 18.9 1 3.44
3 0 8 275.8 3.07 3 180 16.4 17.4 0 4.07
3 0 8 275.8 3.07 3 180 17.3 17.6 0 3.73
3 0 8 275.8 3.07 3 180 15.2 18 0 3.78
4 0 8 472 2.93 3 205 10.4 17.98 0 5.25
[ omitted 17 entries ]

2.9 Summarize:

Summarizes uses the = operator.
It’s only difference with mutate is that it takes a function that returns a list of values smaller than the original column (or group) size.
By default, it will only keep the modified columns (like transmute).

mtcars |> summarize(mean_cyl = mean(cyl, na.rm = T))
data.frame [1 x 1]
mean_cyl
6.188
MT[, .(mean_cyl = mean(cyl, na.rm = T))]
data.table [1 x 1]
mean_cyl
6.188

Group > summarize

mtcars |> group_by(cyl) |> summarize(N = n())
data.frame [3 x 2]
cyl N
4 11
6 7
8 14
MT[, .N, by = cyl]
data.table [3 x 2]
cyl N
6 7
4 11
8 14

dplyr automatically arrange the result by the grouping factor.
To mimic this with data.table:

MT[, .N, keyby = cyl]
data.table [3 x 2]
cyl N
4 11
6 7
8 14
MT[order(cyl), .N, by = cyl]
data.table [3 x 2]
cyl N
4 11
6 7
8 14
MT[, .N, by = cyl][order(cyl)]
data.table [3 x 2]
cyl N
4 11
6 7
8 14

Grouping on a condition:

mtcars |> group_by(cyl > 6) |> summarize(N = n())
data.frame [2 x 2]
cyl > 6 N
FALSE 18
TRUE 14
MT[, .N, by = .(cyl > 6)]
data.table [2 x 2]
cyl N
FALSE 18
TRUE 14

Group > filter > summarize

mtcars |> filter(cyl >= 6 & disp >= 200) |> summarize(N = n())
data.frame [1 x 1]
N
16
MT[cyl >= 6 & disp >= 200, .(.N)]
data.table [1 x 1]
N
16
mtcars |> summarize(N = sum(cyl >= 6 & disp >= 200, na.rm = T))
data.frame [1 x 1]
N
16
MT[, .(N = sum(cyl >= 6 & disp >= 200, na.rm = T))]
data.table [1 x 1]
N
16

Obtaining one summary statistic on multiple columns

mtcars |> group_by(cyl) |> summarize(across(everything(), \(c) mean(c)))
data.frame [3 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 26.664 105.136 82.636 4.071 2.286 19.137 0.909 0.727 4.091 1.545
6 19.743 183.314 122.286 3.586 3.117 17.977 0.571 0.429 3.857 3.429
8 15.1 353.1 209.214 3.229 3.999 16.772 0 0.143 3.286 3.5
MT[, lapply(.SD, \(c) mean(c)), keyby = cyl]
data.table [3 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 26.664 105.136 82.636 4.071 2.286 19.137 0.909 0.727 4.091 1.545
6 19.743 183.314 122.286 3.586 3.117 17.977 0.571 0.429 3.857 3.429
8 15.1 353.1 209.214 3.229 3.999 16.772 0 0.143 3.286 3.5

Apply summary function based on column type:

mtcars |> group_by(cyl) |> summarize(across(where(is.double), \(col) mean(col)))
data.frame [3 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 26.664 105.136 82.636 4.071 2.286 19.137 0.909 0.727 4.091 1.545
6 19.743 183.314 122.286 3.586 3.117 17.977 0.571 0.429 3.857 3.429
8 15.1 353.1 209.214 3.229 3.999 16.772 0 0.143 3.286 3.5
MT[, lapply(.SD, \(col) mean(col)), keyby = cyl, .SDcols = is.double][, cyl := NULL][]
data.table [3 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26.664 4 105.136 82.636 4.071 2.286 19.137 0.909 0.727 4.091 1.545
19.743 6 183.314 122.286 3.586 3.117 17.977 0.571 0.429 3.857 3.429
15.1 8 353.1 209.214 3.229 3.999 16.772 0 0.143 3.286 3.5

Apply summary function to specific columns:

mtcars |> group_by(cyl) |> summarize(across(c(mpg, disp), \(.x) mean(.x)))
data.frame [3 x 3]
cyl mpg disp
4 26.664 105.136
6 19.743 183.314
8 15.1 353.1
MT[, lapply(.SD, \(.x) mean(.x)), keyby = cyl, .SDcols = c("mpg", "disp")]
data.table [3 x 3]
cyl mpg disp
4 26.664 105.136
6 19.743 183.314
8 15.1 353.1
MT[, lapply(.SD[, .(mpg, disp)], \(.x) mean(.x)), keyby = cyl]
data.table [3 x 3]
cyl mpg disp
4 26.664 105.136
6 19.743 183.314
8 15.1 353.1

Apply summary function to specific columns (by pattern):

mtcars |> group_by(cyl) |> summarize(across(matches("^mpg|^disp"), \(.x) mean(.x)))
data.frame [3 x 3]
cyl mpg disp
4 26.664 105.136
6 19.743 183.314
8 15.1 353.1
MT[, lapply(.SD, mean), keyby = cyl, .SDcols = patterns("^mpg|^disp")]
data.table [3 x 3]
cyl mpg disp
4 26.664 105.136
6 19.743 183.314
8 15.1 353.1

Obtaining multiple summary statistics for one column:

mtcars |> group_by(cyl) |> summarize(mean_mpg = mean(mpg), sd_mpg = sd(mpg))
data.frame [3 x 3]
cyl mean_mpg sd_mpg
4 26.664 4.51
6 19.743 1.454
8 15.1 2.56
MT[, .(mean_mpg = mean(mpg), sd_mpg = sd(mpg)), keyby = cyl]
data.table [3 x 3]
cyl mean_mpg sd_mpg
4 26.664 4.51
6 19.743 1.454
8 15.1 2.56
MT[, lapply(.SD, \(x) list(mean_mpg = mean(x), sd_mpg = sd(x))) |> rbindlist(), keyby = cyl, .SDcols = "mpg"]
data.table [3 x 3]
cyl mean_mpg sd_mpg
4 26.664 4.51
6 19.743 1.454
8 15.1 2.56
MT[, lapply(.(mpg), \(x) list(mean_mpg = mean(x), sd_mpg = sd(x))) |> rbindlist(), keyby = cyl]
data.table [3 x 3]
cyl mean_mpg sd_mpg
4 26.664 4.51
6 19.743 1.454
8 15.1 2.56

Obtaining multiple summary statistics on multiple columns (as rows):

MT[, lapply(.SD, \(v) c(mean(v), sd(v)))]
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
20.091 6.188 230.722 146.688 3.597 3.217 17.849 0.438 0.406 3.688 2.812
6.027 1.786 123.939 68.563 0.535 0.978 1.787 0.504 0.499 0.738 1.615
list_of_funs <- list(mean = \(x) mean(x, na.rm = TRUE), sd = \(x) sd(x, na.rm = TRUE))
MT[, lapply(list_of_funs, \(fun) lapply(.SD, fun)) |> rbindlist()]
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
20.091 6.188 230.722 146.688 3.597 3.217 17.849 0.438 0.406 3.688 2.812
6.027 1.786 123.939 68.563 0.535 0.978 1.787 0.504 0.499 0.738 1.615

Obtaining multiple summary statistics on multiple columns (as columns):

cols <- c("mpg", "cyl")

funs_as_list <- \(x) list(mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE))
Warning

dplyr & data.table don’t use the same “format” when pre-defining a list of function to be applied:
- dplyr needs a list of individual functions
- data.table needs a function returning a list

mtcars |> group_by(gear) |> summarize(across(cols, .fns = list_of_funs, .names = "{.col}.{.fn}"))
data.frame [3 x 5]
gear mpg.mean mpg.sd cyl.mean cyl.sd
3 16.107 3.372 7.467 1.187
4 24.533 5.277 4.667 0.985
5 21.38 6.659 6 2
MT[, lapply(.SD, funs_as_list) |> unlist(recursive = FALSE), keyby = gear, .SDcols = cols]
data.table [3 x 5]
gear mpg.mean mpg.sd cyl.mean cyl.sd
3 16.107 3.372 7.467 1.187
4 24.533 5.277 4.667 0.985
5 21.38 6.659 6 2
MT[, lapply(.SD, funs_as_list) |> do.call(c, args = _), keyby = gear, .SDcols = cols]
data.table [3 x 5]
gear mpg.mean mpg.sd cyl.mean cyl.sd
3 16.107 3.372 7.467 1.187
4 24.533 5.277 4.667 0.985
5 21.38 6.659 6 2

Different column order & naming scheme:

Tip

Here we can use the list_of_funs with data.table since we apply them individually.

MT[, lapply(list_of_funs, \(f) lapply(.SD, f)) |> do.call(c, args = _), keyby = gear, .SDcols = cols]
data.table [3 x 5]
gear mean.mpg mean.cyl sd.mpg sd.cyl
3 16.107 7.467 3.372 1.187
4 24.533 4.667 5.277 0.985
5 21.38 6 6.659 2

Using dcast (see next section):

dcast(MT, gear ~ ., fun.aggregate = list(mean, sd), value.var = cols)
data.table [3 x 5]
gear mpg_mean cyl_mean mpg_sd cyl_sd
3 16.107 7.467 3.372 1.187
4 24.533 4.667 5.277 0.985
5 21.38 6 6.659 2

3 Pivots:


3.1 Melt / Longer:

Data:

FAM1
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
FAM2
data.table [5 x 8]
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA

One group of columns –> single value column

FAM1 |> pivot_longer(cols = matches("dob_"), names_to = "variable")
data.frame [15 x 4]
family_id age_mother variable value
1 30 dob_child1 1998-11-26
1 30 dob_child2 2000-01-29
1 30 dob_child3 NA
2 27 dob_child1 1996-06-22
2 27 dob_child2 NA
2 27 dob_child3 NA
3 26 dob_child1 2002-07-11
3 26 dob_child2 2004-04-05
3 26 dob_child3 2007-09-02
4 32 dob_child1 2004-10-10
4 32 dob_child2 2009-08-27
4 32 dob_child3 2012-07-21
5 29 dob_child1 2000-12-05
5 29 dob_child2 2005-02-28
5 29 dob_child3 NA
FAM1 |> melt(measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
data.table [15 x 4]
family_id age_mother variable value
1 30 dob_child1 1998-11-26
2 27 dob_child1 1996-06-22
3 26 dob_child1 2002-07-11
4 32 dob_child1 2004-10-10
5 29 dob_child1 2000-12-05
1 30 dob_child2 2000-01-29
2 27 dob_child2 NA
3 26 dob_child2 2004-04-05
4 32 dob_child2 2009-08-27
5 29 dob_child2 2005-02-28
1 30 dob_child3 NA
2 27 dob_child3 NA
3 26 dob_child3 2007-09-02
4 32 dob_child3 2012-07-21
5 29 dob_child3 NA
FAM1 |> melt(measure.vars = patterns("^dob_"))
data.table [15 x 4]
family_id age_mother variable value
1 30 dob_child1 1998-11-26
2 27 dob_child1 1996-06-22
3 26 dob_child1 2002-07-11
4 32 dob_child1 2004-10-10
5 29 dob_child1 2000-12-05
1 30 dob_child2 2000-01-29
2 27 dob_child2 NA
3 26 dob_child2 2004-04-05
4 32 dob_child2 2009-08-27
5 29 dob_child2 2005-02-28
1 30 dob_child3 NA
2 27 dob_child3 NA
3 26 dob_child3 2007-09-02
4 32 dob_child3 2012-07-21
5 29 dob_child3 NA

One group of columns –> multiple value columns

FAM1 |> melt(measure.vars = patterns(child1 = "child1$", child2 = "child2$|child3$"))
data.table [10 x 5]
family_id age_mother variable child1 child2
1 30 1 1998-11-26 2000-01-29
2 27 1 1996-06-22 NA
3 26 1 2002-07-11 2004-04-05
4 32 1 2004-10-10 2009-08-27
5 29 1 2000-12-05 2005-02-28
1 30 2 NA NA
2 27 2 NA NA
3 26 2 NA 2007-09-02
4 32 2 NA 2012-07-21
5 29 2 NA NA

3.1.1 Merging multiple yes/no columns:

Melting multiple presence/absence columns into a single variable:

movies_wide
data.frame [3 x 4]
ID action adventure animation
1 1 0 0
2 1 1 0
3 1 1 1
pivot_longer(
    movies_wide, -ID, names_to = "Genre", 
    values_transform = \(x) ifelse(x == 0, NA, x), values_drop_na = TRUE
  ) |> select(-value)
data.frame [6 x 2]
ID Genre
1 action
2 action
2 adventure
3 action
3 adventure
3 animation
melt(MOVIES_WIDE, id.vars = "ID", variable.name = "Genre")[value != 0][order(ID), -"value"]
data.table [6 x 2]
ID Genre
1 action
2 action
2 adventure
3 action
3 adventure
3 animation

3.1.2 Partial pivot:

Multiple groups of columns –> Multiple value columns

Manually:

colA <- str_subset(colnames(FAM2), "^dob")
colB <- str_subset(colnames(FAM2), "^gender")

FAM2 |> melt(measure.vars = list(colA, colB), value.name = c("dob", "gender"), variable.name = "child")
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA
FAM2 |> melt(measure.vars = list(a, b), value.name = c("dob", "gender"), variable.name = "child") |> 
  substitute2(env = list(a = I(str_subset(colnames(FAM2), "^dob")), b = I(str_subset(colnames(FAM2), "^gender")))) |> eval()
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA

Using .value:

Tip

Using the .value special identifier allows to do a “half” pivot: the values that would be listed as rows under .value are instead used as columns.

FAM2 |> pivot_longer(cols = matches("^dob|^gender"), names_to = c(".value", "child"), names_sep = "_child")
data.frame [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
1 30 2 2000-01-29 2
1 30 3 NA NA
2 27 1 1996-06-22 2
2 27 2 NA NA
2 27 3 NA NA
3 26 1 2002-07-11 2
3 26 2 2004-04-05 2
3 26 3 2007-09-02 1
4 32 1 2004-10-10 1
4 32 2 2009-08-27 1
4 32 3 2012-07-21 1
5 29 1 2000-12-05 2
5 29 2 2005-02-28 1
5 29 3 NA NA
FAM2 |> melt(measure.vars = patterns("^dob", "^gender"), value.name = c("dob", "gender"), variable.name = "child")
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA

Using measure and value.name:

Warning

data.table only

FAM2 |> melt(measure.vars = measure(value.name, child = \(x) as.integer(x), sep = "_child"))
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA
FAM2 |> melt(measure.vars = measurev(list(value.name = NULL, child = as.integer), pattern = "(.*)_child(\\d{1})"))
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA

3.2 Dcast / Wider:

General idea:
- Pivot around the combination of id.vars (LHS of the formula)
- The measure.vars (RHS of the formula) are the ones whose values become column names
- The value.var are the ones the values are taken from to fill the new columns

Data:

(FAM1L <- FAM1 |> melt(measure.vars = c("dob_child1", "dob_child2", "dob_child3")))
data.table [15 x 4]
family_id age_mother variable value
1 30 dob_child1 1998-11-26
2 27 dob_child1 1996-06-22
3 26 dob_child1 2002-07-11
4 32 dob_child1 2004-10-10
5 29 dob_child1 2000-12-05
1 30 dob_child2 2000-01-29
2 27 dob_child2 NA
3 26 dob_child2 2004-04-05
4 32 dob_child2 2009-08-27
5 29 dob_child2 2005-02-28
1 30 dob_child3 NA
2 27 dob_child3 NA
3 26 dob_child3 2007-09-02
4 32 dob_child3 2012-07-21
5 29 dob_child3 NA
(FAM2L <- FAM2 |> melt(measure.vars = measure(value.name, child = \(.x) as.integer(.x), sep = "_child")))
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA

Basic pivot wider:

FAM1L |> pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
FAM1L |> dcast(family_id + age_mother ~ variable)
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

Using all the columns as IDs:

Note

By default, id_cols = everything()

FAM1L |> pivot_wider(names_from = variable)
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
Note

... => “every unused column”

FAM1L |> dcast(... ~ variable)
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

Multiple value columns –> Multiple groups of columns:

FAM2L |> pivot_wider(
  id_cols = c("family_id", "age_mother"), values_from = c("dob", "gender"), 
  names_from = "child", names_sep = "_child"
)
data.frame [5 x 8]
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA
FAM2L |> dcast(family_id + age_mother ~ child, value.var = c("dob", "gender"), sep = "_child")
data.table [5 x 8]
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA
FAM2L |> dcast(... ~ child, value.var = c("dob", "gender"), sep = "_child")
data.table [5 x 8]
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA

Dynamic names in the formula:

var_name <- "variable"
FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = {{ var_name }})
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
FAM1L |> dcast(family_id + age_mother ~ base::get(var_name))
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
FAM1L |> dcast(family_id + age_mother ~ v1) |> substitute2(env = list(v1 = var_name)) |> eval()
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

Multiple variables:

id_vars <- c("family_id", "age_mother")
FAM1L |> pivot_wider(id_cols = all_of(id_vars), names_from = variable)
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
FAM1L |> dcast(str_c(str_c(id_vars, collapse = " + "), " ~ variable"))
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
FAM1L |> dcast(v1 + v2 ~ variable) |> substitute2(env = list(v1 = id_vars[1], v2 = id_vars[2])) |> eval()
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

3.2.1 Renaming (prefix/suffix) the columns:

FAM1L |> pivot_wider(names_from = variable, values_from = value, names_prefix = "Attr: ")
data.frame [5 x 5]
family_id age_mother Attr: dob_child1 Attr: dob_child2 Attr: dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
FAM1L |> pivot_wider(names_from = variable, values_from = value, names_glue = "Attr: {variable}")
data.frame [5 x 5]
family_id age_mother Attr: dob_child1 Attr: dob_child2 Attr: dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
FAM1L |> dcast(family_id + age_mother ~ paste0("Attr: ", variable))
data.table [5 x 5]
family_id age_mother Attr: dob_child1 Attr: dob_child2 Attr: dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

3.2.2 Unused combinations:

Warning

The logic is inverted between dplyr (keep) and data.table (drop)

FAM1L |> pivot_wider(names_from = variable, values_from = value, id_expand = TRUE, names_expand = FALSE) # (keep_id, keep_names)
data.frame [25 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 26 NA NA NA
1 27 NA NA NA
1 29 NA NA NA
1 30 1998-11-26 2000-01-29 NA
1 32 NA NA NA
2 26 NA NA NA
2 27 1996-06-22 NA NA
2 29 NA NA NA
2 30 NA NA NA
2 32 NA NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
3 27 NA NA NA
3 29 NA NA NA
3 30 NA NA NA
3 32 NA NA NA
[ omitted 10 entries ]
FAM1L |> dcast(family_id + age_mother ~ variable, drop = c(F, T)) # (drop_LHS, drop_RHS)
data.table [25 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 26 NA NA NA
1 27 NA NA NA
1 29 NA NA NA
1 30 1998-11-26 2000-01-29 NA
1 32 NA NA NA
2 26 NA NA NA
2 27 1996-06-22 NA NA
2 29 NA NA NA
2 30 NA NA NA
2 32 NA NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
3 27 NA NA NA
3 29 NA NA NA
3 30 NA NA NA
3 32 NA NA NA
[ omitted 10 entries ]

3.2.3 Subsetting:

Note

AFAIK, pivot_wider can’t do this on it’s own.

FAM1L |> filter(value >= lubridate::ymd(20030101)) |> 
  pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")
data.frame [3 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
4 32 2004-10-10 2009-08-27 2012-07-21
3 26 NA 2004-04-05 2007-09-02
5 29 NA 2005-02-28 NA
FAM1L |> dcast(family_id + age_mother ~ variable, subset = .(value >= lubridate::ymd(20030101)))
data.table [3 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
3 26 NA 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 NA 2005-02-28 NA

3.2.4 Aggregating:

Not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures.

FAM1L |> dcast(family_id + age_mother ~ .)
data.table [5 x 3]
family_id age_mother .
1 30 3
2 27 3
3 26 3
4 32 3
5 29 3

We can customize that default behavior using the fun.aggregate argument:

Here, we count the number of child for each each combination of (family_id + age_mother) -> sum all non-NA value

FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(.x) sum(!is.na(.x))) |>
  rowwise() |> mutate(child_count = sum(c_across(matches("_child")))) |> ungroup()
data.frame [5 x 6]
family_id age_mother dob_child1 dob_child2 dob_child3 child_count
1 30 1 1 0 2
2 27 1 0 0 1
3 26 1 1 1 3
4 32 1 1 1 3
5 29 1 1 0 2
FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(.x) sum(!is.na(.x))) |>
  mutate(child_count = apply(select(cur_data(), matches("_child")), 1, \(r) sum(r)))
data.frame [5 x 6]
family_id age_mother dob_child1 dob_child2 dob_child3 child_count
1 30 1 1 0 2
2 27 1 0 0 1
3 26 1 1 1 3
4 32 1 1 1 3
5 29 1 1 0 2
(FAM1L |> dcast(family_id + age_mother ~ ., fun.agg = \(.x) sum(!is.na(.x))) |> setnames(".", "child_count"))
data.table [5 x 3]
family_id age_mother child_count
1 30 2
2 27 1
3 26 3
4 32 3
5 29 2

Applying multiple fun.agg:

Data:

(DTL <- data.table(
  id1 = sample(5, 20, TRUE), 
  id2 = sample(2, 20, TRUE), 
  group = sample(letters[1:2], 20, TRUE), 
  v1 = runif(20), 
  v2 = 1L)
)
data.table [20 x 5]
id1 id2 group v1 v2
5 1 a 0.706 1
2 2 a 0.152 1
1 2 b 0.487 1
1 2 b 0.416 1
4 2 b 0.681 1
2 2 a 0.384 1
3 1 a 0.263 1
2 2 a 0.536 1
4 2 a 0.447 1
3 1 b 0.446 1
2 2 a 0.855 1
5 1 b 0.832 1
1 2 a 0.854 1
1 1 a 0.987 1
5 1 a 0.59 1
[ omitted 5 entries ]

Multiple fun.agg applied to one variable:

DTL |> dcast(id1 + id2 ~ group, fun.aggregate = list(sum, mean), value.var = "v1")
data.table [9 x 6]
id1 id2 v1_sum_a v1_sum_b v1_mean_a v1_mean_b
1 1 0.987 0 0.987 NaN
1 2 0.854 0.903 0.854 0.452
2 1 0 0.25 NaN 0.25
2 2 2.221 0.048 0.444 0.048
3 1 0.263 0.446 0.263 0.446
3 2 0.083 0 0.083 NaN
4 1 0 0.518 NaN 0.518
4 2 0.447 0.681 0.447 0.681
5 1 1.296 0.832 0.648 0.832

Multiple fun.agg to multiple value.var (all combinations):

DTL |> dcast(id1 + id2 ~ group, fun.aggregate = list(sum, mean), value.var = c("v1", "v2"))
data.table [9 x 10]
id1 id2 v1_sum_a v1_sum_b v2_sum_a v2_sum_b v1_mean_a v1_mean_b v2_mean_a v2_mean_b
1 1 0.987 0 1 0 0.987 NaN 1 NaN
1 2 0.854 0.903 1 2 0.854 0.452 1 1
2 1 0 0.25 0 1 NaN 0.25 NaN 1
2 2 2.221 0.048 5 1 0.444 0.048 1 1
3 1 0.263 0.446 1 1 0.263 0.446 1 1
3 2 0.083 0 1 0 0.083 NaN 1 NaN
4 1 0 0.518 0 1 NaN 0.518 NaN 1
4 2 0.447 0.681 1 1 0.447 0.681 1 1
5 1 1.296 0.832 2 1 0.648 0.832 1 1

Multiple fun.agg and multiple value.var (one-to-one):

Here, we apply sum to v1 (for both group a & b), and mean to v2 (for both group a & b)

DTL |> dcast(id1 + id2 ~ group, fun.aggregate = list(sum, mean), value.var = list("v1", "v2"))
data.table [9 x 6]
id1 id2 v1_sum_a v1_sum_b v2_mean_a v2_mean_b
1 1 0.987 0 1 NaN
1 2 0.854 0.903 1 1
2 1 0 0.25 NaN 1
2 2 2.221 0.048 1 1
3 1 0.263 0.446 1 1
3 2 0.083 0 1 NaN
4 1 0 0.518 NaN 1
4 2 0.447 0.681 1 1
5 1 1.296 0.832 1 1

3.2.5 One-hot encoding:

Making each level of a variable into a presence/absence column:

movies_long
data.frame [6 x 2]
ID Genre
1 action
2 action
2 adventure
3 action
3 adventure
3 animation
pivot_wider(
  movies_long, names_from = "Genre", values_from = "Genre", 
  values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
)
data.frame [3 x 4]
ID action adventure animation
1 1 0 0
2 1 1 0
3 1 1 1
dcast(
  MOVIES_LONG, ID ~ Genre, value.var = "Genre", 
  fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0
)
data.table [3 x 4]
ID action adventure animation
1 1 0 0
2 1 1 0
3 1 1 1

4 Joins:


Tip

In data.table, a JOIN is just another type of SUBSET: we subset the rows of one data.table with the rows of a second one, based on some conditions that define the type of JOIN.

Matching two tables based on their rows can be done:
- Either on equivalences (equi-joins)
- Or functions comparing one row to another (non-equi joins)

Data:

(DT1 <- data.table( 
  ID = LETTERS[1:10],
  A = sample(1:5, 10, replace = TRUE),
  B = sample(10:20, 10)
))
data.table [10 x 3]
ID A B
A 1 16
B 5 12
C 3 11
D 1 19
E 4 13
F 2 18
G 2 20
H 4 17
I 5 15
J 2 10
(DT2 <- data.table(
  ID = LETTERS[5:14],
  C = sample(1:5, 10, replace = TRUE),
  D = sample(10:20, 10) 
))
data.table [10 x 3]
ID C D
E 5 16
F 2 13
G 3 11
H 1 19
I 3 12
J 1 14
K 2 17
L 2 20
M 3 10
N 4 18

Basic (right) join example:

right_join(
  DT1 |> select(ID, A),
  DT2 |> select(ID, C), 
  by = "ID"
) |> as_tibble()
data.frame [10 x 3]
ID A C
E 4 5
F 2 2
G 2 3
H 4 1
I 5 3
J 2 1
K NA 2
L NA 2
M NA 3
N NA 4
DT1[DT2, .(ID, A, C), on = .(ID)]
data.table [10 x 3]
ID A C
E 4 5
F 2 2
G 2 3
H 4 1
I 5 3
J 2 1
K NA 2
L NA 2
M NA 3
N NA 4

4.1 Outer (right, left):

Appends data of one at the end of the other.

Note

data.table doesn’t do left joins natively

Subsetting DT1 by DT2:

Note

DT2 (everything) + DT1 (all columns, but only the rows that match those in DT1).
> Looking up DT1’s rows using DT2 (or DT2’s key, if it has one) as an index.

As a right join:

right_join(DT1, DT2, by = "ID") # DT1 into DT2
data.table [10 x 5]
ID A B C D
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
K NA NA 2 17
L NA NA 2 20
M NA NA 3 10
N NA NA 4 18
DT1[DT2, on = .(ID)]
data.table [10 x 5]
ID A B C D
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
K NA NA 2 17
L NA NA 2 20
M NA NA 3 10
N NA NA 4 18

As a left join:

Note

Not exactly equivalent to the right join: same columns, but DT2 is first instead of DT1

left_join(DT2, DT1, by = "ID") # DT1 into DT2
data.table [10 x 5]
ID C D A B
E 5 16 4 13
F 2 13 2 18
G 3 11 2 20
H 1 19 4 17
I 3 12 5 15
J 1 14 2 10
K 2 17 NA NA
L 2 20 NA NA
M 3 10 NA NA
N 4 18 NA NA
copy(DT2)[DT1, c("A", "B") := list(i.A, i.B), on = .(ID)][]
data.table [10 x 5]
ID C D A B
E 5 16 4 13
F 2 13 2 18
G 3 11 2 20
H 1 19 4 17
I 3 12 5 15
J 1 14 2 10
K 2 17 NA NA
L 2 20 NA NA
M 3 10 NA NA
N 4 18 NA NA

Subsetting DT2 by DT1:

Note

DT1 (everything) + DT2 (all columns, but only the rows that match those in DT1).
> Looking up DT2’s rows using DT1 (or DT1’s key, if it has one) as an index.

As a right join:

right_join(DT2, DT1, by = "ID") # DT2 into DT1
data.table [10 x 5]
ID C D A B
E 5 16 4 13
F 2 13 2 18
G 3 11 2 20
H 1 19 4 17
I 3 12 5 15
J 1 14 2 10
A NA NA 1 16
B NA NA 5 12
C NA NA 3 11
D NA NA 1 19
DT2[DT1, on = .(ID)]
data.table [10 x 5]
ID C D A B
A NA NA 1 16
B NA NA 5 12
C NA NA 3 11
D NA NA 1 19
E 5 16 4 13
F 2 13 2 18
G 3 11 2 20
H 1 19 4 17
I 3 12 5 15
J 1 14 2 10

As a left join:

Note

Not exactly equivalent to the right join: same columns, but DT1 is first instead of DT2

left_join(DT1, DT2, by = "ID") # DT2 into DT1
data.table [10 x 5]
ID A B C D
A 1 16 NA NA
B 5 12 NA NA
C 3 11 NA NA
D 1 19 NA NA
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
copy(DT1)[DT2, c("C", "D") := list(i.C, i.D), on = .(ID)][]
data.table [10 x 5]
ID A B C D
A 1 16 NA NA
B 5 12 NA NA
C 3 11 NA NA
D 1 19 NA NA
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14

4.2 Full (outer):

full_join(DT1, DT2, by = "ID")
data.table [14 x 5]
ID A B C D
A 1 16 NA NA
B 5 12 NA NA
C 3 11 NA NA
D 1 19 NA NA
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
K NA NA 2 17
L NA NA 2 20
M NA NA 3 10
N NA NA 4 18
data.table::merge.data.table(DT1, DT2, by = "ID", all = TRUE)
data.table [14 x 5]
ID A B C D
A 1 16 NA NA
B 5 12 NA NA
C 3 11 NA NA
D 1 19 NA NA
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
K NA NA 2 17
L NA NA 2 20
M NA NA 3 10
N NA NA 4 18

Alternatively:

setkey(DT1, ID)
setkey(DT2, ID)

# Getting the union of the unique keys of both DT
unique_keys <- union(DT1[, ID], DT2[, ID])

DT1[DT2[unique_keys, on = "ID"]]
data.table [14 x 5]
ID A B C D
A 1 16 NA NA
B 5 12 NA NA
C 3 11 NA NA
D 1 19 NA NA
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
K NA NA 2 17
L NA NA 2 20
M NA NA 3 10
N NA NA 4 18

4.3 Inner:

Only returns the ROWS matching both tables:
- Inner: rows matching both DT1 and DT2, columns of both (add DT2’s columns to the right)
- Semi: rows matching both DT1 and DT2, columns of first one

Inner:

inner_join(DT1, DT2, by = "ID") 
data.table [6 x 5]
ID A B C D
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
DT1[DT2, on = .(ID), nomatch = NULL]
data.table [6 x 5]
ID A B C D
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14

Semi:

semi_join(DT1, DT2, by = "ID")
data.table [6 x 3]
ID A B
E 4 13
F 2 18
G 2 20
H 4 17
I 5 15
J 2 10
DT1[na.omit(DT1[DT2, on = .(ID), which = TRUE])]
data.table [6 x 3]
ID A B
E 4 13
F 2 18
G 2 20
H 4 17
I 5 15
J 2 10
Note

which = TRUE returns the row numbers instead of the rows themselves.

4.4 Anti:

ROWS of DT1 that are NOT in DT2, and only the columns of DT1.

anti_join(DT1, DT2, by = "ID")
data.table [4 x 3]
ID A B
A 1 16
B 5 12
C 3 11
D 1 19
DT1[!DT2, on = .(ID)]
data.table [4 x 3]
ID A B
A 1 16
B 5 12
C 3 11
D 1 19

ROWS of DT2 that are NOT in DT1, and only the columns of DT2.

anti_join(DT2, DT1, by = "ID")
data.table [4 x 3]
ID C D
K 2 17
L 2 20
M 3 10
N 4 18
DT2[!DT1, on = .(ID)]
data.table [4 x 3]
ID C D
K 2 17
L 2 20
M 3 10
N 4 18

4.5 Non-equi joins:

DT1[DT2, on = .(ID, A <= C)]
data.table [10 x 4]
ID A B D
E 5 13 16
F 2 18 13
G 3 20 11
H 1 NA 19
I 3 NA 12
J 1 NA 14
K 2 NA 17
L 2 NA 20
M 3 NA 10
N 4 NA 18

4.6 Rolling joins:

DT1[DT2, on = "ID", roll = TRUE]
data.table [10 x 5]
ID A B C D
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
K 2 10 2 17
L 2 10 2 20
M 2 10 3 10
N 2 10 4 18

Inverse the rolling direction:

DT1[DT2, on = "ID", roll = -Inf]
data.table [10 x 5]
ID A B C D
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
K NA NA 2 17
L NA NA 2 20
M NA NA 3 10
N NA NA 4 18
DT1[DT2, on = "ID", rollends = TRUE]
data.table [10 x 5]
ID A B C D
E 4 13 5 16
F 2 18 2 13
G 2 20 3 11
H 4 17 1 19
I 5 15 3 12
J 2 10 1 14
K NA NA 2 17
L NA NA 2 20
M NA NA 3 10
N NA NA 4 18

5 Tidyr & Others:


5.1 Remove NA:

tidyr::drop_na(IRIS, matches("Sepal"))
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]
na.omit(IRIS, cols = str_subset(colnames(IRIS), "Sepal"))
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]

5.2 Unite:

Combine multiple columns into a single one:

mtcars |> tidyr::unite("x", gear, carb, sep = "_")
data.frame [32 x 10]
mpg cyl disp hp drat wt qsec vs am x
21 6 160 110 3.9 2.62 16.46 0 1 4_4
21 6 160 110 3.9 2.875 17.02 0 1 4_4
22.8 4 108 93 3.85 2.32 18.61 1 1 4_1
21.4 6 258 110 3.08 3.215 19.44 1 0 3_1
18.7 8 360 175 3.15 3.44 17.02 0 0 3_2
18.1 6 225 105 2.76 3.46 20.22 1 0 3_1
14.3 8 360 245 3.21 3.57 15.84 0 0 3_4
24.4 4 146.7 62 3.69 3.19 20 1 0 4_2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4_2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4_4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4_4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3_3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3_3
15.2 8 275.8 180 3.07 3.78 18 0 0 3_3
10.4 8 472 205 2.93 5.25 17.98 0 0 3_4
[ omitted 17 entries ]
copy(MT)[, x := paste(gear, carb, sep = "_")][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb x
21 6 160 110 3.9 2.62 16.46 0 1 4 4 4_4
21 6 160 110 3.9 2.875 17.02 0 1 4 4 4_4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 4_1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3_1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3_2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3_1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 3_4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4_2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 4_2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 4_4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 4_4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3_3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3_3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3_3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3_4
[ omitted 17 entries ]

5.3 Extract / Separate:

Separate a row into multiple columns based on a pattern (extract) or a separator (separate):

MT.ext <- MT[, .(x = str_c(gear, carb, sep = "_"))]
MT.ext |> tidyr::extract(col = x, into = c("a", "b"), regex = "(.*)_(.*)", remove = F)
data.table [32 x 3]
x a b
4_4 4 4
4_4 4 4
4_1 4 1
3_1 3 1
3_2 3 2
3_1 3 1
3_4 3 4
4_2 4 2
4_2 4 2
4_4 4 4
4_4 4 4
3_3 3 3
3_3 3 3
3_3 3 3
3_4 3 4
[ omitted 17 entries ]
MT.ext[, c("a", "b") := tstrsplit(x, "_", fixed = TRUE)][] 
data.table [32 x 3]
x a b
4_4 4 4
4_4 4 4
4_1 4 1
3_1 3 1
3_2 3 2
3_1 3 1
3_4 3 4
4_2 4 2
4_2 4 2
4_4 4 4
4_4 4 4
3_3 3 3
3_3 3 3
3_3 3 3
3_4 3 4
[ omitted 17 entries ]

5.4 Separate rows:

Separate a row into multiple rows based on a separator:

Data

(SP <- data.table(
  val = c(1,"2,3",4), 
  date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-03"), origin = "1970-01-01")
  )
)
data.table [3 x 2]
val date
1 2020-01-01
2,3 2020-01-02
4 2020-01-03
SP |> tidyr::separate_rows(val, sep = ",", convert = TRUE)
data.frame [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

Solution 1:

copy(SP)[, c(V1 = strsplit(val, ",", fixed = TRUE), .SD), by = val][, `:=`(val = V1, V1 = NULL)][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

Solution 2:

SP[, strsplit(val, ",", fixed = TRUE), by = val][SP, on = "val"][, `:=`(val = V1, V1 = NULL)][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

Solution 3:

(With type conversion)

SP[, unlist(tstrsplit(val, ",", type.convert = TRUE)), by = val][SP, on = "val"][, `:=`(val = V1, V1 = NULL)][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

Solution 4:

copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))][, val := strsplit(val, ","), by = val][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

(With type conversion)

copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))
       ][, val := strsplit(val, ","), by = val
       ][, val := utils::type.convert(val, as.is = T)][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

5.5 Duplicates:

5.5.1 Duplicated rows:

Finding duplicated rows:

mtcars |> group_by(mpg, hp) |> filter(n() > 1)
data.frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
MT[, if(.N > 1) .SD, by = .(mpg, hp)]
data.table [2 x 11]
mpg hp cyl disp drat wt qsec vs am gear carb
21 110 6 160 3.9 2.62 16.46 0 1 4 4
21 110 6 160 3.9 2.875 17.02 0 1 4 4

Only keeping non-duplicated rows:

Note

This is different from distinct/unique, which will keep one of the duplicated rows of each group.

This removes all groups which have duplicated rows.

Solution 1:

mtcars |> group_by(mpg, hp) |> filter(n() == 1)
data.frame [30 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
[ omitted 15 entries ]
MT[, if(.N == 1) .SD, by = .(mpg, hp)]
data.table [30 x 11]
mpg hp cyl disp drat wt qsec vs am gear carb
22.8 93 4 108 3.85 2.32 18.61 1 1 4 1
21.4 110 6 258 3.08 3.215 19.44 1 0 3 1
18.7 175 8 360 3.15 3.44 17.02 0 0 3 2
18.1 105 6 225 2.76 3.46 20.22 1 0 3 1
14.3 245 8 360 3.21 3.57 15.84 0 0 3 4
24.4 62 4 146.7 3.69 3.19 20 1 0 4 2
22.8 95 4 140.8 3.92 3.15 22.9 1 0 4 2
19.2 123 6 167.6 3.92 3.44 18.3 1 0 4 4
17.8 123 6 167.6 3.92 3.44 18.9 1 0 4 4
16.4 180 8 275.8 3.07 4.07 17.4 0 0 3 3
17.3 180 8 275.8 3.07 3.73 17.6 0 0 3 3
15.2 180 8 275.8 3.07 3.78 18 0 0 3 3
10.4 205 8 472 2.93 5.25 17.98 0 0 3 4
10.4 215 8 460 3 5.424 17.82 0 0 3 4
14.7 230 8 440 3.23 5.345 17.42 0 0 3 4
[ omitted 15 entries ]

Solution 2:

More convoluted

mtcars |> group_by(mpg, hp) |> filter(n() > 1) |> anti_join(mtcars, y = _)
data.frame [30 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
[ omitted 15 entries ]
MT[!MT[, if(.N > 1) .SD, by = .(mpg, hp)], on = names(MT)]
data.table [30 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
[ omitted 15 entries ]
fsetdiff(MT, setcolorder(MT[, if(.N > 1) .SD, by = .(mpg, hp)], names(MT)))
data.table [30 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
[ omitted 15 entries ]

5.5.2 Duplicated values (per row):

(DUPED <- data.table(
    A = c("A1", "A2", "B3", "A4"), 
    B = c("B1", "B2", "B3", "B4"), 
    C = c("A1", "C2", "D3", "C4"), 
    D = c("A1", "D2", "D3", "D4")
  )
)
data.table [4 x 4]
A B C D
A1 B1 A1 A1
A2 B2 C2 D2
B3 B3 D3 D3
A4 B4 C4 D4
DUPED |> mutate(Repeats = apply(cur_data(), 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", ")))
data.table [4 x 5]
A B C D Repeats
A1 B1 A1 A1 A1
A2 B2 C2 D2
B3 B3 D3 D3 B3, D3
A4 B4 C4 D4
DUPED[, Repeats := apply(.SD, 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", "))][]
data.table [4 x 5]
A B C D Repeats
A1 B1 A1 A1 A1
A2 B2 C2 D2
B3 B3 D3 D3 B3, D3
A4 B4 C4 D4

With duplication counter:

dup_counts <- function(v) {
  rles <- as.data.table(unclass(rle(v[which(duplicated(v))])))[, lengths := lengths + 1]
  paste(apply(rles, 1, \(r) paste0(r[2], " (", r[1], ")")), collapse = ", ")
}
DUPED |> mutate(Repeats = apply(cur_data(), 1, \(r) dup_counts(r)))
data.table [4 x 5]
A B C D Repeats
A1 B1 A1 A1 A1 (4)
A2 B2 C2 D2
B3 B3 D3 D3 B3 (2), D3 (2)
A4 B4 C4 D4
DUPED[, Repeats := apply(.SD, 1, \(r) dup_counts(r))][]
data.table [4 x 5]
A B C D Repeats
A1 B1 A1 A1 A1 (4)
A2 B2 C2 D2
B3 B3 D3 D3 B3 (2), D3 (2)
A4 B4 C4 D4

5.6 Expand & Complete:

Here, we are missing an entry for person B on year 2010, that we want to fill:

(CAR <- data.table(
    year = c(2010,2011,2012,2013,2014,2015,2011,2012,2013,2014,2015), 
    person = c("A","A","A","A","A","A", "B","B","B","B","B"),
    car = c("BMW", "BMW", "AUDI", "AUDI", "AUDI", "Mercedes", "Citroen","Citroen", "Citroen", "Toyota", "Toyota")
  )
)
data.table [11 x 3]
year person car
2 010 A BMW
2 011 A BMW
2 012 A AUDI
2 013 A AUDI
2 014 A AUDI
2 015 A Mercedes
2 011 B Citroen
2 012 B Citroen
2 013 B Citroen
2 014 B Toyota
2 015 B Toyota

5.6.1 Expand:

tidyr::expand(CAR, person, year)
data.frame [12 x 2]
person year
A 2 010
A 2 011
A 2 012
A 2 013
A 2 014
A 2 015
B 2 010
B 2 011
B 2 012
B 2 013
B 2 014
B 2 015
CJ(CAR$person, CAR$year, unique = TRUE)
data.table [12 x 2]
V1 V2
A 2 010
A 2 011
A 2 012
A 2 013
A 2 014
A 2 015
B 2 010
B 2 011
B 2 012
B 2 013
B 2 014
B 2 015

5.6.2 Complete:

Joins the original dataset with the expanded one:

CAR |> tidyr::complete(person, year)
data.frame [12 x 3]
person year car
A 2 010 BMW
A 2 011 BMW
A 2 012 AUDI
A 2 013 AUDI
A 2 014 AUDI
A 2 015 Mercedes
B 2 010 NA
B 2 011 Citroen
B 2 012 Citroen
B 2 013 Citroen
B 2 014 Toyota
B 2 015 Toyota
CAR[CJ(person, year, unique = TRUE), on = .(person, year)]
data.table [12 x 3]
year person car
2 010 A BMW
2 011 A BMW
2 012 A AUDI
2 013 A AUDI
2 014 A AUDI
2 015 A Mercedes
2 010 B NA
2 011 B Citroen
2 012 B Citroen
2 013 B Citroen
2 014 B Toyota
2 015 B Toyota

5.7 Uncount:

Duplicating aggregated rows to get the un-aggregated version back

Data

cols <- c("Mild", "Moderate", "Severe")

dat_agg
data.frame [10 x 6]
ID Site Domain Mild Moderate Severe
1 23 A1 4 0 0
2 27 A1 0 1 1
3 28 A1 0 1 0
4 29 A1 0 0 1
5 31 A1 0 1 0
6 33 A1 0 1 1
7 41 A1 3 0 1
8 48 A1 0 2 4
9 64 A1 1 0 0
10 66 A1 1 0 0
dat_agg |> 
  tidyr::pivot_longer(cols = cols, names_to = "Severity", values_to = "Count") |> 
  tidyr::uncount(Count) |> 
  mutate(ID_new = row_number(), .after = "ID") |>
  tidyr::pivot_wider(
    names_from = "Severity", values_from = "Severity", 
    values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
  )
data.frame [23 x 7]
ID ID_new Site Domain Mild Moderate Severe
1 1 23 A1 1 0 0
1 2 23 A1 1 0 0
1 3 23 A1 1 0 0
1 4 23 A1 1 0 0
2 5 27 A1 0 1 0
2 6 27 A1 0 0 1
3 7 28 A1 0 1 0
4 8 29 A1 0 0 1
5 9 31 A1 0 1 0
6 10 33 A1 0 1 0
6 11 33 A1 0 0 1
7 12 41 A1 1 0 0
7 13 41 A1 1 0 0
7 14 41 A1 1 0 0
7 15 41 A1 0 0 1
[ omitted 8 entries ]

Solution 1:

(melt(DAT_AGG, measure.vars = cols, variable.name = "Severity", value.name = "Count")
  [rep(1:.N, Count)][, ID_new := .I] |> 
  dcast(... ~ Severity, value.var = "Severity", fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0)
)[, -"Count"]
data.table [23 x 7]
ID Site Domain ID_new Mild Moderate Severe
1 23 A1 1 1 0 0
1 23 A1 2 1 0 0
1 23 A1 3 1 0 0
1 23 A1 4 1 0 0
2 27 A1 10 0 1 0
2 27 A1 16 0 0 1
3 28 A1 11 0 1 0
4 29 A1 17 0 0 1
5 31 A1 12 0 1 0
6 33 A1 13 0 1 0
6 33 A1 18 0 0 1
7 41 A1 19 0 0 1
7 41 A1 5 1 0 0
7 41 A1 6 1 0 0
7 41 A1 7 1 0 0
[ omitted 8 entries ]

Solution 2:

DAT_AGG[Reduce(`c`, sapply(mget(cols), \(x) rep(1:.N, x)))
      ][, (cols) := lapply(.SD, \(x) ifelse(x > 1, 1, x)), .SDcols = cols
      ][order(ID)]
data.table [23 x 6]
ID Site Domain Mild Moderate Severe
1 23 A1 1 0 0
1 23 A1 1 0 0
1 23 A1 1 0 0
1 23 A1 1 0 0
2 27 A1 0 1 1
2 27 A1 0 1 1
3 28 A1 0 1 0
4 29 A1 0 0 1
5 31 A1 0 1 0
6 33 A1 0 1 1
6 33 A1 0 1 1
7 41 A1 1 0 1
7 41 A1 1 0 1
7 41 A1 1 0 1
7 41 A1 1 0 1
[ omitted 8 entries ]

5.8 List / Unlist:

When a column contains a simple vector/list of values (of the same type, without structure)

5.8.1 One listed column:

Single ID (grouping) column:

(mtcars_list <- mtcars |> group_by(cyl) |> summarize(mpg = list(mpg)) |> ungroup())
data.frame [3 x 2]
cyl mpg
4 <numeric [11]>
6 <numeric [7]>
8 <numeric [14]>
(MT_LIST <- MT[, .(mpg = .(mpg)), keyby = cyl])
data.table [3 x 2]
cyl mpg
4 <numeric [11]>
6 <numeric [7]>
8 <numeric [14]>

Solution 1:

mtcars_list |> unnest(mpg)
data.frame [32 x 2]
cyl mpg
4 22.8
4 24.4
4 22.8
4 32.4
4 30.4
4 33.9
4 21.5
4 27.3
4 26
4 30.4
4 21.4
6 21
6 21
6 21.4
6 18.1
[ omitted 17 entries ]
MT_LIST[, .(mpg = unlist(mpg)), keyby = cyl]
data.table [32 x 2]
cyl mpg
4 22.8
4 24.4
4 22.8
4 32.4
4 30.4
4 33.9
4 21.5
4 27.3
4 26
4 30.4
4 21.4
6 21
6 21
6 21.4
6 18.1
[ omitted 17 entries ]

Solution 2:

Bypasses the need of grouping when unlisting by growing the data.table back to its original number of rows before unlisting.

MT_LIST[rep(MT_LIST[, .I], lengths(mpg))][, mpg := unlist(MT_LIST$mpg)][]
data.table [32 x 2]
cyl mpg
4 22.8
4 24.4
4 22.8
4 32.4
4 30.4
4 33.9
4 21.5
4 27.3
4 26
4 30.4
4 21.4
6 21
6 21
6 21.4
6 18.1
[ omitted 17 entries ]

Multiple ID (grouping) columns:

(mtcars_list2 <- mtcars |> group_by(cyl, gear) |> summarize(mpg = list(mpg)) |> ungroup())
data.frame [8 x 3]
cyl gear mpg
4 3 <numeric [1]>
4 4 <numeric [8]>
4 5 <numeric [2]>
6 3 <numeric [2]>
6 4 <numeric [4]>
6 5 <numeric [1]>
8 3 <numeric [12]>
8 5 <numeric [2]>
(MT_LIST2 <- MT[, .(mpg = .(mpg)), keyby = .(cyl, gear)])
data.table [8 x 3]
cyl gear mpg
4 3 <numeric [1]>
4 4 <numeric [8]>
4 5 <numeric [2]>
6 3 <numeric [2]>
6 4 <numeric [4]>
6 5 <numeric [1]>
8 3 <numeric [12]>
8 5 <numeric [2]>

Solution 1:

mtcars_list2 |> unnest(mpg) # group_by(cyl, gear) is optional
data.frame [32 x 3]
cyl gear mpg
4 3 21.5
4 4 22.8
4 4 24.4
4 4 22.8
4 4 32.4
4 4 30.4
4 4 33.9
4 4 27.3
4 4 21.4
4 5 26
4 5 30.4
6 3 21.4
6 3 18.1
6 4 21
6 4 21
[ omitted 17 entries ]
MT_LIST2[, .(mpg = unlist(mpg)), by = setdiff(names(MT_LIST2), 'mpg')]
data.table [32 x 3]
cyl gear mpg
4 3 21.5
4 4 22.8
4 4 24.4
4 4 22.8
4 4 32.4
4 4 30.4
4 4 33.9
4 4 27.3
4 4 21.4
4 5 26
4 5 30.4
6 3 21.4
6 3 18.1
6 4 21
6 4 21
[ omitted 17 entries ]

Solution 2:

Same as with one grouping column

MT_LIST2[rep(MT_LIST2[, .I], lengths(mpg))][, mpg := unlist(MT_LIST2$mpg)][]
data.table [32 x 3]
cyl gear mpg
4 3 21.5
4 4 22.8
4 4 24.4
4 4 22.8
4 4 32.4
4 4 30.4
4 4 33.9
4 4 27.3
4 4 21.4
4 5 26
4 5 30.4
6 3 21.4
6 3 18.1
6 4 21
6 4 21
[ omitted 17 entries ]

5.8.2 Multiple listed column:

Creating the data:

(mtcars_list_mult <- mtcars |> group_by(cyl, gear) |> summarize(across(c(mpg, disp), \(c) list(c))) |> ungroup())
data.frame [8 x 4]
cyl gear mpg disp
4 3 <numeric [1]> <numeric [1]>
4 4 <numeric [8]> <numeric [8]>
4 5 <numeric [2]> <numeric [2]>
6 3 <numeric [2]> <numeric [2]>
6 4 <numeric [4]> <numeric [4]>
6 5 <numeric [1]> <numeric [1]>
8 3 <numeric [12]> <numeric [12]>
8 5 <numeric [2]> <numeric [2]>
(MT_LIST_MULT <- MT[, lapply(.SD, \(c) .(c)), keyby = .(cyl, gear), .SDcols = c("mpg", "disp")])
data.table [8 x 4]
cyl gear mpg disp
4 3 <numeric [1]> <numeric [1]>
4 4 <numeric [8]> <numeric [8]>
4 5 <numeric [2]> <numeric [2]>
6 3 <numeric [2]> <numeric [2]>
6 4 <numeric [4]> <numeric [4]>
6 5 <numeric [1]> <numeric [1]>
8 3 <numeric [12]> <numeric [12]>
8 5 <numeric [2]> <numeric [2]>

Solution 1:

mtcars_list_mult |> unnest(c(mpg, disp)) # group_by(cyl, gear) is optional
data.frame [32 x 4]
cyl gear mpg disp
4 3 21.5 120.1
4 4 22.8 108
4 4 24.4 146.7
4 4 22.8 140.8
4 4 32.4 78.7
4 4 30.4 75.7
4 4 33.9 71.1
4 4 27.3 79
4 4 21.4 121
4 5 26 120.3
4 5 30.4 95.1
6 3 21.4 258
6 3 18.1 225
6 4 21 160
6 4 21 160
[ omitted 17 entries ]
MT_LIST_MULT[, lapply(.SD, \(c) unlist(c)), by = setdiff(names(MT_LIST_MULT), c("mpg", "disp"))]
data.table [32 x 4]
cyl gear mpg disp
4 3 21.5 120.1
4 4 22.8 108
4 4 24.4 146.7
4 4 22.8 140.8
4 4 32.4 78.7
4 4 30.4 75.7
4 4 33.9 71.1
4 4 27.3 79
4 4 21.4 121
4 5 26 120.3
4 5 30.4 95.1
6 3 21.4 258
6 3 18.1 225
6 4 21 160
6 4 21 160
[ omitted 17 entries ]

5.9 Nest / Unnest:

When a column contains a data.table/data.frame (with multiple columns, structured)

5.9.1 One nested column:

Nesting

(mtcars_nest <- mtcars |> tidyr::nest(data = -cyl)) # Data is inside a tibble
data.frame [3 x 2]
cyl data
6 <tbl_df [7 x 10]>
4 <tbl_df [11 x 10]>
8 <tbl_df [14 x 10]>
mtcars_nest <- mtcars |> nest_by(cyl) |> ungroup() # Data is inside a vctrs_list_of
(MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl])
data.table [3 x 2]
cyl data
4 <data.table [11 x 10]>
6 <data.table [7 x 10]>
8 <data.table [14 x 10]>

Unnesting

mtcars_nest |> unnest(data) |> ungroup()
data.frame [32 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
6 21 160 110 3.9 2.62 16.46 0 1 4 4
6 21 160 110 3.9 2.875 17.02 0 1 4 4
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
6 17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
[ omitted 17 entries ]
MT_NEST[, rbindlist(data), keyby = cyl]
data.table [32 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2
6 21 160 110 3.9 2.62 16.46 0 1 4 4
6 21 160 110 3.9 2.875 17.02 0 1 4 4
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1
[ omitted 17 entries ]
# MT_NEST[, do.call(c, data), keyby = cyl]

5.9.2 Multiple nested column:

Nesting:

(mtcars_nest_mult <- mtcars |> group_by(cyl, gear) |> nest(data1 = c(mpg, hp), data2 = !c(cyl, gear, mpg, hp)) |> ungroup())
data.frame [8 x 4]
cyl gear data1 data2
6 4 <tbl_df [4 x 2]> <tbl_df [4 x 7]>
4 4 <tbl_df [8 x 2]> <tbl_df [8 x 7]>
6 3 <tbl_df [2 x 2]> <tbl_df [2 x 7]>
8 3 <tbl_df [12 x 2]> <tbl_df [12 x 7]>
4 3 <tbl_df [1 x 2]> <tbl_df [1 x 7]>
4 5 <tbl_df [2 x 2]> <tbl_df [2 x 7]>
8 5 <tbl_df [2 x 2]> <tbl_df [2 x 7]>
6 5 <tbl_df [1 x 2]> <tbl_df [1 x 7]>
(MT_NEST_MULT <- MT[, .(data1 = .(.SD[, .(mpg, hp)]), data2 = .(.SD[, !c("mpg", "hp")])), keyby = .(cyl, gear)])
data.table [8 x 4]
cyl gear data1 data2
4 3 <data.table [1 x 2]> <data.table [1 x 7]>
4 4 <data.table [8 x 2]> <data.table [8 x 7]>
4 5 <data.table [2 x 2]> <data.table [2 x 7]>
6 3 <data.table [2 x 2]> <data.table [2 x 7]>
6 4 <data.table [4 x 2]> <data.table [4 x 7]>
6 5 <data.table [1 x 2]> <data.table [1 x 7]>
8 3 <data.table [12 x 2]> <data.table [12 x 7]>
8 5 <data.table [2 x 2]> <data.table [2 x 7]>

Unnesting:

mtcars_nest_mult |> unnest(c(data1, data2)) |> ungroup()
data.frame [32 x 11]
cyl gear mpg hp disp drat wt qsec vs am carb
6 4 21 110 160 3.9 2.62 16.46 0 1 4
6 4 21 110 160 3.9 2.875 17.02 0 1 4
6 4 19.2 123 167.6 3.92 3.44 18.3 1 0 4
6 4 17.8 123 167.6 3.92 3.44 18.9 1 0 4
4 4 22.8 93 108 3.85 2.32 18.61 1 1 1
4 4 24.4 62 146.7 3.69 3.19 20 1 0 2
4 4 22.8 95 140.8 3.92 3.15 22.9 1 0 2
4 4 32.4 66 78.7 4.08 2.2 19.47 1 1 1
4 4 30.4 52 75.7 4.93 1.615 18.52 1 1 2
4 4 33.9 65 71.1 4.22 1.835 19.9 1 1 1
4 4 27.3 66 79 4.08 1.935 18.9 1 1 1
4 4 21.4 109 121 4.11 2.78 18.6 1 1 2
6 3 21.4 110 258 3.08 3.215 19.44 1 0 1
6 3 18.1 105 225 2.76 3.46 20.22 1 0 1
8 3 18.7 175 360 3.15 3.44 17.02 0 0 2
[ omitted 17 entries ]
MT_NEST_MULT[, c(rbindlist(data1), rbindlist(data2)), keyby = .(cyl, gear)]
data.table [32 x 11]
cyl gear mpg hp disp drat wt qsec vs am carb
4 3 21.5 97 120.1 3.7 2.465 20.01 1 0 1
4 4 22.8 93 108 3.85 2.32 18.61 1 1 1
4 4 24.4 62 146.7 3.69 3.19 20 1 0 2
4 4 22.8 95 140.8 3.92 3.15 22.9 1 0 2
4 4 32.4 66 78.7 4.08 2.2 19.47 1 1 1
4 4 30.4 52 75.7 4.93 1.615 18.52 1 1 2
4 4 33.9 65 71.1 4.22 1.835 19.9 1 1 1
4 4 27.3 66 79 4.08 1.935 18.9 1 1 1
4 4 21.4 109 121 4.11 2.78 18.6 1 1 2
4 5 26 91 120.3 4.43 2.14 16.7 0 1 2
4 5 30.4 113 95.1 3.77 1.513 16.9 1 1 2
6 3 21.4 110 258 3.08 3.215 19.44 1 0 1
6 3 18.1 105 225 2.76 3.46 20.22 1 0 1
6 4 21 110 160 3.9 2.62 16.46 0 1 4
6 4 21 110 160 3.9 2.875 17.02 0 1 4
[ omitted 17 entries ]
MT_NEST_MULT[, do.call(c, unname(lapply(.SD, \(c) rbindlist(c)))), .SDcols = patterns('data'), keyby = .(cyl, gear)]
data.table [32 x 11]
cyl gear mpg hp disp drat wt qsec vs am carb
4 3 21.5 97 120.1 3.7 2.465 20.01 1 0 1
4 4 22.8 93 108 3.85 2.32 18.61 1 1 1
4 4 24.4 62 146.7 3.69 3.19 20 1 0 2
4 4 22.8 95 140.8 3.92 3.15 22.9 1 0 2
4 4 32.4 66 78.7 4.08 2.2 19.47 1 1 1
4 4 30.4 52 75.7 4.93 1.615 18.52 1 1 2
4 4 33.9 65 71.1 4.22 1.835 19.9 1 1 1
4 4 27.3 66 79 4.08 1.935 18.9 1 1 1
4 4 21.4 109 121 4.11 2.78 18.6 1 1 2
4 5 26 91 120.3 4.43 2.14 16.7 0 1 2
4 5 30.4 113 95.1 3.77 1.513 16.9 1 1 2
6 3 21.4 110 258 3.08 3.215 19.44 1 0 1
6 3 18.1 105 225 2.76 3.46 20.22 1 0 1
6 4 21 110 160 3.9 2.62 16.46 0 1 4
6 4 21 110 160 3.9 2.875 17.02 0 1 4
[ omitted 17 entries ]

5.9.3 Operate on nested/list columns:

(mtcars_nest <- mtcars |> nest(-cyl) |> ungroup())
data.frame [3 x 2]
cyl data
6 <tbl_df [7 x 10]>
4 <tbl_df [11 x 10]>
8 <tbl_df [14 x 10]>
(MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl])
data.table [3 x 2]
cyl data
4 <data.table [11 x 10]>
6 <data.table [7 x 10]>
8 <data.table [14 x 10]>

Creating a new column using the nested data:

Keeping the nested column:

mtcars_nest |> group_by(cyl) |> mutate(sum = sum(unlist(data))) |> ungroup()
data.frame [3 x 3]
cyl data sum
6 <tbl_df [7 x 10]> 2 508.16
4 <tbl_df [11 x 10]> 2 719.233
8 <tbl_df [14 x 10]> 8 516.809
copy(MT_NEST)[, sum := sapply(data, \(r) sum(r)), keyby = cyl][]
data.table [3 x 3]
cyl data sum
4 <data.table [11 x 10]> 2 719.233
6 <data.table [7 x 10]> 2 508.16
8 <data.table [14 x 10]> 8 516.809

Dropping the nested column:

mtcars_nest |> group_by(cyl) |> summarize(sum = sum(unlist(data))) |> ungroup()
data.frame [3 x 2]
cyl sum
4 2 719.233
6 2 508.16
8 8 516.809
MT_NEST[, .(sum = sapply(data, \(r) sum(r))), keyby = cyl]
data.table [3 x 2]
cyl sum
4 2 719.233
6 2 508.16
8 8 516.809

Creating multiple new columns using the nested data:

linreg <- \(data) lm(mpg ~ hp, data = data) |> broom::tidy()
mtcars_nest |> group_by(cyl) |> group_modify(\(d, g) linreg(unnest(d, everything()))) |> ungroup()
data.frame [6 x 6]
cyl term estimate std.error statistic p.value
4 (Intercept) 35.983 5.201 6.918 <0.001 ***
4 hp −0.113 0.061 −1.843 0.098
6 (Intercept) 20.674 3.304 6.256 0.002 **
6 hp −0.008 0.027 −0.286 0.786
8 (Intercept) 18.08 2.988 6.052 <0.001 ***
8 hp −0.014 0.014 −1.025 0.326
MT_NEST[, rbindlist(lapply(data, \(ndt) linreg(ndt))), keyby = cyl][]
data.table [6 x 6]
cyl term estimate std.error statistic p.value
4 (Intercept) 35.983 5.201 6.918 <0.001 ***
4 hp −0.113 0.061 −1.843 0.098
6 (Intercept) 20.674 3.304 6.256 0.002 **
6 hp −0.008 0.027 −0.286 0.786
8 (Intercept) 18.08 2.988 6.052 <0.001 ***
8 hp −0.014 0.014 −1.025 0.326

Operating inside the nested data:

mtcars_nest |> 
  mutate(data = map(data, \(tibl) mutate(tibl, sum = purrr::pmap_dbl(cur_data(), sum)))) |> 
  unnest(data)
data.frame [32 x 12]
cyl mpg disp hp drat wt qsec vs am gear carb sum
6 21 160 110 3.9 2.62 16.46 0 1 4 4 322.98
6 21 160 110 3.9 2.875 17.02 0 1 4 4 323.795
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1 420.135
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1 379.54
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4 344.46
6 17.8 167.6 123 3.92 3.44 18.9 1 0 4 4 343.66
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 373.59
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1 255.58
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2 266.98
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 295.57
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1 209.85
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2 191.165
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1 202.955
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1 269.775
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1 204.215
[ omitted 17 entries ]
mtcars_nest |> 
  mutate(across(data, \(tibls) map(tibls, \(tibl) mutate(tibl, sum = apply(cur_data(), 1, sum))))) |> 
  unnest(data)
data.frame [32 x 12]
cyl mpg disp hp drat wt qsec vs am gear carb sum
6 21 160 110 3.9 2.62 16.46 0 1 4 4 322.98
6 21 160 110 3.9 2.875 17.02 0 1 4 4 323.795
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1 420.135
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1 379.54
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4 344.46
6 17.8 167.6 123 3.92 3.44 18.9 1 0 4 4 343.66
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 373.59
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1 255.58
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2 266.98
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 295.57
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1 209.85
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2 191.165
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1 202.955
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1 269.775
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1 204.215
[ omitted 17 entries ]

Using the nplyr package:

library(nplyr)

mtcars_nest |> 
  nplyr::nest_mutate(data, sum = apply(cur_data(), 1, sum)) |> 
  unnest(data)
data.frame [32 x 12]
cyl mpg disp hp drat wt qsec vs am gear carb sum
6 21 160 110 3.9 2.62 16.46 0 1 4 4 322.98
6 21 160 110 3.9 2.875 17.02 0 1 4 4 323.795
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1 420.135
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1 379.54
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4 344.46
6 17.8 167.6 123 3.92 3.44 18.9 1 0 4 4 343.66
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 373.59
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1 255.58
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2 266.98
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 295.57
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1 209.85
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2 191.165
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1 202.955
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1 269.775
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1 204.215
[ omitted 17 entries ]
copy(MT_NEST)[, data := lapply(data, \(dt) dt[, sum := apply(.SD, 1, sum)])
            ][, rbindlist(data), keyby = cyl]
data.table [32 x 12]
cyl mpg disp hp drat wt qsec vs am gear carb sum
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1 255.58
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2 266.98
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 295.57
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1 209.85
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2 191.165
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1 202.955
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1 269.775
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1 204.215
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2 268.57
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2 269.683
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2 284.89
6 21 160 110 3.9 2.62 16.46 0 1 4 4 322.98
6 21 160 110 3.9 2.875 17.02 0 1 4 4 323.795
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1 420.135
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1 379.54
[ omitted 17 entries ]

5.10 Rotate / Transpose:

(MT_SUMMARY <- MT[, tidy(summary(mpg)), by = cyl])
data.table [3 x 7]
cyl minimum q1 median mean q3 maximum
6 17.8 18.65 19.7 19.743 21 21.4
4 21.4 22.8 26 26.664 30.4 33.9
8 10.4 14.4 15.2 15.1 16.25 19.2

Solution 1:

Using pivots to fully rotate the data.table:

MT_SUMMARY |> 
  pivot_longer(!cyl, names_to = "Statistic") |> 
  pivot_wider(id_cols = "Statistic", names_from = "cyl", names_prefix = "Cyl ")
data.frame [6 x 4]
Statistic Cyl 6 Cyl 4 Cyl 8
minimum 17.8 21.4 10.4
q1 18.65 22.8 14.4
median 19.7 26 15.2
mean 19.743 26.664 15.1
q3 21 30.4 16.25
maximum 21.4 33.9 19.2
MT_SUMMARY |> 
  melt(id.vars = "cyl", variable.name = "Statistic") |> 
  dcast(Statistic ~ paste0("Cyl ", cyl))
data.table [6 x 4]
Statistic Cyl 4 Cyl 6 Cyl 8
minimum 21.4 17.8 10.4
q1 22.8 18.65 14.4
median 26 19.7 15.2
mean 26.664 19.743 15.1
q3 30.4 21 16.25
maximum 33.9 21.4 19.2

Solution 2:

Using a dedicated function:

Note

AFAIK there is no native Tidyverse function to do this.

library(datawizard)

datawizard::data_rotate(MT_SUMMARY, colnames = TRUE, rownames = "Statistic")
data.frame [6 x 4]
Statistic 6 4 8
minimum 17.8 21.4 10.4
q1 18.65 22.8 14.4
median 19.7 26 15.2
mean 19.743 26.664 15.1
q3 21 30.4 16.25
maximum 21.4 33.9 19.2
data.table::transpose(MT_SUMMARY, keep.names = "Statistic", make.names = 1)
data.table [6 x 4]
Statistic 6 4 8
minimum 17.8 21.4 10.4
q1 18.65 22.8 14.4
median 19.7 26 15.2
mean 19.743 26.664 15.1
q3 21 30.4 16.25
maximum 21.4 33.9 19.2

6 Processing examples:


Examples of interesting tasks that I’ve collected over time.

6.1 Find minimum in each group:

MT |> group_by(cyl) |> arrange(mpg) |> slice(1) |> ungroup()
data.frame [3 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
MT[, .SD[which.min(mpg)], keyby = cyl]
data.table [3 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2
6 17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
8 10.4 472 205 2.93 5.25 17.98 0 0 3 4

6.2 GROUP > FILTER > MUTATE

Data:

(DAT <- structure(list(
  id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
  name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"), 
  year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L), 
  job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"), 
  job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)
  ), 
  .Names = c("id", "name", "year", "job", "job2"), 
  class = "data.frame", 
  row.names = c(NA, -16L)
) |> setDT())
data.table [16 x 5]
id name year job job2
1 Jane 1 980 Manager 1
1 Jane 1 981 Manager 1
1 Jane 1 982 Manager 1
1 Jane 1 983 Manager 1
1 Jane 1 984 Manager 1
1 Jane 1 985 Manager 1
1 Jane 1 986 Boss 0
1 Jane 1 987 Boss 0
2 Bob 1 985 Manager 1
2 Bob 1 986 Manager 1
2 Bob 1 987 Manager 1
2 Bob 1 988 Boss 0
2 Bob 1 989 Boss 0
2 Bob 1 990 Boss 0
2 Bob 1 991 Boss 0
[ omitted 1 entries ]

Tidyverse:

DAT |> group_by(name, job) |> 
  filter(job != "Boss" | year == min(year)) |> 
  mutate(cumu_job2 = cumsum(job2)) |> 
  ungroup()
data.frame [11 x 6]
id name year job job2 cumu_job2
1 Jane 1 980 Manager 1 1
1 Jane 1 981 Manager 1 2
1 Jane 1 982 Manager 1 3
1 Jane 1 983 Manager 1 4
1 Jane 1 984 Manager 1 5
1 Jane 1 985 Manager 1 6
1 Jane 1 986 Boss 0 0
2 Bob 1 985 Manager 1 1
2 Bob 1 986 Manager 1 2
2 Bob 1 987 Manager 1 3
2 Bob 1 988 Boss 0 0
Note

Here, the grouping is done BEFORE the filter -> there will be empty groups, meaning they will sum to 0

data.table:

Solution 1:

DAT[ , .SD[job != "Boss" | year == min(year), .(cumu_job2 = cumsum(job2))], by = .(name, job)]
data.table [11 x 3]
name job cumu_job2
Jane Manager 1
Jane Manager 2
Jane Manager 3
Jane Manager 4
Jane Manager 5
Jane Manager 6
Jane Boss 0
Bob Manager 1
Bob Manager 2
Bob Manager 3
Bob Boss 0

Solution 2:

DAT[ , .(cum_job2 = cumsum(job2[job != "Boss" | year == min(year)])), by = .(name, job)]
data.table [11 x 3]
name job cum_job2
Jane Manager 1
Jane Manager 2
Jane Manager 3
Jane Manager 4
Jane Manager 5
Jane Manager 6
Jane Boss 0
Bob Manager 1
Bob Manager 2
Bob Manager 3
Bob Boss 0

Solution 3:

DAT[
    DAT[, .I[job != "Boss" | year == min(year)], by = .(name, job)]$V1 # Row indices
  ][
    , cumu_job2 := cumsum(job2), by = .(name, job)
  ][]
data.table [11 x 6]
id name year job job2 cumu_job2
1 Jane 1 980 Manager 1 1
1 Jane 1 981 Manager 1 2
1 Jane 1 982 Manager 1 3
1 Jane 1 983 Manager 1 4
1 Jane 1 984 Manager 1 5
1 Jane 1 985 Manager 1 6
1 Jane 1 986 Boss 0 0
2 Bob 1 985 Manager 1 1
2 Bob 1 986 Manager 1 2
2 Bob 1 987 Manager 1 3
2 Bob 1 988 Boss 0 0

If we filtered after the grouping:

DAT[job != "Boss" | year == min(year), list(cumu_job2 = cumsum(job2)), by = .(name, job)]
data.table [9 x 3]
name job cumu_job2
Jane Manager 1
Jane Manager 2
Jane Manager 3
Jane Manager 4
Jane Manager 5
Jane Manager 6
Bob Manager 1
Bob Manager 2
Bob Manager 3

6.3 GROUP > SUMMARIZE > JOIN > MUTATE

Data:

(GSJM1 <- data.table(x = c(1,1,1,1,2,2,2,2), y = c("a", "a", "b", "b"), z = 1:8, key = c("x", "y")))
data.table [8 x 3]
x y z
1 a 1
1 a 2
1 b 3
1 b 4
2 a 5
2 a 6
2 b 7
2 b 8
(GSJM2 <- data.table(x = 1:2, y = c("a", "b"), mul = 4:3, key = c("x", "y")))
data.table [2 x 3]
x y mul
1 a 4
2 b 3

Tidyverse:

as.data.frame(GSJM1) |> 
  group_by(x, y) |>
  summarise(z = sum(z)) |>
  ungroup() |> 
  right_join(GSJM2) |>
  mutate(z = z * mul) |> 
  select(-mul)
data.frame [2 x 3]
x y z
1 a 12
2 b 45

data.table:

Basic:

GSJM1[, .(z = sum(z)), by = .(x, y)][GSJM2][, `:=`(z = z * mul, mul = NULL)][]
data.table [2 x 3]
x y z
1 a 12
2 b 45

Advanced (using .EACHI):

GSJM1[GSJM2, .(z = sum(z) * mul), by = .EACHI]
data.table [2 x 3]
x y z
1 a 12
2 b 45

6.4 Separating rows & cleaning text:

Data

(DT_COMA <- data.table(
  first = c(1,"2,3",3,4,5,6.5,7,8,9,0), 
  second = c(1,"2,,5",3,4,5,"6,5,9",7,8,9,0), 
  third = c("one", "two", "thr,ee", "four", "five", "six", "sev,en", "eight", "nine", "zero"), 
  fourth = as.Date(c(1/1/2020, 2/1/2020, 3/1/2020, 4/1/2020, 5/1/2020, 6/1/2020, 7/1/2020, 8/1/2020, 9/1/2020, 10/1/2020), origin = "1970-01-01")
  )
)
data.table [10 x 4]
first second third fourth
1 1 one 1970-01-01
2,3 2,,5 two 1970-01-01
3 3 thr,ee 1970-01-01
4 4 four 1970-01-01
5 5 five 1970-01-01
6.5 6,5,9 six 1970-01-01
7 7 sev,en 1970-01-01
8 8 eight 1970-01-01
9 9 nine 1970-01-01
0 0 zero 1970-01-01

6.4.1 Step1: Cleaning

Removing unwanted commas within words

Tidyverse:

DT_COMA |> mutate(across(where(\(v) is.character(v) & all(is.na(as.numeric(v)))), \(v) stringr::str_remove_all(v, ",")))
data.table [10 x 4]
first second third fourth
1 1 one 1970-01-01
2,3 2,,5 two 1970-01-01
3 3 three 1970-01-01
4 4 four 1970-01-01
5 5 five 1970-01-01
6.5 6,5,9 six 1970-01-01
7 7 seven 1970-01-01
8 8 eight 1970-01-01
9 9 nine 1970-01-01
0 0 zero 1970-01-01

data.table:

cols_to_clean <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & all(is.na(as.numeric(v)))] |> colnames()

copy(DT_COMA)[, c(cols_to_clean) := purrr::map(.SD[, cols_to_clean, with = F], \(v) stringr::str_remove_all(v, ","))][]
data.table [10 x 4]
first second third fourth
1 1 one 1970-01-01
2,3 2,,5 two 1970-01-01
3 3 three 1970-01-01
4 4 four 1970-01-01
5 5 five 1970-01-01
6.5 6,5,9 six 1970-01-01
7 7 seven 1970-01-01
8 8 eight 1970-01-01
9 9 nine 1970-01-01
0 0 zero 1970-01-01

6.4.2 Step 2: Separating rows

Each numeric row that has multiple comma-separated values has to be split into multiple rows (one value per row)

Tidyverse:

cols_to_separate <- DT_COMA |> select(where(\(v) is.character(v) & any(!is.na(as.numeric(v))))) |> colnames()

purrr::reduce(
  cols_to_separate, 
  \(acc, col) acc |> tidyr::separate_rows(col, sep = ",", convert = T), 
  .init = DT_COMA
)
data.frame [17 x 4]
first second third fourth
1 1 one 1970-01-01
2 2 two 1970-01-01
2 NA two 1970-01-01
2 5 two 1970-01-01
3 2 two 1970-01-01
3 NA two 1970-01-01
3 5 two 1970-01-01
3 3 thr,ee 1970-01-01
4 4 four 1970-01-01
5 5 five 1970-01-01
6.5 6 six 1970-01-01
6.5 5 six 1970-01-01
6.5 9 six 1970-01-01
7 7 sev,en 1970-01-01
8 8 eight 1970-01-01
[ omitted 2 entries ]

data.table:

cols_to_separate <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & any(!is.na(as.numeric(v)))] |> colnames()

(purrr::reduce(
  cols_to_separate,
  \(acc, col) acc[rep(1:.N, lengths(strsplit(get(col), ",")))][, (col) := type.convert(unlist(strsplit(acc[[col]], ",", fixed = T)), as.is = T, na.strings = "")],
  .init = DT_COMA
))[]
data.table [17 x 4]
first second third fourth
1 1 one 1970-01-01
2 2 two 1970-01-01
2 NA two 1970-01-01
2 5 two 1970-01-01
3 2 two 1970-01-01
3 NA two 1970-01-01
3 5 two 1970-01-01
3 3 thr,ee 1970-01-01
4 4 four 1970-01-01
5 5 five 1970-01-01
6.5 6 six 1970-01-01
6.5 5 six 1970-01-01
6.5 9 six 1970-01-01
7 7 sev,en 1970-01-01
8 8 eight 1970-01-01
[ omitted 2 entries ]

6.4.3 Combining both steps:

Tidyverse:

DT_COMA <- DT_COMA |> mutate(across(where(\(v) is.character(v) & all(is.na(as.numeric(v)))), \(v) stringr::str_remove_all(v, ",")))

purrr::reduce(
  select(DT_COMA, where(\(v) is.character(v) & any(!is.na(as.numeric(v))))) |> colnames(), 
  \(acc, col) acc |> tidyr::separate_rows(col, sep = ",", convert = T), 
  .init = DT_COMA
)
data.frame [17 x 4]
first second third fourth
1 1 one 1970-01-01
2 2 two 1970-01-01
2 NA two 1970-01-01
2 5 two 1970-01-01
3 2 two 1970-01-01
3 NA two 1970-01-01
3 5 two 1970-01-01
3 3 three 1970-01-01
4 4 four 1970-01-01
5 5 five 1970-01-01
6.5 6 six 1970-01-01
6.5 5 six 1970-01-01
6.5 9 six 1970-01-01
7 7 seven 1970-01-01
8 8 eight 1970-01-01
[ omitted 2 entries ]

data.table:

cols_to_clean <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & all(is.na(as.numeric(v)))] |> colnames()
cols_to_separate <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & any(!is.na(as.numeric(v)))] |> colnames()

DT_COMA[, c(cols_to_clean) := purrr::map(.SD[, cols_to_clean, with = F], \(v) stringr::str_remove_all(v, ","))]
data.table [10 x 4]
first second third fourth
1 1 one 1970-01-01
2,3 2,,5 two 1970-01-01
3 3 three 1970-01-01
4 4 four 1970-01-01
5 5 five 1970-01-01
6.5 6,5,9 six 1970-01-01
7 7 seven 1970-01-01
8 8 eight 1970-01-01
9 9 nine 1970-01-01
0 0 zero 1970-01-01
(purrr::reduce(
  cols_to_separate,
  \(acc, col) acc[rep(1:.N, lengths(strsplit(get(col), ",")))][, (col) := type.convert(unlist(strsplit(acc[[col]], ",", fixed = T)), as.is = T, na.strings = "")],
  .init = DT_COMA
))[]
data.table [17 x 4]
first second third fourth
1 1 one 1970-01-01
2 2 two 1970-01-01
2 NA two 1970-01-01
2 5 two 1970-01-01
3 2 two 1970-01-01
3 NA two 1970-01-01
3 5 two 1970-01-01
3 3 three 1970-01-01
4 4 four 1970-01-01
5 5 five 1970-01-01
6.5 6 six 1970-01-01
6.5 5 six 1970-01-01
6.5 9 six 1970-01-01
7 7 seven 1970-01-01
8 8 eight 1970-01-01
[ omitted 2 entries ]

6.5 Multiple choice questions:

Data:

surv
data.frame [5 x 2]
ID response
1 I read the assigned readings.|I reread my notes.|I worked with one or more classmates.
2 I read the assigned readings.|I reviewed this week’s slides.
3 I worked on practice problems.|I read the assigned readings.|I reread my notes.
4 I worked on practice problems.|I read the assigned readings.|I reread my notes.
5 I worked on practice problems.|I read the assigned readings.|I reread my notes.|I reviewed this week’s slides.|I worked with one or more classmates.

Here we will spread the answers into their own columns using a pivot because not all rows have all the possible answers:

Tidyverse:

surv |> 
  mutate(response = str_split(response, fixed("|"))) |> 
  unnest(response) |> 
  pivot_wider(id_cols = ID, names_from = response, values_from = response, values_fn = \(.x) sum(!is.na(.x)), values_fill = 0)
data.frame [5 x 6]
ID I read the assigned readings. I reread my notes. I worked with one or more classmates. I reviewed this week’s slides. I worked on practice problems.
1 1 1 1 0 0
2 1 0 0 1 0
3 1 1 0 0 1
4 1 1 0 0 1
5 1 1 1 1 1

data.table:

SURV[, c(.SD, tstrsplit(response, "|", fixed = T))][, -"response"] |> 
  melt(measure.vars = patterns("^V")) |> 
  dcast(ID ~ value, fun.agg = \(.x) sum(!is.na(.x)), subset = .(!is.na(value)))
data.table [5 x 6]
ID I read the assigned readings. I reread my notes. I reviewed this week’s slides. I worked on practice problems. I worked with one or more classmates.
1 1 1 0 0 1
2 1 0 1 0 0
3 1 1 0 1 0
4 1 1 0 1 0
5 1 1 1 1 1

6.6 Filling with lagging conditions:

Task: See this SO question.

Data:

ZIP <- structure(
  list(
    zipcode = c(1001, 1002, 1003, 1004, 1101, 1102, 1103, 1104, 1201, 1202, 1203, 1302), 
    areacode = c(4, 4, NA, 4, 4, 4, NA, 1, 4, 4, NA, 4), 
    type = structure(c(1L, 1L, NA, 1L, 2L, 2L, NA, 1L, 1L, 1L, NA, 1L), .Label = c("clay", "sand"), class = "factor"), 
    region = c(3, 3, NA, 3, 3, 3, NA, 3, 3, 3, NA, 3), 
    do_not_fill = c(1, NA, NA, 1, 1, NA, NA, 1, NA, NA, NA, 1)
    ), 
  class = c("data.table", "data.frame"), row.names = c(NA, -4L)
)

Tidyverse:

as_tibble(ZIP) |>
  mutate(type = as.character(type)) |>
  mutate(
    across(1:4, ~ ifelse(
        is.na(.) & lag(areacode) == lead(areacode) & 
          lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))),
        lag(.), .
      )
    )
  )
data.frame [12 x 5]
zipcode areacode type region do_not_fill
1 001 4 clay 3 1
1 002 4 clay 3 NA
1 003 4 clay 3 NA
1 004 4 clay 3 1
1 101 4 sand 3 1
1 102 4 sand 3 NA
1 103 NA NA NA NA
1 104 1 clay 3 1
1 201 4 clay 3 NA
1 202 4 clay 3 NA
1 203 NA NA NA NA
1 302 4 clay 3 1

data.table:

ZIP[, c(lapply(.SD, \(v) {fifelse(
  is.na(areacode) & lag(areacode) == lead(areacode) &
    lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))), lag(v), v)}), 
  .SD[, .(do_not_fill)]), .SDcols = !patterns("do_not_fill")]
data.table [12 x 5]
zipcode areacode type region do_not_fill
1 001 4 clay 3 1
1 002 4 clay 3 NA
1 002 4 clay 3 NA
1 004 4 clay 3 1
1 101 4 sand 3 1
1 102 4 sand 3 NA
1 103 NA NA NA NA
1 104 1 clay 3 1
1 201 4 clay 3 NA
1 202 4 clay 3 NA
1 203 NA NA NA NA
1 302 4 clay 3 1

6.7 Join + Coalesce:

Task: Replace the missing dates from one dataset with the earliest date from another dataset, matching by ID:

Data:

(dt1 <- data.table::fread(
"
      id  x       y   z         
     1    A       1    NA        
     2    C       3    NA        
     3    C       3    NA        
     4    C       2    NA        
     5    B       2    2019-08-04
     6    C       1    2019-09-18
     7    B       3    2019-12-17
     8    A       2    2019-11-02
     9    A       3    2020-03-16
    10    A       1    2020-01-31
"
))
data.table [10 x 4]
id x y z
1 A 1 NA
2 C 3 NA
3 C 3 NA
4 C 2 NA
5 B 2 2019-08-04
6 C 1 2019-09-18
7 B 3 2019-12-17
8 A 2 2019-11-02
9 A 3 2020-03-16
10 A 1 2020-01-31
(dt2 <- data.table::fread(
"      id      date
      1      2012-09-25
      1      2012-03-26
      1      2012-11-12
      2      2013-01-24
      2      2012-05-04
      2      2012-02-24
      3      2012-05-30
      3      2012-02-15
      4      2012-03-13
      4      2012-05-18
"))
data.table [10 x 2]
id date
1 2012-09-25
1 2012-03-26
1 2012-11-12
2 2013-01-24
2 2012-05-04
2 2012-02-24
3 2012-05-30
3 2012-02-15
4 2012-03-13
4 2012-05-18

Tidyverse:

Using coalesce:

left_join(
  dt1, 
  dt2 |> group_by(id) |> summarize(date = min(date)), 
  by = "id"
) |> mutate(date = coalesce(z, date), z = NULL)
data.table [10 x 4]
id x y date
1 A 1 2012-03-26
2 C 3 2012-02-24
3 C 3 2012-02-15
4 C 2 2012-03-13
5 B 2 2019-08-04
6 C 1 2019-09-18
7 B 3 2019-12-17
8 A 2 2019-11-02
9 A 3 2020-03-16
10 A 1 2020-01-31

Using the rows_* functions:

dplyr::rows_patch(
  dt1 |> rename(date = z), 
  dt2 |> group_by(id) |> summarize(date = min(date)), 
  by = "id"
)
data.table [10 x 4]
id x y date
1 A 1 2012-03-26
2 C 3 2012-02-24
3 C 3 2012-02-15
4 C 2 2012-03-13
5 B 2 2019-08-04
6 C 1 2019-09-18
7 B 3 2019-12-17
8 A 2 2019-11-02
9 A 3 2020-03-16
10 A 1 2020-01-31

data.table:

As a right join:

copy(dt2)[, .(date = min(date)), by = id
  ][dt1, on = "id"][, `:=`(date = fcoalesce(date, z), z = NULL)][]
data.table [10 x 4]
id date x y
1 2012-03-26 A 1
2 2012-02-24 C 3
3 2012-02-15 C 3
4 2012-03-13 C 2
5 2019-08-04 B 2
6 2019-09-18 C 1
7 2019-12-17 B 3
8 2019-11-02 A 2
9 2020-03-16 A 3
10 2020-01-31 A 1

As a left join:

copy(dt1)[dt2[, .(date = min(date)), by = id], c("id", "date") := .(i.id, i.date), on = "id"
  ][, `:=`(date = fcoalesce(date, z), z = NULL)][]
data.table [10 x 4]
id x y date
1 A 1 2012-03-26
2 C 3 2012-02-24
3 C 3 2012-02-15
4 C 2 2012-03-13
5 B 2 2019-08-04
6 C 1 2019-09-18
7 B 3 2019-12-17
8 A 2 2019-11-02
9 A 3 2020-03-16
10 A 1 2020-01-31

6.8 Join on multiple columns (partial matching):

Task: Join both tables based on matching IDs, but the IDs are split between multiple columns in one table (id1 & id2).

(dt1 <- data.table(id = c("ABC", "AAA", "CBC"), x = 1:3))
data.table [3 x 2]
id x
ABC 1
AAA 2
CBC 3
(dt2 <- data.table(
  id1 = c("ABC", "AA", "CB"), 
  id2 = c("AB", "AAA", "CBC"), 
  y = c(0.307, 0.144, 0.786))
)
data.table [3 x 3]
id1 id2 y
ABC AB 0.307
AA AAA 0.144
CB CBC 0.786

Solution 1:

Combine the two ID columns into one with pivot_longer, then join:

dt2 |> pivot_longer(matches("^id"), names_to = NULL, values_to = "id") |> right_join(dt1)
data.frame [3 x 3]
y id x
0.307 ABC 1
0.144 AAA 2
0.786 CBC 3
melt(dt2, measure.vars = patterns("^id"), value.name = "id")[, variable := NULL][dt1, on = "id"]
data.table [3 x 3]
y id x
0.307 ABC 1
0.144 AAA 2
0.786 CBC 3

Solution 2:

Combine the two ID columns into one with unite + separate_rows, then join:

(From @TimTeaFan

dt2 |> unite("id", id1, id2, sep = "_") |> separate_rows("id") |> right_join(dt1)
data.frame [3 x 3]
id y x
ABC 0.307 1
AAA 0.144 2
CBC 0.786 3
copy(dt2)[, id := paste(id1, id2, sep = "_")
        ][, c(V1 = strsplit(id, "_", fixed = TRUE), .SD), by = id
        ][, `:=`(id = V1, V1 = NULL, id1 = NULL, id2 = NULL)
        ][dt1, on = "id"]
data.table [3 x 3]
id y x
ABC 0.307 1
AAA 0.144 2
CBC 0.786 3

Solution 3:

Join on one of the two columns (id2 here), and then fill in (patch) the missing values:

left_join(dt2, dt1, by = c("id2" = "id")) |> 
  rows_patch(rename(dt1, id1 = id), unmatched = "ignore")
data.table [3 x 4]
id1 id2 y x
ABC AB 0.307 1
AA AAA 0.144 2
CB CBC 0.786 3

6.9 Merging rows across multiple columns (every X rows):

Data:

(BANK <- data.table(
    date = c("30 feb", "NA", "NA", "NA", "31 feb", "NA", "NA", "NA"), 
    description = c("Mary", "had a", "little", "lamb", "Twinkle", "twinkle", "little", "star"), 
    withdrawal = c("100", "NA", "NA", "NA", "NA", "NA", "NA", "NA"), 
    deposit = c("NA", "NA", "NA", "NA", "100", "NA", "NA", "NA")
  )[, lapply(.SD, \(c) utils::type.convert(c, as.is = T))]
)
data.table [8 x 4]
date description withdrawal deposit
30 feb Mary 100 NA
NA had a NA NA
NA little NA NA
NA lamb NA NA
31 feb Twinkle NA 100
NA twinkle NA NA
NA little NA NA
NA star NA NA
merge_and_convert <- function(v) {
  utils::type.convert(v, as.is = T) |> na.omit() |> 
    paste(collapse = " ") |> utils::type.convert(as.is = T) |> 
    bind(x, ifelse(is.logical(x), as.integer(x), x))
}

Tidyverse:

Solution 1:

mutate(BANK, ID = ceiling(seq_along(row_number())/4)) |> 
  group_by(ID) |> 
  summarize(across(everything(), \(m) merge_and_convert(m)))
data.frame [2 x 5]
ID date description withdrawal deposit
1 30 feb Mary had a little lamb 100 NA
2 31 feb Twinkle twinkle little star NA 100

Solution 2:

summarize(BANK, across(
  everything(), 
  \(c) sapply(split(c, ceiling(seq_along(c)/4)), \(m) merge_and_convert(m))
))
data.frame [2 x 4]
date description withdrawal deposit
30 feb Mary had a little lamb 100 NA
31 feb Twinkle twinkle little star NA 100

data.table:

BANK[, lapply(.SD, \(c) sapply(split(c, ceiling(seq_along(c)/4)), \(m) merge_and_convert(m)))]
data.table [2 x 4]
date description withdrawal deposit
30 feb Mary had a little lamb 100 NA
31 feb Twinkle twinkle little star NA 100
copy(BANK)[, ID := ceiling(seq_along(.I)/4)][, lapply(.SD, \(m) merge_and_convert(m)), by = ID][]
data.table [2 x 5]
ID date description withdrawal deposit
1 30 feb Mary had a little lamb 100 NA
2 31 feb Twinkle twinkle little star NA 100

6.10 Tagging successive events:

Tagging repeated blocks of events (aka run length encoding):

(DAT <- data.table(event = c(
  rep("A", 3),
  rep("B", 5),
  rep("C", 2),
  rep("B", 2),
  rep("A", 3)
)))
data.table [15 x 1]
event
A
A
A
B
B
B
B
B
C
C
B
B
A
A
A
DAT |> mutate(ID = with(rle(event), rep(seq_along(lengths), lengths)))
data.table [15 x 2]
event ID
A 1
A 1
A 1
B 2
B 2
B 2
B 2
B 2
C 3
C 3
B 4
B 4
A 5
A 5
A 5
DAT |> mutate(ID = c(0, cumsum(diff(as.integer(factor(event))) != 0)) + 1)
data.table [15 x 2]
event ID
A 1
A 1
A 1
B 2
B 2
B 2
B 2
B 2
C 3
C 3
B 4
B 4
A 5
A 5
A 5

Using data.table’s rleid() function:

DAT |> mutate(ID = data.table::rleid(event))
data.table [15 x 2]
event ID
A 1
A 1
A 1
B 2
B 2
B 2
B 2
B 2
C 3
C 3
B 4
B 4
A 5
A 5
A 5
copy(DAT)[, ID := rleid(event)][]
data.table [15 x 2]
event ID
A 1
A 1
A 1
B 2
B 2
B 2
B 2
B 2
C 3
C 3
B 4
B 4
A 5
A 5
A 5

7 Miscellaneous:


7.1 Keywords:

.SD
.I, .N
.GRP, .NGRP
.BY
.EACHI

7.2 Useful functions:

fsetdiff, fintersect, funion and fsetequal (apply to data.tables instead of vectors)

nafill, fcoalesce

as.IDate