Tidyverse <-> data.table

Equivalence between Tidyverse and data.table expressions

Data Manipulation

Tidyverse

data.table

Author

Marc-Aurèle Rivière

Published

May 19, 2022

Abstract

This document is a collection of notes I took while learning to use data.table, summarizing the equivalences between most dplyr/tidyr verbs and data.table.

This document is no longer updated

Please visit this page for a more up-to-date version of this post.

Expand for Version History

V1: 2022-05-19
V2: 2022-05-26
- Improved the section on keys (for ordering & filtering)
- Adding a section for translations of Tidyr (and other similar packages)
- Capping tables to display 15 rows max when unfolded
- Improving table display (stripping, hiding the contents of nested columns, …)
V3: 2022-07-20
- Updating examples of dynamic programming based on the latest recommendations
- Added new entries in processing examples
- Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
- Added pivot_wider examples to match the dcast ones in the Pivots section
- Added some new examples here and there across the Basic Operations section
- Added an entry for operating inside nested data.frames/data.tables
- Added a processing example for run-length encoding (i.e. successive event tagging)
V4: 2022-08-05
- Improved pivot section: example of one-hot encoding (and reverse operation) + better examples of partial pivots with .value
- Added tidyr::uncount() (row duplication) example
- Improved both light & dark themes (code highlight, tables, …)

1 Setup

renv::install(
  c(
    "here",
    "Rdatatable/data.table",
    "tidyverse/dplyr",
    "tidyr",
    "pipebind",
    "stringr",
    "purrr",
    "lubridate",
    "broom"
  )
)

library(here)        # Project management

library(data.table)  # Data wrangling (>= 1.14.3)
library(dplyr)       # Data wrangling (>= 1.1.0)
library(tidyr)       # Data wrangling (extras) (>= 1.2.0)
library(pipebind)    # Piping goodies (>= 0.1.1)

library(stringr)     # Manipulating strings
library(purrr)       # Manipulating lists
library(lubridate)   # Manipulating dates

library(broom)

data.table::setDTthreads(parallel::detectCores(logical = TRUE))

Expand for Session Info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 20.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/Paris
 date     2022-09-24
 pandoc   2.19.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
 Quarto   1.1.251

─ Packages ───────────────────────────────────────────────────────────────────
 ! package    * version     date (UTC) lib source
 P broom      * 1.0.0       2022-07-01 [?] CRAN (R 4.2.0)
 P data.table * 1.14.3      2022-07-27 [?] Github (Rdatatable/data.table@c4a2085)
   dplyr      * 1.0.99.9000 2022-08-15 [1] Github (tidyverse/dplyr@d8294b4)
 P here       * 1.0.1       2020-12-13 [?] CRAN (R 4.2.0)
 P lubridate  * 1.8.0       2021-10-07 [?] CRAN (R 4.2.0)
   pipebind   * 0.1.1       2022-08-10 [1] CRAN (R 4.2.0)
 P purrr      * 0.3.4       2020-04-17 [?] CRAN (R 4.2.0)
 P stringr    * 1.4.0       2019-02-10 [2] CRAN (R 4.2.0)
 P tidyr      * 1.2.0       2022-02-01 [?] CRAN (R 4.2.0)

 [1] /home/mar/Dev/Projects/R/Misc/renv/library/R-4.2/x86_64-pc-linux-gnu
 [2] /home/mar/.cache/R/renv/library/Misc-f25fd835/R-4.2/x86_64-pc-linux-gnu
 [3] /usr/lib/R/library
 [4] /usr/local/lib/R/site-library
 [5] /usr/lib/R/site-library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

2 Basic Operations:

data.table general syntax:

DT[row selector (filter/sort), col selector (select/mutate/summarize/rename), modifiers (group)]

Data

MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]

2.1 Arrange / Order:

mtcars |> arrange(desc(cyl))

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21	6	160	110	3.9	2.62	16.46	0	1	4	4
[ omitted 17 entries ]

MT[order(-cyl)]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21	6	160	110	3.9	2.62	16.46	0	1	4	4
[ omitted 17 entries ]

setorder(MT, -cyl)[]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21	6	160	110	3.9	2.62	16.46	0	1	4	4
[ omitted 17 entries ]

MT[order(-cyl, gear)]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
[ omitted 17 entries ]

Ordering on a character column

IRIS[chorder(Species)]

data.table [150 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 135 entries ]

Ordering with keys

Keys physically reorders the dataset within the RAM (by reference)
- No memory is used for sorting (other than marking which columns is the key)
The dataset is marked with an attribute “sorted”
The dataset is always sorted in ascending order, with NAs first
Using keyby instead of by when grouping will set the grouping factors as keys

Tip

See this SO post for more information on keys.

setkey(MT, cyl, gear)

setkeyv(MT, c("cyl", "gear"))

MT

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
[ omitted 17 entries ]

To see over which keys (if any) the dataset is currently ordered:

haskey(MT)

[1] TRUE

key(MT)

[1] “cyl” “gear”

Warning

Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.

Ordering with (secondary) indices

setindex creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index

setindex(MT, cyl, gear)

setindexv(MT, c("cyl", "gear"))

MT

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

We can see the additional index attribute added to the data.table:

names(attributes(MT))

[1] "names"             "row.names"         "class"            
[4] ".internal.selfref" "index"

We can get the currently used indices with:

indices(MT)

[1] “cyl__gear”

Adding a new index doesn’t remove a previously existing one:

setindex(MT, hp)

indices(MT)

[1] “cyl__gear” “hp”

We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !

2.2 Subset / Filter:

mtcars |> filter(cyl >= 6 & disp < 180)

data.frame [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6

MT[cyl >= 6 & disp < 180]

data.table [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6

Filter based on a range:

MT[disp %between% c(200, 300)]

data.table [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
21.4	6	258	110	3.08	3.215	19.44	1	3	1
18.1	6	225	105	2.76	3.46	20.22	1	3	1
16.4	8	275.8	180	3.07	4.07	17.4	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	3	3

Filtering on characters:

For non-regex, use %chin%, which is a character-optimized version of %in%.

IRIS[Species %chin% c("setosa")]

data.table [50 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 35 entries ]

Filter with pattern:

For regex patterns, use %like%

mtcars |> filter(str_detect(disp, "^\\d{3}\\."))

data.frame [9 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2

MT[like(disp, "^\\d{3}\\.")]

data.table [9 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2

Alternatively:

MT[disp %like% "^\\d{3}\\."]

data.table [9 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2

Filter by keys

When keys or indices are defined, we can filter based on them, which is often a lot faster.

Tip

We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.

setkey(MT, cyl)

MT[.(6)] # Equivalent to MT[cyl == 6]

data.table [7 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6

setkey(MT, cyl, gear)

MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]

data.table [4 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4

Filter by indices

To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).

IRIS["setosa", on = "Species"]

data.table [50 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 35 entries ]

Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.

Tip

When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)

MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]

data.table [12 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4

Filter based on position:

dplyr::first(MT$cyl)

[1] 4

MT[, first(cyl)]

[1] 4

dplyr::last(MT$cyl)

[1] 8

MT[, last(cyl)]

[1] 8

dplyr::nth(MT$cyl, 5)

[1] 4

MT[5, cyl]

[1] 4

Distinct / Unique

mtcars |> distinct(mpg, hp, .keep_all = TRUE)

data.frame [31 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
[ omitted 16 entries ]

unique(MT, by = c("mpg", "hp"))

data.table [31 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
21	6	160	110	3.9	2.62	16.46	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
[ omitted 16 entries ]

N Distinct / Unique N

n_distinct(mtcars$gear)

[1] 3

uniqueN(MT, by = "gear")

[1] 3

Applying a filtering function on multiple columns

Function to filter rows that have 2 or more non-zero decimals in one column

decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) > 1

Manual solution:

mtcars |> filter(decp(drat) & decp(wt) & decp(qsec))

data.frame [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

MT[decp(drat) & decp(wt) & decp(qsec), ]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

Programmatically applying the method to the different columns:

cols <- c("drat", "wt", "qsec")

mtcars |> filter(if_all(cols, decp))

data.frame [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

MT[Reduce(`&`, lapply(mget(cols), decp)), ]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

MT[Reduce(`&`, lapply(MT[, ..cols], decp)), ]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

With the newer env meta-programming interface:

MT[Reduce(`&`, lapply(v1, decp)), env = list(v1 = as.list(cols))]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

MT[f1(`&`, f2(v1, decp)), env = list(f1 = "Reduce", f2 = "lapply", v1 = as.list(cols))]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

Note

We can’t use .SD in the i clause of a data.table, but we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector

MT[MT[, Reduce(`&`, lapply(.SD, decp)), .SDcols = cols]]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

MT[MT[, Reduce(`&`, lapply(.SD[, mget(cols)], decp))]]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

2.3 Select:

MT |> select(matches("cyl|disp"))

data.table [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

MT[, .(mpg, disp)]

data.table [32 x 2]

mpg	disp
21	160
21	160
22.8	108
21.4	258
18.7	360
18.1	225
14.3	360
24.4	146.7
22.8	140.8
19.2	167.6
17.8	167.6
16.4	275.8
17.3	275.8
15.2	275.8
10.4	472
[ omitted 17 entries ]

MT[ , .SD, .SDcols = c("mpg", "disp")]

data.table [32 x 2]

mpg	disp
21	160
21	160
22.8	108
21.4	258
18.7	360
18.1	225
14.3	360
24.4	146.7
22.8	140.8
19.2	167.6
17.8	167.6
16.4	275.8
17.3	275.8
15.2	275.8
10.4	472
[ omitted 17 entries ]

MT[, .SD, .SDcols = patterns("mpg|disp")]

data.table [32 x 2]

mpg	disp
21	160
21	160
22.8	108
21.4	258
18.7	360
18.1	225
14.3	360
24.4	146.7
22.8	140.8
19.2	167.6
17.8	167.6
16.4	275.8
17.3	275.8
15.2	275.8
10.4	472
[ omitted 17 entries ]

By dynamic name:

cols <- c("cyl", "disp")

mtcars |> select(all_of(cols))

data.frame [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

mtcars |> select(!!cols)

data.frame [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

copy(MT)[, ..cols]

data.table [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

copy(MT)[, mget(cols)]

data.table [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

copy(MT)[, cols, with = FALSE]

data.table [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

copy(MT)[, j, env = list(j = as.list(cols))]

data.table [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

Remove a column

mtcars |> select(-cyl)

data.frame [32 x 10]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	160	110	3.9	2.62	16.46	0	1	4	4
21	160	110	3.9	2.875	17.02	0	1	4	4
22.8	108	93	3.85	2.32	18.61	1	1	4	1
21.4	258	110	3.08	3.215	19.44	1	0	3	1
18.7	360	175	3.15	3.44	17.02	0	0	3	2
18.1	225	105	2.76	3.46	20.22	1	0	3	1
14.3	360	245	3.21	3.57	15.84	0	0	3	4
24.4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	275.8	180	3.07	3.78	18	0	0	3	3
10.4	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, c("cyl") := NULL][]

data.table [32 x 10]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	160	110	3.9	2.62	16.46	0	1	4	4
21	160	110	3.9	2.875	17.02	0	1	4	4
22.8	108	93	3.85	2.32	18.61	1	1	4	1
21.4	258	110	3.08	3.215	19.44	1	0	3	1
18.7	360	175	3.15	3.44	17.02	0	0	3	2
18.1	225	105	2.76	3.46	20.22	1	0	3	1
14.3	360	245	3.21	3.57	15.84	0	0	3	4
24.4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	275.8	180	3.07	3.78	18	0	0	3	3
10.4	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, !"cyl"] # MT[, -"cyl"]

data.table [32 x 10]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	160	110	3.9	2.62	16.46	0	1	4	4
21	160	110	3.9	2.875	17.02	0	1	4	4
22.8	108	93	3.85	2.32	18.61	1	1	4	1
21.4	258	110	3.08	3.215	19.44	1	0	3	1
18.7	360	175	3.15	3.44	17.02	0	0	3	2
18.1	225	105	2.76	3.46	20.22	1	0	3	1
14.3	360	245	3.21	3.57	15.84	0	0	3	4
24.4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	275.8	180	3.07	3.78	18	0	0	3	3
10.4	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

By dynamic name:

col <- "cyl"

copy(MT)[, (col) := NULL][]

data.table [32 x 10]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	160	110	3.9	2.62	16.46	0	1	4	4
21	160	110	3.9	2.875	17.02	0	1	4	4
22.8	108	93	3.85	2.32	18.61	1	1	4	1
21.4	258	110	3.08	3.215	19.44	1	0	3	1
18.7	360	175	3.15	3.44	17.02	0	0	3	2
18.1	225	105	2.76	3.46	20.22	1	0	3	1
14.3	360	245	3.21	3.57	15.84	0	0	3	4
24.4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	275.8	180	3.07	3.78	18	0	0	3	3
10.4	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, j := NULL, env = list(j = col)][]

data.table [32 x 10]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	160	110	3.9	2.62	16.46	0	1	4	4
21	160	110	3.9	2.875	17.02	0	1	4	4
22.8	108	93	3.85	2.32	18.61	1	1	4	1
21.4	258	110	3.08	3.215	19.44	1	0	3	1
18.7	360	175	3.15	3.44	17.02	0	0	3	2
18.1	225	105	2.76	3.46	20.22	1	0	3	1
14.3	360	245	3.21	3.57	15.84	0	0	3	4
24.4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	275.8	180	3.07	3.78	18	0	0	3	3
10.4	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

cols <- c("cyl", "disp")

mtcars |> select(!matches(cols))

data.frame [32 x 9]

mpg	hp	drat	wt	qsec	vs	am	gear	carb
21	110	3.9	2.62	16.46	0	1	4	4
21	110	3.9	2.875	17.02	0	1	4	4
22.8	93	3.85	2.32	18.61	1	1	4	1
21.4	110	3.08	3.215	19.44	1	0	3	1
18.7	175	3.15	3.44	17.02	0	0	3	2
18.1	105	2.76	3.46	20.22	1	0	3	1
14.3	245	3.21	3.57	15.84	0	0	3	4
24.4	62	3.69	3.19	20	1	0	4	2
22.8	95	3.92	3.15	22.9	1	0	4	2
19.2	123	3.92	3.44	18.3	1	0	4	4
17.8	123	3.92	3.44	18.9	1	0	4	4
16.4	180	3.07	4.07	17.4	0	0	3	3
17.3	180	3.07	3.73	17.6	0	0	3	3
15.2	180	3.07	3.78	18	0	0	3	3
10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, !..cols]

data.table [32 x 9]

mpg	hp	drat	wt	qsec	vs	am	gear	carb
21	110	3.9	2.62	16.46	0	1	4	4
21	110	3.9	2.875	17.02	0	1	4	4
22.8	93	3.85	2.32	18.61	1	1	4	1
21.4	110	3.08	3.215	19.44	1	0	3	1
18.7	175	3.15	3.44	17.02	0	0	3	2
18.1	105	2.76	3.46	20.22	1	0	3	1
14.3	245	3.21	3.57	15.84	0	0	3	4
24.4	62	3.69	3.19	20	1	0	4	2
22.8	95	3.92	3.15	22.9	1	0	4	2
19.2	123	3.92	3.44	18.3	1	0	4	4
17.8	123	3.92	3.44	18.9	1	0	4	4
16.4	180	3.07	4.07	17.4	0	0	3	3
17.3	180	3.07	3.73	17.6	0	0	3	3
15.2	180	3.07	3.78	18	0	0	3	3
10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, !cols, with = FALSE]

data.table [32 x 9]

mpg	hp	drat	wt	qsec	vs	am	gear	carb
21	110	3.9	2.62	16.46	0	1	4	4
21	110	3.9	2.875	17.02	0	1	4	4
22.8	93	3.85	2.32	18.61	1	1	4	1
21.4	110	3.08	3.215	19.44	1	0	3	1
18.7	175	3.15	3.44	17.02	0	0	3	2
18.1	105	2.76	3.46	20.22	1	0	3	1
14.3	245	3.21	3.57	15.84	0	0	3	4
24.4	62	3.69	3.19	20	1	0	4	2
22.8	95	3.92	3.15	22.9	1	0	4	2
19.2	123	3.92	3.44	18.3	1	0	4	4
17.8	123	3.92	3.44	18.9	1	0	4	4
16.4	180	3.07	4.07	17.4	0	0	3	3
17.3	180	3.07	3.73	17.6	0	0	3	3
15.2	180	3.07	3.78	18	0	0	3	3
10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, -j, env = list(j = I(cols))][]

data.table [32 x 9]

mpg	hp	drat	wt	qsec	vs	am	gear	carb
21	110	3.9	2.62	16.46	0	1	4	4
21	110	3.9	2.875	17.02	0	1	4	4
22.8	93	3.85	2.32	18.61	1	1	4	1
21.4	110	3.08	3.215	19.44	1	0	3	1
18.7	175	3.15	3.44	17.02	0	0	3	2
18.1	105	2.76	3.46	20.22	1	0	3	1
14.3	245	3.21	3.57	15.84	0	0	3	4
24.4	62	3.69	3.19	20	1	0	4	2
22.8	95	3.92	3.15	22.9	1	0	4	2
19.2	123	3.92	3.44	18.3	1	0	4	4
17.8	123	3.92	3.44	18.9	1	0	4	4
16.4	180	3.07	4.07	17.4	0	0	3	3
17.3	180	3.07	3.73	17.6	0	0	3	3
15.2	180	3.07	3.78	18	0	0	3	3
10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

By pattern:

mtcars |> select(-matches("^d"))

data.frame [32 x 9]

mpg	cyl	hp	wt	qsec	vs	am	gear	carb
21	6	110	2.62	16.46	0	1	4	4
21	6	110	2.875	17.02	0	1	4	4
22.8	4	93	2.32	18.61	1	1	4	1
21.4	6	110	3.215	19.44	1	0	3	1
18.7	8	175	3.44	17.02	0	0	3	2
18.1	6	105	3.46	20.22	1	0	3	1
14.3	8	245	3.57	15.84	0	0	3	4
24.4	4	62	3.19	20	1	0	4	2
22.8	4	95	3.15	22.9	1	0	4	2
19.2	6	123	3.44	18.3	1	0	4	4
17.8	6	123	3.44	18.9	1	0	4	4
16.4	8	180	4.07	17.4	0	0	3	3
17.3	8	180	3.73	17.6	0	0	3	3
15.2	8	180	3.78	18	0	0	3	3
10.4	8	205	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, .SD, .SDcols = !patterns("^d")]

data.table [32 x 9]

mpg	cyl	hp	wt	qsec	vs	am	gear	carb
21	6	110	2.62	16.46	0	1	4	4
21	6	110	2.875	17.02	0	1	4	4
22.8	4	93	2.32	18.61	1	1	4	1
21.4	6	110	3.215	19.44	1	0	3	1
18.7	8	175	3.44	17.02	0	0	3	2
18.1	6	105	3.46	20.22	1	0	3	1
14.3	8	245	3.57	15.84	0	0	3	4
24.4	4	62	3.19	20	1	0	4	2
22.8	4	95	3.15	22.9	1	0	4	2
19.2	6	123	3.44	18.3	1	0	4	4
17.8	6	123	3.44	18.9	1	0	4	4
16.4	8	180	4.07	17.4	0	0	3	3
17.3	8	180	3.73	17.6	0	0	3	3
15.2	8	180	3.78	18	0	0	3	3
10.4	8	205	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, grep("^d", colnames(MT)) := NULL][]

data.table [32 x 9]

mpg	cyl	hp	wt	qsec	vs	am	gear	carb
21	6	110	2.62	16.46	0	1	4	4
21	6	110	2.875	17.02	0	1	4	4
22.8	4	93	2.32	18.61	1	1	4	1
21.4	6	110	3.215	19.44	1	0	3	1
18.7	8	175	3.44	17.02	0	0	3	2
18.1	6	105	3.46	20.22	1	0	3	1
14.3	8	245	3.57	15.84	0	0	3	4
24.4	4	62	3.19	20	1	0	4	2
22.8	4	95	3.15	22.9	1	0	4	2
19.2	6	123	3.44	18.3	1	0	4	4
17.8	6	123	3.44	18.9	1	0	4	4
16.4	8	180	4.07	17.4	0	0	3	3
17.3	8	180	3.73	17.6	0	0	3	3
15.2	8	180	3.78	18	0	0	3	3
10.4	8	205	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

By type:

IRIS |> select(where(\(c) !is.numeric(c)))

data.table [150 x 1]

Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]

IRIS[, .SD, .SDcols = !is.numeric]

data.table [150 x 1]

Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]

Select + pull

mtcars |> pull(disp)

MT[, disp]

Select + rename

mtcars |> select(dispp = disp)

data.frame [32 x 1]

dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

MT[, .(dispp = disp)]

data.table [32 x 1]

dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

2.4 Rename:

Manually:

mtcars |> rename(CYL = cyl, MPG = mpg)

data.frame [32 x 11]

MPG	CYL	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

setnames(copy(MT), c("cyl", "mpg"), c("CYL", "MPG"))[]

data.table [32 x 11]

MPG	CYL	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Programmatically:

mtcars |> rename_with(\(c) toupper(c), .cols = matches("^d"))

data.frame [32 x 11]

mpg	cyl	DISP	hp	DRAT	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

setnames(copy(MT), grep("^d", names(MT)), \(c) toupper(c))[]

data.table [32 x 11]

mpg	cyl	DISP	hp	DRAT	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

2.5 Mutate:

data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := modifies the current dt in place (like dplyr::mutate)

The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.

If the number of values provided is smaller than the original column/group:
- With :=, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).

2.5.1 Transmute:

MT[, .(cyl = cyl * 2)]

data.table [32 x 1]

cyl
12
12
8
12
16
12
16
8
8
12
12
16
16
16
16
[ omitted 17 entries ]

2.5.2 In-Place:

2.5.2.1 Single column:

mtcars |> mutate(cyl = 200)

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	4	4
21	200	160	110	3.9	2.875	17.02	0	1	4	4
22.8	200	108	93	3.85	2.32	18.61	1	1	4	1
21.4	200	258	110	3.08	3.215	19.44	1	0	3	1
18.7	200	360	175	3.15	3.44	17.02	0	0	3	2
18.1	200	225	105	2.76	3.46	20.22	1	0	3	1
14.3	200	360	245	3.21	3.57	15.84	0	0	3	4
24.4	200	146.7	62	3.69	3.19	20	1	0	4	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	200	275.8	180	3.07	3.78	18	0	0	3	3
10.4	200	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, cyl := 200][]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	4	4
21	200	160	110	3.9	2.875	17.02	0	1	4	4
22.8	200	108	93	3.85	2.32	18.61	1	1	4	1
21.4	200	258	110	3.08	3.215	19.44	1	0	3	1
18.7	200	360	175	3.15	3.44	17.02	0	0	3	2
18.1	200	225	105	2.76	3.46	20.22	1	0	3	1
14.3	200	360	245	3.21	3.57	15.84	0	0	3	4
24.4	200	146.7	62	3.69	3.19	20	1	0	4	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	200	275.8	180	3.07	3.78	18	0	0	3	3
10.4	200	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Mutate a single column with a function:

mtcars |> mutate(mean_cyl = mean(cyl, na.rm = TRUE))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_cyl
21	6	160	110	3.9	2.62	16.46	0	1	4	4	6.188
21	6	160	110	3.9	2.875	17.02	0	1	4	4	6.188
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	6.188
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	6.188
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	6.188
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6.188
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	6.188
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	6.188
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	6.188
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6.188
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	6.188
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	6.188
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6.188
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	6.188
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	6.188
[ omitted 17 entries ]

copy(MT)[, mean_cyl := mean(cyl, na.rm = TRUE)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_cyl
21	6	160	110	3.9	2.62	16.46	0	1	4	4	6.188
21	6	160	110	3.9	2.875	17.02	0	1	4	4	6.188
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	6.188
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	6.188
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	6.188
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6.188
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	6.188
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	6.188
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	6.188
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6.188
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	6.188
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	6.188
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6.188
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	6.188
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	6.188
[ omitted 17 entries ]

copy(MT)[, `:=`(mean_cyl = mean(cyl, na.rm = TRUE))][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_cyl
21	6	160	110	3.9	2.62	16.46	0	1	4	4	6.188
21	6	160	110	3.9	2.875	17.02	0	1	4	4	6.188
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	6.188
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	6.188
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	6.188
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6.188
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	6.188
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	6.188
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	6.188
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6.188
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	6.188
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	6.188
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6.188
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	6.188
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	6.188
[ omitted 17 entries ]

Dynamic mutate:

Dynamic name on the LHS:

RHS <- "MPG"

mtcars |> mutate({{RHS}} := mean(mpg))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	MPG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

mtcars |> mutate("{RHS}" := mean(mpg))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	MPG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

copy(MT)[, (RHS) := mean(mpg)][] # (RHS) <=> c(RHS)

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	MPG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

copy(MT)[, j := mean(mpg), env = list(j = RHS)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	MPG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

Dynamic name on both LHS & RHS:

data.table requires the use of base::get() on the LHS

LHS <- "MPG"
RHS <- "mpg"

mtcars |> mutate("{LHS}" := as.character(.data[[RHS]]))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	MPG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	21
21	6	160	110	3.9	2.875	17.02	0	1	4	4	21
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	22.8
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	21.4
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	18.7
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	18.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	14.3
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	22.8
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	19.2
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	17.8
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	16.4
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	17.3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	15.2
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4
[ omitted 17 entries ]

mtcars |> mutate({{LHS}} := as.character(cur_data()[[RHS]]))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	MPG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	21
21	6	160	110	3.9	2.875	17.02	0	1	4	4	21
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	22.8
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	21.4
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	18.7
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	18.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	14.3
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	22.8
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	19.2
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	17.8
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	16.4
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	17.3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	15.2
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4
[ omitted 17 entries ]

copy(MT)[, c(LHS) := as.character(get(RHS))][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	MPG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	21
21	6	160	110	3.9	2.875	17.02	0	1	4	4	21
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	22.8
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	21.4
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	18.7
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	18.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	14.3
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	22.8
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	19.2
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	17.8
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	16.4
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	17.3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	15.2
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4
[ omitted 17 entries ]

copy(MT)[, x := y, env = list(x = LHS, y = RHS)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	MPG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	21
21	6	160	110	3.9	2.875	17.02	0	1	4	4	21
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	22.8
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	21.4
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	18.7
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	18.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	14.3
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	22.8
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	19.2
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	17.8
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	16.4
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	17.3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	15.2
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4
[ omitted 17 entries ]

Mutate based on multiple conditions:

if_else:

mtcars |> mutate(Size = if_else(cyl >= 6, "BIG", "small"))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	Size
21	6	160	110	3.9	2.62	16.46	0	1	4	4	BIG
21	6	160	110	3.9	2.875	17.02	0	1	4	4	BIG
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	small
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	BIG
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	BIG
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	BIG
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	BIG
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	small
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	small
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	BIG
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	BIG
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	BIG
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	BIG
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	BIG
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	BIG
[ omitted 17 entries ]

copy(MT)[, Size := fifelse(cyl >= 6, "BIG", "small")][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	Size
21	6	160	110	3.9	2.62	16.46	0	1	4	4	BIG
21	6	160	110	3.9	2.875	17.02	0	1	4	4	BIG
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	small
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	BIG
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	BIG
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	BIG
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	BIG
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	small
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	small
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	BIG
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	BIG
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	BIG
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	BIG
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	BIG
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	BIG
[ omitted 17 entries ]

case_when:

mtcars |> mutate(Size = case_when(
  cyl %between% c(2,4) ~ "small",
  cyl %between% c(4,8) ~ "BIG"
))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	Size
21	6	160	110	3.9	2.62	16.46	0	1	4	4	BIG
21	6	160	110	3.9	2.875	17.02	0	1	4	4	BIG
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	small
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	BIG
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	BIG
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	BIG
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	BIG
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	small
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	small
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	BIG
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	BIG
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	BIG
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	BIG
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	BIG
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	BIG
[ omitted 17 entries ]

copy(MT)[, Size := fcase(
  cyl %between% c(2,4), "small", 
  cyl %between% c(4,8), "BIG"
)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	Size
21	6	160	110	3.9	2.62	16.46	0	1	4	4	BIG
21	6	160	110	3.9	2.875	17.02	0	1	4	4	BIG
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	small
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	BIG
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	BIG
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	BIG
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	BIG
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	small
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	small
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	BIG
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	BIG
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	BIG
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	BIG
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	BIG
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	BIG
[ omitted 17 entries ]

Mutate only if condition is met:

It will keep all the rows and only mutate the ones meeting the provided condition (in i).

Note

This can be extended to mutating multiple columns, of course.

mtcars |> mutate(BIG = case_when(am == 1 ~ cyl >= 6))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	BIG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	TRUE
21	6	160	110	3.9	2.875	17.02	0	1	4	4	TRUE
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	FALSE
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	NA
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	NA
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	NA
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	NA
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	NA
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	NA
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	NA
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	NA
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	NA
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	NA
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	NA
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	NA
[ omitted 17 entries ]

# mtcars |> mutate(BIG = cyl >= 6, .when = am == 1) # Not implemented yet as of dplyr 1.0.9

copy(MT)[am == 1, BIG := cyl >= 6][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	BIG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	TRUE
21	6	160	110	3.9	2.875	17.02	0	1	4	4	TRUE
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	FALSE
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	NA
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	NA
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	NA
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	NA
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	NA
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	NA
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	NA
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	NA
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	NA
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	NA
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	NA
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	NA
[ omitted 17 entries ]

Lag / Lead

mtcars |> mutate(gear1 = lead(gear))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	gear1
21	6	160	110	3.9	2.62	16.46	0	1	4	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	3
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	3
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	3
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	3
[ omitted 17 entries ]

copy(MT)[, gear1 := shift(gear, 1, type = "lead")][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	gear1
21	6	160	110	3.9	2.62	16.46	0	1	4	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	3
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	3
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	3
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	3
[ omitted 17 entries ]

2.5.2.2 Mutate multiple columns:

mtcars |> mutate(cyl = 200, gear = 5)

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	5	4
21	200	160	110	3.9	2.875	17.02	0	1	5	4
22.8	200	108	93	3.85	2.32	18.61	1	1	5	1
21.4	200	258	110	3.08	3.215	19.44	1	0	5	1
18.7	200	360	175	3.15	3.44	17.02	0	0	5	2
18.1	200	225	105	2.76	3.46	20.22	1	0	5	1
14.3	200	360	245	3.21	3.57	15.84	0	0	5	4
24.4	200	146.7	62	3.69	3.19	20	1	0	5	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	5	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	5	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	5	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	5	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	5	3
15.2	200	275.8	180	3.07	3.78	18	0	0	5	3
10.4	200	472	205	2.93	5.25	17.98	0	0	5	4
[ omitted 17 entries ]

copy(MT)[, `:=`(cyl = 200, gear = 5)][]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	5	4
21	200	160	110	3.9	2.875	17.02	0	1	5	4
22.8	200	108	93	3.85	2.32	18.61	1	1	5	1
21.4	200	258	110	3.08	3.215	19.44	1	0	5	1
18.7	200	360	175	3.15	3.44	17.02	0	0	5	2
18.1	200	225	105	2.76	3.46	20.22	1	0	5	1
14.3	200	360	245	3.21	3.57	15.84	0	0	5	4
24.4	200	146.7	62	3.69	3.19	20	1	0	5	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	5	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	5	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	5	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	5	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	5	3
15.2	200	275.8	180	3.07	3.78	18	0	0	5	3
10.4	200	472	205	2.93	5.25	17.98	0	0	5	4
[ omitted 17 entries ]

copy(MT)[, c("cyl", "gear") := list(200, 5)][]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	5	4
21	200	160	110	3.9	2.875	17.02	0	1	5	4
22.8	200	108	93	3.85	2.32	18.61	1	1	5	1
21.4	200	258	110	3.08	3.215	19.44	1	0	5	1
18.7	200	360	175	3.15	3.44	17.02	0	0	5	2
18.1	200	225	105	2.76	3.46	20.22	1	0	5	1
14.3	200	360	245	3.21	3.57	15.84	0	0	5	4
24.4	200	146.7	62	3.69	3.19	20	1	0	5	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	5	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	5	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	5	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	5	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	5	3
15.2	200	275.8	180	3.07	3.78	18	0	0	5	3
10.4	200	472	205	2.93	5.25	17.98	0	0	5	4
[ omitted 17 entries ]

One function applied to multiple columns (across rows):

mtcars |> mutate(across(c("mpg", "disp"), \(c) min(c), .names = "min_{col}"))

data.frame [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1
[ omitted 17 entries ]

copy(MT)[, c("min_mpg", "min_disp") := lapply(.SD, \(c) min(c)), .SDcols = c("mpg", "disp")][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1
[ omitted 17 entries ]

With dynamic naming:

new <- c("min_mpg", "min_disp")
old <- c("mpg", "disp")

copy(MT)[, c(new) := lapply(mget(old), min)][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1
[ omitted 17 entries ]

copy(MT)[, c(new) := lapply(x, min), env = list(x = as.list(setNames(nm = old)))][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1
[ omitted 17 entries ]

Multiple functions on one column (across rows):

copy(MT)[, c("min_mpg", "max_mpg") := list(min(c(mpg)), max(c(mpg)))][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

copy(MT)[, `:=`(min_mpg = min(c(mpg)), max_mpg = max(c(mpg)))][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

copy(MT)[, c("min_mpg", "max_mpg") := lapply(.SD, \(x) list(min(x), max(x))) |> rbindlist(), .SDcols = "mpg"][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

copy(MT)[, c("min_mpg", "max_mpg") := lapply(.SD[, .(mpg)], \(x) list(min(x), max(x))) |> rbindlist()][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

copy(MT)[, c("min_mpg", "max_mpg") := lapply(.(mpg), \(x) list(min(x), max(x))) |> do.call(rbind, args = _)][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

One function applied to multiple columns (across columns)

mtcars |> rowwise() |> mutate(RowSum = sum(c_across(where(is.numeric)))) |> ungroup()

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	RowSum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	728.56
[ omitted 17 entries ]

copy(MT)[, RowSum := rowSums(.SD), .SDcols = is.numeric][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	RowSum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	728.56
[ omitted 17 entries ]

More general option using row-wise apply:

copy(MT)[, RowMean := apply(.SD, 1, \(x) mean(x)), .SDcols = is.numeric][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	RowMean
21	6	160	110	3.9	2.62	16.46	0	1	4	4	29.907
21	6	160	110	3.9	2.875	17.02	0	1	4	4	29.981
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	23.598
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	38.74
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	53.665
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	35.049
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	59.72
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.635
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	27.234
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	31.86
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	31.787
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	46.431
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	46.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	46.35
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	66.233
[ omitted 17 entries ]

Multiple functions applied to multiple columns (row-wise)

copy(MT)[, c("row_mean", "row_sum") := apply(.SD, 1, \(x) list(mean(x), sum(x))) |> rbindlist(), .SDcols = is.numeric][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	row_mean	row_sum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	29.907	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	29.981	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	23.598	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	38.74	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	53.665	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	35.049	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	59.72	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.635	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	27.234	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	31.86	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	31.787	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	46.431	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	46.5	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	46.35	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	66.233	728.56
[ omitted 17 entries ]

Apply an anonymous function inside the DT:

MT[, {
    print(summary(mpg))
    x <- cyl + gear
    .(RN = 1:.N, CG = x)
  }
]

Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90

data.table [32 x 2]

RN	CG
1	10
2	10
3	8
4	9
5	11
6	9
7	11
8	8
9	8
10	10
11	10
12	11
13	11
14	11
15	11
[ omitted 17 entries ]

2.6 Group / Aggregate:

The examples listed apply a grouping but do nothing (using .SD to simply keep all columns as is)

One group:

mtcars |> group_by(cyl)

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

MT[, .SD, by = cyl]

data.table [32 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
6	21	160	110	3.9	2.62	16.46	0	1	4	4
6	21	160	110	3.9	2.875	17.02	0	1	4	4
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
6	17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
[ omitted 17 entries ]

Multiple groups:

MT[, .SD, by = .(cyl, gear)]

data.table [32 x 11]

cyl	gear	mpg	disp	hp	drat	wt	qsec	vs	am	carb
6	4	21	160	110	3.9	2.62	16.46	0	1	4
6	4	21	160	110	3.9	2.875	17.02	0	1	4
6	4	19.2	167.6	123	3.92	3.44	18.3	1	0	4
6	4	17.8	167.6	123	3.92	3.44	18.9	1	0	4
4	4	22.8	108	93	3.85	2.32	18.61	1	1	1
4	4	24.4	146.7	62	3.69	3.19	20	1	0	2
4	4	22.8	140.8	95	3.92	3.15	22.9	1	0	2
4	4	32.4	78.7	66	4.08	2.2	19.47	1	1	1
4	4	30.4	75.7	52	4.93	1.615	18.52	1	1	2
4	4	33.9	71.1	65	4.22	1.835	19.9	1	1	1
4	4	27.3	79	66	4.08	1.935	18.9	1	1	1
4	4	21.4	121	109	4.11	2.78	18.6	1	1	2
6	3	21.4	258	110	3.08	3.215	19.44	1	0	1
6	3	18.1	225	105	2.76	3.46	20.22	1	0	1
8	3	18.7	360	175	3.15	3.44	17.02	0	0	2
[ omitted 17 entries ]

Dynamic grouping:

cols <- c("cyl", "disp")

mtcars |> group_by(across(any_of(cols)))

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

MT[, .SD, by = cols]

data.table [32 x 11]

cyl	disp	mpg	hp	drat	wt	qsec	vs	am	gear	carb
6	160	21	110	3.9	2.62	16.46	0	1	4	4
6	160	21	110	3.9	2.875	17.02	0	1	4	4
4	108	22.8	93	3.85	2.32	18.61	1	1	4	1
6	258	21.4	110	3.08	3.215	19.44	1	0	3	1
8	360	18.7	175	3.15	3.44	17.02	0	0	3	2
8	360	14.3	245	3.21	3.57	15.84	0	0	3	4
6	225	18.1	105	2.76	3.46	20.22	1	0	3	1
4	146.7	24.4	62	3.69	3.19	20	1	0	4	2
4	140.8	22.8	95	3.92	3.15	22.9	1	0	4	2
6	167.6	19.2	123	3.92	3.44	18.3	1	0	4	4
6	167.6	17.8	123	3.92	3.44	18.9	1	0	4	4
8	275.8	16.4	180	3.07	4.07	17.4	0	0	3	3
8	275.8	17.3	180	3.07	3.73	17.6	0	0	3	3
8	275.8	15.2	180	3.07	3.78	18	0	0	3	3
8	472	10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

With potentially absent columns:

cols <- c("cyl", "disp", "fake_col")

mtcars |> group_by(across(any_of(cols)))

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

MT[, .SD, by = intersect(cols, colnames(MT))]

data.table [32 x 11]

cyl	disp	mpg	hp	drat	wt	qsec	vs	am	gear	carb
6	160	21	110	3.9	2.62	16.46	0	1	4	4
6	160	21	110	3.9	2.875	17.02	0	1	4	4
4	108	22.8	93	3.85	2.32	18.61	1	1	4	1
6	258	21.4	110	3.08	3.215	19.44	1	0	3	1
8	360	18.7	175	3.15	3.44	17.02	0	0	3	2
8	360	14.3	245	3.21	3.57	15.84	0	0	3	4
6	225	18.1	105	2.76	3.46	20.22	1	0	3	1
4	146.7	24.4	62	3.69	3.19	20	1	0	4	2
4	140.8	22.8	95	3.92	3.15	22.9	1	0	4	2
6	167.6	19.2	123	3.92	3.44	18.3	1	0	4	4
6	167.6	17.8	123	3.92	3.44	18.9	1	0	4	4
8	275.8	16.4	180	3.07	4.07	17.4	0	0	3	3
8	275.8	17.3	180	3.07	3.73	17.6	0	0	3	3
8	275.8	15.2	180	3.07	3.78	18	0	0	3	3
8	472	10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Getting the current group name:

Use the .BY argument to get the current group name:

mtcars |> group_by(cyl) |> 
  group_walk(
    \(d, g) with(d, plot(gear, mpg, main = paste("Cylinders:", g$cyl)))
  )

MT[, with(.SD, plot(gear, mpg, main = paste("Cylinders:", .BY))), by = cyl] -> void

2.7 Row numbers & indices:

.I: Row indices
.N: Number of rows

.GRP: Group indices
.NGRP: Number of groups

Getting rows indices:

MT[, .I]

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32

Adding rows indices:

mtcars |> mutate(I = row_number())

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	4
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	5
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	7
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	8
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	11
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	12
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	13
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	14
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	15
[ omitted 17 entries ]

copy(MT)[ , I := .I][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	4
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	5
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	7
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	8
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	11
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	12
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	13
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	14
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	15
[ omitted 17 entries ]

Getting row indices (after filtering):

Important

.I gives the vector of row numbers after any subsetting/filtering has been done

Returns the row numbers in the original dataset:

mtcars |> mutate(I = row_number()) |> filter(gear == 4) |> pull(I)

[1] 1 2 3 8 9 10 11 18 19 20 26 32

MT[, .I[gear == 4]]

[1] 1 2 3 8 9 10 11 18 19 20 26 32

Returns the row numbers in the new dataset (after filtering):

mtcars |> filter(gear == 4) |> mutate(I = row_number()) |> pull(I)

[1] 1 2 3 4 5 6 7 8 9 10 11 12

MT[gear == 4, .I]

[1] 1 2 3 4 5 6 7 8 9 10 11 12

Getting the row numbers of specific observations:

Row number of the first and last observation of each group:

mtcars |> group_by(cyl) |> summarize(I = cur_group_rows()[c(1, n())]) |> ungroup()

data.frame [6 x 2]

cyl	I
4	3
4	32
6	1
6	30
8	5
8	31

MT[, .I[c(1, .N)], keyby = cyl]

data.table [6 x 2]

cyl	V1
4	3
4	32
6	1
6	30
8	5
8	31

Keeping all other columns:

mtcars |> mutate(I = row_number()) |> group_by(cyl) |> slice(c(1, n())) |> ungroup()

data.frame [6 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2	32
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6	30
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	5
15	8	301	335	3.54	3.57	14.6	0	1	5	8	31

copy(MT)[, I := .I][, .SD[c(1, .N)], keyby = cyl]

data.table [6 x 12]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	I
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1	3
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2	32
6	21	160	110	3.9	2.62	16.46	0	1	4	4	1
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6	30
8	18.7	360	175	3.15	3.44	17.02	0	0	3	2	5
8	15	301	335	3.54	3.57	14.6	0	1	5	8	31

Filtering based on row numbers:

mtcars |> tail(10)

data.frame [10 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

MT[(.N-10):.N] # Get the last 10 rows

data.table [11 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

MT[MT[, .I[(.N-10):.N]]]

data.table [11 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

(Gets the indices of the last 10 rows and filters based on them)

Adding group indices:

mtcars |> group_by(cyl) |> summarize(GRP = cur_group_id())

data.frame [3 x 2]

cyl	GRP
4	1
6	2
8	3

MT[, .GRP, by = cyl]

data.table [3 x 2]

cyl	GRP
6	1
4	2
8	3

Mutate instead of summarize:

mtcars |> arrange(cyl) |> group_by(cyl) |> mutate(GRP = cur_group_id())

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	GRP
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	1
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2	1
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1	1
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1	1
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2	1
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2	1
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2	1
21	6	160	110	3.9	2.62	16.46	0	1	4	4	2
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	2
[ omitted 17 entries ]

copy(MT)[, GRP := .GRP, keyby = cyl][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	GRP
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	1
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2	1
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1	1
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1	1
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2	1
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2	1
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2	1
21	6	160	110	3.9	2.62	16.46	0	1	4	4	2
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	2
[ omitted 17 entries ]

Row numbers by group:

mtcars |> arrange(gear) |> group_by(gear) |> mutate(I_GRP = row_number())

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I_GRP
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	5
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	7
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	8
10.4	8	460	215	3	5.424	17.82	0	0	3	4	9
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4	10
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1	11
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2	12
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2	13
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4	14
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2	15
[ omitted 17 entries ]

copy(MT)[, I_GRP := 1:.N, keyby = gear][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I_GRP
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	5
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	7
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	8
10.4	8	460	215	3	5.424	17.82	0	0	3	4	9
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4	10
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1	11
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2	12
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2	13
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4	14
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2	15
[ omitted 17 entries ]

Random sample:

mtcars |> slice_sample(n = 5)

data.frame [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3

MT[sample(.N, 5)]

data.table [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
10.4	8	460	215	3	5.424	17.82	0	0	3	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4

Sample by group:

mtcars |> group_by(cyl) |> slice_sample(n = 5)

data.frame [15 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
15	8	301	335	3.54	3.57	14.6	0	1	5	8
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4

MT[, .SD[sample(.N, 5)], keyby = cyl]

data.table [15 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2
6	21	160	110	3.9	2.62	16.46	0	1	4	4
6	21	160	110	3.9	2.875	17.02	0	1	4	4
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
8	13.3	350	245	3.73	3.84	15.41	0	0	3	4
8	14.7	440	230	3.23	5.345	17.42	0	0	3	4
8	19.2	400	175	3.08	3.845	17.05	0	0	3	2
8	15.8	351	264	4.22	3.17	14.5	0	1	5	4
8	17.3	275.8	180	3.07	3.73	17.6	0	0	3	3

Filter by group size:

mtcars |> group_by(cyl) |> filter(n() >= 8)

data.frame [25 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
[ omitted 10 entries ]

MT[, if(.N >= 8) .SD, by = cyl]

data.table [25 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2
8	18.7	360	175	3.15	3.44	17.02	0	0	3	2
8	14.3	360	245	3.21	3.57	15.84	0	0	3	4
8	16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
8	17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
[ omitted 10 entries ]

2.8 Relocate:

mtcars |> group_by(cyl) |> mutate(GRP = cur_group_id(), .before = 1)

data.frame [32 x 12]

GRP	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
2	21	6	160	110	3.9	2.62	16.46	0	1	4	4
2	21	6	160	110	3.9	2.875	17.02	0	1	4	4
1	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
2	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
3	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
2	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
3	14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
1	24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
1	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
2	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
2	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
3	16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
3	17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
3	15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
3	10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

(copy(MT)[ , GRP := .GRP, by = cyl] |> setcolorder(c("GRP", .SD)))[]

data.table [32 x 12]

GRP	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
1	21	6	160	110	3.9	2.62	16.46	0	1	4	4
1	21	6	160	110	3.9	2.875	17.02	0	1	4	4
2	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
1	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
3	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
1	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
3	14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
2	24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
2	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
1	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
1	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
3	16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
3	17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
3	15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
3	10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Ordering by column names

mtcars |> select(sort(tidyselect::peek_vars()))

data.frame [32 x 11]

am	carb	cyl	disp	drat	gear	hp	mpg	qsec	vs	wt
1	4	6	160	3.9	4	110	21	16.46	0	2.62
1	4	6	160	3.9	4	110	21	17.02	0	2.875
1	1	4	108	3.85	4	93	22.8	18.61	1	2.32
0	1	6	258	3.08	3	110	21.4	19.44	1	3.215
0	2	8	360	3.15	3	175	18.7	17.02	0	3.44
0	1	6	225	2.76	3	105	18.1	20.22	1	3.46
0	4	8	360	3.21	3	245	14.3	15.84	0	3.57
0	2	4	146.7	3.69	4	62	24.4	20	1	3.19
0	2	4	140.8	3.92	4	95	22.8	22.9	1	3.15
0	4	6	167.6	3.92	4	123	19.2	18.3	1	3.44
0	4	6	167.6	3.92	4	123	17.8	18.9	1	3.44
0	3	8	275.8	3.07	3	180	16.4	17.4	0	4.07
0	3	8	275.8	3.07	3	180	17.3	17.6	0	3.73
0	3	8	275.8	3.07	3	180	15.2	18	0	3.78
0	4	8	472	2.93	3	205	10.4	17.98	0	5.25
[ omitted 17 entries ]

setcolorder(copy(MT), sort(names(MT)))[]

data.table [32 x 11]

am	carb	cyl	disp	drat	gear	hp	mpg	qsec	vs	wt
1	4	6	160	3.9	4	110	21	16.46	0	2.62
1	4	6	160	3.9	4	110	21	17.02	0	2.875
1	1	4	108	3.85	4	93	22.8	18.61	1	2.32
0	1	6	258	3.08	3	110	21.4	19.44	1	3.215
0	2	8	360	3.15	3	175	18.7	17.02	0	3.44
0	1	6	225	2.76	3	105	18.1	20.22	1	3.46
0	4	8	360	3.21	3	245	14.3	15.84	0	3.57
0	2	4	146.7	3.69	4	62	24.4	20	1	3.19
0	2	4	140.8	3.92	4	95	22.8	22.9	1	3.15
0	4	6	167.6	3.92	4	123	19.2	18.3	1	3.44
0	4	6	167.6	3.92	4	123	17.8	18.9	1	3.44
0	3	8	275.8	3.07	3	180	16.4	17.4	0	4.07
0	3	8	275.8	3.07	3	180	17.3	17.6	0	3.73
0	3	8	275.8	3.07	3	180	15.2	18	0	3.78
0	4	8	472	2.93	3	205	10.4	17.98	0	5.25
[ omitted 17 entries ]

mtcars |> select(carb, sort(tidyselect::peek_vars()))

data.frame [32 x 11]

carb	am	cyl	disp	drat	gear	hp	mpg	qsec	vs	wt
4	1	6	160	3.9	4	110	21	16.46	0	2.62
4	1	6	160	3.9	4	110	21	17.02	0	2.875
1	1	4	108	3.85	4	93	22.8	18.61	1	2.32
1	0	6	258	3.08	3	110	21.4	19.44	1	3.215
2	0	8	360	3.15	3	175	18.7	17.02	0	3.44
1	0	6	225	2.76	3	105	18.1	20.22	1	3.46
4	0	8	360	3.21	3	245	14.3	15.84	0	3.57
2	0	4	146.7	3.69	4	62	24.4	20	1	3.19
2	0	4	140.8	3.92	4	95	22.8	22.9	1	3.15
4	0	6	167.6	3.92	4	123	19.2	18.3	1	3.44
4	0	6	167.6	3.92	4	123	17.8	18.9	1	3.44
3	0	8	275.8	3.07	3	180	16.4	17.4	0	4.07
3	0	8	275.8	3.07	3	180	17.3	17.6	0	3.73
3	0	8	275.8	3.07	3	180	15.2	18	0	3.78
4	0	8	472	2.93	3	205	10.4	17.98	0	5.25
[ omitted 17 entries ]

setcolorder(copy(MT), c("carb", sort(setdiff(names(MT), "carb"))))[]

data.table [32 x 11]

carb	am	cyl	disp	drat	gear	hp	mpg	qsec	vs	wt
4	1	6	160	3.9	4	110	21	16.46	0	2.62
4	1	6	160	3.9	4	110	21	17.02	0	2.875
1	1	4	108	3.85	4	93	22.8	18.61	1	2.32
1	0	6	258	3.08	3	110	21.4	19.44	1	3.215
2	0	8	360	3.15	3	175	18.7	17.02	0	3.44
1	0	6	225	2.76	3	105	18.1	20.22	1	3.46
4	0	8	360	3.21	3	245	14.3	15.84	0	3.57
2	0	4	146.7	3.69	4	62	24.4	20	1	3.19
2	0	4	140.8	3.92	4	95	22.8	22.9	1	3.15
4	0	6	167.6	3.92	4	123	19.2	18.3	1	3.44
4	0	6	167.6	3.92	4	123	17.8	18.9	1	3.44
3	0	8	275.8	3.07	3	180	16.4	17.4	0	4.07
3	0	8	275.8	3.07	3	180	17.3	17.6	0	3.73
3	0	8	275.8	3.07	3	180	15.2	18	0	3.78
4	0	8	472	2.93	3	205	10.4	17.98	0	5.25
[ omitted 17 entries ]

2.9 Summarize:

Summarizes uses the = operator.
It’s only difference with mutate is that it takes a function that returns a list of values smaller than the original column (or group) size.
By default, it will only keep the modified columns (like transmute).

mtcars |> summarize(mean_cyl = mean(cyl, na.rm = T))

data.frame [1 x 1]

mean_cyl
6.188

MT[, .(mean_cyl = mean(cyl, na.rm = T))]

data.table [1 x 1]

mean_cyl
6.188

Group > summarize

mtcars |> group_by(cyl) |> summarize(N = n())

data.frame [3 x 2]

cyl	N
4	11
6	7
8	14

MT[, .N, by = cyl]

data.table [3 x 2]

cyl	N
6	7
4	11
8	14

dplyr automatically arrange the result by the grouping factor.
To mimic this with data.table:

MT[, .N, keyby = cyl]

data.table [3 x 2]

cyl	N
4	11
6	7
8	14

MT[order(cyl), .N, by = cyl]

data.table [3 x 2]

cyl	N
4	11
6	7
8	14

MT[, .N, by = cyl][order(cyl)]

data.table [3 x 2]

cyl	N
4	11
6	7
8	14

Grouping on a condition:

mtcars |> group_by(cyl > 6) |> summarize(N = n())

data.frame [2 x 2]

cyl > 6	N
FALSE	18
TRUE	14

MT[, .N, by = .(cyl > 6)]

data.table [2 x 2]

cyl	N
FALSE	18
TRUE	14

Group > filter > summarize

mtcars |> filter(cyl >= 6 & disp >= 200) |> summarize(N = n())

data.frame [1 x 1]

N
16

MT[cyl >= 6 & disp >= 200, .(.N)]

data.table [1 x 1]

N
16

mtcars |> summarize(N = sum(cyl >= 6 & disp >= 200, na.rm = T))

data.frame [1 x 1]

N
16

MT[, .(N = sum(cyl >= 6 & disp >= 200, na.rm = T))]

data.table [1 x 1]

N
16

Obtaining one summary statistic on multiple columns

mtcars |> group_by(cyl) |> summarize(across(everything(), \(c) mean(c)))

data.frame [3 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	26.664	105.136	82.636	4.071	2.286	19.137	0.909	0.727	4.091	1.545
6	19.743	183.314	122.286	3.586	3.117	17.977	0.571	0.429	3.857	3.429
8	15.1	353.1	209.214	3.229	3.999	16.772	0	0.143	3.286	3.5

MT[, lapply(.SD, \(c) mean(c)), keyby = cyl]

data.table [3 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	26.664	105.136	82.636	4.071	2.286	19.137	0.909	0.727	4.091	1.545
6	19.743	183.314	122.286	3.586	3.117	17.977	0.571	0.429	3.857	3.429
8	15.1	353.1	209.214	3.229	3.999	16.772	0	0.143	3.286	3.5

Apply summary function based on column type:

mtcars |> group_by(cyl) |> summarize(across(where(is.double), \(col) mean(col)))

data.frame [3 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	26.664	105.136	82.636	4.071	2.286	19.137	0.909	0.727	4.091	1.545
6	19.743	183.314	122.286	3.586	3.117	17.977	0.571	0.429	3.857	3.429
8	15.1	353.1	209.214	3.229	3.999	16.772	0	0.143	3.286	3.5

MT[, lapply(.SD, \(col) mean(col)), keyby = cyl, .SDcols = is.double][, cyl := NULL][]

data.table [3 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26.664	4	105.136	82.636	4.071	2.286	19.137	0.909	0.727	4.091	1.545
19.743	6	183.314	122.286	3.586	3.117	17.977	0.571	0.429	3.857	3.429
15.1	8	353.1	209.214	3.229	3.999	16.772	0	0.143	3.286	3.5

Apply summary function to specific columns:

mtcars |> group_by(cyl) |> summarize(across(c(mpg, disp), \(.x) mean(.x)))

data.frame [3 x 3]

cyl	mpg	disp
4	26.664	105.136
6	19.743	183.314
8	15.1	353.1

MT[, lapply(.SD, \(.x) mean(.x)), keyby = cyl, .SDcols = c("mpg", "disp")]

data.table [3 x 3]

cyl	mpg	disp
4	26.664	105.136
6	19.743	183.314
8	15.1	353.1

MT[, lapply(.SD[, .(mpg, disp)], \(.x) mean(.x)), keyby = cyl]

data.table [3 x 3]

cyl	mpg	disp
4	26.664	105.136
6	19.743	183.314
8	15.1	353.1

Apply summary function to specific columns (by pattern):

mtcars |> group_by(cyl) |> summarize(across(matches("^mpg|^disp"), \(.x) mean(.x)))

data.frame [3 x 3]

cyl	mpg	disp
4	26.664	105.136
6	19.743	183.314
8	15.1	353.1

MT[, lapply(.SD, mean), keyby = cyl, .SDcols = patterns("^mpg|^disp")]

data.table [3 x 3]

cyl	mpg	disp
4	26.664	105.136
6	19.743	183.314
8	15.1	353.1

Obtaining multiple summary statistics for one column:

mtcars |> group_by(cyl) |> summarize(mean_mpg = mean(mpg), sd_mpg = sd(mpg))

data.frame [3 x 3]

cyl	mean_mpg	sd_mpg
4	26.664	4.51
6	19.743	1.454
8	15.1	2.56

MT[, .(mean_mpg = mean(mpg), sd_mpg = sd(mpg)), keyby = cyl]

data.table [3 x 3]

cyl	mean_mpg	sd_mpg
4	26.664	4.51
6	19.743	1.454
8	15.1	2.56

MT[, lapply(.SD, \(x) list(mean_mpg = mean(x), sd_mpg = sd(x))) |> rbindlist(), keyby = cyl, .SDcols = "mpg"]

data.table [3 x 3]

cyl	mean_mpg	sd_mpg
4	26.664	4.51
6	19.743	1.454
8	15.1	2.56

MT[, lapply(.(mpg), \(x) list(mean_mpg = mean(x), sd_mpg = sd(x))) |> rbindlist(), keyby = cyl]

data.table [3 x 3]

cyl	mean_mpg	sd_mpg
4	26.664	4.51
6	19.743	1.454
8	15.1	2.56

Obtaining multiple summary statistics on multiple columns (as rows):

MT[, lapply(.SD, \(v) c(mean(v), sd(v)))]

data.table [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
20.091	6.188	230.722	146.688	3.597	3.217	17.849	0.438	0.406	3.688	2.812
6.027	1.786	123.939	68.563	0.535	0.978	1.787	0.504	0.499	0.738	1.615

list_of_funs <- list(mean = \(x) mean(x, na.rm = TRUE), sd = \(x) sd(x, na.rm = TRUE))

MT[, lapply(list_of_funs, \(fun) lapply(.SD, fun)) |> rbindlist()]

data.table [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
20.091	6.188	230.722	146.688	3.597	3.217	17.849	0.438	0.406	3.688	2.812
6.027	1.786	123.939	68.563	0.535	0.978	1.787	0.504	0.499	0.738	1.615

Obtaining multiple summary statistics on multiple columns (as columns):

cols <- c("mpg", "cyl")

funs_as_list <- \(x) list(mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE))

Warning

dplyr & data.table don’t use the same “format” when pre-defining a list of function to be applied:
- dplyr needs a list of individual functions
- data.table needs a function returning a list

mtcars |> group_by(gear) |> summarize(across(cols, .fns = list_of_funs, .names = "{.col}.{.fn}"))

data.frame [3 x 5]

gear	mpg.mean	mpg.sd	cyl.mean	cyl.sd
3	16.107	3.372	7.467	1.187
4	24.533	5.277	4.667	0.985
5	21.38	6.659	6	2

MT[, lapply(.SD, funs_as_list) |> unlist(recursive = FALSE), keyby = gear, .SDcols = cols]

data.table [3 x 5]

gear	mpg.mean	mpg.sd	cyl.mean	cyl.sd
3	16.107	3.372	7.467	1.187
4	24.533	5.277	4.667	0.985
5	21.38	6.659	6	2

MT[, lapply(.SD, funs_as_list) |> do.call(c, args = _), keyby = gear, .SDcols = cols]

data.table [3 x 5]

gear	mpg.mean	mpg.sd	cyl.mean	cyl.sd
3	16.107	3.372	7.467	1.187
4	24.533	5.277	4.667	0.985
5	21.38	6.659	6	2

Different column order & naming scheme:

Tip

Here we can use the list_of_funs with data.table since we apply them individually.

MT[, lapply(list_of_funs, \(f) lapply(.SD, f)) |> do.call(c, args = _), keyby = gear, .SDcols = cols]

data.table [3 x 5]

gear	mean.mpg	mean.cyl	sd.mpg	sd.cyl
3	16.107	7.467	3.372	1.187
4	24.533	4.667	5.277	0.985
5	21.38	6	6.659	2

Using dcast (see next section):

dcast(MT, gear ~ ., fun.aggregate = list(mean, sd), value.var = cols)

data.table [3 x 5]

gear	mpg_mean	cyl_mean	mpg_sd	cyl_sd
3	16.107	7.467	3.372	1.187
4	24.533	4.667	5.277	0.985
5	21.38	6	6.659	2

3 Pivots:

3.1 Melt / Longer:

Data:

FAM1

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

FAM2

data.table [5 x 8]

family_id	age_mother	dob_child1	dob_child2	dob_child3	gender_child1	gender_child2	gender_child3
1	30	1998-11-26	2000-01-29	NA	1	2	NA
2	27	1996-06-22	NA	NA	2	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02	2	2	1
4	32	2004-10-10	2009-08-27	2012-07-21	1	1	1
5	29	2000-12-05	2005-02-28	NA	2	1	NA

One group of columns –> single value column

FAM1 |> pivot_longer(cols = matches("dob_"), names_to = "variable")

data.frame [15 x 4]

family_id	age_mother	variable	value
1	30	dob_child1	1998-11-26
1	30	dob_child2	2000-01-29
1	30	dob_child3	NA
2	27	dob_child1	1996-06-22
2	27	dob_child2	NA
2	27	dob_child3	NA
3	26	dob_child1	2002-07-11
3	26	dob_child2	2004-04-05
3	26	dob_child3	2007-09-02
4	32	dob_child1	2004-10-10
4	32	dob_child2	2009-08-27
4	32	dob_child3	2012-07-21
5	29	dob_child1	2000-12-05
5	29	dob_child2	2005-02-28
5	29	dob_child3	NA

FAM1 |> melt(measure.vars = c("dob_child1", "dob_child2", "dob_child3"))

data.table [15 x 4]

family_id	age_mother	variable	value
1	30	dob_child1	1998-11-26
2	27	dob_child1	1996-06-22
3	26	dob_child1	2002-07-11
4	32	dob_child1	2004-10-10
5	29	dob_child1	2000-12-05
1	30	dob_child2	2000-01-29
2	27	dob_child2	NA
3	26	dob_child2	2004-04-05
4	32	dob_child2	2009-08-27
5	29	dob_child2	2005-02-28
1	30	dob_child3	NA
2	27	dob_child3	NA
3	26	dob_child3	2007-09-02
4	32	dob_child3	2012-07-21
5	29	dob_child3	NA

FAM1 |> melt(measure.vars = patterns("^dob_"))

data.table [15 x 4]

family_id	age_mother	variable	value
1	30	dob_child1	1998-11-26
2	27	dob_child1	1996-06-22
3	26	dob_child1	2002-07-11
4	32	dob_child1	2004-10-10
5	29	dob_child1	2000-12-05
1	30	dob_child2	2000-01-29
2	27	dob_child2	NA
3	26	dob_child2	2004-04-05
4	32	dob_child2	2009-08-27
5	29	dob_child2	2005-02-28
1	30	dob_child3	NA
2	27	dob_child3	NA
3	26	dob_child3	2007-09-02
4	32	dob_child3	2012-07-21
5	29	dob_child3	NA

One group of columns –> multiple value columns

FAM1 |> melt(measure.vars = patterns(child1 = "child1$", child2 = "child2$|child3$"))

data.table [10 x 5]

family_id	age_mother	variable	child1	child2
1	30	1	1998-11-26	2000-01-29
2	27	1	1996-06-22	NA
3	26	1	2002-07-11	2004-04-05
4	32	1	2004-10-10	2009-08-27
5	29	1	2000-12-05	2005-02-28
1	30	2	NA	NA
2	27	2	NA	NA
3	26	2	NA	2007-09-02
4	32	2	NA	2012-07-21
5	29	2	NA	NA

3.1.1 Merging multiple yes/no columns:

Melting multiple presence/absence columns into a single variable:

movies_wide

data.frame [3 x 4]

ID	action	adventure	animation
1	1	0	0
2	1	1	0
3	1	1	1

pivot_longer(
    movies_wide, -ID, names_to = "Genre", 
    values_transform = \(x) ifelse(x == 0, NA, x), values_drop_na = TRUE
  ) |> select(-value)

data.frame [6 x 2]

ID	Genre
1	action
2	action
2	adventure
3	action
3	adventure
3	animation

melt(MOVIES_WIDE, id.vars = "ID", variable.name = "Genre")[value != 0][order(ID), -"value"]

data.table [6 x 2]

ID	Genre
1	action
2	action
2	adventure
3	action
3	adventure
3	animation

3.1.2 Partial pivot:

Multiple groups of columns –> Multiple value columns

Manually:

colA <- str_subset(colnames(FAM2), "^dob")
colB <- str_subset(colnames(FAM2), "^gender")

FAM2 |> melt(measure.vars = list(colA, colB), value.name = c("dob", "gender"), variable.name = "child")

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

FAM2 |> melt(measure.vars = list(a, b), value.name = c("dob", "gender"), variable.name = "child") |> 
  substitute2(env = list(a = I(str_subset(colnames(FAM2), "^dob")), b = I(str_subset(colnames(FAM2), "^gender")))) |> eval()

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

Using .value:

Tip

Using the .value special identifier allows to do a “half” pivot: the values that would be listed as rows under .value are instead used as columns.

FAM2 |> pivot_longer(cols = matches("^dob|^gender"), names_to = c(".value", "child"), names_sep = "_child")

data.frame [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
1	30	2	2000-01-29	2
1	30	3	NA	NA
2	27	1	1996-06-22	2
2	27	2	NA	NA
2	27	3	NA	NA
3	26	1	2002-07-11	2
3	26	2	2004-04-05	2
3	26	3	2007-09-02	1
4	32	1	2004-10-10	1
4	32	2	2009-08-27	1
4	32	3	2012-07-21	1
5	29	1	2000-12-05	2
5	29	2	2005-02-28	1
5	29	3	NA	NA

FAM2 |> melt(measure.vars = patterns("^dob", "^gender"), value.name = c("dob", "gender"), variable.name = "child")

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

Using measure and value.name:

Warning

data.table only

FAM2 |> melt(measure.vars = measure(value.name, child = \(x) as.integer(x), sep = "_child"))

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

FAM2 |> melt(measure.vars = measurev(list(value.name = NULL, child = as.integer), pattern = "(.*)_child(\\d{1})"))

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

3.2 Dcast / Wider:

General idea:
- Pivot around the combination of id.vars (LHS of the formula)
- The measure.vars (RHS of the formula) are the ones whose values become column names
- The value.var are the ones the values are taken from to fill the new columns

Data:

(FAM1L <- FAM1 |> melt(measure.vars = c("dob_child1", "dob_child2", "dob_child3")))

data.table [15 x 4]

family_id	age_mother	variable	value
1	30	dob_child1	1998-11-26
2	27	dob_child1	1996-06-22
3	26	dob_child1	2002-07-11
4	32	dob_child1	2004-10-10
5	29	dob_child1	2000-12-05
1	30	dob_child2	2000-01-29
2	27	dob_child2	NA
3	26	dob_child2	2004-04-05
4	32	dob_child2	2009-08-27
5	29	dob_child2	2005-02-28
1	30	dob_child3	NA
2	27	dob_child3	NA
3	26	dob_child3	2007-09-02
4	32	dob_child3	2012-07-21
5	29	dob_child3	NA

(FAM2L <- FAM2 |> melt(measure.vars = measure(value.name, child = \(.x) as.integer(.x), sep = "_child")))

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

Basic pivot wider:

FAM1L |> pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

FAM1L |> dcast(family_id + age_mother ~ variable)

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Using all the columns as IDs:

Note

By default, id_cols = everything()

FAM1L |> pivot_wider(names_from = variable)

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Note

... => “every unused column”

FAM1L |> dcast(... ~ variable)

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Multiple value columns –> Multiple groups of columns:

FAM2L |> pivot_wider(
  id_cols = c("family_id", "age_mother"), values_from = c("dob", "gender"), 
  names_from = "child", names_sep = "_child"
)

data.frame [5 x 8]

family_id	age_mother	dob_child1	dob_child2	dob_child3	gender_child1	gender_child2	gender_child3
1	30	1998-11-26	2000-01-29	NA	1	2	NA
2	27	1996-06-22	NA	NA	2	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02	2	2	1
4	32	2004-10-10	2009-08-27	2012-07-21	1	1	1
5	29	2000-12-05	2005-02-28	NA	2	1	NA

FAM2L |> dcast(family_id + age_mother ~ child, value.var = c("dob", "gender"), sep = "_child")

data.table [5 x 8]

family_id	age_mother	dob_child1	dob_child2	dob_child3	gender_child1	gender_child2	gender_child3
1	30	1998-11-26	2000-01-29	NA	1	2	NA
2	27	1996-06-22	NA	NA	2	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02	2	2	1
4	32	2004-10-10	2009-08-27	2012-07-21	1	1	1
5	29	2000-12-05	2005-02-28	NA	2	1	NA

FAM2L |> dcast(... ~ child, value.var = c("dob", "gender"), sep = "_child")

data.table [5 x 8]

family_id	age_mother	dob_child1	dob_child2	dob_child3	gender_child1	gender_child2	gender_child3
1	30	1998-11-26	2000-01-29	NA	1	2	NA
2	27	1996-06-22	NA	NA	2	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02	2	2	1
4	32	2004-10-10	2009-08-27	2012-07-21	1	1	1
5	29	2000-12-05	2005-02-28	NA	2	1	NA

Dynamic names in the formula:

var_name <- "variable"

FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = {{ var_name }})

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

FAM1L |> dcast(family_id + age_mother ~ base::get(var_name))

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

FAM1L |> dcast(family_id + age_mother ~ v1) |> substitute2(env = list(v1 = var_name)) |> eval()

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Multiple variables:

id_vars <- c("family_id", "age_mother")

FAM1L |> pivot_wider(id_cols = all_of(id_vars), names_from = variable)

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

FAM1L |> dcast(str_c(str_c(id_vars, collapse = " + "), " ~ variable"))

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

FAM1L |> dcast(v1 + v2 ~ variable) |> substitute2(env = list(v1 = id_vars[1], v2 = id_vars[2])) |> eval()

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

3.2.1 Renaming (prefix/suffix) the columns:

FAM1L |> pivot_wider(names_from = variable, values_from = value, names_prefix = "Attr: ")

data.frame [5 x 5]

family_id	age_mother	Attr: dob_child1	Attr: dob_child2	Attr: dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

FAM1L |> pivot_wider(names_from = variable, values_from = value, names_glue = "Attr: {variable}")

data.frame [5 x 5]

family_id	age_mother	Attr: dob_child1	Attr: dob_child2	Attr: dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

FAM1L |> dcast(family_id + age_mother ~ paste0("Attr: ", variable))

data.table [5 x 5]

family_id	age_mother	Attr: dob_child1	Attr: dob_child2	Attr: dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

3.2.2 Unused combinations:

Warning

The logic is inverted between dplyr (keep) and data.table (drop)

FAM1L |> pivot_wider(names_from = variable, values_from = value, id_expand = TRUE, names_expand = FALSE) # (keep_id, keep_names)

data.frame [25 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	26	NA	NA	NA
1	27	NA	NA	NA
1	29	NA	NA	NA
1	30	1998-11-26	2000-01-29	NA
1	32	NA	NA	NA
2	26	NA	NA	NA
2	27	1996-06-22	NA	NA
2	29	NA	NA	NA
2	30	NA	NA	NA
2	32	NA	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
3	27	NA	NA	NA
3	29	NA	NA	NA
3	30	NA	NA	NA
3	32	NA	NA	NA
[ omitted 10 entries ]

FAM1L |> dcast(family_id + age_mother ~ variable, drop = c(F, T)) # (drop_LHS, drop_RHS)

data.table [25 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	26	NA	NA	NA
1	27	NA	NA	NA
1	29	NA	NA	NA
1	30	1998-11-26	2000-01-29	NA
1	32	NA	NA	NA
2	26	NA	NA	NA
2	27	1996-06-22	NA	NA
2	29	NA	NA	NA
2	30	NA	NA	NA
2	32	NA	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
3	27	NA	NA	NA
3	29	NA	NA	NA
3	30	NA	NA	NA
3	32	NA	NA	NA
[ omitted 10 entries ]

3.2.3 Subsetting:

Note

AFAIK, pivot_wider can’t do this on it’s own.

FAM1L |> filter(value >= lubridate::ymd(20030101)) |> 
  pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")

data.frame [3 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
4	32	2004-10-10	2009-08-27	2012-07-21
3	26	NA	2004-04-05	2007-09-02
5	29	NA	2005-02-28	NA

FAM1L |> dcast(family_id + age_mother ~ variable, subset = .(value >= lubridate::ymd(20030101)))

data.table [3 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
3	26	NA	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	NA	2005-02-28	NA

3.2.4 Aggregating:

Not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures.

FAM1L |> dcast(family_id + age_mother ~ .)

data.table [5 x 3]

family_id	age_mother	.
1	30	3
2	27	3
3	26	3
4	32	3
5	29	3

We can customize that default behavior using the fun.aggregate argument:

Here, we count the number of child for each each combination of (family_id + age_mother) -> sum all non-NA value

FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(.x) sum(!is.na(.x))) |>
  rowwise() |> mutate(child_count = sum(c_across(matches("_child")))) |> ungroup()

data.frame [5 x 6]

family_id	age_mother	dob_child1	dob_child2	dob_child3	child_count
1	30	1	1	0	2
2	27	1	0	0	1
3	26	1	1	1	3
4	32	1	1	1	3
5	29	1	1	0	2

FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(.x) sum(!is.na(.x))) |>
  mutate(child_count = apply(select(cur_data(), matches("_child")), 1, \(r) sum(r)))

data.frame [5 x 6]

family_id	age_mother	dob_child1	dob_child2	dob_child3	child_count
1	30	1	1	0	2
2	27	1	0	0	1
3	26	1	1	1	3
4	32	1	1	1	3
5	29	1	1	0	2

(FAM1L |> dcast(family_id + age_mother ~ ., fun.agg = \(.x) sum(!is.na(.x))) |> setnames(".", "child_count"))

data.table [5 x 3]

family_id	age_mother	child_count
1	30	2
2	27	1
3	26	3
4	32	3
5	29	2

Applying multiple fun.agg:

Data:

(DTL <- data.table(
  id1 = sample(5, 20, TRUE), 
  id2 = sample(2, 20, TRUE), 
  group = sample(letters[1:2], 20, TRUE), 
  v1 = runif(20), 
  v2 = 1L)
)

data.table [20 x 5]

id1	id2	group	v1	v2
5	1	a	0.706	1
2	2	a	0.152	1
1	2	b	0.487	1
1	2	b	0.416	1
4	2	b	0.681	1
2	2	a	0.384	1
3	1	a	0.263	1
2	2	a	0.536	1
4	2	a	0.447	1
3	1	b	0.446	1
2	2	a	0.855	1
5	1	b	0.832	1
1	2	a	0.854	1
1	1	a	0.987	1
5	1	a	0.59	1
[ omitted 5 entries ]

Multiple fun.agg applied to one variable:

DTL |> dcast(id1 + id2 ~ group, fun.aggregate = list(sum, mean), value.var = "v1")

data.table [9 x 6]

id1	id2	v1_sum_a	v1_sum_b	v1_mean_a	v1_mean_b
1	1	0.987	0	0.987	NaN
1	2	0.854	0.903	0.854	0.452
2	1	0	0.25	NaN	0.25
2	2	2.221	0.048	0.444	0.048
3	1	0.263	0.446	0.263	0.446
3	2	0.083	0	0.083	NaN
4	1	0	0.518	NaN	0.518
4	2	0.447	0.681	0.447	0.681
5	1	1.296	0.832	0.648	0.832

Multiple fun.agg to multiple value.var (all combinations):

DTL |> dcast(id1 + id2 ~ group, fun.aggregate = list(sum, mean), value.var = c("v1", "v2"))

data.table [9 x 10]

id1	id2	v1_sum_a	v1_sum_b	v2_sum_a	v2_sum_b	v1_mean_a	v1_mean_b	v2_mean_a	v2_mean_b
1	1	0.987	0	1	0	0.987	NaN	1	NaN
1	2	0.854	0.903	1	2	0.854	0.452	1	1
2	1	0	0.25	0	1	NaN	0.25	NaN	1
2	2	2.221	0.048	5	1	0.444	0.048	1	1
3	1	0.263	0.446	1	1	0.263	0.446	1	1
3	2	0.083	0	1	0	0.083	NaN	1	NaN
4	1	0	0.518	0	1	NaN	0.518	NaN	1
4	2	0.447	0.681	1	1	0.447	0.681	1	1
5	1	1.296	0.832	2	1	0.648	0.832	1	1

Multiple fun.agg and multiple value.var (one-to-one):

Here, we apply sum to v1 (for both group a & b), and mean to v2 (for both group a & b)

DTL |> dcast(id1 + id2 ~ group, fun.aggregate = list(sum, mean), value.var = list("v1", "v2"))

data.table [9 x 6]

id1	id2	v1_sum_a	v1_sum_b	v2_mean_a	v2_mean_b
1	1	0.987	0	1	NaN
1	2	0.854	0.903	1	1
2	1	0	0.25	NaN	1
2	2	2.221	0.048	1	1
3	1	0.263	0.446	1	1
3	2	0.083	0	1	NaN
4	1	0	0.518	NaN	1
4	2	0.447	0.681	1	1
5	1	1.296	0.832	1	1

3.2.5 One-hot encoding:

Making each level of a variable into a presence/absence column:

movies_long

data.frame [6 x 2]

ID	Genre
1	action
2	action
2	adventure
3	action
3	adventure
3	animation

pivot_wider(
  movies_long, names_from = "Genre", values_from = "Genre", 
  values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
)

data.frame [3 x 4]

ID	action	adventure	animation
1	1	0	0
2	1	1	0
3	1	1	1

dcast(
  MOVIES_LONG, ID ~ Genre, value.var = "Genre", 
  fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0
)

data.table [3 x 4]

ID	action	adventure	animation
1	1	0	0
2	1	1	0
3	1	1	1

4 Joins:

Tip

In data.table, a JOIN is just another type of SUBSET: we subset the rows of one data.table with the rows of a second one, based on some conditions that define the type of JOIN.

Matching two tables based on their rows can be done:
- Either on equivalences (equi-joins)
- Or functions comparing one row to another (non-equi joins)

Data:

(DT1 <- data.table( 
  ID = LETTERS[1:10],
  A = sample(1:5, 10, replace = TRUE),
  B = sample(10:20, 10)
))

data.table [10 x 3]

ID	A	B
A	1	16
B	5	12
C	3	11
D	1	19
E	4	13
F	2	18
G	2	20
H	4	17
I	5	15
J	2	10

(DT2 <- data.table(
  ID = LETTERS[5:14],
  C = sample(1:5, 10, replace = TRUE),
  D = sample(10:20, 10) 
))

data.table [10 x 3]

ID	C	D
E	5	16
F	2	13
G	3	11
H	1	19
I	3	12
J	1	14
K	2	17
L	2	20
M	3	10
N	4	18

Basic (right) join example:

right_join(
  DT1 |> select(ID, A),
  DT2 |> select(ID, C), 
  by = "ID"
) |> as_tibble()

data.frame [10 x 3]

ID	A	C
E	4	5
F	2	2
G	2	3
H	4	1
I	5	3
J	2	1
K	NA	2
L	NA	2
M	NA	3
N	NA	4

DT1[DT2, .(ID, A, C), on = .(ID)]

data.table [10 x 3]

ID	A	C
E	4	5
F	2	2
G	2	3
H	4	1
I	5	3
J	2	1
K	NA	2
L	NA	2
M	NA	3
N	NA	4

4.1 Outer (right, left):

Appends data of one at the end of the other.

Note

data.table doesn’t do left joins natively

Subsetting DT1 by DT2:

Note

DT2 (everything) + DT1 (all columns, but only the rows that match those in DT1).
> Looking up DT1’s rows using DT2 (or DT2’s key, if it has one) as an index.

As a right join:

right_join(DT1, DT2, by = "ID") # DT1 into DT2

data.table [10 x 5]

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

DT1[DT2, on = .(ID)]

data.table [10 x 5]

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

As a left join:

Note

Not exactly equivalent to the right join: same columns, but DT2 is first instead of DT1

left_join(DT2, DT1, by = "ID") # DT1 into DT2

data.table [10 x 5]

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
K	2	17	NA	NA
L	2	20	NA	NA
M	3	10	NA	NA
N	4	18	NA	NA

copy(DT2)[DT1, c("A", "B") := list(i.A, i.B), on = .(ID)][]

data.table [10 x 5]

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
K	2	17	NA	NA
L	2	20	NA	NA
M	3	10	NA	NA
N	4	18	NA	NA

Subsetting DT2 by DT1:

Note

DT1 (everything) + DT2 (all columns, but only the rows that match those in DT1).
> Looking up DT2’s rows using DT1 (or DT1’s key, if it has one) as an index.

As a right join:

right_join(DT2, DT1, by = "ID") # DT2 into DT1

data.table [10 x 5]

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
A	NA	NA	1	16
B	NA	NA	5	12
C	NA	NA	3	11
D	NA	NA	1	19

DT2[DT1, on = .(ID)]

data.table [10 x 5]

ID	C	D	A	B
A	NA	NA	1	16
B	NA	NA	5	12
C	NA	NA	3	11
D	NA	NA	1	19
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10

As a left join:

Note

Not exactly equivalent to the right join: same columns, but DT1 is first instead of DT2

left_join(DT1, DT2, by = "ID") # DT2 into DT1

data.table [10 x 5]

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

copy(DT1)[DT2, c("C", "D") := list(i.C, i.D), on = .(ID)][]

data.table [10 x 5]

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

4.2 Full (outer):

full_join(DT1, DT2, by = "ID")

data.table [14 x 5]

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

data.table::merge.data.table(DT1, DT2, by = "ID", all = TRUE)

data.table [14 x 5]

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

Alternatively:

setkey(DT1, ID)
setkey(DT2, ID)

# Getting the union of the unique keys of both DT
unique_keys <- union(DT1[, ID], DT2[, ID])

DT1[DT2[unique_keys, on = "ID"]]

data.table [14 x 5]

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

4.3 Inner:

Only returns the ROWS matching both tables:
- Inner: rows matching both DT1 and DT2, columns of both (add DT2’s columns to the right)
- Semi: rows matching both DT1 and DT2, columns of first one

Inner:

inner_join(DT1, DT2, by = "ID")

data.table [6 x 5]

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

DT1[DT2, on = .(ID), nomatch = NULL]

data.table [6 x 5]

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

Semi:

semi_join(DT1, DT2, by = "ID")

data.table [6 x 3]

ID	A	B
E	4	13
F	2	18
G	2	20
H	4	17
I	5	15
J	2	10

DT1[na.omit(DT1[DT2, on = .(ID), which = TRUE])]

data.table [6 x 3]

ID	A	B
E	4	13
F	2	18
G	2	20
H	4	17
I	5	15
J	2	10

Note

which = TRUE returns the row numbers instead of the rows themselves.

4.4 Anti:

ROWS of DT1 that are NOT in DT2, and only the columns of DT1.

anti_join(DT1, DT2, by = "ID")

data.table [4 x 3]

ID	A	B
A	1	16
B	5	12
C	3	11
D	1	19

DT1[!DT2, on = .(ID)]

data.table [4 x 3]

ID	A	B
A	1	16
B	5	12
C	3	11
D	1	19

ROWS of DT2 that are NOT in DT1, and only the columns of DT2.

anti_join(DT2, DT1, by = "ID")

data.table [4 x 3]

ID	C	D
K	2	17
L	2	20
M	3	10
N	4	18

DT2[!DT1, on = .(ID)]

data.table [4 x 3]

ID	C	D
K	2	17
L	2	20
M	3	10
N	4	18

4.5 Non-equi joins:

DT1[DT2, on = .(ID, A <= C)]

data.table [10 x 4]

ID	A	B	D
E	5	13	16
F	2	18	13
G	3	20	11
H	1	NA	19
I	3	NA	12
J	1	NA	14
K	2	NA	17
L	2	NA	20
M	3	NA	10
N	4	NA	18

4.6 Rolling joins:

DT1[DT2, on = "ID", roll = TRUE]

data.table [10 x 5]

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	2	10	2	17
L	2	10	2	20
M	2	10	3	10
N	2	10	4	18

Inverse the rolling direction:

DT1[DT2, on = "ID", roll = -Inf]

data.table [10 x 5]

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

DT1[DT2, on = "ID", rollends = TRUE]

data.table [10 x 5]

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

5 Tidyr & Others:

5.1 Remove NA:

tidyr::drop_na(IRIS, matches("Sepal"))

data.table [150 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 135 entries ]

na.omit(IRIS, cols = str_subset(colnames(IRIS), "Sepal"))

data.table [150 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 135 entries ]

5.2 Unite:

Combine multiple columns into a single one:

mtcars |> tidyr::unite("x", gear, carb, sep = "_")

data.frame [32 x 10]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	x
21	6	160	110	3.9	2.62	16.46	0	1	4_4
21	6	160	110	3.9	2.875	17.02	0	1	4_4
22.8	4	108	93	3.85	2.32	18.61	1	1	4_1
21.4	6	258	110	3.08	3.215	19.44	1	0	3_1
18.7	8	360	175	3.15	3.44	17.02	0	0	3_2
18.1	6	225	105	2.76	3.46	20.22	1	0	3_1
14.3	8	360	245	3.21	3.57	15.84	0	0	3_4
24.4	4	146.7	62	3.69	3.19	20	1	0	4_2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4_2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4_4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4_4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3_3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3_3
15.2	8	275.8	180	3.07	3.78	18	0	0	3_3
10.4	8	472	205	2.93	5.25	17.98	0	0	3_4
[ omitted 17 entries ]

copy(MT)[, x := paste(gear, carb, sep = "_")][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	x
21	6	160	110	3.9	2.62	16.46	0	1	4	4	4_4
21	6	160	110	3.9	2.875	17.02	0	1	4	4	4_4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	4_1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	3_1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	3_2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3_1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	3_4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4_2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	4_2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	4_4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	4_4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	3_3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	3_3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	3_3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	3_4
[ omitted 17 entries ]

5.3 Extract / Separate:

Separate a row into multiple columns based on a pattern (extract) or a separator (separate):

MT.ext <- MT[, .(x = str_c(gear, carb, sep = "_"))]

MT.ext |> tidyr::extract(col = x, into = c("a", "b"), regex = "(.*)_(.*)", remove = F)

data.table [32 x 3]

x	a	b
4_4	4	4
4_4	4	4
4_1	4	1
3_1	3	1
3_2	3	2
3_1	3	1
3_4	3	4
4_2	4	2
4_2	4	2
4_4	4	4
4_4	4	4
3_3	3	3
3_3	3	3
3_3	3	3
3_4	3	4
[ omitted 17 entries ]

MT.ext[, c("a", "b") := tstrsplit(x, "_", fixed = TRUE)][]

data.table [32 x 3]

x	a	b
4_4	4	4
4_4	4	4
4_1	4	1
3_1	3	1
3_2	3	2
3_1	3	1
3_4	3	4
4_2	4	2
4_2	4	2
4_4	4	4
4_4	4	4
3_3	3	3
3_3	3	3
3_3	3	3
3_4	3	4
[ omitted 17 entries ]

5.4 Separate rows:

Separate a row into multiple rows based on a separator:

Data

(SP <- data.table(
  val = c(1,"2,3",4), 
  date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-03"), origin = "1970-01-01")
  )
)

data.table [3 x 2]

val	date
1	2020-01-01
2,3	2020-01-02
4	2020-01-03

SP |> tidyr::separate_rows(val, sep = ",", convert = TRUE)

data.frame [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Solution 1:

copy(SP)[, c(V1 = strsplit(val, ",", fixed = TRUE), .SD), by = val][, `:=`(val = V1, V1 = NULL)][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Solution 2:

SP[, strsplit(val, ",", fixed = TRUE), by = val][SP, on = "val"][, `:=`(val = V1, V1 = NULL)][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Solution 3:

(With type conversion)

SP[, unlist(tstrsplit(val, ",", type.convert = TRUE)), by = val][SP, on = "val"][, `:=`(val = V1, V1 = NULL)][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Solution 4:

copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))][, val := strsplit(val, ","), by = val][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

(With type conversion)

copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))
       ][, val := strsplit(val, ","), by = val
       ][, val := utils::type.convert(val, as.is = T)][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

5.5 Duplicates:

5.5.1 Duplicated rows:

Finding duplicated rows:

mtcars |> group_by(mpg, hp) |> filter(n() > 1)

data.frame [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4

MT[, if(.N > 1) .SD, by = .(mpg, hp)]

data.table [2 x 11]

mpg	hp	cyl	disp	drat	wt	qsec	vs	am	gear	carb
21	110	6	160	3.9	2.62	16.46	0	1	4	4
21	110	6	160	3.9	2.875	17.02	0	1	4	4

Only keeping non-duplicated rows:

Note

This is different from distinct/unique, which will keep one of the duplicated rows of each group.

This removes all groups which have duplicated rows.

Solution 1:

mtcars |> group_by(mpg, hp) |> filter(n() == 1)

data.frame [30 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
[ omitted 15 entries ]

MT[, if(.N == 1) .SD, by = .(mpg, hp)]

data.table [30 x 11]

mpg	hp	cyl	disp	drat	wt	qsec	vs	am	gear	carb
22.8	93	4	108	3.85	2.32	18.61	1	1	4	1
21.4	110	6	258	3.08	3.215	19.44	1	0	3	1
18.7	175	8	360	3.15	3.44	17.02	0	0	3	2
18.1	105	6	225	2.76	3.46	20.22	1	0	3	1
14.3	245	8	360	3.21	3.57	15.84	0	0	3	4
24.4	62	4	146.7	3.69	3.19	20	1	0	4	2
22.8	95	4	140.8	3.92	3.15	22.9	1	0	4	2
19.2	123	6	167.6	3.92	3.44	18.3	1	0	4	4
17.8	123	6	167.6	3.92	3.44	18.9	1	0	4	4
16.4	180	8	275.8	3.07	4.07	17.4	0	0	3	3
17.3	180	8	275.8	3.07	3.73	17.6	0	0	3	3
15.2	180	8	275.8	3.07	3.78	18	0	0	3	3
10.4	205	8	472	2.93	5.25	17.98	0	0	3	4
10.4	215	8	460	3	5.424	17.82	0	0	3	4
14.7	230	8	440	3.23	5.345	17.42	0	0	3	4
[ omitted 15 entries ]

Solution 2:

More convoluted

mtcars |> group_by(mpg, hp) |> filter(n() > 1) |> anti_join(mtcars, y = _)

data.frame [30 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
[ omitted 15 entries ]

MT[!MT[, if(.N > 1) .SD, by = .(mpg, hp)], on = names(MT)]

data.table [30 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
[ omitted 15 entries ]

fsetdiff(MT, setcolorder(MT[, if(.N > 1) .SD, by = .(mpg, hp)], names(MT)))

data.table [30 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
[ omitted 15 entries ]

5.5.2 Duplicated values (per row):

(DUPED <- data.table(
    A = c("A1", "A2", "B3", "A4"), 
    B = c("B1", "B2", "B3", "B4"), 
    C = c("A1", "C2", "D3", "C4"), 
    D = c("A1", "D2", "D3", "D4")
  )
)

data.table [4 x 4]

A	B	C	D
A1	B1	A1	A1
A2	B2	C2	D2
B3	B3	D3	D3
A4	B4	C4	D4

DUPED |> mutate(Repeats = apply(cur_data(), 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", ")))

data.table [4 x 5]

A	B	C	D	Repeats
A1	B1	A1	A1	A1
A2	B2	C2	D2
B3	B3	D3	D3	B3, D3
A4	B4	C4	D4

DUPED[, Repeats := apply(.SD, 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", "))][]

data.table [4 x 5]

A	B	C	D	Repeats
A1	B1	A1	A1	A1
A2	B2	C2	D2
B3	B3	D3	D3	B3, D3
A4	B4	C4	D4

With duplication counter:

dup_counts <- function(v) {
  rles <- as.data.table(unclass(rle(v[which(duplicated(v))])))[, lengths := lengths + 1]
  paste(apply(rles, 1, \(r) paste0(r[2], " (", r[1], ")")), collapse = ", ")
}

DUPED |> mutate(Repeats = apply(cur_data(), 1, \(r) dup_counts(r)))

data.table [4 x 5]

A	B	C	D	Repeats
A1	B1	A1	A1	A1 (4)
A2	B2	C2	D2
B3	B3	D3	D3	B3 (2), D3 (2)
A4	B4	C4	D4

DUPED[, Repeats := apply(.SD, 1, \(r) dup_counts(r))][]

data.table [4 x 5]

A	B	C	D	Repeats
A1	B1	A1	A1	A1 (4)
A2	B2	C2	D2
B3	B3	D3	D3	B3 (2), D3 (2)
A4	B4	C4	D4

5.6 Expand & Complete:

Here, we are missing an entry for person B on year 2010, that we want to fill:

(CAR <- data.table(
    year = c(2010,2011,2012,2013,2014,2015,2011,2012,2013,2014,2015), 
    person = c("A","A","A","A","A","A", "B","B","B","B","B"),
    car = c("BMW", "BMW", "AUDI", "AUDI", "AUDI", "Mercedes", "Citroen","Citroen", "Citroen", "Toyota", "Toyota")
  )
)

data.table [11 x 3]

year	person	car
2 010	A	BMW
2 011	A	BMW
2 012	A	AUDI
2 013	A	AUDI
2 014	A	AUDI
2 015	A	Mercedes
2 011	B	Citroen
2 012	B	Citroen
2 013	B	Citroen
2 014	B	Toyota
2 015	B	Toyota

5.6.1 Expand:

tidyr::expand(CAR, person, year)

data.frame [12 x 2]

person	year
A	2 010
A	2 011
A	2 012
A	2 013
A	2 014
A	2 015
B	2 010
B	2 011
B	2 012
B	2 013
B	2 014
B	2 015

CJ(CAR$person, CAR$year, unique = TRUE)

data.table [12 x 2]

V1	V2
A	2 010
A	2 011
A	2 012
A	2 013
A	2 014
A	2 015
B	2 010
B	2 011
B	2 012
B	2 013
B	2 014
B	2 015

5.6.2 Complete:

Joins the original dataset with the expanded one:

CAR |> tidyr::complete(person, year)

data.frame [12 x 3]

person	year	car
A	2 010	BMW
A	2 011	BMW
A	2 012	AUDI
A	2 013	AUDI
A	2 014	AUDI
A	2 015	Mercedes
B	2 010	NA
B	2 011	Citroen
B	2 012	Citroen
B	2 013	Citroen
B	2 014	Toyota
B	2 015	Toyota

CAR[CJ(person, year, unique = TRUE), on = .(person, year)]

data.table [12 x 3]

year	person	car
2 010	A	BMW
2 011	A	BMW
2 012	A	AUDI
2 013	A	AUDI
2 014	A	AUDI
2 015	A	Mercedes
2 010	B	NA
2 011	B	Citroen
2 012	B	Citroen
2 013	B	Citroen
2 014	B	Toyota
2 015	B	Toyota

5.7 Uncount:

Duplicating aggregated rows to get the un-aggregated version back

Data

cols <- c("Mild", "Moderate", "Severe")

dat_agg

data.frame [10 x 6]

ID	Site	Domain	Mild	Moderate	Severe
1	23	A1	4	0	0
2	27	A1	0	1	1
3	28	A1	0	1	0
4	29	A1	0	0	1
5	31	A1	0	1	0
6	33	A1	0	1	1
7	41	A1	3	0	1
8	48	A1	0	2	4
9	64	A1	1	0	0
10	66	A1	1	0	0

dat_agg |> 
  tidyr::pivot_longer(cols = cols, names_to = "Severity", values_to = "Count") |> 
  tidyr::uncount(Count) |> 
  mutate(ID_new = row_number(), .after = "ID") |>
  tidyr::pivot_wider(
    names_from = "Severity", values_from = "Severity", 
    values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
  )

data.frame [23 x 7]

ID	ID_new	Site	Domain	Mild	Moderate	Severe
1	1	23	A1	1	0	0
1	2	23	A1	1	0	0
1	3	23	A1	1	0	0
1	4	23	A1	1	0	0
2	5	27	A1	0	1	0
2	6	27	A1	0	0	1
3	7	28	A1	0	1	0
4	8	29	A1	0	0	1
5	9	31	A1	0	1	0
6	10	33	A1	0	1	0
6	11	33	A1	0	0	1
7	12	41	A1	1	0	0
7	13	41	A1	1	0	0
7	14	41	A1	1	0	0
7	15	41	A1	0	0	1
[ omitted 8 entries ]

Solution 1:

(melt(DAT_AGG, measure.vars = cols, variable.name = "Severity", value.name = "Count")
  [rep(1:.N, Count)][, ID_new := .I] |> 
  dcast(... ~ Severity, value.var = "Severity", fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0)
)[, -"Count"]

data.table [23 x 7]

ID	Site	Domain	ID_new	Mild	Moderate	Severe
1	23	A1	1	1	0	0
1	23	A1	2	1	0	0
1	23	A1	3	1	0	0
1	23	A1	4	1	0	0
2	27	A1	10	0	1	0
2	27	A1	16	0	0	1
3	28	A1	11	0	1	0
4	29	A1	17	0	0	1
5	31	A1	12	0	1	0
6	33	A1	13	0	1	0
6	33	A1	18	0	0	1
7	41	A1	19	0	0	1
7	41	A1	5	1	0	0
7	41	A1	6	1	0	0
7	41	A1	7	1	0	0
[ omitted 8 entries ]

Solution 2:

DAT_AGG[Reduce(`c`, sapply(mget(cols), \(x) rep(1:.N, x)))
      ][, (cols) := lapply(.SD, \(x) ifelse(x > 1, 1, x)), .SDcols = cols
      ][order(ID)]

data.table [23 x 6]

ID	Site	Domain	Mild	Moderate	Severe
1	23	A1	1	0	0
1	23	A1	1	0	0
1	23	A1	1	0	0
1	23	A1	1	0	0
2	27	A1	0	1	1
2	27	A1	0	1	1
3	28	A1	0	1	0
4	29	A1	0	0	1
5	31	A1	0	1	0
6	33	A1	0	1	1
6	33	A1	0	1	1
7	41	A1	1	0	1
7	41	A1	1	0	1
7	41	A1	1	0	1
7	41	A1	1	0	1
[ omitted 8 entries ]

5.8 List / Unlist:

When a column contains a simple vector/list of values (of the same type, without structure)

5.8.1 One listed column:

Single ID (grouping) column:

(mtcars_list <- mtcars |> group_by(cyl) |> summarize(mpg = list(mpg)) |> ungroup())

data.frame [3 x 2]

cyl	mpg
4	<numeric [11]>
6	<numeric [7]>
8	<numeric [14]>

(MT_LIST <- MT[, .(mpg = .(mpg)), keyby = cyl])

data.table [3 x 2]

cyl	mpg
4	<numeric [11]>
6	<numeric [7]>
8	<numeric [14]>

Solution 1:

mtcars_list |> unnest(mpg)

data.frame [32 x 2]

cyl	mpg
4	22.8
4	24.4
4	22.8
4	32.4
4	30.4
4	33.9
4	21.5
4	27.3
4	26
4	30.4
4	21.4
6	21
6	21
6	21.4
6	18.1
[ omitted 17 entries ]

MT_LIST[, .(mpg = unlist(mpg)), keyby = cyl]

data.table [32 x 2]

cyl	mpg
4	22.8
4	24.4
4	22.8
4	32.4
4	30.4
4	33.9
4	21.5
4	27.3
4	26
4	30.4
4	21.4
6	21
6	21
6	21.4
6	18.1
[ omitted 17 entries ]

Solution 2:

Bypasses the need of grouping when unlisting by growing the data.table back to its original number of rows before unlisting.

MT_LIST[rep(MT_LIST[, .I], lengths(mpg))][, mpg := unlist(MT_LIST$mpg)][]

data.table [32 x 2]

cyl	mpg
4	22.8
4	24.4
4	22.8
4	32.4
4	30.4
4	33.9
4	21.5
4	27.3
4	26
4	30.4
4	21.4
6	21
6	21
6	21.4
6	18.1
[ omitted 17 entries ]

Multiple ID (grouping) columns:

(mtcars_list2 <- mtcars |> group_by(cyl, gear) |> summarize(mpg = list(mpg)) |> ungroup())

data.frame [8 x 3]

cyl	gear	mpg
4	3	<numeric [1]>
4	4	<numeric [8]>
4	5	<numeric [2]>
6	3	<numeric [2]>
6	4	<numeric [4]>
6	5	<numeric [1]>
8	3	<numeric [12]>
8	5	<numeric [2]>

(MT_LIST2 <- MT[, .(mpg = .(mpg)), keyby = .(cyl, gear)])

data.table [8 x 3]

cyl	gear	mpg
4	3	<numeric [1]>
4	4	<numeric [8]>
4	5	<numeric [2]>
6	3	<numeric [2]>
6	4	<numeric [4]>
6	5	<numeric [1]>
8	3	<numeric [12]>
8	5	<numeric [2]>

Solution 1:

mtcars_list2 |> unnest(mpg) # group_by(cyl, gear) is optional

data.frame [32 x 3]

cyl	gear	mpg
4	3	21.5
4	4	22.8
4	4	24.4
4	4	22.8
4	4	32.4
4	4	30.4
4	4	33.9
4	4	27.3
4	4	21.4
4	5	26
4	5	30.4
6	3	21.4
6	3	18.1
6	4	21
6	4	21
[ omitted 17 entries ]

MT_LIST2[, .(mpg = unlist(mpg)), by = setdiff(names(MT_LIST2), 'mpg')]

data.table [32 x 3]

cyl	gear	mpg
4	3	21.5
4	4	22.8
4	4	24.4
4	4	22.8
4	4	32.4
4	4	30.4
4	4	33.9
4	4	27.3
4	4	21.4
4	5	26
4	5	30.4
6	3	21.4
6	3	18.1
6	4	21
6	4	21
[ omitted 17 entries ]

Solution 2:

Same as with one grouping column

MT_LIST2[rep(MT_LIST2[, .I], lengths(mpg))][, mpg := unlist(MT_LIST2$mpg)][]

data.table [32 x 3]

cyl	gear	mpg
4	3	21.5
4	4	22.8
4	4	24.4
4	4	22.8
4	4	32.4
4	4	30.4
4	4	33.9
4	4	27.3
4	4	21.4
4	5	26
4	5	30.4
6	3	21.4
6	3	18.1
6	4	21
6	4	21
[ omitted 17 entries ]

5.8.2 Multiple listed column:

Creating the data:

(mtcars_list_mult <- mtcars |> group_by(cyl, gear) |> summarize(across(c(mpg, disp), \(c) list(c))) |> ungroup())

data.frame [8 x 4]

cyl	gear	mpg	disp
4	3	<numeric [1]>	<numeric [1]>
4	4	<numeric [8]>	<numeric [8]>
4	5	<numeric [2]>	<numeric [2]>
6	3	<numeric [2]>	<numeric [2]>
6	4	<numeric [4]>	<numeric [4]>
6	5	<numeric [1]>	<numeric [1]>
8	3	<numeric [12]>	<numeric [12]>
8	5	<numeric [2]>	<numeric [2]>

(MT_LIST_MULT <- MT[, lapply(.SD, \(c) .(c)), keyby = .(cyl, gear), .SDcols = c("mpg", "disp")])

data.table [8 x 4]

cyl	gear	mpg	disp
4	3	<numeric [1]>	<numeric [1]>
4	4	<numeric [8]>	<numeric [8]>
4	5	<numeric [2]>	<numeric [2]>
6	3	<numeric [2]>	<numeric [2]>
6	4	<numeric [4]>	<numeric [4]>
6	5	<numeric [1]>	<numeric [1]>
8	3	<numeric [12]>	<numeric [12]>
8	5	<numeric [2]>	<numeric [2]>

Solution 1:

mtcars_list_mult |> unnest(c(mpg, disp)) # group_by(cyl, gear) is optional

data.frame [32 x 4]

cyl	gear	mpg	disp
4	3	21.5	120.1
4	4	22.8	108
4	4	24.4	146.7
4	4	22.8	140.8
4	4	32.4	78.7
4	4	30.4	75.7
4	4	33.9	71.1
4	4	27.3	79
4	4	21.4	121
4	5	26	120.3
4	5	30.4	95.1
6	3	21.4	258
6	3	18.1	225
6	4	21	160
6	4	21	160
[ omitted 17 entries ]

MT_LIST_MULT[, lapply(.SD, \(c) unlist(c)), by = setdiff(names(MT_LIST_MULT), c("mpg", "disp"))]

data.table [32 x 4]

cyl	gear	mpg	disp
4	3	21.5	120.1
4	4	22.8	108
4	4	24.4	146.7
4	4	22.8	140.8
4	4	32.4	78.7
4	4	30.4	75.7
4	4	33.9	71.1
4	4	27.3	79
4	4	21.4	121
4	5	26	120.3
4	5	30.4	95.1
6	3	21.4	258
6	3	18.1	225
6	4	21	160
6	4	21	160
[ omitted 17 entries ]

5.9 Nest / Unnest:

When a column contains a data.table/data.frame (with multiple columns, structured)

5.9.1 One nested column:

Nesting

(mtcars_nest <- mtcars |> tidyr::nest(data = -cyl)) # Data is inside a tibble

data.frame [3 x 2]

cyl	data
6	<tbl_df [7 x 10]>
4	<tbl_df [11 x 10]>
8	<tbl_df [14 x 10]>

mtcars_nest <- mtcars |> nest_by(cyl) |> ungroup() # Data is inside a vctrs_list_of

(MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl])

data.table [3 x 2]

cyl	data
4	<data.table [11 x 10]>
6	<data.table [7 x 10]>
8	<data.table [14 x 10]>

Unnesting

mtcars_nest |> unnest(data) |> ungroup()

data.frame [32 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
6	21	160	110	3.9	2.62	16.46	0	1	4	4
6	21	160	110	3.9	2.875	17.02	0	1	4	4
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
6	17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
[ omitted 17 entries ]

MT_NEST[, rbindlist(data), keyby = cyl]

data.table [32 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2
6	21	160	110	3.9	2.62	16.46	0	1	4	4
6	21	160	110	3.9	2.875	17.02	0	1	4	4
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1
[ omitted 17 entries ]

# MT_NEST[, do.call(c, data), keyby = cyl]

5.9.2 Multiple nested column:

Nesting:

(mtcars_nest_mult <- mtcars |> group_by(cyl, gear) |> nest(data1 = c(mpg, hp), data2 = !c(cyl, gear, mpg, hp)) |> ungroup())

data.frame [8 x 4]

cyl	gear	data1	data2
6	4	<tbl_df [4 x 2]>	<tbl_df [4 x 7]>
4	4	<tbl_df [8 x 2]>	<tbl_df [8 x 7]>
6	3	<tbl_df [2 x 2]>	<tbl_df [2 x 7]>
8	3	<tbl_df [12 x 2]>	<tbl_df [12 x 7]>
4	3	<tbl_df [1 x 2]>	<tbl_df [1 x 7]>
4	5	<tbl_df [2 x 2]>	<tbl_df [2 x 7]>
8	5	<tbl_df [2 x 2]>	<tbl_df [2 x 7]>
6	5	<tbl_df [1 x 2]>	<tbl_df [1 x 7]>

(MT_NEST_MULT <- MT[, .(data1 = .(.SD[, .(mpg, hp)]), data2 = .(.SD[, !c("mpg", "hp")])), keyby = .(cyl, gear)])

data.table [8 x 4]

cyl	gear	data1	data2
4	3	<data.table [1 x 2]>	<data.table [1 x 7]>
4	4	<data.table [8 x 2]>	<data.table [8 x 7]>
4	5	<data.table [2 x 2]>	<data.table [2 x 7]>
6	3	<data.table [2 x 2]>	<data.table [2 x 7]>
6	4	<data.table [4 x 2]>	<data.table [4 x 7]>
6	5	<data.table [1 x 2]>	<data.table [1 x 7]>
8	3	<data.table [12 x 2]>	<data.table [12 x 7]>
8	5	<data.table [2 x 2]>	<data.table [2 x 7]>

Unnesting:

mtcars_nest_mult |> unnest(c(data1, data2)) |> ungroup()

data.frame [32 x 11]

cyl	gear	mpg	hp	disp	drat	wt	qsec	vs	am	carb
6	4	21	110	160	3.9	2.62	16.46	0	1	4
6	4	21	110	160	3.9	2.875	17.02	0	1	4
6	4	19.2	123	167.6	3.92	3.44	18.3	1	0	4
6	4	17.8	123	167.6	3.92	3.44	18.9	1	0	4
4	4	22.8	93	108	3.85	2.32	18.61	1	1	1
4	4	24.4	62	146.7	3.69	3.19	20	1	0	2
4	4	22.8	95	140.8	3.92	3.15	22.9	1	0	2
4	4	32.4	66	78.7	4.08	2.2	19.47	1	1	1
4	4	30.4	52	75.7	4.93	1.615	18.52	1	1	2
4	4	33.9	65	71.1	4.22	1.835	19.9	1	1	1
4	4	27.3	66	79	4.08	1.935	18.9	1	1	1
4	4	21.4	109	121	4.11	2.78	18.6	1	1	2
6	3	21.4	110	258	3.08	3.215	19.44	1	0	1
6	3	18.1	105	225	2.76	3.46	20.22	1	0	1
8	3	18.7	175	360	3.15	3.44	17.02	0	0	2
[ omitted 17 entries ]

MT_NEST_MULT[, c(rbindlist(data1), rbindlist(data2)), keyby = .(cyl, gear)]

data.table [32 x 11]

cyl	gear	mpg	hp	disp	drat	wt	qsec	vs	am	carb
4	3	21.5	97	120.1	3.7	2.465	20.01	1	0	1
4	4	22.8	93	108	3.85	2.32	18.61	1	1	1
4	4	24.4	62	146.7	3.69	3.19	20	1	0	2
4	4	22.8	95	140.8	3.92	3.15	22.9	1	0	2
4	4	32.4	66	78.7	4.08	2.2	19.47	1	1	1
4	4	30.4	52	75.7	4.93	1.615	18.52	1	1	2
4	4	33.9	65	71.1	4.22	1.835	19.9	1	1	1
4	4	27.3	66	79	4.08	1.935	18.9	1	1	1
4	4	21.4	109	121	4.11	2.78	18.6	1	1	2
4	5	26	91	120.3	4.43	2.14	16.7	0	1	2
4	5	30.4	113	95.1	3.77	1.513	16.9	1	1	2
6	3	21.4	110	258	3.08	3.215	19.44	1	0	1
6	3	18.1	105	225	2.76	3.46	20.22	1	0	1
6	4	21	110	160	3.9	2.62	16.46	0	1	4
6	4	21	110	160	3.9	2.875	17.02	0	1	4
[ omitted 17 entries ]

MT_NEST_MULT[, do.call(c, unname(lapply(.SD, \(c) rbindlist(c)))), .SDcols = patterns('data'), keyby = .(cyl, gear)]

data.table [32 x 11]

cyl	gear	mpg	hp	disp	drat	wt	qsec	vs	am	carb
4	3	21.5	97	120.1	3.7	2.465	20.01	1	0	1
4	4	22.8	93	108	3.85	2.32	18.61	1	1	1
4	4	24.4	62	146.7	3.69	3.19	20	1	0	2
4	4	22.8	95	140.8	3.92	3.15	22.9	1	0	2
4	4	32.4	66	78.7	4.08	2.2	19.47	1	1	1
4	4	30.4	52	75.7	4.93	1.615	18.52	1	1	2
4	4	33.9	65	71.1	4.22	1.835	19.9	1	1	1
4	4	27.3	66	79	4.08	1.935	18.9	1	1	1
4	4	21.4	109	121	4.11	2.78	18.6	1	1	2
4	5	26	91	120.3	4.43	2.14	16.7	0	1	2
4	5	30.4	113	95.1	3.77	1.513	16.9	1	1	2
6	3	21.4	110	258	3.08	3.215	19.44	1	0	1
6	3	18.1	105	225	2.76	3.46	20.22	1	0	1
6	4	21	110	160	3.9	2.62	16.46	0	1	4
6	4	21	110	160	3.9	2.875	17.02	0	1	4
[ omitted 17 entries ]

5.9.3 Operate on nested/list columns:

(mtcars_nest <- mtcars |> nest(-cyl) |> ungroup())

data.frame [3 x 2]

cyl	data
6	<tbl_df [7 x 10]>
4	<tbl_df [11 x 10]>
8	<tbl_df [14 x 10]>

(MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl])

data.table [3 x 2]

cyl	data
4	<data.table [11 x 10]>
6	<data.table [7 x 10]>
8	<data.table [14 x 10]>

Creating a new column using the nested data:

Keeping the nested column:

mtcars_nest |> group_by(cyl) |> mutate(sum = sum(unlist(data))) |> ungroup()

data.frame [3 x 3]

cyl	data	sum
6	<tbl_df [7 x 10]>	2 508.16
4	<tbl_df [11 x 10]>	2 719.233
8	<tbl_df [14 x 10]>	8 516.809

copy(MT_NEST)[, sum := sapply(data, \(r) sum(r)), keyby = cyl][]

data.table [3 x 3]

cyl	data	sum
4	<data.table [11 x 10]>	2 719.233
6	<data.table [7 x 10]>	2 508.16
8	<data.table [14 x 10]>	8 516.809

Dropping the nested column:

mtcars_nest |> group_by(cyl) |> summarize(sum = sum(unlist(data))) |> ungroup()

data.frame [3 x 2]

cyl	sum
4	2 719.233
6	2 508.16
8	8 516.809

MT_NEST[, .(sum = sapply(data, \(r) sum(r))), keyby = cyl]

data.table [3 x 2]

cyl	sum
4	2 719.233
6	2 508.16
8	8 516.809

Creating multiple new columns using the nested data:

linreg <- \(data) lm(mpg ~ hp, data = data) |> broom::tidy()

mtcars_nest |> group_by(cyl) |> group_modify(\(d, g) linreg(unnest(d, everything()))) |> ungroup()

data.frame [6 x 6]

cyl	term	estimate	std.error	statistic	p.value
4	(Intercept)	35.983	5.201	6.918	<0.001 ***
4	hp	−0.113	0.061	−1.843	0.098
6	(Intercept)	20.674	3.304	6.256	0.002 **
6	hp	−0.008	0.027	−0.286	0.786
8	(Intercept)	18.08	2.988	6.052	<0.001 ***
8	hp	−0.014	0.014	−1.025	0.326

MT_NEST[, rbindlist(lapply(data, \(ndt) linreg(ndt))), keyby = cyl][]

data.table [6 x 6]

cyl	term	estimate	std.error	statistic	p.value
4	(Intercept)	35.983	5.201	6.918	<0.001 ***
4	hp	−0.113	0.061	−1.843	0.098
6	(Intercept)	20.674	3.304	6.256	0.002 **
6	hp	−0.008	0.027	−0.286	0.786
8	(Intercept)	18.08	2.988	6.052	<0.001 ***
8	hp	−0.014	0.014	−1.025	0.326

Operating inside the nested data:

mtcars_nest |> 
  mutate(data = map(data, \(tibl) mutate(tibl, sum = purrr::pmap_dbl(cur_data(), sum)))) |> 
  unnest(data)

data.frame [32 x 12]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	sum
6	21	160	110	3.9	2.62	16.46	0	1	4	4	322.98
6	21	160	110	3.9	2.875	17.02	0	1	4	4	323.795
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1	420.135
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1	379.54
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4	344.46
6	17.8	167.6	123	3.92	3.44	18.9	1	0	4	4	343.66
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6	373.59
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1	255.58
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2	266.98
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	295.57
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1	209.85
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2	191.165
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1	202.955
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1	269.775
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1	204.215
[ omitted 17 entries ]

mtcars_nest |> 
  mutate(across(data, \(tibls) map(tibls, \(tibl) mutate(tibl, sum = apply(cur_data(), 1, sum))))) |> 
  unnest(data)

data.frame [32 x 12]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	sum
6	21	160	110	3.9	2.62	16.46	0	1	4	4	322.98
6	21	160	110	3.9	2.875	17.02	0	1	4	4	323.795
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1	420.135
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1	379.54
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4	344.46
6	17.8	167.6	123	3.92	3.44	18.9	1	0	4	4	343.66
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6	373.59
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1	255.58
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2	266.98
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	295.57
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1	209.85
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2	191.165
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1	202.955
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1	269.775
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1	204.215
[ omitted 17 entries ]

Using the nplyr package:

library(nplyr)

mtcars_nest |> 
  nplyr::nest_mutate(data, sum = apply(cur_data(), 1, sum)) |> 
  unnest(data)

data.frame [32 x 12]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	sum
6	21	160	110	3.9	2.62	16.46	0	1	4	4	322.98
6	21	160	110	3.9	2.875	17.02	0	1	4	4	323.795
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1	420.135
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1	379.54
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4	344.46
6	17.8	167.6	123	3.92	3.44	18.9	1	0	4	4	343.66
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6	373.59
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1	255.58
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2	266.98
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	295.57
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1	209.85
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2	191.165
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1	202.955
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1	269.775
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1	204.215
[ omitted 17 entries ]

copy(MT_NEST)[, data := lapply(data, \(dt) dt[, sum := apply(.SD, 1, sum)])
            ][, rbindlist(data), keyby = cyl]

data.table [32 x 12]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	sum
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1	255.58
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2	266.98
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	295.57
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1	209.85
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2	191.165
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1	202.955
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1	269.775
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1	204.215
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2	268.57
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2	269.683
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2	284.89
6	21	160	110	3.9	2.62	16.46	0	1	4	4	322.98
6	21	160	110	3.9	2.875	17.02	0	1	4	4	323.795
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1	420.135
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1	379.54
[ omitted 17 entries ]

5.10 Rotate / Transpose:

(MT_SUMMARY <- MT[, tidy(summary(mpg)), by = cyl])

data.table [3 x 7]

cyl	minimum	q1	median	mean	q3	maximum
6	17.8	18.65	19.7	19.743	21	21.4
4	21.4	22.8	26	26.664	30.4	33.9
8	10.4	14.4	15.2	15.1	16.25	19.2

Solution 1:

Using pivots to fully rotate the data.table:

MT_SUMMARY |> 
  pivot_longer(!cyl, names_to = "Statistic") |> 
  pivot_wider(id_cols = "Statistic", names_from = "cyl", names_prefix = "Cyl ")

data.frame [6 x 4]

Statistic	Cyl 6	Cyl 4	Cyl 8
minimum	17.8	21.4	10.4
q1	18.65	22.8	14.4
median	19.7	26	15.2
mean	19.743	26.664	15.1
q3	21	30.4	16.25
maximum	21.4	33.9	19.2

MT_SUMMARY |> 
  melt(id.vars = "cyl", variable.name = "Statistic") |> 
  dcast(Statistic ~ paste0("Cyl ", cyl))

data.table [6 x 4]

Statistic	Cyl 4	Cyl 6	Cyl 8
minimum	21.4	17.8	10.4
q1	22.8	18.65	14.4
median	26	19.7	15.2
mean	26.664	19.743	15.1
q3	30.4	21	16.25
maximum	33.9	21.4	19.2

Solution 2:

Using a dedicated function:

Note

AFAIK there is no native Tidyverse function to do this.

library(datawizard)

datawizard::data_rotate(MT_SUMMARY, colnames = TRUE, rownames = "Statistic")

data.frame [6 x 4]

Statistic	6	4	8
minimum	17.8	21.4	10.4
q1	18.65	22.8	14.4
median	19.7	26	15.2
mean	19.743	26.664	15.1
q3	21	30.4	16.25
maximum	21.4	33.9	19.2

data.table::transpose(MT_SUMMARY, keep.names = "Statistic", make.names = 1)

data.table [6 x 4]

Statistic	6	4	8
minimum	17.8	21.4	10.4
q1	18.65	22.8	14.4
median	19.7	26	15.2
mean	19.743	26.664	15.1
q3	21	30.4	16.25
maximum	21.4	33.9	19.2

6 Processing examples:

Examples of interesting tasks that I’ve collected over time.

6.1 Find minimum in each group:

MT |> group_by(cyl) |> arrange(mpg) |> slice(1) |> ungroup()

data.frame [3 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4

MT[, .SD[which.min(mpg)], keyby = cyl]

data.table [3 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2
6	17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
8	10.4	472	205	2.93	5.25	17.98	0	0	3	4

6.2 GROUP > FILTER > MUTATE

Data:

(DAT <- structure(list(
  id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
  name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"), 
  year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L), 
  job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"), 
  job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)
  ), 
  .Names = c("id", "name", "year", "job", "job2"), 
  class = "data.frame", 
  row.names = c(NA, -16L)
) |> setDT())

data.table [16 x 5]

id	name	year	job	job2
1	Jane	1 980	Manager	1
1	Jane	1 981	Manager	1
1	Jane	1 982	Manager	1
1	Jane	1 983	Manager	1
1	Jane	1 984	Manager	1
1	Jane	1 985	Manager	1
1	Jane	1 986	Boss	0
1	Jane	1 987	Boss	0
2	Bob	1 985	Manager	1
2	Bob	1 986	Manager	1
2	Bob	1 987	Manager	1
2	Bob	1 988	Boss	0
2	Bob	1 989	Boss	0
2	Bob	1 990	Boss	0
2	Bob	1 991	Boss	0
[ omitted 1 entries ]

Tidyverse:

DAT |> group_by(name, job) |> 
  filter(job != "Boss" | year == min(year)) |> 
  mutate(cumu_job2 = cumsum(job2)) |> 
  ungroup()

data.frame [11 x 6]

id	name	year	job	job2	cumu_job2
1	Jane	1 980	Manager	1	1
1	Jane	1 981	Manager	1	2
1	Jane	1 982	Manager	1	3
1	Jane	1 983	Manager	1	4
1	Jane	1 984	Manager	1	5
1	Jane	1 985	Manager	1	6
1	Jane	1 986	Boss	0	0
2	Bob	1 985	Manager	1	1
2	Bob	1 986	Manager	1	2
2	Bob	1 987	Manager	1	3
2	Bob	1 988	Boss	0	0

Note

Here, the grouping is done BEFORE the filter -> there will be empty groups, meaning they will sum to 0

data.table:

Solution 1:

DAT[ , .SD[job != "Boss" | year == min(year), .(cumu_job2 = cumsum(job2))], by = .(name, job)]

data.table [11 x 3]

name	job	cumu_job2
Jane	Manager	1
Jane	Manager	2
Jane	Manager	3
Jane	Manager	4
Jane	Manager	5
Jane	Manager	6
Jane	Boss	0
Bob	Manager	1
Bob	Manager	2
Bob	Manager	3
Bob	Boss	0

Solution 2:

DAT[ , .(cum_job2 = cumsum(job2[job != "Boss" | year == min(year)])), by = .(name, job)]

data.table [11 x 3]

name	job	cum_job2
Jane	Manager	1
Jane	Manager	2
Jane	Manager	3
Jane	Manager	4
Jane	Manager	5
Jane	Manager	6
Jane	Boss	0
Bob	Manager	1
Bob	Manager	2
Bob	Manager	3
Bob	Boss	0

Solution 3:

DAT[
    DAT[, .I[job != "Boss" | year == min(year)], by = .(name, job)]$V1 # Row indices
  ][
    , cumu_job2 := cumsum(job2), by = .(name, job)
  ][]

data.table [11 x 6]

id	name	year	job	job2	cumu_job2
1	Jane	1 980	Manager	1	1
1	Jane	1 981	Manager	1	2
1	Jane	1 982	Manager	1	3
1	Jane	1 983	Manager	1	4
1	Jane	1 984	Manager	1	5
1	Jane	1 985	Manager	1	6
1	Jane	1 986	Boss	0	0
2	Bob	1 985	Manager	1	1
2	Bob	1 986	Manager	1	2
2	Bob	1 987	Manager	1	3
2	Bob	1 988	Boss	0	0

If we filtered after the grouping:

DAT[job != "Boss" | year == min(year), list(cumu_job2 = cumsum(job2)), by = .(name, job)]

data.table [9 x 3]

name	job	cumu_job2
Jane	Manager	1
Jane	Manager	2
Jane	Manager	3
Jane	Manager	4
Jane	Manager	5
Jane	Manager	6
Bob	Manager	1
Bob	Manager	2
Bob	Manager	3

6.3 GROUP > SUMMARIZE > JOIN > MUTATE

Data:

(GSJM1 <- data.table(x = c(1,1,1,1,2,2,2,2), y = c("a", "a", "b", "b"), z = 1:8, key = c("x", "y")))

data.table [8 x 3]

x	y	z
1	a	1
1	a	2
1	b	3
1	b	4
2	a	5
2	a	6
2	b	7
2	b	8

(GSJM2 <- data.table(x = 1:2, y = c("a", "b"), mul = 4:3, key = c("x", "y")))

data.table [2 x 3]

x	y	mul
1	a	4
2	b	3

Tidyverse:

as.data.frame(GSJM1) |> 
  group_by(x, y) |>
  summarise(z = sum(z)) |>
  ungroup() |> 
  right_join(GSJM2) |>
  mutate(z = z * mul) |> 
  select(-mul)

data.frame [2 x 3]

x	y	z
1	a	12
2	b	45

data.table:

Basic:

GSJM1[, .(z = sum(z)), by = .(x, y)][GSJM2][, `:=`(z = z * mul, mul = NULL)][]

data.table [2 x 3]

x	y	z
1	a	12
2	b	45

Advanced (using .EACHI):

GSJM1[GSJM2, .(z = sum(z) * mul), by = .EACHI]

data.table [2 x 3]

x	y	z
1	a	12
2	b	45

6.4 Separating rows & cleaning text:

Data

(DT_COMA <- data.table(
  first = c(1,"2,3",3,4,5,6.5,7,8,9,0), 
  second = c(1,"2,,5",3,4,5,"6,5,9",7,8,9,0), 
  third = c("one", "two", "thr,ee", "four", "five", "six", "sev,en", "eight", "nine", "zero"), 
  fourth = as.Date(c(1/1/2020, 2/1/2020, 3/1/2020, 4/1/2020, 5/1/2020, 6/1/2020, 7/1/2020, 8/1/2020, 9/1/2020, 10/1/2020), origin = "1970-01-01")
  )
)

data.table [10 x 4]

first	second	third	fourth
1	1	one	1970-01-01
2,3	2,,5	two	1970-01-01
3	3	thr,ee	1970-01-01
4	4	four	1970-01-01
5	5	five	1970-01-01
6.5	6,5,9	six	1970-01-01
7	7	sev,en	1970-01-01
8	8	eight	1970-01-01
9	9	nine	1970-01-01
0	0	zero	1970-01-01

6.4.1 Step1: Cleaning

Removing unwanted commas within words

Tidyverse:

DT_COMA |> mutate(across(where(\(v) is.character(v) & all(is.na(as.numeric(v)))), \(v) stringr::str_remove_all(v, ",")))

data.table [10 x 4]

first	second	third	fourth
1	1	one	1970-01-01
2,3	2,,5	two	1970-01-01
3	3	three	1970-01-01
4	4	four	1970-01-01
5	5	five	1970-01-01
6.5	6,5,9	six	1970-01-01
7	7	seven	1970-01-01
8	8	eight	1970-01-01
9	9	nine	1970-01-01
0	0	zero	1970-01-01

data.table:

cols_to_clean <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & all(is.na(as.numeric(v)))] |> colnames()

copy(DT_COMA)[, c(cols_to_clean) := purrr::map(.SD[, cols_to_clean, with = F], \(v) stringr::str_remove_all(v, ","))][]

data.table [10 x 4]

first	second	third	fourth
1	1	one	1970-01-01
2,3	2,,5	two	1970-01-01
3	3	three	1970-01-01
4	4	four	1970-01-01
5	5	five	1970-01-01
6.5	6,5,9	six	1970-01-01
7	7	seven	1970-01-01
8	8	eight	1970-01-01
9	9	nine	1970-01-01
0	0	zero	1970-01-01

6.4.2 Step 2: Separating rows

Each numeric row that has multiple comma-separated values has to be split into multiple rows (one value per row)

Tidyverse:

cols_to_separate <- DT_COMA |> select(where(\(v) is.character(v) & any(!is.na(as.numeric(v))))) |> colnames()

purrr::reduce(
  cols_to_separate, 
  \(acc, col) acc |> tidyr::separate_rows(col, sep = ",", convert = T), 
  .init = DT_COMA
)

data.frame [17 x 4]

first	second	third	fourth
1	1	one	1970-01-01
2	2	two	1970-01-01
2	NA	two	1970-01-01
2	5	two	1970-01-01
3	2	two	1970-01-01
3	NA	two	1970-01-01
3	5	two	1970-01-01
3	3	thr,ee	1970-01-01
4	4	four	1970-01-01
5	5	five	1970-01-01
6.5	6	six	1970-01-01
6.5	5	six	1970-01-01
6.5	9	six	1970-01-01
7	7	sev,en	1970-01-01
8	8	eight	1970-01-01
[ omitted 2 entries ]

data.table:

cols_to_separate <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & any(!is.na(as.numeric(v)))] |> colnames()

(purrr::reduce(
  cols_to_separate,
  \(acc, col) acc[rep(1:.N, lengths(strsplit(get(col), ",")))][, (col) := type.convert(unlist(strsplit(acc[[col]], ",", fixed = T)), as.is = T, na.strings = "")],
  .init = DT_COMA
))[]

data.table [17 x 4]

first	second	third	fourth
1	1	one	1970-01-01
2	2	two	1970-01-01
2	NA	two	1970-01-01
2	5	two	1970-01-01
3	2	two	1970-01-01
3	NA	two	1970-01-01
3	5	two	1970-01-01
3	3	thr,ee	1970-01-01
4	4	four	1970-01-01
5	5	five	1970-01-01
6.5	6	six	1970-01-01
6.5	5	six	1970-01-01
6.5	9	six	1970-01-01
7	7	sev,en	1970-01-01
8	8	eight	1970-01-01
[ omitted 2 entries ]

6.4.3 Combining both steps:

Tidyverse:

DT_COMA <- DT_COMA |> mutate(across(where(\(v) is.character(v) & all(is.na(as.numeric(v)))), \(v) stringr::str_remove_all(v, ",")))

purrr::reduce(
  select(DT_COMA, where(\(v) is.character(v) & any(!is.na(as.numeric(v))))) |> colnames(), 
  \(acc, col) acc |> tidyr::separate_rows(col, sep = ",", convert = T), 
  .init = DT_COMA
)

data.frame [17 x 4]

first	second	third	fourth
1	1	one	1970-01-01
2	2	two	1970-01-01
2	NA	two	1970-01-01
2	5	two	1970-01-01
3	2	two	1970-01-01
3	NA	two	1970-01-01
3	5	two	1970-01-01
3	3	three	1970-01-01
4	4	four	1970-01-01
5	5	five	1970-01-01
6.5	6	six	1970-01-01
6.5	5	six	1970-01-01
6.5	9	six	1970-01-01
7	7	seven	1970-01-01
8	8	eight	1970-01-01
[ omitted 2 entries ]

data.table:

cols_to_clean <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & all(is.na(as.numeric(v)))] |> colnames()
cols_to_separate <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & any(!is.na(as.numeric(v)))] |> colnames()

DT_COMA[, c(cols_to_clean) := purrr::map(.SD[, cols_to_clean, with = F], \(v) stringr::str_remove_all(v, ","))]

data.table [10 x 4]

first	second	third	fourth
1	1	one	1970-01-01
2,3	2,,5	two	1970-01-01
3	3	three	1970-01-01
4	4	four	1970-01-01
5	5	five	1970-01-01
6.5	6,5,9	six	1970-01-01
7	7	seven	1970-01-01
8	8	eight	1970-01-01
9	9	nine	1970-01-01
0	0	zero	1970-01-01

(purrr::reduce(
  cols_to_separate,
  \(acc, col) acc[rep(1:.N, lengths(strsplit(get(col), ",")))][, (col) := type.convert(unlist(strsplit(acc[[col]], ",", fixed = T)), as.is = T, na.strings = "")],
  .init = DT_COMA
))[]

data.table [17 x 4]

first	second	third	fourth
1	1	one	1970-01-01
2	2	two	1970-01-01
2	NA	two	1970-01-01
2	5	two	1970-01-01
3	2	two	1970-01-01
3	NA	two	1970-01-01
3	5	two	1970-01-01
3	3	three	1970-01-01
4	4	four	1970-01-01
5	5	five	1970-01-01
6.5	6	six	1970-01-01
6.5	5	six	1970-01-01
6.5	9	six	1970-01-01
7	7	seven	1970-01-01
8	8	eight	1970-01-01
[ omitted 2 entries ]

6.5 Multiple choice questions:

Data:

surv

data.frame [5 x 2]

ID	response
1	I read the assigned readings.\|I reread my notes.\|I worked with one or more classmates.
2	I read the assigned readings.\|I reviewed this week’s slides.
3	I worked on practice problems.\|I read the assigned readings.\|I reread my notes.
4	I worked on practice problems.\|I read the assigned readings.\|I reread my notes.
5	I worked on practice problems.\|I read the assigned readings.\|I reread my notes.\|I reviewed this week’s slides.\|I worked with one or more classmates.

Here we will spread the answers into their own columns using a pivot because not all rows have all the possible answers:

Tidyverse:

surv |> 
  mutate(response = str_split(response, fixed("|"))) |> 
  unnest(response) |> 
  pivot_wider(id_cols = ID, names_from = response, values_from = response, values_fn = \(.x) sum(!is.na(.x)), values_fill = 0)

data.frame [5 x 6]

ID	I read the assigned readings.	I reread my notes.	I worked with one or more classmates.	I reviewed this week’s slides.	I worked on practice problems.
1	1	1	1	0	0
2	1	0	0	1	0
3	1	1	0	0	1
4	1	1	0	0	1
5	1	1	1	1	1

data.table:

SURV[, c(.SD, tstrsplit(response, "|", fixed = T))][, -"response"] |> 
  melt(measure.vars = patterns("^V")) |> 
  dcast(ID ~ value, fun.agg = \(.x) sum(!is.na(.x)), subset = .(!is.na(value)))

data.table [5 x 6]

ID	I read the assigned readings.	I reread my notes.	I reviewed this week’s slides.	I worked on practice problems.	I worked with one or more classmates.
1	1	1	0	0	1
2	1	0	1	0	0
3	1	1	0	1	0
4	1	1	0	1	0
5	1	1	1	1	1

6.6 Filling with lagging conditions:

Task: See this SO question.

Data:

ZIP <- structure(
  list(
    zipcode = c(1001, 1002, 1003, 1004, 1101, 1102, 1103, 1104, 1201, 1202, 1203, 1302), 
    areacode = c(4, 4, NA, 4, 4, 4, NA, 1, 4, 4, NA, 4), 
    type = structure(c(1L, 1L, NA, 1L, 2L, 2L, NA, 1L, 1L, 1L, NA, 1L), .Label = c("clay", "sand"), class = "factor"), 
    region = c(3, 3, NA, 3, 3, 3, NA, 3, 3, 3, NA, 3), 
    do_not_fill = c(1, NA, NA, 1, 1, NA, NA, 1, NA, NA, NA, 1)
    ), 
  class = c("data.table", "data.frame"), row.names = c(NA, -4L)
)

Tidyverse:

as_tibble(ZIP) |>
  mutate(type = as.character(type)) |>
  mutate(
    across(1:4, ~ ifelse(
        is.na(.) & lag(areacode) == lead(areacode) & 
          lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))),
        lag(.), .
      )
    )
  )

data.frame [12 x 5]

zipcode	areacode	type	region	do_not_fill
1 001	4	clay	3	1
1 002	4	clay	3	NA
1 003	4	clay	3	NA
1 004	4	clay	3	1
1 101	4	sand	3	1
1 102	4	sand	3	NA
1 103	NA	NA	NA	NA
1 104	1	clay	3	1
1 201	4	clay	3	NA
1 202	4	clay	3	NA
1 203	NA	NA	NA	NA
1 302	4	clay	3	1

data.table:

ZIP[, c(lapply(.SD, \(v) {fifelse(
  is.na(areacode) & lag(areacode) == lead(areacode) &
    lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))), lag(v), v)}), 
  .SD[, .(do_not_fill)]), .SDcols = !patterns("do_not_fill")]

data.table [12 x 5]

zipcode	areacode	type	region	do_not_fill
1 001	4	clay	3	1
1 002	4	clay	3	NA
1 002	4	clay	3	NA
1 004	4	clay	3	1
1 101	4	sand	3	1
1 102	4	sand	3	NA
1 103	NA	NA	NA	NA
1 104	1	clay	3	1
1 201	4	clay	3	NA
1 202	4	clay	3	NA
1 203	NA	NA	NA	NA
1 302	4	clay	3	1

6.7 Join + Coalesce:

Task: Replace the missing dates from one dataset with the earliest date from another dataset, matching by ID:

Data:

(dt1 <- data.table::fread(
"
      id  x       y   z         
     1    A       1    NA        
     2    C       3    NA        
     3    C       3    NA        
     4    C       2    NA        
     5    B       2    2019-08-04
     6    C       1    2019-09-18
     7    B       3    2019-12-17
     8    A       2    2019-11-02
     9    A       3    2020-03-16
    10    A       1    2020-01-31
"
))

data.table [10 x 4]

id	x	y	z
1	A	1	NA
2	C	3	NA
3	C	3	NA
4	C	2	NA
5	B	2	2019-08-04
6	C	1	2019-09-18
7	B	3	2019-12-17
8	A	2	2019-11-02
9	A	3	2020-03-16
10	A	1	2020-01-31

(dt2 <- data.table::fread(
"      id      date
      1      2012-09-25
      1      2012-03-26
      1      2012-11-12
      2      2013-01-24
      2      2012-05-04
      2      2012-02-24
      3      2012-05-30
      3      2012-02-15
      4      2012-03-13
      4      2012-05-18
"))

data.table [10 x 2]

id	date
1	2012-09-25
1	2012-03-26
1	2012-11-12
2	2013-01-24
2	2012-05-04
2	2012-02-24
3	2012-05-30
3	2012-02-15
4	2012-03-13
4	2012-05-18

Tidyverse:

Using coalesce:

left_join(
  dt1, 
  dt2 |> group_by(id) |> summarize(date = min(date)), 
  by = "id"
) |> mutate(date = coalesce(z, date), z = NULL)

data.table [10 x 4]

id	x	y	date
1	A	1	2012-03-26
2	C	3	2012-02-24
3	C	3	2012-02-15
4	C	2	2012-03-13
5	B	2	2019-08-04
6	C	1	2019-09-18
7	B	3	2019-12-17
8	A	2	2019-11-02
9	A	3	2020-03-16
10	A	1	2020-01-31

Using the rows_* functions:

dplyr::rows_patch(
  dt1 |> rename(date = z), 
  dt2 |> group_by(id) |> summarize(date = min(date)), 
  by = "id"
)

data.table [10 x 4]

id	x	y	date
1	A	1	2012-03-26
2	C	3	2012-02-24
3	C	3	2012-02-15
4	C	2	2012-03-13
5	B	2	2019-08-04
6	C	1	2019-09-18
7	B	3	2019-12-17
8	A	2	2019-11-02
9	A	3	2020-03-16
10	A	1	2020-01-31

data.table:

As a right join:

copy(dt2)[, .(date = min(date)), by = id
  ][dt1, on = "id"][, `:=`(date = fcoalesce(date, z), z = NULL)][]

data.table [10 x 4]

id	date	x	y
1	2012-03-26	A	1
2	2012-02-24	C	3
3	2012-02-15	C	3
4	2012-03-13	C	2
5	2019-08-04	B	2
6	2019-09-18	C	1
7	2019-12-17	B	3
8	2019-11-02	A	2
9	2020-03-16	A	3
10	2020-01-31	A	1

As a left join:

copy(dt1)[dt2[, .(date = min(date)), by = id], c("id", "date") := .(i.id, i.date), on = "id"
  ][, `:=`(date = fcoalesce(date, z), z = NULL)][]

data.table [10 x 4]

id	x	y	date
1	A	1	2012-03-26
2	C	3	2012-02-24
3	C	3	2012-02-15
4	C	2	2012-03-13
5	B	2	2019-08-04
6	C	1	2019-09-18
7	B	3	2019-12-17
8	A	2	2019-11-02
9	A	3	2020-03-16
10	A	1	2020-01-31

6.8 Join on multiple columns (partial matching):

Task: Join both tables based on matching IDs, but the IDs are split between multiple columns in one table (id1 & id2).

(dt1 <- data.table(id = c("ABC", "AAA", "CBC"), x = 1:3))

data.table [3 x 2]

id	x
ABC	1
AAA	2
CBC	3

(dt2 <- data.table(
  id1 = c("ABC", "AA", "CB"), 
  id2 = c("AB", "AAA", "CBC"), 
  y = c(0.307, 0.144, 0.786))
)

data.table [3 x 3]

id1	id2	y
ABC	AB	0.307
AA	AAA	0.144
CB	CBC	0.786

Solution 1:

Combine the two ID columns into one with pivot_longer, then join:

dt2 |> pivot_longer(matches("^id"), names_to = NULL, values_to = "id") |> right_join(dt1)

data.frame [3 x 3]

y	id	x
0.307	ABC	1
0.144	AAA	2
0.786	CBC	3

melt(dt2, measure.vars = patterns("^id"), value.name = "id")[, variable := NULL][dt1, on = "id"]

data.table [3 x 3]

y	id	x
0.307	ABC	1
0.144	AAA	2
0.786	CBC	3

Solution 2:

Combine the two ID columns into one with unite + separate_rows, then join:

(From @TimTeaFan

dt2 |> unite("id", id1, id2, sep = "_") |> separate_rows("id") |> right_join(dt1)

data.frame [3 x 3]

id	y	x
ABC	0.307	1
AAA	0.144	2
CBC	0.786	3

copy(dt2)[, id := paste(id1, id2, sep = "_")
        ][, c(V1 = strsplit(id, "_", fixed = TRUE), .SD), by = id
        ][, `:=`(id = V1, V1 = NULL, id1 = NULL, id2 = NULL)
        ][dt1, on = "id"]

data.table [3 x 3]

id	y	x
ABC	0.307	1
AAA	0.144	2
CBC	0.786	3

Solution 3:

Join on one of the two columns (id2 here), and then fill in (patch) the missing values:

left_join(dt2, dt1, by = c("id2" = "id")) |> 
  rows_patch(rename(dt1, id1 = id), unmatched = "ignore")

data.table [3 x 4]

id1	id2	y	x
ABC	AB	0.307	1
AA	AAA	0.144	2
CB	CBC	0.786	3

6.9 Merging rows across multiple columns (every X rows):

Data:

(BANK <- data.table(
    date = c("30 feb", "NA", "NA", "NA", "31 feb", "NA", "NA", "NA"), 
    description = c("Mary", "had a", "little", "lamb", "Twinkle", "twinkle", "little", "star"), 
    withdrawal = c("100", "NA", "NA", "NA", "NA", "NA", "NA", "NA"), 
    deposit = c("NA", "NA", "NA", "NA", "100", "NA", "NA", "NA")
  )[, lapply(.SD, \(c) utils::type.convert(c, as.is = T))]
)

data.table [8 x 4]

date	description	withdrawal	deposit
30 feb	Mary	100	NA
NA	had a	NA	NA
NA	little	NA	NA
NA	lamb	NA	NA
31 feb	Twinkle	NA	100
NA	twinkle	NA	NA
NA	little	NA	NA
NA	star	NA	NA

merge_and_convert <- function(v) {
  utils::type.convert(v, as.is = T) |> na.omit() |> 
    paste(collapse = " ") |> utils::type.convert(as.is = T) |> 
    bind(x, ifelse(is.logical(x), as.integer(x), x))
}

Tidyverse:

Solution 1:

mutate(BANK, ID = ceiling(seq_along(row_number())/4)) |> 
  group_by(ID) |> 
  summarize(across(everything(), \(m) merge_and_convert(m)))

data.frame [2 x 5]

ID	date	description	withdrawal	deposit
1	30 feb	Mary had a little lamb	100	NA
2	31 feb	Twinkle twinkle little star	NA	100

Solution 2:

summarize(BANK, across(
  everything(), 
  \(c) sapply(split(c, ceiling(seq_along(c)/4)), \(m) merge_and_convert(m))
))

data.frame [2 x 4]

date	description	withdrawal	deposit
30 feb	Mary had a little lamb	100	NA
31 feb	Twinkle twinkle little star	NA	100

data.table:

BANK[, lapply(.SD, \(c) sapply(split(c, ceiling(seq_along(c)/4)), \(m) merge_and_convert(m)))]

data.table [2 x 4]

date	description	withdrawal	deposit
30 feb	Mary had a little lamb	100	NA
31 feb	Twinkle twinkle little star	NA	100

copy(BANK)[, ID := ceiling(seq_along(.I)/4)][, lapply(.SD, \(m) merge_and_convert(m)), by = ID][]

data.table [2 x 5]

ID	date	description	withdrawal	deposit
1	30 feb	Mary had a little lamb	100	NA
2	31 feb	Twinkle twinkle little star	NA	100

6.10 Tagging successive events:

Tagging repeated blocks of events (aka run length encoding):

(DAT <- data.table(event = c(
  rep("A", 3),
  rep("B", 5),
  rep("C", 2),
  rep("B", 2),
  rep("A", 3)
)))

data.table [15 x 1]

event
A
A
A
B
B
B
B
B
C
C
B
B
A
A
A

DAT |> mutate(ID = with(rle(event), rep(seq_along(lengths), lengths)))

data.table [15 x 2]

event	ID
A	1
A	1
A	1
B	2
B	2
B	2
B	2
B	2
C	3
C	3
B	4
B	4
A	5
A	5
A	5

DAT |> mutate(ID = c(0, cumsum(diff(as.integer(factor(event))) != 0)) + 1)

data.table [15 x 2]

event	ID
A	1
A	1
A	1
B	2
B	2
B	2
B	2
B	2
C	3
C	3
B	4
B	4
A	5
A	5
A	5

Using data.table’s rleid() function:

DAT |> mutate(ID = data.table::rleid(event))

data.table [15 x 2]

event	ID
A	1
A	1
A	1
B	2
B	2
B	2
B	2
B	2
C	3
C	3
B	4
B	4
A	5
A	5
A	5

copy(DAT)[, ID := rleid(event)][]

data.table [15 x 2]

event	ID
A	1
A	1
A	1
B	2
B	2
B	2
B	2
B	2
C	3
C	3
B	4
B	4
A	5
A	5
A	5

7 Miscellaneous:

7.1 Keywords:

.SD
.I, .N
.GRP, .NGRP
.BY
.EACHI

7.2 Useful functions:

fsetdiff, fintersect, funion and fsetequal (apply to data.tables instead of vectors)

nafill, fcoalesce

as.IDate

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
K	2	17	NA	NA
L	2	20	NA	NA
M	3	10	NA	NA
N	4	18	NA	NA

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
K	2	17	NA	NA
L	2	20	NA	NA
M	3	10	NA	NA
N	4	18	NA	NA

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
A	NA	NA	1	16
B	NA	NA	5	12
C	NA	NA	3	11
D	NA	NA	1	19

ID	C	D	A	B
A	NA	NA	1	16
B	NA	NA	5	12
C	NA	NA	3	11
D	NA	NA	1	19
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
K	2	17	NA	NA
L	2	20	NA	NA
M	3	10	NA	NA
N	4	18	NA	NA

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
K	2	17	NA	NA
L	2	20	NA	NA
M	3	10	NA	NA
N	4	18	NA	NA

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
A	NA	NA	1	16
B	NA	NA	5	12
C	NA	NA	3	11
D	NA	NA	1	19

ID	C	D	A	B
A	NA	NA	1	16
B	NA	NA	5	12
C	NA	NA	3	11
D	NA	NA	1	19
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	A	B	C	D
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
K	2	17	NA	NA
L	2	20	NA	NA
M	3	10	NA	NA
N	4	18	NA	NA

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
K	2	17	NA	NA
L	2	20	NA	NA
M	3	10	NA	NA
N	4	18	NA	NA

ID	C	D	A	B
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10
A	NA	NA	1	16
B	NA	NA	5	12
C	NA	NA	3	11
D	NA	NA	1	19

ID	C	D	A	B
A	NA	NA	1	16
B	NA	NA	5	12
C	NA	NA	3	11
D	NA	NA	1	19
E	5	16	4	13
F	2	13	2	18
G	3	11	2	20
H	1	19	4	17
I	3	12	5	15
J	1	14	2	10

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18

ID	A	B	C	D
A	1	16	NA	NA
B	5	12	NA	NA
C	3	11	NA	NA
D	1	19	NA	NA
E	4	13	5	16
F	2	18	2	13
G	2	20	3	11
H	4	17	1	19
I	5	15	3	12
J	2	10	1	14
K	NA	NA	2	17
L	NA	NA	2	20
M	NA	NA	3	10
N	NA	NA	4	18