Using data.table in R

Author

David McCabe

Published

December 9, 2025

Introduction to data.table

This vignette is a personal/custom edit of the official introduction to data table vignette. More vignettes are found in the documentation available through the ??data.table command.

The general syntax is DT[i,j,by] which translates to;

  • take DT

  • subset rows using i

  • calculate j

  • grouped by by


WARNING: data.table uses a mixture of value and reference semantics which can lead to unexpected or undefined behaviour. Exercised caution when using features marked

Pitfall Fix
:= always modifies data.table in place Use copy() if you want a separate object
Subsetting [] sometimes creates copies Use setDT() to enforce reference behavior
Chained := can unexpectedly subset Always separate modification and filtering
Modifying data.table inside a list affects original Use copy() before modifying
Factor levels persist after subsetting Use droplevels()

0. Loading the data.table package

library(data.table)

1. Creating data.table objects

1.1. Create from .csv import

flights<-fread("flights14.csv")
flights
         year month   day dep_delay arr_delay carrier origin   dest air_time
        <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
     1:  2014     1     1        14        13      AA    JFK    LAX      359
     2:  2014     1     1        -3        13      AA    JFK    LAX      363
     3:  2014     1     1         2         9      AA    JFK    LAX      351
     4:  2014     1     1        -8       -26      AA    LGA    PBI      157
     5:  2014     1     1         2         1      AA    JFK    LAX      350
    ---                                                                     
253312:  2014    10    31         1       -30      UA    LGA    IAH      201
253313:  2014    10    31        -5       -14      UA    EWR    IAH      189
253314:  2014    10    31        -8        16      MQ    LGA    RDU       83
253315:  2014    10    31        -4        15      MQ    LGA    DTW       75
253316:  2014    10    31        -5         1      MQ    LGA    SDF      110
        distance  hour
           <int> <int>
     1:     2475     9
     2:     2475    11
     3:     2475    19
     4:     1035     7
     5:     2475    13
    ---               
253312:     1416    14
253313:     1400     8
253314:      431    11
253315:      502    11
253316:      659     8

1.2. Constructor call

DT = data.table(
  ID=c("b","b","b","a","a","c"),
  a=1:6,
  b=7:12,
  c=13:18
)
DT
       ID     a     b     c
   <char> <int> <int> <int>
1:      b     1     7    13
2:      b     2     8    14
3:      b     3     9    15
4:      a     4    10    16
5:      a     5    11    17
6:      c     6    12    18

2. Subsetting (& Adding)

Subsetting with [] behaviour varies

Subsetting with [] Optimises to Create a Copy or a Reference, unlike data.frame, where operations always create a copy, data.table can modify the original object directly. The behavior of this code is undefined…
dt1 <- data.table(a = 1:3, b = 4:6)
dt2 <- dt1[a > 1]  # This creates a copy or a reference
dt2[, b := 10]  # Only modifies `dt2`
print(dt1)  # `b` is unchanged in dt1
Use setDT() if you need to force a reference or copy() for a value.
setDT(dt2)  # Now dt2 is a reference, modifying dt1 too!

\(\:\)

Never subset rows and columns together!!!

data.table can modify and subset in one step, but this can sometimes cause unexpected data loss.
dt <- data.table(a = c(1, 2, 3), b = c(4, 5, 6))
dt <- dt[a != 1, b := 100]  # `dt` is unexpectedly subsetted!
print(dt)  # The row `a == 1` is missing
Fix: Always separate := and filtering:
dt[, b := 100, by = a != 1]  # Modify but keep all rows
dt <- dt[a != 1]  # Now remove rows safely

\(\:\)

2.1. Subsetting rows

flights[origin=="JFK" & month ==6L]
       year month   day dep_delay arr_delay carrier origin   dest air_time
      <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
   1:  2014     6     1        -9        -5      AA    JFK    LAX      324
   2:  2014     6     1       -10       -13      AA    JFK    LAX      329
   3:  2014     6     1        18        -1      AA    JFK    LAX      326
   4:  2014     6     1        -6       -16      AA    JFK    LAX      320
   5:  2014     6     1        -4       -45      AA    JFK    LAX      326
  ---                                                                     
8418:  2014     6    30        -3        -6      MQ    JFK    PIT       62
8419:  2014     6    30        -5       -32      MQ    JFK    RDU       65
8420:  2014     6    30        -3       -16      MQ    JFK    DCA       39
8421:  2014     6    30        -2         7      MQ    JFK    DCA       52
8422:  2014     6    30        -7       -18      MQ    JFK    RDU       67
      distance  hour
         <int> <int>
   1:     2475     8
   2:     2475    12
   3:     2475     7
   4:     2475    10
   5:     2475    18
  ---               
8418:      340    14
8419:      427    14
8420:      213    17
8421:      213     7
8422:      427     8
flights[1:3]
    year month   day dep_delay arr_delay carrier origin   dest air_time
   <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
1:  2014     1     1        14        13      AA    JFK    LAX      359
2:  2014     1     1        -3        13      AA    JFK    LAX      363
3:  2014     1     1         2         9      AA    JFK    LAX      351
   distance  hour
      <int> <int>
1:     2475     9
2:     2475    11
3:     2475    19
flights[-(1:253312)]
    year month   day dep_delay arr_delay carrier origin   dest air_time
   <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
1:  2014    10    31        -5       -14      UA    EWR    IAH      189
2:  2014    10    31        -8        16      MQ    LGA    RDU       83
3:  2014    10    31        -4        15      MQ    LGA    DTW       75
4:  2014    10    31        -5         1      MQ    LGA    SDF      110
   distance  hour
      <int> <int>
1:     1400     8
2:      431    11
3:      502    11
4:      659     8

2.2. Subsetting columns

subset and return array

flights[,arr_delay][1:8]
[1]  13  13   9 -26   1   0 -18 -14

return a data.table

flights[,.(arr_delay)]
        arr_delay
            <int>
     1:        13
     2:        13
     3:         9
     4:       -26
     5:         1
    ---          
253312:       -30
253313:       -14
253314:        16
253315:        15
253316:         1

rekey columns

flights[,.(arrival.delay = arr_delay, departure.delay = dep_delay)]
        arrival.delay departure.delay
                <int>           <int>
     1:            13              14
     2:            13              -3
     3:             9               2
     4:           -26              -8
     5:             1               2
    ---                              
253312:           -30               1
253313:           -14              -5
253314:            16              -8
253315:            15              -4
253316:             1              -5

2.3 Subsetting using match/%in% and &

let’s get a set of possible origin airports…

unique(as.factor(flights$origin))
[1] JFK LGA EWR
Levels: EWR JFK LGA

let’s get a set of possible destination airports…

unique(as.factor(flights$dest))
  [1] LAX PBI MIA SEA SFO BOS ORD IAH AUS DFW STT SJU LAS MCO EGE CHS FLL RSW
 [19] BTV BUF IAD PWM SYR RDU JAX HOU DEN ROC SAN BQN LGB PSE TPA PHX SMF SRQ
 [37] SJC SLC ABQ OAK BUR PDX MSY CLT ATL DTW MSP DAY MHT DCA SDF CLE STL MEM
 [55] BNA GSP MSN OMA IND RIC PVD AVL MCI OKC MKE TUL PIT SAV BWI GSO GRR DSM
 [73] CVG ORF BHM BGR CAK HNL CMH TYS MDW XNA SAT SNA PHL EYW ALB ILM JAC CAE
 [91] PSP MTJ HDN BZN BDL MDT CHO MYR AVP ROA LIT AGS ACK MVY ANC TVC HYA SBN
[109] DAL
109 Levels: ABQ ACK AGS ALB ANC ATL AUS AVL AVP BDL BGR BHM BNA BOS BQN ... XNA

let’s get all the flights with origin “JFK” and dest “MEM, BNA, GSP, MSN”

flights[origin=="JFK" & dest%in%c("MEM", "BNA", "GSP", "MSN"),]
      year month   day dep_delay arr_delay carrier origin   dest air_time
     <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
  1:  2014     1     1        -2       -10      MQ    JFK    BNA      132
  2:  2014     1     2         9       113      MQ    JFK    BNA      141
  3:  2014     1     4       172       172      MQ    JFK    BNA      124
  4:  2014     1     5        98       175      MQ    JFK    BNA      138
  5:  2014     1     6        13        53      MQ    JFK    BNA      140
 ---                                                                     
280:  2014    10    27        -6       -30      MQ    JFK    BNA      113
281:  2014    10    28        -6       -22      MQ    JFK    BNA      123
282:  2014    10    29        -6       -20      MQ    JFK    BNA      121
283:  2014    10    30        -4       -15      MQ    JFK    BNA      116
284:  2014    10    31        -8       -22      MQ    JFK    BNA      115
     distance  hour
        <int> <int>
  1:      765    17
  2:      765    17
  3:      765    20
  4:      765    18
  5:      765    17
 ---               
280:      765    16
281:      765    16
282:      765    16
283:      765    16
284:      765    16

2.3. Adding Columns :=

:= Modifies a data.table In-Place (Even Without Assignment)

Unlike data.frame, where operations create a copy, data.table always modifies the object directly.

dt <- data.table(x = 1:5, y = 6:10)
dt[, y := y * 2]  # Modifies `dt` directly
print(dt)  # y is now doubled
If you need a copy, use copy(dt).
dt <- data.table(x = 1:5, y = 6:10)
dt[, y := y * 2]  # Modifies `dt` directly
print(dt)  # y is now doubled

\(\:\)

2.3.1. Adding Single Columns :=

flights2<-flights[,overall.delay := arr_delay + dep_delay]
head(flights2)
    year month   day dep_delay arr_delay carrier origin   dest air_time
   <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
1:  2014     1     1        14        13      AA    JFK    LAX      359
2:  2014     1     1        -3        13      AA    JFK    LAX      363
3:  2014     1     1         2         9      AA    JFK    LAX      351
4:  2014     1     1        -8       -26      AA    LGA    PBI      157
5:  2014     1     1         2         1      AA    JFK    LAX      350
6:  2014     1     1         4         0      AA    EWR    LAX      339
   distance  hour overall.delay
      <int> <int>         <int>
1:     2475     9            27
2:     2475    11            10
3:     2475    19            11
4:     1035     7           -34
5:     2475    13             3
6:     2454    18             4

2.3.2. Adding Multiple Columns `:=`

d<-data.table(index=1:10)
d[,`:=`(
     x = rnorm(10, mean = 50, sd = 50),
     y = rnorm(10, mean = 50, sd = 50),
     z = rnorm(10, mean = 50, sd = 50)
)]
d
    index         x          y          z
    <int>     <num>      <num>      <num>
 1:     1  90.25226 147.490656  40.676605
 2:     2 -59.23579  55.511816   1.638272
 3:     3  16.81574 102.171900 133.085506
 4:     4  17.29510  74.200344 133.933189
 5:     5 117.48605  65.368615  93.667348
 6:     6  30.48328 114.181640  66.191825
 7:     7  76.98604  57.422301 -15.310592
 8:     8  55.59638 104.440345 -34.824527
 9:     9 113.66572  29.093106  62.462069
10:    10  71.30864   4.507574 -39.067807

3. Sorting Rows

ascending order sort

order(c(2,4,6,12,10,8))
[1] 1 2 3 6 5 4

ascending order sort with tie breaking

order(c(1,2,3,3,2,1),c(2,2,2,1,1,1))
[1] 6 1 5 2 4 3

data.table’s optimised order functions just the same…

flights[order(origin, -dest)]
         year month   day dep_delay arr_delay carrier origin   dest air_time
        <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
     1:  2014     1     5         6        49      EV    EWR    XNA      195
     2:  2014     1     6         7        13      EV    EWR    XNA      190
     3:  2014     1     7        -6       -13      EV    EWR    XNA      179
     4:  2014     1     8        -7       -12      EV    EWR    XNA      184
     5:  2014     1     9        16         7      EV    EWR    XNA      181
    ---                                                                     
253312:  2014    10    31        -1       -22      WN    LGA    ATL      112
253313:  2014    10    31        -5       -23      WN    LGA    ATL      112
253314:  2014     4     6        -6        -1      EV    LGA    AGS      110
253315:  2014     4     7         2         1      EV    LGA    AGS      111
253316:  2014     4    11         0       -19      EV    LGA    AGS      102
        distance  hour overall.delay
           <int> <int>         <int>
     1:     1131     8            55
     2:     1131     8            20
     3:     1131     8           -19
     4:     1131     8           -19
     5:     1131     8            23
    ---                             
253312:      762     9           -23
253313:      762    20           -28
253314:      678    10            -7
253315:      678    11             3
253316:      678    10           -19

4. Reductions

Reductions are useful in their own right, they become powerful later when we use by

flights[origin=="JFK" & month ==6L,.(
  average.delay=mean(arr_delay),
  total.delay=sum(arr_delay)
)]
   average.delay total.delay
           <num>       <int>
1:      5.839349       49179
flights[origin=="JFK" & month ==6L, .N]
[1] 8422

5. Aggregations

5.1. Grouping

get total no. American Airlines flights from each airport

flights[carrier=="AA", .N, by=origin]
   origin     N
   <char> <int>
1:    JFK 11923
2:    LGA 11730
3:    EWR  2649

get total no. American Airlines flights from each route (route is all existing permutations of origin and destination).

flights[carrier=="AA", .N, by=.(origin,dest)]
    origin   dest     N
    <char> <char> <int>
 1:    JFK    LAX  3387
 2:    LGA    PBI   245
 3:    EWR    LAX    62
 4:    JFK    MIA  1876
 5:    JFK    SEA   298
 6:    EWR    MIA   848
 7:    JFK    SFO  1312
 8:    JFK    BOS  1173
 9:    JFK    ORD   432
10:    JFK    IAH     7
11:    JFK    AUS   297
12:    EWR    DFW  1618
13:    LGA    ORD  4366
14:    JFK    STT   229
15:    JFK    SJU   690
16:    LGA    MIA  3334
17:    LGA    DFW  3785
18:    JFK    LAS   595
19:    JFK    MCO   597
20:    JFK    EGE    85
21:    JFK    DFW   474
22:    JFK    SAN   299
23:    JFK    DCA   172
24:    EWR    PHX   121
    origin   dest     N

automatically order columns with keyby

flights[carrier=="AA", .N, keyby=.(origin,dest)]
Key: <origin, dest>
    origin   dest     N
    <char> <char> <int>
 1:    EWR    DFW  1618
 2:    EWR    LAX    62
 3:    EWR    MIA   848
 4:    EWR    PHX   121
 5:    JFK    AUS   297
 6:    JFK    BOS  1173
 7:    JFK    DCA   172
 8:    JFK    DFW   474
 9:    JFK    EGE    85
10:    JFK    IAH     7
11:    JFK    LAS   595
12:    JFK    LAX  3387
13:    JFK    MCO   597
14:    JFK    MIA  1876
15:    JFK    ORD   432
16:    JFK    SAN   299
17:    JFK    SEA   298
18:    JFK    SFO  1312
19:    JFK    SJU   690
20:    JFK    STT   229
21:    LGA    DFW  3785
22:    LGA    MIA  3334
23:    LGA    ORD  4366
24:    LGA    PBI   245
    origin   dest     N

computation…

flights[, .N, by=.(departure.delay=dep_delay>0,arrival.delay=arr_delay>0)]
   departure.delay arrival.delay      N
            <lgcl>        <lgcl>  <int>
1:            TRUE          TRUE  72836
2:           FALSE          TRUE  34583
3:           FALSE         FALSE 119304
4:            TRUE         FALSE  26593

6. Chaining

flights[carrier=="AA", .N, by =.(origin, dest)][order(origin,dest)]
    origin   dest     N
    <char> <char> <int>
 1:    EWR    DFW  1618
 2:    EWR    LAX    62
 3:    EWR    MIA   848
 4:    EWR    PHX   121
 5:    JFK    AUS   297
 6:    JFK    BOS  1173
 7:    JFK    DCA   172
 8:    JFK    DFW   474
 9:    JFK    EGE    85
10:    JFK    IAH     7
11:    JFK    LAS   595
12:    JFK    LAX  3387
13:    JFK    MCO   597
14:    JFK    MIA  1876
15:    JFK    ORD   432
16:    JFK    SAN   299
17:    JFK    SEA   298
18:    JFK    SFO  1312
19:    JFK    SJU   690
20:    JFK    STT   229
21:    LGA    DFW  3785
22:    LGA    MIA  3334
23:    LGA    ORD  4366
24:    LGA    PBI   245
    origin   dest     N

7. Subset Data

In this section I’ll use the dollowing data…

dt<-flights[1:20,.(carrier,dest,air_time,distance,origin)]

7.1. .N reminder

.N is a special variable that holds the number of rows in the current group defined by by

dt[,.N,by=.(origin)]
   origin     N
   <char> <int>
1:    JFK    15
2:    LGA     2
3:    EWR     3

7.2. Subset of data .SD

…likewise .SD (Subset of Data) holds the data for the current group defined using by i.e. \(.N=\sum\#(.SD)\)

dt[, print(.SD), by = .(origin)]
    carrier   dest air_time distance
     <char> <char>    <int>    <int>
 1:      AA    LAX      359     2475
 2:      AA    LAX      363     2475
 3:      AA    LAX      351     2475
 4:      AA    LAX      350     2475
 5:      AA    LAX      338     2475
 6:      AA    LAX      356     2475
 7:      AA    MIA      161     1089
 8:      AA    SEA      349     2422
 9:      AA    SFO      365     2586
10:      AA    BOS       39      187
11:      AA    LAX      345     2475
12:      AA    BOS       35      187
13:      AA    ORD      155      740
14:      AA    IAH      234     1417
15:      AA    AUS      232     1521
   carrier   dest air_time distance
    <char> <char>    <int>    <int>
1:      AA    PBI      157     1035
2:      AA    ORD      142      733
   carrier   dest air_time distance
    <char> <char>    <int>    <int>
1:      AA    LAX      339     2454
2:      AA    MIA      161     1085
3:      AA    DFW      214     1372
Empty data.table (0 rows and 1 cols): origin

7.3. Aggregation of subset data

the lapply function allows aggregation of subset of data .SD. Notice how somecolumns are dropped from the subset of data by specifying the subset .SDcols of columns in the subset data.

dt[, lapply(.SD, mean), by = .(origin), .SDcols = c("air_time","distance")]
   origin air_time distance
   <char>    <num>    <num>
1:    JFK    268.8   1831.6
2:    LGA    149.5    884.0
3:    EWR    238.0   1637.0

8. Type Conversion

8.1. Character columns to factors

I’ll create a data.table with character columns…

dt <- as.data.table(as.data.frame(HairEyeColor, stringsAsFactors = FALSE))
head(dt)
     Hair    Eye    Sex  Freq
   <char> <char> <char> <num>
1:  Black  Brown   Male    32
2:  Brown  Brown   Male    53
3:    Red  Brown   Male    10
4:  Blond  Brown   Male     3
5:  Black   Blue   Male    11
6:  Brown   Blue   Male    50

The standard way to convert the character columns into factors is to:

  • use the reassignment operator := to update columns by reference
  • apply lapply() to the subset of data defined by .SD (Subset of Data)
  • specify the relevant columns using the .SDcols argument
  • .SDcols to subset the original data.table data
dt[, (c("Hair", "Eye")) := lapply(.SD, as.factor), .SDcols = c("Hair", "Eye")]
head(dt)
     Hair    Eye    Sex  Freq
   <fctr> <fctr> <char> <num>
1:  Black  Brown   Male    32
2:  Brown  Brown   Male    53
3:    Red  Brown   Male    10
4:  Blond  Brown   Male     3
5:  Black   Blue   Male    11
6:  Brown   Blue   Male    50

9. Ordinal Indexing

we can add an uid indexing column using the .I symbol. This can be useful for processing and plotting…

dt <- as.data.table(as.data.frame(HairEyeColor))
dt[, Subject_ID := .I]
head(dt)
     Hair    Eye    Sex  Freq Subject_ID
   <fctr> <fctr> <fctr> <num>      <int>
1:  Black  Brown   Male    32          1
2:  Brown  Brown   Male    53          2
3:    Red  Brown   Male    10          3
4:  Blond  Brown   Male     3          4
5:  Black   Blue   Male    11          5
6:  Brown   Blue   Male    50          6

10. Melt/dcast(Reshape)

When you need to reshape wide data into long format for:

  • Easier grouping, summarizing, or plotting (e.g., with ggplot2)
  • Tidy data format (1 variable per column, 1 observation per row)
  • Performing operations like mean scores by subject
# Create a simple data.table
dt <- data.table(
  id = 1:2,
  math = c(90, 80),
  science = c(85, 88)
)

10.1 Melt from wide to long

melted_dt <- melt(dt, id.vars = "id", variable.name = "subject", value.name = "score")
print(melted_dt)
      id subject score
   <int>  <fctr> <num>
1:     1    math    90
2:     2    math    80
3:     1 science    85
4:     2 science    88

10.2 dcast: back to wide format

wide_dt <- dcast(melted_dt, id ~ subject, value.var = "score")
print(wide_dt)
Key: <id>
      id  math science
   <int> <num>   <num>
1:     1    90      85
2:     2    80      88

DT: An R interface to the DataTables library

todo