library(data.table)
Using data.table in R
|
|
Introduction to data.table
This vignette is a personal/custom edit of the official introduction to data table vignette. More vignettes are found in the documentation available through the ??data.table
command.
The general syntax is DT[i,j,by]
which translates to;
take
DT
subset rows using
i
calculate
j
grouped by
by
”
WARNING: data.table
uses a mixture of value and reference semantics which can lead to unexpected or undefined behaviour. Exercised caution when using features marked
Pitfall | Fix |
---|---|
:= always modifies data.table in place |
Use copy() if you want a separate object |
Subsetting [] sometimes creates copies |
Use setDT() to enforce reference behavior |
Chained := can unexpectedly subset |
Always separate modification and filtering |
Modifying data.table inside a list affects original |
Use copy() before modifying |
Factor levels persist after subsetting | Use droplevels() |
0. Loading the data.table package
1. Creating data.table objects
1.1. Create from .csv import
<-fread("flights14.csv")
flights flights
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 1 1 14 13 AA JFK LAX 359
2: 2014 1 1 -3 13 AA JFK LAX 363
3: 2014 1 1 2 9 AA JFK LAX 351
4: 2014 1 1 -8 -26 AA LGA PBI 157
5: 2014 1 1 2 1 AA JFK LAX 350
---
253312: 2014 10 31 1 -30 UA LGA IAH 201
253313: 2014 10 31 -5 -14 UA EWR IAH 189
253314: 2014 10 31 -8 16 MQ LGA RDU 83
253315: 2014 10 31 -4 15 MQ LGA DTW 75
253316: 2014 10 31 -5 1 MQ LGA SDF 110
distance hour
<int> <int>
1: 2475 9
2: 2475 11
3: 2475 19
4: 1035 7
5: 2475 13
---
253312: 1416 14
253313: 1400 8
253314: 431 11
253315: 502 11
253316: 659 8
1.2. Constructor call
= data.table(
DT ID=c("b","b","b","a","a","c"),
a=1:6,
b=7:12,
c=13:18
) DT
ID a b c
<char> <int> <int> <int>
1: b 1 7 13
2: b 2 8 14
3: b 3 9 15
4: a 4 10 16
5: a 5 11 17
6: c 6 12 18
2. Subsetting (& Adding)
Subsetting with []
behaviour varies
Subsetting with []
Optimises to Create a Copy or a Reference, unlike data.frame, where operations always create a copy, data.table can modify the original object directly. The behavior of this code is undefined…
<- data.table(a = 1:3, b = 4:6)
dt1 <- dt1[a > 1] # This creates a copy or a reference
dt2 := 10] # Only modifies `dt2`
dt2[, b print(dt1) # `b` is unchanged in dt1
setDT()
if you need to force a reference or copy()
for a value.
setDT(dt2) # Now dt2 is a reference, modifying dt1 too!
\(\:\)
Never subset rows and columns together!!!
data.table can modify and subset in one step, but this can sometimes cause unexpected data loss.<- data.table(a = c(1, 2, 3), b = c(4, 5, 6))
dt <- dt[a != 1, b := 100] # `dt` is unexpectedly subsetted!
dt print(dt) # The row `a == 1` is missing
:=
and filtering:
:= 100, by = a != 1] # Modify but keep all rows
dt[, b <- dt[a != 1] # Now remove rows safely dt
\(\:\)
2.1. Subsetting rows
=="JFK" & month ==6L] flights[origin
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 6 1 -9 -5 AA JFK LAX 324
2: 2014 6 1 -10 -13 AA JFK LAX 329
3: 2014 6 1 18 -1 AA JFK LAX 326
4: 2014 6 1 -6 -16 AA JFK LAX 320
5: 2014 6 1 -4 -45 AA JFK LAX 326
---
8418: 2014 6 30 -3 -6 MQ JFK PIT 62
8419: 2014 6 30 -5 -32 MQ JFK RDU 65
8420: 2014 6 30 -3 -16 MQ JFK DCA 39
8421: 2014 6 30 -2 7 MQ JFK DCA 52
8422: 2014 6 30 -7 -18 MQ JFK RDU 67
distance hour
<int> <int>
1: 2475 8
2: 2475 12
3: 2475 7
4: 2475 10
5: 2475 18
---
8418: 340 14
8419: 427 14
8420: 213 17
8421: 213 7
8422: 427 8
1:3] flights[
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 1 1 14 13 AA JFK LAX 359
2: 2014 1 1 -3 13 AA JFK LAX 363
3: 2014 1 1 2 9 AA JFK LAX 351
distance hour
<int> <int>
1: 2475 9
2: 2475 11
3: 2475 19
-(1:253312)] flights[
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 10 31 -5 -14 UA EWR IAH 189
2: 2014 10 31 -8 16 MQ LGA RDU 83
3: 2014 10 31 -4 15 MQ LGA DTW 75
4: 2014 10 31 -5 1 MQ LGA SDF 110
distance hour
<int> <int>
1: 1400 8
2: 431 11
3: 502 11
4: 659 8
2.2. Subsetting columns
subset and return array
1:8] flights[,arr_delay][
[1] 13 13 9 -26 1 0 -18 -14
return a data.table
flights[,.(arr_delay)]
arr_delay
<int>
1: 13
2: 13
3: 9
4: -26
5: 1
---
253312: -30
253313: -14
253314: 16
253315: 15
253316: 1
rekey columns
arrival.delay = arr_delay, departure.delay = dep_delay)] flights[,.(
arrival.delay departure.delay
<int> <int>
1: 13 14
2: 13 -3
3: 9 2
4: -26 -8
5: 1 2
---
253312: -30 1
253313: -14 -5
253314: 16 -8
253315: 15 -4
253316: 1 -5
2.3 Subsetting using match
/%in%
and &
let’s get a set of possible origin airports…
unique(as.factor(flights$origin))
[1] JFK LGA EWR
Levels: EWR JFK LGA
let’s get a set of possible destination airports…
unique(as.factor(flights$dest))
[1] LAX PBI MIA SEA SFO BOS ORD IAH AUS DFW STT SJU LAS MCO EGE CHS FLL RSW
[19] BTV BUF IAD PWM SYR RDU JAX HOU DEN ROC SAN BQN LGB PSE TPA PHX SMF SRQ
[37] SJC SLC ABQ OAK BUR PDX MSY CLT ATL DTW MSP DAY MHT DCA SDF CLE STL MEM
[55] BNA GSP MSN OMA IND RIC PVD AVL MCI OKC MKE TUL PIT SAV BWI GSO GRR DSM
[73] CVG ORF BHM BGR CAK HNL CMH TYS MDW XNA SAT SNA PHL EYW ALB ILM JAC CAE
[91] PSP MTJ HDN BZN BDL MDT CHO MYR AVP ROA LIT AGS ACK MVY ANC TVC HYA SBN
[109] DAL
109 Levels: ABQ ACK AGS ALB ANC ATL AUS AVL AVP BDL BGR BHM BNA BOS BQN ... XNA
let’s get all the flights with origin “JFK” and dest “MEM, BNA, GSP, MSN”
=="JFK" & dest%in%c("MEM", "BNA", "GSP", "MSN"),] flights[origin
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 1 1 -2 -10 MQ JFK BNA 132
2: 2014 1 2 9 113 MQ JFK BNA 141
3: 2014 1 4 172 172 MQ JFK BNA 124
4: 2014 1 5 98 175 MQ JFK BNA 138
5: 2014 1 6 13 53 MQ JFK BNA 140
---
280: 2014 10 27 -6 -30 MQ JFK BNA 113
281: 2014 10 28 -6 -22 MQ JFK BNA 123
282: 2014 10 29 -6 -20 MQ JFK BNA 121
283: 2014 10 30 -4 -15 MQ JFK BNA 116
284: 2014 10 31 -8 -22 MQ JFK BNA 115
distance hour
<int> <int>
1: 765 17
2: 765 17
3: 765 20
4: 765 18
5: 765 17
---
280: 765 16
281: 765 16
282: 765 16
283: 765 16
284: 765 16
2.3. Adding Columns :=
:=
Modifies a data.table
In-Place (Even Without Assignment)
Unlike data.frame, where operations create a copy, data.table always modifies the object directly.
<- data.table(x = 1:5, y = 6:10)
dt := y * 2] # Modifies `dt` directly
dt[, y print(dt) # y is now doubled
copy(dt)
.
<- data.table(x = 1:5, y = 6:10)
dt := y * 2] # Modifies `dt` directly
dt[, y print(dt) # y is now doubled
\(\:\)
2.3.1. Adding Single Columns :=
<-flights[,overall.delay := arr_delay + dep_delay]
flights2head(flights2)
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 1 1 14 13 AA JFK LAX 359
2: 2014 1 1 -3 13 AA JFK LAX 363
3: 2014 1 1 2 9 AA JFK LAX 351
4: 2014 1 1 -8 -26 AA LGA PBI 157
5: 2014 1 1 2 1 AA JFK LAX 350
6: 2014 1 1 4 0 AA EWR LAX 339
distance hour overall.delay
<int> <int> <int>
1: 2475 9 27
2: 2475 11 10
3: 2475 19 11
4: 1035 7 -34
5: 2475 13 3
6: 2454 18 4
2.3.2. Adding Multiple Columns `:=`
<-data.table(index=1:10)
d`:=`(
d[,x = rnorm(10, mean = 50, sd = 50),
y = rnorm(10, mean = 50, sd = 50),
z = rnorm(10, mean = 50, sd = 50)
)] d
index x y z
<int> <num> <num> <num>
1: 1 90.25226 147.490656 40.676605
2: 2 -59.23579 55.511816 1.638272
3: 3 16.81574 102.171900 133.085506
4: 4 17.29510 74.200344 133.933189
5: 5 117.48605 65.368615 93.667348
6: 6 30.48328 114.181640 66.191825
7: 7 76.98604 57.422301 -15.310592
8: 8 55.59638 104.440345 -34.824527
9: 9 113.66572 29.093106 62.462069
10: 10 71.30864 4.507574 -39.067807
3. Sorting Rows
ascending order sort
order(c(2,4,6,12,10,8))
[1] 1 2 3 6 5 4
ascending order sort with tie breaking
order(c(1,2,3,3,2,1),c(2,2,2,1,1,1))
[1] 6 1 5 2 4 3
data.table’s optimised order functions just the same…
order(origin, -dest)] flights[
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 1 5 6 49 EV EWR XNA 195
2: 2014 1 6 7 13 EV EWR XNA 190
3: 2014 1 7 -6 -13 EV EWR XNA 179
4: 2014 1 8 -7 -12 EV EWR XNA 184
5: 2014 1 9 16 7 EV EWR XNA 181
---
253312: 2014 10 31 -1 -22 WN LGA ATL 112
253313: 2014 10 31 -5 -23 WN LGA ATL 112
253314: 2014 4 6 -6 -1 EV LGA AGS 110
253315: 2014 4 7 2 1 EV LGA AGS 111
253316: 2014 4 11 0 -19 EV LGA AGS 102
distance hour overall.delay
<int> <int> <int>
1: 1131 8 55
2: 1131 8 20
3: 1131 8 -19
4: 1131 8 -19
5: 1131 8 23
---
253312: 762 9 -23
253313: 762 20 -28
253314: 678 10 -7
253315: 678 11 3
253316: 678 10 -19
4. Reductions
Reductions are useful in their own right, they become powerful later when we use by
=="JFK" & month ==6L,.(
flights[originaverage.delay=mean(arr_delay),
total.delay=sum(arr_delay)
)]
average.delay total.delay
<num> <int>
1: 5.839349 49179
=="JFK" & month ==6L, .N] flights[origin
[1] 8422
5. Aggregations
5.1. Grouping
get total no. American Airlines flights from each airport
=="AA", .N, by=origin] flights[carrier
origin N
<char> <int>
1: JFK 11923
2: LGA 11730
3: EWR 2649
get total no. American Airlines flights from each route (route is all existing permutations of origin and destination).
=="AA", .N, by=.(origin,dest)] flights[carrier
origin dest N
<char> <char> <int>
1: JFK LAX 3387
2: LGA PBI 245
3: EWR LAX 62
4: JFK MIA 1876
5: JFK SEA 298
6: EWR MIA 848
7: JFK SFO 1312
8: JFK BOS 1173
9: JFK ORD 432
10: JFK IAH 7
11: JFK AUS 297
12: EWR DFW 1618
13: LGA ORD 4366
14: JFK STT 229
15: JFK SJU 690
16: LGA MIA 3334
17: LGA DFW 3785
18: JFK LAS 595
19: JFK MCO 597
20: JFK EGE 85
21: JFK DFW 474
22: JFK SAN 299
23: JFK DCA 172
24: EWR PHX 121
origin dest N
automatically order columns with keyby
…
=="AA", .N, keyby=.(origin,dest)] flights[carrier
Key: <origin, dest>
origin dest N
<char> <char> <int>
1: EWR DFW 1618
2: EWR LAX 62
3: EWR MIA 848
4: EWR PHX 121
5: JFK AUS 297
6: JFK BOS 1173
7: JFK DCA 172
8: JFK DFW 474
9: JFK EGE 85
10: JFK IAH 7
11: JFK LAS 595
12: JFK LAX 3387
13: JFK MCO 597
14: JFK MIA 1876
15: JFK ORD 432
16: JFK SAN 299
17: JFK SEA 298
18: JFK SFO 1312
19: JFK SJU 690
20: JFK STT 229
21: LGA DFW 3785
22: LGA MIA 3334
23: LGA ORD 4366
24: LGA PBI 245
origin dest N
computation…
=.(departure.delay=dep_delay>0,arrival.delay=arr_delay>0)] flights[, .N, by
departure.delay arrival.delay N
<lgcl> <lgcl> <int>
1: TRUE TRUE 72836
2: FALSE TRUE 34583
3: FALSE FALSE 119304
4: TRUE FALSE 26593
6. Chaining
=="AA", .N, by =.(origin, dest)][order(origin,dest)] flights[carrier
origin dest N
<char> <char> <int>
1: EWR DFW 1618
2: EWR LAX 62
3: EWR MIA 848
4: EWR PHX 121
5: JFK AUS 297
6: JFK BOS 1173
7: JFK DCA 172
8: JFK DFW 474
9: JFK EGE 85
10: JFK IAH 7
11: JFK LAS 595
12: JFK LAX 3387
13: JFK MCO 597
14: JFK MIA 1876
15: JFK ORD 432
16: JFK SAN 299
17: JFK SEA 298
18: JFK SFO 1312
19: JFK SJU 690
20: JFK STT 229
21: LGA DFW 3785
22: LGA MIA 3334
23: LGA ORD 4366
24: LGA PBI 245
origin dest N
7. Subset Data
In this section I’ll use the dollowing data…
<-flights[1:20,.(carrier,dest,air_time,distance,origin)] dt
7.1. .N
reminder
.N
is a special variable that holds the number of rows in the current group defined by by
=.(origin)] dt[,.N,by
origin N
<char> <int>
1: JFK 15
2: LGA 2
3: EWR 3
7.2. Subset of data .SD
…likewise .SD
(Subset of Data) holds the data for the current group defined using by
i.e. \(.N=\sum\#(.SD)\)
print(.SD), by = .(origin)] dt[,
carrier dest air_time distance
<char> <char> <int> <int>
1: AA LAX 359 2475
2: AA LAX 363 2475
3: AA LAX 351 2475
4: AA LAX 350 2475
5: AA LAX 338 2475
6: AA LAX 356 2475
7: AA MIA 161 1089
8: AA SEA 349 2422
9: AA SFO 365 2586
10: AA BOS 39 187
11: AA LAX 345 2475
12: AA BOS 35 187
13: AA ORD 155 740
14: AA IAH 234 1417
15: AA AUS 232 1521
carrier dest air_time distance
<char> <char> <int> <int>
1: AA PBI 157 1035
2: AA ORD 142 733
carrier dest air_time distance
<char> <char> <int> <int>
1: AA LAX 339 2454
2: AA MIA 161 1085
3: AA DFW 214 1372
Empty data.table (0 rows and 1 cols): origin
7.3. Aggregation of subset data
the lapply
function allows aggregation of subset of data .SD
. Notice how somecolumns are dropped from the subset of data by specifying the subset .SDcols
of columns in the subset data.
lapply(.SD, mean), by = .(origin), .SDcols = c("air_time","distance")] dt[,
origin air_time distance
<char> <num> <num>
1: JFK 268.8 1831.6
2: LGA 149.5 884.0
3: EWR 238.0 1637.0
8. Type Conversion
8.1. Character columns to factors
I’ll create a data.table with character columns…
<- as.data.table(as.data.frame(HairEyeColor, stringsAsFactors = FALSE))
dt head(dt)
Hair Eye Sex Freq
<char> <char> <char> <num>
1: Black Brown Male 32
2: Brown Brown Male 53
3: Red Brown Male 10
4: Blond Brown Male 3
5: Black Blue Male 11
6: Brown Blue Male 50
The standard way to convert the character columns into factors is to:
- use the reassignment operator
:=
to update columns by reference - apply
lapply()
to the subset of data defined by.SD
(Subset of Data) - specify the relevant columns using the
.SDcols
argument .SDcols
to subset the original data.table data
c("Hair", "Eye")) := lapply(.SD, as.factor), .SDcols = c("Hair", "Eye")]
dt[, (head(dt)
Hair Eye Sex Freq
<fctr> <fctr> <char> <num>
1: Black Brown Male 32
2: Brown Brown Male 53
3: Red Brown Male 10
4: Blond Brown Male 3
5: Black Blue Male 11
6: Brown Blue Male 50
9. Ordinal Indexing
we can add an uid indexing column using the .I
symbol. This can be useful for processing and plotting…
<- as.data.table(as.data.frame(HairEyeColor))
dt := .I]
dt[, Subject_ID head(dt)
Hair Eye Sex Freq Subject_ID
<fctr> <fctr> <fctr> <num> <int>
1: Black Brown Male 32 1
2: Brown Brown Male 53 2
3: Red Brown Male 10 3
4: Blond Brown Male 3 4
5: Black Blue Male 11 5
6: Brown Blue Male 50 6
10. Melt/dcast(Reshape)
When you need to reshape wide data into long format for:
- Easier grouping, summarizing, or plotting (e.g., with ggplot2)
- Tidy data format (1 variable per column, 1 observation per row)
- Performing operations like mean scores by subject
# Create a simple data.table
<- data.table(
dt id = 1:2,
math = c(90, 80),
science = c(85, 88)
)
10.1 Melt from wide to long
<- melt(dt, id.vars = "id", variable.name = "subject", value.name = "score")
melted_dt print(melted_dt)
id subject score
<int> <fctr> <num>
1: 1 math 90
2: 2 math 80
3: 1 science 85
4: 2 science 88
10.2 dcast: back to wide format
<- dcast(melted_dt, id ~ subject, value.var = "score")
wide_dt print(wide_dt)
Key: <id>
id math science
<int> <num> <num>
1: 1 90 85
2: 2 80 88
DT: An R interface to the DataTables library
todo