Short answer: because life’s short and you’ve got stuff to do.
Longer answer:
This overview shows you many aspects of data.table using an airlines database with 337 thousand flights.
The basic syntax looks a lot like SQL:
df[i, j, by]
select rows i, do something with variable[s] j, group by
Simple, huh?
We get 337K flights from a data.frame in the nycflights13 library, and convert that to a data.table.
If you need to read data, then data.table’s fread is much faster than base R. See http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r/15058684#15058684. Here we will just use the flight data from a library.
library(data.table)
data(flights, package = "nycflights13")
df = setDT(flights)
df
## year month day dep_time dep_delay arr_time arr_delay carrier
## 1: 2013 1 1 517 2 830 11 UA
## 2: 2013 1 1 533 4 850 20 UA
## 3: 2013 1 1 542 2 923 33 AA
## 4: 2013 1 1 544 -1 1004 -18 B6
## 5: 2013 1 1 554 -6 812 -25 DL
## ---
## 336772: 2013 9 30 NA NA NA NA 9E
## 336773: 2013 9 30 NA NA NA NA 9E
## 336774: 2013 9 30 NA NA NA NA MQ
## 336775: 2013 9 30 NA NA NA NA MQ
## 336776: 2013 9 30 NA NA NA NA MQ
## tailnum flight origin dest air_time distance hour minute
## 1: N14228 1545 EWR IAH 227 1400 5 17
## 2: N24211 1714 LGA IAH 227 1416 5 33
## 3: N619AA 1141 JFK MIA 160 1089 5 42
## 4: N804JB 725 JFK BQN 183 1576 5 44
## 5: N668DN 461 LGA ATL 116 762 5 54
## ---
## 336772: 3393 JFK DCA NA 213 NA NA
## 336773: 3525 LGA SYR NA 198 NA NA
## 336774: N535MQ 3461 LGA BNA NA 764 NA NA
## 336775: N511MQ 3572 LGA CLE NA 419 NA NA
## 336776: N839MQ 3531 LGA RDU NA 431 NA NA
The first advantage of data.table is immediately visible. When you type “df”, R does not try to spit out all 337K flights at you. Instead, it prints out head and tail.
As you can see in the str(df) call, the object remains a data.frame. data.table just makes it much more efficient. There are no rownames in a data.table, since they can be identical for different records.
class(df)
## [1] "data.table" "data.frame"
You use the “j” position to define your variables. The “.” is a shortcut for “list”. No quotes are necessary.
df[, .(tailnum, flight, origin)]
data.table discourages the use of column index numbers rather than names since this is bad programming practice. In some situations you will still require numbers, in that case use “with = FALSE” as follows:
delay.cols = grepl("delay", names(df)) # find columns with "delay"
df[, delay.cols, with = FALSE]
# simple select
df[origin == "EWR"]
# setting a key also automatically sorts the data.frame
setkey(df, origin)
df["EWR"]
# select multiple rows
df[origin %in% c("LGA", "EWR")]
# grep style select
df[dest %like% "LA"] # finds LAS and LAX
# between select
df[distance %between% c(1000, 1200)] # distance between 1000 and 1200 nautical miles
In the example below we find the average delay of flights to IAH that have a delay of at least 30 minutes by carrier. This command uses “.N”, which is the data.table shortcut for “count”.
df[arr_delay > 30 & dest == "IAH", # <- select rows
.(mean_delay = mean(arr_delay), # <- mean arrival delay
count = .N), # <- counter arrival delay
by = carrier] # <- summarize by airline
Row and column selection, combined with group by will already save you tons of time. But wait, there’s much more.
The documentation of the nycflights13 dataset does not specify units for distance or airtime. We assume that distance is measured in the traditional nautical miles per hour and airtime in minutes. The speed is then calculated in km/h.
df[, speed := (distance * 1.852) / (air_time / 60)]
You can update this field in place. For example, set NA values for speed to zero, and blank tail numbers to “Unknown”.
df[is.na(speed), speed := 0]
df[nchar(tailnum) < 2, tailnum := "Unknown"]
This is a real time saver.
df[, speed := NULL]
You can add more fields at the same time. Note that you don’t have to write “df = df[i, j, group]” - the variables are updated in place. Another time saver.
The example below uses library lubridate, which is very efficient handling of date conversions and calculations.
library(lubridate)
df[, `:=` (
speed = (distance * 1.852) / (air_time / 60),
date = ymd(paste(year, month, day, sep = "-"))
)]
df[, c("speed", "date") := NULL]
The nested command to get a date field in the code above is not very transparent. There is much excitement about the use of “pipes” from library magrittr in combination with library dplyr.
data.table allows for code evaluation in steps and chaining. There is therefore not much need for pipes as an additional concept. However, if you absolutely want to write “%>%” all the time in data.table nobody will stop you.
df[, date := {
temp = paste(year, month, day, sep = "-")
date = ymd(temp)
}]
You can calculate more fields in steps simultaneously. You can also chain outputs Linux style. So let’s go wild and find pilots that went through the sound barrier [1234 km/h]:
df[, c("date", "speed") := {
temp = paste(year, month, day, sep = "-")
date = ymd(temp)
km = distance * 1.852
hr = air_time / 60
speed = km / hr
.(date = date, speed = speed)
}][speed > 1234, .(date, air_time, distance, speed, carrier, tailnum)]
The first part of the code gets dates and km/h in steps. The results are then piped into a command that both finds supersonic flights and makes a subselection of the fields. This results in transparent code that you will still be able to understand when you revisit your script months later.
An important part of your time is spent merging different datasets together. data.table can do this really fast- and as usual, there’s more.
We have found a flight that went through the sound barrier and want to know which plane it was. This requires us to get the manufacturer’s data [which are also included in the nycflights13 library] and join that dataset with the flight data.
data(planes, package = "nycflights13")
setDT(planes)
setkey(planes, tailnum)
setkey(df, tailnum)
dx = df[planes]
dx[speed > 1234, .(carrier, tailnum, origin, dest, speed, manufacturer, model, engine)]
First, we get the plane data and set it to data.table. Second, we set the keys for the datasets we want to merge: both contain the field “tailnum”. Then we do a left join on the datasets, and find the supersonic flight. Delta went through the sound barrier with a Boeing 757!
What was the first flight per tail number?
setkey(df, tailnum, date)
df[planes, mult = "first"] # finds the first flight per tail number
df[planes, mult = "last"]
What is the second flight of each tailnumber per carrier?
dx[, .SD[2], by = .(carrier, tailnum)]
The “.SD” means “subset of data”.
What are the first 3 flights per tailnumber per carrier?
dx[, head(.SD, 3), .(carrier, tailnum)]
What was the minimum speed per carrier?
df[, .SD[which.min(speed)], by = carrier]
Instead of of joins in the form : A[B], you can also use merge. This allows you to specify field names for A and B if they have different names [e.g. tailnum and TailNum]. You can also specify left, right and full joins.
# left join
x = merge(df, planes, by = "tailnum", all.x = T)
setkey(df, tailnum)
setkey(planes, tailnum)
y = planes[df]
# full join
z = merge(df, planes, by = "tailnum", all = T)
data.table offers a unique feature to find overlaps between intervals of two data.tables. The feature is best explained from a simple example [?foverlaps for more details].
# make 2 simple data.tables
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
# find the overlaps
z = foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
print(x) # data.table 1
## start end val2
## 1: 5 8 7
## 2: 31 50 8
## 3: 22 25 9
## 4: 16 18 10
print(y) # data.table 2
## start end val1
## 1: 10 15 1
## 2: 20 35 2
## 3: 30 45 3
print(z) # indices overlaps
## xid yid
## 1: 1 NA
## 2: 2 2
## 3: 2 3
## 4: 3 2
## 5: 4 NA
There are 2 data.tables with intervals [start, end]. The last table shows the overlap indices.
The first row of x has interval [5, 8] whereas none of the rows in y have that interval; all are > 8. The second row of x has interval [31, 50] and this has overlaps with both rows 2 and 3 of y: [20-35] and [30-45] respectively.
The table below shows the joined data.table.
foverlaps(x, y, type="any")
## start end val1 i.start i.end val2
## 1: NA NA NA 5 8 7
## 2: 20 35 2 31 50 8
## 3: 30 45 3 31 50 8
## 4: 20 35 2 22 25 9
## 5: NA NA NA 16 18 10
A data.table is, like a data.frame, a list of vectors. Therefore, processing multiple fields requires list processing: lapply. You have to indicate which columns you want to have processed separately. This is a cumbersome operation in data.table; we will work from the easiest case to the most difficult one [which is unfortunately the most frequent one].
# convert times in minutes to hours in the delay columns
df[, lapply(.SD, function(x) x/60), .SDcols = c("dep_delay", "arr_delay")]
As you can see, this has given you the delay columns in hours. However, often you want to do this in place - you want to work in the original data.table. Here, we will find the variable names rather than typing them.
df[, arr.delay.corr := arr_delay - dep_delay] # correction, see below
# which variable names have "delay" in them?
idx.numcols = which(names(df) %like% "delay")
names.numcols = names(df)[idx.numcols]
# convert to minutes in place
df[, (names.numcols) := lapply(.SD, function(x) ifelse(is.na(x), x, x/60)), .SDcols = names.numcols]
print(head(df, 3)[, names.numcols, with = F])
## dep_delay arr_delay arr.delay.corr
## 1: 1.1333333 1.51666667 0.38333333
## 2: 0.6666667 0.73333333 0.06666667
## 3: 0.4000000 0.03333333 -0.36666667
Often times data.tables are in wide format [many variables], whereas you want to have them in long format. You would require long format, for example, to work in ggplot.
First the melt [wide –> long] operation, this is optimized for speed in data.table (even more optimization is scheduled for upcoming version 1.9.6). We do the melt operation for a few carriers only, for plotting purposes later on.
mycols = c("carrier", names.numcols)
m = melt(df[, mycols, with = F], id.var = "carrier")[carrier %chin% c("AA", "NW", "UA", "US")]
print(head(m, 3))
## carrier variable value
## 1: US dep_delay -0.1166667
## 2: US dep_delay -0.1500000
## 3: US dep_delay 0.6500000
Now you can easily plot in ggplot2. How do the arrival and departure delay compare by carrier?
library(ggplot2)
theme_set(theme_bw())
p = ggplot(m, aes(variable, value)) +
geom_boxplot() +
facet_grid(.~ carrier) +
labs(x = "", y = "hours", title = "delays per carrier") +
theme_bw()
print(p)
In this plot, 50% of the observations can be found in the “box” [the horizontal dash is the median]. The plot shows that the variance in arrival delay is wider [and often negative]. The outliers are shows as dots; AA is the winner here. Note the large arrival delays - if you start 1 hour to late, you already have clocked 1 hour arrival delay too unless you fly faster than usual. The variable “arr.delay.corr” corrects for this.
We had just melted the data.table for delays per carrier.
You can now cast this back - the latest and greatest data.table 1.9.6 has some more functionality for this.
For example, you can run multiple functions while casting. The code below calculates both mean and standard deviation for the delays of the selected carriers.
dc = dcast(m, variable ~ carrier, value.var = "value", fun = list(mean, sd), na.rm = T)
Data cleaning always involves the removal of duplicates. Suppose that we want to work with unique carriers.
data.table finds duplicates per key. This means that your results will depend on the keys that you have set. In the example below we remove the key, test for duplicates, then repeat the process when a key is set.
key(df) # what is the current key?
setkey(df) # remove any keys
dedup = df[!duplicated(df),] # find the unique records
nrow(dedup) # how many unique records?
nrow(unique(df, by ="carrier")) # ditto - difference!
setkey(df, carrier)
dedup = df[!duplicated(df),]
nrow(dedup) # now the unique records found corresponds to what you wanted [unique carriers]
data.table avoids to make copies. This means that you can get tricked when you make a new data.table out of another one and do things with it, such as deleting variables.
planes.copy = planes
planes.copy[, type := NULL]
"type" %in% names(planes) # the deleted variable has gone in the original dataset too
## [1] FALSE
planes.deep.copy = copy(planes)
planes.deep.copy[, model := NULL]
"model" %in% names(planes) # deleted variable still there in the original dataset
## [1] TRUE
setnames(df, c("air_time", "speed"), c("air_time.min", "speed.km.hr"))
You sort columns in 2 ways. First there is the aptly named “setcolorder”. Second, you can use the fact that the columns show up in the order in which you queried them.
dfx = df[, .(year, month, day, date)] # sample data.table
setcolorder(dfx, c("date", "day", "month", "year"))
dfx[, .(date, day, month, year)]
r cran location: https://cran.r-project.org/web/packages/data.table/index.html
the mothership: https://github.com/Rdatatable/data.table/wiki
bigvis: https://github.com/hadley/bigvis
You can also work in dplyr if you prefer to use commands for every subaction. It may be easier for some people to remember the syntax of dplyr as it looks more “natural”. The resulting code is more verbose.
As you can see below, you will have the result in data.table before you have finished typing in dplyr. data.table needed two commands (“by” and “[]”), whereas dplyr needed 4 (filter, group_by, summarise and %>%). For me, this gets old really fast. For others, the combination of typing “%>%” combined with a verb for every action is easier to work with.
A discussion of differences can be found here: http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly
library(dplyr)
datatable.way = df[arr_delay > 30 & origin == "JFK",
mean(arr_delay),
by = carrier]
dplyr.way = df %>%
filter(arr_delay > 30 & origin == "JFK") %>%
group_by(carrier) %>%
summarise(mean(arr_delay))