R is my favorite programming language, and the data.table package is the primary reason why. Using data.table, I find it easy to rapidly express any question that I want to ask about my data. Due to the wide variety of features built into data.table, there are very few limits to the types of questions that I can ask. I am able to perform powerful tasks in a small number of steps.

I want to thank Matt Dowle, Arun Srinivasan, and the entire data.table team for producing such a wonderful tool.

data.table is expressive

In my work, it is crucial that I am able to quickly produce analytical insights. I often have multiple stakeholders asking me to answer questions with our data. When I need to perform a deep dive into our business, I don’t want to be burdened by the syntax or the complexity of the language that I am using. I want to spend my brain power thinking about the business problem that I’m trying to solve, not the coding problem that I’m trying to solve!

Using data.table to perform an analysis makes it very easy for me to focus on the data and ignore the programming syntax. I like to say that, due to the i, j, by paradigm that data.table uses, I am able to “live inside” my data while programming. Let me show you what I mean with a simplified example of a typical analysis.

A stakeholder comes to me and says, “Michael, the price per mile of shipments has increased a lot in the past week. Can you investigate why?” They show me a graph that looks something like:

The first thing that I do when I get my data is compute a column that is the quantity “price per mile”. Then, I summarize the distribution of that quantity.

DT[, price_per_mile := price / distance]
DT[, summary(price_per_mile)]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.670   2.971   3.024   3.231   3.143   8.145

Next, I use data.table::week to compute a column that represents whether each observation occurred in the current week or the previous week. I use that column to compare the mean price per mile for each week to the global mean.

DT[, week_num := week(date_time)]
DT[, mean(price_per_mile)]
## [1] 3.230788
DT[, mean(price_per_mile), by = 'week_num']
##    week_num       V1
## 1:        1 3.160383
## 2:        2 3.274862

The code above demonstrates what I mean when I say that data.table allows me to “live inside” my data. When I am investigating a dataset, almost all of the data.table code that I write exists between the [] brackets of DT. I feel like I have the different columns of my data at my finger tips, and I can ignore everything else that is happening in my analysis. Operating only between the [] brackets of DT removes the burden of syntax and code formatting, which allows me to operate much more quickly and focus on exploring my data. The feeling of being so close to my data that I can practically touch it is the primary reason why I love data.table so much.

Let’s continue my investigation so that I can show you other terrific aspects of data.table.

DT[week_num == 1, hist(distance, main = 'week 1')]
DT[week_num == 2, hist(distance, main = 'week 2')]

The i, j, by paradigm that data.table uses within the [] brackets reduces the amount of cognitive load I need to express a new question of my data. I always know that I choose the rows that I want to use with i, the computation that I want to do on certain columns with j, and whether the computation is grouped by any values with by. After programming in data.table for multiple years, the i, j, by paradigm is extremely natural for me. If I want to modify a previous computation by adding a filter, I copy, paste, and, with half a line of code, I can see how the filter changes the result. There aren’t dozens of exceptions that I need to remember and 95% of the questions that I want to answer only need i, j, by.

Returning to the story of my investigation, you can see that I’ve discovered a very interesting shift in our business. In the most recent week, we’ve had a much higher proportion of short-distance shipments. I wonder if that explains the change in price per mile…

DT[, mean(price_per_mile), by = distance > 250]
##    distance       V1
## 1:     TRUE 2.998963
## 2:    FALSE 4.009094
DT[distance > 250, mean(price_per_mile), by = 'week_num']
##    week_num       V1
## 1:        1 2.999506
## 2:        2 2.998568

Yes! The shift in the mix of short-distance shipments does seem to explain the phenomemnon that my stakeholder is asking me about. The average price per mile for short-distance shipments is much higher than the price per mile of longer-distance shipments. And, if we exclude the short notice shipments from the computation, we can see that the price per mile did not change week over week.

Looking back at this analysis, one of the most surprising things that jumps out at me is that I didn’t actually write that much code. It was only 9 lines of code, and the majority of code on each line is made up of the column names in my dataset. The fact that I don’t need to do a lot of typing to iterate through all of the questions that I have is very important. My goal is to spend the majority of my time looking at the visualizations and tables that my code produces, thinking about the patterns that I am observing in the data, and trying to solve a particular business problem. I love that data.table allows me to spend a minimal amount of time typing code and worrying about syntax.

data.table is powerful

In the previous section, I showed how data.table allowed me to rapidly iterate through a series of simple questions about my data. However, data.table isn’t just a tool for simple analyses. It helps me do extremely powerful and complex computations! I could talk about how I’ve used data.table::frollmean and by to compute time series aggregates for each user in my data, how I’ve used setkey to make a computation execute in seconds instead of minutes, or how I’ve used non-equi joins to produce very interesting datasets. However, I’d like to demonstrate the power of data.table by talking about one basic feature that turned an extremely messy and convoluted task into a one line command: .SD.

At my company, I build tools to help other data scientists be more productive and more collaborative. One of the tools that I built helps data scientists understand how their metrics behaved differently on different partitions of their data. For example, if price_per_mile of shipments varies significantly across different mileage bands in the data, this tool would identify this pattern for the user. The tool automatically searches over all possible combinations of different grouping columns in the data and computes a series of user-defined metrics. Then, it identifies the partitions of the data where a metric behaves most abnormally and notifies the user.

One of the challenges with this problem is that some data scientists at my company like to compute metrics that are complicated. For example

weighted_margin <- function(DT) {
    DT[, sum(weight * (revenue - cost) / revenue) / sum(weight)]
}

sparse_average <- function(DT, min_num_observations = 25) {
    num_obs <- DT[, .N]
    if (num_obs > min_num_observations) {
        DT[, mean(some_column)]
    } else {
        NA_real_
    }
}

In general, the tool that I built doesn’t actually know what function was being used. When the user gave my tool their data, they also gave a list of functions that would compute all of their metrics. Here’s an example of what the setup code would look like:

DT <- data.table(
    revenue = rnorm(100)
    , cost = rnorm(100)
    , weight = rnorm(100)
    , some_column = rbinom(100, 1, 0.5)
    , grouping_col_1 = sample(c('a', 'b'), 100, replace = TRUE)
    , grouping_col_2 = sample(c('x', 'y'), 100, replace = TRUE)
)
list_of_funs <- list(
    weighted_margin = weighted_margin
    , sparse_average = sparse_average
)
grouping_cols <- c('grouping_col_1', 'grouping_col_2')

I needed to write a function that would group the data on the partitioning columns and compute all of my metrics. Here’s the single line of code that I wrote to accomplish this task

DT[, lapply(list_of_funs, function(f) f(.SD)), by = grouping_cols]
##    grouping_col_1 grouping_col_2 weighted_margin sparse_average
## 1:              b              y       0.9733648             NA
## 2:              a              y       4.5892169      0.6551724
## 3:              b              x      -2.7084592      0.6333333
## 4:              a              x      -7.1285230             NA

I was so happy when I wrote that one line of code! And, the hero at the center of it all is .SD. .SD stands for subset of data. Let’s break down this line of code to see what is going on.

First, I am grouping the computation by the two columns defined in grouping_cols.

Second, I am looping over the list of functions defined in list_of_funs with lapply. This takes advantage of the fact that all data.frames (which include data.tables) are actually just lists.

Third, for each function in my list, I am passing the different subsets of data that I obtained while grouping on the columns in grouping_cols. This is what is happening with f(.SD). f is one of my metric functions. And .SD is a data.table.