So, what is data.table? An extension of data frame that’s optimized for larger data sets. It inherits from data frames, so anything that works with data frame should work for data tables.
OK, real “big data” this may not be, but we’re going to start by building a “medium-sized” data set here: 13,450,000 rows and 8 columns.
So what do we need to do before we can actually work with the data?
When working with a data table, first step is typically to “set a key”:
“setkey() sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.” (see this Stackoverflow post)
Let’s start by correctly formatting our date columns. FYI: you may need to scroll over to see the whole code chunk.
## insert slightly more meaningful column names
setnames(data, c("a_number", "a_small_number", "a_letter","one_or_zero", "a_negative_number", "a_date", "a_different_letter", "another_date"))
## set a key
system.time(setkey(data, a_number))
## user system elapsed
## 3.287 0.066 3.410
## format date columns correctly
system.time(data$a_date <- as.Date(data$a_date, format = '%Y%m%d'))
## user system elapsed
## 0.994 0.300 1.321
system.time(data$another_date <- as.Date(data$another_date, format = '%Y%m%d'))
## user system elapsed
## 0.909 0.205 1.130
## format factor columns correctly
factors <- c("a_letter", "a_date", "another_date")
system.time(data[, as.factor(factors), with = FALSE])
## user system elapsed
## 0.117 0.067 0.187
How about subsetting?
## subset a single month's worth of data
system.time(april_subset <- subset(data, a_date >= "2016-04-01" & a_date <= "2016-04-30"))
## user system elapsed
## 7.614 1.475 9.578
Subsetting with a ~13 million row dataset in a few seconds. That’s pretty dern fast!
But why did we do that? For real reporting, it’s very handy to be able to reference R objects in text. Here’s a quick example, though you’ll need to view the raw file to see how the numbers are pulled from the R objects:
“… Loading this data into R, we see that 4,034,158 things happened in April 2016. On slide 87 of Some Powerpoint Given to Us by an Exec, we see a total of 2,265,116 things happening in April 2016. This gives a discrepancy of 1,769,042 members between the results of our query and the documentation provided by business. Furthermore, in the month of April 2016, the data retrieved by our query indicates $67,269,074,303 in sales for…”
OK, I can see how that would be useful. But show us something amazing that data.table can do already!!!
How about regressing select levels of a categorical variable on the rest of the data?
## subset the 'a_letter' variable
setkey(data, a_letter); options(datatable.print.nrows=0)
system.time(data_a_letter_a_or_c <- data[a_letter %in% c("A","C")])
## user system elapsed
## 0.579 0.139 0.730
## target levels "A" and "C" of that variable in `lm` call
system.time(my_regression <- data_a_letter_a_or_c[, as.list(coef(lm(a_number~a_date))), by = a_letter])
## user system elapsed
## 21.854 2.045 25.068
knitr::kable(my_regression)
| a_letter | (Intercept) | a_date |
|---|---|---|
| A | 4781.933 | 0.0128557 |
| C | 3775.328 | 0.0724217 |
This is really just the tip of the proverbial iceberg! The package is easy to use and makes pretty fast work of many common munging tasks. Enjoy!