Medium Size Data Munging in R

Introduction

So, what is data.table? An extension of data frame that’s optimized for larger data sets. It inherits from data frames, so anything that works with data frame should work for data tables.

Munging with a large table

OK, real “big data” this may not be, but we’re going to start by building a “medium-sized” data set here: 13,450,000 rows and 8 columns.

So what do we need to do before we can actually work with the data?

Setting the key

When working with a data table, first step is typically to “set a key”:

“setkey() sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.” (see this Stackoverflow post)

Can we get to some munging now?

Let’s start by correctly formatting our date columns. FYI: you may need to scroll over to see the whole code chunk.

## insert slightly more meaningful column names
setnames(data, c("a_number", "a_small_number", "a_letter","one_or_zero", "a_negative_number", "a_date", "a_different_letter", "another_date"))

## set a key
system.time(setkey(data, a_number))

##    user  system elapsed 
##   3.287   0.066   3.410

## format date columns correctly
system.time(data$a_date <- as.Date(data$a_date, format = '%Y%m%d'))

##    user  system elapsed 
##   0.994   0.300   1.321

system.time(data$another_date <- as.Date(data$another_date, format = '%Y%m%d'))

##    user  system elapsed 
##   0.909   0.205   1.130

## format factor columns correctly
factors <- c("a_letter", "a_date", "another_date")
system.time(data[, as.factor(factors), with = FALSE])

##    user  system elapsed 
##   0.117   0.067   0.187

How about subsetting?

## subset a single month's worth of data
system.time(april_subset <- subset(data, a_date >= "2016-04-01" & a_date <= "2016-04-30"))

##    user  system elapsed 
##   7.614   1.475   9.578

Subsetting with a ~13 million row dataset in a few seconds. That’s pretty dern fast!

Inline code reference

But why did we do that? For real reporting, it’s very handy to be able to reference R objects in text. Here’s a quick example, though you’ll need to view the raw file to see how the numbers are pulled from the R objects:

“… Loading this data into R, we see that 4,034,158 things happened in April 2016. On slide 87 of Some Powerpoint Given to Us by an Exec, we see a total of 2,265,116 things happening in April 2016. This gives a discrepancy of 1,769,042 members between the results of our query and the documentation provided by business. Furthermore, in the month of April 2016, the data retrieved by our query indicates $67,269,074,303 in sales for…”

OK, I can see how that would be useful. But show us something amazing that data.table can do already!!!

Slightly more advanced data.table operations

How about regressing select levels of a categorical variable on the rest of the data?

## subset the 'a_letter' variable
setkey(data, a_letter); options(datatable.print.nrows=0)
system.time(data_a_letter_a_or_c <- data[a_letter %in% c("A","C")])

##    user  system elapsed 
##   0.579   0.139   0.730

## target levels "A" and "C" of that variable in `lm` call
system.time(my_regression <- data_a_letter_a_or_c[,  as.list(coef(lm(a_number~a_date))), by = a_letter])

##    user  system elapsed 
##  21.854   2.045  25.068

knitr::kable(my_regression)

a_letter	(Intercept)	a_date
A	4781.933	0.0128557
C	3775.328	0.0724217

This is really just the tip of the proverbial iceberg! The package is easy to use and makes pretty fast work of many common munging tasks. Enjoy!