Apache Drill is an open-source tool for querying distributed data. It also works on file systems, including the local one. With sergeant, R and dplyr bindings for Drill by Bob Rudis, an R session can therefore treat files of any type Drill can read—delimited (CSV, TSV, PSV), JSON, Parquet, etc.—as database tables, effectively reading them into R without reading them into memory. Manipulations are limited akin to other dplyr database backends, but Drill requires little to no setup to read any file visible to your filesystem.

To try out Drill+sergeant, first install Drill (with Homebrew, brew install apache-drill will do) and start it in a shell:

drill-embedded &

Keeping it running in a bash block (and system) has failed lately, but it’s easy to run drill-embedded in a shell to start Drill before querying anything. Drill’s errors will print to that shell, so it’s worth keeping open regardless.

Now use src_drill to connect to the Drill instance:

…and tbl to get a file:

mtcars_drl <- tbl(drl, "dfs.`/Users/alistaire/Desktop/mtcars.csvh`")
mtcars_drl

dfs is the file system storage plugin for Drill (short for “distributed file system”). Paths need to be prefaced by a plugin, and enclosed in backticks. The default location of dfs is /, from which all paths have to be fully qualified. Full details are in the docs.

If fully-qualifying paths gets annoying, you can set your own locations within dfs by changing its configuration. The easiest way to do so is to go to localhost:8047 (that’s the default port), where if you go to the “Storage” tab you can view and edit the full JSON configuration for dfs.

To add a workspace for my home directory, I copied the tmp one, then updated the name (to “alistaire”) and path (“/Users/alistaire”). It should now show up as in drl’s print method:

drl
src:  DrillConnection
tbls: INFORMATION_SCHEMA, cp.default, dfs.alistaire, dfs.default, dfs.root, dfs.tmp, sys

Now the workspace can be appended to dfs with a dot, and paths continue as relative:

tbl(drl, "dfs.alistaire.`Desktop/mtcars.csvh`") %>% 
    group_by(cyl) %>% 
    summarise_all(max, na.rm = TRUE)

Working with tables is a little finicky due to the SQL translation, but with a little patience can work pretty well.

More information about the configuration is available with a full Drill connection:

con <- drill_connection()
drill_active(con)
[1] TRUE
drill_show_schemas(con)

  |                                                                                            
  |                                                                                      |   0%
  |                                                                                            
  |======================================================================================| 100%
drill_query(con, 
            "SELECT mpg, cyl 
            FROM dfs.alistaire.`Desktop/mtcars.csvh` 
            LIMIT 3;")

  |                                                                                            
  |                                                                                      |   0%
  |                                                                                            
  |======================================================================================| 100%

The Drill docs go much further in showing what is possible. It has limits and can cause headaches, but when it works it’s glorious.

For instance, I can read in a 1.5 Gb CSV on an external hard drive:

pryr::mem_used()
95.2 MB
ohio1 <- tbl(drl, "dfs.`/Volumes/Mymble/ohio1_long.csvh`")
pryr::mem_used()
95.2 MB

…without using any additional memory. But I can still do this:

n_voters <- ohio1 %>% 
    filter(voted != "NA") %>% 
    count(election_type, election_date) %>% 
    arrange(desc(election_date)) %>% 
    collect()    # use `collect` to bring table into R
n_voters
library(ggplot2)
ggplot(n_voters, aes(election_date, n, fill = election_type)) + 
    geom_col(width = 50, position = "identity") + 
    scale_y_continuous(labels = scales::comma_format()) + 
    ggpomological::scale_fill_pomological(
        labels = function(x) tools::toTitleCase(tolower(x))
    ) + 
    labs(title = "OH-1 Voters per Election", x = NULL, y = "Number of Voters",
         fill = NULL) + 
    hrbrthemes::theme_ipsum_rc()

Woo!

LS0tCnRpdGxlOiAiQXBhY2hlIERyaWxsIG9uIHRoZSBsb2NhbCBmaWxlc3lzdGVtIHZpYSBzZXJnZWFudCIKYXV0aG9yOiAiRWR3YXJkIFZpc2VsIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpbQXBhY2hlIERyaWxsXShodHRwczovL2RyaWxsLmFwYWNoZS5vcmcvKSBpcyBhbiBvcGVuLXNvdXJjZSB0b29sIGZvciBxdWVyeWluZyAKZGlzdHJpYnV0ZWQgZGF0YS4gSXQgYWxzbyB3b3JrcyBvbiBmaWxlIHN5c3RlbXMsIGluY2x1ZGluZyB0aGUgbG9jYWwgb25lLiAKV2l0aCBbc2VyZ2VhbnRdKGh0dHBzOi8vZ2l0aHViLmNvbS9ocmJybXN0ci9zZXJnZWFudCksIFIgYW5kIGRwbHlyIGJpbmRpbmdzIGZvciAKRHJpbGwgYnkgQm9iIFJ1ZGlzLCBhbiBSIHNlc3Npb24gY2FuIHRoZXJlZm9yZSB0cmVhdCBmaWxlcyBvZiBhbnkgdHlwZSBEcmlsbCAKY2FuIHJlYWTigJRkZWxpbWl0ZWQgKENTViwgVFNWLCBQU1YpLCBKU09OLCBQYXJxdWV0LCBldGMu4oCUYXMgZGF0YWJhc2UgdGFibGVzLCAKZWZmZWN0aXZlbHkgcmVhZGluZyB0aGVtIGludG8gUiB3aXRob3V0IHJlYWRpbmcgdGhlbSBpbnRvIG1lbW9yeS4gTWFuaXB1bGF0aW9ucyAKYXJlIGxpbWl0ZWQgYWtpbiB0byBvdGhlciBkcGx5ciBkYXRhYmFzZSBiYWNrZW5kcywgYnV0IERyaWxsIHJlcXVpcmVzIGxpdHRsZSB0byAKbm8gc2V0dXAgdG8gcmVhZCBhbnkgZmlsZSB2aXNpYmxlIHRvIHlvdXIgZmlsZXN5c3RlbS4KClRvIHRyeSBvdXQgRHJpbGwrc2VyZ2VhbnQsIGZpcnN0IGluc3RhbGwgRHJpbGwgKHdpdGggCkhvbWVicmV3LCBgYnJldyBpbnN0YWxsIGFwYWNoZS1kcmlsbGAgd2lsbCBkbykgYW5kIHN0YXJ0IGl0IGluIGEgc2hlbGw6CgpgYGB7YmFzaH0KZHJpbGwtZW1iZWRkZWQgJgpgYGAKCktlZXBpbmcgaXQgcnVubmluZyBpbiBhIGJhc2ggYmxvY2sgKGFuZCBgc3lzdGVtYCkgaGFzIGZhaWxlZCBsYXRlbHksIGJ1dCBpdCdzIAplYXN5IHRvIHJ1biBgZHJpbGwtZW1iZWRkZWRgIGluIGEgc2hlbGwgdG8gc3RhcnQgRHJpbGwgYmVmb3JlIHF1ZXJ5aW5nIAphbnl0aGluZy4gRHJpbGwncyBlcnJvcnMgd2lsbCBwcmludCB0byB0aGF0IHNoZWxsLCBzbyBpdCdzIHdvcnRoIGtlZXBpbmcgb3BlbiAKcmVnYXJkbGVzcy4KCk5vdyB1c2UgYHNyY19kcmlsbGAgdG8gY29ubmVjdCB0byB0aGUgRHJpbGwgaW5zdGFuY2U6CgpgYGB7cn0KbGlicmFyeShkcGx5cikKIyBpbnN0YWxsLnBhY2thZ2VzKCJzZXJnZWFudCIpCmxpYnJhcnkoc2VyZ2VhbnQpCgpkcmwgPC0gc3JjX2RyaWxsKCkKCiMgd3JpdGUgYSBzYW1wbGUgQ1NWOyAiY3N2aCIgaXMgcHJlY29uZmlndXJlZCBpbiBEcmlsbCB0byBiZSByZWFkIHdpdGggaGVhZGVycwp3cml0ZS5jc3YobXRjYXJzLCAnfi9EZXNrdG9wL210Y2Fycy5jc3ZoJywgcm93Lm5hbWVzID0gRkFMU0UpCmBgYAoKLi4uYW5kIGB0YmxgIHRvIGdldCBhIGZpbGU6CgpgYGB7cn0KbXRjYXJzX2RybCA8LSB0YmwoZHJsLCAiZGZzLmAvVXNlcnMvYWxpc3RhaXJlL0Rlc2t0b3AvbXRjYXJzLmNzdmhgIikKCm10Y2Fyc19kcmwKYGBgCgpgZGZzYCBpcyB0aGUgZmlsZSBzeXN0ZW0gc3RvcmFnZSBwbHVnaW4gZm9yIERyaWxsIChzaG9ydCBmb3IgImRpc3RyaWJ1dGVkIGZpbGUgCnN5c3RlbSIpLiBQYXRocyBuZWVkIHRvIGJlIHByZWZhY2VkIGJ5IGEgcGx1Z2luLCBhbmQgZW5jbG9zZWQgaW4gYmFja3RpY2tzLiBUaGUgCmRlZmF1bHQgbG9jYXRpb24gb2YgYGRmc2AgaXMgYC9gLCBmcm9tIHdoaWNoIGFsbCBwYXRocyBoYXZlIHRvIGJlIGZ1bGx5IApxdWFsaWZpZWQuIFtGdWxsIGRldGFpbHMgYXJlIGluIHRoZSAKZG9jcy5dKGh0dHBzOi8vZHJpbGwuYXBhY2hlLm9yZy9kb2NzL2ZpbGUtc3lzdGVtLXN0b3JhZ2UtcGx1Z2luLykKCklmIGZ1bGx5LXF1YWxpZnlpbmcgcGF0aHMgZ2V0cyBhbm5veWluZywgeW91IGNhbiBzZXQgeW91ciBvd24gbG9jYXRpb25zIHdpdGhpbiAKYGRmc2AgYnkgY2hhbmdpbmcgaXRzIGNvbmZpZ3VyYXRpb24uIFRoZSBlYXNpZXN0IHdheSB0byBkbyBzbyBpcyB0byBnbyAKdG8gW2xvY2FsaG9zdDo4MDQ3XShodHRwOi8vbG9jYWxob3N0OjgwNDcvKSAodGhhdCdzIHRoZSBkZWZhdWx0IHBvcnQpLCB3aGVyZSBpZiAKeW91IGdvIHRvIHRoZSAiU3RvcmFnZSIgdGFiIHlvdSBjYW4gdmlldyBhbmQgZWRpdCB0aGUgZnVsbCBKU09OIGNvbmZpZ3VyYXRpb24gCmZvciBgZGZzYC4KClRvIGFkZCBhIHdvcmtzcGFjZSBmb3IgbXkgaG9tZSBkaXJlY3RvcnksIEkgY29waWVkIHRoZSBgdG1wYCBvbmUsIHRoZW4gdXBkYXRlZCAKdGhlIG5hbWUgKHRvICJhbGlzdGFpcmUiKSBhbmQgcGF0aCAoIi9Vc2Vycy9hbGlzdGFpcmUiKS4gSXQgc2hvdWxkIG5vdyBzaG93IHVwIAphcyBpbiBgZHJsYCdzIHByaW50IG1ldGhvZDoKYGBge3J9CmRybApgYGAKCk5vdyB0aGUgd29ya3NwYWNlIGNhbiBiZSBhcHBlbmRlZCB0byBgZGZzYCB3aXRoIGEgZG90LCBhbmQgcGF0aHMgY29udGludWUgYXMgCnJlbGF0aXZlOgpgYGB7cn0KdGJsKGRybCwgImRmcy5hbGlzdGFpcmUuYERlc2t0b3AvbXRjYXJzLmNzdmhgIikgJT4lIAogICAgZ3JvdXBfYnkoY3lsKSAlPiUgCiAgICBzdW1tYXJpc2VfYWxsKG1heCwgbmEucm0gPSBUUlVFKQpgYGAKCldvcmtpbmcgd2l0aCB0YWJsZXMgaXMgYSBsaXR0bGUgZmluaWNreSBkdWUgdG8gdGhlIFNRTCB0cmFuc2xhdGlvbiwgYnV0IHdpdGggYSAKbGl0dGxlIHBhdGllbmNlIGNhbiB3b3JrIHByZXR0eSB3ZWxsLgoKTW9yZSBpbmZvcm1hdGlvbiBhYm91dCB0aGUgY29uZmlndXJhdGlvbiBpcyBhdmFpbGFibGUgd2l0aCBhIGZ1bGwgRHJpbGwgCmNvbm5lY3Rpb246CmBgYHtyLCBtZXNzYWdlPUZBTFNFfQpjb24gPC0gZHJpbGxfY29ubmVjdGlvbigpCgpkcmlsbF9hY3RpdmUoY29uKQpkcmlsbF9zaG93X3NjaGVtYXMoY29uKQoKZHJpbGxfcXVlcnkoY29uLCAKICAgICAgICAgICAgIlNFTEVDVCBtcGcsIGN5bCAKICAgICAgICAgICAgRlJPTSBkZnMuYWxpc3RhaXJlLmBEZXNrdG9wL210Y2Fycy5jc3ZoYCAKICAgICAgICAgICAgTElNSVQgMzsiKQpgYGAKCltUaGUgRHJpbGwgZG9jc10oaHR0cHM6Ly9kcmlsbC5hcGFjaGUub3JnL2RvY3MvKSBnbyBtdWNoIGZ1cnRoZXIgaW4gc2hvd2luZyAKd2hhdCBpcyBwb3NzaWJsZS4gSXQgaGFzIGxpbWl0cyBhbmQgY2FuIGNhdXNlIGhlYWRhY2hlcywgYnV0IHdoZW4gaXQgd29ya3MgaXQncyAKZ2xvcmlvdXMuCgpGb3IgaW5zdGFuY2UsIEkgY2FuIHJlYWQgaW4gYSAxLjUgR2IgQ1NWIG9uIGFuIGV4dGVybmFsIGhhcmQgZHJpdmU6CmBgYHtyfQpwcnlyOjptZW1fdXNlZCgpCm9oaW8xIDwtIHRibChkcmwsICJkZnMuYC9Wb2x1bWVzL015bWJsZS9vaGlvMV9sb25nLmNzdmhgIikKcHJ5cjo6bWVtX3VzZWQoKQpgYGAKCi4uLndpdGhvdXQgdXNpbmcgYW55IGFkZGl0aW9uYWwgbWVtb3J5LiBCdXQgSSBjYW4gc3RpbGwgZG8gdGhpczoKCmBgYHtyfQpuX3ZvdGVycyA8LSBvaGlvMSAlPiUgCiAgICBmaWx0ZXIodm90ZWQgIT0gIk5BIikgJT4lIAogICAgY291bnQoZWxlY3Rpb25fdHlwZSwgZWxlY3Rpb25fZGF0ZSkgJT4lIAogICAgYXJyYW5nZShkZXNjKGVsZWN0aW9uX2RhdGUpKSAlPiUgCiAgICBjb2xsZWN0KCkgICAgIyB1c2UgYGNvbGxlY3RgIHRvIGJyaW5nIHRhYmxlIGludG8gUgoKbl92b3RlcnMKYGBgCgpgYGB7cn0KbGlicmFyeShnZ3Bsb3QyKQoKZ2dwbG90KG5fdm90ZXJzLCBhZXMoZWxlY3Rpb25fZGF0ZSwgbiwgZmlsbCA9IGVsZWN0aW9uX3R5cGUpKSArIAogICAgZ2VvbV9jb2wod2lkdGggPSA1MCwgcG9zaXRpb24gPSAiaWRlbnRpdHkiKSArIAogICAgc2NhbGVfeV9jb250aW51b3VzKGxhYmVscyA9IHNjYWxlczo6Y29tbWFfZm9ybWF0KCkpICsgCiAgICBnZ3BvbW9sb2dpY2FsOjpzY2FsZV9maWxsX3BvbW9sb2dpY2FsKAogICAgICAgIGxhYmVscyA9IGZ1bmN0aW9uKHgpIHRvb2xzOjp0b1RpdGxlQ2FzZSh0b2xvd2VyKHgpKQogICAgKSArIAogICAgbGFicyh0aXRsZSA9ICJPSC0xIFZvdGVycyBwZXIgRWxlY3Rpb24iLCB4ID0gTlVMTCwgeSA9ICJOdW1iZXIgb2YgVm90ZXJzIiwKICAgICAgICAgZmlsbCA9IE5VTEwpICsgCiAgICBocmJydGhlbWVzOjp0aGVtZV9pcHN1bV9yYygpCmBgYAoKCldvbyEK