Examining Feather Read and Write Speeds

Thomas Jackson
April 10, 2017


Summary

Introduction

R is flexible enough to read and write several different files types with base R (see foreign and haven packages). Over a year ago, Wes McKinney and Hadley Wickham created the 'feather' package to introduce a new file type that is a "...fast, lightweight, and easy-to-use binary file format for storing data frames[1]." Feather files are designed to be read and created by both R and Python. The Data Science Team at Centerfield could find no existing well-documented tests comparing Feather read/write speeds to other file types. Thus, we took it upon ourselves to test five commonly used files types in R to determine which file type has the fastest read/write times.

Testing was conducted on a Lenovo ThinkPad T560 Signature Edition Laptop with an Intel(R) Core(TM) i7-6600U CPU @ 2.60GHZ processor and 16.0 GB (15.9 GB usable) of RAM.

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252

[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C

[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

loaded via a namespace (and not attached):
[1] tools_3.3.3

To minimize bias from other processing jobs, we created 10 files of each type, constructed means and standard deviations of the read and write times, and created 95% confidence intervals for reporting. When dealing with data sets, two measurements are commonly used; the number of rows or observations (n) and the number of columns or variables (p). In our tests, we examined data sizes from 500 - 10,000 n by increments of 500 and from 50 - 200 p by increments of 50.

Writing Files

Using the listed write commands, we compared the following five file types:

We first held p constant and allowed n to vary. The results with p=100 are in Figure 1.

Feather performed better than the other file types in this test. There is very little variation in write speed for Feather files regardless of the number of observations (mean of 0.0245 seconds across all sample sizes). We then held n constant and allowed p to vary. The results with n=5,000 are in Figure 2.

When we held n constant and allowed p to vary, we saw similar results. The number of variables has very little impact on write speed of Feather files (mean of 0.0293 seconds across all variables).

Reading Files

We performed similar tests for reading files. The five file types (and commands used) were:

Again, we first held p constant and allowed n to vary. When we set p=100, the results were similar to those seen when writing files (see Figure 3).

As before, we then fixed n=5,000 and let p vary. Results are in Figure 4.

Feather again demonstrated little variation regardless of p (mean of 0.0293 seconds across all variables). Overall, it appears there is no linear relationship between speed and file size when writing and reading Feather files up to 10,000 n and 200 p. Feather files are indeed very fast and much faster than the other file types tested as n and p increase.

When Does Feather Slow Down?

Because of Feather's promising performance in our tests, we were inclined to ask, "what does it take to make Feather perform slowly?". To answer this, we wrote out increasingly larger files to a local directory until the mean time to write 10 files was statistically greater than 0.5 seconds. Starting with n=10,000 and p=200, 10 files of the size n by p were created and written out. If statistical significance from 0.5 seconds was not achieved, n was increased by 500 and p was increased by 5.

The n and p reached 45,500 and 555 respectively (file size of 193.401 mb) before the write times were statistically greater than 0.5 seconds (see Figure 5). This is not meant to define a given n by p size needed to reach statistical significance, but only to give an idea of how large a file must be before Feather takes a 'large' amount of time to write out the file.

We performed a similar test for reading Feather files using the same n and p starting and incremental values. For this test, n and p reached 42,000 and 520 respectively (file size of 174.356 mb) before read times were statistically greater than 0.5 seconds (see Figure 6). This, again, is not meant to define a given n by p size needed to reach statistical significance, but only to give an idea of how large a file must be before Feather would take a 'large' amount of time to read in a file. Interestingly, the confidence intervals around the average read times were much smaller (more stable) than those of the average write times.

There does appear to be a point (in both write and read) where a stronger linear relationship becomes present. Precisely when this relationship starts is interesting but not investigated further in this report.

Conclusion

The Feather file type is very fast when it comes to both writing and reading files from and into the R environment regardless of the number of observations (n) or the number of variables (p). If files need to be written from R quickly, Feather is currently one of the best solutions. At Centerfield, this exploration sparked interest in using Feather files more broadly in our work. One can imagine file systems with large amounts of data stored in Feather files that both our R and Python data scientists could utilize with greater speed than similar systems based around JSON or .txt files.

Limitations

All attempts were made to minimize bias in this experiment, but some may still be present. First, the data used to populate all files was randomly drawn from the Gaussian distribution. These trials did not include complex, logical, or character data in any of the files. We assumed these other data types would not significantly alter the amount of time it would take to write and read files in or out of the R environment. Second, because the simulations are already computationally exhaustive for some of the file types, the number of files was restricted to 10 for each n by p combination. This allowed us to create means, standard deviations, and confidence intervals, but larger numbers of files created would give a stronger estimate of time involved. Finally, other file types may be available for testing. Whereas Feather performed the best in this case, it does not imply that it is the fastest method available for R to read and write files.


1. McKinney, Wes, & Hadley Wickham. "Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow." blog.rstudio.org/2016/03/29/feather. March 29, 2016.