Executive Summary

After testing the new Feather package and noticing how fast it is, I wanted to learn how Feather compares with other libraries and if its comparative performance degrades on larger data sets.

Here are the highlights of the results:

Test Details

I tested 32 csv files. Each file contain 80 variables. The smallest file is 50 megabytes and the largest is a bit under 2 gigabytes. Each file is 50 megabytes larger than the previous one.

The measurements taken are:

  1. Time it takes to read the file into memory
  2. Time it takes to write the data into a file
  3. Size of the file
  4. Memory usage when file is loaded

The R Markup with the test details is found in my GitHub account

Results

1 - Time it takes to read the file into memory

The following plot traces the time it takes ‘read.csv’ and ‘fread’ to read CSV files, and how long it takes to load the ‘read_feather’ to load the Feather file that has the same data in the original CSV files.

To calculate the Performance Increase, I divided the time it took ‘read.csv’ and ‘fread’ to read the CSV file, by the time it took ‘read_feather’ to load the Feather file that has the same data in the original CSV files.

2 - Time it takes to write the data into a file

Here is a comparison of the time it takes ‘read.csv’ and ‘write_feather’ to create the files based on the same data frame.

3 - Size of the file

A comparison of the file size that ‘read.csv’ and ‘write_feather’ of the files created based on the same data frame. The ‘Function’ says ‘read.csv’ and ‘read_feather’ because the measurement was taken at the time of running those commands.

4 - Memory usage when file is loaded

Here is a comparison of the size of the data loaded via each of the commands.