Feather Performance Analysis

Executive Summary

After testing the new Feather package and noticing how fast it is, I wanted to learn how Feather compares with other libraries and if its comparative performance degrades on larger data sets.

Here are the highlights of the results:

On average, Feather read the files 218 times faster than ‘read.csv’, and 33 times faster then ‘fread’
‘read.csv’ reads 1 gigabyte of CSV data in 80 seconds, ‘fread’ in 12 seconds and Feather in less than half a second.
‘write.csv’ took 90 seconds to write 1 gigabyte of CSV data, the same data frame took 0.67 seconds using Feather
Feather kept its performance even on the largest files tested
Feather’s files size were consistently half of the size of the csv files
When loaded in memory, ‘read.csv’ and Feather where the same size. ‘fead’ was consistently larger

Test Details

I tested 32 csv files. Each file contain 80 variables. The smallest file is 50 megabytes and the largest is a bit under 2 gigabytes. Each file is 50 megabytes larger than the previous one.

The measurements taken are:

Time it takes to read the file into memory
Time it takes to write the data into a file
Size of the file
Memory usage when file is loaded

The R Markup with the test details is found in my GitHub account

Results

1 - Time it takes to read the file into memory

The following plot traces the time it takes ‘read.csv’ and ‘fread’ to read CSV files, and how long it takes to load the ‘read_feather’ to load the Feather file that has the same data in the original CSV files.

To calculate the Performance Increase, I divided the time it took ‘read.csv’ and ‘fread’ to read the CSV file, by the time it took ‘read_feather’ to load the Feather file that has the same data in the original CSV files.

2 - Time it takes to write the data into a file

Here is a comparison of the time it takes ‘read.csv’ and ‘write_feather’ to create the files based on the same data frame.

3 - Size of the file

A comparison of the file size that ‘read.csv’ and ‘write_feather’ of the files created based on the same data frame. The ‘Function’ says ‘read.csv’ and ‘read_feather’ because the measurement was taken at the time of running those commands.

4 - Memory usage when file is loaded

Here is a comparison of the size of the data loaded via each of the commands.