After testing the new Feather package and noticing how fast it is, I wanted to learn how Feather compares with other libraries and if its comparative performance degrades on larger data sets.
Here are the highlights of the results:
I tested 32 csv files. Each file contain 80 variables. The smallest file is 50 megabytes and the largest is a bit under 2 gigabytes. Each file is 50 megabytes larger than the previous one.
The measurements taken are:
The R Markup with the test details is found in my GitHub account
The following plot traces the time it takes ‘read.csv’ and ‘fread’ to read CSV files, and how long it takes to load the ‘read_feather’ to load the Feather file that has the same data in the original CSV files.
To calculate the Performance Increase, I divided the time it took ‘read.csv’ and ‘fread’ to read the CSV file, by the time it took ‘read_feather’ to load the Feather file that has the same data in the original CSV files.
Here is a comparison of the time it takes ‘read.csv’ and ‘write_feather’ to create the files based on the same data frame.
A comparison of the file size that ‘read.csv’ and ‘write_feather’ of the files created based on the same data frame. The ‘Function’ says ‘read.csv’ and ‘read_feather’ because the measurement was taken at the time of running those commands.
Here is a comparison of the size of the data loaded via each of the commands.