Some people still carry the notion that R is limited to analysis of tiny data sets. We wanted to find out how much data is truly being analyzed in practice. We surveyed Twitter followers to find out. We asked the question:
The result was that most of these users were analyzing 1 million - 100 million records with total data sizes of 1 GB - 25 GB.
From our Twitter results, we had 58 responders. Of those responders, 41 reported the number of records, 31 reported the number of columns, and 26 reported data set size. The median number of records was 10 million. The median number of columns was 30. And the median data set size was 18.5 GB.
How big is my data set? The convention for sizing analytic files is number of records. That is what most people report when asked for data sizes. But for R, the most important metric is total data size in memory. Sizing data in memory is also a function of columns and column type. We want an equation that would translate number of records to approximate data size in memory.
One way to estimate total size is to multiply the average cell size by the number of records and columns. Another way to estimate total size is to multiple the average record size by the number of records. For example, if we assume each cell is 8 bytes and the average record size is 32 columns then the average record size is 256 bytes. A file with 10 MM records equates to roughly 2.56 GB.
\[ \textbf{Cell Estimate: }\textrm{8 bytes/cell * 32 columns * 10,000,000 records = 2.56 GB}\] \[ \textbf{Row Estimate: }\textrm{256 bytes/record * 10,000,000 records = 2.56 GB}\]
We looked at four files to estimate cell and record size. Average cell size ranged from 3 bytes/cell - 12 bytes/cell. Average record size ranged from 45 bytes/row - 671 bytes/row.
| Source | Size (GB) | Records | Columns | Bytes per Cell | Bytes per Record |
|---|---|---|---|---|---|
| User event stream | 70.1 | 167,365,820 | 34 | 12.32 | 418 |
| Site weblogs | 3.6 | 5,368,598 | 156 | 4.30 | 671 |
| Analytic Data | 3.5 | 78,000,000 | 6 | 7.48 | 45 |
| Air On Time | 0.62 | 7,140,596 | 29 | 3.00 | 87 |
Notice the variation in record sizes is far greater than the variation in cell sizes. Clearly, the number of columns matters; the number of records is not a sufficient measure of data size.
The size of the file varies based on file format. We used the popular Air On Time data set to measure file size for various storage techniques. The Air On Time data for 2005 has 7,140,596 records and 29 columns (5 character & 24 numeric). The“flat” file in CSV format is 640 Mb. We compared this data set to other formats including: compressed, serialized, in memory (RAM), and in database.
| Format | Bytes | Megabytes |
|---|---|---|
| RDS (bzip2) | 74,416,359 | 74 |
| RDat (9-bzip2) | 101,815,577 | 102 |
| RDS (default) | 111,664,784 | 112 |
| RDat (6-gzip) | 111,664,856 | 112 |
| Bzip2 (download) | 112,450,321 | 112 |
| CSV | 671,027,265 | 671 |
| RAM (as.factor) | 828,633,592 | 829 |
| RAM (as.is) | 971,398,176 | 971 |
| Postgres | 1,007,419,392 | 1,007 |
| RDat (default) | 1,069,527,811 | 1,070 |
In this case the Air On Time data size grew when loaded into RAM and grew even larger when loaded into a Postgres database. The compressed formats were all smaller than the CSV format. The largest file format was the default R output format (RDat default).
R is not only for tiny amounts of data. It is regularly used to analyze data in the 10’s of gigabytes range with 100’s of millions of rows. It is likely that some users are analyzing well over 100 GB of data in memory on a single R instance.
Also, keep in mind that R is used for analytic data that has been cleansed, filtered, and aggregated. While 10’s of gigabytes might not be much raw event data by today’s standards, it is still a lot for many analytic problems.
For example, say you want to analyze site behavior using weblogs. The data warehouse might have terabytes of data, but chances are the analysis you’re interested in will only require a few months and a few important columns. You might even aggregate the data from an event level to a session or user level. By the time you have extracted your analytic data set, you can easily reduce your data by 100X.
We also saw that we can create a rough estimate of file size by by multiplying the number of cells by 8 bytes/cell or the number of rows by 256 bytes/row. Finally, we saw that some CVS files increase in size when loaded into RAM.
When sizing hardware make sure to give yourself plenty of extra memory for scratch work and for overhead. For example, if you’re working on 12 GB files you will probably want at least 48 GB or RAM available to do all required data manipulation and analysis.
KDNuggets conducts an anual survey asking the question: What was the largest dataset you analyzed / data mined? This survey is not R specific but does reflect the size of data analyzed versus data warehoused. They conclude that “over 50% of answers are in the Gigabyte range (median answer between 11 and 100 GB for each each year 2012-14).”