Proof of Concept - Massive Data Sets in R

The most important limitation of R is that it cannot handle data sets that are larger than about half the available RAM on the computer. This is because R loads each data set in its entirety into the computer memory. This makes sense from a performance viewpoint for medium-sized data sets, but is a serious limitation for use with big data.

Fortunately, there is a workaround which enables R to work with massive data sets. This proof of concept explains how.

The solution is based on two important concepts.

Firstly, for most of the tasks that one usually performs in R, such as exploratory data analysis and testing models, it makes no sense to use the entire data set. A random sample produces valid conclusions and runs much faster, saving precious time.

Secondly, for a production scenario where it does make sense to use the entire data set, there are packages that enable this. In this POC we demonstrate the RevoScaleR package.

To store the data, it is much better to use a relational database system (RDBMS), which is designed for heavy-duty data storage and retrieval, and not a csv file. The key functionality of a RDBMS is that we can send it SQL queries to extract the sample of data that we want. This cannot be done with a csv file.

For this POC, we create a table of 100 million rows and store it in a SQL Server database. A preliminary test, attempting to create a data frame of 1 billion rows in R, crashed my laptop computer. (This POC is done with 100 million rows instead of 1 billion for time reasons). We use R to generate the data as a Montecarlo simulation according to the following simple linear model:

\[Y_i = X_i + \epsilon_i\] The details and script that generate this simulated data set are in the appendix.

1) Extract Sample for Exploratory Data Analysis and Testing

Given a data set of 100 million rows in the database (too big to load into R), we extract a random sample, 0.1% the size: 100,000 rows only. To do this, we create a temporary table on the server containing the row numbers of the random sample, and then send a SQL query with an INNER JOIN between the data and temporary tables to the server. The R code is as follows:

# Packages
library(DBI)
library(odbc)
library(ggplot2)

# Parameters
driver = "ODBC Driver 17 for SQL Server"
server = "localhost"
database = "POC_BigDataR"
table = "XYData"
uid = "sa"
pwd = "1234"
n = 1e8 # Total N° of rows (100 million)
k = n/100 # N° of rows in each chunk (1 million)

# Establish connection
conn = dbConnect(  drv = odbc()
                   , Driver = driver
                   , Server = server
                   , Database = database
                   , UID = uid
                   , PWD = pwd
                   , Port = 1433
)

# Extract random sample of ns rows. 17 sec
t0 = Sys.time()
ns = 100000 # Sample size
df = data.frame(RowNum = sample(1:n, ns))
dbWriteTable(conn = conn, name = "Temp_Sample", value = df, overwrite = T)
sqltxt = "select ID, X, Y from"
sqltxt = paste(sqltxt, "(select ROW_NUMBER() over (order by ID) as RowNum, * from XYData) as T")
sqltxt = paste(sqltxt, "inner join Temp_Sample as S on T.RowNum = S.RowNum")
df_sample= dbGetQuery(conn = conn, statement = sqltxt)
print(Sys.time() - t0)

## Time difference of 18.3333 secs

This only took about 20 seconds.

Now we can easily create a scatterplot and perform a linear regression with this sample, and by the principles of statistical sampling, the estimates will be close to those we would have obtained if we had used the entire data set.

nr = nrow(df_sample)
df = df_sample[sample(1:nr, 1000), ]
ggplot(data = df, mapping = aes(x=X, y=Y)) + geom_point()

lin.mod = lm(data = df_sample, formula = Y ~ X)
summary(lin.mod)

## 
## Call:
## lm(formula = Y ~ X, data = df_sample)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -84245842 -13534090    -38129  13483360  81150645 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.770e+04  1.262e+05    0.22    0.826    
## X           9.991e-01  2.185e-03  457.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19980000 on 99998 degrees of freedom
## Multiple R-squared:  0.6764, Adjusted R-squared:  0.6764 
## F-statistic: 2.09e+05 on 1 and 99998 DF,  p-value: < 2.2e-16

2) Perform a Regression with Entire Data Set Stored in RDBMS

In a production scenario we may want to periodically retrain our model with the entire data set. For this we use the RevoScaleR package, which has functions for linear regression and machine learning algorithms that automatically pull the data in chunks from the RDBMS server as they do their work.

The RevoScaleR package was developed by Revolution Analytics, which was purchased by Microsoft in 2015. Unfortunately, the RevoScaleR package on its own is not available on the CRAN repository or anywhere in fact. It is built in to Microsoft’s version of R, Microsoft R Client, and the only way to obtain the RevoScaleR package is by installing Microsoft R Client and then using this version. The download URL is:

https://aka.ms/rclient/

Microsoft R Client is essentially the same as standard R. However, it has some issues that makes it less user-friendly which is why I personally prefer standard R for almost everything.

One can either: 1) use only Microsoft R and install all the usual packages from the CRAN repository, or 2) one can install both standard R and Microsoft R side-by-side, and only use Microsoft R for models with massive data sets. I prefer the second solution. I prefer to use standard R for almost everything, including extracting a sample from a database and analysing it in R as discussed above, and only use Microsoft R Client for training models on massive data sets.

One can easily switch between R versions in the Global Options of RStudio:

With RevoScaleR, to perform a linear regression or train a model from data in a database, it is not necessary to load the data set into a data frame. (Indeed we cannot do this because of the memory constraint). Instead, we pass a connection object as argument to the special RevoScaleR regression and machine learning functions. The functions then automatically pull the data from the database, one chunk at a time. This is transparent to the user; one does not need to be involved in the details. The following code demonstrates this:

connectionString = "Driver={SQL Server Native Client RDA 11.0};Server=(local);Database=POC_BigDataR;Uid=sa;Pwd=1234"
sql_data = RxSqlServerData(table = "XYData", connectionString = connectionString) # This is just a pointer to the table in the SQL Server database
lin.model = rxLinMod(data = sql_data, formula = Y ~ X)
summary(lin.model)

Due to the large size of the data set on the server, the regression takes time. The good thing is that all RevoScaleR functions print messages on the progress so that one is not left wondering if the process will ever finish.

The output is:

Rows Read: 100000000, Total Rows Processed: 100000000, Total Chunk Time: 81.293 seconds Computation time: 83.841 seconds. Call: rxLinMod(formula = Y ~ X, data = sql_data)

Linear Regression Results for: Y ~ X Data: sql_data (RxSqlServerData Data Source) Dependent variable(s): Y Total independent variables: 2 Number of valid observations: 1e+08 Number of missing observations: 0

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.195e+03 4.000e+03 -2.299 0.0215 *
X 1.000e+00 6.928e-05 14435.590 2.22e-16 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2e+07 on 1e+08 degrees of freedom Multiple R-squared: 0.6757 Adjusted R-squared: 0.6757 F-statistic: 2.084e+08 on 1 and 1e+08 DF, p-value: < 2.2e-16 Condition number: 1

As we can see, the regression statistics for the entire data set are extremely close to those of the sample.

For other algorithms such as decision trees, random forests and neural networks, RevoScaleR has functions that work the same way.

To conclude, we have seen two methods for handling massive data sets in R. As data sets grow ever larger, to be able to do this is becoming increasingly important.

Appendix - Generate Simulated Data Set

To generate the simulated data set used in this POC, we follow the following steps.

First, we create the table in SQL Server with SQL code:

create table XYData(
  ID bigint not null
, X float
, Y float
, constraint PK_Data primary key clustered (ID)
)

We then in R generate the data in 100 chunks of 1 million rows at a time (so as to not overflow the RAM) and load each chunk into the SQL Server database. On my laptop this took 18 minutes.

# Generate data and append to SQL Server table, one chunk at a time. 18 min
t0 = Sys.time()
for (i in 1:100){
  print(paste0(i, "%"))
  df = data.frame(ID = ((i-1)*k+1):(i*k))
  df$X = df$ID
  df$Y = df$X + rnorm(k, sd=20*k)
  dbWriteTable(conn = conn, name = table, value = df, append = T)
  rm(df) # Free memory each time
}
print(Sys.time() - t0)

Proof of Concept - Massive Data Sets in R

Jerome Smith

22/02/2022

1) Extract Sample for Exploratory Data Analysis and Testing

2) Perform a Regression with Entire Data Set Stored in RDBMS

Appendix - Generate Simulated Data Set