Recently, along with the co-author, I made a presentation on options to handle large data sets using R at NYC DataScience Academy
.
You can watch the presentation here
This blog presents an overview of the presentation covering the available options to process large data sets in R efficiently.
R reads entire data set into RAM all at once. Other programs can read file sections on demand.
R Objects live in memory entirely.
Does not have int64 datatype
Not possible to index objects with huge numbers of rows & columns even in 64 bit systems (2 Billion vector index limit) . Hits file size limit around 2-4 GB.
We can categorize large data sets in R across two broad categories:
Medium sized
files that can be loaded in R ( within memory limit but processing is cumbersome (typically in the 1-2 GB range )
Large files that cannot be loaded in R due to R / OS limitations as discussed above . we can further split this group into 2 sub groups
Large files
- (typically 2 - 10 GB) that can still be processed locally using some work around solutions.Very Large files
- ( > 10 GB) that needs distributed large scale computing.We will go through the solution approach for each of these situations in the following sections.
Try to reduce the size of the file before loading it into R
If you are loading xls files , you can select specific columns that is required for analysis instead of selecting the entire data set.
You can not select specific columns if you are loading csv or text file - you might want to pre-process the data in command line using cut
or awk
commands and filter data required for analysis.
Pre-allocate number of rows and pre-define column classes
Read optimization example :
read in a few records of the input file , identify the classes of the input file and assign that column class to the input file while reading the entire data set
calculate approximate row count of the data set based on the size of the file , number of fields in the column ( or using wc
in command line ) and define nrow=
parameter
define comment.char parameter
bigfile.sample <- read.csv("data/SAT_Results2014.csv",
stringsAsFactors=FALSE, header=T, nrows=20)
bigfile.colclass <- sapply(bigfile.sample,class)
bigfile.raw <- tbl_df(read.csv("data/SAT_Results2014.csv",
stringsAsFactors=FALSE, header=T,nrow=10000,
colClasses=attendance.colclass, comment.char=""))
These simple changes will significantly improve the loading operation in R.
Alternately, use fread option from package data.table
Following table shows optimization steps while reading the file and relative performance improvement achieved.
url <- "./311_Service_2014.csv"
#File size (MB) : 844
#1,844,515 rows 52 columns
#Standard Read.csv ####
==========================================================================
system.time(DF1 <- read.csv(url,stringsAsFactors=FALSE))
#user system elapsed
#243.38 5.49 249.73
#Optimized Read.csv ####
==========================================================================
system.time(length(readLines(url)))
#Number of lines : 1844516
#user system elapsed
#106.56 2.47 109.63
classes <- c("numeric",rep("character",48),rep("numeric",2), "character")
system.time(DF2 <- read.csv(url, header = TRUE, sep = ",", stringsAsFactors = FALSE, nrow = 1844516, colClasses = classes))
#user system elapsed
#173.73 3.43 182.73
#fread ####
==========================================================================
library(data.table)
system.time(DT1 <- fread(url))
#user system elapsed
#80.10 1.09 81.30
#Summary ####
==========================================================================
## user system elapsed Method
## 243.38 5.49 249.73 read.csv (first time)
## 173.73 3.43 182.73 Optimized read.csv
## 80.10 1.09 81.30 fread
Use pipe operators to overwrite files with intermediate results and minimize data set duplication through process steps, if is an appropriate solution to your processing requirements.
Parallel Processing
Parallelism approach runs several computations at the same time and takes advantage of multiple cores or CPUs on a single system or across systems. Following R packages are used for parallel processing in R.
Explicit Parallelism
(user controlled)
example:
-rmpi(Message Processing Interface)
-snow(Simple Network of Workstations)
Implicit parallelism
(system abstraction)
example:
-doMC/foreach
Given below is an example of multi-core registration using doMC
# enable parallel processing for computationally intensive operations.
library(doMC)
registerDoMC(cores = 4)
For medium sized data sets which are too-big for in-memory
processing but too-small-for-distributed-computing
files, following R Packages come in handy.
bigmemory
bigmemory is part of the “big” family which consists of several packages that perform analysis on large data sets. bigmemory uses several matrix objects but we will only focus on big.matrix.
big.matrix is a R object that uses a pointer to a C++ data structure. The location of the pointer to the C++ matrix can be saved to the disk or RAM and shared with other users in different sessions.
By loading the pointer object, users can access the data set without reading the entire set into R.
The following sample code will give a better understanding of how to use bigmemory:
example
# User / Session 1
library(bigmemory)
library(biganalytics)
library(bigtabulate)
#Create big.matrix
setwd("/Users/sundar/dev")
school.matrix <- read.big.matrix(
"./numeric_matrix_SAT__College_Board__2010_School_Level_Results.csv",
type ="integer", header = TRUE, backingfile = "school.bin",
descriptorfile ="school.desc", extraCols =NULL)
# Get the location of the pointer to school.matrix.
desc <- describe(school.matrix)
str(school.matrix)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slot
## ..@ address:<externalptr>
# process big matrix in active session.
colsums.session1 <- sum(as.numeric(school.matrix[,3]))
colsums.session1
## [1] 67147
# save the location to disk to share the object .
dput(desc , file="/tmp/A.desc")
# Session 2
setwd("/Users/sundar/dev")
library (bigmemory)
library (biganalytics)
# Read the pointer from disk .
shared.desc <- dget("/tmp/A.desc")
# Attach to the pointer in RAM.
shared.bigobject <- attach.big.matrix(shared.desc)
# Check our results .
colsums.session2 <- sum(shared.bigobject[,3])
colsums.session2
## [1] 67147
As one can see, bigmemory is a powerful option to read and process big files and share the object as pointer to the matrix object across sessions, which can be treated as a normal R data object.
However, there is a limitation with bigmemory, C++ matrices allow only one type of data. Therefore the data set has to be only one class of data.
That leads us to the next package to handle large data sets in R
ff
ff is another package dealing with large data sets similar to bigmemory. It uses a pointer as well but to a flat binary file stored in the disk, and it can be shared across different sessions.
One advantage ff has over bigmemory is that it supports multiple data class types in the data set unlike bigmemory.
example
library(ff)
# creating the file
school.ff <- read.csv.ffdf(file="/Users/sundar/dev/mixed_matrix_SAT__College_Board__2010_School_Level_Results.csv")
#creates a ffdf object
class(school.ff)
## [1] "ffdf"
# ffdf is a virtual dataframe
str(school.ff)
## List of 3
## $ virtual: 'data.frame': 5 obs. of 7 variables:
## .. $ VirtualVmode : chr "integer" "integer" "integer" "integer" ...
## .. $ AsIs : logi FALSE FALSE FALSE FALSE FALSE
## .. $ VirtualIsMatrix : logi FALSE FALSE FALSE FALSE FALSE
## .. $ PhysicalIsMatrix : logi FALSE FALSE FALSE FALSE FALSE
## .. $ PhysicalElementNo: int 1 2 3 4 5
## .. $ PhysicalFirstCol : int 1 1 1 1 1
## .. $ PhysicalLastCol : int 1 1 1 1 1
## .. - attr(*, "Dim")= int 157 5
## .. - attr(*, "Dimorder")= int 1 2
## $ physical: List of 5
## .. $ characters : list()
## .. ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>
## .. .. ..- attr(*, "vmode")= chr "integer"
## .. .. ..- attr(*, "maxlength")= int 157
## .. .. ..- attr(*, "pattern")= chr "ffdf"
## .. .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd045531d5b.ff"
## .. .. ..- attr(*, "pagesize")= int 65536
## .. .. ..- attr(*, "finalizer")= chr "close"
## .. .. ..- attr(*, "finonexit")= logi TRUE
## .. .. ..- attr(*, "readonly")= logi FALSE
## .. .. ..- attr(*, "caching")= chr "mmnoflush"
## .. ..- attr(*, "virtual")= list()
## .. .. ..- attr(*, "Length")= int 157
## .. .. ..- attr(*, "Symmetric")= logi FALSE
## .. .. ..- attr(*, "Levels")= chr "aabc"
## .. .. ..- attr(*, "ramclass")= chr "factor"
## .. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
## .. $ Number.of.Test.Takers: list()
## .. ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>
## .. .. ..- attr(*, "vmode")= chr "integer"
## .. .. ..- attr(*, "maxlength")= int 157
## .. .. ..- attr(*, "pattern")= chr "ffdf"
## .. .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd053ac64eb.ff"
## .. .. ..- attr(*, "pagesize")= int 65536
## .. .. ..- attr(*, "finalizer")= chr "close"
## .. .. ..- attr(*, "finonexit")= logi TRUE
## .. .. ..- attr(*, "readonly")= logi FALSE
## .. .. ..- attr(*, "caching")= chr "mmnoflush"
## .. ..- attr(*, "virtual")= list()
## .. .. ..- attr(*, "Length")= int 157
## .. .. ..- attr(*, "Symmetric")= logi FALSE
## .. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
## .. $ Critical.Reading.Mean: list()
## .. ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>
## .. .. ..- attr(*, "vmode")= chr "integer"
## .. .. ..- attr(*, "maxlength")= int 157
## .. .. ..- attr(*, "pattern")= chr "ffdf"
## .. .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd05b15ab37.ff"
## .. .. ..- attr(*, "pagesize")= int 65536
## .. .. ..- attr(*, "finalizer")= chr "close"
## .. .. ..- attr(*, "finonexit")= logi TRUE
## .. .. ..- attr(*, "readonly")= logi FALSE
## .. .. ..- attr(*, "caching")= chr "mmnoflush"
## .. ..- attr(*, "virtual")= list()
## .. .. ..- attr(*, "Length")= int 157
## .. .. ..- attr(*, "Symmetric")= logi FALSE
## .. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
## .. $ Mathematics.Mean : list()
## .. ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>
## .. .. ..- attr(*, "vmode")= chr "integer"
## .. .. ..- attr(*, "maxlength")= int 157
## .. .. ..- attr(*, "pattern")= chr "ffdf"
## .. .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd06b9bd698.ff"
## .. .. ..- attr(*, "pagesize")= int 65536
## .. .. ..- attr(*, "finalizer")= chr "close"
## .. .. ..- attr(*, "finonexit")= logi TRUE
## .. .. ..- attr(*, "readonly")= logi FALSE
## .. .. ..- attr(*, "caching")= chr "mmnoflush"
## .. ..- attr(*, "virtual")= list()
## .. .. ..- attr(*, "Length")= int 157
## .. .. ..- attr(*, "Symmetric")= logi FALSE
## .. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
## .. $ Writing.Mean : list()
## .. ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>
## .. .. ..- attr(*, "vmode")= chr "integer"
## .. .. ..- attr(*, "maxlength")= int 157
## .. .. ..- attr(*, "pattern")= chr "ffdf"
## .. .. ..- attr(*, "filename")= chr "/private/var/folders/c2/8484fxfn3x30bhw7_3skdc6r0000gp/T/RtmptObzLr/ffdf6dd04425cc59.ff"
## .. .. ..- attr(*, "pagesize")= int 65536
## .. .. ..- attr(*, "finalizer")= chr "close"
## .. .. ..- attr(*, "finonexit")= logi TRUE
## .. .. ..- attr(*, "readonly")= logi FALSE
## .. .. ..- attr(*, "caching")= chr "mmnoflush"
## .. ..- attr(*, "virtual")= list()
## .. .. ..- attr(*, "Length")= int 157
## .. .. ..- attr(*, "Symmetric")= logi FALSE
## .. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
## $ row.names: NULL
## - attributes: List of 2
## .. $ names: chr [1:3] "virtual" "physical" "row.names"
## .. $ class: chr "ffdf"
# ffdf object can be treated as any other R object
sum(school.ff[,3])
## [1] 66029
There are two options to process very large data sets ( > 10GB) in R.
Use integrated environment packages like Rhipe to leverage Hadoop MapReduce framework.
Use RHadoop directly on hadoop distributed system.
Storing large files in databases and connecting through DBI/ODBC
calls from R is also an option worth considering.
As you would have realized by now, R does provide many options to handle data files , whatever size they come in - small, medium or large.
Go ahead and analyse that data set in full, the one that you have been holding off till now due to system memory size limitations.