The idea of the packages is to handle large volumes of data without allocation of the full size RAM capacities.
So the file of several Gbs size will be squeezed actually down to several Mbs or even Kbs.
The only metadata are allocated to the RAM.
The physical data are stored and manipulated on the external tools (for instance, HDD or SSD).
Data manipulation & calculation trade-offs;
Hardware & data consistency/integrity issues;
Performance decrease issues of the workstation.
ff package is responsible for data structures & manipulations mainly.
ffbase is responsible for computations mainly.
To process data in appropiate manner all data must be converted to types as follows:
ff (vector)
ff_vector (vector)
ffdf (table/dataframe)
library(ff)
library(ffbase)
rm(list=ls())
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 787739 42.1 5438107 290.5 9493837 507.1
Vcells 5389718 41.2 33430281 255.1 153274983 1169.4
#make the initial data frame
df_size <- 50000000
x <- data.frame(a = numeric(df_size), b = numeric(df_size), c = numeric(df_size))
print(paste(round(object.size(x)/1024/1024,2),"Mb"))
[1] "1144.41 Mb"
#convert to ffdf object
system.time(
x1<-as.ffdf(x)
)
user system elapsed
2.27 3.81 264.36
#save df as ffdf object to the file
system.time(
write.csv2.ffdf(x1,"ff_test.csv")
)
user system elapsed
365.16 6.16 568.69
ffdf, ff, ff_vector could be as a standard vectors or dataframes but with the loss of its main advantage (no-RAM allocations).
Let’s scan through the brief example:
rm(list=ls())
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 787516 42.1 3965509 211.8 9493837 507.1
Vcells 5389355 41.2 265730432 2027.4 255390158 1948.5
#read the raw data as ffdf object
system.time(
recovery <- read.csv2.ffdf(file="ff_test.csv"
, header = TRUE
, first.rows=10000
, next.rows=1000000
,levels=NULL
)
)
user system elapsed
52.87 1.83 87.31
#the size of the ffdf object
paste("ffdf object size:", round(object.size(recovery)/1024/1024,2),"Mb")
[1] "ffdf object size: 0.01 Mb"
#the type & size of the column from ffdf object with indexing
class(recovery[,1])
[1] "integer"
paste("recovery[,1]:",round(object.size(recovery[,1])/1024/1024,2),"Mb")
[1] "recovery[,1]: 190.73 Mb"
#the type & size of the column from ffdf object with column name
class(recovery$a)
[1] "ff_vector" "ff"
paste("recovery$a:",round(object.size(recovery$a)/1024,2),"kb")
[1] "recovery$a: 3.12 kb"
To monitor the RAM allocations in standard and ffdf calculations is recommended to run a tool to monitor RAM allocations such as Windows Task Manager.
mean_standard <- mean(recovery[,1])
rm(mean_standard)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 787566 42.1 2537925 135.6 9493837 507.1
Vcells 5389376 41.2 170067476 1297.6 255390158 1948.5
mean_ff <- mean(recovery$a)
rm(mean_ff)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 787566 42.1 2537925 135.6 9493837 507.1
Vcells 5389376 41.2 136053980 1038.1 255390158 1948.5
system.time(
within(recovery, {a <- a+5
#print(mean(a))
}
)
)
user system elapsed
66.35 1.56 119.12
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 787625 42.1 6380318 340.8 9493837 507.1
Vcells 5389489 41.2 44582167 340.2 255390158 1948.5
The package is quite usefull. However it can require some tuning in specific cases. Tuning means the reaction of script and RAM for the specific task.