A follow up question to how R function calls get sent over to the H2O cluster to be computed is how H2O handle garbage collection between different R or Python sessions.

Naming convention of frames

When you import data into H2O using h2o.importFile you can assign the frame name using argument destination_frame. When you upload local data to the H2O cluster using as.h2o the frame name should default to the R variable name.

For example if you want to move the iris that exist in R by running iris.hex = as.h2o(iris) the key or pointer to the H2OFrame in the DKV is iris which you can access in Flow.

If you subset or manipulate this iris.hex H2OFrame we will create new RTMPs. For example if only the first 2 columns are kept in the frame running tmp = iris.hex[,1:2] you will have both iris and another RTMP_ frame in the DKV. The lesson here is that RTMP frames are temporary but anytime you assigned a name to the frame either using as.h2o, h2o.importFile, or h2o.assign those frames will not be cleaned out by garbage collection.

Manipulate DKV from R

R is simply a client front end where the R function call does a GET request to H2O and H2O respond with some managable summary of what the user requested. The objects in the environment are different from those in your R workspace. Where you might run ls() in R to grab all the objects and variables defined in the workspace, you run h2o.ls() to grab all the keys currently in the H2O distributed key value store. This should correspond to doing both a getFrames and getModels in Flow.

> h2o.ls()
   key
1 iris

You can remove specific keys in the H2O cluster by using h2o.rm followed with the keys you want removed and of course double check to see that frames are removed by running h2o.ls().

> h2o.rm(ids = c("iris"))
> h2o.ls()
[1] key
<0 rows> (or 0-length row.names)

One last choice is to run h2o.removeAll() which will remove all keys (frames, models, modelmetrics,etc) from the cluster. Run this function with caution in case you are sharing the H2O cluster. h2o.removeAll is not sensitive to the session id and will remove frames and models not created in your current r session.

Garbage Collection Between R Sessions

H2O will by default clean out RTMPs perioidically as they are either no longer used or being referenced to. They will also all be cleaned out when you close your R session. All frames (but not the models) assosciated with that particular R session id will be removed by garbage collection upon exit. This feels up memory as well as cleans up the workspace a little bit for other people using the H2O Cluster.

Although the option is available to be explicitly called using the hidden function h2o:::.h2o.garbageCollect() it is not really meant to be called on the front end. H2O should be handling the garbage collection in a similar way as R, preserving R semantics as much as possible.

When sharing a cluster and importing a relatively big dataset please keep in mind to explicitly remove frames you assigned a name to (see Naming convention of frames). For example if I have a 100gb cluster and the dataset I imported into memory is about 30gb then it is relavent to remove that 30gb frame when exiting.

Example of how to quit without leaving a trace
library(h2o)
h2o.init(ip = 'mr-0xd1', port = 54321)

## There should be one key called 'frame'
frame = h2o.importFile(path = "hdfs://namenode/path/to/data", destination_frame = "frame" )

## There shoule be two keys, 'frame' and a 'RTMP_'
frame = frame[,1:5]

## If I assign the destination frame when splitting the data I will have four keys, 'frame', 'RTMP_', 'train', 'valid'
split = h2o.splitFrame(data = frame, destination_frames = c("train", "valid"))
train = split[[1]]
valid = split[[2]]

## Let's assume that someone else on the cluster imported a frame called zframe
## > h2o.ls()
##               key
## 1 RTMP_sid_96ab_12
## 2            frame
## 3            train
## 4            valid
## 5           zframe

At this point if I quit my R session only ‘RTMP_sid_96ab_12’ will be cleaned out by garbage collection but ‘frame’, ‘train’, and ‘valid’ will need to be removed as well.

h2o.rm(ids = c("frame", "train", "valid"))
quit(save ="no")

The other R user should see the following when they run a h2o.ls after your exit:

> h2o.ls()
    key
1 zframe