This is a test how R handles mwl data in HDF5 format using rhdf5 library. Data model suggested by Christian and by Tommy is compared.
This is a structure of Christian’s HDF5:
str.chr <- h5ls(file="cml_test.h5")
print(str.chr)
## group name otype dclass dim
## 0 / cml_1 H5I_GROUP
## 1 /cml_1 channel_1 H5I_GROUP
## 2 /cml_1/channel_1 data H5I_DATASET COMPOUND 1000
## 3 /cml_1/channel_1 metadata H5I_DATASET COMPOUND 1
## 4 /cml_1 channel_2 H5I_GROUP
## 5 /cml_1/channel_2 data H5I_DATASET COMPOUND 1000
## 6 /cml_1/channel_2 metadata H5I_DATASET COMPOUND 1
## 7 /cml_1 metadata H5I_DATASET COMPOUND 1
This is a structure of Christian’s HDF5:
str.tom <- h5ls(file="example.h5")
print(str.tom)
## group name otype dclass dim
## 0 / cml_1 H5I_GROUP
## 1 /cml_1 channel_1 H5I_GROUP
## 2 /cml_1/channel_1 Rx H5I_DATASET FLOAT 1000
## 3 /cml_1/channel_1 Tx H5I_DATASET FLOAT 1000
## 4 /cml_1/channel_1 time H5I_DATASET INTEGER 1000
## 5 /cml_1 channel_2 H5I_GROUP
## 6 /cml_1/channel_2 Rx H5I_DATASET FLOAT 1000
## 7 /cml_1/channel_2 Tx H5I_DATASET FLOAT 1000
## 8 /cml_1/channel_2 time H5I_DATASET INTEGER 1000
## 9 /cml_1 geolocation H5I_GROUP
## 10 /cml_1/geolocation altitude H5I_DATASET INTEGER 2
## 11 /cml_1/geolocation latitude H5I_DATASET FLOAT 2
## 12 /cml_1/geolocation longitude H5I_DATASET FLOAT 2
## 13 /cml_1/geolocation siteID H5I_DATASET STRING 2
This is a way how to read all data (main group). R reads HDF5 file into list object (an ordered collection of objects), where each subgroup is represented as another list belonging to the main list.
This is how object from Chrisitan’s HDF5 looks like:
h5.chr <- h5read(file="cml_test.h5", name="cml_1")
str(h5.chr)
## List of 3
## $ channel_1:List of 2
## ..$ data :'data.frame': 1000 obs. of 3 variables:
## .. ..$ time_UTC: num [1:1000(1d)] 1.43e+09 1.43e+09 1.43e+09 1.43e+09 1.43e+09 ...
## .. ..$ RX : num [1:1000(1d)] -45.6 -45.9 -44.7 -45.9 -45.3 ...
## .. ..$ TX : num [1:1000(1d)] 20 20 20 20 20 20 20 20 20 20 ...
## ..$ metadata:'data.frame': 1 obs. of 4 variables:
## .. ..$ RX_site: chr [1(1d)] "Site A"
## .. ..$ TX_site: chr [1(1d)] "Site B"
## .. ..$ name : chr [1(1d)] "far_near"
## .. ..$ short : chr [1(1d)] "fn"
## $ channel_2:List of 2
## ..$ data :'data.frame': 1000 obs. of 3 variables:
## .. ..$ time_UTC: num [1:1000(1d)] 1.43e+09 1.43e+09 1.43e+09 1.43e+09 1.43e+09 ...
## .. ..$ RX : num [1:1000(1d)] -45.6 -45 -45.3 -45.9 -45.6 ...
## .. ..$ TX : num [1:1000(1d)] 20 20 20 20 20 20 20 20 20 20 ...
## ..$ metadata:'data.frame': 1 obs. of 4 variables:
## .. ..$ RX_site: chr [1(1d)] "Site B"
## .. ..$ TX_site: chr [1(1d)] "Site A"
## .. ..$ name : chr [1(1d)] "near_far"
## .. ..$ short : chr [1(1d)] "nf"
## $ metadata :'data.frame': 1 obs. of 1 variable:
## ..$ ID: chr [1(1d)] "MY2345_MY4567"
This is how object from Tommys’s HDF5 looks like:
h5.tom <- h5read(file="example.h5", name="cml_1")
str(h5.tom)
## List of 3
## $ channel_1 :List of 3
## ..$ Rx : num [1:1000(1d)] 0.1391 0.6608 -0.0294 -0.7938 -1.4106 ...
## ..$ Tx : num [1:1000(1d)] 0.515 -0.415 -1.062 1.461 -1.248 ...
## ..$ time: int [1:1000(1d)] 0 1 2 3 4 5 6 7 8 9 ...
## $ channel_2 :List of 3
## ..$ Rx : num [1:1000(1d)] 0.1391 0.6608 -0.0294 -0.7938 -1.4106 ...
## ..$ Tx : num [1:1000(1d)] 0.515 -0.415 -1.062 1.461 -1.248 ...
## ..$ time: int [1:1000(1d)] 0 1 2 3 4 5 6 7 8 9 ...
## $ geolocation:List of 4
## ..$ altitude : int [1:2(1d)] 30 20
## ..$ latitude : num [1:2(1d)] 52.5 52.5
## ..$ longitude: num [1:2(1d)] 5.66 5.67
## ..$ siteID : chr [1:2(1d)] "siteA" "siteB"
Data model suggested by Tommy leads to less branched lists than the one form Christian. Accesing datasets when full HDF5 file is loaded is therefore bit more comfortable.
Get Rx from HDF5 of Tommy:
print(h5.tom$channel_1$Rx[1:5])
## [1] 0.13905687 0.66080112 -0.02937565 -0.79383851 -1.41060434
#or
print(h5.tom[[1]][[1]][1:5])
## [1] 0.13905687 0.66080112 -0.02937565 -0.79383851 -1.41060434
Get Rx from HDF5 of Christian:
print(h5.chr$channel_1$data$RX[1:5])
## [1] -45.59375 -45.90625 -44.68750 -45.90625 -45.31250
#or
print(h5.chr[[1]][[1]][[2]][1:5])
## [1] -45.59375 -45.90625 -44.68750 -45.90625 -45.31250
I do not know if it is possible to extracting subset of RX directly from Christian’s HDF5 file. I am able to only extract whole data.frame.
dat.chr <- h5read("cml_test.h5", "cml_1/channel_1/data/")#, index=list(1:5, 2:3))#, start=c(1,1), stride=c(2,2))
str(dat.chr)
## 'data.frame': 1000 obs. of 3 variables:
## $ time_UTC: num [1:1000(1d)] 1.43e+09 1.43e+09 1.43e+09 1.43e+09 1.43e+09 ...
## $ RX : num [1:1000(1d)] -45.6 -45.9 -44.7 -45.9 -45.3 ...
## $ TX : num [1:1000(1d)] 20 20 20 20 20 20 20 20 20 20 ...
Tommy’s structure enables easy subsetting
dat.tom <- h5read("example.h5", "cml_1/channel_1/Rx/", index=list(1:5))#, index=list(1:5, 2:3))#, start=c(1,1),
print(dat.tom)
## [1] 0.13905687 0.66080112 -0.02937565 -0.79383851 -1.41060434
It would be maybe possible to save Rx and Tx in one matrix and then subsetting could be even more powerfull. We should try on big data sets how subsetting works.