Using rgallery for physics analysis

Introducing `rgallery`

rgallery is a small R package that provides some functions that make it easy to read HDF5 ntuple files (as written by hep_hpc, from https://bitbucket.org/fnalscdcomputationalscience/hep_hpc). It allows ntuples in those files to be read into an R data.table or data.frame. This introduction concentrates on the us of data.frame.

Obtaining `rgallery`

The rgallery package can be installed using devtools::install_github. See https://github.com/marcpaterno/rgallery for more about installation.

Example of use

For this example, we use a HDF5 ntuple file containing some information about clusters, hits, and vertices found in simulation of DUNE data. Thanks to the DUNE collaboration for allowing access to their simulation output.

f <- h5file("demo.h5","r") # Open the file in readonly mode
f                          # printing the file handle shows us the table names

## H5File 'demo.h5' (mode 'r')
## + clusters
## + hits
## + mctruths
## + vertices

Exploration of one-dimensional data

Let's first explore the hits data.

hits <- read.data.table(f,"hits")
hits

##          clus dof eid.1 eid.2 eid.3        gof id  integral
##       1:    0  18 2e+07    93  9201  1.3416872  0 117.13450
##       2:    0  17 2e+07    93  9201  1.6482247  1  99.51738
##       3:    0  20 2e+07    93  9201  2.2353501  2  95.33622
##       4:    0  20 2e+07    93  9201  2.0840235  3 113.36713
##       5:    0  22 2e+07    93  9201  5.8113637  4 115.89322
##      ---                                                   
## 3249733:   25  23 2e+07   480 48000 19.6003094  4 544.60748
## 3249734:   25  28 2e+07   480 48000  2.5209873  5 214.10004
## 3249735:   25  11 2e+07   480 48000  2.5733807  6 151.48633
## 3249736:   25  12 2e+07   480 48000  0.7821626  7 192.55653
## 3249737:   25   9 2e+07   480 48000  0.7482567  8  89.51905

Printing a data.table shows the first few, and last few, rows in the table. In this table, we have one row per hit found in the events that were processed. The columns eid.1, eid.2, and eid.3 are the run, subrun, and event numbers; together, these provide a unique identifier for the event in which the hit was found. The column id is a unique identifier of the hit within that event. The column cluster tells what cluster in that event is associated with this hit. Finally, integral is a measure of the amount of charge associated with the observed hit. The printout also shows a run number for each row. We can see from this that the table contains 3249737 hits.

First, let's look at the distribution of integral. There are many hits, so it is reasonable to first try a histogram as a density estimator:

# nint tells the number of bins to use.
# type="count" gives a plot showing the number of entries
#   (rather than the default relative frequency)
histogram(~integral, hits, nint=50, type="count")

That's rather disappointing. Almost all the data fall into very few bins. The maximum value of integral is about 28000, but there are few values near the maximum, so the histogram has many empty bins and we can't really see the distribution. Producing the same plot using a logarithmic scale for the x-axis may tell us more.

histogram(~integral, hits, nint=50, type="count",
          scales = list(x = list(log = 10, equispaced = FALSE)))

From this, we start to see some shape in the distribution. The lower end of the distribution does not actually go to zero, and we begin to see some of the shape of the distribution. The tails are very large. A better tool for visualization such data may be the box and whisker plot.

bwplot(~integral, hits,
       scales = list(x = list(log = 10, equispaced = FALSE)))

With this, we see our data cluster mostly between about 100 and 200 , but there are large tails, and even perhaps some structure at the low end of the distribution.

Here is a more detailed view of the main body of the distribution:

histogram(~integral, hits[integral > 10 & integral < 5000],
          nint=50,
          scales = list(x = list(log = 10, equispaced = FALSE)),
          type = "count")

Next we will look at what happens if we aggregate data event-by-event. First, let's look at the number of hits per event. We need to use the three columns that uniquely identify an event to group the data; we will also use the data.table facility .N to count the number of entries in the table (and thus hits) per event. The [] operator for the data.table provides the facility for doing the grouping, and also the calculation on the resulting groups.

histogram(~N,
          hits[, .N, .(eid.1,eid.2,eid.3)],
          nint = 50, type = "count",
          scales = list(x = list(log = 10, equispaced = FALSE)),
          xlab="Number of hits per event")

It is useful to take a look at the data.table returned by the grouping operation:

hits[,.N,.(eid.1,eid.2,eid.3)]

##       eid.1 eid.2 eid.3    N
##    1: 2e+07    93  9201 4773
##    2: 2e+07    93  9202 6925
##    3: 2e+07    93  9203  482
##    4: 2e+07    93  9204 5813
##    5: 2e+07    93  9205 1755
##   ---                       
## 1354: 2e+07   480 47994 4626
## 1355: 2e+07   480 47995   10
## 1356: 2e+07   480 47998 1500
## 1357: 2e+07   480 47999 4262
## 1358: 2e+07   480 48000 1215

The result has the columns we used to uniquely identify the events, and also the calculated column N, containing the number of hits in each event. Note that in the data file we are reading, some events have no clustered hits. Note also that, because of the way the ntuple was filled, we only record hits that were associated with some cluster.

Two-dimensional analyses

The data.table package provides facilities to merge (or join) tables, to allow us to look at relations between quantities in different tables. Next we will look at the relationship between the number of clusters associated with a vertex and the sum of the summed ADCs for the clusters in that vertex. This requires using information from both the vertices and clusters tables, making sure that we associate each cluster with the right vertex.

vertices <- read.data.table(f, "vertices")
clusters <- read.data.table(f, "clusters")
vertices

##      eid.1 eid.2 eid.3 id          x         y          z
##   1: 2e+07    93  9202  0   84.33222 -153.2539   98.21487
##   2: 2e+07    93  9203  0  141.55946 -163.6484  173.50929
##   3: 2e+07    93  9205  0  193.75005 -435.9047  492.58594
##   4: 2e+07    93  9206  0  244.42218  544.2476  918.68146
##   5: 2e+07    93  9206  1  221.11116  548.2454  944.15546
##  ---                                                     
## 922: 2e+07   480 47978  0  253.70491  511.0650  297.27756
## 923: 2e+07   480 47979  0 -263.26633  107.9334 1336.42529
## 924: 2e+07   480 47987  0  174.00801  580.6284  644.04614
## 925: 2e+07   480 47988  0  216.04219  301.9754  814.07391
## 926: 2e+07   480 47991  0 -153.81104  265.5946  524.19812

clusters

##       eid.1 eid.2 eid.3 id    sumadc vtx
##    1: 2e+07    93  9202  0 61172.836   0
##    2: 2e+07    93  9202  1 42437.789   0
##    3: 2e+07    93  9202  2 55847.285   0
##    4: 2e+07    93  9202  3  6355.563   0
##    5: 2e+07    93  9202  4 40968.859   0
##   ---                                   
## 7616: 2e+07   480 47991  5  1829.834   0
## 7617: 2e+07   480 47991  6 15284.536   0
## 7618: 2e+07   480 47991  7 13698.257   0
## 7619: 2e+07   480 47991  8 40042.828   0
## 7620: 2e+07   480 47991  9 23399.721   0

First, we need to add some keys to the data.table objects we are using. The allows the merging to work, and to be efficient. We will use the 3-part event identifier, and the vertex identifier (which has a different name in each table). After creating the keys, we can perform the merge. We will end up with one row in the table per cluster. In the call to merge, the by.x and by.y arguments must be vectors of the same length; the order tells which column in x is to be compared with which column in y.

Note: because of the way our ntuples were constructed, we have included in the clusters table only those clusters that have associated vertices. We have included every vertex in the vertices table. Thus when we make the merged table, we will have the same number of rows as we have clusters.

setkey(vertices, eid.1, eid.2, eid.3, id)
setkey(clusters, eid.1, eid.2, eid.3, id, vtx)

vc <- merge(clusters, vertices,
            by.x = c("eid.1", "eid.2", "eid.3", "vtx"),
            by.y = c("eid.1", "eid.2", "eid.3", "id"))
vc

##       eid.1 eid.2 eid.3 vtx id    sumadc         x         y         z
##    1: 2e+07    24  2302   0  0 14430.347 246.32928 -277.1885  643.4622
##    2: 2e+07    24  2302   0  1  1190.716 246.32928 -277.1885  643.4622
##    3: 2e+07    24  2302   0  2 17813.695 246.32928 -277.1885  643.4622
##    4: 2e+07    24  2302   0  3 12898.758 246.32928 -277.1885  643.4622
##    5: 2e+07    24  2302   0  4 11218.224 246.32928 -277.1885  643.4622
##   ---                                                                 
## 7616: 2e+07   703 70298   0  2 20877.008  93.32211 -274.7898 1348.6790
## 7617: 2e+07   703 70298   0  3 22456.398  93.32211 -274.7898 1348.6790
## 7618: 2e+07   703 70298   0  4  4322.727  93.32211 -274.7898 1348.6790
## 7619: 2e+07   703 70298   0  5  2396.208  93.32211 -274.7898 1348.6790
## 7620: 2e+07   703 70298   0  6 30761.635  93.32211 -274.7898 1348.6790

In vc, the columns that provide the event identification are as we have seen before, the vtx column is an identifier for a vertex within that event, the id column is the identifier for a cluster within that vertex, sumadc is the sum of ADC counts for the cluster, and x,y, and z are the coordinates of the vertex.

The quantities for which we want to explore the correlations are the number of clusters per vertex and the sum of the ADC counts for all clusters in the vertex. To get these, we summarize the dataset:

 xyplot(adcsum~nclus,
        vc[, .(nclus=.N,adcsum=sum(sumadc)), .(eid.1,eid.2,eid.3,vtx)],
        xlab = "Number of clusters",
        ylab = "Sum of SummedADC")

The number of points plotted is not too large, but because the x-axis takes only integer values, there is significant overplotting. This can be reduced by jittering the value plotted on the x-axis. We'll also switch to a log scale for the y axis.

 xyplot(adcsum~jitter(nclus),
        vc[, .(nclus=.N,adcsum=sum(sumadc)), .(eid.1,eid.2,eid.3,vtx)],
        scales = list(y=list(log=10, equispaced=FALSE)),
        xlab = "Number of clusters",
        ylab = "Sum of SummedADC")

Two-dimensional analysis of a larger data set.

The scatterplot above works reasonably well in part because there are not too many points to be plotted. If the number of points is too large, a scatterplot becomes of little use. In such cases, it can be useful to use a kernel density estimator, and then to visualize the density estimate.

In the next bit of analysis, we want to look at the relationship between hits and clusters. In particular, we want to see how the clusters' summed ADCs compares with the sum of the integrals for all the hits in the cluster. First, we must form the correct merged data.table. We already have the correct key in the clusters table; we need to add one for the hits table.

Note that the hits table includes hits that were included in no cluster. Thus we will end up with a merge table that contains a row for each hit if that hit was included in some cluster. This will leave many hits missing from the merge table.

setkey(hits, eid.1, eid.2, eid.3, id, clus)
hc <- merge(clusters, hits,
            by.x = c("eid.1","eid.2","eid.3","id"),
            by.y = c("eid.1","eid.2","eid.3","clus"),
            suffixes=c("cl","h"))
hc

##         eid.1 eid.2 eid.3 id   sumadc vtx dof        gof id  integral
##      1: 2e+07    24  2302  0 14430.35   0  11  0.4446217  0  53.47519
##      2: 2e+07    24  2302  0 14430.35   0  13  0.6528029  1 116.18021
##      3: 2e+07    24  2302  0 14430.35   0  15  0.4665284  2  65.38274
##      4: 2e+07    24  2302  0 14430.35   0  16  1.7912945  3  90.11420
##      5: 2e+07    24  2302  0 14430.35   0  12  1.3788036  4 113.45422
##     ---                                                              
## 287409: 2e+07   703 70298  6 30761.63   0  18  1.2289388 16 154.58118
## 287410: 2e+07   703 70298  6 30761.63   0  12  0.9213578 17 124.72017
## 287411: 2e+07   703 70298  6 30761.63   0  16  1.2534797 18 174.21358
## 287412: 2e+07   703 70298  6 30761.63   0  13  1.9055115 19 248.89500
## 287413: 2e+07   703 70298  6 30761.63   0  11 48.9476509 20 698.91150

What we want to study is the relation between the sumadc for each cluster and the sum of integral for all the hits in that cluster. But first, we will prepare a summary data.table, from which we have excluded any clusters for which sumadc is negative.

hc.summary <- hc[, .(sumadc=first(sumadc),sumint=sum(integral)), .(eid.1,eid.2,eid.3,id)]
hc.summary <- hc.summary[sumadc>0]

xyplot(sumadc~sumint,
       hc.summary,
       xlab = "sum of integral for hits in cluster",
       ylab = "sumadc for cluster",
       scales = list(x=list(log=10, equispaced=FALSE),
                     y=list(log=10, equispaced=FALSE)),
       pch = ".")

That is getting to be too many points to visualize with a scatterplot. We can try looking at a one-dimensional distribution of a derived quantity:

histogram(~(sumadc/sumint),
          hc.summary,
          nint=100,
          scales=list(x=list(log=10, equispaced=FALSE)),
          type = "c")

Another possibility is to visualize a 2-d kernel density estimate. Sometimes we see clusters for which sumadc is negative. We will remove them from the data before we create our kernel density estimate, because we want to use a log scale.

hc.dens <- kde2d(log10(hc.summary$sumadc), log10(hc.summary$sumint))
image(hc.dens, xlab="log(sumadc)", ylab="log(sum of integral)")
contour(hc.dens, add=TRUE)

The final thing we will consider is grouping the data, by "goodness of fit" (gof).

hc[,gofbin:=(cut(gof, breaks = c(-1,0,10,100,1000,10000), include.lowest = TRUE))]
summary(hc$gofbin)

##        [-1,0]        (0,10]      (10,100]   (100,1e+03] (1e+03,1e+04] 
##           136        254379         24395          1837          6665 
##          NA's 
##             1

We can now try making a plot conditioned on the value of gofbin. This time, we'll use a histogram showing relative frequency. Otherwise, because the number of counts in the (0,10] bin is so large that we will not see much for the other bins.

histogram(~integral|gofbin, hc[integral>0.01], nint = 30,
          scales = list(x = list(log = 10, equispaced = FALSE)))