A recent project being worked on by part of the art team involved the translation of a large number (about 250 thousand) of art/ROOT data files into another format. When running an early version of the code to do the translation, we found that the jobs were exhausting the memory resources on the platform where they were being run. Following our own recommended practice, we had run the jobs with the MemoryTracker service enabled, and writing to a database file. The following is a record of the analysis we did to find the source of the problem.
The MemoryTracker service, and the tables in the database file it writes, is described in detail at https://cdcvs.fnal.gov/redmine/projects/art/wiki/MemoryTracker. The pages linked to there include basic instructions in the use of the SQLite command line application (sqlite3).
MemoryTracker database fileIn this document, I use R (see the section Setting up for use of artsupport) for analysis of the data. One is free, of course, to use any system that is able to read an SQLite database file; R is my preference. In all the following, the shaded rectangles present the R commands, typed at the R prompt, that yield the output (tables and plots) shown. If you are not familiar with R, and are uninterested in learning about it, please just ignore the commands and concentrate on the analysis, which is the main point of this document.
First, we looked at the ModuleInfo table, which contains event-by-event and module-by-module information about the virtual size (Vsize) and resident set size (RSS) of the process. The following is the first few rows of the table:
mods <- load_memory_use("Run2c_015_memory.db", "ModuleInfo")
head(mods) # print the first few rows of the data
| Step | Run | SubRun | Event | Path | ModuleLabel | ModuleType | Vsize | RSS | sample |
|---|---|---|---|---|---|---|---|---|---|
| PreProcessModule | 9907 | 32 | 2012 | t1 | rawPrimer | RawPrimer | 3996.164 | 262.8362 | 1 |
| PostProcessModule | 9907 | 32 | 2012 | t1 | rawPrimer | RawPrimer | 3999.265 | 266.1663 | 2 |
| PreProcessModule | 9907 | 32 | 2012 | t1 | TriggerResults | TriggerResultInserter | 3999.265 | 266.1663 | 3 |
| PostProcessModule | 9907 | 32 | 2012 | t1 | TriggerResults | TriggerResultInserter | 3999.265 | 266.1663 | 4 |
| PreProcessModule | 9907 | 32 | 2012 | end_path | raw2hdf5 | Raw2HDF5 | 3999.265 | 266.1663 | 5 |
| PostProcessModule | 9907 | 32 | 2012 | end_path | raw2hdf5 | Raw2HDF5 | 4001.985 | 273.6005 | 6 |
We are interested first in looking at the virtual size of the process. This tells about how much memory the program has allocated, even if that allocated memory has not yet been “touched” by the program. The resident set size of the program indicated how much memory has actually been provided by the operating system to the program. Allocation of large data structures containing uninitialized data can increase Vsize without increasing RSS.
First, we looked at how the virtual size increased with time, taking advantage of the fact that the database table entries are inserted in the order of their collection:
xyplot(Vsize~sample, mods, xlab = "Sample Number", pch = ".", ylab = "Vsize (MB)", grid = TRUE)
There is no sign of a continual increase over time; that would be the indication of a likely memory leak. Note the scale: the full range shown on the plot goes from just less than 4000 MB to about 4030 MB. So we do not appear to be suffering from such a leak. In addition, it seems that the memory usage gets to its maximum value very quickly. Let’s zoom in on the first few samples:
xyplot(Vsize~sample, mods[1:100,], xlab = "Sample Number", ylab = "Vsize (MB)", grid = TRUE)
Note that the very first sample already has a memory usage of 3996 MB. This indicates that the memory usage is associated with something that happens before the first event. We are looking for a problem that happens earlier than that. To discover what that might be, we load a different table from the database file:
other <- load_memory_use("Run2c_015_memory.db", "OtherInfo")
head(other)
| Step | ModuleLabel | ModuleType | Vsize | RSS | sample |
|---|---|---|---|---|---|
| PreModuleConstruction | raw2hdf5 | Raw2HDF5 | 1047.040 | 261.9146 | 1 |
| PostModuleConstruction | raw2hdf5 | Raw2HDF5 | 3996.164 | 262.1972 | 2 |
| PreModuleConstruction | rawPrimer | RawPrimer | 3996.164 | 262.4635 | 3 |
| PostModuleConstruction | rawPrimer | RawPrimer | 3996.164 | 262.4635 | 4 |
| PreModuleConstruction | TriggerResults | TriggerResultInserter | 3996.164 | 262.4635 | 5 |
| PostModuleConstruction | TriggerResults | TriggerResultInserter | 3996.164 | 262.5372 | 6 |
A brief survey of what is available in this table is in order. Let’s take a look at the values that the column Step can take, and how many times (NumEntries in the table below) we see each value:
other %>% group_by(Step) %>% summarize(NumEntries = n())
| Step | NumEntries |
|---|---|
| PostBeginJob | 3 |
| PostBeginRun | 5025 |
| PostBeginSubRun | 5025 |
| PostEndJob | 3 |
| PostEndRun | 5025 |
| PostEndSubRun | 5025 |
| PostModuleConstruction | 3 |
| PreBeginJob | 3 |
| PreBeginRun | 5025 |
| PreBeginSubRun | 5025 |
| PreEndJob | 3 |
| PreEndRun | 5025 |
| PreEndSubRun | 5025 |
| PreModuleConstruction | 3 |
We see one entry for PreModuleConstruction (and several others) for each module in the program. For other values, we see many more entries: this is because the data being processed contained many runs and subruns.
As with all other tables in the MemoryTracker database, the rows are filled sequentially in time. Let’s look at what the first few samples say about the virtual size:
xyplot(Vsize~sample, other[1:100,], xlab = "Sample Number", ylab = "Vsize (MB)", grid = TRUE)
Again, we need to zoom in on the very beginning of the sequence:
xyplot(Vsize~sample, other[1:10,], xlab = "Sample Number", ylab = "Vsize (MB)", grid=TRUE)
By the second sample, we have hit what is essentially the full memory usage of the program. Let’s look at the first two samples:
head(other, 2)
| Step | ModuleLabel | ModuleType | Vsize | RSS | sample |
|---|---|---|---|---|---|
| PreModuleConstruction | raw2hdf5 | Raw2HDF5 | 1047.040 | 261.9146 | 1 |
| PostModuleConstruction | raw2hdf5 | Raw2HDF5 | 3996.164 | 262.1972 | 2 |
We see from this that before the construction of the Raw2HDF5 module, the process was using about 1 GB of memory; after construction of this module, it was using very nearly 4GB. We have found our culprit: construction of this module consumes 3 GB of memory, much more than was expected. We should look at the constructor of this class to determine what is going on. Note also that while the virtual size increased dramatically due to the construction of this object, there is no substantial change in the resident set size. This indicates that the data structures responsible for the memory allocation are not being initialized. This extra clue helped us identify the culprit. If we look at how the resident set size changed with time, we see the filling over time of this data structure:
xyplot(RSS~sample, mods, xlab = "Sample Number", pch = ".", ylab = "RSS (MB)", grid = TRUE)
If we look at both the virtual size and the resident set size for the early samples, we see the clear difference in the total amounts:
xyplot(Vsize+RSS~sample, mods[1:30000,], pch = ".", xlab = "Sample number", ylab = "Memory use (MB)", auto.key = list(space = "right"), grid = TRUE)
As it turns out, the problem was simple to fix; an elementary error had been made that caused the initialization of some very large HDF5 support data structures. Fixing that error removed the huge allocation, and brought the program memory usage under control.
The time overhead introduced by running the MemoryTracker or TimeTracker service is generally small; if your program is doing any real work, the time consumed by that work will be much larger than the time used by these services. Collection of the data is thus generally worthwhile. If your program runs correctly, and no performance issues are observed, the database files can be disposed of. But, in the event that some kind of problem is encountered, having the database files immediately available can be of great benefit. In this particular case, because we followed our own suggested practice of collecting the performance databases, we were able to diagnose the issue in much less than an hour, and fix it rapidly.
artsupportThe R package artsupport provides a few facilities to aid in the analysis of TimeTracker and MemoryTracker database files written by art. While it is possible to analyze these database files with many different tools, this is the recommended toolkit.
Use of artsupport requires an installation of R, “a free software environment for statistical computing and graphics”. R is available for Linux, Unix, Windows, and macOS. You can obtain R from https://cran.r-project.org. On SLF6 and SLF7 machines, the most recent version of R is available for installation using yum; R can be found in the EPEL repository. The package to install is R-devel. To obtain access to the EPEL repository, and then install R, the following commands suffice:
sudo yum install epel-release
sudo yum install R-devel
On macOS, the link provided can be used to find the binary installer; R can also be installed using Homebrew. R is also available for Windows environments. Because SQLite databases are platform-independent, you can analyze the data on a machine different from the machine on which the data were collected.
R has a very convenient system for managing the installation of R add-ons; these add-ons are called “packages”. Installing artsupport is most easily done from inside an R session, using the function install_github, provided by the devtools package. If you do not have devtools installed in R, install it at the R prompt using:
install.packages("devtools")
Once you have devtools installed, load it, and install artsupport:
library(devtools)
install_github("marcpaterno/artsupport")
Please note the version of artsupport listed below. This document will be updated to use new features of artsupport whenever relevant new features appear.
This document was created using the RStudio integrated development environment; it was written in RMarkdown, using the knitr package. The set of R packages and their versions used was:
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin16.7.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] magrittr_1.5 printr_0.1 dplyr_0.7.3
## [4] latticeExtra_0.6-28 RColorBrewer_1.1-2 lattice_0.20-35
## [7] artsupport_0.0.0.9010
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 knitr_1.17 bindr_0.1 R6_2.2.2
## [5] rlang_0.1.2 stringr_1.2.0 tools_3.4.1 grid_3.4.1
## [9] htmltools_0.3.6 yaml_2.1.14 rprojroot_1.2 digest_0.6.12
## [13] assertthat_0.2.0 tibble_1.3.4 bindrcpp_0.2 glue_1.1.1
## [17] evaluate_0.10.1 rmarkdown_1.6 stringi_1.1.5 compiler_3.4.1
## [21] backports_1.1.0 pkgconfig_2.0.1