#library(ARminassaymodel)This vignette describes how to run all of the analyses in the AR (androgen receptor) minimal assay model. The purpose of the overall AR project is to derive a consensus extimate of activity of chemicals against the AR from a large number of in vitro assays, each of which is noisy in its own way.The original model is presented in the paper by Klenistreuer et al., which used 12 assays. In the current project, we develop a new full model with a somewhat different set of assays (now totalling 14) and then look for subset models, which are models using the same biological pathway structure of Kleinstreuer, but with fewer than the full set of 14. The results of this analysis are presented in the manuscript “Selecting a Minimal set of Androgen Receptor Assays for Screening Chemicals” by Judson et al. All code is written in R and is in the ARminassaymodel package.
The model proceeds in several steps
Except for the initial extraction of data from invitrodb, all input and output is file-based, and there are many directories to keep track of, so they are described here.
For each run of the model (full or subset), the result is in an excel file called superMatrix_…, where the elipses give the subset model name, which is an “A” followed by a mask of length 14 indicating which assays are included. So the output from the full model (run in the subset mode) is superMatrix_A11111111111111.xlsx. For the original full model, the output file name is superMatrix_final.xlsx.The superMatrix file contains one row per chemical, with columns being the chemical identity (dsstox_substance_id, casrn, code, name), and the model results (AUC_agonist, AUC_antagonist, AUC_R3 …). The full description of the columns will be given below.
The parent directory is called “model”. Within that are …
This step extracts all required data from invitrodb. To do this, run the following command:
allAssayDriver(do.db.extract=T,tbmode="med")
Before this is done, the database connection has to be set using
setDBConn(server={server},user={username},password={password})
For the calculations in the paper, we used database version ‘prod_internal_invitrodb_v3_2’ on server ‘mysql-res1.epa.gov’
The full model uses all 14 assays. To run this, first, prepare all of the data files using the command
allAssayDriver(do.file.prep=T,tbmode="med")
and load the data into a set of global variables (done to speed up execution) using the command:
allAssayDriver(do.prep=T,tbmode="med")
This only needs to be done once per session.
The full model is then run using either the command
allAssayDriver(do.everything=T,tbmode="med")
or the series of commands
allAssayDriver(do.plot0=T,tbmode="med")
allAssayDriver(do.ref=T,tbmode="med")
allAssayDriver(do.all=T,tbmode="med")
allAssayDriver(do.summary=T,tbmode="med")
The result of these commands will be to generate the reference chemical performance data (in plots/med and output/med) and the full model supermatrix (in output/med). Depending on the processor speed, this takes 20-30 minutes.
IMPORTANT NOTE In much of the code, there is a variable called “tbmode” which allows all code to run with all assays set to their lower or upper 95% confidence interval ranges from toxboot. For these cases, tbmode would be set ot “min” or “max”, The current variation analysis (see below) is more sophisticated than this, so for all current analyses, tbmode is set to “med”
There are a large number of subset models: 2**Nassay - (Nassay+1). For Nassay=14, this is 16369. The CPU time increases with the model complexity (number of assays in the submodel), from about 2 minutes for 2-assay models to 20-30 minutes for the full assay model. Because of this, it is necessary to run in parallel. The command to run the subset models is:
submodelDriver(tbmode="med",minassay={minimum number of assays},maxassay={maximum number of assays},mc.cores={number of cores to use})
On Windows, mc.cores must be set to 1.
As stated above, each run is named with an “A” followed by the assay mask, e.g. “A1111000000000011” and put in the directory varAMASK/{run name}/med.superMatrix_{run name}.xlsx. When each run is started, the appropriate directories are created. Because runs could crash, there is a restart capability. Only runs for which no directory exists will be started. What this does not yet do is restart runs where the directory was created, but where no results were created. You can check the progress of the processing (which takes several days to weeks) by running the command:
submodeProgress()
This only looks at the presence of the directories, but not the contents. An example output is
| nassay | complete | total | percent |
|---|---|---|---|
| 2 | 91 | 91 | 100 |
| 3 | 364 | 364 | 100 |
| 4 | 1001 | 1001 | 100 |
| 5 | 1804 | 2002 | 90 |
| 6 | 0 | 3003 | 0 |
| 7 | 876 | 3432 | 26 |
| 8 | 0 | 3003 | 0 |
| 9 | 0 | 2002 | 0 |
| 10 | 10 | 1001 | 1 |
| 11 | 26 | 364 | 7 |
| 12 | 91 | 91 | 100 |
| 13 | 14 | 14 | 100 |
| 14 | 1 | 1 | 100 |
One analysis presented in the paper compares the performance of the original and new full models. (The new model has more and different assays from the original full model). To run this analysis, use the command:
fullModelOldNewComp(to.file=T)
This produces a plot: plots/med/fullModelOldNewComp.pdf and a table comaring the old and new values, chemical by chemical: plots/med/fullModelOldNewComp.xlsx.
Full models with assay variability are compared to the full model with its point estimate AC50 values using the same approach as was used to compare the subset models. The variation analysis reruns the full model N times. (For this paper, N=100.) The results, in the form of a simplified superMatrix for each run, are placed in the directory varVAR/run_{i}/med where i runs from 1 to N. To perform the run over a range of i, run the command:
varmodelDriver(tbmode="med",runmin={minimum run number},runmax={maximum run number},mc.cores={number of cores to use})
For each of the modes (agonist,antagonist), we plot the AUC values of the variable full model vs. the point estimate full model and calculate an RMSE and R2. As above, AUC values are in the range of 0 to ~1. Values are then dichotomized with values <0.1 set to inactive (0) and those >=0.1 set to positive (1). From this, we create the confusion matrix and calculate sensitivity, specificity, balanced accuracy (BA), positive predictive value (PPV) and negative predictive values (NPV).These later statistics are performed for the full set of chemicals (1820) and for the in vitro and in vivo reference chemicals. To calculate these statistics across all subset models, run the command:
varmodelSummary(to.file=T,tbmode="med",agonist.cutoff=0.1,antagonist.cutoff=0.1,antagonist.min.score=2)
This produces summary statistics files: varmodel.output/med/sample_stats_agonist_0.1_2.xlsx , sample_stats_antagonist_0.1_2.xlsx and *sample_stats_antgonist__filtered_0.1_2.xlsx* , and a set of plots in submodel.output/med/assay_scan_0.1_0.1_2.pdf . The antagonist_filtered mode output sets the full model AUC(antagonist) to zero if the antagonist confidence score is <2.
The results across all subset models are summarized by running the command:
varmodelPlotStats(to.file=T,mode,cutoff.hit=0.1,antagonist.min.score=2,tbmode="med")
separately for mode in (agonist,antagonist,antagonist_filtered). This produces plots in the varmodel.output/med folder with names starting with “varmodel_stats_”. Note that the variation in the full model based on variability in the assay results is significantly smaller (in most cases) than the difference between the full model and the submodels.
Submodels are compared to the full model using several statistics. For each of the modes (agonist,antagonist), we plot the AUC values of the submodel vs. the full model and calculate an RMSE and R2. AUC values are in the range of 0 to ~1. Values are then dichotomized with values <0.1 set to inactive (0) and those >=0.1 set to positive (1). Using these dichotimized values, we create the confusion matrix and calculate sensitivity, specificity, balanced accuracy (BA), positive predictive value (PPV) and negative predictive values (NPV).These later statistics are performed for the full set of chemicals (1820) and for the in vitro and in vivo reference chemicals. To calculate these statistics across all subset models, run the command:
submodelSummary(to.file=T,tbmode="med",agonist.cutoff=0.1,antagonist.cutoff=0.1,antagonist.min.score=2)
This produces summary statistics files: submodel.output/med/sample_stats_agonist_0.1_2.xlsx, sample_stats_antagonist_0.1_2.xlsx and *sample_stats_antgonist__filtered_0.1_2.xlsx, and a set of plots in submodel.output/med/assay_scan_0.1_0.1_2.pdf*. The antagonist_filtered mode sets the full model AUC(antagonist) to zero if the antagonist confidence score is <2. An example plot from this pdf is used for Figure 2 of the paper.
The results across all subset models are summarized by running the command:
submodelPlotStatsMax(to.file=T,cutoff.hit=0.1,antagonist.min.score=2,tbmode="med")
separately for mode in (agonist,antagonist,antagonist_filtered). This produces plots in the submodel.output/med folder with names starting with “submodel_stats_”, and are used to build Figure 2 of the paper.
Figure 3 of the paper shows the subset model performance for a single model. This is produced by running the command:
submodelSummary(to.file=T,tbmode="med",agonist.cutoff=0.1,antagonist.cutoff=0.1,antagonist.min.score=2,amask={amask})
This puts a pdf file with the graphs for just that subset model in the submodel.output/med/ directory.
The full analysis of the paper has used all 1820 chemicals that have data for all 14 assays. ToxCast has more chemcials with subsets of the assays, and these can be run through the appropriate subset models. To prepare the input data files, run the series of commands:
prepToxCastMatrixFiles.forward(date.string="190624",
get.chems=T,
get.nvs=T,
get.nonnvs=T,
filter.chems=T,
invitrodb.filter=T)
The date.string is the most recent date for data extraction from invitrodb, and must be the same as in allAssayDriver.R. This will create series of files in the directory input/forward/toxcast_matrix_filtered with the logAC50, hitcalls, tested status, etc.
To run a subset model with new chemicals (outside of the 1820 used in all of the previous analysis), prepare the new data as above and run the command:
submodel.forward(amask.name={model name, e.g. "A01000000000011"})
This will create the superMatrix for this model in the directory forward_submodel/med
The Final output of the model is given in a file called superMatrix_final.xlsx. It includes one row per chemical and many columns, described here
| name | definition |
|---|---|
| dsstox_substance_id | DSSTox generic substance ID |
| code | Chemical code (CASRN with no dashes, preprended with a “C”) |
| casrn | Chemical CASRN |
| name | Chemical name |
| EDSP.Universe | Is this chemical in the EPA EDSP Universe |
| refchem_agonist_potency | For reference chemicals, the agonist potency designation |
| refchem_antagonist_potency | For reference chemicals, the antagonist potency designation |
| average_antagonist_zscore | The average Z-score for the antagonist assays |
| antagonist_auc_score | The antaogonist AUC score |
| antagonist_shift_score | The antagonist shift score |
| antagonist_zscore_score | The antagonist z-score score |
| antagonist_max_receptor_score | The antagonist max-receptor score |
| antagonist_cell_free_score | The antagonist cell-free confidence score |
| antagonist_confidence_score | The overall antagonist confidence score, the sume of the previous 5 rows. See Table 2 of the paper |
| cytotox_median_um | The median of the cytotoxicity region |
| cytotox_lower_bound_um | The lower bound of the cytotoxcity region |
| cytotox.assays.hit | Number of active cytotoxicity assays |
| AUC.Agonist | AUC for agonist mode |
| AUC.Antagonist | AUC for antagonist mode |
| AUC.R3 … AUC.R6 | Multi-assay-wise receptor AUC values |
| AUC.A1 … AUC.A14 | Assay-wise AUC values |
| {assay_name}_logAC50 | assay/chemical-logAC50 (Concentration where curve = 50% of Top) |
| {assay_name}_Emax | assay/chemical-Emax (Maximum effect magnitude) |
| {assay_name}_T | assay/chemical-Top (Top of the Hill curve) |
| {assay_name}_W | assay/chemical-W (Hill curve slope parameter) |
| {assay_name}_Zscore | assay/chemical-Z-score (Measure of distance from cytotoxicity concentration) |
| {assay_name}_maxConc | assay/chemical-logACB (Concentration where curve = 1 BMAD) |
| {assay_name}_logAC10 | assay/chemical-logA10 (Concentration where curve = 10% of Top) |
| {assay_name}_logACC | assay/chemical-logACB (Concentration where curve = 3 BMAD) |
| {assay_name}_logACB | assay/chemical-logACB (Concentration where curve = 1 BMAD) |
| {assay_name}_flags | Listing of any flags for the assay |
| pseudo.logAC50.median | Median pseudo A50 |
| pseudo.logAC50.min | Lower CI of the pseudo AC50 |
| pseudo.logAC10.median | Median pseudo A10 |
| pseudo.logAC10.min | Lower CI of the pseudo AC10 |
| pseudo.logACC.median | Median pseudo ACC |
| pseudo.logACC.min | Lower CI of the pseudo ACC |
| pseudo.logACB.median | Median pseudo ACB |
| pseudo.logACB.min | Lower CI of the pseudo ACB |
| maximum.receptor | The receptor with the maximum AUC values |
| N.assays.hit | Number of active assays |
| N.assays.hit.hi.Z | Number of active assays with Z>3 |
| N.assays.hit.lo.Z | Number of assay active with Z<3 |
| promiscuity.hi.Z | Fraction of assay hits with Z>3 |
| promiscuity.lo.Z | Fraction of assay hits with Z<3 |
| specificity_score | This is a measure of how likely the chemical is to be causing specific AR activity rather than assay interference |
| ga_min_left | These values are metrics associated with the antagonist shift calcualtion. ga_min_left is the minimum CI for AC50 of the assay that should be left-shifted |
| ga_med_left | median AC50 of the assay that should be left-shifted |
| ga_max_left | maximum CI for AC50 of the assay that should be left-shifted |
| hit_pct_left | Hit percentage for the left-shifted assay |
| ga_min_right | minimum CI for AC50 of the assay that should be right-shifted |
| ga_med_right | median AC50 of the assay that should be right-shifted |
| ga_max_right | maximum CI for AC50 of the assay that should be right-shifted |
| hit_pct_right | Hit percentage for the right-shifted assay |