AR Minimal Assay Model Vignette

Richard Judson

#library(ARminassaymodel)

Introduction

This vignette describes how to run all of the analyses in the AR (androgen receptor) minimal assay model. The purpose of the overall AR project is to derive a consensus extimate of activity of chemicals against the AR from a large number of in vitro assays, each of which is noisy in its own way.The original model is presented in the paper by Klenistreuer et al., which used 12 assays. In the current project, we develop a new full model with a somewhat different set of assays (now totalling 14) and then look for subset models, which are models using the same biological pathway structure of Kleinstreuer, but with fewer than the full set of 14. The results of this analysis are presented in the manuscript “Selecting a Minimal set of Androgen Receptor Assays for Screening Chemicals” by Judson et al. All code is written in R and is in the ARminassaymodel package.

The model proceeds in several steps

  1. Prepare the data. This step extracts the in vitro assay results from invitrodb (the ToxCast database) to a set of files that the rest of the code will use.
  2. Run the full model. This produces the baseline model results which the subset models will be compared to.
  3. Run the subset models. This will run the model with all combinations of 2 to 14 assays. Note that the 14 assay subset model should give identical results to the full model, but is run as a QC check on the code.
  4. Run variation analysis on the full model. This uses the results of the ToxCast toxboot process that calculates confidence intervals around the AC50 values for each chemical-assay pair.
  5. Run several sets of analyses that produce overall statistics and figures.

Folder structure

Except for the initial extraction of data from invitrodb, all input and output is file-based, and there are many directories to keep track of, so they are described here.

For each run of the model (full or subset), the result is in an excel file called superMatrix_…, where the elipses give the subset model name, which is an “A” followed by a mask of length 14 indicating which assays are included. So the output from the full model (run in the subset mode) is superMatrix_A11111111111111.xlsx. For the original full model, the output file name is superMatrix_final.xlsx.The superMatrix file contains one row per chemical, with columns being the chemical identity (dsstox_substance_id, casrn, code, name), and the model results (AUC_agonist, AUC_antagonist, AUC_R3 …). The full description of the columns will be given below.

The parent directory is called “model”. Within that are …

  1. input - This contains all of the data required to run the code. Key data here are
    • DSStox - directory that contains a recent RData file with the entire DSSTox chemical database
    • toxcast_matrix - directory that contains a recent dump of the invitrodb database - see “Preparing the data”
    • toxcast_matrix_filtered - directory that contains the ToxCast data filtered for just the assays and chemicals used in the model
    • forward - directory holding the fiels with the input data for running forward models (new chemicals not used in the full analysis)
    • assay_list.txt - the list of assays in the order they appear in the pathway
    • AR_pathway_refchems in vitro in vivo cut.xlsx - modified set of reference chemicals based on an activity cutoff of 0.1
    • AR_pathway_refchems.xlsx - full set of reference chemicals
    • AR_pathway_chemicals.xlsx - full set of chemicals used in the model
    • ar_toxboot.xlsx - the uncertainty values from toxboot
    • Kleinstreuer et al AR model CRT 2016.pdf - a copy of the Kleinstreuer et al. paper
    • AR_supermatrix_Kleinstreuer_etal_2017.xlsx - the summary supermatrix from the original Kleinstreuer et al. paper. One analysis compares this result to the new full model.
    • Supplemental File 2_ARpathway_Results_ConfScores_CI_2016-08-30.xlsx - the detailed supermatrix from the original Kleinstreuer et al. paper. One analysis compares this result to the new full model.
    • EDSP_universe_input_2014_04_17.xlsx - Chemicals in the EDSP universe - this is added to the supermatrix output
  2. output - this is the tabular output from the full model. As stated above, only the “med” run is done, so the results are in this directory
    • med/superMatrix_final.xlsx - The final results for the full model
  3. plots - this is the graphical output from the full model. As stated above, only the “med” run is done, so the results are in this directory. This directory contains all of the full model plots used in the manuscript or the supplemental files
  4. varAMASK - this contains one directory per submodel run. Within each of these directories is the med directory, and within that is the result file called superMatrix_{mask}.xlsx
  5. submodel.output - this contains (in the “med” directory), the results of analyzing the submodels, both tabular and graphical
  6. varVAR - this contains one directory per variation run of the full model. Within each of these directories is the med directory, and within that is the result file called superMatrix_A11111111111111.xlsx
  7. varmodel.output - this contains (in the “med” directory), the results of analyzing the variational models, both tabular and graphical
  8. ARminassaymodel - this directory contains the RStudio file, the code in the R directory, and the local git repository
  9. forward_submodel - this directory holds the resutls of the submodels run with chemicals beyond the 1820 in all of hte other analyses. All data is in the med folder

Preparing the data

This step extracts all required data from invitrodb. To do this, run the following command:

allAssayDriver(do.db.extract=T,tbmode="med")

Before this is done, the database connection has to be set using

setDBConn(server={server},user={username},password={password})

For the calculations in the paper, we used database version ‘prod_internal_invitrodb_v3_2’ on server ‘mysql-res1.epa.gov’

Running the full model

The full model uses all 14 assays. To run this, first, prepare all of the data files using the command

allAssayDriver(do.file.prep=T,tbmode="med")

and load the data into a set of global variables (done to speed up execution) using the command:

allAssayDriver(do.prep=T,tbmode="med")

This only needs to be done once per session.

The full model is then run using either the command

allAssayDriver(do.everything=T,tbmode="med")

or the series of commands

allAssayDriver(do.plot0=T,tbmode="med")
allAssayDriver(do.ref=T,tbmode="med")
allAssayDriver(do.all=T,tbmode="med")
allAssayDriver(do.summary=T,tbmode="med")

The result of these commands will be to generate the reference chemical performance data (in plots/med and output/med) and the full model supermatrix (in output/med). Depending on the processor speed, this takes 20-30 minutes.

IMPORTANT NOTE In much of the code, there is a variable called “tbmode” which allows all code to run with all assays set to their lower or upper 95% confidence interval ranges from toxboot. For these cases, tbmode would be set ot “min” or “max”, The current variation analysis (see below) is more sophisticated than this, so for all current analyses, tbmode is set to “med”

Running the subset models

There are a large number of subset models: 2**Nassay - (Nassay+1). For Nassay=14, this is 16369. The CPU time increases with the model complexity (number of assays in the submodel), from about 2 minutes for 2-assay models to 20-30 minutes for the full assay model. Because of this, it is necessary to run in parallel. The command to run the subset models is:

submodelDriver(tbmode="med",minassay={minimum number of assays},maxassay={maximum number of assays},mc.cores={number of cores to use})

On Windows, mc.cores must be set to 1.

As stated above, each run is named with an “A” followed by the assay mask, e.g. “A1111000000000011” and put in the directory varAMASK/{run name}/med.superMatrix_{run name}.xlsx. When each run is started, the appropriate directories are created. Because runs could crash, there is a restart capability. Only runs for which no directory exists will be started. What this does not yet do is restart runs where the directory was created, but where no results were created. You can check the progress of the processing (which takes several days to weeks) by running the command:

submodeProgress()

This only looks at the presence of the directories, but not the contents. An example output is

nassay complete total percent
2 91 91 100
3 364 364 100
4 1001 1001 100
5 1804 2002 90
6 0 3003 0
7 876 3432 26
8 0 3003 0
9 0 2002 0
10 10 1001 1
11 26 364 7
12 91 91 100
13 14 14 100
14 1 1 100

Comparing the new full model with the original full model

One analysis presented in the paper compares the performance of the original and new full models. (The new model has more and different assays from the original full model). To run this analysis, use the command:

fullModelOldNewComp(to.file=T)

This produces a plot: plots/med/fullModelOldNewComp.pdf and a table comaring the old and new values, chemical by chemical: plots/med/fullModelOldNewComp.xlsx.

Calculating the variability in the full model results based on toxboot analysis

Full models with assay variability are compared to the full model with its point estimate AC50 values using the same approach as was used to compare the subset models. The variation analysis reruns the full model N times. (For this paper, N=100.) The results, in the form of a simplified superMatrix for each run, are placed in the directory varVAR/run_{i}/med where i runs from 1 to N. To perform the run over a range of i, run the command:

varmodelDriver(tbmode="med",runmin={minimum run number},runmax={maximum run number},mc.cores={number of cores to use})

For each of the modes (agonist,antagonist), we plot the AUC values of the variable full model vs. the point estimate full model and calculate an RMSE and R2. As above, AUC values are in the range of 0 to ~1. Values are then dichotomized with values <0.1 set to inactive (0) and those >=0.1 set to positive (1). From this, we create the confusion matrix and calculate sensitivity, specificity, balanced accuracy (BA), positive predictive value (PPV) and negative predictive values (NPV).These later statistics are performed for the full set of chemicals (1820) and for the in vitro and in vivo reference chemicals. To calculate these statistics across all subset models, run the command:

varmodelSummary(to.file=T,tbmode="med",agonist.cutoff=0.1,antagonist.cutoff=0.1,antagonist.min.score=2)

This produces summary statistics files: varmodel.output/med/sample_stats_agonist_0.1_2.xlsx , sample_stats_antagonist_0.1_2.xlsx and *sample_stats_antgonist__filtered_0.1_2.xlsx* , and a set of plots in submodel.output/med/assay_scan_0.1_0.1_2.pdf . The antagonist_filtered mode output sets the full model AUC(antagonist) to zero if the antagonist confidence score is <2.

The results across all subset models are summarized by running the command:

varmodelPlotStats(to.file=T,mode,cutoff.hit=0.1,antagonist.min.score=2,tbmode="med")

separately for mode in (agonist,antagonist,antagonist_filtered). This produces plots in the varmodel.output/med folder with names starting with “varmodel_stats_”. Note that the variation in the full model based on variability in the assay results is significantly smaller (in most cases) than the difference between the full model and the submodels.

Calculating the subset model performance

Submodels are compared to the full model using several statistics. For each of the modes (agonist,antagonist), we plot the AUC values of the submodel vs. the full model and calculate an RMSE and R2. AUC values are in the range of 0 to ~1. Values are then dichotomized with values <0.1 set to inactive (0) and those >=0.1 set to positive (1). Using these dichotimized values, we create the confusion matrix and calculate sensitivity, specificity, balanced accuracy (BA), positive predictive value (PPV) and negative predictive values (NPV).These later statistics are performed for the full set of chemicals (1820) and for the in vitro and in vivo reference chemicals. To calculate these statistics across all subset models, run the command:

submodelSummary(to.file=T,tbmode="med",agonist.cutoff=0.1,antagonist.cutoff=0.1,antagonist.min.score=2)

This produces summary statistics files: submodel.output/med/sample_stats_agonist_0.1_2.xlsx, sample_stats_antagonist_0.1_2.xlsx and *sample_stats_antgonist__filtered_0.1_2.xlsx, and a set of plots in submodel.output/med/assay_scan_0.1_0.1_2.pdf*. The antagonist_filtered mode sets the full model AUC(antagonist) to zero if the antagonist confidence score is <2. An example plot from this pdf is used for Figure 2 of the paper.

The results across all subset models are summarized by running the command:

submodelPlotStatsMax(to.file=T,cutoff.hit=0.1,antagonist.min.score=2,tbmode="med")

separately for mode in (agonist,antagonist,antagonist_filtered). This produces plots in the submodel.output/med folder with names starting with “submodel_stats_”, and are used to build Figure 2 of the paper.

Figure 3 of the paper shows the subset model performance for a single model. This is produced by running the command:

submodelSummary(to.file=T,tbmode="med",agonist.cutoff=0.1,antagonist.cutoff=0.1,antagonist.min.score=2,amask={amask})

This puts a pdf file with the graphs for just that subset model in the submodel.output/med/ directory.

Preparing data for new chemicals not run in the full analysis

The full analysis of the paper has used all 1820 chemicals that have data for all 14 assays. ToxCast has more chemcials with subsets of the assays, and these can be run through the appropriate subset models. To prepare the input data files, run the series of commands:

prepToxCastMatrixFiles.forward(date.string="190624",
                               get.chems=T,
                               get.nvs=T,
                               get.nonnvs=T,
                               filter.chems=T,
                               invitrodb.filter=T)

The date.string is the most recent date for data extraction from invitrodb, and must be the same as in allAssayDriver.R. This will create series of files in the directory input/forward/toxcast_matrix_filtered with the logAC50, hitcalls, tested status, etc.

Calculating the results of a single model for a new set of chemicals

To run a subset model with new chemicals (outside of the 1820 used in all of the previous analysis), prepare the new data as above and run the command:

submodel.forward(amask.name={model name, e.g. "A01000000000011"})

This will create the superMatrix for this model in the directory forward_submodel/med

Description of the superMatrix file

The Final output of the model is given in a file called superMatrix_final.xlsx. It includes one row per chemical and many columns, described here

name definition
dsstox_substance_id DSSTox generic substance ID
code Chemical code (CASRN with no dashes, preprended with a “C”)
casrn Chemical CASRN
name Chemical name
EDSP.Universe Is this chemical in the EPA EDSP Universe
refchem_agonist_potency For reference chemicals, the agonist potency designation
refchem_antagonist_potency For reference chemicals, the antagonist potency designation
average_antagonist_zscore The average Z-score for the antagonist assays
antagonist_auc_score The antaogonist AUC score
antagonist_shift_score The antagonist shift score
antagonist_zscore_score The antagonist z-score score
antagonist_max_receptor_score The antagonist max-receptor score
antagonist_cell_free_score The antagonist cell-free confidence score
antagonist_confidence_score The overall antagonist confidence score, the sume of the previous 5 rows. See Table 2 of the paper
cytotox_median_um The median of the cytotoxicity region
cytotox_lower_bound_um The lower bound of the cytotoxcity region
cytotox.assays.hit Number of active cytotoxicity assays
AUC.Agonist AUC for agonist mode
AUC.Antagonist AUC for antagonist mode
AUC.R3 … AUC.R6 Multi-assay-wise receptor AUC values
AUC.A1 … AUC.A14 Assay-wise AUC values
{assay_name}_logAC50 assay/chemical-logAC50 (Concentration where curve = 50% of Top)
{assay_name}_Emax assay/chemical-Emax (Maximum effect magnitude)
{assay_name}_T assay/chemical-Top (Top of the Hill curve)
{assay_name}_W assay/chemical-W (Hill curve slope parameter)
{assay_name}_Zscore assay/chemical-Z-score (Measure of distance from cytotoxicity concentration)
{assay_name}_maxConc assay/chemical-logACB (Concentration where curve = 1 BMAD)
{assay_name}_logAC10 assay/chemical-logA10 (Concentration where curve = 10% of Top)
{assay_name}_logACC assay/chemical-logACB (Concentration where curve = 3 BMAD)
{assay_name}_logACB assay/chemical-logACB (Concentration where curve = 1 BMAD)
{assay_name}_flags Listing of any flags for the assay
pseudo.logAC50.median Median pseudo A50
pseudo.logAC50.min Lower CI of the pseudo AC50
pseudo.logAC10.median Median pseudo A10
pseudo.logAC10.min Lower CI of the pseudo AC10
pseudo.logACC.median Median pseudo ACC
pseudo.logACC.min Lower CI of the pseudo ACC
pseudo.logACB.median Median pseudo ACB
pseudo.logACB.min Lower CI of the pseudo ACB
maximum.receptor The receptor with the maximum AUC values
N.assays.hit Number of active assays
N.assays.hit.hi.Z Number of active assays with Z>3
N.assays.hit.lo.Z Number of assay active with Z<3
promiscuity.hi.Z Fraction of assay hits with Z>3
promiscuity.lo.Z Fraction of assay hits with Z<3
specificity_score This is a measure of how likely the chemical is to be causing specific AR activity rather than assay interference
ga_min_left These values are metrics associated with the antagonist shift calcualtion. ga_min_left is the minimum CI for AC50 of the assay that should be left-shifted
ga_med_left median AC50 of the assay that should be left-shifted
ga_max_left maximum CI for AC50 of the assay that should be left-shifted
hit_pct_left Hit percentage for the left-shifted assay
ga_min_right minimum CI for AC50 of the assay that should be right-shifted
ga_med_right median AC50 of the assay that should be right-shifted
ga_max_right maximum CI for AC50 of the assay that should be right-shifted
hit_pct_right Hit percentage for the right-shifted assay