Welcome to the Mixed Cumulative Probit (MCP) Vignette. This document is meant to provide all necessary functions, descriptions, and data structures needed to successfully use the MCP to generate univariate and multivariate models.
If you would like the .HTML file of this document for use without internet access, please email Elaine Y. Chu (elainechu@txstate.edu)
The MCP is a novel algorithm developed by Michael H. Price that uses a Bayesian framework to calculate a continuous variable using any number and combination of continuous, categorical, and/or ordinal data.
While this demonstration uses subadult age estimation to provide data and reproducible results of the MCP process, the goal of this document is to make the MCP as accessible and applicable to as wide of an audience as possible, with the hope that users may then use the MCP algorithm with their own data and research agendas.
All analyses are conducted in R and/or RStudio.
NOTE: There are a couple of text formats in this HTML document that signal specific things.
Any text highlighted in red with red text is R/code-related.
Any text with a yellow background will be demonstration-specific. Anything that is in yellow can be changed to meet your specific needs.
Any text with a red background will provide some commonly-encountered errors, with some suggestions on how to solve them.
The data used for this demonstration are from the following publication:
Stull, K.E., Chu, E.Y., Corron, L.K., & Price, M.H. (2023). The Mixed Cumulative Probit: A multivariate generalization of transition analysis that accommodates variation in the shape, spread, and structure of data. Royal Society Open Science.
Finally, the authors recognize the potential for the MCP to address research agendas in Anthropology, Demography, Economics, and beyond, and encourage researchers to apply the MCP algorithm to their appropriate data.
Most functions used in this document are located in a R package
hosted on GitHub called yada, which stands for “Yet Another
Demographic Analysis”. To install the functions used in this
vignette:
install.packages("devtools") # install devtools package, if needed
devtools::install_github("MichaelHoltonPrice/yada") # install yadaBecause yada is a living repository that gets updated,
we recommend re-installing often and from the development
branch on GitHub. To check whether any updates have been made,
refer to the “Last Updated” date at the top-left corner of this
document, which reflects the date of any last updates to
yada. To update yada, add the argument
ref="dev":
# detach("package:yada", unload=T) # unload yada, if already loaded
# update yada using the development ("dev") branch
devtools::install_github("MichaelHoltonPrice/yada", ref="dev") In addition to yada, this document uses additional R
packages for faster optimization (“parallel processing”) and
visualization. Please ensure the following packages are in your R
library:
doParalleltidyverseOnce you have confirmed that all required packages are installed in the local system, you can begin the MCP algorithm optimization process: Step 1: Initial Steps for the MCP
If you would like to follow the vignette step-by-step with
the examples provided, you may download the folder containing
the data, a template script, and a template var_info file
(more later) can be found by right-clicking the following link and
selecting “Open link in new tab”: LINK.
The templates are provided as a script-version guideline for changing
only the necessary arguments for you own data.
After confirming that all package dependencies and the repository
yada are included in the local system, we can start setting
up our local R system for utilizing the MCP algorithm.
Before proceeding, please ensure that your
working directory is set to whatever file path you would
like all of your analyses to be contained within.
# NOTE: make sure to change "file/path/to/analysis/folder" appropriately
setwd("file/path/to/analysis/folder")If you downloaded the folder from Google
Drive, your working directory file path should match the location where you
placed the folder. For this demonstration, we recommend the shortest file path possible: `setwd("C:/MCP-Vignette-Files")`
Next, use the following lines to create a data folder and a folder in which all of your MCP algorithm files are stored - we recommend this be a results folder.
If you are following this vignette using the MCP-Vignette-Files folder, you do not need to complete this step, as the "data" and "results" folders already exist.
If you are using your own data files, you will need to copy/paste them into the data folder that was just created. As noted above, if you are using the cloned MCP-Vignette-Files folder, then the data files already exist in the data folder and you will not need to do anything. The results folder should remain empty at this time.
Clear the global environment and load the package libraries to use in subsequent analyses:
## Warning: package 'yada' was built under R version 4.4.2
## Warning: package 'doParallel' was built under R version 4.4.2
## Warning: package 'foreach' was built under R version 4.4.2
## Warning: package 'iterators' was built under R version 4.4.2
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tibble' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
Finally, initialize parallel processing for future parts of the analysis, which allows for multiple processes to run at once through R by using the cores available on the local system.
We choose to use `cl=detectCores()` to use all of the cores on our local system to run parallel processing. The value of `cl=` can be changed by the user, in the event they want to use less of the local system"s processing power.
The data folder should contain two main files. If you are using your own data, you will need to place these files inside of the data folder you created.
If you are using the MCP-Vignette-Files folder, then you will find the two following files in the data folder.
var_info file (for us, US_var_info.csv) -
This file holds all variable information and must be standardized with
the following columns to match the format expected by the data
manipulation functions in yada (explained below). If you are using your own data, you may manipulate
the var_info view and
download a template for the var_info file found in the MCP-Vignette-Files
folder.Column NameVariable |
DescriptionThe variable name |
|
| Type | The variable type. There should be exactly one variable, labeled as “x”, which is the output variable. Input variables should be labeled as either “ordinal” or “continuous”, and additional variables to be kept in the data file but not used for model optimization should be labeled as “other”.
|
|
| Categories | A category specification for ordinal variables; | essentially, enter the possible stages (in order) for | ordinal variable. Ordinal stages may be collapsed, per | the researcher’s specifications (see example). Leave blank for continuous variables. | | * Ex: 1 2 3 4 5 6 7 8 9 10 11 {12 13} will | collapse values 12 and 13 together | | |
| Missing_Value | Value(s) that represent missing data for each variable. Multiple values are separated by commas.
|
|
| Left_Right_Side | Specifies whether the Left/Right label is located at the “start” or the “end” of the variable name (columns) in the data file.
|
|
| Left_Label | The label that signals left variables.
|
|
| Right_Label | The label that signals right variables.
|
|
| Left_Right_Approach | The approach used for merging left and right variables.
|
|
Each variable in the data file (here, SVAD_US.csv) must
match a column name in the corresponding var_info file
(US_var_info).The first four columns are required,
whereas the final four columns are optional and only used if left/right
variables exist in the same data file and are to be merged.
NOTE: Only variables named in the
var_info file will be kept for analyses.
The following functions are used in the initial setup to prepare the
data for the MCP. They are not required, if the user feels
comfortable with: 1) collapsing left and right columns (if
necessary), 2) selecting only the desired variables, and 3) formatting
the data into a problem file in the format expected
by functions in yada to use the MCP algorithm.
var_info file, with the
format specified above, to provide instructions on how the input data
should be reformatted.
Here, we are loading the file US_var_info.csv found in the data folder.
This file specifies that we are keeping the following variables (as columns):
This function re-formats and saves the data including:
var_info filevar_info filesave_file=T saves the new, re-formatted data with
“_reformatted” as the suffix to the original data file pathnew_data <- reformat_mcp_data(data_file_path="data/SVAD_US.csv",
var_info=var_info,
save_file=T)
head(new_data) # check that columns are maintained and merged## SVAD_identifier SEX agey FDL RDL max_M1 man_I2 HME_EF TC_Oss
## 1 US_0186 F 1.172603 141.79 83.86 NA NA 0 3
## 2 US_0006 F 4.756164 257.30 143.38 7 8 1 5
## 3 US_0837 M 4.073973 211.32 111.39 NA NA 0 5
## 4 US_0915 M 1.087671 144.60 88.97 2 1 0 2
## 5 US_0617 M 0.830137 124.93 77.32 1 1 0 2
## 6 US_0165 F 14.860274 NA NA 11 11 5 5
After confirming that our specified columns are retained and that the left and right columns of our continuous and ordinal variables are combined into single columns, we can proceed with Step 2: Generating Problem File(s).
The problem file contains all information that is used
by the MCP algorithm to optimize univariate and multivariate models. The
main problem is always required, while an
additional function is available to generate secondary
problem files for cross-validation (more below).
Problem files are lists, which is a
data type in R that can contain multiple other data types
(e.g., vectors, data frames, matrices, other lists, single
values). After generating a problem file for the MCP, you should have
the following features:
data_file, containing the “predictor” variable (in this
demonstration, known age)data_file. This is
opposite of what we see when we view our reformatted data frame,
new_dataThe following functions facilitate the creation and saving of problem files for the MCP pipeline.
This function takes the re-formatted data from the previous step
(using reformat_mcp_data()),
extracts pertinent information, and structures it in the
list format expected by the MCP algorithm.
In the MCP pipeline, univariate and multivariate model selection can currently be conducted in two ways: 1) using Cross-Validation (cv or 2) using Akaike’s Information Criterion (AIC). It is up to the user to decide which model selection method is desirable for them.
This function uses the main problem file to generate
cross-validation training and test folds for model
selection. The number of folds is dictated by the argument
K=. Each of the secondary problem files generated
will have the same structure as the main problem, but will
differ in the subsets of x and Y. To
ensure reproducibility of the subsets of x and
Y within each fold, the argument “seed=” can be
used.
For this demonstration, we choose to use cross-validation as our method for model selection. Those who want to use AIC can skip this step. We generate 4 cross-validation folds and set the seed as 234227327, which was randomly generated using random.org.
This function saves problem files. Using this function is the first instance of two very important arguments that will be common throughout the functions in the MCP pipeline / vignette and are chosen by the user:
data_dir - The data directory (i.e., folder)
where all files generated by the MCP algorithm should be storedanalysis_name - The unique identifier for the current
analysis, that distinguishes the files from previous or future MCP
algorithm usesTo initialize these objects:
If you are using aic, you will only need to run
save_problem() once:
# Save main problem
save_problem(data_dir=data_dir,
analysis_name=analysis_name,
problem=main_problem,
is_folds=F)Note how this function does not require defining the output as an
object using <-.
If you are using cv, you should run save_problem()
twice:
# Save main problem
save_problem(data_dir=data_dir,
analysis_name=analysis_name,
problem=main_problem,
is_folds=F)
# Save cv problems
save_problem(data_dir=data_dir,
analysis_name=analysis_name,
problem=cv_problems,
is_folds=T) # NOTE: is_folds=T here, because we have folds!Once we have generated all problem files and they are saved in our
data_dir, we can move onto Step 3:
Optimizing Univariate Models
The MCP algorithm has the ability to fit univariate models to data
that are non-linear (shape) and have heteroskedastic (non-uniform)
variance (spread). A benefit of the functions in yada is
that the present demonstration does not assume a certain shape or spread
of the data, instead allowing for model selection to decide which are
the best model configurations (combination of shape and spread) for each
variable in the data.
The mean and noise options for each univariate model are as follows:
Mean Options
lin_ord - linear (ordinal variables)log_ord - logarithmic (ordinal variables)pow_law_ord - power law (ordinal variables)pow_law - power law (continuous variables)Noise Options
const - constant noise (homoskedastic)lin_pos_int - linear positive interval
(heteroskedastic)The functions in Step 3 have more options than previous steps. Each function is individually introduced below, some with more complicated chunks of code. This step is where you are first introduced to code for parallel processing.
There are two nearly-identical functions called
build_univariate_ord_problems() and
build_univariate_cont_problems(). These functions do
exactly what they say, which is build the possible model combinations
for all ordinal and continuous variables defined in the main
problem file located in the data_dir. The result is a
list of problem files to generate all possible model
combinations for all univariate models.
Noteworthy arguments for these functions include:
mean_specs - the mean model optionsnoise_specs - the noise model optionsadd_folds - a T/F statement on whether to build
univariate models using cross-validation foldsWe choose to use the default values for `mean_specs` and `noise_specs`, which are to generate all possible combinations available in `yada` for ordinal and continuous univariate models. You do not need to use all of the available options, if you so choose. In addition, we use `add_folds=T` to create univariate fold models for model selection using cross-validation
mean_specs_ord <- c(rep("pow_law_ord",2),rep("log_ord",2),rep("lin_ord",2))
noise_specs_ord <- rep(c("const","lin_pos_int"),3)
ord_prob_list <- build_univariate_ord_problems(data_dir=data_dir,
analysis_name=analysis_name,
mean_specs=mean_specs_ord,
noise_specs=noise_specs_ord,
add_folds=T)
mean_specs_cont <- c(rep("pow_law",2))
noise_specs_cont <- c("const","lin_pos_int")
cont_prob_list <- build_univariate_cont_problems(data_dir=data_dir,
analysis_name=analysis_name,
mean_specs=mean_specs_cont,
noise_specs=noise_specs_cont,
add_folds=T)To calculate the number of ordinal and continuous models being built by these functions, you can use the following equation:
expected = (num_folds+1) * [6(num_ord_variables) + 2(num_cont_variables)]
Given the above equation, we expect:
(4+1) * [6(4) + 2(2)] = 140 models built using both functions.
To check:
## [1] 140
Nice. Now that all of our desired univariate problems are created, we can optimize each model individually.
There are again two near-identical functions called
solve_ord_problem() and solve_cont_problem().
These functions “solve” (i.e., optimize) the univariate
problems they are given. In other words, it optimizes the univariate
model in question using the data provided by the main and/or
secondary (if desired) problems.
Noteworthy arguments for these functions include:
*_prob - the univariate model problem to optimizeanneal_seed - an optional seed number for ordinal
problem reproducibility and is not an argument for continuous
problemsBecause there are many, many ordinal and continuous univariate models
to optimize, we solve the problems in parallel using
foreach and %dopar%. Essentially, this process
takes each problem, i, and optimizes the model using one of the
cores “registered” using registerDoParallel. This allows
for multiple univariate models to be optimized at the same time, as the
process is undertaken by different cores.
Depending on how many univariate problems need to be optimized, the
following process may take a while, certainly long enough to take a
break, eat some snacks, or enjoy a coffee at a cafe! If you would like
to monitor the optimization process, check your data_dir
and see how many files starting with solutiony_ are there. At
the end of the entire process, you should find the same number of files,
which you calculated above, in your data_dir.
We set a `base_seed` and generate a `seed_vect` below for further reproducibility of our univariate models. The `seed_vect` is fed into the argument `anneal_seed` for univariate ordinal problems.
# Set up for reproducibility
base_seed <- 264528 # number generated from random.org
set.seed(base_seed) # set the seed
seed_vect <- sample.int(1000000,
length(ord_prob_list),
replace=F) # vector of sampled numbers
# Solve the ordinal problems in parallel and save to data_dir
ord_success <-
foreach::foreach(i=1:length(ord_prob_list), .combine=cbind) %dopar% {
yada::solve_ord_problem(data_dir=data_dir,
analysis_name=analysis_name,
ord_prob=ord_prob_list[[i]],
anneal_seed=seed_vect[i])
}
# Solve the continuous problems in parallel and save to data_dir
cont_success <-
foreach::foreach(i=1:length(cont_prob_list), .combine=cbind) %dopar% {
yada::solve_cont_problem(data_dir=data_dir,
analysis_name=analysis_name,
cont_prob=cont_prob_list[[i]])
}Check that your data_dir has all of the expected
solutiony_ files. Some bonus code to do so in R:
## [1] 140
Congratulations on making it this far! Proceed to Step 4: Univariate Model Selection
Model selection is an important step in the MCP algorithm pipeline because it allows the algorithm and data to inform what the best model combinations are, out of the different options created in Step 3, instead of the decision being made by the user/researcher.
The functions in yada currently support two different
methods of model selection:
The authors highly suggest the user do additional research into what each of these methods entail to make the best-informed decision when using the MCP algorithm to optimize univariate and multivariate models. Moving forward, both model selection methods will be demonstrated.
The functions in Step 4 are specific to the univariate model
selection (“evaluation”) process. In general,there is one large
function, evaluate_univariate_models, that calculates all
of the metrics and formats all of the data in a structure expected by
subsequent MCP algorithm steps. The secondary functions are all for
user/researcher interpretation.
This function combines an input of mean and noise specifications into a single vector with all possible univariate models.
Noteworthy arguments include:
mean_models - A vector of all mean model specifications
used to generate univariate modelsnoise_models - A vector of all noise model
specifications used to generate univariate modelsord_models <- build_model_vec(mean_models=unique(mean_specs_ord),
noise_models=unique(noise_specs_ord))
cont_models <- build_model_vec(mean_models=unique(mean_specs_cont),
noise_models=unique(noise_specs_ord))Given that we have generated each possible combination of mean and noise model specifications currently generated by the functions `build_univariate_ord_problems` and `build_univariate_cont_problems`, we expect `ord_models` to include 6 mean-noise combinations and `cont_models` to include 2 mean-noise combinations, totalling 8 mean-noise combinations.
## [1] 8
This function ranks the univariate models (i.e., files
starting with solutiony_ in the data_dir).
Broadly, the ranking system is dictated by either out-of-sample negative
log-likelihood k-fold cross-validation (“cv”), or Akaike”s
Information Criterion (“aic”). For a more detailed explanation on how
the ranking of univariate models, by variable, occurs, refer to the help
file:
Noteworthy arguments for this function include:
eval_type - Model selection technique.
Options: “cv” or “aic”, default is “aic”cand_tol - The candidate model tolerance, which is the
value at which univariate models are considered “equally good” if their
respective model evaluation values lie within the cand_tol
value of each other. Model simplicity is then used as a “tie-breaker”
for rankingscale_exp_min - The minimum acceptable value of the
scaling exponent [which is only applicable to log and pow_law
models??]beta2_max - The maximum acceptable value of
beta2, the heteroskedastic noise parameter (i.e.,
lin_pos_int)For this demonstration:
`cand_tol=0.05`, `scale_exp_min=0.01`, and `beta2_max=5`.
These values were chosen for our specific analyses, but we recommend them as a fair starting point if the user is unsure of other values to use. As stated above, this demonstration chooses k-fold cross-validation for model selection, therefore `eval_type="cv"`.
eval_data <- evaluate_univariate_models(data_dir=data_dir,
analysis_name=analysis_name,
eval_type="cv",
ord_models=ord_models,
cont_models=cont_models,
cand_tol=0.05,
scale_exp_min=0.01,
beta2_max=5,
save_file=T)Warning messages may show up in your console at this step, these warnings simply mean that certain mean-noise model combinations are not producing "rank-able" results. It is okay to proceed and does not affect multivariate model construction.
There are two near-identical functions:
write_ordinal_report() and
write_continuous_report(). These functions use the
evaluation data generated by evaluate_univariate_models()
to generate both .HTML and .Rmd files in the data_dir that
“report” the model selection results for each univariate variable, by
variable type. These functions are not necessary within the MCP
pipeline, but are useful for user/researcher interpretation.
Noteworthy arguments for these functions are:
eval_type - Model selection technique. Make sure that
this argument matches what was used for
evaluate_univariate_models() abovej or k - The ordinal or continuous,
respectively, number assigned to the variable in question. An easy way
to find the associated number is to look in your data_dir
for the solutiony_ files, in which either j or
k will be followed by a number, followed by the variable
nameline_width - the maximum width of each line in the
report. This is purely for formatting the resulting .HTML file, and does
not need to be altered, unless desiredWe bundle both functions inside of for-loops to streamline the process instead of writing individual lines of code for each of our six variables. This is not required, but highly recommended for instances where there are many univariate models to go through.
This function takes the output file created by
evaluate_univariate_models() and returns a data frame with
all of the variables, the “best” mean and noise specifications, and the
parameter names and values associated with that model. This function
is also not required for the MCP pipeline, but is useful for reporting
univariate models and their specifications in tables and analysis
reports.
Noteworthy arguments include:
eval_type - Model selection technique. Again, make sure
that this argument is the same as what was chosen for
evaluate_univariate_models()best_params <- get_best_univariate_params(data_dir=data_dir,
analysis_name=analysis_name,
save_file=T)Once the univariate model selection step is complete, we can proceed to Step 5: Optimizing Multivariate Models
The MCP algorithm uses the optimized univariate model parameters (from Step 4) as a “jumping-off” point for multivariate model optimization. Multivariate model optimization is achieved using maximum likelihood estimation. What is unique about the MCP algorithm, is that it can generate multivariate models that:
There are two main functions in yada that generate
multivariate models. Each function is explained below, in great
detail.
This function takes the selected univariate models and strings them together into a multivariate model that assumes conditional independence among the input variables. Importantly, conditional independence means that the input variables have no relationship with each other and a change in one of the variables does not affect change in another. It is important to know whether your data are fundamentally independent or dependent.
Noteworthy arguments for this function are:
fold - Which fold is being used to optimize the model.
If fold=NA, the main conditionally independent model is
optimizedcalc_se - A logical (T/F) argument on whether to
calculate the standard error of the unconstrained parameter vector.
calc_se=T is required to optimize a conditional dependent
multivariate modelallow_corner - A logical (T/F) argument on whether the
algorithm should allow corner solutions. If allow_corner=T,
an attempt is made to use pertinent median values from other variables
to set the scale.remove_var - A logical (T/F) argument on whether input
variables that yielded no best univariate model should be removed. If
remove_var=T, those variables are removed from the model
and recorded. If remove_var=F, the conditionally
independent model is not created and an error is returnedThus far, we have been using cross-validation for model selection. For the multivariate models, we demonstrate AIC for model selection between the conditionally independent and conditionally dependent models. Of note, `fold=NA` because we do not need to generate model folds for AIC.
cindep_model <- build_cindep_model(data_dir=data_dir,
analysis_name=analysis_name,
fold=NA,
calc_se=T,
allow_corner=T,
remove_var=T,
save_file=T)Warning messages may arise at this step, such as "Univariate solution may be a corner solution or may not be an optimum for...". Do not panic. These are just warning messages and does not mean that the conditionally independent model is not optimized. Additionally, these warnings will not hinder the optimization of the conditionally dependent model.
In order to transition from a conditionally independent (“cindep”)
model to a conditionally dependent (“cdep”) model, adjustments must be
made to the model specification (mod_spec), which is housed
in the problem files. For conditionally dependent
multivariate models, we can outline any baseline assumptions about
conditional dependence between input variables within the
mod_spec.
First, extract the mod_spec from the main conditionally
independent multivariate model optimized above and set it as the
conditionally dependent mod_spec:
Next, define your assumptions for among-variable dependence. The MCP algorithm can calculate the among-variable dependence for you, although this method takes the longest amount of time:
J <- cdep_mod_spec$J # number of ordinal variables
K <- cdep_mod_spec$K # number of continuous variables
cdep_mod_spec$cdep_groups <- 1:(J+K) # all variables are their own "group"To speed up optimization time, we chose to add a baseline assumption that all variables of the same type are more correlated among their types than between types. Therefore, we use the following code to define four conditionally dependent groups consisting of Epiphyseal Fusion ("EF"), Ossification ("Oss"), Dental ("max/man"), and Long Bones. This code can be manipulated based on the user/researcher"s knowledge of their data.
# THE FOLLOWING CODE IS DEMONSTRATION SPECIFIC, WHERE WE ARE DEFINING FOUR (4)
# GROUPS OF VARIABLES THAT WE ASSUME HAVE GREATER AMONG-GROUP CORRELATION.
J <- cdep_mod_spec$J # number of ordinal variables
K <- cdep_mod_spec$K # number of continuous variables
ind_EF <- which(endsWith(main_problem$var_names,"EF")) # EF variables
ind_Oss <- which(endsWith(main_problem$var_names,"Oss")) # Oss variables
ind_Dent <- which(startsWith(main_problem$var_names,"ma")) # dental variables
ind_LB <- J + (1:K) # long bones
cdep_groups <- rep(NA,J+K) # initialize empty vector for all variables
cdep_groups[ind_EF] <- 1 # all EF variables are in group 1
cdep_groups[ind_Oss] <- 2 # all Oss variables are in group 2
cdep_groups[ind_Dent] <- 3 # all dental variables are in group 3
cdep_groups[ind_LB] <- 4 # all long bones are in group 4
cdep_mod_spec$cdep_groups <- cdep_groups # add cdep_groups to mod_specFinally, define the model as conditionally dependent:
build_file_path() is a bonus function. Behind the
scenes, this function from yada has been used to
standardize file names throughout the MCP algorithm pipeline that are
saved in your data_dir. This step will be the first
instance in which you explicitly call on build_file_path(),
which does exactly what the function name indicates: it builds a file
path for saving files.
Noteworthy arguments for this function include:
file_type - The type of file that you are saving.
Current options for this argument are: “main_problem”,
“training_problem”, “univariate_ord_soln”, “ordinal_ci”,
“univariate_ord_rmd”, “univariate_cont_soln”, “univariate_cont_rmd”,
“solutionx”, “cv”, “aic”, “mcp_inputs”, “cindep_model”, “hjk_progress”,
“cdep_model”fold - The fold number - some of the
file_types can take a fold number as input, to add to the
file pathFour types of file paths are generated specifically for the conditionally dependent multivariate model optimization:
main_problem_file_ex <- build_file_path(data_dir=data_dir,
analysis_name=analysis_name,
file_type="main_problem",
fold=NA)
fold_problem_file_ex <- build_file_path(data_dir=data_dir,
analysis_name=analysis_name,
file_type="training_problem",
fold=1)
hjk_progress_file_ex <- build_file_path(data_dir=data_dir,
analysis_name=analysis_name,
file_type="hjk_progress",
fold=NA)
cdep_save_file_ex <- build_file_path(data_dir=data_dir,
analysis_name=analysis_name,
file_type="cdep_model",
fold=NA)The example file paths should look like:
## [1] "results/problem_US.rds"
## [1] "results/train_US_fold1.rds"
## [1] "results/hjk_progress_US.rds"
## [1] "results/cdep_model_US.rds"
This function takes the optimized conditionally independent model
(cindep_models) and optimizes a conditionally dependent
model using the updated mod_spec object created with the Intermission code, resulting in
cdep_mod_spec. The function fit_multivariate()
uses a Hooke-Jeeves (hjk) algorithm to calculate a maximum likelihood
fit using the baseline conditionally dependent model to fit a
conditionally dependent, multivariate, mixed model.
Depending on the number of input variables supplied for the
multivariate model, this step may take between a few hours and a few
weeks (yikes!). Because the authors understand that it may not
be possible to leave a local system running for weeks on end, the
function is coded to allow for multiple “legs” of analyses, that start
where you left off, instead of starting from the beginning every time. A
record of your last run is saved as the
hjk_progress_file.
Noteworthy arguments of this function include:
mod_spec - This is the model specification for the
conditionally dependent model, as created in the Intermissioncindep_model - The conditionally independent model that
will be used as a baseline fit for the conditionally dependent
modelprog_file - The file path for the optimization progress
to be saved. In the event that optimization needs to be paused at any
time, this file is used to start off where the optimization left off,
instead of starting anewsave_file - The file path for the final conditionally
dependent model filehjk_control - A list that feeds into the embedded
hjk function in the dfoptim package, click
for more information.Here, we only calculate one conditionally dependent model, as our intentions are to use AIC for model selection down the line.
`hjk_control=list(info=T)` allows us to monitor the progress of the optimization in the console. The number of "steps" it takes to finish optimization depends on the number of variables and sample size.
# Initialize hjk progress and final model file paths
hjk_progress_file <- build_file_path(data_dir=data_dir,
analysis_name=analysis_name,
file_type="hjk_progress",
fold=NA)
cdep_save_file <- build_file_path(data_dir=data_dir,
analysis_name=analysis_name,
file_type="cdep_model",
fold=NA)
# Optimize the conditionally dependent model
cdep_model <- fit_multivariate(x=main_problem$x,
Y=main_problem$Y,
mod_spec=cdep_mod_spec,
cindep_model=cindep_model,
prog_file=hjk_progress_file,
save_file=cdep_save_file,
hjk_control=list(info=T))Once all of our multivariate models are constructed, we move onto Step 6: Multivariate Model Selection.
As is important with univariate model selection, we choose to rely on a model selection method to convince us whether the assumption of conditional independence or conditional dependence is appropriate for our data.
The MCP algorithm again allows for model selection via cross-validation (cv) or Akaike’s Information Criterion (aic), which is conducted using a single function.
This function calculates either cv or aic to help
inform the user/researcher on the appropriate structure of the
multivariate model. evaluate_multivariate_models() returns
a list with evaluation type (eval_type), model selection
metric for the conditionally independent model(s)
(eval_cindep), and model selection metric for the
conditionally dependent model(s) (eval_cdep).
Noteworthy arguments in this function include:
eval_type - The model selection method used. As before,
eval_type="cv" requires conditionally independent and
conditionally dependent model folds to be present in the
data_dirThe general rule for both cross-validation and AIC is that the model with the lowest model selection metric is selected.
## $eval_type
## [1] "aic"
##
## $eval_cindep
## [1] 11629.8
##
## $eval_cdep
## [1] 11034.08
Given the above results, our model selection indicates that conditional dependence is the appropriate assumption for our current data, and that the conditionally dependent model should be used.
This concludes the end of our “model construction” portion of the demonstration. Next is Step 7: Getting Results.
Probability distributions are one of the main factors of a Bayesian framework. Instead of a single point estimate and generally equidistant confidence interval, models optimized using the MCP algorithm result in a distribution of probabilities over all possible values of x, the continuous output. This probability distribution can then be used to extract a single point estimate or range (i.e., “credible interval”), if desired. This type of Bayesian output allows for probabilistic statements on the output x.
Bayes” Theorem is generalized as:
\(\tt{Posterior\:Distribution} = \frac{\tt{Likelihood} \times \tt{Prior}}{\tt{Probability\:of\:Input}}\)
Thus, both the \(\tt{Likelihood}\) and the \(\tt{Prior}\) have influence on the \(\tt{Posterior\:Distribution}\).
In this lengthy discussion, we introduce the functions of
yada that are used after a model is generated using the MCP
algorithm. Within this section, we discuss:
This function generates a prior distribution on the output variable,
x, which is the \(Prior\) portion of Bayes” Theorem, which is
required for then calculating a posterior distribution of our output.
calc_x_prior() uses the known distribution of
x, provided by a problem file, to apply an offset
Weibull mixture model, which assumes higher probabilities of
x at the lower and upper bounds of all provided values
of x. A Weibull distribution is informative for
situations where data follow the mortality curve, commonly
observed in medical examiners offices and subadult skeletal
collections.
Noteworthy arguments for this function include:
prior_type - Currently, only a Weibull
(prior_type="weibull") parameterization can be applied to
our prior distribution on x. We are currently working
on adding more distribution shapesoffset - The amount by which the prior distribution
should be offset from the known output, x. This is a
required argument, as we want our prior distribution to be
representative but not identical to our knowledge of our
outputA Weibull distribution is informative for situations where data follow the mortality curve, commonly observed in medical examiners offices and subadult skeletal collections.
Because our data is related to subadult age and uses subadult skeletal data, we choose to apply a Weibull distribution on our x prior.
th_x <- calc_x_prior(data_dir=data_dir,
analysis_name=analysis_name,
prior_type="weibull",
offset=0.002,
seed=895131,
fold=NA,
save_file=T)## number of iterations = 2000
calc_x_prior() results in a single file in the
data_dir starting with solutionx_.
This function loads the “best” (i.e., selected from either cross-validation or aic) univariate model for the variable given.
Noteworthy arguments for this function include:
var_name - The variable name of the model to be
loadedfold - The fold number to be loaded. Default is
fold=NA, which loads the main univariate modelload_data - A logical for whether the data used to
train the model should be included. Default is
load_data=FThis function uses the prior distribution, calculates the likelihood distribution, calculates the posterior probability distribution for the output, x.
Noteworthy arguments for this function include:
y - The input data, either a single value for
univariate models or a vector of values for multivariate modelsth_x - The parameterization on the output value,
calculated using the calc_x_prior()model - The model used to calcualte the posterior
distributionxcalc - A vector of evenly-spaced output values for
which the posterior distribution should be calculated. If
length(xcalc)==0, it is calculated for you and a
seed must be provided for replication. Default is
xcalc=c()normalize - A logical for whether the posterior
distribution should be normalized to integrate to 1. Default is
normalize=Tseed - A seed value for result replication. If
length(xcalc)==0, a seed must be provided.
Default is seed=NAThe result of calc_x_posterior() is a
list with the following objects:
x - The values of the output, x, for
which a posterior probability density was calculated. This vector should
be identical to xcalc, if it was provideddensity - The density value for the corresponding value
of xTo demonstrate `calc_x_posterior()` we will use observation n=165 and provide both a univariate and multivariate example of calc_x_posterior. NOTE: `calc_x_posterior()` only works on a single individual (aka one row of your data frame) at a time. If you would like to produce posterior distributions (aka estimates) for multiple individuals, please skip to Section 8: Reporting Results.
## SVAD_identifier SEX agey FDL RDL max_M1 man_I2 HME_EF TC_Oss
## 165 US_0409 F 2.369863 157.81 89.86 6 4 0 3
There are two ways of extracting the input variables to use in `calc_x_posterior()`, either from the reformatted data, or from the problem file. We demonstrate both ways of doing so below, first using the reformatted data, `new_data` to extract a single variable for a univariate model calculation, `FDL` for our demonstration, and then from the `main_problem` for an extraction of multiple variables for multivariate model calculation.
We choose to define `xcalc=seq(0,23,by=0.01)`, given that our known output data range between 0-23.
# Extract a single variable value, FDL, from the reformatted data, where
# observations are represented as rows, not columns
y <- new_data[n, "FDL"]
fdl_post <- calc_x_posterior(y=y,
th_x=th_x,
model=fdl_model,
xcalc=seq(0,23,by=0.01),
normalize=T,
seed=NA)Calculating the posterior distribution using a multivariate model will inherently take a bit longer than using a univariate model. Depending on the number of variables included in the multivariate model, this calculation may take anywhere between a few seconds (6 variables) to an hour+ (62 variables). Please exercise patience.
# Extract the vector of input values from the Y matrix, where observations
# are represented by columns, not rows
y <- main_problem$Y[,n]
# Load cdep_model if you did not optimize during the same session
cdep_model <- readRDS(build_file_path(data_dir=data_dir,
analysis_name=analysis_name,
file_type="cdep_model"))
# Calculate the posterior probability distribution
cdep_post <- calc_x_posterior(y=y,
th_x=th_x,
model=cdep_model,
xcalc=seq(0,23,by=0.01),
normalize=T,
seed=NA)At the end of this step, you should be able to calculate posterior distributions for your output variable, x, for a single observation using either a univariate or multivariate model. Next, Step 8: Reporting Results
This section is the fun stuff. Here, we demonstrate how to extract meaning from our posterior probability distributions, such as calculating measures of central tendency (e.g., mean, mode, median) and credible intervals (which is the appropriate term when using Bayesian analyses instead of the frequentist “confidence interval”), compiling results into tables, and visualization.
The functions introduced in this step, as stated above, all have to do with manipulating our posterior probability distributions.
This function takes the output of calc_x_posterior() and
returns a list containing the output value and density
for the posterior mean (xmean), posterior median
(xmed), posterior mode (xmode), 95% credible
interval (xlo-xhi), and 99% credible interval
(xlolo-xhihi).
Noteworthy arguments for this function include:
xv - The vector of output values for which the
posterior was calculated. This vector is most likely the $x
from the result of calc_x_posterior()fv - The vector of densities calculated for the
posterior. This vector is most likely the $density from the
result of calc_x_posterior()ci_type - Whether the credible interval should be
calculated based on quantiles (“quantiles”) or highest posterior density
(“hdi”). Default is ci_type="hdi"xknown - An option argument, the posterior density of a
specific value of xknown, often the known output, is
calculatedWe again provide examples for using `analyze_x_posterior()` for both univariate and multivariate contexts. We also use the default credible interval type, where `ci_type="hdi"`. Additionally, we demonstrate the differences in output when `xknown` is provided.
fdl_analyze <- analyze_x_posterior(xv=fdl_post$x,
fv=fdl_post$density,
ci_type="hdi",
xknown=new_data[n,"agey"])## Identical densities found along different segments of the distribution,
## choosing rightmost.
Using the conditionally dependent multivariate model, with `ci_type="quantiles"` and `xknown=NA`.
cdep_analyze <- analyze_x_posterior(xv=cdep_post$x,
fv=cdep_post$density,
ci_type="quantiles",
xknown=NA)We will be keeping these two *_analyze objects for use
later when we visualize the results.
In order to introduce the next two functions, we must generate an example test sample. Ideally, the test sample is comprised of completely new data.
As mentioned above, a test sample is ideally a set of completely new data, but here we will use a subset of our original data as an example.
og_data <- read.csv("data/SVAD_US.csv") # load original ("og") data
set.seed(2022) # set seed for replication
test_idx <- sample(1:nrow(og_data), size=20, replace=F) # sample 5 data rows
sub_data <- og_data[test_idx,] # extract the sampled rows
write.csv(sub_data, "data/test_sample.csv", row.names=F) # save dataThere are two nearly-identical functions in yada:
univariate_batch_calc() and
multivariate_batch_calc(). These two functions both use
calc_x_posterior() and analyze_x_posterior()
together when given a data frame with multiple observations for which we
would like to calculate measures of central tendency and credible
intervals. Depending on how large your test sample is and how many
variables you are calculating for, this process may take between a few
minutes up to a few days! Make your arrangements accordingly.
IMPORTANT: Before using a test sample to calculate
posterior probability distributions using the functions below, make sure
that the data are reformatted using the var_info file and
reformat_mcp_data().
## SVAD_identifier SEX agey FDL RDL max_M1 man_I2 HME_EF TC_Oss
## 1 US_0281 M 2.926027 142.06 87.88 6 4 0 3
## 2 US_0100 F 1.478439 145.60 88.25 4 3 NA NA
## 3 US_0850 F 18.315068 NA NA 11 11 NA NA
## 4 US_0194 F 15.638604 NA NA 11 11 NA NA
## 5 US_1154 M 18.098630 NA NA 11 11 NA NA
## 6 US_0497 M 16.687671 NA NA 11 11 6 5
Noteworthy arguments for these functions include:
test_samp - The reformatted test sample to use for
posterior predictionsdemo_cols - Column numbers for demographic columns to
be kept in the final prediction tableinput_cols - Column numbers for the input
variablesci_type - Whether the credible interval should be
calculated using “hdi” or “quantiles”As univariate_batch_calc() runs, you will be updated
with which input variable and observation is being processed.
univariate_batch_calc(data_dir=data_dir,
analysis_name=analysis_name,
test_samp=test_samp,
demo_cols=1:3,
input_cols=4:ncol(test_samp),
ci_type="hdi", th_x=th_x,
xcalc=seq(0,23,by=0.01),
seed=NA, save_file=T)After applying calc_x_posterior() and
analyze_x_posterior() for each observation for each model,
if applicable, the top rows of the output table is printed to monitor
the function”s progress. An example of the output is provided below for
multivariate_batch_calc().
Noteworthy arguments for this function include:
model_type - Whether the conditionally independent
(“cindep”) or conditionally dependent (“cdep”) model should be usedmultivariate_batch_calc(data_dir=data_dir,
analysis_name=analysis_name,
test_samp=test_samp,
demo_cols=1:3,
model_type="cdep", ci_type="hdi",
th_x=th_x, xcalc=seq(0,23,by=0.01),
seed=NA, save_file=T)## [1] "Calculating age posterior for observation=1"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=2"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=3"
## [1] "Calculating age posterior for observation=4"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=5"
## [1] "Calculating age posterior for observation=6"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=7"
## [1] "Calculating age posterior for observation=8"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=9"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=10"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=11"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=12"
## [1] "Calculating age posterior for observation=13"
## [1] "Calculating age posterior for observation=14"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=15"
## [1] "Calculating age posterior for observation=16"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=17"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=18"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Calculating age posterior for observation=19"
## [1] "Calculating age posterior for observation=20"
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## [1] "Prediction Table for cdep_model:"
## SVAD_identifier SEX agey xmean xmed xmode lower99 lower95 upper95
## 1 US_0281 M 2.926027 1.89 1.87 1.81 1.24 1.35 2.52
## 2 US_0100 F 1.478439 1.51 1.49 1.45 0.95 1.06 2.02
## 3 US_0850 F 18.315068 17.19 17.50 18.10 9.71 11.89 21.86
## 4 US_0194 F 15.638604 17.19 17.50 18.10 9.69 11.99 21.97
## 5 US_1154 M 18.098630 17.19 17.50 18.10 9.88 11.66 21.71
## 6 US_0497 M 16.687671 18.00 18.11 18.36 12.69 13.94 21.62
## upper99
## 1 2.52
## 2 2.02
## 3 21.86
## 4 21.97
## 5 21.71
## 6 21.62
Occasionally you will get an error along the lines of: `e$fun(obj, substitute(ex), parent.frame(), e$data) : worker initialization failed`. This is a by-product of parallel processing, and re-starting your R session should take care of the error.
This function calculates the 95% and 99% credible interval for each stage of a given univariate ordinal model. As there are a finite number of ordinal stages for each variable, the point estimate and credible interval will be the same for any observation with the same ordinal stage.
Noteworthy arguments for this function include:
var_name - The variable name for the univariate model
to be used for estimationpoint_est - The type of point estimate to be reported.
Options are: xmean, xmed, and
xmodeci_type - Whether the credible interval should be
calculated using “hdi” or “quantiles”input_seed - The base seed used to calculate a seed for
each ordinal stageDepending on how many stages there are in for the ordinal variable chosen, this function may take a few minutes.
NOTE: generate_ord_ci() can only be
used for univariate ordinal models, as continuous variables do not have
distinct stages.
We demonstrate this function using the variable max_M1.
max_M1_df <- generate_ord_ci(data_dir=data_dir,
analysis_name=analysis_name,
var_name="max_M1",
th_x=th_x,
point_est="xmode",
ci_type="hdi",
xcalc=seq(0,23,by=0.01),
save_file=T)## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
## Identical densities found along different segments of the distribution,
## choosing rightmost.
The following code provides examples of the different types of visualizations that can be produced using the results generated from models optimized using the MCP algorithm.
Similar to the concept of generate_ci_ord(), these two
visualizations are meant to display the outputs returned by univariate
ordinal models.
Using the output from generate_ci_ord(), we can
visualize the point estimate and credible interval for each stage of an
univariate ordinal variable.
For this demonstration, we use the data frame of max_M1 generated using `generate_ci_ord()`.
ggplot(max_M1_df) +
geom_errorbar(aes(x=ord_stage, ymin=lower95, # factor stages
ymax=upper95, color=factor(ord_stage))) + # color by stage
geom_point(aes(x=ord_stage, y=point_est,
color=factor(ord_stage))) + # point estimate by stage
scale_x_continuous(breaks=c(0:11), labels=c(1:12)) + # add ticks on x-axis at each stage
labs(x="Ordinal Stage", y="Age [years]", title="Maxillary M1") + # labels
theme_bw() + # black and white plot space
theme(legend.position="none") # remove legendNOTE: The ordinal stages used by any model generated
using the MCP pipeline automatically start at 0 and are re-coded as
integers. These may not reflect your original scores, and therefore
adjustments can be made such as using
scale_x_continuous(labels=c("vector","of","stages")).
In order to visualize the posterior distributions of a univariate
model by stage as densities, the following steps can be taken:
1. Load the selected (“best”) univariate model using
load_best_univariate_model()
2. Extract the unique stages from lowest to highest using
sort() and unique()
3. Initialize an empty data frame for storing the vectors of
x, density, and stage
from the results of each calc_x_posterior()
4. Use a for-loop to run through each unique stage for that variable
To demonstrate these steps, we use the Humerus Medial Epicondyle Epiphyseal Fusion Scores (`HME_EF`).
# 1. Load the univariate model for HME_EF
hme_model <- load_best_univariate_model(data_dir=data_dir,
analysis_name=analysis_name,
var_name="HME_EF",
fold=NA, load_data=F)
# 2. Extract unique stages from lowest to highest
hme_stages <- sort(unique(new_data$HME_EF))
# 3. Initialize an empty data frame with column names: x, density, stage
hme_post_df <- data.frame(x=NA,density=NA,stage=NA)
# 4. Loop through each unique stage
for(stage in 1:length(hme_stages)) {
df0 <- matrix(data=NA, ncol=3, nrow=length(seq(0,23,by=0.01)))
hme_post <- calc_x_posterior(y=hme_stages[stage],
th_x=th_x,
model=hme_model,
xcalc=seq(0,23,by=0.01),
normalize=T)
df0 <- data.frame(x=hme_post$x,
density=hme_post$density,
stage=rep(hme_stages[stage],length(hme_post$x)))
hme_post_df <- rbind(hme_post_df,df0)
}
# Post-processing step
hme_post_df <- hme_post_df[-1,] # remove first row of all NAs
head(hme_post_df) # inspect top 6 rows## x density stage
## 2 0.00 3.963317 0
## 3 0.01 1.992746 0
## 4 0.02 1.570963 0
## 5 0.03 1.352773 0
## 6 0.04 1.211704 0
## 7 0.05 1.110026 0
In the previous example using max_M1, we re-labeled
the ordinal stages on the x-axis using
scale_x_continuous(labels=c()). Below, we demonstrate how
you can re-code the stages in the data instead, using the
recode() function from the package dplyr,
which was loaded with the tidyverse library.
hme_post_df$stage <- recode(hme_post_df$stage,
"0"="0",
"1"="1",
"2"="12",
"3"="2",
"4"="23",
"5"="3",
"6"="4")
unique(hme_post_df$stage) # check that the recoding was done correctly## [1] "0" "1" "12" "2" "23" "3" "4"
Now that the data are set up, we can plot each posterior density distribution, colored and separated (“faceted”) by stage:
ggplot(hme_post_df) +
geom_area(aes(x=x, y=density, fill=factor(stage)),
color="black", alpha=0.5) + # fills space under line
labs(x="Age [years]", y="Density", fill="Stage") + # sets the labels
theme_bw() + # uses black and white theme
theme(legend.position="none") + # removes legend
facet_wrap(~stage, scales="free_y") # facets by stage, free y-axisSimilar to plotting the point estimate and credible intervals for each ordinal stage, we can achieve similar plots for predictions regardless of model type and input variables. These are helpful for visualizing accuracy and precision of the model where:
To demonstrate, we will use the multivariate model predictions calculated during the `multivariate_batch_calc()` example.
multi_pred <- read.csv("results/cdep_model_US_test_predictions.csv")
ggplot(multi_pred) +
geom_errorbar(aes(x=agey, ymin=lower95, ymax=upper95)) +
geom_point(aes(x=agey, y=xmode)) +
geom_abline(slope=1, intercept=0, color="red", linetype="dashed") +
labs(x="Known Age [years]", y="Predicted Age [years]") +
theme_bw()This plot is a great way of observing model performance using test data, although larger sample sizes will start to become busy. We encourage the users to explore other ways of visualizing the outputs of models generated by the MCP algorithm.
If you have made it all to the way to the end of this document, the authors commend and thank you for your time, patience, and curiosity about the MCP algorithm. We hope you have found this vignette informative and helpful.
If you have any questions about this document or the MCP algorithm
and the yada package, or would like to provide any
suggestions, please contact us at:
This work was funded by the National Institute of Justice (Awards 2015-DN-BX-K409 and 2017-DN-BX-0144) and the National Science Foundation (BCS-1551913). Opinions expressed herein do not necessarily represent the official position or policies of the U.S. Department of Justice or the National Science Foundation.