Overview

Welcome to the Mixed Cumulative Probit (MCP) Vignette. This document is meant to provide all necessary functions, descriptions, and data structures needed to successfully use the MCP to generate univariate and multivariate models.

If you would like the .HTML file of this document for use without internet access, please email Elaine Y. Chu (elainechu@txstate.edu)

The Mixed Cumulative Probit

The MCP is a novel algorithm developed by Michael H. Price that uses a Bayesian framework to calculate a continuous variable using any number and combination of continuous, categorical, and/or ordinal data.

Objectives

While this demonstration uses subadult age estimation to provide data and reproducible results of the MCP process, the goal of this document is to make the MCP as accessible and applicable to as wide of an audience as possible, with the hope that users may then use the MCP algorithm with their own data and research agendas.

All analyses are conducted in R and/or RStudio.

NOTE: There are a couple of text formats in this HTML document that signal specific things.

Any text highlighted in red with red text is R/code-related.

Any text with a yellow background will be demonstration-specific. Anything that is in yellow can be changed to meet your specific needs.
Any text with a red background will provide some commonly-encountered errors, with some suggestions on how to solve them.

The data used for this demonstration are from the following publication:

Stull, K.E., Chu, E.Y., Corron, L.K., & Price, M.H. (2023). The Mixed Cumulative Probit: A multivariate generalization of transition analysis that accommodates variation in the shape, spread, and structure of data. Royal Society Open Science.

Finally, the authors recognize the potential for the MCP to address research agendas in Anthropology, Demography, Economics, and beyond, and encourage researchers to apply the MCP algorithm to their appropriate data.

Package Dependencies

Most functions used in this document are located in a R package hosted on GitHub called yada, which stands for “Yet Another Demographic Analysis”. To install the functions used in this vignette:

install.packages("devtools")  # install devtools package, if needed
devtools::install_github("MichaelHoltonPrice/yada")  # install yada

Because yada is a living repository that gets updated, we recommend re-installing often and from the development branch on GitHub. To check whether any updates have been made, refer to the “Last Updated” date at the top-left corner of this document, which reflects the date of any last updates to yada. To update yada, add the argument ref="dev":

# detach("package:yada", unload=T)  # unload yada, if already loaded
# update yada using the development ("dev") branch
devtools::install_github("MichaelHoltonPrice/yada", ref="dev")  

In addition to yada, this document uses additional R packages for faster optimization (“parallel processing”) and visualization. Please ensure the following packages are in your R library:

  • doParallel
  • tidyverse

Once you have confirmed that all required packages are installed in the local system, you can begin the MCP algorithm optimization process: Step 1: Initial Steps for the MCP

1. Initial Steps for the MCP

Sample Data

If you would like to follow the vignette step-by-step with the examples provided, you may download the folder containing the data, a template script, and a template var_info file (more later) can be found by right-clicking the following link and selecting “Open link in new tab”: LINK. The templates are provided as a script-version guideline for changing only the necessary arguments for you own data.


After confirming that all package dependencies and the repository yada are included in the local system, we can start setting up our local R system for utilizing the MCP algorithm.

Setup

Before proceeding, please ensure that your working directory is set to whatever file path you would like all of your analyses to be contained within.

# NOTE: make sure to change "file/path/to/analysis/folder" appropriately
setwd("file/path/to/analysis/folder")
If you downloaded the folder from Google 
Drive, your working directory file path should match the location where you 
placed the folder. For this demonstration, we recommend the shortest file path  possible: `setwd("C:/MCP-Vignette-Files")`

Next, use the following lines to create a data folder and a folder in which all of your MCP algorithm files are stored - we recommend this be a results folder.

If you are following this vignette using  the MCP-Vignette-Files folder, you do not need to complete this step, as  the "data" and "results" folders already exist.
dir.create("data")  # create data folder
dir.create("results") # create results folder

If you are using your own data files, you will need to copy/paste them into the data folder that was just created. As noted above, if you are using the cloned MCP-Vignette-Files folder, then the data files already exist in the data folder and you will not need to do anything. The results folder should remain empty at this time.

Clear the global environment and load the package libraries to use in subsequent analyses:

rm(list=ls())  # clear global environment

# Load Libraries
library(yada)
## Warning: package 'yada' was built under R version 4.4.2
library(doParallel)
## Warning: package 'doParallel' was built under R version 4.4.2
## Warning: package 'foreach' was built under R version 4.4.2
## Warning: package 'iterators' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tibble' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2

Finally, initialize parallel processing for future parts of the analysis, which allows for multiple processes to run at once through R by using the cores available on the local system.

We choose to use `cl=detectCores()` to use all of the cores on our local system to run parallel processing. The value of `cl=` can be changed by the user, in the event they want to use less of the local system"s processing power.
registerDoParallel(cl=detectCores())  # prepare system for parallel processing

The Input

The data folder should contain two main files. If you are using your own data, you will need to place these files inside of the data folder you created.

If you are using the MCP-Vignette-Files folder, then you will find the two following files in the data folder.
  • The data file (for us, SVAD_US.csv) - This file holds individual observations as rows, and demographic and trait information as columns. For more information on the data and its source, please explore the publication on the Subadult Virtual Anthropology Database.
  • The var_info file (for us, US_var_info.csv) - This file holds all variable information and must be standardized with the following columns to match the format expected by the data manipulation functions in yada (explained below). If you are using your own data, you may manipulate the var_info view and download a template for the var_info file found in the MCP-Vignette-Files folder.

Column Name

Variable

Description

The variable name

Type

The variable type. There should be exactly one variable, labeled as “x”, which is the output variable. Input variables should be labeled as either “ordinal” or “continuous”, and additional variables to be kept in the data file but not used for model optimization should be labeled as “other”.

  • Options: x, ordinal, continuous, other
Categories A category specification for ordinal variables; | essentially, enter the possible stages (in order) for | ordinal variable. Ordinal stages may be collapsed, per | the researcher’s specifications (see example). Leave blank for continuous variables. | | * Ex: 1 2 3 4 5 6 7 8 9 10 11 {12 13} will | collapse values 12 and 13 together |
Missing_Value

Value(s) that represent missing data for each variable. Multiple values are separated by commas.

  • Ex: -1,NA,99
Left_Right_Side

Specifies whether the Left/Right label is located at the “start” or the “end” of the variable name (columns) in the data file.

  • Ex: L_FDL == “start”, FDL_L == “end”, where _L is the Left/Right Label (see below).
Left_Label

The label that signals left variables.

  • Ex: _L
Right_Label

The label that signals right variables.

  • Ex: _R
Left_Right_Approach

The approach used for merging left and right variables.

  • Options: left, right, mean, highest, lowest

Each variable in the data file (here, SVAD_US.csv) must match a column name in the corresponding var_info file (US_var_info).The first four columns are required, whereas the final four columns are optional and only used if left/right variables exist in the same data file and are to be merged.

NOTE: Only variables named in the var_info file will be kept for analyses.

The Functions

The following functions are used in the initial setup to prepare the data for the MCP. They are not required, if the user feels comfortable with: 1) collapsing left and right columns (if necessary), 2) selecting only the desired variables, and 3) formatting the data into a problem file in the format expected by functions in yada to use the MCP algorithm.

load_var_info()

This function loads the specified var_info file, with the format specified above, to provide instructions on how the input data should be reformatted.
Here, we are loading the file US_var_info.csv found in the data folder.

This file specifies that we are keeping the following variables (as columns):

  • Demographic / other: SVAD_identifier, SEX
  • x: agey
  • continuous: FDL, RDL
  • ordinal: max_M1, man_I2, HME_EF, TC_Oss
var_info <- load_var_info(var_info_file="data/US_var_info.csv")

reformat_mcp_data()

This function re-formats and saves the data including:

  • selecting certain columns to keep in the final data file
  • merging left and right variable columns, per specifications in the var_info file
  • collapsing any ordinal variables, per specifications in the var_info file
  • save_file=T saves the new, re-formatted data with “_reformatted” as the suffix to the original data file path
new_data <- reformat_mcp_data(data_file_path="data/SVAD_US.csv",
                              var_info=var_info, 
                              save_file=T)
head(new_data)  # check that columns are maintained and merged
##   SVAD_identifier SEX      agey    FDL    RDL max_M1 man_I2 HME_EF TC_Oss
## 1         US_0186   F  1.172603 141.79  83.86     NA     NA      0      3
## 2         US_0006   F  4.756164 257.30 143.38      7      8      1      5
## 3         US_0837   M  4.073973 211.32 111.39     NA     NA      0      5
## 4         US_0915   M  1.087671 144.60  88.97      2      1      0      2
## 5         US_0617   M  0.830137 124.93  77.32      1      1      0      2
## 6         US_0165   F 14.860274     NA     NA     11     11      5      5

After confirming that our specified columns are retained and that the left and right columns of our continuous and ordinal variables are combined into single columns, we can proceed with Step 2: Generating Problem File(s).

2. Generating Problem File(s)

The problem file contains all information that is used by the MCP algorithm to optimize univariate and multivariate models. The main problem is always required, while an additional function is available to generate secondary problem files for cross-validation (more below).

Problem files are lists, which is a data type in R that can contain multiple other data types (e.g., vectors, data frames, matrices, other lists, single values). After generating a problem file for the MCP, you should have the following features:

  • x - A vector, with length equal to the sample size of data_file, containing the “predictor” variable (in this demonstration, known age)
  • Y - A matrix, with rows equal to the number of “response variables” and columns equal to the sample size, containing variable values of each observation in data_file. This is opposite of what we see when we view our reformatted data frame, new_data
  • var_names - A vector, with length equal to the number of “response variables”, containing variable names
  • mod_spec - A list containing model specifications (i.e., “mod_spec”)
  • J - The number of ordinal variables
  • K - The number of continuous variables
  • M - The number of “breaks” for each ordinal variable, which is the number of stages - 1

The Functions

The following functions facilitate the creation and saving of problem files for the MCP pipeline.

generate_main_problem()

This function takes the re-formatted data from the previous step (using reformat_mcp_data()), extracts pertinent information, and structures it in the list format expected by the MCP algorithm.

main_problem <- generate_main_problem(data_file=new_data, 
                                      var_info=var_info)
# View(main_problem)  # use this code to view the contents of main_problem

generate_cv_problems()

In the MCP pipeline, univariate and multivariate model selection can currently be conducted in two ways: 1) using Cross-Validation (cv or 2) using Akaike’s Information Criterion (AIC). It is up to the user to decide which model selection method is desirable for them.

This function uses the main problem file to generate cross-validation training and test folds for model selection. The number of folds is dictated by the argument K=. Each of the secondary problem files generated will have the same structure as the main problem, but will differ in the subsets of x and Y. To ensure reproducibility of the subsets of x and Y within each fold, the argument “seed=” can be used.

For this demonstration, we choose to use cross-validation as our method for model selection. Those who want to use AIC can skip this step. We generate 4 cross-validation folds and set the seed as 234227327, which was randomly generated using random.org.
cv_problems <- generate_cv_problems(main_problem=main_problem,
                                    K=4,
                                    seed=234227327)

save_problem()

This function saves problem files. Using this function is the first instance of two very important arguments that will be common throughout the functions in the MCP pipeline / vignette and are chosen by the user:

  • data_dir - The data directory (i.e., folder) where all files generated by the MCP algorithm should be stored
  • analysis_name - The unique identifier for the current analysis, that distinguishes the files from previous or future MCP algorithm uses

To initialize these objects:

data_dir <- "results"
analysis_name <- "US"

If you are using aic, you will only need to run save_problem() once:

# Save main problem
save_problem(data_dir=data_dir,
             analysis_name=analysis_name,
             problem=main_problem,
             is_folds=F)

Note how this function does not require defining the output as an object using <-.

If you are using cv, you should run save_problem() twice:

# Save main problem
save_problem(data_dir=data_dir,
             analysis_name=analysis_name,
             problem=main_problem,
             is_folds=F)

# Save cv problems
save_problem(data_dir=data_dir,
             analysis_name=analysis_name,
             problem=cv_problems,
             is_folds=T)  # NOTE: is_folds=T here, because we have folds!

Once we have generated all problem files and they are saved in our data_dir, we can move onto Step 3: Optimizing Univariate Models

3. Optimizing Univariate Models

The MCP algorithm has the ability to fit univariate models to data that are non-linear (shape) and have heteroskedastic (non-uniform) variance (spread). A benefit of the functions in yada is that the present demonstration does not assume a certain shape or spread of the data, instead allowing for model selection to decide which are the best model configurations (combination of shape and spread) for each variable in the data.

The mean and noise options for each univariate model are as follows:

Mean Options

  • lin_ord - linear (ordinal variables)
  • log_ord - logarithmic (ordinal variables)
  • pow_law_ord - power law (ordinal variables)
  • pow_law - power law (continuous variables)

Noise Options

  • const - constant noise (homoskedastic)
  • lin_pos_int - linear positive interval (heteroskedastic)

The Functions

The functions in Step 3 have more options than previous steps. Each function is individually introduced below, some with more complicated chunks of code. This step is where you are first introduced to code for parallel processing.

build_univariate_*_problems()

There are two nearly-identical functions called build_univariate_ord_problems() and build_univariate_cont_problems(). These functions do exactly what they say, which is build the possible model combinations for all ordinal and continuous variables defined in the main problem file located in the data_dir. The result is a list of problem files to generate all possible model combinations for all univariate models.

Noteworthy arguments for these functions include:

  • mean_specs - the mean model options
  • noise_specs - the noise model options
  • add_folds - a T/F statement on whether to build univariate models using cross-validation folds
We choose to use the default values for `mean_specs` and `noise_specs`, which are to generate all possible combinations available in `yada` for ordinal and continuous univariate models. You do not need to use all of the available options, if you so choose. In addition, we use `add_folds=T` to create univariate fold models for model selection using cross-validation
mean_specs_ord <- c(rep("pow_law_ord",2),rep("log_ord",2),rep("lin_ord",2))
noise_specs_ord <- rep(c("const","lin_pos_int"),3)

ord_prob_list <- build_univariate_ord_problems(data_dir=data_dir,
                                               analysis_name=analysis_name,
                                               mean_specs=mean_specs_ord,
                                               noise_specs=noise_specs_ord,
                                               add_folds=T)

mean_specs_cont <- c(rep("pow_law",2))
noise_specs_cont <- c("const","lin_pos_int")
cont_prob_list <- build_univariate_cont_problems(data_dir=data_dir,
                                                 analysis_name=analysis_name,
                                                 mean_specs=mean_specs_cont,
                                                 noise_specs=noise_specs_cont,
                                                 add_folds=T)

To calculate the number of ordinal and continuous models being built by these functions, you can use the following equation:

expected = (num_folds+1) * [6(num_ord_variables) + 2(num_cont_variables)]

Given the above equation, we expect: 
(4+1) * [6(4) + 2(2)] = 140 models built using both functions.

To check:

length(ord_prob_list) + length(cont_prob_list)
## [1] 140

Nice. Now that all of our desired univariate problems are created, we can optimize each model individually.

solve_*_problem()

There are again two near-identical functions called solve_ord_problem() and solve_cont_problem(). These functions “solve” (i.e., optimize) the univariate problems they are given. In other words, it optimizes the univariate model in question using the data provided by the main and/or secondary (if desired) problems.

Noteworthy arguments for these functions include:

  • *_prob - the univariate model problem to optimize
  • anneal_seed - an optional seed number for ordinal problem reproducibility and is not an argument for continuous problems

Because there are many, many ordinal and continuous univariate models to optimize, we solve the problems in parallel using foreach and %dopar%. Essentially, this process takes each problem, i, and optimizes the model using one of the cores “registered” using registerDoParallel. This allows for multiple univariate models to be optimized at the same time, as the process is undertaken by different cores.

Depending on how many univariate problems need to be optimized, the following process may take a while, certainly long enough to take a break, eat some snacks, or enjoy a coffee at a cafe! If you would like to monitor the optimization process, check your data_dir and see how many files starting with solutiony_ are there. At the end of the entire process, you should find the same number of files, which you calculated above, in your data_dir.

We set a `base_seed` and generate a `seed_vect` below for further reproducibility of our univariate models. The `seed_vect` is fed into the argument `anneal_seed` for univariate ordinal problems.
# Set up for reproducibility
base_seed <- 264528  # number generated from random.org
set.seed(base_seed)  # set the seed
seed_vect <- sample.int(1000000, 
                        length(ord_prob_list), 
                        replace=F)  # vector of sampled numbers

# Solve the ordinal problems in parallel and save to data_dir
ord_success <-
  foreach::foreach(i=1:length(ord_prob_list), .combine=cbind) %dopar% {
    yada::solve_ord_problem(data_dir=data_dir,
                            analysis_name=analysis_name,
                            ord_prob=ord_prob_list[[i]],
                            anneal_seed=seed_vect[i])
  }

# Solve the continuous problems in parallel and save to data_dir
cont_success <-
  foreach::foreach(i=1:length(cont_prob_list), .combine=cbind) %dopar% {
    yada::solve_cont_problem(data_dir=data_dir, 
                             analysis_name=analysis_name, 
                             cont_prob=cont_prob_list[[i]])
  }

Check that your data_dir has all of the expected solutiony_ files. Some bonus code to do so in R:

uni_files <- list.files(data_dir,"solutiony_")
length(uni_files)
## [1] 140

Congratulations on making it this far! Proceed to Step 4: Univariate Model Selection

4. Univariate Model Selection

Model selection is an important step in the MCP algorithm pipeline because it allows the algorithm and data to inform what the best model combinations are, out of the different options created in Step 3, instead of the decision being made by the user/researcher.

The functions in yada currently support two different methods of model selection:

  • k-fold cross-validation
  • Akaike”s Information Criterion (AIC)
The authors highly suggest the user do additional research into what each of these methods entail to make the best-informed decision when using the MCP algorithm to optimize univariate and multivariate models. Moving forward, both model selection methods will be demonstrated.

The Functions

The functions in Step 4 are specific to the univariate model selection (“evaluation”) process. In general,there is one large function, evaluate_univariate_models, that calculates all of the metrics and formats all of the data in a structure expected by subsequent MCP algorithm steps. The secondary functions are all for user/researcher interpretation.

build_model_vec()

This function combines an input of mean and noise specifications into a single vector with all possible univariate models.

Noteworthy arguments include:

  • mean_models - A vector of all mean model specifications used to generate univariate models
  • noise_models - A vector of all noise model specifications used to generate univariate models
ord_models <- build_model_vec(mean_models=unique(mean_specs_ord),
                              noise_models=unique(noise_specs_ord))

cont_models <- build_model_vec(mean_models=unique(mean_specs_cont),
                               noise_models=unique(noise_specs_ord))
Given that we have generated each possible combination of mean and noise model specifications currently generated by the functions `build_univariate_ord_problems` and `build_univariate_cont_problems`, we expect `ord_models` to include 6 mean-noise combinations and `cont_models` to include 2 mean-noise combinations, totalling 8 mean-noise combinations.
length(ord_models) + length(cont_models)
## [1] 8

evaluate_univariate_models()

This function ranks the univariate models (i.e., files starting with solutiony_ in the data_dir). Broadly, the ranking system is dictated by either out-of-sample negative log-likelihood k-fold cross-validation (“cv”), or Akaike”s Information Criterion (“aic”). For a more detailed explanation on how the ranking of univariate models, by variable, occurs, refer to the help file:

??evaluate_univariate_models()

Noteworthy arguments for this function include:

  • eval_type - Model selection technique. Options: “cv” or “aic”, default is “aic”
  • cand_tol - The candidate model tolerance, which is the value at which univariate models are considered “equally good” if their respective model evaluation values lie within the cand_tol value of each other. Model simplicity is then used as a “tie-breaker” for ranking
  • scale_exp_min - The minimum acceptable value of the scaling exponent [which is only applicable to log and pow_law models??]
  • beta2_max - The maximum acceptable value of beta2, the heteroskedastic noise parameter (i.e., lin_pos_int)
For this demonstration: 
`cand_tol=0.05`, `scale_exp_min=0.01`, and `beta2_max=5`.
These values were chosen for our specific analyses, but we recommend them as a fair starting point if the user is unsure of other values to use. As stated above, this demonstration chooses k-fold cross-validation for model selection, therefore `eval_type="cv"`.
eval_data <- evaluate_univariate_models(data_dir=data_dir,
                                        analysis_name=analysis_name,
                                        eval_type="cv",
                                        ord_models=ord_models,
                                        cont_models=cont_models,
                                        cand_tol=0.05,
                                        scale_exp_min=0.01,
                                        beta2_max=5,
                                        save_file=T)
Warning messages may show up in your console at this step, these warnings simply mean that certain mean-noise model combinations are not producing "rank-able" results. It is okay to proceed and does not affect multivariate model construction.

write_*_report()

There are two near-identical functions: write_ordinal_report() and write_continuous_report(). These functions use the evaluation data generated by evaluate_univariate_models() to generate both .HTML and .Rmd files in the data_dir that “report” the model selection results for each univariate variable, by variable type. These functions are not necessary within the MCP pipeline, but are useful for user/researcher interpretation.

Noteworthy arguments for these functions are:

  • eval_type - Model selection technique. Make sure that this argument matches what was used for evaluate_univariate_models() above
  • j or k - The ordinal or continuous, respectively, number assigned to the variable in question. An easy way to find the associated number is to look in your data_dir for the solutiony_ files, in which either j or k will be followed by a number, followed by the variable name
  • line_width - the maximum width of each line in the report. This is purely for formatting the resulting .HTML file, and does not need to be altered, unless desired
We bundle both functions inside of for-loops to streamline the process instead of writing individual lines of code for each of our six variables. This is not required, but highly recommended for instances where there are many univariate models to go through.
for(j in 1:length(eval_data$mod_select_ord)) {
  write_ordinal_report(data_dir=data_dir,
                       analysis_name=analysis_name,
                       j=j, line_width=200)
}

for(k in 1:length(eval_data$mod_select_cont)) {
  write_continuous_report(data_dir=data_dir,
                          analysis_name=analysis_name,
                          k=k, line_width=200)
}

get_best_univariate_params()

This function takes the output file created by evaluate_univariate_models() and returns a data frame with all of the variables, the “best” mean and noise specifications, and the parameter names and values associated with that model. This function is also not required for the MCP pipeline, but is useful for reporting univariate models and their specifications in tables and analysis reports.

Noteworthy arguments include:

  • eval_type - Model selection technique. Again, make sure that this argument is the same as what was chosen for evaluate_univariate_models()
best_params <- get_best_univariate_params(data_dir=data_dir,
                                          analysis_name=analysis_name,
                                          save_file=T)

Once the univariate model selection step is complete, we can proceed to Step 5: Optimizing Multivariate Models

5. Optimizing Multivariate Models

The MCP algorithm uses the optimized univariate model parameters (from Step 4) as a “jumping-off” point for multivariate model optimization. Multivariate model optimization is achieved using maximum likelihood estimation. What is unique about the MCP algorithm, is that it can generate multivariate models that:

  • Accept mixed (ordinal and continuous) variables as input
  • Assume either conditional independence (“cindep”) or conditional dependence (“cdep”) among the input variables
  • Can handle missing data

The Functions

There are two main functions in yada that generate multivariate models. Each function is explained below, in great detail.

build_cindep_model()

This function takes the selected univariate models and strings them together into a multivariate model that assumes conditional independence among the input variables. Importantly, conditional independence means that the input variables have no relationship with each other and a change in one of the variables does not affect change in another. It is important to know whether your data are fundamentally independent or dependent.

Noteworthy arguments for this function are:

  • fold - Which fold is being used to optimize the model. If fold=NA, the main conditionally independent model is optimized
  • calc_se - A logical (T/F) argument on whether to calculate the standard error of the unconstrained parameter vector. calc_se=T is required to optimize a conditional dependent multivariate model
  • allow_corner - A logical (T/F) argument on whether the algorithm should allow corner solutions. If allow_corner=T, an attempt is made to use pertinent median values from other variables to set the scale.
  • remove_var - A logical (T/F) argument on whether input variables that yielded no best univariate model should be removed. If remove_var=T, those variables are removed from the model and recorded. If remove_var=F, the conditionally independent model is not created and an error is returned
Thus far, we have been using cross-validation for model selection. For the multivariate models, we demonstrate AIC for model selection between the conditionally independent and conditionally dependent models. Of note, `fold=NA` because we do not need to generate model folds for AIC.
cindep_model <- build_cindep_model(data_dir=data_dir,
                                   analysis_name=analysis_name,
                                   fold=NA,
                                   calc_se=T,
                                   allow_corner=T,
                                   remove_var=T,
                                   save_file=T)
Warning messages may arise at this step, such as "Univariate solution may be a corner solution or may not be an optimum for...". Do not panic. These are just warning messages and does not mean that the conditionally independent model is not optimized. Additionally, these warnings will not hinder the optimization of the conditionally dependent model.

A Brief Intermission…

In order to transition from a conditionally independent (“cindep”) model to a conditionally dependent (“cdep”) model, adjustments must be made to the model specification (mod_spec), which is housed in the problem files. For conditionally dependent multivariate models, we can outline any baseline assumptions about conditional dependence between input variables within the mod_spec.

First, extract the mod_spec from the main conditionally independent multivariate model optimized above and set it as the conditionally dependent mod_spec:

cdep_mod_spec <- cindep_model$mod_spec

Next, define your assumptions for among-variable dependence. The MCP algorithm can calculate the among-variable dependence for you, although this method takes the longest amount of time:

J <- cdep_mod_spec$J  # number of ordinal variables
K <- cdep_mod_spec$K  # number of continuous variables

cdep_mod_spec$cdep_groups <- 1:(J+K)  # all variables are their own "group"
To speed up optimization time, we chose to add a baseline assumption that all variables of the same type are more correlated among their types than between types. Therefore, we use the following code to define four conditionally dependent groups consisting of Epiphyseal Fusion ("EF"), Ossification ("Oss"), Dental ("max/man"), and Long Bones. This code can be manipulated based on the user/researcher"s knowledge of their data.
# THE FOLLOWING CODE IS DEMONSTRATION SPECIFIC, WHERE WE ARE DEFINING FOUR (4)
# GROUPS OF VARIABLES THAT WE ASSUME HAVE GREATER AMONG-GROUP CORRELATION. 
J <- cdep_mod_spec$J  # number of ordinal variables
K <- cdep_mod_spec$K  # number of continuous variables

ind_EF <- which(endsWith(main_problem$var_names,"EF"))  # EF variables
ind_Oss <- which(endsWith(main_problem$var_names,"Oss"))  # Oss variables
ind_Dent <- which(startsWith(main_problem$var_names,"ma"))  # dental variables
ind_LB <- J + (1:K)  # long bones

cdep_groups <- rep(NA,J+K)  # initialize empty vector for all variables
cdep_groups[ind_EF] <- 1  # all EF variables are in group 1
cdep_groups[ind_Oss] <- 2  # all Oss variables are in group 2
cdep_groups[ind_Dent] <- 3  # all dental variables are in group 3
cdep_groups[ind_LB] <- 4  # all long bones are in group 4

cdep_mod_spec$cdep_groups <- cdep_groups  # add cdep_groups to mod_spec

Finally, define the model as conditionally dependent:

cdep_mod_spec$cdep_spec <- "dep"  

build_file_path()

build_file_path() is a bonus function. Behind the scenes, this function from yada has been used to standardize file names throughout the MCP algorithm pipeline that are saved in your data_dir. This step will be the first instance in which you explicitly call on build_file_path(), which does exactly what the function name indicates: it builds a file path for saving files.

Noteworthy arguments for this function include:

  • file_type - The type of file that you are saving. Current options for this argument are: “main_problem”, “training_problem”, “univariate_ord_soln”, “ordinal_ci”, “univariate_ord_rmd”, “univariate_cont_soln”, “univariate_cont_rmd”, “solutionx”, “cv”, “aic”, “mcp_inputs”, “cindep_model”, “hjk_progress”, “cdep_model”
  • fold - The fold number - some of the file_types can take a fold number as input, to add to the file path

Four types of file paths are generated specifically for the conditionally dependent multivariate model optimization:

main_problem_file_ex <- build_file_path(data_dir=data_dir,
                                        analysis_name=analysis_name,
                                        file_type="main_problem",
                                        fold=NA)

fold_problem_file_ex <- build_file_path(data_dir=data_dir,
                                        analysis_name=analysis_name,
                                        file_type="training_problem",
                                        fold=1)

hjk_progress_file_ex <- build_file_path(data_dir=data_dir,
                                        analysis_name=analysis_name,
                                        file_type="hjk_progress",
                                        fold=NA)

cdep_save_file_ex <- build_file_path(data_dir=data_dir,
                                      analysis_name=analysis_name,
                                      file_type="cdep_model",
                                      fold=NA)

The example file paths should look like:

## [1] "results/problem_US.rds"
## [1] "results/train_US_fold1.rds"
## [1] "results/hjk_progress_US.rds"
## [1] "results/cdep_model_US.rds"

fit_multivariate()

This function takes the optimized conditionally independent model (cindep_models) and optimizes a conditionally dependent model using the updated mod_spec object created with the Intermission code, resulting in cdep_mod_spec. The function fit_multivariate() uses a Hooke-Jeeves (hjk) algorithm to calculate a maximum likelihood fit using the baseline conditionally dependent model to fit a conditionally dependent, multivariate, mixed model.

Depending on the number of input variables supplied for the multivariate model, this step may take between a few hours and a few weeks (yikes!). Because the authors understand that it may not be possible to leave a local system running for weeks on end, the function is coded to allow for multiple “legs” of analyses, that start where you left off, instead of starting from the beginning every time. A record of your last run is saved as the hjk_progress_file.

Noteworthy arguments of this function include:

  • mod_spec - This is the model specification for the conditionally dependent model, as created in the Intermission
  • cindep_model - The conditionally independent model that will be used as a baseline fit for the conditionally dependent model
  • prog_file - The file path for the optimization progress to be saved. In the event that optimization needs to be paused at any time, this file is used to start off where the optimization left off, instead of starting anew
  • save_file - The file path for the final conditionally dependent model file
  • hjk_control - A list that feeds into the embedded hjk function in the dfoptim package, click for more information.
Here, we only calculate one conditionally dependent model, as our intentions are to use AIC for model selection down the line.
`hjk_control=list(info=T)` allows us to monitor the progress of the optimization in the console. The number of "steps" it takes to finish optimization depends on the number of variables and sample size.
# Initialize hjk progress and final model file paths
hjk_progress_file <- build_file_path(data_dir=data_dir,
                                     analysis_name=analysis_name,
                                     file_type="hjk_progress",
                                     fold=NA)
cdep_save_file <- build_file_path(data_dir=data_dir,
                                  analysis_name=analysis_name,
                                  file_type="cdep_model",
                                  fold=NA)

# Optimize the conditionally dependent model
cdep_model <- fit_multivariate(x=main_problem$x,
                               Y=main_problem$Y,
                               mod_spec=cdep_mod_spec,
                               cindep_model=cindep_model,
                               prog_file=hjk_progress_file,
                               save_file=cdep_save_file,
                               hjk_control=list(info=T))

Once all of our multivariate models are constructed, we move onto Step 6: Multivariate Model Selection.

6. Multivariate Model Selection

As is important with univariate model selection, we choose to rely on a model selection method to convince us whether the assumption of conditional independence or conditional dependence is appropriate for our data.

The Function

The MCP algorithm again allows for model selection via cross-validation (cv) or Akaike’s Information Criterion (aic), which is conducted using a single function.

evaluate_multivariate_models()

This function calculates either cv or aic to help inform the user/researcher on the appropriate structure of the multivariate model. evaluate_multivariate_models() returns a list with evaluation type (eval_type), model selection metric for the conditionally independent model(s) (eval_cindep), and model selection metric for the conditionally dependent model(s) (eval_cdep).

Noteworthy arguments in this function include:

  • eval_type - The model selection method used. As before, eval_type="cv" requires conditionally independent and conditionally dependent model folds to be present in the data_dir

The general rule for both cross-validation and AIC is that the model with the lowest model selection metric is selected.

evaluate_multivariate_models(data_dir=data_dir,
                             analysis_name=analysis_name,
                             eval_type="aic")
## $eval_type
## [1] "aic"
## 
## $eval_cindep
## [1] 11629.8
## 
## $eval_cdep
## [1] 11034.08
Given the above results, our model selection indicates that conditional dependence is the appropriate assumption for our current data, and that the conditionally dependent model should be used.

This concludes the end of our “model construction” portion of the demonstration. Next is Step 7: Getting Results.

7. Getting Results

Probability distributions are one of the main factors of a Bayesian framework. Instead of a single point estimate and generally equidistant confidence interval, models optimized using the MCP algorithm result in a distribution of probabilities over all possible values of x, the continuous output. This probability distribution can then be used to extract a single point estimate or range (i.e., “credible interval”), if desired. This type of Bayesian output allows for probabilistic statements on the output x.

Bayes” Theorem is generalized as:

\(\tt{Posterior\:Distribution} = \frac{\tt{Likelihood} \times \tt{Prior}}{\tt{Probability\:of\:Input}}\)

Thus, both the \(\tt{Likelihood}\) and the \(\tt{Prior}\) have influence on the \(\tt{Posterior\:Distribution}\).

The Functions

In this lengthy discussion, we introduce the functions of yada that are used after a model is generated using the MCP algorithm. Within this section, we discuss:

  • Calculating a prior distribution
  • Calculating a posterior distribution
  • Calculating a point estimate and credible interval from a posterior distribution

calc_x_prior()

This function generates a prior distribution on the output variable, x, which is the \(Prior\) portion of Bayes” Theorem, which is required for then calculating a posterior distribution of our output. calc_x_prior() uses the known distribution of x, provided by a problem file, to apply an offset Weibull mixture model, which assumes higher probabilities of x at the lower and upper bounds of all provided values of x. A Weibull distribution is informative for situations where data follow the mortality curve, commonly observed in medical examiners offices and subadult skeletal collections.

Noteworthy arguments for this function include:

  • prior_type - Currently, only a Weibull (prior_type="weibull") parameterization can be applied to our prior distribution on x. We are currently working on adding more distribution shapes
  • offset - The amount by which the prior distribution should be offset from the known output, x. This is a required argument, as we want our prior distribution to be representative but not identical to our knowledge of our output
A Weibull distribution is informative for situations where data follow the mortality curve, commonly observed in medical examiners offices and subadult skeletal collections.
Because our data is related to subadult age and uses subadult skeletal data, we choose to apply a Weibull distribution on our x prior.
th_x <- calc_x_prior(data_dir=data_dir,
                     analysis_name=analysis_name,
                     prior_type="weibull",
                     offset=0.002,
                     seed=895131,
                     fold=NA,
                     save_file=T)
## number of iterations = 2000

calc_x_prior() results in a single file in the data_dir starting with solutionx_.

load_best_univariate_model()

This function loads the “best” (i.e., selected from either cross-validation or aic) univariate model for the variable given.

Noteworthy arguments for this function include:

  • var_name - The variable name of the model to be loaded
  • fold - The fold number to be loaded. Default is fold=NA, which loads the main univariate model
  • load_data - A logical for whether the data used to train the model should be included. Default is load_data=F
# Import selected univariate model for FDL
fdl_model <- load_best_univariate_model(data_dir=data_dir,
                                        analysis_name=analysis_name,
                                        var_name="FDL",
                                        fold=NA,
                                        load_data=F)

calc_x_posterior()

This function uses the prior distribution, calculates the likelihood distribution, calculates the posterior probability distribution for the output, x.

Noteworthy arguments for this function include:

  • y - The input data, either a single value for univariate models or a vector of values for multivariate models
  • th_x - The parameterization on the output value, calculated using the calc_x_prior()
  • model - The model used to calcualte the posterior distribution
  • xcalc - A vector of evenly-spaced output values for which the posterior distribution should be calculated. If length(xcalc)==0, it is calculated for you and a seed must be provided for replication. Default is xcalc=c()
  • normalize - A logical for whether the posterior distribution should be normalized to integrate to 1. Default is normalize=T
  • seed - A seed value for result replication. If length(xcalc)==0, a seed must be provided. Default is seed=NA

The result of calc_x_posterior() is a list with the following objects:

  • x - The values of the output, x, for which a posterior probability density was calculated. This vector should be identical to xcalc, if it was provided
  • density - The density value for the corresponding value of x
To demonstrate `calc_x_posterior()` we will use observation n=165 and provide both a univariate and multivariate example of calc_x_posterior. NOTE: `calc_x_posterior()` only works on a single individual (aka one row of your data frame) at a time. If you would like to produce posterior distributions (aka estimates) for multiple individuals, please skip to Section 8: Reporting Results.
n <- 165

new_data[n,]  # View observation from the reformatted data
##     SVAD_identifier SEX     agey    FDL   RDL max_M1 man_I2 HME_EF TC_Oss
## 165         US_0409   F 2.369863 157.81 89.86      6      4      0      3
There are two ways of extracting the input variables to use in `calc_x_posterior()`, either from the reformatted data, or from the problem file. We demonstrate both ways of doing so below, first using the reformatted data, `new_data` to extract a single variable for a univariate model calculation, `FDL` for our demonstration, and then from the `main_problem` for an extraction of multiple variables for multivariate model calculation.
We choose to define `xcalc=seq(0,23,by=0.01)`, given that our known output data range between 0-23.
# Extract a single variable value, FDL, from the reformatted data, where
# observations are represented as rows, not columns
y <- new_data[n, "FDL"]

fdl_post <- calc_x_posterior(y=y,
                             th_x=th_x,
                             model=fdl_model,
                             xcalc=seq(0,23,by=0.01),
                             normalize=T,
                             seed=NA)
Calculating the posterior distribution using a multivariate model will inherently take a bit longer than using a univariate model. Depending on the number of variables included in the multivariate model, this calculation may take anywhere between a few seconds (6 variables) to an hour+ (62 variables). Please exercise patience.
# Extract the vector of input values from the Y matrix, where observations
# are represented by columns, not rows
y <- main_problem$Y[,n] 

# Load cdep_model if you did not optimize during the same session
cdep_model <- readRDS(build_file_path(data_dir=data_dir,
                                      analysis_name=analysis_name,
                                      file_type="cdep_model"))

# Calculate the posterior probability distribution
cdep_post <- calc_x_posterior(y=y,
                              th_x=th_x,
                              model=cdep_model,
                              xcalc=seq(0,23,by=0.01),
                              normalize=T,
                              seed=NA)

At the end of this step, you should be able to calculate posterior distributions for your output variable, x, for a single observation using either a univariate or multivariate model. Next, Step 8: Reporting Results

8. Reporting Results

This section is the fun stuff. Here, we demonstrate how to extract meaning from our posterior probability distributions, such as calculating measures of central tendency (e.g., mean, mode, median) and credible intervals (which is the appropriate term when using Bayesian analyses instead of the frequentist “confidence interval”), compiling results into tables, and visualization.

The Functions

The functions introduced in this step, as stated above, all have to do with manipulating our posterior probability distributions.

analyze_x_posterior

This function takes the output of calc_x_posterior() and returns a list containing the output value and density for the posterior mean (xmean), posterior median (xmed), posterior mode (xmode), 95% credible interval (xlo-xhi), and 99% credible interval (xlolo-xhihi).

Noteworthy arguments for this function include:

  • xv - The vector of output values for which the posterior was calculated. This vector is most likely the $x from the result of calc_x_posterior()
  • fv - The vector of densities calculated for the posterior. This vector is most likely the $density from the result of calc_x_posterior()
  • ci_type - Whether the credible interval should be calculated based on quantiles (“quantiles”) or highest posterior density (“hdi”). Default is ci_type="hdi"
  • xknown - An option argument, the posterior density of a specific value of xknown, often the known output, is calculated
We again provide examples for using `analyze_x_posterior()` for both univariate and multivariate contexts. We also use the default credible interval type, where `ci_type="hdi"`. Additionally, we demonstrate the differences in output when `xknown` is provided.
fdl_analyze <- analyze_x_posterior(xv=fdl_post$x,
                                   fv=fdl_post$density,
                                   ci_type="hdi",
                                   xknown=new_data[n,"agey"])
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
Using the conditionally dependent multivariate model, with `ci_type="quantiles"` and `xknown=NA`.
cdep_analyze <- analyze_x_posterior(xv=cdep_post$x,
                                    fv=cdep_post$density,
                                    ci_type="quantiles",
                                    xknown=NA)

We will be keeping these two *_analyze objects for use later when we visualize the results.

Another Brief Intermission…

In order to introduce the next two functions, we must generate an example test sample. Ideally, the test sample is comprised of completely new data.

As mentioned above, a test sample is ideally a set of completely new data, but here we will use a subset of our original data as an example.
og_data <- read.csv("data/SVAD_US.csv")  # load original ("og") data

set.seed(2022)  # set seed for replication
test_idx <- sample(1:nrow(og_data), size=20, replace=F)  # sample 5 data rows
sub_data <- og_data[test_idx,]  # extract the sampled rows
write.csv(sub_data, "data/test_sample.csv", row.names=F)  # save data

*_batch_calc()

There are two nearly-identical functions in yada: univariate_batch_calc() and multivariate_batch_calc(). These two functions both use calc_x_posterior() and analyze_x_posterior() together when given a data frame with multiple observations for which we would like to calculate measures of central tendency and credible intervals. Depending on how large your test sample is and how many variables you are calculating for, this process may take between a few minutes up to a few days! Make your arrangements accordingly.

IMPORTANT: Before using a test sample to calculate posterior probability distributions using the functions below, make sure that the data are reformatted using the var_info file and reformat_mcp_data().

test_samp <- reformat_mcp_data("data/test_sample.csv", var_info)

head(test_samp)  # view test sample
##   SVAD_identifier SEX      agey    FDL   RDL max_M1 man_I2 HME_EF TC_Oss
## 1         US_0281   M  2.926027 142.06 87.88      6      4      0      3
## 2         US_0100   F  1.478439 145.60 88.25      4      3     NA     NA
## 3         US_0850   F 18.315068     NA    NA     11     11     NA     NA
## 4         US_0194   F 15.638604     NA    NA     11     11     NA     NA
## 5         US_1154   M 18.098630     NA    NA     11     11     NA     NA
## 6         US_0497   M 16.687671     NA    NA     11     11      6      5

Noteworthy arguments for these functions include:

  • test_samp - The reformatted test sample to use for posterior predictions
  • demo_cols - Column numbers for demographic columns to be kept in the final prediction table
  • input_cols - Column numbers for the input variables
  • ci_type - Whether the credible interval should be calculated using “hdi” or “quantiles”

As univariate_batch_calc() runs, you will be updated with which input variable and observation is being processed.

univariate_batch_calc(data_dir=data_dir,
                      analysis_name=analysis_name,
                      test_samp=test_samp, 
                      demo_cols=1:3,
                      input_cols=4:ncol(test_samp),
                      ci_type="hdi", th_x=th_x, 
                      xcalc=seq(0,23,by=0.01),
                      seed=NA, save_file=T)

After applying calc_x_posterior() and analyze_x_posterior() for each observation for each model, if applicable, the top rows of the output table is printed to monitor the function”s progress. An example of the output is provided below for multivariate_batch_calc().

Noteworthy arguments for this function include:

  • model_type - Whether the conditionally independent (“cindep”) or conditionally dependent (“cdep”) model should be used
multivariate_batch_calc(data_dir=data_dir,
                        analysis_name=analysis_name,
                        test_samp=test_samp,
                        demo_cols=1:3,
                        model_type="cdep", ci_type="hdi",
                        th_x=th_x, xcalc=seq(0,23,by=0.01),
                        seed=NA, save_file=T)
## [1] "Calculating age posterior for observation=1"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=2"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=3"
## [1] "Calculating age posterior for observation=4"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=5"
## [1] "Calculating age posterior for observation=6"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=7"
## [1] "Calculating age posterior for observation=8"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=9"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=10"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=11"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=12"
## [1] "Calculating age posterior for observation=13"
## [1] "Calculating age posterior for observation=14"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=15"
## [1] "Calculating age posterior for observation=16"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=17"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=18"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Calculating age posterior for observation=19"
## [1] "Calculating age posterior for observation=20"
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## [1] "Prediction Table for cdep_model:"
##   SVAD_identifier SEX      agey xmean  xmed xmode lower99 lower95 upper95
## 1         US_0281   M  2.926027  1.89  1.87  1.81    1.24    1.35    2.52
## 2         US_0100   F  1.478439  1.51  1.49  1.45    0.95    1.06    2.02
## 3         US_0850   F 18.315068 17.19 17.50 18.10    9.71   11.89   21.86
## 4         US_0194   F 15.638604 17.19 17.50 18.10    9.69   11.99   21.97
## 5         US_1154   M 18.098630 17.19 17.50 18.10    9.88   11.66   21.71
## 6         US_0497   M 16.687671 18.00 18.11 18.36   12.69   13.94   21.62
##   upper99
## 1    2.52
## 2    2.02
## 3   21.86
## 4   21.97
## 5   21.71
## 6   21.62
Occasionally you will get an error along the lines of: `e$fun(obj, substitute(ex), parent.frame(), e$data) : worker initialization failed`. This is a by-product of parallel processing, and re-starting your R session should take care of the error.

generate_ord_ci()

This function calculates the 95% and 99% credible interval for each stage of a given univariate ordinal model. As there are a finite number of ordinal stages for each variable, the point estimate and credible interval will be the same for any observation with the same ordinal stage.

Noteworthy arguments for this function include:

  • var_name - The variable name for the univariate model to be used for estimation
  • point_est - The type of point estimate to be reported. Options are: xmean, xmed, and xmode
  • ci_type - Whether the credible interval should be calculated using “hdi” or “quantiles”
  • input_seed - The base seed used to calculate a seed for each ordinal stage

Depending on how many stages there are in for the ordinal variable chosen, this function may take a few minutes.

NOTE: generate_ord_ci() can only be used for univariate ordinal models, as continuous variables do not have distinct stages.

We demonstrate this function using the variable max_M1.
max_M1_df <- generate_ord_ci(data_dir=data_dir,
                             analysis_name=analysis_name,
                             var_name="max_M1",
                             th_x=th_x,
                             point_est="xmode", 
                             ci_type="hdi",
                             xcalc=seq(0,23,by=0.01), 
                             save_file=T)
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.
## Identical densities found along different segments of the distribution,
##   choosing rightmost.

Visualizations

The following code provides examples of the different types of visualizations that can be produced using the results generated from models optimized using the MCP algorithm.

Ordinal Posterior Distributions by Stage

Similar to the concept of generate_ci_ord(), these two visualizations are meant to display the outputs returned by univariate ordinal models.

Point and “Errorbars”

Using the output from generate_ci_ord(), we can visualize the point estimate and credible interval for each stage of an univariate ordinal variable.

For this demonstration, we use the data frame of max_M1 generated using `generate_ci_ord()`.
ggplot(max_M1_df) + 
  geom_errorbar(aes(x=ord_stage, ymin=lower95,   # factor stages
                    ymax=upper95, color=factor(ord_stage))) +  # color by stage
  geom_point(aes(x=ord_stage, y=point_est, 
                 color=factor(ord_stage))) +  # point estimate by stage
  scale_x_continuous(breaks=c(0:11), labels=c(1:12)) +  # add ticks on x-axis at each stage
  labs(x="Ordinal Stage", y="Age [years]", title="Maxillary M1") +  # labels
  theme_bw() +  # black and white plot space
  theme(legend.position="none")  # remove legend

NOTE: The ordinal stages used by any model generated using the MCP pipeline automatically start at 0 and are re-coded as integers. These may not reflect your original scores, and therefore adjustments can be made such as using scale_x_continuous(labels=c("vector","of","stages")).

Density Distributions

In order to visualize the posterior distributions of a univariate model by stage as densities, the following steps can be taken:
1. Load the selected (“best”) univariate model using load_best_univariate_model()
2. Extract the unique stages from lowest to highest using sort() and unique()
3. Initialize an empty data frame for storing the vectors of x, density, and stage from the results of each calc_x_posterior()
4. Use a for-loop to run through each unique stage for that variable

To demonstrate these steps, we use the Humerus Medial Epicondyle Epiphyseal Fusion Scores (`HME_EF`).
# 1. Load the univariate model for HME_EF
hme_model <- load_best_univariate_model(data_dir=data_dir,
                                        analysis_name=analysis_name,
                                        var_name="HME_EF",
                                        fold=NA, load_data=F)

# 2. Extract unique stages from lowest to highest
hme_stages <- sort(unique(new_data$HME_EF))

# 3. Initialize an empty data frame with column names: x, density, stage
hme_post_df <- data.frame(x=NA,density=NA,stage=NA)

# 4. Loop through each unique stage
for(stage in 1:length(hme_stages)) {
  df0 <- matrix(data=NA, ncol=3, nrow=length(seq(0,23,by=0.01)))
  hme_post <- calc_x_posterior(y=hme_stages[stage],
                               th_x=th_x,
                               model=hme_model,
                               xcalc=seq(0,23,by=0.01),
                               normalize=T)
  df0 <- data.frame(x=hme_post$x,
                    density=hme_post$density,
                    stage=rep(hme_stages[stage],length(hme_post$x)))
  
  hme_post_df <- rbind(hme_post_df,df0)
}

# Post-processing step
hme_post_df <- hme_post_df[-1,]  # remove first row of all NAs
head(hme_post_df)  # inspect top 6 rows
##      x  density stage
## 2 0.00 3.963317     0
## 3 0.01 1.992746     0
## 4 0.02 1.570963     0
## 5 0.03 1.352773     0
## 6 0.04 1.211704     0
## 7 0.05 1.110026     0

In the previous example using max_M1, we re-labeled the ordinal stages on the x-axis using scale_x_continuous(labels=c()). Below, we demonstrate how you can re-code the stages in the data instead, using the recode() function from the package dplyr, which was loaded with the tidyverse library.

hme_post_df$stage <- recode(hme_post_df$stage,
                            "0"="0",
                            "1"="1",
                            "2"="12",
                            "3"="2",
                            "4"="23",
                            "5"="3",
                            "6"="4")
unique(hme_post_df$stage)  # check that the recoding was done correctly
## [1] "0"  "1"  "12" "2"  "23" "3"  "4"

Now that the data are set up, we can plot each posterior density distribution, colored and separated (“faceted”) by stage:

ggplot(hme_post_df) + 
  geom_area(aes(x=x, y=density, fill=factor(stage)), 
            color="black", alpha=0.5) +   # fills space under line
  labs(x="Age [years]", y="Density", fill="Stage") +  # sets the labels
  theme_bw() +  # uses black and white theme
  theme(legend.position="none") +  # removes legend
  facet_wrap(~stage, scales="free_y")   # facets by stage, free y-axis

Point and Credible Intervals for Predictions

Similar to plotting the point estimate and credible intervals for each ordinal stage, we can achieve similar plots for predictions regardless of model type and input variables. These are helpful for visualizing accuracy and precision of the model where:

  • accuracy = whether the known value is within the credible interval
  • precision = this distance between the point estimate and known value
To demonstrate, we will use the multivariate model predictions calculated during the `multivariate_batch_calc()` example.
multi_pred <- read.csv("results/cdep_model_US_test_predictions.csv")

ggplot(multi_pred) + 
  geom_errorbar(aes(x=agey, ymin=lower95, ymax=upper95)) + 
  geom_point(aes(x=agey, y=xmode)) + 
  geom_abline(slope=1, intercept=0, color="red", linetype="dashed") + 
  labs(x="Known Age [years]", y="Predicted Age [years]") + 
  theme_bw()

This plot is a great way of observing model performance using test data, although larger sample sizes will start to become busy. We encourage the users to explore other ways of visualizing the outputs of models generated by the MCP algorithm.

Contact Us!

If you have made it all to the way to the end of this document, the authors commend and thank you for your time, patience, and curiosity about the MCP algorithm. We hope you have found this vignette informative and helpful.

If you have any questions about this document or the MCP algorithm and the yada package, or would like to provide any suggestions, please contact us at:

Funding Statement

This work was funded by the National Institute of Justice (Awards 2015-DN-BX-K409 and 2017-DN-BX-0144) and the National Science Foundation (BCS-1551913). Opinions expressed herein do not necessarily represent the official position or policies of the U.S. Department of Justice or the National Science Foundation.