# print link to github repo if anyif (file.exists("./.git/config")){ config <-readLines("./.git/config") url <-grep("url", config, value =TRUE) url <-gsub("\\turl = |.git$", "", url)cat("\nSource code and data found at [", url, "](", url, ")", sep ="") }
Source code and data found at
Overview
This document fits a phylogenetic Structural Equation Model (SEM) to investigate the drivers of vocal repertoire size in parrots. The model accounts for shared evolutionary history via a phylogenetic covariance matrix and handles missing trait data through multiple imputation. Three sub-models are linked: body mass predicts longevity and brain size, which in turn — alongside gregariousness, flock size, and coloration — predict repertoire size.
Load Packages
Code
# knitr is require for creating html/pdf/word reports# formatR is used for soft-wrapping code# install knitr package if not installedif (!requireNamespace("sketchy", quietly =TRUE)) {install.packages("sketchy")}packages <-c("tidyverse","reshape2","ape","phytools","mice","brms","brmsish","cmdstanr","ggplot2","patchwork","dagitty","ggdag","posterior","qs2","pbapply" )# install/ load packagessketchy::load_packages(packages = packages)options(knitr.kable.NA ='')print <-function(x, row.names =FALSE) { kb <-kable(x, row.names = row.names, digits =4, "html") kb <-kable_styling(kb,bootstrap_options =c("striped", "hover", "condensed", "responsive"))scroll_box(kb, width ="100%")}
0.1 Data Import and Visualisation
The main trait dataset contains species-level data on repertoire size, brain size, flock size, gregariousness, longevity, and body mass. A separate colour dataset (Carballo et al. 2020) provides chromatic and achromatic colour scores that are matched to the trait data by species name. Species with no repertoire data are excluded, as is one extreme outlier value for Melopsittacus undulatus that is almost an order of magnitude larger than the next largest value for that species.
Code
# main trait datasettraits <-read.csv("./data/raw/12222025ALLTRAITS.csv")# rename columns for claritytraits <- traits |>rename(phylo ="gs", repertoire ="sp", flock ="log.maxflock") |>mutate(species = phylo)traits <- traits |>select(phylo, species, repertoire, brain, flock, greg, longevity, mass)# remove species with no repertoire datatraits <- traits |>filter(!is.na(repertoire))# colour datasetparrot_col <-read.csv("./data/raw/carballo_2020_colour_avg.csv")# capitalise species names for matchingparrot_col$species <- stringr::str_to_title(parrot_col$scinam)# match colour scores to trait datatraits$cds_avg <- parrot_col$cds_avg[match(traits$phylo, parrot_col$species)]traits$ces_avg <- parrot_col$ces_avg[match(traits$phylo, parrot_col$species)]# remove the one outlier M. undulatus repertoire valuetraits <- traits |>filter(!(phylo =="Melopsittacus_undulatus"& repertoire ==1050))
A quick look at the raw distributions of all traits before any transformation:
Raw distributions of all trait variables prior to transformation.
0.2 Data Transformation
Continuous predictors (brain size, longevity, body mass, flock size, and colour scores) are log-transformed where appropriate and then standardised to mean = 0, SD = 1. Repertoire size is left on its raw count scale, as transforming count data prior to modelling is generally discouraged; instead, it will be modelled with a negative binomial distribution.
The phylogenetic tree is pruned to retain only species present in the trait dataset. A variance-covariance matrix is then computed from the pruned tree; this matrix is passed to brms to model the expected covariance among species due to shared evolutionary history (a Brownian motion model of trait evolution).
Code
tree <-read.tree("./data/raw/08112025_BrianSmith.treefile")# prune to species in trait datamatched_tips <-intersect(tree$tip.label, traits$phylo)tree <-drop.tip(tree, setdiff(tree$tip.label, matched_tips))# generate phylogenetic variance-covariance matrixA <- ape::vcv.phylo(tree)# trim trait data to match pruned treetraits_filtered <- traits[traits$phylo %in% matched_tips, ]
Distribution check after filtering to matched species:
Trait distributions after filtering to species present in the phylogeny.
0.4 Multiple Imputation
Missing trait values are handled using Multiple Imputation by Chained Equations (MICE), implemented via the mice package. We generate 100 imputed datasets, each with a different plausible set of values for the missing observations. The model is later run on all 100 datasets and results are combined, propagating imputation uncertainty into the final inference.
# 100 imputations, default mice algorithm (non-phylogenetic)imp <-mice(traits_filtered, m =100, seed =1859)# store as long-format data frame with .imp variable (1:100)imp_long <-complete(imp, action ="long")qs_save(imp_long, "./data/raw/imputed_data.qs")rm(imp)
0.5 Model Specification
The SEM consists of three linked sub-models:
Longevity: predicted by body mass and phylogeny
Brain size: predicted by body mass and phylogeny
Repertoire size: predicted by longevity, brain size, body mass, flock size, gregariousness, and chromatic colour score — plus phylogeny
All sub-models include a phylogenetic random effect via gr(phylo, cov = A) and a species-level random intercept to capture residual non-phylogenetic species variation. The set_rescor(FALSE) call means residual correlations among the three responses are not estimated (i.e., any correlation is assumed to be fully mediated by the explicit paths in the model).
Weakly informative priors are used throughout. Priors on the standardised predictors in the brain and longevity sub-models are more regularising (SD = 0.5) than those on repertoire predictors (SD = 1), reflecting the fact that repertoire is on its original count scale.
Code
priors <-c(prior(normal(0, 1), class ="b", resp ="repertoire"),prior(normal(0, 0.5), class ="b", resp ="brain"),prior(normal(0, 0.5), class ="b", resp ="longevity"))
0.6 Initial Model Fit (Single Imputation)
Before running the full 100-imputation pipeline, the model is first fit on a single imputed dataset to check for convergence, inspect posterior predictive fit, and confirm the model structure is sensible.
Visual inspection of the posterior predictive distribution against the observed data. The model-predicted distributions (blue lines) should closely follow the observed data distribution (black line).
Posterior predictive checks for the initial single-imputation model fit.
0.7 Full Pipeline: All 100 Imputed Datasets
The model is first compiled with zero chains (no sampling), producing a compiled Stan program. This compiled object is then updated via update() for each of the 100 imputed datasets in parallel, avoiding recompilation on every iteration.
Each imputed dataset gets its own set of 2 chains × 5000 iterations (1000 warmup). Running in parallel across 20 cores substantially reduces wall-clock time.
After fitting all 100 models, Rhat values are checked for each. Rhat > 1.05 is considered a convergence warning. The table below summarises how many parameters with Rhat > 1.05 each model has.
Check R hats across all models. The majority of models should have zero parameters with Rhat > 1.05, and the maximum Rhat values should be close to 1.0.
# proportion of models with any bad Rhatstable(df_rhats$n_bad_rhat >0)
FALSE TRUE
54 1
Code
# distribution of bad Rhat countssummary(df_rhats$n_bad_rhat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.7273 0.0000 40.0000
0.9 Combine Models and Final Inference
The 100 individual model fits are combined into a single brmsfit object using combine_models(). This pools the posterior draws across all imputed datasets, so that subsequent inference integrates over both sampling uncertainty and imputation uncertainty.
Fixed effect estimates with 95% credible intervals. Predictors whose credible intervals exclude zero are interpreted as having a meaningful effect on the response.
---title: Parrot vocal repertoire evolutionsubtitle: Statistical analysis SEM on imputed data author: <a href="https://marce10.github.io/">Marcelo Araya-Salas, PhD</a>date: "`r Sys.Date()`"toc: truetoc-depth: 4toc-location: leftnumber-sections: truehighlight-style: pygmentsformat: html: df-print: kable code-fold: true code-tools: true css: qmd.csseditor_options: chunk_output_type: console---```{r set root directory}#| eval: true#| echo: false# set working directory knitr::opts_knit$set(root.dir ="..")``````{r add link to github repo}#| eval: true#| output: asis# print link to github repo if anyif (file.exists("./.git/config")){ config <-readLines("./.git/config") url <-grep("url", config, value =TRUE) url <-gsub("\\turl = |.git$", "", url)cat("\nSource code and data found at [", url, "](", url, ")", sep ="") }``````{r setup, include=FALSE}knitr::opts_chunk$set(echo =TRUE,warning =FALSE,message =FALSE,fig.width =10,fig.height =6)```::: {.alert .alert-info}## Overview {.unnumbered .unlisted}This document fits a phylogenetic Structural Equation Model (SEM) to investigate the drivers of vocal repertoire size in parrots. The model accounts for shared evolutionary history via a phylogenetic covariance matrix and handles missing trait data through multiple imputation. Three sub-models are linked: body mass predicts longevity and brain size, which in turn — alongside gregariousness, flock size, and coloration — predict repertoire size.:::---## Load Packages {.unnumbered .unlisted}```{r load-packages}# knitr is require for creating html/pdf/word reports# formatR is used for soft-wrapping code# install knitr package if not installedif (!requireNamespace("sketchy", quietly =TRUE)) {install.packages("sketchy")}packages <-c("tidyverse","reshape2","ape","phytools","mice","brms","brmsish","cmdstanr","ggplot2","patchwork","dagitty","ggdag","posterior","qs2","pbapply" )# install/ load packagessketchy::load_packages(packages = packages)options(knitr.kable.NA ='')print <-function(x, row.names =FALSE) { kb <-kable(x, row.names = row.names, digits =4, "html") kb <-kable_styling(kb,bootstrap_options =c("striped", "hover", "condensed", "responsive"))scroll_box(kb, width ="100%")}```---## Data Import and VisualisationThe main trait dataset contains species-level data on repertoire size, brain size, flock size, gregariousness, longevity, and body mass. A separate colour dataset (Carballo et al. 2020) provides chromatic and achromatic colour scores that are matched to the trait data by species name. Species with no repertoire data are excluded, as is one extreme outlier value for *Melopsittacus undulatus* that is almost an order of magnitude larger than the next largest value for that species.```{r data-import}# main trait datasettraits <-read.csv("./data/raw/12222025ALLTRAITS.csv")# rename columns for claritytraits <- traits |>rename(phylo ="gs", repertoire ="sp", flock ="log.maxflock") |>mutate(species = phylo)traits <- traits |>select(phylo, species, repertoire, brain, flock, greg, longevity, mass)# remove species with no repertoire datatraits <- traits |>filter(!is.na(repertoire))# colour datasetparrot_col <-read.csv("./data/raw/carballo_2020_colour_avg.csv")# capitalise species names for matchingparrot_col$species <- stringr::str_to_title(parrot_col$scinam)# match colour scores to trait datatraits$cds_avg <- parrot_col$cds_avg[match(traits$phylo, parrot_col$species)]traits$ces_avg <- parrot_col$ces_avg[match(traits$phylo, parrot_col$species)]# remove the one outlier M. undulatus repertoire valuetraits <- traits |>filter(!(phylo =="Melopsittacus_undulatus"& repertoire ==1050))```A quick look at the raw distributions of all traits before any transformation:```{r raw-distributions, fig.cap="Raw distributions of all trait variables prior to transformation."}reshape2::melt(traits) |>ggplot(aes(x = value)) +facet_wrap(~variable, scales ="free") +geom_histogram() +theme_bw()```---## Data TransformationContinuous predictors (brain size, longevity, body mass, flock size, and colour scores) are log-transformed where appropriate and then standardised to mean = 0, SD = 1. Repertoire size is left on its raw count scale, as transforming count data prior to modelling is generally discouraged; instead, it will be modelled with a negative binomial distribution.```{r transform}traits <- traits |>mutate_at(c("brain", "longevity", "mass"), ~(log(.))) |>mutate_at(c("brain", "longevity", "mass", "flock", "cds_avg", "ces_avg"),~(scale(.) |>as.vector()))```---## Phylogeny: Import and MatchingThe phylogenetic tree is pruned to retain only species present in the trait dataset. A variance-covariance matrix is then computed from the pruned tree; this matrix is passed to `brms` to model the expected covariance among species due to shared evolutionary history (a Brownian motion model of trait evolution).```{r phylogeny}tree <-read.tree("./data/raw/08112025_BrianSmith.treefile")# prune to species in trait datamatched_tips <-intersect(tree$tip.label, traits$phylo)tree <-drop.tip(tree, setdiff(tree$tip.label, matched_tips))# generate phylogenetic variance-covariance matrixA <- ape::vcv.phylo(tree)# trim trait data to match pruned treetraits_filtered <- traits[traits$phylo %in% matched_tips, ]```Distribution check after filtering to matched species:```{r filtered-distributions, fig.cap="Trait distributions after filtering to species present in the phylogeny."}reshape2::melt(traits_filtered) |>ggplot(aes(x = value)) +facet_wrap(~variable, scales ="free") +geom_histogram() +theme_bw()```---## Multiple ImputationMissing trait values are handled using Multiple Imputation by Chained Equations (MICE), implemented via the `mice` package. We generate 100 imputed datasets, each with a different plausible set of values for the missing observations. The model is later run on all 100 datasets and results are combined, propagating imputation uncertainty into the final inference.```{r missingness, fig.cap="Missing data pattern across traits."}md.pattern(traits_filtered)``````{r imputation, eval=FALSE}# 100 imputations, default mice algorithm (non-phylogenetic)imp <-mice(traits_filtered, m =100, seed =1859)# store as long-format data frame with .imp variable (1:100)imp_long <-complete(imp, action ="long")qs_save(imp_long, "./data/raw/imputed_data.qs")rm(imp)``````{r load-imputed, echo=FALSE}# load pre-computed imputed dataimp_long <-qs_read("./data/raw/imputed_data.qs")```---## Model SpecificationThe SEM consists of three linked sub-models:- **Longevity**: predicted by body mass and phylogeny- **Brain size**: predicted by body mass and phylogeny - **Repertoire size**: predicted by longevity, brain size, body mass, flock size, gregariousness, and chromatic colour score — plus phylogenyAll sub-models include a phylogenetic random effect via `gr(phylo, cov = A)` and a species-level random intercept to capture residual non-phylogenetic species variation. The `set_rescor(FALSE)` call means residual correlations among the three responses are not estimated (i.e., any correlation is assumed to be fully mediated by the explicit paths in the model).```{r model-spec}# longevity sub-modellong_mod <-bf( longevity ~ mass + (1|gr(phylo, cov = A)) + (1| species))# brain size sub-modelbr_mod <-bf( brain ~ mass + (1|gr(phylo, cov = A)) + (1| species))# repertoire sub-model (negative binomial for count data)rep_mod <-bf( repertoire ~ longevity + brain + mass + flock + greg + cds_avg + (1|gr(phylo, cov = A)) + (1| species),family =negbinomial())```### PriorsWeakly informative priors are used throughout. Priors on the standardised predictors in the brain and longevity sub-models are more regularising (SD = 0.5) than those on repertoire predictors (SD = 1), reflecting the fact that repertoire is on its original count scale.```{r priors}priors <-c(prior(normal(0, 1), class ="b", resp ="repertoire"),prior(normal(0, 0.5), class ="b", resp ="brain"),prior(normal(0, 0.5), class ="b", resp ="longevity"))```---## Initial Model Fit (Single Imputation)Before running the full 100-imputation pipeline, the model is first fit on a single imputed dataset to check for convergence, inspect posterior predictive fit, and confirm the model structure is sensible.```{r single-fit, eval=FALSE}mod_SEM <-brm( rep_mod + br_mod + long_mod +set_rescor(FALSE),data = imp_long |>filter(.imp ==1),data2 =list(A = A),prior = priors,backend ="cmdstanr",chains =4,cores =4,iter =5000,seed =123,file ="./data/processed/base_SEM_model_fit"control =list(adapt_delta =0.95))``````{r load-single-fit, echo=FALSE}mod_SEM <-readRDS("./data/processed/base_SEM_model_fit.rds")```### Model SummaryRhat values close to 1.0 and large effective sample sizes (ESS) indicate good convergence.```{r single-summary}summary(mod_SEM)```### Posterior Predictive ChecksVisual inspection of the posterior predictive distribution against the observed data. The model-predicted distributions (blue lines) should closely follow the observed data distribution (black line).```{r ppc-single, fig.cap="Posterior predictive checks for the initial single-imputation model fit."}p1 <-pp_check(mod_SEM, ndraws =100, resp ="repertoire") +ggtitle("Response: repertoire")p2 <-pp_check(mod_SEM, ndraws =100, resp ="brain") +ggtitle("Response: brain")p3 <-pp_check(mod_SEM, ndraws =100, resp ="longevity") +ggtitle("Response: longevity")p1 + p2 + p3 +plot_annotation("Posterior predictive checks (blue: model, black: data)")```---## Full Pipeline: All 100 Imputed DatasetsThe model is first compiled with zero chains (no sampling), producing a compiled Stan program. This compiled object is then updated via `update()` for each of the 100 imputed datasets in parallel, avoiding recompilation on every iteration.### Compile (No Sampling)```{r compile-empty, eval=FALSE}mod_SEM_empty <-brm( rep_mod + br_mod + long_mod +set_rescor(FALSE),data = imp_long |>filter(.imp ==1),data2 =list(A = A),prior = priors,backend ="cmdstanr",chains =0,cores =0,iter =2200,warmup =200,seed =123,file ="./data/processed/SEM_model_fit_empty.rds",control =list(adapt_delta =0.95))```### Run Across All ImputationsEach imputed dataset gets its own set of 2 chains × 5000 iterations (1000 warmup). Running in parallel across 20 cores substantially reduces wall-clock time.```{r run-all-imputations, eval=FALSE}models <-pblapply(1:100, cl =20, function(i) {update(mod_SEM_empty,newdata = imp_long |>filter(.imp == i),data2 =list(A = A),prior = priors,backend ="cmdstanr",chains =2,cores =2,iter =5000,seed =123,control =list(adapt_delta =0.95))})saveRDS(models, "./data/processed/SEM_models_all_imputations.rds")```---## Convergence DiagnosticsAfter fitting all 100 models, Rhat values are checked for each. Rhat > 1.05 is considered a convergence warning. The table below summarises how many parameters with Rhat > 1.05 each model has.```{r convergence, eval=FALSE}models <-readRDS("./data/processed/SEM_models_all_imputations.rds")n_bad_rhat <-sapply(models, function(mod) { rhats <-summarise_draws(as_draws_array(mod))$rhatsum(rhats >1.05, na.rm =TRUE)})max_rhat <-sapply(models, function(mod) { rhats <-summarise_draws(as_draws_array(mod) )$rhatmax(rhats, na.rm =TRUE)})df_rhats <-data.frame(model =1:length(models),n_bad_rhat = n_bad_rhat,max_rhat = max_rhat)saveRDS(df_rhats, "./data/processed/n_bad_rhat.rds")```Check R hats across all models. The majority of models should have zero parameters with Rhat > 1.05, and the maximum Rhat values should be close to 1.0.```{r convergence-summary}df_rhats <-readRDS("./data/processed/n_bad_rhat.rds")df_rhats |>subset(max_rhat >1.05)# proportion of models with any bad Rhatstable(df_rhats$n_bad_rhat >0)# distribution of bad Rhat countssummary(df_rhats$n_bad_rhat)```---## Combine Models and Final InferenceThe 100 individual model fits are combined into a single `brmsfit` object using `combine_models()`. This pools the posterior draws across all imputed datasets, so that subsequent inference integrates over both sampling uncertainty and imputation uncertainty.```{r combine, eval=FALSE}comb_mod <-combine_models(mlist = models, check_data =FALSE)saveRDS(comb_mod, "./data/processed/SEM_combined_model.rds")``````{r load-combined, echo=FALSE}comb_mod <-readRDS("./data/processed/SEM_combined_model.rds")```### Results SummaryFixed effect estimates with 95% credible intervals. Predictors whose credible intervals exclude zero are interpreted as having a meaningful effect on the response.```{r final-summary, results="asis"}extended_summary( comb_mod,highlight =TRUE,trace.palette = viridis::mako,remove.intercepts =TRUE,trace =FALSE)```### Repertoire Effects PlotMarginal posterior distributions for all fixed effects on repertoire size:```{r repertoire-effects, fig.cap="Posterior distributions of fixed effects on vocal repertoire size."}mcmc_plot(comb_mod, variable ="^b_rep.*", type ="areas", regex =TRUE)```::: {.alert .alert-success}# Takeaways {.unnumbered .unlisted}- Repertoire size is positively associated with gregariousness- Body mass is positively associated with brain size and longevity::: <!-- '---' adds a gray vertical line -->---::: {.callout-note collapse="true"}## Session information {.unnumbered .unlisted}```{r session info}#| echo: false# if devtools is installed use devtools::session_info()if (requireNamespace("devtools", quietly =TRUE)) { devtools::session_info()} else {sessionInfo()}```:::