Full Documentation For: An investigation of the relationship between long bone measurements and stature: Implications for estimating skeeltal stature in subadults.

Overview

This research compendium provides code and steps needed to replicate the results of the publication:

Chu, E.Y. & Stull, K.E. (In press). An investigation of the relationship between long bone measurements and stature: Implications for estimating skeletal stature in subadults. International Journal of Legal Medicine

To run the code for this paper locally, please ensure that R, and optionally RStudio, are installed on your local system.

Additionally, the following packages are used and should be installed using install.packages("package_name"):
* tidyverse
* magrittr
* caret
* car * gridExtra

How to cite

Please cite this compendium as:

Chu, E.Y. & Stull, K.E. (2024). Compendium of R code for “An investigation of the relationship between long bone measurements and stature: Implications for estimating skeletal stature in subadults”. Accessed [Current Date].

The Data

All data are from the Subadult Virtual Anthropology Database. Specifically, this projects uses data from the United States (U.S.) which can be downloaded HERE. For this specific analysis, the U.S. data were filtered for individuals who had at least one long bone measurement and had known stature. Additionally, the data were visually checked for outliers and removed. The final data are provided in a GitHub repository, which can be accessed at: https://github.com/ElaineYChu/chu-and-stull_nonlinear-stature. Instructions for “cloning the repository” can be found in the README.

Organization

This document is divided into three sections:
1. Data Manipulation
2. Data Analysis
3. Data Visualization

Setup

After cloning the repository to your own system, the following code is used to set up your local R/RStudio environment:

rm(list=ls())  # clears the global environment

## Check that the cloned repository is set was the working directory
if (!(grepl("chu-and-stull_nonlinear-stature", getwd()))) {
  stop("Please check your working directory")
}

## Load required libraries
library(tidyverse)
library(magrittr)
library(caret)
library(car)
library(gridExtra)
library(ggrepel)

## Initialize some objects for visualizations
source("elaine_theme.R")
colors <- c("#073b4c","#0b6583","#43c4ef","#06efb1",
            "#ffd166","#ef436b","#cccccc","#444444")
xlab <- "Age (years)"

## Initialize seed from random.org
seed <- 167416

Data Manipulation

1. Import Data

  • Located in data folder
  • Stored as .rds file
og <- readRDS("data/US_0-21_FINAL.rds")  # original U.S. data

og$stature <- og$stature / 10  # convert stature from mm to cm
og$upper <- og$HDL + og$RDL  # create upper limb length
og$lower <- og$FDL + og$TDL  # create lower limb length

2. Split into Training and Testing Sets

  • 80% Training and 20% Testing
  • Uses createDataPartition() from package caret
base <- og %>% filter(!is.na(stature)) %>% droplevels()  # must have stature

set.seed(seed)  # for data reproducibility
idx <- createDataPartition(base$SEX, p=0.8, list=F)
train <- base[idx, ]  # Pooled training set
test <- base[-idx, ]  # Pooled testing set
train_F <- train %>% filter(SEX=="F")  # Female training set
train_M <- train %>% filter(SEX=="M")  # Male training set
test_F <- train %>% filter(SEX=="F")  # Female training set
test_M <- train %>% filter(SEX=="M")  # Male training set

## Store resulting sets in data folder
saveRDS(train, "data/train.rds")  
saveRDS(test, "data/test.rds")
saveRDS(train_F, "data/train_F.rds")
saveRDS(train_M, "data/train_M.rds")
saveRDS(test_F, "data/test_F.rds")
saveRDS(test_M, "data/test_M.rds")

3. Define Columns that Contain Long Bone Measurements

  • Column numbers
  • Give them better (more descriptive) labels
uni_vars <- 9:28

uni_labels <- c("Femur Length (mm)",
               "Femur Midshaft Breadth (mm)",
               "Femur Distal Breadth (mm)",
               "Tibia Length (mm)",
               "Tibia Proximal Breadth (mm)",
               "Tibia Midshaft Breadth (mm)",
               "Tibia Distal Breadth (mm)",
               "Fibula Length (mm)",
               "Humerus Length (mm)",
               "Humerus Proximal Breadth (mm)",
               "Humers Midshaft Breadth (mm)",
               "Humerus Distal Breadth (mm)",
               "Radius Length (mm)",
               "Radius Proximal Breadth (mm)",
               "Radius Midshaft Breadth (mm)",
               "Radius Distal Breadth (mm)",
               "Ulna Length (mm)",
               "Ulna Midshaft Breadth (mm)",
               "Upper Limb Length (mm)",
               "Lower Limb Length (mm)")

Assumptions Testing

A Shapiro-Wilks test was employed to test for normality for each long bone measurement due to the larger sample size:

sw_df <- as.data.frame(matrix(nrow=length(uni_vars), ncol=3))
names(sw_df) <- c("var", "W", "p")

for(i in 1:length(uni_vars)){
  sw <- shapiro.test(data[[uni_vars[i]]])
  
  sw_df[i, "var"] <- names(data)[uni_vars[i]]
  sw_df[i, "W"] <- round(sw$statistic, 4)
  sw_df[i, "p"] <- signif(sw$p.value, 4)

}

sw_df
##      var      W         p
## 1    FDL 0.8801 5.603e-25
## 2   FMSB 0.9396 5.024e-18
## 3    FDB 0.9358 1.198e-19
## 4    TDL 0.8815 5.154e-25
## 5    TPB 0.9236 1.596e-21
## 6   TMSB 0.9365 9.102e-19
## 7    TDB 0.9235 2.342e-21
## 8   FBDL 0.8818 4.949e-25
## 9    HDL 0.8675 3.428e-25
## 10   HPB 0.9486 6.048e-15
## 11  HMSB 0.9657 2.574e-13
## 12   HDB 0.9311 1.715e-19
## 13   RDL 0.8779 1.592e-24
## 14   RPB 0.9389 2.199e-18
## 15  RMSB 0.9625 1.057e-13
## 16   RDB 0.9373 4.985e-19
## 17   UDL 0.8850 6.195e-24
## 18  UMSB 0.9747 9.058e-11
## 19 upper 0.8617 2.894e-25
## 20 lower 0.8767 4.534e-25
# write.csv(sw_df, "results/shapiro-wilks.csv", row.names=F)

Data are not normally distributed, which suggests the use of nonparametric tests.

Data Analysis

1. Calculate Correlation Coefficients (Kendall’s Tau)

Correlation coefficients (\(r\)) are calculated between stature and all diaphyseal dimensions. Kendall’s tau was used for a conservative estimate of \(r\), especially considering the genearlly non-parametric relationship between diaphyseal dimensions and stature during ontogeny.

cor_mat <- cor(data[uni_vars],data["stature"], method="kendall", use="complete.obs")
cor_mat <- as.data.frame(cor_mat)
cor_mat$var <- rownames(cor_mat)
colnames(cor_mat) <- c("r","var")

cor_mat <- cor_mat[order(cor_mat$r, decreasing=T),]  # reorder by decreasing r

cor_mat$r <- signif(cor_mat$r, digits=3)

write.csv(cor_mat, "results/correlations.csv", row.names=F)

cor_mat
##           r   var
## lower 0.907 lower
## FBDL  0.907  FBDL
## HDL   0.905   HDL
## FDL   0.905   FDL
## TDL   0.905   TDL
## upper 0.904 upper
## UDL   0.895   UDL
## RDL   0.895   RDL
## FDB   0.871   FDB
## TPB   0.869   TPB
## TDB   0.858   TDB
## HDB   0.854   HDB
## FMSB  0.839  FMSB
## HPB   0.830   HPB
## RPB   0.826   RPB
## HMSB  0.820  HMSB
## TMSB  0.819  TMSB
## RDB   0.810   RDB
## RMSB  0.808  RMSB
## UMSB  0.759  UMSB

All correlations yielded strong, positive relationships with stature. From the full cor_mat, we can see that all of the diaphyseal lengths have the strongest relationships, followed by lower limb distal and proximal breadths, then upper breadths.

  • The strongest relationship: 0.907, lower
  • The weakest relationship: 0.759, UMSB

With these strong, positive relationships confirmed, I move onto generating both linear and non-linear models to estimate stature.

2. Stature Estimation Equations

First, let’s evaluate the type of bivariate relationships between femur length and distal breadth and stature:

plot1 <- ggplot(data, aes(x=FDL, y=stature)) + elaine_theme + 
  geom_point(aes(shape=meas_type), alpha=0.2, size=2, fill="grey85") + 
  geom_smooth(method="lm", lty="dashed", se=F, color="black") + 
  geom_smooth(method="loess", lty="solid", se=F, color="black") + 
  scale_shape_manual(values=c(24,21)) + 
  labs(x=uni_labels[1], y="Stature (cm)", title="a)") + 
  theme(legend.position="none", plot.title=element_text(hjust=0))

plot2 <- ggplot(data, aes(x=FDB, y=stature)) + elaine_theme + 
  geom_point(aes(shape=meas_type), alpha=0.2, size=2, fill="grey85") + 
  geom_smooth(method="lm", lty="dashed", se=F, color="black") + 
  geom_smooth(method="loess", lty="solid", se=F, color="black") + 
  scale_shape_manual(values=c(24,21)) + 
  labs(x=uni_labels[3], y="Stature (cm)", title="b)") + 
  theme(legend.position="none", plot.title=element_text(hjust=0))

grid.arrange(plot1, plot2, ncol=2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Figure 1 description: squares represent diaphyseal length, whereas triangles represent maximum length which includes the epiphyses

From the plots, none of the relationships are quite linear. Traditionally, subadult (diaphyseal dimension) stature estimation equations have used linear regression (Ruff, 2007; Smith, 2007). The present data appears to follow a relatively linear pattern for the length measures, but there are individuals being “missed” in the linear regression at the lower end of lower limb length / stature, as well as the middle values of lower limb length, so both types of models will be explored. So let’s try the following:

  1. Linear: \(y=mx+b\)
  2. 3-parameter exponential (similar to power law): \(y=a−be^{−cx}\)
  3. 3-parameter logistic (sigmoidal): \(y=\frac{a}{(1+be^{−cx})}\)

Let’s go!

2a. Linear Models

In this section, two types of linear regression models are constructed:

  • Univariate linear models
  • Stepwise multiple linear models
lm_uni_models <- list()

for(i in uni_vars) {
  var <- names(train)[i]
  
  lm_uni_models[[var]] <- lm(paste("stature ~ ",var), data=train)
}

saveRDS(lm_uni_models, "results/linear_models.rds")

2b. Nonlinear Models

Many many MANY nonlinear models were tested using nls() on the training data and visually assessed for the most appropriate models that fit the data. Equations and how to use nls() pulled from: [https://data-flair.training/blogs/r-nonlinear-regression/]

3 parameter asymptotic exponential: \(y=a−be^{−cx}\) - closest for length (a=1500, b=1000, c=0.005) 3 parameter logistic: \(y=\frac{a}{(1+be^{−cx})}\) - great for breadth (a=1500, b=15, c=0.05)

Pooled

nls_uni <- list()

set.seed(seed)
for(i in uni_vars) {
  var <- names(train)[i]
  
  if (grepl("DL|upper|lower", var)) {
    df <- data.frame(stature=train$stature, x=train[[var]])
    
    nls_uni[[var]] <- tryCatch(
    {
      nls(stature~a-b*exp(-c*x), data=df, 
          start=list(a=1500, b=1000, c=0.005))
    },
    error=function(x) {
      return(NA)
    }
  )
  } else {
    df <- data.frame(stature=train$stature, x=train[[var]])
    
    nls_uni[[var]] <- tryCatch(
    {
      nls(stature~a/(1+b*exp(-c*x)), data=df, 
          start=list(a=1500, b=8, c=0.05))
    },
    error=function(x) {
      return(NA)
    }
  )
  }
}

saveRDS(nls_uni, "results/nonlinear_models.rds")

Only two variables had unsuccessful fits: TDB and RDB

2c. Model Parameters

Extract the model coefficients, and residual standard deviation (“SEE”).

lm_params <- data.frame(matrix(nrow=length(lm_models), ncol=4))
names(lm_params) <- c("var","b","m","SEE")

for(i in 1:length(lm_models)) {
  lm_params[i,"var"] <- names(lm_models)[i]
  coefs <- lm_models[[i]]$coefficients
  lm_params[i,2:(length(coefs)+1)] <- signif(coefs,3)
  lm_params[i, "SEE"] <- signif(sigma(lm_models[[i]]), 3)
}

write.csv(lm_params, "results/linear_parameters.csv", row.names=F)
nls_params <- data.frame(matrix(nrow=length(nls_models), ncol=5))
names(nls_params) <- c("var","a","b","c","SEE")

for(i in 1:length(nls_models)) {
  if(names(nls_models)[i] %in% c("TDB","RDB")) {
    next
  }
  nls_params[i,"var"] <- names(nls_models)[i]
  coefs <- nls_models[[i]]$m$getAllPars()
  nls_params[i,2:(length(coefs)+1)] <- signif(coefs, 3)
  nls_params[i,"SEE"] <- signif(sigma(nls_models[[i]]), 3)
}

nls_params <- na.omit(nls_params)

write.csv(nls_params, "results/nonlinear_parameters.csv", row.names=F)

3. Model Comparisons - Accuracy and Precision

Now that the models are generated, calculate performance metrics for each model:

  • Testing accuracy
  • Bias (residuals)
  • RMSE
  • SEE
  • MAD
perf_df <- data.frame(matrix(nrow=length(lm_models)+length(nls_models),
                             ncol=6))
names(perf_df) <- c("model","vars","KS_stat", "KS_p", "test_acc", "MAD")

Linear Testing Performance

for (i in 1:length(lm_models)) {
  var <- names(lm_models)[i]
  df0 <- data.frame(matrix(nrow=nrow(test),ncol=5))
  names(df0) <- c("medrec","stature","point","lo","hi")
  df0$medrec <- test$SVAD_medrec
  df0$stature <- test$stature
  
  model <- lm_models[[i]]
  
  pred <- predict(model, newdata=test[names(model$model)], interval="prediction")
  
  df0$point <- pred[,1]
  df0$lo <- pred[,2]
  df0$hi <- pred[,3]
  df0$resid <- df0$stature - df0$point
  
  write.csv(df, paste0("results/",var,"_lm_predictions.csv"), row.names=F)
  
  df <- na.omit(df0)
  
  ks <- suppressWarnings(ks.test(df$point, df$stature))
  
  perf_df$model[i] <- "linear"
  perf_df$vars[i] <- ifelse(length(names(model$coefficients)[-1])>1,
                        paste(names(model$coefficients)[-1],collapse=", "),
                        names(model$coefficients)[-1])
  correct <- which(df$stature <= df$hi & df$stature >= df$lo)
  perf_df$test_acc[i] <- signif(length(correct) / nrow(df), 3)
  perf_df$KS_stat[i] <- signif(ks$statistic, 3)
  perf_df$KS_p[i] <- signif(ks$p.value, 3)
  perf_df$MAD[i] <- signif(mad(df$resid, center=mean(df$resid)), 3)
}

write.csv(perf_df[1:length(lm_models),], "results/linear_performance.csv", 
          row.names=F)

Nonlinear Testing Performance

set.seed(seed)
for(i in 1:length(nls_models)) {
  var <- names(nls_models)[i]
  test0 <- na.omit(test[c("SVAD_medrec","stature",var)])
  df0 <- data.frame(matrix(nrow=nrow(test0),ncol=5))
  names(df0) <- c("medrec","stature","point","lo","hi")
  df0$medrec <- test0$SVAD_medrec
  df0$stature <- test0$stature
  
  model <- nls_models[[i]]
  
  if(length(model)==1) {
    next
  } 
  
  gc()  # clear memory
  pred <- propagate::predictNLS(model=model, newdata=data.frame(x=test0[[var]]),
                                interval="prediction")
  
  df0$point <- pred$summary[[1]]
  df0$lo <- pred$summary[[5]]
  df0$hi <- pred$summary[[6]]
  df0$resid <- df0$stature - df0$point
  
  write.csv(df0, paste0("results/",var,"_nls_predictions.csv"), row.names=F)
  
}

pred_files <- list.files("results","_nls_predictions.csv")

for(j in 1:length(pred_files)) {
  df0 <- read.csv(paste0("results/", pred_files[j]))
  var <- stringr::str_remove(pred_files[j], "_nls_predictions.csv")
  df <- na.omit(df0)
  
  ks <- suppressWarnings(ks.test(df$point, df$stature))
       
  perf_df$model[j+20] <- "nonlinear"
  perf_df$vars[j+20] <- var
  correct <- which(df$stature <= df$hi & df$stature >= df$lo)
  perf_df$test_acc[j+20] <- signif(length(correct) / nrow(df), 3)
  perf_df$KS_stat[j+20] <- signif(ks$statistic, 3)
  perf_df$KS_p[j+20] <- signif(ks$p.value, 3)
  perf_df$MAD[j+20] <- signif(mad(df$resid, center=mean(df$resid)), 3)
}


write.csv(perf_df[perf_df$model=="nonlinear" & 
                    !is.na(perf_df$vars),], 
          "results/nonlinear_performance.csv", row.names=F)

write.csv(perf_df[!is.na(perf_df$vars),], 
          "results/model_performance.csv", row.names=F)

4. Model Evaluations

Now that all models have been constructed and performance metrics have been calculated, we can begin to provide overall recommendations for subadult stature estimation.

Full output of model performance:

perf_df
##        model  vars KS_stat  KS_p test_acc   MAD
## 1     linear   FDL  0.0778 0.648    0.961  5.10
## 2     linear  FMSB  0.0889 0.476    0.956  7.31
## 3     linear   FDB  0.0942 0.364    0.948  7.61
## 4     linear   TDL  0.0829 0.563    0.978  5.65
## 5     linear   TPB  0.0733 0.684    0.927  7.51
## 6     linear  TMSB  0.0659 0.824    0.956  7.92
## 7     linear   TDB  0.0785 0.598    0.958  8.34
## 8     linear  FBDL  0.0820 0.570    0.973  4.85
## 9     linear   HDL  0.0833 0.604    0.958  5.54
## 10    linear   HPB  0.0544 0.981    0.966  8.31
## 11    linear  HMSB  0.1140 0.206    0.972 10.20
## 12    linear   HDB  0.0838 0.556    0.939  8.22
## 13    linear   RDL  0.0888 0.519    0.988  6.14
## 14    linear   RPB  0.0686 0.805    0.949  7.57
## 15    linear  RMSB  0.0994 0.367    0.942 12.30
## 16    linear   RDB  0.1000 0.329    0.933  9.04
## 17    linear   UDL  0.0702 0.794    0.977  5.71
## 18    linear  UMSB  0.1240 0.147    0.947 15.40
## 19    linear upper  0.0864 0.581    0.994  5.52
## 20    linear lower  0.0791 0.637    0.966  5.22
## 21 nonlinear  FBDL  0.0601 0.896    0.967  3.65
## 22 nonlinear   FDB  0.0733 0.684    0.958  6.96
## 23 nonlinear   FDL  0.0500 0.978    0.956  4.06
## 24 nonlinear  FMSB  0.0611 0.890    0.967  8.10
## 25 nonlinear   HDB  0.0503 0.977    0.950  7.03
## 26 nonlinear   HDL  0.0476 0.991    0.958  3.99
## 27 nonlinear  HMSB  0.0909 0.461    0.943  9.98
## 28 nonlinear   HPB  0.0544 0.981    0.959  7.75
## 29 nonlinear lower  0.0508 0.976    0.949  3.76
## 30 nonlinear   RDL  0.0533 0.970    0.970  4.39
## 31 nonlinear  RMSB  0.0819 0.615    0.930 11.90
## 32 nonlinear   RPB  0.0629 0.880    0.914  7.84
## 33 nonlinear   TDL  0.0552 0.945    0.961  3.82
## 34 nonlinear  TMSB  0.0549 0.946    0.934  8.09
## 35 nonlinear   TPB  0.0733 0.684    0.937  7.75
## 36 nonlinear   UDL  0.0585 0.932    0.965  3.88
## 37 nonlinear  UMSB  0.1010 0.360    0.953 15.90
## 38 nonlinear upper  0.0494 0.989    0.969  3.93

4a. Bland-Altman Plots

The goodness-of-fit test, via the KS test, demonstrate overall better “fit” of the nonlinear models compared to the linear. We may use the Bland-Altman plots to demonstrate these ideas further:

# Import Linear Predictions
hdl_lm <- read.csv('results/HDL_lm_predictions.csv') %>% mutate(var="HDL")
hpb_lm <- read.csv('results/HPB_lm_predictions.csv') %>% mutate(var="HPB")
hmsb_lm <- read.csv('results/HMSB_lm_predictions.csv') %>% mutate(var="HMSB")
hdb_lm <- read.csv('results/HDB_lm_predictions.csv') %>% mutate(var="HDB")
rdl_lm <- read.csv('results/RDL_lm_predictions.csv') %>% mutate(var="RDL")
rpb_lm <- read.csv('results/RPB_lm_predictions.csv') %>% mutate(var="RPB")
rmsb_lm <- read.csv('results/RMSB_lm_predictions.csv') %>% mutate(var="RMSB")
rdb_lm <- read.csv('results/RDB_lm_predictions.csv') %>% mutate(var="RDB")
udl_lm <- read.csv('results/UDL_lm_predictions.csv') %>% mutate(var="UDL")
umsb_lm <- read.csv('results/UMSB_lm_predictions.csv') %>% mutate(var="UMSB")
upper_lm <- read.csv('results/upper_lm_predictions.csv') %>% mutate(var="upper")
fdl_lm <- read.csv('results/FDL_lm_predictions.csv') %>% mutate(var="FDL")
fmsb_lm <- read.csv('results/FMSB_lm_predictions.csv') %>% mutate(var="FMSB")
fdb_lm <- read.csv('results/FDB_lm_predictions.csv') %>% mutate(var="FDB")
tdl_lm <- read.csv('results/TDL_lm_predictions.csv') %>% mutate(var=("TDL"))
tpb_lm <- read.csv('results/TPB_lm_predictions.csv') %>% mutate(var="TPB")
tmsb_lm <- read.csv('results/TMSB_lm_predictions.csv') %>% mutate(var="TMSB")
tdb_lm <- read.csv('results/TDB_lm_predictions.csv') %>% mutate(var="TDB")
fbdl_lm <- read.csv('results/FBDL_lm_predictions.csv') %>% mutate(var="FBDL")
lower_lm <- read.csv('results/lower_lm_predictions.csv') %>% mutate(var="lower")

lm_pred <- rbind(hdl_lm, hpb_lm, hmsb_lm, hdb_lm, rdl_lm, rpb_lm, rmsb_lm, rdb_lm, udl_lm, umsb_lm, upper_lm, fdl_lm, fmsb_lm, fdb_lm, tdl_lm, tpb_lm, tmsb_lm, tdb_lm, fbdl_lm, lower_lm)
lm_pred %<>% mutate(type="Linear")

# Import Nonlinear predictions
hdl_nls <- read.csv('results/HDL_nls_predictions.csv') %>% mutate(var="HDL")
hpb_nls <- read.csv('results/HPB_nls_predictions.csv') %>% mutate(var="HPB")
hmsb_nls <- read.csv('results/HMSB_nls_predictions.csv') %>% mutate(var="HMSB")
hdb_nls <- read.csv('results/HDB_nls_predictions.csv') %>% mutate(var="HDB")
rdl_nls <- read.csv('results/RDL_nls_predictions.csv') %>% mutate(var="RDL")
rpb_nls <- read.csv('results/RPB_nls_predictions.csv') %>% mutate(var="RPB")
rmsb_nls <- read.csv('results/RMSB_nls_predictions.csv') %>% mutate(var="RMSB")
udl_nls <- read.csv('results/UDL_nls_predictions.csv') %>% mutate(var="UDL")
umsb_nls <- read.csv('results/UMSB_nls_predictions.csv') %>% mutate(var="UMSB")
upper_nls <- read.csv('results/upper_nls_predictions.csv') %>% mutate(var="upper")
fdl_nls <- read.csv('results/FDL_nls_predictions.csv') %>% mutate(var="FDL")
fmsb_nls <- read.csv('results/FMSB_nls_predictions.csv') %>% mutate(var="FMSB")
fdb_nls <- read.csv('results/FDB_nls_predictions.csv') %>% mutate(var="FDB")
tdl_nls <- read.csv('results/TDL_nls_predictions.csv') %>% mutate(var=("TDL"))
tpb_nls <- read.csv('results/TPB_nls_predictions.csv') %>% mutate(var="TPB")
tmsb_nls <- read.csv('results/TMSB_nls_predictions.csv') %>% mutate(var="TMSB")
fbdl_nls <- read.csv('results/FBDL_nls_predictions.csv') %>% mutate(var="FBDL")
lower_nls <- read.csv('results/lower_nls_predictions.csv') %>% mutate(var="lower")

nls_pred <- rbind(hdl_nls, hpb_nls, hmsb_nls, hdb_nls, rdl_nls, rpb_nls, rmsb_nls, udl_nls, umsb_nls, upper_nls, fdl_nls, fmsb_nls, fdb_nls, tdl_nls, tpb_nls, tmsb_nls, fbdl_nls, lower_nls)
nls_pred %<>% mutate(type="Nonlinear")

# Combine datasets, save as .csv and .rds files
all_pred <- rbind(lm_pred, nls_pred)

all_pred %<>% mutate(mt = ifelse(grepl('DL$', var), "Length", 
                                 ifelse(grepl('PB$', var), "Proximal",
                                        ifelse(grepl('MSB$', var), "Midshaft", "Distal"))))

all_pred$mt <- factor(all_pred$mt, levels=c("Length","Proximal","Midshaft","Distal"))

write.csv(all_pred, "results/all_predictions.csv", row.names=F)
saveRDS(all_pred, "results/all_predictions.rds")
all_pred$avg <- rowMeans(all_pred[c("stature","point")])
mean_diff <- mean(all_pred$resid)
lwr <- mean_diff - 1.96*sd(all_pred$resid)
upr <- mean_diff + 1.96*sd(all_pred$resid)
names(all_pred)[1] <- "SVAD_medrec"

all_pred2 <- left_join(all_pred, test[c("SVAD_medrec","meas_type","SEX","agey")], by="SVAD_medrec", relationship='many-to-many')

ggplot(all_pred2) + elaine_theme + 
  geom_hline(yintercept=mean_diff, color="black") + 
  geom_hline(yintercept=lwr, lty="dotted") + 
  geom_hline(yintercept=upr, lty="dotted") + 
  geom_point(aes(x=avg, y=resid), size=2, pch=21, fill=colors[4]) + 
  facet_grid(mt ~ type) + 
  labs(x="Mean of Known and Predicted Stature (cm)",
       y="Residuals (cm)")

Figure 2 description: Bland-Altman plots demonstrating the pattern of residuals based on measurement type (rows) and model type (columns). The solid line represents the mean residuals, while the dashed lines represent the upper and lower bounds of a 95% confidence interval.

  • Minimum residual for linear models: -41.6404261
  • Maximum residual for linear models: 46.27754
  • Minimum residual for nonlinear models: -44.9751798
  • Maximum residual for nonlinear models: 43.1972133

4b. Missclassifieds

Given the Bland-Altman plot, the Nonlinear models present with less bias in stature estimates compared to Linear models, especially at the lower ends of stature, i.e. younger individuals. Which is what we saw with the general bivariate plots, that showed a linear model would actually “miss” those individuals.

Let’s just take a gander at those individuals that fall outside of the 95% confidence interval for the residuals:

mc_df <- all_pred2 %>% filter(stature >= hi | stature <= lo)

nrow(mc_df)
## [1] 296
nrow(all_pred2)
## [1] 6675
nrow(mc_df %>% filter(type=="Linear", mt=="Length")) / nrow(all_pred2) * 100
## [1] 0.7790262
nrow(mc_df %>% filter(type=="Nonlinear", mt=="Length")) / nrow(all_pred2) * 100
## [1] 0.5842697
nrow(mc_df %>% filter(type=="Linear", mt=="Proximal")) / nrow(all_pred2) * 100
## [1] 0.1947566
nrow(mc_df %>% filter(type=="Nonlinear", mt=="Proximal")) / nrow(all_pred2) * 100
## [1] 0.494382
nrow(mc_df %>% filter(type=="Linear", mt=="Midshaft")) / nrow(all_pred2) * 100
## [1] 0.5842697
nrow(mc_df %>% filter(type=="Nonlinear", mt=="Midshaft")) / nrow(all_pred2) * 100
## [1] 0.7191011
nrow(mc_df %>% filter(type=="Linear", mt=="Distal")) / nrow(all_pred2) * 100
## [1] 0.6142322
nrow(mc_df %>% filter(type=="Nonlinear", mt=="Distal")) / nrow(all_pred2) * 100
## [1] 0.4644195
nrow(mc_df %>% filter(type=="Linear")) / nrow(all_pred2) * 100
## [1] 2.172285
nrow(mc_df %>% filter(type=="Nonlinear")) / nrow(all_pred2) * 100
## [1] 2.262172
mc_df$meas_type <- car::recode(mc_df$meas_type, 
                               "'DCP'='With Epiphyses';
                               'Stull'='Diaphysis Only'")

ggplot(mc_df) + elaine_theme + 
  geom_point(aes(x=agey, y=stature, fill=SEX, shape=meas_type), size=2) + 
  facet_grid(type ~ mt ) + 
  scale_fill_manual(values=colors[c(3,5)]) + 
  scale_shape_manual(values=c(21, 24)) + 
  labs(x="Age (years)", y="Known Stature (cm)") + 
  theme(legend.position="none")

ggplot(mc_df) + elaine_theme + 
  geom_point(aes(x=agey, y=stature, fill=SEX, shape=meas_type), size=2) + 
  facet_grid(mt ~ type ) +  
  scale_shape_manual(values=c(21, 24), guide="none") + 
  scale_fill_manual(values=colors[c(3,5)]) +
  labs(x="Age (years)", y="Known Stature (cm)", fill=NULL) + 
  theme(legend.position="none")

Figure 3 description: Misclassifications by measurement type (rows) and model type (columns). Female = filled blue shapes. Male = filled yellow shapes. Diaphyseal measurements = circles. Maximum measurements = triangles.

#5. Coverage in subadults

chu <- ggplot(data, aes(x=upper, y=stature)) + elaine_theme + 
  geom_point(color="grey85", pch=16, alpha=0.4) + 
  geom_smooth(se=F, color="purple") + 
  labs(x="Upper Limb Length (mm)", y="Known Stature (cm)", title="Present Study")
murray <- ggplot(data, aes(x=upper, y=stature)) + elaine_theme + 
  geom_point(color="grey85", pch=16, alpha=0.4) + 
  geom_smooth(data=data %>% filter(agey<=12), se=F, method="lm", color="red") + 
  labs(x="Upper Limb Length (mm)", y="Known Stature (cm)", title="Murray et al. (2024)")
feldesman <- ggplot(data, aes(x=upper, y=stature)) + elaine_theme + 
  geom_point(color="grey85", pch=16, alpha=0.4) + 
  geom_smooth(data=data %>% filter(agey<=18, agey>=8), se=F, method="lm", color="blue") + 
  labs(x="Upper Limb Length (mm)", y="Known Stature (cm)", title="Feldesman (1992)")
smith <- ggplot(data, aes(x=upper, y=stature)) + elaine_theme + 
  geom_point(color="grey85", pch=16, alpha=0.4) + 
  geom_smooth(data=data %>% filter(agey<=10, agey>=3), se=F, method="lm", color="green") + 
  labs(x="Upper Limb Length (mm)", y="Known Stature (cm)", title="Smith (2007)")
  
grid.arrange(smith, feldesman, murray, chu, nrow=2, ncol=2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Figure 4 description: Coverage of different stature estimation methods compared to data in the current study.