Introduction

This page is to show my process in simulating data made by the package ‘squidSim’ to test different sample sizes and variances for a discriminant analysis prior to running my observed data. This analysis and simulating data is novel to me and I am choosing to write down the steps to help myself at a later time, and to have simulated data to compare my observed data analysis to.

Background

I have a small data set of adult American barn owls (Tyto furcata) I am using to run an LDA prior to using this analysis on nestling American barn owls (hereafter, barn owls). My goal is to identify the emergence of the reverse sexual dimorphism that appears in this species and to determine if an LDA can help sex these individuals at an earlier age (ie prior to full body plumage coloration).

Linear discriminant analyses have been used for sexing individual bird species for decades(cite). Recently, there was a paper explaining how the different methods of using this analysis may have led to poorly understood and even misused results. I aim to remove that uncertainty about my data and species by using ‘squidSim’, an r package for simulating data populations to determine prior to running my own data, if it is plausible to use an LDA on my birds.

EDIT First, I will run an LDA simulation on adult barn owls based on some of my observed data. This will include: the average morphometrics of hallux, culmen, and mass for each sex, and estimated variance for each morphometric of each sex. Will add more if anything else comes to mind, after.


I have measurements for 37 adults females, and 30 adult males. I will begin my simulations starting at 5 individuals and working my way up in increments of 10, until 100. I will run each of these population sizes at different variances beginning at 0.5 and increasing in increments of 0.5, until 2. This comes out to 40 simulations which I will input in a table, and have figures for showing changes over the differing sample sizes. Each of these simulations will be run with 5000 populations to ensure accurate results.

2023-07-11

Steps for running an LDA

1. Box’s M test (homogeneity of covariances)

2. Homogeneity of variances

3. Identify outliers

4. Use jackknife (leave-one-out) method

5. Report LDA coeff, percentage accuracy, and the CI for each

2023-07-19

I have found what I believe to be the best method for finding confidence intervals and that is simply finding the confidence interval of the proportion (ie posterior probability).

#example data used from Dechaume-Moncharmont et al 2011. Very similar results (tenths are slighly different than reported, most likely due to raw data used in their analysis)
n <- 525
p <- 0.81
margin <- qnorm(0.975)*sqrt(p*(1-p)/n)

low <- p - margin
low #reported is 77.5
## [1] 0.7764426
high <- p + margin
high #reported is 84.8
## [1] 0.8435574

I also think it would be beneficial to include the range of proportions. Perhaps visualize these by # populations simulated and boxplots?

Beginning simulations

I will bring over code from “Discriminant Analysis Setup” for this below part:

Introduction

This code is for setting up simulated population(s) for comparing real data collected by E.M. Phillips for their dissertation. Specifically this simulation is for creating data for a linear discriminant analysis. Dechaume-Moncharmont (et al. 2011) showed different LDA analyses methods and determined larger sample sizes using the jackknifing (ie leave-one-out) method. I began with 5 individuals of each sex for comparison, with different, increasing variances at each sample size (0.5, 0.75, 1, 1.25, 1.5, 1.75, 2). Tables were created from these outputs with a set.seed(18).

library(squidSim)
library(lme4)
library(ggplot2)
library(MASS)
library(rstatix)
library(readxl)
Below is testing the code and method on a single population prior to running on thousands…

Adding intercept and variance data using estimates from real dataset:

adultdf <- read_excel("adults_thesis.xlsx", sheet = 'comp')
adultdf <- as.data.frame(adultdf)
adults <- adultdf %>%
  filter(age == 'adult') %>%
  filter(!is.na(hallux))
mean(adults$hallux) #intercept to be used in simulations
## [1] 20.09231
fem <- adults %>%
  filter(sex1 == 'female')
male <- adults %>%
  filter(sex1 == 'male')
mean(fem$hallux) - mean(male$hallux) #used for beta for the average hallux difference between adults.
## [1] 1.780952
var(adults$hallux) #variance to be used in the data simulations
## [1] 1.820721

Intercept is from Phillips (2022) Master’s thesis data. For this example, I am using estimates of adults (sample size: females = 37; males = 30) for hallux. Reference group is ‘female’ based on alphabetical and order in data. Intercept is roughly the median between the average male’s and female’s hallux measurements (20mm). Beta of -1.78mm states the females’ hallux is on average 1.78mm larger than the males’. It is negative because the female is the reference group.

bnow.cv <- make_structure(structure="sex(2)", 
                         repeat_obs=5,  
                         level_names=list(sex=c("female","male"))) #creates simulated data structure

pop.cv <- simulate_population(
  data_structure = bnow.cv,
  parameters = list(
    intercept= 20,
    sex=list(
      fixed=TRUE,
      beta=c(0,-1.78)
    ),
    residual = list(
      vcov = 1.82
    )
  )
)
#simulates data based on parameters put in to the simulation

data.cv <- get_population_data(pop.cv)

Because this is simulated data, there should not be outliers, and it should be normally distributed. I should be in the habit of checking all these things for the observed dataset.

1. Box’s M test

For this simulation, I will not have covariance because I am only using one predictor (hallux).

2. Homogeneity of Variance

data.cv %>% levene_test(y ~ sex) ## p-value is not significant, meaning it is homogeneous
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## # A tibble: 1 × 4
##     df1   df2 statistic     p
##   <int> <int>     <dbl> <dbl>
## 1     1     8     0.712 0.423

3. Identify Outliers

data.cv %>% identify_outliers(y) ##no outliers found
## [1] y           sex_effect1 sex_effect2 residual    sex         squid_pop  
## [7] is.outlier  is.extreme 
## <0 rows> (or 0-length row.names)

4. Jackknife method

model.cv <- lda(sex ~ y, CV = TRUE, data = data.cv) #currently using lda method, might change to qda at a later time, but need to research more.

#Assess the accuracy of the prediction
#percent correct for each category of 'sex'

ct <- table(data.cv$sex, model.cv$class)
diag(prop.table(ct, 1))
## female   male 
##    0.8    0.8
#total percent correct
sum(diag(prop.table(ct)))
## [1] 0.8
n <- 10
p <- 0.9
margin <- qnorm(0.975)*sqrt(p*(1-p)/n)

low <- p - margin
low
## [1] 0.7140615
high <- p + margin
high
## [1] 1.085939

This small dataset with the observed variance is not giving very good reports because it is so small. Let’s increase the number of populations used to get better estimates. I still do not expect good estimates because of how small the sample size is.

Increased population numbers

Increasing number of populations to 5,000 for better statistical estimates
rm(list=ls()) #clear the workspace
set.seed(18)
bnow.pop1 <- make_structure(structure="sex(2)", 
                         repeat_obs=5,  
                         level_names=list(sex=c("female","male")))


pop1 <- simulate_population(
  data_structure = bnow.pop1,
  parameters = list(
    intercept= 20,
    sex=list(
      fixed=TRUE,
      beta=c(0,-1.78) #mean difference between male and female hallux
    ),
    residual = list(
      vcov = 1.82
    )
  ),
  n_pop = 5000
)

data.pop1 <- get_population_data(pop1, list = TRUE)

Retesting assumptions…can’t run these on lists(vvvv), so much create an lapply?

# <make lapply function>
# 2. homogeneity of variance
# data.pop1 %>% levene_test(y ~ sex) #not significant
# 3. outliers
# data.pop1 %>% identify_outliers(y) ##outliers were found

Outliers were found, but not to be extreme. Nonetheless, it may be best practice to remove them. While there is a possibility for a female hallux to reach upwards of 23mm, it is not likely. Similarly, if a male’s hallux was near 14mm, I would consider the possibility it was not an adult (ie fledgling still growing), or some malformation. This is one reason I tried my best to measure both hallux on each nestling. For these reasons, I think it best they are removed.

Removal of outliers

# <make lapply function>
# quartiles <- quantile(data.pop1$y, probs=c(.25, .75), na.rm = FALSE)
# IQR <- IQR(data.pop1$y)
#  
# Lower <- quartiles[1] - 1.5*IQR
# Upper <- quartiles[2] + 1.5*IQR 
#  
# data_no_outlier <- subset(data.pop1, data.pop1$y > Lower & data.pop1$y < Upper)
# 
# ###checking for outliers again
# data_no_outlier %>% identify_outliers(y) 

There are still outliers, but a lot less. I will run the data with these included. For now…will review this way or removing outliers when creating the actual simulation data that will be used in the project.

Discriminant analysis jackknife method

pop1_list <- lapply(data.pop1, function(i){
  
  model.pop1 <- lda(sex ~ y, CV =TRUE, data = i)
  
  ctl <- table(i$sex, model.pop1$class)
  # diag(prop.table(ctl, 1))
  #total percent correct
  df <- data.frame('Percent_Correct' = sum(diag(prop.table(ctl)))
  )
  
  
})

pop1_list1 <- do.call(rbind, pop1_list)
#total percent correct
mean(pop1_list1$Percent_Correct)
## [1] 0.72218
# pop1_list.no.out <- lapply(data_no_outlier, function(i){
# 
#   model.pop1.no.out <- lda(sex ~ y, CV =TRUE, data = i)
# 
#   ctl <- table(i$sex, model.pop1.no.out$class)
#   # diag(prop.table(ctl, 1))
#   #total percent correct
#   df <- data.frame('Percent_Correct' = sum(diag(prop.table(ctl)))
#   )
# 
# 
# })
# 

# ct.no.out <- table(data.pop1$sex, model.no.outlier$class)
# diag(prop.table(ct.no.out, 1))
#total percent correct
# sum(diag(prop.table(ct.no.out)))

I can’t determine the probability correct with different size tables. Will look into this later. For now, making sure the code can run correctly without removing outliers.