This page is to show my process in simulating data made by the package ‘squidSim’ to test different sample sizes and variances for a discriminant analysis prior to running my observed data. This analysis and simulating data is novel to me and I am choosing to write down the steps to help myself at a later time, and to have simulated data to compare my observed data analysis to.
I have a small data set of adult American barn owls (Tyto furcata) I am using to run an LDA prior to using this analysis on nestling American barn owls (hereafter, barn owls). My goal is to identify the emergence of the reverse sexual dimorphism that appears in this species and to determine if an LDA can help sex these individuals at an earlier age (ie prior to full body plumage coloration). Linear discriminant analyses have been used for sexing individual bird species for decades(cite). Recently, there was a paper explaining how the different methods of using this analysis may have led to poorly understood and even misused results. I aim to remove that uncertainty about my data and species by using ‘squidSim’, an r package for simulating data populations to determine prior to running my own data, if it is plausible to use an LDA on my birds.
I have measurements for 37 adults females, and 30 adult males. I will begin my simulations starting at 5 individuals and working my way up in increments of 10, until 100. I will run each of these population sizes at different variances beginning at 0.5 and increasing in increments of 0.5, until 2. This comes out to 40 simulations which I will input in a table, and have figures for showing changes over the differing sample sizes. Each of these simulations will be run with 5000 populations to ensure accurate results.
I have found what I believe to be the best method for finding confidence intervals and that is simply finding the confidence interval of the proportion (ie posterior probability).
#example data used from Dechaume-Moncharmont et al 2011. Very similar results (tenths are slighly different than reported, most likely due to raw data used in their analysis)
n <- 525
p <- 0.81
margin <- qnorm(0.975)*sqrt(p*(1-p)/n)
low <- p - margin
low #reported is 77.5
## [1] 0.7764426
high <- p + margin
high #reported is 84.8
## [1] 0.8435574
I also think it would be beneficial to include the range of proportions. Perhaps visualize these by # populations simulated and boxplots?
I will bring over code from “Discriminant Analysis Setup” for this below part:
This code is for setting up simulated population(s) for comparing real data collected by E.M. Phillips for their dissertation. Specifically this simulation is for creating data for a linear discriminant analysis. Dechaume-Moncharmont (et al. 2011) showed different LDA analyses methods and determined larger sample sizes using the jackknifing (ie leave-one-out) method. I began with 5 individuals of each sex for comparison, with different, increasing variances at each sample size (0.5, 0.75, 1, 1.25, 1.5, 1.75, 2). Tables were created from these outputs with a set.seed(18).
library(squidSim)
library(lme4)
library(ggplot2)
library(MASS)
library(rstatix)
library(readxl)
Adding intercept and variance data using estimates from real dataset:
adultdf <- read_excel("adults_thesis.xlsx", sheet = 'comp')
adultdf <- as.data.frame(adultdf)
adults <- adultdf %>%
filter(age == 'adult') %>%
filter(!is.na(hallux))
mean(adults$hallux) #intercept to be used in simulations
## [1] 20.09231
fem <- adults %>%
filter(sex1 == 'female')
male <- adults %>%
filter(sex1 == 'male')
mean(fem$hallux) - mean(male$hallux) #used for beta for the average hallux difference between adults.
## [1] 1.780952
var(adults$hallux) #variance to be used in the data simulations
## [1] 1.820721
Intercept is from Phillips (2022) Master’s thesis data. For this example, I am using estimates of adults (sample size: females = 37; males = 30) for hallux. Reference group is ‘female’ based on alphabetical and order in data. Intercept is roughly the median between the average male’s and female’s hallux measurements (20mm). Beta of -1.78mm states the females’ hallux is on average 1.78mm larger than the males’. It is negative because the female is the reference group.
bnow.cv <- make_structure(structure="sex(2)",
repeat_obs=5,
level_names=list(sex=c("female","male"))) #creates simulated data structure
pop.cv <- simulate_population(
data_structure = bnow.cv,
parameters = list(
intercept= 20,
sex=list(
fixed=TRUE,
beta=c(0,-1.78)
),
residual = list(
vcov = 1.82
)
)
)
#simulates data based on parameters put in to the simulation
data.cv <- get_population_data(pop.cv)
Because this is simulated data, there should not be outliers, and it should be normally distributed. I should be in the habit of checking all these things for the observed dataset.
For this simulation, I will not have covariance because I am only using one predictor (hallux).
data.cv %>% levene_test(y ~ sex) ## p-value is not significant, meaning it is homogeneous
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 1 8 0.712 0.423
data.cv %>% identify_outliers(y) ##no outliers found
## [1] y sex_effect1 sex_effect2 residual sex squid_pop
## [7] is.outlier is.extreme
## <0 rows> (or 0-length row.names)
model.cv <- lda(sex ~ y, CV = TRUE, data = data.cv) #currently using lda method, might change to qda at a later time, but need to research more.
#Assess the accuracy of the prediction
#percent correct for each category of 'sex'
ct <- table(data.cv$sex, model.cv$class)
diag(prop.table(ct, 1))
## female male
## 0.8 0.8
#total percent correct
sum(diag(prop.table(ct)))
## [1] 0.8
n <- 10
p <- 0.9
margin <- qnorm(0.975)*sqrt(p*(1-p)/n)
low <- p - margin
low
## [1] 0.7140615
high <- p + margin
high
## [1] 1.085939
This small dataset with the observed variance is not giving very good reports because it is so small. Let’s increase the number of populations used to get better estimates. I still do not expect good estimates because of how small the sample size is.
rm(list=ls()) #clear the workspace
set.seed(18)
bnow.pop1 <- make_structure(structure="sex(2)",
repeat_obs=5,
level_names=list(sex=c("female","male")))
pop1 <- simulate_population(
data_structure = bnow.pop1,
parameters = list(
intercept= 20,
sex=list(
fixed=TRUE,
beta=c(0,-1.78) #mean difference between male and female hallux
),
residual = list(
vcov = 1.82
)
),
n_pop = 5000
)
data.pop1 <- get_population_data(pop1, list = TRUE)
Retesting assumptions…can’t run these on lists(vvvv), so much create an lapply?
# <make lapply function>
# 2. homogeneity of variance
# data.pop1 %>% levene_test(y ~ sex) #not significant
# 3. outliers
# data.pop1 %>% identify_outliers(y) ##outliers were found
Outliers were found, but not to be extreme. Nonetheless, it may be best practice to remove them. While there is a possibility for a female hallux to reach upwards of 23mm, it is not likely. Similarly, if a male’s hallux was near 14mm, I would consider the possibility it was not an adult (ie fledgling still growing), or some malformation. This is one reason I tried my best to measure both hallux on each nestling. For these reasons, I think it best they are removed.
# <make lapply function>
# quartiles <- quantile(data.pop1$y, probs=c(.25, .75), na.rm = FALSE)
# IQR <- IQR(data.pop1$y)
#
# Lower <- quartiles[1] - 1.5*IQR
# Upper <- quartiles[2] + 1.5*IQR
#
# data_no_outlier <- subset(data.pop1, data.pop1$y > Lower & data.pop1$y < Upper)
#
# ###checking for outliers again
# data_no_outlier %>% identify_outliers(y)
There are still outliers, but a lot less. I will run the data with these included. For now…will review this way or removing outliers when creating the actual simulation data that will be used in the project.
pop1_list <- lapply(data.pop1, function(i){
model.pop1 <- lda(sex ~ y, CV =TRUE, data = i)
ctl <- table(i$sex, model.pop1$class)
# diag(prop.table(ctl, 1))
#total percent correct
df <- data.frame('Percent_Correct' = sum(diag(prop.table(ctl)))
)
})
pop1_list1 <- do.call(rbind, pop1_list)
#total percent correct
mean(pop1_list1$Percent_Correct)
## [1] 0.72218
# pop1_list.no.out <- lapply(data_no_outlier, function(i){
#
# model.pop1.no.out <- lda(sex ~ y, CV =TRUE, data = i)
#
# ctl <- table(i$sex, model.pop1.no.out$class)
# # diag(prop.table(ctl, 1))
# #total percent correct
# df <- data.frame('Percent_Correct' = sum(diag(prop.table(ctl)))
# )
#
#
# })
#
# ct.no.out <- table(data.pop1$sex, model.no.outlier$class)
# diag(prop.table(ct.no.out, 1))
#total percent correct
# sum(diag(prop.table(ct.no.out)))
I can’t determine the probability correct with different size tables. Will look into this later. For now, making sure the code can run correctly without removing outliers.