library(squidSim)
library(lme4)
library(ggplot2)
library(MASS)
library(rstatix)
library(readxl)
This page is to show my process in simulating data made by the package ‘squidSim’ to test different sample sizes and variances for a discriminant analysis prior to running my observed data. This analysis and simulating data is novel to me and I am choosing to write down the steps to help myself at a later time, and to have simulated data to compare my observed data analysis to.
I have a small data set of adult American barn owls (Tyto furcata) I am using to run a discriminant function analysis (DFA) prior to using this analysis on nestling American barn owls (hereafter, barn owls). My goal is to identify the emergence of the reverse sexual dimorphism that appears in this species and to determine if a DFA can help sex these individuals at an earlier age (ie prior to full body plumage coloration). DFAs have been used for sexing individual bird species for decades. Recently, there was a paper explaining how the different methods of using this analysis may have led to poorly understood and even misused results. I aim to understand the uncertainty that may accompany my observed data by using ‘squidSim’, an r package for simulating data populations prior to running my own data.
This rmarkdown is for running different simulations with variance of 0.5. Sample sizes will go from 10 to 100 (ie. 5 individuals per sex to 50 individuals per sex) in intervals of 10.
I will input in a table, and have figures for showing changes over the differing sample sizes. Each of these simulations will be run with 5000 populations to ensure accurate results.
adultdf <- read_excel("adults_thesis.xlsx", sheet = 'comp')
adultdf <- as.data.frame(adultdf)
adults <- adultdf %>%
filter(age == 'adult') %>%
filter(!is.na(hallux))
mean(adults$hallux) #intercept to be used in simulations
## [1] 20.09231
fem <- adults %>%
filter(sex1 == 'female')
male <- adults %>%
filter(sex1 == 'male')
mean(fem$hallux) - mean(male$hallux) #used for beta for the average hallux difference between adults.
## [1] 1.780952
var(adults$hallux) #variance to compare simulated variance to.
## [1] 1.820721
Intercept and mean difference remains the same (taken from my observed data; above), but variance will remain at 0.5 throughout this code.
rm(list=ls()) #clear the workspace
set.seed(18)
bnow.pop10 <- make_structure(structure="sex(2)",
repeat_obs=5,
level_names=list(sex=c("female","male")))
pop10 <- simulate_population(
data_structure = bnow.pop10,
parameters = list(
intercept= 20,
sex=list(
fixed=TRUE,
beta=c(0,-1.78) #mean difference between male and female hallux
),
residual = list(
vcov = 0.5
)
),
n_pop = 5000
)
data.pop10 <- get_population_data(pop10, list = TRUE)
Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.
x10id <- lapply(data.pop10, function(i){
norm.dat <- shapiro.test(i$y)
df <- data.frame('pvalue' = norm.dat$p.value)
})
not.norm <- do.call(rbind, x10id)
not.norm <- not.norm %>%
filter(pvalue < 0.05)
pval <- nrow(not.norm)
cat(pval, 'are found to not be normally distributed')
## 177 are found to not be normally distributed
out10id <- lapply(data.pop10, function(i){
df <- i %>% identify_outliers(y)
})
outlier.df <- do.call(rbind, out10id)
cat(nrow(outlier.df), 'outliers were found')
## 556 outliers were found
pop10_list <- lapply(data.pop10, function(i){
model.pop10 <- lda(sex ~ y, CV =TRUE, data = i)
ctl <- table(i$sex, model.pop10$class)
# diag(prop.table(ctl, 1))
#total percent correct
df <- data.frame('Percent_Correct' = sum(diag(prop.table(ctl)))
)
})
pop10_list1 <- do.call(rbind, pop10_list)
#total percent correct
p <- mean(pop10_list1$Percent_Correct)
n <- 5000
margin <- qnorm(0.975)*sqrt(p*(1-p)/n)
low <- p - margin
high <- p + margin
cat('The posterior probability is', p,'.', 'The confidence intervals (95%) are', '[',low,',', high,']')
## The posterior probability is 0.8812 . The confidence intervals (95%) are [ 0.8722317 , 0.8901683 ]
cat('The range of posterior probabilities is',min(pop10_list1), '-',max(pop10_list1))
## The range of posterior probabilities is 0.4 - 1
Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.
## 228 are found to not be normally distributed
## 335 outliers were found
## The posterior probability is 0.89059 . The confidence intervals (95%) are [ 0.8819377 , 0.8992423 ]
## The range of posterior probabilities is 0.6 - 1
Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.
## 384 are found to not be normally distributed
## 253 outliers were found
## The posterior probability is 0.89352 . The confidence intervals (95%) are [ 0.8849703 , 0.9020697 ]
## The range of posterior probabilities is 0.6666667 - 1
Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.
## 527 are found to not be normally distributed
## 214 outliers were found
## The posterior probability is 0.89363 . The confidence intervals (95%) are [ 0.8850842 , 0.9021758 ]
## The range of posterior probabilities is 0.625 - 1
Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.
## 723 are found to not be normally distributed
## 173 outliers were found
## The posterior probability is 0.895036 . The confidence intervals (95%) are [ 0.8865402 , 0.9035318 ]
## The range of posterior probabilities is 0.72 - 1
Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.
## 935 are found to not be normally distributed
## 169 outliers were found
## The posterior probability is 0.8946967 . The confidence intervals (95%) are [ 0.8861888 , 0.9032046 ]
## The range of posterior probabilities is 0.7166667 - 1
Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.
## 1108 are found to not be normally distributed
## 130 outliers were found
## The posterior probability is 0.8950629 . The confidence intervals (95%) are [ 0.886568 , 0.9035577 ]
## The range of posterior probabilities is 0.7571429 - 1