DFA with variance of 0.5

library(squidSim)
library(lme4)
library(ggplot2)
library(MASS)
library(rstatix)
library(readxl)

Introduction

This page is to show my process in simulating data made by the package ‘squidSim’ to test different sample sizes and variances for a discriminant analysis prior to running my observed data. This analysis and simulating data is novel to me and I am choosing to write down the steps to help myself at a later time, and to have simulated data to compare my observed data analysis to.

Background

I have a small data set of adult American barn owls (Tyto furcata) I am using to run a discriminant function analysis (DFA) prior to using this analysis on nestling American barn owls (hereafter, barn owls). My goal is to identify the emergence of the reverse sexual dimorphism that appears in this species and to determine if a DFA can help sex these individuals at an earlier age (ie prior to full body plumage coloration).

DFAs have been used for sexing individual bird species for decades. Recently, there was a paper explaining how the different methods of using this analysis may have led to poorly understood and even misused results. I aim to understand the uncertainty that may accompany my observed data by using ‘squidSim’, an r package for simulating data populations prior to running my own data.

Overview

This rmarkdown is for running different simulations with variance of 0.5. Sample sizes will go from 10 to 100 (ie. 5 individuals per sex to 50 individuals per sex) in intervals of 10.

I will input in a table, and have figures for showing changes over the differing sample sizes. Each of these simulations will be run with 5000 populations to ensure accurate results.

Getting intercept, averages and variance from my observed dataset

adultdf <- read_excel("adults_thesis.xlsx", sheet = 'comp')
adultdf <- as.data.frame(adultdf)
adults <- adultdf %>%
  filter(age == 'adult') %>%
  filter(!is.na(hallux))
mean(adults$hallux) #intercept to be used in simulations

## [1] 20.09231

fem <- adults %>%
  filter(sex1 == 'female')
male <- adults %>%
  filter(sex1 == 'male')
mean(fem$hallux) - mean(male$hallux) #used for beta for the average hallux difference between adults.

## [1] 1.780952

var(adults$hallux) #variance to compare simulated variance to.

## [1] 1.820721

Increased population numbers

Increasing number of populations to 5,000 for better statistical estimates

Intercept and mean difference remains the same (taken from my observed data; above), but variance will remain at 0.5 throughout this code.

10 individuals

1. Simulate data

rm(list=ls()) #clear the workspace
set.seed(18)
bnow.pop10 <- make_structure(structure="sex(2)", 
                         repeat_obs=5,  
                         level_names=list(sex=c("female","male")))


pop10 <- simulate_population(
  data_structure = bnow.pop10,
  parameters = list(
    intercept= 20,
    sex=list(
      fixed=TRUE,
      beta=c(0,-1.78) #mean difference between male and female hallux
    ),
    residual = list(
      vcov = 0.5 
    )
  ),
  n_pop = 5000
)

data.pop10 <- get_population_data(pop10, list = TRUE)

2. Test for assumption

Test for normality (homogeneity of variance)

Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.

x10id <- lapply(data.pop10, function(i){
  
  norm.dat <- shapiro.test(i$y)
  
 
  df <- data.frame('pvalue' = norm.dat$p.value)
  
})

not.norm <- do.call(rbind, x10id)
not.norm <- not.norm %>%
  filter(pvalue < 0.05)
pval <- nrow(not.norm)
cat(pval, 'are found to not be normally distributed')

## 177 are found to not be normally distributed

Look for outliers

out10id <- lapply(data.pop10, function(i){
  
  df <- i %>% identify_outliers(y)
})

outlier.df <- do.call(rbind, out10id)

cat(nrow(outlier.df), 'outliers were found')

## 556 outliers were found

4. Loop for running analysis

pop10_list <- lapply(data.pop10, function(i){
  
  model.pop10 <- lda(sex ~ y, CV =TRUE, data = i)
  
  ctl <- table(i$sex, model.pop10$class)
  # diag(prop.table(ctl, 1))
  #total percent correct
  df <- data.frame('Percent_Correct' = sum(diag(prop.table(ctl)))
  )
  
  
})

pop10_list1 <- do.call(rbind, pop10_list)
#total percent correct
p <- mean(pop10_list1$Percent_Correct)

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

n <- 5000
margin <- qnorm(0.975)*sqrt(p*(1-p)/n)

low <- p - margin
high <- p + margin

cat('The posterior probability is', p,'.', 'The confidence intervals (95%) are', '[',low,',', high,']')

## The posterior probability is 0.8812 . The confidence intervals (95%) are [ 0.8722317 , 0.8901683 ]

cat('The range of posterior probabilities is',min(pop10_list1), '-',max(pop10_list1))

## The range of posterior probabilities is 0.4 - 1

20 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.

## 228 are found to not be normally distributed

Look for outliers

## 335 outliers were found

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

## The posterior probability is 0.89059 . The confidence intervals (95%) are [ 0.8819377 , 0.8992423 ]

## The range of posterior probabilities is 0.6 - 1

30 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.

## 384 are found to not be normally distributed

Look for outliers

## 253 outliers were found

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

## The posterior probability is 0.89352 . The confidence intervals (95%) are [ 0.8849703 , 0.9020697 ]

## The range of posterior probabilities is 0.6666667 - 1

40 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.

## 527 are found to not be normally distributed

Look for outliers

## 214 outliers were found

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

## The posterior probability is 0.89363 . The confidence intervals (95%) are [ 0.8850842 , 0.9021758 ]

## The range of posterior probabilities is 0.625 - 1

50 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.

## 723 are found to not be normally distributed

Look for outliers

## 173 outliers were found

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

## The posterior probability is 0.895036 . The confidence intervals (95%) are [ 0.8865402 , 0.9035318 ]

## The range of posterior probabilities is 0.72 - 1

60 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.

## 935 are found to not be normally distributed

Look for outliers

## 169 outliers were found

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

## The posterior probability is 0.8946967 . The confidence intervals (95%) are [ 0.8861888 , 0.9032046 ]

## The range of posterior probabilities is 0.7166667 - 1

70 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Test in a function, then do a ‘if p -value is less than 0.05’ situation and find the number of times it is violated.

## 1108 are found to not be normally distributed

Look for outliers

## 130 outliers were found

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

## The posterior probability is 0.8950629 . The confidence intervals (95%) are [ 0.886568 , 0.9035577 ]

## The range of posterior probabilities is 0.7571429 - 1

DFA with variance of 0.5

EM Phillips

2023-07-19

Introduction

Background

Overview

Getting intercept, averages and variance from my observed dataset

Increased population numbers

Increasing number of populations to 5,000 for better statistical estimates

10 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Look for outliers

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

20 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Look for outliers

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

30 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Look for outliers

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

40 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Look for outliers

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

50 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Look for outliers

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

60 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Look for outliers

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?

70 individuals

1. Simulate data

2. Test for assumption

Test for normality (homogeneity of variance)

Look for outliers

4. Loop for running analysis

5. Posterior probabilities, confidence intervals of pp, min-max range, all in table?