Load libraries and data

library(here)
library(tidyverse)
library(ggplot2)
library(VGAM)
library(arm)

## Load data
ytang <- read_csv(here("Week_03", "Data", "yellowtang.csv"))
copepod <- read_csv(here("Week_03", "Data", "dam_acartia_survival_subset.csv"))

Copepod Histogram

First plot a histrogram of the data, using the discrete.histogram() function from the package ‘arm’. This function is better than standard histogram functions for our purposes, because the standard binning algorithms makes it hard to see what is happening at the ends of the distribution.

discrete.histogram(data = copepod,
                   x = copepod$nx) # number of surviving copepods

Copepod Mean & Variance

Now calculate the mean and the variance of the number of survivors. This kind of data could potentially be modelled as a binomial distribution (review the lecture notes to understand why). Based on the formula for the binomial distribution (it is in the lecture notes, or on Wikipedia), what should the variance be, if the data were actually binomially distributed, given the mean number of survivors we observe?

print(paste("Mean:", mean(copepod$nx)))  #calculate mean of count of survivors

## [1] "Mean: 17.3846153846154"

print(paste("Variance:", var(copepod$nx))) # calculate variance of count of survivors

## [1] "Variance: 58.1059829059829"

#The binomial is parameterize with two parameters, n = number of trials, and p = probability of ‘success’. The mean of the distribution is n*p, which should be intuitive because it’s just the expected number of successes. The variance of the distribution is a generalization of the variance for the Bernoulli: n*p*(1-p).
n <- 25 # number of copepods per trial
p <- mean(copepod$nx) / n
expected_var <- n*p*(1 - p)
print(paste("Binomially Distributed Variance:", expected_var))

## [1] "Binomially Distributed Variance: 5.29562130177515"

Copepod Random Draws from Binomial Distribution

Use the function rbinom() to simulate random draws from a binomial distribution that has the same probability of survival as the observed copepod data. Draw the same number of values as are present in the observed data. Calculate the mean and the variance of the number of survivors. Plot a histogram of your randomly drawn values. Now repeat this four more times – you will have a total five histograms as well as five means and variances. Describe how your simulated binomial data differs from the observed data. Can you imagine reasons why the two distributions might be different?

set.seed(123)
# probability of survival
survivalprobability = sum(copepod$nx)/sum(copepod$ni) #all 91 survival counts/total copepods

#simulate random draws from a binom distribution that has the same prob of survival as observed copepod data. draw the same number of values as are present in the observed data
rdraw1 = rbinom(n = 91, #91 unique ...1
       size = 25, #25 copepods
       prob = survivalprobability) 
mean(rdraw1)

## [1] 17.36264

var(rdraw1)

## [1] 5.4337

rdraw2 = rbinom(n = 91, #91 unique ...1
       size = 25, #25 copepods
       prob = survivalprobability) 
mean(rdraw2)

## [1] 17.38462

var(rdraw2)

## [1] 3.77265

rdraw3 = rbinom(n = 91, #91 unique ...1
       size = 25, #25 copepods
       prob = survivalprobability) 
mean(rdraw3)

## [1] 17.26374

var(rdraw3)

## [1] 4.907448

rdraw4 = rbinom(n = 91, #91 unique ...1
       size = 25, #25 copepods
       prob = survivalprobability) 
mean(rdraw4)

## [1] 17.49451

var(rdraw4)

## [1] 6.252747

rdraw5 = rbinom(n = 91, #91 unique ...1
       size = 25, #25 copepods
       prob = survivalprobability) 
mean(rdraw5)

## [1] 17.47253

var(rdraw5)

## [1] 5.252015

Histograms side by side

par(mfrow = c(3,2))
discrete.histogram(data = copepod,
                   x = copepod$nx)
discrete.histogram(rdraw1)
discrete.histogram(rdraw2)
discrete.histogram(rdraw3)
discrete.histogram(rdraw4)
discrete.histogram(rdraw5)

The simulated data has a closer to normal distribution versus the observed data, and the range is between 10-24 survivors with no outliers in comparison to the normal distribution with values of 0 and 6 survivors. The simulated data is more closely clustered around the the mean of ~17 survivors. These distributions may be different because we see what the data would look like if it were truly binomial. However, we see that there are other factors outside of random sampling/survival that affect copepod survival from the observed data.

VGAM

Now load the package VGAM, which has functions for the beta-binomial distribution. This distribution is similar to the binomial, in that it models success/failure of a ‘trial’, but it allows the probability of success to vary across trials. This extra variability is controlled by the parameter ‘rho’. Use trial and error, with the function rbetabinom, to find a value of rho that creates a distribution that looks similar to the observed copepod data. A value of 0 for rho produces a binomial distribution, while increasing values of rho between 0 and 1 lead to more variability in the data. Display the results of your trial and error simulations, and report what value of rho you think best corresponds to the observed data.

par(mfrow = c(3,4))
discrete.histogram(data = copepod,
                   x = copepod$nx, 
                   main = "Observed Data")
for (i in seq(0, 1, 0.1)) { 
  random = rbetabinom(n = 91,
                      size = 25,
                      prob = survivalprobability,
                      rho = i)
  rho = i
  discrete.histogram(random, main = paste("Rho:", round(rho, 2))) 
   }  #make a histogram of the number of survivors per plot

The rho value of 0.2 or 0.3 appear to be the closest to the observed data. The extra variability can be controlled with this rho value and we see that the spread of data for these two rho values has a flatter distribution pattern with higher probability around 15 or more copepods.

Yellow Tang Histogram

Plot a histogram of the yellow tang counts.

discrete.histogram(data = ytang,
                   x = ytang$count)

Mean and Variance

Calculate the mean and variance of the counts. Also calculate what the variance would be if the counts were poisson-distributed. How does it differ from the observed data? Why might that be?

mean(ytang$count)

## [1] 5.026948

var(ytang$count)

## [1] 35.96793

#also the variance, i.e. for the Poisson distribution the variance is equal to the mean. Poisson variance equals 5.026948

The variance for poisson-distributed counts would be the mean, which is 5.03. This poisson-distributed variance of 5.03 is much lower than the observed variance of 35.97. The poisson distribution cannot handle this overdispersion.

Random draws

Simulate five sets of random draws from the poisson distribution with the appropriate mean and variance (using the function rpois()). Plot histograms of the results, and report the means and variances of the simulated data. Verbally compare these distributions to the observed distribution.

#lambda of poisson is just mean

par(mfrow = c(3, 2))

for (i in 1:5) { 
 graph <- rpois(n = 1373, 
      lambda = mean(ytang$count))
 smean <- mean(graph)
 svar <- var(graph)
discrete.histogram(graph, main = paste("Mean:", round(smean, 2), 
                                "Var:", round(svar, 2)))
   }

The distribution of the data from the random draws ranges between 0 and 15, while the observed distribution has a range of up to 50. These ranges are also near normal distribution, but the observed distribution is skewed to the left.

Negative binomial distribution

Perform trial and error to find good-fitting parameter values from the negative binomial distribution, which can model counts with more variability than the poisson distribution. Use the function rnegbin() from the package ‘MASS’. The parameter ‘theta’ is the one you will want to manipulate to change the shape of the distribution (keep the value of the mean, mu, the same as the observed data). Note: smaller values of theta lead to larger variances.

par(mfrow = c(3,2))
for (i in  seq(0.5, 1, 0.1)) { #seq start at 0.5, end at 1, increase in intervals of 0.1
 negbin <- rnegbin(n = 1373, #1373 counts
                        mu = mean(ytang$count),
                        theta = i) #theta value 0.1
discrete.histogram(negbin, main = paste("Theta:", round(i, 2)))
   }

0.8 appears to model the data the closest with the skewed distribution to the left and a range between 0 and 53.

Week 3: Copepod Distributions