OCN 683 - Homework 4

Sarah Pennington

The file ‘dam_acartia_survival_subset.csv’ contains some data on the survival of the copepod Acartia tonsa, from a study conducted by Hans Dam and colleagues. In experimental cultures the survival of copepods was tracked from the nauplius 1 stage to the copepodid 6 stage. The data used here is a subset: only the copepods reared at temperature 18ºC, and with survival measured at day 14. The column ‘nx’ is the number of surviving copepods, out of 25, on day 14. We are going to investigate the distribution of this subset of the data.

First plot a histrogram of the data, using the discrete.histogram() function from the package ‘arm’. This function is better than standard histogram functions for our purposes, because the standard binning algorithms makes it hard to see what is happening at the ends of the distribution.

getwd()
## [1] "C:/Users/sarah/Desktop/Masters program/Classes_spring_25/homework"
copsub <-  read_csv("dam_acartia_survival_subset.csv", 
    col_types = cols(...1 = col_skip()))
## New names:
## • `` -> `...1`
names(copsub) 
##  [1] "Generation" "Treatment"  "Temp"       "pH"         "Rep"       
##  [6] "Beak"       "time"       "nx"         "lx"         "ni"
str(copsub)
## spc_tbl_ [91 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Generation: num [1:91] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Treatment : num [1:91] 1 1 1 2 2 2 2 2 2 2 ...
##  $ Temp      : num [1:91] 18 18 18 18 18 18 18 18 18 18 ...
##  $ pH        : num [1:91] 8.2 8.2 8.2 7.5 7.5 7.5 7.5 7.5 7.5 7.5 ...
##  $ Rep       : num [1:91] 1 1 3 1 2 2 2 3 3 3 ...
##  $ Beak      : num [1:91] 2 3 2 1 1 2 3 1 2 3 ...
##  $ time      : num [1:91] 14 14 14 14 14 14 14 14 14 14 ...
##  $ nx        : num [1:91] 10 6 13 20 15 17 16 19 17 16 ...
##  $ lx        : num [1:91] 0.4 0.24 0.52 0.8 0.6 0.68 0.64 0.76 0.68 0.64 ...
##  $ ni        : num [1:91] 25 25 25 25 25 25 25 25 25 25 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_skip(),
##   ..   Generation = col_double(),
##   ..   Treatment = col_double(),
##   ..   Temp = col_double(),
##   ..   pH = col_double(),
##   ..   Rep = col_double(),
##   ..   Beak = col_double(),
##   ..   time = col_double(),
##   ..   nx = col_double(),
##   ..   lx = col_double(),
##   ..   ni = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
nx <- copsub$nx
discrete.histogram(copsub$nx)

Now calculate the mean and the variance of the number of survivors. This kind of data could potentially be modelled as a binomial distribution. Based on the formula for the binomial distribution (it is in the lecture notes, or on Wikipedia), what should the variance be, if the data were actually binomially distributed, given the mean number of survivors we observe?

#Mean and variance: 
mean <- mean(nx)
variance <- var(nx) # (1/n-1)E[(X – Mean)2]
print(mean)
## [1] 17.38462
  ### [1] 17.38462
print(variance)
## [1] 58.10598
  ### 58.10598


#Based on the formula for the binomial distribution (it is in the lecture notes, or on Wikipedia), what should the variance be, if the data were actually binomially distributed, given the mean number of survivors we observe? 

  ## From lecture 5:E[(X – Mean)2]
binomial_var<- 0
for (x in nx) {
  binomial_var <- binomial_var + (x - mean)^2
}
print(binomial_var)
## [1] 5229.538
### 5229.538  

Use the function rbinom() to simulate random draws from a binomial distribution that has the same probability of survival as the observed copepod data. Draw the same number of values as are present in the observed data. Calculate the mean and the variance of the number of survivors. Plot a histogram of your randomly drawn values. Now repeat this four more times – you will have a total five histograms as well as five means and variances. Describe how your simulated binomial data differs from the observed data. Can you imagine reasons why the two distributions might be different?

#Use the function rbinom() to simulate random draws from a binomial distribution that has the same probability of survival as the observed copepod data.
    ###X/n
prob <- sum(copsub$nx)/sum(copsub$ni)

#Simulate random draws from a binomial distribution with the same probability of survival
  ##rbinom(n, size, prob)
draw1 <- rbinom(91, size = 25, prob = prob)

#Calculate the mean and the variance of the number of survivors. 
m_draw1 <- mean(draw1)
print(m_draw1) 
## [1] 17.98901
var_draw1 <- var(draw1)
print(var_draw1) 
## [1] 4.566545
#Plot a histogram of your randomly drawn values.
discrete.histogram(draw1)

#Now repeat this four more times – you will have a total five histograms as well as five means and variances.

draw2 <- rbinom(91, size = 25, prob = prob)
m_draw2 <- mean(draw2) 
print(m_draw2) 
## [1] 17.0989
var_draw2 <- var(draw2) 
print(var_draw2) 
## [1] 4.378999
draw3 <- rbinom(91, size = 25, prob = prob)
m_draw3 <- mean(draw3)
print(m_draw3) 
## [1] 16.98901
var_draw3 <- var(draw3)
print(var_draw3) 
## [1] 5.744322
draw4 <- rbinom(91, size = 25, prob = prob)
m_draw4 <- mean(draw4)
print(m_draw4) 
## [1] 17.42857
var_draw4 <- var(draw4)
print(var_draw4) 
## [1] 3.95873
draw5 <- rbinom(91, size = 25, prob = prob)
m_draw5 <- mean(draw5)
print(m_draw5) 
## [1] 17.57143
var_draw5 <- var(draw5)
print(var_draw5) 
## [1] 5.025397
par(mfrow = c(2, 3))
discrete.histogram(copsub$nx)
discrete.histogram(draw1)
discrete.histogram(draw2)
discrete.histogram(draw3)
discrete.histogram(draw4)
discrete.histogram(draw5)

# Describe how your simulated binomial data differs from the observed data. Can you imagine reasons why the two distributions might be different?
  ### The simulated data has a lot lower variance than the actual data. Randomly creating the distributions based on the probability and number of samples doesn't allow us to alter the variability of the data.   

Now load the package VGAM, which has functions for the beta-binomial distribution. This distribution is similar to the binomial, in that it models success/failure of a ‘trial’, but it allows the probability of success to vary across trials. This extra variability is controlled by the parameter ‘rho’. Use trial and error, with the function rbetabinom, to find a value of rho that creates a distribution that looks similar to the observed copepod data. A value of 0 for rho produces a binomial distribution, while increasing values of rho between 0 and 1 lead to more variability in the data. Display the results of your trial and error simulations, and report what value of rho you think best corresponds to the observed data.

library(VGAM)
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following object is masked from 'package:arm':
## 
##     logit
## The following object is masked from 'package:car':
## 
##     logit
rho_values <- seq(0, 1, by = 0.1)

for (rho_value in rho_values) {
  rho_value_name <- paste("rho", round(rho_value, 2), sep = "_")
  data <- rbetabinom(n = 91, size = 25, rho = rho_value, prob = prob)
  assign(rho_value_name, data)
}

#Display the results of your trial and error simulations, and report what value of rho you think best corresponds to the observed data.
par(mfrow = c(3, 4))
discrete.histogram(copsub$nx, main = "data")
discrete.histogram(rho_0, main= "0")
discrete.histogram(rho_0.1, main= "0.1")
discrete.histogram(rho_0.2, main= "0.2")
discrete.histogram(rho_0.3, main= "0.3")
discrete.histogram(rho_0.4, main= "0.4")
discrete.histogram(rho_0.5, main= "0.5")
discrete.histogram(rho_0.6, main= "0.6")
discrete.histogram(rho_0.7, main= "0.7")
discrete.histogram(rho_0.8, main= "0.8")
discrete.histogram(rho_0.9, main= "0.9")
discrete.histogram(rho_1, main= "1")

###Based on the figures generated, I think rho= 0.2 or rho=0.3 replicated the observed data the best. 

The other common kind of discrete data that we will discuss in this class is count data. The file ‘yellowtang.csv’ contains counts of the reef fish yellow tang (lauʻipala) from sites around Hawai’i Island collected by NOAA. The column with the counts is called ‘count’. We’re going to repeat everything we did the copepod data, but instead of using the binomial and beta-binomial distributions, we will use the poisson and negative binomial distributions for the count data.

#Plot a histogram of the yellow tang counts.
tang <- read_csv("yellowtang.csv", 
    col_types = cols(...1 = col_skip(), X = col_skip()))
## New names:
## • `` -> `...1`
names(tang)
##  [1] "commonname"           "time"                 "latitude"            
##  [4] "longitude"            "region_name"          "island"              
##  [7] "site"                 "reef_zone"            "depth_bin"           
## [10] "sitevisitid"          "date_"                "obs_year"            
## [13] "diver"                "rep"                  "method"              
## [16] "photographer"         "training_yn"          "depth"               
## [19] "survey_radius_m"      "hard_coral"           "soft_coral"          
## [22] "ma"                   "cca"                  "ta"                  
## [25] "sand"                 "tunicate"             "zoanthid"            
## [28] "corallimorph"         "clam"                 "cyano"               
## [31] "sponge"               "other"                "other_type"          
## [34] "habitat_code"         "habitat_type"         "current_strength"    
## [37] "visibility"           "substrate_height_0"   "substrate_height_20" 
## [40] "substrate_height_50"  "substrate_height_100" "substrate_height_150"
## [43] "max_height"           "urchin_dacor"         "boring_urchin_dacor" 
## [46] "roundid"              "missionid"            "min_depth"           
## [49] "max_depth"            "objectid"             "replicateid"         
## [52] "species"              "taxonname"            "commonfamilyall"     
## [55] "family"               "scientific_name"      "trophic"             
## [58] "count"
discrete.histogram(tang$count)

#Calculate the mean and variance of the counts. Also calculate what the variance would be if the counts were poisson-distributed. How does it differ from the observed data? Why might that be? 
count <-tang$count
mtang <- mean(count)
print(mtang) #5.026948
## [1] 5.026948
vartang <- var(count)
print(vartang) #35.96793
## [1] 35.96793
#Poisson distributed:  Mean = variance Poisson variance =5.026948

# How does it differ from the observed data? Why might that be?
    ## The Poisson variance is a lot lower than the actual variance. This could be because environmental variability at sites data were collected.  

Simulate five sets of random draws from the poisson distribution with the appropriate mean and variance (using the function rpois()). Plot histograms of the results, and report the means and variances of the simulated data. Verbally compare these distributions to the observed distribution.

count <-nrow(tang)
print(count)
## [1] 1373
?rpois()
## starting httpd help server ... done
## rpois(n, lambda)  λ = mean # of counts
draw1<- rpois(count, mtang)
m_draw1 <- mean(draw1)
print(m_draw1) 
## [1] 5.006555
var_draw1 <- var(draw1)
print(var_draw1) 
## [1] 5.174155
draw2<- rpois(count, mtang)
m_draw2 <- mean(draw2) 
print(m_draw2) 
## [1] 4.900218
var_draw2 <- var(draw2) 
print(var_draw2) 
## [1] 4.914964
draw3<- rpois(count, mtang)
m_draw3 <- mean(draw3)
print(m_draw3) 
## [1] 5.088128
var_draw3 <- var(draw3)
print(var_draw3) 
## [1] 5.226193
draw4<- rpois(count, mtang)
m_draw4 <- mean(draw4)
print(m_draw4) 
## [1] 5.115076
var_draw4 <- var(draw4)
print(var_draw4) 
## [1] 5.738934
draw5<- rpois(count, mtang)
m_draw5 <- mean(draw5)
print(m_draw5) 
## [1] 5.05244
var_draw5 <- var(draw5)
print(var_draw5) 
## [1] 5.350018
par(mfrow = c(2, 3))
discrete.histogram(tang$count)
discrete.histogram(draw1)
discrete.histogram(draw2)
discrete.histogram(draw3)
discrete.histogram(draw4)
discrete.histogram(draw5)

###The random distributions generated don't have the same range as the actual data, which goes from zero to over 50. The random distributions are all approximately normally distributed, whereas the actual data is skewed left. 

Perform trial and error to find good-fitting parameter values from the negative binomial distribution, which can model counts with more variability than the poisson distribution. Use the function rnegbin() from the package ‘MASS’. The parameter ‘theta’ is the one you will want to manipulate to change the shape of the distribution (keep the value of the mean, mu, the same as the observed data). Hint: smaller values of theta lead to larger variances.

library(MASS)
theta_values <- seq(.5, 4, by = 0.5)

#rnegbin(n, mu = n, theta = stop("'theta' must be specified"))
count <-nrow(tang)

for (theta_value in theta_values) {
  theta_value_name <- paste("theta", round(theta_value, 2), sep = "_")
  data <- rnegbin(count, mu = mtang, theta = theta_value)
  assign(theta_value_name, data)
}

par(mfrow = c(3, 4))
discrete.histogram(tang$count, main = "data")
discrete.histogram(theta_0.5, main= "0.5")
discrete.histogram(theta_1, main= "1")
discrete.histogram(theta_1.5, main= "1.5")
discrete.histogram(theta_2, main= "2")
discrete.histogram(theta_2.5, main= "2.5")
discrete.histogram(theta_3, main= "3")
discrete.histogram(theta_3.5, main= "3.5")
discrete.histogram(theta_4, main= "4")
discrete.histogram(theta_2, main= "6")
discrete.histogram(theta_2, main= "8")
discrete.histogram(theta_2, main= "10")

###It seems the random draw best represents the data when theta is somewhere between 0.5 and 1. 
theta_values <- seq(.5, 1, by = 0.1)
for (theta_value in theta_values) {
  theta_value_name <- paste("theta", round(theta_value, 2), sep = "_")
  data <- rnegbin(count, mu = mtang, theta = theta_value)
  assign(theta_value_name, data)
}

par(mfrow = c(3, 3))
discrete.histogram(tang$count, main = "data")
discrete.histogram(theta_0.5, main= "0.5")
discrete.histogram(theta_0.6, main= "0.6")
discrete.histogram(theta_0.7, main= "0.7")
discrete.histogram(theta_0.8, main= "0.8")
discrete.histogram(theta_0.9, main= "0.9")
discrete.histogram(theta_1, main= "1")

###I would say theta = 0.8 did the best job of modeling the actual data.