This is just aimed at describing the mesopelagic fish data available
through the CalCOFI collection. In particular, we are looking at fish in
the families Myctophidae, Gonostomatidae, and Bathylagidae. Note that
all CalCOFI datasets mentioned (ASIDE from the juvenile info) are shown
on this site (https://coastwatch.pfeg.noaa.gov/erddap/search/index.html?page=1&itemsPerPage=1000&searchFor=calcofi)
and were downloaded via the R packge rerddap.
If we look at CalCOFI larval count data, we see that a full 10% of entries are from these families, including 96 different species. The breakdown number-wise is Myctophidae are the most abundant (1,033,245), followed by Gonostomatidae (174,421), and then Bathylagidae (134,170). However, these numbers are for larval fish that are quantitatively sampled, and instead we are focusing on juvenile and adult fish accidentally caught and NOT quantitatively sampled. This data is not yet publicly available, but folks at CalCOFI provided it (Ed, Noelle, Andrew, and Bill). The data currently only go back as far as 1978, but CalCOFI personnel (Nico) are actively working through earlier cruises. We can provide cruises we would like sorted first to expedite obtaining earlier data, but the end goal is to have all cruises sorted and cataloged. To obtain the physical samples we will go through Sherri.
The data provided by CalCOFI has lengths but not weights. We kind of need an approximate length-weight relationship for fish if we are going to create pools. To create a very rough approximation, we can use other CalCOFI data (namely the CPS Trawl Life History Specimen Data) to create a relationship, and compare predicted values (the black line below) with observed values from our 12 test myctophid specimens (the red points). Note that samples are sparse (we have 169 samples) and tend to be larger than our archived specimens. No data for the family Gonostomatidae is available.
# Download the CalCOFI dataset and filter by species in our target families
testdat <- rerddap::tabledap("FRDCPSTrawlLHSpecimen") %>%
dplyr::left_join(., readRDS(here::here("data","itis_tsn_family.rds")), by="itis_tsn") %>%
as.data.frame() %>%
dplyr::filter(family %in% c("Myctophidae","Gonostomatidae","Bathylagidae"))
# These are our 12 test sample length and weights. All were myctophids.
our.test.samples = data.frame(weight = c(3.1298, 0.8554, 0.7428, 1.3002, 0.7338, 0.9178, 0.4317, 0.2748, 0.1417, 0.8783, 0.7426, 0.3195),
length = c(65.5, 42.4, 41.7, 53.6, 42.3, 46.2, 37.4, 33.3, 28.8, 44.7, 42.2, 33.8),
family= rep("Myctophidae", 12))
# Plot length-weight relationships
testdat %>%
dplyr::filter(as.numeric(weight) < 30) %>% #There's one fish that's unreasonably large - I think a misplaced decimal point
dplyr::mutate(weight = as.numeric(weight),
standard_length = as.numeric(standard_length)) %>%
ggplot(aes(y=log(weight), x = log(standard_length), fill=family)) +
geom_point(pch=21) +
stat_poly_line(col="black") +
stat_poly_eq(aes(label = after_stat(eq.label)), label.y=0.9) +
stat_poly_eq(label.y = 0.85) +
geom_point(data = our.test.samples, aes(x=log(length), y=log(weight), fill=family, fill=""), col="red", pch=17) +
theme_bw() +
facet_wrap(~family) +
xlab("log(length) - mm") +
ylab("log(weight) - g") +
theme(legend.position = "none")
There’s an approximately linear log-log relationship between length and weight. Our samples tend to fall below the line, indicating weight is underestimated for a given length. This might be an issue with weight loss due to formaldehyde preservation, species specific differences we don’t account for, or natural variability at small sizes. In any case, it’s a decent starting point for developing pools and figuring out approximately how many individuals we will need to combine to achieve ~10 g of biomass. We will need to weigh everything as we pull it to ensure biomass is sufficient.
We can now look at the data to see a few things. The first is that this data is distributed across the entire west coast - I’m going to subset data to only be within the 75-station CalCOFI sampling area, as this is when data is more abundant and consistent!
I updated the R Shiny app to show the number and total weight of myctophids (only the three most abundant species, Stenobrachius leucopsarus, Triphoturus mexicanus, and Nannobrachium ritteri) per year/site: https://lmcgill.shinyapps.io/DDT_sample_locations/. Note that although the app shows individuals across the 113-station area, we are using only the 75-station area. Looks like samples are only from Bongo tows (which is at least consistent I suppose), and are pretty evenly spread across the year. There are four Manta cruises in the database, but they don’t capture our species of interest. Note that there is slightly more spatial variability in sample location than is shown due to interannual variation in where catches at a specific sampling station occur.
The table below shows summary data across all individuals within the sampling area. Some notes. First, the juveniles caught are overwhelmingly Myctophidae, but there are a number of Bathylagidae. The number of Gonostomatidae is quite low. Second, a few species tend to dominate the catches. The top 3 species from Myctophidae and Bathylagidae are Stenobrachius leucopsarus, Triphoturus mexicanus, and Nannobrachium ritteri for Myctophidae, and Bathylagoides wesethi, Leuroglossus stilbius, and Lipolagus ochotensis for Bathylagidae. It seems like all six species are vertical migrators (https://academic.oup.com/plankt/article/34/9/828/1494081). Note that these number are from after 1978. We can reasonably expect a large number BEFORE 1978 as well - which is important to keep in mind when determining how many specimens to pull.
If you look on the Shiny app, there’s definite spatial separation between the species. SL are more likely to be caught north of Long Beach, TM are more likely to be caught south of long beach, and NR are more likely to be caught offshore.
knitr::kable(
sizes %>%
dplyr::group_by(family, SpeciesName) %>%
dplyr::summarize(count = sum(Count),
mean.weight.g = round(mean(weight), 2),
mean.length.mm = round(mean(Size), 2)) %>%
dplyr::ungroup() %>% arrange(family, desc(count)) %>% as.data.frame() )
| family | SpeciesName | count | mean.weight.g | mean.length.mm |
|---|---|---|---|---|
| Bathylagidae | Bathylagoides wesethi | 121 | 0.46 | 34.81 |
| Bathylagidae | Leuroglossus stilbius | 86 | 1.10 | 46.05 |
| Bathylagidae | Lipolagus ochotensis | 57 | 0.38 | 33.47 |
| Bathylagidae | Bathylagus | 4 | 0.45 | 38.02 |
| Gonostomatidae | Cyclothone signata | 12 | NA | 20.57 |
| Myctophidae | Triphoturus mexicanus | 1126 | 0.45 | 30.34 |
| Myctophidae | Stenobrachius leucopsarus | 945 | 0.80 | 36.86 |
| Myctophidae | Nannobrachium ritteri | 539 | 0.80 | 37.49 |
| Myctophidae | Diaphus theta | 268 | 0.17 | 21.40 |
| Myctophidae | Symbolophorus californiensis | 196 | 1.04 | 41.63 |
| Myctophidae | Diogenichthys atlanticus | 128 | 0.10 | 19.54 |
| Myctophidae | Ceratoscopelus townsendi | 108 | 0.46 | 31.16 |
| Myctophidae | Protomyctophum crockeri | 73 | 0.13 | 21.40 |
| Myctophidae | Tarletonbeania crenularis | 54 | 0.50 | 32.49 |
| Myctophidae | Notolychnus valdiviae | 33 | 0.33 | 26.35 |
| Myctophidae | Nannobrachium regale | 31 | 0.92 | 39.90 |
| Myctophidae | Myctophum nitidulum | 30 | 1.11 | 37.96 |
| Myctophidae | Lampanyctus | 20 | 0.98 | 41.27 |
| Myctophidae | Diaphus | 18 | 0.11 | 19.15 |
| Myctophidae | Myctophidae | 10 | 0.39 | 27.89 |
| Myctophidae | Diaphus anderseni | 6 | 0.16 | 22.87 |
| Myctophidae | Nannobrachium hawaiiensis | 6 | 0.67 | 37.02 |
| Myctophidae | Diogenichthys laternatus | 5 | 0.32 | 22.48 |
| Myctophidae | Hygophum reinhardtii | 5 | 0.53 | 33.60 |
| Myctophidae | Gonichthys tenuiculus | 4 | 0.84 | 41.00 |
| Myctophidae | Lampanyctus nobilis | 4 | 0.04 | 14.73 |
| Myctophidae | Nannobrachium | 4 | 0.32 | 24.18 |
| Myctophidae | Centrobranchus nigroocellatus | 3 | 0.31 | 28.17 |
| Myctophidae | Hygophum atratum | 3 | 1.02 | 41.07 |
| Myctophidae | Bolinichthys longipes | 2 | 0.19 | 24.55 |
| Myctophidae | Electrona risso | 2 | 0.02 | 11.85 |
| Myctophidae | Diaphus pacificus | 1 | 0.36 | 31.50 |
| Myctophidae | Lampadena urophaos | 1 | 0.11 | 20.90 |
| Myctophidae | Lampanyctus tenuiformis | 1 | 0.18 | 24.70 |
| Myctophidae | Myctophiformes | 1 | 0.11 | 21.00 |
| Myctophidae | Notoscopelus resplendens | 1 | 0.14 | 22.80 |
| Myctophidae | Parvilux boschmai | 1 | 0.12 | 21.30 |
| Myctophidae | Parvilux ingens | 1 | 2.83 | 65.00 |
| Myctophidae | Protomyctophum | 1 | 0.07 | 18.00 |
The below graphs just show data from Stenobrachius leucopsarus, Triphoturus mexicanus, and Nannobrachium ritteri. You can see that length and weight distributions aren’t that different across species.
sizes %>%
dplyr::filter(SpeciesName %in% c(c("Stenobrachius leucopsarus", "Triphoturus mexicanus","Nannobrachium ritteri"))) %>%
expandRows(., "Count", drop=FALSE) %>%
dplyr::select(SpeciesName, weight, Size) %>%
dplyr::rename("Weight (g)" = "weight", "Length (mm)" = "Size") %>%
gather("Parameter", "Value",-SpeciesName) %>%
ggplot(aes(x=Value, fill=SpeciesName)) +
geom_histogram(color="black") +
theme_half_open() +
facet_grid(SpeciesName~Parameter, scales="free") +
theme(legend.position = "bottom", legend.title=element_blank())
I propose the following cuts on the data:
You can check out the Shiny app to see how the number of individuals and biomass change with you change species composition and raising/ lowering minimum size requirements.
Given biomass estimates from the juvenile database, I’m thinking we will have enough tissue for between 100-300 DDT estimates (depending on how many species we include, how the actual weights shake out, and how many juveniles are in earlier years). This really isn’t a lot given the temporal (60+ years) and spatial (hundreds of km) extents we are working with. Therefore, I think we need to think of this as trying to pick up broad scale, mean changes in DDT amount and source in a few abundant species of myctophids, acknowledging that there is likely fine scale spatial/ temporal/ ontogenetic variation we won’t be able to resolve. I know we’ve talked about it before, but being explicit in what contrasts we want to see (i.e., changes through time time, near and far from the dump site, different sizes classes/ species, etc.) will be important. NOTE AGAIN THAT WE DO NOT HAVE ENOUGH BIOMASS FOR VERY FINE SCALE POOLING. What I mean by this, is that we’re not going to have 10 grams of biomass from the same species, same year, same site, and same size class. We will need to make explicit decisions about what variation we want to aggregate over.
Pooling strategies range from random (points selected totally at random) to homogeneous (groups selected to be as similar as possible). Generally, the groups should be as small as possible, but large enough to ensure there’s sufficient tissue for analysis. Generally, it seems like homogeneous pooling is the way to go IF we know apriori what we are looking for. There’s a bunch of work in the biomedical literature that shows measurements taken on pools that are formed homogeneously, with respect to covariate information, can be used to construct estimators that are nearly as efficient as the analogous estimators based on individual level data for both continuous and binary regression (Mitchell et al. 2014; Zhang et al. 2013; Liu et al. 2017). Important to note that, these studies are generally examining how well pooled-level estimates reflect equivalent individual-level estimates for the same set of covariates which were used to pool the data. So it’s super super important to have an apriori hypothesis of what we want to test. I’m thinking 1) distance from the dump site (and latitude/ longitude of the site), 2) length (proxy for age/trophic position), 3) time, and 4) species.
Selecting pools could be completed by hand, but it will likely be difficult to simultaneously maximize both heterogeneity across groups and homogeneity within groups. I’m just going to use nearest neighbor balanced clustering as a first pass at pool development, which will create sets of elements that are similar while ensuring clusters are of equal size. As a small caveat, in this case I’m taking “size” to indicate total grams of the fish, as opposed to an individual fish. So clusters will all be around 10 grams, but might contain anywhere from 2 to many fish.
It might be important to note that this is a first pass approximation. It would be a definitely do-able optimization problem to get an exact / more customized solution to this problem, but it doesn’t seem necessary at the moment. I figure we can tweak by hand as necessary.
We use the R package anticlust (Papenberg and Kalu 2021)
to create a set of clusters of equal size (in this case ~10 grams) that
seeks to maximize homogeneity within clusters and maximize heterogeneity
across clusters. The algorithm first computes the centroid of all data
points (mean of all covariates or a dissimilarity matrix). After
identifying the centroid, the algorithm proceeds as follows:
In this case, our covariates of interest are distance from dump site, latitude, longitude, length data (trophic position), and year. We can weight any of these to give them more importance, add on additional covariates we are interested in, or run separate clustering algorithms for each year/species/site (if we want individual years/species/sites to stay unmixed in all cases). I imagine this will give us a first pass to see exactly what sorts of clustering options are possible.
We can run the clustering algorithm to generate our potential pools. You can put more importance on certain covariates (so, for example, pools are more homogeneous with respect to year and location, but less homogeneous with respect to fish size). I imagine it would be a sort of thing were we give the data a once over with this algorithm, and then go through by hand to make sure everythign looks good, and make any necessary changes.
Clearly it’s important to figure out what we care about beforehand because there are definite tradeoffs. I have it setup where you can simulate data using a spatial GLMM, and now that I have a pooling scheme setup and the individual fish locations and covariates, I can test a few different pooling schemes using our actual data covariates. Also, I’ll make sure to check everything beforehand to be totally positive that covariates aren’t confounded. I do think the most important consideration is what contrasts we want present in the data (both for SI and DDT analysis) - we can accentuate those.
sizes.new %>%
dplyr::filter(SpeciesName =="Stenobrachius leucopsarus") %>%
dplyr::select(Size, year, distance, cluster, SpeciesName) %>%
dplyr::rename("Size (mm)" = "Size", "Distance from dumpsite (km)" = "distance", "Year" = "year") %>%
gather(Metric, Value, `Size (mm)`:`Distance from dumpsite (km)`) %>%
ggplot(aes(x=cluster, y=Value, group=cluster)) +
geom_boxplot()+
theme_bw() +
facet_wrap(~Metric, nrow=3, scales="free_y")+
xlab("Pool Number")+
ggtitle("Stenobrachius leucopsarus") +
ylab("Value")
sizes.new %>%
dplyr::filter(SpeciesName =="Triphoturus mexicanus") %>%
dplyr::select(Size, year, distance, cluster, SpeciesName) %>%
dplyr::rename("Size (mm)" = "Size", "Distance from dumpsite (km)" = "distance", "Year" = "year") %>%
gather(Metric, Value, `Size (mm)`:`Distance from dumpsite (km)`) %>%
ggplot(aes(x=cluster, y=Value, group=cluster)) +
geom_boxplot()+
theme_bw() +
facet_wrap(~Metric, nrow=3, scales="free_y")+
xlab("Pool Number")+
ggtitle("Triphoturus mexicanus") +
ylab("Value")
sizes.new %>%
dplyr::filter(SpeciesName =="Nannobrachium ritteri") %>%
dplyr::select(Size, year, distance, cluster, SpeciesName) %>%
dplyr::rename("Size (mm)" = "Size", "Distance from dumpsite (km)" = "distance", "Year" = "year") %>%
gather(Metric, Value, `Size (mm)`:`Distance from dumpsite (km)`) %>%
ggplot(aes(x=cluster, y=Value, group=cluster)) +
geom_boxplot()+
theme_bw() +
facet_wrap(~Metric, nrow=3, scales="free_y")+
xlab("Pool Number")+
ggtitle("Nannobrachium ritteri") +
ylab("Value")
We need to obtain test samples to see what the highest priority samples are. The things we need to do/ necessary samples are as follows:
Develop a relationship between length and trophic position. If we can pull a number of samples with different sizes, from one time period and location (to minimize variation in the baseline isotope values), we can run bulk and CSIA isotope analysis on those samples to develop that relationship. Samples for this would be 10-20 individuals of varying sizes, from a single (or small cluster) or sites and similar years.
Otolith examination. Pull fish from a few time periods to see if otoliths are disintegrated, or if it will be possible to obtain age and growth information. Toni found that the Scripps Vertebrate collection and LA county history collection have otoliths from our two species in their collections! Could provide a reference for what a “good” otolith looks like, and we could potentially use these to develop an age-length curve if our otoliths are too degraded to use. Samples necessary to complete this step will just be the individuals from Steps 1 and 3. If possible in more recent years, we could get a couple of reference fish from the EtOH preserved side of the bongo net and inspect their otoliths.
Initial estimates of DDT “envelope” across space and time. Determining the sampling extent will be somewhat iterative. Prioritize a few pooled samples across both space and time to run preliminary DDT/ SI analyses on. I imagine it would be ~4-6 pools, with some samples right on the dump site and others further offshore. Time periods could be bookends (as far back as possible to close to the present). Looking at the data, we could do relatively large Stenobrachius leucopsarus 1) from sites <100 km from the dumpsite, late 1980s, 2) from sites > 200 km from the dumpsite, late 1980s, 3) from sites <100 km from the dumpsite, mid 2010s, 4) from sites > 200 km from the dumpsite, mid 2010s.