Mesopelagic Fish Data

Background

This is just aimed at describing the mesopelagic fish data available through the CalCOFI collection. In particular, we are looking at fish in the families Myctophidae, Gonostomatidae, and Bathylagidae. Note that all CalCOFI datasets mentioned (ASIDE from the juvenile info) are shown on this site (https://coastwatch.pfeg.noaa.gov/erddap/search/index.html?page=1&itemsPerPage=1000&searchFor=calcofi) and were downloaded via the R packge rerddap.

If we look at CalCOFI larval count data, we see that a full 10% of entries are from these families, including 96 different species. The breakdown number-wise is Myctophidae are the most abundant (1,033,245), followed by Gonostomatidae (174,421), and then Bathylagidae (134,170). However, these numbers are for larval fish that are quantitatively sampled, and instead we are focusing on juvenile and adult fish accidentally caught and NOT quantitatively sampled. This data is not yet publicly available, but folks at CalCOFI provided it (Ed, Noelle, Andrew, and Bill). The data currently only go back as far as 1978, but CalCOFI personnel (Nico) are actively working through earlier cruises. We can provide cruises we would like sorted first to expedite obtaining earlier data, but the end goal is to have all cruises sorted and cataloged. To obtain the physical samples we will go through Sherri.

Length-weight relationships

The data provided by CalCOFI has lengths but not weights. We kind of need an approximate length-weight relationship for fish if we are going to create pools. To create a very rough approximation, we can use other CalCOFI data (namely the CPS Trawl Life History Specimen Data) to create a relationship, and compare predicted values (the black line below) with observed values from our 12 test myctophid specimens (the red points). Note that samples are sparse (we have 169 samples) and tend to be larger than our archived specimens. No data for the family Gonostomatidae is available.

# Download the CalCOFI dataset and filter by species in our target families 
testdat <- rerddap::tabledap("FRDCPSTrawlLHSpecimen") %>% 
  dplyr::left_join(., readRDS(here::here("data","itis_tsn_family.rds")), by="itis_tsn") %>% 
  as.data.frame() %>% 
  dplyr::filter(family %in% c("Myctophidae","Gonostomatidae","Bathylagidae"))

# These are our 12 test sample length and weights. All were myctophids. 
our.test.samples = data.frame(weight = c(3.1298, 0.8554, 0.7428, 1.3002, 0.7338, 0.9178, 0.4317, 0.2748, 0.1417, 0.8783, 0.7426, 0.3195), 
           length = c(65.5, 42.4, 41.7, 53.6, 42.3, 46.2, 37.4, 33.3, 28.8, 44.7, 42.2, 33.8), 
           family= rep("Myctophidae", 12))

# Plot length-weight relationships 
testdat %>% 
  dplyr::filter(as.numeric(weight) < 30) %>% #There's one fish that's unreasonably large - I think a misplaced decimal point
  dplyr::mutate(weight = as.numeric(weight), 
                standard_length = as.numeric(standard_length)) %>% 
  ggplot(aes(y=log(weight), x = log(standard_length), fill=family)) + 
  geom_point(pch=21) + 
  stat_poly_line(col="black") +
  stat_poly_eq(aes(label = after_stat(eq.label)), label.y=0.9) +
  stat_poly_eq(label.y = 0.85) +
  geom_point(data = our.test.samples, aes(x=log(length), y=log(weight), fill=family, fill=""), col="red", pch=17) +
  theme_bw() + 
  facet_wrap(~family) + 
  xlab("log(length) - mm") + 
  ylab("log(weight) - g") + 
  theme(legend.position = "none")

There’s an approximately linear log-log relationship between length and weight. Our samples tend to fall below the line, indicating weight is underestimated for a given length. This might be an issue with weight loss due to formaldehyde preservation, species specific differences we don’t account for, or natural variability at small sizes. In any case, it’s a decent starting point for developing pools and figuring out approximately how many individuals we will need to combine to achieve ~10 g of biomass. We will need to weigh everything as we pull it to ensure biomass is sufficient.

Available juvenile data

Spatial and Temporal Extent

We can now look at the data to see a few things. The first is that this data is distributed across the entire west coast - I’m going to subset data to only be within the 75-station CalCOFI sampling area, as this is when data is more abundant and consistent!

I updated the R Shiny app to show the number and total weight of myctophids (only the three most abundant species, Stenobrachius leucopsarus, Triphoturus mexicanus, and Nannobrachium ritteri) per year/site: https://lmcgill.shinyapps.io/DDT_sample_locations/. Note that although the app shows individuals across the 113-station area, we are using only the 75-station area. Looks like samples are only from Bongo tows (which is at least consistent I suppose), and are pretty evenly spread across the year. There are four Manta cruises in the database, but they don’t capture our species of interest. Note that there is slightly more spatial variability in sample location than is shown due to interannual variation in where catches at a specific sampling station occur.

Species Available

The table below shows summary data across all individuals within the sampling area. Some notes. First, the juveniles caught are overwhelmingly Myctophidae, but there are a number of Bathylagidae. The number of Gonostomatidae is quite low. Second, a few species tend to dominate the catches. The top 3 species from Myctophidae and Bathylagidae are Stenobrachius leucopsarus, Triphoturus mexicanus, and Nannobrachium ritteri for Myctophidae, and Bathylagoides wesethi, Leuroglossus stilbius, and Lipolagus ochotensis for Bathylagidae. It seems like all six species are vertical migrators (https://academic.oup.com/plankt/article/34/9/828/1494081). Note that these number are from after 1978. We can reasonably expect a large number BEFORE 1978 as well - which is important to keep in mind when determining how many specimens to pull.

If you look on the Shiny app, there’s definite spatial separation between the species. SL are more likely to be caught north of Long Beach, TM are more likely to be caught south of long beach, and NR are more likely to be caught offshore.

knitr::kable(
  sizes %>% 
  dplyr::group_by(family, SpeciesName) %>% 
  dplyr::summarize(count = sum(Count), 
                   mean.weight.g = round(mean(weight), 2), 
                   mean.length.mm = round(mean(Size), 2)) %>% 
  dplyr::ungroup() %>% arrange(family, desc(count)) %>% as.data.frame() )

family	SpeciesName	count	mean.weight.g	mean.length.mm
Bathylagidae	Bathylagoides wesethi	121	0.46	34.81
Bathylagidae	Leuroglossus stilbius	86	1.10	46.05
Bathylagidae	Lipolagus ochotensis	57	0.38	33.47
Bathylagidae	Bathylagus	4	0.45	38.02
Gonostomatidae	Cyclothone signata	12	NA	20.57
Myctophidae	Triphoturus mexicanus	1126	0.45	30.34
Myctophidae	Stenobrachius leucopsarus	945	0.80	36.86
Myctophidae	Nannobrachium ritteri	539	0.80	37.49
Myctophidae	Diaphus theta	268	0.17	21.40
Myctophidae	Symbolophorus californiensis	196	1.04	41.63
Myctophidae	Diogenichthys atlanticus	128	0.10	19.54
Myctophidae	Ceratoscopelus townsendi	108	0.46	31.16
Myctophidae	Protomyctophum crockeri	73	0.13	21.40
Myctophidae	Tarletonbeania crenularis	54	0.50	32.49
Myctophidae	Notolychnus valdiviae	33	0.33	26.35
Myctophidae	Nannobrachium regale	31	0.92	39.90
Myctophidae	Myctophum nitidulum	30	1.11	37.96
Myctophidae	Lampanyctus	20	0.98	41.27
Myctophidae	Diaphus	18	0.11	19.15
Myctophidae	Myctophidae	10	0.39	27.89
Myctophidae	Diaphus anderseni	6	0.16	22.87
Myctophidae	Nannobrachium hawaiiensis	6	0.67	37.02
Myctophidae	Diogenichthys laternatus	5	0.32	22.48
Myctophidae	Hygophum reinhardtii	5	0.53	33.60
Myctophidae	Gonichthys tenuiculus	4	0.84	41.00
Myctophidae	Lampanyctus nobilis	4	0.04	14.73
Myctophidae	Nannobrachium	4	0.32	24.18
Myctophidae	Centrobranchus nigroocellatus	3	0.31	28.17
Myctophidae	Hygophum atratum	3	1.02	41.07
Myctophidae	Bolinichthys longipes	2	0.19	24.55
Myctophidae	Electrona risso	2	0.02	11.85
Myctophidae	Diaphus pacificus	1	0.36	31.50
Myctophidae	Lampadena urophaos	1	0.11	20.90
Myctophidae	Lampanyctus tenuiformis	1	0.18	24.70
Myctophidae	Myctophiformes	1	0.11	21.00
Myctophidae	Notoscopelus resplendens	1	0.14	22.80
Myctophidae	Parvilux boschmai	1	0.12	21.30
Myctophidae	Parvilux ingens	1	2.83	65.00
Myctophidae	Protomyctophum	1	0.07	18.00

The below graphs just show data from Stenobrachius leucopsarus, Triphoturus mexicanus, and Nannobrachium ritteri. You can see that length and weight distributions aren’t that different across species.

sizes %>% 
  dplyr::filter(SpeciesName %in% c(c("Stenobrachius leucopsarus", "Triphoturus mexicanus","Nannobrachium ritteri"))) %>% 
  expandRows(., "Count", drop=FALSE) %>% 
  dplyr::select(SpeciesName, weight, Size) %>% 
  dplyr::rename("Weight (g)" = "weight", "Length (mm)" = "Size") %>% 
  gather("Parameter", "Value",-SpeciesName) %>% 
  ggplot(aes(x=Value, fill=SpeciesName)) + 
  geom_histogram(color="black") + 
  theme_half_open() + 
  facet_grid(SpeciesName~Parameter, scales="free") + 
  theme(legend.position = "bottom", legend.title=element_blank())

Overall numbers

I propose the following cuts on the data:

Use only our two (or three) most abundant myctophids: Stenobrachius leucopsarus, Triphoturus mexicanus, and possibly Nannobrachium ritteri. Although it woud be really nice to have the biomass of all individuals, I think it will introduce too much variability (and we don’t have enough individuals of other species to be well-resolved in time and space).
Prioritize larger individuals. If you look at the Shiny app, you will see you can cut out ~1000 of the smallest individuals and only lose ~200 grams of material. It takes more small individuals to obtain the 10 gram limit of biomass, and we will eat through the collection faster if we include them in the DDT analysis.
Use only sites within the 75-station core CalCOFI sampling (this has the most repeated site visits and highest density of juveniles). So anything below line 76.7 gets removed. We may also want to cut out some sizes far offshore in the future.

You can check out the Shiny app to see how the number of individuals and biomass change with you change species composition and raising/ lowering minimum size requirements.

Creating pools of individuals for DDT analysis

Background

Given biomass estimates from the juvenile database, I’m thinking we will have enough tissue for between 100-300 DDT estimates (depending on how many species we include, how the actual weights shake out, and how many juveniles are in earlier years). This really isn’t a lot given the temporal (60+ years) and spatial (hundreds of km) extents we are working with. Therefore, I think we need to think of this as trying to pick up broad scale, mean changes in DDT amount and source in a few abundant species of myctophids, acknowledging that there is likely fine scale spatial/ temporal/ ontogenetic variation we won’t be able to resolve. I know we’ve talked about it before, but being explicit in what contrasts we want to see (i.e., changes through time time, near and far from the dump site, different sizes classes/ species, etc.) will be important. NOTE AGAIN THAT WE DO NOT HAVE ENOUGH BIOMASS FOR VERY FINE SCALE POOLING. What I mean by this, is that we’re not going to have 10 grams of biomass from the same species, same year, same site, and same size class. We will need to make explicit decisions about what variation we want to aggregate over.

Pooling strategies range from random (points selected totally at random) to homogeneous (groups selected to be as similar as possible). Generally, the groups should be as small as possible, but large enough to ensure there’s sufficient tissue for analysis. Generally, it seems like homogeneous pooling is the way to go IF we know apriori what we are looking for. There’s a bunch of work in the biomedical literature that shows measurements taken on pools that are formed homogeneously, with respect to covariate information, can be used to construct estimators that are nearly as efficient as the analogous estimators based on individual level data for both continuous and binary regression (Mitchell et al. 2014; Zhang et al. 2013; Liu et al. 2017). Important to note that, these studies are generally examining how well pooled-level estimates reflect equivalent individual-level estimates for the same set of covariates which were used to pool the data. So it’s super super important to have an apriori hypothesis of what we want to test. I’m thinking 1) distance from the dump site (and latitude/ longitude of the site), 2) length (proxy for age/trophic position), 3) time, and 4) species.

Logistics

Selecting pools could be completed by hand, but it will likely be difficult to simultaneously maximize both heterogeneity across groups and homogeneity within groups. I’m just going to use nearest neighbor balanced clustering as a first pass at pool development, which will create sets of elements that are similar while ensuring clusters are of equal size. As a small caveat, in this case I’m taking “size” to indicate total grams of the fish, as opposed to an individual fish. So clusters will all be around 10 grams, but might contain anywhere from 2 to many fish.

It might be important to note that this is a first pass approximation. It would be a definitely do-able optimization problem to get an exact / more customized solution to this problem, but it doesn’t seem necessary at the moment. I figure we can tweak by hand as necessary.

We use the R package anticlust (Papenberg and Kalu 2021) to create a set of clusters of equal size (in this case ~10 grams) that seeks to maximize homogeneity within clusters and maximize heterogeneity across clusters. The algorithm first computes the centroid of all data points (mean of all covariates or a dissimilarity matrix). After identifying the centroid, the algorithm proceeds as follows:

The element having the smallest distance (defined through a distance/ dissimilarity matrix) from the centroid is clustered with its (N/K) - 1 nearest neighbors (neighborhood is defined according to the Euclidean distance if the data input is a feature matrix).
From the remaining elements, again the element nearest to the centroid is selected and clustered with its (N/K) - 1 neighbors.
The procedure is repeated until all elements are part of a cluster.

In this case, our covariates of interest are distance from dump site, latitude, longitude, length data (trophic position), and year. We can weight any of these to give them more importance, add on additional covariates we are interested in, or run separate clustering algorithms for each year/species/site (if we want individual years/species/sites to stay unmixed in all cases). I imagine this will give us a first pass to see exactly what sorts of clustering options are possible.

Example

We can run the clustering algorithm to generate our potential pools. You can put more importance on certain covariates (so, for example, pools are more homogeneous with respect to year and location, but less homogeneous with respect to fish size). I imagine it would be a sort of thing were we give the data a once over with this algorithm, and then go through by hand to make sure everythign looks good, and make any necessary changes.

Clearly it’s important to figure out what we care about beforehand because there are definite tradeoffs. I have it setup where you can simulate data using a spatial GLMM, and now that I have a pooling scheme setup and the individual fish locations and covariates, I can test a few different pooling schemes using our actual data covariates. Also, I’ll make sure to check everything beforehand to be totally positive that covariates aren’t confounded. I do think the most important consideration is what contrasts we want present in the data (both for SI and DDT analysis) - we can accentuate those.

sizes.new %>% 
  dplyr::filter(SpeciesName =="Stenobrachius leucopsarus") %>% 
  dplyr::select(Size, year, distance, cluster, SpeciesName) %>% 
  dplyr::rename("Size (mm)" = "Size", "Distance from dumpsite (km)" = "distance", "Year" = "year") %>% 
  gather(Metric, Value, `Size (mm)`:`Distance from dumpsite (km)`) %>% 
  ggplot(aes(x=cluster, y=Value, group=cluster)) + 
  geom_boxplot()+ 
  theme_bw() + 
  facet_wrap(~Metric, nrow=3, scales="free_y")+
  xlab("Pool Number")+
  ggtitle("Stenobrachius leucopsarus") + 
  ylab("Value")

sizes.new %>% 
  dplyr::filter(SpeciesName =="Triphoturus mexicanus") %>% 
  dplyr::select(Size, year, distance, cluster, SpeciesName) %>% 
  dplyr::rename("Size (mm)" = "Size", "Distance from dumpsite (km)" = "distance", "Year" = "year") %>% 
  gather(Metric, Value, `Size (mm)`:`Distance from dumpsite (km)`) %>% 
  ggplot(aes(x=cluster, y=Value, group=cluster)) + 
  geom_boxplot()+ 
  theme_bw() + 
  facet_wrap(~Metric, nrow=3, scales="free_y")+
  xlab("Pool Number")+
  ggtitle("Triphoturus mexicanus") + 
  ylab("Value")

sizes.new %>% 
  dplyr::filter(SpeciesName =="Nannobrachium ritteri") %>% 
  dplyr::select(Size, year, distance, cluster, SpeciesName) %>% 
  dplyr::rename("Size (mm)" = "Size", "Distance from dumpsite (km)" = "distance", "Year" = "year") %>% 
  gather(Metric, Value, `Size (mm)`:`Distance from dumpsite (km)`) %>% 
  ggplot(aes(x=cluster, y=Value, group=cluster)) + 
  geom_boxplot()+ 
  theme_bw() + 
  facet_wrap(~Metric, nrow=3, scales="free_y")+
  xlab("Pool Number")+
  ggtitle("Nannobrachium ritteri") + 
  ylab("Value")

Getting started

We need to obtain test samples to see what the highest priority samples are. The things we need to do/ necessary samples are as follows:

Develop a relationship between length and trophic position. If we can pull a number of samples with different sizes, from one time period and location (to minimize variation in the baseline isotope values), we can run bulk and CSIA isotope analysis on those samples to develop that relationship. Samples for this would be 10-20 individuals of varying sizes, from a single (or small cluster) or sites and similar years.
Otolith examination. Pull fish from a few time periods to see if otoliths are disintegrated, or if it will be possible to obtain age and growth information. Toni found that the Scripps Vertebrate collection and LA county history collection have otoliths from our two species in their collections! Could provide a reference for what a “good” otolith looks like, and we could potentially use these to develop an age-length curve if our otoliths are too degraded to use. Samples necessary to complete this step will just be the individuals from Steps 1 and 3. If possible in more recent years, we could get a couple of reference fish from the EtOH preserved side of the bongo net and inspect their otoliths.
Initial estimates of DDT “envelope” across space and time. Determining the sampling extent will be somewhat iterative. Prioritize a few pooled samples across both space and time to run preliminary DDT/ SI analyses on. I imagine it would be ~4-6 pools, with some samples right on the dump site and others further offshore. Time periods could be bookends (as far back as possible to close to the present). Looking at the data, we could do relatively large Stenobrachius leucopsarus 1) from sites <100 km from the dumpsite, late 1980s, 2) from sites > 200 km from the dumpsite, late 1980s, 3) from sites <100 km from the dumpsite, mid 2010s, 4) from sites > 200 km from the dumpsite, mid 2010s.

Mesopelagic Fish Data

Lily McGill

2022-10-24

Background

Length-weight relationships

Available juvenile data

Spatial and Temporal Extent

Species Available

Overall numbers

Creating pools of individuals for DDT analysis

Background

Logistics

Example

Getting started