mgsim: Round 1 Analysis

Author

July Pilowsky

Code
library(tidyplots)
library(paletteer)
library(here)
library(cowplot)
library(GGally)
library(abc)
library(tidyverse)
source(here("Scripts/validation_metric_functions.R"))
source(here("Scripts/convenience_functions.R"))
source(here("Scripts/animate_sim.R"))
set_trust_promises(TRUE)
[1] FALSE
Code
region <- qread(here("Data/Input/finch_region.qs"))
round1_metrics <- read_csv(here("Data/Validation/round1_validation_metrics.csv")) |> 
  mutate(hm_trend_1970 = rep(NA_real_), hm_trend_1993 = rep(NA_real_))
round1_metrics$hm_trend_1970[round1_metrics$dc] <- round1_metrics$hm_trend[seq(1, 9752, 2)]
round1_metrics$hm_trend_1993[round1_metrics$dc] <- round1_metrics$hm_trend[seq(2, 9752, 2)]
round1_priors <- read_csv(here("Data/Input/sample_data_round1.csv"))

1. What is the motivation and methodology of this research?

1.a. Motivation

The motivation here is to show the utility and promise of my R package, epizootic, for building process-explicit models of disease outbreaks in wildlife. I’m using the Mycoplasma outbreak in house finches as a case study to show what we can learn from using epizootic to reconstruct the outbreak and narrow the prior distributions for aspects of the outbreak that are difficult to observe.

1.b. Methods

I simulated the house finch system from 1940 to 2016 (which was as late as my climate and land use data went.) I started in 1940 with the release of house finches in New York because of the scientific consensus that the invasion biology of the house finch is essential to understanding the outbreak. The disease is introduced to the system in Washington D.C. in early 1994, just how it happened in real life.

The simulations include the demography of the house finch, dispersal (both demography and dispersal include density dependence), the epidemiological dynamics of the disease, and seasonality. The prior distributions for these were drawn from the extensive literature on this system. The simulations also include changes in climate and land use, including the house finch’s preferences in climate and land use types, as determined by a species distribution model. The simulations have a daily step but data are only recorded on a seasonal basis.

I ran 10,000 simulations based on stratified sampling of the prior distributions. I collected a variety of data to validate the simulations, all non-overlapping with the data used to parameterize the simulations. In this document, I use these validation metrics to select top-performing simulations and make scientific inferences based on ensembles of top-performing models.

2. How well did the simulations perform?

Our validation metrics are as follows:

  1. Are there finches in Washington DC in 1994? This is crucial because this is the time and place where the Mycoplasma gallisepticum outbreak first occurred. If there are no finches there, there can’t be an outbreak. All the subsequent metrics are conditioned on whether the simulation meets this criterion. All simulations that fail are excluded.
  2. Mycoplasma presence. There are locations where Project FeederWatch spotted infected finches every year for years on end. This score measures whether the simulations can replicate that persistence. A lower score indicates a better ability to replicate persistence.
  3. Haemorhous presence/absence. From many years of Breeding Bird Survey and Christmas Bird Count data, we know of places in North America where the house finch is always observed and never observed. A lower score means the simulation can accurately replicate the sites where house finches are always or never present.
  4. Haemorhous population trends. For bird conservation regions throughout North America, Audubon has records of population trends in Haemorhous population size. I used two metrics: population trend since 1970, and population trend since 1993. A lower score means that the population trends in the simulation are more accurate.
  5. Point prevalence. I have data from studies done on Mycoplasma all over North America with prevalence of the disease that they found at a specific point in space and time, with uncertainty around these estimates. I assigned a penalty for the simulated prevalence falling outside this estimated range of prevalence.
  6. First arrival of Mycoplasma. I estimated the first arrival of the Mycoplasma outbreak to different parts of the Northeast based on Project FeederWatch sightings, with uncertainty due to incomplete observation. I assigned a penalty for the simulated date of first arrival falling outside the estimated range.
  7. First arrival of Haemorhous. I estimated the first arrival of the house finch in parts of its invasive range based on sightings from the Breeding Bird Survey, with uncertainty due to incomplete observation. I assigned a penalty for the simulated date of first arrival falling outside the estimated range.

2.a. Presence in Washington, D.C.

First of all, let’s look at the performance on the “presence of finches in DC” metric, since everything else depends on that.

Code
round1_priors |> 
  mutate(dc = round1_metrics$dc) |> 
  select(-mortality_Sa_summer, -mortality_Ra_summer) |> 
  pivot_longer(1:45, names_to = "Prior", values_to = "Value") |> 
  ggplot(aes(x = Value, group = dc, fill = dc)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~Prior, scales = "free") +
  scale_fill_paletteer_d("PrettyCols::Light", name = "Birds in DC?")

About half of the simulations (4876) succeeded on this metric. The main determinants of whether the simulations succeeded on this metric had to do with density-dependent controls on dispersal. The simulations with birds in DC maximized the range of population densities at which dispersal would occur.

2.b. Other Validation Metrics

Now, let’s look at the performance on the other seven validation metrics. What I’m looking for here are metrics that are good filters: not all the simulations succeed, but not all of them fail horribly, either. This helps us choose which simulations are the best.

Code
round1_metrics_dc <- round1_metrics |> 
  filter(dc) |> 
  select(-index, -dc, -hm_trend)
quantiles <- round1_metrics_dc |> 
  map_dbl(\(x) quantile(x, prob = c(0.01), na.rm = T))
labels <- c("Mycoplasma presence", "Finch pres/abs", 
            "Point prevalence", "Mycoplasma arrival", "Finch arrival",
            "Finch trend from 1970", "Finch trend from 1993")
plotlist <- pmap(list(round1_metrics_dc, quantiles, labels), function(x, y, z) {
  ggplot(data = NULL, aes(x = x)) + 
    geom_histogram() + 
    geom_vline(xintercept = y, color = "blue") +
    xlab(z)
})
plot_grid(plotlist = plotlist)

Here I’ve added a blue line representing the top 1% of scores. This is tricky, because some of these metrics are poor filters: almost all simulations do well on them. Though perhaps they look better once extreme outliers are removed:

Code
plotlist2 <- pmap(list(round1_metrics_dc |> 
                         filter(if_all(everything(), \(x) x < quantile(x, 0.99, na.rm=T))), quantiles, labels), function(x, y, z) {
  ggplot(data = NULL, aes(x = x)) + 
    geom_histogram() + 
    geom_vline(xintercept = y, color = "blue") +
    xlab(z)
})
plot_grid(plotlist = plotlist2)

Mycoplasma presence remains a poor filter, as does presence and absence of house finches. Another issue is that the performance on finch presence/absence, Mycoplasma arrival, and finch arrival is very poor. I think the best thing I can do at this point is to try ABC on different combinations of validation metrics, choose top models, and check if:

  1. the same top models are chosen by different validation metrics.
  2. whether the ensemble of the top models chosen produces a result that I know to be accurate based on the natural history of the house finch and the conjunctivitis outbreak.

3. Which simulations scored the best on validation metrics?

You can’t reasonably use seven validation metrics at once. Generally, three is considered the ideal number. Choosing a good combination of three validation metrics can be very challenging.

3.a. Correlation

It is important to know which validation metrics are correlated, because you want three that are uncorrelated to one another for maximum effectiveness in choosing models that perform well at multiple functions.

Code
ggpairs(round1_metrics_dc |> filter(if_all(everything(), \(x) x < quantile(x, 0.99, na.rm=T))))

Point prevalence appears to be negatively correlated with Mycoplasma presence, positively correlated with finch presence/absence, positively correlated with population trend from 1993 on, and positively correlated with arrival of Mycoplasma. I will say I would strongly prefer point prevalence over Mycoplasma presence or finch presence/absence because it’s a better filter. Finch presence/absence is also highly correlated with population trend from 1993 on, and population trend from 1993 is a better filter also.

3.c. Mycoplasma presence & finch arrival

I used a function called abctools::selectsumm that essentially uses cross-validation to find which statistics can most refine the posterior distributions. This method selected Mycoplasma presence and first arrival of house finches as the best combination of variables.

Code
abc49 <- abc(param = round1_priors[round1_metrics$dc,], 
             sumstat = round1_metrics_dc[c(1, 5)],
              target = rep(0, 2), 
             tol = 0.01, 
             method = "neuralnet")
12345678910
12345678910
Code
head(abc49$ss, 10)
      mg_presence hm_arrival
 [1,]          33      27255
 [2,]          33      26715
 [3,]          33      26953
 [4,]          33      24692
 [5,]          33      26641
 [6,]          33      27561
 [7,]          33      26849
 [8,]          33      27010
 [9,]           3      28160
[10,]           3      28652
Code
selected_samples <- round1_priors %>% 
  rowid_to_column("sample") |> 
  semi_join(abc49$unadj.values, copy = TRUE) %>% 
  pull(sample)
write_sftp_transfer("/glade/work/pilowskyj/Round1_matrix", 
                      "/Users/caryinstitute/Documents/mgsim/Data/Output/Round1_matrix", 
                      selected_samples, 
                      here("Scripts/Globus/abc2.txt"))
ensemble_sa <- ensemble_mean(selected_samples, abc49$weights, "Sa", here("Data/Output/Round1_matrix"))
animate_sim(ensemble_sa, region)

Code
ensemble_ra <- ensemble_mean(selected_samples, abc49$weights, "Ra", here("Data/Output/Round1_matrix"))
animate_sim(ensemble_ra, region)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries

This ensemble seems to have slower dispersal than the others, with the house finch range expanding more slowly, and the disease spreading more slowly as well.

Code
plot_abc_posteriors(abc49, round1_priors)
✔ split_plot: split into 45 plots across 12 pages
[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]


[[12]]

This validation dataset selects for rather low transmission rates all things considered.

3.d. Everything!

I read a bunch of papers about the curse of dimensionality and overfitting with excessive summary statistics, and they suggested that the neural net method of ABC optimization may be able to compensate for these issues. So I’m trying all seven here.

Code
abc49 <- abc(param = round1_priors[round1_metrics$dc,], 
             sumstat = round1_metrics_dc,
              target = rep(0, 7), 
             tol = 0.01, 
             method = "neuralnet")
12345678910
12345678910
Code
head(abc49$ss, 10)
      mg_presence hm_presabs point_prevalence mg_arrival hm_arrival
 [1,]           2       2997         2.646297       5131      30930
 [2,]           2       3013         1.502044       4793      30003
 [3,]           7       3022         1.675591       4905      31021
 [4,]           3       2998         1.416690       4854      31133
 [5,]           3       3049         2.098506       4632      31197
 [6,]           3       2997         2.623132       4861      28160
 [7,]           8       2997         2.045344       5058      31102
 [8,]          16       2997         1.214188       5074      31192
 [9,]           3       2997         2.039519       4769      30895
[10,]          10       2997         1.193360       4748      31213
      hm_trend_1970 hm_trend_1993
 [1,]      79.64963      174.5768
 [2,]      79.61800      223.6906
 [3,]      76.87590      213.3664
 [4,]      71.83274      223.4325
 [5,]      81.81654      216.8374
 [6,]      65.63331      221.4385
 [7,]      52.15980      193.9473
 [8,]      50.62347      193.5241
 [9,]      48.46236      193.9341
[10,]      57.35784      196.7086
Code
selected_samples <- round1_priors %>% 
  rowid_to_column("sample") |> 
  semi_join(abc49$unadj.values, copy = TRUE) %>% 
  pull(sample)
write_sftp_transfer("/glade/work/pilowskyj/Round1_matrix", 
                      "/Users/caryinstitute/Documents/mgsim/Data/Output/Round1_matrix", 
                      selected_samples, 
                      here("Scripts/Globus/abc3.txt"))
ensemble_sa <- ensemble_mean(selected_samples, abc49$weights, "Sa", here("Data/Output/Round1_matrix"))
animate_sim(ensemble_sa, region)

This run looks similar to others but with lower abundance and higher seasonality. But let’s look at recovered adults to get a sense of the outbreak:

Code
ensemble_ra <- ensemble_mean(selected_samples[-24], abc49$weights[-24], "Ra", here("Data/Output/Round1_matrix"))
animate_sim(ensemble_ra, region)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries

The infection spreads more slowly in this simulation, and it isn’t as evenly distributed across the landscape. It replicates the pattern of more intense infection in the east, which is great. It also hits the West at about the times that are reported from the literature.

Code
plot_abc_posteriors(abc49, round1_priors)
✔ split_plot: split into 45 plots across 12 pages
[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]


[[12]]

4. Conclusions

What I find very exciting in these results is that they are able to give us insight into aspects of the Mycoplasma outbreak that are unobservable / unknowable by any other method. For one, they give us dynamic maps of the expansion of the house finch’s range, and of the geographic expansion of house finch conjunctivitis, that are unaffected by observer bias (birdwatchers’ observations cluster heavily in certain parts of the country). For another, the posterior distributions provide estimates of quantities like the number of house finches that were initially released in New York, and the number of house finches that were initially infected with Mycoplasma in early 1994.

I’m trying to work out which of these sets of validation metrics do the best job reconstructing the range expansion of the house finch and the spread of Mycoplasma in the house finch. Thus far, I think the best performance is in the set I selected based on my own judgment as a scientist, in section 3b above, but I’m open to other opinions. Almost certainly I will want to perform a second round of simulations using the posterior distributions from this round as priors.