Question 1

Briefly describe what analytical methods could be used to determine whether Agent X is likely to be present or not for the entire region of interest. Note that the region extent is defined by the covariate data layers

The simplest approach would be to see which covariate properties align with known locations of occurrence of Agent X. The ranges of these values can be used to develop a “suitability” map of conditions for Agent X. This approach was conducted as part of the data exploration phase, and results from this approach are shown below.

Suitable Locations per at least Four Covariates

A slightly more robust approach involves characterizing the “background” environment, or generating pseudo-absence data to account for the fact that we only have data at some locations. This approach is detailed as a response to Question 2.
If a wide variety of covariates were present, a geographically weighted regression or principal components analysis could also be used to narrow down key drivers, and develop logistic regression models to assess potential extent of Agent X.

Question 2

Demonstrate one application of the methods outlined in Question 1 using reproducible code. Please plot your results as a map across the geographic area outlined by the covariate brick. Why did you choose this method over any of the others? What assumptions have you made (if any) concerning the data?

(left) Random Background and (right) Restricted Background samples

I chose the background method to investigate over the others, primarily because a robust understanding of species distribution requires an understanding of locations where Agent X may be absent. I also chose not to work with pseudoabsences, since we don’t have a good understanding of where Agent X does not occur. The background method instead deals with understanding differences in the environment, which is a safer approach, and involves fewer assumptions, given limited data.

In order to develop background points, I tested two approaches–random sampling within the study area, and restricted sampling within a 100 km radius of observation points. I chose to test these two methods because I read that the size of the background can significantly affect prediction results (Barbet-Massin, 2012; VanDerWal, 2009).

Evaluation of Background Fits<br>(top) Random Background<br>(bottom) Restricted Background

Evaluation of Background Fits
(top) Random Background
(bottom) Restricted Background

This tells us that the first subset (Random Background) offers a distinct set of distributions between presence and absence data points, with a high Area Under Curve (AUC) value, indicating stronger discriminating power. We will use the Random Background method for the next steps.

K-fold Partitioning

In order to cross-validate the model, I split the dataset into testing and training datasets, sub-sampled into presence and absence datasets. For this example, I use K-Fold Partitioning with 5 groups. I also correct for spatial sorting bias, “the difference between two point data sets in the average distance to the nearest point in a reference dataset” (Hijmans, 2009)

K-Fold Partitioning with Random Background Samples
* : Training, Absence
* : Testing, Absence
+ : Training, Presence
+ : Testing, Presence

Modeling

To model the probability surface based on covariates, I develop a prediction averaged across four models:

Generalized Linear Model (GLM): chosen for simplicity
Random Forest (RF): chosen for balance of bias and variance across variables
Bioclim (BC): chosen due to built-in ‘suitability’ selector; “[BC method] compares the values of environmental variables at any location to a percentile distribution of the values at known locations of occurrence (‘training sites’) (Hijmans & Elith, 2017)
Mahalonobis (MH): chosen for accounting of correlations, and independence of measurement scale

GLM

(left) GLM Model, Raw Values <br>(right) GLM Model Prediction, Presence/Absence

(left) GLM Model, Raw Values
(right) GLM Model Prediction, Presence/Absence

Random Forest

(left) Random Forest Model, Raw Values
(right) Random Forest Model Prediction, Presence/Absence

Bioclim

(left) Bioclim Model, Raw Values
(right) Bioclim Model Prediction, Presence/Absence

Bioclim Model Evaluation
(left) Regular
(right) Spatial Sorting Bias Corrected

Bioclim Model Evaluation<br>(left) Regular<br>(right) Spatial Sorting Bias Corrected

Mahalanobis distance

(left) Mahalanobis Distance Model, Raw Values
(right) Mahalanobis Distance Model Prediction, Presence/Absence

Question 3

Summarize the key results. How valid are your predictions? How appropriate is the assessment across the entire region?

Combining model results

These models provide us a range of expected areas of spread, but the Random Forest, Bioclim, and Mahalanobis models all provide a similar trend: areas closer to the coast are more likely to be affected by Agent X. The model evaluation indicates that the GLM model is not a better fit than randomness, so we exclude that from our analysis.

The final model is a combination of the Random Forest, Bioclim, and Mahalanobis models. Instead of a simple mean of predicted model results, we develop a weighted mean of predicted surfaces to give more weight to higher AUC models.

Weighted Mean of Predicted Models

Given the strong geographic clustering of Agent X observations in Yucatan, Mexico, my model predictions are most valid for the Eastern coast of Mexico. The Western coast and Northern inland parts of the study extent may be subject to slightly different environmental drivers which may affect our predictions. Model AUCs are generally over 75%, indicating that these models are generally a good fit for the data.

Question 4

What additional information would you need to improve your estimates? (For example, discuss the quality of the existing data and/or identify what other fields could be added to the input dataset or covariate brick). How could field epidemiologists or Ministries of Health assist with this?

The resolution of available data was excellent, but additional covariate data would definitely improve the quality of our models. Environmental covariates such as temperature, precipitation, land use, etc. would help us further understand environmental drivers, potentially at a higher resolution. The binary nature of the Rural/Urban covariate is especially limiting; a land use raster would be much more useful to characterize extent of potential spread. Also given Agent X’s predicted occurrence near the coast, we have reason to suspect that sea surface temperature or precipitation might influence its behavior. As a quick test:

(left) Random Forest model result after adding Precipitation covariate<br>(middle) Random Forest AUC before addition of Precipitation layer<br>(right) Random Forest AUC after addition of Precipitation layer

(left) Random Forest model result after adding Precipitation covariate
(middle) Random Forest AUC before addition of Precipitation layer
(right) Random Forest AUC after addition of Precipitation layer

The AUC of the random forest model is improved by adding mean annual precipitation as a covariate. Therefore, better covariate data would help us further narrow down the preferred environments for Agent X.

Field epidemiologists and Ministries of Health can also help provide surveillance information which would be beneficial to this analysis. Field reports about the conditions in which Agent X was contracted and spread can be immensely beneficial in characterizing pathogen properties and human susceptibilities. Field epidemiologists can provide valuable information about transmission pathways, local environments and behaviors, which we can use to further quantify and develop models.

The Ministries of Health can also assist modeling by providing higher-resolution data. I was unable to investigate seasonality in this analysis due to only the Year being reported. While this is useful, information such as month or week would help us understand the seasonality of Agent X, and calibrate model covariates accordingly. It can also help target intervention and prevention strategies through increased biosecurity in target regions during some key durations.

References

Hijmans RJ and Elith J (2017). Species distribution modeling with R
Hijmans, R.J, Phillips, S., Leathwick, J. and Elith, J. (2011), Package ‘dismo’. Available online at: http://cran.r-project.org/web/packages/dismo/index.html.
Naimi B and Araujo MB (2016). “sdm: a reproducible and extensible R platform for species distribution modelling.” Ecography, 39, pp. 368-375. doi: 10.1111/ecog.01881.
Barbet-Massin, Morgane, et al. “Selecting pseudo-absences for species distribution models: how, where and how many?.” Methods in Ecology and Evolution 3.2 (2012): 327-338.
VanDerWal, Jeremy, et al. “Selecting pseudo-absence data for presence-only distribution modeling: how far should you stray from what you know?.” ecological modelling 220.4 (2009): 589-594.
Hijmans, Robert J. “Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model.” Ecology 93.3 (2012): 679-688.
Hanberry, Brice B., Hong S. He, and Brian J. Palik. “Pseudoabsence generation strategies for species distribution models.” PloS one 7.8 (2012): e44486.

Geospatial Resarcher Test, Atlas of Baseline Risk Assessment for Infectious Disease (ABRAID) Team

Submitted to Institute for Health Metrics & Evaluation

Aishwarya Venkat

August 10, 2017

Data Exploration