Note: All code used to generate this document can be found here
(top) Number of Cases of Agent X by Year
We first start by exploring the data provided for this exercise:
Provided Covariates
(top) Correlation plot of Provided Covariates
(left) Location of Agent X cases
Briefly describe what analytical methods could be used to determine whether Agent X is likely to be present or not for the entire region of interest. Note that the region extent is defined by the covariate data layers
Suitable Locations per at least Four Covariates
Suitable Locations per at least Four Covariates
A slightly more robust approach involves characterizing the “background” environment, or generating pseudo-absence data to account for the fact that we only have data at some locations. This approach is detailed as a response to Question 2.
If a wide variety of covariates were present, a geographically weighted regression or principal components analysis could also be used to narrow down key drivers, and develop logistic regression models to assess potential extent of Agent X.
Demonstrate one application of the methods outlined in Question 1 using reproducible code. Please plot your results as a map across the geographic area outlined by the covariate brick. Why did you choose this method over any of the others? What assumptions have you made (if any) concerning the data?
(left) Random Background and (right) Restricted Background samples
I chose the background method to investigate over the others, primarily because a robust understanding of species distribution requires an understanding of locations where Agent X may be absent. I also chose not to work with pseudoabsences, since we don’t have a good understanding of where Agent X does not occur. The background method instead deals with understanding differences in the environment, which is a safer approach, and involves fewer assumptions, given limited data.
In order to develop background points, I tested two approaches–random sampling within the study area, and restricted sampling within a 100 km radius of observation points. I chose to test these two methods because I read that the size of the background can significantly affect prediction results (Barbet-Massin, 2012; VanDerWal, 2009).
Evaluation of Background Fits
(top) Random Background
(bottom) Restricted Background
This tells us that the first subset (Random Background) offers a distinct set of distributions between presence and absence data points, with a high Area Under Curve (AUC) value, indicating stronger discriminating power. We will use the Random Background method for the next steps.
In order to cross-validate the model, I split the dataset into testing and training datasets, sub-sampled into presence and absence datasets. For this example, I use K-Fold Partitioning with 5 groups. I also correct for spatial sorting bias, “the difference between two point data sets in the average distance to the nearest point in a reference dataset” (Hijmans, 2009)
K-Fold Partitioning with Random Background Samples
* : Training, Absence
* : Testing, Absence
+ : Training, Presence
+ : Testing, Presence
To model the probability surface based on covariates, I develop a prediction averaged across four models:
GLM
(left) GLM Model, Raw Values
(right) GLM Model Prediction, Presence/Absence
Random Forest
(left) Random Forest Model, Raw Values
(right) Random Forest Model Prediction, Presence/Absence
Bioclim
(left) Bioclim Model, Raw Values
(right) Bioclim Model Prediction, Presence/Absence
Bioclim Model Evaluation
(left) Regular
(right) Spatial Sorting Bias Corrected
Mahalanobis distance
(left) Mahalanobis Distance Model, Raw Values
(right) Mahalanobis Distance Model Prediction, Presence/Absence
Summarize the key results. How valid are your predictions? How appropriate is the assessment across the entire region?
These models provide us a range of expected areas of spread, but the Random Forest, Bioclim, and Mahalanobis models all provide a similar trend: areas closer to the coast are more likely to be affected by Agent X. The model evaluation indicates that the GLM model is not a better fit than randomness, so we exclude that from our analysis.
The final model is a combination of the Random Forest, Bioclim, and Mahalanobis models. Instead of a simple mean of predicted model results, we develop a weighted mean of predicted surfaces to give more weight to higher AUC models.
Weighted Mean of Predicted Models
Given the strong geographic clustering of Agent X observations in Yucatan, Mexico, my model predictions are most valid for the Eastern coast of Mexico. The Western coast and Northern inland parts of the study extent may be subject to slightly different environmental drivers which may affect our predictions. Model AUCs are generally over 75%, indicating that these models are generally a good fit for the data.
What additional information would you need to improve your estimates? (For example, discuss the quality of the existing data and/or identify what other fields could be added to the input dataset or covariate brick). How could field epidemiologists or Ministries of Health assist with this?
The resolution of available data was excellent, but additional covariate data would definitely improve the quality of our models. Environmental covariates such as temperature, precipitation, land use, etc. would help us further understand environmental drivers, potentially at a higher resolution. The binary nature of the Rural/Urban covariate is especially limiting; a land use raster would be much more useful to characterize extent of potential spread. Also given Agent X’s predicted occurrence near the coast, we have reason to suspect that sea surface temperature or precipitation might influence its behavior. As a quick test:
(left) Random Forest model result after adding Precipitation covariate
(middle) Random Forest AUC before addition of Precipitation layer
(right) Random Forest AUC after addition of Precipitation layer
The AUC of the random forest model is improved by adding mean annual precipitation as a covariate. Therefore, better covariate data would help us further narrow down the preferred environments for Agent X.
Field epidemiologists and Ministries of Health can also help provide surveillance information which would be beneficial to this analysis. Field reports about the conditions in which Agent X was contracted and spread can be immensely beneficial in characterizing pathogen properties and human susceptibilities. Field epidemiologists can provide valuable information about transmission pathways, local environments and behaviors, which we can use to further quantify and develop models.
The Ministries of Health can also assist modeling by providing higher-resolution data. I was unable to investigate seasonality in this analysis due to only the Year being reported. While this is useful, information such as month or week would help us understand the seasonality of Agent X, and calibrate model covariates accordingly. It can also help target intervention and prevention strategies through increased biosecurity in target regions during some key durations.