Outline of the problem

We are attempting to fit a spatio-temporal GAM to county-level estimates of R0 (the basic reproductive number in epidemiology) for a disease across Texas. We are primarily interested in the spatial effects of ecoregion (i.e. macro-habitat type) and human population density, but want to make sure we account appropriately for spatial autocorrelation. We are also interested in patterns over time.

We have spatial data on 11 ecoregions, and have calculated the proportions of each county that are covered by each ecoregion (some counties have as many as four regions, while others have only a single region). We have human population density for each county. In total, we have 2,737 estimates spread spatially across 94 of the 254 counties in Texas and temporally across the 216 months between January 2000 and December 2018. Some counties are well sampled (maximum of 179 data points for a single county), while 13 counties have a single data point. The distribution of data points across counties looks roughly exponential, but there is a spatial bias such that more populated counties from central to southeast Texas are better sampled. Estimates for R0 range from 1.14 to 2.05, with a median of 1.72.

We have fit a number of models (using a Gaussian error distribution) that have included the spatial predictors: ecoregion, (lat, lon) (which is approximately—deviates slightly—from a square grid in Texas), and log(human population density), and the temporal predictors: month, and year. We have failed to specify a model that has less than 0.8 estimated concurvity for both spatial variables when two of the three spatial variables are included and 0.95 estimated concurvity when all three are included. In all of the models that we have fit, this has caused estimates to blow up (e.g. a given ecoregion will be estimated to have an effect of increasing R0 by a large amount, but low population density will be estimated to decrease R0 by a large amount). In the end this results in accurate predictions, but unrealistic parameter estimates.

We have tried models with many combinations of the following: thin plate splines for (lat, lon); tensor products for (lat, lon); Markov random fields for counties (NAs for missing counties with drop.unused.levels = FALSE); splines for 10/11 ecoregions (compositional); linear effect of ecoregion; Markov random fields for ecoregion (each county assigned to the ecoregion that is dominant in that county); splines for log(population density); linear effect of population density.

gam.check( ) plots and output from this function for two models are given below. The top output is for a model with s(lat, lon) + s(ecoregion, bs = “mrf”, xt = xt) + s(Month) + s(Year). We don’t notice any obvious problems with the gam.check( ) plots; there are approximately five outliers, but it would be surprising (?) if those alone are driving our problems. gam.check() with s(ecoregion, bs = “mrf”, xt = xt) gives plots but gives the error below, which I haven’t gotten to the bottom of. The output on the bottom is for a model with ecogregion as a parametric factor.

Maybe we are just expecting too much for the data we have. Any thoughts on the best model for these data are appreciated.

Plots/output

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 19 iterations.
## The RMS GCV score gradient at convergence was 9.482236e-08 .
## The Hessian was not positive definite.
## Model rank =  122 / 122
## Error in .subset2(x, i, exact = exact): subscript out of bounds

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 13 iterations.
## The RMS GCV score gradient at convergence was 4.254451e-08 .
## The Hessian was not positive definite.
## Model rank =  122 / 122 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##             k'   edf k-index p-value    
## s(X,Y)   87.00 84.79    0.91  <2e-16 ***
## s(Month) 11.00  9.77    0.66  <2e-16 ***
## s(Year)  13.00  8.20    0.75  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1