In the following, I will use geostatistics to explore Professor Hugh Sturrock’s data on soil-transmitted helminths prevalence in western Kenya with a focus on T. Trichiura, or the human whipworm. T. Trichiura is one of three well-documented soil-transmitted helminths common to Asia and Sub-Saharan Africa. In human beings, it causes a neglected tropical disease known as trichuriasis. Understanding the geographic distribution of soil-transmitted helminths like whipworm is important for developing anthelminthic treatment strategies in tropical countries.
A sizable body of evidence out of several tropical countries points to a potential correlation between altitude and T. Trichiura, as well as annual precipitation and T. Trichiura. Recent studies including Chaiyos et. al.’s 2018 study out of Thailand, as well as older studies like Appleton et. al.’s 1995 study in South Africa, find that prevalence of T. trichiura was significantly correlated with annual precipitation and annual rainfall. Chammartin et. al.’s 2013 study in Bolivia finds that altitude has a negative effect on T. Trichiura.
A smaller set of studies in East Africa point to a correlation between T. Trichiura prevalence and proximity to Lake Victoria. Handzel et. al.’s 2003 study out of western Kenya finds that closer proximity to Lake Victoria was associated with increased T. trichiura infection rates among school-aged children. A similar 2015 study in Tanzania by Siza et. al. finds high whipworm infection rates among residents of Tanzania’s Lake Victoria Basin.
Based on this short literature review, I investigated the following two hypotheses using geostatistical methods:
In the following, I will discuss my methods which included pre-processing the whipworm prevalence data, cluster analyses, processing model covariates, spatial and non-spatial regression analyses, model cross-validation, and risk mapping.
My first step to examine my hypothesis was to pre-process the whipworm prevalence data in preparation for a cluster analysis. The “raw” whipworm prevalence data had a sample size of 68 prevalence points. My data pre-processing steps included generating a histogram, which revealed that the data was not normally distributed. I conducted a logit transformation on this data to calculate log whipworm prevalence and dropped any NA values in the log whipworm prevalence variable. This reduced my sample size to 61 observations.
The logit transformation did not correct the skew in the data, as shown in the plot below.
The limitations of using prevalence points that are not normally distributed is detailed in the conclusion.
After pre-processing the whipworm prevalence data, I examined the data for instances of global and local clustering.
To examine instances of global clustering, I generated correlograms to determine the highest distance classes at which statistically significant positive spatial autocorrelation could be discerned. I then used a Monte Carlo simulation of the global Moran’s I test to determine the Moran’s I test statistic.
The neighbor structure used within my Moran’s I tests followed a k-nearest neighbors approach with the parameters k = 1 and row-standardized weights. With these parameters, only immediately adjacent points were defined as “neighbors,” and each neighbor linkage was assigned a spatial weight value equivalent to the 1 divided by the number of neighbors.
Correlograms calculated with 15 and 20 distance classes initially revealed that spatial autocorrelation exists up to about 0.17 decimal degrees and I conducted a Monte Carlo simulation of the Moran’s I test to confirm the value of the test statistic, as shown below.
##
## Monte-Carlo simulation of Moran I
##
## data: .
## weights: weights
## number of simulations + 1: 1000
##
## statistic = 0.5979, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
The Monte Carlo simulation reveals that to a high degree of statistical significance, there is positive global clustering in our whipworm prevalence data with a Moran’s I statistic of 0.597.
However, running the checks for neighbor symmetry below, we find that neighbors are not symmetrical:
## d1 d2
## TRUE TRUE
## d1 d2
## 6 3
This limitation in assigning neighbors and spatial weights to identify clusters is explored in the conclusion.
To examine instances of local clustering, I decomposed the global Moran’s I statistic into its local parts by generating windows of Local Instances of Spatial Autocorrelation (LISAs) around each whipworm prevalence point. The LISAs for each point show several instances of local clustering, as shown below.
## Ii E.Ii Var.Ii Z.Ii Pr(z > 0)
## 1 5.9510570 -0.0166667 0.2259525 12.5545 < 2.2e-16 ***
## 2 3.1784448 -0.0166667 0.2259525 6.7217 8.982e-12 ***
## 3 5.2396479 -0.0166667 0.2259525 11.0579 < 2.2e-16 ***
## 4 5.5483026 -0.0166667 0.2259525 11.7072 < 2.2e-16 ***
## 5 3.2066525 -0.0166667 0.2259525 6.7810 5.967e-12 ***
## 6 0.1201696 -0.0166667 0.3063512 0.2472 0.402367
## 7 0.0233194 -0.0166667 0.3063512 0.0722 0.471204
## 8 0.0311102 -0.0166667 0.2259525 0.1005 0.459970
## 9 -0.0011243 -0.0166667 0.3063512 0.0281 0.488799
## 10 -1.0901757 -0.0166667 0.9495402 -1.1017 0.864696
## 11 -0.7200786 -0.0166667 0.1777134 -1.6686 0.952401
## 12 0.3629910 -0.0166667 0.9495402 0.3896 0.348411
## 13 0.3255957 -0.0166667 0.2259525 0.7200 0.235753
## 14 -0.3748589 -0.0166667 0.4671484 -0.5241 0.699885
## 15 0.9007664 -0.0166667 0.2259525 1.9300 0.026801 *
## 16 0.5110948 -0.0166667 0.3063512 0.9535 0.170164
## 17 0.8955145 -0.0166667 0.3063512 1.6481 0.049671 *
## 18 0.0118733 -0.0166667 0.2259525 0.0600 0.476062
## 19 0.0146410 -0.0166667 0.2259525 0.0659 0.473743
## 20 -0.3591299 -0.0166667 0.1455539 -0.8976 0.815311
## 21 0.0117781 -0.0166667 0.1777134 0.0675 0.473102
## 22 -0.1023434 -0.0166667 0.2259525 -0.1802 0.571518
## 23 0.0015268 -0.0166667 0.3063512 0.0329 0.486889
## 24 0.1442948 -0.0166667 0.4671484 0.2355 0.406910
## 25 0.2787526 -0.0166667 0.9495402 0.3032 0.380881
## 26 0.0140307 -0.0166667 0.4671484 0.0449 0.482088
## 27 0.0182245 -0.0166667 0.9495402 0.0358 0.485718
## 28 0.3640302 -0.0166667 0.0536698 1.6433 0.050161 .
## 29 0.3668178 -0.0166667 0.0536698 1.6553 0.048929 *
## 30 0.0231878 -0.0166667 0.0589708 0.1641 0.434819
## 31 0.2059328 -0.0166667 0.0450556 1.0487 0.147159
## 32 0.0187641 -0.0166667 0.0536698 0.1529 0.439223
## 33 0.2443487 -0.0166667 0.0306987 1.4897 0.068148 .
## 34 0.1443796 -0.0166667 0.0286104 0.9521 0.170520
## 35 0.4419216 -0.0166667 0.0329958 2.5246 0.005791 **
## 36 0.2899394 -0.0166667 0.0490755 1.3840 0.083173 .
## 37 0.3793914 -0.0166667 0.0204894 2.7669 0.002830 **
## 38 0.2813866 -0.0166667 0.0536698 1.2866 0.099124 .
## 39 0.0297390 -0.0166667 0.1053546 0.1430 0.443157
## 40 -0.0185104 -0.0166667 0.0355347 -0.0098 0.503902
## 41 0.8881212 -0.0166667 0.0267038 5.5368 1.540e-08 ***
## 42 0.4167058 -0.0166667 0.0355347 2.2990 0.010753 *
## 43 0.3070451 -0.0166667 0.0249560 2.0491 0.020224 *
## 44 0.3630882 -0.0166667 0.0267038 2.3239 0.010065 *
## 45 -0.1973534 -0.0166667 0.0724642 -0.6712 0.748960
## 46 0.2697499 -0.0166667 0.0450556 1.3493 0.088613 .
## 47 0.2193869 -0.0166667 0.0450556 1.1121 0.133052
## 48 0.4345555 -0.0166667 0.0249560 2.8563 0.002143 **
## 49 0.3366259 -0.0166667 0.0286104 2.0887 0.018368 *
## 50 0.7223797 -0.0166667 0.0249560 4.6783 1.447e-06 ***
## 51 -0.3746710 -0.0166667 0.0919548 -1.1806 0.881118
## 52 0.2099439 -0.0166667 0.0329958 1.2475 0.106102
## 53 -0.0983158 -0.0166667 0.0355347 -0.4331 0.667542
## 54 0.1732682 -0.0166667 0.0490755 0.8574 0.195618
## 55 0.2543117 -0.0166667 0.0383557 1.3836 0.083236 .
## 56 0.5027408 -0.0166667 0.0383557 2.6521 0.003999 **
## 57 0.3142038 -0.0166667 0.0355347 1.7552 0.039611 *
## 58 0.5847013 -0.0166667 0.0589708 2.4764 0.006636 **
## 59 1.4106691 -0.0166667 0.4671484 2.0883 0.018384 *
## 60 0.9263473 -0.0166667 0.9495402 0.9677 0.166586
## 61 1.8949908 -0.0166667 0.9495402 1.9618 0.024893 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, the plot of observed vs. expected whipworm prevalence (log) shown below indicates that there are not many outliers in the data, only 7 outliers. This means that most of the data does not deviate from expectation, where expectation is defined as the weighted mean of each point’s neighbors.
Following a cluster analysis, I used geostatistics to discern the correlation between whipworm prevalence and distance to waterbodies, distance to transit, altitude, and annual precipitation. I will briefly describe 1) the repository from which each covariate was pulled, and 2) transformations applied on covariates.
The following describes each covariate, the data source it was pulled from, and relevant sample sizes.
Here is a visualization of the vector-based covariates listed above:
In the map above, points in shades of red are whipworm prevalence points, blue lines and polygons are inland waterbodies and grey lines are roads and railroads. For illustration purposes, I have clipped these features to the bounding box of the whipworm prevalence data, which is in and around western Kenya.
After loading the covariates, I rasterized all vector data, resampled the rasters, and calculated distances between each prevalence point and the nearest lake, rivers/canal, road, and railroad.
I rasterized my four covariates containing vector data from the DCW database:
First, I re-projected all of my vector data into the WGS84 coordinate reference system:
Next, I rasterized each covariate with a step-wise process. Here is the process used for rasterizing the roads shapefile, which contains spatial lines:
A plot of the newly produced raster (in this case, roads) revealed that rasterization worked:
To rasterize my lakes vector data, I applied a method for rasterizing spatial polygons as shown below:
# rasterize lakes ----
ext <- extent(33.90959, 41.92622, -4.720417, 5.061166)
xy <- ext %>% bbox() %>% as.matrix() %>% apply(1, diff) %>% abs()
n <- 5
r <- raster(ext, ncol=xy[1]*n, nrow=xy[2]*n)
lakes_r <- lakes %>% rasterize(r, fun='first')
This also successfully rasterized my lakes data:
After rasterizing my vector data, I cropped each of the 6 raster layers to match the extent of the Kenya Admin 1 Level per the GADM, then resampled all 6 of my raster layers to the lowest resolution raster of the set: the altitude raster layer.
Because the annual precipitation data is continuous, I resampled using a nearest-neighbor approach, which assigns the value of the nearest cell to cells in the prediction grid. The resampling procedure worked.
We can see that the extent and resolution of all of raster layers now match up, and are ready for calculating distances:
KEN_alt
## class : RasterLayer
## dimensions : 1130, 962, 1087060 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 33.90833, 41.925, -4.716667, 4.7 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## source : memory
## names : KEN_alt
## values : 0, 5778 (min, max)
KEN_bioclim_resamp
## class : RasterLayer
## dimensions : 1130, 962, 1087060 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 33.90833, 41.925, -4.716667, 4.7 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## source : memory
## names : bio12
## values : 173, 2493 (min, max)
KEN_canals_r_resamp
## class : RasterLayer
## dimensions : 1130, 962, 1087060 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 33.90833, 41.925, -4.716667, 4.7 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## source : memory
## names : layer
## values : 8.163305e-06, 0.01207146 (min, max)
KEN_roads_r_resamp
## class : RasterLayer
## dimensions : 1130, 962, 1087060 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 33.90833, 41.925, -4.716667, 4.7 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## source : memory
## names : layer
## values : 6.373713e-06, 0.01217149 (min, max)
KEN_railroads_r_resamp
## class : RasterLayer
## dimensions : 1130, 962, 1087060 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 33.90833, 41.925, -4.716667, 4.7 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## source : memory
## names : layer
## values : 0.0002596938, 0.006324815 (min, max)
KEN_lakes_r_resamp
## class : RasterLayer
## dimensions : 1130, 962, 1087060 (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333 (x, y)
## extent : 33.90833, 41.925, -4.716667, 4.7 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## source : memory
## names : layer
## values : 6, 244 (min, max)
I calculated distance between prevalence points to each feature using the distance command. This command served to generate raster layers of nothing but distances. Finally, I extracted values from the distance rasters as columns into my data.frame.
The extraction revealed two immediate problems with data availability: 1) I could not run regressions using distance to canals/lakes because all extractions had a value of 0:
## canals
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
And 2) I could not run regressions using distance to roads because only the following two values were non-zero extractions:
## roads
## 1 1855.3227
## 2 927.6611
I specified non-spatial generalized linear models (GLMs) with various combinations of the 4 covariates available (bio12, altitude, distance to lakes, and distance to railroads). All models specified included binomial outputs. Importantly, I determined that models that included distance to roads resulted in a negative coefficient and dropped this variable from the models. I used two criteria to determine model fit:
The model that fit this criteria was the 3-covariate non-spatial GLM with altitude, annual precipitation, and distance to lakes:
##
## Call:
## glm(formula = cbind(tri_np, round(tri_nneg)) ~ alt + bio12 +
## lakes, family = binomial(), data = STH_kenya_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.8508 -1.5117 -0.4945 0.9151 8.6723
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.765e+00 3.621e-01 13.159 < 2e-16 ***
## alt -2.604e-03 3.990e-04 -6.526 6.75e-11 ***
## bio12 -2.583e-03 2.781e-04 -9.289 < 2e-16 ***
## lakes 4.780e-05 2.828e-06 16.901 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1287.54 on 60 degrees of freedom
## Residual deviance: 645.31 on 57 degrees of freedom
## AIC: 890.42
##
## Number of Fisher Scoring iterations: 4
I also used the deviance residuals to evaluate model robustness. The third quartile (3Q) of the deviance residuals has a value of 0.91. This means that the first 75 percent of our model observations are within an acceptable distance of our predicted values, with “acceptable” being defined as a value less than 2.
After choosing the 4-covariate GLM, I ran a few diagnostics to determine the extent of residual spatial autocorrelation. I first examined a residual plot for any linear patterns:
The plot illustrates a lack of linearity. This indicates that these covariates, while statistically significant, still do not explain much of the spatial pattern underlying the whipworm prevalence data. I then used a correlogram to examine clustering effects. I took note of the highest distance class at which statistically significant residual clustering can be detected.
## Moran I statistic
## dist.class coef p.value n
## [1,] 0.05109279 0.0308233804 0.154947188 1132
## [2,] 0.15327888 -0.0833084652 0.908832873 480
## [3,] 0.25546447 -0.1017832993 0.642793997 96
## [4,] 0.35765005 -0.0280393801 0.489358328 182
## [5,] 0.45983564 -0.0450639612 0.789614671 1026
## [6,] 0.56202123 0.1492423106 0.004960484 412
## [7,] 0.66420681 -0.1049455990 0.839680895 166
## [8,] 0.76639240 0.0807562638 0.157565161 38
## [9,] 0.86857799 -0.0420579931 0.372533428 72
## [10,] 0.97076357 -0.0006592442 0.213781584 54
It seems that the Moran’s I statistic is not statistically significant at any of the distance classes. This indicates that we have no residual spatial autocorrelation. Our non-spatial GLM has independent and identically distributed residuals (iid), satisfying a key assumption of linear regression modeling.
However, I also ran a spatial regression regardless to see how the inclusion of a spatially-correlated fixed effect might improve model fit. I ran a Matern model with binomial outputs:
## formula: cbind(tri_np, round(tri_nneg)) ~ alt + bio12 + lakes + Matern(1 |
## lat + long)
## Estimation of corrPars and lambda by Laplace ML approximation (p_v).
## Estimation of fixed effects by Laplace ML approximation (p_v).
## Estimation of lambda by 'outer' ML, maximizing p_v.
## Family: binomial ( link = logit )
## ------------ Fixed effects (beta) ------------
## Estimate Cond. SE t-value
## (Intercept) 5.736e+00 2.187e+00 2.6232
## alt -6.905e-03 2.341e-03 -2.9495
## bio12 1.477e-03 1.691e-03 0.8734
## lakes 2.009e-05 2.409e-05 0.8337
## --------------- Random effects ---------------
## Family: gaussian ( link = identity )
## --- Correlation parameters:
## 1.nu 1.rho
## 0.4157388 13.9707489
## --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian;
## lat + long : 1.334
## # of obs: 61; # of groups: lat + long, 60
## ------------- Likelihood values -------------
## logLik
## p_v(h) (marginal L): -204.3296
I calculated 95% confidence intervals on each covariate and expressed them as odds ratios, as follows:
Next, I checked the residual plot and correlogram for the spatial regression to determine the effect of including the spatially-correlated fixed effect.
Here is the residual plot:
And here is the correlogram:
## Moran I statistic
## dist.class coef p.value n
## [1,] 0.05109279 -0.066685436 0.82411359 1132
## [2,] 0.15327888 -0.017140805 0.49704509 480
## [3,] 0.25546447 0.143182733 0.07316679 96
## [4,] 0.35765005 -0.058880569 0.61789255 182
## [5,] 0.45983564 -0.013031736 0.45413458 1026
## [6,] 0.56202123 0.058991876 0.14550497 412
## [7,] 0.66420681 -0.014941530 0.41596123 166
## [8,] 0.76639240 -0.009228469 0.32282485 38
## [9,] 0.86857799 -0.001231104 0.22160556 72
## [10,] 0.97076357 -0.011738657 0.24117375 54
Both the residual plot and correlogram indicates no spatial autocorrelation. The plot also shows a clear linear pattern between observed and expected values. This means that the inclusion of the spatially-correlated fixed effect improved our model fit.
Next, I validated the spatial regression using a 10-fold cross validation technique. This is a plot of cross-validated predictions vs. observed values:
The mean squared error of the cross-validated spatial model with 4 covariates was the smallest of the other model specifications explored as well:
## [1] 0.000582857
Finally, I used the spatial regression to generate a risk map of whipworm prevalence across Kenya, as shown below. To generate the risk map, I rasterized model covariates, created a rasterstack with column names that matched model covariate names, and used the Matern model specification to predict across the raster stack. The risk map is shown below, masked to Kenya Admin 1 bounds.
Two key findings can be delineated from this analysis:
Altitude, annual precipitation, and distance to lakes are correlated with whipworm prevalence to a statistically significant degree, with p-values of <6.75e-11, < 2e-16, and < 2e-16 respectively. Our non-spatial GLM also does not exhibit global clustering.
The inclusion of a spatially-correlated fixed effect improves model fit. However, the marginal impact of each of these factors on whipworm prevalence in western Kenya is extremely small. Here are the odds ratios:
The findings from this study indicated statistically significant marginal impacts of altitude, annual precipitation, distance to waterbodies, or transit accessibility on whipworm prevalence in Western Kenya, however, those impacts were found to be negligible. Thus, this study neither evidenced nor contradicted the primary and secondary hypothesis. Potential reasons for these results include the two factors that affected our analysis:
The first point limits our conclusions around the existence of clustering because the null hypothesis of the Moran’s I is a normal distribution. Our outcomes data should be normally distributed for the hypothesis test to be conclusive. The second point is problematic because it points to a logical fallacy; it means that if k is a neighbor of j, then j is not a neighbor of k. Constructing neighbor structures in a cluster analysis of the whipworm prevalence data yields asymmetrical results and this should be investigated.
Other potential reasons include the outdatedness and incompleteness of the DCW data. The DCW database was compiled only in 1992, and Langaas et. al. (1995) has pointed to the challenges of using this data for statistical analysis due its outdatedness and incompleteness.
Appleton, C. C., and E. Gouws. “The distribution of common intestinal nematodes along an altitudinal transect in KwaZulu-Natal, South Africa.” Annals of Tropical Medicine & Parasitology 90.2 (1996): 181-188.
Appleton, C. C., M. Maurihungirire, and E. Gouws. “The distribution of helminth infections along the coastal plain of Kwazulu-Natal province, South Africa.” Annals of Tropical Medicine & Parasitology 93.8 (1999): 859-868.
Chaiyos, J., et al. “MaxEnt modeling of soil-transmitted helminth infection distributions in Thailand.” Parasitology research 117.11 (2018): 3507-3517.
Chammartin, Frédérique, et al. “Modelling the geographical distribution of soil-transmitted helminth infections in Bolivia.” Parasites & vectors 6.1 (2013): 152.
Defense Mapping Agency (DMA), 1992. Digital Chart of the World. Defense Mapping Agency, Fairfax, Virginia. (Four CD-ROMs.)
Fick, S.E. and R.J. Hijmans, 2017. WorldClim 2: new 1km spatial resolution climate surfaces for global land areas. International Journal of Climatology 37 (12): 4302-4315.
Global Administrative Areas ( 2012). GADM database of Global Administrative Areas, version 2.0. [online] URL: www.gadm.org.
Handzel, Thomas, et al. “Geographic distribution of schistosomiasis and soil-transmitted helminths in Western Kenya: implications for anthelminthic mass treatment.” The American journal of tropical medicine and hygiene 69.3 (2003): 318-323.
Langaas, Sindre. Completeness of the Digital Chart of the World (DCW) database. UNEP/GRID-Arendal, 1995.
NASA Shuttle Radar Topography Mission (SRTM)(2013). Shuttle Radar Topography Mission (SRTM) Global. Distributed by OpenTopography. https://doi.org/10.5069/G9445JDF Accessed: 2020-05-03
Siza, Julius E., et al. “Prevalence of schistosomes and soil-transmitted helminths and morbidity associated with schistosomiasis among adult population in Lake Victoria Basin, Tanzania.” The Korean journal of parasitology 53.5 (2015): 525.