Introduction

In the following, I will use geostatistics to explore Professor Hugh Sturrock’s data on soil-transmitted helminths prevalence in western Kenya with a focus on T. Trichiura, or the human whipworm. T. Trichiura is one of three well-documented soil-transmitted helminths common to Asia and Sub-Saharan Africa. In human beings, it causes a neglected tropical disease known as trichuriasis. Understanding the geographic distribution of soil-transmitted helminths like whipworm is important for developing anthelminthic treatment strategies in tropical countries.

A sizable body of evidence out of several tropical countries points to a potential correlation between altitude and T. Trichiura, as well as annual precipitation and T. Trichiura. Recent studies including Chaiyos et. al.’s 2018 study out of Thailand, as well as older studies like Appleton et. al.’s 1995 study in South Africa, find that prevalence of T. trichiura was significantly correlated with annual precipitation and annual rainfall. Chammartin et. al.’s 2013 study in Bolivia finds that altitude has a negative effect on T. Trichiura.

A smaller set of studies in East Africa point to a correlation between T. Trichiura prevalence and proximity to Lake Victoria. Handzel et. al.’s 2003 study out of western Kenya finds that closer proximity to Lake Victoria was associated with increased T. trichiura infection rates among school-aged children. A similar 2015 study in Tanzania by Siza et. al. finds high whipworm infection rates among residents of Tanzania’s Lake Victoria Basin.

Based on this short literature review, I investigated the following two hypotheses using geostatistical methods:

  1. Closer proximity to waterbodies (i.e. lakes, rivers, and canals) is associated with higher prevalence of T. Trichiura in western Kenya. In other words, distance to lakes, distance to rivers/canals, and whipworm prevalence are negatively correlated.
  2. This association is tempered by altitude and annual precipitation as well as transit accessibility. In other words, lower altitudes and lower annual precipitation are associated with lower whipworm prevalence, and transit accessibility, which will be measured by a closer proximity to roads and railroads, is negatively correlated with whipworm prevalence.

Methods

In the following, I will discuss my methods which included pre-processing the whipworm prevalence data, cluster analyses, processing model covariates, spatial and non-spatial regression analyses, model cross-validation, and risk mapping.

Pre-processing whipworm prevalence data

My first step to examine my hypothesis was to pre-process the whipworm prevalence data in preparation for a cluster analysis. The “raw” whipworm prevalence data had a sample size of 68 prevalence points. My data pre-processing steps included generating a histogram, which revealed that the data was not normally distributed. I conducted a logit transformation on this data to calculate log whipworm prevalence and dropped any NA values in the log whipworm prevalence variable. This reduced my sample size to 61 observations.

The logit transformation did not correct the skew in the data, as shown in the plot below.

The limitations of using prevalence points that are not normally distributed is detailed in the conclusion.

Global and local cluster analysis

After pre-processing the whipworm prevalence data, I examined the data for instances of global and local clustering.

Global clustering

To examine instances of global clustering, I generated correlograms to determine the highest distance classes at which statistically significant positive spatial autocorrelation could be discerned. I then used a Monte Carlo simulation of the global Moran’s I test to determine the Moran’s I test statistic.

The neighbor structure used within my Moran’s I tests followed a k-nearest neighbors approach with the parameters k = 1 and row-standardized weights. With these parameters, only immediately adjacent points were defined as “neighbors,” and each neighbor linkage was assigned a spatial weight value equivalent to the 1 divided by the number of neighbors.

Correlograms calculated with 15 and 20 distance classes initially revealed that spatial autocorrelation exists up to about 0.17 decimal degrees and I conducted a Monte Carlo simulation of the Moran’s I test to confirm the value of the test statistic, as shown below.

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  . 
## weights: weights  
## number of simulations + 1: 1000 
## 
## statistic = 0.5979, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater

The Monte Carlo simulation reveals that to a high degree of statistical significance, there is positive global clustering in our whipworm prevalence data with a Moran’s I statistic of 0.597.

However, running the checks for neighbor symmetry below, we find that neighbors are not symmetrical:

##   d1   d2 
## TRUE TRUE
## d1 d2 
##  6  3

This limitation in assigning neighbors and spatial weights to identify clusters is explored in the conclusion.

Local clustering

To examine instances of local clustering, I decomposed the global Moran’s I statistic into its local parts by generating windows of Local Instances of Spatial Autocorrelation (LISAs) around each whipworm prevalence point. The LISAs for each point show several instances of local clustering, as shown below.

##            Ii       E.Ii     Var.Ii    Z.Ii Pr(z > 0)    
## 1   5.9510570 -0.0166667  0.2259525 12.5545 < 2.2e-16 ***
## 2   3.1784448 -0.0166667  0.2259525  6.7217 8.982e-12 ***
## 3   5.2396479 -0.0166667  0.2259525 11.0579 < 2.2e-16 ***
## 4   5.5483026 -0.0166667  0.2259525 11.7072 < 2.2e-16 ***
## 5   3.2066525 -0.0166667  0.2259525  6.7810 5.967e-12 ***
## 6   0.1201696 -0.0166667  0.3063512  0.2472  0.402367    
## 7   0.0233194 -0.0166667  0.3063512  0.0722  0.471204    
## 8   0.0311102 -0.0166667  0.2259525  0.1005  0.459970    
## 9  -0.0011243 -0.0166667  0.3063512  0.0281  0.488799    
## 10 -1.0901757 -0.0166667  0.9495402 -1.1017  0.864696    
## 11 -0.7200786 -0.0166667  0.1777134 -1.6686  0.952401    
## 12  0.3629910 -0.0166667  0.9495402  0.3896  0.348411    
## 13  0.3255957 -0.0166667  0.2259525  0.7200  0.235753    
## 14 -0.3748589 -0.0166667  0.4671484 -0.5241  0.699885    
## 15  0.9007664 -0.0166667  0.2259525  1.9300  0.026801 *  
## 16  0.5110948 -0.0166667  0.3063512  0.9535  0.170164    
## 17  0.8955145 -0.0166667  0.3063512  1.6481  0.049671 *  
## 18  0.0118733 -0.0166667  0.2259525  0.0600  0.476062    
## 19  0.0146410 -0.0166667  0.2259525  0.0659  0.473743    
## 20 -0.3591299 -0.0166667  0.1455539 -0.8976  0.815311    
## 21  0.0117781 -0.0166667  0.1777134  0.0675  0.473102    
## 22 -0.1023434 -0.0166667  0.2259525 -0.1802  0.571518    
## 23  0.0015268 -0.0166667  0.3063512  0.0329  0.486889    
## 24  0.1442948 -0.0166667  0.4671484  0.2355  0.406910    
## 25  0.2787526 -0.0166667  0.9495402  0.3032  0.380881    
## 26  0.0140307 -0.0166667  0.4671484  0.0449  0.482088    
## 27  0.0182245 -0.0166667  0.9495402  0.0358  0.485718    
## 28  0.3640302 -0.0166667  0.0536698  1.6433  0.050161 .  
## 29  0.3668178 -0.0166667  0.0536698  1.6553  0.048929 *  
## 30  0.0231878 -0.0166667  0.0589708  0.1641  0.434819    
## 31  0.2059328 -0.0166667  0.0450556  1.0487  0.147159    
## 32  0.0187641 -0.0166667  0.0536698  0.1529  0.439223    
## 33  0.2443487 -0.0166667  0.0306987  1.4897  0.068148 .  
## 34  0.1443796 -0.0166667  0.0286104  0.9521  0.170520    
## 35  0.4419216 -0.0166667  0.0329958  2.5246  0.005791 ** 
## 36  0.2899394 -0.0166667  0.0490755  1.3840  0.083173 .  
## 37  0.3793914 -0.0166667  0.0204894  2.7669  0.002830 ** 
## 38  0.2813866 -0.0166667  0.0536698  1.2866  0.099124 .  
## 39  0.0297390 -0.0166667  0.1053546  0.1430  0.443157    
## 40 -0.0185104 -0.0166667  0.0355347 -0.0098  0.503902    
## 41  0.8881212 -0.0166667  0.0267038  5.5368 1.540e-08 ***
## 42  0.4167058 -0.0166667  0.0355347  2.2990  0.010753 *  
## 43  0.3070451 -0.0166667  0.0249560  2.0491  0.020224 *  
## 44  0.3630882 -0.0166667  0.0267038  2.3239  0.010065 *  
## 45 -0.1973534 -0.0166667  0.0724642 -0.6712  0.748960    
## 46  0.2697499 -0.0166667  0.0450556  1.3493  0.088613 .  
## 47  0.2193869 -0.0166667  0.0450556  1.1121  0.133052    
## 48  0.4345555 -0.0166667  0.0249560  2.8563  0.002143 ** 
## 49  0.3366259 -0.0166667  0.0286104  2.0887  0.018368 *  
## 50  0.7223797 -0.0166667  0.0249560  4.6783 1.447e-06 ***
## 51 -0.3746710 -0.0166667  0.0919548 -1.1806  0.881118    
## 52  0.2099439 -0.0166667  0.0329958  1.2475  0.106102    
## 53 -0.0983158 -0.0166667  0.0355347 -0.4331  0.667542    
## 54  0.1732682 -0.0166667  0.0490755  0.8574  0.195618    
## 55  0.2543117 -0.0166667  0.0383557  1.3836  0.083236 .  
## 56  0.5027408 -0.0166667  0.0383557  2.6521  0.003999 ** 
## 57  0.3142038 -0.0166667  0.0355347  1.7552  0.039611 *  
## 58  0.5847013 -0.0166667  0.0589708  2.4764  0.006636 ** 
## 59  1.4106691 -0.0166667  0.4671484  2.0883  0.018384 *  
## 60  0.9263473 -0.0166667  0.9495402  0.9677  0.166586    
## 61  1.8949908 -0.0166667  0.9495402  1.9618  0.024893 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, the plot of observed vs. expected whipworm prevalence (log) shown below indicates that there are not many outliers in the data, only 7 outliers. This means that most of the data does not deviate from expectation, where expectation is defined as the weighted mean of each point’s neighbors.

Processing model covariates

Following a cluster analysis, I used geostatistics to discern the correlation between whipworm prevalence and distance to waterbodies, distance to transit, altitude, and annual precipitation. I will briefly describe 1) the repository from which each covariate was pulled, and 2) transformations applied on covariates.

Covariate data sources

The following describes each covariate, the data source it was pulled from, and relevant sample sizes.

  • Annual precipitation data (RasterLayer): The annual precipitation data was available on the WorldClimate database. I pulled the variable Bio12 which corresponds to annual precipitation at a resolution of 2.5 meters.
  • Altitude data (RasterLayer): The altitude data was aggregated from NASA’s Shuttle Radar Topography Mission (SRTM) in February 2000, which obtained 90 meter (3 arc-second) resolution between -60 and 60 latitude. This data provides altitude information at 90m resolution for many parts of the world. I pulled this data corresponding to the Kenya GADM administrative region.
  • Inland lakes (Polygons): The inland lakes data provides the locations of lakes across Kenya. This data comes from the Digital Chart of the World (DCW) inland waterbodies database. The DCW was commissioned by the United States Defense Mapping Agency around 1992 for global scientific research purposes and to support military analysis. I only downloaded a subset of the DCW inland waterbodies database corresponding to Kenya’s national borders. This subset had a sample size of 255 inland lakes.
  • Inland rivers and canals (Lines): The inland rivers and canals data provides the locations of inland rivers and canals in Kenya. This data comes from the DCW inland waterbodies database. I only downloaded a subset of this database which corresponds to Kenya’s national borders. This subset had a sample size of 2569 inland lakes.
  • Roads (Lines): The roads data provides the locations of roads in Kenya. This data comes from the DCW roads database. I only downloaded a subset of this database which corresponds to Kenya’s national borders. This subset had a sample size of 2097 roads.
  • Railroads (Lines): The railroads data provides the locations of railroads in Kenya. This data comes from the DCW railroads database. I only downloaded a subset of this database which corresponds to Kenya’s national borders. This subset had a sample size of 42 railroads.

Here is a visualization of the vector-based covariates listed above:

In the map above, points in shades of red are whipworm prevalence points, blue lines and polygons are inland waterbodies and grey lines are roads and railroads. For illustration purposes, I have clipped these features to the bounding box of the whipworm prevalence data, which is in and around western Kenya.

After loading the covariates, I rasterized all vector data, resampled the rasters, and calculated distances between each prevalence point and the nearest lake, rivers/canal, road, and railroad.

Rasterizing all vector data

I rasterized my four covariates containing vector data from the DCW database:

  • lakes shapefile
  • canals/lakes shapefile
  • roads shapefile
  • railroads shapefile

First, I re-projected all of my vector data into the WGS84 coordinate reference system:

Next, I rasterized each covariate with a step-wise process. Here is the process used for rasterizing the roads shapefile, which contains spatial lines:

A plot of the newly produced raster (in this case, roads) revealed that rasterization worked:

To rasterize my lakes vector data, I applied a method for rasterizing spatial polygons as shown below:

# rasterize lakes ----
ext <-  extent(33.90959, 41.92622, -4.720417, 5.061166)
xy <- ext %>% bbox() %>% as.matrix() %>% apply(1, diff) %>% abs()
n <- 5
r <- raster(ext, ncol=xy[1]*n, nrow=xy[2]*n)
lakes_r <- lakes %>% rasterize(r, fun='first')

This also successfully rasterized my lakes data:

Resampling rasters

After rasterizing my vector data, I cropped each of the 6 raster layers to match the extent of the Kenya Admin 1 Level per the GADM, then resampled all 6 of my raster layers to the lowest resolution raster of the set: the altitude raster layer.

Because the annual precipitation data is continuous, I resampled using a nearest-neighbor approach, which assigns the value of the nearest cell to cells in the prediction grid. The resampling procedure worked.

We can see that the extent and resolution of all of raster layers now match up, and are ready for calculating distances:

KEN_alt
## class      : RasterLayer 
## dimensions : 1130, 962, 1087060  (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 33.90833, 41.925, -4.716667, 4.7  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : KEN_alt 
## values     : 0, 5778  (min, max)
KEN_bioclim_resamp
## class      : RasterLayer 
## dimensions : 1130, 962, 1087060  (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 33.90833, 41.925, -4.716667, 4.7  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : bio12 
## values     : 173, 2493  (min, max)
KEN_canals_r_resamp
## class      : RasterLayer 
## dimensions : 1130, 962, 1087060  (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 33.90833, 41.925, -4.716667, 4.7  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : layer 
## values     : 8.163305e-06, 0.01207146  (min, max)
KEN_roads_r_resamp
## class      : RasterLayer 
## dimensions : 1130, 962, 1087060  (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 33.90833, 41.925, -4.716667, 4.7  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : layer 
## values     : 6.373713e-06, 0.01217149  (min, max)
KEN_railroads_r_resamp
## class      : RasterLayer 
## dimensions : 1130, 962, 1087060  (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 33.90833, 41.925, -4.716667, 4.7  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : layer 
## values     : 0.0002596938, 0.006324815  (min, max)
KEN_lakes_r_resamp
## class      : RasterLayer 
## dimensions : 1130, 962, 1087060  (nrow, ncol, ncell)
## resolution : 0.008333333, 0.008333333  (x, y)
## extent     : 33.90833, 41.925, -4.716667, 4.7  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## source     : memory
## names      : layer 
## values     : 6, 244  (min, max)

Calculating distances

I calculated distance between prevalence points to each feature using the distance command. This command served to generate raster layers of nothing but distances. Finally, I extracted values from the distance rasters as columns into my data.frame.

The extraction revealed two immediate problems with data availability: 1) I could not run regressions using distance to canals/lakes because all extractions had a value of 0:

##      canals 
##  Min.   :0  
##  1st Qu.:0  
##  Median :0  
##  Mean   :0  
##  3rd Qu.:0  
##  Max.   :0

And 2) I could not run regressions using distance to roads because only the following two values were non-zero extractions:

##       roads
## 1 1855.3227
## 2  927.6611

Spatial and non-spatial regression analyses

I specified non-spatial generalized linear models (GLMs) with various combinations of the 4 covariates available (bio12, altitude, distance to lakes, and distance to railroads). All models specified included binomial outputs. Importantly, I determined that models that included distance to roads resulted in a negative coefficient and dropped this variable from the models. I used two criteria to determine model fit:

  • lowest residual deviances
  • lowest cross-validated mean squared error.

The model that fit this criteria was the 3-covariate non-spatial GLM with altitude, annual precipitation, and distance to lakes:

## 
## Call:
## glm(formula = cbind(tri_np, round(tri_nneg)) ~ alt + bio12 + 
##     lakes, family = binomial(), data = STH_kenya_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.8508  -1.5117  -0.4945   0.9151   8.6723  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.765e+00  3.621e-01  13.159  < 2e-16 ***
## alt         -2.604e-03  3.990e-04  -6.526 6.75e-11 ***
## bio12       -2.583e-03  2.781e-04  -9.289  < 2e-16 ***
## lakes        4.780e-05  2.828e-06  16.901  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1287.54  on 60  degrees of freedom
## Residual deviance:  645.31  on 57  degrees of freedom
## AIC: 890.42
## 
## Number of Fisher Scoring iterations: 4

I also used the deviance residuals to evaluate model robustness. The third quartile (3Q) of the deviance residuals has a value of 0.91. This means that the first 75 percent of our model observations are within an acceptable distance of our predicted values, with “acceptable” being defined as a value less than 2.

After choosing the 4-covariate GLM, I ran a few diagnostics to determine the extent of residual spatial autocorrelation. I first examined a residual plot for any linear patterns:

The plot illustrates a lack of linearity. This indicates that these covariates, while statistically significant, still do not explain much of the spatial pattern underlying the whipworm prevalence data. I then used a correlogram to examine clustering effects. I took note of the highest distance class at which statistically significant residual clustering can be detected.

## Moran I statistic 
##       dist.class          coef     p.value    n
##  [1,] 0.05109279  0.0308233804 0.154947188 1132
##  [2,] 0.15327888 -0.0833084652 0.908832873  480
##  [3,] 0.25546447 -0.1017832993 0.642793997   96
##  [4,] 0.35765005 -0.0280393801 0.489358328  182
##  [5,] 0.45983564 -0.0450639612 0.789614671 1026
##  [6,] 0.56202123  0.1492423106 0.004960484  412
##  [7,] 0.66420681 -0.1049455990 0.839680895  166
##  [8,] 0.76639240  0.0807562638 0.157565161   38
##  [9,] 0.86857799 -0.0420579931 0.372533428   72
## [10,] 0.97076357 -0.0006592442 0.213781584   54

It seems that the Moran’s I statistic is not statistically significant at any of the distance classes. This indicates that we have no residual spatial autocorrelation. Our non-spatial GLM has independent and identically distributed residuals (iid), satisfying a key assumption of linear regression modeling.

However, I also ran a spatial regression regardless to see how the inclusion of a spatially-correlated fixed effect might improve model fit. I ran a Matern model with binomial outputs:

## formula: cbind(tri_np, round(tri_nneg)) ~ alt + bio12 + lakes + Matern(1 | 
##     lat + long)
## Estimation of corrPars and lambda by Laplace ML approximation (p_v).
## Estimation of fixed effects by Laplace ML approximation (p_v).
## Estimation of lambda by 'outer' ML, maximizing p_v.
## Family: binomial ( link = logit ) 
##  ------------ Fixed effects (beta) ------------
##               Estimate  Cond. SE t-value
## (Intercept)  5.736e+00 2.187e+00  2.6232
## alt         -6.905e-03 2.341e-03 -2.9495
## bio12        1.477e-03 1.691e-03  0.8734
## lakes        2.009e-05 2.409e-05  0.8337
##  --------------- Random effects ---------------
## Family: gaussian ( link = identity ) 
##                    --- Correlation parameters:
##       1.nu      1.rho 
##  0.4157388 13.9707489 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    lat + long  :  1.334  
## # of obs: 61; # of groups: lat + long, 60 
##  ------------- Likelihood values  -------------
##                         logLik
## p_v(h) (marginal L): -204.3296

I calculated 95% confidence intervals on each covariate and expressed them as odds ratios, as follows:

  • Altitude: 0.9885724, 0.9976861
  • Annual Precipitation: 0.9981645, 1.0048019
  • Distance to lakes: 0.9999729, 1.0000673

Next, I checked the residual plot and correlogram for the spatial regression to determine the effect of including the spatially-correlated fixed effect.

Here is the residual plot:

And here is the correlogram:

## Moran I statistic 
##       dist.class         coef    p.value    n
##  [1,] 0.05109279 -0.066685436 0.82411359 1132
##  [2,] 0.15327888 -0.017140805 0.49704509  480
##  [3,] 0.25546447  0.143182733 0.07316679   96
##  [4,] 0.35765005 -0.058880569 0.61789255  182
##  [5,] 0.45983564 -0.013031736 0.45413458 1026
##  [6,] 0.56202123  0.058991876 0.14550497  412
##  [7,] 0.66420681 -0.014941530 0.41596123  166
##  [8,] 0.76639240 -0.009228469 0.32282485   38
##  [9,] 0.86857799 -0.001231104 0.22160556   72
## [10,] 0.97076357 -0.011738657 0.24117375   54

Both the residual plot and correlogram indicates no spatial autocorrelation. The plot also shows a clear linear pattern between observed and expected values. This means that the inclusion of the spatially-correlated fixed effect improved our model fit.

Model cross-validation

Next, I validated the spatial regression using a 10-fold cross validation technique. This is a plot of cross-validated predictions vs. observed values:

The mean squared error of the cross-validated spatial model with 4 covariates was the smallest of the other model specifications explored as well:

## [1] 0.000582857

Risk mapping

Finally, I used the spatial regression to generate a risk map of whipworm prevalence across Kenya, as shown below. To generate the risk map, I rasterized model covariates, created a rasterstack with column names that matched model covariate names, and used the Matern model specification to predict across the raster stack. The risk map is shown below, masked to Kenya Admin 1 bounds.

Results

Two key findings can be delineated from this analysis:

  1. Altitude, annual precipitation, and distance to lakes are correlated with whipworm prevalence to a statistically significant degree, with p-values of <6.75e-11, < 2e-16, and < 2e-16 respectively. Our non-spatial GLM also does not exhibit global clustering.

  2. The inclusion of a spatially-correlated fixed effect improves model fit. However, the marginal impact of each of these factors on whipworm prevalence in western Kenya is extremely small. Here are the odds ratios:

  1. The risk map generated with our spatial regression actually indicates lower risk of whipworm prevalence close to waterbodies, particularly Lake Victoria. This contradicts findings that closer proximity to Lake Victoria increases whipworm prevalence. However, it is important to note that this conclusion is limited by the fact that this region is primarily where our sample was taken. Thus, our risk estimation suffers somewhat from the lack of availability of whipworm prevalence data from regions outside of western Kenya.

Conclusion

The findings from this study indicated statistically significant marginal impacts of altitude, annual precipitation, distance to waterbodies, or transit accessibility on whipworm prevalence in Western Kenya, however, those impacts were found to be negligible. Thus, this study neither evidenced nor contradicted the primary and secondary hypothesis. Potential reasons for these results include the two factors that affected our analysis:

  1. The data is not normally distributed even with a logit transformation, which is problematic for conducting a hypothesis test using the Moran’s I.
  2. There does not seem to be symmetry in the neighbor structure, which is problematic for defining spatial weights.

The first point limits our conclusions around the existence of clustering because the null hypothesis of the Moran’s I is a normal distribution. Our outcomes data should be normally distributed for the hypothesis test to be conclusive. The second point is problematic because it points to a logical fallacy; it means that if k is a neighbor of j, then j is not a neighbor of k. Constructing neighbor structures in a cluster analysis of the whipworm prevalence data yields asymmetrical results and this should be investigated.

Other potential reasons include the outdatedness and incompleteness of the DCW data. The DCW database was compiled only in 1992, and Langaas et. al. (1995) has pointed to the challenges of using this data for statistical analysis due its outdatedness and incompleteness.

Citations

Appleton, C. C., and E. Gouws. “The distribution of common intestinal nematodes along an altitudinal transect in KwaZulu-Natal, South Africa.” Annals of Tropical Medicine & Parasitology 90.2 (1996): 181-188.

Appleton, C. C., M. Maurihungirire, and E. Gouws. “The distribution of helminth infections along the coastal plain of Kwazulu-Natal province, South Africa.” Annals of Tropical Medicine & Parasitology 93.8 (1999): 859-868.

Chaiyos, J., et al. “MaxEnt modeling of soil-transmitted helminth infection distributions in Thailand.” Parasitology research 117.11 (2018): 3507-3517.

Chammartin, Frédérique, et al. “Modelling the geographical distribution of soil-transmitted helminth infections in Bolivia.” Parasites & vectors 6.1 (2013): 152.

Defense Mapping Agency (DMA), 1992. Digital Chart of the World. Defense Mapping Agency, Fairfax, Virginia. (Four CD-ROMs.)

Fick, S.E. and R.J. Hijmans, 2017. WorldClim 2: new 1km spatial resolution climate surfaces for global land areas. International Journal of Climatology 37 (12): 4302-4315.

Global Administrative Areas ( 2012). GADM database of Global Administrative Areas, version 2.0. [online] URL: www.gadm.org.

Handzel, Thomas, et al. “Geographic distribution of schistosomiasis and soil-transmitted helminths in Western Kenya: implications for anthelminthic mass treatment.” The American journal of tropical medicine and hygiene 69.3 (2003): 318-323.

Langaas, Sindre. Completeness of the Digital Chart of the World (DCW) database. UNEP/GRID-Arendal, 1995.

NASA Shuttle Radar Topography Mission (SRTM)(2013). Shuttle Radar Topography Mission (SRTM) Global. Distributed by OpenTopography. https://doi.org/10.5069/G9445JDF Accessed: 2020-05-03

Siza, Julius E., et al. “Prevalence of schistosomes and soil-transmitted helminths and morbidity associated with schistosomiasis among adult population in Lake Victoria Basin, Tanzania.” The Korean journal of parasitology 53.5 (2015): 525.