Model Based Clustering of Disease Risk in Glasgow

Introduction

In the dynamic landscape of public health, the spatial analysis of disease prevalence stands as a pivotal tool for unraveling the intricate patterns of disease spread and risk concentration. This project explores the spatial distribution and clustering of respiratory disease risks within Glasgow’s Intermediate Zones, using data from two different years (2006 and 2007). Clarifying the underlying spatial pattern of respiratory health issues is our main objective since it is essential for strategic health planning and intervention.

In order to do this, we use spatial modeling techniques, utilizing the Poisson Leroux Conditional Autoregressive (CAR) model to capture the local variations in disease risk and to dissect the complex spatial dependencies. In addition, we investigate the potential applications of Model-Based Clustering Methods to identify discrete clusters in the geographical terrain, thus offering a more sophisticated comprehension of the spatial variability of disease risks.

Our goal is to create an in-depth study of Glasgow’s respiratory illness landscape by combining advanced statistical techniques with spatial analysis to provide insights into the spatial dynamics at work. Our approach contributes to a better understanding of the geographical determinants of health and has the potential to guide targeted treatments and public health policy. It is not just a statistical exercise.

Necessary Libraries:

library(sf)
library(dplyr)
library(tidyverse)
library(spdep)
library(leaflet)
library(sp)
library(ggplot2)
library(broom)
library(RColorBrewer)
library(CARBayes)
library(mclust)
library(clevr)
library(pdfCluster)
library(cluster)

Reading in and combining spatial data

names(expected_counts)[1] <- "IZ_CODE"
names(respiratory_admissions)[1] <- "IZ_CODE"

combined_counts <- merge(expected_counts, respiratory_admissions, by = "IZ_CODE")

The fields within the dataset include:

IZ_CODE: A unique code assigned to each Intermediate Zone.
IZ_NAME: The name of the Intermediate Zone.
STDAREA_HA: The standard area in hectares for each zone.
Shape_Leng: The length of the shape’s perimeter.
Shape_Area: The area of the shape in square meters.
E2006 and E2007: The expected disease counts for the years 2006 and 2007, respectively.
Y2006 and Y2007: The actual observed disease counts for the years 2006 and 2007, respectively.

sf_data <- merge(shape, combined_counts, by = "IZ_CODE")
head(sf_data,3)

## Simple feature collection with 3 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 260667 ymin: 669235 xmax: 263842 ymax: 671065
## Projected CRS: OSGB36 / British National Grid
##     IZ_CODE       IZ_NAME STDAREA_HA Shape_Leng Shape_Area    E2006    E2007
## 1 S02000260    Auchinairn   112.2925   7751.798    1107841 94.83067 98.24602
## 2 S02000261 Woodhill East   111.5763   6464.223    1104265 42.79192 45.26085
## 3 S02000262 Woodhill West   107.2337   7316.999    1065622 89.90771 92.36517
##   Y2006 Y2007                       geometry
## 1    95    97 MULTIPOLYGON (((260718 6698...
## 2    13    15 MULTIPOLYGON (((262047 6699...
## 3    42    49 MULTIPOLYGON (((261158.7 67...

In the context of our work on modeling disease risk across Glasgow, the Standardized Mortality Ratio (SMR) represents a crucial metric for assessing the relative risk of respiratory diseases within different Intermediate Zones over two years. Essentially, the SMR compares the observed number of respiratory-related deaths in each zone to the number expected based on a broader reference population, standardized for age and other factors.

For our analysis, an SMR greater than 1 in a particular zone for a given year would indicate a higher-than-expected occurrence of respiratory diseases, possibly signaling underlying health issues, environmental factors, or socio-economic conditions contributing to elevated risks. On the other hand, an SMR less than 1 would suggest fewer cases than expected, possibly reflecting better health outcomes or effective public health interventions.

#SMR = Y/E

sf_data$SMR2006 <- sf_data$Y2006 / sf_data$E2006
sf_data$SMR2007 <- sf_data$Y2007 / sf_data$E2007

summary(sf_data$SMR2006)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2645  0.5591  0.7800  0.8118  1.0045  1.6044

summary(sf_data$SMR2007)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2891  0.6061  0.8504  0.8667  1.0784  1.7462

Creating the neighborhood matrix W

The neighborhood matrix W is fundamental in spatial analysis, encapsulating the spatial relationships among Glasgow’s Intermediate Zones. We create these connections using the spdep package by using shared boundaries. This produces a matrix where ‘1’ denotes neighboring zones and ‘0’ denotes no direct connection.

The neighborhood list, when summarized, shows 271 regions, with an average of about 5.25 links per region. Plotting red lines between the centroids of adjacent zones creates a visual representation that provides a preliminary look at the spatial structure we plan to analyze for respiratory disease risks.

W.nb <- poly2nb(sf_data, row.names = rownames(sf_data))

#summary of the neighbourhood connections
summary(W.nb)

## Neighbour list object:
## Number of regions: 271 
## Number of nonzero links: 1424 
## Percentage nonzero weights: 1.938971 
## Average number of links: 5.254613 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 14 15 20 
##  4 15 30 51 62 50 32 12  5  5  2  1  1  1 
## 4 least connected regions:
## 202 227 239 265 with 1 link
## 1 most connected region:
## 234 with 20 links

W <- nb2mat(W.nb, style = "B")
class(W)

## [1] "matrix" "array"

#Centroids of the areas to join to show neighbour connections
sf_data_cent<-st_centroid(sf_data$geometry)
plot(sf_data$geometry)
#Connecting lines of neighbours' centroids
plot.nb(W.nb, coords=sf_data_cent, col="red", add=TRUE)

Calculating Moran’s I Statistics

The Moran’s I statistic is a measure of spatial autocorrelation, indicating whether there is a pattern of similarity or dissimilarity among the values of a variable in a geographic space. In this case, the Moran’s I value is 0.41943 for year 2006 and 0.43109 for year 2007. The Moran’s I statistic indicates a positive spatial autocorrelation in the variables SMR2006 and SMR2007. The p-value of 9.999e-05 is smaller than the conventional significance level of 0.05, suggesting strong evidence to reject the null hypothesis of no spatial autocorrelation. The observed rank being 10,001 out of 10,001 simulations further emphasizes the significance of the result. Therefore, there is a statistically significant positive spatial autocorrelation, indicating that areas with similar SMR2006 values are clustered in space.

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  sf_data$SMR2006 
## weights: W.list  
## number of simulations + 1: 10001 
## 
## statistic = 0.41943, observed rank = 10001, p-value = 9.999e-05
## alternative hypothesis: greater

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  sf_data$SMR2007 
## weights: W.list  
## number of simulations + 1: 10001 
## 
## statistic = 0.43109, observed rank = 10001, p-value = 9.999e-05
## alternative hypothesis: greater

Mapping Spatial Data

Examining the Standard Mortality Ratios (SMRs) for 2006 and 2007, which show the spatial distribution of disease risk, darker colors denote high-risk zones where actual disease cases surpass predicted numbers. These regions of increased color highlight possible hotspots of concern that could call for additional investigation and the distribution of healthcare resources. The two maps’ comparison could provide insight into the temporal dynamics of disease spread, possibly highlighting the impact of shifting environmental or socioeconomic factors or the effectiveness of health interventions. These maps show the distribution of disease risk in an illustrative manner, but they also highlight the need for thorough analysis in order to identify underlying causes and create targeted public health interventions. For health officials and decision-makers, these visual aids are priceless because they provide direction for tactical planning and intervention aimed at reducing the impact of disease in impacted communities.

sp_dat<-as_Spatial(sf_data)
class(sp_dat)

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

sp_dat@data$id <- rownames(sp_dat@data)
temp1<-tidy(sp_dat)

## Regions defined for each Polygons

#temp1<-st_as_sf(sp_dat)
sp_dat2 <- merge(temp1, sp_dat@data, by = "id")

ggplot(data = sp_dat2, aes(x=long, y=lat, goup=group, fill = c(SMR2006))) +
  geom_polygon() +
  coord_equal() +
  xlab("Easting (m)") +
  ylab("Northing (m)") +
  labs(title = "Spatial Distribution of SMR in 2006", fill = "SMR") +
  theme(title = element_text(size=16)) +
  scale_fill_gradientn(colors=brewer.pal(n=9, name="YlOrRd"))

ggplot(data = sp_dat2, aes(x=long, y=lat, goup=group, fill = c(SMR2007))) +
  geom_polygon() +
  coord_equal() +
  xlab("Easting (m)") +
  ylab("Northing (m)") +
  labs(title = "Spatial Distribution of SMR in 2007", fill = "SMR") +
  theme(title = element_text(size=16)) +
  scale_fill_gradientn(colors=brewer.pal(n=9, name="YlOrRd"))

Overlaying spatial data on an OpenStreetMap

SMR 2006

sp_dat_ll <- spTransform(sp_dat, CRS("+proj=longlat +datum=WGS84 +no_defs"))

colours <- colorNumeric(palette = "YlOrRd", domain = sp_dat_ll@data$SMR2006)
leaflet(data=sp_dat_ll) %>%
  addTiles() %>%
  addPolygons(fillColor = ~colours(SMR2006),
                            color="", weight=1,
                            fillOpacity = 0.7) %>%
  addLegend(pal = colours, values = sp_dat_ll@data$SMR2006,
            opacity = 1, title="SMR 2006") %>%
  addScaleBar(position="bottomleft")

SMR 2007

colours <- colorNumeric(palette = "YlOrRd", domain = sp_dat_ll@data$SMR2007)
leaflet(data=sp_dat_ll) %>%
  addTiles() %>%
  addPolygons(fillColor = ~colours(SMR2007),
              color="", weight=1,
              fillOpacity = 0.7) %>%
  addLegend(pal = colours, values = sp_dat_ll@data$SMR2007,
            opacity = 1, title="SMR 2007") %>%
  addScaleBar(position="bottomleft")

Spatial Modelling with CARBayes

The study looks at the spatial patterns of respiratory disease risks for two different years in Glasgow’s Intermediate Zones using Conditional Autoregressive (CAR) models. A more sophisticated comprehension of the spatial correlations found in the disease count data is possible by utilizing the CARBayes package and the Leroux model in particular.

This modeling approach is supported, among other things, by its ability to incorporate the spatial adjacency matrix, which acknowledges the possibility of interconnections between health outcomes in nearby regions. This is especially true for illnesses where the environment may have a big impact. The methodology selected also takes into account the Poisson distribution of disease counts, which is a widely accepted assumption in epidemiological research.

By utilizing a significant burn-in period and a high number of iterations, the MCMC samples are guaranteed to be representative of the actual posterior distribution, resulting in reliable estimates of the risks of disease specific to a given area. The fitted values play a crucial role in identifying spatial trends and evaluating the impact of public health interventions throughout the studied years.

formula_2006 <- Y2006 ~ offset(log(E2006))

model_2_2006 <- S.CARleroux(formula=formula_2006, family="poisson", data=sf_data, W=W,
                      burnin=10000, n.sample=100000, thin=10, verbose=FALSE)
#print(model_2_2006)

summary(model_2_2006)

##                     Length Class      Mode     
## summary.results      21    -none-     numeric  
## samples               6    -none-     list     
## fitted.values       271    -none-     numeric  
## residuals             2    data.frame list     
## modelfit              6    -none-     numeric  
## accept                4    -none-     numeric  
## localised.structure   0    -none-     NULL     
## formula               3    formula    call     
## model                 2    -none-     character
## mcmc.info             5    -none-     numeric  
## X                   271    -none-     numeric

formula_2007 <- Y2007 ~ offset(log(E2007))

model_2_2007 <- S.CARleroux(formula=formula_2007, family="poisson", data=sf_data, W=W,
                            burnin=10000, n.sample=100000, thin=10, verbose=FALSE)
#print(model_2_2007)

summary(model_2_2007)

##                     Length Class      Mode     
## summary.results      21    -none-     numeric  
## samples               6    -none-     list     
## fitted.values       271    -none-     numeric  
## residuals             2    data.frame list     
## modelfit              6    -none-     numeric  
## accept                4    -none-     numeric  
## localised.structure   0    -none-     NULL     
## formula               3    formula    call     
## model                 2    -none-     character
## mcmc.info             5    -none-     numeric  
## X                   271    -none-     numeric

Checking Convergence for Both Year

The summary() function gives a summary of the MCMC samples for each parameter for both the 2006 and 2007 models. Thousands of iterations indicate that the models have been given enough time to explore the parameter space. Thousands of samples are gathered post-burn-in for each of the important parameters of interest, which include tau2 (the variance parameter), rho (the spatial correlation parameter), phi (the spatial random effects), and beta (the regression coefficients). For the purpose of estimating parameters and evaluating convergence, this thorough sample is crucial.

summary(model_2_2006$samples)

##        Length  Class Mode   
## beta      9000 mcmc  numeric
## phi    2439000 mcmc  numeric
## rho       9000 mcmc  numeric
## tau2      9000 mcmc  numeric
## fitted 2439000 mcmc  numeric
## Y            1 mcmc  logical

The values of rho at each MCMC algorithm iteration are shown in the trace plots for the parameter rho, which in the Leroux CAR model measures spatial correlation. A stationary distribution is best indicated by a “fuzzy caterpillar” pattern, in which the trace plot oscillates around a stable mean without showing any obvious trends or directionality. These desired features may be seen in the trace plots for both years that are provided; the samples are densely covered throughout a range and do not display any consistent patterns over time. The rho chains appear to be well-mixed and have probably converged to their stationary distributions based on this behavior.

The density graphs that go with it show the sampled values’ empirical distribution for rho. The samples are aggregating around a single value, according to the unimodal, bell-shaped curves seen in both plots; there is no indication of numerous modes or appreciable asymmetry. The density graphs provide more proof that the parameter rho has converged, supporting the inference made from the trace plots.

plot(model_2_2006$samples$rho)

plot(model_2_2007$samples$rho)

It seems that the models have reached convergence for the rho parameter based on these diagnostics. To thoroughly evaluate convergence, it is crucial to carry out comparable diagnostics for every model parameter. For a more thorough assessment, additional diagnostics like autocorrelation plots and the Gelman-Rubin statistic would be helpful. A lower autocorrelation and a close Gelman-Rubin statistic for all parameters indicate that the MCMC method is likely to converge.

To be sure that the model is accurately representing the underlying spatial processes and that the MCMC outputs are consistent with known information, it is advisable to compare the estimated posterior distributions with prior beliefs and empirical data.

Assessing goodness of fit

The spatial autocorrelation in the residuals of the models, as measured by Moran’s I statistic, was used to evaluate the goodness of fit for the spatial models of 2006 and 2007. In contrast to clustering, the statistic values for both years are negative, suggesting a dispersion pattern where areas with a high risk of disease are surrounded by areas with a lower risk, or vice versa. The p-values, however, are extremely high (0.9843 for 2006 and 0.966 for 2007), indicating that random chance rather than any innate spatial process may be the cause of the observed spatial patterns. This suggests that the models have a good fit to the data with no unaccounted-for spatial trends because the residuals do not exhibit significant spatial autocorrelation.

moran.mc(x = residuals(model_2_2006, type="pearson"), listw = W.list, nsim = 10000)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(model_2_2006, type = "pearson") 
## weights: W.list  
## number of simulations + 1: 10001 
## 
## statistic = -0.08351, observed rank = 115, p-value = 0.9885
## alternative hypothesis: greater

moran.mc(x = residuals(model_2_2007, type="pearson"), listw = W.list, nsim = 10000)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  residuals(model_2_2007, type = "pearson") 
## weights: W.list  
## number of simulations + 1: 10001 
## 
## statistic = -0.068693, observed rank = 336, p-value = 0.9664
## alternative hypothesis: greater

Mapping with Fitted Values

The spatial distribution of fitted values for disease risk throughout the Glasgow Intermediate Zones for the years 2006 and 2007 is displayed on the maps that you have provided. The color gradation in these visualizations indicates areas with elevated risk levels, which makes them especially helpful for that purpose. Higher fitted values are represented by darker shades, indicating that more research or funding may be needed in these areas.

A concentration of higher fitted values appears to be centrally located in the 2006 map, which may correspond to densely populated or urban areas. This pattern might be a sign of social determinants or environmental factors impacting the risk of disease. In a similar vein, the 2007 map shows a central area of elevated risk, albeit with a distribution that appears a little more dispersed than the year before.

sf_data$fitted2006 <- model_2_2006$fitted.values

sf_data$fitted2007 <- model_2_2007$fitted.values

merged_df <- merge(sp_dat2, sf_data, by = "IZ_CODE", all.x = TRUE)

Spatial Distribution of fitted values for 2006

ggplot(data = merged_df, aes(x=long, y=lat, goup=group, fill = c(fitted2006))) +
  geom_polygon() +
  coord_equal() +
  xlab("Easting (m)") +
  ylab("Northing (m)") +
  labs(title = "Spatial Distribution of fitted values for 2006", fill = "SMR") +
  theme(title = element_text(size=16)) +
  scale_fill_gradientn(colors=brewer.pal(n=9, name="YlOrRd"))

Spatial Distribution of fitted values for 2007

ggplot(data = merged_df, aes(x=long, y=lat, goup=group, fill = c(fitted2007))) +
  geom_polygon() +
  coord_equal() +
  xlab("Easting (m)") +
  ylab("Northing (m)") +
  labs(title = "Spatial Distribution of fitted values for 2007", fill = "SMR") +
  theme(title = element_text(size=16)) +
  scale_fill_gradientn(colors=brewer.pal(n=9, name="YlOrRd"))

Model Based Clustering

The term “model-based clustering” signifies the application of finite mixture models as a method to conduct clustering analyses, and this particular approach forms the central theme of the current review. This approach relies on the assumption that the observed data are generated from a mixture of underlying probability distributions, and the goal is to infer both the number of clusters and the parameters of these distributions. Model-based clustering provides a flexible framework that accommodates different types of data and allows for varying cluster shapes and sizes. It contrasts with traditional methods, such as k-means clustering, by incorporating probabilistic models that provide a richer representation of the inherent uncertainty in the clustering process.

combined_fitted <- data.frame(
  fitted_2006 = model_2_2006$fitted.values,
  fitted_2007 = model_2_2007$fitted.values
)

head(combined_fitted)

##   fitted_2006 fitted_2007
## 1    90.84801    93.08621
## 2    18.01717    19.70937
## 3    44.43942    50.54876
## 4    30.96528    46.31603
## 5    66.70678    74.12395
## 6    21.94707    28.39613

The scatter plot that is presented illustrates the correlation between the estimated disease risk values in the Glasgow Intermediate Zones for the years 2006 and 2007. The fitted disease risk values for each year determine the position of each point on the graph, which represents an Intermediate Zone.

Different colors correspond to the clustering patterns that the bivariate mixture model clustering revealed. Based on the similarity of fitted values over the course of the two years, the algorithm has grouped the data points into 2 clusters, as indicated by the ellipses that represent the confidence regions for each cluster.

According to the clustering results, there are different groups of Intermediate Zones that have comparable disease risk profiles over the course of the two years. For example, the red cluster shows lower risks, and the blue cluster seems to represent areas with consistently higher disease risks in both years. The gradient of risk levels that is suggested by the transition between these clusters may be a sign of underlying socioeconomic, environmental, or health-related factors that have varying effects on these areas.

Areas with a high risk of disease in 2006 are likely to have a high risk of disease in 2007, according to the scatter plot, which indicates a positive correlation between the fitted values for the two years. This consistency may indicate stable underlying factors that influence the risk of disease over an extended period of time.

All things considered, the clustering of the mixture model offers insightful information about the spatial distribution and consistency of disease risk. Policymakers and public health experts can use this information to focus interventions and allocate resources to areas that are consistently at higher risk.

clustering_result_mclust <- Mclust(combined_fitted, G=1:9)


summary(clustering_result_mclust)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVE (ellipsoidal, equal orientation) model with 2 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -2319.618 271 10 -4695.258 -4771.875
## 
## Clustering table:
##   1   2 
##  90 181

plot(clustering_result_mclust, what="BIC")

plot(clustering_result_mclust, what="classification")

Mclust uses the Bayesian Information Criterion (BIC) to weigh model fit against complexity in order to determine the ideal number of clusters. In general, a model with a lower BIC score is considered better, and Mclust will automatically choose the number of clusters that produce the lowest BIC.

An ideal number of clusters is suggested by the model that was chosen based on the Bayesian Information Criterion (BIC) score of -4694.695, which effectively categorizes the spatial disease risk without overfitting the data. We can distinguish between the high-risk and low-risk areas by looking at the component means within these clusters, which show the average disease risk. This level of detail makes it easier to identify areas that might need more extensive medical care and preventative measures.

cluster_assignments <- predict(clustering_result_mclust)

sp_dat@data$clusters <- cluster_assignments$classification

The model-based clustering (Mclust) results for the fitted disease risk values for the two years in a row are displayed visually on the map. The two distinct colors, red and blue, each represent a group of areas with similar disease risk profiles and correspond to the identified clusters. Cluster 2 (blue) may represent regions with a lower risk, while Cluster 1 (red) may represent areas with a higher risk. Health authorities can more effectively target interventions by concentrating resources on the areas indicated by the red cluster thanks to this delineation. As a result, the clustering analysis offers a clear visual aid for comprehending the spatial distribution of disease risk throughout the area under study.

merged_df <- merge(sp_dat2, sp_dat@data, by = "IZ_CODE", all.x = TRUE)

ggplot(data = merged_df, aes(x=long, y=lat, goup=group, fill = factor(clusters))) +
  geom_polygon()

K-means Clustering

inertias <- numeric(10)

for (k in 1:10) {
  kmeans_model <- kmeans(combined_fitted, centers = k)
  inertias[k] <- kmeans_model$tot.withinss
}

plot(1:10, inertias, type = 'b', xlab = 'Number of Clusters (k)', ylab = 'Variance Explained',
     main = 'Elbow Method for Optimal k')

k=2
set.seed(123) 
kmeans_result <- kmeans(combined_fitted, centers = k)

The elbow method identified two as the ideal number of clusters for the given data, and the map that follows shows the clustering result from a K-means analysis. The regions are designated with color codes of red and blue, which stand for Clusters 1 and 2, respectively. The dataset’s spatial patterns can be seen thanks to this clustering arrangement, which may indicate the boundary between two different risk profiles that differ in the region. Finding these clusters helps reveal possible hotspots (red) and lower-risk areas (blue), which can be important for allocating resources and planning public health initiatives. Thus, by reflecting the underlying structure in the spatial distribution of disease risk, the K-means clustering offers an insightful partitioning of the data.

sp_dat@data$kmeans_clusters <- kmeans_result$cluster


merged_df <- merge(sp_dat2, sp_dat@data, by = "IZ_CODE", all.x = TRUE)

ggplot(data = merged_df, aes(x=long, y=lat, goup=group, fill = factor(kmeans_clusters))) +
  geom_polygon()

Comparing Clustering Results

ARI score

adj.rand.index(merged_df$kmeans_clusters, merged_df$clusters)

## [1] 0.7710257

FMI Score

fowlkes_mallows(merged_df$kmeans_clusters, merged_df$clusters)

## [1] 0.8993553

Silhouette score for mclust

silhouette_score_mclust <- silhouette(sp_dat@data$clusters, dist(combined_fitted))
mean(silhouette_score_mclust[, "sil_width"])

## [1] 0.5128216

Silhouette score for k-means

silhouette_score_kmeans <- silhouette(sp_dat@data$kmeans_clusters, dist(combined_fitted))

mean(silhouette_score_kmeans[, "sil_width"])

## [1] 0.5413335

The cluster assignments appear to be in strong agreement when the two clustering results from the Mclust and K-means methods are compared using the Fowlkes-Mallows Index (FMI) and the Adjusted Rand Index (ARI). Given that ARI values closer to 1 indicate greater harmony, the two sets of clusters exhibit substantial similarity, as indicated by the 0.771 ARI score. Comparably, a high degree of agreement between the clustering results is also implied by the FMI score of 0.899, which is near to 1.

Both clustering results have a moderately positive silhouette score, according to a further comparison using the silhouette score, which gauges how similar an object is to its own cluster in relation to other clusters. K-means slightly outperforms Mclust in this regard (0.541 vs. 0.513). This shows that in the K-means solution as opposed to the Mclust solution, objects are generally closer to the centers of their clusters and farther from the centers of other clusters.

In conclusion, the ARI and FMI show that both clustering techniques produce comparable data partitions. On the other hand, the silhouette score of the K-means clustering is slightly better, indicating a slightly more cohesive and distinct clustering structure.

Conclusion

In summary, the two-year study of the risk of respiratory diseases in Glasgow’s Intermediate Zones shed light on disease clustering and spatial patterns. Through the use of Poisson Leroux CAR models, disease risk could be examined while taking spatial autocorrelation into account, producing risk estimates that were more precise and spatially adjusted.

All things considered, the study was successful in identifying areas with comparable risk profiles, which can guide the distribution of resources and specific public health initiatives. The robustness of the results and the applicability of these techniques in geographical epidemiological research are supported by the agreement between various clustering approaches.

References

Craig Anderson, Duncan Lee, Nema Dean, Bayesian cluster detection via adjacency modelling, Spatial and Spatio-temporal Epidemiology, Volume 16, 2016, Pages 11-20, ISSN 1877-5845, https://doi.org/10.1016/j.sste.2015.11.005.

Paul D. McNicholas, 2016. “Model-Based Clustering,” Journal of Classification, Springer;The Classification Society, vol. 33(3), pages 331-373, October.

Bivand, R, E Pebesma, and V Gomez-Rubio. 2013. Applied Spatial Data Analysis with r. 2nd ed. Springer-Verlag, New York.