This script demonstrates how to integrate spatial data with public health records to visualize the geographic distribution of dengue cases in Davao City at the barangay level. The workflow begins by loading the necessary R packages, including sf for spatial data handling, dplyr for data manipulation, and ggplot2 for visualization. Additional packages like viridis and stringi are used to enhance color accessibility and ensure consistent text formatting, respectively.
The first major step involves importing a Level 3 administrative boundary shapefile of the Philippines, which contains polygon geometries for all barangays. This is done using the st_read() function from the sf package. Since the focus is on Davao City, the dataset is filtered to include only those barangays where the NAME_2 attribute matches “DavaoCity”. Next, the script loads dengue case data from a CSV file. This dataset is expected to contain barangay-level information such as the number of dengue cases and population figures. However, to ensure that the spatial and tabular datasets can be merged correctly, the script standardizes the barangay names in both datasets. This involves converting all names to lowercase and removing any special characters or accents using stri_trans_general() from the stringi package. This preprocessing step is crucial because even minor inconsistencies in naming can prevent a successful join.
Once the names are standardized, the dengue data is merged with the spatial polygons using a left join on the NAME_3 column, which represents barangay names. The result is a spatial data frame that contains both the geographic boundaries and the associated dengue case data for each barangay in Davao City.
With the data prepared, the script proceeds to create a choropleth map using ggplot2. Each barangay is shaded according to the number of dengue cases it recorded, with a gradient from white (low cases) to dark red (high cases). The map includes a clean, minimal theme and informative labels, making it suitable for both analysis and presentation. This visualization allows public health officials and researchers to quickly identify areas with high dengue incidence and potential clustering patterns.
Finally, the map is saved as a high-resolution PDF using ggsave(), making it easy to include in reports, presentations, or printed materials. The entire workflow illustrates how spatial analysis in R can be used to support data-driven decision-making in public health, particularly in identifying and responding to disease hotspots.
# Load required packages
library(sf)
library(dplyr)
library(readxl)
library(ggplot2)
library(viridis)
library(stringi)
# 1. Load Barangay-level shapefile (GADM Level 3)
# Make sure the file path and layer name match your downloaded GADM file
ph_l3<- st_read("C:/Users/User/Desktop/Spatial Analysis/gadm41_PHL_3.json/gadm41_PHL_3.json", layer = "gadm41_PHL_3")## Reading layer `gadm41_PHL_3' from data source
## `C:\Users\User\Desktop\Spatial Analysis\gadm41_PHL_3.json\gadm41_PHL_3.json'
## using driver `GeoJSON'
## Simple feature collection with 41948 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 116.9283 ymin: 4.5869 xmax: 126.6053 ymax: 21.0699
## Geodetic CRS: WGS 84
# Filter only Davao City
davao_barangays <- ph_l3 %>% filter(NAME_2 == "DavaoCity")
# 2. Load Dengue data (barangay-level data in Excel)
# Example columns: Barangay | Dengue_Cases | Population
dengue_data <- read.csv("C:/Users/User/Desktop/Spatial Analysis/dengue.csv")
# Standardize both datasets
davao_barangays <- davao_barangays %>%
mutate(NAME_3 = stri_trans_general(NAME_3, "Latin-ASCII"))
# Optional: check spelling differences
# Use stringr::str_to_lower() or similar if names mismatch
dengue_data$NAME_3 <- tolower(dengue_data$NAME_3)
davao_barangays$NAME_3 <- tolower(davao_barangays$NAME_3)
# 4. Merge dengue data with barangay polygons
davao_dengue <- davao_barangays %>%
left_join(dengue_data, by = "NAME_3")
# 5. Plot choropleth map of dengue cases with white to dark red gradient
ggplot(davao_dengue) +
geom_sf(aes(fill = cases), color = "gray20", size = 0.1) +
scale_fill_gradient(
low = "white",
high = "darkred",
name = "Dengue Cases"
) +
theme_minimal() +
labs(
title = "Distribution of Dengue Cases in Davao City Barangays",
subtitle = "Data Source: [Your source here]",
caption = "Mapping dengue cases helps identify clusters and high-risk areas."
)After visualizing the raw distribution of dengue cases across barangays in Davao City, we move into a more advanced spatial analysis technique: hotspot detection. This method allows us to statistically identify areas where dengue cases are significantly clustered — either as high-value hotspots or low-value coldspots. To achieve this, we use the Local Gi* statistic, a well-established tool in spatial epidemiology.
We begin by defining the spatial relationships between barangays. This is done using the poly2nb() function, which examines the geometry of each polygon (barangay) and identifies its neighbors — those that share a border. These neighbor relationships are essential for understanding spatial dependence, as they allow us to compare each barangay’s dengue case count with those of its immediate surroundings.
Next, we convert this neighbor list into a spatial weights matrix using nb2listw(). This matrix assigns weights to each neighbor, indicating how much influence it has on the target barangay. We use the “W” style, which standardizes the weights so that they sum to one for each row. This ensures that the influence of neighboring areas is proportionally distributed and comparable across all barangays.
With the spatial structure in place, we calculate the Local Gi* statistic using the localG() function. This function compares each barangay’s dengue case count to the average of its neighbors, adjusted for overall spatial variation. The result is a Z-score for each barangay: - A high positive Z-score indicates a potential hotspot — a barangay with high dengue cases surrounded by other high-case areas. - A low negative Z-score suggests a coldspot — a barangay with low cases surrounded by other low-case areas. - Z-scores near zero imply no significant clustering. We then attach these Z-scores to the spatial dataset as a new column called GiZScore. This allows us to visualize the results on a map.
To create the map, we use the tmap package. We set the mode to “plot” for static visualization, and use tm_shape() to initialize the map with our spatial data. The tm_fill() function fills each barangay based on its Gi* Z-score, using a diverging red-blue color palette (-RdBu). In this palette, red represents hotspots and blue represents coldspots. We add borders for clarity and include a title and legend to make the map informative and presentation-ready.
This hotspot map provides a powerful visual tool for public health decision-making. It highlights areas of concern where dengue cases are not just high, but statistically clustered — guiding targeted interventions, resource allocation, and further investigation.
library(spdep)
library(tmap)
# 8. Hotspot Analysis using Local Gi* Statistic
nb <- poly2nb(davao_dengue) # Create neighbors list
listw <- nb2listw(nb, style = "W") # Spatial weights
gi_star <- localG(davao_dengue$cases, listw) # Calculate Gi* statistic
davao_dengue$GiZScore <- as.numeric(gi_star) # Add Gi* Z-scores to data
# 9. Visualize Hotspots
tmap_mode("plot")
tm_shape(davao_dengue) +
tm_fill("GiZScore", palette = "-RdBu", title = "Hotspot (Gi*)") +
tm_borders() +
tm_layout(
title = "Hotspot Analysis of Dengue Cases in Davao City",
legend.outside = TRUE
)Spatial regression is a vital tool in public health research, especially when analyzing geographically distributed data like dengue cases. Unlike traditional regression models, spatial regression accounts for the fact that nearby locations often influence each other — a phenomenon known as spatial autocorrelation. In this analysis, we aim to understand how population size relates to dengue case counts across barangays in Davao City, while adjusting for spatial dependence using two models: the Spatial Lag Model and the Spatial Error Model.
We begin by preparing the data. Using the filter() function from the dplyr package, we remove any barangays with missing values in the cases or population columns. This ensures that our models are built on complete data:
Next, we define the spatial relationships between barangays. This is done using the poly2nb() function from the spdep package, which constructs a neighbor list based on shared polygon boundaries. We then convert this list into a spatial weights matrix using nb2listw(), specifying the “W” style to standardize the weights.
With the spatial structure in place, we fit our first model: the Spatial Lag Model. This model assumes that dengue cases in one barangay are directly influenced by the number of cases in neighboring barangays. We use the lagsarlm() function from the spatialreg package to estimate this model.
The summary() output provides key information, including the regression coefficient for population and the spatial lag parameter (ρ), which quantifies the strength of spatial dependence. A significant ρ value suggests that dengue cases tend to cluster — high values in one barangay are associated with high values in adjacent areas.
To complement this, we also fit a Spatial Error Model, which assumes that spatial dependence exists in the error terms rather than the outcome variable. This model is useful when unobserved spatial factors — such as environmental conditions or mobility patterns — influence dengue case counts. We estimate this model using the errorsarlm() function.
Again, the summary() output reveals the spatial error parameter (λ), which captures the degree of autocorrelation in the residuals. Comparing the two models helps us determine whether spatial clustering is better explained by direct spillover effects (as in the lag model) or by unmeasured spatial processes (as in the error model).
By incorporating spatial regression into our analysis, we gain a deeper understanding of how dengue spreads across Davao City. These models allow us to quantify the relationship between population and disease burden while adjusting for spatial clustering — a critical step in designing targeted public health interventions and allocating resources effectively.
# Load required libraries
library(spdep)
library(spatialreg)
# 1. Filter out missing values FIRST
model_data <- davao_dengue %>%
filter(!is.na(cases), !is.na(population))
# 2. Rebuild spatial weights based on filtered data
nb <- poly2nb(model_data)
listw <- nb2listw(nb, style = "W")
# 3. Run Spatial Lag Model
library(spatialreg)
lag_model <- lagsarlm(cases ~ population, data = model_data, listw = listw)
summary(lag_model)##
## Call:lagsarlm(formula = cases ~ population, data = model_data, listw = listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.5548 -2.9974 -1.7686 3.1687 19.9927
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17446748 0.67697351 0.2577 0.7966
## population 0.00108260 0.00004682 23.1224 <2e-16
##
## Rho: 0.20063, LR test value: 22.628, p-value: 1.9656e-06
## Asymptotic standard error: 0.041422
## z-value: 4.8437, p-value: 1.2744e-06
## Wald statistic: 23.461, p-value: 1.2744e-06
##
## Log likelihood: -604.2332 for lag model
## ML residual variance (sigma squared): 44.399, (sigma: 6.6633)
## Number of observations: 182
## Number of parameters estimated: 4
## AIC: 1216.5, (AIC for lm: 1237.1)
## LM test for residual autocorrelation
## test value: 7.9247, p-value: 0.0048766
# 4. Run Spatial Error Model
error_model <- errorsarlm(cases ~ population, data = model_data, listw = listw)
summary(error_model)##
## Call:errorsarlm(formula = cases ~ population, data = model_data, listw = listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.5940 -3.3250 -2.4151 3.3367 23.8875
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3469e+00 6.5377e-01 2.0602 0.03938
## population 1.2542e-03 3.4223e-05 36.6475 < 2e-16
##
## Lambda: -0.026461, LR test value: 0.036285, p-value: 0.84893
## Asymptotic standard error: 0.11224
## z-value: -0.23575, p-value: 0.81363
## Wald statistic: 0.055576, p-value: 0.81363
##
## Log likelihood: -615.5292 for error model
## ML residual variance (sigma squared): 50.705, (sigma: 7.1208)
## Number of observations: 182
## Number of parameters estimated: 4
## AIC: 1239.1, (AIC for lm: 1237.1)
In spatial analysis, clustering techniques allow us to group geographic units based on similarity in attributes and spatial proximity. One such method is the SKATER algorithm — short for Spatial ’K’luster Analysis by Tree Edge Removal. This technique is particularly useful in public health mapping, where we aim to identify groups of barangays that share similar characteristics, such as dengue case counts and population size, while also being geographically connected.
The process begins with data preparation. We first filter the dataset to remove any barangays with missing values in the cases or population columns. This ensures that the clustering algorithm operates on complete and meaningful data.
Next, we define the spatial structure of the barangays by creating a neighbor list using the poly2nb() function from the spdep package. This function identifies which barangays share boundaries. We then convert this neighbor list into a spatial weights matrix using nb2listw(), with the “W” style to standardize the weights.
Next, we define the spatial structure of the barangays by creating a neighbor list using the poly2nb() function from the spdep package. This function identifies which barangays share boundaries. We then convert this neighbor list into a spatial weights matrix using nb2listw(), with the “W” style to standardize the weights.
To prepare for clustering, we construct a minimum spanning tree (MST) — a structure that connects all barangays with the shortest possible total distance, without forming loops. This is done by first calculating the centroid coordinates of each barangay using st_centroid() and st_coordinates(). We then compute the distances between neighbors using nbdists() and transform these into weights using an inverse distance function. The MST is built using mstree().
With the spatial structure defined, we select the numeric variables to be used for clustering. In this case, we choose cases and population, ensuring they are numeric and standardized using scale() to give them equal weight in the clustering process.
We then apply the SKATER algorithm using the skater() function. This function partitions the MST into clusters by removing edges, based on attribute similarity and spatial connectivity. The number of clusters is controlled by the ncuts parameter, which is set to one less than the desired number of clusters. For example, to create five clusters, we set ncuts = 4.
The resulting cluster assignments are stored in skater_result$groups, which we attach to the spatial dataset as a new factor variable called cluster.
Finally, we visualize the clusters using the tmap package. We set the mode to “plot” and use tm_shape() to initialize the map. The tm_fill() function colors each barangay according to its cluster assignment, using the “Set3” palette for clear differentiation. Borders are added with tm_borders() to enhance readability.
This map provides a powerful visual summary of spatial clustering in dengue data, revealing patterns that may not be apparent from raw case counts alone. By grouping barangays based on both their attributes and spatial relationships, the SKATER algorithm supports more targeted and regionally coherent public health interventions.
library(spdep)
library(sf)
library(tmap)
# 1. Filter out missing values
model_data <- davao_dengue %>%
filter(!is.na(cases), !is.na(population))
# 2. Create spatial weights
nb <- poly2nb(model_data)
listw <- nb2listw(nb, style = "W")
# 3. Build minimum spanning tree (MST)
coords <- st_coordinates(st_centroid(model_data))
dists <- nbdists(nb, coords)
weights <- lapply(dists, function(x) 1 / (1 + x))
mst <- mstree(nb2listw(nb, glist = weights))
# 4. Select numeric variables for clustering
# Make sure only numeric columns are passed
vars <- model_data %>%
st_drop_geometry() %>%
select(cases, population) %>%
mutate(across(everything(), as.numeric)) %>%
scale()
# 5. Run SKATER clustering (e.g., 5 clusters)
skater_result <- skater(mst, vars, ncuts = 4) # ncuts = clusters - 1
# 6. Add cluster labels to data
model_data$cluster <- factor(skater_result$groups)
# 7. Visualize clusters
tmap_mode("plot")
tm_shape(model_data) +
tm_fill("cluster", palette = "Set3", title = "SKATER Clusters") +
tm_borders()After performing spatial clustering using the SKATER algorithm, the next step is to organize and export the results for reporting, visualization, or further analysis. The SKATER algorithm assigns each barangay to a cluster based on both spatial proximity and attribute similarity (e.g., dengue cases and population). These cluster labels are stored in the model_data object, which contains both the spatial geometry and the associated attributes.
To prepare the cluster assignments for export, we first remove the spatial geometry using the st_drop_geometry() function from the sf package. This simplifies the dataset to a regular data frame, making it easier to manipulate and save. We then select only the relevant columns: NAME_3, which contains the barangay names, and cluster, which holds the cluster labels assigned by SKATER. Using arrange(cluster), we sort the barangays by their cluster number to produce a clean, grouped list.
This grouped data frame provides a clear summary of which barangays belong to each cluster. It can be printed directly in the R console using print(barangay_clusters) to inspect the results or share them during analysis.
For documentation or reporting purposes, we save the grouped cluster assignments to a CSV file using the write.csv() function. This creates a file named “barangay_clusters_by_skater.csv” that can be opened in spreadsheet software or imported into other analytical tools.
This export step ensures that the results of the spatial clustering are preserved and accessible outside of R, supporting collaboration, presentation, and integration with other datasets or mapping platforms.
# Assuming 'model_data' contains the cluster assignments and barangay names
# Group barangays by cluster
barangay_clusters <- model_data %>%
st_drop_geometry() %>%
select(NAME_3, cluster) %>%
arrange(cluster)
# View the grouped list
print(barangay_clusters)## NAME_3 cluster
## 1 acacia 1
## 2 alambre 1
## 3 alejandranavarro 1
## 4 angalan 1
## 5 atan-awe 1
## 6 baganihan 1
## 7 bagoaplaya 1
## 8 bagooshiro 1
## 9 baguio 1
## 10 balengaeng 1
## 11 baliok 1
## 12 bangkasheights 1
## 13 bantol 1
## 14 baracatan 1
## 15 bato 1
## 16 bayabas 1
## 17 biaoescuela 1
## 18 biaoguianga 1
## 19 biaojoaquin 1
## 20 binugao 1
## 21 buda 1
## 22 bunawan 1
## 23 cadalian 1
## 24 calinan 1
## 25 callawa 1
## 26 camansi 1
## 27 carmen 1
## 28 catalunangrande 1
## 29 catalunanpequeno 1
## 30 catigan 1
## 31 cawayan 1
## 32 colosas 1
## 33 crossingbayabas 1
## 34 dacudao 1
## 35 dalag 1
## 36 dalagdag 1
## 37 daliao 1
## 38 daliaonplantation 1
## 39 datusalumay 1
## 40 dominga 1
## 41 dumoy 1
## 42 eden 1
## 43 fatima 1
## 44 gatungan 1
## 45 gumalang 1
## 46 gumitan 1
## 47 ilang 1
## 48 inayangan 1
## 49 indangan 1
## 50 kilate 1
## 51 lacson 1
## 52 lamanan 1
## 53 lampianao 1
## 54 langub 1
## 55 lizada 1
## 56 losamigos 1
## 57 lubogan 1
## 58 lumiad 1
## 59 mabuhay 1
## 60 magsaysay 1
## 61 mahayag 1
## 62 malabog 1
## 63 malagos 1
## 64 malamba 1
## 65 manambulan 1
## 66 mandug 1
## 67 manuelguianga 1
## 68 mapula 1
## 69 marapangi 1
## 70 marilog 1
## 71 matinabiao 1
## 72 megkawayan 1
## 73 mintal 1
## 74 mudiang 1
## 75 mulig 1
## 76 newcarmen 1
## 77 newvalencia 1
## 78 panacan 1
## 79 panalum 1
## 80 pandaitan 1
## 81 pangyan 1
## 82 paquibato 1
## 83 paradiseembak 1
## 84 riverside 1
## 85 salapawan 1
## 86 salaysay 1
## 87 saloy 1
## 88 sanisidro 1
## 89 santonino 1
## 90 sasa 1
## 91 sibulan 1
## 92 sirawan 1
## 93 sirib 1
## 94 suawan 1
## 95 subasta 1
## 96 sumimao 1
## 97 tacunan 1
## 98 tagakpan 1
## 99 tagluno 1
## 100 tagurano 1
## 101 talandang 1
## 102 talomoriver 1
## 103 tamayong 1
## 104 tambobong 1
## 105 tamugan 1
## 106 tapak 1
## 107 tawan-tawan 1
## 108 tibuloy 1
## 109 tibungco 1
## 110 toril 1
## 111 tugbok 1
## 112 tungakalan 1
## 113 ula 1
## 114 vicentehizonsr. 1
## 115 waan 1
## 116 wangan 1
## 117 wines 1
## 118 centro 2
## 119 agdao 3
## 120 alfonsoangliongtos 3
## 121 barangay1-a 3
## 122 barangay13-b 3
## 123 barangay14-b 3
## 124 barangay15-b 3
## 125 barangay16-b 3
## 126 barangay17-b 3
## 127 barangay18-b 3
## 128 barangay2-a 3
## 129 barangay3-a 3
## 130 barangay34-d 3
## 131 barangay35-d 3
## 132 barangay37-d 3
## 133 barangay38-d 3
## 134 barangay39-d 3
## 135 barangay40-d 3
## 136 barangay5-a 3
## 137 barangay6-a 3
## 138 barangay7-a 3
## 139 bucana 3
## 140 buhangin 3
## 141 cabantian 3
## 142 communal 3
## 143 gov.pacianobangoy 3
## 144 gov.vicenteduterte 3
## 145 kap.tomasmonteverd 3
## 146 lapu-lapu 3
## 147 leongarcia,sr. 3
## 148 ma-a 3
## 149 magtuod 3
## 150 matinaaplaya 3
## 151 matinacrossing 3
## 152 matinapangi 3
## 153 pampanga 3
## 154 rafaelcastillo 3
## 155 sanantonio 3
## 156 talomo 3
## 157 tigatto 3
## 158 ubalde 3
## 159 wilfredoaquino 3
## 160 bagogallera 4
## 161 barangay10-a 5
## 162 barangay11-b 5
## 163 barangay12-b 5
## 164 barangay19-b 5
## 165 barangay20-b 5
## 166 barangay21-c 5
## 167 barangay22-c 5
## 168 barangay23-c 5
## 169 barangay24-c 5
## 170 barangay25-c 5
## 171 barangay26-c 5
## 172 barangay27-c 5
## 173 barangay28-c 5
## 174 barangay29-c 5
## 175 barangay30-c 5
## 176 barangay31-d 5
## 177 barangay32-d 5
## 178 barangay33-d 5
## 179 barangay36-d 5
## 180 barangay4-a 5
## 181 barangay8-a 5
## 182 barangay9-a 5
# Optional: Save to CSV for reporting
write.csv(barangay_clusters, "barangay_clusters_by_skater.csv", row.names = FALSE)After performing spatial clustering using the SKATER algorithm, the next step is to interpret the results by summarizing the characteristics of each cluster. Clustering groups barangays based on both geographic proximity and similarity in attributes such as dengue case counts and population. However, to make these clusters meaningful for public health planning, we need to quantify what each group represents.
We begin by creating a cluster profile table using the dplyr package. First, we remove the spatial geometry from the dataset using st_drop_geometry() to simplify the data into a standard tabular format. We then group the data by the cluster variable, which contains the cluster assignments generated by SKATER. Within each group, we calculate several key statistics.
This summary provides a comprehensive overview of each cluster: - Barangay_Count shows how many barangays are in each group. - Total_Cases and Total_Population give the cumulative burden and size of each cluster. - Avg_Cases and Avg_Population help compare clusters on a per-barangay basis. - Incidence_Rate expresses the number of dengue cases per 1,000 people, offering a standardized measure of disease intensity. To present these results in a clear and professional format, we use the kable() function from the knitr package. This generates a neatly formatted table with custom column names and rounded values.
This table is suitable for inclusion in reports, presentations, or dashboards. It allows public health officials to quickly compare clusters and identify which ones have the highest disease burden or incidence rates. For further inspection or sharing, we print the profile directly to the console: print(cluster_profile).
And optionally, we export the results to a CSV file using write.csv(), making it easy to share with collaborators or integrate into other tools: write.csv(cluster_profile, “cluster_profiles.csv”, row.names = FALSE).
This profiling step transforms raw cluster assignments into actionable insights. By summarizing the characteristics of each group, we support data-driven decision-making and help prioritize interventions in the most affected areas.
# Assuming 'model_data' contains 'cluster', 'cases', 'population', and 'NAME_3'
# 1. Summarize key statistics by cluster
cluster_profile <- model_data %>%
st_drop_geometry() %>%
group_by(cluster) %>%
summarise(
Barangay_Count = n(),
Total_Cases = sum(cases, na.rm = TRUE),
Total_Population = sum(population, na.rm = TRUE),
Avg_Cases = mean(cases, na.rm = TRUE),
Avg_Population = mean(population, na.rm = TRUE),
Incidence_Rate = (sum(cases, na.rm = TRUE) / sum(population, na.rm = TRUE)) * 1000
)
library(knitr)
# Display the cluster profile as a formatted table
kable(
cluster_profile,
caption = "Cluster Profiles of Davao City Barangays",
digits = 2,
col.names = c(
"Cluster",
"No. of Barangays",
"Total Dengue Cases",
"Total Population",
"Average Cases",
"Average Population",
"Incidence Rate (per 1,000)"
)
)| Cluster | No. of Barangays | Total Dengue Cases | Total Population | Average Cases | Average Population | Incidence Rate (per 1,000) |
|---|---|---|---|---|---|---|
| 1 | 117 | 497 | 426466.2 | 4.25 | 3645.01 | 1.17 |
| 2 | 1 | 3 | 2415.0 | 3.00 | 2415.00 | 1.24 |
| 3 | 41 | 1403 | 915812.5 | 34.22 | 22336.89 | 1.53 |
| 4 | 1 | 55 | 39852.0 | 55.00 | 39852.00 | 1.38 |
| 5 | 22 | 981 | 763750.0 | 44.59 | 34715.91 | 1.28 |
## # A tibble: 5 × 7
## cluster Barangay_Count Total_Cases Total_Population Avg_Cases Avg_Population
## <fct> <int> <int> <dbl> <dbl> <dbl>
## 1 1 117 497 426466. 4.25 3645.
## 2 2 1 3 2415 3 2415
## 3 3 41 1403 915812. 34.2 22337.
## 4 4 1 55 39852 55 39852
## 5 5 22 981 763750 44.6 34716.
## # ℹ 1 more variable: Incidence_Rate <dbl>
Spatial regression is a powerful technique for modeling geographically distributed phenomena, especially when observations are influenced by their spatial context. In this analysis, we use a Spatial Lag Model to predict dengue case counts across barangays in Davao City, based on population size and spatial relationships. The goal is to identify potential hotspots — areas where dengue cases are expected to be high due to both demographic and geographic factors.
We begin by preparing the data. Using the filter() function from the dplyr package, we remove any barangays with missing values in the cases or population columns. This ensures that the regression model is built on complete and reliable data.
Next, we define the spatial structure of the barangays. The poly2nb() function from the spdep package creates a neighbor list based on shared polygon boundaries, identifying which barangays are adjacent to each other. We then convert this neighbor list into a spatial weights matrix using nb2listw(), with the “W” style to standardize the weights.
With the spatial weights defined, we fit a Spatial Lag Model using the lagsarlm() function from the spatialreg package. This model assumes that dengue cases in one barangay are influenced not only by its own population size but also by the number of cases in neighboring barangays. The model includes a spatial lag term (ρ) that captures this dependence.
The summary() output provides estimates for the regression coefficient of population and the spatial lag parameter. A significant lag term indicates that dengue cases tend to cluster — high values in one barangay are associated with high values in adjacent areas. Once the model is fitted, we use the predict() function to generate predicted dengue case counts for each barangay. These predictions reflect both the influence of population and the spatial structure of the data.
To visualize the results, we use the tmap package to create a choropleth map. We set the mode to “plot” for static visualization and use tm_shape() to initialize the map with our spatial data. The tm_fill() function colors each barangay based on its predicted case count, using a red gradient to indicate intensity. Borders are added with tm_borders(), and the layout is customized with tm_layout().
This map provides a powerful visual tool for identifying potential dengue hotspots. By incorporating both demographic and spatial factors, the Spatial Lag Model offers a more accurate and context-aware prediction of disease distribution — supporting targeted public health interventions and resource allocation.
# Load required libraries
library(sf)
library(spdep)
library(spatialreg)
library(tmap)
library(dplyr)
# 1. Prepare data: remove missing values
model_data <- davao_dengue %>%
filter(!is.na(cases), !is.na(population))
# 2. Create spatial weights
nb <- poly2nb(model_data)
listw <- nb2listw(nb, style = "W")
# 3. Fit Spatial Lag Model
lag_model <- lagsarlm(cases ~ population, data = model_data, listw = listw)
summary(lag_model)##
## Call:lagsarlm(formula = cases ~ population, data = model_data, listw = listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.5548 -2.9974 -1.7686 3.1687 19.9927
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17446748 0.67697351 0.2577 0.7966
## population 0.00108260 0.00004682 23.1224 <2e-16
##
## Rho: 0.20063, LR test value: 22.628, p-value: 1.9656e-06
## Asymptotic standard error: 0.041422
## z-value: 4.8437, p-value: 1.2744e-06
## Wald statistic: 23.461, p-value: 1.2744e-06
##
## Log likelihood: -604.2332 for lag model
## ML residual variance (sigma squared): 44.399, (sigma: 6.6633)
## Number of observations: 182
## Number of parameters estimated: 4
## AIC: 1216.5, (AIC for lm: 1237.1)
## LM test for residual autocorrelation
## test value: 7.9247, p-value: 0.0048766
# 4. Predict dengue cases
model_data$predicted_cases <- predict(lag_model)
# 5. Visualize predicted cases
tmap_mode("plot")
tm_shape(model_data) +
tm_fill("predicted_cases", palette = "Reds", title = "Predicted Dengue Cases") +
tm_borders() +
tm_layout(title = "Spatial Regression Prediction", legend.outside = TRUE)In addition to traditional spatial statistics and regression models, machine learning offers a flexible and powerful approach to predicting disease hotspots. In this analysis, we use a Random Forest classifier to predict which barangays in Davao City are likely to be dengue hotspots based on population data. This method allows us to model complex, nonlinear relationships and evaluate predictive performance using standard classification metrics.
We begin by preparing the spatial dataset. Using the filter() function from the dplyr package, we remove any barangays with missing values in the cases or population columns. This ensures that the model is trained on complete and reliable data.
Next, we define a binary hotspot label. We use the median number of dengue cases as a threshold: barangays with case counts above the median are labeled as hotspots (), and those below or equal to the median are labeled as non-hotspots (). This simplifies the prediction task into a binary classification problem
We then prepare the data for machine learning. Using , we remove the spatial geometry to create a standard data frame. We select only the relevant columns — the hotspot label and the population predictor — and remove any remaining missing values.
To train and evaluate the model, we split the data into training and testing sets using from the package. We allocate 80% of the data for training and reserve 20% for testing.
We then train a Random Forest classifier using the randomForest() function. This ensemble method builds multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting. We specify importance = TRUE to enable variable importance analysis.
Once the model is trained, we use it to predict hotspot labels for the test data. To evaluate the model’s performance, we convert the predicted and actual labels into factors and use confusionMatrix() from the caret package. This provides metrics such as accuracy, sensitivity, specificity, and the confusion matrix itself.
Next, we map the predictions back to the spatial data. We create a new column ml_hotspot in the original model_data and assign predicted values to the corresponding test rows. This allows us to visualize the results geographically. We then convert the hotspot labels into descriptive factors — “Hotspot” and “Non-Hotspot” — for mapping.
Finally, we use the tmap package to create a choropleth map of predicted hotspots. Barangays predicted as hotspots are shaded in dark red, while non-hotspots are shown in gray. The map includes borders and a clear legend for interpretation.
This machine learning approach complements traditional spatial methods by offering predictive insights based on patterns in the data. It enables rapid identification of high-risk areas and supports proactive public health planning.
# Load required libraries
library(sf)
library(dplyr)
library(caret)
library(randomForest)
library(tmap)
# 1. Prepare spatial data: remove missing values
model_data <- davao_dengue %>%
filter(!is.na(cases), !is.na(population))
# 2. Define hotspot label using median threshold
model_data$hotspot <- ifelse(model_data$cases > median(model_data$cases, na.rm = TRUE), 1, 0)
# 3. Prepare ML data: drop geometry and select predictors
ml_data <- model_data %>%
st_drop_geometry() %>%
select(hotspot, population) %>%
na.omit()
# 4. Train/test split
set.seed(123)
train_index <- createDataPartition(ml_data$hotspot, p = 0.8, list = FALSE)
train_data <- ml_data[train_index, ]
test_data <- ml_data[-train_index, ]
# 5. Train Random Forest model
rf_model <- randomForest(hotspot ~ population, data = train_data, importance = TRUE)
# 6. Predict on test data
predicted_hotspots <- predict(rf_model, newdata = test_data)
# 7. Evaluate model performance
predicted_hotspots <- factor(predicted_hotspots, levels = c(0, 1))
actual_hotspots <- factor(test_data$hotspot, levels = c(0, 1))
confusionMatrix(predicted_hotspots, actual_hotspots)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11 0
## 1 0 12
##
## Accuracy : 1
## 95% CI : (0.8518, 1)
## No Information Rate : 0.5217
## P-Value [Acc > NIR] : 3.173e-07
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4783
## Detection Rate : 0.4783
## Detection Prevalence : 0.4783
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
# 8. Map predictions back to spatial data
model_data$ml_hotspot <- NA
test_rows <- which(rownames(model_data) %in% rownames(test_data))
model_data$ml_hotspot[test_rows] <- as.numeric(as.character(predicted_hotspots))
# 9. Visualize predicted hotspots
model_data$ml_hotspot <- factor(model_data$ml_hotspot, levels = c(0, 1), labels = c("Non-Hotspot", "Hotspot"))
tmap_mode("plot")
tm_shape(model_data) +
tm_fill("ml_hotspot", palette = c("gray80", "darkred"), title = "ML Predicted Hotspots") +
tm_borders() +
tm_layout(title = "Machine Learning Hotspot Prediction", legend.outside = TRUE)