Spatial Analysis using R

1.0 Hotspot Analysis

1.1 Mapping Dengue Cases

This script demonstrates how to integrate spatial data with public health records to visualize the geographic distribution of dengue cases in Davao City at the barangay level. The workflow begins by loading the necessary R packages, including sf for spatial data handling, dplyr for data manipulation, and ggplot2 for visualization. Additional packages like viridis and stringi are used to enhance color accessibility and ensure consistent text formatting, respectively.

The first major step involves importing a Level 3 administrative boundary shapefile of the Philippines, which contains polygon geometries for all barangays. This is done using the st_read() function from the sf package. Since the focus is on Davao City, the dataset is filtered to include only those barangays where the NAME_2 attribute matches “DavaoCity”. Next, the script loads dengue case data from a CSV file. This dataset is expected to contain barangay-level information such as the number of dengue cases and population figures. However, to ensure that the spatial and tabular datasets can be merged correctly, the script standardizes the barangay names in both datasets. This involves converting all names to lowercase and removing any special characters or accents using stri_trans_general() from the stringi package. This preprocessing step is crucial because even minor inconsistencies in naming can prevent a successful join.

Once the names are standardized, the dengue data is merged with the spatial polygons using a left join on the NAME_3 column, which represents barangay names. The result is a spatial data frame that contains both the geographic boundaries and the associated dengue case data for each barangay in Davao City.

With the data prepared, the script proceeds to create a choropleth map using ggplot2. Each barangay is shaded according to the number of dengue cases it recorded, with a gradient from white (low cases) to dark red (high cases). The map includes a clean, minimal theme and informative labels, making it suitable for both analysis and presentation. This visualization allows public health officials and researchers to quickly identify areas with high dengue incidence and potential clustering patterns.

Finally, the map is saved as a high-resolution PDF using ggsave(), making it easy to include in reports, presentations, or printed materials. The entire workflow illustrates how spatial analysis in R can be used to support data-driven decision-making in public health, particularly in identifying and responding to disease hotspots.

# Load required packages
library(sf)
library(dplyr)
library(readxl)
library(ggplot2)
library(viridis)
library(stringi)

# 1. Load Barangay-level shapefile (GADM Level 3)
# Make sure the file path and layer name match your downloaded GADM file
ph_l3<- st_read("C:/Users/User/Desktop/Spatial Analysis/gadm41_PHL_3.json/gadm41_PHL_3.json", layer = "gadm41_PHL_3")

## Reading layer `gadm41_PHL_3' from data source 
##   `C:\Users\User\Desktop\Spatial Analysis\gadm41_PHL_3.json\gadm41_PHL_3.json' 
##   using driver `GeoJSON'
## Simple feature collection with 41948 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 116.9283 ymin: 4.5869 xmax: 126.6053 ymax: 21.0699
## Geodetic CRS:  WGS 84

# Filter only Davao City
davao_barangays <- ph_l3 %>% filter(NAME_2 == "DavaoCity")

# 2. Load Dengue data (barangay-level data in Excel)
# Example columns: Barangay | Dengue_Cases | Population
dengue_data <- read.csv("C:/Users/User/Desktop/Spatial Analysis/dengue.csv")

# Standardize both datasets
davao_barangays <- davao_barangays %>%
  mutate(NAME_3 = stri_trans_general(NAME_3, "Latin-ASCII"))


# Optional: check spelling differences
# Use stringr::str_to_lower() or similar if names mismatch
dengue_data$NAME_3 <- tolower(dengue_data$NAME_3)
davao_barangays$NAME_3 <- tolower(davao_barangays$NAME_3)

# 4. Merge dengue data with barangay polygons
davao_dengue <- davao_barangays %>%
  left_join(dengue_data, by = "NAME_3")

# 5. Plot choropleth map of dengue cases with white to dark red gradient
ggplot(davao_dengue) +
  geom_sf(aes(fill = cases), color = "gray20", size = 0.1) +
  scale_fill_gradient(
    low = "white",
    high = "darkred",
    name = "Dengue Cases"
  ) +
  theme_minimal() +
  labs(
    title = "Distribution of Dengue Cases in Davao City Barangays",
    subtitle = "Data Source: [Your source here]",
    caption = "Mapping dengue cases helps identify clusters and high-risk areas."
  )

# 6. Save the map
ggsave("davao_dengue_map.pdf", width = 8.5, height = 11, units = "in", dpi = 300)

1.2 Hotspot Analysis

After visualizing the raw distribution of dengue cases across barangays in Davao City, we move into a more advanced spatial analysis technique: hotspot detection. This method allows us to statistically identify areas where dengue cases are significantly clustered — either as high-value hotspots or low-value coldspots. To achieve this, we use the Local Gi* statistic, a well-established tool in spatial epidemiology.

We begin by defining the spatial relationships between barangays. This is done using the poly2nb() function, which examines the geometry of each polygon (barangay) and identifies its neighbors — those that share a border. These neighbor relationships are essential for understanding spatial dependence, as they allow us to compare each barangay’s dengue case count with those of its immediate surroundings.

Next, we convert this neighbor list into a spatial weights matrix using nb2listw(). This matrix assigns weights to each neighbor, indicating how much influence it has on the target barangay. We use the “W” style, which standardizes the weights so that they sum to one for each row. This ensures that the influence of neighboring areas is proportionally distributed and comparable across all barangays.

With the spatial structure in place, we calculate the Local Gi* statistic using the localG() function. This function compares each barangay’s dengue case count to the average of its neighbors, adjusted for overall spatial variation. The result is a Z-score for each barangay: - A high positive Z-score indicates a potential hotspot — a barangay with high dengue cases surrounded by other high-case areas. - A low negative Z-score suggests a coldspot — a barangay with low cases surrounded by other low-case areas. - Z-scores near zero imply no significant clustering. We then attach these Z-scores to the spatial dataset as a new column called GiZScore. This allows us to visualize the results on a map.

To create the map, we use the tmap package. We set the mode to “plot” for static visualization, and use tm_shape() to initialize the map with our spatial data. The tm_fill() function fills each barangay based on its Gi* Z-score, using a diverging red-blue color palette (-RdBu). In this palette, red represents hotspots and blue represents coldspots. We add borders for clarity and include a title and legend to make the map informative and presentation-ready.

This hotspot map provides a powerful visual tool for public health decision-making. It highlights areas of concern where dengue cases are not just high, but statistically clustered — guiding targeted interventions, resource allocation, and further investigation.

library(spdep)
library(tmap)

# 8. Hotspot Analysis using Local Gi* Statistic
nb <- poly2nb(davao_dengue)                      # Create neighbors list
listw <- nb2listw(nb, style = "W")               # Spatial weights
gi_star <- localG(davao_dengue$cases, listw)     # Calculate Gi* statistic

davao_dengue$GiZScore <- as.numeric(gi_star)     # Add Gi* Z-scores to data

# 9. Visualize Hotspots
tmap_mode("plot")
tm_shape(davao_dengue) +
  tm_fill("GiZScore", palette = "-RdBu", title = "Hotspot (Gi*)") +
  tm_borders() +
  tm_layout(
    title = "Hotspot Analysis of Dengue Cases in Davao City",
    legend.outside = TRUE
  )

2.0 Spatial Modelling

2.1 Spatial Regression

Spatial regression is a vital tool in public health research, especially when analyzing geographically distributed data like dengue cases. Unlike traditional regression models, spatial regression accounts for the fact that nearby locations often influence each other — a phenomenon known as spatial autocorrelation. In this analysis, we aim to understand how population size relates to dengue case counts across barangays in Davao City, while adjusting for spatial dependence using two models: the Spatial Lag Model and the Spatial Error Model.

We begin by preparing the data. Using the filter() function from the dplyr package, we remove any barangays with missing values in the cases or population columns. This ensures that our models are built on complete data:

Next, we define the spatial relationships between barangays. This is done using the poly2nb() function from the spdep package, which constructs a neighbor list based on shared polygon boundaries. We then convert this list into a spatial weights matrix using nb2listw(), specifying the “W” style to standardize the weights.

With the spatial structure in place, we fit our first model: the Spatial Lag Model. This model assumes that dengue cases in one barangay are directly influenced by the number of cases in neighboring barangays. We use the lagsarlm() function from the spatialreg package to estimate this model.

The summary() output provides key information, including the regression coefficient for population and the spatial lag parameter (ρ), which quantifies the strength of spatial dependence. A significant ρ value suggests that dengue cases tend to cluster — high values in one barangay are associated with high values in adjacent areas.

To complement this, we also fit a Spatial Error Model, which assumes that spatial dependence exists in the error terms rather than the outcome variable. This model is useful when unobserved spatial factors — such as environmental conditions or mobility patterns — influence dengue case counts. We estimate this model using the errorsarlm() function.

Again, the summary() output reveals the spatial error parameter (λ), which captures the degree of autocorrelation in the residuals. Comparing the two models helps us determine whether spatial clustering is better explained by direct spillover effects (as in the lag model) or by unmeasured spatial processes (as in the error model).

By incorporating spatial regression into our analysis, we gain a deeper understanding of how dengue spreads across Davao City. These models allow us to quantify the relationship between population and disease burden while adjusting for spatial clustering — a critical step in designing targeted public health interventions and allocating resources effectively.

# Load required libraries
library(spdep)
library(spatialreg)

# 1. Filter out missing values FIRST
model_data <- davao_dengue %>% 
  filter(!is.na(cases), !is.na(population))

# 2. Rebuild spatial weights based on filtered data
nb <- poly2nb(model_data)
listw <- nb2listw(nb, style = "W")

# 3. Run Spatial Lag Model
library(spatialreg)
lag_model <- lagsarlm(cases ~ population, data = model_data, listw = listw)
summary(lag_model)

## 
## Call:lagsarlm(formula = cases ~ population, data = model_data, listw = listw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.5548  -2.9974  -1.7686   3.1687  19.9927 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17446748 0.67697351  0.2577   0.7966
## population  0.00108260 0.00004682 23.1224   <2e-16
## 
## Rho: 0.20063, LR test value: 22.628, p-value: 1.9656e-06
## Asymptotic standard error: 0.041422
##     z-value: 4.8437, p-value: 1.2744e-06
## Wald statistic: 23.461, p-value: 1.2744e-06
## 
## Log likelihood: -604.2332 for lag model
## ML residual variance (sigma squared): 44.399, (sigma: 6.6633)
## Number of observations: 182 
## Number of parameters estimated: 4 
## AIC: 1216.5, (AIC for lm: 1237.1)
## LM test for residual autocorrelation
## test value: 7.9247, p-value: 0.0048766

# 4. Run Spatial Error Model
error_model <- errorsarlm(cases ~ population, data = model_data, listw = listw)
summary(error_model)

## 
## Call:errorsarlm(formula = cases ~ population, data = model_data, listw = listw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.5940  -3.3250  -2.4151   3.3367  23.8875 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3469e+00 6.5377e-01  2.0602  0.03938
## population  1.2542e-03 3.4223e-05 36.6475  < 2e-16
## 
## Lambda: -0.026461, LR test value: 0.036285, p-value: 0.84893
## Asymptotic standard error: 0.11224
##     z-value: -0.23575, p-value: 0.81363
## Wald statistic: 0.055576, p-value: 0.81363
## 
## Log likelihood: -615.5292 for error model
## ML residual variance (sigma squared): 50.705, (sigma: 7.1208)
## Number of observations: 182 
## Number of parameters estimated: 4 
## AIC: 1239.1, (AIC for lm: 1237.1)

2.2 Spatial Clustering

In spatial analysis, clustering techniques allow us to group geographic units based on similarity in attributes and spatial proximity. One such method is the SKATER algorithm — short for Spatial ’K’luster Analysis by Tree Edge Removal. This technique is particularly useful in public health mapping, where we aim to identify groups of barangays that share similar characteristics, such as dengue case counts and population size, while also being geographically connected.

The process begins with data preparation. We first filter the dataset to remove any barangays with missing values in the cases or population columns. This ensures that the clustering algorithm operates on complete and meaningful data.

Next, we define the spatial structure of the barangays by creating a neighbor list using the poly2nb() function from the spdep package. This function identifies which barangays share boundaries. We then convert this neighbor list into a spatial weights matrix using nb2listw(), with the “W” style to standardize the weights.

Next, we define the spatial structure of the barangays by creating a neighbor list using the poly2nb() function from the spdep package. This function identifies which barangays share boundaries. We then convert this neighbor list into a spatial weights matrix using nb2listw(), with the “W” style to standardize the weights.

To prepare for clustering, we construct a minimum spanning tree (MST) — a structure that connects all barangays with the shortest possible total distance, without forming loops. This is done by first calculating the centroid coordinates of each barangay using st_centroid() and st_coordinates(). We then compute the distances between neighbors using nbdists() and transform these into weights using an inverse distance function. The MST is built using mstree().

With the spatial structure defined, we select the numeric variables to be used for clustering. In this case, we choose cases and population, ensuring they are numeric and standardized using scale() to give them equal weight in the clustering process.

We then apply the SKATER algorithm using the skater() function. This function partitions the MST into clusters by removing edges, based on attribute similarity and spatial connectivity. The number of clusters is controlled by the ncuts parameter, which is set to one less than the desired number of clusters. For example, to create five clusters, we set ncuts = 4.

The resulting cluster assignments are stored in skater_result$groups, which we attach to the spatial dataset as a new factor variable called cluster.

Finally, we visualize the clusters using the tmap package. We set the mode to “plot” and use tm_shape() to initialize the map. The tm_fill() function colors each barangay according to its cluster assignment, using the “Set3” palette for clear differentiation. Borders are added with tm_borders() to enhance readability.

This map provides a powerful visual summary of spatial clustering in dengue data, revealing patterns that may not be apparent from raw case counts alone. By grouping barangays based on both their attributes and spatial relationships, the SKATER algorithm supports more targeted and regionally coherent public health interventions.

library(spdep)
library(sf)
library(tmap)

# 1. Filter out missing values
model_data <- davao_dengue %>%
  filter(!is.na(cases), !is.na(population))

# 2. Create spatial weights
nb <- poly2nb(model_data)
listw <- nb2listw(nb, style = "W")

# 3. Build minimum spanning tree (MST)
coords <- st_coordinates(st_centroid(model_data))
dists <- nbdists(nb, coords)
weights <- lapply(dists, function(x) 1 / (1 + x))
mst <- mstree(nb2listw(nb, glist = weights))

# 4. Select numeric variables for clustering
# Make sure only numeric columns are passed
vars <- model_data %>%
  st_drop_geometry() %>%
  select(cases, population) %>%
  mutate(across(everything(), as.numeric)) %>%
  scale()

# 5. Run SKATER clustering (e.g., 5 clusters)
skater_result <- skater(mst, vars, ncuts = 4)  # ncuts = clusters - 1

# 6. Add cluster labels to data
model_data$cluster <- factor(skater_result$groups)

# 7. Visualize clusters
tmap_mode("plot")
tm_shape(model_data) +
  tm_fill("cluster", palette = "Set3", title = "SKATER Clusters") +
  tm_borders()

After performing spatial clustering using the SKATER algorithm, the next step is to organize and export the results for reporting, visualization, or further analysis. The SKATER algorithm assigns each barangay to a cluster based on both spatial proximity and attribute similarity (e.g., dengue cases and population). These cluster labels are stored in the model_data object, which contains both the spatial geometry and the associated attributes.

To prepare the cluster assignments for export, we first remove the spatial geometry using the st_drop_geometry() function from the sf package. This simplifies the dataset to a regular data frame, making it easier to manipulate and save. We then select only the relevant columns: NAME_3, which contains the barangay names, and cluster, which holds the cluster labels assigned by SKATER. Using arrange(cluster), we sort the barangays by their cluster number to produce a clean, grouped list.

This grouped data frame provides a clear summary of which barangays belong to each cluster. It can be printed directly in the R console using print(barangay_clusters) to inspect the results or share them during analysis.

For documentation or reporting purposes, we save the grouped cluster assignments to a CSV file using the write.csv() function. This creates a file named “barangay_clusters_by_skater.csv” that can be opened in spreadsheet software or imported into other analytical tools.

This export step ensures that the results of the spatial clustering are preserved and accessible outside of R, supporting collaboration, presentation, and integration with other datasets or mapping platforms.

# Assuming 'model_data' contains the cluster assignments and barangay names

# Group barangays by cluster
barangay_clusters <- model_data %>%
  st_drop_geometry() %>%
  select(NAME_3, cluster) %>%
  arrange(cluster)

# View the grouped list
print(barangay_clusters)

##                 NAME_3 cluster
## 1               acacia       1
## 2              alambre       1
## 3     alejandranavarro       1
## 4              angalan       1
## 5             atan-awe       1
## 6            baganihan       1
## 7           bagoaplaya       1
## 8           bagooshiro       1
## 9               baguio       1
## 10          balengaeng       1
## 11              baliok       1
## 12      bangkasheights       1
## 13              bantol       1
## 14           baracatan       1
## 15                bato       1
## 16             bayabas       1
## 17         biaoescuela       1
## 18         biaoguianga       1
## 19         biaojoaquin       1
## 20             binugao       1
## 21                buda       1
## 22             bunawan       1
## 23            cadalian       1
## 24             calinan       1
## 25             callawa       1
## 26             camansi       1
## 27              carmen       1
## 28     catalunangrande       1
## 29    catalunanpequeno       1
## 30             catigan       1
## 31             cawayan       1
## 32             colosas       1
## 33     crossingbayabas       1
## 34             dacudao       1
## 35               dalag       1
## 36            dalagdag       1
## 37              daliao       1
## 38   daliaonplantation       1
## 39         datusalumay       1
## 40             dominga       1
## 41               dumoy       1
## 42                eden       1
## 43              fatima       1
## 44            gatungan       1
## 45            gumalang       1
## 46             gumitan       1
## 47               ilang       1
## 48           inayangan       1
## 49            indangan       1
## 50              kilate       1
## 51              lacson       1
## 52             lamanan       1
## 53           lampianao       1
## 54              langub       1
## 55              lizada       1
## 56           losamigos       1
## 57             lubogan       1
## 58              lumiad       1
## 59             mabuhay       1
## 60           magsaysay       1
## 61             mahayag       1
## 62             malabog       1
## 63             malagos       1
## 64             malamba       1
## 65          manambulan       1
## 66              mandug       1
## 67       manuelguianga       1
## 68              mapula       1
## 69           marapangi       1
## 70             marilog       1
## 71          matinabiao       1
## 72          megkawayan       1
## 73              mintal       1
## 74             mudiang       1
## 75               mulig       1
## 76           newcarmen       1
## 77         newvalencia       1
## 78             panacan       1
## 79             panalum       1
## 80           pandaitan       1
## 81             pangyan       1
## 82           paquibato       1
## 83       paradiseembak       1
## 84           riverside       1
## 85           salapawan       1
## 86            salaysay       1
## 87               saloy       1
## 88           sanisidro       1
## 89           santonino       1
## 90                sasa       1
## 91             sibulan       1
## 92             sirawan       1
## 93               sirib       1
## 94              suawan       1
## 95             subasta       1
## 96             sumimao       1
## 97             tacunan       1
## 98            tagakpan       1
## 99             tagluno       1
## 100           tagurano       1
## 101          talandang       1
## 102        talomoriver       1
## 103           tamayong       1
## 104          tambobong       1
## 105            tamugan       1
## 106              tapak       1
## 107        tawan-tawan       1
## 108            tibuloy       1
## 109           tibungco       1
## 110              toril       1
## 111             tugbok       1
## 112         tungakalan       1
## 113                ula       1
## 114    vicentehizonsr.       1
## 115               waan       1
## 116             wangan       1
## 117              wines       1
## 118             centro       2
## 119              agdao       3
## 120 alfonsoangliongtos       3
## 121        barangay1-a       3
## 122       barangay13-b       3
## 123       barangay14-b       3
## 124       barangay15-b       3
## 125       barangay16-b       3
## 126       barangay17-b       3
## 127       barangay18-b       3
## 128        barangay2-a       3
## 129        barangay3-a       3
## 130       barangay34-d       3
## 131       barangay35-d       3
## 132       barangay37-d       3
## 133       barangay38-d       3
## 134       barangay39-d       3
## 135       barangay40-d       3
## 136        barangay5-a       3
## 137        barangay6-a       3
## 138        barangay7-a       3
## 139             bucana       3
## 140           buhangin       3
## 141          cabantian       3
## 142           communal       3
## 143  gov.pacianobangoy       3
## 144 gov.vicenteduterte       3
## 145 kap.tomasmonteverd       3
## 146          lapu-lapu       3
## 147     leongarcia,sr.       3
## 148               ma-a       3
## 149            magtuod       3
## 150       matinaaplaya       3
## 151     matinacrossing       3
## 152        matinapangi       3
## 153           pampanga       3
## 154     rafaelcastillo       3
## 155         sanantonio       3
## 156             talomo       3
## 157            tigatto       3
## 158             ubalde       3
## 159     wilfredoaquino       3
## 160        bagogallera       4
## 161       barangay10-a       5
## 162       barangay11-b       5
## 163       barangay12-b       5
## 164       barangay19-b       5
## 165       barangay20-b       5
## 166       barangay21-c       5
## 167       barangay22-c       5
## 168       barangay23-c       5
## 169       barangay24-c       5
## 170       barangay25-c       5
## 171       barangay26-c       5
## 172       barangay27-c       5
## 173       barangay28-c       5
## 174       barangay29-c       5
## 175       barangay30-c       5
## 176       barangay31-d       5
## 177       barangay32-d       5
## 178       barangay33-d       5
## 179       barangay36-d       5
## 180        barangay4-a       5
## 181        barangay8-a       5
## 182        barangay9-a       5

# Optional: Save to CSV for reporting
write.csv(barangay_clusters, "barangay_clusters_by_skater.csv", row.names = FALSE)

After performing spatial clustering using the SKATER algorithm, the next step is to interpret the results by summarizing the characteristics of each cluster. Clustering groups barangays based on both geographic proximity and similarity in attributes such as dengue case counts and population. However, to make these clusters meaningful for public health planning, we need to quantify what each group represents.

We begin by creating a cluster profile table using the dplyr package. First, we remove the spatial geometry from the dataset using st_drop_geometry() to simplify the data into a standard tabular format. We then group the data by the cluster variable, which contains the cluster assignments generated by SKATER. Within each group, we calculate several key statistics.

This summary provides a comprehensive overview of each cluster: - Barangay_Count shows how many barangays are in each group. - Total_Cases and Total_Population give the cumulative burden and size of each cluster. - Avg_Cases and Avg_Population help compare clusters on a per-barangay basis. - Incidence_Rate expresses the number of dengue cases per 1,000 people, offering a standardized measure of disease intensity. To present these results in a clear and professional format, we use the kable() function from the knitr package. This generates a neatly formatted table with custom column names and rounded values.

This table is suitable for inclusion in reports, presentations, or dashboards. It allows public health officials to quickly compare clusters and identify which ones have the highest disease burden or incidence rates. For further inspection or sharing, we print the profile directly to the console: print(cluster_profile).

And optionally, we export the results to a CSV file using write.csv(), making it easy to share with collaborators or integrate into other tools: write.csv(cluster_profile, “cluster_profiles.csv”, row.names = FALSE).

This profiling step transforms raw cluster assignments into actionable insights. By summarizing the characteristics of each group, we support data-driven decision-making and help prioritize interventions in the most affected areas.

# Assuming 'model_data' contains 'cluster', 'cases', 'population', and 'NAME_3'

# 1. Summarize key statistics by cluster
cluster_profile <- model_data %>%
  st_drop_geometry() %>%
  group_by(cluster) %>%
  summarise(
    Barangay_Count = n(),
    Total_Cases = sum(cases, na.rm = TRUE),
    Total_Population = sum(population, na.rm = TRUE),
    Avg_Cases = mean(cases, na.rm = TRUE),
    Avg_Population = mean(population, na.rm = TRUE),
    Incidence_Rate = (sum(cases, na.rm = TRUE) / sum(population, na.rm = TRUE)) * 1000
  )

library(knitr)

# Display the cluster profile as a formatted table
kable(
  cluster_profile,
  caption = "Cluster Profiles of Davao City Barangays",
  digits = 2,
  col.names = c(
    "Cluster",
    "No. of Barangays",
    "Total Dengue Cases",
    "Total Population",
    "Average Cases",
    "Average Population",
    "Incidence Rate (per 1,000)"
  )
)

Cluster Profiles of Davao City Barangays
Cluster	No. of Barangays	Total Dengue Cases	Total Population	Average Cases	Average Population	Incidence Rate (per 1,000)
1	117	497	426466.2	4.25	3645.01	1.17
2	1	3	2415.0	3.00	2415.00	1.24
3	41	1403	915812.5	34.22	22336.89	1.53
4	1	55	39852.0	55.00	39852.00	1.38
5	22	981	763750.0	44.59	34715.91	1.28

# 2. View the profile
print(cluster_profile)

## # A tibble: 5 × 7
##   cluster Barangay_Count Total_Cases Total_Population Avg_Cases Avg_Population
##   <fct>            <int>       <int>            <dbl>     <dbl>          <dbl>
## 1 1                  117         497          426466.      4.25          3645.
## 2 2                    1           3            2415       3             2415 
## 3 3                   41        1403          915812.     34.2          22337.
## 4 4                    1          55           39852      55            39852 
## 5 5                   22         981          763750      44.6          34716.
## # ℹ 1 more variable: Incidence_Rate <dbl>

# Optional: Save to CSV
write.csv(cluster_profile, "cluster_profiles.csv", row.names = FALSE)

3.0 Prediction of hotpots

3.1 Spatial Regression

Spatial regression is a powerful technique for modeling geographically distributed phenomena, especially when observations are influenced by their spatial context. In this analysis, we use a Spatial Lag Model to predict dengue case counts across barangays in Davao City, based on population size and spatial relationships. The goal is to identify potential hotspots — areas where dengue cases are expected to be high due to both demographic and geographic factors.

Next, we define the spatial structure of the barangays. The poly2nb() function from the spdep package creates a neighbor list based on shared polygon boundaries, identifying which barangays are adjacent to each other. We then convert this neighbor list into a spatial weights matrix using nb2listw(), with the “W” style to standardize the weights.

With the spatial weights defined, we fit a Spatial Lag Model using the lagsarlm() function from the spatialreg package. This model assumes that dengue cases in one barangay are influenced not only by its own population size but also by the number of cases in neighboring barangays. The model includes a spatial lag term (ρ) that captures this dependence.

The summary() output provides estimates for the regression coefficient of population and the spatial lag parameter. A significant lag term indicates that dengue cases tend to cluster — high values in one barangay are associated with high values in adjacent areas. Once the model is fitted, we use the predict() function to generate predicted dengue case counts for each barangay. These predictions reflect both the influence of population and the spatial structure of the data.

To visualize the results, we use the tmap package to create a choropleth map. We set the mode to “plot” for static visualization and use tm_shape() to initialize the map with our spatial data. The tm_fill() function colors each barangay based on its predicted case count, using a red gradient to indicate intensity. Borders are added with tm_borders(), and the layout is customized with tm_layout().

This map provides a powerful visual tool for identifying potential dengue hotspots. By incorporating both demographic and spatial factors, the Spatial Lag Model offers a more accurate and context-aware prediction of disease distribution — supporting targeted public health interventions and resource allocation.

# Load required libraries
library(sf)
library(spdep)
library(spatialreg)
library(tmap)
library(dplyr)

# 1. Prepare data: remove missing values
model_data <- davao_dengue %>%
  filter(!is.na(cases), !is.na(population))

# 2. Create spatial weights
nb <- poly2nb(model_data)
listw <- nb2listw(nb, style = "W")

# 3. Fit Spatial Lag Model
lag_model <- lagsarlm(cases ~ population, data = model_data, listw = listw)
summary(lag_model)

## 
## Call:lagsarlm(formula = cases ~ population, data = model_data, listw = listw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.5548  -2.9974  -1.7686   3.1687  19.9927 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17446748 0.67697351  0.2577   0.7966
## population  0.00108260 0.00004682 23.1224   <2e-16
## 
## Rho: 0.20063, LR test value: 22.628, p-value: 1.9656e-06
## Asymptotic standard error: 0.041422
##     z-value: 4.8437, p-value: 1.2744e-06
## Wald statistic: 23.461, p-value: 1.2744e-06
## 
## Log likelihood: -604.2332 for lag model
## ML residual variance (sigma squared): 44.399, (sigma: 6.6633)
## Number of observations: 182 
## Number of parameters estimated: 4 
## AIC: 1216.5, (AIC for lm: 1237.1)
## LM test for residual autocorrelation
## test value: 7.9247, p-value: 0.0048766

# 4. Predict dengue cases
model_data$predicted_cases <- predict(lag_model)

# 5. Visualize predicted cases
tmap_mode("plot")
tm_shape(model_data) +
  tm_fill("predicted_cases", palette = "Reds", title = "Predicted Dengue Cases") +
  tm_borders() +
  tm_layout(title = "Spatial Regression Prediction", legend.outside = TRUE)

3.2 Machine Learning

In addition to traditional spatial statistics and regression models, machine learning offers a flexible and powerful approach to predicting disease hotspots. In this analysis, we use a Random Forest classifier to predict which barangays in Davao City are likely to be dengue hotspots based on population data. This method allows us to model complex, nonlinear relationships and evaluate predictive performance using standard classification metrics.

We begin by preparing the spatial dataset. Using the filter() function from the dplyr package, we remove any barangays with missing values in the cases or population columns. This ensures that the model is trained on complete and reliable data.

Next, we define a binary hotspot label. We use the median number of dengue cases as a threshold: barangays with case counts above the median are labeled as hotspots (), and those below or equal to the median are labeled as non-hotspots (). This simplifies the prediction task into a binary classification problem

We then prepare the data for machine learning. Using , we remove the spatial geometry to create a standard data frame. We select only the relevant columns — the hotspot label and the population predictor — and remove any remaining missing values.

To train and evaluate the model, we split the data into training and testing sets using from the package. We allocate 80% of the data for training and reserve 20% for testing.

We then train a Random Forest classifier using the randomForest() function. This ensemble method builds multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting. We specify importance = TRUE to enable variable importance analysis.

Once the model is trained, we use it to predict hotspot labels for the test data. To evaluate the model’s performance, we convert the predicted and actual labels into factors and use confusionMatrix() from the caret package. This provides metrics such as accuracy, sensitivity, specificity, and the confusion matrix itself.

Next, we map the predictions back to the spatial data. We create a new column ml_hotspot in the original model_data and assign predicted values to the corresponding test rows. This allows us to visualize the results geographically. We then convert the hotspot labels into descriptive factors — “Hotspot” and “Non-Hotspot” — for mapping.

Finally, we use the tmap package to create a choropleth map of predicted hotspots. Barangays predicted as hotspots are shaded in dark red, while non-hotspots are shown in gray. The map includes borders and a clear legend for interpretation.

This machine learning approach complements traditional spatial methods by offering predictive insights based on patterns in the data. It enables rapid identification of high-risk areas and supports proactive public health planning.

# Load required libraries
library(sf)
library(dplyr)
library(caret)
library(randomForest)
library(tmap)

# 1. Prepare spatial data: remove missing values
model_data <- davao_dengue %>%
  filter(!is.na(cases), !is.na(population))

# 2. Define hotspot label using median threshold
model_data$hotspot <- ifelse(model_data$cases > median(model_data$cases, na.rm = TRUE), 1, 0)

# 3. Prepare ML data: drop geometry and select predictors
ml_data <- model_data %>%
  st_drop_geometry() %>%
  select(hotspot, population) %>%
  na.omit()

# 4. Train/test split
set.seed(123)
train_index <- createDataPartition(ml_data$hotspot, p = 0.8, list = FALSE)
train_data <- ml_data[train_index, ]
test_data <- ml_data[-train_index, ]

# 5. Train Random Forest model
rf_model <- randomForest(hotspot ~ population, data = train_data, importance = TRUE)

# 6. Predict on test data
predicted_hotspots <- predict(rf_model, newdata = test_data)

# 7. Evaluate model performance
predicted_hotspots <- factor(predicted_hotspots, levels = c(0, 1))
actual_hotspots <- factor(test_data$hotspot, levels = c(0, 1))
confusionMatrix(predicted_hotspots, actual_hotspots)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 11  0
##          1  0 12
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8518, 1)
##     No Information Rate : 0.5217     
##     P-Value [Acc > NIR] : 3.173e-07  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4783     
##          Detection Rate : 0.4783     
##    Detection Prevalence : 0.4783     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
##

# 8. Map predictions back to spatial data
model_data$ml_hotspot <- NA
test_rows <- which(rownames(model_data) %in% rownames(test_data))
model_data$ml_hotspot[test_rows] <- as.numeric(as.character(predicted_hotspots))

# 9. Visualize predicted hotspots
model_data$ml_hotspot <- factor(model_data$ml_hotspot, levels = c(0, 1), labels = c("Non-Hotspot", "Hotspot"))

tmap_mode("plot")
tm_shape(model_data) +
  tm_fill("ml_hotspot", palette = c("gray80", "darkred"), title = "ML Predicted Hotspots") +
  tm_borders() +
  tm_layout(title = "Machine Learning Hotspot Prediction", legend.outside = TRUE)