1. Introduction

Understanding how environmental exposures influence population health is central to public health surveillance. This mini project applies foundational epidemiological methods to a synthetic dataset simulating small-area population health patterns. It draws inspiration from the work of the Small Area Health Statistics Unit (SAHSU) at Imperial College London, which investigates disease risk at a fine geographic scale.

This study explores the association between a simulated environmental exposure index and disease incidence, using concepts learned through:

2. Methods

2.1 Data Simulation

A dataset was generated with the following structure:

Variable Description
Area Area identifier (Area_001–Area_100)
Population Simulated population size
Cases Simulated number of disease cases
Exposure_Index Numeric value from 1–10 (e.g., pollution level)
Latitude, Longitude Coordinates for mapping
# Load the data
df <- read_csv("mini_project_synthetic_data.csv")

# Calculate incidence per 1000 population
df <- df %>%
  mutate(Incidence_per_1000 = round((Cases / Population) * 1000, 2))

head(df)

2.2 Epidemiological Measures Applied

Concept Source Application
Incidence Rate Coursera 1 New cases per 1,000 people
Risk Difference Coursera 1 & Oxford Handbook Difference in incidence between high vs low exposure groups
Ecological Design Coursera 2 Area-level rather than individual analysis
Exposure-Outcome Association Coursera 2 Explored via regression
Mapping & Visualisation Data science application Choropleth and bubble maps of disease patterns

3. Results

Distribution of Incidence Rates

The incidence rates across the simulated areas were calculated and visualized to understand the distribution of disease burden.

# Histogram of incidence
ggplot(df, aes(x = Incidence_per_1000)) +
  geom_histogram(fill = "#0072B2", color = "white", binwidth = 0.5) +
  labs(title = "Distribution of Incidence Rates", x = "Incidence per 1,000", y = "Count of Areas") +
  theme_minimal()

### Incidence by Exposure Quintile

To explore the relationship between environmental exposure and disease incidence, the dataset was divided into quintiles based on the Exposure_Index. The average(mean) incidence rate and the risk difference between the highest and lowest exposure groups was calculated for each quintile

# Calculate and display mean incidence per exposure quintile
df <- df %>%
  mutate(Exposure_Quintile = ntile(Exposure_Index, 5))

quintile_summary <- df %>%
  group_by(Exposure_Quintile) %>%
  summarise(Mean_Incidence = round(mean(Incidence_per_1000), 2), .groups = "drop")

# Print the summary table to the console
print(quintile_summary)
## # A tibble: 5 × 2
##   Exposure_Quintile Mean_Incidence
##               <int>          <dbl>
## 1                 1           7.25
## 2                 2           7.9 
## 3                 3          10.2 
## 4                 4          11.8 
## 5                 5          14.6
# Calculate the risk difference between Quintile 5 and Quintile 1
risk_diff <- quintile_summary$Mean_Incidence[5] - quintile_summary$Mean_Incidence[1]

# Display the result 
cat("Risk Difference (Q5 - Q1):", round(risk_diff, 2), "cases per 1,000\n")
## Risk Difference (Q5 - Q1): 7.39 cases per 1,000

3.3 Exposure vs. Incidence Regression

To assess the relationship between the Exposure_Index and Incidence_per_1000, a linear regression model was fitted. This model helps quantify the association between environmental exposure and disease incidence.

ggplot(df, aes(x = Exposure_Index, y = Incidence_per_1000)) +
  geom_point(color = "#D55E00") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(title = "Exposure vs. Incidence", x = "Exposure Index", y = "Incidence per 1,000") +
  theme_minimal()

3.4 Spatial Mapping

To visualize the spatial distribution of disease incidence, a choropleth map was created using the leaflet package. This map highlights shows the spatial variation in incidence rates based on the Incidence_per_1000.

library(leaflet)
library(RColorBrewer)

# Create a color palette for Incidence
pal <- colorQuantile("YlOrRd", df$Incidence_per_1000, n = 5)

# Create an interactive map
leaflet(df) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(
    lng = ~Longitude,
    lat = ~Latitude,
    radius = 6,
    color = ~pal(Incidence_per_1000),
    stroke = TRUE,
    fillOpacity = 0.8,
    popup = ~paste0(
      "<b>Area:</b> ", Area, "<br>",
      "<b>Exposure Index:</b> ", Exposure_Index, "<br>",
      "<b>Cases:</b> ", Cases, "<br>",
      "<b>Incidence/1,000:</b> ", Incidence_per_1000
    )
  ) %>%
  addLegend(
    "bottomright",
    pal = pal,
    values = ~Incidence_per_1000,
    title = "Incidence per 1,000",
    opacity = 1
  )

4. Discussion

The data show a clear ecological pattern: areas with higher exposure index values exhibit higher mean incidence rates. The risk difference between the highest and lowest quintiles suggests a substantial difference in disease burden potentially attributable to environmental exposure.

This relationship is further supported by a positive trend in the scatterplot and the color gradient in the spatial bubble map. While the dataset is synthetic, it mirrors SAHSU-style real-world investigations into spatial variation in environmental determinants of health.

Limitations:

  • This is an ecological study: individual-level confounding cannot be ruled out (e.g., age, SES).

  • No age-standardisation or temporal factors included.

  • Exposure is simplified and not based on real-world pollution metrics.

Strengths:

  • All analytical techniques align with epidemiological principles from the Coursera training.

  • The study design is realistic for small-area health surveillance work like that done at SAHSU.

5. Conclusion

This project demonstrates the application of core epidemiological principles to simulated spatial data. It reflects the kind of analytical approach SAHSU uses to explore small-area variations in disease risk and environmental exposures.

Skills Applied:

Incidence & risk comparison
Ecological analysis
Exposure-disease relationship
Spatial visualisation using R

Appenidix

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-3 leaflet_2.2.2      knitr_1.50         tmap_4.1          
## [5] sf_1.0-21          readr_2.1.5        ggplot2_3.5.2      dplyr_1.1.4       
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6            xfun_0.52               bslib_0.9.0            
##  [4] raster_3.6-32           htmlwidgets_1.6.4       lattice_0.22-7         
##  [7] tzdb_0.5.0              leaflet.providers_2.0.0 vctrs_0.6.5            
## [10] tools_4.5.1             crosstalk_1.2.1         generics_0.1.4         
## [13] parallel_4.5.1          tibble_3.3.0            proxy_0.4-27           
## [16] pkgconfig_2.0.3         Matrix_1.7-3            KernSmooth_2.23-26     
## [19] data.table_1.17.6       lifecycle_1.0.4         compiler_4.5.1         
## [22] farver_2.1.2            terra_1.8-54            codetools_0.2-20       
## [25] leafsync_0.1.0          leaflegend_1.2.1        stars_0.6-8            
## [28] htmltools_0.5.8.1       class_7.3-23            sass_0.4.10            
## [31] yaml_2.3.10             crayon_1.5.3            pillar_1.11.0          
## [34] jquerylib_0.1.4         classInt_0.4-11         cachem_1.1.0           
## [37] lwgeom_0.2-14           wk_0.9.4                abind_1.4-8            
## [40] nlme_3.1-168            tidyselect_1.2.1        digest_0.6.37          
## [43] splines_4.5.1           labeling_0.4.3          fastmap_1.2.0          
## [46] grid_4.5.1              colorspace_2.1-1        cli_3.6.5              
## [49] logger_0.4.0            magrittr_2.0.3          maptiles_0.10.0        
## [52] base64enc_0.1-3         dichromat_2.0-0.1       XML_3.99-0.18          
## [55] cols4all_0.8            leafem_0.2.4            e1071_1.7-16           
## [58] withr_3.0.2             scales_1.4.0            bit64_4.6.0-1          
## [61] sp_2.2-0                rmarkdown_2.29          bit_4.6.0              
## [64] png_0.1-8               hms_1.1.3               evaluate_1.0.4         
## [67] tmaptools_3.2           viridisLite_0.4.2       mgcv_1.9-3             
## [70] s2_1.1.9                rlang_1.1.6             Rcpp_1.1.0             
## [73] glue_1.8.0              DBI_1.2.3               vroom_1.6.5            
## [76] rstudioapi_0.17.1       jsonlite_2.0.0          R6_2.6.1               
## [79] spacesXYZ_1.6-0         units_0.8-7