Understanding how environmental exposures influence population health is central to public health surveillance. This mini project applies foundational epidemiological methods to a synthetic dataset simulating small-area population health patterns. It draws inspiration from the work of the Small Area Health Statistics Unit (SAHSU) at Imperial College London, which investigates disease risk at a fine geographic scale.
This study explores the association between a simulated environmental exposure index and disease incidence, using concepts learned through:
A dataset was generated with the following structure:
| Variable | Description |
|---|---|
Area |
Area identifier (Area_001–Area_100) |
Population |
Simulated population size |
Cases |
Simulated number of disease cases |
Exposure_Index |
Numeric value from 1–10 (e.g., pollution level) |
Latitude, Longitude |
Coordinates for mapping |
# Load the data
df <- read_csv("mini_project_synthetic_data.csv")
# Calculate incidence per 1000 population
df <- df %>%
mutate(Incidence_per_1000 = round((Cases / Population) * 1000, 2))
head(df)
| Concept | Source | Application |
|---|---|---|
| Incidence Rate | Coursera 1 | New cases per 1,000 people |
| Risk Difference | Coursera 1 & Oxford Handbook | Difference in incidence between high vs low exposure groups |
| Ecological Design | Coursera 2 | Area-level rather than individual analysis |
| Exposure-Outcome Association | Coursera 2 | Explored via regression |
| Mapping & Visualisation | Data science application | Choropleth and bubble maps of disease patterns |
The incidence rates across the simulated areas were calculated and visualized to understand the distribution of disease burden.
# Histogram of incidence
ggplot(df, aes(x = Incidence_per_1000)) +
geom_histogram(fill = "#0072B2", color = "white", binwidth = 0.5) +
labs(title = "Distribution of Incidence Rates", x = "Incidence per 1,000", y = "Count of Areas") +
theme_minimal()
### Incidence by Exposure Quintile
To explore the relationship between environmental exposure and
disease incidence, the dataset was divided into quintiles based on the
Exposure_Index. The average(mean) incidence rate and the
risk difference between the highest and lowest exposure groups was
calculated for each quintile
# Calculate and display mean incidence per exposure quintile
df <- df %>%
mutate(Exposure_Quintile = ntile(Exposure_Index, 5))
quintile_summary <- df %>%
group_by(Exposure_Quintile) %>%
summarise(Mean_Incidence = round(mean(Incidence_per_1000), 2), .groups = "drop")
# Print the summary table to the console
print(quintile_summary)
## # A tibble: 5 × 2
## Exposure_Quintile Mean_Incidence
## <int> <dbl>
## 1 1 7.25
## 2 2 7.9
## 3 3 10.2
## 4 4 11.8
## 5 5 14.6
# Calculate the risk difference between Quintile 5 and Quintile 1
risk_diff <- quintile_summary$Mean_Incidence[5] - quintile_summary$Mean_Incidence[1]
# Display the result
cat("Risk Difference (Q5 - Q1):", round(risk_diff, 2), "cases per 1,000\n")
## Risk Difference (Q5 - Q1): 7.39 cases per 1,000
To assess the relationship between the Exposure_Index
and Incidence_per_1000, a linear regression model was
fitted. This model helps quantify the association between environmental
exposure and disease incidence.
ggplot(df, aes(x = Exposure_Index, y = Incidence_per_1000)) +
geom_point(color = "#D55E00") +
geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(title = "Exposure vs. Incidence", x = "Exposure Index", y = "Incidence per 1,000") +
theme_minimal()
To visualize the spatial distribution of disease incidence, a
choropleth map was created using the leaflet package. This
map highlights shows the spatial variation in incidence rates based on
the Incidence_per_1000.
library(leaflet)
library(RColorBrewer)
# Create a color palette for Incidence
pal <- colorQuantile("YlOrRd", df$Incidence_per_1000, n = 5)
# Create an interactive map
leaflet(df) %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(
lng = ~Longitude,
lat = ~Latitude,
radius = 6,
color = ~pal(Incidence_per_1000),
stroke = TRUE,
fillOpacity = 0.8,
popup = ~paste0(
"<b>Area:</b> ", Area, "<br>",
"<b>Exposure Index:</b> ", Exposure_Index, "<br>",
"<b>Cases:</b> ", Cases, "<br>",
"<b>Incidence/1,000:</b> ", Incidence_per_1000
)
) %>%
addLegend(
"bottomright",
pal = pal,
values = ~Incidence_per_1000,
title = "Incidence per 1,000",
opacity = 1
)
The data show a clear ecological pattern: areas with higher exposure index values exhibit higher mean incidence rates. The risk difference between the highest and lowest quintiles suggests a substantial difference in disease burden potentially attributable to environmental exposure.
This relationship is further supported by a positive trend in the scatterplot and the color gradient in the spatial bubble map. While the dataset is synthetic, it mirrors SAHSU-style real-world investigations into spatial variation in environmental determinants of health.
This is an ecological study: individual-level confounding cannot be ruled out (e.g., age, SES).
No age-standardisation or temporal factors included.
Exposure is simplified and not based on real-world pollution metrics.
All analytical techniques align with epidemiological principles from the Coursera training.
The study design is realistic for small-area health surveillance work like that done at SAHSU.
This project demonstrates the application of core epidemiological principles to simulated spatial data. It reflects the kind of analytical approach SAHSU uses to explore small-area variations in disease risk and environmental exposures.
Skills Applied:
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 leaflet_2.2.2 knitr_1.50 tmap_4.1
## [5] sf_1.0-21 readr_2.1.5 ggplot2_3.5.2 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0
## [4] raster_3.6-32 htmlwidgets_1.6.4 lattice_0.22-7
## [7] tzdb_0.5.0 leaflet.providers_2.0.0 vctrs_0.6.5
## [10] tools_4.5.1 crosstalk_1.2.1 generics_0.1.4
## [13] parallel_4.5.1 tibble_3.3.0 proxy_0.4-27
## [16] pkgconfig_2.0.3 Matrix_1.7-3 KernSmooth_2.23-26
## [19] data.table_1.17.6 lifecycle_1.0.4 compiler_4.5.1
## [22] farver_2.1.2 terra_1.8-54 codetools_0.2-20
## [25] leafsync_0.1.0 leaflegend_1.2.1 stars_0.6-8
## [28] htmltools_0.5.8.1 class_7.3-23 sass_0.4.10
## [31] yaml_2.3.10 crayon_1.5.3 pillar_1.11.0
## [34] jquerylib_0.1.4 classInt_0.4-11 cachem_1.1.0
## [37] lwgeom_0.2-14 wk_0.9.4 abind_1.4-8
## [40] nlme_3.1-168 tidyselect_1.2.1 digest_0.6.37
## [43] splines_4.5.1 labeling_0.4.3 fastmap_1.2.0
## [46] grid_4.5.1 colorspace_2.1-1 cli_3.6.5
## [49] logger_0.4.0 magrittr_2.0.3 maptiles_0.10.0
## [52] base64enc_0.1-3 dichromat_2.0-0.1 XML_3.99-0.18
## [55] cols4all_0.8 leafem_0.2.4 e1071_1.7-16
## [58] withr_3.0.2 scales_1.4.0 bit64_4.6.0-1
## [61] sp_2.2-0 rmarkdown_2.29 bit_4.6.0
## [64] png_0.1-8 hms_1.1.3 evaluate_1.0.4
## [67] tmaptools_3.2 viridisLite_0.4.2 mgcv_1.9-3
## [70] s2_1.1.9 rlang_1.1.6 Rcpp_1.1.0
## [73] glue_1.8.0 DBI_1.2.3 vroom_1.6.5
## [76] rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1
## [79] spacesXYZ_1.6-0 units_0.8-7