This geospatial statistical model uses routinely collected malaria case data, population data and remotely sensed data, such as open and vegetated water bodies, to estimate population living around open water bodies, expected malaria cases, and standardised morbidity ratio (SMR) of malaria. And ultimately, quantify the association between proximity to larval habitat and malaria risk in health facility catchment areas in Kasungu. The SMR compares the risk of morbidity in a population of interest with that of a standard population. In this case, our interest is to find out whether the number of dry season malaria cases in each catchment area are greater than we would expect given the malaria rate for the entire Kasungu district.
We do this by comparing what we observe (O) with what we would expect (E) if the risk of malaria was equal throughout Kasungu. The SMR statistical notation of catchment i can be written as follows: \[SMR_i = \frac{O_i}{E_i}\]
Buffers around waterbodies are created and then combined with population data in raster format to estimate the proprtion of catcment population living within 1km, 2km and 3km of water bodies. Subsequently, the observed malaria cases are modeled using Poisson regression to find out if living within various distances from water bodies is causing variability in malaria risk in Kasungu district. We hypothesize that the risk of being a case in a catchment is dependent on proximity to water bodies. The data used spans from 2017 to 2020 and was derived from digitized DHIS2 malaria records, accessibility mapping, aggregated population geospatial layer and TropWet tool in Google Earth Engine.
Loading the R packages that will be used to read in, view, transform and model the malaria cases and spatial datasets.
suppressPackageStartupMessages({
library(SpatialEpi)
library(spdep)
library(spaMM)
library(popEpi)
library(Epi)
library(epitools) # compute confidence intervals of malaria data
library(tidyverse)
library(caret) # easily compute cross-validation methods to test model perfomance
library(stats)
library(MASS) # fit negative binomial model
library(performance) # check linear model performance
library(see) # to plot model assumptions
library(caTools) # splitting data into training and test data sets
library(glmmTMB) # check overdispersion
library(lme4) # fit GLM
library(report) # automatically produce regression model reports
library(qqplotr)
library(DescTools) # tools for descriptive statistics e.g., pseudo R-squared
library(gtsummary) # easily display regression model outputs, such as P value, CI
library(ggpubr)
library(plotly)
library(lubridate)
library(knitr)
library(raster)
library(rgdal)
library(rgeos)
library(readr)
library(sf)
library(sp)
library(tmap)
library(spdep)
library(maptools)
library(gridExtra)
library(ggsci)
library(grid)
library(exactextractr)
library(DataExplorer)
library(thematic)
library(mapview)
library(kableExtra) # create interactive tables
library(gt) # create beautiful HTML tables
library(pander) # improves the aesthetics of R outputs
library(imputeTS) # easily remove NAs
`%>%` <- magrittr::`%>%`
})
file.path(getwd(),"data")
## [1] "C:/Users/cnkolokosa/Documents/R/upscaled_2021_updated_May/upscaled_2021/data"
here::here()
## [1] "C:/Users/cnkolokosa/Documents/R/upscaled_2021_updated_May/upscaled_2021"
The total dry season malaria cases recorded at health-care facilities in Kasungu from 2017 to 2019 are contained in the KasunguData.csv sourced from https://dhis2.health.gov.mw/. The kasungu_facility_catchments_2004.shp shapefile also contains the population and health information within each health-facility catchment area in Kasungu district.
The aggregated population raster layers for Malawi e.g.,ku_pop_2017_1km_aggregated.tif were downloaded from the Open Spatial and Demographic and Data Research website: https://www.worldpop.org/geodata/country?iso3=MWI. These layers estimate total number of people per grid-cell. The units are number of people per pixel with country totals adjusted to match the corresponding official United Nations population estimates. The datasets were downloaded in Geotiff at a resolution of 1km and are projected in Geographic Coordinate System, WGS84.
The kasungu_water.shpand water_bodies layers contain open and vegetated waterbodies polygons, detected using the Tropical Wetland Unmixing Tool (TropWet). TropWet is a Google Earth Engine hosted toolbox that uses the Landsat archive to map tropical wetlands and can be accessed through: https://www.aber.ac.uk/en/dges/research/earth-observation-laboratory/research/tropwet/
# Kasungu dry season malaria data
# dry_season_malaria_2017_2020 <- readr::read_csv(
# "https://raw.githubusercontent.com/ClintyNkolokosa/Analysis-of-dry-season-malaria-cases-in-Kasungu/main/data/dry_season_malaria_2017_2020.csv")
dry_season_malaria_2017_2020 <- read.csv(here::here("data/dry_season_malaria_2017_2020.csv"),
stringsAsFactors = FALSE)
# Kasungu monthly NMCP confirmed malaria cases
monthly_malaria_2017_2021 <- read.csv(here::here("data/Kasungu monthly malaria 2017-2021.csv"),
stringsAsFactors = FALSE)
monthly_malaria_2017_2021$date <- lubridate::ym(monthly_malaria_2017_2021$periodid)
# Kasungu district boundary shapefile
kasungu_district <- sf::st_read(here::here("data", "kasungu_district.shp"))
# Kasungu health facility catchments generated from accessibility mapping
malire_new <- sf::st_read(here::here("data", "zipatala_catchment_areas.shp")) |>
sf::st_transform(32736) # reproject to WGS UTM Zone 36 South
# Kasungu population raster layer
kasungu_population_2017 <- raster(here::here("data", "ku_pop_2017_1km_aggregated.tif"))
kasungu_population_2018 <- raster(here::here("data", "ku_pop_2018_1km_aggregated.tif"))
kasungu_population_2019 <- raster(here::here("data", "ku_pop_2019_1km_aggregated.tif"))
kasungu_population_2020 <- raster(here::here("data", "ku_pop_2020_1km_aggregated.tif"))
# Read in waterbodies polygons
dryseason_waterbodies_2017 <- sf::st_read(here::here("data", "water_bodies_2017.shp"))
dryseason_waterbodies_2018 <- sf::st_read(here::here("data", "kasungu_2018_water.shp"))
dryseason_waterbodies_2019 <- sf::st_read(here::here("data", "kasungu_2019_water.shp"))
dryseason_waterbodies_2020 <- sf::st_read(here::here("data", "water_bodies_2020.shp"))
# Add a row ID to water bodies polygons
dryseason_waterbodies_2017$ID <- 1:nrow(dryseason_waterbodies_2017)
dryseason_waterbodies_2018$ID <- 1:nrow(dryseason_waterbodies_2018)
dryseason_waterbodies_2019$ID <- 1:nrow(dryseason_waterbodies_2019)
dryseason_waterbodies_2020$ID <- 1:nrow(dryseason_waterbodies_2020)
Lets have a closer look at the malaria dataset. We observe that Kasungu district has 30 health facilities classified as dispensary, health centre, district hospital and rural hospital, and the highest malaria cases were recorded at Kasungu District Hospital.
# Plotly bar chart -------------------------------------------------------------
bar_chart <- dry_season_malaria_2017_2020 |>
dplyr::filter(Names != "K2 Taso Clinic", # Have missing malaria records
Names != "Kalikeni Private Clinic",
Names != "Kakwale Health Centre",
Names != "St Andrews Community Hospital",
Names != "St. Faith Health Centre",
Names != "Chambwe Health Centre") |>
plotly::plot_ly(y = ~Names,
x = ~dr_2017,
type = "bar",
orientation = 'h',
name = "2017") |>
plotly::add_trace(x = ~ dr_2018,
name = "2018") |>
plotly::add_trace(x = ~ dr_2019,
name = "2019") |>
plotly::add_trace(x = ~ dr_2020,
name = "2020") |>
plotly::layout(xaxis = list(title = "Total malaria cases"),
yaxis = list(title = " "),
hovermode = "compare",
margin = list(b = 10,
t = 10,
pad = 2))
bar_chart
Fig.1 The total malaria cases recorded at each health-care facility in Kasungu district
ggplot2::ggplot(monthly_malaria_2017_2021) +
aes(x = date,
y = NMCP) +
geom_line(size = 0.6,
colour = "#112446") +
scale_y_continuous(trans = "log2") +
labs(x = "Year",
y = "Confirmed malaria cases") +
theme_classic()
Fig.2: Seasonal variation in confirmed malaria cases
Heath facility catchment area is the area from which a health facility attracts patients. Since the available official catchment areas are outdated and no recent spatial data about the catchment areas is available, new health facility catchments polygon were generated from generic accessibility mapping script adapted from https://malariaatlas.org/wp-content/uploads/accessibility/R_generic_accessibilty_mapping_script.r The script requires two user supplied datasets: the 2015 friction surface, which is available here: http://www.map.ox.ac.uk/accessibility_to_cities/, and a user-supplied .csv of points dry_season_malaria_2017_2020. The accumulated cost algorithm accCost and r.Cost algorithm in QGIS were run to make the final output map of new health facility catchment boundaries.
# Using the complete.cases() function to select health centres with complete
# longitude and latitude coordinates.
zipatala <- dry_season_malaria_2017_2020[complete.cases(dry_season_malaria_2017_2020),]
# Aggregate health facilities close to each other:
# a) Kasalika Health Centre and Kasungu District Hospital,
# b) Bua and Mziza Health Centres, and
# c) Kaluluma and Nkhamenya Rural Hospitals
# d) Anchor and Santhe Health Centres are combined in order to
# generate catchment areas that are geographically correct
zipatala$dr_2017[which(
zipatala$Names == "Kasungu District Hospital")] <- zipatala$dr_2017[which(
zipatala$Names == "Kasungu District Hospital")] + zipatala$dr_2017[which(
zipatala$Names == "Kasalika Health Centre")]
zipatala$dr_2018[which(
zipatala$Names == "Kasungu District Hospital")] <- zipatala$dr_2018[which(
zipatala$Names == "Kasungu District Hospital")] + zipatala$dr_2018[which(
zipatala$Names == "Kasalika Health Centre")]
zipatala$dr_2019[which(
zipatala$Names == "Kasungu District Hospital")] <- zipatala$dr_2019[which(
zipatala$Names == "Kasungu District Hospital")] + zipatala$dr_2019[which(
zipatala$Names == "Kasalika Health Centre")]
zipatala$dr_2020[which(
zipatala$Names == "Kasungu District Hospital")] <- zipatala$dr_2020[which(
zipatala$Names == "Kasungu District Hospital")] + zipatala$dr_2020[which(
zipatala$Names == "Kasalika Health Centre")]
zipatala$dr_2017[which(
zipatala$Names == "Nkhamenya Rural Hospital")] <- zipatala$dr_2017[which(
zipatala$Names == "Nkhamenya Rural Hospital")] + zipatala$dr_2017[which(
zipatala$Names == "Kaluluma Rural Hospital")]
zipatala$dr_2018[which(
zipatala$Names == "Nkhamenya Rural Hospital")] <- zipatala$dr_2018[which(
zipatala$Names == "Nkhamenya Rural Hospital")] +zipatala$dr_2018[which(
zipatala$Names == "Kaluluma Rural Hospital")]
zipatala$dr_2019[which(
zipatala$Names == "Nkhamenya Rural Hospital")] <- zipatala$dr_2019[which(
zipatala$Names == "Nkhamenya Rural Hospital")] + zipatala$dr_2019[which(
zipatala$Names == "Kaluluma Rural Hospital")]
zipatala$dr_2020[which(
zipatala$Names == "Nkhamenya Rural Hospital")] <- zipatala$dr_2020[which(
zipatala$Names == "Nkhamenya Rural Hospital")] + zipatala$dr_2020[which(
zipatala$Names == "Kaluluma Rural Hospital")]
zipatala$dr_2017[which(
zipatala$Names == "Mziza Health Centre")] <- zipatala$dr_2017[which(
zipatala$Names == "Mziza Health Centre")] + zipatala$dr_2017[which(
zipatala$Names == "Bua Health Centre")]
zipatala$dr_2018[which(
zipatala$Names == "Mziza Health Centre")] <- zipatala$dr_2018[which(
zipatala$Names == "Mziza Health Centre")] + zipatala$dr_2018[which(
zipatala$Names == "Bua Health Centre")]
zipatala$dr_2019[which(
zipatala$Names == "Mziza Health Centre")] <- zipatala$dr_2019[which(
zipatala$Names == "Mziza Health Centre")] + zipatala$dr_2019[which(
zipatala$Names == "Bua Health Centre")]
zipatala$dr_2020[which(
zipatala$Names == "Mziza Health Centre")] <- zipatala$dr_2020[which(
zipatala$Names == "Mziza Health Centre")] + zipatala$dr_2020[which(
zipatala$Names == "Bua Health Centre")]
zipatala$dr_2017[which(
zipatala$Names == "Santhe Health Centre")] <- zipatala$dr_2017[which(
zipatala$Names == "Santhe Health Centre")] + zipatala$dr_2017[which(
zipatala$Names == "Anchor Farm")]
zipatala$dr_2018[which(
zipatala$Names == "Santhe Health Centre")] <- zipatala$dr_2018[which(
zipatala$Names == "Santhe Health Centre")] + zipatala$dr_2018[which(
zipatala$Names == "Anchor Farm")]
zipatala$dr_2019[which(
zipatala$Names == "Santhe Health Centre")] <- zipatala$dr_2019[which(
zipatala$Names == "Santhe Health Centre")] + zipatala$dr_2019[which(
zipatala$Names == "Anchor Farm")]
zipatala$dr_2020[which(
zipatala$Names == "Santhe Health Centre")] <- zipatala$dr_2020[which(
zipatala$Names == "Santhe Health Centre")] + zipatala$dr_2020[which(
zipatala$Names == "Anchor Farm")]
# Drop out the other health facilities
zipatala_aggregated <- zipatala |>
dplyr::filter(Names != "Kasalika Health Centre",
Names != "Bua Health Centre",
Names != "Kaluluma Rural Hospital",
Names != "Anchor Farm")
# Save csv file
# write.csv(zipatala_aggregated, "data/zipatala_aggregated.csv")
# Convert to sf object
zipatala_aggregated_sf <- sf::st_as_sf(zipatala_aggregated,
coords = c("LONGITU", "LATITUD"),
crs = 4326, agr = "constant")
# st_write(zipatala_aggregated_sf, "data/zipatala_combined.shp")
# Plot map
tm_shape(malire_new)+
tm_polygons()+
tm_shape(zipatala_aggregated_sf)+
tm_dots(size = .3,
col = "blue",
alpha = 0.5)+
tm_text("Names",
size = .3,
just = "top",
col = "black",
remove.overlap = TRUE)+
tm_layout(frame = FALSE,
title = "New Kasungu health facility \n catchment boundaries",
title.size = .8,
title.position = c("left", "top"))+
tm_compass(position=c("right", "top"))+
tm_scale_bar(breaks = c(0, 10, 20),
text.size = .5)
Fig 3. Kasungu health-care facilities and catchment areas
# Custom function to create a raster population map ----------------------------
create.population.map <- function(population.raster, title){
# raster population map
# arguments:
# population.raster: aggregated population raster layer from WorldPop
# legend.title: legend title
# returns:
# a tmap-element (plots a map)
# set to interactive view
# tmap::tmap_mode("view")
# plot map
tmap::tm_shape(population.raster)+
tmap::tm_raster(palette = "-viridis",
title = title,
breaks = c(0,100,200,400,600,800,1000,2000,4000,6000,8000))+
tmap::tm_layout(legend.position = c("right", "bottom"),
frame = FALSE)+
tmap::tm_scale_bar(position = c("left", "bottom"))
}
# Set to static map
tmap_mode("plot")
# Invoking function ------------------------------------------------------------
estimated_pop_2017 <- create.population.map(kasungu_population_2017, title = "2017 Population")
estimated_pop_2018 <- create.population.map(kasungu_population_2018, title = "2018 Population")
estimated_pop_2019 <- create.population.map(kasungu_population_2019, title = "2019 Population")
estimated_pop_2020 <- create.population.map(kasungu_population_2020, title = "2020 Population")
# Layout the maps
tmap::tmap_arrange(estimated_pop_2017, estimated_pop_2018,
estimated_pop_2019, estimated_pop_2020, nrow = 2)
Fig.4 Estimated total number of people per 1km grid-cell
The WorldPop aggregated population e.g. kasungu_population_2017.tif, and DHIS2 malaria dry_season_malaria_2017_2020 datasets are assigned to the new health facility catchments.
# Helper function that assigns malaria data from health facilities to their catchments areas
assign.malaria.data <- function(catchment_boundary, malaria_data){
# arguments:
# catchment_boundary: sf polygon object of new catchment boundaries
# malaria_data: sf point object with a data frame containing the dry season malaria cases
# returns:
# catchments_malaria_sf: sf polygon object with a data frame containing dry season malaria cases
# Convert sf objects to spatial
catchment_shp <- as(catchment_boundary, "Spatial")
malaria_shp <- as(malaria_data, "Spatial")
# Match CRS
malaria_shp <- spTransform(malaria_shp, crs(catchment_shp))
# Overlay aggregated health facility points and extract 2017 - 2020 malaria cases
# Using 'point.in.poly' to return a point spatial object, in this case location of health facilities
# and estimated population instead of sp::over function, which simply returns
# a data frame, with the same no. rows.
# Argument 'sp = TRUE' returns an sp class object, else returns sf class object
# Joining the malaria and population dataset using only 'merge' function can't work due to
# non-unique columns and differences in row numbers
hospitals_in_catchment <- spatialEco::point.in.poly(malaria_shp, catchment_shp, sp = TRUE)
# Add the extracted ID, health facility names and dry season malaria cases to
# the health facility catchments (hfc)
hfc_malaria_shp <- merge(catchment_shp, hospitals_in_catchment, by.x = "DN", by.y = "rowID")
# Convert the shapefile containing malaria data to sf-object
hfc_malaria_sf <- sf::st_as_sf(hfc_malaria_shp)
# Tidy the data by dropping columns not needed
catchment_malaria <- hfc_malaria_sf |>
dplyr::select(-c(coords.x1, coords.x2))
return(out = catchment_malaria)
}
# Invoking the function --------------------------------------------------------
malaria_by_catchment <- assign.malaria.data(malire_new, zipatala_aggregated_sf)
The population raster is a continuous gridded surface layer that has an estimated population density value to every square in their grid. The population values are extracted using raster::extract(), summed and apportioned to the catchment polygons.
# Helper unction to extract population from WorldPop raster file and assign
# the values to the new catchments.
extract.pop.values <- function(kasungu_pop_raster, catchments){
# function to extract population from raster file and assign the population to catchments
# arguments:
# kasungu_pop_raster: population raster file clipped to Kasungu district
# catchments: shapefile containing the polygons that we will use as boundaries
# returns:
# catchments_malaria_pop_sf: sf polygon object containing malaria and population data
# convert from sf to sp
catchments_sp <- as(catchments, "Spatial")
# Match extent i.e projection
catchments_sp <- spTransform(catchments_sp, proj4string(kasungu_pop_raster))
# Crop and mask the population raster to exclude Kasungu National Park
pop_raster_clip <- raster::crop(kasungu_pop_raster, extent(catchments_sp)) |>
raster::mask(catchments_sp)
# Extracting zonal statistics from a population raster layer
pop_by_catchment <- round(raster::extract(pop_raster_clip, catchments, fun = sum, na.rm = TRUE))
pop_by_catchment_df <- pop_by_catchment %>%
# apply unlist to the lists to have vectors as the list elements
lapply(unlist) %>%
# convert vectors to data.frames
lapply(as_tibble) %>%
# combine the list of data.frames
bind_rows(., .id = "rowID") %>%
# rename the value variable
dplyr::rename(pop = value)
# Add row ID to column to catchment layer
catchments$rowID <- 1:nrow(catchments)
# Merge catchment areas and population data
pop_by_catchments <- merge(catchments, pop_by_catchment_df, by = "rowID")
# Cleaning 'Inf' values
pop_by_catchments |>
dplyr::mutate_if(is.numeric, list(~na_if(., Inf))) |>
dplyr::mutate_if(is.numeric, list(~na_if(., -Inf)))
return(out = pop_by_catchments)
}
# Invoking the function -------------------------------------------------------------------------
malaria_pop_by_catchment_2017 <- extract.pop.values(kasungu_population_2017, malaria_by_catchment)
malaria_pop_by_catchment_2018 <- extract.pop.values(kasungu_population_2018, malaria_by_catchment)
malaria_pop_by_catchment_2019 <- extract.pop.values(kasungu_population_2019, malaria_by_catchment)
malaria_pop_by_catchment_2020 <- extract.pop.values(kasungu_population_2020, malaria_by_catchment)
Estimated total number of people within health facility catchment areas.
# max(malaria_pop_by_catchment_2017$pop)
# [1] 143490
# max(malaria_pop_by_catchment_2018$pop)
# [1] 147175
# max(malaria_pop_by_catchment_2019$pop)
# [1] 151079
# max(malaria_pop_by_catchment_2020$pop)
# [1] 185282
# Custom function to create maps of estimated population by catchment areas --------------------------
create.population.map <- function(catchment.area,
variable = "pop",
title,
legend.title = "Estimated \n population"){
# estimated population map
# catchment.area: estimated population layer from nachulu function
# variable: variable name (as character, in qoutes)
# title: map title in quotes
# legend.title: legend title in qoutes
# returns:
# a tmap-element (plots a map)
tm_shape(catchment.area)+
tm_fill(col = variable,
breaks = c(0, 13000, 19000, 27000, 35000, 70000, 140000, 200000),
palette = "YlOrBr",
title = legend.title)+
tm_borders(col = "grey",
lwd = 0.4)+
tm_layout(legend.position = c(0.75, "bottom"),
legend.text.size = 0.6,
legend.title.size = 0.8,
frame = FALSE)+
tm_credits(title,
position = c(0.3, 0.8),
size = 1)
}
# Invoking the function --------------------------------------------------------------------------
pop_by_catchment_2017 <- create.population.map(malaria_pop_by_catchment_2017, title = "2017")
pop_by_catchment_2018 <- create.population.map(malaria_pop_by_catchment_2018, title = "2018")
pop_by_catchment_2019 <- create.population.map(malaria_pop_by_catchment_2019, title = "2019")
pop_by_catchment_2020 <- create.population.map(malaria_pop_by_catchment_2020, title = "2020")
tmap::tmap_arrange(pop_by_catchment_2017, pop_by_catchment_2018,
pop_by_catchment_2019, pop_by_catchment_2020, ncol = 2)
Fig. 5: Estimated population by health facility catchment areas
# Helper function to calculate population density by catchment -----------------
calculate.population.density <- function(pop.data){
# Convert to spatial object
pop.sp <- as(pop.data, "Spatial")
# Calculate area of catchment polygon in square kilometres
pop.sp$area_sqkm <- round(rgeos::gArea(pop.sp, byid = TRUE) / (1000 * 1000))
# Calculate population density
pop.sp$pop_density <- round(pop.sp$pop / pop.sp$area_sqkm)
# Convert back to sf object
pop.sf <- sf::st_as_sf(pop.sp)
return(pop.sf)
}
# Invoking function ------------------------------------------------------------
pop_density_2017 <- calculate.population.density(malaria_pop_by_catchment_2017)
pop_density_2018 <- calculate.population.density(malaria_pop_by_catchment_2018)
pop_density_2019 <- calculate.population.density(malaria_pop_by_catchment_2019)
pop_density_2020 <- calculate.population.density(malaria_pop_by_catchment_2020)
# Helper function to create population density maps ----------------------------
create.pop.density.map <- function(pop.density.data,
variable = "pop_density",
title = NA,
legend.title = "Population \ndensity/km^2"){
tm_shape(pop.density.data)+
tm_fill(col = variable,
breaks = c(0, 50, 100, 150, 200, 250, 300, 350),
palette = "-magma",
title = legend.title)+
tm_borders()+
tm_layout(legend.position = c(0.75, "bottom"),
legend.text.size = 0.6,
legend.title.size = 0.8,
frame = FALSE)+
tm_credits(title,
position = c(0.3, 0.8),
size = 1)
}
# Invoking function ------------------------------------------------------------
pop_density_2017_map <- create.pop.density.map(pop_density_2017, title = "2017")
pop_density_2018_map <- create.pop.density.map(pop_density_2018, title = "2018")
pop_density_2019_map <- create.pop.density.map(pop_density_2019, title = "2019")
pop_density_2020_map <- create.pop.density.map(pop_density_2020, title = "2020")
# Layout maps
tmap::tmap_arrange(pop_density_2017_map, pop_density_2018_map,
pop_density_2019_map, pop_density_2020_map, ncol = 2)
Fig. 6: Estimated population density by health facility catchment areas
The expected number of dry season malaria cases in catchment i are calculated as the observed risk (r) of malaria i.e. the total number of malaria cases in Kasungu district divided by the total population of the district, multiplied by the number of people in the catchment area: \[E_i = \frac{\sum_i O_i}{\sum_i N_i}\times N_i\]
The expected number of dry season malaria cases are calculated under the assumption that there is no spatial variation in risk, i.e., no difference in infection rates between the catchment areas.
# Compute and print the overall incidence of dry season malaria cases
overall_malaria_incidence_rate_2017 <- round(sum(
malaria_pop_by_catchment_2017$dr_2017) / sum(
malaria_pop_by_catchment_2017$pop), 2)
overall_malaria_incidence_rate_2017
## [1] 0.08
overall_malaria_incidence_rate_2018 <- round(sum(
malaria_pop_by_catchment_2018$dr_2018) / sum(
malaria_pop_by_catchment_2018$pop), 2)
overall_malaria_incidence_rate_2018
## [1] 0.09
overall_malaria_incidence_rate_2019 <- round(sum(
malaria_pop_by_catchment_2019$dr_2019) / sum(
malaria_pop_by_catchment_2019$pop), 2)
overall_malaria_incidence_rate_2019
## [1] 0.09
overall_malaria_incidence_rate_2020 <- round(sum(
malaria_pop_by_catchment_2020$dr_2020) / sum(
malaria_pop_by_catchment_2020$pop), 2)
overall_malaria_incidence_rate_2020
## [1] 0.12
# Calculate expected malaria cases ------------------------------------------------
expected_malaria_2017 <- malaria_pop_by_catchment_2017 |>
dplyr::rename(
observed_2017 = dr_2017,
pop_2017 = pop) |>
dplyr::mutate(
expected_2017 = round(sum(observed_2017)/sum(pop_2017, na.rm = TRUE)*pop_2017))
expected_malaria_2018 <- malaria_pop_by_catchment_2018 |>
dplyr::rename(
observed_2018 = dr_2018,
pop_2018 = pop) |>
dplyr::mutate(
expected_2018 = round(sum(observed_2018)/sum(pop_2018, na.rm = TRUE)*pop_2018))
expected_malaria_2019 <- malaria_pop_by_catchment_2019 |>
dplyr::rename(
observed_2019 = dr_2019,
pop_2019 = pop) |>
dplyr::mutate(
expected_2019 = round(sum(observed_2019)/sum(pop_2019, na.rm = TRUE)*pop_2019))
expected_malaria_2020 <- malaria_pop_by_catchment_2020 |>
dplyr::rename(
observed_2020 = dr_2020,
pop_2020 = pop) |>
dplyr::mutate(
expected_2020 = round(sum(observed_2020)/sum(pop_2020, na.rm = TRUE)*pop_2020))
The SMR compares the risk of morbidity in a population of interest with that of a standard population. In this case, our interest is to find out whether the number of dry season malaria cases in each catchment area are greater than we would expect given the malaria rate for the entire Kasungu district.
We do this by comparing what we observe (O) with what we would expect (E) if the risk of malaria was equal throughout Kasungu. The SMR of catchment i can be calculated as follows: \[SMR_i = \frac{O_i}{E_i}\]
SMRs above 1 represent high risk of dry season malaria and SMRs below 1, viceversa.
# Calculate the ratio of observed to expected (SMR) ----------------------------
SMR_2017 <- expected_malaria_2017 |>
dplyr::mutate(SMR = round(observed_2017/expected_2017, 1)) |>
dplyr::select(rowID,Names, pop_2017, observed_2017, expected_2017, SMR)
SMR_2018 <- expected_malaria_2018 |>
dplyr::mutate(SMR = round(observed_2018/expected_2018, 1)) |>
dplyr::select(rowID, Names, pop_2018, observed_2018, expected_2018, SMR)
SMR_2019 <- expected_malaria_2019 |>
dplyr::mutate(SMR = round(observed_2019/expected_2019, 1)) |>
dplyr::select(rowID, Names, pop_2019, observed_2019, expected_2019, SMR)
SMR_2020 <- expected_malaria_2020 |>
dplyr::mutate(SMR = round(observed_2020/expected_2020, 1)) |>
dplyr::select(rowID, Names, pop_2020, observed_2020, expected_2020, SMR)
# Create SMR tables ------------------------------------------------------------
SMR_table_2017 <- SMR_2017 |>
dplyr::as_tibble() |>
dplyr::rename(Health_facility = Names) |>
dplyr::select(-rowID, -geometry) |>
kable() |>
kableExtra::kable_styling(full_width = FALSE)
SMR_table_2018 <- SMR_2018 |>
dplyr::as_tibble() |>
dplyr::rename(Health_facility = Names) |>
dplyr::select(-rowID, -geometry) |>
kable() |>
kableExtra::kable_styling(full_width = FALSE)
SMR_table_2019 <- SMR_2019 |>
dplyr::as_tibble() |>
dplyr::rename(Health_facility = Names) |>
dplyr::select(-rowID, -geometry) |>
kable () |>
kableExtra::kable_styling(full_width = FALSE)
SMR_table_2020 <- SMR_2020 |>
dplyr::as_tibble() |>
dplyr::rename(Health_facility = Names) |>
dplyr::select(-rowID, -geometry) |>
kable () |>
kableExtra::kable_styling(full_width = FALSE)
SMR_table_2017
| Health_facility | pop_2017 | observed_2017 | expected_2017 | SMR |
|---|---|---|---|---|
| Lodjwa Health Centre | 9923 | 564 | 826 | 0.7 |
| Nkhamenya Rural Hospital | 40154 | 2720 | 3344 | 0.8 |
| Newa Mpasazi Health Centre | 13879 | 216 | 1156 | 0.2 |
| Mpepa /Chisinga Health Centre | 27459 | 1523 | 2287 | 0.7 |
| Mnyanja Health Centre | 39950 | 1480 | 3327 | 0.4 |
| Simlemba Health Centre | 26999 | 1159 | 2249 | 0.5 |
| Ofesi Health Centre | 28098 | 1930 | 2340 | 0.8 |
| Chulu Health Centre | 27906 | 3482 | 2324 | 1.5 |
| Kapelula Health Centre | 35727 | 2970 | 2976 | 1.0 |
| Livwezi Health Centre | 22009 | 594 | 1833 | 0.3 |
| Gogode Dispensary | 13061 | 1553 | 1088 | 1.4 |
| Dwangwa Dispensary | 32704 | 1153 | 2724 | 0.4 |
| Chamama Health Facility | 20026 | 1005 | 1668 | 0.6 |
| Wimbe Health Centre | 11864 | 2558 | 988 | 2.6 |
| Chinyama | 12768 | 1140 | 1063 | 1.1 |
| Mdunga Health Centre | 18177 | 1382 | 1514 | 0.9 |
| Mtunthama Health Centre | 18744 | 1982 | 1561 | 1.3 |
| Kasungu District Hospital | 143490 | 14663 | 11951 | 1.2 |
| Chamwabvi Health Centre | 35353 | 2031 | 2945 | 0.7 |
| Linyangwa Health Centre | 17772 | 1987 | 1480 | 1.3 |
| Mziza Health Centre | 44295 | 4098 | 3689 | 1.1 |
| Kawamba Health Centre | 26385 | 3845 | 2198 | 1.7 |
| Kamboni Health Centre | 21226 | 2588 | 1768 | 1.5 |
| Khola Health Centre | 22664 | 1012 | 1888 | 0.5 |
| Santhe Health Centre | 43948 | 5668 | 3660 | 1.5 |
| Mkhota Health Centre | 23295 | 1487 | 1940 | 0.8 |
SMR_table_2018
| Health_facility | pop_2018 | observed_2018 | expected_2018 | SMR |
|---|---|---|---|---|
| Lodjwa Health Centre | 10281 | 1151 | 934 | 1.2 |
| Nkhamenya Rural Hospital | 41642 | 3343 | 3785 | 0.9 |
| Newa Mpasazi Health Centre | 14248 | 434 | 1295 | 0.3 |
| Mpepa /Chisinga Health Centre | 28488 | 2616 | 2589 | 1.0 |
| Mnyanja Health Centre | 41856 | 1715 | 3804 | 0.5 |
| Simlemba Health Centre | 27455 | 1506 | 2496 | 0.6 |
| Ofesi Health Centre | 29002 | 1773 | 2636 | 0.7 |
| Chulu Health Centre | 28832 | 3330 | 2621 | 1.3 |
| Kapelula Health Centre | 37630 | 3480 | 3420 | 1.0 |
| Livwezi Health Centre | 22544 | 1128 | 2049 | 0.6 |
| Gogode Dispensary | 13368 | 2550 | 1215 | 2.1 |
| Dwangwa Dispensary | 33534 | 1216 | 3048 | 0.4 |
| Chamama Health Facility | 20372 | 1226 | 1852 | 0.7 |
| Wimbe Health Centre | 11814 | 3167 | 1074 | 2.9 |
| Chinyama | 13138 | 1673 | 1194 | 1.4 |
| Mdunga Health Centre | 18928 | 1894 | 1720 | 1.1 |
| Mtunthama Health Centre | 19074 | 3358 | 1734 | 1.9 |
| Kasungu District Hospital | 147175 | 12019 | 13377 | 0.9 |
| Chamwabvi Health Centre | 36167 | 2079 | 3287 | 0.6 |
| Linyangwa Health Centre | 18032 | 1500 | 1639 | 0.9 |
| Mziza Health Centre | 46313 | 2291 | 4210 | 0.5 |
| Kawamba Health Centre | 26253 | 3881 | 2386 | 1.6 |
| Kamboni Health Centre | 21430 | 3250 | 1948 | 1.7 |
| Khola Health Centre | 23195 | 1697 | 2108 | 0.8 |
| Santhe Health Centre | 45063 | 6195 | 4096 | 1.5 |
| Mkhota Health Centre | 23884 | 4218 | 2171 | 1.9 |
SMR_table_2019
| Health_facility | pop_2019 | observed_2019 | expected_2019 | SMR |
|---|---|---|---|---|
| Lodjwa Health Centre | 10608 | 1168 | 909 | 1.3 |
| Nkhamenya Rural Hospital | 43293 | 3932 | 3709 | 1.1 |
| Newa Mpasazi Health Centre | 14780 | 626 | 1266 | 0.5 |
| Mpepa /Chisinga Health Centre | 29456 | 4169 | 2523 | 1.7 |
| Mnyanja Health Centre | 43783 | 2504 | 3751 | 0.7 |
| Simlemba Health Centre | 28076 | 1788 | 2405 | 0.7 |
| Ofesi Health Centre | 30065 | 2124 | 2576 | 0.8 |
| Chulu Health Centre | 29731 | 3537 | 2547 | 1.4 |
| Kapelula Health Centre | 39747 | 3357 | 3405 | 1.0 |
| Livwezi Health Centre | 22945 | 435 | 1966 | 0.2 |
| Gogode Dispensary | 13641 | 1469 | 1169 | 1.3 |
| Dwangwa Dispensary | 34415 | 1370 | 2948 | 0.5 |
| Chamama Health Facility | 20701 | 1127 | 1773 | 0.6 |
| Wimbe Health Centre | 11855 | 2162 | 1016 | 2.1 |
| Chinyama | 13475 | 1260 | 1154 | 1.1 |
| Mdunga Health Centre | 19960 | 1485 | 1710 | 0.9 |
| Mtunthama Health Centre | 19385 | 1718 | 1661 | 1.0 |
| Kasungu District Hospital | 151079 | 13052 | 12942 | 1.0 |
| Chamwabvi Health Centre | 36899 | 1180 | 3161 | 0.4 |
| Linyangwa Health Centre | 18279 | 2692 | 1566 | 1.7 |
| Mziza Health Centre | 48452 | 3135 | 4151 | 0.8 |
| Kawamba Health Centre | 26356 | 3469 | 2258 | 1.5 |
| Kamboni Health Centre | 21509 | 2537 | 1843 | 1.4 |
| Khola Health Centre | 23816 | 2139 | 2040 | 1.0 |
| Santhe Health Centre | 46196 | 5793 | 3957 | 1.5 |
| Mkhota Health Centre | 24429 | 2268 | 2093 | 1.1 |
SMR_table_2020
| Health_facility | pop_2020 | observed_2020 | expected_2020 | SMR |
|---|---|---|---|---|
| Lodjwa Health Centre | 13081 | 1788 | 1538 | 1.2 |
| Nkhamenya Rural Hospital | 53692 | 8539 | 6313 | 1.4 |
| Newa Mpasazi Health Centre | 18311 | 2182 | 2153 | 1.0 |
| Mpepa /Chisinga Health Centre | 36317 | 5186 | 4270 | 1.2 |
| Mnyanja Health Centre | 54649 | 6117 | 6426 | 1.0 |
| Simlemba Health Centre | 34240 | 5310 | 4026 | 1.3 |
| Ofesi Health Centre | 37240 | 2323 | 4379 | 0.5 |
| Chulu Health Centre | 36638 | 7160 | 4308 | 1.7 |
| Kapelula Health Centre | 50214 | 7297 | 5904 | 1.2 |
| Livwezi Health Centre | 27786 | 1028 | 3267 | 0.3 |
| Gogode Dispensary | 16681 | 2767 | 1961 | 1.4 |
| Dwangwa Dispensary | 42282 | 2869 | 4971 | 0.6 |
| Chamama Health Facility | 25248 | 635 | 2969 | 0.2 |
| Wimbe Health Centre | 14367 | 2233 | 1689 | 1.3 |
| Chinyama | 16463 | 1605 | 1936 | 0.8 |
| Mdunga Health Centre | 25108 | 3169 | 2952 | 1.1 |
| Mtunthama Health Centre | 23501 | 1882 | 2763 | 0.7 |
| Kasungu District Hospital | 185282 | 19393 | 21785 | 0.9 |
| Chamwabvi Health Centre | 45106 | 1128 | 5304 | 0.2 |
| Linyangwa Health Centre | 22144 | 4380 | 2604 | 1.7 |
| Mziza Health Centre | 60648 | 5791 | 7131 | 0.8 |
| Kawamba Health Centre | 32076 | 7073 | 3771 | 1.9 |
| Kamboni Health Centre | 25750 | 4665 | 3028 | 1.5 |
| Khola Health Centre | 29367 | 3426 | 3453 | 1.0 |
| Santhe Health Centre | 56704 | 6556 | 6667 | 1.0 |
| Mkhota Health Centre | 29986 | 4592 | 3526 | 1.3 |
# Helper function to create maps of observed and expected dry season malaria cases
create.malaria.map <- function(malaria.data,
variable = NA,
title = NA,
legend.title = NA){
# observed and expected malaria incidence map
# malaria.data: data frame containing observed and expected malaria cases
# variable: variable name (as character, in quotes e.g. "observed")
# title: map title in quotes
# legend.title: legend title in quotes
# returns:
# a tmap-element (plots a map)
tm_shape(malaria.data)+
tm_fill(col = variable,
breaks = c(0, 500, 1000, 2500, 5000, 10000, 15000, 20000, 25000),
palette = "YlOrRd",
title = legend.title)+
tm_borders(lw = 0.3)+
tm_layout(legend.position = c(0.75,"bottom"),
legend.text.size = 0.5,
legend.title.size = 0.7,
frame = FALSE)+
tm_credits(title,
position = c(0.2, 0.8),
size = 1)
}
# Invoking the function
# 2017 observed and expected malaria cases -------------------------------------
observed_malaria_2017_map <- create.malaria.map(malaria_pop_by_catchment_2017,
variable = "dr_2017",
title = "2017",
legend.title = "Observed malaria")
expected_malaria_2017_map <- create.malaria.map(expected_malaria_2017,
variable = "expected_2017",
title = "2017",
legend.title = "Expected malaria")
# 2018 observed and expected malaria cases -------------------------------------
observed_malaria_2018_map <- create.malaria.map(malaria_pop_by_catchment_2018,
variable = "dr_2018",
title = "2018",
legend.title = "Observed malaria")
expected_malaria_2018_map <- create.malaria.map(expected_malaria_2018,
variable = "expected_2018",
title = "2018",
legend.title = "Expected malaria")
# 2019 observed and expected malaria cases -------------------------------------
observed_malaria_2019_map <- create.malaria.map(malaria_pop_by_catchment_2019,
variable = "dr_2019",
title = "2019",
legend.title = "Observed malaria")
expected_malaria_2019_map <- create.malaria.map(expected_malaria_2019,
variable = "expected_2019",
title = "2019",
legend.title = "Expected malaria")
# 2020 observed and expected malaria cases -------------------------------------
observed_malaria_2020_map <- create.malaria.map(malaria_pop_by_catchment_2020,
variable = "dr_2020",
title = "2020",
legend.title = "Observed malaria")
expected_malaria_2020_map <- create.malaria.map(expected_malaria_2020,
variable = "expected_2020",
title = "2020",
legend.title = "Expected malaria")
# Layout maps ------------------------------------------------------------------
tmap::tmap_arrange(observed_malaria_2017_map, expected_malaria_2017_map,
observed_malaria_2018_map, expected_malaria_2018_map,
observed_malaria_2019_map, expected_malaria_2019_map,
observed_malaria_2020_map, expected_malaria_2020_map, ncol = 2)
Fig. 7: Observed and expected malaria incidence by health facility catchment area, Kasungu
A ratio greater than 1.0 indicates that more malaria cases have occurred than would have been expected, while a ratio less than 1.0 indicates that less cases have occurred. This means that, catchments with SMRs above 1 have high dry season malaria risk.
# max(SMR_2017$SMR)
# [1] 2.6
# max(SMR_2018$SMR)
# [1] 2.9
# max(SMR_2019$SMR)
# [1] 2.1
# max(SMR_2020$SMR)
# [1] 1.9
# Define function to create maps of SMR by catchment ---------------------------
create.smr.map <- function(smr.data,
variable = "SMR_category",
title = NA,
legend.title = "SMR"){
# SMR by catchment map
# smr.data: sf polygon object containing SMR by catchment data
# variable: variable name (as character, in qoutes)
# title: map title in quotes
# legend.title: legend title in qoutes
# returns:
# a tmap-element (plots a map)
# create category column
smr.data$SMR_category <- NA
# assigning labels for the SMR estimate legends
smr.category.list <- c("<0.50", "0.51 to 0.75", "0.76 to 0.99", "1.00",
"1.10 to 1.24", "1.25 to 1.49", ">1.50")
# assigning categories
smr.data$SMR_category[smr.data$SMR >= 0.00 & smr.data$SMR < 0.49] = -3
smr.data$SMR_category[smr.data$SMR >= 0.50 & smr.data$SMR < 0.75] = -2
smr.data$SMR_category[smr.data$SMR >= 0.76 & smr.data$SMR < 0.99] = -1
smr.data$SMR_category[smr.data$SMR >= 1.00 & smr.data$SMR < 1.09] = 0
smr.data$SMR_category[smr.data$SMR >= 1.10 & smr.data$SMR < 1.24] = 1
smr.data$SMR_category[smr.data$SMR >= 1.25 & smr.data$SMR < 1.49] = 2
smr.data$SMR_category[smr.data$SMR >= 1.50 & smr.data$SMR < 3.00] = 3
# generating divergent colour schemes [Blues - Light Blues – White – Light Reds – Reds]
# smr.palette <- c("#33a6fe", "#cbe6fe", "#dfeffe", "#feb1b1", "#fe8e8e","#fe0000")
tm_shape(smr.data)+
tm_fill(col = variable,
style = "cat",
palette = "-RdBu",
title = legend.title,
labels = smr.category.list)+
tm_borders(lw = 0.6)+
tm_layout(legend.position = c(0.75,"bottom"),
legend.text.size = 0.5,
legend.title.size = 0.7,
frame = FALSE)+
tm_credits(title,
position = c(0.2, 0.8),
size = 1)
}
# Invoking function ------------------------------------------------------------
SMR_2017_map <- create.smr.map(SMR_2017, title = "2017")
SMR_2018_map <- create.smr.map(SMR_2018, title = "2018")
SMR_2019_map <- create.smr.map(SMR_2019, title = "2019")
SMR_2020_map <- create.smr.map(SMR_2020, title = "2020")
# Layout maps ------------------------------------------------------------------
tmap::tmap_arrange(SMR_2017_map, SMR_2018_map, SMR_2019_map, SMR_2020_map, ncol = 2)
Fig. 8: Standardised morbidity ratio of malaria by health facility catchment
First, using st_buffer, we compute 1km, 2km and 3km buffers around dry season water bodies obtained from LandSat satellite imagery using TropWet tool in Google Earth Engine. Then, geometry of the buffer features are then combined resulting in resolved internal boundaries to enable extracting population values in buffers from the WorldPop raster. Finally, we calculate the proportion of people in each catchment area living within water bodies.
# Combine and transform TropWet derived waterbody polygons -------------------------------
surface_waterbodies_2017 <- sf::st_buffer(dryseason_waterbodies_2017, dist = 30) |>
sf::st_union() |>
sf::st_cast("POLYGON") |>
sf::st_as_sf()
surface_waterbodies_2018 <- sf::st_buffer(dryseason_waterbodies_2018, dist = 30) |>
sf::st_union() |>
sf::st_cast("POLYGON") |>
sf::st_as_sf()
surface_waterbodies_2019 <- sf::st_buffer(dryseason_waterbodies_2019, dist = 30) |>
sf::st_union() |>
sf::st_cast("POLYGON") |>
sf::st_as_sf()
surface_waterbodies_2020 <- sf::st_buffer(dryseason_waterbodies_2020, dist = 30) |>
sf::st_union() |>
sf::st_cast("POLYGON") |>
sf::st_as_sf()
# Helper function to compute 1km, 2km and 3km buffers around the water bodies ---------------------
create.waterbody.buffer <- function(waterbody, distance, catchment){
# function for creating buffers around waterbodies
# arguments:
# waterbody: waterbody shapefile
# distance: buffer distance in meters
# catchment: catchment area shapefile
# returns:
# buffered waterbodies
# Create buffers around water bodies
buffer_radius <- sf::st_buffer(waterbody, distance)
# Dissolve the buffers
# buffer_union <- sf::st_as_sf(st_cast(st_union(buffer_radius),"MULTIPOLYGON"))
buffer_union <- sf::st_union(buffer_radius) |>
sf::st_cast("MULTIPOLYGON") |>
sf::st_as_sf()
# Assign attributes of the 'catchment' to each of the water bodies.
buffer_intersect <- sf::st_intersection(buffer_union, catchment)
buffer_intersect_sf <- sf::st_as_sf(buffer_intersect)
# Convert the MULTIPOLYGON object into several POLYGON objects
buffer_intersect_polygons <- sf::st_buffer(buffer_intersect_sf, 0.0) |>
sf::st_cast("MULTIPOLYGON") |>
sf::st_cast("POLYGON")
# Polygons being seen to be in multiple catchments
sf::st_intersects(buffer_intersect_polygons, catchment)
# Make the assumption that the attribute is constant throughout the geometry
sf::st_agr(buffer_intersect_polygons) = "constant"
sf::st_agr(catchment) = "constant"
return(out = buffer_intersect_polygons)
}
# Invoking function
# For 2017 TropWet surface water polygons --------------------------------------------------------
buffer_1km_2017 <- create.waterbody.buffer(waterbody = surface_waterbodies_2017,
distance = 1000,
catchment = malire_new)
buffer_2km_2017 <- create.waterbody.buffer(waterbody = surface_waterbodies_2017,
distance = 2000,
catchment = malire_new)
buffer_3km_2017 <- create.waterbody.buffer(waterbody = surface_waterbodies_2017,
distance = 3000,
catchment = malire_new)
# For 2018 TropWet surface water polygons --------------------------------------------------------
buffer_1km_2018 <- create.waterbody.buffer(waterbody = surface_waterbodies_2018,
distance = 1000,
catchment = malire_new)
buffer_2km_2018 <- create.waterbody.buffer(waterbody = surface_waterbodies_2018,
distance = 2000,
catchment = malire_new)
buffer_3km_2018 <- create.waterbody.buffer(waterbody = surface_waterbodies_2018,
distance = 3000,
catchment = malire_new)
# For 2019 TropWet surface water polygons ------------------------------------------------------
buffer_1km_2019 <- create.waterbody.buffer(waterbody = surface_waterbodies_2019,
distance = 1000,
catchment = malire_new)
buffer_2km_2019 <- create.waterbody.buffer(waterbody = surface_waterbodies_2019,
distance = 2000,
catchment = malire_new)
buffer_3km_2019 <- create.waterbody.buffer(waterbody = surface_waterbodies_2019,
distance = 3000,
catchment = malire_new)
# For 2020 TropWet surface water polygons ------------------------------------------------------
buffer_1km_2020 <- create.waterbody.buffer(waterbody = surface_waterbodies_2020,
distance = 1000,
catchment = malire_new)
buffer_2km_2020 <- create.waterbody.buffer(waterbody = surface_waterbodies_2020,
distance = 2000,
catchment = malire_new)
buffer_3km_2020 <- create.waterbody.buffer(waterbody = surface_waterbodies_2020,
distance = 3000,
catchment = malire_new)
Note that blank areas in the map represent catchment areas in which no water body was detected — this may be a limitation of using moderate resolution satellite imagery (>30m spatial resolution) to identify surface water.
# Map the buffers
create.buffer.map <- function(buffers, boundary = malire_new, title = NA){
# function for creating buffer map in ggplot
# arguments:
# buffer: waterbodies buffer polygon layer
# boundary: health facility catchment polygons
# title: main title
# returns:
# a map-element (plots a map)
ggplot(data = buffers)+
geom_sf()+
geom_sf(data = boundary,
fill = NA)+
theme_void()+
labs(title = title)
}
# Invoking the function
# For 2017 -------------------------------------------------------------------------------
buffer_1km_2017_map <- create.buffer.map(buffer_1km_2017, title = "2017: 1km Buffers")
buffer_2km_2017_map <- create.buffer.map(buffer_2km_2017, title = "2017: 2km Buffers")
buffer_3km_2017_map <- create.buffer.map(buffer_3km_2017, title = "2017: 3km Buffers")
# For 2018 --------------------------------------------------------------------------------
buffer_1km_2018_map <- create.buffer.map(buffer_1km_2018, title = "2018: 1km Buffers")
buffer_2km_2018_map <- create.buffer.map(buffer_2km_2018, title = "2018: 2km Buffers")
buffer_3km_2018_map <- create.buffer.map(buffer_3km_2018, title = "2018: 3km Buffers")
# For 2019 ---------------------------------------------------------------------------------
buffer_1km_2019_map <- create.buffer.map(buffer_1km_2019, title = "2019: 1km Buffers")
buffer_2km_2019_map <- create.buffer.map(buffer_2km_2019, title = "2019: 2km Buffers")
buffer_3km_2019_map <- create.buffer.map(buffer_3km_2019, title = "2019: 3km Buffers")
# For 2020 --------------------------------------------------------------------------------
buffer_1km_2020_map <- create.buffer.map(buffer_1km_2020, title = "2020: 1km Buffers")
buffer_2km_2020_map <- create.buffer.map(buffer_2km_2020, title = "2020: 2km Buffers")
buffer_3km_2020_map <- create.buffer.map(buffer_3km_2020, title = "2020: 3km Buffers")
grid.arrange(buffer_1km_2017_map, buffer_1km_2018_map, buffer_1km_2019_map, buffer_1km_2020_map,
buffer_2km_2017_map, buffer_2km_2018_map, buffer_2km_2019_map, buffer_2km_2020_map,
buffer_3km_2017_map, buffer_3km_2018_map, buffer_3km_2019_map, buffer_3km_2020_map, ncol = 4)
Fig 9. Buffers around dry season waterbodies in Kasungu
# Helper function to calculate estimated number of people living within waterbody buffers
# in each catchment area
estimate.buffer.pop <- function(catchment.population, buffers, catchment.area){
# Extract population estimates from WorldPop raster
buffers$buffer_pop <- raster::extract(catchment.population,
buffers,
fun = sum,
na.rm = TRUE)
# Find which catchment each polygon belongs to using its centroid - a point dataset
# representing the geographic center-points of the polygons
# buffer_by_catchment <- st_intersection(st_centroid(buffers), catchment.area)
buffer_by_catchment <- sf::st_centroid(buffers) |>
sf::st_intersection(catchment.area)
# Notice that the buffer_catchment is comprised of separate POLYGONS (buffer_by_catchment$x).
# The first step is to “dissolve” away these POLYGONS into one MULTIPOLYGON.
# There is no sf equivalent to the QGIS or ArcMap “dissolve” operation.
# Instead we use a combination of group_by and summarize from the dplyr package.
# Stats::aggregate from sf package, and dplyr::summarize both do essentially the same.
buffer_pop_aggregated <- buffer_by_catchment |>
dplyr::group_by(DN) |>
dplyr::summarize(buffer_pop_aggregated = round(sum(buffer_pop, na.rm = TRUE)))
buffer_pop <- merge(catchment.area, st_drop_geometry(buffer_pop_aggregated),
by = 'DN', all.x = TRUE)
return(out = buffer_pop)
}
# Invoking the function and calculating proportion of
# catchment population living within buffers
# 2017 buffer population -------------------------------------------------------
buffer_pop_1km_2017 <- estimate.buffer.pop(
kasungu_population_2017,
buffer_1km_2017,
malaria_pop_by_catchment_2017) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
buffer_pop_2km_2017 <- estimate.buffer.pop(
kasungu_population_2017,
buffer_2km_2017,
malaria_pop_by_catchment_2017) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100)) |>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
buffer_pop_3km_2017 <- estimate.buffer.pop(
kasungu_population_2017,
buffer_3km_2017,
malaria_pop_by_catchment_2017) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
# 2018 buffer population -------------------------------------------------------
buffer_pop_1km_2018 <- estimate.buffer.pop(
kasungu_population_2018,
buffer_1km_2018,
malaria_pop_by_catchment_2018) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
buffer_pop_2km_2018 <- estimate.buffer.pop(
kasungu_population_2018,
buffer_2km_2018,
malaria_pop_by_catchment_2018) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
buffer_pop_3km_2018 <- estimate.buffer.pop(
kasungu_population_2018,
buffer_3km_2018,
malaria_pop_by_catchment_2018) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
# 2019 buffer population -------------------------------------------------------
buffer_pop_1km_2019 <- estimate.buffer.pop(
kasungu_population_2019,
buffer_1km_2019,
malaria_pop_by_catchment_2019) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
buffer_pop_2km_2019 <- estimate.buffer.pop(
kasungu_population_2019,
buffer_2km_2019,
malaria_pop_by_catchment_2019) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0))) # replace NA with zero
buffer_pop_3km_2019 <- estimate.buffer.pop(
kasungu_population_2019,
buffer_3km_2019,
malaria_pop_by_catchment_2019) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
# 2020 buffer population -------------------------------------------------------
buffer_pop_1km_2020 <- estimate.buffer.pop(
kasungu_population_2020,
buffer_1km_2020,
malaria_pop_by_catchment_2020) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
buffer_pop_2km_2020 <- estimate.buffer.pop(
kasungu_population_2020,
buffer_2km_2020,
malaria_pop_by_catchment_2020) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100))|>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
buffer_pop_3km_2020 <- estimate.buffer.pop(
kasungu_population_2020,
buffer_3km_2020,
malaria_pop_by_catchment_2020) |>
dplyr::rename(catchment_pop = pop,
buffer_pop = buffer_pop_aggregated) |>
dplyr::mutate(
prop_buffer_catchment_pop = round((buffer_pop/catchment_pop)*100)) |>
dplyr::mutate(across(everything(), .fns = ~replace_na(.,0)))
# Helper function to create maps of proportion of people living in proximity ----------
# to water bodies in each catchment area
create.pop.proportion.map <- function(pop.data,
variable = "prop_buffer_catchment_pop",
title = NA,
legend.title = NA){
# pop.data: sf polygon object containing proportion of catchment population
# living within water bodies
# variable: variable name (as character, in qoutes)
# title: map title in quotes
# legend.title: legend title in qoutes
# returns:
# a tmap-element (plots a map)
tm_shape(pop.data)+
tm_fill(col = variable,
breaks = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
palette = "YlOrBr",
title = legend.title)+
tm_borders(lw = 0.3)+
tm_layout(legend.position = c(0.8,"bottom"),
legend.text.size = 0.5,
legend.title.size = 0.7,
frame = FALSE)+
tm_credits(title,
position = c(0.25, 0.75),
size = 1)
}
# Invoking function
# 2017 population proportion ---------------------------------------------------
pop_proportion_1km_2017_map <- create.pop.proportion.map(
buffer_pop_1km_2017,
title = "2017",
legend.title = "Population within \n1km buffers (%)")
pop_proportion_2km_2017_map <- create.pop.proportion.map(
buffer_pop_2km_2017,
title = "2017",
legend.title = "Population within \n2km buffers (%)")
pop_proportion_3km_2017_map <- create.pop.proportion.map(
buffer_pop_3km_2017,
title = "2017",
legend.title = "Population within \n3km buffers (%)")
# 2018 population proportion ---------------------------------------------------
pop_proportion_1km_2018_map <- create.pop.proportion.map(
buffer_pop_1km_2018,
title = "2018",
legend.title = "Population within \n1km buffers (%)")
pop_proportion_2km_2018_map <- create.pop.proportion.map(
buffer_pop_2km_2018,
title = "2018",
legend.title = "Population within \n2km buffers (%)")
pop_proportion_3km_2018_map <- create.pop.proportion.map(
buffer_pop_3km_2018,
title = "2018",
legend.title = "Population within \n3km buffers (%)")
# 2019 population proportion ---------------------------------------------------
pop_proportion_1km_2019_map <- create.pop.proportion.map(
buffer_pop_1km_2019,
title = "2019",
legend.title = "Population within \n1km buffers (%)")
pop_proportion_2km_2019_map <- create.pop.proportion.map(
buffer_pop_2km_2019,
title = "2019",
legend.title = "Population within \n2km buffers (%)")
pop_proportion_3km_2019_map <- create.pop.proportion.map(
buffer_pop_3km_2019,
title = "2019",
legend.title = "Population within \n3km buffers (%)")
# 2020 population proportion ---------------------------------------------------
pop_proportion_1km_2020_map <- create.pop.proportion.map(
buffer_pop_1km_2020,
title = "2020",
legend.title = "Population within \n1km buffers (%)")
pop_proportion_2km_2020_map <- create.pop.proportion.map(
buffer_pop_2km_2020,
title = "2020",
legend.title = "Population within \n2km buffers (%)")
pop_proportion_3km_2020_map <- create.pop.proportion.map(
buffer_pop_3km_2020,
title = "2020",
legend.title = "Population within \n3km buffers (%)")
# Layout maps ------------------------------------------------------------------
tmap::tmap_arrange(pop_proportion_1km_2017_map, pop_proportion_2km_2017_map,
pop_proportion_3km_2017_map, pop_proportion_1km_2018_map,
pop_proportion_2km_2018_map, pop_proportion_3km_2018_map,
pop_proportion_1km_2019_map, pop_proportion_2km_2019_map,
pop_proportion_3km_2019_map, pop_proportion_1km_2020_map,
pop_proportion_2km_2020_map, pop_proportion_3km_2020_map, ncol = 3)
Fig. 10. Proportion of catchment population living around water bodies
A correlation coeeficient of more than zero (cor.coeff r > 0.1) indicates some positive association between the SMR and the buffer population variables. That is, SMR of dry season malaria increases with increase in number of people surrounding water bodies. This implies that as the population of people living near water bodies increases, the risk of dry season malaria increases as well.
# Helper function to tidy and bind the SMR and proportion of -------------------
# buffer-catchment population data frames
tidy.data <- function(smr.df,
proportion.pop.1km,
proprotion.pop.2km,
proportion.pop.3km){
# Convert the sf objects to data frames-------------------------------------------
smr_df <- as.data.frame(smr.df) |>
dplyr::select(rowID, Names, SMR)
proportion_pop_1km_df <- as.data.frame(proportion.pop.1km) |>
dplyr::select(rowID, prop_pop_1km = `prop_buffer_catchment_pop`)
proportion_pop_2km_df <- as.data.frame(proprotion.pop.2km) |>
dplyr::select(rowID, prop_pop_2km = `prop_buffer_catchment_pop`)
proportion_pop_3km_df <- as.data.frame(proportion.pop.3km) |>
dplyr::select(rowID, prop_pop_3km = `prop_buffer_catchment_pop`)
# Merge SMR and population data frames -----------------------------------------
combined_1 <- merge(smr_df, proportion_pop_1km_df, by = "rowID", all = TRUE)
combined_2 <- merge(proportion_pop_2km_df, proportion_pop_3km_df)
combined_fully <- merge(combined_1, combined_2, by = "rowID", all = TRUE)
}
# Invoking the function --------------------------------------------------------
smr_pop_2017 <- tidy.data(SMR_2017, buffer_pop_1km_2017, buffer_pop_2km_2017, buffer_pop_3km_2017)
smr_pop_2018 <- tidy.data(SMR_2018, buffer_pop_1km_2018, buffer_pop_2km_2018, buffer_pop_3km_2018)
smr_pop_2019 <- tidy.data(SMR_2019, buffer_pop_1km_2019, buffer_pop_2km_2019, buffer_pop_3km_2019)
smr_pop_2020 <- tidy.data(SMR_2020, buffer_pop_1km_2020, buffer_pop_2km_2020, buffer_pop_3km_2020)
# Helper function to create scatter plots --------------------------------------
create.scatter.plot <- function(smr.pop.df,
independent.var = NA,
dependent.var = "SMR",
x.label = NA,
plot.title = NA){
scatter.plot <- ggpubr::ggscatter(smr.pop.df, # data frame
x = independent.var, # x-axis variable
y = dependent.var, # y-axis variable
add = "reg.line", # Add regression line
conf.int = TRUE, # Add confidence interval
add.params = list(color = "red",
fill = "lightgray"),
palette = "jco", # journal color palette. see ?ggpar
xlab = x.label, # x-axis label
ylab = "SMR", # y-axis label
title = plot.title)+
ggpubr::stat_cor(label.y = 3)+ # Add correlation coefficient
ggpubr::font("title", size = 10, face = "bold")+
ggpubr::font("xlab", size = 10)+
ggpubr::font("ylab", size = 10)
return(scatter.plot)
}
# Invoking function
# 2017 scatter plots ------------------------------------------------------------
scatter_1km_2017 <- create.scatter.plot(
smr_pop_2017, independent.var = "prop_pop_1km",
x.label = "Percentage of catchment population \nliving in 1km buffer",
plot.title = "2017")
scatter_2km_2017 <- create.scatter.plot(
smr_pop_2017, independent.var = "prop_pop_2km",
x.label = "Percentage of catchment population \nliving in 2km buffer",
plot.title = "2017")
scatter_3km_2017 <- create.scatter.plot(
smr_pop_2017, independent.var = "prop_pop_3km",
x.label = "Percentage of catchment population \nliving in 3km buffer",
plot.title = "2017")
# 2018 scatter plots -----------------------------------------------------------
scatter_1km_2018 <- create.scatter.plot(
smr_pop_2018, independent.var = "prop_pop_1km",
x.label = "Percentage of catchment population \nliving in 1km buffer",
plot.title = "2018")
scatter_2km_2018 <- create.scatter.plot(
smr_pop_2018, independent.var = "prop_pop_2km",
x.label = "Percentage of catchment population \nliving in 2km buffer",
plot.title = "2018")
scatter_3km_2018 <- create.scatter.plot(
smr_pop_2018, independent.var = "prop_pop_3km",
x.label = "Percentage of catchment population \nliving in 3km buffer",
plot.title = "2018")
# 2019 scatter plots -----------------------------------------------------------
scatter_1km_2019 <- create.scatter.plot(
smr_pop_2019, independent.var = "prop_pop_1km",
x.label = "Percentage of catchment population \nliving in 1km buffer",
plot.title = "2019")
scatter_2km_2019 <- create.scatter.plot(
smr_pop_2019, independent.var = "prop_pop_2km",
x.label = "Percentage of catchment population \nliving in 2km buffer",
plot.title = "2019")
scatter_3km_2019 <- create.scatter.plot(
smr_pop_2019, independent.var = "prop_pop_3km",
x.label = "Percentage of catchment population \nliving in 3km buffer",
plot.title = "2019")
# 2020 scatter plots -----------------------------------------------------------
scatter_1km_2020 <- create.scatter.plot(
smr_pop_2020, independent.var = "prop_pop_1km",
x.label = "Percentage of catchment population \nliving in 1km buffer",
plot.title = "2020")
scatter_2km_2020 <- create.scatter.plot(
smr_pop_2020, independent.var = "prop_pop_2km",
x.label = "Percentage of catchment population \nliving in 2km buffer",
plot.title = "2020")
scatter_3km_2020 <- create.scatter.plot(
smr_pop_2020, independent.var = "prop_pop_3km",
x.label = "Percentage of catchment population \nliving in 3km buffer",
plot.title = "2020")
# Arrange the plots ------------------------------------------------------------
ggpubr::ggarrange(scatter_1km_2017, scatter_2km_2017, scatter_3km_2017,
scatter_1km_2018, scatter_2km_2018, scatter_3km_2018,
scatter_1km_2019, scatter_2km_2019, scatter_3km_2019,
scatter_1km_2020, scatter_2km_2020, scatter_3km_2020,
ncol = 3, nrow = 4)
Fig 11. Relationship between standardised morbidity ratio and living near waterbodies
First, we check if the count data (dry season malaria cases) is normally distributed or follows a Poisson distribution.The Poisson distribution is a discrete distribution that measures the probability of a given number of events happening in a specified time period generated by a Poisson process. By Poisson processes, we mean processes that are discrete, independent, and mutually exclusive, e.g., dry season malaria cases.
# Combine data for model fitting -----------------------------------------------
model_data_2017 <- merge(expected_malaria_2017, smr_pop_2017, by = "rowID", all = TRUE) |>
dplyr::select(-Names.y) |>
dplyr::rename(Names = Names.x)
model_data_2018 <- merge(expected_malaria_2018, smr_pop_2018, by = "rowID", all = TRUE) |>
dplyr::select(-Names.y) |>
dplyr::rename(Names = Names.x)
model_data_2019 <- merge(expected_malaria_2019, smr_pop_2019, by = "rowID", all = TRUE) |>
dplyr::select(-Names.y) |>
dplyr::rename(Names = Names.x)
model_data_2020 <- merge(expected_malaria_2020, smr_pop_2020, by = "rowID", all = TRUE) |>
dplyr::select(-Names.y) |>
dplyr::rename(Names = Names.x)
# Normality test ---------------------------------------------------------------
# Check whether the dependent variable follows a poisson distribution
# Dry season malaria cases appear to be non normally distributed i.e highly skewed
histogram_2017 <- ggplot2::ggplot(model_data_2017, aes(x = observed_2017)) +
geom_histogram(fill = "white", color = "black")+
geom_vline(aes(xintercept = mean(observed_2017)),
color = "blue",
linetype = "dashed")+
labs(title = "2017", x = "Observed malaria cases", y = "Count")+
theme_classic()
histogram_2018 <- ggplot2::ggplot(model_data_2018, aes(x = observed_2018)) +
geom_histogram(fill = "white", color = "black")+
geom_vline(aes(xintercept = mean(observed_2018)),
color = "blue",
linetype = "dashed")+
labs(title = "2018", x = "Observed malaria cases", y = "Count")+
theme_classic()
histogram_2019 <- ggplot2::ggplot(model_data_2019, aes(x = observed_2019)) +
geom_histogram(fill = "white", color = "black")+
geom_vline(aes(xintercept = mean(observed_2019)),
color = "blue",
linetype = "dashed")+
labs(title = "2019", x = "Observed malaria cases", y = "Count")+
theme_classic()
histogram_2020 <- ggplot2::ggplot(model_data_2020, aes(x = observed_2020)) +
geom_histogram(fill = "white", color = "black")+
geom_vline(aes(xintercept = mean(observed_2020)),
color = "blue",
linetype = "dashed")+
labs(title = "2020", x = "Observed malaria cases", y = "Count")+
theme_classic()
gridExtra::grid.arrange(histogram_2017, histogram_2018, histogram_2019, histogram_2020)
Fig 12. Poisson distribution of dry season malaria cases. Clearly, the data is not in the form of a bell curve like in a normal distribution. Many health facilities reported very few malaria cases. A few health facilities have a large number of cases making for a distribution that appears to be far from normal. Therefore, Poisson regression will be used to model our dry season malaria data.
# Let’s check out the mean and variance of the dependent variable:
mean(model_data_2017$observed_2017)
## [1] 2491.923
var(model_data_2017$observed_2017)
## [1] 7693348
mean(model_data_2018$observed_2018)
## [1] 2795.769
var(model_data_2018$observed_2018)
## [1] 5090401
mean(model_data_2019$observed_2019)
## [1] 2711.385
var(model_data_2019$observed_2019)
## [1] 5947880
mean(model_data_2020$observed_2020)
## [1] 4580.538
var(model_data_2020$observed_2020)
## [1] 14157771
# The variance is much greater than the mean, which suggests that we will
# have over-dispersion in the model.
\[ ln (E(y)) = 𝜷_0 + 𝜷_1 x + ln(𝒆_𝒊)\] where, dependent variable, 𝑦 = observed malaria cases; 𝐸(𝑦) = expected count value; independent variable, 𝑥 = percentage of people living near dams; and offset, 𝑒 = expected malaria cases.
To cope with the malaria count data coming from populations of different sizes, we specify an offset argument. This adds a constant term for each row of the data in the model. The log of the expected cases is used in the offset term.
Summary outputs:
Estimate : the intercept (\(𝜷_0\)) and the beta coefficient estimates associated to each predictor variable.
Std.Error : the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.
t value : the t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3). For a given the predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, i.e., whether the beta coefficient of the predictor is significantly different from zero.
Pr(>|t|) : The p-value corresponding to the t-statistic. The smaller the p-value, the more significant the estimate is.
Residuals : Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum (min)and maximum (max) should be roughly equal in absolute value.
Coefficients: Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.
Residual standard error (RSE), and R-squared (\(R^2\)) metrics tell how well the model fits to our data. An (adjusted) \(R^2\) that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model. A number near 0 indicates that the regression model did not explain much of the variability in the outcome.
# Fit generalised linear model -------------------------------------------------
# Defining model parameters:
# response variable: observed_2017, observed_2018, observed_2019, observed_2020 are
# recorded dry season malaria cases in that year
# risk factor: prop_pop_1km, prop_pop_2km, prop_pop_3km are the percentage of people living
# within 1km, 2km and 3km buffers of water bodies, respectively.
# offset: expected_* is the number of malaria cases we would expect if the malaria rate
# was equal in all the catchment areas
# 2017 -------------------------------------------------------------------------
model_1km_2017 <- glm(observed_2017~prop_pop_1km+offset(log(expected_2017)),
data = model_data_2017, family = poisson(link = "log"))
# Display the statistical summary of the model
# summary(model_1km_2017)
# generate model report according to best practices guidelines
report::report(model_1km_2017)
## We fitted a poisson model (estimated using ML) to predict observed_2017 with prop_pop_1km and expected_2017 (formula: observed_2017 ~ prop_pop_1km + offset(log(expected_2017))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_1km = 0 and expected_2017 = 0, is at -0.29 (95% CI [-0.30, -0.28], p < .001). Within this model:
##
## - The effect of prop_pop_1km is statistically significant and positive (beta = 0.03, 95% CI [0.03, 0.03], p < .001; Std. beta = -0.37, 95% CI [-0.38, -0.36])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
The estimated regression equation can be written as follow: observed malaria cases = -0.290015 + 0.032594 * percentage of catchment population living near dam. Using this formula, for each change in number of people living near dams, we can predict the number of dry season malaria case. Our estimated model: dry season malaria risk = -0.29 + 0.03 + ln(e) If the number of people living around dams was 0, dry season malaria risk would be -0.29. That is the estimate 0.29 can be interpreted as the dry season malaria risk in a catchment with no people living near waterbodies.
For example: - for a buffer population equal zero, we can expect -0.29 malaria cases. - for a buffer population equal 1000, we can expect round(-0.29 + 0.03 * 1000) = 30 malaria cases.
sjPlot::tab_model(model_1km_2017, digits = 4, digits.re = 4,
show.r2 = FALSE, show.aic = TRUE)
| observed_2017 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.7483 | 0.7384 – 0.7582 | <0.001 |
| prop_pop_1km | 1.0331 | 1.0320 – 1.0343 | <0.001 |
| Observations | 26 | ||
| AIC | 9834.177 | ||
# From the output above, the p-value is 0.001 - way less than the alpha value of 0.05,
# we therefore reject our null hypothesis, meaning that there is a difference in
# dry season malaria risk between people living close to water bodies than those living far away.
model_2km_2017 <- glm(observed_2017~1+prop_pop_2km+offset(log(expected_2017)),
data = model_data_2017, family = poisson(link = "log"))
# summary(model_2km_2017)
report::report(model_2km_2017)
## We fitted a poisson model (estimated using ML) to predict observed_2017 with prop_pop_2km and expected_2017 (formula: observed_2017 ~ 1 + prop_pop_2km + offset(log(expected_2017))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_2km = 0 and expected_2017 = 0, is at -0.35 (95% CI [-0.36, -0.33], p < .001). Within this model:
##
## - The effect of prop_pop_2km is statistically significant and positive (beta = 0.01, 95% CI [0.01, 0.02], p < .001; Std. beta = -0.47, 95% CI [-0.48, -0.45])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_2km_2017, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2017 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.707 | 0.697 – 0.718 | <0.001 |
| prop_pop_2km | 1.015 | 1.014 – 1.015 | <0.001 |
| Observations | 26 | ||
| AIC | 9674.729 | ||
model_3km_2017 <- glm(observed_2017~1+prop_pop_3km+offset(log(expected_2017)),
data = model_data_2017, family = poisson(link = "log"))
#summary(model_3km_2017)
report::report(model_3km_2017)
## We fitted a poisson model (estimated using ML) to predict observed_2017 with prop_pop_3km and expected_2017 (formula: observed_2017 ~ 1 + prop_pop_3km + offset(log(expected_2017))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_3km = 0 and expected_2017 = 0, is at -0.40 (95% CI [-0.42, -0.39], p < .001). Within this model:
##
## - The effect of prop_pop_3km is statistically significant and positive (beta = 0.01, 95% CI [0.01, 0.01], p < .001; Std. beta = -0.50, 95% CI [-0.51, -0.49])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_3km_2017, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2017 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.668 | 0.657 – 0.680 | <0.001 |
| prop_pop_3km | 1.010 | 1.010 – 1.011 | <0.001 |
| Observations | 26 | ||
| AIC | 9942.005 | ||
# 2018 -------------------------------------------------------------------------
model_1km_2018 <- glm(observed_2018~1+prop_pop_1km+offset(log(expected_2018)),
data = model_data_2018, family = poisson(link = "log"))
# summary(model_1km_2018)
report::report(model_1km_2018)
## We fitted a poisson model (estimated using ML) to predict observed_2018 with prop_pop_1km and expected_2018 (formula: observed_2018 ~ 1 + prop_pop_1km + offset(log(expected_2018))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_1km = 0 and expected_2018 = 0, is at -0.29 (95% CI [-0.30, -0.27], p < .001). Within this model:
##
## - The effect of prop_pop_1km is statistically significant and positive (beta = 0.02, 95% CI [0.02, 0.02], p < .001; Std. beta = -0.58, 95% CI [-0.59, -0.56])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_1km_2018, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2018 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.750 | 0.740 – 0.761 | <0.001 |
| prop_pop_1km | 1.024 | 1.023 – 1.025 | <0.001 |
| Observations | 26 | ||
| AIC | 13953.927 | ||
model_2km_2018 <- glm(observed_2018~1+prop_pop_2km+offset(log(expected_2018)),
data = model_data_2018, family = poisson(link = "log"))
# summary(model_2km_2018)
report::report(model_2km_2018)
## We fitted a poisson model (estimated using ML) to predict observed_2018 with prop_pop_2km and expected_2018 (formula: observed_2018 ~ 1 + prop_pop_2km + offset(log(expected_2018))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_2km = 0 and expected_2018 = 0, is at -0.30 (95% CI [-0.31, -0.28], p < .001). Within this model:
##
## - The effect of prop_pop_2km is statistically significant and positive (beta = 9.22e-03, 95% CI [8.81e-03, 9.62e-03], p < .001; Std. beta = -0.47, 95% CI [-0.48, -0.46])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_2km_2018, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2018 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.742 | 0.731 – 0.754 | <0.001 |
| prop_pop_2km | 1.009 | 1.009 – 1.010 | <0.001 |
| Observations | 26 | ||
| AIC | 14604.406 | ||
model_3km_2018 <- glm(observed_2018~1+prop_pop_3km+offset(log(expected_2018)),
data = model_data_2018, family = poisson(link = "log"))
# summary(model_3km_2018)
report::report(model_3km_2018)
## We fitted a poisson model (estimated using ML) to predict observed_2018 with prop_pop_3km and expected_2018 (formula: observed_2018 ~ 1 + prop_pop_3km + offset(log(expected_2018))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_3km = 0 and expected_2018 = 0, is at -0.40 (95% CI [-0.42, -0.38], p < .001). Within this model:
##
## - The effect of prop_pop_3km is statistically significant and positive (beta = 8.03e-03, 95% CI [7.67e-03, 8.38e-03], p < .001; Std. beta = -0.47, 95% CI [-0.48, -0.45])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_3km_2018, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2018 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.668 | 0.655 – 0.681 | <0.001 |
| prop_pop_3km | 1.008 | 1.008 – 1.008 | <0.001 |
| Observations | 26 | ||
| AIC | 14554.845 | ||
# 2019 -------------------------------------------------------------------------
model_1km_2019 <- glm(observed_2019~1+prop_pop_1km+offset(log(expected_2019)),
data = model_data_2019, family = poisson(link = "log"))
# summary(model_1km_2019)
report::report(model_1km_2019)
## We fitted a poisson model (estimated using ML) to predict observed_2019 with prop_pop_1km and expected_2019 (formula: observed_2019 ~ 1 + prop_pop_1km + offset(log(expected_2019))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_1km = 0 and expected_2019 = 0, is at -0.19 (95% CI [-0.20, -0.17], p < .001). Within this model:
##
## - The effect of prop_pop_1km is statistically significant and positive (beta = 0.01, 95% CI [9.39e-03, 0.01], p < .001; Std. beta = -0.42, 95% CI [-0.43, -0.41])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_1km_2019, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2019 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.830 | 0.817 – 0.844 | <0.001 |
| prop_pop_1km | 1.010 | 1.009 – 1.011 | <0.001 |
| Observations | 26 | ||
| AIC | 10313.731 | ||
model_2km_2019 <- glm(observed_2019~1+prop_pop_2km+offset(log(expected_2019)),
data = model_data_2019, family = poisson(link = "log"))
# summary(model_2km_2019)
report::report(model_2km_2019)
## We fitted a poisson model (estimated using ML) to predict observed_2019 with prop_pop_2km and expected_2019 (formula: observed_2019 ~ 1 + prop_pop_2km + offset(log(expected_2019))). The model's explanatory power is substantial (Nagelkerke's R2 = 0.96). The model's intercept, corresponding to prop_pop_2km = 0 and expected_2019 = 0, is at -0.08 (95% CI [-0.10, -0.06], p < .001). Within this model:
##
## - The effect of prop_pop_2km is statistically significant and positive (beta = 1.76e-03, 95% CI [1.38e-03, 2.14e-03], p < .001; Std. beta = -0.33, 95% CI [-0.34, -0.31])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_2km_2019, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2019 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.924 | 0.907 – 0.942 | <0.001 |
| prop_pop_2km | 1.002 | 1.001 – 1.002 | <0.001 |
| Observations | 26 | ||
| AIC | 10906.895 | ||
model_3km_2019 <- glm(observed_2019~1+prop_pop_3km+offset(log(expected_2019)),
data = model_data_2019, family = poisson(link = "log"))
# summary(model_3km_2019)
report::report(model_3km_2019)
## We fitted a poisson model (estimated using ML) to predict observed_2019 with prop_pop_3km and expected_2019 (formula: observed_2019 ~ 1 + prop_pop_3km + offset(log(expected_2019))). The model's explanatory power is substantial (Nagelkerke's R2 = 0.83). The model's intercept, corresponding to prop_pop_3km = 0 and expected_2019 = 0, is at -0.07 (95% CI [-0.10, -0.05], p < .001). Within this model:
##
## - The effect of prop_pop_3km is statistically significant and positive (beta = 1.08e-03, 95% CI [7.70e-04, 1.39e-03], p < .001; Std. beta = -0.43, 95% CI [-0.44, -0.41])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_3km_2019, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2019 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.929 | 0.909 – 0.950 | <0.001 |
| prop_pop_3km | 1.001 | 1.001 – 1.001 | <0.001 |
| Observations | 26 | ||
| AIC | 10942.072 | ||
# 2020 -------------------------------------------------------------------------
model_1km_2020 <- glm(observed_2020~1+prop_pop_1km+offset(log(expected_2020)),
data = model_data_2020, family = poisson(link = "log"))
# summary(model_1km_2020)
report::report(model_1km_2020)
## We fitted a poisson model (estimated using ML) to predict observed_2020 with prop_pop_1km and expected_2020 (formula: observed_2020 ~ 1 + prop_pop_1km + offset(log(expected_2020))). The model's explanatory power is weak (Nagelkerke's R2 = 0.08). The model's intercept, corresponding to prop_pop_1km = 0 and expected_2020 = 0, is at 5.44e-03 (95% CI [-3.63e-03, 0.01], p = 0.239). Within this model:
##
## - The effect of prop_pop_1km is statistically non-significant and negative (beta = -5.94e-04, 95% CI [-1.36e-03, 1.77e-04], p = 0.131; Std. beta = -0.49, 95% CI [-0.50, -0.49])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_1km_2020, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2020 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 1.005 | 0.996 – 1.015 | 0.239 |
| prop_pop_1km | 0.999 | 0.999 – 1.000 | 0.131 |
| Observations | 26 | ||
| AIC | 21077.777 | ||
model_2km_2020 <- glm(observed_2020~1+prop_pop_2km+offset(log(expected_2020)),
data = model_data_2020, family = poisson(link = "log"))
# summary(model_2km_2020)
report::report(model_2km_2020)
## We fitted a poisson model (estimated using ML) to predict observed_2020 with prop_pop_2km and expected_2020 (formula: observed_2020 ~ 1 + prop_pop_2km + offset(log(expected_2020))). The model's explanatory power is very weak (Nagelkerke's R2 = 0.01). The model's intercept, corresponding to prop_pop_2km = 0 and expected_2020 = 0, is at -2.27e-03 (95% CI [-0.01, 7.71e-03], p = 0.656). Within this model:
##
## - The effect of prop_pop_2km is statistically non-significant and positive (beta = 9.77e-05, 95% CI [-2.56e-04, 4.51e-04], p = 0.588; Std. beta = -0.58, 95% CI [-0.59, -0.57])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_2km_2020, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2020 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.998 | 0.988 – 1.008 | 0.656 |
| prop_pop_2km | 1.000 | 1.000 – 1.000 | 0.588 |
| Observations | 26 | ||
| AIC | 21079.760 | ||
model_3km_2020 <- glm(observed_2020~1+prop_pop_3km+offset(log(expected_2020)),
data = model_data_2020, family = poisson(link = "log"))
# summary(model_3km_2020)
report::report(model_3km_2020)
## We fitted a poisson model (estimated using ML) to predict observed_2020 with prop_pop_3km and expected_2020 (formula: observed_2020 ~ 1 + prop_pop_3km + offset(log(expected_2020))). The model's explanatory power is very weak (Nagelkerke's R2 = 4.42e-03). The model's intercept, corresponding to prop_pop_3km = 0 and expected_2020 = 0, is at -1.59e-03 (95% CI [-0.01, 9.18e-03], p = 0.773). Within this model:
##
## - The effect of prop_pop_3km is statistically non-significant and positive (beta = 4.14e-05, 95% CI [-1.97e-04, 2.80e-04], p = 0.734; Std. beta = -0.51, 95% CI [-0.52, -0.51])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(model_3km_2020, digits = 3, digits.re = 3,
show.r2 = FALSE, show.aic = TRUE)
| observed_2020 | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.998 | 0.988 – 1.009 | 0.773 |
| prop_pop_3km | 1.000 | 1.000 – 1.000 | 0.734 |
| Observations | 26 | ||
| AIC | 21079.939 | ||
In the summaries above, we can see that all p-values, are less than 0.001 (except 2020 model), hence, the explanatory variable (percentage of people living near water bodies) has significant effect on dry season malaria cases. The Residual deviance is greater than the degrees of freedom, indicating that over-dispersion exists, as anticipated. This means that the estimates are correct, but the standard errors (Std. Error i.e., standard deviation) are unaccounted for by the model.
The Null deviance shows how well the response variable is predicted by a model that includes only the intercept (grand mean) whereas residual with the inclusion of independent variables. Above, for example model_1km_2017, we can see that the addition of 1 (25-24 = 1) independent variable decreased the deviance to 9587.9 from 12787.5. The greater difference in values means a poor fit. And the dispersion parameter is 399.4958 (9587.9/24) which is large. One possibility is that there are other important covariates that could be used to describe the differences in the observed dry season malaria cases. We consider overdispersion as a possible explanation of the significant lack-of-fit. Over-dispersion suggests that there is more variation in the response than the model implies.
Here, we evaluate how well the Poisson regression model is predicting the outcome of a new test data that have not been used to build the model i.e how close the prediction is close to the real value.
Two important metrics have been used to assess the performance of the predictive regression model:
Root Mean Squared Error, which measures the model prediction error. It corresponds to the average difference between the observed known values of the outcome and the predicted value by the model.RMSE is computed as RMSE = mean((observed - predicted)^2) |> sqrt(). The lower the RMSE, the better the model.
\[RMSE = \sqrt \frac {\Sigma ( y - \hat{y} )^2}{N}\]
Pseudo R-squared goodness-of-fit measure for count data in Poisson regression which is nonlinear and computed using deviance statistics:
\[D=2∑ni=1{Yilog(Yi/μi)−(Yi−μi)}\] where \[μi=exp(β^0+β^1X1+...+β^pXp)\] denotes the predicted mean for observation i based on the estimated model parameters. AJKOER (https://stats.stackexchange.com/users/54013/ajkoer), Basic R-Squared in Poisson Regression, URL (version: 2020-06-28): https://stats.stackexchange.com/q/474500
# Tidy model assessment data ---------------------------------------------------
model_assessment_data_2017 <- model_data_2017 |>
dplyr::as_tibble() |>
dplyr::select(Names, observed_2017, expected_2017,
prop_pop_1km, prop_pop_2km, prop_pop_3km)
model_assessment_data_2018 <- model_data_2018 |>
dplyr::as_tibble() |>
dplyr::select(Names, observed_2018, expected_2018,
prop_pop_1km, prop_pop_2km, prop_pop_3km)
model_assessment_data_2019 <- model_data_2019 |>
dplyr::as_tibble() |>
dplyr::select(Names, observed_2019, expected_2019,
prop_pop_1km, prop_pop_2km, prop_pop_3km)
model_assessment_data_2020 <- model_data_2020 |>
dplyr::as_tibble() |>
dplyr::select(Names, observed_2020, expected_2020,
prop_pop_1km, prop_pop_2km, prop_pop_3km)
# To do: LOOCV(Leave One Out Cross-Validation) and k-fold Cross Validation
# Split the data into training and test set ------------------------------------
set.seed(2) # generate random numbers
split_2017 <- caTools::sample.split(model_assessment_data_2017,
SplitRatio = 0.8) # use 80% of the data for training
train_2017 <- subset(model_assessment_data_2017, split = "TRUE")
test_2017 <- subset(model_assessment_data_2017, split = "FALSE")
# Build model
model_2017 <- glm(observed_2017~1+prop_pop_1km+offset(log(expected_2017)),
data = train_2017, family = 'poisson')
summary(model_2017)
##
## Call:
## glm(formula = observed_2017 ~ 1 + prop_pop_1km + offset(log(expected_2017)),
## family = "poisson", data = train_2017)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -34.204 -13.024 -3.552 15.308 40.886
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.290015 0.006763 -42.88 <2e-16 ***
## prop_pop_1km 0.032594 0.000570 57.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12787.5 on 25 degrees of freedom
## Residual deviance: 9587.9 on 24 degrees of freedom
## AIC: 9834.2
##
## Number of Fisher Scoring iterations: 4
# Prediction
predicted_2017 <- predict(model_2017, test_2017, type = "response")
# Compare predicted vs actual dry season malaria cases
par(mfrow = c(2, 1))
# plot(test_2017$observed_2017, type = "b", lty = 1.8, col = "blue")
# plot(predicted_2017, type = "b", lty = 1.8, col = "red", add = TRUE)
barplot(test_2017$observed_2017, main = "Observed malaria cases",
xlab = "Catchment area", ylab = "Observed malaria cases")
barplot(predicted_2017, main = "Predicted malaria cases",
xlab = "Catchment area", ylab = "Predicted malaria cases")
par(mfrow = c(1, 1)) # Create a 2 x 2 plotting matrix
# Split the data into training and test set
set.seed(123) # generate a sequence of random numbers
# 2017 -------------------------------------------------------------------------
training_samples_2017 <- model_assessment_data_2017$observed_2017 |>
caret::createDataPartition(p = 0.8, list = FALSE) # use 80% of the data for training
train_data_2017 <- model_assessment_data_2017[training_samples_2017, ]
test_data_2017 <- model_assessment_data_2017[-training_samples_2017, ]
# 2018
training_samples_2018 <- model_assessment_data_2018$observed_2018 |>
caret::createDataPartition(p = 0.8, list = FALSE) # see ?createDataPartition
train_data_2018 <- model_assessment_data_2018[training_samples_2018, ]
test_data_2018 <- model_assessment_data_2018[-training_samples_2018, ]
# 2019
training_samples_2019 <- model_assessment_data_2019$observed_2019 |>
caret::createDataPartition(p = 0.8, list = FALSE)
train_data_2019 <- model_assessment_data_2019[training_samples_2019, ]
test_data_2019 <- model_assessment_data_2019[-training_samples_2019, ]
# 2020
training_samples_2020 <- model_assessment_data_2020$observed_2020 |>
caret::createDataPartition(p = 0.8, list = FALSE)
train_data_2020 <- model_assessment_data_2020[training_samples_2020, ]
test_data_2020 <- model_assessment_data_2020[-training_samples_2020, ]
# Make predictions using the test data in order to evaluate the performance
# of our regression model aka goodness-of-fit. The "response" type of prediction
# is on the scale of the response variable. Thus for a default binomial model
# the default predictions are of log-odds (probabilities on logit scale) and
# type = "response" gives the predicted probabilities.
# 2017 -------------------------------------------------------------------------
predictions_1km_2017 <- model_1km_2017 |>
stats::predict.glm(test_data_2017, type = "response") # see ?stats::predict.glm
predictions_2km_2017 <- model_2km_2017 |>
stats::predict.glm(test_data_2017, type = "response")
# 2018
predictions_1km_2018 <- model_1km_2018 |>
stats::predict.glm(test_data_2018, type = "response")
predictions_2km_2018 <- model_2km_2018 |>
stats::predict.glm(test_data_2018, type = "response")
# 2019
predictions_1km_2019 <- model_1km_2019 |>
stats::predict.glm(test_data_2019, type = "response")
predictions_2km_2019 <- model_2km_2019 |>
stats::predict.glm(test_data_2019, type = "response")
# 2020
predictions_1km_2020 <- model_1km_2020 |>
stats::predict.glm(test_data_2020, type = "response")
predictions_2km_2020 <- model_2km_2020 |>
stats::predict.glm(test_data_2020, type = "response")
# Model performance ------------------------------------------------------------
# (a) Compute the prediction error, RMSE. The lower the RMSE, the better the model
# 2017
caret::RMSE(predictions_1km_2017, test_data_2017$observed_2017)
## [1] 753.5512
caret::RMSE(predictions_2km_2017, test_data_2017$observed_2017)
## [1] 656.2454
# 2018
caret::RMSE(predictions_1km_2018, test_data_2018$observed_2018)
## [1] 1801.58
caret::RMSE(predictions_2km_2018, test_data_2018$observed_2018)
## [1] 1682.038
# 2019
caret::RMSE(predictions_1km_2019, test_data_2019$observed_2019)
## [1] 1376.436
caret::RMSE(predictions_2km_2019, test_data_2019$observed_2019)
## [1] 1348.131
# 2020
caret::RMSE(predictions_1km_2020, test_data_2020$observed_2020)
## [1] 2398.95
caret::RMSE(predictions_2km_2020, test_data_2020$observed_2020)
## [1] 2400.144
# (b) Compute pseudo R-square
# We express the goodness of fit of the Poisson regression model by widely
# used variants of pseudo R squared statistics, most of which are based on
# the deviance of the model:
# - the Aldrich-Nelson pseudo-R2 with the Veall-Zimmermann correction, which
# is the best approximation of the McKelvey-Zavoina.
# Efron, Aldrich-Nelson, McFadden and Nagelkerke approaches severely underestimate
# the "true R2". -----------------------------------------------------------------
# DescTools::PseudoR2(model_1km_2017, "all")
round(DescTools::PseudoR2(model_1km_2017, c("AldrichNelson", "AIC", "LogLikNull",
"LogLik", "G2", "VeallZimmermann")), 2)
## AldrichNelson AIC G2 VeallZimmermann
## 0.99 9834.18 3199.65 0.99
round(DescTools::PseudoR2(model_2km_2017, c("AldrichNelson", "AIC", "LogLikNull",
"LogLik", "G2", "VeallZimmermann")), 2)
## AldrichNelson AIC G2 VeallZimmermann
## 0.99 9674.73 3359.10 0.99
round(DescTools::PseudoR2(model_1km_2018, c("AldrichNelson", "AIC", "LogLikNull",
"LogLik", "G2", "VeallZimmermann")), 2)
## AldrichNelson AIC G2 VeallZimmermann
## 0.99 13953.93 2639.84 0.99
round(DescTools::PseudoR2(model_2km_2018, c("AldrichNelson", "AIC", "LogLikNull",
"LogLik", "G2", "VeallZimmermann")), 2)
## AldrichNelson AIC G2 VeallZimmermann
## 0.99 14604.41 1989.36 0.99
round(DescTools::PseudoR2(model_1km_2019, c("AldrichNelson", "AIC", "LogLikNull",
"LogLik", "G2", "VeallZimmermann")), 2)
## AldrichNelson AIC G2 VeallZimmermann
## 0.96 10313.73 675.17 0.97
round(DescTools::PseudoR2(model_2km_2019, c("AldrichNelson", "AIC", "LogLikNull",
"LogLik", "G2", "VeallZimmermann")), 2)
## AldrichNelson AIC G2 VeallZimmermann
## 0.76 10906.89 82.01 0.76
round(DescTools::PseudoR2(model_1km_2020, c("AldrichNelson", "AIC", "LogLikNull",
"LogLik", "G2", "VeallZimmermann")), 2)
## AldrichNelson AIC G2 VeallZimmermann
## 0.08 21077.78 2.28 0.08
round(DescTools::PseudoR2(model_2km_2020, c("AldrichNelson", "AIC", "LogLikNull",
"LogLik", "G2", "VeallZimmermann")), 2)
## AldrichNelson AIC G2 VeallZimmermann
## 0.01 21079.76 0.29 0.01
# Check for collinearity, normality or heteroscedasticity --------------------
performance::check_model(model_1km_2017, theme = "see::theme_modern")
performance::check_model(model_2km_2017, theme = "see::theme_modern")
performance::check_model(model_1km_2018, theme = "see::theme_modern")
performance::check_model(model_2km_2018, theme = "see::theme_modern")
performance::check_model(model_1km_2019, theme = "see::theme_modern")
performance::check_model(model_2km_2019, theme = "see::theme_modern")
performance::check_model(model_1km_2020, theme = "see::theme_modern")
performance::check_model(model_2km_2020, theme = "see::theme_modern")
# Model performance summaries --------------------------------------------------
# Compute indices of model performance for the regression models and
# compare the quality of the models.
# Note that all score value do not necessarily sum up to 100%. See ?compare_performance
model_comparison <- performance::compare_performance(model_1km_2017, model_2km_2017,
model_1km_2018, model_2km_2018,
model_1km_2019, model_2km_2019,
model_1km_2020, model_2km_2020,
metrics = "common", rank = TRUE)
model_comparison
## # Comparison of Model Performance Indices
##
## Name | Model | AIC | BIC | Nagelkerke's R2 | RMSE | Performance-Score
## -----------------------------------------------------------------------------------------------
## model_2km_2017 | glm | 9674.729 | 9677.245 | 1.000 | 906.793 | 100.00%
## model_1km_2017 | glm | 9834.177 | 9836.693 | 1.000 | 934.411 | 98.50%
## model_1km_2019 | glm | 10313.731 | 10316.247 | 1.000 | 952.120 | 95.89%
## model_2km_2019 | glm | 10906.895 | 10909.411 | 0.957 | 972.461 | 91.62%
## model_1km_2018 | glm | 13953.927 | 13956.443 | 1.000 | 1338.635 | 68.77%
## model_2km_2018 | glm | 14604.406 | 14606.922 | 1.000 | 1363.406 | 65.20%
## model_1km_2020 | glm | 21077.777 | 21080.293 | 0.084 | 1764.232 | 2.08%
## model_2km_2020 | glm | 21079.760 | 21082.277 | 0.011 | 1772.332 | 0.00%
The 8 models are ranked on each of the four parameters: AIC, BIC, RMSE and adjusted R-squared. The radar chart gives an indication of which model performed well or poorly against each parameter. For example, model_1km_2020 and model_2km_2020 have significantly the lowest ranking for all the parameters when compared to the other models (model_1km_2017, model_2km_2017, model_1km_2018, model_1km_2018, model_1km_2019, model_2km_2019), which have higher ranking in AIC, BIC, RMSE and R-squared.
plot(performance::compare_performance(model_1km_2017, model_2km_2017,
model_1km_2018, model_2km_2018,
model_1km_2019, model_2km_2019,
model_1km_2020, model_2km_2020,
metrics = "common", rank = TRUE))
The fitted values appear to line up particularly well with the observed data, suggesting that prop_pop_* (i.e., proportion of catchment population living near water bodies) can help us understand malaria risk in the catchment areas.
# Helper function to create scatter plots to see how well
# fitted values line up with observed malaria cases
plot.fitted.values <- function(fitted.values.df, model.df, title){
# Remove missing values from model data since
# model fitting deletes missing observations
model.df.complete <- model.df |>
tidyr::drop_na() |>
dplyr::rename_at(vars(starts_with("observed_")), ~ str_c("observed"))
# Plot fitted versus observed values
scatter.plot <- ggplot2::ggplot()+
ggplot2::geom_point(aes(fitted.values.df$fitted.values,
model.df.complete$observed))+
ggplot2::theme_classic()+
ggplot2::labs(x = "Fitted values",
y = "Observed values",
title = title)
return(scatter.plot)
}
# Invoking function
# 2017 -------------------------------------------------------------------------
fitted_1km_2017 <- plot.fitted.values(model_1km_2017, model_data_2017, "2017: 1km model")
fitted_2km_2017 <- plot.fitted.values(model_2km_2017, model_data_2017, "2017: 2km model")
fitted_3km_2017 <- plot.fitted.values(model_3km_2017, model_data_2017, "2017: 3km model")
# 2018 -------------------------------------------------------------------------
fitted_1km_2018 <- plot.fitted.values(model_1km_2018, model_data_2018, "2018: 1km model")
fitted_2km_2018 <- plot.fitted.values(model_2km_2018, model_data_2018, "2018: 2km model")
fitted_3km_2018 <- plot.fitted.values(model_3km_2018, model_data_2018, "2018: 3km model")
# 2019 -------------------------------------------------------------------------
fitted_1km_2019 <- plot.fitted.values(model_1km_2019, model_data_2019, "2019: 1km model")
fitted_2km_2019 <- plot.fitted.values(model_2km_2019, model_data_2019, "2019: 2km model")
fitted_3km_2019 <- plot.fitted.values(model_3km_2019, model_data_2020, "2019: 3km model")
# 2020 -------------------------------------------------------------------------
fitted_1km_2020 <- plot.fitted.values(model_1km_2020, model_data_2020, "2020: 1km model")
fitted_2km_2020 <- plot.fitted.values(model_2km_2020, model_data_2020, "2020: 2km model")
fitted_3km_2020 <- plot.fitted.values(model_3km_2020, model_data_2020, "2020: 3km model")
# Layout scatter plots ---------------------------------------------------------
cowplot::plot_grid(fitted_1km_2017, fitted_2km_2017,
fitted_1km_2018, fitted_2km_2018,
fitted_1km_2019, fitted_2km_2019,
fitted_1km_2020, fitted_2km_2020,
ncol = 2, nrow = 4)
Fig. 12. How well the percentage of catchment population living around water bodies explain observed malaria incidence
To understand whether similarity between malaria cases in catchment areas is a function of the distance between them or not, we incoporate a spatial dependency effect into the multivariate model and test test for spatial autocorrelation.
# Prep data
prep.spatial.dependency.data <- function(sf_2017, sf_2018, sf_2019, sf_2020){
df_2017 <- sf_2017 |>
dplyr::as_tibble() |>
dplyr::rename(observed_2018 = dr_2018,
observed_2019 = dr_2019,
observed_2020 = dr_2020,
SMR_2017 = SMR) |>
dplyr::select(rowID, Names, SMR_2017, geometry)
df_2018 <- sf_2018 |>
dplyr::as_tibble() |>
dplyr::rename(SMR_2018 = SMR) |>
dplyr::select(rowID, SMR_2018)
df_2019 <- sf_2019 |>
dplyr::as_tibble() |>
dplyr::rename(SMR_2019 = SMR) |>
dplyr::select(rowID, SMR_2019)
df_2020 <- sf_2020 |>
dplyr::as_tibble() |>
dplyr::rename(SMR_2020 = SMR) |>
dplyr::select(rowID, SMR_2020)
spatial_dependency_data <- merge(
merge(
merge(
df_2017, df_2018, by = "rowID", all = TRUE),
df_2019, by = "rowID", all = TRUE),
df_2020, by = "rowID", all = TRUE)
return(spatial_dependency_data)
}
# Invoking function ------------------------------------------------------------
spatial_dependency_data <- prep.spatial.dependency.data(model_data_2017,
model_data_2018,
model_data_2019,
model_data_2020)
# Find adjacent polygons i.e., make neigbhour list,
# Contiguity neighbors - all that share a boundary point
spatial_dependency_shp <- sf::st_as_sf(spatial_dependency_data) |>
as("Spatial")
catchment_neighbours <- spdep::poly2nb(spatial_dependency_shp) # Queen contiguity
summary(catchment_neighbours)
## Neighbour list object:
## Number of regions: 26
## Number of nonzero links: 98
## Percentage nonzero weights: 14.49704
## Average number of links: 3.769231
## Link number distribution:
##
## 1 2 3 4 5 6 7 8
## 2 3 8 5 5 1 1 1
## 2 least connected regions:
## 1 26 with 1 link
## 1 most connected region:
## 18 with 8 links
# Get coordinates from catchment polygons
# Get center points of each catchment area
coords <- coordinates(spatial_dependency_shp)
# View the connections
{plot(spatial_dependency_shp, asp = 1)+
plot(catchment_neighbours, coords, col = "blue", add = TRUE)}
Fig. 13. Neighbourhood matrix
## integer(0)
# Run a Moran I test on SMR
moran.test(spatial_dependency_data$SMR_2017,
nb2listw(catchment_neighbours))
##
## Moran I test under randomisation
##
## data: spatial_dependency_data$SMR_2017
## weights: nb2listw(catchment_neighbours)
##
## Moran I statistic standard deviate = 1.0451, p-value = 0.148
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.10287363 -0.04000000 0.01868994
moran.test(spatial_dependency_data$SMR_2018,
nb2listw(catchment_neighbours))
##
## Moran I test under randomisation
##
## data: spatial_dependency_data$SMR_2018
## weights: nb2listw(catchment_neighbours)
##
## Moran I statistic standard deviate = 1.4625, p-value = 0.0718
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.16260895 -0.04000000 0.01919258
moran.test(spatial_dependency_data$SMR_2019,
nb2listw(catchment_neighbours))
##
## Moran I test under randomisation
##
## data: spatial_dependency_data$SMR_2019
## weights: nb2listw(catchment_neighbours)
##
## Moran I statistic standard deviate = 0.64086, p-value = 0.2608
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.05113078 -0.04000000 0.02022102
moran.test(spatial_dependency_data$SMR_2020,
nb2listw(catchment_neighbours))
##
## Moran I test under randomisation
##
## data: spatial_dependency_data$SMR_2020
## weights: nb2listw(catchment_neighbours)
##
## Moran I statistic standard deviate = 1.0093, p-value = 0.1564
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.10407174 -0.04000000 0.02037431
# Run a Moran I MC test on SMR
moran.mc(spatial_dependency_data$SMR_2017,
nb2listw(catchment_neighbours),
nsim = 999)
##
## Monte-Carlo simulation of Moran I
##
## data: spatial_dependency_data$SMR_2017
## weights: nb2listw(catchment_neighbours)
## number of simulations + 1: 1000
##
## statistic = 0.10287, observed rank = 844, p-value = 0.156
## alternative hypothesis: greater
moran.mc(spatial_dependency_data$SMR_2018,
nb2listw(catchment_neighbours),
nsim = 999)
##
## Monte-Carlo simulation of Moran I
##
## data: spatial_dependency_data$SMR_2018
## weights: nb2listw(catchment_neighbours)
## number of simulations + 1: 1000
##
## statistic = 0.16261, observed rank = 908, p-value = 0.092
## alternative hypothesis: greater
moran.mc(spatial_dependency_data$SMR_2019,
nb2listw(catchment_neighbours),
nsim = 999)
##
## Monte-Carlo simulation of Moran I
##
## data: spatial_dependency_data$SMR_2019
## weights: nb2listw(catchment_neighbours)
## number of simulations + 1: 1000
##
## statistic = 0.051131, observed rank = 773, p-value = 0.227
## alternative hypothesis: greater
moran.mc(spatial_dependency_data$SMR_2020,
nb2listw(catchment_neighbours),
nsim = 999)
##
## Monte-Carlo simulation of Moran I
##
## data: spatial_dependency_data$SMR_2020
## weights: nb2listw(catchment_neighbours)
## number of simulations + 1: 1000
##
## statistic = 0.10407, observed rank = 838, p-value = 0.162
## alternative hypothesis: greater
# Run a Conditional Autoregressive (CAR) model, which allows us to incorporate
# the spatial autocorrelation between neighbours within our GLM
# First, generate a weights matrix from a neighbours list with spatial weights
adj_matrix <- spdep::nb2mat(catchment_neighbours, style = "B") # see ?nb2mat
# Match row and column names with those of geographic location index
rownames(adj_matrix) <- colnames(adj_matrix) <- spatial_dependency_data$rowID
# row.names(adj_matrix) <- NULL # alternatively
# Now we can fit the model. The spatial effect is called using the adjacency function which
# requires the grouping factor (i.e. the rowID of each catchment area)
CAR_model_1km_2017 <- spaMM::fitme(observed_2017~prop_pop_1km+offset(log(expected_2017)),
adjMatrix = adj_matrix,
data = model_data_2017,
family = 'poisson')
# Generate 95% CI
coefs <- as.data.frame(summary(CAR_model_1km_2017)$beta_table)
## formula: observed_2017 ~ prop_pop_1km + offset(log(expected_2017))
## Estimation of fixed effects by ML.
## family: poisson( link = log )
## ------------ Fixed effects (beta) ------------
## Estimate Cond. SE t-value
## (Intercept) -0.29002 0.006763 -42.88
## prop_pop_1km 0.03259 0.000570 57.18
## ------------- Likelihood values -------------
## logLik
## p(h) (Likelihood): -4915.088
# Moran's I contiguity test
MI_2017 <- spdep::moran(model_data_2017$observed_2017,
nb2listw(catchment_neighbours),
length(model_data_2017$observed_2017),
Szero(nb2listw(catchment_neighbours)))
The findings from the univariate model above suggest the risk of dry season malaria transmission varies depending on the year, so we consider a model with an interaction between proportion of people close to dams and year. We would like to explain dry season malaria risk based on the number of people living close to dams and the year.
We also compare the effect of removing intercept from the multiple linear regression. Mathematically, \(𝜷_0 = 0\). Hence our multivariate model without intercept can be written as: \(ln (E(y)) = {𝜷_1} {x_i}_{1} + ln(𝒆_𝒊)\) where, \(i = 1,2,3,⋯,n\)
# Prep model data --------------------------------------------------------------
df2017 <- model_data_2017 |>
dplyr::as_tibble() |>
dplyr::rename(observed_cases = observed_2017,
expected_cases = expected_2017,
health_facility = Names) |>
dplyr::select(-geometry, -fid,-DN, -X, -SMR,
-pop_2017, -dr_2018, -dr_2019, -dr_2020)
df2017$year <- "2017" # add new column
# df2017 <- cbind(df2017, year = "2017") # alternatively
df2018 <- model_data_2018 |>
dplyr::as_tibble() |>
dplyr::rename(observed_cases = observed_2018,
expected_cases = expected_2018) |>
dplyr::select(-geometry, -fid,-DN, -X, -SMR,
-pop_2018,-dr_2017, -dr_2019, -dr_2020)
df2018$year <- "2018"
colnames(df2018) <- colnames(df2017) # match columns names
df2019 <- model_data_2019 |>
dplyr::as_tibble() |>
dplyr::rename(observed_cases = observed_2019,
expected_cases = expected_2019) |>
dplyr::select(-geometry, -fid,-DN, -X, -SMR,
-pop_2019, -dr_2017, -dr_2018, -dr_2020)
df2019$year <- "2019"
colnames(df2019) <- colnames(df2017) # match columns names
df2020 <- model_data_2020 |>
dplyr::as_tibble() |>
dplyr::rename(observed_cases = observed_2020,
expected_cases = expected_2020) |>
dplyr::select(-geometry, -fid,-DN, -X, -SMR,
-pop_2020, -dr_2017, -dr_2018, -dr_2019)
df2020$year <- "2020"
colnames(df2020) <- colnames(df2017) # match columns names
model_data <- rbind(df2017, df2018, df2019, df2020)
model_data <- imputeTS::na.replace(model_data, 0) # replace NA with zero
# find the log(n) of each value in 'expected' column. It is the fourth column
log_expected <- log(model_data[ , 4]) |>
dplyr::rename(log_expected = expected_cases)
# add the log values to the dataframe using 'cbind()'
model_data <- cbind(model_data, log_expected)
model_data |> # View model data in table format
gt::gt() |>
gt::tab_style(style = list(cell_text(align = "center")),
locations = cells_column_labels() ) |>
gt::cols_label(health_facility = "Health facility",
observed_cases = "Observed cases",
expected_cases = "Expected cases",
prop_pop_1km = "Proportion of population in 1km buffers%",
prop_pop_2km = "Proportion of population in 2km buffers%",
prop_pop_3km = "Proportion of population in 3km buffers%",
log_expected = "Log of expected cases",
year = "Year")
| rowID | Health facility | Observed cases | Expected cases | Proportion of population in 1km buffers% | Proportion of population in 2km buffers% | Proportion of population in 3km buffers% | Year | Log of expected cases |
|---|---|---|---|---|---|---|---|---|
| 1 | Lodjwa Health Centre | 564 | 826 | 10 | 7 | 19 | 2017 | 6.716595 |
| 2 | Nkhamenya Rural Hospital | 2720 | 3344 | 4 | 16 | 30 | 2017 | 8.114923 |
| 3 | Newa Mpasazi Health Centre | 216 | 1156 | 1 | 7 | 14 | 2017 | 7.052721 |
| 4 | Mpepa /Chisinga Health Centre | 1523 | 2287 | 0 | 0 | 0 | 2017 | 7.734996 |
| 5 | Mnyanja Health Centre | 1480 | 3327 | 2 | 6 | 19 | 2017 | 8.109826 |
| 6 | Simlemba Health Centre | 1159 | 2249 | 2 | 8 | 17 | 2017 | 7.718241 |
| 7 | Ofesi Health Centre | 1930 | 2340 | 0 | 2 | 2 | 2017 | 7.757906 |
| 8 | Chulu Health Centre | 3482 | 2324 | 7 | 23 | 45 | 2017 | 7.751045 |
| 9 | Kapelula Health Centre | 2970 | 2976 | 2 | 8 | 16 | 2017 | 7.998335 |
| 10 | Livwezi Health Centre | 594 | 1833 | 8 | 20 | 40 | 2017 | 7.513709 |
| 11 | Gogode Dispensary | 1553 | 1088 | 6 | 19 | 39 | 2017 | 6.992096 |
| 12 | Dwangwa Dispensary | 1153 | 2724 | 9 | 29 | 54 | 2017 | 7.909857 |
| 13 | Chamama Health Facility | 1005 | 1668 | 2 | 9 | 16 | 2017 | 7.419381 |
| 14 | Wimbe Health Centre | 2558 | 988 | 21 | 53 | 80 | 2017 | 6.895683 |
| 15 | Chinyama | 1140 | 1063 | 11 | 30 | 46 | 2017 | 6.968850 |
| 16 | Mdunga Health Centre | 1382 | 1514 | 0 | 0 | 0 | 2017 | 7.322510 |
| 17 | Mtunthama Health Centre | 1982 | 1561 | 21 | 64 | 82 | 2017 | 7.353082 |
| 18 | Kasungu District Hospital | 14663 | 11951 | 18 | 39 | 58 | 2017 | 9.388570 |
| 19 | Chamwabvi Health Centre | 2031 | 2945 | 7 | 24 | 46 | 2017 | 7.987864 |
| 20 | Linyangwa Health Centre | 1987 | 1480 | 5 | 20 | 41 | 2017 | 7.299797 |
| 21 | Mziza Health Centre | 4098 | 3689 | 13 | 30 | 51 | 2017 | 8.213111 |
| 22 | Kawamba Health Centre | 3845 | 2198 | 13 | 34 | 60 | 2017 | 7.695303 |
| 23 | Kamboni Health Centre | 2588 | 1768 | 6 | 20 | 33 | 2017 | 7.477604 |
| 24 | Khola Health Centre | 1012 | 1888 | 2 | 8 | 16 | 2017 | 7.543273 |
| 25 | Santhe Health Centre | 5668 | 3660 | 4 | 13 | 21 | 2017 | 8.205218 |
| 26 | Mkhota Health Centre | 1487 | 1940 | 5 | 17 | 24 | 2017 | 7.570443 |
| 1 | Lodjwa Health Centre | 1151 | 934 | 0 | 0 | 0 | 2018 | 6.839476 |
| 2 | Nkhamenya Rural Hospital | 3343 | 3785 | 3 | 6 | 13 | 2018 | 8.238801 |
| 3 | Newa Mpasazi Health Centre | 434 | 1295 | 1 | 7 | 16 | 2018 | 7.166266 |
| 4 | Mpepa /Chisinga Health Centre | 2616 | 2589 | 4 | 16 | 31 | 2018 | 7.859027 |
| 5 | Mnyanja Health Centre | 1715 | 3804 | 3 | 11 | 30 | 2018 | 8.243808 |
| 6 | Simlemba Health Centre | 1506 | 2496 | 3 | 10 | 23 | 2018 | 7.822445 |
| 7 | Ofesi Health Centre | 1773 | 2636 | 2 | 5 | 10 | 2018 | 7.877018 |
| 8 | Chulu Health Centre | 3330 | 2621 | 11 | 39 | 65 | 2018 | 7.871311 |
| 9 | Kapelula Health Centre | 3480 | 3420 | 9 | 27 | 51 | 2018 | 8.137396 |
| 10 | Livwezi Health Centre | 1128 | 2049 | 8 | 25 | 54 | 2018 | 7.625107 |
| 11 | Gogode Dispensary | 2550 | 1215 | 9 | 22 | 44 | 2018 | 7.102499 |
| 12 | Dwangwa Dispensary | 1216 | 3048 | 10 | 37 | 61 | 2018 | 8.022241 |
| 13 | Chamama Health Facility | 1226 | 1852 | 3 | 12 | 22 | 2018 | 7.524021 |
| 14 | Wimbe Health Centre | 3167 | 1074 | 20 | 54 | 80 | 2018 | 6.979145 |
| 15 | Chinyama | 1673 | 1194 | 14 | 33 | 48 | 2018 | 7.085064 |
| 16 | Mdunga Health Centre | 1894 | 1720 | 7 | 17 | 45 | 2018 | 7.450080 |
| 17 | Mtunthama Health Centre | 3358 | 1734 | 25 | 69 | 84 | 2018 | 7.458186 |
| 18 | Kasungu District Hospital | 12019 | 13377 | 19 | 50 | 66 | 2018 | 9.501292 |
| 19 | Chamwabvi Health Centre | 2079 | 3287 | 16 | 46 | 70 | 2018 | 8.097731 |
| 20 | Linyangwa Health Centre | 1500 | 1639 | 5 | 18 | 39 | 2018 | 7.401842 |
| 21 | Mziza Health Centre | 2291 | 4210 | 17 | 40 | 58 | 2018 | 8.345218 |
| 22 | Kawamba Health Centre | 3881 | 2386 | 33 | 62 | 84 | 2018 | 7.777374 |
| 23 | Kamboni Health Centre | 3250 | 1948 | 19 | 36 | 53 | 2018 | 7.574558 |
| 24 | Khola Health Centre | 1697 | 2108 | 5 | 14 | 28 | 2018 | 7.653495 |
| 25 | Santhe Health Centre | 6195 | 4096 | 6 | 21 | 36 | 2018 | 8.317766 |
| 26 | Mkhota Health Centre | 4218 | 2171 | 17 | 40 | 66 | 2018 | 7.682943 |
| 1 | Lodjwa Health Centre | 1168 | 909 | 10 | 3 | 10 | 2019 | 6.812345 |
| 2 | Nkhamenya Rural Hospital | 3932 | 3709 | 10 | 29 | 58 | 2019 | 8.218518 |
| 3 | Newa Mpasazi Health Centre | 626 | 1266 | 5 | 18 | 44 | 2019 | 7.143618 |
| 4 | Mpepa /Chisinga Health Centre | 4169 | 2523 | 3 | 11 | 19 | 2019 | 7.833204 |
| 5 | Mnyanja Health Centre | 2504 | 3751 | 7 | 17 | 36 | 2019 | 8.229778 |
| 6 | Simlemba Health Centre | 1788 | 2405 | 11 | 32 | 55 | 2019 | 7.785305 |
| 7 | Ofesi Health Centre | 2124 | 2576 | 5 | 20 | 36 | 2019 | 7.853993 |
| 8 | Chulu Health Centre | 3537 | 2547 | 27 | 60 | 86 | 2019 | 7.842671 |
| 9 | Kapelula Health Centre | 3357 | 3405 | 10 | 35 | 54 | 2019 | 8.133000 |
| 10 | Livwezi Health Centre | 435 | 1966 | 17 | 43 | 72 | 2019 | 7.583756 |
| 11 | Gogode Dispensary | 1469 | 1169 | 10 | 22 | 42 | 2019 | 7.063904 |
| 12 | Dwangwa Dispensary | 1370 | 2948 | 22 | 64 | 85 | 2019 | 7.988882 |
| 13 | Chamama Health Facility | 1127 | 1773 | 2 | 13 | 18 | 2019 | 7.480428 |
| 14 | Wimbe Health Centre | 2162 | 1016 | 21 | 54 | 81 | 2019 | 6.923629 |
| 15 | Chinyama | 1260 | 1154 | 14 | 35 | 48 | 2019 | 7.050989 |
| 16 | Mdunga Health Centre | 1485 | 1710 | 8 | 33 | 50 | 2019 | 7.444249 |
| 17 | Mtunthama Health Centre | 1718 | 1661 | 30 | 76 | 91 | 2019 | 7.415175 |
| 18 | Kasungu District Hospital | 13052 | 12942 | 25 | 58 | 90 | 2019 | 9.468233 |
| 19 | Chamwabvi Health Centre | 1180 | 3161 | 13 | 54 | 77 | 2019 | 8.058644 |
| 20 | Linyangwa Health Centre | 2692 | 1566 | 15 | 45 | 77 | 2019 | 7.356280 |
| 21 | Mziza Health Centre | 3135 | 4151 | 28 | 66 | 89 | 2019 | 8.331105 |
| 22 | Kawamba Health Centre | 3469 | 2258 | 42 | 78 | 94 | 2019 | 7.722235 |
| 23 | Kamboni Health Centre | 2537 | 1843 | 28 | 52 | 78 | 2019 | 7.519150 |
| 24 | Khola Health Centre | 2139 | 2040 | 10 | 19 | 34 | 2019 | 7.620705 |
| 25 | Santhe Health Centre | 5793 | 3957 | 24 | 58 | 81 | 2019 | 8.283241 |
| 26 | Mkhota Health Centre | 2268 | 2093 | 23 | 53 | 86 | 2019 | 7.646354 |
| 1 | Lodjwa Health Centre | 1788 | 1538 | 1 | 4 | 9 | 2020 | 7.338238 |
| 2 | Nkhamenya Rural Hospital | 8539 | 6313 | 5 | 20 | 42 | 2020 | 8.750366 |
| 3 | Newa Mpasazi Health Centre | 2182 | 2153 | 2 | 8 | 17 | 2020 | 7.674617 |
| 4 | Mpepa /Chisinga Health Centre | 5186 | 4270 | 0 | 0 | 0 | 2020 | 8.359369 |
| 5 | Mnyanja Health Centre | 6117 | 6426 | 1 | 2 | 5 | 2020 | 8.768108 |
| 6 | Simlemba Health Centre | 5310 | 4026 | 3 | 10 | 22 | 2020 | 8.300529 |
| 7 | Ofesi Health Centre | 2323 | 4379 | 0 | 3 | 2 | 2020 | 8.384576 |
| 8 | Chulu Health Centre | 7160 | 4308 | 13 | 36 | 54 | 2020 | 8.368229 |
| 9 | Kapelula Health Centre | 7297 | 5904 | 0 | 0 | 0 | 2020 | 8.683385 |
| 10 | Livwezi Health Centre | 1028 | 3267 | 7 | 17 | 38 | 2020 | 8.091627 |
| 11 | Gogode Dispensary | 2767 | 1961 | 7 | 20 | 40 | 2020 | 7.581210 |
| 12 | Dwangwa Dispensary | 2869 | 4971 | 9 | 34 | 58 | 2020 | 8.511376 |
| 13 | Chamama Health Facility | 635 | 2969 | 2 | 8 | 14 | 2020 | 7.995980 |
| 14 | Wimbe Health Centre | 2233 | 1689 | 21 | 53 | 80 | 2020 | 7.431892 |
| 15 | Chinyama | 1605 | 1936 | 11 | 28 | 45 | 2020 | 7.568379 |
| 16 | Mdunga Health Centre | 3169 | 2952 | 0 | 0 | 0 | 2020 | 7.990238 |
| 17 | Mtunthama Health Centre | 1882 | 2763 | 20 | 65 | 82 | 2020 | 7.924072 |
| 18 | Kasungu District Hospital | 19393 | 21785 | 19 | 40 | 62 | 2020 | 9.988977 |
| 19 | Chamwabvi Health Centre | 1128 | 5304 | 9 | 28 | 53 | 2020 | 8.576217 |
| 20 | Linyangwa Health Centre | 4380 | 2604 | 5 | 20 | 42 | 2020 | 7.864804 |
| 21 | Mziza Health Centre | 5791 | 7131 | 18 | 29 | 45 | 2020 | 8.872207 |
| 22 | Kawamba Health Centre | 7073 | 3771 | 19 | 42 | 64 | 2020 | 8.235095 |
| 23 | Kamboni Health Centre | 4665 | 3028 | 8 | 28 | 43 | 2020 | 8.015658 |
| 24 | Khola Health Centre | 3426 | 3453 | 4 | 13 | 21 | 2020 | 8.146999 |
| 25 | Santhe Health Centre | 6556 | 6667 | 5 | 17 | 31 | 2020 | 8.804925 |
| 26 | Mkhota Health Centre | 4592 | 3526 | 10 | 30 | 55 | 2020 | 8.167919 |
# Model fitting
# 1km model --------------------------------------------------------------------
multivariate_1km <- glm(observed_cases~1+prop_pop_1km+year+offset(log(expected_cases)),
data = model_data, family = poisson(link = "log"))
#summary.glm(multivariate_1km)
report::report(multivariate_1km)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_1km, year and expected_cases (formula: observed_cases ~ 1 + prop_pop_1km + year + offset(log(expected_cases))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_1km = 0, year = 2017 and expected_cases = 0, is at -0.11 (95% CI [-0.12, -0.10], p < .001). Within this model:
##
## - The effect of prop_pop_1km is statistically significant and positive (beta = 0.01, 95% CI [0.01, 0.01], p < .001; Std. beta = -0.54, 95% CI [-0.55, -0.54])
## - The effect of year [2018] is statistically significant and negative (beta = -0.05, 95% CI [-0.06, -0.04], p < .001; Std. beta = -0.27, 95% CI [-0.29, -0.25])
## - The effect of year [2019] is statistically significant and negative (beta = -0.13, 95% CI [-0.14, -0.12], p < .001; Std. beta = 0.30, 95% CI [0.29, 0.32])
## - The effect of year [2020] is statistically significant and negative (beta = -0.01, 95% CI [-0.02, -4.60e-03], p = 0.004; Std. beta = -0.67, 95% CI [-0.68, -0.65])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
# Check effect of removing intercept.
# When you remove an intercept from a regression model, you’re setting
# it equal to 0 rather than estimating it from the data.
multivariate_1km_no_intercept <- glm(observed_cases~0+prop_pop_1km+year, # leaving the intercept out
offset = log(expected_cases),
data = model_data,
family = poisson(link = "log"))
#summary.glm(multivariate_1km_no_intercept)
report::report(multivariate_1km_no_intercept)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_1km and year (formula: observed_cases ~ 0 + prop_pop_1km + year). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_1km = 0 and year = 2017, is at (p ). Within this model:
##
## - The effect of prop_pop_1km is statistically significant and positive (beta = 0.01, 95% CI [0.01, 0.01], p < .001; Std. beta = -0.05, 95% CI [-0.06, -0.05])
## - The effect of year [2017] is statistically significant and negative (beta = -0.11, 95% CI [-0.12, -0.10], p < .001; Std. beta = 8.79, 95% CI [8.78, 8.80])
## - The effect of year [2018] is statistically significant and negative (beta = -0.16, 95% CI [-0.17, -0.15], p < .001; Std. beta = 8.65, 95% CI [8.64, 8.66])
## - The effect of year [2019] is statistically significant and negative (beta = -0.24, 95% CI [-0.25, -0.23], p < .001; Std. beta = 8.76, 95% CI [8.74, 8.77])
## - The effect of year [2020] is statistically significant and negative (beta = -0.12, 95% CI [-0.13, -0.12], p < .001; Std. beta = 8.57, 95% CI [8.57, 8.58])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(multivariate_1km, show.r2 = FALSE, show.aic = TRUE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.895 | 0.888 – 0.903 | <0.001 |
| prop_pop_1km | 1.013 | 1.013 – 1.014 | <0.001 |
| year [2018] | 0.955 | 0.945 – 0.966 | <0.001 |
| year [2019] | 0.877 | 0.867 – 0.887 | <0.001 |
| year [2020] | 0.986 | 0.977 – 0.995 | 0.004 |
| Observations | 104 | ||
| AIC | 58129.020 | ||
sjPlot::tab_model(multivariate_1km_no_intercept,
show.r2 = FALSE, show.aic = TRUE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| prop_pop_1km | 1.013 | 1.013 – 1.014 | <0.001 |
| year [2017] | 0.895 | 0.888 – 0.903 | <0.001 |
| year [2018] | 0.855 | 0.847 – 0.863 | <0.001 |
| year [2019] | 0.785 | 0.777 – 0.794 | <0.001 |
| year [2020] | 0.883 | 0.876 – 0.889 | <0.001 |
| Observations | 104 | ||
| AIC | 58129.020 | ||
# Alternatively
# multivariate_1km_rate <- glm(observed_cases~prop_pop_1km+year+offset(log_expected),
# data = model_data, family = poisson(link = "log"))
# summary(multivariate_1km_rate)
# report:report(multivariate_1km_rate)
# 2km model --------------------------------------------------------------------
multivariate_2km <- glm(observed_cases~1+prop_pop_2km+year+offset(log(expected_cases)),
data = model_data, family = 'poisson')
# summary(multivariate_2km)
report::report(multivariate_2km)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_2km, year and expected_cases (formula: observed_cases ~ 1 + prop_pop_2km + year + offset(log(expected_cases))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_2km = 0, year = 2017 and expected_cases = 0, is at -0.11 (95% CI [-0.12, -0.11], p < .001). Within this model:
##
## - The effect of prop_pop_2km is statistically significant and positive (beta = 5.13e-03, 95% CI [4.93e-03, 5.33e-03], p < .001; Std. beta = -0.52, 95% CI [-0.52, -0.51])
## - The effect of year [2018] is statistically significant and negative (beta = -0.05, 95% CI [-0.06, -0.04], p < .001; Std. beta = -0.10, 95% CI [-0.12, -0.09])
## - The effect of year [2019] is statistically significant and negative (beta = -0.12, 95% CI [-0.13, -0.11], p < .001; Std. beta = 0.36, 95% CI [0.35, 0.38])
## - The effect of year [2020] is statistically non-significant and negative (beta = -8.47e-03, 95% CI [-0.02, 1.11e-03], p = 0.083; Std. beta = -0.60, 95% CI [-0.61, -0.59])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
# Check effect of removing intercept
multivariate_2km_no_intercept <- glm(observed_cases~prop_pop_2km+year-1,
offset = log(expected_cases),
data = model_data,
family = poisson(link = "log"))
# summary(multivariate_2km_no_intercept)
report::report(multivariate_2km_no_intercept)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_2km and year (formula: observed_cases ~ prop_pop_2km + year - 1). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_2km = 0 and year = 2017, is at (p ). Within this model:
##
## - The effect of prop_pop_2km is statistically significant and positive (beta = 5.13e-03, 95% CI [4.93e-03, 5.33e-03], p < .001; Std. beta = -0.02, 95% CI [-0.03, -0.02])
## - The effect of year [2017] is statistically significant and negative (beta = -0.11, 95% CI [-0.12, -0.11], p < .001; Std. beta = 8.78, 95% CI [8.77, 8.79])
## - The effect of year [2018] is statistically significant and negative (beta = -0.16, 95% CI [-0.17, -0.15], p < .001; Std. beta = 8.64, 95% CI [8.63, 8.65])
## - The effect of year [2019] is statistically significant and negative (beta = -0.23, 95% CI [-0.25, -0.22], p < .001; Std. beta = 8.72, 95% CI [8.71, 8.74])
## - The effect of year [2020] is statistically significant and negative (beta = -0.12, 95% CI [-0.13, -0.12], p < .001; Std. beta = 8.58, 95% CI [8.57, 8.58])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(multivariate_2km,
show.r2 = FALSE, show.aic = TRUE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.892 | 0.884 – 0.900 | <0.001 |
| prop_pop_2km | 1.005 | 1.005 – 1.005 | <0.001 |
| year [2018] | 0.953 | 0.942 – 0.963 | <0.001 |
| year [2019] | 0.887 | 0.877 – 0.898 | <0.001 |
| year [2020] | 0.992 | 0.982 – 1.001 | 0.083 |
| Observations | 104 | ||
| AIC | 59186.646 | ||
sjPlot::tab_model(multivariate_2km_no_intercept,
show.r2 = FALSE, show.aic = TRUE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| prop_pop_2km | 1.005 | 1.005 – 1.005 | <0.001 |
| year [2017] | 0.892 | 0.884 – 0.900 | <0.001 |
| year [2018] | 0.850 | 0.842 – 0.858 | <0.001 |
| year [2019] | 0.792 | 0.782 – 0.801 | <0.001 |
| year [2020] | 0.885 | 0.878 – 0.891 | <0.001 |
| Observations | 104 | ||
| AIC | 59186.646 | ||
# 3km model --------------------------------------------------------------------
multivariate_3km <- glm(observed_cases~prop_pop_3km+year+offset(log(expected_cases)),
data = model_data, family = 'poisson')
# summary(multivariate_3km)
report::report(multivariate_3km)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_3km, year and expected_cases (formula: observed_cases ~ prop_pop_3km + year + offset(log(expected_cases))). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_3km = 0, year = 2017 and expected_cases = 0, is at -0.13 (95% CI [-0.14, -0.12], p < .001). Within this model:
##
## - The effect of prop_pop_3km is statistically significant and positive (beta = 3.60e-03, 95% CI [3.45e-03, 3.75e-03], p < .001; Std. beta = -0.45, 95% CI [-0.46, -0.45])
## - The effect of year [2018] is statistically significant and negative (beta = -0.04, 95% CI [-0.05, -0.03], p < .001; Std. beta = -0.17, 95% CI [-0.18, -0.15])
## - The effect of year [2019] is statistically significant and negative (beta = -0.11, 95% CI [-0.12, -0.10], p < .001; Std. beta = 0.41, 95% CI [0.39, 0.42])
## - The effect of year [2020] is statistically non-significant and negative (beta = -7.82e-03, 95% CI [-0.02, 1.75e-03], p = 0.109; Std. beta = -0.55, 95% CI [-0.57, -0.54])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
In this combined dataset, we can see that the residual deviance is far from degrees of freedom, and the dispersion parameter is e.g., 576.9798 (57121/99 ) which is large. What does this indicate? This value indicates poor fit. That is, a significant difference between fitted values and observed values. This means that there is extra variance not accounted for by the model. One way to deal with over-dispersion is to run a quasi-Poisson model, which fits an extra dispersion parameter to account for that extra variance. The negative coefficient for the predictors is highly significant (p<2e-16). We also see that yearly variation influences observed cases negatively, but percentage of people living close to dams influences observed cases positively.
# Check effect of removing intercept
multivariate_3km_no_intercept <- glm(observed_cases~prop_pop_3km+year-1,
offset = log(expected_cases),
data = model_data,
family = poisson(link = "log"))
# summary(multivariate_3km_no_intercept)
report::report(multivariate_3km_no_intercept)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_3km and year (formula: observed_cases ~ prop_pop_3km + year - 1). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_3km = 0 and year = 2017, is at (p ). Within this model:
##
## - The effect of prop_pop_3km is statistically significant and positive (beta = 3.60e-03, 95% CI [3.45e-03, 3.75e-03], p < .001; Std. beta = -0.02, 95% CI [-0.03, -0.02])
## - The effect of year [2017] is statistically significant and negative (beta = -0.13, 95% CI [-0.14, -0.12], p < .001; Std. beta = 8.78, 95% CI [8.77, 8.79])
## - The effect of year [2018] is statistically significant and negative (beta = -0.18, 95% CI [-0.19, -0.17], p < .001; Std. beta = 8.64, 95% CI [8.63, 8.65])
## - The effect of year [2019] is statistically significant and negative (beta = -0.25, 95% CI [-0.26, -0.23], p < .001; Std. beta = 8.72, 95% CI [8.71, 8.74])
## - The effect of year [2020] is statistically significant and negative (beta = -0.14, 95% CI [-0.15, -0.13], p < .001; Std. beta = 8.58, 95% CI [8.57, 8.58])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
sjPlot::tab_model(multivariate_3km,
show.r2 = FALSE, show.aic = TRUE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.875 | 0.866 – 0.883 | <0.001 |
| prop_pop_3km | 1.004 | 1.003 – 1.004 | <0.001 |
| year [2018] | 0.957 | 0.947 – 0.968 | <0.001 |
| year [2019] | 0.893 | 0.883 – 0.904 | <0.001 |
| year [2020] | 0.992 | 0.983 – 1.002 | 0.109 |
| Observations | 104 | ||
| AIC | 59540.390 | ||
sjPlot::tab_model(multivariate_3km_no_intercept,
show.r2 = FALSE, show.aic = TRUE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| prop_pop_3km | 1.004 | 1.003 – 1.004 | <0.001 |
| year [2017] | 0.875 | 0.866 – 0.883 | <0.001 |
| year [2018] | 0.837 | 0.829 – 0.846 | <0.001 |
| year [2019] | 0.781 | 0.771 – 0.791 | <0.001 |
| year [2020] | 0.868 | 0.861 – 0.875 | <0.001 |
| Observations | 104 | ||
| AIC | 59540.390 | ||
# Build regression model table -------------------------------------------------
regression_table_1km <- gtsummary::tbl_regression(multivariate_1km,
exponentiate = FALSE) |>
gtsummary::bold_p()
regression_table_2km <- gtsummary::tbl_regression(multivariate_2km) |>
gtsummary::bold_p()
regression_table_3km <- gtsummary::tbl_regression(multivariate_3km,
exponentiate = FALSE) |>
gtsummary::bold_p()
# merge regression model tables ------------------------------------------------
table_merge <- gtsummary::tbl_merge(
tbls = list(
regression_table_1km,
regression_table_2km,
regression_table_3km
),
tab_spanner = c("**Model 1km**",
"**Model 2km**",
"**Model 3km**")
)
table_merge
| Characteristic | Model 1km | Model 2km | Model 3km | ||||||
|---|---|---|---|---|---|---|---|---|---|
| log(IRR)1 | 95% CI1 | p-value | log(IRR)1 | 95% CI1 | p-value | log(IRR)1 | 95% CI1 | p-value | |
| prop_pop_1km | 0.01 | 0.01, 0.01 | <0.001 | ||||||
| year | |||||||||
| 2017 | — | — | — | — | — | — | |||
| 2018 | -0.05 | -0.06, -0.04 | <0.001 | -0.05 | -0.06, -0.04 | <0.001 | -0.04 | -0.05, -0.03 | <0.001 |
| 2019 | -0.13 | -0.14, -0.12 | <0.001 | -0.12 | -0.13, -0.11 | <0.001 | -0.11 | -0.12, -0.10 | <0.001 |
| 2020 | -0.01 | -0.02, 0.00 | 0.004 | -0.01 | -0.02, 0.00 | 0.083 | -0.01 | -0.02, 0.00 | 0.11 |
| prop_pop_2km | 0.01 | 0.00, 0.01 | <0.001 | ||||||
| prop_pop_3km | 0.00 | 0.00, 0.00 | <0.001 | ||||||
|
1
IRR = Incidence Rate Ratio, CI = Confidence Interval
|
|||||||||
# Check model perfomance -------------------------------------------------------
performance::check_model(multivariate_1km, theme = "see::theme_modern")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
performance::check_model(multivariate_2km, theme = "see::theme_modern")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
performance::check_model(multivariate_3km, theme = "see::theme_modern")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
performance::compare_performance(
multivariate_1km, multivariate_2km, multivariate_3km,
metrics = c("AIC", "BIC", "RMSE", "Sigma"), rank = TRUE)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC | BIC | RMSE | Sigma | Performance-Score
## ----------------------------------------------------------------------------------------
## multivariate_1km | glm | 58129.020 | 58142.242 | 1372.801 | 24.020 | 75.00%
## multivariate_2km | glm | 59186.646 | 59199.868 | 1352.076 | 24.242 | 43.44%
## multivariate_3km | glm | 59540.390 | 59553.611 | 1351.802 | 24.315 | 25.00%
Here we apply a quassi-Poisson and negative binomial regression models to the multivariate model, which we have observed that it suffers from overdispersion issues under regular Poisson regression. One approach to dealing with overdispersion is to model the response using a negative binomial instead of a Poisson distribution. An advantage of this approach is that it introduces another parameter in addition to \(λ\), which gives the model more flexibility and, unlike the quasi-Poisson model, the negative binomial model assumes an explicit likelihood model.
Mathematically, negative binomial can be expressed as a Poisson model where \(λ\) is also random, following a gamma distribution. Specifically, if \(Y|λ∼Poisson(λ)\) and \(λ∼gamma(r, \frac{1-p}{p})\), then \(Y∼NegBinom(r, p)\) where \(E(Y) = \frac{pr}{1− p} = μ\) and \(Var(Y) = \frac{pr}{(1−p)^2} = μ+\frac{μ^2}{r}\). The overdispersion in this case is given by \(\frac{μ^2}{r}\), which approaches \(0\) as \(r\) increases (so smaller values of \(r\) indicate greater overdispersion).
In a quassi-Poisson model, the standard errors are inflated by multiplying the variance by \(\phi\), so that the standard errors are larger than the likelihood approach would imply; i.e., \(SEQ(\hat{β}) = √\hat{ϕ}∗SE(\hat{β})\), where \(Q\) stands for “quasi-Poisson” since multiplying variances by \(ϕ\) is an ad-hoc solution (Roback and Legler, 2021).
# Account for overdispersion using quasi-Poisson model--------------------------
quassipoisson_1km <- glm(observed_cases~prop_pop_1km+year, family = quasipoisson,
offset = log(expected_cases), data = model_data)
# summary(quassipoisson_1km)
report::report(quassipoisson_1km)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_1km and year (formula: observed_cases ~ prop_pop_1km + year). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_1km = 0 and year = 2017, is at -0.11 (95% CI [-0.32, 0.09], t(99) = -1.06, p = 0.290). Within this model:
##
## - The effect of prop_pop_1km is statistically significant and positive (beta = 0.01, 95% CI [2.87e-03, 0.02], t(99) = 2.52, p = 0.013; Std. beta = -0.05, 95% CI [-1.22, 1.13])
## - The effect of year [2018] is statistically non-significant and negative (beta = -0.05, 95% CI [-0.30, 0.21], t(99) = -0.35, p = 0.725; Std. beta = -0.14, 95% CI [-3.86, 3.87])
## - The effect of year [2019] is statistically non-significant and negative (beta = -0.13, 95% CI [-0.41, 0.14], t(99) = -0.93, p = 0.352; Std. beta = -0.03, 95% CI [-3.88, 4.07])
## - The effect of year [2020] is statistically non-significant and negative (beta = -0.01, 95% CI [-0.24, 0.22], t(99) = -0.12, p = 0.903; Std. beta = -0.21, 95% CI [-2.65, 3.61])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
# In the absence of overdispersion, we expect the dispersion parameter estimate to be 1.0.
# The estimated dispersion parameter here is much larger than 1.0 (594.1947) indicating
# overdispersion (extra variance) that should be accounted for.
sjPlot::tab_model(quassipoisson_1km, show.r2 = FALSE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 0.895 | 0.726 – 1.092 | 0.290 |
| prop_pop_1km | 1.013 | 1.003 – 1.024 | 0.013 |
| year [2018] | 0.955 | 0.741 – 1.234 | 0.725 |
| year [2019] | 0.877 | 0.666 – 1.155 | 0.352 |
| year [2020] | 0.986 | 0.787 – 1.242 | 0.903 |
| Observations | 104 | ||
quassipoisson_2km <- glm(observed_cases~prop_pop_2km+year, family = quasipoisson,
offset = log(expected_cases), data = model_data)
# summary(quassipoisson_2km)
report::report(quassipoisson_2km)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_2km and year (formula: observed_cases ~ prop_pop_2km + year). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_2km = 0 and year = 2017, is at -0.11 (95% CI [-0.33, 0.10], t(99) = -1.04, p = 0.299). Within this model:
##
## - The effect of prop_pop_2km is statistically significant and positive (beta = 5.13e-03, 95% CI [3.28e-04, 9.94e-03], t(99) = 2.09, p = 0.039; Std. beta = -0.02, 95% CI [-1.19, 1.28])
## - The effect of year [2018] is statistically non-significant and negative (beta = -0.05, 95% CI [-0.31, 0.21], t(99) = -0.37, p = 0.713; Std. beta = -0.14, 95% CI [-3.82, 3.82])
## - The effect of year [2019] is statistically non-significant and negative (beta = -0.12, 95% CI [-0.40, 0.16], t(99) = -0.84, p = 0.404; Std. beta = -0.06, 95% CI [-3.93, 4.04])
## - The effect of year [2020] is statistically non-significant and negative (beta = -8.47e-03, 95% CI [-0.23, 0.22], t(99) = -0.07, p = 0.942; Std. beta = -0.20, 95% CI [-2.61, 3.55])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
quassipoisson_3km <- glm(observed_cases~prop_pop_3km+year, family = quasipoisson,
offset = log(expected_cases), data = model_data)
# summary(quassipoisson_3km)
report::report(quassipoisson_3km)
## We fitted a poisson model (estimated using ML) to predict observed_cases with prop_pop_3km and year (formula: observed_cases ~ prop_pop_3km + year). The model's explanatory power is substantial (Nagelkerke's R2 = 1.00). The model's intercept, corresponding to prop_pop_3km = 0 and year = 2017, is at -0.13 (95% CI [-0.37, 0.09], t(99) = -1.14, p = 0.259). Within this model:
##
## - The effect of prop_pop_3km is statistically non-significant and positive (beta = 3.60e-03, 95% CI [-4.06e-05, 7.29e-03], t(99) = 1.93, p = 0.057; Std. beta = -0.02, 95% CI [-1.12, 1.29])
## - The effect of year [2018] is statistically non-significant and negative (beta = -0.04, 95% CI [-0.30, 0.21], t(99) = -0.33, p = 0.740; Std. beta = -0.14, 95% CI [-3.81, 3.80])
## - The effect of year [2019] is statistically non-significant and negative (beta = -0.11, 95% CI [-0.39, 0.17], t(99) = -0.79, p = 0.431; Std. beta = -0.06, 95% CI [-3.94, 4.05])
## - The effect of year [2020] is statistically non-significant and negative (beta = -7.82e-03, 95% CI [-0.23, 0.22], t(99) = -0.07, p = 0.947; Std. beta = -0.20, 95% CI [-2.61, 3.55])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
# Account for overdispersion using negative binomial model----------------------
binomial_1km <- MASS::glm.nb(observed_cases~prop_pop_1km+year,
offset(log(expected_cases)), data = model_data)
summary(binomial_1km)
##
## Call:
## MASS::glm.nb(formula = observed_cases ~ prop_pop_1km + year,
## data = model_data, weights = offset(log(expected_cases)),
## init.theta = 2.331719086, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.5122 -2.2287 -1.0880 0.8793 8.1660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.515872 0.050701 148.240 < 2e-16 ***
## prop_pop_1km 0.040085 0.002934 13.663 < 2e-16 ***
## year2018 -0.019164 0.066298 -0.289 0.772541
## year2019 -0.249465 0.071120 -3.508 0.000452 ***
## year2020 0.621513 0.064556 9.628 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.3317) family taken to be 1)
##
## Null deviance: 1179.62 on 103 degrees of freedom
## Residual deviance: 872.72 on 99 degrees of freedom
## AIC: 14469
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.332
## Std. Err.: 0.108
##
## 2 x log-likelihood: -14456.809
sjPlot::tab_model(binomial_1km, show.r2 = FALSE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 1836.969 | 1659.547 – 2038.241 | <0.001 |
| prop_pop_1km | 1.041 | 1.035 – 1.047 | <0.001 |
| year [2018] | 0.981 | 0.862 – 1.116 | 0.773 |
| year [2019] | 0.779 | 0.681 – 0.892 | <0.001 |
| year [2020] | 1.862 | 1.640 – 2.113 | <0.001 |
| Observations | 104 | ||
binomial_2km <- MASS::glm.nb(observed_cases~prop_pop_2km+year,
offset(log(expected_cases)), data = model_data)
summary(binomial_2km)
##
## Call:
## MASS::glm.nb(formula = observed_cases ~ prop_pop_2km + year,
## data = model_data, weights = offset(log(expected_cases)),
## init.theta = 2.245157282, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.4895 -2.1614 -1.0243 0.6682 8.8453
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.485606 0.053867 138.964 < 2e-16 ***
## prop_pop_2km 0.016519 0.001315 12.562 < 2e-16 ***
## year2018 -0.033563 0.067624 -0.496 0.61967
## year2019 -0.236145 0.072419 -3.261 0.00111 **
## year2020 0.629752 0.065788 9.572 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.2452) family taken to be 1)
##
## Null deviance: 1135.87 on 103 degrees of freedom
## Residual deviance: 874.84 on 99 degrees of freedom
## AIC: 14504
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.245
## Std. Err.: 0.104
##
## 2 x log-likelihood: -14491.952
sjPlot::tab_model(binomial_1km, show.r2 = FALSE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 1836.969 | 1659.547 – 2038.241 | <0.001 |
| prop_pop_1km | 1.041 | 1.035 – 1.047 | <0.001 |
| year [2018] | 0.981 | 0.862 – 1.116 | 0.773 |
| year [2019] | 0.779 | 0.681 – 0.892 | <0.001 |
| year [2020] | 1.862 | 1.640 – 2.113 | <0.001 |
| Observations | 104 | ||
binomial_3km <- MASS::glm.nb(observed_cases~prop_pop_3km+year,
offset(log(expected_cases)), data = model_data)
summary(binomial_3km)
##
## Call:
## MASS::glm.nb(formula = observed_cases ~ prop_pop_3km + year,
## data = model_data, weights = offset(log(expected_cases)),
## init.theta = 2.216971935, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.4199 -2.1948 -1.0507 0.5095 9.0198
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.424712 0.058284 127.389 < 2e-16 ***
## prop_pop_3km 0.011655 0.001001 11.639 < 2e-16 ***
## year2018 -0.029366 0.068234 -0.430 0.66693
## year2019 -0.234897 0.072969 -3.219 0.00129 **
## year2020 0.632333 0.066192 9.553 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.217) family taken to be 1)
##
## Null deviance: 1121.62 on 103 degrees of freedom
## Residual deviance: 875.57 on 99 degrees of freedom
## AIC: 14516
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.217
## Std. Err.: 0.103
##
## 2 x log-likelihood: -14503.732
sjPlot::tab_model(binomial_1km, show.r2 = FALSE,
digits = 3, digits.re = 3)
| observed_cases | |||
|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p |
| (Intercept) | 1836.969 | 1659.547 – 2038.241 | <0.001 |
| prop_pop_1km | 1.041 | 1.035 – 1.047 | <0.001 |
| year [2018] | 0.981 | 0.862 – 1.116 | 0.773 |
| year [2019] | 0.779 | 0.681 – 0.892 | <0.001 |
| year [2020] | 1.862 | 1.640 – 2.113 | <0.001 |
| Observations | 104 | ||
Since our dataset is small and partitioning it will lead to higher bias, therefore, we use the Leave one out cross validation (LOOCV) method which uses all data points to measure performance of the Poisson model. This method works as follow:
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated.
# Define training control
train_control <- caret::trainControl(method = "LOOCV")
# Removing missing values
model_data_evaluation <- model_data[complete.cases(model_data),]
# Train the model, fit a regression model and use LOOCV to evaluate perfomance
train_model_1km <- caret::train(
observed_cases~1+prop_pop_1km+year+offset(log(expected_cases)),
data = model_data_evaluation, method = "glm",
trControl = train_control, family = "poisson")
train_model_2km <- caret::train(
observed_cases~1+prop_pop_2km+year+offset(log(expected_cases)),
data = model_data_evaluation, method = "glm",
trControl = train_control, family = "poisson")
train_model_3km <- caret::train(
observed_cases~1+prop_pop_3km+year+offset(log(expected_cases)),
data = model_data_evaluation, method = "glm",
trControl = train_control, family = "poisson")
# Summarize the results
# The lower the Mean Absolute Error, the more closely a model can predict the actual observations
print(train_model_1km)
## Generalized Linear Model
##
## 104 samples
## 3 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 103, 103, 103, 103, 103, 103, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2854.871 0.07951775 1774.594
summary(train_model_1km)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -70.91 -25.76 -11.04 10.69 138.63
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.5596403 0.0042883 1762.873 < 2e-16 ***
## prop_pop_1km 0.0345465 0.0002076 166.381 < 2e-16 ***
## year2018 -0.0241893 0.0054887 -4.407 1.05e-05 ***
## year2019 -0.2740150 0.0059470 -46.076 < 2e-16 ***
## year2020 0.5769079 0.0048867 118.056 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 201942 on 103 degrees of freedom
## Residual deviance: 154281 on 99 degrees of freedom
## AIC: 155289
##
## Number of Fisher Scoring iterations: 5
print(train_model_2km)
## Generalized Linear Model
##
## 104 samples
## 3 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 103, 103, 103, 103, 103, 103, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2920.994 0.04763413 1780.845
summary(train_model_2km)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -80.94 -26.36 -11.11 11.99 148.88
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.549e+00 4.478e-03 1685.834 <2e-16 ***
## prop_pop_2km 1.288e-02 9.391e-05 137.173 <2e-16 ***
## year2018 3.048e-03 5.472e-03 0.557 0.578
## year2019 -1.985e-01 5.853e-03 -33.913 <2e-16 ***
## year2020 5.813e-01 4.887e-03 118.952 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 201942 on 103 degrees of freedom
## Residual deviance: 162246 on 99 degrees of freedom
## AIC: 163254
##
## Number of Fisher Scoring iterations: 5
print(train_model_3km)
## Generalized Linear Model
##
## 104 samples
## 3 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 103, 103, 103, 103, 103, 103, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2904.301 0.05092564 1766.576
summary(train_model_3km)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -73.324 -25.390 -11.284 9.963 149.735
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.464e+00 4.870e-03 1532.691 <2e-16 ***
## prop_pop_3km 9.929e-03 7.484e-05 132.678 <2e-16 ***
## year2018 -3.717e-03 5.477e-03 -0.679 0.497
## year2019 -1.965e-01 5.848e-03 -33.596 <2e-16 ***
## year2020 5.834e-01 4.887e-03 119.391 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 201942 on 103 degrees of freedom
## Residual deviance: 162692 on 99 degrees of freedom
## AIC: 163699
##
## Number of Fisher Scoring iterations: 5