Part 1 - Water Quality Impacts on Health

For the first part of this lab, you will attempt to answer a simple question: how would the burden of disease be different for the United States if household drinking water quality was similar to the water quality in Boulder Creek? To answer this question, you will need to break it down into several sub-questions, including:

What is the current burden of disease associated with drinking water illnesses in the United States?
How would the burden of disease change based on the quality of water in Boulder Creek?
1. How many people would be affected by the impacted drinking water quality?
2. How much water do these people drink on average each day?
3. How many bacteria would they be ingesting, and how many of them would get sick as a result of ingesting more pathogens?
4. What would be the total health impact, expressed in disability adjusted life years (DALYs)?

In other words, you will be conducting a quantitative microbial risk assessment (QMRA) to estimate the increased burden of disease associated with reduced drinking-water quality. We recommend that you use the following resources to aid in your investigation:

Havelaar, A.H., & Melse, J.M. (2003) Quantifying public health risk in the WHO Guidelines for drinking-water quality: A burden of disease approach. National Institute for Public Health and the Environment of the Netherlands
- A discussion paper on the concepts and methodology of disability-adjusted life years (DALYs) as a common public health metric and its usefulness for drinking-water quality. http://www.who.int/water_sanitation_health/publications/quantifyinghealthrisks/en/
Brown, J., & Clasen, T. (2012). High Adherence Is Necessary to Realize Health Gains from Water Quality Interventions. PLoS ONE, 7(5), e36735. http://doi.org/10.1371/journal.pone.0036735
- A description of a QRMA model used to estimate the level of adherence necessary to significantly increase DALYs using different levels of contamination and water treatment efficacies.

1. Current Burden of Disease

To determine the current burden of disease from unsafe water, you’ll want to check out the Global Burden of Disease (GBD) Visualization at the Institute of Health Metrics and Evaluation (IHME): https://vizhub.healthdata.org/gbd-compare/.

Select the second icon on the left-hand side (“Risks by Cause”), select level four on the slider, and select “United States” for location. Then select “Unsafe water, sanitation and hygiene” on the right-hand side. Find “Unsafe water” in the rankings. What are the total number of DALYs, the percent of total DALYs, and the DALY rate (number of DALYs per 100,000)?

Now navigate to the arrow diagram showing the risks for each cause and make sure that level four is still selected on the slider and United States is selected for location. What are the total DALYs, the percent of total DALYs, the DALY rate, and the cause ranking for unsafe water in the United States? For ranking, change the Category back to “All Risk Factors” and look at the plot.

Now do the same thing (determine the total DALYs, the percent of total DALYs, the DALY rate, and the cause ranking) for Sub-Saharan Africa in the most recent year available.

Screenshot of IHME Visualization Dashboard

Answers - To check your work - results from 2019 to confirm you are accessing the data correctly. Repeat for most recent year available.

Enteric infections from unsafe water sources for United States in 2019:

14,503 DALYs
0.013% of total DALYs
4.42 DALYs per 100,000 (0.7 to 11.76)

Diarrheal diseases from unsafe water sources for Sub-Saharan Africa in 2019:

36 million DALYs
7.05% of total DALYs
3,338 DALYs per 100,000

2a. Population

What was the population of the United States in the most recent year available?

See: http://ghdx.healthdata.org/geography/united-states-america

Intro to Monte Carlo Methods -

https://en.wikipedia.org/wiki/Monte_Carlo_method

https://www.investopedia.com/terms/m/montecarlosimulation.asp

library(MonteCarlo)
library(data.table)

population = 327978730

2b. Water Consumption

How much water do people drink on average per day? Should the amount of water consumed be averaged or simulated? What are the benefits of conducting a simulation versus a single estimate based on average values?

If you’re not familiar with generating random numbers or writing a function in R, here’s an example to simulate the approximate number of liters that a person could drink each day in a year. See if you can generate a data table that randomly samples the amount of water 100,000 people would drink in a year. Hint: you might try the function replicate.

# function to randomly selected between 1, 2, and 5 liters of water per person per day
generate_water = function(n=1) {
  sample(c(1,2,5),n,replace = TRUE)
}

generate_water(365)

# answer
dt_water = data.table(replicate(365,generate_water(100000)))

2c. Dose-Response

How many organisms would people be ingesting on average per liter of water? What is the probability of infection for the number of microbes ingested? What is the probability of illness given infection?

Here are two different ways to randomly select the number of organisms present in a liter of water: one samples from a log-normal distribution and the other samples from an empirical distribution based on water quality testing of Boulder Creek. Make sure that you can justify which one that you use, as well as the parameters used (e.g., the mean for the log-normal distribution). How do your two data tables compare?

# generate number of organisms consumed empirically (based on log-normal distribution)
organisms_lnorm <- function(n = 1, mean = 1) {
  
  rlnorm(n, meanlog = mean, sdlog = mean)
}

# generate number of organisms consumed empirically (based on E. coli water testing)
organisms_empirical <- function(n = 1) {
 
  sample(c(14,75,9,18,14,22,8,14,6,12,26,6,1,2,8,4),n,replace = TRUE) * 10
}

# now create two new data tables with the number of organism per liter for 100,000 people for both distributions
dt_lnorm <- data.table(replicate(365,organisms_lnorm(100000,1)))
dt_empirical <- data.table(replicate(365,organisms_empirical(100000)))
dt_lnorm
dt_empirical

You now have two data table that represent the number of organisms 100,000 people would ingest in each liter of water for a year. Pick the one that you think is more accurate. Now you need to determine each person’s daily dose of organisms, their probability of infection, and their probability of illness.

# number of microbes ingested per day
dose = dt_water * dt_lnorm

# probabiliy density function for campylobacter infection
beta_poisson = function(dose, alpha = 0.145, n50 = 896) {
  
  1 - ( 1 + (dose / n50) * ( 2^(1/alpha) - 1 ) )^(-alpha)
}

# generate probability of daily infection (beta-poisson function)
p_inf_daily <- beta_poisson(dose)
  
# generate probability of illness given infection (0.3 for campylobacter)
p_illness_inf = 0.3
p_illness <- p_inf_daily * p_illness_inf

2d. Health Impact

You should have a large matrix with the risk of daily infection due to ingesting different amounts of campylobacter bacteria over the course of one year for 100,000 individuals. Now we will use the disease burden associated with campylobacteriosis (try saying that five times really fast) to cacluate the total number of disability adjusted life years associated with increased exposure to campylobacter bacteria. To do this you have to make assumptions (that can be justified) about the proportion of the population that is susceptible to infection and the disease burden associated with campylobacteriosis.

# percent of the population susceptible to infection (campylobacter)
susceptible = 1.00

# disease burden per case (campylobacter)
db_case = 4.6e-3

# generate DALYs per year
DALYs_year = p_illness * db_case * susceptible
  
# generate total DALYs per 100,000
DALYs_total <- sum(DALYs_year)

DALYs_total

Simulation of Disease Burden

Now you have the total disease burden associated with increased exposure to campylobacter. However, there are two problems. First, we assumed that a person could have campylobacteriosis every day, whereas in reality infections and illnesses have average durations that are baked into the disease burden per case estimates. Second, we made assumptions about the number of campylobacter bacteria present in each liter of water when in reality we’re not sure how many bacteria might be present. Also, for some illnesses the disease burden per case isn’t a fixed number. As a result, we need to re-run the experiment with multiple repetitions to generate a confidence interval for our DALY estimates.

In order to simulate the disease burden associated with increased exposure to pathogens, we will conduct a Monte Carlo simulation. I’ve opted to use the MonteCarlo package (https://www.r-bloggers.com/introducing-the-montecarlo-package/), but you can feel free to use another method as you wish.

Included below is the code for for campylobacter, rotavirus, shinga toxin-producing E. coli, and cryptosporidium.

Your responsibilty is to understand how this code functions, so that you can describe the code and results in your lab report.

Campylobacter, Rotavirus, E. Coli, Cryptosporidium

# function to calculate disease burden
disease_burden = function(people, meanlog, alpha, n50, p_illness_inf, susceptible, db_case){
  
  # generate water consumed
  water_consumed <- sample(c(1,2,5),people,replace = TRUE)
  
  # generate number of organism consumed from log-normal distribution
  number_of_organisms <- rlnorm(people, meanlog = meanlog)
  
  # generate daily dose
  dose <- water_consumed * number_of_organisms
  
  # generate probability of daily infection (beta-poisson function)
  p_inf_daily <- 1 - ( 1 + (dose / n50) * ( 2^(1/alpha) - 1 ) )^(-alpha)
  
  # generate probability of illness
  p_illness <- p_inf_daily * p_illness_inf
  
  # generate yearly infection risk
  p_year = 1 - (1 - p_illness)^365
  
  # generate DALYs per year
  DALYs_year = p_year * db_case * susceptible
  
  # generate total DALYs per 100,000
  DALYs_total <- sum(DALYs_year)
  
  # returns value for MonteCarlo function
  return(list("DALYs"=DALYs_total))
  
}

# people = 100000
# water = 5
# meanlog = 1
# alpha = 0.145
# n50 = 896
# p_illness_inf = 0.3
# susceptible = 1.00
# db_case = 4.6e-3

# parameter list for campylobacter
param_campylobacter <- list("people" = 100000,
                            "meanlog" = c(1, 0.1, 0.01, 0.001, 0.0001),
                            "alpha" = 0.145,
                            "n50" = 896,
                            "p_illness_inf" = 0.3,
                            "susceptible" = 1.00,
                            "db_case" = 4.6e-3)

# parameter list for rotavirus
param_rotavirus <- list("people" = 100000,
                        "meanlog" = c(1, 0.1, 0.01, 0.001, 0.0001),
                        "alpha" = 0.253,
                        "n50" = 6.17,
                        "p_illness_inf" = 0.5,
                        "susceptible" = 0.06,
                        "db_case" = c(0.2, 0.3, 0.4, 0.5))

# parameter list for E. coli
param_ecoli <- list("people" = 100000,
                    "meanlog" = c(1, 0.1, 0.01, 0.001, 0.0001),
                    "alpha" = 0.16,
                    "n50" = 2.11e6,
                    "p_illness_inf" = 1.00,
                    "susceptible" = 1.00,
                    "db_case" = 5.47e-2)

# parameter list for cryptosporidium
param_crypto <- list("people" = 100000,
                     "meanlog" = c(0.0001, 0.001, 0.01, 0.1, 1),
                    "alpha" = 0.16,
                    "n50" = 2.11e6,
                     "p_illness_inf" = 0.7,
                     "susceptible" = 1.00,
                     "db_case" = 1.5e-3)

# function to run the simulation
crypto <- MonteCarlo(func=disease_burden, nrep=1000, param_list = param_crypto)

# summary of results
summary(crypto$results$DALYs)

# function to run the simulation
campylobacter <- MonteCarlo(func=disease_burden, nrep=1000, param_list = param_campylobacter)

# summary of results
summary(campylobacter$results$DALYs)

# function to run the simulation
rotavirus <- MonteCarlo(func=disease_burden, nrep=1000, param_list = param_rotavirus)

# summary of results
summary(rotavirus$results$DALYs)

# function to run the simulation
ecoli <- MonteCarlo(func=disease_burden, nrep=1000, param_list = param_ecoli)

# summary of results
summary(ecoli$results$DALYs)

You should now have estimates of the total DALYs due to infections from campylobacter, rotavirus, shinga toxin-producing E. coli, and cryptosporidium. How does these estimates compare with the first DALY estimates from 2a. through 2d. above? What different assumptions were made, and how did that affect the outcome?

Part 2 - Air Quality Impacts on Health

In the previous part of this lab you were able to conduct a quantitative microbial risk assessment to estimate the change in disease burden resulting from reduced water quality. In the second part of this lab you will conduct a similar exercise guided by a simple question: how does disease burden change with varying levels of air quality?

To unpack this question you will explore the cost-effectiveness of different technologies in different countries.

Cost-Effectiveness Scenarios

To begin you will use the Household Air Polluation Intervention Tool (HAPIT) to compare the relative impact of different technologies based on a variety of user-determined parameters. In particular, you will attempt to determine the most cost-effective intervention (in terms of dollars per averted disability adjusted life year) for three countries: Ethiopia, India, and a country of your choosing. You will use the following inputs (see the ASSUMPTIONS section below for applicable values):

For the pre-intervention exposure use the median 24-hour PM2.5 concentrations in the kitchen for wood from Balakrishnan et al., 2013
For the post-intervention exposure use the scenarios provided in the ASSUMPTIONS section below as well as the concentrations from our lab experiments for the rocket and LPG stoves. Assume default standard devation for non-lab scenarios.
Assume there are 25,000 targeted households
Assume 60% adoption for all interventions
Assume a 5 year lifespan for the Ecozoom Dura and LPG stoves
Use default values for all other parameters

For this first exercise you will use the public-facing version of HAPIT, developed by Ajay Pillarisetti, located here: https://hapit.shinyapps.io/HAPIT/. In your final lab report you will need to present your results as figures with a table of compiled results in the appendix. For example, for your first figure you could create a stacked clustered bar chart, where the stack would represent Total DALYs and Total Deaths, the horizontal axis would be the scenarios, and each cluster would represent the three countries.

To estimate the cost-effectiveness of these interventions, refer to the WHO-CHOICE tool - Here is a critique that summarizes the information you need for this part of the exercise:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4339959/

“In recent years, the most common approach has involved the use of thresholds based on per capita gross domestic product (GDP). Under this approach – which has been promoted by the World Health Organization’s Choosing Interventions that are Cost–Effective (WHO-CHOICE) project – an intervention that, per disability-adjusted life-year (DALY) avoided, costs less than three times the national annual GDP per capita is considered cost–effective, whereas one that costs less than once the national annual GDP per capita is considered highly cost–effective.”

Based on this approach, what is the cost effectiveness of the interventions you modeled with HAPIT for each of your contexts? In addition you will need to address the following questions in the discussion:

Based on a thorough review of the HAPIT model (including Chapter 10 of Broken Promises and the information referenced in the Info tab), how do you account for the differences between each country when all the other inputs were held constant? (i.e., what are the assumptions that are baked into HAPIT’s calculations and how do they affect the health estimates?)
Do you agree with the results? Which assumptions seem the least plausible?
Can you imagine using this as a policy tool for selecting the most cost-effective air quality intervention? Why or why not?
What are the ethical implications of the WHO-CHOICE approach?

Assumptions

Three-Stone Fire: Pre-Intervention Concentration (three-stone fire): \(386 \mu g/m^3\) using wood cooking fuel from Balakrishnan et al., 2013

Rocket Stove: Post-Intervention Concentration (rocket stove): \(216 \mu g/m^3\) using wood cooking fuel from HAPIT

LPG Stove: Post-Intervention Concentration (LPG stove): \(54 \mu g/m^3\) using LPG cooking fuel from HAPIT

Lab Assignment

You will need to compose a formal lab report (that includes part one and part two) not exceeding five pages (excluding appendices). This can be submitted as an R Notebook (PDF version of the main report less than 5 pages) or as a PDF. You will need to include a brief introduction, a methods section, results, discussion, and a brief conclusion. See the syllabus and grading rubric for more a more detailed description.

In particular, you should address the following questions:

For Water Quality impacts:

How does the newly calculated disease burden assuming water quality similar to Boulder Creek compare to the disease burden reflected in IHME? How does it compare with other causes of disease burden in the United States (https://vizhub.healthdata.org/gbd-compare/)? How about in Low SDI countries?
Is disease burden per case the same regardless of context? Justify your answer.

For Air Quality Impacts:

Would the disease burden from cooking with rocket stoves be greater than the current disease burden associated with smoking in the United States?
How does the disease burden compare with other causes of disease burden in the United States (https://vizhub.healthdata.org/gbd-compare/)? How about in Low SDI countries?

For Both:

Simulations require a lot of assumptions. Comment on which assumptions felt dubious (or that you weren’t able to justify from the literature) and which assumptions most significantly influenced your results. How much faith do you have in your results?
Why is it important to conduct a Monte Carlo simulation instead of just using average values?

Global Burden of Disease Lab