Introduction

The Water Crisis in Bangladesh

In rural Bangladesh, many wells contain water with unsafe natural arsenic levels. This contamination is a large issue in the country, increasing the population’s risk of cancer and other diseases (Gelman & Hill, 2007). While some may question why the Bangladeshi rely on wells for their water source, other sources of water in the area, such as surface water, contain high levels of microbial contamination, making it unsafe to consume. In fact, the wells were drilled as a solution for a safe water source after the surface water was found to be detrimental to the health of the Bangladeshi (Gelman et al., 2004). The Bangladeshi continue to use these wells as there are many factors that prevent them from finding a different source of water, such as relocation. This study will look at factors that may influence a household to ultimately make the decision to find an alternate well that may provide safer water.

The Data

The following analysis will regard data from a rural Bangladesh area named the Araihazar upazila, where both safe and unsafe wells can be found. In general, a well is considered “safe” if the arsenic level is below 0.5 hundreds of microliters per gram, and “unsafe” if the arsenic level is above this measure (Gelman & Hill, 2007). The data set contains one response variable and four explanatory variables, as well as a constant, of \(n=3020\) households.

The constant variable, \(z_{i,1}\), is used to ensure the model has an intercept. This variable will have a value of 1 for each household. The four response variables are as follows:

  • \(z_{i,2}\) - the arsenic level (in hundreds of microliters per gram) of the well
  • \(z_{i,3}\) - the distance (in meters) to the nearest “safe” well
  • \(z_{i,4}\) - whether or not any member of the household is active in a community organization (a binary variable with 1 indicating “yes” and 0 indicating “no”)
  • \(z_{i,5}\) - the education level of the most educated household member

The information explained above is organized into an \(n \times 5\) matrix.

Additionally, data for the response variable stating if the household switched wells is organized into a vector \(y=(Y_1, .., Y_n)^\top,\) where \(Y_i = 1\) indicates a switch, while \(Y_i = 0\) indicates no switch.

The data will be brought in with the following code.

wells <- read.table(file=url("https://michaeljauch.github.io/STA4102Data/wells.txt"), sep=" ", header=TRUE)
n <- nrow(wells)
y <- wells[,c("switch")]
ones <- rep(1, n)
Z <- as.matrix(cbind(ones, wells[, c("arsenic", "dist", "assoc","educ")]))

The Analysis Process

Using the data explained above, a logistic regression analysis will be performed to analyze any distinct relationships between the predictor variables (\(z_{i,2}\) - \(z_{i,5}\)) and the response \(Y_i\). In regards to this situation, the logistic regression model can be defined as

\[\begin{align*} Y_i \vert z_i &\sim \text{Bernoulli}(\pi_i) \\ \pi_i &= \frac{1}{1 + e^{-z_i^\top\beta}} = \frac{e^{z_i^\top\beta}}{1 + e^{z_i^\top\beta}} \end{align*}\]

for each \(i = 1, .., n\) and an unknown parameter vector \(\beta = (\beta_1, ..., \beta_5)^\top\) of regression coefficients.

Possible Questions

When running this analysis, there are some questions that should be considered, such as:

  • What variables seem to impact a household’s decision to switch wells?
  • Do any variables seem to be correlated with the others?
  • Do any variables appear more important in impacting a household’s decision to switch wells?

Exploratory Analysis

A summary of the data set can be found below. This includes summaries for all predictor variables, as well as the response variable.

##      switch          arsenic           dist             assoc       
##  Min.   :0.0000   Min.   :0.510   Min.   :  0.387   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.820   1st Qu.: 21.117   1st Qu.:0.0000  
##  Median :1.0000   Median :1.300   Median : 36.761   Median :0.0000  
##  Mean   :0.5752   Mean   :1.657   Mean   : 48.332   Mean   :0.4228  
##  3rd Qu.:1.0000   3rd Qu.:2.200   3rd Qu.: 64.041   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :9.650   Max.   :339.531   Max.   :1.0000  
##       educ       
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 5.000  
##  Mean   : 4.828  
##  3rd Qu.: 8.000  
##  Max.   :17.000

Based on the summaries above, while the average distance to a safe well is 48.332 meters, more than half of the households investigated decided to switch wells (57.52%). It is also evident that the average level of arsenic in these wells is 1.657 hundreds of microliters per gram, which is deemed “unsafe”. Furthermore, less than half of the households have active community organization members (42.28%) and the average amount of education in the households is 4.828 years. These statistics are all important to consider when viewing why (or why not) a household chooses to switch wells.

Below are visualizations of this information for the numerical variables (arsenic, distance, and education).

Model Fitting

To estimate the parameters of the logistic regression model described earlier, we first need to derive the likelihood function, in this case the log likelihood function. This is found to be

\[\begin{align*} l(\beta)=y^\top{Z}\beta-b^\top1_n \end{align*}\]

where

Using this optimization equation, we now need to derive the first and second derivatives, or in this case the gradient and Hessian matrix of the log likelihood. Doing so, we find

\[\begin{align*} l'(\beta)=Z^\top(y-\pi) \end{align*}\]

and

\[\begin{align*} l''(\beta)=-Z^\top{WZ} \end{align*}\]

where W is a diagonal matrix with its \(i^{th}\) diagonal entry equal to \(\pi_i(1-\pi_i)\). Using this information, we can construct the updating equation for Newton’s Method to be

\[\begin{align*} \beta^{(t+1)}=\beta^{(t)}+(Z^\top{W^{(t)}Z})^{-1}Z^\top(y-\pi^{(t)}) \end{align*}\]

Using this updating equation, we will use Newton’s Method to find the maximum likelihood estimates of the predictor variables. The code below shows this process, implementing the gradient, Hessian matrix, and updating equation.

## Initial Values
x = c(0, 0, 0, 0, 0)
itr = 10
x.values = matrix(0,itr+1,5)
x.values[1,] = x

## Log Likelihood Function
g = function(beta, y, Z){
  n <- length(y)
  ones <- rep(1, n)
  b <- rep(0, n)
  for(i in 1:n){
    b[i] <- log(1 + exp(Z[i,] %*% beta))
  }
  return(y %*% Z %*% beta - b %*% ones)
}

## Gradient
g.prime = function(beta, y, Z){
  n <- length(y)
  pi_vec <- rep(0, n)
  for(i in 1:n){
    pi_vec[i] <- 1/(1 + exp(-Z[i,] %*% beta))
  }
  return(t(Z) %*% (y - pi_vec))
}

## Hessian Matrix
g.2prime=function(beta, y, Z){
  n <- length(y) 
  pi_vec <- rep(0, n)
  w <- rep(0, n) 
  for(i in 1:n){
    pi_vec[i] <- 1/(1 + exp(-Z[i,] %*% beta))
    w[i] <- pi_vec[i]*(1-pi_vec[i])
  }
  return(-t(Z) %*% diag(w) %*% Z)
}

## Updating Equation and Newton's Method
for(i in 1:itr){
  x = x - solve(g.2prime(x, y, Z))%*%g.prime(x, y, Z)
  x.values[i+1,] = x
}

## Estimated Values
x
##                 [,1]
## ones    -0.156711657
## arsenic  0.467021588
## dist    -0.008961102
## assoc   -0.124299982
## educ     0.042446614
## X-Values for Each Iteration
x.values
##              [,1]      [,2]         [,3]       [,4]       [,5]
##  [1,]  0.00000000 0.0000000  0.000000000  0.0000000 0.00000000
##  [2,] -0.08292061 0.3801041 -0.007907759 -0.1149084 0.03822372
##  [3,] -0.15256787 0.4621707 -0.008910053 -0.1239865 0.04228893
##  [4,] -0.15669849 0.4670060 -0.008960949 -0.1242992 0.04244621
##  [5,] -0.15671166 0.4670216 -0.008961102 -0.1243000 0.04244661
##  [6,] -0.15671166 0.4670216 -0.008961102 -0.1243000 0.04244661
##  [7,] -0.15671166 0.4670216 -0.008961102 -0.1243000 0.04244661
##  [8,] -0.15671166 0.4670216 -0.008961102 -0.1243000 0.04244661
##  [9,] -0.15671166 0.4670216 -0.008961102 -0.1243000 0.04244661
## [10,] -0.15671166 0.4670216 -0.008961102 -0.1243000 0.04244661
## [11,] -0.15671166 0.4670216 -0.008961102 -0.1243000 0.04244661

The parameter estimates above reveal the strength of the relationships of the predictor variables and the probability a household switches to a safe well. We see that distance and involvement in community organizations have fairly weak negative relationships, whereas arsenic levels have a strong positive relationship and years of education has a weak positive relationship. Furthermore, it is seen that Newton’s Method has converged since the values do not change much after iteration 5.

Interpretation

Below are charts illustrating the relationships of the numeric predictor variables used in predicting a household’s decision to switch wells. While each variable shows their individual influence on the response, it may be possible that interactions between the preidtors could have also been a deciding factor.

As one can see by viewing the above chart, the arsenic level of a household’s current well had a strong influence on if they switched. This can be interpreted since the probability of them switching is nearly 40% even if there is 0 arsenic in their water. Since water is considered “unsafe” with an arsenic level above 0.5 units of hundreds of micrograms per liter, this is less impressive. However, there are many households in this data with an aresnic level much worse than the “unsafe” threshold which can be seen in this chart. As the arsenic level increases, the probability they switch steadily increases.

The chart above shows the negative relationship between a household’s probability of switching to a safe well and the proximity to said well. This makes sense since the further one is from a safe well, the less feasible it is for them to use it, even if it means they might improve their health. While some may look at this chart and think the households are being unreasonable in traveling to a safe well that is mere meters away, they must also haul their containers of water back to their house in order to use it. So, it truly is favorable to have a safe well closer to the house.

Lastly, this chart reveals the positive effect that the amount of education in the household has on the probability that they will switch to a safe well. While there is still a 50% chance that a person with no years of education makes the switch for their household, it is seen that the amount of education does influence this decision. This may be because the more educated households know what ingesting arsenic does to the body and they want to protect their families.

Conclusion

In conclusion, a household is more likely to make the switch to a nearby “safe” well if the arsenic level of their current water is high, the “safe” well is not far away, if they aren’t associated with a community organization, and/or they have extensive levels of education. When looking at the strengths of these relationships, the arsenic level of their well is the strongest influencer at 0.467 and distance is the weakest at -0.008. The strength of arsenic makes sense since this factor directly affects the health of the household, making them have a higher risk of cancer and other diseases. Distance, however, may not be as obvious since all measurements were in meters. However, the manpower needed to transport containers to a “safe” well, gather water, and transport it back to the house each day is exhausting.

When considering the public health as a whole, this situation helps point out the fact that just because “safer” options are available, does not mean that people have easy access. The probabilities of households that don’t switch to a “safe” well (42.48%) is alarming considering there is a “safe” well most likely within 300 meters from them. Therefore, authorities should look into a more easily accessible source of water that the Bangladeshi can use in order to protect their health.

While we only looked at a few factors that play a role in a household’s decision to find a “safe” well, there are many others that could be examined. For instance, average household age, number of family members in the household, and the household’s income could all possibly reveal relationships that may be important to consider. Furthermore, this analysis only looked at one area of Bangladesh. There may be surrounding areas that could (a) offer assistance to their suffering neighbors or (b) also be suffering, making the severity of this issue even worse. This water issue is an important issue and a solution desperately needs to be discovered.