Research Question 1: Is there a correlation between the level of arsenic contamination in the household’s original well and the distance of a households to the closest known safe well?

Research Question 2: Is there an association between households switching to another well from an unsafe well and participation of members in the community organizations?


Data

data()
data(package = .packages(all.available=TRUE))
#install.packages("carData")
library(carData)
mydata <- force(Wells)
head(mydata)
##   switch arsenic distance education association
## 1    yes    2.36   16.826         0          no
## 2    yes    0.71   47.322         0          no
## 3     no    2.07   20.967        10          no
## 4    yes    1.15   21.486        12          no
## 5    yes    1.10   40.874        14         yes
## 6    yes    3.90   69.518         9         yes

Data on whether or not households in Bangladesh changed the wells that they were using.

nrow(mydata)
## [1] 3020


Explanation of data

  • Unit of observation: a household in Bangladesh
  • Sample size: 3020
  • Variables:
    • Switch: whether or not the household switched to another well from an unsafe well (categorical variable; yes and no)
    • Arsenic: the level of arsenic contamination in the household’s original well (numerical variable; unit of measurement-hundreds of micrograms per liter)
    • Distance: distance of a households to the closest known safe well (numerical variable; unit of measurement-meter)
    • Education: education of the head of the household (numerical variable; unit of measurement-year)
    • Association: whether or not any members of the household participated in any community organizations (categorical variable; yes and no)

*Note:

  • For the first given research question, the only needed variables are “arsenic” and “distance”.

  • For the second given research question, the only needed variables are “switch” and “association”.


Source


Data manipulation and Descriptive statistics

Converting categorical variables into factors
mydata$switchF <- factor(mydata$switch,
                         levels = c("no", "yes"),
                         labels = c ("NoSw" , "YesSw"))

mydata$associationF <- factor(mydata$association,
                         levels = c("no", "yes"),
                         labels = c ("NoAs" , "YesAs"))
head(mydata,3)
##   switch arsenic distance education association switchF associationF
## 1    yes    2.36   16.826         0          no   YesSw         NoAs
## 2    yes    0.71   47.322         0          no   YesSw         NoAs
## 3     no    2.07   20.967        10          no    NoSw         NoAs


Checking for missing values
any(is.na(mydata))
## [1] FALSE

There are no missing values.


Statistical description
summary(mydata)
##  switch        arsenic         distance         education     
##  no :1283   Min.   :0.510   Min.   :  0.387   Min.   : 0.000  
##  yes:1737   1st Qu.:0.820   1st Qu.: 21.117   1st Qu.: 0.000  
##             Median :1.300   Median : 36.761   Median : 5.000  
##             Mean   :1.657   Mean   : 48.332   Mean   : 4.828  
##             3rd Qu.:2.200   3rd Qu.: 64.041   3rd Qu.: 8.000  
##             Max.   :9.650   Max.   :339.531   Max.   :17.000  
##  association  switchF     associationF
##  no :1743    NoSw :1283   NoAs :1743  
##  yes:1277    YesSw:1737   YesAs:1277  
##                                       
##                                       
##                                       
## 
  • Median 1.30 for variable “arsenic”: Half of the households have arsenic levels up to 1.30 hundreds of micrograms per liter while the other half has arsenic levels higher than 1.30 hundreds of micrograms per liter.

  • 3rd quartile 64.041 for the variable “distance”: 75% of the households have a distance to the closest known well up to 64.041 meters while the other half has distances larger than this.


For the variables used in the first research question

library(psych)
psych::describe(mydata [ , c("arsenic" , "distance")])
##          vars    n  mean    sd median trimmed   mad  min    max
## arsenic     1 3020  1.66  1.11   1.30    1.49  0.86 0.51   9.65
## distance    2 3020 48.33 38.48  36.76   42.44 28.31 0.39 339.53
##           range skew kurtosis   se
## arsenic    9.14 1.66     3.95 0.02
## distance 339.14 1.70     4.26 0.70
  • Mean 1.66 for the variable “arsenic”: The average level of arsenic contamination in the household’s original well is 1.66 hundreds of micrograms per liter.

  • Min 0.39 for the variable “distance”: The shortest distance from a household to the closest known safe well is 0.39 meters.


For the variables used in the second research question

summary(mydata[, c("switchF","associationF")])
##   switchF     associationF
##  NoSw :1283   NoAs :1743  
##  YesSw:1737   YesAs:1277
  • The summary indicates that among the observed households, 1 283 did not switch from an unsafe well, while 1 737 households switched to another well from an unsafe source.

  • The summary shows that among the observed households, 1 743 did not have any members participating in community organizations, while 1 277 households had at least one member involved in community organizations.


Correlation analysis

For the first research question: Is there a correlation between the level of arsenic contamination in the household’s original well and the distance of a households to the closest known safe well?

Assumptions
  • Both variables are numeric. This assumption is met.
  • Errors are normally distributed. Since we have big enough sample, we do not test to check this.
  • Linear relationship between the variables.

Reducing the size of the sample to check linearity, by taking a random sample of 300 units.

set.seed(1) 
mydata1 <- mydata[sample(nrow(mydata), 300), ]
head(mydata)
##   switch arsenic distance education association switchF associationF
## 1    yes    2.36   16.826         0          no   YesSw         NoAs
## 2    yes    0.71   47.322         0          no   YesSw         NoAs
## 3     no    2.07   20.967        10          no    NoSw         NoAs
## 4    yes    1.15   21.486        12          no   YesSw         NoAs
## 5    yes    1.10   40.874        14         yes   YesSw        YesAs
## 6    yes    3.90   69.518         9         yes   YesSw        YesAs


Scatterplot for linearity check

library(car)
## Warning: package 'car' was built under R version 4.3.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(mydata1$arsenic, mydata1$distance, 
            smooth = TRUE,
            ylim = c(0, 345),
            xlim = c(0, 8),
            main = "Relationship between Arsenic and Distance",
            xlab = "Arsenic contamination", 
            ylab = "Distance")

From this scatterplot it seems like the linear relationship assumption is violated as well. However, I will continue with calculating Pearson correlation coefficient for educational purposes.


cor(mydata1$arsenic, mydata1$distance,
    method = "pearson",
    use = "complete.obs")
## [1] 0.1931792

The linear (assumed) relationship between variables “arsenic” and “distance” is positive and weak.


cor.test(mydata1$arsenic, mydata1$distance,
         method = "pearson",
         use ="complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  mydata1$arsenic and mydata1$distance
## t = 3.3988, df = 298, p-value = 0.000769
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0817262 0.2998604
## sample estimates:
##       cor 
## 0.1931792

H0: rho (arsenic, distance) = 0

H1: rho (arsenic, distance) =/= 0

We reject null hypothesis (at p<0.001) and conclude that there is correlation between the level of arsenic contamination in the household’s original well and the distance of a households to the closest known safe well.


Pearson Chi2 test for association between two categorical variables

Is there an association between households switching to another well from an unsafe well and participation of members in the community organizations?

Since association between two categorical variables needs to be analyzed, the use of the typical correlation coefficient is not appropriate. The appropriate solution is the analysis of frequencies.

However, in order to be able to perform Pearson Chi2 test, certain assumptions need to be met:

Assumptions
  • Observations must be independent. This assumption is met.
  • All expected frequencies are greater than 1.
  • Up to (maximum) 20% of the expected frequencies can be between 1 and 5 (however, any expected frequency that is lower than 5 will reduce the power of the test)


H0: There is no association between two categorical variables “switch” and “association”.

H1: There is association between two categorical variables “switch” and “association”.

results <- chisq.test(mydata$switchF, mydata$associationF, 
                      correct = TRUE) #Yates' continuity correction, 2x2 contingency table

results
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$switchF and mydata$associationF
## X-squared = 3.7497, df = 1, p-value = 0.05282

p-value is greater than 5%, meaning that the null hypothesis cannot be rejected.


Observed (empirical) frequency:
addmargins (results$observed)
##               mydata$associationF
## mydata$switchF NoAs YesAs  Sum
##          NoSw   714   569 1283
##          YesSw 1029   708 1737
##          Sum   1743  1277 3020


Expected (theoretical) frequency:
round(results$expected, 2)
##               mydata$associationF
## mydata$switchF    NoAs  YesAs
##          NoSw   740.49 542.51
##          YesSw 1002.51 734.49

740.49 - If there is no association between two categorical variables we would expect to see 740.49 households that did not switch to another well from an unsafe well and did not have any members that participated in any community organizations.


Second assumption is met, since all of the expected frequencies are greater than 1.

Third assumption is met, since all of the expected frequencies are also greater than 5.


Standardized residuals - tell us if the differences between empirical and expected frequencies are statistically significant.
round(results$res, 2)
##               mydata$associationF
## mydata$switchF  NoAs YesAs
##          NoSw  -0.97  1.14
##          YesSw  0.84 -0.98

Table of standardized residuals shows that the discrepancies between the actual and expected values is not statistically significant (all values of standardized residuals are below 1.96).

However, for educational purposes I am going to explain a few numbers like if they were significant.

-0.97 - The actual number of households that did not switch to another well from an unsafe well and did not have any members that participated in any community organizations is smaller than expected. (Alfa=)

1.14 - The actual number of households that did not switch to another well from an unsafe well but did had members that participated in community organizations is higher than expected. (Alfa=)


Conclusion:

Pearson’s Chi-squared test: Based on the sample data, we cannot reject the null hypothesis (0.053) and therefore we can conclude that there is no association between households switching to another well from an unsafe well and participation of members in the community organizations.


Proportion Tables
addmargins(round(prop.table(results$observed), 3))
##               mydata$associationF
## mydata$switchF  NoAs YesAs   Sum
##          NoSw  0.236 0.188 0.424
##          YesSw 0.341 0.234 0.575
##          Sum   0.577 0.422 0.999

0.236 - Out of 3020 households, there are 23.6% (714/3020) of households that did not switch to another well from an unsafe well and did not have any members that participated in any community organizations.

0.234 - Out of 3020 households, there are 23.4% (708/3020) of households that switched to another well from an unsafe well and had at least one member that participated in a community organization.

*Note: 0.999 is not 1.000 due to rounding errors, however it represents 100%.


addmargins(round(prop.table(results$observed, 1), 3), 2) 
##               mydata$associationF
## mydata$switchF  NoAs YesAs   Sum
##          NoSw  0.557 0.443 1.000
##          YesSw 0.592 0.408 1.000

0.557 - Out of 1283 households that did not switch to another well from an unsafe well, 55.7% (714/1283) of them did not have any members that participated in any community organizations.

0.408 - Out of 1737 households that switched to another well from an unsafe well, 40.8% (708/1737) of them had at least one member that participated in a community organization.


addmargins(round(prop.table(results$observed, 2), 3), 1)
##               mydata$associationF
## mydata$switchF  NoAs YesAs
##          NoSw  0.410 0.446
##          YesSw 0.590 0.554
##          Sum   1.000 1.000

0.410 - Out of 1743 households that did not have any members that participated in any community organizations, 41% (714/1743) of them did not switch to another well from an unsafe well.

0.554 - Out of 1277 households that had at least one member that participated in a community organization, 55.4% (708/1277) of them switched to another well from an unsafe well.


Effects Size

Cramer’s V statistics

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cramers_v(mydata$switchF, mydata$associationF)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.03              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.03)
## [1] "tiny"
## (Rules: funder2019)

There is no association between households switching to another well from an unsafe well and participation of members in the community organizations, and the effect is tiny (r=0.03), which also makes sense due to the fact that we concluded that there is no association.


Odds Ratio

oddsratio(mydata$switchF, mydata$associationF)
## Odds ratio |       95% CI
## -------------------------
## 0.86       | [0.75, 1.00]
interpret_oddsratio(0.86)
## [1] "very small"
## (Rules: chen2010)

Considering that the Pearson chi-squared test shows no significant association the result is very small (OR=0.86) , the odds ratio of 0.86 implies that any observed difference in the odds of households switching from an unsafe well associated with community organization participation may be negligible.

Additionally suggests that the odds of switching to another well from an unsafe well among those who participate in community organizations are 0.86 times the odds of switching among those who do not participate.