data()
data(package = .packages(all.available=TRUE))
#install.packages("carData")
library(carData)
mydata <- force(Wells)
head(mydata)
## switch arsenic distance education association
## 1 yes 2.36 16.826 0 no
## 2 yes 0.71 47.322 0 no
## 3 no 2.07 20.967 10 no
## 4 yes 1.15 21.486 12 no
## 5 yes 1.10 40.874 14 yes
## 6 yes 3.90 69.518 9 yes
Data on whether or not households in Bangladesh changed the wells that they were using.
nrow(mydata)
## [1] 3020
*Note:
For the first given research question, the only needed variables are “arsenic” and “distance”.
For the second given research question, the only needed variables are “switch” and “association”.
mydata$switchF <- factor(mydata$switch,
levels = c("no", "yes"),
labels = c ("NoSw" , "YesSw"))
mydata$associationF <- factor(mydata$association,
levels = c("no", "yes"),
labels = c ("NoAs" , "YesAs"))
head(mydata,3)
## switch arsenic distance education association switchF associationF
## 1 yes 2.36 16.826 0 no YesSw NoAs
## 2 yes 0.71 47.322 0 no YesSw NoAs
## 3 no 2.07 20.967 10 no NoSw NoAs
any(is.na(mydata))
## [1] FALSE
There are no missing values.
summary(mydata)
## switch arsenic distance education
## no :1283 Min. :0.510 Min. : 0.387 Min. : 0.000
## yes:1737 1st Qu.:0.820 1st Qu.: 21.117 1st Qu.: 0.000
## Median :1.300 Median : 36.761 Median : 5.000
## Mean :1.657 Mean : 48.332 Mean : 4.828
## 3rd Qu.:2.200 3rd Qu.: 64.041 3rd Qu.: 8.000
## Max. :9.650 Max. :339.531 Max. :17.000
## association switchF associationF
## no :1743 NoSw :1283 NoAs :1743
## yes:1277 YesSw:1737 YesAs:1277
##
##
##
##
Median 1.30 for variable “arsenic”: Half of the households have arsenic levels up to 1.30 hundreds of micrograms per liter while the other half has arsenic levels higher than 1.30 hundreds of micrograms per liter.
3rd quartile 64.041 for the variable “distance”: 75% of the households have a distance to the closest known well up to 64.041 meters while the other half has distances larger than this.
For the variables used in the first research question
library(psych)
psych::describe(mydata [ , c("arsenic" , "distance")])
## vars n mean sd median trimmed mad min max
## arsenic 1 3020 1.66 1.11 1.30 1.49 0.86 0.51 9.65
## distance 2 3020 48.33 38.48 36.76 42.44 28.31 0.39 339.53
## range skew kurtosis se
## arsenic 9.14 1.66 3.95 0.02
## distance 339.14 1.70 4.26 0.70
Mean 1.66 for the variable “arsenic”: The average level of arsenic contamination in the household’s original well is 1.66 hundreds of micrograms per liter.
Min 0.39 for the variable “distance”: The shortest distance from a household to the closest known safe well is 0.39 meters.
For the variables used in the second research question
summary(mydata[, c("switchF","associationF")])
## switchF associationF
## NoSw :1283 NoAs :1743
## YesSw:1737 YesAs:1277
The summary indicates that among the observed households, 1 283 did not switch from an unsafe well, while 1 737 households switched to another well from an unsafe source.
The summary shows that among the observed households, 1 743 did not have any members participating in community organizations, while 1 277 households had at least one member involved in community organizations.
For the first research question: Is there a correlation between the level of arsenic contamination in the household’s original well and the distance of a households to the closest known safe well?
Reducing the size of the sample to check linearity, by taking a random sample of 300 units.
set.seed(1)
mydata1 <- mydata[sample(nrow(mydata), 300), ]
head(mydata)
## switch arsenic distance education association switchF associationF
## 1 yes 2.36 16.826 0 no YesSw NoAs
## 2 yes 0.71 47.322 0 no YesSw NoAs
## 3 no 2.07 20.967 10 no NoSw NoAs
## 4 yes 1.15 21.486 12 no YesSw NoAs
## 5 yes 1.10 40.874 14 yes YesSw YesAs
## 6 yes 3.90 69.518 9 yes YesSw YesAs
Scatterplot for linearity check
library(car)
## Warning: package 'car' was built under R version 4.3.2
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(mydata1$arsenic, mydata1$distance,
smooth = TRUE,
ylim = c(0, 345),
xlim = c(0, 8),
main = "Relationship between Arsenic and Distance",
xlab = "Arsenic contamination",
ylab = "Distance")
From this scatterplot it seems like the linear relationship assumption is violated as well. However, I will continue with calculating Pearson correlation coefficient for educational purposes.
cor(mydata1$arsenic, mydata1$distance,
method = "pearson",
use = "complete.obs")
## [1] 0.1931792
The linear (assumed) relationship between variables “arsenic” and “distance” is positive and weak.
cor.test(mydata1$arsenic, mydata1$distance,
method = "pearson",
use ="complete.obs")
##
## Pearson's product-moment correlation
##
## data: mydata1$arsenic and mydata1$distance
## t = 3.3988, df = 298, p-value = 0.000769
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0817262 0.2998604
## sample estimates:
## cor
## 0.1931792
H0: rho (arsenic, distance) = 0
H1: rho (arsenic, distance) =/= 0
We reject null hypothesis (at p<0.001) and conclude that there is correlation between the level of arsenic contamination in the household’s original well and the distance of a households to the closest known safe well.
Is there an association between households switching to another well from an unsafe well and participation of members in the community organizations?
Since association between two categorical variables needs to be analyzed, the use of the typical correlation coefficient is not appropriate. The appropriate solution is the analysis of frequencies.
However, in order to be able to perform Pearson Chi2 test, certain assumptions need to be met:
H0: There is no association between two categorical variables “switch” and “association”.
H1: There is association between two categorical variables “switch” and “association”.
results <- chisq.test(mydata$switchF, mydata$associationF,
correct = TRUE) #Yates' continuity correction, 2x2 contingency table
results
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$switchF and mydata$associationF
## X-squared = 3.7497, df = 1, p-value = 0.05282
p-value is greater than 5%, meaning that the null hypothesis cannot be rejected.
addmargins (results$observed)
## mydata$associationF
## mydata$switchF NoAs YesAs Sum
## NoSw 714 569 1283
## YesSw 1029 708 1737
## Sum 1743 1277 3020
round(results$expected, 2)
## mydata$associationF
## mydata$switchF NoAs YesAs
## NoSw 740.49 542.51
## YesSw 1002.51 734.49
740.49 - If there is no association between two categorical variables we would expect to see 740.49 households that did not switch to another well from an unsafe well and did not have any members that participated in any community organizations.
Second assumption is met, since all of the expected frequencies are greater than 1.
Third assumption is met, since all of the expected frequencies are also greater than 5.
round(results$res, 2)
## mydata$associationF
## mydata$switchF NoAs YesAs
## NoSw -0.97 1.14
## YesSw 0.84 -0.98
Table of standardized residuals shows that the discrepancies between the actual and expected values is not statistically significant (all values of standardized residuals are below 1.96).
However, for educational purposes I am going to explain a few numbers like if they were significant.
-0.97 - The actual number of households that did not switch to another well from an unsafe well and did not have any members that participated in any community organizations is smaller than expected. (Alfa=)
1.14 - The actual number of households that did not switch to another well from an unsafe well but did had members that participated in community organizations is higher than expected. (Alfa=)
Pearson’s Chi-squared test: Based on the sample data, we cannot reject the null hypothesis (0.053) and therefore we can conclude that there is no association between households switching to another well from an unsafe well and participation of members in the community organizations.
addmargins(round(prop.table(results$observed), 3))
## mydata$associationF
## mydata$switchF NoAs YesAs Sum
## NoSw 0.236 0.188 0.424
## YesSw 0.341 0.234 0.575
## Sum 0.577 0.422 0.999
0.236 - Out of 3020 households, there are 23.6% (714/3020) of households that did not switch to another well from an unsafe well and did not have any members that participated in any community organizations.
0.234 - Out of 3020 households, there are 23.4% (708/3020) of households that switched to another well from an unsafe well and had at least one member that participated in a community organization.
*Note: 0.999 is not 1.000 due to rounding errors, however it represents 100%.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## mydata$associationF
## mydata$switchF NoAs YesAs Sum
## NoSw 0.557 0.443 1.000
## YesSw 0.592 0.408 1.000
0.557 - Out of 1283 households that did not switch to another well from an unsafe well, 55.7% (714/1283) of them did not have any members that participated in any community organizations.
0.408 - Out of 1737 households that switched to another well from an unsafe well, 40.8% (708/1737) of them had at least one member that participated in a community organization.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## mydata$associationF
## mydata$switchF NoAs YesAs
## NoSw 0.410 0.446
## YesSw 0.590 0.554
## Sum 1.000 1.000
0.410 - Out of 1743 households that did not have any members that participated in any community organizations, 41% (714/1743) of them did not switch to another well from an unsafe well.
0.554 - Out of 1277 households that had at least one member that participated in a community organization, 55.4% (708/1277) of them switched to another well from an unsafe well.
Cramer’s V statistics
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(mydata$switchF, mydata$associationF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.03 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.03)
## [1] "tiny"
## (Rules: funder2019)
There is no association between households switching to another well from an unsafe well and participation of members in the community organizations, and the effect is tiny (r=0.03), which also makes sense due to the fact that we concluded that there is no association.
Odds Ratio
oddsratio(mydata$switchF, mydata$associationF)
## Odds ratio | 95% CI
## -------------------------
## 0.86 | [0.75, 1.00]
interpret_oddsratio(0.86)
## [1] "very small"
## (Rules: chen2010)
Considering that the Pearson chi-squared test shows no significant association the result is very small (OR=0.86) , the odds ratio of 0.86 implies that any observed difference in the odds of households switching from an unsafe well associated with community organization participation may be negligible.
Additionally suggests that the odds of switching to another well from an unsafe well among those who participate in community organizations are 0.86 times the odds of switching among those who do not participate.