Statistical Investigation and Hypothesis Testing

Introduction:
Access to adequate sanitation is a fundamental human right and a critical determinant of public health and well-being. However, disparities in sanitation access persist globally, with certain regions facing greater challenges than others. Here, I will use statistical methods to analyze regional disparities in access to sanitation services, focusing on two key hypotheses. Firstly, Investigate whether there is a significant difference in the coverage of safely managed sanitation between urban and rural areas. Secondly, Exploring the association between region and the type of sanitation service provided, particularly focusing on whether there are regional disparities in the prevalence of safely managed sanitation services. By examining these hypotheses, to uncover insights that can inform targeted interventions and policy decisions to improve sanitation access and equity worldwide.

Hypothesis 1: Two-sample t-test:
The hypothesis tests whether there's a significant difference in the coverage of safely managed sanitation between urban and rural areas.

Significance:
If the null hypothesis is rejected, it implies that there's a statistically significant difference in coverage between urban and rural areas. This could indicate disparities in sanitation infrastructure or access between these two regions.

Further Questions:
What factors contribute to the observed differences in sanitation coverage between urban and rural areas?
Are there specific interventions or policies that could help bridge the gap in sanitation coverage between these regions?
How do demographic and socioeconomic factors influence sanitation access in urban and rural areas?

Hypothesis 2: Fisher's exact test with simulation:
The hypothesis tests whether there's a significant association between the region and the type of sanitation service (safely managed or not).

Significance:
If the null hypothesis is rejected, it suggests that there's a significant relationship between the region and the type of sanitation service provided. This could indicate regional disparities in the provision of sanitation services.

Further Questions:
What are the underlying reasons for the observed regional disparities in sanitation service provision?
How do governance structures, cultural norms, and economic factors influence sanitation service delivery across different regions?
Are there specific geographical areas or regions that require targeted interventions to improve sanitation access?

Visualizations: 1
The boxplot visualization for Hypothesis 1 provides a graphical representation of the distribution of coverage of safely managed sanitation across urban and rural areas.

Significance:
Visualizing the coverage distribution helps in understanding the extent of differences between urban and rural areas in terms of sanitation access. It provides a clear picture of the data distribution and potential outliers.

Further Questions:
Are there any noticeable patterns or trends in the coverage of safely managed sanitation between urban and rural areas?
Do certain regions exhibit consistently higher or lower coverage rates compared to others?

Visualizations: 2
The stacked bar plot visualization for Hypothesis 2 illustrates the frequency distribution of sanitation service types across different regions.

Significance:
Visualizing the association between region and sanitation service type highlights any regional disparities in sanitation service provision. It allows for a quick comparison of the proportion of safely managed sanitation services across different regions.

Further Questions:
Are there particular regions where the prevalence of safely managed sanitation services is notably higher or lower?
How do environmental factors and geographical characteristics influence the distribution of sanitation services across regions?

Conclusion:
This analysis reveals valuable insights into regional disparities in access to sanitation services. Through the two-sample t-test, I found significant differences in the coverage of safely managed sanitation between urban and rural areas, underscoring the need for tailored interventions to address disparities in sanitation infrastructure and access. Also, the Fisher's exact test highlighted significant associations between region and the type of sanitation service provided, suggesting the presence of regional disparities in sanitation service provision. These findings emphasize the importance of implementing targeted interventions and policies to ensure equitable access to safe and sustainable sanitation services across different regions.

library(pwr)

## Warning: package 'pwr' was built under R version 4.3.3

library(vcd)

## Warning: package 'vcd' was built under R version 4.3.3

## Loading required package: grid

# Read the CSV file
data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv")

# Convert 'Service level' into binary variable
data$binary_service_level <- as.factor(data$Service.level == "Safely managed service")

# Convert 'Region' to factor
data$Region <- as.factor(data$Region)

# Hypothesis 1: Two-sample t-test
effect_size <- 0.1
alpha <- 0.05
beta <- 0.2
power <- 1 - beta

# Perform power analysis
sample_size <- pwr.t.test(d = effect_size, sig.level = alpha, power = power, type = "two.sample")$n

# Check if there's enough data for hypothesis 1
if (sample_size >= 30) {
  print("There is enough data to perform the hypothesis test for Hypothesis 1.")

} else {
  print("There is not enough data to perform the hypothesis test for Hypothesis 1.")
}

## [1] "There is enough data to perform the hypothesis test for Hypothesis 1."

# Hypothesis 2: Fisher's exact test with simulation
fisher_test <- fisher.test(data$Region, data$binary_service_level, simulate.p.value = TRUE)
print(fisher_test)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  data$Region and data$binary_service_level
## p-value = 0.0004998
## alternative hypothesis: two.sided

# Interpret p-value for Hypothesis 2
if (fisher_test$p.value < 0.05) {
  print("Reject Null Hypothesis (Hypothesis 2): There is a significant association between region and the type of sanitation service.")
} else {
  print("Fail to Reject Null Hypothesis (Hypothesis 2): There is no significant association between region and the type of sanitation service.")
}

## [1] "Reject Null Hypothesis (Hypothesis 2): There is a significant association between region and the type of sanitation service."

# Visualization for Hypothesis 1: Boxplot of coverage by urban/rural
boxplot(data$Coverage ~ data$Residence.Type, data = data, main = "Coverage of Safely Managed Sanitation by Residence Type",
        xlab = "Residence Type", ylab = "Coverage")

# Visualization for Hypothesis 2: Stacked bar plot of region vs. type of sanitation service
contingency_table <- table(data$Region, data$binary_service_level)
barplot(contingency_table, beside = TRUE, legend = rownames(contingency_table),
        main = "Association between Region and Type of Sanitation Service",
        xlab = "Region", ylab = "Frequency", col = c("blue", "green"))