Statistical Investigation and Hypothesis Testing

Introduction:
Access to safe sanitation is a critical aspect of public health and sustainable development. In this study, I will delve into the factors influencing access to safe sanitation using statistical methods. and aim to uncover insights that can inform policies and interventions to improve sanitation services globally.

Hypothesis 1:

Null Hypothesis:There is no significant association between the region and the level of safely managed sanitation services.

Methodology:
I use the Neyman-Pearson framework to test this hypothesis. The significance level (α) is set to 0.05. The Type 2 Error rate (β) is fixed at 0.2, corresponding to a power of 0.8. calculate the sample size required for a two-sample t-test, assuming a moderate effect size of 0.5. If the sample size is deemed sufficient, we perform the test and interpret the results.

Insights and Significance:
The analysis aims to reveal whether there is a substantial difference in the level of safely managed sanitation services across different regions. If the null hypothesis is rejected, it would imply that regional disparities exist, highlighting the need for targeted interventions in regions with inadequate sanitation services. Conversely, if the null hypothesis is not rejected, it suggests that the level of sanitation services is relatively consistent across regions, which could inform resource allocation strategies.

Further Investigation:
If the null hypothesis is rejected, further investigation could explore the specific factors contributing to regional disparities in sanitation access. This could involve analyzing socio-economic indicators, infrastructure development, and policy implementation at the regional level.

Hypothesis 2:

Null Hypothesis: There is no significant association between the region and the binary classification of safely managed sanitation services.

Methodology:
Using Fisher's Significance Testing framework to test this hypothesis. I calculate the p-value based on the chi-square test of independence between the region and the binary classification of sanitation services. A p-value less than 0.05 indicates a significant association. Than interpret the p-value and provide reasoning for the confidence in our conclusions.

Insights and Significance:
This analysis aims to determine whether there is a significant association between the region and the binary classification of sanitation services. A significant association would imply that regional factors influence the likelihood of having safely managed sanitation services, which could guide targeted interventions and policy decisions.

Further Investigation:
If a significant association is found,Than the further investigation could explore the underlying reasons for regional disparities in sanitation service classifications. This could involve qualitative research to understand contextual factors such as governance structures, cultural norms, and community engagement practices impacting sanitation access.

# Ensure that 'vcd' package is installed correctly
install.packages("vcd")

## Installing package into 'C:/Users/am790/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

# Load the 'vcd' package after installation
library(vcd)

## Warning: package 'vcd' was built under R version 4.3.3

## Loading required package: grid

# Read the CSV file
data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv")

# View summary of the data
summary(data)

##      Type              Region          Residence.Type     Service.Type      
##  Length:3367        Length:3367        Length:3367        Length:3367       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       Year         Coverage         Population        Service.level     
##  Min.   :2010   Min.   :  0.000   Min.   :0.000e+00   Length:3367       
##  1st Qu.:2013   1st Qu.:  2.486   1st Qu.:4.366e+06   Class :character  
##  Median :2016   Median : 12.110   Median :3.306e+07   Mode  :character  
##  Mean   :2016   Mean   : 22.447   Mean   :1.497e+08                     
##  3rd Qu.:2019   3rd Qu.: 34.190   3rd Qu.:1.755e+08                     
##  Max.   :2022   Max.   :100.000   Max.   :2.173e+09

# Convert 'Service level' into binary variable
data$binary_service_level <- as.factor(data$Service.level == "Safely managed service")

# Convert 'Region' to factor
data$Region <- as.factor(data$Region)

# Perform chi-square test of independence
chi_square_test <- chisq.test(data$Region, data$binary_service_level)
print(chi_square_test)

## 
##  Pearson's Chi-squared test
## 
## data:  data$Region and data$binary_service_level
## X-squared = 37.714, df = 7, p-value = 3.434e-06

# Perform power analysis for two-sample t-test
alpha <- 0.05
beta <- 0.2
power <- 1 - beta
effect_size <- 0.5

# Perform power analysis
sample_size <- pwr.t.test(d = effect_size, sig.level = alpha, power = power, type = "two.sample")$n

## Error in pwr.t.test(d = effect_size, sig.level = alpha, power = power, : could not find function "pwr.t.test"

print(paste("Required sample size per group:", round(sample_size)))

## [1] "Required sample size per group: 64"

# Interpret p-value
if (chi_square_test$p.value < 0.05) {
  print("Reject Null Hypothesis: There is a significant association between region and binary service level.")
} else {
  print("Fail to Reject Null Hypothesis: There is no significant association between region and binary service level.")
}

## [1] "Reject Null Hypothesis: There is a significant association between region and binary service level."

# Visualize the association between region and binary service level using a stacked bar plot
contingency_table <- table(data$Region, data$binary_service_level)
barplot(contingency_table, beside = TRUE, legend = rownames(contingency_table),
        main = "Association between Region and Binary Service Level (Null Hypothesis)",
        xlab = "Region", ylab = "Frequency", col = c("blue", "green"))

# Visualize the association between region and binary service level using a mosaic plot
mosaicplot(contingency_table, main = "Mosaic Plot of Association between Region and Binary Service Level")