Exploring Relationships, Correlation, and Confidence Intervals in Washdash Data

Access to adequate sanitation and hygiene services is essential for public health and well-being. In this analysis, I delve into the Washdash dataset to explore trends and relationships related to sanitation coverage and population dynamics. Specifically, investigate how sanitation coverage has evolved over time and how it correlates with population size.

Summary of the Data: A summary of the dataset revealed information about the type, region, residence type, service type, year, coverage, population, service level, per capita coverage, and service type ordered. This provided an overview of the data's characteristics and distributions.

Data Processing and Exploration: I began by reading the CSV file and setting the seed to avoid scientific notation for easier interpretation of numerical values.This analysis focuses on two main pairs of variables:

Pair 1:Per Capita Coverage Over Time: I calculate the per capita coverage by dividing the coverage by the population for each observation. This metric allows us to assess the level of sanitation services available per individual and then visualize the trend of per capita coverage over time using a scatter plot, with years on the x-axis and per capita coverage on the y-axis.
Insight: The scatter plot shows the trend of per capita coverage over the years. Despite some missing or out-of-range data points, it indicates relatively stable coverage over time.
Significance: This insight suggests that the level of healthcare coverage per person has remained relatively consistent over the years. Understanding this stability is crucial for assessing the effectiveness of healthcare policies and interventions.
Further Questions: What interventions or policies have contributed to maintaining stable coverage over time?
Are there any specific periods where coverage significantly improved or declined, and if so, what factors influenced these changes?

Pair 2: Population Across Service Levels: I convert the 'Service Level' variable into an ordered factor and explore its relationship with population size. A box plot is used to visualize the distribution of population across different service levels, providing insights into variations in sanitation access among different segments of the population.
Insight: The box plot illustrates the distribution of population across different service levels. It highlights variations in population medians among service levels, with potential outliers in certain categories.
Significance: This insight indicates disparities in population distribution across different levels of sanitation services. Understanding these disparities is crucial for equitable resource allocation and targeted interventions.
Further Questions: What factors contribute to the variation in population distribution across service levels?
Are there any socio-economic or geographical factors associated with areas where outliers occur, and how do they impact sanitation service provision?

Correlation Coefficients: To quantify the relationships between variables, I calculate Pearson correlation coefficients for key pairs of variables:
Year and Per Capita Coverage
Population and Coverage
Population and Per Capita Coverage

Insight: These correlation coefficients help us understand the strength and direction of associations between sanitation coverage, population size, and time:
Year and Per Capita Coverage: NA (likely due to missing values or limited variability)
Population and Coverage: Moderate positive correlation (0.615), suggesting that areas with higher populations tend to have higher coverage rates.
Population and Per Capita Coverage: NA (possibly due to missing or out-of-range data points).
Significance: These correlation coefficients provide insights into the relationships between variables. Understanding these relationships helps identify factors influencing healthcare coverage and population distribution.
Further Questions: What specific factors contribute to the positive correlation between population and coverage?
How do demographic, economic, and geographical factors influence healthcare coverage and per capita coverage?
Are there any underlying reasons for the missing or out-of-range data points, and how do they affect the analysis results?

Confidence Interval for Population Mean: I estimate a confidence interval for the population mean at a 95% confidence level. This interval provides a range within which we can be reasonably confident that the true population mean lies, based on the sample data.

Insight: A confidence interval was constructed to estimate the range in which the true population mean lies with 95% confidence. The interval was approximately 140.51 million to 158.82 million.
Significance: This confidence interval provides a measure of uncertainty around the estimated population mean. It helps gauge the reliability of the sample data in representing the entire population.
Further Questions: How does the confidence interval change with different confidence levels (e.g., 90%, 99%)?
What factors might contribute to the variability in population estimates, and how can data collection methods be improved to reduce uncertainty?

Conclusion: In this analysis provides valuable insights into sanitation and drinking water services, further investigation is needed to address data quality issues, validate findings, and explore the relationships between variables. Also, ongoing monitoring and evaluation are essential to inform policy decisions and interventions aimed at improving access to essential services for all populations.

# Read the CSV file
data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv")
# Set seed for reproducibility
set.seed(123)
# Set the display format to avoid scientific notation
options(scipen = 999)
# Pair 1: Calculate per capita coverage
data$Per_capita_coverage <- data$Coverage / data$Population

# Pair 2: Convert Service Level to an ordered factor
data$Service_Level_Ordered <- factor(data$Service.level, levels = c("Open defecation", "Limited service", "Basic service", "At least basic", "Safely managed service"), ordered = TRUE)
summary(data)

##      Type              Region          Residence.Type     Service.Type      
##  Length:3367        Length:3367        Length:3367        Length:3367       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       Year         Coverage         Population         Service.level     
##  Min.   :2010   Min.   :  0.000   Min.   :         0   Length:3367       
##  1st Qu.:2013   1st Qu.:  2.486   1st Qu.:   4366312   Class :character  
##  Median :2016   Median : 12.110   Median :  33061569   Mode  :character  
##  Mean   :2016   Mean   : 22.447   Mean   : 149665335                     
##  3rd Qu.:2019   3rd Qu.: 34.190   3rd Qu.: 175499562                     
##  Max.   :2022   Max.   :100.000   Max.   :2173068885                     
##                                                                          
##  Per_capita_coverage            Service_Level_Ordered
##  Min.   :0.00000     Open defecation       : 286     
##  1st Qu.:0.00000     Limited service       : 748     
##  Median :0.00000     Basic service         : 652     
##  Mean   :0.00000     At least basic        : 105     
##  3rd Qu.:0.00000     Safely managed service: 493     
##  Max.   :0.00004     NA's                  :1083     
##  NA's   :112

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

# Pair 1: Per_capita_coverage vs. Year
ggplot(data = data, aes(x = Year, y = Per_capita_coverage)) +
  geom_point() +
  labs(title = "Per Capita Coverage Over Time",
       x = "Year",
       y = "Per Capita Coverage")

## Warning: Removed 112 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Pair 2: Population vs. Service Level
ggplot(data = data, aes(x = Service_Level_Ordered, y = Population)) +
  geom_boxplot() +
  labs(title = "Population Across Service Levels",
       x = "Service Level",
       y = "Population")

# Calculate correlation coefficient for Year and Per Capita Coverage
cor_year_per_capita <- cor(data$Year, data$Per_capita_coverage)

# Calculate correlation coefficient for Population and Coverage
cor_population_coverage <- cor(data$Population, data$Coverage)

# Calculate correlation coefficient for Population and Per Capita Coverage
cor_population_per_capita <- cor(data$Population, data$Per_capita_coverage)

# Print correlation coefficients
print(paste("Correlation coefficient for Year and Per Capita Coverage:", cor_year_per_capita))

## [1] "Correlation coefficient for Year and Per Capita Coverage: NA"

print(paste("Correlation coefficient for Population and Coverage:", cor_population_coverage))

## [1] "Correlation coefficient for Population and Coverage: 0.615361436962529"

print(paste("Correlation coefficient for Population and Per Capita Coverage:", cor_population_per_capita))

## [1] "Correlation coefficient for Population and Per Capita Coverage: NA"

# Calculate the sample mean and standard deviation of the population
mean_population <- mean(data$Population)
sd_population <- sd(data$Population)
n <- length(data$Population)

# Choose the desired confidence level (95% confidence level)
confidence_level <- 0.95

# Look up the t-value for the chosen confidence level and degrees of freedom (n - 1)
t_value <- qt((1 + confidence_level) / 2, df = n - 1)

# Calculate the margin of error
margin_of_error <- t_value * (sd_population / sqrt(n))

# Construct the confidence interval
lower_bound <- mean_population - margin_of_error
upper_bound <- mean_population + margin_of_error

# Print the confidence interval
cat("Confidence interval for Population mean (95% confidence level):", lower_bound, "-", upper_bound, "\n")

## Confidence interval for Population mean (95% confidence level): 140508790 - 158821879