Introduction:
Access to adequate sanitation and hygiene services is essential for public health and well-being. In this analysis, I delve into the Washdash dataset to explore trends and relationships related to sanitation coverage and population dynamics. Specifically, investigate how sanitation coverage has evolved over time and how it correlates with population size.
Summary of the Data: A summary of the dataset revealed information about the type, region, residence type, service type, year, coverage, population, service level, per capita coverage, and service type ordered. This provided an overview of the data's characteristics and distributions.
Data Processing and Exploration: I began by reading the CSV file and setting the seed to avoid scientific notation for easier interpretation of numerical values.This analysis focuses on two main pairs of variables:
Pair 1:Per Capita Coverage Over Time: I calculate the per capita coverage by dividing the coverage by the population for each observation. This metric allows us to assess the level of sanitation services available per individual and then visualize the trend of per capita coverage over time using a scatter plot, with years on the x-axis and per capita coverage on the y-axis.
Insight: The scatter plot shows the trend of per capita coverage over the years. Despite some missing or out-of-range data points, it indicates relatively stable coverage over time.
Significance: This insight suggests that the level of healthcare coverage per person has remained relatively consistent over the years. Understanding this stability is crucial for assessing the effectiveness of healthcare policies and interventions.
Further Questions: What interventions or policies have contributed to maintaining stable coverage over time?
Are there any specific periods where coverage significantly improved or declined, and if so, what factors influenced these changes?
Pair 2: Population Across Service Levels: I convert the 'Service Level' variable into an ordered factor and explore its relationship with population size. A box plot is used to visualize the distribution of population across different service levels, providing insights into variations in sanitation access among different segments of the population.
Insight: The box plot illustrates the distribution of population across different service levels. It highlights variations in population medians among service levels, with potential outliers in certain categories.
Significance: This insight indicates disparities in population distribution across different levels of sanitation services. Understanding these disparities is crucial for equitable resource allocation and targeted interventions.
Further Questions: What factors contribute to the variation in population distribution across service levels?
Are there any socio-economic or geographical factors associated with areas where outliers occur, and how do they impact sanitation service provision?
Correlation Coefficients: To quantify the relationships between variables, I calculate Pearson correlation coefficients for key pairs of variables:
Year and Per Capita Coverage
Population and Coverage
Population and Per Capita Coverage
Insight: These correlation coefficients help us understand the strength and direction of associations between sanitation coverage, population size, and time:
Year and Per Capita Coverage: NA (likely due to missing values or limited variability)
Population and Coverage: Moderate positive correlation (0.615), suggesting that areas with higher populations tend to have higher coverage rates.
Population and Per Capita Coverage: NA (possibly due to missing or out-of-range data points).
Significance: These correlation coefficients provide insights into the relationships between variables. Understanding these relationships helps identify factors influencing healthcare coverage and population distribution.
Further Questions: What specific factors contribute to the positive correlation between population and coverage?
How do demographic, economic, and geographical factors influence healthcare coverage and per capita coverage?
Are there any underlying reasons for the missing or out-of-range data points, and how do they affect the analysis results?
Confidence Interval for Population Mean: I estimate a confidence interval for the population mean at a 95% confidence level. This interval provides a range within which we can be reasonably confident that the true population mean lies, based on the sample data.
Insight: A confidence interval was constructed to estimate the range in which the true population mean lies with 95% confidence. The interval was approximately 140.51 million to 158.82 million.
Significance: This confidence interval provides a measure of uncertainty around the estimated population mean. It helps gauge the reliability of the sample data in representing the entire population.
Further Questions: How does the confidence interval change with different confidence levels (e.g., 90%, 99%)?
What factors might contribute to the variability in population estimates, and how can data collection methods be improved to reduce uncertainty?
Conclusion: In this analysis provides valuable insights into sanitation and drinking water services, further investigation is needed to address data quality issues, validate findings, and explore the relationships between variables. Also, ongoing monitoring and evaluation are essential to inform policy decisions and interventions aimed at improving access to essential services for all populations.
# Read the CSV file data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv") # Set seed for reproducibility set.seed(123) # Set the display format to avoid scientific notation options(scipen = 999) # Pair 1: Calculate per capita coverage data$Per_capita_coverage <- data$Coverage / data$Population # Pair 2: Convert Service Level to an ordered factor data$Service_Level_Ordered <- factor(data$Service.level, levels = c("Open defecation", "Limited service", "Basic service", "At least basic", "Safely managed service"), ordered = TRUE) summary(data)
## Type Region Residence.Type Service.Type ## Length:3367 Length:3367 Length:3367 Length:3367 ## Class :character Class :character Class :character Class :character ## Mode :character Mode :character Mode :character Mode :character ## ## ## ## ## Year Coverage Population Service.level ## Min. :2010 Min. : 0.000 Min. : 0 Length:3367 ## 1st Qu.:2013 1st Qu.: 2.486 1st Qu.: 4366312 Class :character ## Median :2016 Median : 12.110 Median : 33061569 Mode :character ## Mean :2016 Mean : 22.447 Mean : 149665335 ## 3rd Qu.:2019 3rd Qu.: 34.190 3rd Qu.: 175499562 ## Max. :2022 Max. :100.000 Max. :2173068885 ## ## Per_capita_coverage Service_Level_Ordered ## Min. :0.00000 Open defecation : 286 ## 1st Qu.:0.00000 Limited service : 748 ## Median :0.00000 Basic service : 652 ## Mean :0.00000 At least basic : 105 ## 3rd Qu.:0.00000 Safely managed service: 493 ## Max. :0.00004 NA's :1083 ## NA's :112
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
# Pair 1: Per_capita_coverage vs. Year ggplot(data = data, aes(x = Year, y = Per_capita_coverage)) + geom_point() + labs(title = "Per Capita Coverage Over Time", x = "Year", y = "Per Capita Coverage")
## Warning: Removed 112 rows containing missing values or values outside the scale range ## (`geom_point()`).
# Pair 2: Population vs. Service Level ggplot(data = data, aes(x = Service_Level_Ordered, y = Population)) + geom_boxplot() + labs(title = "Population Across Service Levels", x = "Service Level", y = "Population")
# Calculate correlation coefficient for Year and Per Capita Coverage cor_year_per_capita <- cor(data$Year, data$Per_capita_coverage) # Calculate correlation coefficient for Population and Coverage cor_population_coverage <- cor(data$Population, data$Coverage) # Calculate correlation coefficient for Population and Per Capita Coverage cor_population_per_capita <- cor(data$Population, data$Per_capita_coverage) # Print correlation coefficients print(paste("Correlation coefficient for Year and Per Capita Coverage:", cor_year_per_capita))
## [1] "Correlation coefficient for Year and Per Capita Coverage: NA"
print(paste("Correlation coefficient for Population and Coverage:", cor_population_coverage))
## [1] "Correlation coefficient for Population and Coverage: 0.615361436962529"
print(paste("Correlation coefficient for Population and Per Capita Coverage:", cor_population_per_capita))
## [1] "Correlation coefficient for Population and Per Capita Coverage: NA"
# Calculate the sample mean and standard deviation of the population mean_population <- mean(data$Population) sd_population <- sd(data$Population) n <- length(data$Population) # Choose the desired confidence level (95% confidence level) confidence_level <- 0.95 # Look up the t-value for the chosen confidence level and degrees of freedom (n - 1) t_value <- qt((1 + confidence_level) / 2, df = n - 1) # Calculate the margin of error margin_of_error <- t_value * (sd_population / sqrt(n)) # Construct the confidence interval lower_bound <- mean_population - margin_of_error upper_bound <- mean_population + margin_of_error # Print the confidence interval cat("Confidence interval for Population mean (95% confidence level):", lower_bound, "-", upper_bound, "\n")
## Confidence interval for Population mean (95% confidence level): 140508790 - 158821879