Access to safe sanitation facilities is a fundamental human right and a key indicator of public health and well-being. However, millions of people worldwide still lack access to adequate sanitation services, leading to significant health risks and environmental challenges. In this project, I aim to leverage data-driven techniques to understand and address the challenges associated with access to safe sanitation.
By analyzing a comprehensive dataset containing information on various factors such as geographical region, residence type, service type, and coverage, I will seek to uncover insights that can inform policy decisions, resource allocation, and interventions aimed at improving sanitation infrastructure and services.
Through statistical modeling, visualization, and interpretation of the data and identify patterns, trends, and disparities in access to safe sanitation. Also, exploring the effectiveness of different sanitation interventions and their impact on public health outcomes.
Data Summary and Exploration
Insight:
The summary and head of the data provide an overview of its structure, variables, and the first few observations.
By examining the summary statistics,aim to understand the distribution and range of each variable, identifying any potential issues like missing values or outliers.
The binary conversion of the "Service level" variable into a binary variable ("binary_service_level") allows to the model the occurrence of the "Safely managed service" category.
Significance:
Understanding the structure of the dataset is crucial for data preprocessing and modeling.
Converting the "Service level" variable into a binary format enables us to perform binary classification tasks, such as predicting whether a service is safely managed or not.
Further Investigations:
Investigate the distribution of each variable to identify potential outliers or data quality issues.
Explore the relationship between variables to understand their interactions and correlations.
Logistic Regression Modeling
Insight:
The logistic regression model is built using the binary response variable ("binary_service_level") and explanatory variables ("Region", "Residence_Type", "Service_Type", "Coverage").
The coefficients obtained from the model summary provide insights into the impact of each explanatory variable on the likelihood of a service being safely managed.
The 95% confidence interval for the coefficient of the "Region" variable is calculated, indicating the range of values within which the true coefficient is likely to fall.
Significance:
The logistic regression model helps quantify the relationships between explanatory variables and the binary outcome, aiding in predictive modeling and inference.
Understanding the confidence interval of the coefficients provides insights into the uncertainty associated with the model estimates.
Further Investigations:
Assess the significance of each explanatory variable and their impact on the binary outcome.
Evaluate the goodness-of-fit of the logistic regression model and assess its predictive performance using appropriate metrics.
Scatterplot Visualization
Insight:
Scatterplots of "Coverage" vs. "Binary Service Level" and "Population" vs. "Binary Service Level" visualize the relationship between the explanatory variables and the binary response.
These plots allow for visual inspection of any potential patterns or trends in the data.
Significance:
Scatterplots help identify any linear or non-linear relationships between the explanatory variables and the binary outcome.
Visualizing the data aids in determining if transformations or additional variables are needed for model improvement.
Further Investigations:
Explore additional relationships between variables through scatterplot matrices or pairwise plots.
Investigate any outliers or clusters in the data that may impact model performance.
Significance:
# Read the CSV file data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv") summary(data)
## Type Region Residence.Type Service.Type ## Length:3367 Length:3367 Length:3367 Length:3367 ## Class :character Class :character Class :character Class :character ## Mode :character Mode :character Mode :character Mode :character ## ## ## ## Year Coverage Population Service.level ## Min. :2010 Min. : 0.000 Min. :0.000e+00 Length:3367 ## 1st Qu.:2013 1st Qu.: 2.486 1st Qu.:4.366e+06 Class :character ## Median :2016 Median : 12.110 Median :3.306e+07 Mode :character ## Mean :2016 Mean : 22.447 Mean :1.497e+08 ## 3rd Qu.:2019 3rd Qu.: 34.190 3rd Qu.:1.755e+08 ## Max. :2022 Max. :100.000 Max. :2.173e+09
head(data)
## Type Region Residence.Type Service.Type Year Coverage ## 1 sdg Australia and New Zealand total Sanitation 2010 5.40789 ## 2 sdg Australia and New Zealand total Sanitation 2010 0.00000 ## 3 sdg Australia and New Zealand total Sanitation 2010 0.00000 ## 4 sdg Australia and New Zealand total Sanitation 2010 94.58855 ## 5 sdg Australia and New Zealand total Sanitation 2010 0.00356 ## 6 sdg Australia and New Zealand total Drinking water 2010 99.90241 ## Population Service.level ## 1 1.425817e+06 Basic service ## 2 0.000000e+00 Limited service ## 3 0.000000e+00 Open defecation ## 4 2.493875e+07 Safely managed service ## 5 9.382345e+02 Unimproved ## 6 2.633978e+07 At least basic
options(scipen = 999) # Convert "Service level" into binary variable data$binary_service_level <- as.integer(data$Service.level == "Safely managed service") # Select the binary response variable binary_response <- data$binary_service_level # Select explanatory variables explanatory_variables <- c("Region", "Residence_Type", "Service_Type", "Coverage") # Convert categorical variables into factors data$Region <- as.factor(data$Region) data$Residence_Type <- as.factor(data$Residence.Type) data$Service_Type <- as.factor(data$Service.Type) # Build logistic regression model model <- glm(binary_response ~ Region + Residence_Type + Service_Type + Coverage, data = data, family = binomial) # Interpret coefficients summary(model)
## ## Call: ## glm(formula = binary_response ~ Region + Residence_Type + Service_Type + ## Coverage, family = binomial, data = data) ## ## Coefficients: ## Estimate Std. Error z value ## (Intercept) -11.937597 0.638303 -18.702 ## RegionCentral and Southern Asia 6.314541 0.475179 13.289 ## RegionEastern and South-Eastern Asia 6.051834 0.492755 12.282 ## RegionEurope and Northern America 5.352273 0.640049 8.362 ## RegionLatin America and the Caribbean 5.237538 0.453593 11.547 ## RegionNorthern Africa and Western Asia 4.956446 0.465147 10.656 ## RegionOceania 4.302086 0.454547 9.465 ## RegionSub-Saharan Africa 6.937023 0.493999 14.043 ## Residence_Typetotal 0.515518 0.194166 2.655 ## Residence_Typeurban 0.894654 0.200123 4.471 ## Service_TypeHygiene -21.387634 358.970665 -0.060 ## Service_TypeSanitation 0.719919 0.171236 4.204 ## Coverage 0.114257 0.005355 21.337 ## Pr(>|z|) ## (Intercept) < 0.0000000000000002 *** ## RegionCentral and Southern Asia < 0.0000000000000002 *** ## RegionEastern and South-Eastern Asia < 0.0000000000000002 *** ## RegionEurope and Northern America < 0.0000000000000002 *** ## RegionLatin America and the Caribbean < 0.0000000000000002 *** ## RegionNorthern Africa and Western Asia < 0.0000000000000002 *** ## RegionOceania < 0.0000000000000002 *** ## RegionSub-Saharan Africa < 0.0000000000000002 *** ## Residence_Typetotal 0.00793 ** ## Residence_Typeurban 0.0000078 *** ## Service_TypeHygiene 0.95249 ## Service_TypeSanitation 0.0000262 *** ## Coverage < 0.0000000000000002 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2804.4 on 3366 degrees of freedom ## Residual deviance: 1104.6 on 3354 degrees of freedom ## AIC: 1130.6 ## ## Number of Fisher Scoring iterations: 18
# Calculate Confidence Interval for a coefficien coef_summary <- summary(model)$coefficients coef_index <- match("Region", rownames(coef_summary)) # Find the row index of the coefficient coef <- coef_summary[coef_index, "Estimate"] std_err <- coef_summary[coef_index, "Std. Error"] # For a 95% confidence interval z_value <- qnorm(0.975) margin_of_error <- z_value * std_err lower_bound <- coef - margin_of_error upper_bound <- coef + margin_of_error # Translate the meaning of the CI cat("The 95% confidence interval for the coefficient of Region is [", round(lower_bound, 3), ", ", round(upper_bound, 3), "].\n")
## The 95% confidence interval for the coefficient of Region is [ NA , NA ].
cat("This means that we are 95% confident that the true value of the coefficient for Region falls within this interval.\n")
## This means that we are 95% confident that the true value of the coefficient for Region falls within this interval.
# Load necessary libraries library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
# Scatterplot for Coverage vs. binary_service_level ggplot(data, aes(x = Coverage, y = binary_service_level)) + geom_point() + labs(x = "Coverage", y = "Binary Service Level") + ggtitle("Scatterplot of Coverage vs. Binary Service Level")
# Scatterplot for Population vs. binary_service_level ggplot(data, aes(x = Population, y = binary_service_level)) + geom_point() + labs(x = "Population", y = "Binary Service Level") + ggtitle("Scatterplot of Population vs. Binary Service Level")