A Logistic Regression Approach Using Washdash Data

Access to safe sanitation facilities is a fundamental human right and a key indicator of public health and well-being. However, millions of people worldwide still lack access to adequate sanitation services, leading to significant health risks and environmental challenges. In this project, I aim to leverage data-driven techniques to understand and address the challenges associated with access to safe sanitation.
By analyzing a comprehensive dataset containing information on various factors such as geographical region, residence type, service type, and coverage, I will seek to uncover insights that can inform policy decisions, resource allocation, and interventions aimed at improving sanitation infrastructure and services.
Through statistical modeling, visualization, and interpretation of the data and identify patterns, trends, and disparities in access to safe sanitation. Also, exploring the effectiveness of different sanitation interventions and their impact on public health outcomes.

Data Summary and Exploration

Insight:
The summary and head of the data provide an overview of its structure, variables, and the first few observations.
By examining the summary statistics,aim to understand the distribution and range of each variable, identifying any potential issues like missing values or outliers.
The binary conversion of the "Service level" variable into a binary variable ("binary_service_level") allows to the model the occurrence of the "Safely managed service" category.

Significance:
Understanding the structure of the dataset is crucial for data preprocessing and modeling.
Converting the "Service level" variable into a binary format enables us to perform binary classification tasks, such as predicting whether a service is safely managed or not.

Further Investigations:
Investigate the distribution of each variable to identify potential outliers or data quality issues.
Explore the relationship between variables to understand their interactions and correlations.

Logistic Regression Modeling

Insight:
The logistic regression model is built using the binary response variable ("binary_service_level") and explanatory variables ("Region", "Residence_Type", "Service_Type", "Coverage").
The coefficients obtained from the model summary provide insights into the impact of each explanatory variable on the likelihood of a service being safely managed.
The 95% confidence interval for the coefficient of the "Region" variable is calculated, indicating the range of values within which the true coefficient is likely to fall.

Significance:
The logistic regression model helps quantify the relationships between explanatory variables and the binary outcome, aiding in predictive modeling and inference.
Understanding the confidence interval of the coefficients provides insights into the uncertainty associated with the model estimates.

Further Investigations:
Assess the significance of each explanatory variable and their impact on the binary outcome.
Evaluate the goodness-of-fit of the logistic regression model and assess its predictive performance using appropriate metrics.

Scatterplot Visualization

Insight:
Scatterplots of "Coverage" vs. "Binary Service Level" and "Population" vs. "Binary Service Level" visualize the relationship between the explanatory variables and the binary response.
These plots allow for visual inspection of any potential patterns or trends in the data.

Significance:
Scatterplots help identify any linear or non-linear relationships between the explanatory variables and the binary outcome. Visualizing the data aids in determining if transformations or additional variables are needed for model improvement.

Further Investigations:
Explore additional relationships between variables through scatterplot matrices or pairwise plots.
Investigate any outliers or clusters in the data that may impact model performance.

Significance:

# Read the CSV file
data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv")

summary(data)

##      Type              Region          Residence.Type     Service.Type      
##  Length:3367        Length:3367        Length:3367        Length:3367       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       Year         Coverage         Population        Service.level     
##  Min.   :2010   Min.   :  0.000   Min.   :0.000e+00   Length:3367       
##  1st Qu.:2013   1st Qu.:  2.486   1st Qu.:4.366e+06   Class :character  
##  Median :2016   Median : 12.110   Median :3.306e+07   Mode  :character  
##  Mean   :2016   Mean   : 22.447   Mean   :1.497e+08                     
##  3rd Qu.:2019   3rd Qu.: 34.190   3rd Qu.:1.755e+08                     
##  Max.   :2022   Max.   :100.000   Max.   :2.173e+09

head(data)

##   Type                    Region Residence.Type   Service.Type Year Coverage
## 1  sdg Australia and New Zealand          total     Sanitation 2010  5.40789
## 2  sdg Australia and New Zealand          total     Sanitation 2010  0.00000
## 3  sdg Australia and New Zealand          total     Sanitation 2010  0.00000
## 4  sdg Australia and New Zealand          total     Sanitation 2010 94.58855
## 5  sdg Australia and New Zealand          total     Sanitation 2010  0.00356
## 6  sdg Australia and New Zealand          total Drinking water 2010 99.90241
##     Population          Service.level
## 1 1.425817e+06          Basic service
## 2 0.000000e+00        Limited service
## 3 0.000000e+00        Open defecation
## 4 2.493875e+07 Safely managed service
## 5 9.382345e+02             Unimproved
## 6 2.633978e+07         At least basic

options(scipen = 999)

# Convert "Service level" into binary variable
data$binary_service_level <- as.integer(data$Service.level == "Safely managed service")

# Select the binary response variable
binary_response <- data$binary_service_level

# Select explanatory variables
explanatory_variables <- c("Region", "Residence_Type", "Service_Type", "Coverage")

# Convert categorical variables into factors
data$Region <- as.factor(data$Region)
data$Residence_Type <- as.factor(data$Residence.Type)
data$Service_Type <- as.factor(data$Service.Type)

# Build logistic regression model
model <- glm(binary_response ~ Region + Residence_Type + Service_Type + Coverage, data = data, family = binomial)

# Interpret coefficients
summary(model)

## 
## Call:
## glm(formula = binary_response ~ Region + Residence_Type + Service_Type + 
##     Coverage, family = binomial, data = data)
## 
## Coefficients:
##                                          Estimate Std. Error z value
## (Intercept)                            -11.937597   0.638303 -18.702
## RegionCentral and Southern Asia          6.314541   0.475179  13.289
## RegionEastern and South-Eastern Asia     6.051834   0.492755  12.282
## RegionEurope and Northern America        5.352273   0.640049   8.362
## RegionLatin America and the Caribbean    5.237538   0.453593  11.547
## RegionNorthern Africa and Western Asia   4.956446   0.465147  10.656
## RegionOceania                            4.302086   0.454547   9.465
## RegionSub-Saharan Africa                 6.937023   0.493999  14.043
## Residence_Typetotal                      0.515518   0.194166   2.655
## Residence_Typeurban                      0.894654   0.200123   4.471
## Service_TypeHygiene                    -21.387634 358.970665  -0.060
## Service_TypeSanitation                   0.719919   0.171236   4.204
## Coverage                                 0.114257   0.005355  21.337
##                                                    Pr(>|z|)    
## (Intercept)                            < 0.0000000000000002 ***
## RegionCentral and Southern Asia        < 0.0000000000000002 ***
## RegionEastern and South-Eastern Asia   < 0.0000000000000002 ***
## RegionEurope and Northern America      < 0.0000000000000002 ***
## RegionLatin America and the Caribbean  < 0.0000000000000002 ***
## RegionNorthern Africa and Western Asia < 0.0000000000000002 ***
## RegionOceania                          < 0.0000000000000002 ***
## RegionSub-Saharan Africa               < 0.0000000000000002 ***
## Residence_Typetotal                                 0.00793 ** 
## Residence_Typeurban                               0.0000078 ***
## Service_TypeHygiene                                 0.95249    
## Service_TypeSanitation                            0.0000262 ***
## Coverage                               < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2804.4  on 3366  degrees of freedom
## Residual deviance: 1104.6  on 3354  degrees of freedom
## AIC: 1130.6
## 
## Number of Fisher Scoring iterations: 18

# Calculate Confidence Interval for a coefficien
coef_summary <- summary(model)$coefficients
coef_index <- match("Region", rownames(coef_summary))  # Find the row index of the coefficient
coef <- coef_summary[coef_index, "Estimate"]
std_err <- coef_summary[coef_index, "Std. Error"]

# For a 95% confidence interval
z_value <- qnorm(0.975)
margin_of_error <- z_value * std_err
lower_bound <- coef - margin_of_error
upper_bound <- coef + margin_of_error

# Translate the meaning of the CI
cat("The 95% confidence interval for the coefficient of Region is [", round(lower_bound, 3), ", ", round(upper_bound, 3), "].\n")

## The 95% confidence interval for the coefficient of Region is [ NA ,  NA ].

cat("This means that we are 95% confident that the true value of the coefficient for Region falls within this interval.\n")

## This means that we are 95% confident that the true value of the coefficient for Region falls within this interval.