Introduction:
Sanitation coverage is a critical indicator of public health and environmental well-being, with access to adequate sanitation facilities being fundamental for disease prevention and community health. To gain insights into the factors influencing sanitation coverage levels, this study employs a data-driven approach to evaluate a multiple linear regression model.
The dataset utilized in this analysis contains comprehensive information on various parameters such as population demographics, service types, and temporal trends in sanitation coverage. Leveraging this dataset, we construct a multiple linear regression model to predict sanitation coverage based on key predictors including population size, service type, and year.
This study aims to assess the performance and validity of the regression model through thorough evaluation using diagnostic plots. These plots include Residuals vs Fitted Plot, Normal Q-Q Plot, Scale-Location Plot, Residuals vs Leverage Plot, and Cook's Distance Plot, each providing valuable insights into the model's adherence to key assumptions and potential areas for improvement.
Multiple Linear Regression Model Building:
The multiple linear regression model is built using the lm() function. The response variable Coverage is regressed on the predictor variables Population, Service.Type, and Year.
The model output summary provides coefficients, standard errors, t-values, and p-values for each predictor variable.
Diagnostic Plots Evaluation:
Residuals vs Fitted Plot: This plot helps assess the linearity assumption. Ideally, the residuals should be randomly scattered around the horizontal line at zero, indicating that there's no systematic pattern in the residuals.
Normal Q-Q Plot: This plot checks the normality assumption of residuals. The points on the plot should fall approximately along the diagonal line, suggesting that the residuals are normally distributed.
Scale-Location Plot: This plot examines the homoscedasticity assumption. It checks if the spread of residuals is consistent across the range of fitted values. The points should be randomly scattered around the horizontal line, indicating constant variance of residuals.
Residuals vs Leverage Plot: This plot helps identify influential data points that might have a disproportionate impact on the regression model. Points outside the dashed lines may have high leverage.
Cook's Distance Plot: Cook's distance measures the influence of each observation on the regression coefficients. Large values of Cook's distance indicate influential observations that might warrant further investigation.
Insights:
The coefficients and their significance levels provide insights into the relationship between predictors and the response variable.
Diagnostic plots help assess the assumptions of the multiple linear regression model.
By examining these plots, we can identify potential issues such as non-linearity, non-normality, heteroscedasticity, and influential observations.
Further investigations needed addressing any violations of assumptions, considering interaction terms, or exploring other variables not included in the current model.
Further Questions:
Are there any transformations needed for predictor variables to better meet the assumptions?
Are there additional variables that could improve the model's predictive power?
How robust is the model to potential outliers or influential observations?
In conclusion, I've learned valuable insights about the factors influencing sanitation coverage through the multiple linear regression analysis conducted. The model revealed significant relationships between population size, service type, and sanitation coverage levels. However, further exploration is needed to better understand the complexities of these relationships and to improve the predictive accuracy of the model.
# Read the CSV file data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv") # Summary of the data summary(data)
## Type Region Residence.Type Service.Type ## Length:3367 Length:3367 Length:3367 Length:3367 ## Class :character Class :character Class :character Class :character ## Mode :character Mode :character Mode :character Mode :character ## ## ## ## Year Coverage Population Service.level ## Min. :2010 Min. : 0.000 Min. :0.000e+00 Length:3367 ## 1st Qu.:2013 1st Qu.: 2.486 1st Qu.:4.366e+06 Class :character ## Median :2016 Median : 12.110 Median :3.306e+07 Mode :character ## Mean :2016 Mean : 22.447 Mean :1.497e+08 ## 3rd Qu.:2019 3rd Qu.: 34.190 3rd Qu.:1.755e+08 ## Max. :2022 Max. :100.000 Max. :2.173e+09
# Summary of the data summary(data)
## Type Region Residence.Type Service.Type ## Length:3367 Length:3367 Length:3367 Length:3367 ## Class :character Class :character Class :character Class :character ## Mode :character Mode :character Mode :character Mode :character ## ## ## ## Year Coverage Population Service.level ## Min. :2010 Min. : 0.000 Min. :0.000e+00 Length:3367 ## 1st Qu.:2013 1st Qu.: 2.486 1st Qu.:4.366e+06 Class :character ## Median :2016 Median : 12.110 Median :3.306e+07 Mode :character ## Mean :2016 Mean : 22.447 Mean :1.497e+08 ## 3rd Qu.:2019 3rd Qu.: 34.190 3rd Qu.:1.755e+08 ## Max. :2022 Max. :100.000 Max. :2.173e+09
# Build multiple linear regression model lm_model <- lm(Coverage ~ Population + Service.Type + Year, data = data) summary(lm_model)
## ## Call: ## lm(formula = Coverage ~ Population + Service.Type + Year, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -51.285 -12.483 -8.141 6.917 86.984 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.068e+02 1.901e+02 1.088 0.277 ## Population 5.774e-08 1.302e-09 44.349 < 2e-16 *** ## Service.TypeHygiene 5.737e+00 1.072e+00 5.352 9.29e-08 *** ## Service.TypeSanitation -8.655e-01 7.555e-01 -1.146 0.252 ## Year -9.595e-02 9.429e-02 -1.018 0.309 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 20.27 on 3362 degrees of freedom ## Multiple R-squared: 0.3858, Adjusted R-squared: 0.3851 ## F-statistic: 528 on 4 and 3362 DF, p-value: < 2.2e-16
# Diagnostic plots for multiple linear regression model evaluation par(mfrow = c(2, 2)) # Set the layout for multiple plots # Residuals vs Fitted Plot plot(lm_model, which = 1) # Normal Q-Q Plot plot(lm_model, which = 2) # Scale-Location Plot (Squared Residuals vs Fitted Values) plot(lm_model, which = 3) # Residuals vs Leverage Plot plot(lm_model, which = 5)
# Cook's Distance Plot plot(lm_model, which = 4) # Reset the layout par(mfrow = c(1, 1))