Evaluation of Multiple Linear Regression Model for Sanitation Coverage Prediction

Introduction:
Sanitation coverage is a critical indicator of public health and environmental well-being, with access to adequate sanitation facilities being fundamental for disease prevention and community health. To gain insights into the factors influencing sanitation coverage levels, this study employs a data-driven approach to evaluate a multiple linear regression model.
The dataset utilized in this analysis contains comprehensive information on various parameters such as population demographics, service types, and temporal trends in sanitation coverage. Leveraging this dataset, we construct a multiple linear regression model to predict sanitation coverage based on key predictors including population size, service type, and year.
This study aims to assess the performance and validity of the regression model through thorough evaluation using diagnostic plots. These plots include Residuals vs Fitted Plot, Normal Q-Q Plot, Scale-Location Plot, Residuals vs Leverage Plot, and Cook's Distance Plot, each providing valuable insights into the model's adherence to key assumptions and potential areas for improvement.

Multiple Linear Regression Model Building:
The multiple linear regression model is built using the lm() function. The response variable Coverage is regressed on the predictor variables Population, Service.Type, and Year.
The model output summary provides coefficients, standard errors, t-values, and p-values for each predictor variable.

Diagnostic Plots Evaluation:
Residuals vs Fitted Plot: This plot helps assess the linearity assumption. Ideally, the residuals should be randomly scattered around the horizontal line at zero, indicating that there's no systematic pattern in the residuals.
Normal Q-Q Plot: This plot checks the normality assumption of residuals. The points on the plot should fall approximately along the diagonal line, suggesting that the residuals are normally distributed.
Scale-Location Plot: This plot examines the homoscedasticity assumption. It checks if the spread of residuals is consistent across the range of fitted values. The points should be randomly scattered around the horizontal line, indicating constant variance of residuals.
Residuals vs Leverage Plot: This plot helps identify influential data points that might have a disproportionate impact on the regression model. Points outside the dashed lines may have high leverage.
Cook's Distance Plot: Cook's distance measures the influence of each observation on the regression coefficients. Large values of Cook's distance indicate influential observations that might warrant further investigation.

Insights:
The coefficients and their significance levels provide insights into the relationship between predictors and the response variable.
Diagnostic plots help assess the assumptions of the multiple linear regression model.
By examining these plots, we can identify potential issues such as non-linearity, non-normality, heteroscedasticity, and influential observations.
Further investigations needed addressing any violations of assumptions, considering interaction terms, or exploring other variables not included in the current model.

Further Questions:
Are there any transformations needed for predictor variables to better meet the assumptions?
Are there additional variables that could improve the model's predictive power?
How robust is the model to potential outliers or influential observations?

In conclusion, I've learned valuable insights about the factors influencing sanitation coverage through the multiple linear regression analysis conducted. The model revealed significant relationships between population size, service type, and sanitation coverage levels. However, further exploration is needed to better understand the complexities of these relationships and to improve the predictive accuracy of the model.

# Read the CSV file
data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv")

# Summary of the data
summary(data)

##      Type              Region          Residence.Type     Service.Type      
##  Length:3367        Length:3367        Length:3367        Length:3367       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       Year         Coverage         Population        Service.level     
##  Min.   :2010   Min.   :  0.000   Min.   :0.000e+00   Length:3367       
##  1st Qu.:2013   1st Qu.:  2.486   1st Qu.:4.366e+06   Class :character  
##  Median :2016   Median : 12.110   Median :3.306e+07   Mode  :character  
##  Mean   :2016   Mean   : 22.447   Mean   :1.497e+08                     
##  3rd Qu.:2019   3rd Qu.: 34.190   3rd Qu.:1.755e+08                     
##  Max.   :2022   Max.   :100.000   Max.   :2.173e+09

# Summary of the data
summary(data)

##      Type              Region          Residence.Type     Service.Type      
##  Length:3367        Length:3367        Length:3367        Length:3367       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       Year         Coverage         Population        Service.level     
##  Min.   :2010   Min.   :  0.000   Min.   :0.000e+00   Length:3367       
##  1st Qu.:2013   1st Qu.:  2.486   1st Qu.:4.366e+06   Class :character  
##  Median :2016   Median : 12.110   Median :3.306e+07   Mode  :character  
##  Mean   :2016   Mean   : 22.447   Mean   :1.497e+08                     
##  3rd Qu.:2019   3rd Qu.: 34.190   3rd Qu.:1.755e+08                     
##  Max.   :2022   Max.   :100.000   Max.   :2.173e+09

# Build multiple linear regression model
lm_model <- lm(Coverage ~ Population + Service.Type + Year, data = data)
summary(lm_model)

## 
## Call:
## lm(formula = Coverage ~ Population + Service.Type + Year, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.285 -12.483  -8.141   6.917  86.984 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.068e+02  1.901e+02   1.088    0.277    
## Population              5.774e-08  1.302e-09  44.349  < 2e-16 ***
## Service.TypeHygiene     5.737e+00  1.072e+00   5.352 9.29e-08 ***
## Service.TypeSanitation -8.655e-01  7.555e-01  -1.146    0.252    
## Year                   -9.595e-02  9.429e-02  -1.018    0.309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.27 on 3362 degrees of freedom
## Multiple R-squared:  0.3858, Adjusted R-squared:  0.3851 
## F-statistic:   528 on 4 and 3362 DF,  p-value: < 2.2e-16

# Diagnostic plots for multiple linear regression model evaluation
par(mfrow = c(2, 2))  # Set the layout for multiple plots

# Residuals vs Fitted Plot
plot(lm_model, which = 1)

# Normal Q-Q Plot
plot(lm_model, which = 2)

# Scale-Location Plot (Squared Residuals vs Fitted Values)
plot(lm_model, which = 3)

# Residuals vs Leverage Plot
plot(lm_model, which = 5)

# Cook's Distance Plot
plot(lm_model, which = 4)

# Reset the layout
par(mfrow = c(1, 1))