The dataset is on water quality across over forty one thousand US zip codes, the data for this is derived from 20 federal and state sources such as the EPA SDWIS, EPA ECHO, UCMR5, CCR, state MCL databases, CDC blood lead, FEMA flood claims, Census housing age, and USGS groundwater. The data has 42679 observations with 25 variables, however to answer the question only a select few will be used, those variables are, home safety score, contaminant count, population, enforcement action count, and water source. I choose this topic as it is important to see what factors go into deciding the safety score of a home and if those factors that dictate the score are important to the people they affect.
library(tidyverse)
library(car)
setwd("C:/Users/wesle/Downloads")
hsds <- read_csv("zipcheckup-water-quality.csv")
hsdsdmv <- hsds |>
filter(state %in% c("MD", "VA", "DC")) |>
select(c("home_safety_score", "contaminant_count", "population", "enforcement_action_count", "water_source", "copper_level_mg_l", "lead_level_mg_l")) |>
mutate(water_source = ifelse(water_source %in% c("SW"), 0, 1))
Filtering the state to only include states in the DMV to focus on a smaller sample of the data and to see how it can impact myself and those around me. Changed water sources to be binary; 0 for surface water, 1 for ground water. I have also selected the five previously mentioned variables to make future actions with the dataset easier.
colSums(is.na(hsdsdmv))
## home_safety_score contaminant_count population
## 0 0 0
## enforcement_action_count water_source copper_level_mg_l
## 0 0 2103
## lead_level_mg_l
## 1028
hsdsdmv <- hsdsdmv |>
select(!c("copper_level_mg_l", "lead_level_mg_l"))
None of the variables chosen in the dataset have NA values. However, other variables that were planned to be used such as lead and copper levels in the water, as shown, have many NA values with copper being NA for all but 26 of the observartions and lead being NA for almost half of the values, for that reason I have chosen not to use those variables.
hist(hsdsdmv$home_safety_score)
mlrm <- lm(home_safety_score ~ contaminant_count + population + enforcement_action_count + water_source, data = hsdsdmv)
crPlots(mlrm)
All variables but enforcement_action_count seem to have good or okay linearity, all mostly follow the dotted line, population does seem to be a bit off it at the start and end of its plot but it seems close enough. Enforcement action doesn’t really seem near the dotted line for most of the plot, only being close it at the start and end of its plot, this could mean some non-linearity. Overall though linearity seems okay.~
plot(resid(mlrm), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
Seems mostly centered around zero, however a few areas are off, a section from 0-500 all seem to be at 0, and a section from 1500 to 1750 seems to be much further away from zero, closer to twenty, and from 1750 to 2000 it seems much closer to negative twenty instead. There doesn’t seem to be a pattern though.
par(mfrow=c(2,2)); plot(mlrm); par(mfrow=c(1,1))
Residuals vs Fitted: Seems okay for the most part, there isn’t any curvature at any point in the plot, but it does seem to trend downwards.
Homoscedasticity: The spread of the residuals do seem to be somewhat constant, making a box-like shape.
Scale–Location: The spread appears to have increased a bit but still stays somewhat constant.
Q-Q: Only deviation is near the right tail, going upwards then back down right before the end of the plot.
Residuals vs Leverage: A couple of high leverage points, but none appear to be influencial.
cor(hsdsdmv)
## home_safety_score contaminant_count population
## home_safety_score 1.00000000 0.1269712 -0.09816382
## contaminant_count 0.12697124 1.0000000 0.32449901
## population -0.09816382 0.3244990 1.00000000
## enforcement_action_count 0.36776138 0.6690223 0.15436652
## water_source 0.04676857 -0.2546402 -0.39536038
## enforcement_action_count water_source
## home_safety_score 0.3677614 0.04676857
## contaminant_count 0.6690223 -0.25464022
## population 0.1543665 -0.39536038
## enforcement_action_count 1.0000000 -0.27698509
## water_source -0.2769851 1.00000000
Most of the correlation coeffecients for the variables don’t appear to be significant as they dont exceed 0.4 or -0.4. However, enforcement action and contaminant count do have moderate correlation with a value of 0.669 meaning that dropping one could possibly be benefical.
summary(mlrm)
##
## Call:
## lm(formula = home_safety_score ~ contaminant_count + population +
## enforcement_action_count + water_source, data = hsdsdmv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.076 -9.363 0.815 6.966 40.563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.936e+01 5.328e-01 111.414 < 2e-16 ***
## contaminant_count -2.386e+00 3.888e-01 -6.138 9.97e-10 ***
## population -2.266e-06 6.524e-07 -3.472 0.000526 ***
## enforcement_action_count 1.742e+00 8.924e-02 19.516 < 2e-16 ***
## water_source 3.750e+00 6.938e-01 5.405 7.21e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.33 on 2124 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1835
## F-statistic: 120.6 on 4 and 2124 DF, p-value: < 2.2e-16
The p-value for the overall model and each of the variables with in the model are all very small, with all of them being far lower than the default α of 0.05. Each contaminant count increase seems to lower the home safety score by around 2.4, population increases also seem to lower the score as well by 2.3e-06, enforcement action count and water source both increase the home safety score though, with their values being 1.7 and 3.8 respectively. However, the adjusted R-squared is 0.1835, which means that his model only explains aorund 18.35% of the variance in home safety scores; This means that the variables aren’t the greatest fot for predicting the home safety scores.
While the multiple linear regression model showed that contaminant count, water source, population, and enforcement action count didn’t have the biggest impact on home safety score, it does help in understanding that those are still small factors that do go into account. Though contaminant counts may be a factor many may find important when looking at safety, so it being a somewhat weak factor when used as a predictor can be suprising or concerning for many. Better variables such as lead and copper mg levels would have been good to have, but as shown before they had too many NA values to be useful. Future research and analysis’ can focus on those factors as those two are most likely stronger predictors of home safety scores.