This webpage and subsequent code may be periodically updated as new analyses are conducted and improved on. Planned updates include omitting water rates as a variable in order to observe a larger number of water systems and the relationship between their water quality and socioeconomic/demographic characteristics. This may also help to address potential simultaneous causality concerns. Additional updates include observing how distance affects spatial relationships, using duration of violations as a dependent variable instead of number of violations, observing how different water sources (groundwater vs. surface water) impacts water quality, and sub-setting the data by specific types of contaminants. These analyses were not included in the full report due to time and data constraints.
The second research question of our project seeks to explore the relationship between drinking water quality and prices in order to help the US Environmental Protection Agency (EPA) understand the necessary data to collect to create a national drinking water system screening tool. This analysis will address this research question by evaluating the correlation between the number of health-based SDWIS violations over the past 10 years and annual water rates, as well as with other variables, such as demographic and housing characteristics, and the percent of disadvantaged communities in a CWS. This analysis will evaluate these relationships for all of California’s CWS, those whose communities are more than 10% disadvantaged communities, and ones that are considered Small and Very Small CWS. It will also consider the spatial concentration of these relationships.
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
#tinytex::install_tinytex()
#install.packages("kableExtra")
#install.packages(c("cowplot", "googleway", "ggplot2", "ggrepel", "ggspatial", "libwgeom", "sf", "rnaturalearth", "rnaturalearthdata"))
#install.packages("maps")
#install.packages("jtools")
#install.packages("countreg", repos="http://R-Forge.R-project.org")
#install.packages("spaMM")
#install.packages("remotes")
#install.packages("CRAN")
#devtools::install_github("chris31415926535/bivariatechoropleths")
#install.packages("stringr")
library(kableExtra)
library(tinytex)
library(readxl)
library(RColorBrewer)
library(biscale)
library(classInt)
library(pastecs)
library(psych)
library(tidycensus)
library(tidyverse)
library(deldir)
library(geojsonio)
library(geojson)
library(geojsonlint)
library(geojsonsf)
library(leaflet)
library(leafpop)
library(leaflet.providers)
library(leafem)
library(maptools)
library(mapview)
library(raster)
library(rgdal)
library(s2)
library(sf)
library(shiny)
library(spData)
library(spdep)
library(svglite)
library(terra)
library(dplyr)
library(tigris)
library(rmapshaper)
library(ggplot2)
library(sfdep)
library(car)
library(cowplot)
library(collapse)
library(extrafont)
library(ggplot2)
library(ggspatial)
library(patchwork)
library(scico)
library(googleway)
library(ggrepel)
library(ggspatial)
library(rnaturalearth)
library(rnaturalearthdata)
library(maps)
library(jtools)
library(spatialreg)
library(msme)
library(MASS)
library(countreg)
library(spaMM)
library(viridis)
library(remotes)
library(bivariatechoropleths)
library("stringr")
The data for this analysis comes from multiple sources. California’s CWS boundaries data was requested and gathered from the California Water Boards System Area Boundary Layer (SABL) webpage (California Water Resources Control Board, 2022). Information on the number of health-based violations came from the U.S. EPA Safe Drinking Water Information System (SDWIS). Annual water rates and other data presented in the CA HR2W tool were requested and obtained from the authors of the OEHHA HR2W report and tool (OEHHA, 2021). Census tract boundaries were obtained from the U.S. Census Bureau (2018). Census tract demographic and socioeconomic information as well as data on the age of housing infrastructure was downloaded from Social Explorer and based on tables from the American Community Survey (ACS) five year estimates (U.S. Census Bureau, 2020).
The data was transformed and merged into one dataset by matching CWS identifiers in R. However, CWS boundaries first had to be matched to census tracts to attribute demographic information of the census tract to the water system. This was completed using a spatial join in ArcGIS Pro. The joined dataset was then exported to R, where census tract data was merged by census tract ID and water system data was merged by CWS ID. At this stage, we used the definitions for disadvantaged and severely disadvantaged communities from the OEHHA HR2W tool to identify disadvantaged communities in each CWS. This was done by identifying whether or not their median household income was less than $49,454 for disadvantaged community status or less than $37,091 for severely disadvantaged community status (OEHHA, 2021). Lastly, to create the final dataset, the data points were aggregated to the CWS level using a combination of averages and sums for each observation within a CWS. Therefore, median income, population density, and the Gini coefficient variables are averages of all the census tracts within the CWS. Percent of the CWS population that is non-white, the percent of housing stock that are rental units, and the percent of housing stock that was built before 1980 are all true estimates calculated from the total population and housing stock information aggregated from all census tracts serviced by the CWS. Percent of communities that are disadvantaged communities was calculated by summing the number of disadvantaged communities serviced by a CWS and dividing it by the total number of communities the CWS services. Annual water rates, number of service connections, and whether or not a CWS purchases water is already reported at the CWS level from CA OEHHA. The CWS violations represent the number of health-based violations that a CWS has had since 2012. This was calculated by observing how many health-based contaminants have exceeded their maximum contaminant level (MCL) over a 10-year period (2012 to 2022). This is the same water quality information used in the production of the sample drinking water screening tool for New Jersey. A final data list can be found in the Appendix.
Various methods were used to approach this statistical analysis. Initially, the distribution and spatial concentration of the data were explored using exploratory data analysis and exploratory spatial data analysis methods. These include calculating descriptive statistics; creating boxplots, histograms, and scatterplots; and developing choropleth and bivariate maps. Additionally, we estimated a Global Moran’s I statistic for the primary variables of interest – annual water rates and the number of health-based violations. These were useful in determining the structure of the data, determining whether or not any of the variables needed to be transformed to create a normal distribution, and identifying any potential patterns, relationships, and spatial clustering in our main variables of interest.
Next, we estimated a negative binomial model because our dependent variable, the number of health-based violations, is count data that cannot be estimated using a basic linear model, such as Ordinary Least Squares. We estimated a full model with all relevant variables that we were able to collect. The explanatory variables of the full model include annual water rates, average median income, population density, the number of service connections for each CWS (a measure of size), whether or not the CWS purchases water, the percentage of the population that is non-white, percentage of communities that are disadvantaged communities, percentage of housing units that are rental units, percentage of housing stock built before 1980, and the average Gini coefficient (a measure of inequality) for each CWS.
After this full model was estimated, we used various methods to evaluate the fit of the model and whether or not spatial dependency needs to be incorporated into the regression with a spatial weights matrix. These diagnostic methods included the Akaike Information Criterion (AIC) and the Moran’s I of the residuals. AIC is a method that identifies the fit of the model. For AIC, a backward step method was used, which removes one variable at a time from the initial model to determine the combination of variables with the best fit. This best fit is determined through finding the model that has the lowest AIC value. After settling on the final negative binomial model, we observed a plot of the fitted values compared to observed values to visualize the fit of the model. Then, we used a Moran’s I test of the residuals of the model to determine whether or not spatial dependency still persisted in the residuals. Finally, we utilized a conditional autoregressive (CAR) spatial model to account for the remaining spatial autocorrelation and analyzed the fit of this model. After conducting this analysis for the full dataset, we repeated these steps for the data filtered by percentage of communities that are disadvantaged communities and CWS that are classified as small and very small.
In the CAR model, rho (ρ) represents the coefficient on the spatial weights matrix (W), which indicates the strength above spatial autocorrelation that influences the response variable. This model was chosen because it estimates the probability of values at a certain location occurring, conditional on neighboring values. This is the appropriate spatial model to use because there are potential spatial spillover effects from explanatory variables or unmeasurable influences from neighboring CWS that influence the water quality in a specific CWS. With the matrix of weights of the neighbors of each CWS, we are able to account for those spillover effects, and make the model more accurate. We determined CWS neighbors by selecting the three nearest CWS.
# Estimating number of violations
vios <- read_xlsx("data//nj_ca_vios.xlsx", sheet = "Sheet1", col_names = TRUE)
ca_vios <- vios[str_detect(vios$`PWS ID`, "CA"),]
ca_vios$current <- ifelse(ca_vios$`Compliance Status` == "Known" | ca_vios$`Compliance Status` == "Open", 1, 0)
ca_vios$closed <- ifelse(ca_vios$`Compliance Status`=="Returned to Compliance", 1, 0)
## Limiting to post-2012
ca_vios <- ca_vios[ca_vios$`Compliance Period Begin Date` >= "2012-01-01",]
ca_vios$vioyear <- format(ca_vios$`Compliance Period Begin Date`, format = "%Y")
ca_vios <- ca_vios %>% relocate(vioyear, .after = `Compliance Period End Date`)
ca_vios$rtcyear <- format(ca_vios$`RTC Date`, format = "%Y")
## Aggregate violations to year level (originally quarter)
ca_vios_new <- collap(ca_vios, ~ `PWS ID` + `Rule Name` + `Violation Type` +
`Contaminant Name` + `Compliance Status` + vioyear,
custom = list(fmean = c("closed", "current", "Public Notification Tier",
"Severity Indicator Count"), flast = c("rtcyear")))
ca_vios_new$vioyear <- as.numeric(ca_vios_new$vioyear)
ca_vios_new$rtcyear <- as.numeric(ca_vios_new$rtcyear)
ca_vios_new$yeartocomp <- ca_vios_new$rtcyear - ca_vios_new$vioyear
## Aggregating violations to CWS level
ca_vios_count <- collap(ca_vios_new, ~ `PWS ID`, custom = list(fsum = c("current", "closed")))
ca_vios_count$total_vios <- ca_vios_count$current + ca_vios_count$closed
write.csv(ca_vios_count, file = "data//caviolations.csv", col.names = TRUE)
# Loading in CWS boundary, CWS data, and census tract data
system_data <- read_xlsx("data\\CalHRTWv1_FinalDataSpreadsheet.xlsx", col_names = TRUE)
systems <- st_read("data\\ca_systems.geojson")
## Reading layer `ca_systems' from data source
## `C:\Users\kelly\OneDrive\Desktop\MPP\Praticum\Analysis\data\ca_systems.geojson'
## using driver `GeoJSON'
## Simple feature collection with 3017 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.3161 ymin: 32.53425 xmax: -114.1686 ymax: 41.99871
## Geodetic CRS: WGS 84
tracts <- st_read("data\\tract2020.geojson")
## Reading layer `tract2020' from data source
## `C:\Users\kelly\OneDrive\Desktop\MPP\Praticum\Analysis\data\tract2020.geojson'
## using driver `GeoJSON'
## Simple feature collection with 9129 features and 2 fields (with 20 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.4096 ymin: 32.53444 xmax: -114.1312 ymax: 42.00948
## Geodetic CRS: WGS 84
census_data <- read.csv("tracts\\cleaned_census_data.csv", header = TRUE)
ca_violations <- read.csv("data\\caviolations.csv", header = TRUE) ##this was generated above
system_data <- subset(system_data, `Annual Water Cost at 6 HCF`>0) #limits the water systems to those that have annual water cost data
system_data <- system_data %>% rename(pwsid=PWSID)
census_data$FIPS <- paste0("0", census_data$FIPS)
census_data <- census_data %>% rename(GEOID = FIPS)
census_data <- subset(census_data, select = -c(white))
census_data[3:42] <- lapply(census_data[3:42], as.numeric)
ca_sys <- left_join(system_data, systems, by = "pwsid")
#st_write(ca_sys, "newsystems2.shp") ## checking in ArcGIS to make sure geographies look correct and join with census tracts
joined_tracts <- read.csv("data\\joinedOEHHA.csv", header = TRUE) ##attribute table created in ArcGIS that links CWS to census tracts
systems <- systems %>% dplyr::select(pwsid, geometry)
ca_sys <- left_join(joined_tracts, systems, by = "pwsid")
ca_sys <- rename(ca_sys, cwsboundaries = geometry )
ca_violations <- rename(ca_violations, pwsid = PWS.ID)
ca_violations <- ca_violations[,2:5]
ca_sys$GEOID <- paste0("0", ca_sys$GEOID)
# Finally joining all data to get final large dataset
reg.df <- left_join(ca_sys, census_data, by = "GEOID")
reg.df <- reg.df %>% dplyr::select(pwsid, system_name, pop_served, num_contam_he, max_dur_he, num_mcl_vio,
max_dur_mcl_vio, prchw, num_sources, water_type,
emerg_source, over_draft_basin,
AR_MH, mdhi, monthly, annual, disadv_cs, num_mrvio, sys_size,
connections, population,
fed_class, state_class, service_connect, GEOID, cwsboundaries, 51:91)
# Cleaning and transforming final large dataset
## Estimating the number of disadvantaged communities serviced by a CWS
reg.df$disad_c <- ifelse(reg.df$med_inc < 49454, 1, 0) #once data is aggregated to the CWS level this variable will be summed, giving the total number of disadvantaged communities in a CWS. This will then be divided by the total number of communities serviced by a CWS to get the percentage of disadvantaged communities.
reg.df$sev_disad <- ifelse(reg.df$med_inc < 37091, 1, 0)
reg.df$communities <- 1
## Aggregating data to the CWS level
reg.df2 <- collap(reg.df, ~ pwsid,
custom = list(fmean = 3:7,
fmean = 13:16,
fmean = 18,
fmean = 20:21,
fmean = 24,
fmean = 27,
fsum = 28:50,
fmean = 51:53,
fsum = 54:65,
fmean = 66:67,
fsum = 68:70,
flast = c("system_name", "prchw", "num_sources", "water_type",
"emerg_source", "over_draft_basin", "sys_size",
"fed_class", "state_class")))
reg.df2 <- left_join(reg.df2, systems, by = "pwsid") #adds in CWS geography
reg.df2[26:32] <- (reg.df2[26:32]/reg.df2[,25])*100
reg.df2$non_white <- rowSums(reg.df2[,27:32])
reg.df2 <- reg.df2 %>% relocate(non_white, .after = white.1)
reg.df2[35:41] <- (reg.df2[35:41]/reg.df2[,34])*100
reg.df2[46:48] <- (reg.df2[46:48]/reg.df2[,42])*100 ##no. employed, unemployed, or not in LF / total LF
reg.df2$renter_units_perc <- (reg.df2$renter_units/reg.df2$housing_units)*100
reg.df2[54:63] <- (reg.df2[54:63]/reg.df2[,53])*100 ##percent of rental units by age
reg.df2$pre1980 <- rowSums(reg.df2[,59:63])
reg.df2$post1980 <- rowSums(reg.df2[,54:58])
reg.df2$disad_per <- (reg.df2$disad_c/reg.df2$communities)*100
reg.df2$sev_per <- (reg.df2$sev_disad/reg.df2$communities)*100
reg.df2$prchw <- ifelse(reg.df2$prchw == "Yes", 1, 0)
reg.df2$over_draft_basin <- ifelse(reg.df2$over_draft_basin == "Yes", 1, 0)
reg.df2$emerg_source <- ifelse(reg.df2$emerg_source == "Yes", 1, 0)
reg.df2$joint_dis <- ifelse(reg.df2$disad_c == 1 & reg.df2$sev_disad == 1, 1, 0)
reg.df2 <- reg.df2 %>% filter(!st_is_empty(geometry))
# Loading in general SDWIS information about all CWS in California to compare sample population to total population
system_eval <- read.csv("CA_WaterSystemDetail.csv", header = TRUE)
system_eval <- system_eval %>% dplyr::select(PWS.ID, Population.Served.Count)
system_eval$Population.Served.Count <- as.numeric(gsub(",", "", system_eval$Population.Served.Count, fixed = TRUE))
system_eval$size <- ifelse(system_eval$Population.Served.Count <= 200, "Very Small",
ifelse(system_eval$Population.Served.Count <= 999 & system_eval$Population.Served.Count >= 201, "Small",
ifelse(system_eval$Population.Served.Count <= 3300 & system_eval$Population.Served.Count >= 1000, "Medium",
ifelse(system_eval$Population.Served.Count > 3300, "Large/Very Large", NA))))
reg.df2$size <- ifelse(reg.df2$sys_size == "Very Small (1-200)", 1,
ifelse(reg.df2$sys_size == "Small (201-999)", 2,
ifelse(reg.df2$sys_size == "Medium (1000-3,300)", 3,
ifelse(reg.df2$sys_size == "Large/Very Large(>3300)", 4, NA))))
reg.df2$sys_size <- ifelse(reg.df2$sys_size == "Large/Very Large(>3300)", "Lg(>3300)",
ifelse(reg.df2$sys_size == "Very Small (1-200)", "V.Sm(1-200)",
ifelse(reg.df2$sys_size == "Small (201-999)", "Sm(201-999)",
ifelse(reg.df2$sys_size == "Medium (1000-3,300)", "Md(1000-3300)", NA))))
table(reg.df2$sys_size)
##
## Lg(>3300) Md(1000-3300) Sm(201-999) V.Sm(1-200)
## 352 181 209 397
# Removing N/A and repeat observations
reg.df2 <- na.omit(reg.df2)
## Finding repeat observations
which(reg.df2$pwsid == c("CA0110011"))
## [1] 8 9
which(reg.df2$pwsid == c("CA0910013"))
## [1] 66 67
which(reg.df2$pwsid == c("CA1010007"))
## [1] 81 82
which(reg.df2$pwsid == c("CA1910034"))
## [1] 251 252
which(reg.df2$pwsid == c("CA4410002"))
## [1] 876 877
which(reg.df2$pwsid == c("CA4410014"))
## [1] 881 882
which(reg.df2$pwsid == c("CA4410017"))
## [1] 883 884
which(reg.df2$pwsid == c("CA4810003"))
## [1] 916 917
which(reg.df2$pwsid == c("CA5010018"))
## [1] 992 993
which(reg.df2$pwsid == c("CA5510009"))
## [1] 1057 1058
reg.df2 <- reg.df2[-c(9,67,82,252,877,882,884,917,993,1058),]
reg.df2 <- reg.df2[,-c(4,5,7)]
reg.df2 <- left_join(reg.df2, ca_violations, by = "pwsid")
reg.df2$total_vios <- as.numeric(reg.df2$total_vios)
reg.df2["total_vios"][is.na(reg.df2["total_vios"])] <- 0
#st_write(reg.df2, dsn = "data/final/finaldata.geojson", append = FALSE) #final dataset available to download
#st_write(reg.df2, dsn = "data/final/finaldata.shp")
The initial exploration of the data revealed many interesting patterns. First, the descriptive statistics are displayed in Table 1. This shows that more than half of systems do not have any health-based violations and the average annual water rate in CA for the minimum water needed to live is $580. The following figures show the distribution and relationship of multiple variables. For the distribution of violations in Figure X, it is clear that there are a large number of 0 violations. This distribution shows that a negative binomial regression model would be best to use. The following figure shows the distribution of water rates. Despite the right skew of this variable, a negative binomial model does not require the transformation of skewed data.
sf_use_s2(FALSE)
options(scipen=999)
# Creating descriptive statistics table
violations <- stat.desc(reg.df2$total_vios)
annual <- stat.desc(reg.df2$annual)
percent_disad <- stat.desc(reg.df2$disad_per)
percent_severe_disad <- stat.desc(reg.df2$sev_per)
median_income <- stat.desc(reg.df2$med_inc)
gini <- stat.desc(reg.df2$gini)
rental_units <- stat.desc(reg.df2$renter_units_perc)
housing_pre1980 <- stat.desc(reg.df2$pre1980)
pop_density <- stat.desc(reg.df2$popdensity)
size <- stat.desc(reg.df2$connections)
purchase_water <- stat.desc(reg.df2$prchw)
non_white <- stat.desc(reg.df2$non_white)
stats <- t(data.frame(violations, annual, median_income, non_white, percent_disad, gini, rental_units, housing_pre1980, pop_density, size, purchase_water))
stats <- stats[,-c(3,6,7,10:12,14)]
kable(stats, digits = 2, row.names = TRUE, col.names = c("N", "Number of Null",
"Min", "Max", "Median",
"Mean", "Std. Dev."),
align = "c", label = "Descriptive Statistics", caption = "Table X: Statistics for variables considered in this analysis")
N | Number of Null | Min | Max | Median | Mean | Std. Dev. | |
---|---|---|---|---|---|---|---|
violations | 1108 | 848 | 0.00 | 26.00 | 0.00 | 0.86 | 2.61 |
annual | 1108 | 0 | 175.44 | 2108.88 | 493.20 | 580.42 | 324.34 |
median_income | 1108 | 0 | 18387.00 | 247679.00 | 74655.00 | 80776.95 | 32018.62 |
non_white | 1108 | 0 | 1.29 | 91.10 | 25.58 | 29.97 | 17.58 |
percent_disad | 1108 | 682 | 0.00 | 100.00 | 0.00 | 18.42 | 32.15 |
gini | 1108 | 0 | 0.21 | 0.65 | 0.43 | 0.43 | 0.05 |
rental_units | 1108 | 0 | 2.24 | 78.00 | 27.80 | 28.84 | 13.45 |
housing_pre1980 | 1108 | 5 | 0.00 | 100.00 | 57.04 | 55.54 | 19.01 |
pop_density | 1108 | 0 | 1.06 | 33403.60 | 523.15 | 2441.94 | 3769.51 |
size | 1108 | 0 | 1.00 | 686540.00 | 777.50 | 7637.03 | 29442.83 |
purchase_water | 1108 | 732 | 0.00 | 1.00 | 0.00 | 0.34 | 0.47 |
hist(reg.df2$total_vios, freq=FALSE, main = "Figure X: Number of Violations", xlab="MCL of Violations")
hist(reg.df2$annual, freq=FALSE, main = "Figure X: Distribution of Annual Water Rates", xlab="Annual Water Rates")
The next set of graphs display the relationships between different variables to start to explore whether or not certain variables should be included in the regression analysis. For the relationship between violations and water rates, as shown in Figure X, there appears to be more health-based violations for systems with lower water rates, although it is not a stark pattern. This trend indicates that there could be a relationship between the two variables. A more drastic relationship appears when looking at the relationship between the number of health-based violations and the median income of the water system, as shown in Figure X. Other variables the literature suggests are potentially important factors contributing to water quality are system size and disadvantaged community status. To explore the relationship of these variables with violations and water rates, we created a box plot of the number of violations per different system size classification and plotted the number of violations and water rates against the percent of disadvantaged communities in a CWS (Figures X through X). The boxplots relating water system size to the variables of interest show that small and very small systems have higher numbers of violations as well as higher water rates. While literature had suggested that these may be important variables to investigate, these figures solidify that there is a potential relationship between these variables and the response variable. There is not a clear pattern that arises from either plot comparing disadvantaged community status to the two variables of interest, however they are still included in the initial regression model because the literature emphasizes their significance.
plot(reg.df2$annual, reg.df2$total_vios, main = "Figure X: Violations by Water Rates",
xlab="Annual Water Rates", ylab="MCL Violations")
plot(reg.df2$med_inc, reg.df2$total_vios, main = "Figure X: Violations by Median Income",
xlab="Average Median Income", ylab="Violations")
plot(reg.df2$med_inc, reg.df2$annual, main = "Figure X: Water Rates by Median Income",
xlab="Average Median Income", ylab="Annual Water Rate")
boxplot(reg.df2$total_vios~reg.df2$sys_size, xlab = "System Size", ylab = "MCL Violations", show.names = TRUE)
boxplot(reg.df2$annual~reg.df2$sys_size, xlab = "System Size", ylab = "Annual Water Rates", show.names = TRUE)
plot(reg.df2$disad_per, reg.df2$annual, main = "Figure X: Water Rates by Disadvantaged Communities",
xlab="Percent Disadv. Communities", ylab="Water rates")
plot(reg.df2$disad_per, reg.df2$total_vios, main = "Figure X: Violations by Disadvantaged Communities",
xlab="Percent Disadv. Communities", ylab="MCL Violations")
Finally, as a part of the data exploration, it is important to show the difference between the full population of community water systems in California and the sample data that is used in this analysis. One of the best ways to show the difference between the two is to look at the distribution of data among different sizes of systems. Table X shows that the data used in the analysis represents a disproportionately higher percentage of medium and large systems compared to small and very small systems. This may limit the external validity of our results and highlights deficiencies in data collection for smaller systems.
#Comparing sample systems to full population of systems
actual_size <- table(system_eval$size)
sample_size <- table(reg.df2$sys_size)
size_comp <- data.frame(actual_size, sample_size)
size_comp <- size_comp[,-c(1)]
size_comp <- rename(size_comp, "Size" = "Var1.1")
size_comp <- size_comp %>% relocate(Size, .before = "Freq")
size_comp <- rename(size_comp, "Actual" = "Freq")
size_comp <- rename(size_comp, "Sample" = "Freq.1")
kable(size_comp, align = "l", caption = "Comparison of the breakdown of sample CWS sizes to population of CWS sizes", label = "Sample v. Population CWS Size Breakdown")
Size | Actual | Sample |
---|---|---|
Lg(>3300) | 660 | 344 |
Md(1000-3300) | 285 | 178 |
Sm(201-999) | 679 | 201 |
V.Sm(1-200) | 1237 | 385 |
Next, we explored the spatial component of our data through choropleth maps and a spatial dependency statistic. The maps below (Figures X-X) show the distribution and concentration of our two main variables of interest, the number of health-based violations and annual water rates across California (the full, interactive maps can be seen by following the links in the figure notes). The map reflects similar ideas discussed above and in the literature that smaller systems face higher numbers of health-based violations, as shown by the concentration of CWS with high numbers of violations in California’s central valley. Both figures also show that there is potentially spatial clustering of violations and water rates. These maps solidify the hypothesis that both system size and spatial clustering should be considered in regression analyses.
Figure X shows a bivariate choropleth map that combines the two previous variables. If we expected to see that high water bills led to fewer violations and vice versa, there should be mainly colors along the gradient going from top left to bottom right of the matrix key. This map does not show any clear pattern in terms of the gradient; however it does look like systems with a similar relationship between water rates and violations are clustered near each other. We verify this spatial clustering by using a Global Moran’s I statistic.
map_dat <- reg.df2[,c(1,2,13,15,66,76)]
map_dat <- sf::st_as_sf(map_dat)
vio_bins <- c(-1, 0, 1, 10, 26)
pal <- colorBin("Reds", domain = reg.df2$total_vios, bins = vio_bins)
labels <- sprintf(
"<strong>%s</strong><br/><strong>%s</strong><br/>%g MCL Violations<br/>%s",
reg.df2$system_name, reg.df2$pwsid, reg.df2$total_vios, reg.df2$sys_size
) %>% lapply(htmltools::HTML)
violationmap <- leaflet() %>% addTiles() %>%
addPolygons(data = map_dat$geometry, fillOpacity= 0.7, fillColor = pal(reg.df2$total_vios),
color = "grey", weight=1, highlightOptions = leaflet::highlightOptions(
weight = 1,
color = "#666",
fillOpacity = 0.7),
label = labels,
labelOptions = labelOptions(
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "15px",
direction = "auto")) %>%
addLegend("bottomright", pal = pal, values = reg.df2$total_vios,
title = "Number of Violations",
opacity = 1)
violationmap