This webpage and subsequent code may be periodically updated as new analyses are conducted and improved on. Planned updates include omitting water rates as a variable in order to observe a larger number of water systems and the relationship between their water quality and socioeconomic/demographic characteristics. This may also help to address potential simultaneous causality concerns. Additional updates include observing how distance affects spatial relationships, using duration of violations as a dependent variable instead of number of violations, observing how different water sources (groundwater vs. surface water) impacts water quality, and sub-setting the data by specific types of contaminants. These analyses were not included in the full report due to time and data constraints.

Statistical Analysis

The second research question of our project seeks to explore the relationship between drinking water quality and prices in order to help the US Environmental Protection Agency (EPA) understand the necessary data to collect to create a national drinking water system screening tool. This analysis will address this research question by evaluating the correlation between the number of health-based SDWIS violations over the past 10 years and annual water rates, as well as with other variables, such as demographic and housing characteristics, and the percent of disadvantaged communities in a CWS. This analysis will evaluate these relationships for all of California’s CWS, those whose communities are more than 10% disadvantaged communities, and ones that are considered Small and Very Small CWS. It will also consider the spatial concentration of these relationships.

library(knitr)
knitr::opts_chunk$set(echo = TRUE)
#tinytex::install_tinytex()
#install.packages("kableExtra")

#install.packages(c("cowplot", "googleway", "ggplot2", "ggrepel", "ggspatial", "libwgeom", "sf", "rnaturalearth", "rnaturalearthdata"))
#install.packages("maps")
#install.packages("jtools")
#install.packages("countreg", repos="http://R-Forge.R-project.org")
#install.packages("spaMM")
#install.packages("remotes")
#install.packages("CRAN")
#devtools::install_github("chris31415926535/bivariatechoropleths")
#install.packages("stringr")

library(kableExtra)
library(tinytex)
library(readxl)
library(RColorBrewer)
library(biscale)
library(classInt)
library(pastecs)
library(psych)
library(tidycensus)
library(tidyverse)
library(deldir)
library(geojsonio)
library(geojson)
library(geojsonlint)
library(geojsonsf)
library(leaflet)
library(leafpop)
library(leaflet.providers)
library(leafem)
library(maptools)
library(mapview)
library(raster)
library(rgdal)
library(s2)
library(sf)
library(shiny)
library(spData)
library(spdep)
library(svglite)
library(terra)
library(dplyr)
library(tigris)
library(rmapshaper)
library(ggplot2)
library(sfdep)
library(car)
library(cowplot)
library(collapse)
library(extrafont)
library(ggplot2)
library(ggspatial)
library(patchwork)
library(scico)
library(googleway)
library(ggrepel)
library(ggspatial)
library(rnaturalearth)
library(rnaturalearthdata)
library(maps)
library(jtools)
library(spatialreg)
library(msme)
library(MASS)
library(countreg)
library(spaMM)
library(viridis)
library(remotes)
library(bivariatechoropleths)
library("stringr")

Data

The data for this analysis comes from multiple sources. California’s CWS boundaries data was requested and gathered from the California Water Boards System Area Boundary Layer (SABL) webpage (California Water Resources Control Board, 2022). Information on the number of health-based violations came from the U.S. EPA Safe Drinking Water Information System (SDWIS). Annual water rates and other data presented in the CA HR2W tool were requested and obtained from the authors of the OEHHA HR2W report and tool (OEHHA, 2021). Census tract boundaries were obtained from the U.S. Census Bureau (2018). Census tract demographic and socioeconomic information as well as data on the age of housing infrastructure was downloaded from Social Explorer and based on tables from the American Community Survey (ACS) five year estimates (U.S. Census Bureau, 2020).

The data was transformed and merged into one dataset by matching CWS identifiers in R. However, CWS boundaries first had to be matched to census tracts to attribute demographic information of the census tract to the water system. This was completed using a spatial join in ArcGIS Pro. The joined dataset was then exported to R, where census tract data was merged by census tract ID and water system data was merged by CWS ID. At this stage, we used the definitions for disadvantaged and severely disadvantaged communities from the OEHHA HR2W tool to identify disadvantaged communities in each CWS. This was done by identifying whether or not their median household income was less than $49,454 for disadvantaged community status or less than $37,091 for severely disadvantaged community status (OEHHA, 2021). Lastly, to create the final dataset, the data points were aggregated to the CWS level using a combination of averages and sums for each observation within a CWS. Therefore, median income, population density, and the Gini coefficient variables are averages of all the census tracts within the CWS. Percent of the CWS population that is non-white, the percent of housing stock that are rental units, and the percent of housing stock that was built before 1980 are all true estimates calculated from the total population and housing stock information aggregated from all census tracts serviced by the CWS. Percent of communities that are disadvantaged communities was calculated by summing the number of disadvantaged communities serviced by a CWS and dividing it by the total number of communities the CWS services. Annual water rates, number of service connections, and whether or not a CWS purchases water is already reported at the CWS level from CA OEHHA. The CWS violations represent the number of health-based violations that a CWS has had since 2012. This was calculated by observing how many health-based contaminants have exceeded their maximum contaminant level (MCL) over a 10-year period (2012 to 2022). This is the same water quality information used in the production of the sample drinking water screening tool for New Jersey. A final data list can be found in the Appendix.

Methods

Various methods were used to approach this statistical analysis. Initially, the distribution and spatial concentration of the data were explored using exploratory data analysis and exploratory spatial data analysis methods. These include calculating descriptive statistics; creating boxplots, histograms, and scatterplots; and developing choropleth and bivariate maps. Additionally, we estimated a Global Moran’s I statistic for the primary variables of interest – annual water rates and the number of health-based violations. These were useful in determining the structure of the data, determining whether or not any of the variables needed to be transformed to create a normal distribution, and identifying any potential patterns, relationships, and spatial clustering in our main variables of interest.

Next, we estimated a negative binomial model because our dependent variable, the number of health-based violations, is count data that cannot be estimated using a basic linear model, such as Ordinary Least Squares. We estimated a full model with all relevant variables that we were able to collect. The explanatory variables of the full model include annual water rates, average median income, population density, the number of service connections for each CWS (a measure of size), whether or not the CWS purchases water, the percentage of the population that is non-white, percentage of communities that are disadvantaged communities, percentage of housing units that are rental units, percentage of housing stock built before 1980, and the average Gini coefficient (a measure of inequality) for each CWS.

After this full model was estimated, we used various methods to evaluate the fit of the model and whether or not spatial dependency needs to be incorporated into the regression with a spatial weights matrix. These diagnostic methods included the Akaike Information Criterion (AIC) and the Moran’s I of the residuals. AIC is a method that identifies the fit of the model. For AIC, a backward step method was used, which removes one variable at a time from the initial model to determine the combination of variables with the best fit. This best fit is determined through finding the model that has the lowest AIC value. After settling on the final negative binomial model, we observed a plot of the fitted values compared to observed values to visualize the fit of the model. Then, we used a Moran’s I test of the residuals of the model to determine whether or not spatial dependency still persisted in the residuals. Finally, we utilized a conditional autoregressive (CAR) spatial model to account for the remaining spatial autocorrelation and analyzed the fit of this model. After conducting this analysis for the full dataset, we repeated these steps for the data filtered by percentage of communities that are disadvantaged communities and CWS that are classified as small and very small.

In the CAR model, rho (ρ) represents the coefficient on the spatial weights matrix (W), which indicates the strength above spatial autocorrelation that influences the response variable. This model was chosen because it estimates the probability of values at a certain location occurring, conditional on neighboring values. This is the appropriate spatial model to use because there are potential spatial spillover effects from explanatory variables or unmeasurable influences from neighboring CWS that influence the water quality in a specific CWS. With the matrix of weights of the neighbors of each CWS, we are able to account for those spillover effects, and make the model more accurate. We determined CWS neighbors by selecting the three nearest CWS.

# Estimating number of violations
vios <- read_xlsx("data//nj_ca_vios.xlsx", sheet = "Sheet1", col_names = TRUE)
ca_vios <- vios[str_detect(vios$`PWS ID`, "CA"),]
ca_vios$current <- ifelse(ca_vios$`Compliance Status` == "Known" | ca_vios$`Compliance Status` == "Open", 1, 0)
ca_vios$closed <- ifelse(ca_vios$`Compliance Status`=="Returned to Compliance", 1, 0)

## Limiting to post-2012
ca_vios <- ca_vios[ca_vios$`Compliance Period Begin Date` >= "2012-01-01",]
ca_vios$vioyear <- format(ca_vios$`Compliance Period Begin Date`, format = "%Y")
ca_vios <- ca_vios %>% relocate(vioyear, .after = `Compliance Period End Date`)
ca_vios$rtcyear <- format(ca_vios$`RTC Date`, format = "%Y")

## Aggregate violations to year level (originally quarter)
ca_vios_new <- collap(ca_vios, ~ `PWS ID` + `Rule Name` + `Violation Type` + 
                        `Contaminant Name` + `Compliance Status` + vioyear, 
                      custom = list(fmean = c("closed", "current", "Public Notification Tier", 
                                              "Severity Indicator Count"), flast = c("rtcyear")))
ca_vios_new$vioyear <- as.numeric(ca_vios_new$vioyear)
ca_vios_new$rtcyear <- as.numeric(ca_vios_new$rtcyear)
ca_vios_new$yeartocomp <- ca_vios_new$rtcyear - ca_vios_new$vioyear

## Aggregating violations to CWS level
ca_vios_count <- collap(ca_vios_new, ~ `PWS ID`, custom = list(fsum = c("current", "closed")))
ca_vios_count$total_vios <- ca_vios_count$current + ca_vios_count$closed

write.csv(ca_vios_count, file = "data//caviolations.csv", col.names = TRUE)


# Loading in CWS boundary, CWS data, and census tract data
system_data <- read_xlsx("data\\CalHRTWv1_FinalDataSpreadsheet.xlsx", col_names = TRUE)
systems <- st_read("data\\ca_systems.geojson")
## Reading layer `ca_systems' from data source 
##   `C:\Users\kelly\OneDrive\Desktop\MPP\Praticum\Analysis\data\ca_systems.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 3017 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.3161 ymin: 32.53425 xmax: -114.1686 ymax: 41.99871
## Geodetic CRS:  WGS 84
tracts <- st_read("data\\tract2020.geojson")
## Reading layer `tract2020' from data source 
##   `C:\Users\kelly\OneDrive\Desktop\MPP\Praticum\Analysis\data\tract2020.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 9129 features and 2 fields (with 20 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.4096 ymin: 32.53444 xmax: -114.1312 ymax: 42.00948
## Geodetic CRS:  WGS 84
census_data <- read.csv("tracts\\cleaned_census_data.csv", header = TRUE)
ca_violations <- read.csv("data\\caviolations.csv", header = TRUE) ##this was generated above

system_data <- subset(system_data, `Annual Water Cost at 6 HCF`>0) #limits the water systems to those that have annual water cost data
system_data <- system_data %>% rename(pwsid=PWSID)

census_data$FIPS <- paste0("0", census_data$FIPS)
census_data <- census_data %>% rename(GEOID = FIPS)
census_data <- subset(census_data, select = -c(white))

census_data[3:42] <- lapply(census_data[3:42], as.numeric)

ca_sys <- left_join(system_data, systems, by = "pwsid")
#st_write(ca_sys, "newsystems2.shp") ## checking in ArcGIS to make sure geographies look correct and join with census tracts

joined_tracts <- read.csv("data\\joinedOEHHA.csv", header = TRUE) ##attribute table created in ArcGIS that links CWS to census tracts

systems <- systems %>% dplyr::select(pwsid, geometry)

ca_sys <- left_join(joined_tracts, systems, by = "pwsid")
ca_sys <- rename(ca_sys, cwsboundaries = geometry )
ca_violations <- rename(ca_violations, pwsid = PWS.ID)
ca_violations <- ca_violations[,2:5]

ca_sys$GEOID <- paste0("0", ca_sys$GEOID)


# Finally joining all data to get final large dataset
reg.df <- left_join(ca_sys, census_data, by = "GEOID")
reg.df <- reg.df %>% dplyr::select(pwsid, system_name, pop_served, num_contam_he, max_dur_he, num_mcl_vio, 
                                   max_dur_mcl_vio, prchw, num_sources, water_type, 
                                   emerg_source, over_draft_basin,
                                   AR_MH, mdhi, monthly, annual, disadv_cs, num_mrvio, sys_size, 
                                   connections, population, 
                                   fed_class, state_class, service_connect, GEOID, cwsboundaries, 51:91)

# Cleaning and transforming final large dataset
## Estimating the number of disadvantaged communities serviced by a CWS
reg.df$disad_c <- ifelse(reg.df$med_inc < 49454, 1, 0) #once data is aggregated to the CWS level this variable will be summed, giving the total number of disadvantaged communities in a CWS. This will then be divided by the total number of communities serviced by a CWS to get the percentage of disadvantaged communities. 
reg.df$sev_disad <- ifelse(reg.df$med_inc < 37091, 1, 0)

reg.df$communities <- 1

## Aggregating data to the CWS level
reg.df2 <- collap(reg.df, ~ pwsid, 
                  custom = list(fmean = 3:7,
                                fmean = 13:16,
                                fmean = 18,
                                fmean = 20:21, 
                                fmean = 24,
                                fmean = 27,
                                fsum = 28:50,
                                fmean = 51:53,
                                fsum = 54:65,
                                fmean = 66:67,
                                fsum = 68:70,
                                flast = c("system_name", "prchw", "num_sources", "water_type",
                                          "emerg_source", "over_draft_basin", "sys_size",
                                          "fed_class", "state_class")))

reg.df2 <- left_join(reg.df2, systems, by = "pwsid") #adds in CWS geography


reg.df2[26:32] <- (reg.df2[26:32]/reg.df2[,25])*100
reg.df2$non_white <- rowSums(reg.df2[,27:32])
reg.df2 <- reg.df2 %>% relocate(non_white, .after = white.1)
reg.df2[35:41] <- (reg.df2[35:41]/reg.df2[,34])*100
reg.df2[46:48] <- (reg.df2[46:48]/reg.df2[,42])*100 ##no. employed, unemployed, or not in LF / total LF
reg.df2$renter_units_perc <- (reg.df2$renter_units/reg.df2$housing_units)*100
reg.df2[54:63] <- (reg.df2[54:63]/reg.df2[,53])*100 ##percent of rental units by age
reg.df2$pre1980 <- rowSums(reg.df2[,59:63])
reg.df2$post1980 <- rowSums(reg.df2[,54:58])
reg.df2$disad_per <- (reg.df2$disad_c/reg.df2$communities)*100
reg.df2$sev_per <- (reg.df2$sev_disad/reg.df2$communities)*100

reg.df2$prchw <- ifelse(reg.df2$prchw == "Yes", 1, 0)
reg.df2$over_draft_basin <- ifelse(reg.df2$over_draft_basin == "Yes", 1, 0)
reg.df2$emerg_source <- ifelse(reg.df2$emerg_source == "Yes", 1, 0)
reg.df2$joint_dis <- ifelse(reg.df2$disad_c == 1 & reg.df2$sev_disad == 1, 1, 0)

reg.df2 <- reg.df2 %>% filter(!st_is_empty(geometry))

# Loading in general SDWIS information about all CWS in California to compare sample population to total population
system_eval <- read.csv("CA_WaterSystemDetail.csv", header = TRUE)

system_eval <- system_eval %>% dplyr::select(PWS.ID, Population.Served.Count)
system_eval$Population.Served.Count <- as.numeric(gsub(",", "", system_eval$Population.Served.Count, fixed = TRUE))

system_eval$size <- ifelse(system_eval$Population.Served.Count <= 200, "Very Small", 
                           ifelse(system_eval$Population.Served.Count <= 999 & system_eval$Population.Served.Count >= 201, "Small", 
                                  ifelse(system_eval$Population.Served.Count <= 3300 & system_eval$Population.Served.Count >= 1000, "Medium", 
                                         ifelse(system_eval$Population.Served.Count > 3300, "Large/Very Large", NA))))

reg.df2$size <- ifelse(reg.df2$sys_size == "Very Small (1-200)", 1, 
                       ifelse(reg.df2$sys_size == "Small (201-999)", 2, 
                              ifelse(reg.df2$sys_size == "Medium (1000-3,300)", 3, 
                                     ifelse(reg.df2$sys_size == "Large/Very Large(>3300)", 4, NA))))
reg.df2$sys_size <- ifelse(reg.df2$sys_size == "Large/Very Large(>3300)", "Lg(>3300)",
                           ifelse(reg.df2$sys_size == "Very Small (1-200)", "V.Sm(1-200)",
                                  ifelse(reg.df2$sys_size == "Small (201-999)", "Sm(201-999)",
                                         ifelse(reg.df2$sys_size == "Medium (1000-3,300)", "Md(1000-3300)", NA))))
table(reg.df2$sys_size)
## 
##     Lg(>3300) Md(1000-3300)   Sm(201-999)   V.Sm(1-200) 
##           352           181           209           397
# Removing N/A and repeat observations
reg.df2 <- na.omit(reg.df2)

## Finding repeat observations
which(reg.df2$pwsid == c("CA0110011"))
## [1] 8 9
which(reg.df2$pwsid == c("CA0910013"))
## [1] 66 67
which(reg.df2$pwsid == c("CA1010007"))
## [1] 81 82
which(reg.df2$pwsid == c("CA1910034"))
## [1] 251 252
which(reg.df2$pwsid == c("CA4410002"))
## [1] 876 877
which(reg.df2$pwsid == c("CA4410014"))
## [1] 881 882
which(reg.df2$pwsid == c("CA4410017"))
## [1] 883 884
which(reg.df2$pwsid == c("CA4810003"))
## [1] 916 917
which(reg.df2$pwsid == c("CA5010018"))
## [1] 992 993
which(reg.df2$pwsid == c("CA5510009"))
## [1] 1057 1058
reg.df2 <- reg.df2[-c(9,67,82,252,877,882,884,917,993,1058),]

reg.df2 <- reg.df2[,-c(4,5,7)]

reg.df2 <- left_join(reg.df2, ca_violations, by = "pwsid")
reg.df2$total_vios <- as.numeric(reg.df2$total_vios)
reg.df2["total_vios"][is.na(reg.df2["total_vios"])] <- 0

#st_write(reg.df2, dsn = "data/final/finaldata.geojson", append = FALSE) #final dataset available to download
#st_write(reg.df2, dsn = "data/final/finaldata.shp")

Exploratory Data Analysis Findings

The initial exploration of the data revealed many interesting patterns. First, the descriptive statistics are displayed in Table 1. This shows that more than half of systems do not have any health-based violations and the average annual water rate in CA for the minimum water needed to live is $580. The following figures show the distribution and relationship of multiple variables. For the distribution of violations in Figure X, it is clear that there are a large number of 0 violations. This distribution shows that a negative binomial regression model would be best to use. The following figure shows the distribution of water rates. Despite the right skew of this variable, a negative binomial model does not require the transformation of skewed data.

sf_use_s2(FALSE)
options(scipen=999)

# Creating descriptive statistics table
violations <- stat.desc(reg.df2$total_vios)
annual <- stat.desc(reg.df2$annual)
percent_disad <- stat.desc(reg.df2$disad_per)
percent_severe_disad <- stat.desc(reg.df2$sev_per)
median_income <- stat.desc(reg.df2$med_inc)
gini <- stat.desc(reg.df2$gini)
rental_units <- stat.desc(reg.df2$renter_units_perc)
housing_pre1980 <- stat.desc(reg.df2$pre1980)
pop_density <- stat.desc(reg.df2$popdensity)
size <- stat.desc(reg.df2$connections)
purchase_water <- stat.desc(reg.df2$prchw)
non_white <- stat.desc(reg.df2$non_white)
stats <- t(data.frame(violations, annual, median_income, non_white, percent_disad, gini, rental_units, housing_pre1980, pop_density, size, purchase_water))
stats <- stats[,-c(3,6,7,10:12,14)]

kable(stats, digits = 2, row.names = TRUE, col.names = c("N", "Number of Null", 
                                                                                  "Min", "Max", "Median", 
                                                                                  "Mean", "Std. Dev."),
              align = "c", label = "Descriptive Statistics", caption = "Table X: Statistics for variables considered in this analysis")
Table X: Statistics for variables considered in this analysis
N Number of Null Min Max Median Mean Std. Dev.
violations 1108 848 0.00 26.00 0.00 0.86 2.61
annual 1108 0 175.44 2108.88 493.20 580.42 324.34
median_income 1108 0 18387.00 247679.00 74655.00 80776.95 32018.62
non_white 1108 0 1.29 91.10 25.58 29.97 17.58
percent_disad 1108 682 0.00 100.00 0.00 18.42 32.15
gini 1108 0 0.21 0.65 0.43 0.43 0.05
rental_units 1108 0 2.24 78.00 27.80 28.84 13.45
housing_pre1980 1108 5 0.00 100.00 57.04 55.54 19.01
pop_density 1108 0 1.06 33403.60 523.15 2441.94 3769.51
size 1108 0 1.00 686540.00 777.50 7637.03 29442.83
purchase_water 1108 732 0.00 1.00 0.00 0.34 0.47
hist(reg.df2$total_vios, freq=FALSE, main = "Figure X: Number of Violations", xlab="MCL of Violations")

hist(reg.df2$annual, freq=FALSE, main = "Figure X: Distribution of Annual Water Rates", xlab="Annual Water Rates")

The next set of graphs display the relationships between different variables to start to explore whether or not certain variables should be included in the regression analysis. For the relationship between violations and water rates, as shown in Figure X, there appears to be more health-based violations for systems with lower water rates, although it is not a stark pattern. This trend indicates that there could be a relationship between the two variables. A more drastic relationship appears when looking at the relationship between the number of health-based violations and the median income of the water system, as shown in Figure X. Other variables the literature suggests are potentially important factors contributing to water quality are system size and disadvantaged community status. To explore the relationship of these variables with violations and water rates, we created a box plot of the number of violations per different system size classification and plotted the number of violations and water rates against the percent of disadvantaged communities in a CWS (Figures X through X). The boxplots relating water system size to the variables of interest show that small and very small systems have higher numbers of violations as well as higher water rates. While literature had suggested that these may be important variables to investigate, these figures solidify that there is a potential relationship between these variables and the response variable. There is not a clear pattern that arises from either plot comparing disadvantaged community status to the two variables of interest, however they are still included in the initial regression model because the literature emphasizes their significance.

plot(reg.df2$annual, reg.df2$total_vios, main = "Figure X: Violations by Water Rates", 
     xlab="Annual Water Rates", ylab="MCL Violations")

plot(reg.df2$med_inc, reg.df2$total_vios, main = "Figure X: Violations by Median Income", 
     xlab="Average Median Income", ylab="Violations")

plot(reg.df2$med_inc, reg.df2$annual, main = "Figure X: Water Rates by Median Income", 
     xlab="Average Median Income", ylab="Annual Water Rate")

boxplot(reg.df2$total_vios~reg.df2$sys_size, xlab = "System Size", ylab = "MCL Violations", show.names = TRUE)

boxplot(reg.df2$annual~reg.df2$sys_size, xlab = "System Size", ylab = "Annual Water Rates", show.names = TRUE)

plot(reg.df2$disad_per, reg.df2$annual, main = "Figure X: Water Rates by Disadvantaged Communities", 
     xlab="Percent Disadv. Communities", ylab="Water rates")

plot(reg.df2$disad_per, reg.df2$total_vios, main = "Figure X: Violations by Disadvantaged Communities", 
     xlab="Percent Disadv. Communities", ylab="MCL Violations")

Finally, as a part of the data exploration, it is important to show the difference between the full population of community water systems in California and the sample data that is used in this analysis. One of the best ways to show the difference between the two is to look at the distribution of data among different sizes of systems. Table X shows that the data used in the analysis represents a disproportionately higher percentage of medium and large systems compared to small and very small systems. This may limit the external validity of our results and highlights deficiencies in data collection for smaller systems.

#Comparing sample systems to full population of systems 
actual_size <- table(system_eval$size)
sample_size <- table(reg.df2$sys_size)
size_comp <- data.frame(actual_size, sample_size)
size_comp <- size_comp[,-c(1)]
size_comp <- rename(size_comp, "Size" = "Var1.1")
size_comp <- size_comp %>% relocate(Size, .before = "Freq")
size_comp <- rename(size_comp, "Actual" = "Freq")
size_comp <- rename(size_comp, "Sample" = "Freq.1")

kable(size_comp, align = "l", caption = "Comparison of the breakdown of sample CWS sizes to population of CWS sizes", label = "Sample v. Population CWS Size Breakdown")
Comparison of the breakdown of sample CWS sizes to population of CWS sizes
Size Actual Sample
Lg(>3300) 660 344
Md(1000-3300) 285 178
Sm(201-999) 679 201
V.Sm(1-200) 1237 385

Exploratory Spatial Data Analysis Findings

Next, we explored the spatial component of our data through choropleth maps and a spatial dependency statistic. The maps below (Figures X-X) show the distribution and concentration of our two main variables of interest, the number of health-based violations and annual water rates across California (the full, interactive maps can be seen by following the links in the figure notes). The map reflects similar ideas discussed above and in the literature that smaller systems face higher numbers of health-based violations, as shown by the concentration of CWS with high numbers of violations in California’s central valley. Both figures also show that there is potentially spatial clustering of violations and water rates. These maps solidify the hypothesis that both system size and spatial clustering should be considered in regression analyses.

Figure X shows a bivariate choropleth map that combines the two previous variables. If we expected to see that high water bills led to fewer violations and vice versa, there should be mainly colors along the gradient going from top left to bottom right of the matrix key. This map does not show any clear pattern in terms of the gradient; however it does look like systems with a similar relationship between water rates and violations are clustered near each other. We verify this spatial clustering by using a Global Moran’s I statistic.

map_dat <- reg.df2[,c(1,2,13,15,66,76)]
map_dat <- sf::st_as_sf(map_dat)

vio_bins <- c(-1, 0, 1, 10, 26)
pal <- colorBin("Reds", domain = reg.df2$total_vios, bins = vio_bins)

labels <- sprintf(
  "<strong>%s</strong><br/><strong>%s</strong><br/>%g MCL Violations<br/>%s",
  reg.df2$system_name, reg.df2$pwsid, reg.df2$total_vios, reg.df2$sys_size
) %>% lapply(htmltools::HTML)

violationmap <- leaflet() %>% addTiles() %>% 
  addPolygons(data = map_dat$geometry, fillOpacity= 0.7,  fillColor = pal(reg.df2$total_vios), 
              color = "grey", weight=1, highlightOptions = leaflet::highlightOptions(
    weight = 1,
    color = "#666",
    fillOpacity = 0.7),
  label = labels,
  labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto")) %>%
  addLegend("bottomright", pal = pal, values = reg.df2$total_vios,
            title = "Number of Violations",
            opacity = 1)
violationmap
binpal <- colorBin("Blues", reg.df2$annual, pretty = FALSE)

labels_an <- sprintf(
  "<strong>%s</strong><br/><strong>%s</strong><br/>%g MCL Violations<br/>%s",
  reg.df2$system_name, reg.df2$pwsid, reg.df2$total_vios, reg.df2$sys_size
) %>% lapply(htmltools::HTML)

annualmap <- leaflet() %>% addTiles() %>% 
  addPolygons(data = map_dat$geometry, fillOpacity= 0.7,  fillColor = binpal(reg.df2$annual), 
              color = "grey", weight=1, highlightOptions = leaflet::highlightOptions(
    weight = 1,
    color = "#666",
    fillOpacity = 0.7),
  label = labels_an,
  labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto")) %>%
  addLegend("bottomright", pal = binpal, values = reg.df2$annual,
            title = "Annual Water Rates",
            opacity = 1)
annualmap
vio_bins <- c(-1, 0, 1, 10, 26)
map_dat$vio_bins <- ifelse(map_dat$total_vios == 0, 1, 
                           ifelse(map_dat$total_vios > 0 & map_dat$total_vios <= 1, 2, 
                                  ifelse(map_dat$total_vios > 1 & map_dat$total_vios <= 10, 3, 
                                         ifelse(map_dat$total_vios >10, 4, 0))))

labels_bi <- sprintf(
  "<strong>%s</strong><br/><strong>%s</strong><br/>%g MCL Violations<br/>$%g per year<br/>%s",
  map_dat$system_name, map_dat$pwsid, map_dat$total_vios, map_dat$annual, reg.df2$sys_size
) %>% lapply(htmltools::HTML)

library(bivariatechoropleths)
library(leaflet)

bivar <- leaflet::leaflet() %>%
  leaflet::addTiles() %>%
  bivariatechoropleths::addBivariateChoropleth(
  map_data = map_dat,
  var1_name = annual,
  var2_name = vio_bins,
  ntiles = 4,
  var1_label = "Annual Water Cost",
  var2_label = "# MCL Violations",
  weight = 1,
  fillOpacity = 0.7,
  color = "grey",
  paletteFunction = pals::arc.bluepink)

bivar <- bivar %>% leaflet::addPolygons(data = map_dat$geometry, weight = 0.5, 
                       color = "lightgrey", fillOpacity = 0, 
                         highlightOptions = highlightOptions(
    weight = 2,
    color = "#666",
    fillOpacity = 0,
    bringToFront = TRUE),
  label = labels_bi,
  labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto"))

bivar

The Moran’s I statistic shows positive spatial autocorrelation of the number of health-based violations. This means that water systems with a higher number of health-based violations are surrounded by other water systems that also have higher numbers of health-based violations. This test statistic shows that there is spatial autocorrelation in the dependent variable, suggesting that a spatial regression model should be explored if the spatial dependency is not fully accounted for by the explanatory variables. We also explored different neighborhood structures to determine the best number of neighbors to use in the analysis. By looking at the Global Moran’s I statistic for water quality violations for different numbers of neighbors (Table X), the statistic shows that there is stronger spatial autocorrelation when using three neighbors as opposed to 10 or 38. Therefore, the neighborhood structure that we will move forward with is 3-nearest neighbors.

coords <- st_centroid(st_geometry(reg.df2$geometry), of_largest_polygon = TRUE)
nb_k38 <- knearneigh(coords, k = 38)
options(scipen=999)
lwk <- nb2listw(knn2nb(nb_k38), style = "W")
mt38 <- moran.test(reg.df2$total_vios, listw = lwk, randomisation = FALSE)

nb_k3 <- knearneigh(coords, k = 3) ###this one yields the highest Moran's I, is also the average number of neighbors discovered after an ArcGIS analysis of neighbors
options(scipen=999)
lwk3 <- nb2listw(knn2nb(nb_k3), style = "W")
mt3 <- moran.test(reg.df2$total_vios, listw = lwk3, randomisation = FALSE)
plot(st_geometry(reg.df2$geometry), border="grey")
plot(knn2nb(nb_k3), coords, add=TRUE)
title(main="K nearest neighbours, k = 3")

nb_k10 <- knearneigh(coords, k = 10)
options(scipen=999)
lwk10 <- nb2listw(knn2nb(nb_k10), style = "W")
mt10 <- moran.test(reg.df2$total_vios, listw = lwk10, randomisation = FALSE)

mtests <- t(data.frame(mt38$estimate, mt10$estimate, mt3$estimate))
mtests <- mtests[,-c(2,3)]
mtests <- data.frame(mtests)
kable(mtests, digits = 4, row.names = TRUE, col.names = "Moran's I", align = "l", 
      caption = "Moran's I Test Statistic for 38, 10, and 3 K-Nearest Neighbors")
Moran’s I Test Statistic for 38, 10, and 3 K-Nearest Neighbors
Moran’s I
mt38.estimate 0.0877
mt10.estimate 0.1555
mt3.estimate 0.1949

Regression Analysis Findings

While a traditional linear model (Ordinary Least Squares) cannot be used for count data, there are two alternative models that were explored: Poisson and Negative Binomial. By running both models, as well as a Pearson dispersion statistic and comparing fitted to observed values, we identified that the negative binomial model produced the best fit. This analysis can be found in the Appendix. The initial negative binomial model included all relevant variables that were identified in the literature and are presented in the data table in the appendix. After the initial negative binomial regression was run, a backward step AIC test was conducted to identify the combination of variables from this model that present the best model fit with the lowest AIC. The tables below shows the regression results from the final negative binomial model as well as the Moran’s I test on the residuals to identify any remaining spatial dependency.

summary(negbi <- glm.nb(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
                           renter_units_perc + pre1980 + disad_per, data = reg.df2))
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
## Warning in glm.nb(total_vios ~ annual + med_inc + popdensity + prchw +
## connections + : alternation limit reached
## 
## Call:
## glm.nb(formula = total_vios ~ annual + med_inc + popdensity + 
##     prchw + connections + non_white + gini + renter_units_perc + 
##     pre1980 + disad_per, data = reg.df2, init.theta = 0.1871919901, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1672  -0.7990  -0.6441  -0.2800   2.4364  
## 
## Coefficients:
##                       Estimate   Std. Error z value  Pr(>|z|)    
## (Intercept)       -1.750173591  0.886450288  -1.974  0.048341 *  
## annual             0.001042347  0.000253805   4.107 0.0000401 ***
## med_inc           -0.000006934  0.000003587  -1.933  0.053181 .  
## popdensity        -0.000215434  0.000046771  -4.606 0.0000041 ***
## prchw             -0.821061445  0.219299624  -3.744  0.000181 ***
## connections        0.000004982  0.000003009   1.656  0.097768 .  
## non_white          0.019620242  0.006419216   3.056  0.002239 ** 
## gini               2.901535652  1.833017233   1.583  0.113438    
## renter_units_perc  0.009380714  0.008221648   1.141  0.253879    
## pre1980           -0.004209697  0.004598444  -0.915  0.359950    
## disad_per         -0.000073255  0.003110942  -0.024  0.981214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1872) family taken to be 1)
## 
##     Null deviance: 742.99  on 1107  degrees of freedom
## Residual deviance: 612.54  on 1097  degrees of freedom
## AIC: 2182.1
## 
## Number of Fisher Scoring iterations: 25
## 
## 
##               Theta:  0.1872 
##           Std. Err.:  0.0174 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -2158.0600
negbi2 <- step(negbi)
## Start:  AIC=2180.06
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980 + disad_per
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - disad_per          1   612.54 2178.1
## - pre1980            1   613.39 2178.9
## - renter_units_perc  1   613.93 2179.4
## <none>                   612.54 2180.1
## - gini               1   614.89 2180.4
## - med_inc            1   616.56 2182.1
## - connections        1   619.42 2184.9
## - non_white          1   621.59 2187.1
## - prchw              1   625.17 2190.7
## - annual             1   632.05 2197.6
## - popdensity         1   638.30 2203.8
## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached
## 
## Step:  AIC=2178.05
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - pre1980            1   613.41 2176.9
## - renter_units_perc  1   613.99 2177.5
## <none>                   612.55 2178.1
## - gini               1   614.95 2178.5
## - med_inc            1   617.73 2181.2
## - connections        1   619.43 2182.9
## - non_white          1   621.62 2185.1
## - prchw              1   625.55 2189.1
## - annual             1   632.10 2195.6
## - popdensity         1   638.49 2202.0
## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached
## 
## Step:  AIC=2176.91
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - renter_units_perc  1   613.42 2176.4
## - gini               1   613.86 2176.8
## <none>                   611.95 2176.9
## - med_inc            1   616.77 2179.7
## - connections        1   619.01 2182.0
## - non_white          1   620.93 2183.9
## - prchw              1   625.37 2188.3
## - annual             1   630.71 2193.7
## - popdensity         1   640.14 2203.1
## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached
## 
## Step:  AIC=2176.38
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##               Df Deviance    AIC
## <none>             612.95 2176.4
## - gini         1   615.09 2176.5
## - connections  1   619.84 2181.3
## - med_inc      1   621.25 2182.7
## - prchw        1   625.99 2187.4
## - non_white    1   626.06 2187.5
## - annual       1   630.50 2191.9
## - popdensity   1   640.03 2201.5
plot(negbi2, which=c(1,2), cex=0.1)

nbresults <- summary(negbi2)$coefficients
nbresults <- nbresults[,-c(3)]
nbresults <- data.frame(nbresults)
nbresults$odds_ratio <- exp(nbresults$Estimate)
nbresults <- nbresults[,c(1,4,2,3)]
nbresults <- rename(nbresults, "P-value" = "Pr...z..")

redmtest <- lm.morantest(negbi2, listw = lwk3)##even stronger spatial autocorrelation showing us in resids with this model
testsum <- t(data.frame(redmtest$estimate))
pval <- redmtest$p.value
testsum <- data.frame(testsum, pval)
testsum <- testsum[,-c(2,3)]

reg.df2$fitted_neg <- fitted(negbi2)
results_negbi <- ggplot() + geom_point(aes(reg.df2$fitted_neg, reg.df2$total_vios)) + xlab("Fitted Values") + ylab("Observed Values")
kable(nbresults, digits = 5, row.names = TRUE, col.names = c("Estimate", "Odds Ratio", "Std. Error", "P-Value"), caption = "Negative Binomial Regression Results")
Negative Binomial Regression Results
Estimate Odds Ratio Std. Error P-Value
(Intercept) -1.56963 0.20812 0.86341 0.06907
annual 0.00097 1.00097 0.00025 0.00014
med_inc -0.00001 0.99999 0.00000 0.00540
popdensity -0.00021 0.99979 0.00004 0.00000
prchw -0.82428 0.43855 0.21882 0.00017
connections 0.00001 1.00001 0.00000 0.08760
non_white 0.02213 1.02238 0.00614 0.00031
gini 2.66892 14.42440 1.76383 0.13025
kable(testsum, digits = 5, row.names = FALSE, col.names = c("Moran's I Estimate", "P-Value"), caption = "Moran's I Test Statistic on Residuals")
Moran’s I Test Statistic on Residuals
Moran’s I Estimate P-Value
0.41151 0
results_negbi

Despite the improvement in the fit of the model from the AIC backward step model, the fitted values to observed values, as seen in Figure X, shows that the fit could still be improved. There is an additional diagnostic step that needs to be taken because of the spatial autocorrelation that appears to be apparent in our response variable. We used the Global Moran’s I of the residuals to evaluate whether or not a spatial regression needs to be explored. For the negative binomial regression in Table X, the Global Moran’s I of residuals is positive and statistically significant (as shown in Table X), indicating that there is spatial autocorrelation in the error term of the regression. This could be biasing the results and shows that we should incorporate spatial autocorrelation in the regression with a spatial weights matrix and using neighboring dependent variable values to assist in the prediction of the dependent variable. This is done through a conditional autoregressive model (CAR) in combination with the negative binomial.

The table below shows the results from the CAR model.

is.symmetric.nb(knn2nb(nb_k3))
## [1] FALSE
nb_mat <- make.sym.nb(knn2nb(nb_k3))
is.symmetric.nb(nb_mat)
## [1] TRUE
matrix <- nb2mat(nb_mat, style = "B")
rownames(matrix) <- colnames(matrix) <- reg.df2$pwsid

car <- fitme(total_vios ~ annual + med_inc + prchw + popdensity + non_white +
               gini + connections + adjacency(1|pwsid), adjMatrix = matrix, 
             data = reg.df2, family = 'negbin')
## If the 'RSpectra' package were installed, an extreme eigenvalue computation could be faster.
summary(car)
## formula: total_vios ~ annual + med_inc + prchw + popdensity + non_white + 
##     gini + connections + adjacency(1 | pwsid)
## Estimation of corrPars, lambda and NB_shape by ML (P_v approximation of logL).
## Estimation of fixed effects by ML (P_v approximation of logL).
## Estimation of lambda and NB_shape by 'outer' ML, maximizing logL.
## family: negbin2(shape=998300)( link = log ) 
##  ------------ Fixed effects (beta) ------------
##                 Estimate    Cond. SE t-value
## (Intercept) -2.982345757 1.122500935 -2.6569
## annual       0.001092281 0.000303286  3.6015
## med_inc     -0.000009092 0.000003981 -2.2842
## prchw       -0.652508075 0.305883517 -2.1332
## popdensity  -0.000195481 0.000061352 -3.1862
## non_white    0.023639575 0.008384515  2.8194
## gini         1.184772171 2.246961146  0.5273
## connections  0.000005512 0.000003211  1.7165
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##     1.rho 
## 0.1089231 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    pwsid  :  4.363  
## # of obs: 1108; # of groups: pwsid, 1108 
##  ------------- Likelihood values  -------------
##                        logLik
## logL      (P_v(h)): -1057.893
aic <- get_any_IC(car)
aic
##        marginal AIC:     conditional AIC:      dispersion AIC: 
##            2137.7852            1735.7020            2121.7852 
##        effective df: 
##             687.3681
coefficients <- (car$fixef)
oddsratio <- exp(coefficients)
rho <- car$corrPars
rho <- rho[["1"]][["rho"]]

results <- data.frame(coefficients, oddsratio)
results$significance <- "*"
results[c(1,2,5,6),3] <- "***"
results[c(3,4), 3] <- "**"
results[7,3] <-""


reg.df2$fitted_vals <- fitted(car)

results_plot <- ggplot() + geom_point(aes(reg.df2$fitted_vals, reg.df2$total_vios))+ xlab("Fitted Values") + ylab("Observed Values")
kable(results, digits = 3, row.names = TRUE, col.names = c("Estimate", "Odds Ratio", "Significance"), align = "c", label = "CAR Negative Binomial Results")
Estimate Odds Ratio Significance
(Intercept) -2.982 0.051 ***
annual 0.001 1.001 ***
med_inc 0.000 1.000 **
prchw -0.653 0.521 **
popdensity 0.000 1.000 ***
non_white 0.024 1.024 ***
gini 1.185 3.270
connections 0.000 1.000
results_plot

As shown by the fitted values graph in Figure X, incorporating the spatial dependence parameter in the negative binomial regression greatly improved the fit of the model. This model now reveals interesting relationships between the explanatory variables and the dependent variable. For all of the continuous variables in the model (all except purchased water), the Odds Ratios can be interpreted as the factor by which the incidence rate of violations will change from a unit increase in the explanatory variable. For example, if the percentage of the population serviced by the CWS that is not white increases by 1 percentage point (the unit of the variable), then the incidence rate of health-based violations in the CWS will increase by a factor of 1.024, holding all other factors constant. For binomial variables, such as purchased water, the odds ratio for this variable represents the factor by which the incidence rate is different compared to the alternative specification of the variable. For example, CWS that purchase water will have a violation incidence rate that is 0.52 times lower than CWS that do not purchase water.

These results suggest that variables related to the sourcing of the water, cost of the water, and the demographic makeup of a CWS all impact the number of health-based violations a CWS faces to varying degrees.

Filtering Data for Different Characteristics

The next step of the analysis includes conducting the same CAR model above for the data filtered by different characteristics to determine if different types of CWS portray different characteristics. First, we filtered the data to only include CWS whose communities they service are more than 10% disadvantaged communities. Although the percentage of disadvantaged communities that a CWS services was not determined to be a significant predictor of health-based violations, we wanted to explore the relationships between variables in this set of data and see if any new relationships emerge that may have been masked in the full data model.

Disadvantaged Communities

filtered_dat <- reg.df2 %>% filter(disad_per >= 10)

disad_vios <- stat.desc(filtered_dat$total_vios)
disad_price <- stat.desc(filtered_dat$annual)
disad_inc <- stat.desc(filtered_dat$med_inc)
disad_perc <- stat.desc(filtered_dat$disad_per)
disad_gini <- stat.desc(filtered_dat$gini)
disad_rental_units <- stat.desc(filtered_dat$renter_units_perc)
disad_housing_pre1980 <- stat.desc(filtered_dat$pre1980)
disad_pop_density <- stat.desc(filtered_dat$popdensity)
disad_size <- stat.desc(filtered_dat$connections)
disad_purchase <- stat.desc(filtered_dat$prchw)
disad_non_white <- stat.desc(filtered_dat$non_white)
disad_stats <- t(data.frame(disad_vios, disad_price, disad_inc, disad_non_white, disad_perc, disad_gini, disad_rental_units, disad_housing_pre1980, disad_pop_density, disad_size, disad_purchase))
disad_stats <- disad_stats[,-c(3,6,7,10:12,14)]

kable(disad_stats, digits = 3, row.names = TRUE, col.names = c("N", "Number of Null", 
                                                                                  "Min", "Max", "Median", 
                                                                                  "Mean", "Std. Dev."),
              align = "c", label = "Descriptive Statistics", caption = "Statistics for variables considered in this analysis of CWS with communities serviced greater than 10% disadvantaged communities")
Statistics for variables considered in this analysis of CWS with communities serviced greater than 10% disadvantaged communities
N Number of Null Min Max Median Mean Std. Dev.
disad_vios 351 264 0.000 20.000 0.000 0.929 2.852
disad_price 351 0 178.560 1986.000 430.080 490.034 276.434
disad_inc 351 0 18387.000 144022.000 50719.600 54126.540 14438.753
disad_non_white 351 0 5.904 91.104 30.513 33.126 17.584
disad_perc 351 0 10.000 100.000 50.000 56.992 32.835
disad_gini 351 0 0.310 0.629 0.436 0.438 0.044
disad_rental_units 351 0 8.231 77.997 35.607 35.960 13.894
disad_housing_pre1980 351 0 13.855 94.413 58.542 57.982 15.714
disad_pop_density 351 0 1.123 22494.830 891.993 3005.457 4446.933
disad_size 351 0 12.000 686540.000 1856.000 10446.630 42707.024
disad_purchase 351 245 0.000 1.000 0.000 0.302 0.460
disad_coords <- st_centroid(st_geometry(filtered_dat$geometry), of_largest_polygon = TRUE)
## Warning in st_centroid.sfc(st_geometry(filtered_dat$geometry),
## of_largest_polygon = TRUE): st_centroid does not give correct centroids for
## longitude/latitude data
disad_nb_k3 <- knearneigh(disad_coords, k = 3) ###this one yields the highest Moran's I
options(scipen=999)
disad_lwk3 <- nb2listw(knn2nb(disad_nb_k3), style = "W")

is.symmetric.nb(knn2nb(disad_nb_k3))
## [1] FALSE
disad_mat <- make.sym.nb(knn2nb(disad_nb_k3))
is.symmetric.nb(disad_mat)
## [1] TRUE
disad_matrix <- nb2mat(disad_mat, style = "B")

rownames(disad_matrix) <- colnames(disad_matrix) <- filtered_dat$pwsid

summary(nb_disad <- glm.nb(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
                renter_units_perc + pre1980, data = filtered_dat))
## 
## Call:
## glm.nb(formula = total_vios ~ annual + med_inc + popdensity + 
##     prchw + connections + non_white + gini + renter_units_perc + 
##     pre1980, data = filtered_dat, init.theta = 0.2259402352, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1556  -0.8189  -0.6236  -0.1587   2.5597  
## 
## Coefficients:
##                       Estimate   Std. Error z value Pr(>|z|)   
## (Intercept)        0.005893104  2.080668960   0.003  0.99774   
## annual             0.001174027  0.000466799   2.515  0.01190 * 
## med_inc           -0.000030016  0.000013940  -2.153  0.03129 * 
## popdensity        -0.000287906  0.000088546  -3.251  0.00115 **
## prchw             -0.383853530  0.374084125  -1.026  0.30484   
## connections        0.000008284  0.000003471   2.387  0.01701 * 
## non_white          0.010808947  0.011386944   0.949  0.34250   
## gini               0.713861208  3.461777139   0.206  0.83663   
## renter_units_perc  0.024703281  0.012507535   1.975  0.04826 * 
## pre1980           -0.006641010  0.009847214  -0.674  0.50005   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.2259) family taken to be 1)
## 
##     Null deviance: 269.48  on 350  degrees of freedom
## Residual deviance: 199.55  on 341  degrees of freedom
## AIC: 711.16
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.2259 
##           Std. Err.:  0.0369 
## 
##  2 x log-likelihood:  -689.1590
nb_disad1 <- step(nb_disad)
## Start:  AIC=709.16
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980
## 
##                     Df Deviance    AIC
## - gini               1   199.58 707.19
## - pre1980            1   200.09 707.70
## - non_white          1   200.33 707.94
## - prchw              1   200.60 708.21
## <none>                   199.55 709.16
## - renter_units_perc  1   203.08 710.69
## - med_inc            1   204.84 712.45
## - annual             1   208.99 716.60
## - connections        1   211.44 719.05
## - popdensity         1   214.33 721.94
## 
## Step:  AIC=707.19
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + renter_units_perc + pre1980
## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - pre1980            1   200.17 705.70
## - non_white          1   200.41 705.94
## - prchw              1   200.68 706.21
## <none>                   199.66 707.19
## - renter_units_perc  1   203.16 708.69
## - med_inc            1   205.51 711.05
## - annual             1   209.25 714.78
## - connections        1   212.19 717.73
## - popdensity         1   215.03 720.56
## 
## Step:  AIC=705.7
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + renter_units_perc
## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - non_white          1   200.58 704.69
## - prchw              1   200.60 704.71
## <none>                   199.59 705.70
## - renter_units_perc  1   203.57 707.68
## - med_inc            1   204.94 709.05
## - annual             1   209.19 713.30
## - connections        1   212.23 716.34
## - popdensity         1   218.59 722.70
## 
## Step:  AIC=704.69
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     renter_units_perc
## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - prchw              1   200.78 703.56
## <none>                   199.91 704.69
## - med_inc            1   205.11 707.89
## - renter_units_perc  1   207.21 709.99
## - annual             1   209.27 712.04
## - connections        1   211.79 714.57
## - popdensity         1   217.88 720.66
## 
## Step:  AIC=703.56
## total_vios ~ annual + med_inc + popdensity + connections + renter_units_perc
## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## <none>                   200.32 703.56
## - med_inc            1   206.35 707.59
## - renter_units_perc  1   207.85 709.09
## - annual             1   208.83 710.07
## - connections        1   211.86 713.10
## - popdensity         1   221.14 722.38
disad_car <- fitme(total_vios ~ med_inc + annual + popdensity +
               renter_units_perc + connections + adjacency(1|pwsid), adjMatrix = disad_matrix, 
             data = filtered_dat, family = 'negbin')
summary(disad_car)
## formula: total_vios ~ med_inc + annual + popdensity + renter_units_perc + 
##     connections + adjacency(1 | pwsid)
## Estimation of corrPars, lambda and NB_shape by ML (P_v approximation of logL).
## Estimation of fixed effects by ML (P_v approximation of logL).
## Estimation of lambda and NB_shape by 'outer' ML, maximizing logL.
## family: negbin2(shape=980600)( link = log ) 
##  ------------ Fixed effects (beta) ------------
##                      Estimate    Cond. SE t-value
## (Intercept)       -2.13059640 1.013132712  -2.103
## med_inc           -0.00002797 0.000016061  -1.742
## annual             0.00130991 0.000531226   2.466
## popdensity        -0.00027651 0.000111831  -2.473
## renter_units_perc  0.03643121 0.014774976   2.466
## connections        0.00000911 0.000003825   2.382
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##      1.rho 
## 0.01598932 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    pwsid  :  4.145  
## # of obs: 351; # of groups: pwsid, 351 
##  ------------- Likelihood values  -------------
##                        logLik
## logL      (P_v(h)): -340.2451
disad_aic <- get_any_IC(disad_car)
disad_aic
##        marginal AIC:     conditional AIC:      dispersion AIC: 
##             698.4901             562.3710             686.4901 
##        effective df: 
##             218.8552
dis_coefficients <- (disad_car$fixef)
dis_interpretation <- exp(dis_coefficients)

disad_results <- data.frame(dis_coefficients, dis_interpretation)
disad_results$significance <- "*"
disad_results[c(1,3:6),3] <- "**"

filtered_dat$fitted_vals <- fitted(disad_car)
kable(disad_results, digits = 3, row.names = TRUE, col.names = c("Estimate", "Odds Ratio", "Significance"), align = "c", label = "CAR Negative Binomial Results for Disadvantaged Communities")
Estimate Odds Ratio Significance
(Intercept) -2.131 0.119 **
med_inc 0.000 1.000
annual 0.001 1.001 **
popdensity 0.000 1.000 **
renter_units_perc 0.036 1.037 **
connections 0.000 1.000 **
ggplot() + geom_point(aes(filtered_dat$fitted_vals, filtered_dat$total_vios))+ xlab("Fitted Values") + ylab("Observed Values")

Small and Very Small Systems

filtered_size <- reg.df2 %>% filter(size == 1 | size == 2)

size_vios <- stat.desc(filtered_size$total_vios)
size_price <- stat.desc(filtered_size$annual)
size_inc <- stat.desc(filtered_size$med_inc)
size_gini <- stat.desc(filtered_size$gini)
size_rental_units <- stat.desc(filtered_size$renter_units_perc)
size_housing_pre1980 <- stat.desc(filtered_size$pre1980)
size_pop_density <- stat.desc(filtered_size$popdensity)
size_disad <- stat.desc(filtered_size$disad_per)
size_non_white <- stat.desc(filtered_size$non_white)
size_connections <- stat.desc(filtered_size$connections)
size_purchase <- stat.desc(filtered_size$prchw)
size_stats <- t(data.frame(size_vios, size_price, size_inc, size_non_white, size_disad, size_gini, size_rental_units, size_housing_pre1980, size_pop_density, size_connections, size_purchase))
size_stats <- size_stats[,-c(3,6,7,10:12,14)]

kable(size_stats, digits = 3, row.names = TRUE, col.names = c("N", "Number of Null", 
                                                                                  "Min", "Max", "Median", 
                                                                                  "Mean", "Std. Dev."),
              align = "c", label = "Descriptive Statistics", caption = "Statistics for variables considered in this analysis of small and very small CWS")
Statistics for variables considered in this analysis of small and very small CWS
N Number of Null Min Max Median Mean Std. Dev.
size_vios 586 421 0.000 26.000 0.000 1.213 3.204
size_price 586 0 180.000 2108.880 600.000 680.714 370.461
size_inc 586 0 23929.000 247679.000 70857.750 78499.997 33452.427
size_non_white 586 0 1.289 91.104 20.468 24.742 15.212
size_disad 586 443 0.000 100.000 0.000 19.769 37.341
size_gini 586 0 0.214 0.651 0.444 0.443 0.054
size_rental_units 586 0 2.239 77.997 21.889 24.431 13.114
size_housing_pre1980 586 5 0.000 100.000 57.931 55.600 20.261
size_pop_density 586 0 1.057 18323.710 92.966 655.710 1907.038
size_connections 586 0 1.000 993.000 96.500 221.770 253.750
size_purchase 586 497 0.000 1.000 0.000 0.152 0.359
size_coords <- st_centroid(st_geometry(filtered_size$geometry), of_largest_polygon = TRUE)
## Warning in st_centroid.sfc(st_geometry(filtered_size$geometry),
## of_largest_polygon = TRUE): st_centroid does not give correct centroids for
## longitude/latitude data
size_nb_k3 <- knearneigh(size_coords, k = 3) ###this one yields the highest Moran's I

options(scipen=999)
size_lwk3 <- nb2listw(knn2nb(size_nb_k3), style = "W")

is.symmetric.nb(knn2nb(size_nb_k3))
## [1] FALSE
size_mat <- make.sym.nb(knn2nb(size_nb_k3))
is.symmetric.nb(size_mat)
## [1] TRUE
size_matrix <- nb2mat(size_mat, style = "B")

rownames(size_matrix) <- colnames(size_matrix) <- filtered_size$pwsid

summary(nb_size <- glm.nb(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
                renter_units_perc + pre1980, data = filtered_size))
## 
## Call:
## glm.nb(formula = total_vios ~ annual + med_inc + popdensity + 
##     prchw + connections + non_white + gini + renter_units_perc + 
##     pre1980, data = filtered_size, init.theta = 0.2012915594, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.27906  -0.85858  -0.71414  -0.08344   2.54921  
## 
## Coefficients:
##                       Estimate   Std. Error z value  Pr(>|z|)    
## (Intercept)       -0.007590073  1.085804088  -0.007  0.994423    
## annual             0.001157082  0.000292646   3.954 0.0000769 ***
## med_inc           -0.000013356  0.000003889  -3.434  0.000595 ***
## popdensity        -0.000494517  0.000139337  -3.549  0.000387 ***
## prchw             -0.889408901  0.355796103  -2.500  0.012427 *  
## connections       -0.000499104  0.000455885  -1.095  0.273603    
## non_white          0.031779723  0.008215332   3.868  0.000110 ***
## gini               0.954119156  2.077765052   0.459  0.646087    
## renter_units_perc -0.002362267  0.009624955  -0.245  0.806122    
## pre1980           -0.010114646  0.005450316  -1.856  0.063483 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.2013) family taken to be 1)
## 
##     Null deviance: 438.20  on 585  degrees of freedom
## Residual deviance: 364.57  on 576  degrees of freedom
## AIC: 1395.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.2013 
##           Std. Err.:  0.0227 
## 
##  2 x log-likelihood:  -1373.3680
nb_size1 <- step(nb_size)
## Start:  AIC=1393.37
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980
## 
##                     Df Deviance    AIC
## - renter_units_perc  1   364.63 1391.4
## - gini               1   364.75 1391.5
## - connections        1   365.83 1392.6
## <none>                   364.57 1393.4
## - pre1980            1   368.19 1395.0
## - prchw              1   370.55 1397.3
## - med_inc            1   376.64 1403.4
## - non_white          1   378.09 1404.9
## - popdensity         1   382.85 1409.7
## - annual             1   383.02 1409.8
## 
## Step:  AIC=1391.43
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + pre1980
## 
##               Df Deviance    AIC
## - gini         1   364.65 1389.6
## - connections  1   365.80 1390.8
## <none>             364.48 1391.4
## - pre1980      1   368.06 1393.0
## - prchw        1   370.44 1395.4
## - med_inc      1   377.96 1402.9
## - non_white    1   379.48 1404.4
## - annual       1   383.48 1408.4
## - popdensity   1   383.72 1408.7
## 
## Step:  AIC=1389.6
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + pre1980
## 
##               Df Deviance    AIC
## - connections  1   365.89 1389.0
## <none>             364.54 1389.6
## - pre1980      1   367.98 1391.0
## - prchw        1   370.34 1393.4
## - med_inc      1   379.28 1402.3
## - non_white    1   380.35 1403.4
## - annual       1   383.76 1406.8
## - popdensity   1   384.14 1407.2
## 
## Step:  AIC=1388.95
## total_vios ~ annual + med_inc + popdensity + prchw + non_white + 
##     pre1980
## 
##              Df Deviance    AIC
## <none>            365.08 1389.0
## - pre1980     1   368.35 1390.2
## - prchw       1   372.14 1394.0
## - med_inc     1   378.52 1400.4
## - non_white   1   380.07 1401.9
## - popdensity  1   384.66 1406.5
## - annual      1   386.88 1408.8
size_car <- fitme(total_vios ~ med_inc + annual + popdensity + 
                    non_white + prchw + pre1980 +adjacency(1|pwsid), adjMatrix = size_matrix, 
             data = filtered_size, family = 'negbin')
summary(size_car)
## formula: total_vios ~ med_inc + annual + popdensity + non_white + prchw + 
##     pre1980 + adjacency(1 | pwsid)
## Estimation of corrPars, lambda and NB_shape by ML (P_v approximation of logL).
## Estimation of fixed effects by ML (P_v approximation of logL).
## Estimation of lambda and NB_shape by 'outer' ML, maximizing logL.
## family: negbin2(shape=0.2419)( link = log ) 
##  ------------ Fixed effects (beta) ------------
##                Estimate   Cond. SE  t-value
## (Intercept) -0.00151919 0.51318731 -0.00296
## med_inc     -0.00001364 0.00000381 -3.58015
## annual       0.00128650 0.00029966  4.29323
## popdensity  -0.00049419 0.00013318 -3.71072
## non_white    0.03198563 0.00856153  3.73597
## prchw       -0.90186010 0.36129074 -2.49622
## pre1980     -0.01032298 0.00563905 -1.83063
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##     1.rho 
## 0.1899922 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    pwsid  :  0.292  
## # of obs: 586; # of groups: pwsid, 586 
##  ------------- Likelihood values  -------------
##                        logLik
## logL      (P_v(h)): -686.0091
size_aic <- get_any_IC(size_car)
size_coefficients <- (size_car$fixef)
size_interpretation <- exp(size_coefficients)

size_results <- data.frame(size_coefficients, size_interpretation)
size_results$significance <- ""
size_results[c(2:5),3] <- "***"
size_results[c(6,7),3] <- "**"

filtered_size$fitted_vals <- fitted(size_car)

kable(size_results, digits = 3, row.names = TRUE, col.names = c("Estimate", "Odds Ratio", "Significance"), align = "c", label = "CAR Negative Binomial Results for Small and Very Small CWS")
Estimate Odds Ratio Significance
(Intercept) -0.002 0.998
med_inc 0.000 1.000 ***
annual 0.001 1.001 ***
popdensity 0.000 1.000 ***
non_white 0.032 1.033 ***
prchw -0.902 0.406 **
pre1980 -0.010 0.990 **
ggplot() + geom_point(aes(filtered_size$fitted_vals, filtered_size$total_vios))+ xlab("Fitted Values") + ylab("Observed Values")

Appendix

poisson <- glm(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
               renter_units_perc + pre1980 + disad_per, family = "poisson", data = reg.df2)
summary(poisson)
## 
## Call:
## glm(formula = total_vios ~ annual + med_inc + popdensity + prchw + 
##     connections + non_white + gini + renter_units_perc + pre1980 + 
##     disad_per, family = "poisson", data = reg.df2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7718  -1.3221  -0.8652  -0.3281  10.3599  
## 
## Coefficients:
##                       Estimate   Std. Error z value             Pr(>|z|)    
## (Intercept)        0.170870363  0.361589721   0.473             0.636532    
## annual             0.001077909  0.000079114  13.625 < 0.0000000000000002 ***
## med_inc           -0.000016310  0.000001859  -8.774 < 0.0000000000000002 ***
## popdensity        -0.000275633  0.000026670 -10.335 < 0.0000000000000002 ***
## prchw             -0.575549792  0.107921208  -5.333         0.0000000966 ***
## connections        0.000006890  0.000001362   5.059         0.0000004214 ***
## non_white          0.020471709  0.002368936   8.642 < 0.0000000000000002 ***
## gini               0.686884196  0.677361007   1.014             0.310554    
## renter_units_perc  0.007174199  0.002869085   2.501             0.012401 *  
## pre1980           -0.006865692  0.001809320  -3.795             0.000148 ***
## disad_per         -0.004013333  0.001127443  -3.560             0.000371 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3702.5  on 1107  degrees of freedom
## Residual deviance: 2952.5  on 1097  degrees of freedom
## AIC: 3690.7
## 
## Number of Fisher Scoring iterations: 6
P__disp(poisson)
## pearson.chi2   dispersion 
##   6098.98009      5.55969
summary(negbi <- glm.nb(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
                           renter_units_perc + pre1980 + disad_per, data = reg.df2))
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
## Warning in glm.nb(total_vios ~ annual + med_inc + popdensity + prchw +
## connections + : alternation limit reached
## 
## Call:
## glm.nb(formula = total_vios ~ annual + med_inc + popdensity + 
##     prchw + connections + non_white + gini + renter_units_perc + 
##     pre1980 + disad_per, data = reg.df2, init.theta = 0.1871919901, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1672  -0.7990  -0.6441  -0.2800   2.4364  
## 
## Coefficients:
##                       Estimate   Std. Error z value  Pr(>|z|)    
## (Intercept)       -1.750173591  0.886450288  -1.974  0.048341 *  
## annual             0.001042347  0.000253805   4.107 0.0000401 ***
## med_inc           -0.000006934  0.000003587  -1.933  0.053181 .  
## popdensity        -0.000215434  0.000046771  -4.606 0.0000041 ***
## prchw             -0.821061445  0.219299624  -3.744  0.000181 ***
## connections        0.000004982  0.000003009   1.656  0.097768 .  
## non_white          0.019620242  0.006419216   3.056  0.002239 ** 
## gini               2.901535652  1.833017233   1.583  0.113438    
## renter_units_perc  0.009380714  0.008221648   1.141  0.253879    
## pre1980           -0.004209697  0.004598444  -0.915  0.359950    
## disad_per         -0.000073255  0.003110942  -0.024  0.981214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1872) family taken to be 1)
## 
##     Null deviance: 742.99  on 1107  degrees of freedom
## Residual deviance: 612.54  on 1097  degrees of freedom
## AIC: 2182.1
## 
## Number of Fisher Scoring iterations: 25
## 
## 
##               Theta:  0.1872 
##           Std. Err.:  0.0174 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -2158.0600
negbi2 <- step(negbi)
## Start:  AIC=2180.06
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980 + disad_per
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - disad_per          1   612.54 2178.1
## - pre1980            1   613.39 2178.9
## - renter_units_perc  1   613.93 2179.4
## <none>                   612.54 2180.1
## - gini               1   614.89 2180.4
## - med_inc            1   616.56 2182.1
## - connections        1   619.42 2184.9
## - non_white          1   621.59 2187.1
## - prchw              1   625.17 2190.7
## - annual             1   632.05 2197.6
## - popdensity         1   638.30 2203.8
## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached
## 
## Step:  AIC=2178.05
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - pre1980            1   613.41 2176.9
## - renter_units_perc  1   613.99 2177.5
## <none>                   612.55 2178.1
## - gini               1   614.95 2178.5
## - med_inc            1   617.73 2181.2
## - connections        1   619.43 2182.9
## - non_white          1   621.62 2185.1
## - prchw              1   625.55 2189.1
## - annual             1   632.10 2195.6
## - popdensity         1   638.49 2202.0
## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached
## 
## Step:  AIC=2176.91
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##                     Df Deviance    AIC
## - renter_units_perc  1   613.42 2176.4
## - gini               1   613.86 2176.8
## <none>                   611.95 2176.9
## - med_inc            1   616.77 2179.7
## - connections        1   619.01 2182.0
## - non_white          1   620.93 2183.9
## - prchw              1   625.37 2188.3
## - annual             1   630.71 2193.7
## - popdensity         1   640.14 2203.1
## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached
## 
## Step:  AIC=2176.38
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge
##               Df Deviance    AIC
## <none>             612.95 2176.4
## - gini         1   615.09 2176.5
## - connections  1   619.84 2181.3
## - med_inc      1   621.25 2182.7
## - prchw        1   625.99 2187.4
## - non_white    1   626.06 2187.5
## - annual       1   630.50 2191.9
## - popdensity   1   640.03 2201.5
lm.morantest(negbi2, listw = lwk3)##even stronger spatial autocorrelation showing us in resids with this model
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = total_vios ~ annual + med_inc + popdensity +
## prchw + connections + non_white + gini, data = reg.df2, init.theta =
## 0.1861751384, link = log)
## weights: lwk3
## 
## Moran I statistic standard deviate = 14.814, p-value <
## 0.00000000000000022
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.4115064872    -0.0356637804     0.0009112054
countreg::rootogram(poisson)

countreg::rootogram(negbi2)