This webpage and subsequent code may be periodically updated as new analyses are conducted and improved on. Planned updates include omitting water rates as a variable in order to observe a larger number of water systems and the relationship between their water quality and socioeconomic/demographic characteristics. This may also help to address potential simultaneous causality concerns. Additional updates include observing how distance affects spatial relationships, using duration of violations as a dependent variable instead of number of violations, observing how different water sources (groundwater vs. surface water) impacts water quality, and sub-setting the data by specific types of contaminants. These analyses were not included in the full report due to time and data constraints.

Statistical Analysis

The second research question of our project seeks to explore the relationship between drinking water quality and prices in order to help the US Environmental Protection Agency (EPA) understand the necessary data to collect to create a national drinking water system screening tool. This analysis will address this research question by evaluating the correlation between the number of health-based SDWIS violations over the past 10 years and annual water rates, as well as with other variables, such as demographic and housing characteristics, and the percent of disadvantaged communities in a CWS. This analysis will evaluate these relationships for all of California’s CWS, those whose communities are more than 10% disadvantaged communities, and ones that are considered Small and Very Small CWS. It will also consider the spatial concentration of these relationships.

library(knitr)
knitr::opts_chunk$set(echo = TRUE)
#tinytex::install_tinytex()
#install.packages("kableExtra")

#install.packages(c("cowplot", "googleway", "ggplot2", "ggrepel", "ggspatial", "libwgeom", "sf", "rnaturalearth", "rnaturalearthdata"))
#install.packages("maps")
#install.packages("jtools")
#install.packages("countreg", repos="http://R-Forge.R-project.org")
#install.packages("spaMM")
#install.packages("remotes")
#install.packages("CRAN")
#devtools::install_github("chris31415926535/bivariatechoropleths")
#install.packages("stringr")

library(kableExtra)
library(tinytex)
library(readxl)
library(RColorBrewer)
library(biscale)
library(classInt)
library(pastecs)
library(psych)
library(tidycensus)
library(tidyverse)
library(deldir)
library(geojsonio)
library(geojson)
library(geojsonlint)
library(geojsonsf)
library(leaflet)
library(leafpop)
library(leaflet.providers)
library(leafem)
library(maptools)
library(mapview)
library(raster)
library(rgdal)
library(s2)
library(sf)
library(shiny)
library(spData)
library(spdep)
library(svglite)
library(terra)
library(dplyr)
library(tigris)
library(rmapshaper)
library(ggplot2)
library(sfdep)
library(car)
library(cowplot)
library(collapse)
library(extrafont)
library(ggplot2)
library(ggspatial)
library(patchwork)
library(scico)
library(googleway)
library(ggrepel)
library(ggspatial)
library(rnaturalearth)
library(rnaturalearthdata)
library(maps)
library(jtools)
library(spatialreg)
library(msme)
library(MASS)
library(countreg)
library(spaMM)
library(viridis)
library(remotes)
library(bivariatechoropleths)
library("stringr")

Data

The data for this analysis comes from multiple sources. California’s CWS boundaries data was requested and gathered from the California Water Boards System Area Boundary Layer (SABL) webpage (California Water Resources Control Board, 2022). Information on the number of health-based violations came from the U.S. EPA Safe Drinking Water Information System (SDWIS). Annual water rates and other data presented in the CA HR2W tool were requested and obtained from the authors of the OEHHA HR2W report and tool (OEHHA, 2021). Census tract boundaries were obtained from the U.S. Census Bureau (2018). Census tract demographic and socioeconomic information as well as data on the age of housing infrastructure was downloaded from Social Explorer and based on tables from the American Community Survey (ACS) five year estimates (U.S. Census Bureau, 2020).

The data was transformed and merged into one dataset by matching CWS identifiers in R. However, CWS boundaries first had to be matched to census tracts to attribute demographic information of the census tract to the water system. This was completed using a spatial join in ArcGIS Pro. The joined dataset was then exported to R, where census tract data was merged by census tract ID and water system data was merged by CWS ID. At this stage, we used the definitions for disadvantaged and severely disadvantaged communities from the OEHHA HR2W tool to identify disadvantaged communities in each CWS. This was done by identifying whether or not their median household income was less than $49,454 for disadvantaged community status or less than $37,091 for severely disadvantaged community status (OEHHA, 2021). Lastly, to create the final dataset, the data points were aggregated to the CWS level using a combination of averages and sums for each observation within a CWS. Therefore, median income, population density, and the Gini coefficient variables are averages of all the census tracts within the CWS. Percent of the CWS population that is non-white, the percent of housing stock that are rental units, and the percent of housing stock that was built before 1980 are all true estimates calculated from the total population and housing stock information aggregated from all census tracts serviced by the CWS. Percent of communities that are disadvantaged communities was calculated by summing the number of disadvantaged communities serviced by a CWS and dividing it by the total number of communities the CWS services. Annual water rates, number of service connections, and whether or not a CWS purchases water is already reported at the CWS level from CA OEHHA. The CWS violations represent the number of health-based violations that a CWS has had since 2012. This was calculated by observing how many health-based contaminants have exceeded their maximum contaminant level (MCL) over a 10-year period (2012 to 2022). This is the same water quality information used in the production of the sample drinking water screening tool for New Jersey. A final data list can be found in the Appendix.

Methods

Various methods were used to approach this statistical analysis. Initially, the distribution and spatial concentration of the data were explored using exploratory data analysis and exploratory spatial data analysis methods. These include calculating descriptive statistics; creating boxplots, histograms, and scatterplots; and developing choropleth and bivariate maps. Additionally, we estimated a Global Moran’s I statistic for the primary variables of interest – annual water rates and the number of health-based violations. These were useful in determining the structure of the data, determining whether or not any of the variables needed to be transformed to create a normal distribution, and identifying any potential patterns, relationships, and spatial clustering in our main variables of interest.

Next, we estimated a negative binomial model because our dependent variable, the number of health-based violations, is count data that cannot be estimated using a basic linear model, such as Ordinary Least Squares. We estimated a full model with all relevant variables that we were able to collect. The explanatory variables of the full model include annual water rates, average median income, population density, the number of service connections for each CWS (a measure of size), whether or not the CWS purchases water, the percentage of the population that is non-white, percentage of communities that are disadvantaged communities, percentage of housing units that are rental units, percentage of housing stock built before 1980, and the average Gini coefficient (a measure of inequality) for each CWS.

After this full model was estimated, we used various methods to evaluate the fit of the model and whether or not spatial dependency needs to be incorporated into the regression with a spatial weights matrix. These diagnostic methods included the Akaike Information Criterion (AIC) and the Moran’s I of the residuals. AIC is a method that identifies the fit of the model. For AIC, a backward step method was used, which removes one variable at a time from the initial model to determine the combination of variables with the best fit. This best fit is determined through finding the model that has the lowest AIC value. After settling on the final negative binomial model, we observed a plot of the fitted values compared to observed values to visualize the fit of the model. Then, we used a Moran’s I test of the residuals of the model to determine whether or not spatial dependency still persisted in the residuals. Finally, we utilized a conditional autoregressive (CAR) spatial model to account for the remaining spatial autocorrelation and analyzed the fit of this model. After conducting this analysis for the full dataset, we repeated these steps for the data filtered by percentage of communities that are disadvantaged communities and CWS that are classified as small and very small.

In the CAR model, rho (ρ) represents the coefficient on the spatial weights matrix (W), which indicates the strength above spatial autocorrelation that influences the response variable. This model was chosen because it estimates the probability of values at a certain location occurring, conditional on neighboring values. This is the appropriate spatial model to use because there are potential spatial spillover effects from explanatory variables or unmeasurable influences from neighboring CWS that influence the water quality in a specific CWS. With the matrix of weights of the neighbors of each CWS, we are able to account for those spillover effects, and make the model more accurate. We determined CWS neighbors by selecting the three nearest CWS.

# Estimating number of violations
vios <- read_xlsx("data//nj_ca_vios.xlsx", sheet = "Sheet1", col_names = TRUE)
ca_vios <- vios[str_detect(vios$`PWS ID`, "CA"),]
ca_vios$current <- ifelse(ca_vios$`Compliance Status` == "Known" | ca_vios$`Compliance Status` == "Open", 1, 0)
ca_vios$closed <- ifelse(ca_vios$`Compliance Status`=="Returned to Compliance", 1, 0)

## Limiting to post-2012
ca_vios <- ca_vios[ca_vios$`Compliance Period Begin Date` >= "2012-01-01",]
ca_vios$vioyear <- format(ca_vios$`Compliance Period Begin Date`, format = "%Y")
ca_vios <- ca_vios %>% relocate(vioyear, .after = `Compliance Period End Date`)
ca_vios$rtcyear <- format(ca_vios$`RTC Date`, format = "%Y")

## Aggregate violations to year level (originally quarter)
ca_vios_new <- collap(ca_vios, ~ `PWS ID` + `Rule Name` + `Violation Type` + 
                        `Contaminant Name` + `Compliance Status` + vioyear, 
                      custom = list(fmean = c("closed", "current", "Public Notification Tier", 
                                              "Severity Indicator Count"), flast = c("rtcyear")))
ca_vios_new$vioyear <- as.numeric(ca_vios_new$vioyear)
ca_vios_new$rtcyear <- as.numeric(ca_vios_new$rtcyear)
ca_vios_new$yeartocomp <- ca_vios_new$rtcyear - ca_vios_new$vioyear

## Aggregating violations to CWS level
ca_vios_count <- collap(ca_vios_new, ~ `PWS ID`, custom = list(fsum = c("current", "closed")))
ca_vios_count$total_vios <- ca_vios_count$current + ca_vios_count$closed

write.csv(ca_vios_count, file = "data//caviolations.csv", col.names = TRUE)


# Loading in CWS boundary, CWS data, and census tract data
system_data <- read_xlsx("data\\CalHRTWv1_FinalDataSpreadsheet.xlsx", col_names = TRUE)
systems <- st_read("data\\ca_systems.geojson")

## Reading layer `ca_systems' from data source 
##   `C:\Users\kelly\OneDrive\Desktop\MPP\Praticum\Analysis\data\ca_systems.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 3017 features and 29 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.3161 ymin: 32.53425 xmax: -114.1686 ymax: 41.99871
## Geodetic CRS:  WGS 84

tracts <- st_read("data\\tract2020.geojson")

## Reading layer `tract2020' from data source 
##   `C:\Users\kelly\OneDrive\Desktop\MPP\Praticum\Analysis\data\tract2020.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 9129 features and 2 fields (with 20 geometries empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.4096 ymin: 32.53444 xmax: -114.1312 ymax: 42.00948
## Geodetic CRS:  WGS 84

census_data <- read.csv("tracts\\cleaned_census_data.csv", header = TRUE)
ca_violations <- read.csv("data\\caviolations.csv", header = TRUE) ##this was generated above

system_data <- subset(system_data, `Annual Water Cost at 6 HCF`>0) #limits the water systems to those that have annual water cost data
system_data <- system_data %>% rename(pwsid=PWSID)

census_data$FIPS <- paste0("0", census_data$FIPS)
census_data <- census_data %>% rename(GEOID = FIPS)
census_data <- subset(census_data, select = -c(white))

census_data[3:42] <- lapply(census_data[3:42], as.numeric)

ca_sys <- left_join(system_data, systems, by = "pwsid")
#st_write(ca_sys, "newsystems2.shp") ## checking in ArcGIS to make sure geographies look correct and join with census tracts

joined_tracts <- read.csv("data\\joinedOEHHA.csv", header = TRUE) ##attribute table created in ArcGIS that links CWS to census tracts

systems <- systems %>% dplyr::select(pwsid, geometry)

ca_sys <- left_join(joined_tracts, systems, by = "pwsid")
ca_sys <- rename(ca_sys, cwsboundaries = geometry )
ca_violations <- rename(ca_violations, pwsid = PWS.ID)
ca_violations <- ca_violations[,2:5]

ca_sys$GEOID <- paste0("0", ca_sys$GEOID)


# Finally joining all data to get final large dataset
reg.df <- left_join(ca_sys, census_data, by = "GEOID")
reg.df <- reg.df %>% dplyr::select(pwsid, system_name, pop_served, num_contam_he, max_dur_he, num_mcl_vio, 
                                   max_dur_mcl_vio, prchw, num_sources, water_type, 
                                   emerg_source, over_draft_basin,
                                   AR_MH, mdhi, monthly, annual, disadv_cs, num_mrvio, sys_size, 
                                   connections, population, 
                                   fed_class, state_class, service_connect, GEOID, cwsboundaries, 51:91)

# Cleaning and transforming final large dataset
## Estimating the number of disadvantaged communities serviced by a CWS
reg.df$disad_c <- ifelse(reg.df$med_inc < 49454, 1, 0) #once data is aggregated to the CWS level this variable will be summed, giving the total number of disadvantaged communities in a CWS. This will then be divided by the total number of communities serviced by a CWS to get the percentage of disadvantaged communities. 
reg.df$sev_disad <- ifelse(reg.df$med_inc < 37091, 1, 0)

reg.df$communities <- 1

## Aggregating data to the CWS level
reg.df2 <- collap(reg.df, ~ pwsid, 
                  custom = list(fmean = 3:7,
                                fmean = 13:16,
                                fmean = 18,
                                fmean = 20:21, 
                                fmean = 24,
                                fmean = 27,
                                fsum = 28:50,
                                fmean = 51:53,
                                fsum = 54:65,
                                fmean = 66:67,
                                fsum = 68:70,
                                flast = c("system_name", "prchw", "num_sources", "water_type",
                                          "emerg_source", "over_draft_basin", "sys_size",
                                          "fed_class", "state_class")))

reg.df2 <- left_join(reg.df2, systems, by = "pwsid") #adds in CWS geography


reg.df2[26:32] <- (reg.df2[26:32]/reg.df2[,25])*100
reg.df2$non_white <- rowSums(reg.df2[,27:32])
reg.df2 <- reg.df2 %>% relocate(non_white, .after = white.1)
reg.df2[35:41] <- (reg.df2[35:41]/reg.df2[,34])*100
reg.df2[46:48] <- (reg.df2[46:48]/reg.df2[,42])*100 ##no. employed, unemployed, or not in LF / total LF
reg.df2$renter_units_perc <- (reg.df2$renter_units/reg.df2$housing_units)*100
reg.df2[54:63] <- (reg.df2[54:63]/reg.df2[,53])*100 ##percent of rental units by age
reg.df2$pre1980 <- rowSums(reg.df2[,59:63])
reg.df2$post1980 <- rowSums(reg.df2[,54:58])
reg.df2$disad_per <- (reg.df2$disad_c/reg.df2$communities)*100
reg.df2$sev_per <- (reg.df2$sev_disad/reg.df2$communities)*100

reg.df2$prchw <- ifelse(reg.df2$prchw == "Yes", 1, 0)
reg.df2$over_draft_basin <- ifelse(reg.df2$over_draft_basin == "Yes", 1, 0)
reg.df2$emerg_source <- ifelse(reg.df2$emerg_source == "Yes", 1, 0)
reg.df2$joint_dis <- ifelse(reg.df2$disad_c == 1 & reg.df2$sev_disad == 1, 1, 0)

reg.df2 <- reg.df2 %>% filter(!st_is_empty(geometry))

# Loading in general SDWIS information about all CWS in California to compare sample population to total population
system_eval <- read.csv("CA_WaterSystemDetail.csv", header = TRUE)

system_eval <- system_eval %>% dplyr::select(PWS.ID, Population.Served.Count)
system_eval$Population.Served.Count <- as.numeric(gsub(",", "", system_eval$Population.Served.Count, fixed = TRUE))

system_eval$size <- ifelse(system_eval$Population.Served.Count <= 200, "Very Small", 
                           ifelse(system_eval$Population.Served.Count <= 999 & system_eval$Population.Served.Count >= 201, "Small", 
                                  ifelse(system_eval$Population.Served.Count <= 3300 & system_eval$Population.Served.Count >= 1000, "Medium", 
                                         ifelse(system_eval$Population.Served.Count > 3300, "Large/Very Large", NA))))

reg.df2$size <- ifelse(reg.df2$sys_size == "Very Small (1-200)", 1, 
                       ifelse(reg.df2$sys_size == "Small (201-999)", 2, 
                              ifelse(reg.df2$sys_size == "Medium (1000-3,300)", 3, 
                                     ifelse(reg.df2$sys_size == "Large/Very Large(>3300)", 4, NA))))
reg.df2$sys_size <- ifelse(reg.df2$sys_size == "Large/Very Large(>3300)", "Lg(>3300)",
                           ifelse(reg.df2$sys_size == "Very Small (1-200)", "V.Sm(1-200)",
                                  ifelse(reg.df2$sys_size == "Small (201-999)", "Sm(201-999)",
                                         ifelse(reg.df2$sys_size == "Medium (1000-3,300)", "Md(1000-3300)", NA))))
table(reg.df2$sys_size)

## 
##     Lg(>3300) Md(1000-3300)   Sm(201-999)   V.Sm(1-200) 
##           352           181           209           397

# Removing N/A and repeat observations
reg.df2 <- na.omit(reg.df2)

## Finding repeat observations
which(reg.df2$pwsid == c("CA0110011"))

## [1] 8 9

which(reg.df2$pwsid == c("CA0910013"))

## [1] 66 67

which(reg.df2$pwsid == c("CA1010007"))

## [1] 81 82

which(reg.df2$pwsid == c("CA1910034"))

## [1] 251 252

which(reg.df2$pwsid == c("CA4410002"))

## [1] 876 877

which(reg.df2$pwsid == c("CA4410014"))

## [1] 881 882

which(reg.df2$pwsid == c("CA4410017"))

## [1] 883 884

which(reg.df2$pwsid == c("CA4810003"))

## [1] 916 917

which(reg.df2$pwsid == c("CA5010018"))

## [1] 992 993

which(reg.df2$pwsid == c("CA5510009"))

## [1] 1057 1058

reg.df2 <- reg.df2[-c(9,67,82,252,877,882,884,917,993,1058),]

reg.df2 <- reg.df2[,-c(4,5,7)]

reg.df2 <- left_join(reg.df2, ca_violations, by = "pwsid")
reg.df2$total_vios <- as.numeric(reg.df2$total_vios)
reg.df2["total_vios"][is.na(reg.df2["total_vios"])] <- 0

#st_write(reg.df2, dsn = "data/final/finaldata.geojson", append = FALSE) #final dataset available to download
#st_write(reg.df2, dsn = "data/final/finaldata.shp")

Exploratory Data Analysis Findings

The initial exploration of the data revealed many interesting patterns. First, the descriptive statistics are displayed in Table 1. This shows that more than half of systems do not have any health-based violations and the average annual water rate in CA for the minimum water needed to live is $580. The following figures show the distribution and relationship of multiple variables. For the distribution of violations in Figure X, it is clear that there are a large number of 0 violations. This distribution shows that a negative binomial regression model would be best to use. The following figure shows the distribution of water rates. Despite the right skew of this variable, a negative binomial model does not require the transformation of skewed data.

sf_use_s2(FALSE)
options(scipen=999)

# Creating descriptive statistics table
violations <- stat.desc(reg.df2$total_vios)
annual <- stat.desc(reg.df2$annual)
percent_disad <- stat.desc(reg.df2$disad_per)
percent_severe_disad <- stat.desc(reg.df2$sev_per)
median_income <- stat.desc(reg.df2$med_inc)
gini <- stat.desc(reg.df2$gini)
rental_units <- stat.desc(reg.df2$renter_units_perc)
housing_pre1980 <- stat.desc(reg.df2$pre1980)
pop_density <- stat.desc(reg.df2$popdensity)
size <- stat.desc(reg.df2$connections)
purchase_water <- stat.desc(reg.df2$prchw)
non_white <- stat.desc(reg.df2$non_white)
stats <- t(data.frame(violations, annual, median_income, non_white, percent_disad, gini, rental_units, housing_pre1980, pop_density, size, purchase_water))
stats <- stats[,-c(3,6,7,10:12,14)]

kable(stats, digits = 2, row.names = TRUE, col.names = c("N", "Number of Null", 
                                                                                  "Min", "Max", "Median", 
                                                                                  "Mean", "Std. Dev."),
              align = "c", label = "Descriptive Statistics", caption = "Table X: Statistics for variables considered in this analysis")

Table X: Statistics for variables considered in this analysis
	N	Number of Null	Min	Max	Median	Mean	Std. Dev.
violations	1108	848	0.00	26.00	0.00	0.86	2.61
annual	1108	0	175.44	2108.88	493.20	580.42	324.34
median_income	1108	0	18387.00	247679.00	74655.00	80776.95	32018.62
non_white	1108	0	1.29	91.10	25.58	29.97	17.58
percent_disad	1108	682	0.00	100.00	0.00	18.42	32.15
gini	1108	0	0.21	0.65	0.43	0.43	0.05
rental_units	1108	0	2.24	78.00	27.80	28.84	13.45
housing_pre1980	1108	5	0.00	100.00	57.04	55.54	19.01
pop_density	1108	0	1.06	33403.60	523.15	2441.94	3769.51
size	1108	0	1.00	686540.00	777.50	7637.03	29442.83
purchase_water	1108	732	0.00	1.00	0.00	0.34	0.47

hist(reg.df2$total_vios, freq=FALSE, main = "Figure X: Number of Violations", xlab="MCL of Violations")

hist(reg.df2$annual, freq=FALSE, main = "Figure X: Distribution of Annual Water Rates", xlab="Annual Water Rates")

The next set of graphs display the relationships between different variables to start to explore whether or not certain variables should be included in the regression analysis. For the relationship between violations and water rates, as shown in Figure X, there appears to be more health-based violations for systems with lower water rates, although it is not a stark pattern. This trend indicates that there could be a relationship between the two variables. A more drastic relationship appears when looking at the relationship between the number of health-based violations and the median income of the water system, as shown in Figure X. Other variables the literature suggests are potentially important factors contributing to water quality are system size and disadvantaged community status. To explore the relationship of these variables with violations and water rates, we created a box plot of the number of violations per different system size classification and plotted the number of violations and water rates against the percent of disadvantaged communities in a CWS (Figures X through X). The boxplots relating water system size to the variables of interest show that small and very small systems have higher numbers of violations as well as higher water rates. While literature had suggested that these may be important variables to investigate, these figures solidify that there is a potential relationship between these variables and the response variable. There is not a clear pattern that arises from either plot comparing disadvantaged community status to the two variables of interest, however they are still included in the initial regression model because the literature emphasizes their significance.

plot(reg.df2$annual, reg.df2$total_vios, main = "Figure X: Violations by Water Rates", 
     xlab="Annual Water Rates", ylab="MCL Violations")

plot(reg.df2$med_inc, reg.df2$total_vios, main = "Figure X: Violations by Median Income", 
     xlab="Average Median Income", ylab="Violations")

plot(reg.df2$med_inc, reg.df2$annual, main = "Figure X: Water Rates by Median Income", 
     xlab="Average Median Income", ylab="Annual Water Rate")

boxplot(reg.df2$total_vios~reg.df2$sys_size, xlab = "System Size", ylab = "MCL Violations", show.names = TRUE)

boxplot(reg.df2$annual~reg.df2$sys_size, xlab = "System Size", ylab = "Annual Water Rates", show.names = TRUE)

plot(reg.df2$disad_per, reg.df2$annual, main = "Figure X: Water Rates by Disadvantaged Communities", 
     xlab="Percent Disadv. Communities", ylab="Water rates")

plot(reg.df2$disad_per, reg.df2$total_vios, main = "Figure X: Violations by Disadvantaged Communities", 
     xlab="Percent Disadv. Communities", ylab="MCL Violations")

Finally, as a part of the data exploration, it is important to show the difference between the full population of community water systems in California and the sample data that is used in this analysis. One of the best ways to show the difference between the two is to look at the distribution of data among different sizes of systems. Table X shows that the data used in the analysis represents a disproportionately higher percentage of medium and large systems compared to small and very small systems. This may limit the external validity of our results and highlights deficiencies in data collection for smaller systems.

#Comparing sample systems to full population of systems 
actual_size <- table(system_eval$size)
sample_size <- table(reg.df2$sys_size)
size_comp <- data.frame(actual_size, sample_size)
size_comp <- size_comp[,-c(1)]
size_comp <- rename(size_comp, "Size" = "Var1.1")
size_comp <- size_comp %>% relocate(Size, .before = "Freq")
size_comp <- rename(size_comp, "Actual" = "Freq")
size_comp <- rename(size_comp, "Sample" = "Freq.1")

kable(size_comp, align = "l", caption = "Comparison of the breakdown of sample CWS sizes to population of CWS sizes", label = "Sample v. Population CWS Size Breakdown")

Comparison of the breakdown of sample CWS sizes to population of CWS sizes
Size	Actual	Sample
Lg(>3300)	660	344
Md(1000-3300)	285	178
Sm(201-999)	679	201
V.Sm(1-200)	1237	385

Exploratory Spatial Data Analysis Findings

Next, we explored the spatial component of our data through choropleth maps and a spatial dependency statistic. The maps below (Figures X-X) show the distribution and concentration of our two main variables of interest, the number of health-based violations and annual water rates across California (the full, interactive maps can be seen by following the links in the figure notes). The map reflects similar ideas discussed above and in the literature that smaller systems face higher numbers of health-based violations, as shown by the concentration of CWS with high numbers of violations in California’s central valley. Both figures also show that there is potentially spatial clustering of violations and water rates. These maps solidify the hypothesis that both system size and spatial clustering should be considered in regression analyses.

Figure X shows a bivariate choropleth map that combines the two previous variables. If we expected to see that high water bills led to fewer violations and vice versa, there should be mainly colors along the gradient going from top left to bottom right of the matrix key. This map does not show any clear pattern in terms of the gradient; however it does look like systems with a similar relationship between water rates and violations are clustered near each other. We verify this spatial clustering by using a Global Moran’s I statistic.

map_dat <- reg.df2[,c(1,2,13,15,66,76)]
map_dat <- sf::st_as_sf(map_dat)

vio_bins <- c(-1, 0, 1, 10, 26)
pal <- colorBin("Reds", domain = reg.df2$total_vios, bins = vio_bins)

labels <- sprintf(
  "<strong>%s</strong><br/><strong>%s</strong><br/>%g MCL Violations<br/>%s",
  reg.df2$system_name, reg.df2$pwsid, reg.df2$total_vios, reg.df2$sys_size
) %>% lapply(htmltools::HTML)

violationmap <- leaflet() %>% addTiles() %>% 
  addPolygons(data = map_dat$geometry, fillOpacity= 0.7,  fillColor = pal(reg.df2$total_vios), 
              color = "grey", weight=1, highlightOptions = leaflet::highlightOptions(
    weight = 1,
    color = "#666",
    fillOpacity = 0.7),
  label = labels,
  labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto")) %>%
  addLegend("bottomright", pal = pal, values = reg.df2$total_vios,
            title = "Number of Violations",
            opacity = 1)
violationmap

binpal <- colorBin("Blues", reg.df2$annual, pretty = FALSE)

labels_an <- sprintf(
  "<strong>%s</strong><br/><strong>%s</strong><br/>%g MCL Violations<br/>%s",
  reg.df2$system_name, reg.df2$pwsid, reg.df2$total_vios, reg.df2$sys_size
) %>% lapply(htmltools::HTML)

annualmap <- leaflet() %>% addTiles() %>% 
  addPolygons(data = map_dat$geometry, fillOpacity= 0.7,  fillColor = binpal(reg.df2$annual), 
              color = "grey", weight=1, highlightOptions = leaflet::highlightOptions(
    weight = 1,
    color = "#666",
    fillOpacity = 0.7),
  label = labels_an,
  labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto")) %>%
  addLegend("bottomright", pal = binpal, values = reg.df2$annual,
            title = "Annual Water Rates",
            opacity = 1)
annualmap

vio_bins <- c(-1, 0, 1, 10, 26)
map_dat$vio_bins <- ifelse(map_dat$total_vios == 0, 1, 
                           ifelse(map_dat$total_vios > 0 & map_dat$total_vios <= 1, 2, 
                                  ifelse(map_dat$total_vios > 1 & map_dat$total_vios <= 10, 3, 
                                         ifelse(map_dat$total_vios >10, 4, 0))))

labels_bi <- sprintf(
  "<strong>%s</strong><br/><strong>%s</strong><br/>%g MCL Violations<br/>$%g per year<br/>%s",
  map_dat$system_name, map_dat$pwsid, map_dat$total_vios, map_dat$annual, reg.df2$sys_size
) %>% lapply(htmltools::HTML)

library(bivariatechoropleths)
library(leaflet)

bivar <- leaflet::leaflet() %>%
  leaflet::addTiles() %>%
  bivariatechoropleths::addBivariateChoropleth(
  map_data = map_dat,
  var1_name = annual,
  var2_name = vio_bins,
  ntiles = 4,
  var1_label = "Annual Water Cost",
  var2_label = "# MCL Violations",
  weight = 1,
  fillOpacity = 0.7,
  color = "grey",
  paletteFunction = pals::arc.bluepink)

bivar <- bivar %>% leaflet::addPolygons(data = map_dat$geometry, weight = 0.5, 
                       color = "lightgrey", fillOpacity = 0, 
                         highlightOptions = highlightOptions(
    weight = 2,
    color = "#666",
    fillOpacity = 0,
    bringToFront = TRUE),
  label = labels_bi,
  labelOptions = labelOptions(
    style = list("font-weight" = "normal", padding = "3px 8px"),
    textsize = "15px",
    direction = "auto"))

bivar

The Moran’s I statistic shows positive spatial autocorrelation of the number of health-based violations. This means that water systems with a higher number of health-based violations are surrounded by other water systems that also have higher numbers of health-based violations. This test statistic shows that there is spatial autocorrelation in the dependent variable, suggesting that a spatial regression model should be explored if the spatial dependency is not fully accounted for by the explanatory variables. We also explored different neighborhood structures to determine the best number of neighbors to use in the analysis. By looking at the Global Moran’s I statistic for water quality violations for different numbers of neighbors (Table X), the statistic shows that there is stronger spatial autocorrelation when using three neighbors as opposed to 10 or 38. Therefore, the neighborhood structure that we will move forward with is 3-nearest neighbors.

coords <- st_centroid(st_geometry(reg.df2$geometry), of_largest_polygon = TRUE)
nb_k38 <- knearneigh(coords, k = 38)
options(scipen=999)
lwk <- nb2listw(knn2nb(nb_k38), style = "W")
mt38 <- moran.test(reg.df2$total_vios, listw = lwk, randomisation = FALSE)

nb_k3 <- knearneigh(coords, k = 3) ###this one yields the highest Moran's I, is also the average number of neighbors discovered after an ArcGIS analysis of neighbors
options(scipen=999)
lwk3 <- nb2listw(knn2nb(nb_k3), style = "W")
mt3 <- moran.test(reg.df2$total_vios, listw = lwk3, randomisation = FALSE)
plot(st_geometry(reg.df2$geometry), border="grey")
plot(knn2nb(nb_k3), coords, add=TRUE)
title(main="K nearest neighbours, k = 3")

nb_k10 <- knearneigh(coords, k = 10)
options(scipen=999)
lwk10 <- nb2listw(knn2nb(nb_k10), style = "W")
mt10 <- moran.test(reg.df2$total_vios, listw = lwk10, randomisation = FALSE)

mtests <- t(data.frame(mt38$estimate, mt10$estimate, mt3$estimate))
mtests <- mtests[,-c(2,3)]
mtests <- data.frame(mtests)
kable(mtests, digits = 4, row.names = TRUE, col.names = "Moran's I", align = "l", 
      caption = "Moran's I Test Statistic for 38, 10, and 3 K-Nearest Neighbors")

Moran’s I Test Statistic for 38, 10, and 3 K-Nearest Neighbors
	Moran’s I
mt38.estimate	0.0877
mt10.estimate	0.1555
mt3.estimate	0.1949

Regression Analysis Findings

While a traditional linear model (Ordinary Least Squares) cannot be used for count data, there are two alternative models that were explored: Poisson and Negative Binomial. By running both models, as well as a Pearson dispersion statistic and comparing fitted to observed values, we identified that the negative binomial model produced the best fit. This analysis can be found in the Appendix. The initial negative binomial model included all relevant variables that were identified in the literature and are presented in the data table in the appendix. After the initial negative binomial regression was run, a backward step AIC test was conducted to identify the combination of variables from this model that present the best model fit with the lowest AIC. The tables below shows the regression results from the final negative binomial model as well as the Moran’s I test on the residuals to identify any remaining spatial dependency.

summary(negbi <- glm.nb(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
                           renter_units_perc + pre1980 + disad_per, data = reg.df2))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning in glm.nb(total_vios ~ annual + med_inc + popdensity + prchw +
## connections + : alternation limit reached

## 
## Call:
## glm.nb(formula = total_vios ~ annual + med_inc + popdensity + 
##     prchw + connections + non_white + gini + renter_units_perc + 
##     pre1980 + disad_per, data = reg.df2, init.theta = 0.1871919901, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1672  -0.7990  -0.6441  -0.2800   2.4364  
## 
## Coefficients:
##                       Estimate   Std. Error z value  Pr(>|z|)    
## (Intercept)       -1.750173591  0.886450288  -1.974  0.048341 *  
## annual             0.001042347  0.000253805   4.107 0.0000401 ***
## med_inc           -0.000006934  0.000003587  -1.933  0.053181 .  
## popdensity        -0.000215434  0.000046771  -4.606 0.0000041 ***
## prchw             -0.821061445  0.219299624  -3.744  0.000181 ***
## connections        0.000004982  0.000003009   1.656  0.097768 .  
## non_white          0.019620242  0.006419216   3.056  0.002239 ** 
## gini               2.901535652  1.833017233   1.583  0.113438    
## renter_units_perc  0.009380714  0.008221648   1.141  0.253879    
## pre1980           -0.004209697  0.004598444  -0.915  0.359950    
## disad_per         -0.000073255  0.003110942  -0.024  0.981214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1872) family taken to be 1)
## 
##     Null deviance: 742.99  on 1107  degrees of freedom
## Residual deviance: 612.54  on 1097  degrees of freedom
## AIC: 2182.1
## 
## Number of Fisher Scoring iterations: 25
## 
## 
##               Theta:  0.1872 
##           Std. Err.:  0.0174 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -2158.0600

negbi2 <- step(negbi)

## Start:  AIC=2180.06
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980 + disad_per

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - disad_per          1   612.54 2178.1
## - pre1980            1   613.39 2178.9
## - renter_units_perc  1   613.93 2179.4
## <none>                   612.54 2180.1
## - gini               1   614.89 2180.4
## - med_inc            1   616.56 2182.1
## - connections        1   619.42 2184.9
## - non_white          1   621.59 2187.1
## - prchw              1   625.17 2190.7
## - annual             1   632.05 2197.6
## - popdensity         1   638.30 2203.8

## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached

## 
## Step:  AIC=2178.05
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - pre1980            1   613.41 2176.9
## - renter_units_perc  1   613.99 2177.5
## <none>                   612.55 2178.1
## - gini               1   614.95 2178.5
## - med_inc            1   617.73 2181.2
## - connections        1   619.43 2182.9
## - non_white          1   621.62 2185.1
## - prchw              1   625.55 2189.1
## - annual             1   632.10 2195.6
## - popdensity         1   638.49 2202.0

## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached

## 
## Step:  AIC=2176.91
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - renter_units_perc  1   613.42 2176.4
## - gini               1   613.86 2176.8
## <none>                   611.95 2176.9
## - med_inc            1   616.77 2179.7
## - connections        1   619.01 2182.0
## - non_white          1   620.93 2183.9
## - prchw              1   625.37 2188.3
## - annual             1   630.71 2193.7
## - popdensity         1   640.14 2203.1

## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached

## 
## Step:  AIC=2176.38
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

##               Df Deviance    AIC
## <none>             612.95 2176.4
## - gini         1   615.09 2176.5
## - connections  1   619.84 2181.3
## - med_inc      1   621.25 2182.7
## - prchw        1   625.99 2187.4
## - non_white    1   626.06 2187.5
## - annual       1   630.50 2191.9
## - popdensity   1   640.03 2201.5

plot(negbi2, which=c(1,2), cex=0.1)

nbresults <- summary(negbi2)$coefficients
nbresults <- nbresults[,-c(3)]
nbresults <- data.frame(nbresults)
nbresults$odds_ratio <- exp(nbresults$Estimate)
nbresults <- nbresults[,c(1,4,2,3)]
nbresults <- rename(nbresults, "P-value" = "Pr...z..")

redmtest <- lm.morantest(negbi2, listw = lwk3)##even stronger spatial autocorrelation showing us in resids with this model
testsum <- t(data.frame(redmtest$estimate))
pval <- redmtest$p.value
testsum <- data.frame(testsum, pval)
testsum <- testsum[,-c(2,3)]

reg.df2$fitted_neg <- fitted(negbi2)
results_negbi <- ggplot() + geom_point(aes(reg.df2$fitted_neg, reg.df2$total_vios)) + xlab("Fitted Values") + ylab("Observed Values")

kable(nbresults, digits = 5, row.names = TRUE, col.names = c("Estimate", "Odds Ratio", "Std. Error", "P-Value"), caption = "Negative Binomial Regression Results")

Negative Binomial Regression Results
	Estimate	Odds Ratio	Std. Error	P-Value
(Intercept)	-1.56963	0.20812	0.86341	0.06907
annual	0.00097	1.00097	0.00025	0.00014
med_inc	-0.00001	0.99999	0.00000	0.00540
popdensity	-0.00021	0.99979	0.00004	0.00000
prchw	-0.82428	0.43855	0.21882	0.00017
connections	0.00001	1.00001	0.00000	0.08760
non_white	0.02213	1.02238	0.00614	0.00031
gini	2.66892	14.42440	1.76383	0.13025

kable(testsum, digits = 5, row.names = FALSE, col.names = c("Moran's I Estimate", "P-Value"), caption = "Moran's I Test Statistic on Residuals")

Moran’s I Test Statistic on Residuals
Moran’s I Estimate	P-Value
0.41151	0

results_negbi

Despite the improvement in the fit of the model from the AIC backward step model, the fitted values to observed values, as seen in Figure X, shows that the fit could still be improved. There is an additional diagnostic step that needs to be taken because of the spatial autocorrelation that appears to be apparent in our response variable. We used the Global Moran’s I of the residuals to evaluate whether or not a spatial regression needs to be explored. For the negative binomial regression in Table X, the Global Moran’s I of residuals is positive and statistically significant (as shown in Table X), indicating that there is spatial autocorrelation in the error term of the regression. This could be biasing the results and shows that we should incorporate spatial autocorrelation in the regression with a spatial weights matrix and using neighboring dependent variable values to assist in the prediction of the dependent variable. This is done through a conditional autoregressive model (CAR) in combination with the negative binomial.

The table below shows the results from the CAR model.

is.symmetric.nb(knn2nb(nb_k3))

## [1] FALSE

nb_mat <- make.sym.nb(knn2nb(nb_k3))
is.symmetric.nb(nb_mat)

## [1] TRUE

matrix <- nb2mat(nb_mat, style = "B")
rownames(matrix) <- colnames(matrix) <- reg.df2$pwsid

car <- fitme(total_vios ~ annual + med_inc + prchw + popdensity + non_white +
               gini + connections + adjacency(1|pwsid), adjMatrix = matrix, 
             data = reg.df2, family = 'negbin')

## If the 'RSpectra' package were installed, an extreme eigenvalue computation could be faster.

summary(car)

## formula: total_vios ~ annual + med_inc + prchw + popdensity + non_white + 
##     gini + connections + adjacency(1 | pwsid)
## Estimation of corrPars, lambda and NB_shape by ML (P_v approximation of logL).
## Estimation of fixed effects by ML (P_v approximation of logL).
## Estimation of lambda and NB_shape by 'outer' ML, maximizing logL.
## family: negbin2(shape=998300)( link = log ) 
##  ------------ Fixed effects (beta) ------------
##                 Estimate    Cond. SE t-value
## (Intercept) -2.982345757 1.122500935 -2.6569
## annual       0.001092281 0.000303286  3.6015
## med_inc     -0.000009092 0.000003981 -2.2842
## prchw       -0.652508075 0.305883517 -2.1332
## popdensity  -0.000195481 0.000061352 -3.1862
## non_white    0.023639575 0.008384515  2.8194
## gini         1.184772171 2.246961146  0.5273
## connections  0.000005512 0.000003211  1.7165
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##     1.rho 
## 0.1089231 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    pwsid  :  4.363  
## # of obs: 1108; # of groups: pwsid, 1108 
##  ------------- Likelihood values  -------------
##                        logLik
## logL      (P_v(h)): -1057.893

aic <- get_any_IC(car)
aic

##        marginal AIC:     conditional AIC:      dispersion AIC: 
##            2137.7852            1735.7020            2121.7852 
##        effective df: 
##             687.3681

coefficients <- (car$fixef)
oddsratio <- exp(coefficients)
rho <- car$corrPars
rho <- rho[["1"]][["rho"]]

results <- data.frame(coefficients, oddsratio)
results$significance <- "*"
results[c(1,2,5,6),3] <- "***"
results[c(3,4), 3] <- "**"
results[7,3] <-""


reg.df2$fitted_vals <- fitted(car)

results_plot <- ggplot() + geom_point(aes(reg.df2$fitted_vals, reg.df2$total_vios))+ xlab("Fitted Values") + ylab("Observed Values")
kable(results, digits = 3, row.names = TRUE, col.names = c("Estimate", "Odds Ratio", "Significance"), align = "c", label = "CAR Negative Binomial Results")

	Estimate	Odds Ratio	Significance
(Intercept)	-2.982	0.051	***
annual	0.001	1.001	***
med_inc	0.000	1.000	**
prchw	-0.653	0.521	**
popdensity	0.000	1.000	***
non_white	0.024	1.024	***
gini	1.185	3.270
connections	0.000	1.000

results_plot

As shown by the fitted values graph in Figure X, incorporating the spatial dependence parameter in the negative binomial regression greatly improved the fit of the model. This model now reveals interesting relationships between the explanatory variables and the dependent variable. For all of the continuous variables in the model (all except purchased water), the Odds Ratios can be interpreted as the factor by which the incidence rate of violations will change from a unit increase in the explanatory variable. For example, if the percentage of the population serviced by the CWS that is not white increases by 1 percentage point (the unit of the variable), then the incidence rate of health-based violations in the CWS will increase by a factor of 1.024, holding all other factors constant. For binomial variables, such as purchased water, the odds ratio for this variable represents the factor by which the incidence rate is different compared to the alternative specification of the variable. For example, CWS that purchase water will have a violation incidence rate that is 0.52 times lower than CWS that do not purchase water.

These results suggest that variables related to the sourcing of the water, cost of the water, and the demographic makeup of a CWS all impact the number of health-based violations a CWS faces to varying degrees.

Filtering Data for Different Characteristics

The next step of the analysis includes conducting the same CAR model above for the data filtered by different characteristics to determine if different types of CWS portray different characteristics. First, we filtered the data to only include CWS whose communities they service are more than 10% disadvantaged communities. Although the percentage of disadvantaged communities that a CWS services was not determined to be a significant predictor of health-based violations, we wanted to explore the relationships between variables in this set of data and see if any new relationships emerge that may have been masked in the full data model.

Disadvantaged Communities

filtered_dat <- reg.df2 %>% filter(disad_per >= 10)

disad_vios <- stat.desc(filtered_dat$total_vios)
disad_price <- stat.desc(filtered_dat$annual)
disad_inc <- stat.desc(filtered_dat$med_inc)
disad_perc <- stat.desc(filtered_dat$disad_per)
disad_gini <- stat.desc(filtered_dat$gini)
disad_rental_units <- stat.desc(filtered_dat$renter_units_perc)
disad_housing_pre1980 <- stat.desc(filtered_dat$pre1980)
disad_pop_density <- stat.desc(filtered_dat$popdensity)
disad_size <- stat.desc(filtered_dat$connections)
disad_purchase <- stat.desc(filtered_dat$prchw)
disad_non_white <- stat.desc(filtered_dat$non_white)
disad_stats <- t(data.frame(disad_vios, disad_price, disad_inc, disad_non_white, disad_perc, disad_gini, disad_rental_units, disad_housing_pre1980, disad_pop_density, disad_size, disad_purchase))
disad_stats <- disad_stats[,-c(3,6,7,10:12,14)]

kable(disad_stats, digits = 3, row.names = TRUE, col.names = c("N", "Number of Null", 
                                                                                  "Min", "Max", "Median", 
                                                                                  "Mean", "Std. Dev."),
              align = "c", label = "Descriptive Statistics", caption = "Statistics for variables considered in this analysis of CWS with communities serviced greater than 10% disadvantaged communities")

Statistics for variables considered in this analysis of CWS with communities serviced greater than 10% disadvantaged communities
	N	Number of Null	Min	Max	Median	Mean	Std. Dev.
disad_vios	351	264	0.000	20.000	0.000	0.929	2.852
disad_price	351	0	178.560	1986.000	430.080	490.034	276.434
disad_inc	351	0	18387.000	144022.000	50719.600	54126.540	14438.753
disad_non_white	351	0	5.904	91.104	30.513	33.126	17.584
disad_perc	351	0	10.000	100.000	50.000	56.992	32.835
disad_gini	351	0	0.310	0.629	0.436	0.438	0.044
disad_rental_units	351	0	8.231	77.997	35.607	35.960	13.894
disad_housing_pre1980	351	0	13.855	94.413	58.542	57.982	15.714
disad_pop_density	351	0	1.123	22494.830	891.993	3005.457	4446.933
disad_size	351	0	12.000	686540.000	1856.000	10446.630	42707.024
disad_purchase	351	245	0.000	1.000	0.000	0.302	0.460

disad_coords <- st_centroid(st_geometry(filtered_dat$geometry), of_largest_polygon = TRUE)

## Warning in st_centroid.sfc(st_geometry(filtered_dat$geometry),
## of_largest_polygon = TRUE): st_centroid does not give correct centroids for
## longitude/latitude data

disad_nb_k3 <- knearneigh(disad_coords, k = 3) ###this one yields the highest Moran's I
options(scipen=999)
disad_lwk3 <- nb2listw(knn2nb(disad_nb_k3), style = "W")

is.symmetric.nb(knn2nb(disad_nb_k3))

## [1] FALSE

disad_mat <- make.sym.nb(knn2nb(disad_nb_k3))
is.symmetric.nb(disad_mat)

## [1] TRUE

disad_matrix <- nb2mat(disad_mat, style = "B")

rownames(disad_matrix) <- colnames(disad_matrix) <- filtered_dat$pwsid

summary(nb_disad <- glm.nb(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
                renter_units_perc + pre1980, data = filtered_dat))

## 
## Call:
## glm.nb(formula = total_vios ~ annual + med_inc + popdensity + 
##     prchw + connections + non_white + gini + renter_units_perc + 
##     pre1980, data = filtered_dat, init.theta = 0.2259402352, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1556  -0.8189  -0.6236  -0.1587   2.5597  
## 
## Coefficients:
##                       Estimate   Std. Error z value Pr(>|z|)   
## (Intercept)        0.005893104  2.080668960   0.003  0.99774   
## annual             0.001174027  0.000466799   2.515  0.01190 * 
## med_inc           -0.000030016  0.000013940  -2.153  0.03129 * 
## popdensity        -0.000287906  0.000088546  -3.251  0.00115 **
## prchw             -0.383853530  0.374084125  -1.026  0.30484   
## connections        0.000008284  0.000003471   2.387  0.01701 * 
## non_white          0.010808947  0.011386944   0.949  0.34250   
## gini               0.713861208  3.461777139   0.206  0.83663   
## renter_units_perc  0.024703281  0.012507535   1.975  0.04826 * 
## pre1980           -0.006641010  0.009847214  -0.674  0.50005   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.2259) family taken to be 1)
## 
##     Null deviance: 269.48  on 350  degrees of freedom
## Residual deviance: 199.55  on 341  degrees of freedom
## AIC: 711.16
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.2259 
##           Std. Err.:  0.0369 
## 
##  2 x log-likelihood:  -689.1590

nb_disad1 <- step(nb_disad)

## Start:  AIC=709.16
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980
## 
##                     Df Deviance    AIC
## - gini               1   199.58 707.19
## - pre1980            1   200.09 707.70
## - non_white          1   200.33 707.94
## - prchw              1   200.60 708.21
## <none>                   199.55 709.16
## - renter_units_perc  1   203.08 710.69
## - med_inc            1   204.84 712.45
## - annual             1   208.99 716.60
## - connections        1   211.44 719.05
## - popdensity         1   214.33 721.94
## 
## Step:  AIC=707.19
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + renter_units_perc + pre1980

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - pre1980            1   200.17 705.70
## - non_white          1   200.41 705.94
## - prchw              1   200.68 706.21
## <none>                   199.66 707.19
## - renter_units_perc  1   203.16 708.69
## - med_inc            1   205.51 711.05
## - annual             1   209.25 714.78
## - connections        1   212.19 717.73
## - popdensity         1   215.03 720.56
## 
## Step:  AIC=705.7
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + renter_units_perc

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - non_white          1   200.58 704.69
## - prchw              1   200.60 704.71
## <none>                   199.59 705.70
## - renter_units_perc  1   203.57 707.68
## - med_inc            1   204.94 709.05
## - annual             1   209.19 713.30
## - connections        1   212.23 716.34
## - popdensity         1   218.59 722.70
## 
## Step:  AIC=704.69
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     renter_units_perc

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - prchw              1   200.78 703.56
## <none>                   199.91 704.69
## - med_inc            1   205.11 707.89
## - renter_units_perc  1   207.21 709.99
## - annual             1   209.27 712.04
## - connections        1   211.79 714.57
## - popdensity         1   217.88 720.66
## 
## Step:  AIC=703.56
## total_vios ~ annual + med_inc + popdensity + connections + renter_units_perc

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## <none>                   200.32 703.56
## - med_inc            1   206.35 707.59
## - renter_units_perc  1   207.85 709.09
## - annual             1   208.83 710.07
## - connections        1   211.86 713.10
## - popdensity         1   221.14 722.38

disad_car <- fitme(total_vios ~ med_inc + annual + popdensity +
               renter_units_perc + connections + adjacency(1|pwsid), adjMatrix = disad_matrix, 
             data = filtered_dat, family = 'negbin')
summary(disad_car)

## formula: total_vios ~ med_inc + annual + popdensity + renter_units_perc + 
##     connections + adjacency(1 | pwsid)
## Estimation of corrPars, lambda and NB_shape by ML (P_v approximation of logL).
## Estimation of fixed effects by ML (P_v approximation of logL).
## Estimation of lambda and NB_shape by 'outer' ML, maximizing logL.
## family: negbin2(shape=980600)( link = log ) 
##  ------------ Fixed effects (beta) ------------
##                      Estimate    Cond. SE t-value
## (Intercept)       -2.13059640 1.013132712  -2.103
## med_inc           -0.00002797 0.000016061  -1.742
## annual             0.00130991 0.000531226   2.466
## popdensity        -0.00027651 0.000111831  -2.473
## renter_units_perc  0.03643121 0.014774976   2.466
## connections        0.00000911 0.000003825   2.382
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##      1.rho 
## 0.01598932 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    pwsid  :  4.145  
## # of obs: 351; # of groups: pwsid, 351 
##  ------------- Likelihood values  -------------
##                        logLik
## logL      (P_v(h)): -340.2451

disad_aic <- get_any_IC(disad_car)
disad_aic

##        marginal AIC:     conditional AIC:      dispersion AIC: 
##             698.4901             562.3710             686.4901 
##        effective df: 
##             218.8552

dis_coefficients <- (disad_car$fixef)
dis_interpretation <- exp(dis_coefficients)

disad_results <- data.frame(dis_coefficients, dis_interpretation)
disad_results$significance <- "*"
disad_results[c(1,3:6),3] <- "**"

filtered_dat$fitted_vals <- fitted(disad_car)
kable(disad_results, digits = 3, row.names = TRUE, col.names = c("Estimate", "Odds Ratio", "Significance"), align = "c", label = "CAR Negative Binomial Results for Disadvantaged Communities")

	Estimate	Odds Ratio	Significance
(Intercept)	-2.131	0.119	**
med_inc	0.000	1.000
annual	0.001	1.001	**
popdensity	0.000	1.000	**
renter_units_perc	0.036	1.037	**
connections	0.000	1.000	**

ggplot() + geom_point(aes(filtered_dat$fitted_vals, filtered_dat$total_vios))+ xlab("Fitted Values") + ylab("Observed Values")

Small and Very Small Systems

filtered_size <- reg.df2 %>% filter(size == 1 | size == 2)

size_vios <- stat.desc(filtered_size$total_vios)
size_price <- stat.desc(filtered_size$annual)
size_inc <- stat.desc(filtered_size$med_inc)
size_gini <- stat.desc(filtered_size$gini)
size_rental_units <- stat.desc(filtered_size$renter_units_perc)
size_housing_pre1980 <- stat.desc(filtered_size$pre1980)
size_pop_density <- stat.desc(filtered_size$popdensity)
size_disad <- stat.desc(filtered_size$disad_per)
size_non_white <- stat.desc(filtered_size$non_white)
size_connections <- stat.desc(filtered_size$connections)
size_purchase <- stat.desc(filtered_size$prchw)
size_stats <- t(data.frame(size_vios, size_price, size_inc, size_non_white, size_disad, size_gini, size_rental_units, size_housing_pre1980, size_pop_density, size_connections, size_purchase))
size_stats <- size_stats[,-c(3,6,7,10:12,14)]

kable(size_stats, digits = 3, row.names = TRUE, col.names = c("N", "Number of Null", 
                                                                                  "Min", "Max", "Median", 
                                                                                  "Mean", "Std. Dev."),
              align = "c", label = "Descriptive Statistics", caption = "Statistics for variables considered in this analysis of small and very small CWS")

Statistics for variables considered in this analysis of small and very small CWS
	N	Number of Null	Min	Max	Median	Mean	Std. Dev.
size_vios	586	421	0.000	26.000	0.000	1.213	3.204
size_price	586	0	180.000	2108.880	600.000	680.714	370.461
size_inc	586	0	23929.000	247679.000	70857.750	78499.997	33452.427
size_non_white	586	0	1.289	91.104	20.468	24.742	15.212
size_disad	586	443	0.000	100.000	0.000	19.769	37.341
size_gini	586	0	0.214	0.651	0.444	0.443	0.054
size_rental_units	586	0	2.239	77.997	21.889	24.431	13.114
size_housing_pre1980	586	5	0.000	100.000	57.931	55.600	20.261
size_pop_density	586	0	1.057	18323.710	92.966	655.710	1907.038
size_connections	586	0	1.000	993.000	96.500	221.770	253.750
size_purchase	586	497	0.000	1.000	0.000	0.152	0.359

size_coords <- st_centroid(st_geometry(filtered_size$geometry), of_largest_polygon = TRUE)

## Warning in st_centroid.sfc(st_geometry(filtered_size$geometry),
## of_largest_polygon = TRUE): st_centroid does not give correct centroids for
## longitude/latitude data

size_nb_k3 <- knearneigh(size_coords, k = 3) ###this one yields the highest Moran's I

options(scipen=999)
size_lwk3 <- nb2listw(knn2nb(size_nb_k3), style = "W")

is.symmetric.nb(knn2nb(size_nb_k3))

## [1] FALSE

size_mat <- make.sym.nb(knn2nb(size_nb_k3))
is.symmetric.nb(size_mat)

## [1] TRUE

size_matrix <- nb2mat(size_mat, style = "B")

rownames(size_matrix) <- colnames(size_matrix) <- filtered_size$pwsid

summary(nb_size <- glm.nb(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
                renter_units_perc + pre1980, data = filtered_size))

## 
## Call:
## glm.nb(formula = total_vios ~ annual + med_inc + popdensity + 
##     prchw + connections + non_white + gini + renter_units_perc + 
##     pre1980, data = filtered_size, init.theta = 0.2012915594, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.27906  -0.85858  -0.71414  -0.08344   2.54921  
## 
## Coefficients:
##                       Estimate   Std. Error z value  Pr(>|z|)    
## (Intercept)       -0.007590073  1.085804088  -0.007  0.994423    
## annual             0.001157082  0.000292646   3.954 0.0000769 ***
## med_inc           -0.000013356  0.000003889  -3.434  0.000595 ***
## popdensity        -0.000494517  0.000139337  -3.549  0.000387 ***
## prchw             -0.889408901  0.355796103  -2.500  0.012427 *  
## connections       -0.000499104  0.000455885  -1.095  0.273603    
## non_white          0.031779723  0.008215332   3.868  0.000110 ***
## gini               0.954119156  2.077765052   0.459  0.646087    
## renter_units_perc -0.002362267  0.009624955  -0.245  0.806122    
## pre1980           -0.010114646  0.005450316  -1.856  0.063483 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.2013) family taken to be 1)
## 
##     Null deviance: 438.20  on 585  degrees of freedom
## Residual deviance: 364.57  on 576  degrees of freedom
## AIC: 1395.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.2013 
##           Std. Err.:  0.0227 
## 
##  2 x log-likelihood:  -1373.3680

nb_size1 <- step(nb_size)

## Start:  AIC=1393.37
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980
## 
##                     Df Deviance    AIC
## - renter_units_perc  1   364.63 1391.4
## - gini               1   364.75 1391.5
## - connections        1   365.83 1392.6
## <none>                   364.57 1393.4
## - pre1980            1   368.19 1395.0
## - prchw              1   370.55 1397.3
## - med_inc            1   376.64 1403.4
## - non_white          1   378.09 1404.9
## - popdensity         1   382.85 1409.7
## - annual             1   383.02 1409.8
## 
## Step:  AIC=1391.43
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + pre1980
## 
##               Df Deviance    AIC
## - gini         1   364.65 1389.6
## - connections  1   365.80 1390.8
## <none>             364.48 1391.4
## - pre1980      1   368.06 1393.0
## - prchw        1   370.44 1395.4
## - med_inc      1   377.96 1402.9
## - non_white    1   379.48 1404.4
## - annual       1   383.48 1408.4
## - popdensity   1   383.72 1408.7
## 
## Step:  AIC=1389.6
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + pre1980
## 
##               Df Deviance    AIC
## - connections  1   365.89 1389.0
## <none>             364.54 1389.6
## - pre1980      1   367.98 1391.0
## - prchw        1   370.34 1393.4
## - med_inc      1   379.28 1402.3
## - non_white    1   380.35 1403.4
## - annual       1   383.76 1406.8
## - popdensity   1   384.14 1407.2
## 
## Step:  AIC=1388.95
## total_vios ~ annual + med_inc + popdensity + prchw + non_white + 
##     pre1980
## 
##              Df Deviance    AIC
## <none>            365.08 1389.0
## - pre1980     1   368.35 1390.2
## - prchw       1   372.14 1394.0
## - med_inc     1   378.52 1400.4
## - non_white   1   380.07 1401.9
## - popdensity  1   384.66 1406.5
## - annual      1   386.88 1408.8

size_car <- fitme(total_vios ~ med_inc + annual + popdensity + 
                    non_white + prchw + pre1980 +adjacency(1|pwsid), adjMatrix = size_matrix, 
             data = filtered_size, family = 'negbin')
summary(size_car)

## formula: total_vios ~ med_inc + annual + popdensity + non_white + prchw + 
##     pre1980 + adjacency(1 | pwsid)
## Estimation of corrPars, lambda and NB_shape by ML (P_v approximation of logL).
## Estimation of fixed effects by ML (P_v approximation of logL).
## Estimation of lambda and NB_shape by 'outer' ML, maximizing logL.
## family: negbin2(shape=0.2419)( link = log ) 
##  ------------ Fixed effects (beta) ------------
##                Estimate   Cond. SE  t-value
## (Intercept) -0.00151919 0.51318731 -0.00296
## med_inc     -0.00001364 0.00000381 -3.58015
## annual       0.00128650 0.00029966  4.29323
## popdensity  -0.00049419 0.00013318 -3.71072
## non_white    0.03198563 0.00856153  3.73597
## prchw       -0.90186010 0.36129074 -2.49622
## pre1980     -0.01032298 0.00563905 -1.83063
##  --------------- Random effects ---------------
## Family: gaussian( link = identity ) 
##                    --- Correlation parameters:
##     1.rho 
## 0.1899922 
##            --- Variance parameters ('lambda'):
## lambda = var(u) for u ~ Gaussian; 
##    pwsid  :  0.292  
## # of obs: 586; # of groups: pwsid, 586 
##  ------------- Likelihood values  -------------
##                        logLik
## logL      (P_v(h)): -686.0091

size_aic <- get_any_IC(size_car)

size_coefficients <- (size_car$fixef)
size_interpretation <- exp(size_coefficients)

size_results <- data.frame(size_coefficients, size_interpretation)
size_results$significance <- ""
size_results[c(2:5),3] <- "***"
size_results[c(6,7),3] <- "**"

filtered_size$fitted_vals <- fitted(size_car)

kable(size_results, digits = 3, row.names = TRUE, col.names = c("Estimate", "Odds Ratio", "Significance"), align = "c", label = "CAR Negative Binomial Results for Small and Very Small CWS")

	Estimate	Odds Ratio	Significance
(Intercept)	-0.002	0.998
med_inc	0.000	1.000	***
annual	0.001	1.001	***
popdensity	0.000	1.000	***
non_white	0.032	1.033	***
prchw	-0.902	0.406	**
pre1980	-0.010	0.990	**

ggplot() + geom_point(aes(filtered_size$fitted_vals, filtered_size$total_vios))+ xlab("Fitted Values") + ylab("Observed Values")

Appendix

poisson <- glm(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
               renter_units_perc + pre1980 + disad_per, family = "poisson", data = reg.df2)
summary(poisson)

## 
## Call:
## glm(formula = total_vios ~ annual + med_inc + popdensity + prchw + 
##     connections + non_white + gini + renter_units_perc + pre1980 + 
##     disad_per, family = "poisson", data = reg.df2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7718  -1.3221  -0.8652  -0.3281  10.3599  
## 
## Coefficients:
##                       Estimate   Std. Error z value             Pr(>|z|)    
## (Intercept)        0.170870363  0.361589721   0.473             0.636532    
## annual             0.001077909  0.000079114  13.625 < 0.0000000000000002 ***
## med_inc           -0.000016310  0.000001859  -8.774 < 0.0000000000000002 ***
## popdensity        -0.000275633  0.000026670 -10.335 < 0.0000000000000002 ***
## prchw             -0.575549792  0.107921208  -5.333         0.0000000966 ***
## connections        0.000006890  0.000001362   5.059         0.0000004214 ***
## non_white          0.020471709  0.002368936   8.642 < 0.0000000000000002 ***
## gini               0.686884196  0.677361007   1.014             0.310554    
## renter_units_perc  0.007174199  0.002869085   2.501             0.012401 *  
## pre1980           -0.006865692  0.001809320  -3.795             0.000148 ***
## disad_per         -0.004013333  0.001127443  -3.560             0.000371 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3702.5  on 1107  degrees of freedom
## Residual deviance: 2952.5  on 1097  degrees of freedom
## AIC: 3690.7
## 
## Number of Fisher Scoring iterations: 6

P__disp(poisson)

## pearson.chi2   dispersion 
##   6098.98009      5.55969

summary(negbi <- glm.nb(total_vios ~ annual + med_inc + popdensity + prchw + connections + non_white + gini +
                           renter_units_perc + pre1980 + disad_per, data = reg.df2))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning in glm.nb(total_vios ~ annual + med_inc + popdensity + prchw +
## connections + : alternation limit reached

## 
## Call:
## glm.nb(formula = total_vios ~ annual + med_inc + popdensity + 
##     prchw + connections + non_white + gini + renter_units_perc + 
##     pre1980 + disad_per, data = reg.df2, init.theta = 0.1871919901, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1672  -0.7990  -0.6441  -0.2800   2.4364  
## 
## Coefficients:
##                       Estimate   Std. Error z value  Pr(>|z|)    
## (Intercept)       -1.750173591  0.886450288  -1.974  0.048341 *  
## annual             0.001042347  0.000253805   4.107 0.0000401 ***
## med_inc           -0.000006934  0.000003587  -1.933  0.053181 .  
## popdensity        -0.000215434  0.000046771  -4.606 0.0000041 ***
## prchw             -0.821061445  0.219299624  -3.744  0.000181 ***
## connections        0.000004982  0.000003009   1.656  0.097768 .  
## non_white          0.019620242  0.006419216   3.056  0.002239 ** 
## gini               2.901535652  1.833017233   1.583  0.113438    
## renter_units_perc  0.009380714  0.008221648   1.141  0.253879    
## pre1980           -0.004209697  0.004598444  -0.915  0.359950    
## disad_per         -0.000073255  0.003110942  -0.024  0.981214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.1872) family taken to be 1)
## 
##     Null deviance: 742.99  on 1107  degrees of freedom
## Residual deviance: 612.54  on 1097  degrees of freedom
## AIC: 2182.1
## 
## Number of Fisher Scoring iterations: 25
## 
## 
##               Theta:  0.1872 
##           Std. Err.:  0.0174 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -2158.0600

negbi2 <- step(negbi)

## Start:  AIC=2180.06
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980 + disad_per

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - disad_per          1   612.54 2178.1
## - pre1980            1   613.39 2178.9
## - renter_units_perc  1   613.93 2179.4
## <none>                   612.54 2180.1
## - gini               1   614.89 2180.4
## - med_inc            1   616.56 2182.1
## - connections        1   619.42 2184.9
## - non_white          1   621.59 2187.1
## - prchw              1   625.17 2190.7
## - annual             1   632.05 2197.6
## - popdensity         1   638.30 2203.8

## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached

## 
## Step:  AIC=2178.05
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc + pre1980

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - pre1980            1   613.41 2176.9
## - renter_units_perc  1   613.99 2177.5
## <none>                   612.55 2178.1
## - gini               1   614.95 2178.5
## - med_inc            1   617.73 2181.2
## - connections        1   619.43 2182.9
## - non_white          1   621.62 2185.1
## - prchw              1   625.55 2189.1
## - annual             1   632.10 2195.6
## - popdensity         1   638.49 2202.0

## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached

## 
## Step:  AIC=2176.91
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini + renter_units_perc

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

##                     Df Deviance    AIC
## - renter_units_perc  1   613.42 2176.4
## - gini               1   613.86 2176.8
## <none>                   611.95 2176.9
## - med_inc            1   616.77 2179.7
## - connections        1   619.01 2182.0
## - non_white          1   620.93 2183.9
## - prchw              1   625.37 2188.3
## - annual             1   630.71 2193.7
## - popdensity         1   640.14 2203.1

## Warning in glm.nb(formula = total_vios ~ annual + med_inc + popdensity + :
## alternation limit reached

## 
## Step:  AIC=2176.38
## total_vios ~ annual + med_inc + popdensity + prchw + connections + 
##     non_white + gini

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: algorithm did not converge

##               Df Deviance    AIC
## <none>             612.95 2176.4
## - gini         1   615.09 2176.5
## - connections  1   619.84 2181.3
## - med_inc      1   621.25 2182.7
## - prchw        1   625.99 2187.4
## - non_white    1   626.06 2187.5
## - annual       1   630.50 2191.9
## - popdensity   1   640.03 2201.5

lm.morantest(negbi2, listw = lwk3)##even stronger spatial autocorrelation showing us in resids with this model

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = total_vios ~ annual + med_inc + popdensity +
## prchw + connections + non_white + gini, data = reg.df2, init.theta =
## 0.1861751384, link = log)
## weights: lwk3
## 
## Moran I statistic standard deviate = 14.814, p-value <
## 0.00000000000000022
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.4115064872    -0.0356637804     0.0009112054

countreg::rootogram(poisson)

countreg::rootogram(negbi2)

USC EPA Practicum Statistical Analysis

Kelly-Anne Moffa

2023-05-06