OBTAINING DRINKING WATER DATA FROM THE ENVIRONMENTAL PROTECTION AGENCY - EPA
Water is an essential factor for the whole life and the human survival and, having an important role for both drinking as well economic sectors. Therefore, protecting this source against any pollution has become necessary (Witek and Jarosiewicz, 2009, Reza and Singh, 2010; Sojobi, 2016). In the last century, the availability and quality of surface or ground waters have been changing, mainly due to urbanization, industrialization etc.
EPA
Using ENVIROFACTS - API
SDWIS - Safe Drinking Water Information System
Information about safe drinking water is stored in SDWIS, the EPA’s Safe Drinking Water Information System. SDWIS tracks information on drinking water contamination levels as required by the 1974 Safe Drinking Water Act and its 1986 and 1996 amendments. The Safe Drinking Water Act (SDWA) and accompanying regulations establish Maximum Contaminant Levels (MCLs), treatment techniques, and monitoring and reporting requirements to ensure that water provided to customers is safe for human consumption.
Map of tables available in SDWIS
UCMR - Unregulated Contaminant Monitoring Rule
EPA collects data for chemicals and microbes that may be present in drinking water, but are not currently subject to EPA drinking water regulations.
The UCMR program was developed in coordination with the Contaminant Candidate List (CCL) a list of contaminants that:
Are not regulated by the National Primary Drinking Water Regulations
Are known or anticipated to occur at public water systems
May warrant regulation under the SDWA
library(plyr)
library(knitr)
library(readr)
library(tidyverse)
library(tidymodels)
library(DT)
library(dplyr)
library(textrecipes)
library(ggplot2)
library(lubridate)
set.seed(1234)
Dataset # 1 - Using API
Safe Drinking Water Information System (SDWIS)
The Safe Drinking Water Information System (SDWIS) contains information about public water systems and their violations of EPA’s drinking water regulations. Searching SDWIS will allow you to locate your drinking water supplier and view its violations and enforcement history for the last ten years.
# EPA API
#change import way
#URL <- ("https://data.epa.gov/efservice/VIOLATION/CSV")
library(readr)
<- read_csv("data/EnvirofactsRestAP_violations.csv") api_data
Dataset # 2 - UCMR.Csv
What are the public health benefits of UCMR? This permits assessment of the population being exposed and the levels of exposure.
UCMR data represent one of the primary sources of national occurrence data in drinking water that EPA uses to inform regulatory and other risk management decisions for drinking water contaminant candidates.
library(readr)
<- read_delim("data/UCMR4_All_MA_WY.txt",
water_data "\t", escape_double = FALSE, locale = locale(encoding = "Latin1"),
trim_ws = TRUE)
names(water_data)
Dataset 1 - APi Data
library(dplyr)
<- api_data %>%
api_water select(Facility_name = VIOLATION.PWSID,
Pop_served = VIOLATION.POPULATION_SERVED_COUNT,
State = VIOLATION.PRIMACY_AGENCY_CODE,
Health = VIOLATION.IS_HEALTH_BASED_IND,
Contaminant = VIOLATION.CONTAMINANT_CODE,
Result_value = VIOLATION.VIOL_MEASURE,
State_MCL = VIOLATION.STATE_MCL)
library(tidyr)
<- api_water %>%
api_clean drop_na()
library(skimr)
library(ggplot2)
%>% skim() api_clean2
Name | Piped data |
Number of rows | 3837 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Facility_name | 0 | 1 | 8 | 9 | 0 | 2302 | 0 |
State | 0 | 1 | 1 | 2 | 0 | 51 | 0 |
Health | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Pop_served | 0 | 1 | 4644.81 | 43712.50 | 1 | 102.00 | 360.00 | 1654 | 1064730 | ▇▁▁▁▁ |
Contaminant | 0 | 1 | 2240.65 | 1109.80 | 100 | 1025.00 | 2456.00 | 2950 | 4109 | ▁▇▂▆▃ |
Result_value | 0 | 1 | 51.89 | 1159.13 | 0 | 0.07 | 0.16 | 16 | 54000 | ▇▁▁▁▁ |
State_MCL | 0 | 1 | 13.06 | 33.88 | 0 | 0.06 | 0.08 | 10 | 500 | ▇▁▁▁▁ |
Dataset 2 - Contaminant Monitoring UCMR
<- water_data %>%
epa_water select(PWSName,
Size,
FacilityName,
FacilityWaterType,
CollectionDate,
Contaminant,
MRL,Result = AnalyticalResultsSign,
Result_value = `AnalyticalResultValue(µg/L)`,
State)
library(plyr)
library(lubridate)
<- epa_water %>%
epa_water mutate(Result = replace(Result, Result == "=", TRUE),
Date = mdy(CollectionDate),
CollectionDate = NULL) %>%
mutate(MRL = as.numeric(MRL))
library(tidyr)
<- epa_water %>%
epa_clean drop_na()
<- read_csv("epa_clean.csv") epa_clean
##
## -- Column specification --------------------------------------------------------
## cols(
## PWSName = col_character(),
## Size = col_character(),
## FacilityName = col_character(),
## FacilityWaterType = col_character(),
## Contaminant = col_character(),
## MRL = col_double(),
## Result = col_logical(),
## Result_value = col_double(),
## State = col_character(),
## Date = col_date(format = "")
## )
library(lubridate)
<- epa_clean %>%
epa_clean mutate(Date = ymd(Date))
library(skimr)
library(ggplot2)
%>% skim() epa_clean
Name | Piped data |
Number of rows | 16527 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 6 |
Date | 1 |
logical | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
PWSName | 0 | 1 | 4 | 54 | 0 | 2659 | 0 |
Size | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
FacilityName | 0 | 1 | 2 | 50 | 0 | 4893 | 0 |
FacilityWaterType | 0 | 1 | 2 | 2 | 0 | 4 | 0 |
Contaminant | 0 | 1 | 8 | 27 | 0 | 22 | 0 |
State | 0 | 1 | 2 | 2 | 0 | 35 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Date | 0 | 1 | 2018-01-02 | 2020-12-01 | 2019-05-21 | 685 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
Result | 0 | 1 | 1 | TRU: 16527 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
MRL | 0 | 1 | 0.41 | 0.17 | 0.01 | 0.40 | 0.4 | 0.4 | 2 | ▇▁▁▁▁ |
Result_value | 0 | 1 | 17.94 | 81.95 | 0.01 | 0.95 | 2.9 | 10.5 | 3960 | ▇▁▁▁▁ |
Correlation is a measure of the relationship or interdependence of two variables. In other words, how much do the values of one variable change with the values of another. Correlation can be either positive or negative and it ranges from -1 to 1, with 1 and -1 indicating perfect correlation (1 being positive and -1 being negative) and 0 indicating no correlation.
Dataset 1 - APi Data
library(PerformanceAnalytics)
%>%
api_clean2 select_if(is.numeric) %>%
chart.Correlation()
library(GGally)
%>%
api_clean2 select_if(is.numeric) %>%
ggcorr(label = TRUE)
library(GGally)
%>%
api_clean2 select_if(is.numeric) %>%
ggpairs()
Dataset 2 - APi Data
library(PerformanceAnalytics)
%>%
epa_clean select_if(is.numeric) %>%
chart.Correlation()
library(GGally)
%>%
epa_clean select_if(is.numeric) %>%
ggcorr(label = TRUE)
library(GGally)
%>%
epa_clean select_if(is.numeric) %>%
ggpairs()
MOST RECENT REGULATED contaminants by the EPA
REGULATED CONTAMINANTS VIOLATIONS - States with levels higher than the Maximum Concentration Levels
library(ggplot2)
ggplot(api_clean2) +
aes(x = State_MCL, size = Result_value) +
geom_histogram(bins = 30L, fill = "#FF8C00") +
scale_x_continuous(trans = "log10") +
theme_minimal() +
facet_wrap(vars(State))
UNREGULATED CONTAMINANTS - > Maximum Concentration Levels (MCL) by State
The Safe Drinking Water Act (SDWA) defines “contaminant” as any physical, chemical, biological or radiological substance or matter in water. Drinking water may reasonably be expected to contain at least small amounts of some contaminants. Some contaminants may be harmful if consumed at certain levels in drinking water. The presence of contaminants does not necessarily indicate that the water poses a health risk.
library(DT)
datatable(epa_clean2)
ggplot(epa_clean) +
aes(x = Date, fill = Contaminant) +
geom_histogram(bins = 30L) +
scale_fill_hue(direction = 1) +
labs(subtitle = "Contaminants per State") +
theme_minimal() +
facet_wrap(vars(State))
Identifiying States of interest
ggplot(epa_clean) +
aes(x = Date, fill = Contaminant) +
geom_histogram(bins = 30L) +
scale_fill_hue(direction = 1) +
labs(subtitle = "Contaminants per State") +
theme_minimal() +
facet_wrap(vars(State))
Violations
using the Public Api, I was able to identify 100K violations in Region 2 of the EPA database. Region 2 includes 34 states, among them TX, NY, NJ, NC, UT. UCMR provided the counties served field for 99% of the PWSs. (n=100,002) but only 26% reported information without any NA’s (n= 3837)
%>%
api_clean2 filter(State == "NY") %>%
summarise(Avg_ppl_affected = mean(Pop_served))
## # A tibble: 1 x 1
## Avg_ppl_affected
## <dbl>
## 1 717.
Unregulated Contaminants
Using the city served and county served fields in UCMR to determine the areas with reports of unregulated contaminants by each PWS. From the 500k entries found we were able to gather 15,527 with no missing values.
When we filter by State = NY we can see that there are 1369 reports stating levels of unregulated contaminants above the MRL (max result value)
library(dplyr)
count(epa_clean$State == "NY")
#average mu of contaminant Reporting Levels
%>%
epa_clean select(State, MRL) %>%
filter(State == "NY") %>%
group_by(State) %>%
summarise(Average_MRL = mean(MRL))
## # A tibble: 1 x 2
## State Average_MRL
## <chr> <dbl>
## 1 NY 0.407
sampling
Create a new variable df1 using the dataset_1 - Violations - epa_clean - filtering the States of interest
<- epa_clean %>%
df1 select(State, MRL) %>%
filter(State == "NY" |
== "TX") State
sampling variable of df1
<- df1 %>%
df1_sampling rep_sample_n(size = 50)
df1_sampling
## # A tibble: 50 x 3
## # Groups: replicate [1]
## replicate State MRL
## <int> <chr> <dbl>
## 1 1 TX 0.4
## 2 1 TX 0.4
## 3 1 TX 0.007
## 4 1 TX 0.4
## 5 1 TX 0.4
## 6 1 TX 0.3
## 7 1 TX 0.4
## 8 1 NY 0.4
## 9 1 TX 0.4
## 10 1 NY 0.4
## # ... with 40 more rows
Let’s compute the proportion of MRL meeting national average of 0.3
%>%
df1_sampling mutate(is_avg_MRL = (MRL == 0.4))
## # A tibble: 50 x 4
## # Groups: replicate [1]
## replicate State MRL is_avg_MRL
## <int> <chr> <dbl> <lgl>
## 1 1 TX 0.4 TRUE
## 2 1 TX 0.4 TRUE
## 3 1 TX 0.007 FALSE
## 4 1 TX 0.4 TRUE
## 5 1 TX 0.4 TRUE
## 6 1 TX 0.3 FALSE
## 7 1 TX 0.4 TRUE
## 8 1 NY 0.4 TRUE
## 9 1 TX 0.4 TRUE
## 10 1 NY 0.4 TRUE
## # ... with 40 more rows
%>%
df1_sampling mutate(is_avg_MRL = (MRL == 0.4)) %>%
summarize(nat_avg = sum(is_avg_MRL))
## # A tibble: 1 x 2
## replicate nat_avg
## <int> <int>
## 1 1 42
let’s compute the proportion of the 50 sampled entries that are MRL= 0.4 by dividing nat_avg by 50
%>%
df1_sampling mutate(is_avg_MRL = (MRL == 0.4)) %>%
summarize(nat_avg = sum(is_avg_MRL)) %>%
mutate(prop_avg_MRL = nat_avg / 50)
## # A tibble: 1 x 3
## replicate nat_avg prop_avg_MRL
## <int> <int> <dbl>
## 1 1 42 0.84
Great! 78% of df1_sampling Maximum Regulated Levels is 0.4
Bootstrapping
# Repeat resampling 1000 times
# Compute 1000 sample means
<- df1 %>%
virtual_resampled_means rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
group_by(replicate) %>%
summarize(mean_MRL = mean(MRL))
virtual_resampled_means
## # A tibble: 1,000 x 2
## replicate mean_MRL
## <int> <dbl>
## 1 1 0.373
## 2 2 0.426
## 3 3 0.42
## 4 4 0.377
## 5 5 0.421
## 6 6 0.368
## 7 7 0.381
## 8 8 0.437
## 9 9 0.422
## 10 10 0.392
## # ... with 990 more rows
ggplot(virtual_resampled_means, aes(x = mean_MRL)) +
geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
labs(x = "sample mean")
Statistical Anaysis
Null Hypothesis
NYC has one of the best water in the world, we should drink tap water. Reason why there is no difference in mean between the MRL (Max Reported levels) and the Violation reported mean Results and the Non regulated Contaminant Mean concentration Levels MCL.
Alternate Hypothesis
Due to increase in population,industrialization, poor environmental policies and lack of accurate and continuous monitoring, NYC tap water is not the best in the world and it should be filtered before consumption. There is a significant difference in mean between the accepted levels of contaminants in water and the concentration levels observed in the reported violations
We will investigate NYC
# lets take a look at the data
<- epa_clean %>%
df1 select(State, MRL) %>%
filter(State == "NY" |
== "TX") State
# Looking for Average Result Value / above State concentration levels
%>%
epa_clean select(State, Result_value) %>%
filter(State == "NY" |
== "TX") %>%
State group_by(State) %>%
summarise(Average_RV = mean(Result_value))
## # A tibble: 2 x 2
## State Average_RV
## <chr> <dbl>
## 1 NY 17.3
## 2 TX 6.83
<- epa_clean %>%
df2 select(State, Result_value) %>%
filter(State == "NY" |
== "TX") State
1 sample T-test
I will test whether the average MCL/ MRL levels for contaminants in NYC and TX abides with the National standard of 0.3 ug/l.
To do so we make use of the MRL ( max reporting level ) giving the amount allowed to be reported by the EPA and then we use the Result value variable which tells us the amount above the MCL
hist(log(df1$MRL))
hist(log(df2$Result_value))
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 -1.04
%>%
df2 specify(response = Result_value) %>%
hypothesize(null = "point", mu = 0.4) %>%
calculate(stat = "t")
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 13.6
Hypothesis for Violation Data - api_clean2
t.test(api_clean2$Result_value,
mu = 0.4,
alternative = "greater")
##
## One Sample t-test
##
## data: api_clean2$Result_value
## t = 2.7518, df = 3836, p-value = 0.002978
## alternative hypothesis: true mean is greater than 0.4
## 95 percent confidence interval:
## 21.10582 Inf
## sample estimates:
## mean of x
## 51.89279
Hypothesis for Unregulated Contaminant Data - epa_clean
t.test(epa_clean$Result_value,
mu = 0.4,
alternative = "greater")
##
## One Sample t-test
##
## data: epa_clean$Result_value
## t = 27.515, df = 16526, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 0.4
## 95 percent confidence interval:
## 16.89116 Inf
## sample estimates:
## mean of x
## 17.93973
We reject the Null Hypothesis, there is significant evidence to believe NYC water is not safe for consumption without being filtered first.