OBTAINING DRINKING WATER DATA FROM THE ENVIRONMENTAL PROTECTION AGENCY - EPA
Water is an essential factor for the whole life and the human survival and, having an important role for both drinking as well economic sectors. Therefore, protecting this source against any pollution has become necessary (Witek and Jarosiewicz, 2009, Reza and Singh, 2010; Sojobi, 2016). In the last century, the availability and quality of surface or ground waters have been changing, mainly due to urbanization, industrialization etc.
EPA
Using ENVIROFACTS - API
SDWIS - Safe Drinking Water Information System
Information about safe drinking water is stored in SDWIS, the EPA’s Safe Drinking Water Information System. SDWIS tracks information on drinking water contamination levels as required by the 1974 Safe Drinking Water Act and its 1986 and 1996 amendments. The Safe Drinking Water Act (SDWA) and accompanying regulations establish Maximum Contaminant Levels (MCLs), treatment techniques, and monitoring and reporting requirements to ensure that water provided to customers is safe for human consumption.
Map of tables available in SDWIS
UCMR - Unregulated Contaminant Monitoring Rule
EPA collects data for chemicals and microbes that may be present in drinking water, but are not currently subject to EPA drinking water regulations.
The UCMR program was developed in coordination with the Contaminant Candidate List (CCL) a list of contaminants that:
Are not regulated by the National Primary Drinking Water Regulations
Are known or anticipated to occur at public water systems
May warrant regulation under the SDWA
library(plyr)
library(knitr)
library(readr)
library(tidyverse)
library(tidymodels)
library(DT)
library(dplyr)
library(textrecipes)
library(ggplot2)
library(lubridate)
set.seed(1234)Dataset # 1 - Using API
Safe Drinking Water Information System (SDWIS)
The Safe Drinking Water Information System (SDWIS) contains information about public water systems and their violations of EPA’s drinking water regulations. Searching SDWIS will allow you to locate your drinking water supplier and view its violations and enforcement history for the last ten years.
# EPA API
#change import way
#URL <- ("https://data.epa.gov/efservice/VIOLATION/CSV")
library(readr)
api_data <- read_csv("data/EnvirofactsRestAP_violations.csv")Dataset # 2 - UCMR.Csv
What are the public health benefits of UCMR? This permits assessment of the population being exposed and the levels of exposure.
UCMR data represent one of the primary sources of national occurrence data in drinking water that EPA uses to inform regulatory and other risk management decisions for drinking water contaminant candidates.
library(readr)
water_data <- read_delim("data/UCMR4_All_MA_WY.txt",
"\t", escape_double = FALSE, locale = locale(encoding = "Latin1"),
trim_ws = TRUE)
names(water_data)Dataset 1 - APi Data
library(dplyr)
api_water <- api_data %>%
select(Facility_name = VIOLATION.PWSID,
Pop_served = VIOLATION.POPULATION_SERVED_COUNT,
State = VIOLATION.PRIMACY_AGENCY_CODE,
Health = VIOLATION.IS_HEALTH_BASED_IND,
Contaminant = VIOLATION.CONTAMINANT_CODE,
Result_value = VIOLATION.VIOL_MEASURE,
State_MCL = VIOLATION.STATE_MCL)library(tidyr)
api_clean <- api_water %>%
drop_na()library(skimr)
library(ggplot2)
api_clean2 %>% skim()| Name | Piped data |
| Number of rows | 3837 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Facility_name | 0 | 1 | 8 | 9 | 0 | 2302 | 0 |
| State | 0 | 1 | 1 | 2 | 0 | 51 | 0 |
| Health | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Pop_served | 0 | 1 | 4644.81 | 43712.50 | 1 | 102.00 | 360.00 | 1654 | 1064730 | ▇▁▁▁▁ |
| Contaminant | 0 | 1 | 2240.65 | 1109.80 | 100 | 1025.00 | 2456.00 | 2950 | 4109 | ▁▇▂▆▃ |
| Result_value | 0 | 1 | 51.89 | 1159.13 | 0 | 0.07 | 0.16 | 16 | 54000 | ▇▁▁▁▁ |
| State_MCL | 0 | 1 | 13.06 | 33.88 | 0 | 0.06 | 0.08 | 10 | 500 | ▇▁▁▁▁ |
Dataset 2 - Contaminant Monitoring UCMR
epa_water <- water_data %>%
select(PWSName,
Size,
FacilityName,
FacilityWaterType,
CollectionDate,
Contaminant,
MRL,
Result = AnalyticalResultsSign,
Result_value = `AnalyticalResultValue(µg/L)`,
State)library(plyr)
library(lubridate)
epa_water <- epa_water %>%
mutate(Result = replace(Result, Result == "=", TRUE),
Date = mdy(CollectionDate),
CollectionDate = NULL) %>%
mutate(MRL = as.numeric(MRL))library(tidyr)
epa_clean <- epa_water %>%
drop_na()epa_clean <- read_csv("epa_clean.csv")##
## -- Column specification --------------------------------------------------------
## cols(
## PWSName = col_character(),
## Size = col_character(),
## FacilityName = col_character(),
## FacilityWaterType = col_character(),
## Contaminant = col_character(),
## MRL = col_double(),
## Result = col_logical(),
## Result_value = col_double(),
## State = col_character(),
## Date = col_date(format = "")
## )
library(lubridate)
epa_clean <- epa_clean %>%
mutate(Date = ymd(Date))library(skimr)
library(ggplot2)
epa_clean %>% skim()| Name | Piped data |
| Number of rows | 16527 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| Date | 1 |
| logical | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| PWSName | 0 | 1 | 4 | 54 | 0 | 2659 | 0 |
| Size | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| FacilityName | 0 | 1 | 2 | 50 | 0 | 4893 | 0 |
| FacilityWaterType | 0 | 1 | 2 | 2 | 0 | 4 | 0 |
| Contaminant | 0 | 1 | 8 | 27 | 0 | 22 | 0 |
| State | 0 | 1 | 2 | 2 | 0 | 35 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Date | 0 | 1 | 2018-01-02 | 2020-12-01 | 2019-05-21 | 685 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| Result | 0 | 1 | 1 | TRU: 16527 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| MRL | 0 | 1 | 0.41 | 0.17 | 0.01 | 0.40 | 0.4 | 0.4 | 2 | ▇▁▁▁▁ |
| Result_value | 0 | 1 | 17.94 | 81.95 | 0.01 | 0.95 | 2.9 | 10.5 | 3960 | ▇▁▁▁▁ |
Correlation is a measure of the relationship or interdependence of two variables. In other words, how much do the values of one variable change with the values of another. Correlation can be either positive or negative and it ranges from -1 to 1, with 1 and -1 indicating perfect correlation (1 being positive and -1 being negative) and 0 indicating no correlation.
Dataset 1 - APi Data
library(PerformanceAnalytics)
api_clean2 %>%
select_if(is.numeric) %>%
chart.Correlation()library(GGally)
api_clean2 %>%
select_if(is.numeric) %>%
ggcorr(label = TRUE)library(GGally)
api_clean2 %>%
select_if(is.numeric) %>%
ggpairs()Dataset 2 - APi Data
library(PerformanceAnalytics)
epa_clean %>%
select_if(is.numeric) %>%
chart.Correlation()library(GGally)
epa_clean %>%
select_if(is.numeric) %>%
ggcorr(label = TRUE)library(GGally)
epa_clean %>%
select_if(is.numeric) %>%
ggpairs()MOST RECENT REGULATED contaminants by the EPA
REGULATED CONTAMINANTS VIOLATIONS - States with levels higher than the Maximum Concentration Levels
library(ggplot2)
ggplot(api_clean2) +
aes(x = State_MCL, size = Result_value) +
geom_histogram(bins = 30L, fill = "#FF8C00") +
scale_x_continuous(trans = "log10") +
theme_minimal() +
facet_wrap(vars(State))UNREGULATED CONTAMINANTS - > Maximum Concentration Levels (MCL) by State
The Safe Drinking Water Act (SDWA) defines “contaminant” as any physical, chemical, biological or radiological substance or matter in water. Drinking water may reasonably be expected to contain at least small amounts of some contaminants. Some contaminants may be harmful if consumed at certain levels in drinking water. The presence of contaminants does not necessarily indicate that the water poses a health risk.
library(DT)
datatable(epa_clean2)ggplot(epa_clean) +
aes(x = Date, fill = Contaminant) +
geom_histogram(bins = 30L) +
scale_fill_hue(direction = 1) +
labs(subtitle = "Contaminants per State") +
theme_minimal() +
facet_wrap(vars(State))Identifiying States of interest
ggplot(epa_clean) +
aes(x = Date, fill = Contaminant) +
geom_histogram(bins = 30L) +
scale_fill_hue(direction = 1) +
labs(subtitle = "Contaminants per State") +
theme_minimal() +
facet_wrap(vars(State))Violations
using the Public Api, I was able to identify 100K violations in Region 2 of the EPA database. Region 2 includes 34 states, among them TX, NY, NJ, NC, UT. UCMR provided the counties served field for 99% of the PWSs. (n=100,002) but only 26% reported information without any NA’s (n= 3837)
api_clean2 %>%
filter(State == "NY") %>%
summarise(Avg_ppl_affected = mean(Pop_served))## # A tibble: 1 x 1
## Avg_ppl_affected
## <dbl>
## 1 717.
Unregulated Contaminants
Using the city served and county served fields in UCMR to determine the areas with reports of unregulated contaminants by each PWS. From the 500k entries found we were able to gather 15,527 with no missing values.
When we filter by State = NY we can see that there are 1369 reports stating levels of unregulated contaminants above the MRL (max result value)
library(dplyr)
count(epa_clean$State == "NY")#average mu of contaminant Reporting Levels
epa_clean %>%
select(State, MRL) %>%
filter(State == "NY") %>%
group_by(State) %>%
summarise(Average_MRL = mean(MRL))## # A tibble: 1 x 2
## State Average_MRL
## <chr> <dbl>
## 1 NY 0.407
sampling
Create a new variable df1 using the dataset_1 - Violations - epa_clean - filtering the States of interest
df1 <- epa_clean %>%
select(State, MRL) %>%
filter(State == "NY" |
State == "TX")sampling variable of df1
df1_sampling <- df1 %>%
rep_sample_n(size = 50)
df1_sampling## # A tibble: 50 x 3
## # Groups: replicate [1]
## replicate State MRL
## <int> <chr> <dbl>
## 1 1 TX 0.4
## 2 1 TX 0.4
## 3 1 TX 0.007
## 4 1 TX 0.4
## 5 1 TX 0.4
## 6 1 TX 0.3
## 7 1 TX 0.4
## 8 1 NY 0.4
## 9 1 TX 0.4
## 10 1 NY 0.4
## # ... with 40 more rows
Let’s compute the proportion of MRL meeting national average of 0.3
df1_sampling %>%
mutate(is_avg_MRL = (MRL == 0.4))## # A tibble: 50 x 4
## # Groups: replicate [1]
## replicate State MRL is_avg_MRL
## <int> <chr> <dbl> <lgl>
## 1 1 TX 0.4 TRUE
## 2 1 TX 0.4 TRUE
## 3 1 TX 0.007 FALSE
## 4 1 TX 0.4 TRUE
## 5 1 TX 0.4 TRUE
## 6 1 TX 0.3 FALSE
## 7 1 TX 0.4 TRUE
## 8 1 NY 0.4 TRUE
## 9 1 TX 0.4 TRUE
## 10 1 NY 0.4 TRUE
## # ... with 40 more rows
df1_sampling %>%
mutate(is_avg_MRL = (MRL == 0.4)) %>%
summarize(nat_avg = sum(is_avg_MRL))## # A tibble: 1 x 2
## replicate nat_avg
## <int> <int>
## 1 1 42
let’s compute the proportion of the 50 sampled entries that are MRL= 0.4 by dividing nat_avg by 50
df1_sampling %>%
mutate(is_avg_MRL = (MRL == 0.4)) %>%
summarize(nat_avg = sum(is_avg_MRL)) %>%
mutate(prop_avg_MRL = nat_avg / 50)## # A tibble: 1 x 3
## replicate nat_avg prop_avg_MRL
## <int> <int> <dbl>
## 1 1 42 0.84
Great! 78% of df1_sampling Maximum Regulated Levels is 0.4
Bootstrapping
# Repeat resampling 1000 times
# Compute 1000 sample means
virtual_resampled_means <- df1 %>%
rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>%
group_by(replicate) %>%
summarize(mean_MRL = mean(MRL))
virtual_resampled_means## # A tibble: 1,000 x 2
## replicate mean_MRL
## <int> <dbl>
## 1 1 0.373
## 2 2 0.426
## 3 3 0.42
## 4 4 0.377
## 5 5 0.421
## 6 6 0.368
## 7 7 0.381
## 8 8 0.437
## 9 9 0.422
## 10 10 0.392
## # ... with 990 more rows
ggplot(virtual_resampled_means, aes(x = mean_MRL)) +
geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
labs(x = "sample mean")Statistical Anaysis
Null Hypothesis
NYC has one of the best water in the world, we should drink tap water. Reason why there is no difference in mean between the MRL (Max Reported levels) and the Violation reported mean Results and the Non regulated Contaminant Mean concentration Levels MCL.
Alternate Hypothesis
Due to increase in population,industrialization, poor environmental policies and lack of accurate and continuous monitoring, NYC tap water is not the best in the world and it should be filtered before consumption. There is a significant difference in mean between the accepted levels of contaminants in water and the concentration levels observed in the reported violations
We will investigate NYC
# lets take a look at the data
df1 <- epa_clean %>%
select(State, MRL) %>%
filter(State == "NY" |
State == "TX")# Looking for Average Result Value / above State concentration levels
epa_clean %>%
select(State, Result_value) %>%
filter(State == "NY" |
State == "TX") %>%
group_by(State) %>%
summarise(Average_RV = mean(Result_value))## # A tibble: 2 x 2
## State Average_RV
## <chr> <dbl>
## 1 NY 17.3
## 2 TX 6.83
df2 <- epa_clean %>%
select(State, Result_value) %>%
filter(State == "NY" |
State == "TX")1 sample T-test
I will test whether the average MCL/ MRL levels for contaminants in NYC and TX abides with the National standard of 0.3 ug/l.
To do so we make use of the MRL ( max reporting level ) giving the amount allowed to be reported by the EPA and then we use the Result value variable which tells us the amount above the MCL
hist(log(df1$MRL))hist(log(df2$Result_value))## # A tibble: 1 x 1
## stat
## <dbl>
## 1 -1.04
df2 %>%
specify(response = Result_value) %>%
hypothesize(null = "point", mu = 0.4) %>%
calculate(stat = "t")## # A tibble: 1 x 1
## stat
## <dbl>
## 1 13.6
Hypothesis for Violation Data - api_clean2
t.test(api_clean2$Result_value,
mu = 0.4,
alternative = "greater")##
## One Sample t-test
##
## data: api_clean2$Result_value
## t = 2.7518, df = 3836, p-value = 0.002978
## alternative hypothesis: true mean is greater than 0.4
## 95 percent confidence interval:
## 21.10582 Inf
## sample estimates:
## mean of x
## 51.89279
Hypothesis for Unregulated Contaminant Data - epa_clean
t.test(epa_clean$Result_value,
mu = 0.4,
alternative = "greater")##
## One Sample t-test
##
## data: epa_clean$Result_value
## t = 27.515, df = 16526, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 0.4
## 95 percent confidence interval:
## 16.89116 Inf
## sample estimates:
## mean of x
## 17.93973
We reject the Null Hypothesis, there is significant evidence to believe NYC water is not safe for consumption without being filtered first.