Ginorio_FinalProject.utf8

Overview

OBTAINING DRINKING WATER DATA FROM THE ENVIRONMENTAL PROTECTION AGENCY - EPA

Water is an essential factor for the whole life and the human survival and, having an important role for both drinking as well economic sectors. Therefore, protecting this source against any pollution has become necessary (Witek and Jarosiewicz, 2009, Reza and Singh, 2010; Sojobi, 2016). In the last century, the availability and quality of surface or ground waters have been changing, mainly due to urbanization, industrialization etc.

EPA

Using ENVIROFACTS - API

SDWIS - Safe Drinking Water Information System

Information about safe drinking water is stored in SDWIS, the EPA’s Safe Drinking Water Information System. SDWIS tracks information on drinking water contamination levels as required by the 1974 Safe Drinking Water Act and its 1986 and 1996 amendments. The Safe Drinking Water Act (SDWA) and accompanying regulations establish Maximum Contaminant Levels (MCLs), treatment techniques, and monitoring and reporting requirements to ensure that water provided to customers is safe for human consumption.

Map of tables available in SDWIS

UCMR - Unregulated Contaminant Monitoring Rule

EPA collects data for chemicals and microbes that may be present in drinking water, but are not currently subject to EPA drinking water regulations.

The UCMR program was developed in coordination with the Contaminant Candidate List (CCL) a list of contaminants that:

Are not regulated by the National Primary Drinking Water Regulations
Are known or anticipated to occur at public water systems
May warrant regulation under the SDWA

Packages

library(plyr)
library(knitr)
library(readr)
library(tidyverse)
library(tidymodels)
library(DT)
library(dplyr)
library(textrecipes)
library(ggplot2)
library(lubridate)
set.seed(1234)

Data

API_data

Dataset # 1 - Using API

Safe Drinking Water Information System (SDWIS)

The Safe Drinking Water Information System (SDWIS) contains information about public water systems and their violations of EPA’s drinking water regulations. Searching SDWIS will allow you to locate your drinking water supplier and view its violations and enforcement history for the last ten years.

# EPA API
#change import way

#URL <- ("https://data.epa.gov/efservice/VIOLATION/CSV")

library(readr)

api_data <- read_csv("data/EnvirofactsRestAP_violations.csv")

EPA_data

Dataset # 2 - UCMR.Csv

What are the public health benefits of UCMR? This permits assessment of the population being exposed and the levels of exposure.

UCMR data represent one of the primary sources of national occurrence data in drinking water that EPA uses to inform regulatory and other risk management decisions for drinking water contaminant candidates.

library(readr)

water_data <- read_delim("data/UCMR4_All_MA_WY.txt", 
    "\t", escape_double = FALSE, locale = locale(encoding = "Latin1"), 
    trim_ws = TRUE)
names(water_data)

Exploratory Analysis

Dataset 1 - APi Data

library(dplyr)

api_water <- api_data %>% 
  select(Facility_name = VIOLATION.PWSID,
         Pop_served = VIOLATION.POPULATION_SERVED_COUNT,
         State = VIOLATION.PRIMACY_AGENCY_CODE,
         Health = VIOLATION.IS_HEALTH_BASED_IND,
         Contaminant = VIOLATION.CONTAMINANT_CODE,
         Result_value = VIOLATION.VIOL_MEASURE,
         State_MCL = VIOLATION.STATE_MCL)

library(tidyr)
api_clean <- api_water %>% 
  drop_na()

library(skimr)
library(ggplot2)

 api_clean2 %>% skim()

Data summary
Name	Piped data
Number of rows	3837
Number of columns	7
_______________________
Column type frequency:
character	3
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Facility_name	1	8	9	2302
State	1	1	2	51
Health	1	1	1	1

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Pop_served	1	4644.81	43712.50	1	102.00	360.00	1654	1064730	▇▁▁▁▁
Contaminant	1	2240.65	1109.80	100	1025.00	2456.00	2950	4109	▁▇▂▆▃
Result_value	1	51.89	1159.13	0	0.07	0.16	16	54000	▇▁▁▁▁
State_MCL	1	13.06	33.88	0	0.06	0.08	10	500	▇▁▁▁▁

Dataset 2 - Contaminant Monitoring UCMR

epa_water <- water_data %>% 
  select(PWSName,
         Size,
         FacilityName,
         FacilityWaterType,
         CollectionDate,
         Contaminant,
         MRL,
         Result = AnalyticalResultsSign,
         Result_value = `AnalyticalResultValue(µg/L)`,
         State)

library(plyr)
library(lubridate)

epa_water <- epa_water %>% 
  mutate(Result = replace(Result, Result == "=", TRUE),
         Date = mdy(CollectionDate),
         CollectionDate = NULL) %>% 
  mutate(MRL = as.numeric(MRL))

library(tidyr)

epa_clean <- epa_water %>% 
  drop_na()

epa_clean <- read_csv("epa_clean.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   PWSName = col_character(),
##   Size = col_character(),
##   FacilityName = col_character(),
##   FacilityWaterType = col_character(),
##   Contaminant = col_character(),
##   MRL = col_double(),
##   Result = col_logical(),
##   Result_value = col_double(),
##   State = col_character(),
##   Date = col_date(format = "")
## )

library(lubridate)

epa_clean <- epa_clean %>% 
  mutate(Date = ymd(Date))

library(skimr)
library(ggplot2)

 epa_clean %>% skim()

Data summary
Name	Piped data
Number of rows	16527
Number of columns	10
_______________________
Column type frequency:
character	6
Date	1
logical	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
PWSName	1	4	54	2659
Size	1	1	1	2
FacilityName	1	2	50	4893
FacilityWaterType	1	2	2	4
Contaminant	1	8	27	22
State	1	2	2	35

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date	0	1	2018-01-02	2020-12-01	2019-05-21	685

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
Result	0	1	1	TRU: 16527

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
MRL	0	1	0.41	0.17	0.01	0.40	0.4	0.4	2	▇▁▁▁▁
Result_value	0	1	17.94	81.95	0.01	0.95	2.9	10.5	3960	▇▁▁▁▁

Relationships

Correlation is a measure of the relationship or interdependence of two variables. In other words, how much do the values of one variable change with the values of another. Correlation can be either positive or negative and it ranges from -1 to 1, with 1 and -1 indicating perfect correlation (1 being positive and -1 being negative) and 0 indicating no correlation.

Dataset 1 - APi Data

library(PerformanceAnalytics)

api_clean2 %>% 
  select_if(is.numeric) %>% 
  chart.Correlation()

library(GGally)

api_clean2 %>% 
  select_if(is.numeric) %>% 
  ggcorr(label = TRUE)

library(GGally)

api_clean2 %>% 
  select_if(is.numeric) %>% 
  ggpairs()

Dataset 2 - APi Data

library(PerformanceAnalytics)

epa_clean %>% 
  select_if(is.numeric) %>% 
  chart.Correlation()

library(GGally)

epa_clean %>% 
  select_if(is.numeric) %>% 
  ggcorr(label = TRUE)

library(GGally)

epa_clean %>% 
  select_if(is.numeric) %>% 
  ggpairs()

Contaminants

MOST RECENT REGULATED contaminants by the EPA

Regulated contaminants

REGULATED CONTAMINANTS VIOLATIONS - States with levels higher than the Maximum Concentration Levels

library(ggplot2)

ggplot(api_clean2) +
 aes(x = State_MCL, size = Result_value) +
 geom_histogram(bins = 30L, fill = "#FF8C00") +
 scale_x_continuous(trans = "log10") +
 theme_minimal() +
 facet_wrap(vars(State))

UNREGULATED CONTAMINANTS - > Maximum Concentration Levels (MCL) by State

The Safe Drinking Water Act (SDWA) defines “contaminant” as any physical, chemical, biological or radiological substance or matter in water. Drinking water may reasonably be expected to contain at least small amounts of some contaminants. Some contaminants may be harmful if consumed at certain levels in drinking water. The presence of contaminants does not necessarily indicate that the water poses a health risk.

library(DT)
datatable(epa_clean2)

ggplot(epa_clean) +
 aes(x = Date, fill = Contaminant) +
 geom_histogram(bins = 30L) +
 scale_fill_hue(direction = 1) +
 labs(subtitle = "Contaminants per State") +
 theme_minimal() +
 facet_wrap(vars(State))

Methods

Identifiying States of interest

ggplot(epa_clean) +
 aes(x = Date, fill = Contaminant) +
 geom_histogram(bins = 30L) +
 scale_fill_hue(direction = 1) +
 labs(subtitle = "Contaminants per State") +
 theme_minimal() +
 facet_wrap(vars(State))

Violations

using the Public Api, I was able to identify 100K violations in Region 2 of the EPA database. Region 2 includes 34 states, among them TX, NY, NJ, NC, UT. UCMR provided the counties served field for 99% of the PWSs. (n=100,002) but only 26% reported information without any NA’s (n= 3837)

api_clean2 %>% 
  filter(State == "NY") %>% 
  summarise(Avg_ppl_affected = mean(Pop_served))

## # A tibble: 1 x 1
##   Avg_ppl_affected
##              <dbl>
## 1             717.

Unregulated Contaminants

Using the city served and county served fields in UCMR to determine the areas with reports of unregulated contaminants by each PWS. From the 500k entries found we were able to gather 15,527 with no missing values.

When we filter by State = NY we can see that there are 1369 reports stating levels of unregulated contaminants above the MRL (max result value)

library(dplyr)
count(epa_clean$State == "NY")

#average mu of contaminant Reporting Levels

epa_clean %>% 
  select(State, MRL) %>% 
  filter(State == "NY") %>% 
  group_by(State) %>% 
  summarise(Average_MRL = mean(MRL))

## # A tibble: 1 x 2
##   State Average_MRL
##   <chr>       <dbl>
## 1 NY          0.407

sampling

Create a new variable df1 using the dataset_1 - Violations - epa_clean - filtering the States of interest

df1 <- epa_clean %>% 
  select(State, MRL) %>% 
  filter(State == "NY" |
           State == "TX")

sampling variable of df1

df1_sampling <- df1 %>% 
  rep_sample_n(size = 50)

df1_sampling

## # A tibble: 50 x 3
## # Groups:   replicate [1]
##    replicate State   MRL
##        <int> <chr> <dbl>
##  1         1 TX    0.4  
##  2         1 TX    0.4  
##  3         1 TX    0.007
##  4         1 TX    0.4  
##  5         1 TX    0.4  
##  6         1 TX    0.3  
##  7         1 TX    0.4  
##  8         1 NY    0.4  
##  9         1 TX    0.4  
## 10         1 NY    0.4  
## # ... with 40 more rows

Let’s compute the proportion of MRL meeting national average of 0.3

df1_sampling %>% 
  mutate(is_avg_MRL = (MRL == 0.4))

## # A tibble: 50 x 4
## # Groups:   replicate [1]
##    replicate State   MRL is_avg_MRL
##        <int> <chr> <dbl> <lgl>     
##  1         1 TX    0.4   TRUE      
##  2         1 TX    0.4   TRUE      
##  3         1 TX    0.007 FALSE     
##  4         1 TX    0.4   TRUE      
##  5         1 TX    0.4   TRUE      
##  6         1 TX    0.3   FALSE     
##  7         1 TX    0.4   TRUE      
##  8         1 NY    0.4   TRUE      
##  9         1 TX    0.4   TRUE      
## 10         1 NY    0.4   TRUE      
## # ... with 40 more rows

df1_sampling %>% 
  mutate(is_avg_MRL = (MRL == 0.4)) %>% 
  summarize(nat_avg = sum(is_avg_MRL))

## # A tibble: 1 x 2
##   replicate nat_avg
##       <int>   <int>
## 1         1      42

let’s compute the proportion of the 50 sampled entries that are MRL= 0.4 by dividing nat_avg by 50

df1_sampling %>% 
  mutate(is_avg_MRL = (MRL == 0.4)) %>% 
  summarize(nat_avg = sum(is_avg_MRL)) %>% 
  mutate(prop_avg_MRL = nat_avg / 50)

## # A tibble: 1 x 3
##   replicate nat_avg prop_avg_MRL
##       <int>   <int>        <dbl>
## 1         1      42         0.84

Great! 78% of df1_sampling Maximum Regulated Levels is 0.4

Bootstrapping

# Repeat resampling 1000 times

# Compute 1000 sample means

virtual_resampled_means <- df1 %>% 
  rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>% 
  group_by(replicate) %>% 
  summarize(mean_MRL = mean(MRL))
virtual_resampled_means

## # A tibble: 1,000 x 2
##    replicate mean_MRL
##        <int>    <dbl>
##  1         1    0.373
##  2         2    0.426
##  3         3    0.42 
##  4         4    0.377
##  5         5    0.421
##  6         6    0.368
##  7         7    0.381
##  8         8    0.437
##  9         9    0.422
## 10        10    0.392
## # ... with 990 more rows

ggplot(virtual_resampled_means, aes(x = mean_MRL)) +
  geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
  labs(x = "sample mean")

Inference

Statistical Anaysis

Null Hypothesis

NYC has one of the best water in the world, we should drink tap water. Reason why there is no difference in mean between the MRL (Max Reported levels) and the Violation reported mean Results and the Non regulated Contaminant Mean concentration Levels MCL.

Alternate Hypothesis

Due to increase in population,industrialization, poor environmental policies and lack of accurate and continuous monitoring, NYC tap water is not the best in the world and it should be filtered before consumption. There is a significant difference in mean between the accepted levels of contaminants in water and the concentration levels observed in the reported violations

We will investigate NYC

# lets take a look at the data 

df1 <- epa_clean %>% 
  select(State, MRL) %>% 
  filter(State == "NY" |
           State == "TX")

# Looking for Average Result Value / above State concentration levels

epa_clean %>% 
  select(State, Result_value) %>% 
  filter(State == "NY" |
           State == "TX") %>% 
  group_by(State) %>% 
  summarise(Average_RV = mean(Result_value))

## # A tibble: 2 x 2
##   State Average_RV
##   <chr>      <dbl>
## 1 NY         17.3 
## 2 TX          6.83

df2 <- epa_clean %>% 
  select(State, Result_value) %>% 
  filter(State == "NY" |
           State == "TX")

1 sample T-test

I will test whether the average MCL/ MRL levels for contaminants in NYC and TX abides with the National standard of 0.3 ug/l.

To do so we make use of the MRL ( max reporting level ) giving the amount allowed to be reported by the EPA and then we use the Result value variable which tells us the amount above the MCL

hist(log(df1$MRL))

hist(log(df2$Result_value))

## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 -1.04

df2 %>% 
  specify(response = Result_value) %>% 
  hypothesize(null = "point", mu = 0.4) %>% 
  calculate(stat = "t")

## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  13.6

Hypothesis for Violation Data - api_clean2

t.test(api_clean2$Result_value,
       mu = 0.4,
       alternative = "greater")

## 
##  One Sample t-test
## 
## data:  api_clean2$Result_value
## t = 2.7518, df = 3836, p-value = 0.002978
## alternative hypothesis: true mean is greater than 0.4
## 95 percent confidence interval:
##  21.10582      Inf
## sample estimates:
## mean of x 
##  51.89279

Hypothesis for Unregulated Contaminant Data - epa_clean

t.test(epa_clean$Result_value,
       mu = 0.4,
       alternative = "greater")

## 
##  One Sample t-test
## 
## data:  epa_clean$Result_value
## t = 27.515, df = 16526, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 0.4
## 95 percent confidence interval:
##  16.89116      Inf
## sample estimates:
## mean of x 
##  17.93973

Conclusion

We reject the Null Hypothesis, there is significant evidence to believe NYC water is not safe for consumption without being filtered first.

Quality of Drinking Water

Overview

Overview

Packages

Data

API_data

EPA_data

Exploratory Analysis

Relationships

Contaminants

Methods

Inference

Conclusion