Quality of Drinking Water

Overview

Overview

OBTAINING DRINKING WATER DATA FROM THE ENVIRONMENTAL PROTECTION AGENCY - EPA


Water is an essential factor for the whole life and the human survival and, having an important role for both drinking as well economic sectors. Therefore, protecting this source against any pollution has become necessary (Witek and Jarosiewicz, 2009, Reza and Singh, 2010; Sojobi, 2016). In the last century, the availability and quality of surface or ground waters have been changing, mainly due to urbanization, industrialization etc.

EPA

Using ENVIROFACTS - API

SDWIS - Safe Drinking Water Information System


Information about safe drinking water is stored in SDWIS, the EPA’s Safe Drinking Water Information System. SDWIS tracks information on drinking water contamination levels as required by the 1974 Safe Drinking Water Act and its 1986 and 1996 amendments. The Safe Drinking Water Act (SDWA) and accompanying regulations establish Maximum Contaminant Levels (MCLs), treatment techniques, and monitoring and reporting requirements to ensure that water provided to customers is safe for human consumption.

Map of tables available in SDWIS


UCMR - Unregulated Contaminant Monitoring Rule


EPA collects data for chemicals and microbes that may be present in drinking water, but are not currently subject to EPA drinking water regulations.

The UCMR program was developed in coordination with the Contaminant Candidate List (CCL) a list of contaminants that:

  • Are not regulated by the National Primary Drinking Water Regulations

  • Are known or anticipated to occur at public water systems

  • May warrant regulation under the SDWA

Packages

library(plyr)
library(knitr)
library(readr)
library(tidyverse)
library(tidymodels)
library(DT)
library(dplyr)
library(textrecipes)
library(ggplot2)
library(lubridate)
set.seed(1234)

Data

API_data

Dataset # 1 - Using API


Safe Drinking Water Information System (SDWIS)

The Safe Drinking Water Information System (SDWIS) contains information about public water systems and their violations of EPA’s drinking water regulations. Searching SDWIS will allow you to locate your drinking water supplier and view its violations and enforcement history for the last ten years.

# EPA API
#change import way

#URL <- ("https://data.epa.gov/efservice/VIOLATION/CSV")

library(readr)

api_data <- read_csv("data/EnvirofactsRestAP_violations.csv")
EPA_data

Dataset # 2 - UCMR.Csv


What are the public health benefits of UCMR? This permits assessment of the population being exposed and the levels of exposure.

UCMR data represent one of the primary sources of national occurrence data in drinking water that EPA uses to inform regulatory and other risk management decisions for drinking water contaminant candidates.

library(readr)

water_data <- read_delim("data/UCMR4_All_MA_WY.txt", 
    "\t", escape_double = FALSE, locale = locale(encoding = "Latin1"), 
    trim_ws = TRUE)
names(water_data)
Exploratory Analysis

Dataset 1 - APi Data


library(dplyr)

api_water <- api_data %>% 
  select(Facility_name = VIOLATION.PWSID,
         Pop_served = VIOLATION.POPULATION_SERVED_COUNT,
         State = VIOLATION.PRIMACY_AGENCY_CODE,
         Health = VIOLATION.IS_HEALTH_BASED_IND,
         Contaminant = VIOLATION.CONTAMINANT_CODE,
         Result_value = VIOLATION.VIOL_MEASURE,
         State_MCL = VIOLATION.STATE_MCL)
library(tidyr)
api_clean <- api_water %>% 
  drop_na()
library(skimr)
library(ggplot2)

 api_clean2 %>% skim()
Data summary
Name Piped data
Number of rows 3837
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Facility_name 0 1 8 9 0 2302 0
State 0 1 1 2 0 51 0
Health 0 1 1 1 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Pop_served 0 1 4644.81 43712.50 1 102.00 360.00 1654 1064730 ▇▁▁▁▁
Contaminant 0 1 2240.65 1109.80 100 1025.00 2456.00 2950 4109 ▁▇▂▆▃
Result_value 0 1 51.89 1159.13 0 0.07 0.16 16 54000 ▇▁▁▁▁
State_MCL 0 1 13.06 33.88 0 0.06 0.08 10 500 ▇▁▁▁▁

Dataset 2 - Contaminant Monitoring UCMR


epa_water <- water_data %>% 
  select(PWSName,
         Size,
         FacilityName,
         FacilityWaterType,
         CollectionDate,
         Contaminant,
         MRL,
         Result = AnalyticalResultsSign,
         Result_value = `AnalyticalResultValue(µg/L)`,
         State)
library(plyr)
library(lubridate)

epa_water <- epa_water %>% 
  mutate(Result = replace(Result, Result == "=", TRUE),
         Date = mdy(CollectionDate),
         CollectionDate = NULL) %>% 
  mutate(MRL = as.numeric(MRL))
library(tidyr)

epa_clean <- epa_water %>% 
  drop_na()
epa_clean <- read_csv("epa_clean.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   PWSName = col_character(),
##   Size = col_character(),
##   FacilityName = col_character(),
##   FacilityWaterType = col_character(),
##   Contaminant = col_character(),
##   MRL = col_double(),
##   Result = col_logical(),
##   Result_value = col_double(),
##   State = col_character(),
##   Date = col_date(format = "")
## )
library(lubridate)

epa_clean <- epa_clean %>% 
  mutate(Date = ymd(Date))
library(skimr)
library(ggplot2)

 epa_clean %>% skim()
Data summary
Name Piped data
Number of rows 16527
Number of columns 10
_______________________
Column type frequency:
character 6
Date 1
logical 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PWSName 0 1 4 54 0 2659 0
Size 0 1 1 1 0 2 0
FacilityName 0 1 2 50 0 4893 0
FacilityWaterType 0 1 2 2 0 4 0
Contaminant 0 1 8 27 0 22 0
State 0 1 2 2 0 35 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 2018-01-02 2020-12-01 2019-05-21 685

Variable type: logical

skim_variable n_missing complete_rate mean count
Result 0 1 1 TRU: 16527

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MRL 0 1 0.41 0.17 0.01 0.40 0.4 0.4 2 ▇▁▁▁▁
Result_value 0 1 17.94 81.95 0.01 0.95 2.9 10.5 3960 ▇▁▁▁▁
Relationships

Correlation is a measure of the relationship or interdependence of two variables. In other words, how much do the values of one variable change with the values of another. Correlation can be either positive or negative and it ranges from -1 to 1, with 1 and -1 indicating perfect correlation (1 being positive and -1 being negative) and 0 indicating no correlation.

Dataset 1 - APi Data


library(PerformanceAnalytics)

api_clean2 %>% 
  select_if(is.numeric) %>% 
  chart.Correlation()

library(GGally)

api_clean2 %>% 
  select_if(is.numeric) %>% 
  ggcorr(label = TRUE)

library(GGally)

api_clean2 %>% 
  select_if(is.numeric) %>% 
  ggpairs()


Dataset 2 - APi Data


library(PerformanceAnalytics)

epa_clean %>% 
  select_if(is.numeric) %>% 
  chart.Correlation()

library(GGally)

epa_clean %>% 
  select_if(is.numeric) %>% 
  ggcorr(label = TRUE)

library(GGally)

epa_clean %>% 
  select_if(is.numeric) %>% 
  ggpairs()

Contaminants

MOST RECENT REGULATED contaminants by the EPA


Regulated contaminants


REGULATED CONTAMINANTS VIOLATIONS - States with levels higher than the Maximum Concentration Levels


library(ggplot2)

ggplot(api_clean2) +
 aes(x = State_MCL, size = Result_value) +
 geom_histogram(bins = 30L, fill = "#FF8C00") +
 scale_x_continuous(trans = "log10") +
 theme_minimal() +
 facet_wrap(vars(State))


UNREGULATED CONTAMINANTS - > Maximum Concentration Levels (MCL) by State


The Safe Drinking Water Act (SDWA) defines “contaminant” as any physical, chemical, biological or radiological substance or matter in water. Drinking water may reasonably be expected to contain at least small amounts of some contaminants. Some contaminants may be harmful if consumed at certain levels in drinking water. The presence of contaminants does not necessarily indicate that the water poses a health risk.

library(DT)
datatable(epa_clean2)
ggplot(epa_clean) +
 aes(x = Date, fill = Contaminant) +
 geom_histogram(bins = 30L) +
 scale_fill_hue(direction = 1) +
 labs(subtitle = "Contaminants per State") +
 theme_minimal() +
 facet_wrap(vars(State))

Methods

Identifiying States of interest


ggplot(epa_clean) +
 aes(x = Date, fill = Contaminant) +
 geom_histogram(bins = 30L) +
 scale_fill_hue(direction = 1) +
 labs(subtitle = "Contaminants per State") +
 theme_minimal() +
 facet_wrap(vars(State))

Violations

using the Public Api, I was able to identify 100K violations in Region 2 of the EPA database. Region 2 includes 34 states, among them TX, NY, NJ, NC, UT. UCMR provided the counties served field for 99% of the PWSs. (n=100,002) but only 26% reported information without any NA’s (n= 3837)

api_clean2 %>% 
  filter(State == "NY") %>% 
  summarise(Avg_ppl_affected = mean(Pop_served))
## # A tibble: 1 x 1
##   Avg_ppl_affected
##              <dbl>
## 1             717.

Unregulated Contaminants

Using the city served and county served fields in UCMR to determine the areas with reports of unregulated contaminants by each PWS. From the 500k entries found we were able to gather 15,527 with no missing values.

When we filter by State = NY we can see that there are 1369 reports stating levels of unregulated contaminants above the MRL (max result value)

library(dplyr)
count(epa_clean$State == "NY")
#average mu of contaminant Reporting Levels

epa_clean %>% 
  select(State, MRL) %>% 
  filter(State == "NY") %>% 
  group_by(State) %>% 
  summarise(Average_MRL = mean(MRL))
## # A tibble: 1 x 2
##   State Average_MRL
##   <chr>       <dbl>
## 1 NY          0.407

sampling


Create a new variable df1 using the dataset_1 - Violations - epa_clean - filtering the States of interest

df1 <- epa_clean %>% 
  select(State, MRL) %>% 
  filter(State == "NY" |
           State == "TX")

sampling variable of df1

df1_sampling <- df1 %>% 
  rep_sample_n(size = 50)

df1_sampling
## # A tibble: 50 x 3
## # Groups:   replicate [1]
##    replicate State   MRL
##        <int> <chr> <dbl>
##  1         1 TX    0.4  
##  2         1 TX    0.4  
##  3         1 TX    0.007
##  4         1 TX    0.4  
##  5         1 TX    0.4  
##  6         1 TX    0.3  
##  7         1 TX    0.4  
##  8         1 NY    0.4  
##  9         1 TX    0.4  
## 10         1 NY    0.4  
## # ... with 40 more rows

Let’s compute the proportion of MRL meeting national average of 0.3

df1_sampling %>% 
  mutate(is_avg_MRL = (MRL == 0.4))
## # A tibble: 50 x 4
## # Groups:   replicate [1]
##    replicate State   MRL is_avg_MRL
##        <int> <chr> <dbl> <lgl>     
##  1         1 TX    0.4   TRUE      
##  2         1 TX    0.4   TRUE      
##  3         1 TX    0.007 FALSE     
##  4         1 TX    0.4   TRUE      
##  5         1 TX    0.4   TRUE      
##  6         1 TX    0.3   FALSE     
##  7         1 TX    0.4   TRUE      
##  8         1 NY    0.4   TRUE      
##  9         1 TX    0.4   TRUE      
## 10         1 NY    0.4   TRUE      
## # ... with 40 more rows
df1_sampling %>% 
  mutate(is_avg_MRL = (MRL == 0.4)) %>% 
  summarize(nat_avg = sum(is_avg_MRL))
## # A tibble: 1 x 2
##   replicate nat_avg
##       <int>   <int>
## 1         1      42

let’s compute the proportion of the 50 sampled entries that are MRL= 0.4 by dividing nat_avg by 50

df1_sampling %>% 
  mutate(is_avg_MRL = (MRL == 0.4)) %>% 
  summarize(nat_avg = sum(is_avg_MRL)) %>% 
  mutate(prop_avg_MRL = nat_avg / 50)
## # A tibble: 1 x 3
##   replicate nat_avg prop_avg_MRL
##       <int>   <int>        <dbl>
## 1         1      42         0.84

Great! 78% of df1_sampling Maximum Regulated Levels is 0.4


Bootstrapping


# Repeat resampling 1000 times

# Compute 1000 sample means

virtual_resampled_means <- df1 %>% 
  rep_sample_n(size = 50, replace = TRUE, reps = 1000) %>% 
  group_by(replicate) %>% 
  summarize(mean_MRL = mean(MRL))
virtual_resampled_means
## # A tibble: 1,000 x 2
##    replicate mean_MRL
##        <int>    <dbl>
##  1         1    0.373
##  2         2    0.426
##  3         3    0.42 
##  4         4    0.377
##  5         5    0.421
##  6         6    0.368
##  7         7    0.381
##  8         8    0.437
##  9         9    0.422
## 10        10    0.392
## # ... with 990 more rows
ggplot(virtual_resampled_means, aes(x = mean_MRL)) +
  geom_histogram(binwidth = 1, color = "white", boundary = 1990) +
  labs(x = "sample mean")

Inference

Statistical Anaysis


Null Hypothesis


NYC has one of the best water in the world, we should drink tap water. Reason why there is no difference in mean between the MRL (Max Reported levels) and the Violation reported mean Results and the Non regulated Contaminant Mean concentration Levels MCL.


Alternate Hypothesis


Due to increase in population,industrialization, poor environmental policies and lack of accurate and continuous monitoring, NYC tap water is not the best in the world and it should be filtered before consumption. There is a significant difference in mean between the accepted levels of contaminants in water and the concentration levels observed in the reported violations


We will investigate NYC

# lets take a look at the data 

df1 <- epa_clean %>% 
  select(State, MRL) %>% 
  filter(State == "NY" |
           State == "TX")
# Looking for Average Result Value / above State concentration levels

epa_clean %>% 
  select(State, Result_value) %>% 
  filter(State == "NY" |
           State == "TX") %>% 
  group_by(State) %>% 
  summarise(Average_RV = mean(Result_value))
## # A tibble: 2 x 2
##   State Average_RV
##   <chr>      <dbl>
## 1 NY         17.3 
## 2 TX          6.83
df2 <- epa_clean %>% 
  select(State, Result_value) %>% 
  filter(State == "NY" |
           State == "TX")

1 sample T-test


I will test whether the average MCL/ MRL levels for contaminants in NYC and TX abides with the National standard of 0.3 ug/l.

To do so we make use of the MRL ( max reporting level ) giving the amount allowed to be reported by the EPA and then we use the Result value variable which tells us the amount above the MCL

hist(log(df1$MRL))

hist(log(df2$Result_value))

## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 -1.04
df2 %>% 
  specify(response = Result_value) %>% 
  hypothesize(null = "point", mu = 0.4) %>% 
  calculate(stat = "t")
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1  13.6

Hypothesis for Violation Data - api_clean2

t.test(api_clean2$Result_value,
       mu = 0.4,
       alternative = "greater")
## 
##  One Sample t-test
## 
## data:  api_clean2$Result_value
## t = 2.7518, df = 3836, p-value = 0.002978
## alternative hypothesis: true mean is greater than 0.4
## 95 percent confidence interval:
##  21.10582      Inf
## sample estimates:
## mean of x 
##  51.89279

Hypothesis for Unregulated Contaminant Data - epa_clean

t.test(epa_clean$Result_value,
       mu = 0.4,
       alternative = "greater")
## 
##  One Sample t-test
## 
## data:  epa_clean$Result_value
## t = 27.515, df = 16526, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 0.4
## 95 percent confidence interval:
##  16.89116      Inf
## sample estimates:
## mean of x 
##  17.93973

Conclusion

We reject the Null Hypothesis, there is significant evidence to believe NYC water is not safe for consumption without being filtered first.