library(readr)
library(tidyverse)
library(ggplot2)
library(lubridate)
water_data <- read_delim("data/UCMR4_All_MA_WY.txt",
"\t", escape_double = FALSE, locale = locale(encoding = "Latin1"),
trim_ws = TRUE)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## MRL = col_double(),
## `AnalyticalResultValue(µg/L)` = col_double()
## )
## i Use `spec()` for the full column specifications.
names(water_data)
## [1] "PWSID" "PWSName"
## [3] "Size" "FacilityID"
## [5] "FacilityName" "FacilityWaterType"
## [7] "SamplePointID" "SamplePointName"
## [9] "SamplePointType" "CollectionDate"
## [11] "SampleID" "Contaminant"
## [13] "MRL" "MethodID"
## [15] "AnalyticalResultsSign" "AnalyticalResultValue(µg/L)"
## [17] "SampleEventCode" "MonitoringRequirement"
## [19] "Region" "State"
epa_water <- water_data %>%
select(PWSName,
Size,
FacilityName,
FacilityWaterType,
CollectionDate,
Contaminant,
MRL,
Result = AnalyticalResultsSign,
Result_value = `AnalyticalResultValue(µg/L)`,
State)
epa_water <- epa_water %>%
mutate(Result = replace(Result, Result == "=", TRUE),
Date = mdy(CollectionDate),
CollectionDate = NULL)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
I would like to know, for a specific city and state, the violations and enforcement actions, as well as the definitions, health effects, and sources of contamination for any contaminants.
What are the cases, and how many are there?
States with contaminant violations in each water treatment facility - 502,043 entries, 10 total columns
Describe the method of data collection.
The motivation for this project is to understand our Drinking Water Requirements for States and Public Water Systems by analyzing the data provided by the United States Environmental Protection Agency EPA. When public water systems are found to contain contaminants in amounts exceeding the Maximum Contamination Limit MCL, they are in violation and must take action to restore the quality of their water. Utilizing the Envirofacts REST API, the SDWIS information can be queried and retrieved from the SDWIS database.
What type of study is this (observational/experiment)?
Observational
If you collected the data, state self-collected. If not, provide a citation/link.
EPA REST API National Contaminant Occurrence Database (NCOD)
What is the response variable? Is it quantitative or qualitative?
MRL -> Minimum Reporting Level
You should have two independent variables, one quantitative and one qualitative.
Result_Value -> Quatitative States -> Qualitative Contaminant -> Qualitative
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(epa_water)
## PWSName Size FacilityName FacilityWaterType
## Length:502043 Length:502043 Length:502043 Length:502043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Contaminant MRL Result Result_value
## Length:502043 Min. :0.00 Length:502043 Min. : 0.0
## Class :character 1st Qu.:0.03 Class :character 1st Qu.: 3.6
## Mode :character Median :0.09 Mode :character Median : 11.3
## Mean :0.25 Mean : 18.2
## 3rd Qu.:0.30 3rd Qu.: 25.4
## Max. :2.00 Max. :3960.0
## NA's :106834 NA's :378694
## State Date
## Length:502043 Min. :2018-01-02
## Class :character 1st Qu.:2018-11-07
## Mode :character Median :2019-06-10
## Mean :2019-06-03
## 3rd Qu.:2019-12-16
## Max. :2020-12-08
##
hist(log(epa_water$MRL))
hist(log(epa_water$Result_value))
boxplot(epa_water$Result_value ~ epa_water$State)
# MRL independent Variable (x)
# Result_Value dependent variable (y)
plot(log(epa_water$Result_value) ~ log(epa_water$MRL))
ggplot(epa_water) +
aes(x = State, fill = Contaminant) +
geom_bar() +
scale_fill_hue(direction = 1) +
coord_flip() +
theme_minimal()