This document serves as an overview of monomeric Aluminum (test code 01106D) availability in Pennsylvania. Monomeric Al is a dissolved species of aluminum that is filtered through a 0.1 micron filter.
A query was run in SIS to pull all statewide monomeric Al data and match with spatial data using the NHD_ID field. Code is included below that was used to query SIS. A description of the results is also included with special emphasis on shortcomings.
ODBC drivers allow us to connect directly to SIS and customize queries with Rstudio. Below is code necessary to pull all monomeric Al (test code 01106D) data that is currently in SIS.
# load packages necessary - must be in office or connected to VPN
library(DBI)
library(odbc)
library(tidyverse)
library(tidylog)
# make connection with SIS via oracle driver and ODBC package
con <- dbConnect(odbc::odbc(), "fixdwh", PWD="test")
# query monomeric Al samples by specifying test code
sampleQuery <- dbSendQuery(con,"SELECT SAMPLE_ID, SAMPLE_NHD_ID, COLLECTOR_ID, SEQUENCE_NUMBER, DATE_COLLECTED, TIME_COLLECTED, SAMPLE_MEDIUM_CODE, SAMPLE_MEDIUM_DESC, QUALITY_ASSURANCE_TYPE_DESC, STANDARD_ANALYSIS_CODE, TEST_CODE, FIELD_ANALYSIS, TEST_SHORT_DESC, TEST_DESC, TEST_UNIT_TIME, READING_INDICATOR_CODE, FINAL_AMOUNT, FINAL_AMOUNT_ABBREV, LOWER_REPORTING_LIMIT, ANALYTE_SHORT_DESC, ANALYTE_CODE, SAMPLE_COMMENTS, MONITORING_POINT_NAME, MONITORING_POINT_ALIAS_ID, SAMPLE_STREAM_CODE, SAMPLE_STREAM_NAME, SAMPLE_LOCATION
FROM DWH_DBA.DWH_ALL_SAMPLE_RESULTS
WHERE TEST_CODE = '01106D' ")
We have now connected to sis (con object) and created our query (sampleQuery object) using the dbSendQuery() function. Now we can fetch the query using dbfetch(). We can then use pipe operators to perform additional steps necessary for analysis.
# fetch query
sampleData <- dbFetch(sampleQuery) %>%
as_tibble() %>% # save as tibble instead of data.frame
rename(NHD_ID = SAMPLE_NHD_ID) %>% # prepare for join
mutate_at('FINAL_AMOUNT', as.numeric) # FINAL_AMOUNT is character, must convert to numeric
dim(sampleData) # 426 27
summary(sampleData) # 273 out of 426 NHD_ID's are NA
length(unique(sampleData$NHD_ID)) # 92 unique NHD_ID's
Our query resulted in 426 sample results of monomeric Al from 2003-2020. However 65% of the samples do not have an NHD_ID, meaning there is no location data available for the majority of samples. This makes the data almost useless, because we don’t know where it was collected.
There are 92 unique NHD_IDs (sites) where data has been collected (and location information included in SIS). In the next step, we will pull location information for only these 92 unique sites using a further customized query.
uniqueNHD <- sort(unique(sampleData$NHD_ID)) # create vector of unique NHD_IDs to use for query
sql_nhd <- paste("'", as.character(uniqueNHD), "'", collapse=", ", sep="") # format vector to nest within query using sprintf
# customize query to pull location data for only unique NHD_IDs inlcuded in sample results
nhd_sql <- sprintf("SELECT NHD_ID, NHD_REACH_CODE, NHD_HUC_CODE, NHD_WATERSHED_DESCRIPTION, NHD_GNIS_ID, NHD_GNIS_FEATURE_NAME, FLOWLINE_COM_ID, ACCEPTED_LATITUDE_DECIMAL, ACCEPTED_LONGITUDE_DECIMAL
FROM DWH_DBA.DWH_NHD
WHERE NHD_ID IN (%s)", sql_nhd)
siteQuery <- dbSendQuery(con, nhd_sql)
siteData <- dbFetch(siteQuery) %>% # run query
as_tibble %>%
distinct() # each NHD ID is duplicated 10 times! this will save only unique combinatiions of all columns (should be n= 1)
dim(siteData) # 91 one less than length of unique NHD_ID's above
dbDisconnect(con) # disconnect from SIS
This query resulted in location information from 91 of the 92 NHD_IDs we requested. The location data is duplicated, so by running distinct() we can remove duplicated rows.
After disconnecting from SIS, we can join the results and location dataframes to begin spatial analysis.
# SAMPLE_NHD_ID == NHD_ID
sample_site_join <-
sampleData %>%
left_join(siteData, by='NHD_ID') %>%
filter(QUALITY_ASSURANCE_TYPE_DESC != 'Blank' | is.na(QUALITY_ASSURANCE_TYPE_DESC)) %>%
drop_na(NHD_ID, FINAL_AMOUNT)
In the code above, we join location information to the sample results, drop blank samples, and drop observations whenever there is no location information or concentration. Now we have a clean dataset to begin analysis.
After some checks (code not shown), I’m confident that the detection limit is consistently 10 ug/L for the dataset, so no further data cleaning steps are necessary.
After calculating summary statistics such as n, mean, sd, etc… (code not shown) at unique sites, we are able to visualize our dataset.
The histogram below shows the distribution sampling intensity, shown as number of samples taken at unique sites. The majority of site had only 1 or 2 samples taken. Very rarely were 3 or more samples collected.
The histogram below shows the distribution of mean monomeric Al concentrations at unique sites. Mean concentrations ranged from 10 - 1420 ug/L. The majority of sites had mean concentrations < 300 ug/L.
We can use the coordinates from sites where spatial information was available to map the results throughout space. Mean concentrations at each site are symbolized using a color ramp. Data coverage seems limited to areas where acid deposition is a suspected cause of impairment.
Although there were 426 available sample results, only 152 (35%) of those were useable. The lack of spatial information included for samples in SIS hinders the utility of the dataset. Here are some thoughts about how to improve the utility of the dataset (and many others like it):