Overview


       This document serves as an overview of monomeric Aluminum (test code 01106D) availability in Pennsylvania. Monomeric Al is a dissolved species of aluminum that is filtered through a 0.1 micron filter.

       A query was run in SIS to pull all statewide monomeric Al data and match with spatial data using the NHD_ID field. Code is included below that was used to query SIS. A description of the results is also included with special emphasis on shortcomings.




SIS Query in R


       ODBC drivers allow us to connect directly to SIS and customize queries with Rstudio. Below is code necessary to pull all monomeric Al (test code 01106D) data that is currently in SIS.

# load packages necessary - must be in office or connected to VPN
library(DBI)
library(odbc)
library(tidyverse)
library(tidylog)

# make connection with SIS via oracle driver and ODBC package
con <- dbConnect(odbc::odbc(), "fixdwh", PWD="test")


# query monomeric Al samples by specifying test code
sampleQuery <- dbSendQuery(con,"SELECT SAMPLE_ID, SAMPLE_NHD_ID, COLLECTOR_ID, SEQUENCE_NUMBER, DATE_COLLECTED, TIME_COLLECTED, SAMPLE_MEDIUM_CODE, SAMPLE_MEDIUM_DESC, QUALITY_ASSURANCE_TYPE_DESC, STANDARD_ANALYSIS_CODE, TEST_CODE, FIELD_ANALYSIS, TEST_SHORT_DESC, TEST_DESC, TEST_UNIT_TIME, READING_INDICATOR_CODE, FINAL_AMOUNT, FINAL_AMOUNT_ABBREV, LOWER_REPORTING_LIMIT, ANALYTE_SHORT_DESC, ANALYTE_CODE, SAMPLE_COMMENTS, MONITORING_POINT_NAME, MONITORING_POINT_ALIAS_ID, SAMPLE_STREAM_CODE, SAMPLE_STREAM_NAME, SAMPLE_LOCATION

                    FROM DWH_DBA.DWH_ALL_SAMPLE_RESULTS

                    WHERE TEST_CODE = '01106D' ")


       We have now connected to sis (con object) and created our query (sampleQuery object) using the dbSendQuery() function. Now we can fetch the query using dbfetch(). We can then use pipe operators to perform additional steps necessary for analysis.

# fetch query
sampleData <-  dbFetch(sampleQuery) %>% 
  as_tibble() %>% # save as tibble instead of data.frame
  rename(NHD_ID = SAMPLE_NHD_ID) %>%  # prepare for join
  mutate_at('FINAL_AMOUNT', as.numeric) # FINAL_AMOUNT is character, must convert to numeric


dim(sampleData) # 426  27
summary(sampleData) # 273 out of 426 NHD_ID's are NA
length(unique(sampleData$NHD_ID)) # 92 unique NHD_ID's


       Our query resulted in 426 sample results of monomeric Al from 2003-2020. However 65% of the samples do not have an NHD_ID, meaning there is no location data available for the majority of samples. This makes the data almost useless, because we don’t know where it was collected.
       There are 92 unique NHD_IDs (sites) where data has been collected (and location information included in SIS). In the next step, we will pull location information for only these 92 unique sites using a further customized query.

uniqueNHD <- sort(unique(sampleData$NHD_ID)) # create vector of unique NHD_IDs to use for query
sql_nhd <- paste("'", as.character(uniqueNHD), "'", collapse=", ", sep="") # format vector to nest within query using sprintf


# customize query to pull location data for only unique NHD_IDs inlcuded in sample results
nhd_sql <- sprintf("SELECT NHD_ID, NHD_REACH_CODE, NHD_HUC_CODE, NHD_WATERSHED_DESCRIPTION, NHD_GNIS_ID, NHD_GNIS_FEATURE_NAME, FLOWLINE_COM_ID, ACCEPTED_LATITUDE_DECIMAL, ACCEPTED_LONGITUDE_DECIMAL
                    
                    FROM DWH_DBA.DWH_NHD 
                    
                    WHERE NHD_ID IN (%s)", sql_nhd)


siteQuery <- dbSendQuery(con, nhd_sql) 


siteData <- dbFetch(siteQuery) %>%  # run query
  as_tibble %>% 
  distinct() # each NHD ID is duplicated 10 times! this will save only unique combinatiions of all columns (should be n= 1)


dim(siteData) # 91  one less than length of unique NHD_ID's above


dbDisconnect(con) # disconnect from SIS


       This query resulted in location information from 91 of the 92 NHD_IDs we requested. The location data is duplicated, so by running distinct() we can remove duplicated rows.
       After disconnecting from SIS, we can join the results and location dataframes to begin spatial analysis.

# SAMPLE_NHD_ID == NHD_ID
sample_site_join <- 
  sampleData %>% 
  left_join(siteData, by='NHD_ID') %>% 
  
  filter(QUALITY_ASSURANCE_TYPE_DESC != 'Blank' | is.na(QUALITY_ASSURANCE_TYPE_DESC)) %>% 
  
  drop_na(NHD_ID, FINAL_AMOUNT)


       In the code above, we join location information to the sample results, drop blank samples, and drop observations whenever there is no location information or concentration. Now we have a clean dataset to begin analysis.
       After some checks (code not shown), I’m confident that the detection limit is consistently 10 ug/L for the dataset, so no further data cleaning steps are necessary.




Data Visualization


       After calculating summary statistics such as n, mean, sd, etc… (code not shown) at unique sites, we are able to visualize our dataset.


Sampling intensity at unique sites


       The histogram below shows the distribution sampling intensity, shown as number of samples taken at unique sites. The majority of site had only 1 or 2 samples taken. Very rarely were 3 or more samples collected.


Monomeric Al concentrations


       The histogram below shows the distribution of mean monomeric Al concentrations at unique sites. Mean concentrations ranged from 10 - 1420 ug/L. The majority of sites had mean concentrations < 300 ug/L.

Leaflet map


       We can use the coordinates from sites where spatial information was available to map the results throughout space. Mean concentrations at each site are symbolized using a color ramp. Data coverage seems limited to areas where acid deposition is a suspected cause of impairment.




Implications


Although there were 426 available sample results, only 152 (35%) of those were useable. The lack of spatial information included for samples in SIS hinders the utility of the dataset. Here are some thoughts about how to improve the utility of the dataset (and many others like it):

  • After pulling data, it is extremely simple to identify samples that do not contain location information (NHD_ID = NA). We could reach out to staff that collected these data and provide them with collecter id, sequence number, date, and time and request that they retroactively enter spatial information associated with samples.


  • Many of the samples seemed to come from non-DEP staff, such as SRBC. I recognized many of these aliases from being collected as part of acid deposition TMDL projects or watershed-wide monitoring projects.
    • How is location information typically entered when samples are collected by non-DEP staff? Inclusion of location information with these samples may also help improve the utility of the dataset.


  • The issue of samples lacking location information in SIS keeps rearing it’s ugly head. We need to consistently communicate the importance of creating monitoring points to associate with samples in SIS. We could develop a training tutorial/video/session to aid understanding of the process.


  • The existing data that do have location information are limited to areas where acid dep is a likely cause of impairment. We don’t have much monomeric Al data from well-buffered streams. We need to improve data coverage in anticipation of inclusion of an acid/atmospheric deposition protocol for source/cause determinations.