Below are steps taken to:
1. identify a population of WQN sites that are representative of the FAME dataset
2. Compare nutrient concentrations (N & P species) between WQN and FAME sites
I first imported FAME land use (n = 49 sites). I then aggregated land use (LU) into more meaningful groups:
Then I calculated the ranges of the above LU categories and drainage area (DA) for FAME sites. I calculated LU identically for 474 active and historic WQN sites. I used the FAME ranges in LU and DA to filter the wqn dataset using this simple code:
wqn_sub <-
wqn %>%
filter(between(drnarea, 1.4, 25995) &
between(developed, 0.2, 78.4) &
between(forest, 3.4, 99) &
between(ag, 0.01, 72))
458 WQNs were within the FAME ranges of LU and DA. My takeaway is that FAME sites are generally representative of WQN sites, because of the large ranges of LU and DA, and because 458/474 WQN sites were also within these ranges.
I then created a unique vector of WQN stations names and queried SIS for nutrients using tests codes:
wqn_sisPull <-
wqn_sub %>%
pull(wqn_id)
sql_wqns <- paste("'", as.character(wqn_sisPull), "'", collapse=", ", sep="")
length(wqn_sisPull) # 458 wqns
# connect to sis and pull data --------------------------------------------
con <- dbConnect(odbc::odbc(), "fixdwh", PWD="test")
samp_sql <- sprintf("SELECT SAMPLE_ID, SAMPLE_NHD_ID, COLLECTOR_ID, SEQUENCE_NUMBER,
DATE_COLLECTED, TIME_COLLECTED, SAMPLE_MEDIUM_CODE, SAMPLE_MEDIUM_DESC, QUALITY_ASSURANCE_TYPE_DESC,
STANDARD_ANALYSIS_CODE, TEST_CODE, FIELD_ANALYSIS, TEST_SHORT_DESC, TEST_DESC, TEST_UNIT_TIME,
READING_INDICATOR_CODE, FINAL_AMOUNT, FINAL_AMOUNT_ABBREV, LOWER_REPORTING_LIMIT, ANALYTE_SHORT_DESC,
ANALYTE_CODE, SAMPLE_COMMENTS, MONITORING_POINT_NAME, MONITORING_POINT_ALIAS_ID, SAMPLE_STREAM_CODE,
SAMPLE_STREAM_NAME, SAMPLE_LOCATION
FROM DWH_DBA.DWH_ALL_SAMPLE_RESULTS
WHERE DATE_COLLECTED > TO_DATE('2008-01-01 12:00:00', 'yyyy-mm-dd HH24:MI:SS')
AND TEST_CODE IN ('00600A', '00610A', '00615', '00620',
'00665A', '70507A' , '00666A', '00671A')
AND MONITORING_POINT_ALIAS_ID IN (%s)", sql_wqns)
sampleQuery <- dbSendQuery(con, samp_sql)
begin <- Sys.time()
sampleData <- dbFetch(sampleQuery) %>%
as_tibble() %>%
rename(NHD_ID = SAMPLE_NHD_ID)
end <- Sys.time()
end - begin # 10.6 mins
This query resulted in 141,058 records from 210 WQNs. We asked for data from 458 stations, but the filters we put on the query (date since 2008, only N&P test codes) must have precluded them from the pull.
Below are histograms of DA, % forest, % ag, and % development, showing FAME and WQN sites. The y axis is scaled to the density of the dataset, and not to absolute frequency of the number of sites because of the disparity between the sample sizes of each respective dataset.
Overall, FAME sites are generally smaller and are less forested, more agricultural, and slightly more developed than the population of WQN sites. This is probably advantageous when trying to come up with a gradient of nutrient concentrations, but we can confirm that in the next steps.
This plot of histograms comparing density of nutrient concentrations between the dataset show promising results. The distributions generally follow the same shape - the majority of observations in both datasets are low in nutrient concentrations. However, the FAME dataset has comparatively fewer observations of low concentrations - and more of a gradient of from low to high.
This dynamic is likely an asset when trying to describe fatty acid responses across a nutrient gradient.
If you’d like static images of any of the plots above, or more information/analysis, let me know.