This document is intended to serve 2 purposes:
## read in Class A biomass df from MB
classAbiomass <- read_excel('classAbiomass.xlsx', sheet = 'Sheet1')
classAbiomass <- janitor::clean_names(classAbiomass) #clean column names
For spreadsheets with problematic column headers, such as spaces, special characters, capitalization, etc., the ‘janitor’ package is useful. Also, the read_excel() function imports data as a tibble, which has numerous advantages over traditional dataframes.
| county | water | section | fishery | section_limits | lower_limit_latitude | lower_limit_longitude | length_miles | date_added_to_class_a_list | wild_brook_trout_biomass | wild_brown_trout_biomass | wild_rainbow_trout_biomass |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Somerset | South Fork Bens Creek | 4 | Brown | 30 m downstream private bridge off T-590 to T-785 bridge | 40.25059 | -78.97206 | 3.51 | July 2019 | NA | 44.03 | NA |
| McKean | UNT to Lewis Run (RM 3.20) | 1 | Brown | Headwaters to Mouth | 41.84286 | -78.69338 | 2.68 | April 2019 | 1.29 | 40.62 | NA |
| Clearfield/ Elk | Grapevine Run | 1 | Brook | Headwaters to Mouth | 41.20558 | -78.66495 | 0.76 | April 2019 | 40.47 | NA | NA |
After looking at the dataframe, I identified numerous issues which I dealt with in one code chunk.
classA_df <-
classAbiomass %>%
mutate(
#calculate total biomass
total_biomass = rowSums(dplyr::select(
., wild_brook_trout_biomass, wild_brown_trout_biomass, wild_rainbow_trout_biomass), na.rm=T),
date_correct = str_replace(date_added_to_class_a_list, "Febuary", "February"), # use str_replace to correct Febuary
lower_limit_longitude = replace(lower_limit_longitude, lower_limit_longitude > 77.71, -77.71111), # replace errant longitude
date = ymd(parse_date_time(date_correct, orders="bY")), # use parse_date to get into date format
month = month(date), # extract month
year = year(date) # extract year
) %>%
#remove another errant longitude that cannot be understood
filter(!(lower_limit_longitude > -10)) %>%
#drop unnecesary columns
select(-date_added_to_class_a_list)
The result is a clean dataframe
| county | water | section | fishery | section_limits | lower_limit_latitude | lower_limit_longitude | length_miles | wild_brook_trout_biomass | wild_brown_trout_biomass | wild_rainbow_trout_biomass | total_biomass | date_correct | date | month | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Somerset | South Fork Bens Creek | 4 | Brown | 30 m downstream private bridge off T-590 to T-785 bridge | 40.25059 | -78.97206 | 3.51 | NA | 44.03 | NA | 44.03 | July 2019 | 2019-07-01 | 7 | 2019 |
| McKean | UNT to Lewis Run (RM 3.20) | 1 | Brown | Headwaters to Mouth | 41.84286 | -78.69338 | 2.68 | 1.29 | 40.62 | NA | 41.91 | April 2019 | 2019-04-01 | 4 | 2019 |
| Clearfield/ Elk | Grapevine Run | 1 | Brook | Headwaters to Mouth | 41.20558 | -78.66495 | 0.76 | 40.47 | NA | NA | 40.47 | April 2019 | 2019-04-01 | 4 | 2019 |
Since we now have a clean class A dataframe with lat/long, we can create an interactive map using the leaflet() package. This is a very simple process, but allows you to leverage basemaps and display info with pop-up labels that display while hovering/clicking. You can also add data filters via radio buttons that help to display categorical information spatially. Another advantage is these maps are easily embedded in Rmarkdown documents that can be published to the web, thus no need for IT involvement.
# leaflet map with radio buttons to toggle between species
#define legend categories
bins <- c(0, 20, 40, 60, 100, 200, Inf)
pal_bin <- colorBin("viridis", domain = classA_df$total_biomass, bins = bins, na.color = 'transparent')
# setting na.color = 'transparent' causes sites where no biomass data is present to become invisible
biomassBySpecies_map <-
classA_df %>%
leaflet() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
#species
#brook
addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
col = ~pal_bin(wild_brook_trout_biomass),
opacity = 0.8, popup = ~ water,
label = ~ as.character(round(wild_brook_trout_biomass, 2)), group = 'Brook') %>%
#brown
addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
col = ~pal_bin(wild_brown_trout_biomass),
opacity = 0.8, popup = ~ water,
label = ~ as.character(round(wild_brown_trout_biomass, 2)), group = 'Brown') %>%
#rainbow
addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
col = ~pal_bin(wild_rainbow_trout_biomass),
opacity = 0.8, popup = ~ water,
label = ~ as.character(round(wild_rainbow_trout_biomass, 2)), group = 'Rainbow') %>%
#total
addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
col = ~pal_bin(total_biomass),
opacity = 0.8, popup = ~ water,
label = ~ as.character(round(total_biomass, 2)), group = 'Total') %>%
#legend -- for bin palette
addLegend("topright", pal = pal_bin, values = ~total_biomass,
title = "Trout Biomass",
opacity = 0.8) %>%
# Layers control
addLayersControl(
overlayGroups = c("Brook", 'Brown', 'Rainbow', "Total"),
options = layersControlOptions(collapsed = FALSE)
)
biomassBySpecies_map
Now, since our intention is to examine the relationship of trout biomass and water quality, lets bring in the WQI dataset from TW (2008-2018). We again use read_excel() and janitor to clean column names.
wqi <- read_excel('WQI_2018.xlsx', sheet = 'Sheet1')
wqi <- janitor::clean_names(wqi) #clean column names
I want to show WQI data spatially with a leaflet map too, but first we have to deal with the classifications. These categorical ‘grades’ are problematic, so it’s best to make an additional column as a factor and set levels.
## make grade a factor variable for use with symbology below
wqi <-
wqi %>%
mutate(f_grade = factor(grade, levels = c("Poor", "Fair", "Average", "Good")))
levels(wqi$f_grade)
## [1] "Poor" "Fair" "Average" "Good"
Now we can code our leaflet map (just showing wqi, no swqi’s)
## define color palette
factpal <- colorFactor(c('#d7191c', '#fdae61', '#abdda4', '#2b83ba'), wqi$f_grade)
# overall WQI only
wqi_map <-
wqi %>%
leaflet() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
addCircleMarkers(
col = ~factpal(f_grade),
opacity = 0.8,
popup = ~ as.character(round(wqi, 2)), label = ~ flowline_gnis) %>%
addLegend(
colors=c('#d7191c', '#fdae61', '#abdda4', '#2b83ba'),
labels=c('Poor', 'Fair', 'Average', 'Good'),
opacity = 0.8,
title = 'DEP 2018 WQI Scores')
wqi_map
Now that we’ve visualized trout biomass and wqi throughout space, the next step is to export our clean dataframes to excel so we can merge with HUC12 watershed in GIS. This is an annoying step that requires multiple steps outside of the Rstudio project.
For example, it’s a pain to export to excel, import to GIS, spatial join, save as shapefile, open .dbf in excel, save, and re-import to R. Plus, I’m learning that DEP’s GIS only accepts .xls – while R exports as .xlsx. Need to find a work-around to eliminate yet another intermediate step.
For the long-term we need to find way to work with polygons/perfom GIS joins/relates in R. Maybe ArcPro would provide some less painful alternatives?
For the GIS task itself, you can actually build a spreadsheet [write_xlsx()] using dataframes in R. See example below using biomass and wqi dataframes. This helps prevent a bunch of individual csv files in your directory – too many and naming conventions become an issue.
## I don't want to take the entire dataframes out to merge with HUC12, GIS does strange things to numbers
## see if first 7 columns have enough unique info to complete merge
dim(classA_df) # 962 16
classA_df_merge <-
classA_df %>%
select(1:7) %>% # select only first 7 cols (includes lat/long)
distinct() # determine how many unique combos of 1st 6 cols are present (hopefully 962!)
classA_df_merge
dim(classA_df_merge) # 962 7 -- check!
## same for wqi df
dim(wqi) #2625 17
wqi_merge <-
wqi %>%
select(1:6, 15:16) %>% # select only first 6 cols plus lat/long for projecting
distinct() # determine how many unique combos of 1st 6 cols are present (hopefully 2625!)
wqi_merge
dim(wqi_merge) #2625 8 -- check!
# build spreadsheet/export dataframes as single excel sheet with 2 tabs for GIS merge with HUC12
write_xlsx(list(classA_df_merge = classA_df_merge, wqi_merge = wqi_merge), "gis_huc12.xlsx")
I simply projected both dataframes as shapefiles, and merged with HUC12 polygon layer
However, GIS chops off column names if they’re too long so we need to deal with that to merge with the original df to access the numeric fields of interest. Using piping we can read-in data, subset columns, rename columns, and join with original df all in one step.
classA_huc12_join <- read_excel('classA_huc12_join.xlsx', sheet = 'classA_huc12_join') %>%
select(2:6, 16:21) %>% # select only columns of interest
rename(section_limits = section_li) %>% #rename columns that gis cut off
left_join(classA_df, by=c("county", "water", "section", "fishery", "section_limits"))
We confirm that dfs merged successfully due to helpful messages from the tidylog package. When loaded after tidyverse this package gives narrations on what each function accomplished.
| county | water | section | fishery | section_limits | HUC_8 | HUC_8_NAME | HUC_10 | HUC_10_NAM | HUC_12 | HUC_12_NAM | lower_limit_latitude | lower_limit_longitude | length_miles | wild_brook_trout_biomass | wild_brown_trout_biomass | wild_rainbow_trout_biomass | total_biomass | date_correct | date | month | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lancaster | Segloch Run | 2 | Brook | Intersection of T 596 and T 548 to SR026 | 02050306 | Lower Susquehanna | 0205030609 | Cocalico Creek | 020503060902 | Middle Creek | 40.23230 | -76.27900 | 2.60 | 27.35 | 29.95 | NA | 57.30 | January 1983 | 1983-01-01 | 1 | 1983 |
| Bedford-Blair | Boiling Spring Run | 2 | Brown | Dam at pond immediately downstream from Harvest Lane to Mouth | 02050302 | Upper Juniata | 0205030201 | Upper Frankstown Branch Juniata River | 020503020101 | Beaverdam Creek | 40.27194 | -78.46639 | 2.11 | NA | 56.21 | NA | 56.21 | January 2017 | 2017-01-01 | 1 | 2017 |
| Perry | Kansas Valley Run | 1 | Mixed Brook/Brown | Headwaters to Mouth | 02050304 | Lower Juniata | 0205030409 | Tuscarora Creek | 020503040903 | Horse Valley Run | 40.34373 | -77.59420 | 2.48 | 23.21 | 31.96 | NA | 55.17 | March 2001 | 2001-03-01 | 3 | 2001 |
Same for the WQI - HUC12 df
wqi_huc12_join <- read_excel('wqi_huc12_join.xlsx', sheet = 'wqi_huc12_join') %>%
select(2:7, 17:22) %>% # select only columns of interest
rename(nhd_location_id = nhd_locati, monitoring_alias = monitoring,
flowline_gnis = flowline_g, year_collected = year_colle) %>% #rename columns that gis cut off
left_join(wqi, by=c("nhd_location_id", "monitori_1", "monitoring_alias", "sample_loc", "flowline_gnis", "year_collected"))
| nhd_location_id | monitori_1 | monitoring_alias | sample_loc | flowline_gnis | year_collected | HUC_8 | HUC_8_NAME | HUC_10 | HUC_10_NAM | HUC_12 | HUC_12_NAM | wqi | wq_similar | forest | urban | ag | land_dist | grade | n | latitude | longitude | f_grade |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1100734 | NA | NA | Cedar Run at culvert outlet downstream of Waste Management | NA | 18 | 02050305 | Lower Susquehanna-Swatara | 0205030505 | Yellow Breeches Creek | 020503050505 | Lower Yellow Breeches Creek | 30.33704 | Ag | 33.33339 | 37.79162 | 30.33704 | 39.21016 | Poor | 1 | 40.23016 | -76.95059 | Poor |
| 36455 | NA | WQN0212 | OLD US RT 11; BRIDGE ST BR @ NEW CUMBERLAND | Yellow Breeches Creek | 8 | 02050305 | Lower Susquehanna-Swatara | 0205030505 | Yellow Breeches Creek | 020503050505 | Lower Yellow Breeches Creek | 49.34103 | Forest | 49.34103 | 50.26803 | 53.29568 | 76.61736 | Fair | 16 | 40.22430 | -76.86069 | Fair |
| 36455 | NA | WQN0212 | OLD US RT 11; BRIDGE ST BR @ NEW CUMBERLAND | Yellow Breeches Creek | 9 | 02050305 | Lower Susquehanna-Swatara | 0205030505 | Yellow Breeches Creek | 020503050505 | Lower Yellow Breeches Creek | 56.38481 | Ag | 57.43163 | 60.63640 | 56.38481 | 76.58827 | Fair | 19 | 40.22430 | -76.86069 | Fair |
The analysis routine consists of averaging wqi in each huc12 watershed since we could potentially have multiple wqi points per huc 12. The resulting characterization of water quality in each watershed can serve as a predictor variable for trout biomass.
wqi_huc12_stat <-
wqi_huc12_join %>%
group_by(HUC_12, HUC_12_NAM) %>% # summarize by these fields
summarize(mean_wqi = mean(wqi, na.rm=T), sd_wqi = sd(wqi, na.rm=T), n_wqi = length(wqi), # calculate stats
minYear = min(year_collected, na.rm=T), maxYear = max(year_collected, na.rm=T))
# could keep going to calculate mean swqi scores
444 huc12s have wqi points within. Range of n is 1-37.
| HUC_12 | HUC_12_NAM | mean_wqi | sd_wqi | n_wqi | mean_forest_wqi | sd_forest_wqi | mean_urban_wqi | sd_urban_wqi | mean_ag_wqi | sd_ag_wqi | mean_land_dist_wqi | sd_land_dist_wqi | minYear | maxYear |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 020401010402 | Lower Equinunk Creek | 93.47634 | NA | 1 | 97.07617 | NA | 98.40826 | NA | 93.51108 | NA | 93.47634 | NA | 18 | 18 |
| 020401010605 | Masthope Creek | 82.67053 | NA | 1 | 96.03777 | NA | 96.08027 | NA | 95.36778 | NA | 82.67053 | NA | 16 | 16 |
| 020401030201 | East Branch Dyberry Creek | 91.91476 | 1.170703 | 2 | 96.94153 | 2.389315 | 98.16760 | 1.569052 | 93.56753 | 3.508076 | 94.57599 | 2.592843 | 10 | 18 |
classA_wqi_df <-
classA_huc12_join %>%
inner_join(wqi_huc12_stat, by=c('HUC_12', 'HUC_12_NAM')) ## using inner_join so that only matching x & y records are returned
The merge resulted in 137 unique huc12s with water quality AND trout biomass data - decent n
To accomplish, we need to melt and facet_wrap in ggplot2 to show x - y relationships. Piping like this - including melting and plotting results in no objects created in your local environment!
classA_wqi_df %>%
reshape2::melt(id.vars = c('HUC_12', 'HUC_12_NAM', 'mean_wqi', 'sd_wqi'),
measure.vars = c("wild_brook_trout_biomass", "wild_brown_trout_biomass",
"wild_rainbow_trout_biomass","total_biomass")) %>%
drop_na(value) %>% #drop the rows with NA value for trout biomass
ggplot(aes(x=mean_wqi, y=value, color = variable, fill = variable)) +
facet_wrap(~variable, ncol = 2, scales = 'free_y') +
geom_point(pch=21, alpha = 0.7) +
geom_smooth(method = lm) +
labs(x = 'Mean HUC12 WQI', y = 'Trout Biomass (kg/ha)') +
theme_minimal() +
theme(legend.position = 'none')
If we want to save high quality image file of plot, use ggsave() to manipulate file type, dpi, and dimensions
ggsave(file='WQI_vs_TroutBiomass.png', scale=1, units='in', dpi=300, width=10, height=6)
Subsequent regression modeling of brook and brown trout vs. water quality is revealing. Rainbow trout response was not modeled due to low sample size (n=14).
| Species | WQI Type | Intercept | Slope | P | R2_adjusted | n | Result |
|---|---|---|---|---|---|---|---|
| Brook Trout | Agriculture SWQI | 14.81 +/- 7.9 | 0.29 +/- 0.09 | 0.002 | 2.88% | 307 | Positive |
| Brook Trout | Forest SWQI | 14.78 +/- 9.09 | 0.29 +/- 0.1 | 0.006 | 2.11% | 307 | Positive |
| Brook Trout | Land Disturbance SWQI | 53.1 +/- 8.12 | -0.16 +/- 0.1 | 0.092 | 0.60% | 307 | No Trend |
| Brook Trout | Urban SWQI | 16.35 +/- 10.07 | 0.27 +/- 0.11 | 0.021 | 1.42% | 307 | Positive |
| Brook Trout | WQI | 33.17 +/- 6.45 | 0.08 +/- 0.08 | 0.312 | 0.01% | 307 | No Trend |
| Brown Trout | Agriculture SWQI | 170.61 +/- 13.01 | -1.5 +/- 0.17 | 0.000 | 26.38% | 221 | Negative |
| Brown Trout | Forest SWQI | 191.33 +/- 15.1 | -1.7 +/- 0.19 | 0.000 | 26.78% | 221 | Negative |
| Brown Trout | Land Disturbance SWQI | 191.74 +/- 51.42 | -1.54 +/- 0.6 | 0.010 | 2.53% | 221 | Negative |
| Brown Trout | Urban SWQI | 213.08 +/- 17.54 | -1.92 +/- 0.21 | 0.000 | 26.58% | 221 | Negative |
| Brown Trout | WQI | 174.98 +/- 13.87 | -1.65 +/- 0.19 | 0.000 | 25.20% | 221 | Negative |
Brook trout biomass is significantly, positively correlated with agriculture, forest, and urban SWQIs. For example, for every 3 point increase in the agriculture SWQI score (improving water quality), brook trout biomass increased by ~ 1 kg/ha. This suggests that brook trout are responding positively to increases in water quality spatially across PA. Although these relationships are positive and significant, the models describe very little variation in the data (0 - 2.88%). The most likely interpretation is that there are many other stressors involved in controlling brook trout abundance such as habitat degradation and population fragmentation. Water quality has a role in regulating brook trout abundance, but it seems to be a minor factor.
Conversely, brown trout are negatively correlated with all SWQI scores and the overall WQI score. The slopes (effect sizes) are much greater for browns compared to brooks. For example, for every 1 point increase in overall WQI score, brown trout biomass decreases by 1.65 kg/ha. Additionally, the amount of variation described by brown trout models is much increased (25 - 27% for overall WQI and all SWQIs except land disturbance). Taken together, the magnitude and directionality of the slope estimates suggest that water quality is not playing a large role in regulating brown trout populations.
A shortcoming of this analysis may be that mean wqi scores in huc12 watersheds are not representative of on-site water quality. However, huc12s are small enough that it shouldn’t be a major issue.
I hope this exercise included some useful code while informing a current topic of discussion.