CABIN Data Exploration

Preliminary wrangling, tidying and exploration for benthic data within all regions sampled in Canada within the Canadian Aquatic Biomonitoring Network (CABIN).

CABIN is an aquatic biomonitoring program for freshwater ecosystems that began in 1987. Sites are divided by “Reference”, or “Test” sites.

Reference sites represent sites without industrial, agriculture, or development activity. Test sites may or may not have similar activity, and data from the reference sites is used to assess changes to macroinvertebrate communities, water chemistry, and habitat.

This data set is a collection of 30 or so spreadsheets from different regions, with different site visit identifiers and variables across each type of spreadsheet (“study”, “habitat” and “benthic”). This is an example of how to aggregate and sort somewhat disparate data from different sources for graphing and exploration.

The differences between test and reference sites are graphed here for the initial analysis to begin to see if a) we can detect a difference in pH, macroinvertebrate abundance, and percent sorted between test and reference sites. Further analyses are coming, and will be done with the MVAbund package in R for multivariate environmental data.

Data retrieved from the open source data set: https://open.canada.ca/data/en/dataset/13564ca4-e330-40a5-9521-bfb1be767147

Open-sourced R packages used: tidyverse, tidyhydat, zoo, lubridate, ggplot, scales, data.table

source code with quality control methods for this dataset: https://github.com/Jacqui-123/CABIN/blob/main/CABIN%20Benthic_data.Rmd

1) Load in the data

Smoosh all the benthic datasheets into one dataframe, with a column of region to be able to keep track of the different regions each data set originally came from. Do the same thing for the “study” and “habitat” spreadsheets.

library(data.table)
files <- list.files(path = "/Users/LevyJ/Desktop/CABIN", pattern = "^cabin_benthic.*?\\.csv", full.names = TRUE) 
df_benthic <- map_df(files, ~read_csv(., col_types = cols(.default = "c")), .id = "Region", encoding = 'Latin-1') 
#do this for all files

2) Data Tidying

a) Benthic Data

-Rename the columns to English and standard characters to be easier to work with. (code not shown)

-Create a new column “unique ID” with the site visit ID, site, and year separated by a “-”. This is so we have a unique id for each site visit and year. This was done for each set of spreadsheet to create a unique identifier for each site visit that can be used across all types of datasheets.

df_benthic <- df_benthic %>%
  tidyr::unite("Unique_Id", c("Site_Visit_ID", "Site", "Year" ), sep = "_", remove = TRUE) 

-Select only kick net samples, sample # = 1, mesh size 400, and calculate the total abundance for each unique site. Individual taxa are removed for now.

b) Study data tidying: -Create a unique site identifier to be able to accurately match up data -Keep only the “River” sites, as some are lake or wetland sites.

c) Habitat data tidying:

-Create a unique site identifier to be able to accurately match up data

-Pivot habitat data so the variables are in columns and the values in rows

df_habitat <- df_habitat %>%
#keep only first sample from triplicates
  filter(Sample_Number == "1") %>%
  #get rid of protocols other than wadeable streams
  #filter(Protocol == "CABIN - Wadeable Streams") %>% 
  #keep only alkalinity and pH data
  filter(Variable == "General-Alkalinity" | Variable == "General-pH")  %>%
  filter(Type == "Water Chemistry")%>%
  #Value needs to be as.numeric or pivot won't work. 
  mutate_at(c('Value'), as.numeric) %>%
  #unite so have units of measurement in column name as well when pivot
  tidyr::unite("Habitat Variable", c("Variable", "Unit"), sep = "_") %>%
  select(-c( "MDL", "VariableDescription")) %>%
  #pivot, preserving multiple columns (value, notes, computed)
  pivot_wider(names_from = "Habitat Variable", values_from = c("Value", "Notes", "Computed") ) %>%
  rename("pH" = "Value_General-pH_pH",
         "Alkalinity" = "Value_General-Alkalinity_mg/L") %>%
    distinct()

3) Merge data together

-Change df to Data.tables for quicker processing speed using the data.table vignette

-Join the three data tables together.

#merge on benthic and habitat data, keeping everything: 

df_both<- merge(df_habitat, df_benthic, by = "Unique_Id", all = TRUE)

#merge df_both and the study data - left join to keep only the df_both (hab + benth) sites, as study data doesn't have any new variables of interest.  

df_all <- merge(df_both, df_study, by = "Unique_Id", all.x = TRUE)

#length(unique(df_all$Unique_Id)) #sanity check, should be 12408

Tidy the joined df to get rid of lake/wetland/non wadeable stream sites

#tidy to get rid of wetlands, lakes, etc- make sure to keep NAs
df_all <- df_all %>%
  filter(is.na(Protocol) | Protocol == "CABIN - Wadeable Streams") %>%
  filter(is.na(Sampling_Device) | Sampling_Device == "Kick Net") %>%
  #note: we are keeping if River or wadeable
  filter(is.na(Type.y) | Type.y  == "River") %>%
  filter(is.na(Mesh_size) | Mesh_size == "400") %>%
  mutate(Status = coalesce(Status.x, Status.y)) %>%
  mutate(Region = coalesce(Region.x, Region.y))

4) Get ready to graph

-Tidy and reformat df_all pH and alkalinity data so that it is as numeric, and has the correct number of digits

-Graphing prep: make separate data.table for test, reference, and all sites for quicker processing

-Make a boxplot function so that it’s easier to make multiple graphs with different inputs

#Graphing function: boxplot 
boxplot <- function(data, x, y, fill, title, ylabtitle) {
  p <-
    ggplot(data, aes({{x}}, {{y}}, fill = {{fill}} )) +
geom_boxplot() +
  theme_bw() +
  geom_point() +
    theme(plot.title =  element_text(size = 10, hjust = .5, face = "bold")) +
   stat_boxplot(geom = "errorbar", width = 0.25) +
  ggtitle({{title}}) +
   ylab({{ylabtitle}}) +
 theme(legend.position = 'none') +
scale_x_discrete(labels = wrap_format(10)) +
  viridis::scale_fill_viridis(discrete = TRUE)
    print(p)
}

5) Graphing: Percent Sorted by reference and test sites

Table with # of data points for each region (ref/potential ref sites):

##                                       Region_Name    n
##  1:                          Arctic Drainage Area  322
##  2:                Great Slave Lake Drainage Area  172
##  3:              Maritime Provinces Drainage Area  653
##  4:                    Nelson River Drainage Area  121
##  5:    Northern Quebec and Labrador Drainage Area   49
##  6:                         Pacific Drainage Area 1435
##  7:         Southwestern Hudson Bay Drainage Area  110
##  8:                    St. Lawrence Drainage Area  336
##  9: Western and Northern Hudson Bay Drainage Area    8
## 10:                     Yukon River Drainage Area  137

Table with # of data points for each region (test sites)

##                                      Region_Name    n
## 1:                          Arctic Drainage Area  419
## 2:                Great Slave Lake Drainage Area  140
## 3:              Maritime Provinces Drainage Area  759
## 4:                    Nelson River Drainage Area  460
## 5:    Northern Quebec and Labrador Drainage Area    2
## 6:                         Pacific Drainage Area 1204
## 7:                    St. Lawrence Drainage Area  355
## 8: Western and Northern Hudson Bay Drainage Area    1
## 9:                     Yukon River Drainage Area   91

6) Graphing: Total Abundance (macroinvertebrate count) data

Note: looks like there may be outliers? These are not excluded from the scatterplots below

Abundance data, # of data points for each region (ref sites)

##                                       Region_Name    n
##  1:                          Arctic Drainage Area  322
##  2:                Great Slave Lake Drainage Area  172
##  3:              Maritime Provinces Drainage Area  653
##  4:                    Nelson River Drainage Area  121
##  5:    Northern Quebec and Labrador Drainage Area   49
##  6:                         Pacific Drainage Area 1435
##  7:         Southwestern Hudson Bay Drainage Area  110
##  8:                    St. Lawrence Drainage Area  336
##  9: Western and Northern Hudson Bay Drainage Area    8
## 10:                     Yukon River Drainage Area  137

Abundance data, # of data points for each region (test sites)

##                                      Region_Name    n
## 1:                          Arctic Drainage Area  419
## 2:                Great Slave Lake Drainage Area  140
## 3:              Maritime Provinces Drainage Area  759
## 4:                    Nelson River Drainage Area  460
## 5:    Northern Quebec and Labrador Drainage Area    2
## 6:                         Pacific Drainage Area 1205
## 7:                    St. Lawrence Drainage Area  355
## 8: Western and Northern Hudson Bay Drainage Area    1
## 9:                     Yukon River Drainage Area   91

7) Graphing: pH data (& Percent sorted)

pH data, # of data points for each region (ref sites)

##                                      Region_Name    n
## 1:                Great Slave Lake Drainage Area  176
## 2:              Maritime Provinces Drainage Area  594
## 3:                    Nelson River Drainage Area  127
## 4:    Northern Quebec and Labrador Drainage Area   33
## 5:                         Pacific Drainage Area 1409
## 6:         Southwestern Hudson Bay Drainage Area  110
## 7:                    St. Lawrence Drainage Area  345
## 8: Western and Northern Hudson Bay Drainage Area    8
## 9:                     Yukon River Drainage Area  143

pH data, # of data points for each region (test sites)

##                                      Region_Name    n
## 1:                Great Slave Lake Drainage Area  140
## 2:              Maritime Provinces Drainage Area  614
## 3:                    Nelson River Drainage Area  449
## 4:                         Pacific Drainage Area 1072
## 5:                    St. Lawrence Drainage Area  239
## 6: Western and Northern Hudson Bay Drainage Area    1
## 7:                     Yukon River Drainage Area   90

8) Graphing: pH data (& Abundance)

9) Graphing: Alkalinity data & Percent Sorted

**Note: Lower (<20.0 mg/L) values differ depending on the lab and might not be entirely accurate to compare across the sites. For example in some cases they list a “low quantification limit” with a value close to zero (values = .035 for example). Or an MDL threshold like <20.0, or <10.0, and in each case is entered into CABIN as 1/2 the MDL (ie 10.0, or 5.0). All of these values were retained, as is, for the plots down below.

Alkalinity data, # of data points for each region (ref sites)

##                                   Region_Name    n
## 1:             Great Slave Lake Drainage Area  124
## 2:           Maritime Provinces Drainage Area  201
## 3:                 Nelson River Drainage Area   67
## 4: Northern Quebec and Labrador Drainage Area   13
## 5:                      Pacific Drainage Area 1190
## 6:      Southwestern Hudson Bay Drainage Area  110
## 7:                 St. Lawrence Drainage Area  174
## 8:                  Yukon River Drainage Area   97

Alkalinity data, # of data points for each region (test sites)

##                         Region_Name   n
## 1:   Great Slave Lake Drainage Area  51
## 2: Maritime Provinces Drainage Area 213
## 3:       Nelson River Drainage Area 292
## 4:            Pacific Drainage Area 734
## 5:       St. Lawrence Drainage Area  64
## 6:        Yukon River Drainage Area  56

10) Graphing: Alkalinity data & Abundance

**Note: MDL values included