PIBI Notebook: Draft

2018-01-05

Summary of Issues

This document aims to provide R-scripts to calculate the Chesapeake Bay Phytoplankton indices described by (Lacouture et al. 2006). In the Validation section of this document, an independent data source produced by Jacqueline M. Johnson is used QA/QC the processes outlined in this document. R functions developed to for metric scoring (Metric Score Disagreement), IBI scoring (IBI Value Disagreement), and IBI rating (Rating Disagreement) for the most part appear to be working correctly (IBI Value, IBI Rating, Metric Scoring Disagreements). However, before you dive into the documentation, I would like to direct your attention to a number of discrepancies that were identified in the Metric Disagreement section. The Overview of Descrepancies subsection is a good starting point because it summarizes the descrepancies by index, size of the descrepancy, metric, and data source. Some of these descrepancies are very small when metric values are compared and could be a result of round differences. To reduce the impact of minor changes metric scores are compared instead of metric values. Metric scores bin the data into four categories: 1, 3, 5, or NA. Therefore, if the metric score in this document differs from Jacqueline M. Johnson’s metric score, the metric value has most likely changed quite a bit.

The metric with the greatest number of scoring descrepancies is Pheophytin.
- This is an issue for all indices and both data sources.
  - MSU/Pearl Pheophytin values are incorrect more frequently.
- Potential reason(s) for issue:
  - Calculation of mean pheophytin values.
    - Current procedure (Prepare Water Quality Data).
      - For oligohaline, mesohaline, and polyhaline indics water quality data is summarized and represented as the mean of the available samples collected from the surface layer and above the pycnocline layer.
      - For the freshwater index, water quality data is summarized as the mean value of all of the samples collected in the water column (i.e. collected from the surface, above pycnocline, below pycnocline, and benthic samples).
    - Jacqueline M. Johnson may have had a more complicated procedure then outlined in Prepare Water Quality Data.
    - Additional data may have been acquired since the Jacqueline M. Johnson last calculated these values.
The metric with the second greatest number of scoring descrepancies is total_phyto_biomass_chla_ratio.
- This is an issue for all indices that incorporate this metric and both data sources.
  - MSU/Pearl values are incorrect more frequently.
- Potential reason(s) for issue:
  - Calculation of mean chla values (See reasons discussed for Pheophytine above).
  - Jacqueline M. Johnson may have had a different procedure for calculating biomass then what is outlined in Prepare Carbon Assignments.
    - This seems unlikely becuase there are many instances where the total_phyto_biomass_chla_ratio values do agree. Maybe Jacqueline M. Johnson used a different carbon table or maybe there are a select set of taxa that require modified biomass calculations?
The metric with the third greatest number of scoring descrepancies is doc.
- See reasons discussed for Pheophytine above.
The remaining biomass metrics scoring descrepancies are most likely due to an incorrect calculation of biomass (see total_phyto_biomass_chla_ratio discussed above).
- Could be the result of differences between the assigned taxonomic hierarchy in this document and the hierarchy assigned by Jacqueline M. Johnson.
  - ITIS was used to aquire the majority of taxonomic hierarchy information for the reported phytoplankton taxa (see ITIS Data and Prepare Taxonomic Hierarchy). I believe ITIS was used by Jacqueline M. Johnson as well becuase the Chesapeake Bay Program Incorporates TSNs (Taxonomic Serial Number assigned by ITIS) in most of their taxonomic databases. Taxonomic updates to ITIS between the assessments could cause some of these differences.

There are several probable sources for these discrepancies:

The most likely source is an error or issue in this documents R-script.
- Typos
- Incorrect interpretation of data preparation or calculation
- Overlooked data preparation steps
- Poorly documented data prepatration steps, which make it difficult for me to replicate
There is potential for issues in the Jacqueline M. Johnson’s SQL code, which would ultimately lead to discrepancies between her calculated values and the values calculated in this document.
The data may have been updated or altered between the time Jacqueline M. Johnson calculated the indices and now.
- Taxonomic names may have been updated to reflect currently accepted taxonomic nomenclature.
- Data could be represented differently in the new Chesapeake Bay Program CEDR database.

Introduction

This R-script and documentation was created to calculate the Phytoplankton Index of Biotic Integrity (PIBI) for the Chesapeake Bay and its tidal tributaries developed by (Lacouture et al. 2006). The following text documents the process used to prepare and calculate the PIBI. Data was acquired from the Chesapeake Bay Program’s (CBP) CEDR database.

Intended Audience

I developed this document for the Chesapeake Bay Program and individuals/groups interested in assessing the Chesapeake Bay condition with phytoplankton. The document describes the how to use the statistical coding language, (R Core Team 2017), to re-create the PIBI. My hope is that this document will allow others to assess Chesapeake Bay conditions based on the indices developed by (Lacouture et al. 2006).

This document provides the R code necessary to import, prepare, and calculated the indices developed by (Lacouture et al. 2006). R Markdown is used to integrate text and code. The intention is to provide a more readable document, with better descriptions of how and why an activity was performed, than a standard R script with a few commented lines. Although my hope is that readers without any R background could comprehend the R code in this document with a little bit of patience, this document is really intended for readers with at least a basic understanding of R programming.

Additionally, I am a big fan of the tidyverse packages developed by the RStudio team and use these packages frequently throughout the document. The tidyverse is essentially an ecosystem of packages that work well together and often make it easier to deal with data in R. For R users who do not use packages from the tidyverse, I think many of the functions will be intuitive or easy to quickly search for online. However, I often find that the pipe operator, %>%, from the magrittr package is confusing to those unfamiliar with the tidyverse, and right fully so. It takes a little while to wrap your head around the pipe operator but once you do I think you will find its use makes R code more legible. In essence base R works from the inside-out, while the pipe operator presents the code in a linear fashion. For example, imagine you have a character vector x and you want to trim leading/trailing white space, then keep only unique strings, and then make all characters lowercase. In base R, the code would look this: tolower(unique(trimws(x))). Again, base R works from the inside-out, so first trimws() is used to remove leading/trailing white space, then unique() is used to remove duplicate strings, and finally tolower() is used to convert all characters to lowercase. Using the pipe operator, the code would look like this: x %>% trimws() %>% unique() %>% tolower(). The pipe operator presents the functions in the order you intend them to be performed; therefore, the code should be easier to read and more intuitive, which should in turn reduce errors. For more information on the pipe operator please refer to http://magrittr.tidyverse.org/.

Download and Edit this Document

This document can be downloaded and edited from my GitHub account. As stated in the previous section, this document was created using R Markdown, so that the code could be integrated with descriptive text. The document is broken up into multiple children documents in the “notebook” folder, which are all compiled together using the “pipi_notebook.Rmd” parent document. For small corrections, the easiest way to inform me will be to open a new issue on the GitHub respository for this document. However, my hope is that there will be others interested in becoming more involved, of whom I can add as collaborators to the Git Hub repository. RStudio has made it very easy to connect R projects to GitHub. Using these tools, it can be very easy to “push” and “pull” edits, while maintaining a record of the documents history. If you are interested in collaborating on this document, please contact me, Zachary M. Smith.

How to Download from GitHub

For those interested in collaborating, please install the latest version of RStudio. RStudio interacts well with GitHub and some aspects of this document may depend on the RStudio IDE. Next create a new project:

Create a new project (File -> New Project).
Select Version Control and Git (Version Controal -> Git).
- If you recieve a message say “Git Not Found”, then you need to install Git (Install Git Here). Once Git is installed, try setting up the project again in RStudio.
- I believe RStudio should automatically recognize Git on your computer but I had to specify where the git.exe was located by going to Tools -> Global Options -> Git/SVN -> locate git.exe (my file path: “C:/Users/zsmith/AppData/Local/Programs/Git/bin/git.exe”).
Paste the phyto repository URL (https://github.com/zsmith27/phyto.git) into the “Repository URL” box.
Use the “Create project as subdirectory of:” box to manage where you want to store the project on your computer.

Compile the Document

To view the html document you must compile the document using Knit. Follow these steps to Knit the document:

Open pibi_notebook.Rmd in phyto/notebook.
Find and click the Knit button (it looks like a ball of yarn) in the toolbar above the editor window.
- If a window appears saying “Install Requred Packages” for R Markdown, install the necessary packages for knitting the document.
Hopefully the document will compile correctly. If the document does not compile correctly, try to interpret and correct for the error message returned. If you cannot interprete the error message, please open a new issue on the GitHub respository for this document and I will try to help you resolve the issue.

Getting Started

Load the necessary packages into the environment (Dowle and Srinivasan 2017; Wickham 2017 a; Grolemund and Wickham 2011; Wickham 2017 b; Wickham and Bryan 2017; Lang and the CRAN team 2016; Wickham et al. 2017; Wickham and Henry 2017; Henry and Wickham 2017; Wickham 2009; Cheng, Karambelkar, and Xie 2017; Ooms 2014; Müller 2017; Chamberlain 2017).

library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(stringr)
library(leaflet)
library(lubridate)
library(readxl)
library(httr)
library(RCurl)
library(data.table)
library(jsonlite)
library(rprojroot)
library(ritis)

mmir is a costume package that I am developing to calculate taxonomic community metrics. The development version of mmir can be downloaded from my GitHub account using devtools::install_github(repo = "zsmith27/mmir"). After the installation mmir must be loaded into the environment using library().

library(mmir)

sessionInfo() is used to provide the information necessary to replicate this document. As packages are updated, especially mmir, functions may become defunct or a function may not respond in a similar fashion to previous versions of the package. Without the information provided by sessionInfo() it can be extremely difficult, if not impossible to replicate the results in this report.

sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] packrat_0.4.8-54    bindrcpp_0.2        mmir_0.0.0.9000    
##  [4] ritis_0.7.0         rprojroot_1.3-1     jsonlite_1.5       
##  [7] data.table_1.10.4-3 RCurl_1.95-4.8      bitops_1.0-6       
## [10] httr_1.3.1          readxl_1.0.0        lubridate_1.7.1    
## [13] leaflet_1.1.0       stringr_1.2.0       purrr_0.2.4        
## [16] ggplot2_2.2.1       tidyr_0.7.2         dplyr_0.7.4        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.14           compiler_3.4.3         cellranger_1.1.0      
##  [4] plyr_1.8.4             bindr_0.1              tools_3.4.3           
##  [7] digest_0.6.13          evaluate_0.10.1        tibble_1.3.4          
## [10] gtable_0.2.0           pkgconfig_2.0.1        rlang_0.1.4           
## [13] shiny_1.0.5            crosstalk_1.0.0        curl_3.1              
## [16] crul_0.4.0             yaml_2.1.16            xml2_1.1.1            
## [19] knitr_1.17             htmlwidgets_0.9        grid_3.4.3            
## [22] glue_1.2.0             R6_2.2.2               solrium_1.0.0         
## [25] rmarkdown_1.8          magrittr_1.5           codetools_0.2-15      
## [28] OptimalCutpoints_1.1-3 backports_1.1.2        scales_0.5.0          
## [31] htmltools_0.3.6        assertthat_0.2.0       mime_0.5              
## [34] colorspace_1.3-2       xtable_1.8-2           httpuv_1.3.5          
## [37] stringi_1.1.6          lazyeval_0.2.1         munsell_0.4.3

Data for this document are located in various files. The object project.dir represents the phytoplankton project directory and will be used as the root of the file path throughout the document.

project.dir <- rprojroot::find_rstudio_root_file()

Custom Functions

The clean_string() function modifies character objects by removing leading/trailing white space, converting all characters to lower case, and replacing all spaces (" “) with an underscore (”_“). These specifications eliminate common typos and standardize character objects.

clean_string <- function(x) {
  x %>% 
    stringr::str_trim() %>% 
    tolower() %>% 
    stringr::str_replace_all("\\s+", " ") %>% 
    stringr::str_replace_all(" ", "_") %>%  
    if_else(. == "", as.character(NA), .)
}

The clean_up() function is a wrapper for clean_string(). This function applies clean_string() to the column headers and character columns of data frames. Again, this process standardizes character values and makes it easier work with the data.

clean_up <- function(x) {
  x %>% 
    rename_all(clean_string) %>% 
    mutate_if(is.character, funs(clean_string))%>% 
    distinct()
}

Data Acquisition

CEDR Data

It is NOT recommended that the following section of R-script be run with everytime this document is compiled. Running this script each time should not cause any errors but could lead to significant wait times associated with downloading the data from CEDR. The phytoplankton taxonomic counts and water quality data is relatively large and can take some time to download; therefore, this script should be run after your intial download from GitHUB, after a known update has been made, or after a signficant amount of time has passed from the intial data download (~6 months).

For those working with the Rmarkdown file, the object run.cedr.acquisition is a hidden object specified at the head of the pibi_notebook.Rmd created to indicate if the user would like to download data from CEDR (run.cedr.acquisition <- TRUE) or if user has already downloaded the data and wants to skip downloading the data again (run.cedr.acquisition <- FALSE). run.cedr.acquisition is then used in this section to signifiy if the code chunks should be executed or ignored using the eval option at the top of the code chunk; therefore, if run.cedr.acquisition is set to TRUE, then eval=TRUE and the code will be executed, and vice versa ifrun.cedr.acquisition is set to FALSE. In the subsequent code chunks in this section, the CEDR data will be downloaded using the Chesapeake Bay Program CEDR API and the downloaded data will be saved into the R project directory. The data will be imported in subsequent sections.

Create url.root, which represents the root of the CEDR API. All API calls will start with url.root. Additionally, use Sys.Date() to find todays, which will be used in the API query. This makes the following script more dynamic because if this script is run in 2 years, it will then include any data that was added to the CEDR database within those 2 years.

url.root <- "http://datahub.chesapeakebay.net/api.JSON"
todays.date <- format(Sys.Date(), "%m-%d-%Y")

Download Station Vector

Create a vector (station.vec) of stations that contain phytoplankton data. The vector will be used to query data from from the CEDR API in subsequent code chunks.

station.vec <- file.path(url.root,
                       "LivingResources",
                       "TidalPlankton",
                       "Reported",
                       "1-01-1970",
                       todays.date,
                       "17",
                       "Station") %>% 
  fromJSON() %>% 
  pull(unique(MonitoringLocationId))

Download Phytoplankton Data

Download all of the phytoplankton data from the CEDR API.

phyto.df <- file.path(url.root,
                      "LivingResources",
                      "TidalPlankton",
                      "Reported",
                      "1-01-1970",
                      todays.date, 
                      "17",
                      "Station",
                      paste(station.vec, collapse = ",")) %>%
  fromJSON() %>% 
  clean_up()

Subsetphyto.df to only represent data collected from the above pycnocline layer or in the water column (Layer %in% c("AP", "WC)) and rows that contain taxonomic counts (!is.na(ReportingValue)). Also, convert SampleDate to class date, which will be used to limit the water quality download query [Water Quality Data].

phyto.df <- phyto.df %>% 
  filter(layer %in% c("ap", "wc"),
         !is.na(reportingvalue)) %>% 
  mutate(sampledate = as.Date(sampledate))

Update TSNs (final_tsn) for reported taxa (latinname) known to be problematic. Most of these taxa are not in the ITIS database but an upstream taxonomic rank is found in ITIS. For example, the species, Navicula notablis, is not found in ITIS but the genus Navicula is present in ITIS; therefore, the TSN and hierarchy for Navicula is used to represent an incomplete hierarchy for Navicula notablis. For the purposes of this document, these incomplete hierarchies are good enough because the metrics calulated in the Metric Calculation section only requires taxonomic data for the ranks division, phylum, class, and two specific species. For future development of indices, these hierarchies will need to be completed either manually or find a more complete database (maybe Algaebase?).

phyto.df <- phyto.df %>% 
  mutate(
    tsn = as.integer(tsn),
    final_tsn = case_when(
      latinname == "navicula_notablis" ~ as.integer(4327),
      latinname == "pleurosigma_macrum" ~ as.integer(4650),
      latinname == "pleurosigma_obscurum" ~ as.integer(591383),
      latinname == "polykrikos_hartmannii" ~ as.integer(331299),
      latinname == "protoperidinium_aciculiderum" ~ as.integer(10329),
      latinname == "protoperidinium_paulseni" ~ as.integer(3568),
      latinname == "scrippsiella_favionese" ~ as.integer(10537),
      latinname == "tetrastrum_caudatum" ~ as.integer(5691),
      latinname == "didymocystis" ~ as.integer(5810),
      latinname == "lauterborniella_elegantissima" ~ as.integer(6097),
      latinname == "characium_sp." ~ as.integer(5756),
      latinname == "cylindrospermopsis_sp." ~ as.integer(203689),
      latinname == "chaetoceros_neogracilis" ~ as.integer(1004011),
      latinname == "navicula_retusa_cancellata" ~ as.integer(1020372),
      latinname == "karlodinium_micrum" ~ as.integer(180904),
      latinname == "lagerheimia" ~ as.integer(6017),
      latinname == "quadricoccus_euryhalinicus" ~ as.integer(957939),
      latinname == "scrippsiella_precaria" ~ as.integer(10536),
      latinname == "psuedosolenia_calcar-avis" ~ as.integer(970064),
      latinname == "centronella" ~ as.integer(970064),
      latinname == "amphidinium_tatrae" ~ as.integer(9997),
      latinname == "navicula_lata" ~ as.integer(4450),
      latinname == "nitzschia_vitrea_recta" ~ as.integer(5204),
      latinname == "rhaphoneis_gemmifera" ~ as.integer(3145),
      latinname == "delphineis_surirella" ~ as.integer(969978),
      latinname == "navicula_annulata" ~ as.integer(3649),
      latinname == "proboscia_alata_gracillima" ~ as.integer(610099),
      latinname == "guinardia_striata" ~ as.integer(2921),
      latinname == "guinardia_cylindrus" ~ as.integer(2921),
      latinname == "aphanizomenon_issatschenkoi" ~ as.integer(1191),
      latinname == "helicotheca_tamesis" ~ as.integer(590815),
      latinname == "corethron_valdivae" ~ as.integer(2386),
      latinname == "gonyaulax_conjuncta" ~ as.integer(10359),
      latinname == "lioloma_delicatulum" ~ as.integer(573597),
      latinname == "syracosphaera_histrica" ~ as.integer(2234),
      latinname == "rhizosolenia_formosa" ~ as.integer(2879),
      latinname == "proboscla_alata_curvirostris" ~ as.integer(610099),
      latinname == "membraneis_challengeri" ~ as.integer(3648),
      latinname == "chrysococcus_tesselatus" ~ as.integer(1751),
      latinname == "rhoicosphenia_abbreviata" ~ as.integer(3633),
      latinname == "protoperidinium_aciculiferum" ~ as.integer(10340),
      latinname == "protoperidinium_fimbriatum" ~ as.integer(10340),
      latinname == "licmophora_inflata" ~ as.integer(3155),
      latinname == "biddulphia_reticulata" ~ as.integer(2678),
      latinname == "caloneis_lepidula" ~ as.integer(4369),
      latinname == "caloneis_trinodis" ~ as.integer(4369),
      latinname == "amphiprora_cholnokyi" ~ as.integer(4674),
      latinname == "navicula_interrupta" ~ as.integer(3649),
      latinname == "cerataulus_radiatus" ~ as.integer(2709),
      latinname == "gyrosigma_balticum_silimis" ~ as.integer(4623),
      latinname == "dictyocha_siderea" ~ as.integer(1804),
      latinname == "odontella_alternans" ~ as.integer(573604),
      latinname == "nitzschia_vitrea_salinarum" ~ as.integer(5204),
      latinname == "proboscla_alata_indica" ~ as.integer(610099),
      latinname == "attheya_decora" ~ as.integer(2876),
      latinname == "synedra_closterioides" ~ as.integer(970065),
      latinname == "trinacria_regina" ~ as.integer(2747),
      latinname == "chattonella" ~ as.integer(969917),
      latinname == "chattonella_subsalsa" ~ as.integer(969917),
      latinname == "heterosigma_akashiwo" ~ as.integer(969917),
      latinname == "vibrio_fisheri" ~ as.integer(959178),
      TRUE ~ as.integer(tsn)
    )
  )

Export the phytoplankton taxonomic counts (phyto.df) as a CSV file to “data” folder in this project directory. Using a combination of dir.create(), file.path(), and project.dir(created using rprojroot::find_rstudio_root_file() in Getting Started) the necessary folder structure in the project directory will be created, if it does not already exist.

dir.create(file.path(project.dir, "data/phytoplankton"),
           recursive = TRUE, showWarnings = FALSE)

data.table::fwrite(phyto.df, file.path(rprojroot::find_rstudio_root_file(), "data/phytoplankton", "cedr_phyto_taxa.csv"))

Download Monitoring Event

Download all of the monitoring event data associated with the collection of phytoplankton from the CEDR API.

event.df <- file.path(url.root,
                      "LivingResources",
                      "TidalPlankton",
                      "MonitorEvent",
                      "1-01-1970",
                      todays.date, 
                      "17",
                      "Station",
                      paste(station.vec, collapse = ",")) %>%
  fromJSON() %>% 
  clean_up()

Export the phytoplankton event data (event.df) as a CSV file to “data” folder in this project directory. Using a combination of dir.create(), file.path(), and rprojroot::find_rstudio_root_file() the necessary folder structure in the project directory will be created, if it does not already exist.

dir.create(file.path(rprojroot::find_rstudio_root_file(), "data/phytoplankton"),
           recursive = TRUE, showWarnings = FALSE)

data.table::fwrite(event.df, file.path(rprojroot::find_rstudio_root_file(), "data/phytoplankton", "cedr_phyto_event.csv"))

Download Station Information

Download all of the station data associated with the collection of phytoplankton from the CEDR API.

station.df <- file.path(url.root,
                        "LivingResources",
                        "TidalPlankton",
                        "Station",
                        "station",
                        paste(station.vec, collapse = ",")) %>%
  fromJSON() %>% 
  clean_up()

Export the phytoplankton station data (station.df) as a CSV file to “data” folder in this project directory. Using a combination of dir.create(), file.path(), and rprojroot::find_rstudio_root_file() the necessary folder structure in the project directory will be created, if it does not already exist.

dir.create(file.path(rprojroot::find_rstudio_root_file(), "data/phytoplankton"),
           recursive = TRUE, showWarnings = FALSE)

data.table::fwrite(station.df, file.path(rprojroot::find_rstudio_root_file(), "data/phytoplankton", "cedr_phyto_station.csv"))

Download Water Quality Data

Download chlorophyll a, dissolved organic carbon, and pheophytin water quality data collected from the same stations as the phytoplankton data. The minimum and maximum dates found in the phytoplankton taxonomic count data (phyto.df$SampleDate) are used to limit the water quality query to reduce the amount of unwanted data and to speed up the download. Three days is subtracted from the minimum date and three days are added to the maximum date becuase water quality data collected within ± 3 day window (see 3-Day Window for more details).

wq.df <- file.path(url.root,
                   "WaterQuality",
                   "WaterQuality",
                   format(min(phyto.df$sampledate) - days(3), "%m-%d-%Y"),
                   format(max(phyto.df$sampledate) + days(3), "%m-%d-%Y"), 
                   "6",
                   "7,16,23,24",
                   "station",
                   paste(station.vec, collapse = ","),
                   "21,34,74") %>% 
  fromJSON() %>% 
  clean_up()

Export the water quality data (wq.df) as a CSV file to “data” folder in this project directory. Using a combination of dir.create(), file.path(), and rprojroot::find_rstudio_root_file() the necessary folder structure in the project directory will be created, if it does not already exist.

dir.create(file.path(rprojroot::find_rstudio_root_file(), "data/water_quality"),
           recursive = TRUE, showWarnings = FALSE)

data.table::fwrite(wq.df, file.path(rprojroot::find_rstudio_root_file(), "data/water_quality", "cedr_wq.csv"))

Remove data objects that are no longer useful to clean up the global environment.

rm(run.cedr.acquisition, url.root, todays.date, station.vec, phyto.df, event.df, station.df, wq.df)

ITIS Data

The Integrated Taxonomic Information System (ITIS) database is a useful resource for obtaining taxonomic hierarchy information, which is necessary to compute many of the phytoplankton community metrics. The majority of taxa observed in the Chesapeake Bay and its tributaries can be found in the database; however, there are a number of taxa that require edits or are missing entirely from the database (see Download Phytoplankton Data). The following script uses package ritis to access the ITIS API and download the necessary taxonomic hierarchy information (Chamberlain 2017). The ITIS API and a simple way of querying the API with ritis makes this document more robust and dynamic. Anyone interested party should be able to download this documents R Markdown file from GitHub and run this script to access the ITIS database. As long as there are no changes to the API, this script can be used to access the ITIS data in subsequent years and will incorporate any subsequent changes to the ITIS database.

It is not necessary to download the entire ITIS database, we only need the taxonomic hierarchy information for the phytoplankton reported in the CEDR database. Phytoplankton taxonomic count data obtained from CEDR (Download Phytoplankton Data) are imported. The unique list of reported taxa, represented by TSN, will be used to query the ITIS database.

col.class.vec <- c("samplenumber" = "character",
                   "tsn" = "character",
                   "speccode" = "character",
                   "reportingvalue" = "integer")

bay.temp <- data.table::fread(file.path(project.dir, "data/phytoplankton/cedr_phyto_taxa.csv"),
                            data.table = FALSE,
                            colClasses = col.class.vec)

The phytoplankton reported in the CEDR database are used to query the the ITIS database and extract all of the taxonomic information related to each reported taxon. The ritis function, usage(), is used to check if there is any information for the specified taxon (final_tsn); if there is no information related to the specified taxon (final_tsn), then an empty data frame is returned. CEDR assigns taxa that are not found in the ITIS database a dumby TSN of zero, which does not exist in the ITIS database. Any query of the ITIS API with a TSN equal to zero will return an error; the return of an empty data frame prevents this error and can be used to check which TSNs were not found in the ITIS database (i.e., all taxonomic columns are filled with NA). If there is information related to the specified taxon (final_tsn), then usage() is used to check if the specified TSN is “accepted” or “valid”. If the TSN is not “accepted” or “valid”, then ritis::accepted_names() is used to updated the TSN that will be used to query the ITIS database (tsn.accepted). The ritis function, hierarchy_full(), is then used to pull all of the hierarchy information upstream of the specified taxon (i.e. Kingdom to the specified taxons rank) and immediate children of the specified taxon (e.g. if the specified taxon represents Family-level, then the children might represent all Subfamilies below the specified taxon). For the purposes of this document, the immediate children of the specified taxon are not need and are filtered out of the data. The orginal reported TSN in CEDR (org_tsn), the accepted TSN (final_tsn), and the accepted taxonomic name of accepted TSN (final_id) are extracted as individual columns. These columns are used as a grouping feature to transform the data frame from a long format, where the taxonomic ranks are represented as a single column and taxonomic names are represented as a single column, to a wide format, where the taxonomic ranks represent the column names and the taxonomic names fill in each row respective to the appropriate taxonomic rank column. Finally, clean_up() is applied.

hier.wide <- unique(bay.temp$final_tsn) %>% 
  lapply(function(tsn.i) {
 #print(tsn.i)
  
  if(ncol(usage(tsn.i)) == 0) return(data.frame(org_tsn = as.integer(tsn.i),
                                                stringsAsFactors = FALSE))
  if(!ritis::usage(tsn.i)$taxonUsageRating %in% c("accepted", "valid")) {
    tsn.accepted <- ritis::accepted_names(as.integer(tsn.i))$acceptedTsn
  } else {
    tsn.accepted <- as.integer(tsn.i)
  }
  
  full.df <- ritis::hierarchy_full(tsn.accepted) %>% 
    select(rankname, taxonname, tsn) %>% 
    slice(1:which(tsn == tsn.accepted)) %>% 
    mutate(org_tsn = as.integer(tsn.i),
           final_tsn = as.integer(tsn.accepted),
           final_id = slice(., which(tsn == tsn.accepted))$taxonname)
}) %>% 
  bind_rows() %>% 
  select(-tsn) %>% 
  spread(rankname, taxonname) %>% 
  clean_up()

A vector of column names is created to sort the columns. Not all of the names in column.vec are found in hier.wide and are filtered out in the last line. This is purposefully written to be more robust; future data acquisitions may contain different taxonomic ranks due to ITIS updates and this script will capture these changes.

column.vec <- c("org_tsn", "final_tsn", "final_id",
                "kingdom", "subkingdom", "infrakingdom", 
                "superdivision", "division", "subdivision", "infradivision",
                "superphylum", "phylum", "subphylum", "infraphylum", 
                "superclass", "class", "subclass", "infraclass", 
                "superorder", "order", "suborder", "infraorder", 
                "superfamily", "family", "subfamily", 
                "tribe", "subtribe", 
                "genus", "subgenus", 
                "species", "subspecies")

column.vec <- column.vec[column.vec %in% names(hier.wide)]

The columns are sorted according to column.vec.

hier.wide <- hier.wide[, column.vec]

Export the taxonomic hierarchy information (hier.wide) as a CSV file to “data” folder in this project directory. Using a combination of dir.create(), file.path(), and rprojroot::find_rstudio_root_file() the necessary folder structure in the project directory will be created, if it does not already exist.

dir.create(file.path(project.dir, "data/itis"),
           recursive = TRUE, showWarnings = FALSE)

data.table::fwrite(hier.wide, file.path(rprojroot::find_rstudio_root_file(), "data/itis", "itis_hierarchy.csv"))

Remove data objects that are no longer useful to clean up the global environment.

rm(col.class.vec, bay.temp, hier.wide, column.vec)

Prepare Taxonomic Hierarchy

This section joins the phytoplankton data obtained from CEDR (CEDR Data) and the taxonomic hierarcy information data obtained from ITIS (ITIS Data).

Taxonomic Data

Phytoplankton taxonomic count data obtained from CEDR (Download Phytoplankton Data) is imported.

col.class.vec <- c("samplenumber" = "character",
                   "tsn" = "character",
                   "speccode" = "character",
                   "reportingvalue" = "integer")

taxa.raw <- data.table::fread(file.path(project.dir, "data/phytoplankton/cedr_phyto_taxa.csv"),
                            data.table = FALSE,
                            colClasses = col.class.vec,
                            na.strings = "") %>% 
  mutate(sampledate = as.Date(sampledate))

The PIBI was developed for the taxa found above the pycnocline. Therefore, rows where the layer column does not specifies above the pycnocline (“ap”) or water column (“wc”) are removed. Furthermore, (Lacouture et al. 2006) (page 599) developed indices for data collected in the spring (March-May) and summer (July-September).

bay.df <- taxa.raw %>% 
  filter(layer %in% c("ap", "wc")) %>% 
  distinct() %>% 
  mutate(month = month(sampledate),
         season = case_when(
           month %in% c(3, 4, 5) ~ "spring",
           month %in% c(7, 8, 9) ~ "summer",
           TRUE ~ "remove"
         )) %>% 
  filter(season %in% c("spring", "summer"))

If there are multiple taxa with the same taxonomic name, then the reporting values should be summed. Each row should represent a unique taxon.

bay.df <- bay.df %>% 
  group_by_at(vars(-reportingvalue)) %>% 
  summarize(reportingvalue = sum(reportingvalue)) %>% 
  ungroup()

Only phytoplankton data (“ph”) is retained for analysis. Additionally, some taxa were not identified to any useful taxonomic rank and were excluded from the analysis.

bay.df <- bay.df %>% 
  # Removes methods with prefix "BE" (Tidal Benthic Taxa Enumeration) or blanks.
  filter(grepl("ph", method),
         !latinname %in% c("micro-phytoflagellates",
                           "microflagellates",
                           #"green_cells",
                           #"blue_green_sphere",
                           "epiphytic_flagellates",
                           "hydrodictyon_reticulatum"))

Taxonomic Hierarchy

Taxonomic hierarchy data obtained from ITIS (ITIS Data) is imported.

hier.wide <- data.table::fread(file.path(project.dir, "data/itis/itis_hierarchy.csv"),
                            data.table = FALSE,
                            na.strings = "") %>% 
  clean_up()

The taxonomic counts are joined with the taxonomic hierarchy data by the final_tsn columns. A unique identifier (unique_id) is created for each sampling event by concatenating station, layer, samplenumber, and sampledate.

bay.df <- left_join(bay.df, hier.wide, by = c("final_tsn" = "org_tsn")) %>% 
  mutate(unique_id = paste(station, layer, samplenumber, sampledate, sep = "_"))

Some taxa require specific attention before they can be finalized.

bay.df <- bay.df %>% 
  mutate(
    final_id = if_else(latinname == "trinacria_regina", "trinacria_regina",  final_id),
    final_id = case_when(
      latinname == "green_cells" ~ "green_cells",
      latinname == "blue_green_spheres" ~ "blue_green_spheres",
      TRUE ~ final_id
    ),
    kingdom = if_else(latinname == "trinacria_regina",
                      "chromista", kingdom),
    phylum = if_else(latinname == "trinacria_regina",
                     "bacillariophyta", phylum),
    subphylum = if_else(latinname == "trinacria_regina",
                        "bacillariophytina", subphylum),
    class = if_else(latinname == "trinacria_regina",
                    "mediophyceae", class),
    subclass = if_else(latinname == "trinacria_regina",
                       "chaetocerotophycidae", subclass),
    order = if_else(latinname == "trinacria_regina",
                    "hemiaulales", order),
    family = if_else(latinname == "trinacria_regina",
                     "hemiaulaceae", family),
    genus = case_when(latinname == "trinacria_regina" ~ "trinacria",
                      latinname == "didymocystis" ~ "didymocystis",
                      latinname == "Lauterborniella_elegantissima" ~ "lauterborniella",
                      TRUE ~ genus),
    species = case_when(latinname == "trinacria_regina" ~ "trinacria_regina",
                        latinname == "lauterborniella_elegantissima" ~ "lauterborniella_elegantissima",
                        TRUE ~ species)
  )

Remove all unnecessary objects from the environment.

rm(col.class.vec)

test <- bay.df %>% 
  filter(is.na(kingdom))

Prepare Carbon Assignments

A large but incomplete table of taxa size and carbon values is imported from the project data directory.

carbon.df <- readxl::read_excel(file.path(project.dir, "data/carbon/carbon_list_2014.xlsx"),
                                sheet = "carbon_list_2014") %>% 
  clean_up()

Some of the size categories are standardized to make it possible to accurately join carbon.df and bay.df. “unidentified” is converted to NA. Both carbon.df and bay.df have inconsistent representation of “species” (i.e. “species”, “species_”, “species_#”, “sp.”, and “sp#”): therefore, “species” was standardized to “sp#”.

carbon.df <- carbon.df %>% 
  mutate(size = case_when(
    size %in% "unidentified" ~ as.character(NA),
    str_detect(size, "species|sp.") ~ str_replace_all(size, "sp.#|species_#|species_|species", "sp#"),
    TRUE ~size
  ),
  latinname = if_else(latinname == "blue_green_trichome", "blue_green_sphere", latinname)) %>% 
  mutate(picoplankton = if_else(
    diameter <= 20 &
      height <= 20 &
      width <= 20 &
      width <= 20 &
      height_2 <= 20 &
      width_2 <= 20,
    TRUE,
    FALSE
  )) %>% 
  select(latinname, size, old_carbon) %>%
  distinct() %>% 
  bind_rows(data.frame(latinname = "asterionellopsis_glacialis", 
                       old_carbon = carbon.df[carbon.df$latinname == "asterionellopsis_kariana", "old_carbon"],
                       stringsAsFactors = FALSE))

Check for taxa with duplicate carbon values. If a taxon have two or more carbon values, then when carbon.df is joined with bay.df it will duplicate the taxon’s reporting value.

carbon.dups <- carbon.df %>% 
  group_by(latinname, size) %>% 
  mutate(count = n()) %>% 
  ungroup() %>% 
  filter(count > 1)

Taxa with duplicate carbon values: biddulphia, chaetoceros_wighami, gymnodinium. One of the “chaetoceros_wighami” old_carbon values is NA, and therefore, is excluded. The selected old_carbon values for “biddulphia” and “gymnodinium” appear to be the values used by (Lacouture et al. 2006).

carbon.df <- carbon.df %>% 
  filter(!(latinname == "chaetoceros_wighami" & is.na(old_carbon)),
         !(latinname == "biddulphia" & old_carbon == 7899.50), # %>%  #,
         !(latinname == "gymnodinium" & old_carbon == 848))
#  mutate(size = case_when(
#    is.na(size) & latinname == "gymnodinium" & old_carbon == 848 ~ "msu",
#    is.na(size) & latinname == "gymnodinium" & old_carbon == 43.900 ~ "odu",
#    TRUE ~ size
#    ))

Convert string values “not_applicable” and “not_specified” in the size column to NA. Also, some “gymnodinium” had a reported size of “cell-_small“, which had no equivalent size category in carbon.df; therefore,”cell-_small" is replaced with NA. Finally, as was done above for carbon.df, strings representing “species” were standardized to “sp#”.

bay.df <- bay.df %>% 
  mutate(size = case_when(
    size %in% c("not_applicable", "not_specified") ~ as.character(NA),
    size == "cell-_small" & latinname == "gymnodinium" ~ as.character(NA),
    str_detect(size, "species|sp.") ~ str_replace_all(size, "sp.#|species_#|species_|species", "sp#"),
    TRUE ~ size
  ))# %>% 
#  mutate(size = case_when(
#        is.na(size) & latinname == "gymnodinium" & str_detect(source, "msu") ~ "msu",
#    is.na(size) & latinname == "gymnodinium" & str_detect(source, "odu") ~ "odu",
#    TRUE ~ size
#  ))

Identify all taxa (latinname) in bay.df that do not have a size equivalent in carbon.df.

no.match.df <- anti_join(bay.df, carbon.df, by = c("latinname", "size")) %>% 
  select(latinname, size) %>% 
  distinct()

Attempt to find partial matches for taxa in no.match.df. Each row in no.match.df represents a unique taxon and size combination. lapply loops through each row, extracts the unique size string, splits the string into smaller pieces, and attempts to find partial matches with associated size strings in carbon.df. The string in carbon.df with the most partial matches will be used to replace the size string in bay.df.

partial.match.df <- lapply(1:nrow(no.match.df), function(row.i) {
  sub.df <- no.match.df %>% 
    slice(row.i)
  
  size.vec <- str_split(sub.df$size, "_") %>% unlist()
  
  getmode <- function(v) {
    uniqv <- unique(v)
    uniqv[which.max(tabulate(match(v, uniqv)))]
  }
  
  
  carbon.sub <- carbon.df %>% 
    filter(latinname == sub.df$latinname) %>% 
    mutate(row_max_matches = sapply(size.vec, function(size.i) grep(size.i, size)) %>% 
             unlist() %>% 
             getmode()) %>% 
    slice(unique(row_max_matches)) %>% 
    select(-row_max_matches, -old_carbon) %>% 
    rename(new_size = size) %>% 
    mutate(size = sub.df$size)
  
  return(carbon.sub)
}) %>% 
  bind_rows()

Merge bay.df with partial.match.df and if a new_size is present from partial.match.df, then replace the size string with the new_size string. By updating the size strings it is possible to obtain more matches when merging bay.df and carbon.df.

bay.df <- left_join(bay.df, partial.match.df, by = c("latinname", "size")) %>% 
  mutate(reported_size = size,
         size = if_else(!is.na(new_size), new_size, size))

Merge bay.df with carbon.df and calculate biomass. The phytoplankton counts are reported in picoliters. Biomass is converted to microliters.

bay.df <- left_join(bay.df, carbon.df, by = c("latinname", "size")) %>% 
  mutate(biomass = reportingvalue * old_carbon / 10^6)

Prepare Event Data

Phytoplankton sampling event data obtained from CEDR (Download Monitoring Event) is imported.

events.df <- data.table::fread(file.path(project.dir, "data/phytoplankton/cedr_phyto_event.csv"),
                            data.table = FALSE,
                            na.strings = "")

Some of the salinity zones (salzone) differ from the categories in (Lacouture et al. 2006). Salinity zones are converted to one of four categories: “f” (freshwater), “m” (mesohaline), “o” (oligohaline), and “p” (polyhaline).

events.df <- events.df %>% 
  mutate(sampledate = as.Date(sampledate),
         salzone = case_when(
           salzone %in% c("tf", "fe") ~ "f",
           salzone == "me" ~ "m",
           salzone == "oe" ~ "o",
           salzone == "pe" ~ "p",
           TRUE ~ salzone
         ))

Retain only unique information regarding the data provider (source), station name (station), sample date (sampledate), water layer (layer), and salinity zone (salzone).

events.sub <- events.df %>% 
  select(source, station, sampledate, layer, salzone) %>% 
  distinct()

Import the salinity zone used by Jacqueline M. Johnson of the Chesapeake Bay Program. This will provide a check of salinity zone classification from 10 years ago to the salinity zone classification in CEDR.

old.salzone <- readxl::read_excel(file.path(project.dir, "data/jackie_data/Data 2013_4plus-phyto-metrics.xlsx"),
                         sheet = "DATA_4+biometrics",
                         skip = 1) %>% 
  clean_up() %>% 
  select(station, sample_date, ibi_salzone) %>% 
  mutate(sample_date = as.Date(sample_date)) %>% 
  rename(old_salzone = ibi_salzone,
         sampledate = sample_date) %>% 
  distinct()

Append the salinity zone defined by Jacqueline M. Johnson (old_salzone) to wq.df.

events.sub <- inner_join(events.sub, old.salzone, by = c("station", "sampledate"))

Identify rows where the salinity zone specified by CEDR (salzone) differs from the salinity zone specified by Jacqueline M. Johnson (old_salzone).

events.sub <- events.sub %>% 
  mutate(salzone_disagree = if_else(salzone != old_salzone, TRUE, FALSE))

Summarize and plot the differences in specified salinity zones.

salzone.diff <- events.sub %>% 
  select(station, sampledate, source, salzone, old_salzone, salzone_disagree) %>% 
  unite(sal_group, salzone, old_salzone) %>% 
  distinct() %>% 
  filter(salzone_disagree == TRUE) %>% 
  group_by(sal_group) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  arrange(count) %>% 
  mutate(sal_group = factor(sal_group, levels = unique(sal_group))) %>% 
  ggplot(aes(sal_group, count, fill = count)) + 
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  geom_bar(stat = "identity")

To be consitent with the data used to develope the PIBI, salinity zones specified by Jacqueline M. Johnson (old_salzone) were favored when salinity zones differed, therefore, salzone is replaced by old_salzone when salinity zones differed.

events.sub <- events.sub %>% 
  rename(cedr_salzone = salzone) %>% 
  mutate(salzone = if_else(salzone_disagree == TRUE, old_salzone, cedr_salzone))

Join bay.df and events.sub to combine phytoplankton count data with event data.

bay.df <- left_join(bay.df, events.sub, by = c("source", "station",
                                               "sampledate", "layer")) %>% 
  mutate(unique_id = paste(unique_id, season, salzone, sep = "_"))

Prepare Station Data

Identify any stations in events.df that are not found in bay.df.

missing.taxa<- anti_join(events.df, bay.df, by = "station")

Phytoplankton station data obtained from CEDR (Download Station Information) is imported and clean_up() applied.

station.df <- data.table::fread(file.path(project.dir, "data/phytoplankton/cedr_phyto_station.csv"),
                            data.table = FALSE,
                            na.strings = "")

leaflet is used to provide a visual reference and check of the data. The stations with event data are plotted as blue circles and the stations without event data are plotted as red circles. jitter() is used to alter the points position slightly, to prevent event data for the same station from completely overlapping one another.

library(leaflet)
events.sub <- events.df %>% 
  select(catalogingunitdescription, source, latitude, longitude, station) %>% 
  filter(!is.na(latitude)) %>% 
  mutate(no_taxa_data = station %in% unique(missing.taxa$station),
         no_taxa_data = factor(no_taxa_data)) %>% 
  distinct() %>% 
  mutate(longitude = jitter(longitude, amount = 0.005),
         latitude = jitter(latitude, amount = 0.005))

factpal <- colorFactor(c("blue", "red"), events.sub$no_taxa_data)

leaflet(events.sub) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircleMarkers( ~longitude, ~latitude,
                    color = ~factpal(no_taxa_data),
                    stroke = FALSE,
                    fillOpacity = 0.5,
                    popup = paste("Station:", events.sub$station, "<br/>",
                                  "Description:", events.sub$catalogingunitdescription, "<br/>",
                                  "Source:", events.sub$source, "<br/>",
                                  "Latitude:", events.sub$latitude, "<br/>",
                                  "Longitude:", events.sub$longitude),
                    popupOptions = popupOptions(style = list(color = "black")))

Prepare Water Quality Data

Phytoplankton sampling event data obtained from CEDR (Download Monitoring Event) is imported.

wq.df <- data.table::fread(file.path(project.dir, "data/water_quality/cedr_wq.csv"),
                            data.table = FALSE,
                           na.strings = c(""))

Convert the sample date (sampledate) to class date.

wq.df <- wq.df %>% 
  mutate(sampledate = as.Date(sampledate))

Extract the salinity zone (salzone) found in bay.df.

bay.salzone <- bay.df %>% 
  select(station, sampledate, salzone) %>% 
  rename(sampledate = sampledate) %>% 
  distinct()

Append the salinity zone (salzone) to wq.df.

wq.df <- inner_join(wq.df, bay.salzone, by = c("station", "sampledate"))

Processing

The phytoplankton indices only utilize measures of Dissolved Organic Carbon (DOC), chlorophyll a, and pheophytin. Therefore, only these water quality parameters are retained.

wq.df <- wq.df %>% 
  filter(is.na(problem)) %>% 
  unite(parameter, layer, parameter) %>% 
  filter(str_detect(parameter, "chla|pheo|doc"))

The stations with water quality data are plotted using leaflet, to provide a visual reference and check of the data.

stations.df <- wq.df %>% 
  dplyr::select(station, agency, source, latitude, longitude) %>% 
  distinct()

leaflet(stations.df) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%  
  addCircleMarkers( ~longitude, ~latitude,
                    stroke = FALSE,
                    fillOpacity = 0.5,
                    popup = paste("Station:", stations.df$station, "<br/>",
                                  "Agency:", stations.df$agency, "<br/>",
                                  "Source:", stations.df$source, "<br/>",
                                  "Latitude:", stations.df$latitude, "<br/>",
                                  "Longitude:", stations.df$longitude))

The following chunks of code summarize the water quality data according the the Methods section in (Buchanan et al. 2005) (page 139). The water quality data is divided into separate data frames, manipulated accordingly, and then appended back together.

Surface chlorophyll a (“s_chla”) is extracted as a separate data frame and the summarized to represent the mean value for each station, replicate type, and date.

wq.s_chla <- wq.df %>% 
  select(station, samplereplicatetype, sampledate, parameter, measurevalue) %>% 
  filter(parameter == "s_chla") %>% 
  distinct() %>% 
  select(-samplereplicatetype) %>% 
  group_by_at(vars(-measurevalue)) %>% 
  summarize(measurevalue = mean(measurevalue, is.na = TRUE)) %>% 
  ungroup()

If the upper pycnocline depth is not specified, then the mean for each parameter (i.e. chlorophyll a, pheophytin, DOC) is found using samples from the entire water column (Buchanan et al. 2005) (page 139).

wq.sub.tf <- wq.df %>% 
  filter(salzone == "f") %>% 
  #filter(is.na(upperpycnocline)) %>% 
  #filter(startsWith(station, "f")) %>% 
  select(station, samplereplicatetype, sampledate, parameter, measurevalue) %>% 
  filter(grepl("chla|pheo|doc", parameter)) %>% 
  distinct() %>% 
  mutate(parameter = case_when(
    grepl("chla", parameter) ~ "chla",
    grepl("pheo", parameter) ~ "pheo",
    grepl("doc", parameter) ~ "doc",
    TRUE ~ "ERROR"
  )) %>% 
  select(-samplereplicatetype) %>% 
  group_by_at(vars(-measurevalue)) %>% 
  summarize(measurevalue = mean(measurevalue)) %>% 
  ungroup()

If the upper pycnocline depth is specified, then the mean for each parameter (i.e. chlorophyll a, pheophytin, DOC) is found using samples specified as above the pycnocline (“ap”) and surface (“s”) (Buchanan et al. 2005) (page 139).

wq.sub.pycno <- wq.df %>% 
  filter(salzone != "f",
    #!is.na(upperpycnocline),
         grepl("ap_|s_", parameter)) %>% 
  #filter(!startsWith(station, "f")) %>% 
  select(station, samplereplicatetype, sampledate, parameter, measurevalue) %>% 
  distinct() %>% 
  mutate(parameter = case_when(
    parameter %in% c("ap_chla", "s_chla") ~ "chla",
    parameter %in% c("ap_pheo", "s_pheo") ~ "pheo",
    parameter %in% c("ap_doc", "s_doc") ~ "doc",
    TRUE ~ "ERROR"
  )) %>% 
  select(-samplereplicatetype) %>% 
  group_by_at(vars(-measurevalue)) %>% 
  summarize(measurevalue = mean(measurevalue)) %>% 
  ungroup()

The separate water quality data frames are appended together.

wq.sub <- bind_rows(wq.s_chla, wq.sub.tf, wq.sub.pycno) %>% 
  rename(date = sampledate)

3-Day Window

In some case the water quality data was not collected on the same day as the phytoplankton data. To obtain more phytoplankton sampling events with associated water quality data, (Lacouture et al. 2006) used water quality data collected within ± 3 days of the phytoplankton.

bay.df is subset to only represent unique combinations of station and sampledate.

bay.sub <- bay.df %>% 
  select(station, sampledate) %>% 
  distinct() %>% 
  mutate(lower_date = sampledate - lubridate::days(3),
         upper_date = sampledate + lubridate::days(3))

The process to identify water quality dates within the ± 3 day window of the phytoplankton sampling date can be lengthy. The parallel package is used again to speed up the process.

library(parallel)
n.cores <- detectCores() - 1
cl <- makeCluster(n.cores)
clusterExport(cl = cl, varlist = c("wq.sub", "bay.sub"))
clusterEvalQ(cl, c(library(dplyr))) %>% invisible()

For each station and sample date combination (each row) in bay.sub, all of the wq.sub samples, from the same station, within a ± 3 day window are found. If there are multiple wq.sub samples within the ± 3 day window, then the sample collected closest to the phytoplankton data is retained. There is a possibility that water quality samples could be collected at equal intervals before and after phytoplankton sampling. For example, water quality samples could have been collected one day before and one day after phytoplankton sampling. In those instances the sample collected prior to the phytoplankton sampling is retained.

env.df <- parLapply(cl, 1:nrow(bay.sub), function(row.i) {

  sub.df <- slice(bay.sub, row.i)
  #----------------------------------------------------------------------------
  sub.env <- wq.sub %>% 
    filter(station == sub.df$station,
           date >= sub.df$lower_date,
           date <= sub.df$upper_date)
  #----------------------------------------------------------------------------
  if (nrow(sub.env) == 0) return(data.frame(
    station = NA,
    date = NA,
    parameter = NA,
    measurevalue = NA
  ))
  #----------------------------------------------------------------------------
  final.df <- sub.env %>% 
    mutate(date_diff = date - sub.df$sampledate,
           abs_date_diff = abs(date_diff),
           sampledate = sub.df$sampledate) %>% 
    filter(abs_date_diff == min(abs_date_diff))
  #----------------------------------------------------------------------------
  if (nrow(final.df) > 1) {
    final.df <- final.df %>% 
      filter(date == min(date))
  }
  #----------------------------------------------------------------------------
  return(final.df)
}) %>% 
  bind_rows() %>% 
  filter(!is.na(station))

stopCluster(cl)
#closeAllConnections()

The water quality data is transformed from a long data format to a long data format.

env.wide <- env.df %>% 
  spread(parameter, measurevalue)

s_shla and pheo are given more descriptive names, surface_chla and pheophytin, respectively. Additionally, the script ensures that surface_chla, pheophytin, and doc are all numeric values. Finally, the columns are subset to prepare to join with bay.df.

env.wide <- env.wide %>% 
  mutate(
    surface_chla = if_else(!is.na(s_chla), s_chla, as.numeric(NA)),
    pheophytin = if_else(!is.na(pheo), pheo, as.numeric(NA)),
    doc = if_else(!is.na(doc), doc, as.numeric(NA))
    ) %>% 
  select(station, sampledate, surface_chla, chla, pheophytin, doc)

Join bay.df with env.wide to combine phytoplankton count data with water quality data.

bay.df <- left_join(bay.df, env.wide, by = c("station", "sampledate"))

Remove objects that are no longer necessary.

#rm(env.df, env.wide, wq.df, wq.sub, bay.sub)

Metric Calculation

Subset bay.df to just represent the columns necessary for metric calculations. Samples without specified salinity zones (salzone) are excluded.

bay.taxa <- bay.df %>% 
  select(unique_id, sampledate, season, salzone,
         surface_chla, chla, pheophytin, doc,
         reportingvalue, biomass, latinname, final_id,
         division, phylum, class, species)  %>% 
  filter(!is.na(salzone)) %>% 
  distinct()

The majority of the metrics are calculated using the Multi-Metric Index R-package I am developing, mmir. mmir is installed and loaded in the Introduction. All of the metrics are calculated for each unique sampling event (unique_id), even though not every phytoplankton index uses the same metrics. This is down to simplify the process and the filter of metrics to calculate the IBI scores is done during the scoring section.

metrics.df <- bay.taxa %>% 
  select(unique_id, surface_chla, chla, doc, pheophytin) %>% 
  distinct() %>% 
  mutate(total_phyto_biomass = taxa_abund(bay.taxa, unique_id, biomass, species),
         total_phyto_biomass_chla_ratio = total_phyto_biomass / chla,
         pct_cryptophyte = taxa_pct(bay.taxa, unique_id, biomass, division, "cryptophycophyta"),
         cyanophyte_biomass = taxa_abund(bay.taxa, unique_id, biomass, phylum, "cyanobacteria"), 
         diatom_biomass = taxa_abund(bay.taxa, unique_id, biomass, class, "bacillariophyceae"),
         dinoflagellate_biomass = taxa_abund(bay.taxa, unique_id, biomass, division, "pyrrophycophyta"),
         scrippsiella_precaria_biomass = taxa_abund(bay.taxa, unique_id,
                                                    biomass, latinname, "scrippsiella_precaria"),
         dinoflagellate_biomass = dinoflagellate_biomass - scrippsiella_precaria_biomass,
         microcystis_aeruginosa_abundance = taxa_abund(bay.taxa, unique_id,
                                                       reportingvalue, species, "microcystis_aeruginosa"),
         #picoplankton_abundance = taxa_abund(bay.taxa, unique_id, reportingvalue, picoplankton, TRUE),
         prorocentrum_minimum_abundance = taxa_abund(bay.taxa, unique_id,
                                                     reportingvalue, species, "prorocentrum_minimum")
         ) %>% 
  select(-scrippsiella_precaria_biomass)

metrics.df is converted to a long data format, which will make it easier to score the metrics in the following section.

metrics.long <- metrics.df %>% 
  gather(metric, value, "surface_chla":"prorocentrum_minimum_abundance")

Score and Rate Metrics

Scoring Functions

The code below seems like a lot of duplication but rating thresholds for each index are very specific and sometimes even the same metric used in two different indices is not scored using the same logic. For example, in the spring freshwater index total biomass (total_phyto_biomass) is scored as either a 1, 3, or 5, while in the spring mesohaline index total biomass (total_phyto_biomass) is scored as either a 1 or NA. Due to these discrepancies, I have chosen to create a separate function specific to each phytoplankton index.

Each of the functions follows the same logic. metrics.long is filtered to only represent a single season and salinity zone (salzone). Additionally, metrics.long is filtered to only include the metrics associated with the specified season and salinity zone. If any of the required metrics are missing from the data, then and error will be returned specifying that the function in question requires the following metric(s). Finally, the scoring thresholds found in an old Microsoft Access database created by Jacqueline M. Johnson of the Chesapeake Bay Program are applied to each of the metrics. I first tried to use the thresholds provided in (Lacouture et al. 2006) (pages 602-203) but minor discrepancies, mostly rounding differences, resulted in more errors later in this document; therefore, Jacqueline M. Johnson’s thresholds were favored. Originally, summer mesohaline and polyhaline indices included a picoplankton abundance metric. The picoplankton abundance metric has been dropped because picoplankton are no longer collected by the data providers. A warning will still appear if pico.warning is not set to FALSE indicating that their was no value for picoplankton abundance and therefore no score could be assigned.

Each of the scoring functions does not require a description, so only a subheading will be provided to break up the functions.

Spring Freshwater

score_spring_f <- function(metrics.long) {
  metric.vec <- c("total_phyto_biomass_chla_ratio",
                  "surface_chla",
                  "cyanophyte_biomass",
                  "doc",
                  "pheophytin",
                  "total_phyto_biomass")
  #----------------------------------------------------------------------------
  metrics.sub <- metrics.long %>% 
    filter(season == "spring", 
           salzone == "f",
           metric %in% metric.vec)
  #----------------------------------------------------------------------------
  if (any(!metric.vec %in% unique(metrics.sub$metric))) {
    stop(paste("score_spring_f requires the following metric(s):",
               paste(metric.vec[!metric.vec %in% unique(metrics.sub$metric)], collapse = ", ")))
  }
  #----------------------------------------------------------------------------
  final.df <- metrics.sub %>% 
    mutate(score = case_when(
      # Jackie's code 39.28 and 41.72
      # Lacouture (2006) 39.3 and 41.7
      metric == "total_phyto_biomass_chla_ratio" & value < 39.28 ~ 1,
      metric == "total_phyto_biomass_chla_ratio" & value >= 39.28 & value <= 41.72 ~ 3,
      metric == "total_phyto_biomass_chla_ratio" & value > 41.72 ~ 5,
      metric == "surface_chla" & (value < 3.42 | value > 14.45) ~ 1,
      metric == "surface_chla" & ((value >= 3.42 & value <= 4.37) |(value >= 13.98 & value <= 14.45)) ~ 3,
      metric == "surface_chla" & value > 4.37 & value < 13.98 ~ 5,
      # Jackie's code 23.02
      # Lacouture (2006) 23
      metric == "cyanophyte_biomass" & value > 23.02 ~ 1,
      metric == "cyanophyte_biomass" & value <= 23.02 ~ as.numeric(NA),
      metric == "doc" & value > 2.40 ~ 1,
      metric == "doc" & value >= 2.19 & value <= 2.40 ~ 3,
      metric == "doc" & value < 2.19 ~ 5,
      metric == "pheophytin" & value > 2.50 ~ 1,
      metric == "pheophytin" & value >= 1.55 & value <= 2.50 ~ 3,
      metric == "pheophytin" & value < 1.55 ~ 5,
      # Jackie's code 172.99, 583.9, and 828.5
      # Lacouture (2006) 173, 584, and 829
      metric == "total_phyto_biomass"  & (value <= 172.99 | value > 828.5) ~ 1,
      metric == "total_phyto_biomass"  & value >= 583.9 & value <= 828.5 ~ 3,
      metric == "total_phyto_biomass"  & value > 172.99 & value < 583.9 ~ 5,
      TRUE ~ as.numeric(NA)))
  return(final.df)
}

Spring Oligohaline

score_spring_o <- function(metrics.long) {
  metric.vec <- c("total_phyto_biomass_chla_ratio",
                  "surface_chla",
                  "doc",
                  "pheophytin",
                  "total_phyto_biomass")
  #----------------------------------------------------------------------------
  metrics.sub <- metrics.long %>% 
    filter(season == "spring", 
           salzone == "o",
           metric %in% metric.vec)
  #----------------------------------------------------------------------------
  if (any(!metric.vec %in% unique(metrics.sub$metric))) {
    stop(paste("score_spring_o requires the following metric(s):",
               paste(metric.vec[!metric.vec %in% unique(metrics.sub$metric)], collapse = ", ")))
  }
  #----------------------------------------------------------------------------
  final.df <- metrics.sub %>%
    mutate(score = case_when(
      metric == "total_phyto_biomass_chla_ratio" & (value < 18.8 | value > 48.6) ~ 1,
      metric == "total_phyto_biomass_chla_ratio" & value >= 18.8 & value <= 21.2 ~ 3,
      metric == "total_phyto_biomass_chla_ratio" & value > 21.2 & value < 48.6 ~ 5,
      metric == "surface_chla" & (value < 6.77 | value > 33.64) ~ 1,
      metric == "surface_chla" & ((value >= 20.93 & value <= 33.64) |(value >= 6.77 & value <= 8.82)) ~ 3,
      metric == "surface_chla" & value > 8.82 & value < 20.93 ~ 5,
      metric == "doc" & value > 3.27 ~ 1,
      metric == "doc" & value >= 2.69 & value <= 3.27 ~ 3,
      metric == "doc" & value < 2.69 ~ 5,
      metric == "pheophytin" & value > 2.68 ~ 1,
      metric == "pheophytin" & value >= 2.23 & value <= 2.68 ~ 3,
      metric == "pheophytin" & value < 2.23 ~ 5,
      # Jackie's code 133.67, 426.31, 131.37, and 685.81
      # Lacouture (2006) 134, 426, 131, and 686
      metric == "total_phyto_biomass"  & value > 133.67 & value < 426.31 ~ 5,
      metric == "total_phyto_biomass"  & ((value >= 131.37 & value <= 133.67) | (value >= 426.31 & value < 685.81)) ~ 3,
      metric == "total_phyto_biomass"  & any(value < 131.37 | value >= 685.81) ~ 1,
      TRUE ~ as.numeric(NA)))
  return(final.df)
}

Spring Mesohaline

score_spring_m <- function(metrics.long) {
  metric.vec <- c("total_phyto_biomass_chla_ratio",
                  "surface_chla",
                  "diatom_biomass",
                  "dinoflagellate_biomass",
                  "doc",
                  "pheophytin",
                  "prorocentrum_minimum_abundance",
                  "total_phyto_biomass")
  #----------------------------------------------------------------------------
  metrics.sub <- metrics.long %>% 
    filter(season == "spring", 
           salzone == "m",
           metric %in% metric.vec)
  #----------------------------------------------------------------------------
  if (any(!metric.vec %in% unique(metrics.sub$metric))) {
    stop(paste("score_spring_m requires the following metric(s):",
               paste(metric.vec[!metric.vec %in% unique(metrics.sub$metric)],
                     collapse = ", ")))
  }
  #----------------------------------------------------------------------------
  final.df <- metrics.sub %>% 
    mutate(score = case_when(
      # Jackie's code 69.52 and 45.04
      # Lacouture (2006) 69.5 and 45
      metric == "total_phyto_biomass_chla_ratio" & value < 45.04 ~ 1,
      metric == "total_phyto_biomass_chla_ratio" & value >= 45.04 & value <= 69.52 ~ 3,
      metric == "total_phyto_biomass_chla_ratio" & value > 69.52 ~ 5,
      metric == "surface_chla" & (value < 2.60 | value > 8.00) ~ 1,
      metric == "surface_chla" & ((value >= 2.60 & value <= 2.90) |(value >= 6.17 & value <= 8.00)) ~ 3,
      metric == "surface_chla" & value > 2.90 & value < 6.17 ~ 5,
      # Jackie's code 275.4
      # Lacouture (2006) 275
      metric == "diatom_biomass" & (value < 149 | value >= 2513) ~ 1,
      metric == "diatom_biomass" & value >= 149 & value <= 275.4 ~ 3,
      metric == "diatom_biomass" & value > 275.4 & value < 2513 ~ 5,
      # Jackie's code 28.3, 156.9, and 268.2
      # Lacouture (2006) 157 and 268
      metric == "dinoflagellate_biomass" & (value < 28.3 | value > 268.2) ~ 1,
      metric == "dinoflagellate_biomass" & value >= 156.9 & value <= 268.2 ~ 3,
      metric == "dinoflagellate_biomass" & value > 28.3 & value < 156.9 ~ 5,
      metric == "doc" & value > 3.17 ~ 1,
      metric == "doc" & value >= 2.84 & value <= 3.17 ~ 3,
      metric == "doc" & value < 2.84 ~ 5,
      metric == "pheophytin" & value > 1.03 ~ 1,
      metric == "pheophytin" & value >= 1.00 & value <= 1.03 ~ 3,
      metric == "pheophytin" & value < 1.00 ~ 5,
      metric == "prorocentrum_minimum_abundance" & value > 1477600 ~ 1,
      metric == "prorocentrum_minimum_abundance" & value <= 1477600 ~ as.numeric(NA),
      metric == "total_phyto_biomass"  & value > 1150 ~ 1,
      metric == "total_phyto_biomass"  & value <= 1150 ~ as.numeric(NA),
      TRUE ~ as.numeric(NA)))
  return(final.df)
}

Spring Polyhaline

score_spring_p <- function(metrics.long) {
  metric.vec <- c("total_phyto_biomass_chla_ratio",
                  "surface_chla",
                  "pct_cryptophyte",
                  "doc",
                  "pheophytin",
                  "prorocentrum_minimum_abundance",
                  "total_phyto_biomass")
  #----------------------------------------------------------------------------
  metrics.sub <- metrics.long %>% 
    filter(season == "spring", 
           salzone == "p",
           metric %in% metric.vec)
  #----------------------------------------------------------------------------
  if (any(!metric.vec %in% unique(metrics.sub$metric))) {
    stop(paste("score_spring_p requires the following metric(s):",
               paste(metric.vec[!metric.vec %in% unique(metrics.sub$metric)], collapse = ", ")))
  }
  #----------------------------------------------------------------------------
  final.df <- metrics.sub %>% 
    mutate(score = case_when(
      metric == "total_phyto_biomass_chla_ratio" & value < 71.0 ~ 1,
      metric == "total_phyto_biomass_chla_ratio" & value >= 71.0 & value <= 107.5 ~ 3,
      metric == "total_phyto_biomass_chla_ratio" & value > 107.5 ~ 5,
      metric == "surface_chla" & value > 4.00 ~ 1,
      metric == "surface_chla" & value >= 2.80 & value <= 4.00 ~ 3,
      metric == "surface_chla" & value < 2.80 ~ 5,
      # Jackie's code 4.93 and 7.06
      # Lacouture (2006) 4.9 and 7.1
      metric == "pct_cryptophyte" & value > 7.06 ~ 1,
      metric == "pct_cryptophyte" & value >= 4.93 & value <= 7.06 ~ 3,
      metric == "pct_cryptophyte" & value < 4.93 ~ 5,
      metric == "doc" & value > 2.61 ~ 1,
      metric == "doc" & value >= 2.50 & value <= 2.61 ~ 3,
      metric == "doc" & value < 2.50 ~ 5,
      metric == "pheophytin" & value > 0.90 ~ 1,
      metric == "pheophytin" & value >= 0.55 & value <= 0.90 ~ 3,
      metric == "pheophytin" & value < 0.55 ~ 5,
      metric == "prorocentrum_minimum_abundance" & value > 7488 ~ 1,
      metric == "prorocentrum_minimum_abundance" & value >= 672 & value <= 7488 ~ 3,
      metric == "prorocentrum_minimum_abundance" & value < 672 ~ 5,
      # Jackie's code 1061.7
      # Lacouture (2006) 1062
      metric == "total_phyto_biomass"  & value > 1061.7 ~ 1,
      metric == "total_phyto_biomass"  & value <= 1061.7 ~ as.numeric(NA),
      TRUE ~ as.numeric(NA)))
  return(final.df)
}

Summer Freshwater

score_summer_f <- function(metrics.long) {
  metric.vec <- c("surface_chla",
                  "cyanophyte_biomass",
                  "diatom_biomass",
                  "doc", 
                  "microcystis_aeruginosa_abundance",
                  "pheophytin", 
                  "total_phyto_biomass")
  #----------------------------------------------------------------------------
  metrics.sub <- metrics.long %>% 
    filter(season == "summer", 
           salzone == "f",
           metric %in% metric.vec)
  #----------------------------------------------------------------------------
  if (any(!metric.vec %in% unique(metrics.sub$metric))) {
    stop(paste("score_summer_f requires the following metric(s):",
               paste(metric.vec[!metric.vec %in% unique(metrics.sub$metric)], collapse = ", ")))
  }
  #----------------------------------------------------------------------------
  final.df <- metrics.sub %>% 
    mutate(score = case_when(
      metric == "surface_chla" & value > 12.30 ~ 1,
      metric == "surface_chla" & ((value >= 12.00 & value <= 12.30) | value <= 5.4) ~ 3,
      metric == "surface_chla" & value > 5.40 & value < 12.00 ~ 5,
      # Jackie's code 38.87 and 67.4
      # Lacouture (2006) 39 and 67
      metric == "cyanophyte_biomass" & value > 67.4 ~ 1,
      metric == "cyanophyte_biomass" & value >= 38.87 & value <= 67.4 ~ 3,
      metric == "cyanophyte_biomass" & value < 38.87 ~ 5,
      # Jackie's code 122.1 and 192.6
      # Lacouture (2006) 122 and 193
      metric == "diatom_biomass" & value > 192.6 ~ 1,
      metric == "diatom_biomass" & value >= 122.1 & value <= 192.6 ~ 3,
      metric == "diatom_biomass" & value < 122.1 ~ 5,
      metric == "doc" & value > 3.18 ~ 1,
      metric == "doc" & value >= 2.67 & value <= 3.18 ~ 3,
      metric == "doc" & value < 2.67 ~ 5,
      metric == "microcystis_aeruginosa_abundance" & value > 262507 ~ 1,
      metric == "microcystis_aeruginosa_abundance" & value <= 262507 ~ as.numeric(NA),
      metric == "pheophytin" & value > 4.30 ~ 1,
      metric == "pheophytin" & value >= 2.40 & value <= 4.30 ~ 3,
      metric == "pheophytin" & value < 2.40 ~ 5,
      # Jackie's code 231.3 and 555.7
      # Lacouture (2006) 232 and 551
      metric == "total_phyto_biomass"  & value > 555.7 ~ 1,
      metric == "total_phyto_biomass"  & value < 231.3 ~ 3,
      metric == "total_phyto_biomass"  & value >= 231.3 & value <= 555.7 ~ 5,
      TRUE ~ as.numeric(NA)))
  return(final.df)
}

Summer Oligohaline

score_summer_o <- function(metrics.long) {
  metric.vec <- c("surface_chla",
                  "cyanophyte_biomass",
                  "diatom_biomass",
                  "doc",
                  "pheophytin")
  #----------------------------------------------------------------------------
  metrics.sub <- metrics.long %>% 
    filter(season == "summer", 
           salzone == "o",
           metric %in% metric.vec)
  #----------------------------------------------------------------------------
  if (any(!metric.vec %in% unique(metrics.sub$metric))) {
    stop(paste("score_summer_o requires the following metric(s):",
               paste(metric.vec[!metric.vec %in% unique(metrics.sub$metric)], collapse = ", ")))
  }
  #----------------------------------------------------------------------------
  final.df <- metrics.sub %>% 
    mutate(score = case_when(
      metric == "surface_chla" & value >= 9.47 ~ 1,
      metric == "surface_chla" & value <= 4.20 ~ 3,
      metric == "surface_chla" & value > 4.20 & value < 9.47 ~ 5,
      # Jackie's code 1.79 and 26.55
      # Lacouture (2006) 1.8 and 27
      metric == "cyanophyte_biomass" & value >= 26.55 ~ 1,
      metric == "cyanophyte_biomass" & value < 1.79 ~ 3,
      metric == "cyanophyte_biomass" & value >= 1.79 & value < 26.55 ~ 5,
      # Jackie's code 44.14 and 126.59
      # Lacouture (2006) 44 and 127
      metric == "diatom_biomass" & value >= 126.59 ~ 1,
      metric == "diatom_biomass" & value <= 44.14 ~ 3,
      metric == "diatom_biomass" & value > 44.14 & value < 126.59 ~ 5,
      metric == "doc" & value > 4.00 ~ 1,
      metric == "doc" & value >= 3.15 & value <= 4.00 ~ 3,
      metric == "doc" & value < 3.15 ~ 5,
      metric == "pheophytin" & value > 2.81 ~ 1,
      metric == "pheophytin" & value >= 1.58 & value <= 2.81 ~ 3,
      metric == "pheophytin" & value < 1.58 ~ 5,
      TRUE ~ as.numeric(NA)))
  return(final.df)
}

Summer Mesohaline

score_summer_m <- function(metrics.long, pico.warning) {
  metric.vec <- c("total_phyto_biomass_chla_ratio",
                  "surface_chla",
                  "dinoflagellate_biomass", 
                  "doc",
                  "pheophytin",
                  "total_phyto_biomass")
  pico.vec <- "picoplankton_abundance"
  #----------------------------------------------------------------------------
  metrics.sub <- metrics.long %>% 
    filter(season == "summer", 
           salzone == "m",
           metric %in% c(metric.vec, pico.vec))
  #----------------------------------------------------------------------------
  if (any(!metric.vec %in% unique(metrics.sub$metric))) {
    stop(paste("score_summer_m requires the following metric(s):",
               paste(metric.vec[!metric.vec %in% unique(metrics.sub$metric)], collapse = ", ")))
  }
  
  if (pico.warning == TRUE && any(!pico.vec %in% unique(metrics.sub$metric))) {
    warning("Warning: picoplankton_abundance missing from score_summer_m")
  }
  #----------------------------------------------------------------------------
  final.df <- metrics.sub %>% 
    mutate(score = case_when(
      metric == "total_phyto_biomass_chla_ratio" & value < 32.2 ~ 1,
      metric == "total_phyto_biomass_chla_ratio" & value >= 32.2 & value <= 36.9 ~ 3,
      metric == "total_phyto_biomass_chla_ratio" & value > 36.9 ~ 5,
      metric == "surface_chla" & value >= 9.74 ~ 1,
      metric == "surface_chla" & ((value >= 7.70 & value < 9.74) | value <= 4.00) ~ 3,
      metric == "surface_chla" & value > 4.00 & value < 7.70 ~ 5,
      # Jackie's code 200.92, 31.22, 55.98
      # Lacouture (2006) 201, 31, 56
      metric == "dinoflagellate_biomass" & (value <= 31.22 | value > 200.92) ~ 1,
      metric == "dinoflagellate_biomass" & value > 31.22 & value <= 55.98 ~ 3, 
      metric == "dinoflagellate_biomass" & value > 55.98 & value < 200.92 ~ 5,
      metric == "doc" & value > 3.35 ~ 1,
      metric == "doc" & value >= 2.99 & value <= 3.35 ~ 3,
      metric == "doc" & value < 2.99 ~ 5,
      metric == "pheophytin" & value > 1.60 ~ 1,
      metric == "pheophytin" & value >= 1.23 & value <= 1.60 ~ 3,
      metric == "pheophytin" & value < 1.23 ~ 5,
      # Jackie's code 598720000
      # Lacouture (2006) 598700000
      metric == "picoplankton_abundance" & value < 352000000 ~ 1,
      metric == "picoplankton_abundance" & value >= 352000000 & value <= 598720000 ~ 3,
      metric == "picoplankton_abundance" & value > 598720000 ~ 5,
      metric == "total_phyto_biomass"  & value > 660 ~ 1,
      metric == "total_phyto_biomass"  & value <= 660 ~ as.numeric(NA),
      TRUE ~ as.numeric(NA)))
  return(final.df)
}

Summer Polyhaline

score_summer_p <- function(metrics.long, pico.warning) {
  
  metric.vec <- c("total_phyto_biomass_chla_ratio",
                  "surface_chla",
                  "pct_cryptophyte",
                  "diatom_biomass",
                  "dinoflagellate_biomass",
                  "doc",
                  "pheophytin",
                  "total_phyto_biomass")
  pico.vec <- "picoplankton_abundance"
  #----------------------------------------------------------------------------
  metrics.sub <- metrics.long %>% 
    filter(season == "summer", 
           salzone == "p",
           metric %in% c(metric.vec, pico.vec))
  #----------------------------------------------------------------------------
  if (any(!metric.vec %in% unique(metrics.sub$metric))) {
    stop(paste("score_summer_p requires the following metric(s):",
               paste(metric.vec[!metric.vec %in% unique(metrics.sub$metric)], collapse = ", ")))
  }
  
  if (pico.warning == TRUE && any(!pico.vec %in% unique(metrics.sub$metric))) {
    warning("Warning: picoplankton_abundance missing from score_summer_p")
  }
  #----------------------------------------------------------------------------
  final.df <- metrics.sub %>% 
    mutate(score = case_when(
      metric == "total_phyto_biomass_chla_ratio" & value < 37.7 ~ 1,
      metric == "total_phyto_biomass_chla_ratio" & value >= 37.7 & value <= 74.5 ~ 3,
      metric == "total_phyto_biomass_chla_ratio" & value > 74.5 ~ 5,
      metric == "surface_chla" & value > 5.33 ~ 1,
      metric == "surface_chla" & value >= 4.52 & value <= 5.33 ~ 3,
      metric == "surface_chla" & value < 4.52 ~ 5,
      metric == "pct_cryptophyte" & value > 6.5 ~ 1,
      metric == "pct_cryptophyte" & value >= 3.9 & value <= 6.5 ~ 3,
      metric == "pct_cryptophyte" & value < 3.9 ~ 5,
      metric == "diatom_biomass" & (value >= 799 | value < 137) ~ 1,
      metric == "diatom_biomass" & value >= 137 & value <= 181 ~ 3,
      metric == "diatom_biomass" & value > 181 & value < 799 ~ 5,
      metric == "dinoflagellate_biomass" & (value < 23 | value >= 544) ~ 1,
      metric == "dinoflagellate_biomass" & value >= 23 & value <= 37 ~ 3, 
      metric == "dinoflagellate_biomass" & value > 37 & value < 554 ~ 5,
      metric == "doc" & value > 2.80 ~ 1,
      metric == "doc" & value >= 2.58 & value <= 2.80 ~ 3,
      metric == "doc" & value < 2.58 ~ 5,
      metric == "pheophytin" & value > 1.50 ~ 1,
      metric == "pheophytin" & value >= 0.93 & value <= 1.50 ~ 3,
      metric == "pheophytin" & value < 0.93 ~ 5,
      metric == "picoplankton_abundance" & value < 208600000 ~ 1,
      metric == "picoplankton_abundance" & value >= 208600000 & value <= 269500000 ~ 3,
      metric == "picoplankton_abundance" & value > 269500000 ~ 5,
      # In Jackie's code 718
      # Lacouture (2006) 711
      metric == "total_phyto_biomass"  & (value < 181 | value > 831) ~ 1,
      metric == "total_phyto_biomass"  & ((value >= 181 & value <= 207) | (value >= 718 & value <= 831)) ~ 3,
      metric == "total_phyto_biomass"  & value > 207 & value < 718 ~ 5,
      TRUE ~ as.numeric(NA)))
  return(final.df)
}

Scoring Wrapper Function

The following function wraps all of the functions above into a single function to make it simple to calculate scores. The unique_id column is separated to represent station name (station), water layer (layer), sample number (samplenumber), sample date (sampledate), season (season), and salinity zone (salzone). Many of these fields are necessary for the individual scoring functions (e.g. score_spring_f() and score_summer_p()). The individual scoring functions are designed to score only the data applicable to a specific season and salinity zone; therefore, the output of each of the functions can simply be appended together.

score_phyto <- function(metrics.long, pico.warning) {
  #----------------------------------------------------------------------------
  spring.f <- score_spring_f(metrics.long)
  spring.o <- score_spring_o(metrics.long)
  spring.m <- score_spring_m(metrics.long)
  spring.p <- score_spring_p(metrics.long)
  #----------------------------------------------------------------------------
  summer.f <- score_summer_f(metrics.long)
  summer.o <- score_summer_o(metrics.long)
  summer.m <- score_summer_m(metrics.long, pico.warning)
  summer.p <- score_summer_p(metrics.long, pico.warning)
  #----------------------------------------------------------------------------
  final.df <- bind_rows(spring.f, spring.o, spring.m, spring.p,
                        summer.f, summer.o, summer.m, summer.p) %>% 
    select(-layer)
  #----------------------------------------------------------------------------
  return(final.df)
}

Apply the Scoring Function

score_phyto() is used to score the metric values in metrics.long.

scores.df <- metrics.long %>% 
  separate(unique_id, c("station", "layer", "samplenumber",
                        "sampledate", "season", "salzone"), sep = "_",
           remove = FALSE) %>% 
  rename(date = sampledate) %>% 
  score_phyto(pico.warning = FALSE) %>% 
  select(-samplenumber)

The IBI score (ibi_score) is calculated by finding the mean score for each unique event (unique_id), excluding NAs. Additionally, metrics with a value that was not NA are counted (present_count). Unique events (unique_id) with fewer than 4 non-NA values are excluded from the data frame.

scores.df <- scores.df %>% 
  group_by(unique_id) %>% 
  mutate(ibi_score = mean(score, na.rm = TRUE),
         metric_count = n(),
         present_count = sum(!is.na(score)),
         na_count = sum(is.na(score))) %>% 
  ungroup() %>% 
  filter(present_count >= 4)

Ratings

All of the indices were rated using the same rating scheme (Lacouture et al. 2006) (page 611).

poor: score <= 2
fair-poor: 2 < score < 2.67
fair: 2.67 <= score < 3.33
fair-good: 3.33 <= score <= 4
good: score > 4

rate_phyto <- function(scores.df) {
  final.df <- scores.df %>% 
    mutate(
      ibi_score = round(ibi_score, 2),
      rating = case_when(
        ibi_score < 2 ~ "poor",
        ibi_score >= 2 & ibi_score < 2.67 ~ "fair_poor",
        ibi_score >= 2.67 & ibi_score < 3.33 ~ "fair",
        ibi_score >= 3.33 & ibi_score < 4 ~ "fair_good",
        ibi_score >= 4 ~ "good",
        TRUE ~ "ERROR"
      ))
  return(final.df)
}

Apply rate_phyto() to scores.df.

ratings.df <- rate_phyto(scores.df) %>% 
  mutate(date = as.Date(date))

Map Ratings

Summarize ratings.df to just represent IBI scores and ratings. Join ratings.df with stations.df to provide latitude and longitude data for interactive maps. jitter() is used to alter the points position slightly, to prevent event data for the same station from completely overlapping one another.

ibi.df <- ratings.df %>% 
  select(-metric, -value, -score) %>% 
  unite(season_sal, season, salzone) %>% 
  distinct() %>% 
  left_join(stations.df, by = "station") %>% 
  mutate(longitude = jitter(longitude, amount = 0.01),
         latitude = jitter(latitude, amount = 0.01))

Convert the rating column to factors, which will make it easier to assign colors to the ratings in subsequent plots.

ibi.df <- ibi.df %>% 
  mutate(rating = factor(rating, c("poor", "fair_poor", "fair", "fair_good", "good")))

A color palette is assigned to the ratings.

factpal <- colorFactor(c("red", "orange", "yellow", "lightgreen", "green"), ibi.df$rating)

leaflet is used to provide a visual reference and check of the ratings.

leaflet(ibi.df) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%  
  addCircleMarkers( ~longitude, ~latitude,
                    color = ~factpal(rating),
                    stroke = FALSE,
                    fillOpacity = 0.5,
                    popup = paste("Station:", stations.df$station, "<br/>",
                                  "Agency:", stations.df$agency, "<br/>",
                                  "Source:", stations.df$datasource, "<br/>",
                                  "Latitude:", stations.df$latitude, "<br/>",
                                  "Longitude:", stations.df$longitude))

Plot each stations IBI trend for each applicable index. The background colors represent the five rating category ranges (i.e. “poor”, “fair_poor”, “fair”, “fair_good”, “good”), providing the user with a quick reference of which rating a particular point would receive.

min.date <- min(ibi.df$date, na.rm = TRUE)
max.date <- max(ibi.df$date, na.rm = TRUE)

ggplot(data = ibi.df, aes(date, ibi_score, group = season_sal, color = season_sal)) +
  theme_classic() +
  scale_x_date(limits = c(min(ibi.df$date, na.rm = TRUE), max(ibi.df$date, na.rm = TRUE)), expand = c(0, 0)) +
  scale_y_continuous(limits = c(0.8, 5.2), expand = c(0, 0)) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 0.8, ymax = 2, fill = "red", alpha = 0.25) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 2, ymax = 2.67, fill = "orange", alpha = 0.25) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 2.67, ymax = 3.33, fill = "yellow", alpha = 0.25) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 3.33, ymax = 4, fill = "lightgreen", alpha = 0.25) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 4, ymax = 5.2, fill = "green", alpha = 0.25) +
  geom_line() +
  geom_point() +
  facet_wrap(~ station, ncol = 5, scales = "free")

These figures are the same as the previous figures but the plots have been divided by index (e.g. Spring Freshwater and Summer Polyhaline) to reduce clutter.

ggplot(data = ibi.df, aes(date, ibi_score, group = season_sal, color = season_sal)) +
  theme_classic() +
  scale_x_date(limits = c(min(ibi.df$date, na.rm = TRUE), max(ibi.df$date, na.rm = TRUE)), expand = c(0, 0)) +
  scale_y_continuous(limits = c(0.8, 5.2), expand = c(0, 0)) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 0.8, ymax = 2, fill = "red", alpha = 0.25) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 2, ymax = 2.67, fill = "orange", alpha = 0.25) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 2.67, ymax = 3.33, fill = "yellow", alpha = 0.25) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 3.33, ymax = 4, fill = "lightgreen", alpha = 0.25) +
  annotate("rect", xmin = min.date, xmax = max.date, ymin = 4, ymax = 5.2, fill = "green", alpha = 0.25) +
  geom_line(na.rm = TRUE) +
  geom_point(na.rm = TRUE) +
  facet_wrap(~ station + season_sal, ncol = 5, scales = "free")

Another way of looking at the data above is represent all of the samples for a particular station in a box-and-whisker plot. This provides an a good summary of the scores for each station. However, this plot could be miss leading because the majority of stations have frequencies and provides no frame of reference for changes overtime.

station.vec <- sort(unique(ibi.df$station))
min.station <- station.vec[1]
max.station <- station.vec[length(station.vec)]
ggplot(data = ibi.df, aes(station, ibi_score)) +
  theme_classic() +
  #scale_x_discrete(expand = c(0, 0)) +
  scale_y_continuous(limits = c(0.8, 5.2), expand = c(0, 0)) +
  annotate("rect", xmin = min.station, xmax = max.station, ymin = 0.8, ymax = 2, fill = "red", alpha = 0.25) +
  annotate("rect", xmin = min.station, xmax = max.station, ymin = 2, ymax = 2.67, fill = "orange", alpha = 0.25) +
  annotate("rect", xmin = min.station, xmax = max.station, ymin = 2.67, ymax = 3.33, fill = "yellow", alpha = 0.25) +
  annotate("rect", xmin = min.station, xmax = max.station, ymin = 3.33, ymax = 4, fill = "lightgreen", alpha = 0.25) +
  annotate("rect", xmin = min.station, xmax = max.station, ymin = 4, ymax = 5.2, fill = "green", alpha = 0.25) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  facet_wrap(~season_sal, ncol = 2)

Validation

Import Old Values

Import metric values, scores, IBI values, and ratings calculated by Jacqueline M. Johnson of the Chesapeake Bay Program over 10 years ago. This data provides and independent set of metric values, scores, IBI values, and index ratings to compare to the values determined in this document.

old.org.df <- read_excel(file.path(project.dir, "data/jackie_data/Data 2013_4plus-phyto-metrics.xlsx"),
                         sheet = "DATA_4+biometrics",
                         skip = 1) %>% 
  clean_up()

Metrics and scores are in a wide data format, where each metric or score represents a column. Create a vector of the metrics of interest. The score columns have the same names as the metrics but “__1" is added as a suffix. Therefore, scores.vec is created by pasting “__1" on to the end of each of the strings in metrics.vec.

metrics.vec <- c("chl_surf", "biomass_chl_ratio", "cyano_biomass", "doc",
                 "pheo", "tot_biomass", "diatom_biomass", "dino_biomass",
                 "prorocentrum_min_abund", "microcystis_aer_abund",
                 "crypto_bio_pct")

scores.vec <- paste0(metrics.vec, "__1")

Remove all unnecessary columns.

old.sub <- old.org.df %>% 
  mutate(old_rating = str_replace_all(pibi_rank, "-", "_")) %>% 
  rename(old_org_ibi_score = pibi_score) %>% 
  select(station, sample_date, season, ibi_layer, ibi_salzone,
         metrics.vec, scores.vec, old_rating, old_org_ibi_score
         )

Remove the score columns from old.org.df and transform the metric data to a long data format.

old.metrics <- old.sub %>%
  select(-one_of(scores.vec)) %>% 
  gather(metric, old_metric_value, -station, -sample_date,
         -season, -ibi_layer, -ibi_salzone,
         -old_rating, -old_org_ibi_score)

Remove the metric columns from old.org.df and transform the score data to a long data format. Remove the “__1" suffix from the scores name, which is just a reference to the metric name.

old.score <- old.sub %>% 
  select(-one_of(metrics.vec)) %>% 
  gather(metric, old_score_value, -station, -sample_date,
         -season, -ibi_layer, -ibi_salzone, -old_rating, -old_org_ibi_score
         ) %>% 
  mutate(metric = str_replace(metric, "__1", "")) %>% 
  group_by(station, sample_date, season, ibi_layer, ibi_salzone) %>% 
  mutate(old_ibi_score = mean(old_score_value, na.rm = TRUE)) %>% 
  ungroup()

old.metrics and old.score are joined together by station, sample_date, season, ibi_layer, ibi_salzone, and metric. This provides both the metric value and score in a long data format. In other words each row represents a unique events (unique_id) metric value (old_metric_value) and score (old_score_value).

old.df <- full_join(old.metrics, old.score,
                    by = c("station", "sample_date", "season",
                           "ibi_layer", "ibi_salzone", "metric",
                           "old_rating", "old_org_ibi_score"))

Convert column names and metric names to be consistent with the names found in ratings.df.

old.df <- left_join(old.df, old.org.df,
                    by = c("station", "sample_date", "season", "ibi_layer", "ibi_salzone")) %>% 
  select(station, sample_date, season, ibi_layer, ibi_salzone, metric,
         old_metric_value, old_score_value,
         old_org_ibi_score,
         old_ibi_score, old_rating
         ) %>% 
  rename(layer = ibi_layer,
         salzone = ibi_salzone,
         date = sample_date) %>% 
  mutate(date = as.Date(date),
         metric = case_when(
           metric == "chl_surf" ~ "surface_chla",
           metric == "biomass_chl_ratio" ~ "total_phyto_biomass_chla_ratio",
           metric == "cyano_biomass" ~ "cyanophyte_biomass",
           metric == "pheo" ~ "pheophytin",
           metric == "tot_biomass" ~ "total_phyto_biomass",
           metric == "dino_biomass" ~ "dinoflagellate_biomass",
           metric == "prorocentrum_min_abund" ~ "prorocentrum_minimum_abundance",
           metric == "microcystis_aer_abund" ~ "microcystis_aeruginosa_abundance",
           metric == "crypto_bio_pct" ~ "pct_cryptophyte",
           TRUE ~ metric
         ))

Modify Old Values

Re-score the metric values from old.df using score_phyto(). This allows the user to compare the new scoring procedure in this document to the independent scoring procedure created by Jacqueline M. Johnson of the Chesapeake Bay Program in SQL.

old.rescored <- old.df %>% 
  rename(value = old_metric_value) %>% 
  unite(unique_id, station, layer, date, season, salzone, remove = FALSE) %>% 
  score_phyto(pico.warning = FALSE)

Calculate the IBI score based on the scores calculated in the previous code chunk.

old.rescored <- old.rescored %>% 
  group_by(unique_id) %>% 
  mutate(ibi_score = mean(score, na.rm = TRUE),
         old_ibi_score = mean(old_score_value, na.rm = TRUE)) %>% 
  ungroup()

Rate the unique events (unique_id) IBI score calculated in the previous code chunk using rate_phyto().

old.rescored <- rate_phyto(old.rescored)

IBI Value, IBI Rating, Metric Scoring Disagreements

The metric values, metric scores, IBI values, and IBI rating calssification by Jacueline M. Johnson were used to evaluate the performance of the R-script developed in this document for calculated metric scores and IBI values. The metric values calculated by Jacueline M. Johnson are scored and the IBI values calculated using the R-scripts described in this document. Jacueline M. Johnson, obviously, used these metric values to then calculated metric scores, and subsequently IBI values and IBI ratings; therefore, if the metric scores, IBI values, and IBI ratings generated using this documents R-scripts do not agree with Jacueline M. Johnson’s values, then there must be a descrepancy in the scoring procedure, IBI value calculation, or rating classification. The metric values calculated by Jacueline M. Johnson are a baseline from which to validate

IBI Value Disagreement

Identify IBI scores re-calculated in this document that differ from the IBI scores calculated by Jacqueline M. Johnson. All numeric values are rounded to the hundredth place to remove minor discrepancies between the values being compared. The absolute difference between the re-calculated IBI score and Jacqueline M. Johnson’s IBI score is calculated (ibi_diff). Rows are sorted to present IBI score differences in descending order (ibi_diff) and only rows where the IBI differences disagree (ibi_diff > 0) are retained. This data frame can be used to explore differences in individual metric scores that are causing discrepancies between the two IBI values. Season (season) and salinity zone (salzone) are extracted from the unique event identifier (unique_id) and combined (unite()) to represent phytoplankton index names. Metrics (metric) and indices (index) are converted to factors to make them easier to plot in subsequent code chunks.

disagree.df <- old.rescored %>% 
  mutate_if(is.numeric, round, digits = 2) %>% 
  mutate(ibi_diff = abs(old_ibi_score - ibi_score)) %>% 
  arrange(desc(ibi_diff)) %>% 
  select(unique_id, metric, value, old_score_value, score, 
         old_ibi_score, ibi_score, ibi_diff, 
         old_org_ibi_score, old_rating,
         rating) %>% 
  separate(unique_id, c("station", "layer", "date", "season", "salzone"), sep = "_", remove = FALSE) %>% 
  unite(index, c("season", "salzone")) %>% 
  select(-station, -layer, -date) %>% 
  mutate(metric = factor(metric),
         index = factor(index))

Focus on just the IBI scores that differ. Categorize the IBI score differences (ibi_diff) into bins representing 0.5 increments.

disagree.ibi <- disagree.df %>%  
  filter(ibi_diff > 0) %>% 
  select(unique_id, index, old_ibi_score, ibi_score, ibi_diff) %>% 
  mutate(disagree_bin = case_when(
    ibi_diff > 0 & ibi_diff <= 0.5 ~ "0 < IBI Score <= 0.5",
    ibi_diff > 0.5 & ibi_diff <= 1 ~ "0.5 < IBI Score <= 1.0",
    ibi_diff > 1 & ibi_diff <= 1.5 ~ "1.0 < IBI Score <= 1.5",
    ibi_diff > 1.5 & ibi_diff <= 2 ~ "1.5 < IBI Score <= 2.0",
    ibi_diff > 2 & ibi_diff <= 2.5 ~ "2.0 < IBI Score <= 2.5",
    ibi_diff > 2.5 & ibi_diff <= 3 ~ "2.5 < IBI Score <= 3.0",
    ibi_diff > 3 & ibi_diff <= 3.5 ~ "3.0 < IBI Score <= 3.5",
    ibi_diff > 3.5 & ibi_diff <= 4 ~ "3.5 < IBI Score <= 4.0",
    TRUE ~ "ERROR"
  )) %>% 
  distinct()

If at there is at least one IBI score disagreement, then plot the counts of IBI score disagreement for each index (index). The bins created in the previous code chunk can be used to divide up the data into separate plots and provide a little bit more information regarding how much the index scores differ.

if (nrow(disagree.ibi) > 0) {
  ggplot(disagree.ibi, aes(index, fill = index)) +
    geom_bar() +
    facet_wrap(~ disagree_bin, scales = "free_x") +
    coord_flip()
}

Rating Disagreement

Focus on just ratings that differ, which makes it easier to identify errors in this documents code. Subset disagree.df to only include only rows where ratings classified in this document (rating) and ratings specified by Jacqueline M. Johnson (old_rating) disagree. Additionally, only include rows where the unaltered original IBI score reported by Jacqueline M. Johnson (old_org_ibi_score) is equivalent to the IBI score calculated in this document (ibi_score). The IBI score reported by Jacqueline M. Johnson was modified to exclude any the inclusion of Picophytoplankton abundance (old_ibi_score) in the Modify Old Values section to provide a better comparison of IBI scores in this document and Jacqueline M. Johnson’s IBI scores. However, this method does not allow us to accurately assess ratings. If we apply the rating function, rate_phyto(), to Jacqueline M. Johnson’s values, then the ratings will always agree when the IBI scores are equivalent. Therefore, the rating assessment must be limited to when the re-calculated IBI scores (ibi_score) is equvalent to the unaltered, original IBI score reported by Jacqueline M. Johnson (old_org_ibi_score); the comparison of the re-classified IBI ratings (rating) with the original IBI rating specified by Jacqueline M. Johnson (old_rating) provides an independent check of the rating function, rate_phyto(), developed in this document.

disagree.rating <- disagree.df %>% 
  filter(rating != old_rating,
         ibi_score == old_org_ibi_score) %>% 
  select(unique_id, index, old_ibi_score, ibi_score, old_rating, rating) %>% 
  distinct()

If there is at least one IBI rating disagreement, plot the counts of IBI rating disagreements.

if (nrow(disagree.rating) > 0) {
  disagree.rating %>% 
  mutate(old_rating = factor(old_rating,
                             levels = c("poor", "fair_poor",
                                        "fair", "fair_good", "good")),
         rating = factor(rating,
                         levels = c("poor", "fair_poor",
                                    "fair", "fair_good", "good"))) %>% 
  group_by(old_rating, rating) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  tidyr::complete(old_rating, rating) %>% 
  mutate(count = if_else(is.na(count), as.integer(0), count)) %>% 
  ggplot(aes(rating, old_rating,
             fill = count)) +
  theme_minimal() +
  geom_tile() +
  scale_fill_gradient(low = "#ffffff", high = "#4286f4") +
  geom_text(aes(label = count, size = 5), hjust = 0.5, vjust = 0.5) +
  xlab("This Documents Rating") +
  ylab("Jacqueline M. Johnson's Rating") +
  theme(legend.position = "none")
}

Metric Score Disagreement

Focus on just scores that differ, which makes it easier to identify errors in this documents code. Subset disagree.df to only include metrics (metric), metric values (value), Jacqueline M. Johnson’s scores (old_score_value), and this documents re-scored values (score). Find the absolute difference (score_diff) between Jacqueline M. Johnson’s scores (old_score_value) and this documents re-scored values (score). Only rows where the scores disagree (score_diff > 0) are retained.

disagree.score <- disagree.df %>% 
  filter(ibi_diff > 0) %>% 
  select(unique_id, index, metric, value, old_score_value, score) %>% 
  distinct() %>% 
  mutate(score_diff = abs(old_score_value - score)) %>% 
  filter(score_diff > 0) %>% 
  group_by(metric, index) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  tidyr::complete(metric, index) %>% 
  mutate(count = if_else(is.na(count), as.integer(0), count))

If there is at least one metric score disagreement, plot the counts of score disagreement for each metric by index (index).

if (nrow(disagree.score) > 0) {
  ggplot(disagree.score, aes(metric, count, fill = metric)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    facet_wrap(~index, ncol = 4) +
    theme(legend.position = "bottom") +
    guides(fill = guide_legend(ncol = 2, bycol = TRUE))
}

Metric Disagreement

This section compares the values found in this document to the values found by Jacqueline M. Johnson. The previous section was good for identifying issues related to scoring the metrics. This section identifies issues related to the metric values. The discrepancies found here most likely do not have to deal with the formula used to calculate the metrics but probably stem from differences in how the taxonomic counts are prepared.

Values from this document (ratings.df) are merged with Jacqueline M. Johnson’s values (old.df). Season (season) and salinity zone (salzone) are combined (unite()) to represent phytoplankton index names. Metrics (metric), indices (index), and data sources (source) are converted to factors to make them easier to plot in subsequent code chunks.

join.df <- left_join(ratings.df, old.df, by = c("station", "date", "season", "salzone",  "metric")) %>% 
  filter(date <= max(old.df$date)) %>% 
  unite(index, season, salzone) %>% 
  left_join(unique(bay.df[, c("unique_id", "source")]), by = "unique_id") %>% 
  mutate(index = factor(index),
         metric = factor(metric),
         source = factor(source))

The numeric values are all rounded to the hundredth place to reduce the detection of minor discrepancies. The absolute difference between the re-calculated IBI score and Jacqueline M. Johnson’s IBI score is calculated (ibi_diff). Rows are sorted to present IBI score differences in descending order (ibi_diff) and only rows where the IBI differences disagree (ibi_diff > 0) are retained. This data frame can be used to explore differences in individual metric scores that are causing discrepancies between the two IBI values.

join.df <- join.df %>% 
  mutate_if(is.numeric, round, digits = 2) %>% 
  mutate(metric_diff = abs(value - old_metric_value),
         score_diff = abs(score - old_score_value),
         ibi_diff = abs(ibi_score - old_ibi_score)) %>% 
  select(unique_id,  metric, index, source,
         value, old_metric_value, metric_diff,
         score, old_score_value, score_diff,
         ibi_score, old_ibi_score, ibi_diff,
         rating, old_rating)

Identify IBI scores calculated in this document that differ from the IBI scores calculated by Jacqueline M. Johnson.

diff.df <- join.df %>% 
  filter(ibi_diff > 0) %>% 
  arrange(desc(ibi_diff), desc(score_diff), desc(metric_diff)) %>% 
  distinct()

Overview of Descrepancies

Focus on just the IBI scores that differ. Categorize the IBI score differences (ibi_diff) into bins representing 0.5 increments.

diff.ibi <- diff.df %>% 
  select(-metric, -value, -old_metric_value, -metric_diff,
         -score, -old_score_value, - score_diff) %>% 
  mutate(disagree_bin = case_when(
    ibi_diff > 0 & ibi_diff <= 0.5 ~ "0 < IBI Score <= 0.5",
    ibi_diff > 0.5 & ibi_diff <= 1 ~ "0.5 < IBI Score <= 1.0",
    ibi_diff > 1 & ibi_diff <= 1.5 ~ "1.0 < IBI Score <= 1.5",
    ibi_diff > 1.5 & ibi_diff <= 2 ~ "1.5 < IBI Score <= 2.0",
    ibi_diff > 2 & ibi_diff <= 2.5 ~ "2.0 < IBI Score <= 2.5",
    ibi_diff > 2.5 & ibi_diff <= 3 ~ "2.5 < IBI Score <= 3.0",
    ibi_diff > 3 & ibi_diff <= 3.5 ~ "3.0 < IBI Score <= 3.5",
    ibi_diff > 3.5 & ibi_diff <= 4 ~ "3.5 < IBI Score <= 4.0",
    TRUE ~ "ERROR"
  )) %>% 
  distinct() %>% 
  arrange(desc(ibi_diff))

IBI Scores

if (nrow(diff.ibi) > 0) {
  ggplot(diff.ibi, aes(index, fill = index)) +
    geom_bar() +
    facet_wrap(~ disagree_bin) + 
    xlab("Index") +
    ylab("Number of Discrepancies") +
    coord_flip() 
}

Metric Scores

Focus just on scores that differ, which makes it easier to identify errors in this documents code. Subset disagree.df to only include metrics (metric), metric values (value), Jacqueline M. Johnson’s scores (old_score_value), and this documents scored values (score). Find the absolute difference (score_diff) between Jacqueline M. Johnson’s scores (old_score_value) and this documents scored values (score). Only rows where the scores disagree (score_diff > 0) are retained.

diff.score <- diff.df %>% 
  select(unique_id, index, source,  metric, value, old_score_value, score) %>% 
  distinct() %>% 
  mutate(score_diff = abs(old_score_value - score)) %>% 
  filter(score_diff > 0)

If there is at least one metric score disagreement, plot the counts of score disagreement for each metric.

if (nrow(diff.score) > 0) {
  count.diff <- diff.score %>% 
    group_by(metric) %>% 
    summarize(count = n()) %>% 
    ungroup() %>% 
    arrange(count) %>% 
    mutate(metric = factor(metric, levels = metric))
  
  ggplot(count.diff, aes(metric, count, fill = metric)) +
    geom_bar(stat = "identity") +
    coord_flip()
}

If there is at least one metric score disagreement, plot the counts of score disagreement for each metric by index (index).

if (nrow(diff.score) > 0) {
  diff.score %>% 
    mutate(metric = factor(metric, levels = levels(count.diff$metric))) %>% 
  ggplot(aes(metric, fill = metric)) +
    geom_bar() +
    coord_flip() +
    facet_wrap(~index, ncol = 4) +
    theme(legend.position = "bottom") +
    guides(fill = guide_legend(ncol = 2, bycol = TRUE))
}

More information can be gleaned from the previous plot if the data source (source) is also used as a grouping factor. Counts of score differences are summed by metric, index, and source.

diff.score.source <- diff.score %>% 
  group_by(metric, index, source) %>% 
  summarize(count = n()) %>% 
  ungroup() %>% 
  tidyr::complete(metric, index, source) %>% 
  mutate(count = if_else(is.na(count), as.integer(0), count))

If there is at least one metric score disagreement for diff.score.source, plot the counts of score disagreement for each metric by source (source) and index (index).

if (nrow(diff.score.source) > 0) {
  diff.score.source %>% 
    mutate(metric = factor(metric, levels = levels(count.diff$metric))) %>% 
    filter(!is.na(metric)) %>% 
  ggplot(aes(metric, count, fill = metric)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    facet_wrap(~index + source, ncol = 2) +
    theme(legend.position = "bottom") +
    guides(fill = guide_legend(ncol = 2, bycol = TRUE))
}

Metric Values

Focus just on metric values that differ, which makes it easier to identify errors in this documents code. Subset diff.df to only include metrics (metric), metric values (value), Jacqueline M. Johnson’s scores (old_metric_value), and this documents scored values (value). Find the absolute difference (metric_diff) between Jacqueline M. Johnson’s scores (old_metric_value) and this documents scored values (value). Only rows where the scores disagree by more than two (metric_diff > 2) are retained. Two was an arbitrary selection made to provide a small buffer to allow for rounding error.

diff.metric <- diff.df %>% 
  select(unique_id, metric, index, value, old_metric_value, metric_diff) %>% 
  mutate(metric_diff = abs(metric_diff)) %>% 
  filter(metric_diff > 2)

If there is at least one metric value disagreement, plot the counts of metric value disagreements for each metric.

if (nrow(diff.metric) > 0) {
  diff.metric %>% 
    group_by(metric) %>% 
    summarize(count = n()) %>% 
    ungroup() %>% 
    arrange(count) %>% 
    mutate(metric = factor(metric, levels = metric)) %>% 
  ggplot(aes(metric, count, fill = metric)) +
    geom_bar(stat = "identity") +
    coord_flip()
}

Conclusion

Currently, there are a large number of discrepancies between the IBI values calculated in this document and the IBI values found by Jacqueline M. Johnson of the Chesapeake Bay Program. The scoring functions provided in this document appear to be functioning correctly based on the [Scoring Disagreement] section. After using the scoring functions to re-score the metric values provided by Jacqueline M. Johnson, several errors were identified and corrected. These corrections reducing the scoring discrepancies to less than 10. Most of the remaining discrepancies are found at scoring thresholds. Jacqueline M. Johnson’s code provided the best description of scoring thresholds, which have been replicated in the Scoring Functions section; at this time, I have no justification for altering the thresholds further, so the few discrepancies will remain. Ultimately, the Metric Score Disagreement section suggests that any remaining discrepancies between the values found in this document and the values values found by Jacqueline M. Johnson are caused by differences in metric calculations and/or, more likely, differences in taxonomic count preparations. So far, no table has been found representing Jacqueline M. Johnson’s prepared taxonomic counts. This table would make it easier to identify discrepancies in how the taxa are prepared (e.g. excluded taxa or assigned carbon values). Without this table I need help identifying why these discrepancies exist. If you have any suggestions, please let me know.

Citations

Buchanan, Claire, Richard V. Lacouture, Harold G. Marshall, Marcia Olson, and Jacqueline M. Johnson. 2005. “Phytoplankton Reference Communities for Chesapeake Bay and Its Tidal Tributaries.” Estuaries and Coasts 28 (1): 138–59.

Chamberlain, Scott. 2017. “Ritis: Integrated Taxonomic Information System Client.”

Cheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2017. “Leaflet: Create Interactive Web Maps with the JavaScript ’Leaflet’ Library.”

Dowle, Matt, and Arun Srinivasan. 2017. “Data.Table: Extensions of ’Data.Frame‘.”

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25.

Henry, Lionel, and Hadley Wickham. 2017. “Purrr: Functional Programming Tools.”

Lacouture, Richard V., Jacqueline M. Johnson, Claire Buchanan, and Harold G. Marshall. 2006. “Phytoplankton Index of Biotic Integrity for Chesapeake Bay and Its Tidal Tributaries.” Estuaries and Coasts 29 (4): 598–616.

Lang, Duncan T., and the CRAN team. 2016. “RCurl: General Network (HTTP/FTP/...) Client Interface for R.”

Müller, Kirill. 2017. “Rprojroot: Finding Files in Project Subdirectories.”

Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.”

R Core Team. 2017. “R: A Languag and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

———. 2017a. “Httr: Tools for Working with URLs and HTTP.”

———. 2017b. “Stringr: Simple, Consistent Wrappers for Common String Operations.”

Wickham, Hadley, and Jennifer Bryan. 2017. “Readxl: Read Excel Files.”

Wickham, Hadley, and Lionel Henry. 2017. “Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions.”

Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Müller. 2017. “Dplyr: A Grammar of Data Manipulation.”