Bait Comparison Data QAQC

Author

Michael Spear

Published

January 13, 2025

Length and Weight

Unexpected character values

There are a handful of unexpected “character string” values (ie non-numbers) in the length and weight columns. For weight, it seems pretty obvious these should be overwritten as NAs. Here’s a small table of unexpected values and the number of times they appear.

Code
full_data %>%
    select(weight) %>%
    filter(str_detect(weight, "[:alpha:]")) %>%
    pull(weight) %>%
    table()
.
N/A too windy               X 
             6              8 

For length, I wanted to make sure that ‘Tally’ didn’t mean something important. Otherwise I’ll overwrite these lengths as NA.

Code
full_data %>%
    select(length) %>%
    filter(str_detect(length, "[:alpha:]")) %>%
    pull(length) %>%
    table()
.
            Tally TALLY (NO LENGTH) 
                7                 1 

After doing a little more parsing of values (formatting dates, removing accidental whitespace, etc.) and other data manipulation, I took a look at the distribution of your length and weight values. It didn’t seem like your original research questions depended on length/weight data, but just in case I wanted to explore it quickly.

There are some outliers here, though it’s difficult to see among all species, so I’ll also plot a few species you expressed interest in separately.

We could do formal outlier detection like we do for MAM/LTEF/LTRM QAQC, but I wanted to check with you first. We could also skip that step if you’re not interested in analyzing these anyways, but if you plan on publishing this as a paper with a dataset published alongside it, or even if you want the data to live in the IRBS database, if that archived data will include lengths/weights we should address it. I would use statistical thresholds to identify outliers per species and overwrite the weights as NA, leaving the length values intact as these are usually more reliable than weight.

What are your thoughts? Should I edit your data to remove weight values that outlie the expected length:weight relationships for each species?

Code
full_data <- full_data %>%                                        #for other missing catch values, force to 1
  mutate(length = parse_number(length),
         weight = parse_number(weight),
         stratum = str_trim(stratum)) %>%
  mutate(
    sdate = lubridate::parse_date_time(sdate, 'mdy'),
    year = year(sdate)
    )

nonzero_data <-
  full_data %>%
  group_by(barcode, fishcode) %>%
  summarise(total_catch = sum(catch))

sites <-
  full_data %>%
  dplyr::select(site, barcode, stratum, sdate, year, lcode, gear, period, bait, secchi, temp, depth, current, pool) %>%
  distinct()

full_data %>% 
  #filter to top few species we'd be interested in
  ggplot(aes(x = log10(length), y = log10(weight), color = fishcode)) + 
  geom_point() +
  labs(x = "log10 length (mm)", y = "log10 weight (g)", color = 'species code',
       title = 'LW data',
       subtitle = 'all species')

Code
#lw (selected spp)
full_data %>% 
  #filter to top few species we'd be interested in
  filter(fishcode %in% c('CNCF', 'FHCF', 'SMBF', 'BKCP', 'WTCP', 'FWDM')) %>% 
  ggplot(aes(x = log10(length), y = log10(weight), color = fishcode)) + 
  geom_point() +
  labs(x = "log10 length (mm)", y = "log10 weight (g)", color = 'species code',
       title = 'LW data',
       subtitle = 'selected species')

Duplicate Data

It seems we have several barcodes that appear more than once. These might represent duplicate entries of a single record, or you just reused barcodes by accident for unique records. I caught a few the last time we looked at these data together, but now there’s a couple more.

It’s important for the analyses that I can have a unique ID for each site, so if you could please review these pairs and decide if they represent:

  1. Duplicate entries of the same data. In this case, we’ll delete one entry

  2. Duplicate entries of nearly-identical data. In this case, we’ll also delete one entry, but you need to indicate to me which one you’d like deleted

  3. Unique records that accidentally were assigned the same barcode. In this case, I’ll assign some arbitrary value to barcode to make it unique.

I could take my best guess at these, but I’d rather you make the call as the collector of the data.

Code
sites %>% 
  filter(barcode %in% 
           (sites %>% 
              group_by(barcode) %>% 
              count %>% 
              filter(n > 1) %>% 
              pull(barcode))
         ) %>%
 kableExtra::kable()
site barcode stratum sdate year lcode gear period bait secchi temp depth current pool
RM116 7 MCB 2019-07-24 2019 HLM12 HL 1 Clam 24 28.3 1.5 0.22 La Grange
RM155 7 MCB 2019-07-24 2019 HLM12 HL 1 Clam 24 28.3 1.5 0.22 La Grange
RM106.4 9 MCB 2019-07-08 2019 HLM13 HL 1 Clam 20 28.1 5.1 0.61 La Grange
RM106.5 9 MCB 2019-07-08 2019 HLM13 HL 1 Clam 20 28.1 5.1 0.61 La Grange
NA 571 SCB 2021-09-20 2021 ALTHLS7 HL 3 Clam 32 24.6 0.5 0.13 La Grange
NA 571 SCB 2021-09-20 2021 FIXED HL 3 Clam 32 24.6 0.5 0.13 La Grange
NA B2024070113592045 MCB 2024-06-24 2024 B10.RS HL 1 Cottonseed 24 29.0 3.5 0.00 La Grange
NA B2024070113592045 MCB 2024-06-24 2024 B10.RS HL 1 Cottonseed NA NA NA NA La Grange
NA 16016214 SCB 2024-07-09 2024 HL11.RS HL 1 Soybean 23 27.1 1.3 0.27 La Grange
NA 16016214 MCB 2024-07-15 2024 HS4.RS HS 1 Soybean 36 29.6 2.4 0.53 La Grange

Missing-ness

There’s a small amount of ‘missing’ data, where the values were empty or NA for a given field. For modeling purposes, any record with missing data in either the dependent or any independent variable of the model would be dropped. Nothing seems particularly bad to me, I’d just ask to look over these percentages, and if you think one seems high, investigate them on your own to see if values can be recovered from data sheets or if something else is amiss.

Code
missing_i <- 
DataExplorer::plot_missing(full_data %>% 
                             select(fishcode, catch, length, weight),
                           title = 'Individual-level data',
                           ggtheme = theme_bw(),
                           theme_config = list(legend.position = "none")) 

Code
missing_s <-
DataExplorer::plot_missing(sites, ,
                           title = 'Site-level data',
                           ggtheme = theme_bw(),
                           theme_config = list(legend.position = "none"))

Other outlier values

Investigating other variables likely important for modeling, you can see a few fields have outliers we likely need to deal with. I’ll let you decide if they warrant tracking down with paper data sheets or simply overwriting as NA.

Temperature

We could get carried away with period-specific ranges, but it seems reasonable to me we could set liberal limits for all periods where all temps need to be > 5 degrees and < 40 degrees. Do you agree?

Code
sites %>%
  ggplot(aes(x = sdate, y = temp, color = as.factor(period))) + 
  geom_point() + 
  facet_wrap(~year, scales = 'free') +
  labs(x = 'Sampling start date',
       y = 'Temperature (C)',
       color = 'Sampling period',
       title = 'Temperature outliers')

Secchi

Do these seem ok to you?

Code
sites %>%
  ggplot(aes(x = stratum, y = secchi, color = as.factor(year))) +
  geom_jitter(height = 0) +
  facet_wrap(~pool) +
  labs(x = 'Stratum', y = 'Secchi depth (cm)', color = 'Year')

Depth

Again, these seem reasonable.

Code
sites %>%
  ggplot(aes(x = stratum, y = depth, color = as.factor(pool))) +
  geom_jitter(height = 0) +
  labs(x = 'Stratum', y = 'Depth (m)', color = 'Pool')

Current

These seem good, too.

Code
sites %>%
  ggplot(aes(x = stratum, y = current, color = as.factor(year))) +
  geom_jitter(height = 0) +
  facet_wrap(~pool) +
  labs(x = 'Stratum', y = 'Current (units?)', color = 'Year')

Catch

I’ll let you be the judge of this. Some really high catches for CNCF and SMBF. Do you think these are outliers worth a second look in the data?

Code
nonzero_data %>%
  left_join(sites %>% select(pool, year, barcode)) %>%
  ggplot(aes(x = fishcode, y = total_catch, shape = as.factor(pool), color = as.factor(year))) +
  geom_point(alpha = .8) +
  labs(x = 'Species', y = 'Catch in each barcode', shape = 'Pool', color = 'Year') +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) 

Action Items

Sam, please get back to me about the following action items:

  • Confirm character string values for length and weight should be overwritten as NA

  • Decide if we want to ignore all length/weight data, or perform outlier detection/removal workflow a la MAM.

  • Identify cause and correction for each set of duplicate site-level records.

  • Decide if the low levels of ‘missing-ness’ are acceptable, or if we want to go back to paper data sheets to fill in any gaps.

  • Apply boundaries for reasonable ranges of continuous variables. So far, it seems temperature may be the only variable needing attention, but I’d like your input on the others, especially catch.