Bait Comparison Data QAQC

Author

Michael Spear

Published

January 13, 2025

Length and Weight

Unexpected character values

There are a handful of unexpected “character string” values (ie non-numbers) in the length and weight columns. For weight, it seems pretty obvious these should be overwritten as NAs. Here’s a small table of unexpected values and the number of times they appear.

Code

full_data %>%
    select(weight) %>%
    filter(str_detect(weight, "[:alpha:]")) %>%
    pull(weight) %>%
    table()

.
N/A too windy               X 
             6              8

For length, I wanted to make sure that ‘Tally’ didn’t mean something important. Otherwise I’ll overwrite these lengths as NA.

Code

full_data %>%
    select(length) %>%
    filter(str_detect(length, "[:alpha:]")) %>%
    pull(length) %>%
    table()

.
            Tally TALLY (NO LENGTH) 
                7                 1

After doing a little more parsing of values (formatting dates, removing accidental whitespace, etc.) and other data manipulation, I took a look at the distribution of your length and weight values. It didn’t seem like your original research questions depended on length/weight data, but just in case I wanted to explore it quickly.

There are some outliers here, though it’s difficult to see among all species, so I’ll also plot a few species you expressed interest in separately.

We could do formal outlier detection like we do for MAM/LTEF/LTRM QAQC, but I wanted to check with you first. We could also skip that step if you’re not interested in analyzing these anyways, but if you plan on publishing this as a paper with a dataset published alongside it, or even if you want the data to live in the IRBS database, if that archived data will include lengths/weights we should address it. I would use statistical thresholds to identify outliers per species and overwrite the weights as NA, leaving the length values intact as these are usually more reliable than weight.

What are your thoughts? Should I edit your data to remove weight values that outlie the expected length:weight relationships for each species?

Code

full_data <- full_data %>%                                        #for other missing catch values, force to 1
  mutate(length = parse_number(length),
         weight = parse_number(weight),
         stratum = str_trim(stratum)) %>%
  mutate(
    sdate = lubridate::parse_date_time(sdate, 'mdy'),
    year = year(sdate)
    )

nonzero_data <-
  full_data %>%
  group_by(barcode, fishcode) %>%
  summarise(total_catch = sum(catch))

sites <-
  full_data %>%
  dplyr::select(site, barcode, stratum, sdate, year, lcode, gear, period, bait, secchi, temp, depth, current, pool) %>%
  distinct()

full_data %>% 
  #filter to top few species we'd be interested in
  ggplot(aes(x = log10(length), y = log10(weight), color = fishcode)) + 
  geom_point() +
  labs(x = "log10 length (mm)", y = "log10 weight (g)", color = 'species code',
       title = 'LW data',
       subtitle = 'all species')

Code

#lw (selected spp)
full_data %>% 
  #filter to top few species we'd be interested in
  filter(fishcode %in% c('CNCF', 'FHCF', 'SMBF', 'BKCP', 'WTCP', 'FWDM')) %>% 
  ggplot(aes(x = log10(length), y = log10(weight), color = fishcode)) + 
  geom_point() +
  labs(x = "log10 length (mm)", y = "log10 weight (g)", color = 'species code',
       title = 'LW data',
       subtitle = 'selected species')

Duplicate Data

It seems we have several barcodes that appear more than once. These might represent duplicate entries of a single record, or you just reused barcodes by accident for unique records. I caught a few the last time we looked at these data together, but now there’s a couple more.

It’s important for the analyses that I can have a unique ID for each site, so if you could please review these pairs and decide if they represent:

Duplicate entries of the same data. In this case, we’ll delete one entry
Duplicate entries of nearly-identical data. In this case, we’ll also delete one entry, but you need to indicate to me which one you’d like deleted
Unique records that accidentally were assigned the same barcode. In this case, I’ll assign some arbitrary value to barcode to make it unique.

I could take my best guess at these, but I’d rather you make the call as the collector of the data.

Code

sites %>% 
  filter(barcode %in% 
           (sites %>% 
              group_by(barcode) %>% 
              count %>% 
              filter(n > 1) %>% 
              pull(barcode))
         ) %>%
 kableExtra::kable()

site	barcode	stratum	sdate	year	lcode	gear	period	bait	secchi	temp	depth	current	pool
RM116	7	MCB	2019-07-24	2019	HLM12	HL	1	Clam	24	28.3	1.5	0.22	La Grange
RM155	7	MCB	2019-07-24	2019	HLM12	HL	1	Clam	24	28.3	1.5	0.22	La Grange
RM106.4	9	MCB	2019-07-08	2019	HLM13	HL	1	Clam	20	28.1	5.1	0.61	La Grange
RM106.5	9	MCB	2019-07-08	2019	HLM13	HL	1	Clam	20	28.1	5.1	0.61	La Grange
NA	571	SCB	2021-09-20	2021	ALTHLS7	HL	3	Clam	32	24.6	0.5	0.13	La Grange
NA	571	SCB	2021-09-20	2021	FIXED	HL	3	Clam	32	24.6	0.5	0.13	La Grange
NA	B2024070113592045	MCB	2024-06-24	2024	B10.RS	HL	1	Cottonseed	24	29.0	3.5	0.00	La Grange
NA	B2024070113592045	MCB	2024-06-24	2024	B10.RS	HL	1	Cottonseed	NA	NA	NA	NA	La Grange
NA	16016214	SCB	2024-07-09	2024	HL11.RS	HL	1	Soybean	23	27.1	1.3	0.27	La Grange
NA	16016214	MCB	2024-07-15	2024	HS4.RS	HS	1	Soybean	36	29.6	2.4	0.53	La Grange

Missing-ness

There’s a small amount of ‘missing’ data, where the values were empty or NA for a given field. For modeling purposes, any record with missing data in either the dependent or any independent variable of the model would be dropped. Nothing seems particularly bad to me, I’d just ask to look over these percentages, and if you think one seems high, investigate them on your own to see if values can be recovered from data sheets or if something else is amiss.

Code

missing_i <- 
DataExplorer::plot_missing(full_data %>% 
                             select(fishcode, catch, length, weight),
                           title = 'Individual-level data',
                           ggtheme = theme_bw(),
                           theme_config = list(legend.position = "none"))

Code

missing_s <-
DataExplorer::plot_missing(sites, ,
                           title = 'Site-level data',
                           ggtheme = theme_bw(),
                           theme_config = list(legend.position = "none"))

Other outlier values

Investigating other variables likely important for modeling, you can see a few fields have outliers we likely need to deal with. I’ll let you decide if they warrant tracking down with paper data sheets or simply overwriting as NA.

Temperature

We could get carried away with period-specific ranges, but it seems reasonable to me we could set liberal limits for all periods where all temps need to be > 5 degrees and < 40 degrees. Do you agree?

Code

sites %>%
  ggplot(aes(x = sdate, y = temp, color = as.factor(period))) + 
  geom_point() + 
  facet_wrap(~year, scales = 'free') +
  labs(x = 'Sampling start date',
       y = 'Temperature (C)',
       color = 'Sampling period',
       title = 'Temperature outliers')

Secchi

Do these seem ok to you?

Code

sites %>%
  ggplot(aes(x = stratum, y = secchi, color = as.factor(year))) +
  geom_jitter(height = 0) +
  facet_wrap(~pool) +
  labs(x = 'Stratum', y = 'Secchi depth (cm)', color = 'Year')

Depth

Again, these seem reasonable.

Code

sites %>%
  ggplot(aes(x = stratum, y = depth, color = as.factor(pool))) +
  geom_jitter(height = 0) +
  labs(x = 'Stratum', y = 'Depth (m)', color = 'Pool')

Current

These seem good, too.

Code

sites %>%
  ggplot(aes(x = stratum, y = current, color = as.factor(year))) +
  geom_jitter(height = 0) +
  facet_wrap(~pool) +
  labs(x = 'Stratum', y = 'Current (units?)', color = 'Year')

Catch

I’ll let you be the judge of this. Some really high catches for CNCF and SMBF. Do you think these are outliers worth a second look in the data?

Code

nonzero_data %>%
  left_join(sites %>% select(pool, year, barcode)) %>%
  ggplot(aes(x = fishcode, y = total_catch, shape = as.factor(pool), color = as.factor(year))) +
  geom_point(alpha = .8) +
  labs(x = 'Species', y = 'Catch in each barcode', shape = 'Pool', color = 'Year') +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

Action Items

Sam, please get back to me about the following action items:

Confirm character string values for length and weight should be overwritten as NA
Decide if we want to ignore all length/weight data, or perform outlier detection/removal workflow a la MAM.
Identify cause and correction for each set of duplicate site-level records.
Decide if the low levels of ‘missing-ness’ are acceptable, or if we want to go back to paper data sheets to fill in any gaps.
Apply boundaries for reasonable ranges of continuous variables. So far, it seems temperature may be the only variable needing attention, but I’d like your input on the others, especially catch.