Code
full_data %>%
select(weight) %>%
filter(str_detect(weight, "[:alpha:]")) %>%
pull(weight) %>%
table().
N/A too windy X
6 8
There are a handful of unexpected “character string” values (ie non-numbers) in the length and weight columns. For weight, it seems pretty obvious these should be overwritten as NAs. Here’s a small table of unexpected values and the number of times they appear.
full_data %>%
select(weight) %>%
filter(str_detect(weight, "[:alpha:]")) %>%
pull(weight) %>%
table().
N/A too windy X
6 8
For length, I wanted to make sure that ‘Tally’ didn’t mean something important. Otherwise I’ll overwrite these lengths as NA.
full_data %>%
select(length) %>%
filter(str_detect(length, "[:alpha:]")) %>%
pull(length) %>%
table().
Tally TALLY (NO LENGTH)
7 1
After doing a little more parsing of values (formatting dates, removing accidental whitespace, etc.) and other data manipulation, I took a look at the distribution of your length and weight values. It didn’t seem like your original research questions depended on length/weight data, but just in case I wanted to explore it quickly.
There are some outliers here, though it’s difficult to see among all species, so I’ll also plot a few species you expressed interest in separately.
We could do formal outlier detection like we do for MAM/LTEF/LTRM QAQC, but I wanted to check with you first. We could also skip that step if you’re not interested in analyzing these anyways, but if you plan on publishing this as a paper with a dataset published alongside it, or even if you want the data to live in the IRBS database, if that archived data will include lengths/weights we should address it. I would use statistical thresholds to identify outliers per species and overwrite the weights as NA, leaving the length values intact as these are usually more reliable than weight.
What are your thoughts? Should I edit your data to remove weight values that outlie the expected length:weight relationships for each species?
full_data <- full_data %>% #for other missing catch values, force to 1
mutate(length = parse_number(length),
weight = parse_number(weight),
stratum = str_trim(stratum)) %>%
mutate(
sdate = lubridate::parse_date_time(sdate, 'mdy'),
year = year(sdate)
)
nonzero_data <-
full_data %>%
group_by(barcode, fishcode) %>%
summarise(total_catch = sum(catch))
sites <-
full_data %>%
dplyr::select(site, barcode, stratum, sdate, year, lcode, gear, period, bait, secchi, temp, depth, current, pool) %>%
distinct()
full_data %>%
#filter to top few species we'd be interested in
ggplot(aes(x = log10(length), y = log10(weight), color = fishcode)) +
geom_point() +
labs(x = "log10 length (mm)", y = "log10 weight (g)", color = 'species code',
title = 'LW data',
subtitle = 'all species')#lw (selected spp)
full_data %>%
#filter to top few species we'd be interested in
filter(fishcode %in% c('CNCF', 'FHCF', 'SMBF', 'BKCP', 'WTCP', 'FWDM')) %>%
ggplot(aes(x = log10(length), y = log10(weight), color = fishcode)) +
geom_point() +
labs(x = "log10 length (mm)", y = "log10 weight (g)", color = 'species code',
title = 'LW data',
subtitle = 'selected species')It seems we have several barcodes that appear more than once. These might represent duplicate entries of a single record, or you just reused barcodes by accident for unique records. I caught a few the last time we looked at these data together, but now there’s a couple more.
It’s important for the analyses that I can have a unique ID for each site, so if you could please review these pairs and decide if they represent:
Duplicate entries of the same data. In this case, we’ll delete one entry
Duplicate entries of nearly-identical data. In this case, we’ll also delete one entry, but you need to indicate to me which one you’d like deleted
Unique records that accidentally were assigned the same barcode. In this case, I’ll assign some arbitrary value to barcode to make it unique.
I could take my best guess at these, but I’d rather you make the call as the collector of the data.
sites %>%
filter(barcode %in%
(sites %>%
group_by(barcode) %>%
count %>%
filter(n > 1) %>%
pull(barcode))
) %>%
kableExtra::kable()| site | barcode | stratum | sdate | year | lcode | gear | period | bait | secchi | temp | depth | current | pool |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RM116 | 7 | MCB | 2019-07-24 | 2019 | HLM12 | HL | 1 | Clam | 24 | 28.3 | 1.5 | 0.22 | La Grange |
| RM155 | 7 | MCB | 2019-07-24 | 2019 | HLM12 | HL | 1 | Clam | 24 | 28.3 | 1.5 | 0.22 | La Grange |
| RM106.4 | 9 | MCB | 2019-07-08 | 2019 | HLM13 | HL | 1 | Clam | 20 | 28.1 | 5.1 | 0.61 | La Grange |
| RM106.5 | 9 | MCB | 2019-07-08 | 2019 | HLM13 | HL | 1 | Clam | 20 | 28.1 | 5.1 | 0.61 | La Grange |
| NA | 571 | SCB | 2021-09-20 | 2021 | ALTHLS7 | HL | 3 | Clam | 32 | 24.6 | 0.5 | 0.13 | La Grange |
| NA | 571 | SCB | 2021-09-20 | 2021 | FIXED | HL | 3 | Clam | 32 | 24.6 | 0.5 | 0.13 | La Grange |
| NA | B2024070113592045 | MCB | 2024-06-24 | 2024 | B10.RS | HL | 1 | Cottonseed | 24 | 29.0 | 3.5 | 0.00 | La Grange |
| NA | B2024070113592045 | MCB | 2024-06-24 | 2024 | B10.RS | HL | 1 | Cottonseed | NA | NA | NA | NA | La Grange |
| NA | 16016214 | SCB | 2024-07-09 | 2024 | HL11.RS | HL | 1 | Soybean | 23 | 27.1 | 1.3 | 0.27 | La Grange |
| NA | 16016214 | MCB | 2024-07-15 | 2024 | HS4.RS | HS | 1 | Soybean | 36 | 29.6 | 2.4 | 0.53 | La Grange |
There’s a small amount of ‘missing’ data, where the values were empty or NA for a given field. For modeling purposes, any record with missing data in either the dependent or any independent variable of the model would be dropped. Nothing seems particularly bad to me, I’d just ask to look over these percentages, and if you think one seems high, investigate them on your own to see if values can be recovered from data sheets or if something else is amiss.
missing_i <-
DataExplorer::plot_missing(full_data %>%
select(fishcode, catch, length, weight),
title = 'Individual-level data',
ggtheme = theme_bw(),
theme_config = list(legend.position = "none")) missing_s <-
DataExplorer::plot_missing(sites, ,
title = 'Site-level data',
ggtheme = theme_bw(),
theme_config = list(legend.position = "none"))Investigating other variables likely important for modeling, you can see a few fields have outliers we likely need to deal with. I’ll let you decide if they warrant tracking down with paper data sheets or simply overwriting as NA.
We could get carried away with period-specific ranges, but it seems reasonable to me we could set liberal limits for all periods where all temps need to be > 5 degrees and < 40 degrees. Do you agree?
sites %>%
ggplot(aes(x = sdate, y = temp, color = as.factor(period))) +
geom_point() +
facet_wrap(~year, scales = 'free') +
labs(x = 'Sampling start date',
y = 'Temperature (C)',
color = 'Sampling period',
title = 'Temperature outliers')Do these seem ok to you?
sites %>%
ggplot(aes(x = stratum, y = secchi, color = as.factor(year))) +
geom_jitter(height = 0) +
facet_wrap(~pool) +
labs(x = 'Stratum', y = 'Secchi depth (cm)', color = 'Year')Again, these seem reasonable.
sites %>%
ggplot(aes(x = stratum, y = depth, color = as.factor(pool))) +
geom_jitter(height = 0) +
labs(x = 'Stratum', y = 'Depth (m)', color = 'Pool')These seem good, too.
sites %>%
ggplot(aes(x = stratum, y = current, color = as.factor(year))) +
geom_jitter(height = 0) +
facet_wrap(~pool) +
labs(x = 'Stratum', y = 'Current (units?)', color = 'Year')I’ll let you be the judge of this. Some really high catches for CNCF and SMBF. Do you think these are outliers worth a second look in the data?
nonzero_data %>%
left_join(sites %>% select(pool, year, barcode)) %>%
ggplot(aes(x = fishcode, y = total_catch, shape = as.factor(pool), color = as.factor(year))) +
geom_point(alpha = .8) +
labs(x = 'Species', y = 'Catch in each barcode', shape = 'Pool', color = 'Year') +
theme_bw() +
theme(axis.text.x = element_text(angle = 90)) Sam, please get back to me about the following action items:
Confirm character string values for length and weight should be overwritten as NA
Decide if we want to ignore all length/weight data, or perform outlier detection/removal workflow a la MAM.
Identify cause and correction for each set of duplicate site-level records.
Decide if the low levels of ‘missing-ness’ are acceptable, or if we want to go back to paper data sheets to fill in any gaps.
Apply boundaries for reasonable ranges of continuous variables. So far, it seems temperature may be the only variable needing attention, but I’d like your input on the others, especially catch.