Rattlesnake thermoregulation

Analysis of internal body temperatures

Updated

July 1, 2026

Comments look like this.

To-do items look like this.

1 Datasets

Study data are organized across three primary sources:

data/site-metadata.csv — geographic and collaboration information for each of the 19 field sites.
data/snake-metadata.csv — physical and biological attributes for each of the 210 individual snakes, including a path to the corresponding body temperature file.
data/_raw/ — one CSV per snake with timestamped internal body temperature recordings from iButton sensors.

An integrated, quality-controlled workspace is provided as data/processed/tbdata.RData. It contains tb (body temperature time series), meta_snake, and meta_site.

tbdata.RData is produced in two sequential steps. analysis/processing/data-processing.r reads the raw CSV files, joins metadata, and detects basic integrity issues (duplicate timestamps, corrupted observations). analysis/processing/outliers.r then applies biological bounds filtering, removes short discontinuous segments, and flags rolling-window outliers as NA. See the Outlier detection section for details.

1.1 Internal body temperatures

Body temperature data from 210 snakes were collected at 19 sites across western North America, spanning 2013–2025.

Sampling locations and snake counts.
locale	state	country	lat	lon	elev	n
Alberta	AB	CAN	50.85	-110	710	2
Bouse	AZ	USA	33.84	-114	360	8
Chimineas Ranch	CA	USA	35.17	-119.9	744	23
Eatonton	GA	USA	33.23	-83.52	202	3
Indio Mountains Research Station	TX	USA	30.75	-105	1250	22
Kamloops-Thompson	BC	CAN	50.75	-120.5	697	8
Lordsburg	NM	USA	32.24	-108.9	1274	15
McIntyre Landing	MI	USA	44.65	-84.86	NA	8
Montana de Oro	CA	USA	35.26	-120.9	24	15
Norden	NE	USA	42.8	-100.1	714	1
Owl Head Buttes	AZ	USA	32.6	-111.1	840	5
Powerlines	AZ	USA	32.62	-111.3	635	3
Quartzite	AZ	USA	33.78	-114.1	360	5
Rattlesnake Butte	CO	USA	40.57	-107	2200	18
Rodeo	NM	USA	31.89	-109	1257	17
Sedgwick Reserve	CA	USA	34.69	-120	342	28
Vandenberg Airforce Base	CA	USA	34.74	-120.6	429	19
Youngblood Place	TX	USA	30.33	-103.1	1330	10

Sampled individuals belonged to 11 species. Most were represented at a single site.

Sample sizes by species.
spp	n_indiv	n_sites
crotalus oreganus	85	4
crotalus viridis	43	5
crotalus atrox	25	5
crotalus scutulatus	16	1
crotalus oreganus oreganus	8	1
sistrurus catenatus	8	1
crotaus pyrrhus	7	2
crotalus lepidus	6	1
crotalus ornatus	5	1
crotalus scutulatus x viridis	4	2
crotalus horridus	3	1

Sampling date ranges by site.
locale	state	start	end	years
Chimineas Ranch	CA	Apr 2010	Oct 2017	2
Vandenberg Airforce Base	CA	Apr 2012	Nov 2017	3
Montana de Oro	CA	Apr 2014	Feb 2018	3
Eatonton	GA	Apr 2021	Apr 2021	1
Owl Head Buttes	AZ	Apr 2022	Aug 2022	1
Quartzite	AZ	Apr 2023	Aug 2023	1
Kamloops-Thompson	BC	Jul 2005	Apr 2006	2
Alberta	AB	Jul 2021	Jul 2022	2
Rattlesnake Butte	CO	Jun 2017	May 2021	4
Lordsburg	NM	Jun 2020	Sep 2021	2
Rodeo	NM	Jun 2021	Sep 2021	1
Youngblood Place	TX	Jun 2021	Aug 2022	2
McIntyre Landing	MI	May 2014	Aug 2014	1
Sedgwick Reserve	CA	May 2015	Dec 2017	3
Indio Mountains Research Station	TX	May 2019	Oct 2023	5
Norden	NE	May 2022	Jun 2022	1
Powerlines	AZ	May 2022	Aug 2023	2
Bouse	AZ	May 2023	Aug 2023	1

1.2 DAYMET weather data

Daily weather covariates (minimum and maximum air temperature) are sourced from the ORNL Daymet 1-km gridded product, fetched via API by site coordinate and covering 1980–2024.

Example rows from DAYMET data.
lat	lon	date	year	yday	dayl	tmax	tmin	vp
30.33	-103.1	1980-01-01	1980	1	36293	18.54	-3.63	423.8
30.33	-103.1	1980-01-02	1980	2	36321	17.27	1.12	641.6
30.33	-103.1	1980-01-03	1980	3	36351	12.22	-1.7	539.1
30.33	-103.1	1980-01-04	1980	4	36383	17.16	-2.94	456.1
30.33	-103.1	1980-01-05	1980	5	36418	19.23	-3.21	431.4
30.33	-103.1	1980-01-06	1980	6	36454	19.27	-1.04	513.5

2 Exploratory Data Analysis

2.1 Outlier detection

Quality control is implemented in analysis/processing/outliers.r and proceeds in three stages.

1. Biological bounds. Observations with \[T_b < -5°C\] or \[T_b > 40°C\] are removed as physiologically implausible.

2. Discontinuity removal. Time gaps larger than 24 hours define segment boundaries within each snake’s series. Segments shorter than 7 days are dropped, as they are too short to contribute reliable information and may reflect instrument artifacts or re-deployment gaps.

3. Rolling-window outlier detection. For each observation a local mean (0.5-day window) and local SD (4-day window) are estimated using centered rolling statistics, with fallback to forward- or backward-looking windows near series endpoints. An observation is flagged if it deviates from the rolling mean by more than 6 SD and by more than 1°C. Flagged values are set to NA; rows are preserved.

The code block below shows the core logic of each stage.

Code

n_days_mean <- 0.5  # rolling-mean window
n_days_sd   <- 4    # rolling-SD window
n_sd        <- 6    # outlier threshold

# Stages 1-2: biological bounds + discontinuity removal
tb_qc <- tb |>
  filter(tb > -5 & tb < 40) |>
  group_by(file) |>
  arrange(datetime) |>
  mutate(
    lagged            = datetime - lag(datetime, default = first(datetime)),
    is_not_contiguous = cumsum(duration(24, "hours") < lagged)
  ) |>
  group_by(file, is_not_contiguous) |>
  mutate(chunk_length = max(last(datetime) - first(datetime))) |>
  filter(chunk_length > duration(7, "days")) |>
  ungroup()

# Stage 3: per-file rolling mean and SD, then flag deviations
#   NB: outliers.r additionally uses forward/backward windows near series
#   endpoints; this shows the centered-window core for brevity.
tb_aug <- tb_qc |>
  group_by(file) |>
  group_modify(function(tbdata, ...) {
    samp_freq <- median(diff(as.numeric(tbdata$datetime)), na.rm = TRUE)
    w_mean    <- ceiling(as.numeric(duration(n_days_mean, "days")) / samp_freq)
    w_sd      <- ceiling(as.numeric(duration(n_days_sd,   "days")) / samp_freq)
    tbdata |>
      mutate(
        rmean = data.table::frollmean(tb, w_mean, align = "center", fill = NA),
        rsd   = data.table::frollapply(tb, w_sd, FUN = sd, align = "center", fill = NA)
      )
  }) |>
  ungroup()

# flagged values set to NA; the full pipeline writes the result back to tbdata.RData
tb_adj <- tb_aug |>
  mutate(
    tb = if_else((abs(tb - rmean) > n_sd * rsd) & (abs(tb - rmean) > 1), NA_real_, tb)
  ) |>
  select(file, datetime, tb)

Diagnostic plots showing a ±7-day window around each flagged observation are written to analysis/processing/_diagnostics/_outliers/ for manual review.

2.2 Active season identification

Active season dates are inferred from internal body temperatures using changepoint detection on daily diurnal variance. The intuition is that thermoregulating snakes show greater daily temperature variance when active; variance is low and flat during inactivity.

Season classification is a standalone analysis. The output data/processed/seasons.rds is produced by analysis/processing/season-identification.R and is not currently consumed by the main modeling pipeline.

2.2.1 Diurnal variation

Daily standard deviations in body temperature are computed per snake per day of year. Snakes with fewer than 30 days of data are excluded, and series are smoothed with a 7-day rolling mean.

2.2.2 Location averages

Smoothed series are aggregated to locale-level pointwise medians to give the representative seasonal profile for each site.

2.2.3 Changepoint detection

Mean changepoints are detected via binary segmentation (up to $Q = 3$ segments). A threshold of 1.5°C on segment-level mean distinguishes active from inactive periods.

Comments:

In Lordsburg, a changepoint falls in the middle of the end-of-season transition. The 1.5°C threshold excludes the tapering period; reducing it to 1.25 would include it.
In Montana de Oro and Sedgwick Reserve, inferred start dates are unreliable due to sparse data (see below).

Inferred changepoints and individual-level diurnal variation for Montana de Oro and Sedgwick Reserve.

In Montana de Oro, data span 2017–2018 (mid-April to late February). A single snake was active for roughly one week in early February 2018; the inferred changepoint simply marks the start of data collection. In Sedgwick, data span 2015–2017 but with only one snake recorded in 2016; the changepoint reflects that individual’s activity onset.

Sampling dates for Montana de Oro and Sedgwick Reserve.
locale	yr	first.obs	last.obs
Montana de Oro	2014	2014-04-16	2014-10-07
Montana de Oro	2017	2017-04-15	2017-12-31
Montana de Oro	2018	2018-01-01	2018-02-21
Sedgwick Reserve	2015	2015-05-16	2015-12-31
Sedgwick Reserve	2016	2016-01-01	2016-03-24
Sedgwick Reserve	2017	2017-04-15	2017-12-11

2.2.4 Inferred active seasons

Active season classification uses the changepoint-inferred segment means thresholded at 1.5°C (daily SD). Locales for which all data fell within the active season (Rodeo, McIntyre, Chimineas) are excluded from the table.

Inferred active season changepoints.
location	start	end
Vandenberg Airforce Base	May 5	Oct 16
Montana de Oro	Jan 30	Oct 26
Lordsburg	Mar 28	Nov 12
Alberta	May 1	Sep 8
Youngblood Place	Mar 25	Nov 10
Rattlesnake Butte	May 3	Sep 22
Indio Mountains Research Station	Mar 16	Nov 10
Kamloops-Thompson	Mar 23	Sep 27
Sedgwick Reserve	Feb 12	Oct 24

3 Statistical Modeling

The analysis models how continuous body temperature profiles ($T_b$ as a function of time of day) vary with site-level environmental covariates across months, using a pooled cyclic functional regression fit across all months simultaneously.

3.1 Notation and covariates

Symbol	Meaning
\[T_b(t)\]	Body temperature at hour \[t \in [0, 24)\]
\[x\]	Covariate (tmin, tmax, elev, or lat), z-scored globally
\[m\]	Month (\[1, \ldots, 12\]), treated as a continuous cyclic variable
\[\ell\]	Locale (field site)
\[i\]	Curve index — one snake-day
\[\phi^{(1)}_k(t)\]	Level-1 (between-locale) eigenfunction, PC \[k\]
\[\phi^{(2)}_k(t)\]	Level-2 (within-locale) eigenfunction, PC \[k\]

Four covariates are supported. Each is z-scored globally (across all locale-dates) so that a 1-SD shift is interpretable as a population-level contrast regardless of site.

Variable	Units	Contrast	Notes
`tmin`	°C	5°C	—
`tmax`	°C	5°C	—
`elev`	m	100 m	Rattlesnake Butte excluded (extreme outlier)
`lat`	degrees	5°	—

3.2 Data preparation

The package function prep_tf_data() loads the processed data files, joins snake and site metadata with Daymet weather, constructs functional data objects (one curve per snake-day on the 0–24h domain), and z-scores the four covariates globally. Sparse curves (fewer than 10 hourly observations) are retained at this stage and filtered downstream.

Diurnal body temperature profiles across all sites and months. Each curve represents one snake on one calendar day. Red curves fall during the inferred inactive season for that locale.

Season classification is determined at the locale level, not individually. Some curves appear mismatched because individual snakes can be active or inactive outside the typical window for their site.

3.3 Model specification

3.3.1 Pooled cyclic model (primary)

A single model is fit per covariate across all 12 months. The cyclic cubic spline basis on month — with knots at $m = 0.5$ and $m = 12.5$ — ensures a smooth December → January transition.

Stage 1 — Initial BAM:

\[ T_b(t, m) = f_0(t) + g_0(m) + h_0(t, m) + x \cdot h_1(t, m) + \varepsilon(t, m) \]

$f_0(t)$: s(t, bs = "cc", k = 20), knots at $t = 0, 23$
$g_0(m)$: s(month, bs = "cc", k = 12), knots at $m = 0.5, 12.5$
$h_0(t, m)$: te(t, month, bs = c("cc", "cc"), k = c(12, 8))
$h_1(t, m)$: te(t, month, bs = c("cc", "cc"), k = c(12, 8), by = x)

Fit via mgcv::bam() with method = "fREML" and discrete = TRUE. The tensor product basis dimensions k = c(12, 8) were increased from c(10, 6) to reduce EDF saturation.

Stage 2 — Multilevel FPCA:

Stage 1 residuals are averaged to snake-level mean curves (one per individual, pooled across all months) and decomposed into between-locale and within-locale components:

\[ \hat{\varepsilon}_{\ell i}(t) = \sum_{k=1}^{K_1} \xi_{\ell k}^{(1)} \phi_k^{(1)}(t) + \sum_{k=1}^{K_2} \xi_{\ell i, k}^{(2)} \phi_k^{(2)}(t) + e_{\ell i}(t) \]

Estimation uses run_mfpca() with $K_1 = K_2 = 1$ component retained at each level. Missing hourly values are linearly interpolated before MFPCA.

Stage 3 — Full BAM:

\[ T_b(t, m) = f_0(t) + g_0(m) + h_0(t, m) + x \cdot h_1(t, m) + \phi_1^{(1)}(t) \cdot b_\ell + \phi_1^{(2)}(t) \cdot b_{\ell i} \]

where $b_\ell \sim N(0, \sigma_1^2)$ (locale-level) and $b_{\ell i} \sim N(0, \sigma_2^2)$ (snake-level) are eigenfunction-loaded random intercepts. Fit with select = TRUE in addition to the Stage 1 options.

Stage 4 — Population predictions:

Population-level predictions set $b_\ell = b_{\ell i} = 0$:

\[ \hat{T}_b(t, m \mid x) = \hat{f}_0(t) + \hat{g}_0(m) + \hat{h}_0(t, m) + x \cdot \hat{h}_1(t, m) \]

Confidence bands use the lpmatrix approach (random-effect columns zeroed in the prediction matrix). Predictions span: $t \in \{0, 0.1, \ldots, 23.9\}$, $m \in \{1, \ldots, 12\}$, and two contrast levels ($z = 0$ vs. $z = c / \hat{\sigma}_x$).

The full implementation is in analysis/modeling/fit-models.R (local) and analysis/modeling/tide/fit-pooled.R (HPC entrypoint). Key model-fitting code:

Code

prepped <- prep_tf_data(data_dir = "data/processed")

gam_data <- prepped$tf_df |>
  filter(tf_count(tb) > 10, !is.na(tmin)) |>
  select(fid, short.id, locale, tb, date) |>
  mutate(month = lubridate::month(date)) |>
  left_join(prepped$scaled, by = join_by(locale, date, short.id)) |>
  tf_unnest(tb) |>
  rename(t = tb_arg, tb = tb_value)

# Stage 1
fit_init <- bam(
  tb ~ s(t, bs = "cc", k = 20) +
       s(month, bs = "cc", k = 12) +
       te(t, month, bs = c("cc", "cc"), k = c(12, 8)) +
       te(t, month, bs = c("cc", "cc"), k = c(12, 8), by = tmin.z),
  data   = gam_data,
  knots  = list(t = c(0, 23), month = c(0.5, 12.5)),
  method = "fREML",
  discrete = TRUE
)

# Stage 2
fit_mfpca <- run_mfpca(Y_fill, id = locale_vec, visit = visit_vec)

# Stages 3–4: see analysis/modeling/fit-models.R for complete code

3.4 Results

Pre-computed model outputs are stored as lightweight extracts (_lite.rds) in analysis/rslt/model/. The package function reconstruct_predictions() regenerates population prediction grids at any desired contrast value.

3.4.1 Predicted curves

Predicted body temperature curves are shown at the baseline covariate level ($z = 0$) and at the shifted level ($z = c / \hat{\sigma}_x$, corresponding to the contrasts in the table above).

Predicted body temperature curves at baseline and shifted covariate values.

3.4.2 Covariate effects

Pointwise differences between predicted curves (shifted minus baseline) reveal when during the day each covariate has the largest influence on body temperature.

Pointwise covariate effects by hour and month.

3.4.3 Effect sizes

The area between predicted curves (integrated absolute pointwise difference, normalized to an average hourly value in °C) summarizes the magnitude of each covariate’s effect per month.

3.5 Bootstrap inference

Model-based confidence intervals from vcov() are anti-conservative because within-curve residual autocorrelation is approximately 0.8 at lag-1, violating the independence assumption embedded in the model’s standard errors. The snake-level block bootstrap provides valid inference by resampling the set of snakes ($n = 131$) with replacement, preserving within-snake and within-curve dependence. For each bootstrap draw, the Stage-3 BAM is refit with Stages 1 and 2 held fixed, and the area between the two predicted population curves is computed per month:

\[ A_m = \int_0^{24} \left| \hat{T}_b(t, m \mid z = c/\hat{\sigma}) - \hat{T}_b(t, m \mid z = 0) \right| \, dt \]

Both absolute and signed areas are computed; the annualized summary is the mean across all 12 months. Bootstrap inference is applied to pooled models only. Results are based on 1000 draws per variable (50 HPC chunks × 20 draws). See analysis/posthoc/posthoc-spec.md for script-level details and execution order.

Pooled model — average hourly absolute effect size with 95% bootstrap CIs.

Pooled model — signed average hourly effect with 95% bootstrap CIs.

Annualized effect size comparison across covariates (95% bootstrap CI).

--- title: "Rattlesnake thermoregulation" subtitle: "Analysis of internal body temperatures" date: today published-title: "Updated" format: html: code-fold: true code-tools: true toc: true number-sections: true callout-appearance: simple execute: warning: false message: false echo: false cache: true html-math-method: method: mathjax url: "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" --- ```{r setup} #| cache: false library(tidyverse) library(lubridate) library(tidyfun) library(mgcv) library(changepoint) library(here) library(pander) library(patchwork) library(sf) library(rnaturalearth) library(rnaturalearthdata) library(rnaturalearthhires) library(ggrepel) devtools::load_all() ggthm <- theme_bw() + theme( panel.grid = element_blank(), panel.grid.major.y = element_line(color = "black", linewidth = 0.1) ) data_dir <- here("data/processed") pooled_dir <- here("analysis/rslt/model/pooled") ``` ::: callout-note Comments look like this. ::: ::: callout-caution To-do items look like this. ::: # Datasets Study data are organized across three primary sources: 1. `data/site-metadata.csv` — geographic and collaboration information for each of the 19 field sites. 2. `data/snake-metadata.csv` — physical and biological attributes for each of the 210 individual snakes, including a path to the corresponding body temperature file. 3. `data/_raw/` — one CSV per snake with timestamped internal body temperature recordings from iButton sensors. An integrated, quality-controlled workspace is provided as `data/processed/tbdata.RData`. It contains `tb` (body temperature time series), `meta_snake`, and `meta_site`. ```{r load-data} load(here("data/processed/tbdata.RData")) daymet <- read_rds(here("data/processed/daymet.rds")) ``` ::: callout-note `tbdata.RData` is produced in two sequential steps. `analysis/processing/data-processing.r` reads the raw CSV files, joins metadata, and detects basic integrity issues (duplicate timestamps, corrupted observations). `analysis/processing/outliers.r` then applies biological bounds filtering, removes short discontinuous segments, and flags rolling-window outliers as `NA`. See the [Outlier detection] section for details. ::: ## Internal body temperatures Body temperature data from `r nrow(meta_snake)` snakes were collected at `r nrow(meta_site)` sites across western North America, spanning 2013–2025. ```{r sites-table} meta_snake |> count(loc.id) |> right_join(meta_site |> distinct(loc.id, locale, state, country, lat, lon, elev), join_by(loc.id)) |> select(locale, state, country, lat, lon, elev, n) |> pander(caption = "Sampling locations and snake counts.") ``` Sampled individuals belonged to `r meta_snake$spp |> unique() |> length()` species. Most were represented at a single site. ```{r species-table} meta_snake |> count(spp, loc.id) |> group_by(spp) |> summarize(n_indiv = sum(n), n_sites = n(), .groups = "drop") |> arrange(desc(n_indiv)) |> pander(caption = "Sample sizes by species.") ``` ```{r sampling-dates-table} tb |> left_join(meta_snake |> select(file, loc.id), join_by(file)) |> left_join(meta_site |> distinct(loc.id, locale, state), join_by(loc.id)) |> group_by(locale, state) |> summarize( start = format(min(date(datetime)), "%b %Y"), end = format(max(date(datetime)), "%b %Y"), years = n_distinct(year(datetime)), .groups = "drop" ) |> arrange(start) |> pander(caption = "Sampling date ranges by site.") ``` ```{r site-map} #| fig-width: 7 #| fig-height: 5 #| fig-align: center #| fig-cap: "Sampling sites across North America." site_locs <- meta_snake |> count(loc.id, name = "n") |> left_join(meta_site |> distinct(loc.id, locale, lat, lon), join_by(loc.id)) |> mutate(label = paste0("n = ", n)) xlims <- range(site_locs$lon, na.rm = TRUE) + c(-3, 3) ylims <- range(site_locs$lat, na.rm = TRUE) + c(-2, 2) ggplot() + geom_sf( data = ne_countries(continent = "North America", returnclass = "sf", scale = "medium"), fill = "gray98", color = "gray70", linewidth = 0.3 ) + geom_sf( data = ne_states(country = c("United States of America", "Canada"), returnclass = "sf"), fill = NA, color = "gray82", linewidth = 0.2 ) + geom_point( data = site_locs, aes(x = lon, y = lat), shape = 21, fill = "#4575b4", color = NULL, size = 2.5, stroke = 0.5, alpha = 0.8 ) + geom_text_repel( data = site_locs, aes(x = lon, y = lat, label = label), size = 3.5, segment.color = "gray40", segment.size = 0.3, box.padding = 0.7, point.padding = 0.4, force = 6, force_pull = 0.4, min.segment.length = 0, max.overlaps = Inf, max.iter = 10000, seed = 91 ) + coord_sf(xlim = xlims, ylim = ylims, expand = FALSE) + labs(x = NULL, y = NULL) + theme_bw() + theme(panel.grid = element_line(color = "gray88", linewidth = 0.2)) # ggsave('rslt/_fig/map.png', width = 5, height = 4, units = 'in', dpi = 400) ``` ## DAYMET weather data Daily weather covariates (minimum and maximum air temperature) are sourced from the ORNL Daymet 1-km gridded product, fetched via API by site coordinate and covering 1980–2024. ```{r daymet-preview} daymet |> head() |> pander(caption = "Example rows from DAYMET data.") ``` # Exploratory Data Analysis ## Outlier detection Quality control is implemented in `analysis/processing/outliers.r` and proceeds in three stages. **1. Biological bounds.** Observations with $$T_b < -5°C$$ or $$T_b > 40°C$$ are removed as physiologically implausible. **2. Discontinuity removal.** Time gaps larger than 24 hours define segment boundaries within each snake's series. Segments shorter than 7 days are dropped, as they are too short to contribute reliable information and may reflect instrument artifacts or re-deployment gaps. **3. Rolling-window outlier detection.** For each observation a local mean (0.5-day window) and local SD (4-day window) are estimated using centered rolling statistics, with fallback to forward- or backward-looking windows near series endpoints. An observation is flagged if it deviates from the rolling mean by more than 6 SD *and* by more than 1°C. Flagged values are set to `NA`; rows are preserved. The code block below shows the core logic of each stage. ```{r outlier-code} #| eval: false #| echo: true n_days_mean <- 0.5 # rolling-mean window n_days_sd <- 4 # rolling-SD window n_sd <- 6 # outlier threshold # Stages 1-2: biological bounds + discontinuity removal tb_qc <- tb |> filter(tb > -5 & tb < 40) |> group_by(file) |> arrange(datetime) |> mutate( lagged = datetime - lag(datetime, default = first(datetime)), is_not_contiguous = cumsum(duration(24, "hours") < lagged) ) |> group_by(file, is_not_contiguous) |> mutate(chunk_length = max(last(datetime) - first(datetime))) |> filter(chunk_length > duration(7, "days")) |> ungroup() # Stage 3: per-file rolling mean and SD, then flag deviations # NB: outliers.r additionally uses forward/backward windows near series # endpoints; this shows the centered-window core for brevity. tb_aug <- tb_qc |> group_by(file) |> group_modify(function(tbdata, ...) { samp_freq <- median(diff(as.numeric(tbdata$datetime)), na.rm = TRUE) w_mean <- ceiling(as.numeric(duration(n_days_mean, "days")) / samp_freq) w_sd <- ceiling(as.numeric(duration(n_days_sd, "days")) / samp_freq) tbdata |> mutate( rmean = data.table::frollmean(tb, w_mean, align = "center", fill = NA), rsd = data.table::frollapply(tb, w_sd, FUN = sd, align = "center", fill = NA) ) }) |> ungroup() # flagged values set to NA; the full pipeline writes the result back to tbdata.RData tb_adj <- tb_aug |> mutate( tb = if_else((abs(tb - rmean) > n_sd * rsd) & (abs(tb - rmean) > 1), NA_real_, tb) ) |> select(file, datetime, tb) ``` Diagnostic plots showing a ±7-day window around each flagged observation are written to `analysis/processing/_diagnostics/_outliers/` for manual review. ## Active season identification Active season dates are inferred from internal body temperatures using changepoint detection on daily diurnal variance. The intuition is that thermoregulating snakes show greater daily temperature variance when active; variance is low and flat during inactivity. ```{r raw-season} raw_season <- tb |> left_join(meta_snake |> select(file, short.id, loc.id), join_by(file)) |> left_join(meta_site |> distinct(loc.id, locale), join_by(loc.id)) |> select(-file, -loc.id) ``` ::: callout-note Season classification is a standalone analysis. The output `data/processed/seasons.rds` is produced by `analysis/processing/season-identification.R` and is not currently consumed by the main modeling pipeline. ::: ### Diurnal variation Daily standard deviations in body temperature are computed per snake per day of year. Snakes with fewer than 30 days of data are excluded, and series are smoothed with a 7-day rolling mean. ```{r dvar-indiv} #| fig-align: center dvar_indiv <- raw_season |> group_by(short.id, locale, yr = year(datetime), day = yday(datetime)) |> summarize(tb.sd = stats::sd(tb, na.rm = TRUE), .groups = "drop") |> drop_na(tb.sd) |> mutate(fid = consecutive_id(short.id, locale, yr)) |> tf_nest(tb.sd, .id = "fid", .arg = "day") dvar_indiv |> ggplot(aes(y = tb.sd)) + geom_spaghetti(alpha = 0.6, linewidth = 0.2) + facet_wrap(~locale) + labs(y = "SD(temperature)", x = "Day of year") + ggthm ``` ```{r dvar-smooth} #| fig-align: center dvar_indiv |> mutate(tfe = tf_count(tb.sd)) |> unnest(tfe) |> filter(tfe > 30) |> mutate(tb.sd = tf_smooth(tb.sd, "rollmean", k = 7)) |> ggplot(aes(y = tb.sd)) + geom_spaghetti(alpha = 0.6, linewidth = 0.2) + facet_wrap(~locale) + labs(y = "SD(temperature)", x = "Day of year") + ggthm ``` ### Location averages Smoothed series are aggregated to locale-level pointwise medians to give the representative seasonal profile for each site. ```{r dvar-locale} #| fig-align: center dvar_indiv |> mutate(tfe = tf_count(tb.sd)) |> unnest(tfe) |> filter(tfe > 30) |> mutate(tb.sd = tf_smooth(tb.sd, "rollmean", k = 7)) |> group_by(locale) |> mutate(tb.sd = mean(tb.sd, depth = "pointwise", na.rm = TRUE)) |> ggplot(aes(y = tb.sd)) + geom_spaghetti() + facet_wrap(~locale) + labs(y = "Median SD(temperature)", x = "Day of year") + ggthm ``` ### Changepoint detection ```{r dvar-loc-store} dvar_loc <- dvar_indiv |> mutate(tfe = tf_count(tb.sd)) |> unnest(tfe) |> filter(tfe > 30) |> select(-tfe) |> mutate(tb.sd = tf_smooth(tb.sd, "rollmean", k = 7)) |> group_by(locale) |> mutate(tb.sd = mean(tb.sd, depth = "pointwise", na.rm = TRUE)) |> distinct(locale, tb.sd) |> ungroup() |> tf_unnest(tb.sd) |> rename(day = tb.sd_arg) |> nest(data = c(day, tb.sd_value)) ``` ```{r cp-fn} cp.fn <- function(.args, .vals, return = "means", q = 4) { cpt.out <- cpt.mean(.vals, method = "BinSeg", Q = q) cp.ix <- cpts(cpt.out) cp.x <- cpt.out@param.est$mean if (return == "means") { out <- cut(.args, breaks = .args[c(1, cp.ix, length(.args))], labels = round(cp.x, 5), right = FALSE, include.lowest = TRUE) |> as.character() |> as.numeric() } if (return == "points") { cp.delta <- diff(cp.x) out <- tibble( day = .args[cp.ix], delta = cp.delta, type = if_else(cp.delta > 0, "increase", "decrease") ) } return(out) } ``` Mean changepoints are detected via binary segmentation (up to $Q = 3$ segments). A threshold of 1.5°C on segment-level mean distinguishes active from inactive periods. ```{r changepts} #| fig-align: center changepts <- dvar_loc |> mutate(tb.sd_mean = map(data, ~cp.fn(.x$day, .x$tb.sd_value, q = 3))) |> unnest(everything()) changepts |> ggplot(aes(x = day, y = tb.sd_value)) + geom_path() + geom_path(aes(y = tb.sd_mean), color = "red", linewidth = 0.5, alpha = 0.5) + facet_wrap(~locale) + geom_hline(yintercept = 1.5, linetype = "dashed", linewidth = 0.25) + labs(y = "Median SD(temperature)", x = "Day of year") + ggthm ``` Comments: 1. In Lordsburg, a changepoint falls in the middle of the end-of-season transition. The 1.5°C threshold excludes the tapering period; reducing it to 1.25 would include it. 2. In Montana de Oro and Sedgwick Reserve, inferred start dates are unreliable due to sparse data (see below). ```{r mdo-sedg} #| fig-width: 8 #| fig-height: 3 #| fig-align: center #| fig-cap: "Inferred changepoints and individual-level diurnal variation for Montana de Oro and Sedgwick Reserve." cpt_means <- changepts |> filter(locale %in% c("Montana de Oro", "Sedgwick Reserve")) |> distinct(locale, day, tb.sd_mean) |> tf_nest(tb.sd_mean, .id = "locale", .arg = "day") dvar_indiv |> filter(locale %in% c("Montana de Oro", "Sedgwick Reserve")) |> mutate(tfe = tf_count(tb.sd)) |> unnest(tfe) |> filter(tfe > 30) |> mutate(tb.sd = tf_smooth(tb.sd, "rollmean", k = 7)) |> select(-tfe) |> ggplot(aes(y = tb.sd)) + geom_spaghetti(linewidth = 0.25, alpha = 0.5) + geom_spaghetti( aes(y = tb.sd_mean), data = cpt_means, color = "red", linewidth = 0.5, alpha = 0.5 ) + facet_wrap(~locale) + geom_hline(yintercept = 1.5, linetype = "dashed", linewidth = 0.25) + labs(y = "SD(temperature)", x = "Day of year") + ggthm ``` In Montana de Oro, data span 2017–2018 (mid-April to late February). A single snake was active for roughly one week in early February 2018; the inferred changepoint simply marks the start of data collection. In Sedgwick, data span 2015–2017 but with only one snake recorded in 2016; the changepoint reflects that individual's activity onset. ```{r mdo-sedg-dates} raw_season |> filter(locale %in% c("Montana de Oro", "Sedgwick Reserve")) |> group_by(locale, yr = year(datetime)) |> summarize( first.obs = min(datetime) |> date(), last.obs = max(datetime) |> date(), .groups = "drop" ) |> pander(caption = "Sampling dates for Montana de Oro and Sedgwick Reserve.") ``` ### Inferred active seasons Active season classification uses the changepoint-inferred segment means thresholded at 1.5°C (daily SD). Locales for which all data fell within the active season (Rodeo, McIntyre, Chimineas) are excluded from the table. ```{r seasons-classify} seasons <- dvar_loc |> mutate(tb.sd_mean = map(data, ~cp.fn(.x$day, .x$tb.sd_value, q = 3))) |> unnest(everything()) |> mutate(season = if_else(tb.sd_mean > 1.5, "active", "inactive")) |> select(locale, day, season) seasons |> mutate(lseason = lag(season)) |> drop_na() |> filter(season != lseason, day > 1) |> mutate( date = parse_date_time(day, orders = "j"), date = paste(month(date, label = TRUE, abbr = TRUE), mday(date)) ) |> select(locale, date, season) |> group_by(locale) |> add_count() |> filter(n > 1) |> select(-n) |> pivot_wider(names_from = season, values_from = date) |> rename(start = active, end = inactive, location = locale) |> pander(caption = "Inferred active season changepoints.") ``` # Statistical Modeling The analysis models how continuous body temperature profiles ($T_b$ as a function of time of day) vary with site-level environmental covariates across months, using a **pooled cyclic functional regression** fit across all months simultaneously. ## Notation and covariates | Symbol | Meaning | |--------|---------| | $$T_b(t)$$ | Body temperature at hour $$t \in [0, 24)$$ | | $$x$$ | Covariate (tmin, tmax, elev, or lat), z-scored globally | | $$m$$ | Month ($$1, \ldots, 12$$), treated as a continuous cyclic variable | | $$\ell$$ | Locale (field site) | | $$i$$ | Curve index — one snake-day | | $$\phi^{(1)}_k(t)$$ | Level-1 (between-locale) eigenfunction, PC $$k$$ | | $$\phi^{(2)}_k(t)$$ | Level-2 (within-locale) eigenfunction, PC $$k$$ | Four covariates are supported. Each is z-scored globally (across all locale-dates) so that a 1-SD shift is interpretable as a population-level contrast regardless of site. | Variable | Units | Contrast | Notes | |----------|-------|----------|-------| | `tmin` | °C | 5°C | — | | `tmax` | °C | 5°C | — | | `elev` | m | 100 m | Rattlesnake Butte excluded (extreme outlier) | | `lat` | degrees | 5° | — | ## Data preparation The package function `prep_tf_data()` loads the processed data files, joins snake and site metadata with Daymet weather, constructs functional data objects (one curve per snake-day on the 0–24h domain), and z-scores the four covariates globally. Sparse curves (fewer than 10 hourly observations) are retained at this stage and filtered downstream. ```{r prep-data} prepped <- prep_tf_data(data_dir = here("data/processed")) tf_df <- prepped$tf_df ``` ```{r overview-plot} #| fig-width: 12 #| fig-height: 9 #| fig-dpi: 300 #| fig-align: center #| message: false #| warning: false #| fig-cap: "Diurnal body temperature profiles across all sites and months. Each curve represents one snake on one calendar day. Red curves fall during the inferred inactive season for that locale." seasons_df <- read_rds(here("data/processed/seasons.rds")) locale_order <- meta_site |> distinct(locale, lat) |> arrange(lat) |> pull(locale) tf_df |> mutate( mo = month(date, label = TRUE, abbr = TRUE), day = yday(date), locale = factor(locale, levels = locale_order) ) |> left_join(seasons_df, join_by(locale, day)) |> filter(!is.na(season)) |> # drop sparse sites (fewer than 10 curves) that render as empty columns, # plus Indio Mountains Research Station (too few curves to be informative) add_count(locale) |> filter(n >= 10, locale != "Indio Mountains Research Station") |> mutate(locale = fct_drop(locale)) |> ggplot(aes(y = tb, color = season)) + # abbreviate site labels to the first word (before any space or hyphen) facet_grid(mo ~ locale, labeller = labeller(locale = ~str_extract(.x, "^[^ -]+"))) + geom_spaghetti(alpha = 0.25, linewidth = 0.1) + scale_x_continuous(breaks = seq(0, 24, by = 6)) + scale_color_manual(values = c("red", "black")) + labs(x = 'hour of day', y = 'internal body temperature') + guides(color = guide_legend( position = "top", override.aes = list(alpha = 1, linewidth = 0.5) )) + ggthm ggsave('rslt/_fig/tbdata.png', width = 10, height = 6, units = 'in', dpi = 400) ``` ::: callout-note Season classification is determined at the locale level, not individually. Some curves appear mismatched because individual snakes can be active or inactive outside the typical window for their site. ::: ## Model specification ### Pooled cyclic model (primary) A single model is fit per covariate across all 12 months. The cyclic cubic spline basis on month — with knots at $m = 0.5$ and $m = 12.5$ — ensures a smooth December → January transition. **Stage 1 — Initial BAM:** $$ T_b(t, m) = f_0(t) + g_0(m) + h_0(t, m) + x \cdot h_1(t, m) + \varepsilon(t, m) $$ - $f_0(t)$: `s(t, bs = "cc", k = 20)`, knots at $t = 0, 23$ - $g_0(m)$: `s(month, bs = "cc", k = 12)`, knots at $m = 0.5, 12.5$ - $h_0(t, m)$: `te(t, month, bs = c("cc", "cc"), k = c(12, 8))` - $h_1(t, m)$: `te(t, month, bs = c("cc", "cc"), k = c(12, 8), by = x)` Fit via `mgcv::bam()` with `method = "fREML"` and `discrete = TRUE`. The tensor product basis dimensions `k = c(12, 8)` were increased from `c(10, 6)` to reduce EDF saturation. **Stage 2 — Multilevel FPCA:** Stage 1 residuals are averaged to snake-level mean curves (one per individual, pooled across all months) and decomposed into between-locale and within-locale components: $$ \hat{\varepsilon}_{\ell i}(t) = \sum_{k=1}^{K_1} \xi_{\ell k}^{(1)} \phi_k^{(1)}(t) + \sum_{k=1}^{K_2} \xi_{\ell i, k}^{(2)} \phi_k^{(2)}(t) + e_{\ell i}(t) $$ Estimation uses `run_mfpca()` with $K_1 = K_2 = 1$ component retained at each level. Missing hourly values are linearly interpolated before MFPCA. **Stage 3 — Full BAM:** $$ T_b(t, m) = f_0(t) + g_0(m) + h_0(t, m) + x \cdot h_1(t, m) + \phi_1^{(1)}(t) \cdot b_\ell + \phi_1^{(2)}(t) \cdot b_{\ell i} $$ where $b_\ell \sim N(0, \sigma_1^2)$ (locale-level) and $b_{\ell i} \sim N(0, \sigma_2^2)$ (snake-level) are eigenfunction-loaded random intercepts. Fit with `select = TRUE` in addition to the Stage 1 options. **Stage 4 — Population predictions:** Population-level predictions set $b_\ell = b_{\ell i} = 0$: $$ \hat{T}_b(t, m \mid x) = \hat{f}_0(t) + \hat{g}_0(m) + \hat{h}_0(t, m) + x \cdot \hat{h}_1(t, m) $$ Confidence bands use the `lpmatrix` approach (random-effect columns zeroed in the prediction matrix). Predictions span: $t \in \{0, 0.1, \ldots, 23.9\}$, $m \in \{1, \ldots, 12\}$, and two contrast levels ($z = 0$ vs. $z = c / \hat{\sigma}_x$). The full implementation is in `analysis/modeling/fit-models.R` (local) and `analysis/modeling/tide/fit-pooled.R` (HPC entrypoint). Key model-fitting code: ```{r pooled-fit-code} #| eval: false #| echo: true prepped <- prep_tf_data(data_dir = "data/processed") gam_data <- prepped$tf_df |> filter(tf_count(tb) > 10, !is.na(tmin)) |> select(fid, short.id, locale, tb, date) |> mutate(month = lubridate::month(date)) |> left_join(prepped$scaled, by = join_by(locale, date, short.id)) |> tf_unnest(tb) |> rename(t = tb_arg, tb = tb_value) # Stage 1 fit_init <- bam( tb ~ s(t, bs = "cc", k = 20) + s(month, bs = "cc", k = 12) + te(t, month, bs = c("cc", "cc"), k = c(12, 8)) + te(t, month, bs = c("cc", "cc"), k = c(12, 8), by = tmin.z), data = gam_data, knots = list(t = c(0, 23), month = c(0.5, 12.5)), method = "fREML", discrete = TRUE ) # Stage 2 fit_mfpca <- run_mfpca(Y_fill, id = locale_vec, visit = visit_vec) # Stages 3–4: see analysis/modeling/fit-models.R for complete code ``` ## Results Pre-computed model outputs are stored as lightweight extracts (`_lite.rds`) in `analysis/rslt/model/`. The package function `reconstruct_predictions()` regenerates population prediction grids at any desired contrast value. ```{r result-settings} contrasts <- c(elev = 100, lat = 5, tmin = 5, tmax = 5) variables <- c("elev", "lat", "tmin", "tmax") ``` ```{r load-helpers} load_pooled_preds <- function(variables, contrasts, pooled_dir) { map(variables, function(v) { f <- file.path(pooled_dir, paste0("pooled_", v, "_lite.rds")) if (!file.exists(f)) return(NULL) l <- readRDS(f) reconstruct_predictions(l, contrast = contrasts[v]) |> select(t, fit, se, lwr, upr, month, any_of(v)) |> mutate(variable = v) }) |> bind_rows() |> pivot_longer( cols = any_of(variables), names_to = "var_name", values_to = "var_level", values_drop_na = TRUE ) |> select(-var_name) } plot_curves <- function(df, title = NULL) { ggplot( df, aes(x = t, y = fit, group = var_level, color = as.factor(round(var_level, 1)), fill = as.factor(round(var_level, 1))) ) + geom_line(linewidth = 0.3) + geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.1, color = NA) + facet_grid( variable ~ month, scales = "free_y", labeller = labeller(month = setNames(month.abb, 1:12)) ) + labs(x = "Hour", y = "Body Temperature (°C)", color = "Level", fill = "Level", title = title) + theme_bw() + theme(legend.position = "none", strip.text = element_text(size = 7)) } compute_diffs <- function(df) { df |> arrange(variable, month, t, var_level) |> group_by(variable, month, t) |> summarise(diff = last(fit) - first(fit), .groups = "drop") } plot_diffs <- function(diffs, title = NULL) { ggplot(diffs, aes(x = t, y = diff)) + geom_ribbon(aes(ymin = 0, ymax = diff), alpha = 0.3) + geom_line(linewidth = 0.3) + geom_hline(yintercept = 0, color = "red", linewidth = 0.3) + facet_grid( variable ~ month, scales = "free_y", labeller = labeller(month = setNames(month.abb, 1:12)) ) + labs(x = "Hour", y = "Temperature Difference (°C)", title = title) + theme_bw() + theme(strip.text = element_text(size = 7)) } plot_avg_diffs <- function(diffs, dt = 0.1, title = NULL) { area_df <- diffs |> group_by(variable, month) |> summarise(avg_diff = sum(abs(diff)) * dt / 24, .groups = "drop") ggplot(area_df, aes(x = month, y = avg_diff)) + geom_point(size = 2) + scale_x_continuous(breaks = 1:12, labels = month.abb) + facet_wrap(~variable, nrow = 1) + labs(x = "", y = "Avg. Hourly Temp. Difference (°C)", title = title) + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) } ``` ```{r load-preds} pooled_lite_files <- list.files(pooled_dir, pattern = "_lite\\.rds$", full.names = TRUE) pooled_exists <- length(pooled_lite_files) > 0 if (pooled_exists) all_pooled <- load_pooled_preds(variables, contrasts, pooled_dir) ``` ### Predicted curves Predicted body temperature curves are shown at the baseline covariate level ($z = 0$) and at the shifted level ($z = c / \hat{\sigma}_x$, corresponding to the contrasts in the table above). ```{r plot-pooled-curves} #| fig-width: 12 #| fig-height: 8 #| fig-align: center #| fig-cap: "Predicted body temperature curves at baseline and shifted covariate values." if (pooled_exists) plot_curves(all_pooled) ``` ### Covariate effects Pointwise differences between predicted curves (shifted minus baseline) reveal when during the day each covariate has the largest influence on body temperature. ```{r pooled-diffs} #| fig-width: 12 #| fig-height: 8 #| fig-align: center #| fig-cap: "Pointwise covariate effects by hour and month." if (pooled_exists) { pooled_diffs <- compute_diffs(all_pooled) plot_diffs(pooled_diffs) } ``` ### Effect sizes The area between predicted curves (integrated absolute pointwise difference, normalized to an average hourly value in °C) summarizes the magnitude of each covariate's effect per month. ```{r avg-diffs-pooled} #| fig-width: 10 #| fig-height: 4 #| fig-align: center if (pooled_exists) plot_avg_diffs(pooled_diffs) ``` ## Bootstrap inference Model-based confidence intervals from `vcov()` are anti-conservative because within-curve residual autocorrelation is approximately 0.8 at lag-1, violating the independence assumption embedded in the model's standard errors. The snake-level block bootstrap provides valid inference by resampling the set of snakes ($n = 131$) with replacement, preserving within-snake and within-curve dependence. For each bootstrap draw, the Stage-3 BAM is refit with Stages 1 and 2 held fixed, and the area between the two predicted population curves is computed per month: $$ A_m = \int_0^{24} \left| \hat{T}_b(t, m \mid z = c/\hat{\sigma}) - \hat{T}_b(t, m \mid z = 0) \right| \, dt $$ Both absolute and signed areas are computed; the annualized summary is the mean across all 12 months. Bootstrap inference is applied to pooled models only. Results are based on 1000 draws per variable (50 HPC chunks × 20 draws). See `analysis/posthoc/posthoc-spec.md` for script-level details and execution order. ```{r bootstrap-abs} #| fig-width: 10 #| fig-height: 4 #| fig-align: center #| fig-cap: "Pooled model — average hourly absolute effect size with 95% bootstrap CIs." bootstrap_file <- here("analysis/rslt/tbl/area_bootstrap.csv") annual_file <- here("analysis/rslt/tbl/area_bootstrap_annual.csv") contrast_labels <- c( elev = "Elevation (+100 m)", lat = "Latitude (+5\u00b0)", tmin = "Min Temp (+5\u00b0C)", tmax = "Max Temp (+5\u00b0C)" ) if (file.exists(bootstrap_file)) { boot_df <- read.csv(bootstrap_file) ggplot(boot_df, aes(x = month, y = area_abs / 24)) + geom_pointrange( aes(ymin = area_abs_lwr / 24, ymax = area_abs_upr / 24), size = 0.4, linewidth = 0.8 ) + scale_x_continuous(breaks = 1:12, labels = month.abb) + facet_wrap( ~variable, nrow = 1, labeller = labeller(variable = contrast_labels) ) + labs(x = "", y = "Avg. Hourly |Effect| (\u00b0C)") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) } ``` ```{r bootstrap-signed} #| fig-width: 10 #| fig-height: 4 #| fig-align: center #| fig-cap: "Pooled model — signed average hourly effect with 95% bootstrap CIs." if (file.exists(bootstrap_file)) { ggplot(boot_df, aes(x = month, y = area_signed / 24)) + geom_pointrange( aes(ymin = area_signed_lwr / 24, ymax = area_signed_upr / 24), size = 0.4, linewidth = 0.8 ) + geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.3) + scale_x_continuous(breaks = 1:12, labels = month.abb) + facet_wrap( ~variable, nrow = 1, labeller = labeller(variable = contrast_labels) ) + labs(x = "", y = "Avg. Hourly Effect (\u00b0C)") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) } ``` ```{r bootstrap-annual} #| fig-width: 5 #| fig-height: 4 #| fig-align: center #| fig-cap: "Annualized effect size comparison across covariates (95% bootstrap CI)." if (file.exists(annual_file)) { annual_df <- read.csv(annual_file) ggplot(annual_df, aes(x = reorder(variable, annual_abs), y = annual_abs / 24)) + geom_pointrange( aes(ymin = annual_abs_lwr / 24, ymax = annual_abs_upr / 24), size = 0.6, linewidth = 0.9 ) + labs(x = "", y = "Annualized Avg. Hourly |Effect| (\u00b0C)") + theme_bw() } ```