GAM
In this exercise we will explore the concept of GAM
- Generalized Additive Models - these are models that take dynamic
approach to modelling behavior. Most, behavioral questions are never
straight forward, do not follow a linear model or pattern of behavior,
it requires a flexible approach to understanding complex relationships
within a data set - in this case, the use of splines (we will
get to that) as a smoothing factor, along a multi-modal use in an
additive manner to really understand what influences patterns - in this
case, dolphin behavior.
Dolphins
The problem is described as reviewing the impact on these
contributing factors on $behav - we expect to see some
variance in the contributing factors, some linear and others not
quite.
Rows: 167
Columns: 11
$ speed <dbl> 0.9206887, 1.1729019, 1.1749055, 0.6720431,…
$ rr <dbl> 42.28394, 61.78797, 66.73499, 18.15142, 59.…
$ lin <dbl> 0.633153893, 0.665410017, 0.428164531, 0.62…
$ distance <dbl> 138.29062, 182.16778, 200.23402, 268.98642,…
$ timeper <dbl> 0.5, 0.5, 0.6, 0.2, 0.3, 0.3, 0.3, 0.5, 0.4…
$ cat <chr> "Mid", "Mid", "Tour", "None", "Small", "Sma…
$ grpsize <int> 3, 3, 2, 2, 2, 5, 6, 3, 4, 3, 1, 1, 1, 2, 1…
$ calf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ behav <chr> "Forage", "Rest", "Rest", "Forage", "Rest",…
$ count <int> 1, 2, 1, 1, 2, 1, 1, 2, 1, 3, 2, 1, 4, 1, 5…
$ id <int> 414334, 414334, 414336, 414341, 414341, 414…
How to Decide Which Variables to Apply Smooth Splines (s()):
Think of splines as bendable rulers that adjust to the shape
of the data. They are built using knots, which are points where the
curve can change direction smoothly. The most common types include
B-splines and thin plate regression splines, which help capture patterns
without overfitting.
In GAMs, splines are used to model nonlinear effects while
maintaining interpretability. They help reveal hidden trends in data
that a simple linear model might miss and are best applied to numeric
predictors where relationships with the target variable are expected to
be non-linear. Here’s a step-by-step guide in deciding where to use
them:
- Look at Variable Types:
- Smooth terms (s()) are suited for continuous numeric predictors
(e.g.,
$speed
, $rr
, $distance
,
$lin
).
- Avoid applying s() to categorical or integer variables like
$cat
or $grpsize
.
- Consider the Nature of the Relationship: If you believe the
relationship between a predictor and the target variable isn’t strictly
linear, splines are ideal. For example:
$speed
: Dolphin behavior may vary non-linearly with
swimming speed.
$distance
: Behavioral states could be influenced by a
non-linear relationship with the distance traveled.
One way of understanding the relationship is by plotting for the
relationships, each predictor against the target variable to observe the
relationship. If we see curves or irregular trends, use a
spline.
Exploratory Data Analysis (EDA):
To achieve this we will be review the underlying relationships
through an EDA, i.e, stats., tables, plots, etc.
1. Basic summary statistics
Basic behaviour stats.
FSB |
54 |
0.3233533 |
3.476940 |
21.93659 |
0.7276897 |
172.0757 |
Social |
52 |
0.3113772 |
1.698714 |
49.99276 |
0.5195180 |
160.1517 |
Forage |
42 |
0.2514970 |
1.686381 |
55.56664 |
0.3921278 |
113.6056 |
Rest |
14 |
0.0838323 |
1.828723 |
40.17311 |
0.5661032 |
175.6061 |
Travel |
5 |
0.0299401 |
2.897316 |
31.00841 |
0.5649046 |
147.2829 |
2. Distribution of behaviors (count and proportion)

3. Numeric variables by behavior
Error in glue("Distribution of **{var}** by Behavior") :
could not find function "glue"
4. Categorical variables by behavior
-> Group size

-> Calf presence

-> Category

5. Timeper (time period) analysis

6. Pairwise relationships colored by behavior
Smaller sample for better visualization if dataset is large

Dissecting the plots
Pairwise Relationships
- Speed vs. Respiration Rate (rr)
There’s a noticeable distinction between behavior types based on
speed and rr.
Foraging behavior occurs at lower speeds but higher rr, possibly
indicating exertion from hunting.
Resting shows moderate speed but lower rr, which aligns with reduced
movement.
- Distance vs. Speed
Non-linear trends emerge: dolphins covering greater distances tend to
maintain a consistent speed instead of fluctuating frequently.
Travel behavior seems distributed across mid-range distances and
speed levels.
- Group Size vs. Behavior
Larger groups show more frequent social interactions, with foraging
behaviors often happening in smaller groups.
Resting appears to occur at variable group sizes, suggesting it might
not be group-dependent.
- Presence of a Calf vs. Behavior
The presence of a calf does not strongly influence behavior, based on
model output significance levels.
Calf presence might be an independent factor rather than a primary
behavioral determinant.
7. Individual patterns (if IDs represent individuals)
if(length(unique(df$id)) < 20) { # Only plot if not too many individuals
ggplot(df, aes(x = id, fill = behav)) +
geom_bar() +
labs(title = "Behavior Distribution by Individual ID",
x = "Individual ID", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
8. Count vs behavior
ggplot(df, aes(x = behav, y = count, fill = behav)) +
geom_boxplot() +
labs(title = "Count Distribution by Behavior",
x = "Behavior", y = "Count") +
theme_minimal()

ID’s and count From the distributions above - we’ve
now got the following understanding of what the relationships are:
- Numeric Variables & Behavior
The boxplots suggest that behaviors vary based on numerical
predictors like speed, distance, and rr (possibly respiration rate).
Foraging behavior is associated with higher rr and lower speed
compared to other behaviors.
Resting tends to occur at moderate rr and speed levels.
- Categorical Variables & Behavior
Group size and calf presence seem to have some impact on behaviors,
with larger groups showing different behavioral distributions.
Key Takeaways Behavioral states are influenced by
both linear and non-linear relationships.
Speed and respiration rate interact in a way that differentiates
active behaviors from passive ones.
We can begin to sense how the predictive model would reinforce these
group dynamics matter, like how calf presence does not show strong
predictive influence on how the dolphins will largely behave. Distance
traveled also suggests a patterned movement strategy rather than
randomness.
GAM Analysis
set.seed(234)
data_split <- initial_split(df, prop = 0.8)
train_data <- training(data_split) %>%
mutate(
behav = as.factor(behav)#,
# grpsize = as.integer(grpsize),
# calf = as.integer(calf)
)
test_data <- testing(data_split)
For the Generalized Additive Model (GAM), we will use the
mgcv package - it integrates well with Tidymodels.
Apply directly to the mgcv library
gam_formula <- behav ~ s(speed) + s(rr) + s(distance) + s(lin) + grpsize + calf
gam_model <- gam(
gam_formula,
data = train_data,
family = binomial(),
method = "REML"
)
summary(gam_model)
Family: binomial
Link function: logit
Formula:
behav ~ s(speed) + s(rr) + s(distance) + s(lin) + grpsize + calf
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.1684 0.8176 -2.652 0.008 **
grpsize 1.1058 0.2580 4.286 1.82e-05 ***
calf -0.7351 0.8319 -0.884 0.377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(speed) 1.000 1.000 0.047 0.8288
s(rr) 1.000 1.000 3.099 0.0783 .
s(distance) 5.268 6.400 13.398 0.0576 .
s(lin) 1.819 2.266 2.053 0.4391
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.493 Deviance explained = 52%
-REML = 44.42 Scale est. = 1 n = 133
Smooth Terms (Splines) in GAM Analysis
The Generalized Additive Model (GAM) suggests that some relationships
are non-linear, especially for distance (significant smooth term) and
possibly rr.
Speed does not show a strong non-linear effect, suggesting
its influence might be more linear.
Overall Model Performance
The explained deviance of 52% indicates a reasonable ability to
capture behavior variability with these predictors.
The significance levels suggest that group size has a strong effect,
while calf presence does not.
Based on this, behaviors appear influenced by both linear
and non-linear relationships, with distance and
respiration rate showing non-linear patterns.
Now we have built a baseline model that accomodates for the basics
and an understanding or BAU guide to using the Generalized Additive
Model, for linear and non-linear relationships..
Other Thoughts & Interpretation Looking deeper
into calf presence and its influence on behavior, the analysis suggests
that calves do not significantly impact dolphin behavior within this
dataset. Here’s why:
Statistical Insights: The p-value for calf presence in the model is
0.377, indicating a non-significant effect on behavior.
Compare this to group size, which has a highly significant effect
(p-value = 1.82e-05)—showing that larger groups strongly influence
behavior, but the presence of a calf does not.
Observed Patterns: While the GAM model did account for calf presence,
the lack of significance means it doesn’t differentiate behaviors
clearly based on whether a calf is present or not.
The boxplots and pairwise relationships further confirm no strong
behavioral shifts when calves are in the group.
Unlike group size, which influences social interactions and movement
dynamics, calves seem to be more passive participants in behavioral
shifts.
Dolphins might prioritize protective behavior over distinct
behavioral changes when calves are around, but this isn’t strongly
reflected in movement metrics like speed, respiration rate, or distance
traveled.
