GAM

In this exercise we will explore the concept of GAM - Generalized Additive Models - these are models that take dynamic approach to modelling behavior. Most, behavioral questions are never straight forward, do not follow a linear model or pattern of behavior, it requires a flexible approach to understanding complex relationships within a data set - in this case, the use of splines (we will get to that) as a smoothing factor, along a multi-modal use in an additive manner to really understand what influences patterns - in this case, dolphin behavior.

Dolphins

The problem is described as reviewing the impact on these contributing factors on $behav - we expect to see some variance in the contributing factors, some linear and others not quite.

Rows: 167
Columns: 11
$ speed    <dbl> 0.9206887, 1.1729019, 1.1749055, 0.6720431,…
$ rr       <dbl> 42.28394, 61.78797, 66.73499, 18.15142, 59.…
$ lin      <dbl> 0.633153893, 0.665410017, 0.428164531, 0.62…
$ distance <dbl> 138.29062, 182.16778, 200.23402, 268.98642,…
$ timeper  <dbl> 0.5, 0.5, 0.6, 0.2, 0.3, 0.3, 0.3, 0.5, 0.4…
$ cat      <chr> "Mid", "Mid", "Tour", "None", "Small", "Sma…
$ grpsize  <int> 3, 3, 2, 2, 2, 5, 6, 3, 4, 3, 1, 1, 1, 2, 1…
$ calf     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ behav    <chr> "Forage", "Rest", "Rest", "Forage", "Rest",…
$ count    <int> 1, 2, 1, 1, 2, 1, 1, 2, 1, 3, 2, 1, 4, 1, 5…
$ id       <int> 414334, 414334, 414336, 414341, 414341, 414…

How to Decide Which Variables to Apply Smooth Splines (s()):

Think of splines as bendable rulers that adjust to the shape of the data. They are built using knots, which are points where the curve can change direction smoothly. The most common types include B-splines and thin plate regression splines, which help capture patterns without overfitting.

In GAMs, splines are used to model nonlinear effects while maintaining interpretability. They help reveal hidden trends in data that a simple linear model might miss and are best applied to numeric predictors where relationships with the target variable are expected to be non-linear. Here’s a step-by-step guide in deciding where to use them:

  1. Look at Variable Types:
  • Smooth terms (s()) are suited for continuous numeric predictors (e.g., $speed, $rr, $distance, $lin).
  • Avoid applying s() to categorical or integer variables like $cat or $grpsize.
  1. Consider the Nature of the Relationship: If you believe the relationship between a predictor and the target variable isn’t strictly linear, splines are ideal. For example:
  • $speed: Dolphin behavior may vary non-linearly with swimming speed.
  • $distance: Behavioral states could be influenced by a non-linear relationship with the distance traveled.

One way of understanding the relationship is by plotting for the relationships, each predictor against the target variable to observe the relationship. If we see curves or irregular trends, use a spline.

Exploratory Data Analysis (EDA):

To achieve this we will be review the underlying relationships through an EDA, i.e, stats., tables, plots, etc.

1. Basic summary statistics
Basic behaviour stats.
behav count proportion avg_speed avg_rr avg_lin avg_distance
FSB 54 0.3233533 3.476940 21.93659 0.7276897 172.0757
Social 52 0.3113772 1.698714 49.99276 0.5195180 160.1517
Forage 42 0.2514970 1.686381 55.56664 0.3921278 113.6056
Rest 14 0.0838323 1.828723 40.17311 0.5661032 175.6061
Travel 5 0.0299401 2.897316 31.00841 0.5649046 147.2829
2. Distribution of behaviors (count and proportion)

3. Numeric variables by behavior
Error in glue("Distribution of **{var}** by Behavior") : 
  could not find function "glue"
4. Categorical variables by behavior

-> Group size

-> Calf presence

-> Category

5. Timeper (time period) analysis

6. Pairwise relationships colored by behavior

Smaller sample for better visualization if dataset is large

Dissecting the plots

Pairwise Relationships

  1. Speed vs. Respiration Rate (rr)

There’s a noticeable distinction between behavior types based on speed and rr.

Foraging behavior occurs at lower speeds but higher rr, possibly indicating exertion from hunting.

Resting shows moderate speed but lower rr, which aligns with reduced movement.

  1. Distance vs. Speed

Non-linear trends emerge: dolphins covering greater distances tend to maintain a consistent speed instead of fluctuating frequently.

Travel behavior seems distributed across mid-range distances and speed levels.

  1. Group Size vs. Behavior

Larger groups show more frequent social interactions, with foraging behaviors often happening in smaller groups.

Resting appears to occur at variable group sizes, suggesting it might not be group-dependent.

  1. Presence of a Calf vs. Behavior

The presence of a calf does not strongly influence behavior, based on model output significance levels.

Calf presence might be an independent factor rather than a primary behavioral determinant.

7. Individual patterns (if IDs represent individuals)
if(length(unique(df$id)) < 20) {  # Only plot if not too many individuals
  ggplot(df, aes(x = id, fill = behav)) +
    geom_bar() +
    labs(title = "Behavior Distribution by Individual ID",
         x = "Individual ID", y = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
8. Count vs behavior
ggplot(df, aes(x = behav, y = count, fill = behav)) +
  geom_boxplot() +
  labs(title = "Count Distribution by Behavior",
       x = "Behavior", y = "Count") +
  theme_minimal()

ID’s and count From the distributions above - we’ve now got the following understanding of what the relationships are:

  1. Numeric Variables & Behavior

The boxplots suggest that behaviors vary based on numerical predictors like speed, distance, and rr (possibly respiration rate).

Foraging behavior is associated with higher rr and lower speed compared to other behaviors.

Resting tends to occur at moderate rr and speed levels.

  1. Categorical Variables & Behavior

Group size and calf presence seem to have some impact on behaviors, with larger groups showing different behavioral distributions.

Key Takeaways Behavioral states are influenced by both linear and non-linear relationships.

Speed and respiration rate interact in a way that differentiates active behaviors from passive ones.

We can begin to sense how the predictive model would reinforce these group dynamics matter, like how calf presence does not show strong predictive influence on how the dolphins will largely behave. Distance traveled also suggests a patterned movement strategy rather than randomness.

GAM Analysis

set.seed(234)
data_split <- initial_split(df, prop = 0.8)
train_data <- training(data_split) %>% 
  mutate(
    behav = as.factor(behav)#,
    # grpsize = as.integer(grpsize),
    # calf = as.integer(calf)
  )
test_data <- testing(data_split)

For the Generalized Additive Model (GAM), we will use the mgcv package - it integrates well with Tidymodels.

Apply directly to the mgcv library

gam_formula <- behav ~ s(speed) + s(rr) + s(distance) + s(lin) + grpsize + calf

gam_model <- gam(
  gam_formula,
  data = train_data,
  family = binomial(),
  method = "REML"
)

summary(gam_model)

Family: binomial 
Link function: logit 

Formula:
behav ~ s(speed) + s(rr) + s(distance) + s(lin) + grpsize + calf

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.1684     0.8176  -2.652    0.008 ** 
grpsize       1.1058     0.2580   4.286 1.82e-05 ***
calf         -0.7351     0.8319  -0.884    0.377    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
              edf Ref.df Chi.sq p-value  
s(speed)    1.000  1.000  0.047  0.8288  
s(rr)       1.000  1.000  3.099  0.0783 .
s(distance) 5.268  6.400 13.398  0.0576 .
s(lin)      1.819  2.266  2.053  0.4391  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.493   Deviance explained =   52%
-REML =  44.42  Scale est. = 1         n = 133

Smooth Terms (Splines) in GAM Analysis

The Generalized Additive Model (GAM) suggests that some relationships are non-linear, especially for distance (significant smooth term) and possibly rr.

Speed does not show a strong non-linear effect, suggesting its influence might be more linear.

Overall Model Performance

The explained deviance of 52% indicates a reasonable ability to capture behavior variability with these predictors.

The significance levels suggest that group size has a strong effect, while calf presence does not.

Based on this, behaviors appear influenced by both linear and non-linear relationships, with distance and respiration rate showing non-linear patterns.

Now we have built a baseline model that accomodates for the basics and an understanding or BAU guide to using the Generalized Additive Model, for linear and non-linear relationships..

Other Thoughts & Interpretation Looking deeper into calf presence and its influence on behavior, the analysis suggests that calves do not significantly impact dolphin behavior within this dataset. Here’s why:

Statistical Insights: The p-value for calf presence in the model is 0.377, indicating a non-significant effect on behavior.

Compare this to group size, which has a highly significant effect (p-value = 1.82e-05)—showing that larger groups strongly influence behavior, but the presence of a calf does not.

Observed Patterns: While the GAM model did account for calf presence, the lack of significance means it doesn’t differentiate behaviors clearly based on whether a calf is present or not.

The boxplots and pairwise relationships further confirm no strong behavioral shifts when calves are in the group.

Unlike group size, which influences social interactions and movement dynamics, calves seem to be more passive participants in behavioral shifts.

Dolphins might prioritize protective behavior over distinct behavioral changes when calves are around, but this isn’t strongly reflected in movement metrics like speed, respiration rate, or distance traveled.

