Multiple Covariates

A GLM tutorial provided by Dr. T Caughlin here. These are just the follow along notes.

🐶 Dawgs & Toads 🐸

1. Load Data

dogs <- read.delim("dogs.csv", sep=";")


table <- gt(dogs)%>%
  opt_stylize(style = 6, color = "cyan")



opt_interactive(table, page_size_default = 5)



2. Define Variables

Predictors: TrainingDay, Temperature, Leash, DogBehavior

Response: FoundFirstAttempt (binary)

mean(dogs$FoundFirstAttempt) #good job dogs!
## [1] 0.720524

A. Explore

Explore the correlations (NUMERICAL DATA ONLY in cor) (also another plot for funsies)

#only pulling these columns out briefly to check them for fun things (correlations between covariates)

numericsOnly <- dogs[,c(9, 12, 20)]

cor(numericsOnly) #.39 is the highest and it is low enough that we can move on and assume there isnt high enough correlation between covariates to remove one.
##             TrainingDay Temperature SearchArea
## TrainingDay  1.00000000  0.01699062  0.3911707
## Temperature  0.01699062  1.00000000  0.1615864
## SearchArea   0.39117067  0.16158640  1.0000000
columngrab <- dogs[,c(9, 12, 19, 27, 31)]

forggpairs <- dogs[,c(9, 12, 20, 27)]


ggpairs(forggpairs, aes(color=as.factor(FoundFirstAttempt)))
This figure shows a moderate positive linear correlation using Pearson correlations, between training day and search area. As training day increases, search area tends to increase, *** means statistically significant.

This figure shows a moderate positive linear correlation using Pearson correlations, between training day and search area. As training day increases, search area tends to increase, *** means statistically significant.


Because these covariates are all on different scales (Training Day 0-31, Temperature 1-40, and Search Area 5-900), we want to scale them by centering them around the mean so we can compare the effect sizes easily. Only the numeric covariates are rescaled.

B. Scale the covariates

#built function for rescaling quick

rescale=function(x) {return((x - mean(x, na.rm = TRUE)) / (2 * sd(x, na.rm = TRUE)))}

numericsOnly_std<-apply(numericsOnly,2,rescale)

head(numericsOnly_std)
##      TrainingDay Temperature SearchArea
## [1,]  -0.7392079   0.1231439 -0.6049086
## [2,]  -0.7392079   0.1231439 -0.6049086
## [3,]  -0.7392079   0.1231439 -0.5437738
## [4,]  -0.7392079   0.1231439 -0.5437738
## [5,]  -0.4777333  -0.7249340 -0.4826390
## [6,]  -0.4777333  -0.7249340 -0.3603695
#let's add the columns we just created (the scaling for the numeric variables) back onto our og dataframe so we can use the scaled data easily. 

colnames(numericsOnly_std)<-paste(colnames(numericsOnly_std), "_scaled", sep="") #add the tag scaled to the numerics

toadly_ready<-data.frame(numericsOnly_std,dogs)



3. Set Priors

In this tutorial, priors are not explicitely set. The brm() automatically sets weak priors and once your model runs you can see what it chose by running the code: get_prior(m1)



4. Fit Bayesian Model GLM

COVARIATES

TrainingDay_scaled, Temperature_scaled, Leash (category, not scaled), DogBehavior (category, not scaled)

m1<-brm(FoundFirstAttempt~TrainingDay_scaled+Temperature_scaled+Leash+DogBehaviour,family="bernoulli",data=toadly_ready, refresh=0)


#get_prior(FoundFirstAttempt~TrainingDay_scaled+Temperature_scaled+Leash+DogBehaviour,family="bernoulli",data=toadly_ready)

For m1 the brm function is using:

  • Intercept prior: student_t(3, 0, 2.5) (default)

  • Slope priors (b): (flat) for all covariates

flat means that all slope values are equally possible.



5. Model Exploration

summary(m1)

summary(m1)
##  Family: bernoulli 
##   Links: mu = logit 
## Formula: FoundFirstAttempt ~ TrainingDay_scaled + Temperature_scaled + Leash + DogBehaviour 
##    Data: toadly_ready (Number of observations: 229) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Regression Coefficients:
##                          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept                    2.47      0.65     1.25     3.86 1.01      700
## TrainingDay_scaled           0.43      0.41    -0.35     1.25 1.01     1022
## Temperature_scaled          -0.14      0.39    -0.92     0.59 1.00     1347
## Leashnoleash                -0.61      0.64    -1.93     0.58 1.01      718
## Leashshortleash             -1.60      0.76    -3.20    -0.15 1.01      683
## DogBehaviourdistracted      -3.65      0.73    -5.23    -2.38 1.00     1253
## DogBehaviourfatigued        -1.05      0.42    -1.92    -0.22 1.00     1168
## DogBehaviourunresponsive  -668.35    571.18 -2201.27   -24.45 1.01      639
##                          Tail_ESS
## Intercept                    1055
## TrainingDay_scaled           1527
## Temperature_scaled           1706
## Leashnoleash                 1093
## Leashshortleash              1216
## DogBehaviourdistracted       1210
## DogBehaviourfatigued         1355
## DogBehaviourunresponsive      750
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
draws <- as.data.frame(m1)


m1_long <- draws %>%
  select(b_TrainingDay_scaled, b_Temperature_scaled, b_Leashnoleash, b_Leashshortleash, b_DogBehaviourdistracted, b_DogBehaviourfatigued, b_Intercept) %>%
  pivot_longer(everything(),
               names_to = "parameter",
               values_to = "value")

ggplot(m1_long, aes(x = value, y = parameter, fill = parameter)) +
  stat_halfeye(.width = c(.8, .95), point_interval = median_qi) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  scale_y_discrete(labels = c(
    b_TrainingDay_scaled = "Training day",
    b_Temperature_scaled = "Temperature", 
    b_Leashnoleash = "Leashed or not", 
    b_Leashshortleash = "Short leash", 
    b_DogBehaviourdistracted = "Distracted dogs", 
    b_DogBehaviourfatigued = "Fatigued dog",
    b_Intercept  = "Intercept")) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")+
  scale_fill_manual(values=c("#440154", "#472D7B", "#3B528B", "#2C728E", "#21918C", "#5EC962", "#FDE725"))+
  labs(x="Log-odds (posterior draws)")

The further away from 0 above, the stronger the effect.

You can have a strong effect with high uncertainty and weak effects with low uncertainty.



6. Ecological Inference

A. Probability of direction for TrainingDay_scaled

THIS MEASURE WILL NOT TELL YOU EFFECT SIZE

We can evaluate how much of the posterior distribution lies above or below zero. For example, the Training Day coefficient is not entirely positive — some posterior draws fall below zero.

To quantify this, we calculate the proportion of posterior draws that are greater than zero:

length(which(draws$b_TrainingDay_scaled>0))/length(draws$b_TrainingDay_scaled)
## [1] 0.85975

84% of the posterior probability mass lies above zero.

B. Effect size for TrainingDay_scaled

How big is the effect in real terms?

NOTE we used family=bernoulli in our brm model because of the binary 0s and 1s in our y data. This puts an automatic logit link into play, and the inverse of this is plogis. So to look at the real terms of the effect, lets undo our logit and place the data back into an understandable scale.

td_min <- min(toadly_ready$TrainingDay_scaled) #using the minimum value of our training day data to determine what the effect is for training day at its most minimum observed value

td_max <- max(toadly_ready$TrainingDay_scaled)

min_td <- plogis(draws$b_Intercept+draws$b_TrainingDay_scaled*td_min) 

max_td <- plogis(draws$b_Intercept+draws$b_TrainingDay_scaled*td_max)

effectSize_td <- data.frame(max_td-min_td)

median(effectSize_td$max_td...min_td)
## [1] 0.04074141
quantile(effectSize_td$max_td...min_td, c(0.05,0.5,0.95))
##          5%         50%         95% 
## -0.03413471  0.04074141  0.12671065
ggplot(effectSize_td, aes(x=max_td...min_td))+
  geom_histogram(fill="#2C728E", color="white", bins = 30)+
  geom_vline(xintercept = 0, linetype = "dashed")+
  geom_vline(xintercept = -0.037, color="pink")+
  geom_vline(xintercept = 0.12, color="pink")+
  geom_vline(xintercept=median(effectSize_td$max_td...min_td), color="black", linewidth=1.5)+
  geom_text(x=-0.2, y= 500, label="Median effect = 4.2%")+
  labs(title="Posterior distribution of training day effect on toad finding success probability", x="Effect size")+
  theme_classic()
Here the thicker black line is the median effect size for dog training days, the pink lines are the 90% credible intervals, and the black dotted line is centered at 0.

Here the thicker black line is the median effect size for dog training days, the pink lines are the 90% credible intervals, and the black dotted line is centered at 0.

C. Effect size for the more exciting DogBehavior

Looking at the effect plot we can see that dog behavior has the strongest effect overall.

WHICH DOGS ARE THE INTERCEPT HERE?

unique(toadly_ready$DogBehaviour)
## [1] "cooperative"  "fatigued"     "unresponsive" "distracted"

Answer: cooperative

Cooperative dogs are the intercept here, and everything else is being compared to them. Let’s look at how a distracted dog compares to a cooperative dog.

C1. DIRECTION
length(which(draws$b_DogBehaviourdistracted<0))/length(draws$b_DogBehaviourdistracted)
## [1] 1

100% of the posterior probability mass lies below zero. (we could likely assume this from the plot though). A nice scientific sentence for this is:

“There is a >99% probability that distracted dogs have lower success in finding toads.”

C2. EFFECT SIZE
cooperative_dog <- plogis(draws$b_Intercept)

distracted_dog <- plogis(draws$b_Intercept+draws$b_DogBehaviourdistracted)

effect_distracted <-data.frame(cooperative_dog-distracted_dog)

median(effect_distracted$cooperative_dog...distracted_dog)
## [1] 0.6655203
quantile(effect_distracted$cooperative_dog...distracted_dog, c(0.05,0.5,0.95))
##        5%       50%       95% 
## 0.3895936 0.6655203 0.8268731
ggplot(effect_distracted, aes(x=cooperative_dog...distracted_dog))+
  geom_histogram(color="white", fill="#440154", bins = 30)+
  geom_vline(xintercept = 0, linetype = "dashed")+
  geom_vline(xintercept = 0.38, color="#2C728E")+
  geom_vline(xintercept = 0.83, color="#2C728E")+
  geom_vline(xintercept=median(effect_distracted$cooperative_dog...distracted_dog), color="#5EC962", linewidth = 1.5)+
  geom_text(x=0.25, y= 200, label="Median effect = 66.5%")+
  labs(title="Effect of distraction on probability of toad detection", x="Effect size")+
  theme_classic()
Here the green line is the median effect size for dog training days, the blue lines are the 90% credible intervals, and the black dotted line is centered at 0.

Here the green line is the median effect size for dog training days, the blue lines are the 90% credible intervals, and the black dotted line is centered at 0.

D. \(R^2\)

Finally, we can get the \(R^2\) for the model fit:

bayes_R2(m1)
##     Estimate  Est.Error      Q2.5     Q97.5
## R2 0.2641992 0.03465796 0.1893447 0.3233515

\(R^2\) is \(\approx\) 26%, which is pretty great for a dog behavior study.

NOTES

See convergence to normal over large sample sizes:

curve(rgamma(x, 10, 3), from =0, to=100)

curve(dgamma(x, 10, 3), from =0, to=100)

#PRACTICE SOME CODES

# NEGATIVE BINOMIAL (size=0.1, mu=100) rnbinom()


draws <- rnbinom(1000, size=0.1, mu=100)

par(mfrow=c(1, 2))

hist(draws, breaks=10, col="pink")

#FOR LOOPS

empty <- rep("i", 100)

for(i in 1:100) {
  empty[i]=mean(rnbinom(1000, size=0.1, mu=100))
}

hist(as.numeric(empty), col="tomato", main="Neg binom for loop")