A GLM tutorial provided by Dr. T Caughlin here. These are just the follow along notes.
dogs <- read.delim("dogs.csv", sep=";")
table <- gt(dogs)%>%
opt_stylize(style = 6, color = "cyan")
opt_interactive(table, page_size_default = 5)
Predictors: TrainingDay, Temperature,
Leash, DogBehavior
Response: FoundFirstAttempt (binary)
mean(dogs$FoundFirstAttempt) #good job dogs!
## [1] 0.720524
Explore the correlations (NUMERICAL DATA ONLY in cor)
(also another plot for funsies)
#only pulling these columns out briefly to check them for fun things (correlations between covariates)
numericsOnly <- dogs[,c(9, 12, 20)]
cor(numericsOnly) #.39 is the highest and it is low enough that we can move on and assume there isnt high enough correlation between covariates to remove one.
## TrainingDay Temperature SearchArea
## TrainingDay 1.00000000 0.01699062 0.3911707
## Temperature 0.01699062 1.00000000 0.1615864
## SearchArea 0.39117067 0.16158640 1.0000000
columngrab <- dogs[,c(9, 12, 19, 27, 31)]
forggpairs <- dogs[,c(9, 12, 20, 27)]
ggpairs(forggpairs, aes(color=as.factor(FoundFirstAttempt)))
This figure shows a moderate positive linear correlation using Pearson correlations, between training day and search area. As training day increases, search area tends to increase, *** means statistically significant.
Because these covariates are all on different scales (Training Day 0-31, Temperature 1-40, and Search Area 5-900), we want to scale them by centering them around the mean so we can compare the effect sizes easily. Only the numeric covariates are rescaled.
#built function for rescaling quick
rescale=function(x) {return((x - mean(x, na.rm = TRUE)) / (2 * sd(x, na.rm = TRUE)))}
numericsOnly_std<-apply(numericsOnly,2,rescale)
head(numericsOnly_std)
## TrainingDay Temperature SearchArea
## [1,] -0.7392079 0.1231439 -0.6049086
## [2,] -0.7392079 0.1231439 -0.6049086
## [3,] -0.7392079 0.1231439 -0.5437738
## [4,] -0.7392079 0.1231439 -0.5437738
## [5,] -0.4777333 -0.7249340 -0.4826390
## [6,] -0.4777333 -0.7249340 -0.3603695
#let's add the columns we just created (the scaling for the numeric variables) back onto our og dataframe so we can use the scaled data easily.
colnames(numericsOnly_std)<-paste(colnames(numericsOnly_std), "_scaled", sep="") #add the tag scaled to the numerics
toadly_ready<-data.frame(numericsOnly_std,dogs)
In this tutorial, priors are not explicitely set. The
brm() automatically sets weak priors and once your model
runs you can see what it chose by running the code:
get_prior(m1)
COVARIATES
TrainingDay_scaled, Temperature_scaled,
Leash (category, not scaled), DogBehavior
(category, not scaled)
m1<-brm(FoundFirstAttempt~TrainingDay_scaled+Temperature_scaled+Leash+DogBehaviour,family="bernoulli",data=toadly_ready, refresh=0)
#get_prior(FoundFirstAttempt~TrainingDay_scaled+Temperature_scaled+Leash+DogBehaviour,family="bernoulli",data=toadly_ready)
For m1 the brm function is using:
Intercept prior: student_t(3, 0, 2.5)
(default)
Slope priors (b): (flat) for all covariates
flat means that all slope values are equally
possible.
summary(m1)
summary(m1)
## Family: bernoulli
## Links: mu = logit
## Formula: FoundFirstAttempt ~ TrainingDay_scaled + Temperature_scaled + Leash + DogBehaviour
## Data: toadly_ready (Number of observations: 229)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept 2.47 0.65 1.25 3.86 1.01 700
## TrainingDay_scaled 0.43 0.41 -0.35 1.25 1.01 1022
## Temperature_scaled -0.14 0.39 -0.92 0.59 1.00 1347
## Leashnoleash -0.61 0.64 -1.93 0.58 1.01 718
## Leashshortleash -1.60 0.76 -3.20 -0.15 1.01 683
## DogBehaviourdistracted -3.65 0.73 -5.23 -2.38 1.00 1253
## DogBehaviourfatigued -1.05 0.42 -1.92 -0.22 1.00 1168
## DogBehaviourunresponsive -668.35 571.18 -2201.27 -24.45 1.01 639
## Tail_ESS
## Intercept 1055
## TrainingDay_scaled 1527
## Temperature_scaled 1706
## Leashnoleash 1093
## Leashshortleash 1216
## DogBehaviourdistracted 1210
## DogBehaviourfatigued 1355
## DogBehaviourunresponsive 750
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
draws <- as.data.frame(m1)
m1_long <- draws %>%
select(b_TrainingDay_scaled, b_Temperature_scaled, b_Leashnoleash, b_Leashshortleash, b_DogBehaviourdistracted, b_DogBehaviourfatigued, b_Intercept) %>%
pivot_longer(everything(),
names_to = "parameter",
values_to = "value")
ggplot(m1_long, aes(x = value, y = parameter, fill = parameter)) +
stat_halfeye(.width = c(.8, .95), point_interval = median_qi) +
geom_vline(xintercept = 0, linetype = "dashed") +
scale_y_discrete(labels = c(
b_TrainingDay_scaled = "Training day",
b_Temperature_scaled = "Temperature",
b_Leashnoleash = "Leashed or not",
b_Leashshortleash = "Short leash",
b_DogBehaviourdistracted = "Distracted dogs",
b_DogBehaviourfatigued = "Fatigued dog",
b_Intercept = "Intercept")) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")+
scale_fill_manual(values=c("#440154", "#472D7B", "#3B528B", "#2C728E", "#21918C", "#5EC962", "#FDE725"))+
labs(x="Log-odds (posterior draws)")
The further away from 0 above, the stronger the effect.
You can have a strong effect with high uncertainty and weak effects with low uncertainty.
TrainingDay_scaledTHIS MEASURE WILL NOT TELL YOU EFFECT SIZE
We can evaluate how much of the posterior distribution lies above or below zero. For example, the Training Day coefficient is not entirely positive — some posterior draws fall below zero.
To quantify this, we calculate the proportion of posterior draws that are greater than zero:
length(which(draws$b_TrainingDay_scaled>0))/length(draws$b_TrainingDay_scaled)
## [1] 0.85975
84% of the posterior probability mass lies above zero.
TrainingDay_scaledHow big is the effect in real terms?
NOTE we used family=bernoulli in our
brm model because of the binary 0s and 1s in our y data.
This puts an automatic logit link into play, and the
inverse of this is plogis. So to look at the real terms of
the effect, lets undo our logit and place the data back
into an understandable scale.
td_min <- min(toadly_ready$TrainingDay_scaled) #using the minimum value of our training day data to determine what the effect is for training day at its most minimum observed value
td_max <- max(toadly_ready$TrainingDay_scaled)
min_td <- plogis(draws$b_Intercept+draws$b_TrainingDay_scaled*td_min)
max_td <- plogis(draws$b_Intercept+draws$b_TrainingDay_scaled*td_max)
effectSize_td <- data.frame(max_td-min_td)
median(effectSize_td$max_td...min_td)
## [1] 0.04074141
quantile(effectSize_td$max_td...min_td, c(0.05,0.5,0.95))
## 5% 50% 95%
## -0.03413471 0.04074141 0.12671065
ggplot(effectSize_td, aes(x=max_td...min_td))+
geom_histogram(fill="#2C728E", color="white", bins = 30)+
geom_vline(xintercept = 0, linetype = "dashed")+
geom_vline(xintercept = -0.037, color="pink")+
geom_vline(xintercept = 0.12, color="pink")+
geom_vline(xintercept=median(effectSize_td$max_td...min_td), color="black", linewidth=1.5)+
geom_text(x=-0.2, y= 500, label="Median effect = 4.2%")+
labs(title="Posterior distribution of training day effect on toad finding success probability", x="Effect size")+
theme_classic()
Here the thicker black line is the median effect size for dog training days, the pink lines are the 90% credible intervals, and the black dotted line is centered at 0.
DogBehaviorLooking at the effect plot we can see that dog behavior has the strongest effect overall.
WHICH DOGS ARE THE INTERCEPT HERE?
unique(toadly_ready$DogBehaviour)
## [1] "cooperative" "fatigued" "unresponsive" "distracted"
Answer: cooperative
Cooperative dogs are the intercept here, and everything else is being compared to them. Let’s look at how a distracted dog compares to a cooperative dog.
length(which(draws$b_DogBehaviourdistracted<0))/length(draws$b_DogBehaviourdistracted)
## [1] 1
100% of the posterior probability mass lies below zero. (we could likely assume this from the plot though). A nice scientific sentence for this is:
“There is a >99% probability that distracted dogs have lower success in finding toads.”
cooperative_dog <- plogis(draws$b_Intercept)
distracted_dog <- plogis(draws$b_Intercept+draws$b_DogBehaviourdistracted)
effect_distracted <-data.frame(cooperative_dog-distracted_dog)
median(effect_distracted$cooperative_dog...distracted_dog)
## [1] 0.6655203
quantile(effect_distracted$cooperative_dog...distracted_dog, c(0.05,0.5,0.95))
## 5% 50% 95%
## 0.3895936 0.6655203 0.8268731
ggplot(effect_distracted, aes(x=cooperative_dog...distracted_dog))+
geom_histogram(color="white", fill="#440154", bins = 30)+
geom_vline(xintercept = 0, linetype = "dashed")+
geom_vline(xintercept = 0.38, color="#2C728E")+
geom_vline(xintercept = 0.83, color="#2C728E")+
geom_vline(xintercept=median(effect_distracted$cooperative_dog...distracted_dog), color="#5EC962", linewidth = 1.5)+
geom_text(x=0.25, y= 200, label="Median effect = 66.5%")+
labs(title="Effect of distraction on probability of toad detection", x="Effect size")+
theme_classic()
Here the green line is the median effect size for dog training days, the blue lines are the 90% credible intervals, and the black dotted line is centered at 0.
Finally, we can get the \(R^2\) for the model fit:
bayes_R2(m1)
## Estimate Est.Error Q2.5 Q97.5
## R2 0.2641992 0.03465796 0.1893447 0.3233515
\(R^2\) is \(\approx\) 26%, which is pretty great for a dog behavior study.
See convergence to normal over large sample sizes:
curve(rgamma(x, 10, 3), from =0, to=100)
curve(dgamma(x, 10, 3), from =0, to=100)
#PRACTICE SOME CODES
# NEGATIVE BINOMIAL (size=0.1, mu=100) rnbinom()
draws <- rnbinom(1000, size=0.1, mu=100)
par(mfrow=c(1, 2))
hist(draws, breaks=10, col="pink")
#FOR LOOPS
empty <- rep("i", 100)
for(i in 1:100) {
empty[i]=mean(rnbinom(1000, size=0.1, mu=100))
}
hist(as.numeric(empty), col="tomato", main="Neg binom for loop")