Mixed effects models

Fish and mercury

During our initial linear mixed effects models lecture I gave an example of mercury concentration in fish (walleye). We will revisit that example, and hopefully, the logic behind mixed effects models will be clearer!

While last time we looked at the example of Walleye in a Lake Michigan, this time we will look into a completely different example. Let’s assume we are still working with Walleye (or any other fish species of your choice, you can even rename it something else if you wish). We are still interested in mercury concentration based on size. However, we are looking at a completely different water body. This water body is only 20 acres!

Because, this water body is only 20 acres, we don’t think there is any spatial heterogeneity, so, we don’t have to worry too much about a spatial component. This makes our lives easier! We can simply set a net, catch the fish, and measure them without having to worry about mixed effects (for the time being).

Now, before we look at the data, we need to generate it. I don’t have a great dataset for this, so we will be simulating the data in this study and in this whole assignment!

My data doesn’t look like yours!

This is OK. As we are simulating data, each run will be unique. This means each time you run a chunk of code it will give you a different result. Also, when you render your document, you will get a different result each time (each render runs each code chunk).

If you want to be able to replicate your example, you could set a seed set.seed(). I won’t be setting a seed for my example,.

Simulating data

If you are doing complex models, I always recommend you simulate some data (with known parameters) to test the models. This way, you know if the model is doing what you want it to do, and how close the estimates are to the real parameters! We will be doing that in this assignment.

Fish sampling

The first step in our experiment is setting the net in the small-ish pond. The net should catch about 50 individuals. We will simulate our sampling using the following:

n<-rpois(1,50)
n
[1] 50

Where n is the number of fish we got. You can run that line multiple times and see that we get a different number each time!

Now, let’s check the size of our fish! Our net won’t catch any individuals under 20cm and fish size is uniformly distributed (in this example), with the largest fish being 60 cm.

Let’s check the size of teh fish we caught:

size<-runif(n,20,60)
print(size)
 [1] 22.75967 43.66579 39.80321 24.78608 49.14263 58.56552 56.91650 23.55892
 [9] 35.47709 34.03659 58.06951 55.21679 32.12110 55.10773 58.66152 36.95968
[17] 58.59402 43.31591 50.24252 40.66932 28.10174 51.56332 51.86310 47.37769
[25] 33.53715 35.96670 30.39047 54.40070 39.26885 34.38812 22.56905 58.59425
[33] 21.09162 27.07817 28.60000 52.94883 57.49429 38.12240 40.65217 25.59778
[41] 29.81557 56.46622 22.76075 20.25000 36.94917 33.11672 30.18007 49.91449
[49] 37.95815 26.46042

Finally, let’s simulate the mercury concentration. It is dependent on the size of the fish. Let’s remember the linear model equation.

\[ y \sim \beta_0+\beta_{1}x_i + \epsilon \]

Where,

\[ \epsilon \sim Normal(0,\sigma^2) \]

Let’s simulate the data for the mercury concentration.

We’ll use 0.5 as our \(\beta_0\) and 0.018 as our \(\beta_1\) and 0.0064 as our variance (we will use standard deviation, so the square root).

Let’s explore the data we obtained:

 Hg<-0.5 + 0.018*size + rnorm(n,0,0.08)
 plot(Hg~size)

Looks good! Let’s make a data frame with all of our data:

df1<-data.frame(size=size,Hg=Hg)

And let’s run a simple linear model:

model1<-lm(Hg~size,data=df1)
summary(model1)

Call:
lm(formula = Hg ~ size, data = df1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.165705 -0.042941  0.000784  0.058587  0.204396 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.5326945  0.0395647   13.46   <2e-16 ***
size        0.0171498  0.0009447   18.15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08237 on 48 degrees of freedom
Multiple R-squared:  0.8729,    Adjusted R-squared:  0.8702 
F-statistic: 329.5 on 1 and 48 DF,  p-value: < 2.2e-16
Assignment question 1

Look at the summary of your model, and answer: was the model good at estimating \(\beta_0\) (intercept), \(\beta_1\) (size) and the residual standard error?

Be aware that the results will change after you render, that is OK. You can leave your original answer here. Again, you can set the random seed, and have the results be reproducible if you want with set.seed() and putting a number.

Finally, we can plot it:

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.3
pred1<-predict(model1,df1,interval = "c")
dfplot<-cbind(df1,pred1)

ggplot(data=dfplot, aes(x=size,y=Hg,ymin=lwr,ymax=upr))+
  geom_point()+
  geom_line(aes(y=fit))+
  geom_ribbon(alpha=0.2)+
  theme_classic()

Multiple sites

Let’s assume that we have three other ponds of the same size. So, we are going to do the same exact thing as we did in the first pond.

Let’s also assume that the relationship (the covariance, or the slope) of size and Hg concentration is the same in all three sites. So, now, the equation would be:

\[ y \sim \beta_0+\beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} +\beta_{4}x_{4i} \epsilon \]

where,

\[ \epsilon \sim Normal(0,\sigma^2) \]

So, we have to come up with a value for \(\beta_2\), \(\beta_3\), and \(\beta_4\),

Simulating Data 2

The way I am simulating this dataset is a bit unconventional. Usually you would come up with an equation for each population (or have a random function that selects the parameters). You would very rarely do it this way, but I am trying to follow the linear regression equations to simulate the data.

In this case, the values I am giving the betas are:

\(\beta_2\): 0.025

\(\beta_3\): 0.01

\(\beta_4\): 0.1

Remember, a \(\beta_j\) is the difference in the intercept between group j and group 1.

First, let’’s create a vector with the betas

beta<-c(0.025,0.01,0.1)

HgDat<-list(site1=df1)

Then, let’s name our first pong region “A”:

df1$region<-"A"

And create a list where we will store all of our results:

HgDat<-list(site1=df1)

Finally, we do what we did with site 1:

  1. “Set up the net” and “catch” our fish (n)
  2. Obtain the size with a uniform distribution
  3. Estimate Hg using the new beta
  4. Name the site
  5. Save it as a data frame
for(i in 1:3){
 n<-rpois(1,50) 
  size<-runif(n,20,60)
  Hg<-0.5 + 0.018*size + rnorm(n,0,0.08)+beta[i]
  Region<-rep(LETTERS[i+1],n)
  HgDat[[i+1]]<-data.frame(size=size,Hg=Hg,region=Region)
}

We stored all the data as a list, let’s now backtransform it to a data frame:

HgDat_df<-dplyr::bind_rows(HgDat)
HgDat_df$region<-as.factor(HgDat_df$region)
head(HgDat_df)
      size        Hg region
1 22.75967 0.9504535      A
2 43.66579 1.2445096      A
3 39.80321 1.0681250      A
4 24.78608 1.0962214      A
5 49.14263 1.2863725      A
6 58.56552 1.5648987      A

Before we continue, I recommend you open the Hg_Dat dataframe and explore it.

Let’s now run the model:

model2<-lm(Hg~size+region,data=HgDat_df)
summary(model2)

Call:
lm(formula = Hg ~ size + region, data = HgDat_df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.196308 -0.056289  0.002324  0.060099  0.200114 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.5178445  0.0224918  23.024  < 2e-16 ***
size        0.0175208  0.0004822  36.332  < 2e-16 ***
regionB     0.0154486  0.0160993   0.960    0.338    
regionC     0.0085410  0.0160157   0.533    0.594    
regionD     0.0916719  0.0156576   5.855 1.77e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08166 on 215 degrees of freedom
Multiple R-squared:  0.877, Adjusted R-squared:  0.8747 
F-statistic: 383.2 on 4 and 215 DF,  p-value: < 2.2e-16

and plot the data:

pred2<-predict(model2,HgDat_df,interval = "c")
dfplot<-cbind(HgDat_df,pred2)

ggplot(data=dfplot, aes(x=size,y=Hg,ymin=lwr,ymax=upr,col=region,shape=region,fill=region))+
  geom_point()+
  geom_line(aes(y=fit))+
  geom_ribbon(alpha=0.2)+
  theme_classic()

Assignment question 2

Look at the summary of your model, and answer: was the model good at estimating \(\beta_0\) (intercept), \(\beta_1\) (size), \(\beta_2\), \(\beta_3\), \(\beta_4\) and the residual standard error?

Run an Anova (in this case technically an Ancova, as there is covariance), and if there are region is significant, do a pairwise comparison.

We know (because we set the parameters) that every site is different. Is your pairwise comparison able to identify these differences among ALL groups? If it can’t, explain why you think it is failing at doing so.

Finally, run a different model with the same exact data where there is an interactive effect between site and region (AKA, slope is different). Compare the AIC values of both models? Did AIC correctly choose the additive model as the “best model”?

Back to the Great Lakes

Let’s go back to our original Michigan Lake example

Here, we know there is a spatial effect of where we set our nets on the amount of mercury (still, the slope is the same). We will be placing four nets

And we don’t want to bias our estimate by choosing where to place the nets.

Also, we don’t care what site has a higher concentration of mercury. We care about the concentration lake-wide and about the variance introduced by the spatial heterogeneity.

Also, we don’t want to estimate a \(\beta\) for every net, we simply want to estimate the variance introduced by the spatial component (we don’t want to estimate 99 \(\beta's\) if we are setting 100 nets!). So, we are doing a mixed model (with a mixed intercept). As a reminder, this is the equation:

\[ Hg_{ij} \sim \underbrace{(\beta_0 +\underbrace{\gamma_j}_{\text{Random intercept}})}_{intercept} + \underbrace{\beta_1size_{i}}_{slope} +\underbrace{\epsilon}_\text{ind var} \]

where: \(\gamma_j \sim Normal(0,\sigma_\gamma)\) and \(\epsilon \sim Normal(0,\sigma)\).

Let’s say the variance introduced by the selection of site is 0.0625 (standard deviation of 0.25).

Then, we can create an object called \(\gamma\):

nsites<-4
gamma <- rnorm(n=nsites,mean=0,sd=0.25) 

Notice how I am not selecting the \(\beta's\) values? All I am providing is the standard deviation (think of this as teh variability introduced by where you place the nets. And using rnorm (random normal) I am obtaining 4 random values that affect the intercept. This is why this is a random component of the intercept.

Each time you run that line, you will get different values, because it is a random process (different than when we had 4 sites!)

HgDatmixed<-list()
for(i in 1:4){
 n<-rpois(1,50) 
  size<-runif(n,20,60)
  Hg<- (0.5+gamma[i]) + 0.018*size + rnorm(n,0,0.08)
  Net<-rep(LETTERS[i],n)
  HgDatmixed[[i]]<-data.frame(size=size,Hg=Hg,net=Net)
}
HgDatmixed<-dplyr::bind_rows(HgDatmixed)
HgDatmixed$net<-as.factor(HgDatmixed$net)

Now, let’s run the mixed model:

library(glmmTMB)
Warning: package 'glmmTMB' was built under R version 4.3.3
m3<-glmmTMB(Hg~size +(1|net), data=HgDatmixed)

summary(m3)
 Family: gaussian  ( identity )
Formula:          Hg ~ size + (1 | net)
Data: HgDatmixed

     AIC      BIC   logLik deviance df.resid 
  -438.2   -424.7    223.1   -446.2      211 

Random effects:

Conditional model:
 Groups   Name        Variance Std.Dev.
 net      (Intercept) 0.093153 0.3052  
 Residual             0.006496 0.0806  
Number of obs: 215, groups:  net, 4

Dispersion estimate for gaussian family (sigma^2): 0.0065 

Conditional model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.5008789  0.1538459    3.26  0.00113 ** 
size        0.0184115  0.0004701   39.17  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Assignment question 3

Look at the summary of your model, and answer: was the model good at estimating \(\beta_0\) (intercept), \(\beta_1\) (size), and \(\gamma\)?

Now, let’s plot it. To plot it, we need two steps. First, we need to plot the data with the random intercepts:

preddata3 <- HgDatmixed
preddata3$predHg <- predict(m3, HgDatmixed)

plot2<- ggplot(data=HgDatmixed, aes(x=size, y=Hg, col=net, shape=net)) +
    geom_point() +
    geom_line(data=preddata3, aes(x=size, y=predHg, col=net))+
   theme_classic()

plot2

However, we also want to estimate the relationship between size and Hg for a “typical” individual. To do so, we set all random effects to 0 (we only estimate the fixed effects). We can do so with:

preddata3$predHg_population <- predict(m3, preddata3, re.form=~0)

See how we added re.form=~0. That’s how we tell the function to predict the values given no random effects. We add this to our plot:

plot2<- ggplot(data=HgDatmixed, aes(x=size, y=Hg, col=net, shape=net)) +
    geom_point() +
    geom_line(data=preddata3, aes(x=size, y=predHg, col=net))+
    geom_line(data=preddata3, aes(x=size, y=predHg_population),
            col='black',linewidth=1.5) +
   theme_classic()

plot2

Assignment question 4

Run all the code in the “back to the Great Lakes” section again (actually do it 2 or 3 more times, no need to re-write it, just run it again). You should see how the random intercepts change each time you run them. Why do you think that happens?

Repeat the “back to the Great Lakes” section again, but this time you are setting 25 nets. Look the summary of that new model, and answer: was the model good at estimating \(\beta_0\) (intercept), \(\beta_1\) (size), and \(\gamma\)?

Finally, repeat that experiment, but this time there is no random intercept, but there is a random slope. Show the summary of the model

Model cross-validation

We can use cross-validation to test whether our model is good. To run it, we need to run the model as a linear model (i.e., just fixed effects):

library(caret)
Warning: package 'caret' was built under R version 4.3.3
Loading required package: lattice
ctrl <- trainControl(method= 'cv', number= 10)

tr <- train(Hg~size +net, data=HgDatmixed, trControl= ctrl,method="lm")

tr
Linear Regression 

215 samples
  2 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 192, 193, 193, 194, 193, 194, ... 
Resampling results:

  RMSE        Rsquared   MAE       
  0.08061511  0.9533888  0.06475449

Tuning parameter 'intercept' was held constant at a value of TRUE

We obtain a value of RMSE. The good thing is that this value actually has meaning and can be interpreted! It is the root mean square variation. It essentially measures the average differences between predicted and observed values. And it is in the same units (mg of Hg in this case). In this case the average difference was of 0.077. Whether that is good or bad depends on your system, but you should have enough knowledge of your system to reach a conclusion!

Notes:

\[ Hg_{ij} \sim \underbrace{\beta_0}_{intercept} + \underbrace{(\beta_1size_{i}+\underbrace{\psi}_{random \ slope}}_{slope} )+\underbrace{\epsilon}_\text{ind var} \]

where: \(\psi_j \sim Normal(0,\sigma_\psi)\) and \(\epsilon \sim Normal(0,\sigma)\).