<-rpois(1,50)
n n
[1] 50
During our initial linear mixed effects models lecture I gave an example of mercury concentration in fish (walleye). We will revisit that example, and hopefully, the logic behind mixed effects models will be clearer!
While last time we looked at the example of Walleye in a Lake Michigan, this time we will look into a completely different example. Let’s assume we are still working with Walleye (or any other fish species of your choice, you can even rename it something else if you wish). We are still interested in mercury concentration based on size. However, we are looking at a completely different water body. This water body is only 20 acres!
Because, this water body is only 20 acres, we don’t think there is any spatial heterogeneity, so, we don’t have to worry too much about a spatial component. This makes our lives easier! We can simply set a net, catch the fish, and measure them without having to worry about mixed effects (for the time being).
Now, before we look at the data, we need to generate it. I don’t have a great dataset for this, so we will be simulating the data in this study and in this whole assignment!
This is OK. As we are simulating data, each run will be unique. This means each time you run a chunk of code it will give you a different result. Also, when you render your document, you will get a different result each time (each render runs each code chunk).
If you want to be able to replicate your example, you could set a seed set.seed()
. I won’t be setting a seed for my example,.
If you are doing complex models, I always recommend you simulate some data (with known parameters) to test the models. This way, you know if the model is doing what you want it to do, and how close the estimates are to the real parameters! We will be doing that in this assignment.
The first step in our experiment is setting the net in the small-ish pond. The net should catch about 50 individuals. We will simulate our sampling using the following:
<-rpois(1,50)
n n
[1] 50
Where n is the number of fish we got. You can run that line multiple times and see that we get a different number each time!
Now, let’s check the size of our fish! Our net won’t catch any individuals under 20cm and fish size is uniformly distributed (in this example), with the largest fish being 60 cm.
Let’s check the size of teh fish we caught:
<-runif(n,20,60)
sizeprint(size)
[1] 22.75967 43.66579 39.80321 24.78608 49.14263 58.56552 56.91650 23.55892
[9] 35.47709 34.03659 58.06951 55.21679 32.12110 55.10773 58.66152 36.95968
[17] 58.59402 43.31591 50.24252 40.66932 28.10174 51.56332 51.86310 47.37769
[25] 33.53715 35.96670 30.39047 54.40070 39.26885 34.38812 22.56905 58.59425
[33] 21.09162 27.07817 28.60000 52.94883 57.49429 38.12240 40.65217 25.59778
[41] 29.81557 56.46622 22.76075 20.25000 36.94917 33.11672 30.18007 49.91449
[49] 37.95815 26.46042
Finally, let’s simulate the mercury concentration. It is dependent on the size of the fish. Let’s remember the linear model equation.
\[ y \sim \beta_0+\beta_{1}x_i + \epsilon \]
Where,
\[ \epsilon \sim Normal(0,\sigma^2) \]
Let’s simulate the data for the mercury concentration.
We’ll use 0.5 as our \(\beta_0\) and 0.018 as our \(\beta_1\) and 0.0064 as our variance (we will use standard deviation, so the square root).
Let’s explore the data we obtained:
<-0.5 + 0.018*size + rnorm(n,0,0.08)
Hgplot(Hg~size)
Looks good! Let’s make a data frame with all of our data:
<-data.frame(size=size,Hg=Hg) df1
And let’s run a simple linear model:
<-lm(Hg~size,data=df1)
model1summary(model1)
Call:
lm(formula = Hg ~ size, data = df1)
Residuals:
Min 1Q Median 3Q Max
-0.165705 -0.042941 0.000784 0.058587 0.204396
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5326945 0.0395647 13.46 <2e-16 ***
size 0.0171498 0.0009447 18.15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08237 on 48 degrees of freedom
Multiple R-squared: 0.8729, Adjusted R-squared: 0.8702
F-statistic: 329.5 on 1 and 48 DF, p-value: < 2.2e-16
Look at the summary of your model, and answer: was the model good at estimating \(\beta_0\) (intercept), \(\beta_1\) (size) and the residual standard error?
Be aware that the results will change after you render, that is OK. You can leave your original answer here. Again, you can set the random seed, and have the results be reproducible if you want with set.seed()
and putting a number.
Finally, we can plot it:
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.3
<-predict(model1,df1,interval = "c")
pred1<-cbind(df1,pred1)
dfplot
ggplot(data=dfplot, aes(x=size,y=Hg,ymin=lwr,ymax=upr))+
geom_point()+
geom_line(aes(y=fit))+
geom_ribbon(alpha=0.2)+
theme_classic()
Let’s assume that we have three other ponds of the same size. So, we are going to do the same exact thing as we did in the first pond.
Let’s also assume that the relationship (the covariance, or the slope) of size and Hg concentration is the same in all three sites. So, now, the equation would be:
\[ y \sim \beta_0+\beta_{1}x_{1i} + \beta_{2}x_{2i} +\beta_{3}x_{3i} +\beta_{4}x_{4i} \epsilon \]
where,
\[ \epsilon \sim Normal(0,\sigma^2) \]
So, we have to come up with a value for \(\beta_2\), \(\beta_3\), and \(\beta_4\),
The way I am simulating this dataset is a bit unconventional. Usually you would come up with an equation for each population (or have a random function that selects the parameters). You would very rarely do it this way, but I am trying to follow the linear regression equations to simulate the data.
In this case, the values I am giving the betas are:
\(\beta_2\): 0.025
\(\beta_3\): 0.01
\(\beta_4\): 0.1
Remember, a \(\beta_j\) is the difference in the intercept between group j and group 1.
First, let’’s create a vector with the betas
<-c(0.025,0.01,0.1)
beta
<-list(site1=df1) HgDat
Then, let’s name our first pong region “A”:
$region<-"A" df1
And create a list where we will store all of our results:
<-list(site1=df1) HgDat
Finally, we do what we did with site 1:
for(i in 1:3){
<-rpois(1,50)
n<-runif(n,20,60)
size<-0.5 + 0.018*size + rnorm(n,0,0.08)+beta[i]
Hg<-rep(LETTERS[i+1],n)
Region+1]]<-data.frame(size=size,Hg=Hg,region=Region)
HgDat[[i }
We stored all the data as a list, let’s now backtransform it to a data frame:
<-dplyr::bind_rows(HgDat)
HgDat_df$region<-as.factor(HgDat_df$region)
HgDat_dfhead(HgDat_df)
size Hg region
1 22.75967 0.9504535 A
2 43.66579 1.2445096 A
3 39.80321 1.0681250 A
4 24.78608 1.0962214 A
5 49.14263 1.2863725 A
6 58.56552 1.5648987 A
Before we continue, I recommend you open the Hg_Dat dataframe and explore it.
Let’s now run the model:
<-lm(Hg~size+region,data=HgDat_df)
model2summary(model2)
Call:
lm(formula = Hg ~ size + region, data = HgDat_df)
Residuals:
Min 1Q Median 3Q Max
-0.196308 -0.056289 0.002324 0.060099 0.200114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5178445 0.0224918 23.024 < 2e-16 ***
size 0.0175208 0.0004822 36.332 < 2e-16 ***
regionB 0.0154486 0.0160993 0.960 0.338
regionC 0.0085410 0.0160157 0.533 0.594
regionD 0.0916719 0.0156576 5.855 1.77e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08166 on 215 degrees of freedom
Multiple R-squared: 0.877, Adjusted R-squared: 0.8747
F-statistic: 383.2 on 4 and 215 DF, p-value: < 2.2e-16
and plot the data:
<-predict(model2,HgDat_df,interval = "c")
pred2<-cbind(HgDat_df,pred2)
dfplot
ggplot(data=dfplot, aes(x=size,y=Hg,ymin=lwr,ymax=upr,col=region,shape=region,fill=region))+
geom_point()+
geom_line(aes(y=fit))+
geom_ribbon(alpha=0.2)+
theme_classic()
Look at the summary of your model, and answer: was the model good at estimating \(\beta_0\) (intercept), \(\beta_1\) (size), \(\beta_2\), \(\beta_3\), \(\beta_4\) and the residual standard error?
Run an Anova (in this case technically an Ancova, as there is covariance), and if there are region is significant, do a pairwise comparison.
We know (because we set the parameters) that every site is different. Is your pairwise comparison able to identify these differences among ALL groups? If it can’t, explain why you think it is failing at doing so.
Finally, run a different model with the same exact data where there is an interactive effect between site and region (AKA, slope is different). Compare the AIC values of both models? Did AIC correctly choose the additive model as the “best model”?
Let’s go back to our original Michigan Lake example
Here, we know there is a spatial effect of where we set our nets on the amount of mercury (still, the slope is the same). We will be placing four nets
And we don’t want to bias our estimate by choosing where to place the nets.
Also, we don’t care what site has a higher concentration of mercury. We care about the concentration lake-wide and about the variance introduced by the spatial heterogeneity.
Also, we don’t want to estimate a \(\beta\) for every net, we simply want to estimate the variance introduced by the spatial component (we don’t want to estimate 99 \(\beta's\) if we are setting 100 nets!). So, we are doing a mixed model (with a mixed intercept). As a reminder, this is the equation:
\[ Hg_{ij} \sim \underbrace{(\beta_0 +\underbrace{\gamma_j}_{\text{Random intercept}})}_{intercept} + \underbrace{\beta_1size_{i}}_{slope} +\underbrace{\epsilon}_\text{ind var} \]
where: \(\gamma_j \sim Normal(0,\sigma_\gamma)\) and \(\epsilon \sim Normal(0,\sigma)\).
Let’s say the variance introduced by the selection of site is 0.0625 (standard deviation of 0.25).
Then, we can create an object called \(\gamma\):
<-4
nsites<- rnorm(n=nsites,mean=0,sd=0.25) gamma
Notice how I am not selecting the \(\beta's\) values? All I am providing is the standard deviation (think of this as teh variability introduced by where you place the nets. And using rnorm (random normal) I am obtaining 4 random values that affect the intercept. This is why this is a random component of the intercept.
Each time you run that line, you will get different values, because it is a random process (different than when we had 4 sites!)
<-list()
HgDatmixedfor(i in 1:4){
<-rpois(1,50)
n<-runif(n,20,60)
size<- (0.5+gamma[i]) + 0.018*size + rnorm(n,0,0.08)
Hg<-rep(LETTERS[i],n)
Net<-data.frame(size=size,Hg=Hg,net=Net)
HgDatmixed[[i]]
}<-dplyr::bind_rows(HgDatmixed)
HgDatmixed$net<-as.factor(HgDatmixed$net) HgDatmixed
Now, let’s run the mixed model:
library(glmmTMB)
Warning: package 'glmmTMB' was built under R version 4.3.3
<-glmmTMB(Hg~size +(1|net), data=HgDatmixed)
m3
summary(m3)
Family: gaussian ( identity )
Formula: Hg ~ size + (1 | net)
Data: HgDatmixed
AIC BIC logLik deviance df.resid
-438.2 -424.7 223.1 -446.2 211
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
net (Intercept) 0.093153 0.3052
Residual 0.006496 0.0806
Number of obs: 215, groups: net, 4
Dispersion estimate for gaussian family (sigma^2): 0.0065
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5008789 0.1538459 3.26 0.00113 **
size 0.0184115 0.0004701 39.17 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Look at the summary of your model, and answer: was the model good at estimating \(\beta_0\) (intercept), \(\beta_1\) (size), and \(\gamma\)?
Now, let’s plot it. To plot it, we need two steps. First, we need to plot the data with the random intercepts:
<- HgDatmixed
preddata3 $predHg <- predict(m3, HgDatmixed)
preddata3
<- ggplot(data=HgDatmixed, aes(x=size, y=Hg, col=net, shape=net)) +
plot2geom_point() +
geom_line(data=preddata3, aes(x=size, y=predHg, col=net))+
theme_classic()
plot2
However, we also want to estimate the relationship between size and Hg for a “typical” individual. To do so, we set all random effects to 0 (we only estimate the fixed effects). We can do so with:
$predHg_population <- predict(m3, preddata3, re.form=~0) preddata3
See how we added re.form=~0
. That’s how we tell the function to predict the values given no random effects. We add this to our plot:
<- ggplot(data=HgDatmixed, aes(x=size, y=Hg, col=net, shape=net)) +
plot2geom_point() +
geom_line(data=preddata3, aes(x=size, y=predHg, col=net))+
geom_line(data=preddata3, aes(x=size, y=predHg_population),
col='black',linewidth=1.5) +
theme_classic()
plot2
Run all the code in the “back to the Great Lakes” section again (actually do it 2 or 3 more times, no need to re-write it, just run it again). You should see how the random intercepts change each time you run them. Why do you think that happens?
Repeat the “back to the Great Lakes” section again, but this time you are setting 25 nets. Look the summary of that new model, and answer: was the model good at estimating \(\beta_0\) (intercept), \(\beta_1\) (size), and \(\gamma\)?
Finally, repeat that experiment, but this time there is no random intercept, but there is a random slope. Show the summary of the model
We can use cross-validation to test whether our model is good. To run it, we need to run the model as a linear model (i.e., just fixed effects):
library(caret)
Warning: package 'caret' was built under R version 4.3.3
Loading required package: lattice
<- trainControl(method= 'cv', number= 10)
ctrl
<- train(Hg~size +net, data=HgDatmixed, trControl= ctrl,method="lm")
tr
tr
Linear Regression
215 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 192, 193, 193, 194, 193, 194, ...
Resampling results:
RMSE Rsquared MAE
0.08061511 0.9533888 0.06475449
Tuning parameter 'intercept' was held constant at a value of TRUE
We obtain a value of RMSE. The good thing is that this value actually has meaning and can be interpreted! It is the root mean square variation. It essentially measures the average differences between predicted and observed values. And it is in the same units (mg of Hg in this case). In this case the average difference was of 0.077. Whether that is good or bad depends on your system, but you should have enough knowledge of your system to reach a conclusion!
Notes:
\[ Hg_{ij} \sim \underbrace{\beta_0}_{intercept} + \underbrace{(\beta_1size_{i}+\underbrace{\psi}_{random \ slope}}_{slope} )+\underbrace{\epsilon}_\text{ind var} \]
where: \(\psi_j \sim Normal(0,\sigma_\psi)\) and \(\epsilon \sim Normal(0,\sigma)\).