I have started to believe that a good part of understanding the process of a particular statistical model involves being able to set up a simulation of it. Here we do a simulation of data expected from a fish gut microbiota pilot study that I was involved in. While we obtain abundance data on many different microbes in such studies, the first step is to understand what we can do with just one. This means this approach is applicable to each microbe separately and does not examine the data at community level.

In this simulation, there are 4 gut locations (A to D) examined from 10 fish (A to J).

The model is:

Response = beta1(Gut_location) + sigma1(Fish) + sigma2(Unexplained)

Setup of the data

 [1] GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D GutLoc A GutLoc B GutLoc C GutLoc D
Levels: GutLoc A GutLoc B GutLoc C GutLoc D
 [1] Fish A Fish A Fish A Fish A Fish B Fish B Fish B Fish B Fish C Fish C Fish C Fish C Fish D Fish D Fish D Fish D Fish E Fish E Fish E Fish E Fish F Fish F Fish F Fish F Fish G Fish G Fish G Fish G Fish H Fish H Fish H Fish H Fish I Fish I Fish I Fish I Fish J Fish J Fish J Fish J
Levels: Fish A Fish B Fish C Fish D Fish E Fish F Fish G Fish H Fish I Fish J

Define the (unknown) parameters

Part of the simulation is inputing the numbers that you would obtain from the experiment. It is a wild guess here, thus we just make up some numbers - we will assume that what ever we are counting generally decreases in abundance along the gut locations from 300 to 200, to 150 to 75 counts (probably should be using a GLM, we’ll do that later). Further, we will assume the variation among fish is 15 standard deviations. And lastly, the unexplained variation is 10 standard deviations.

 [1]   3.288521 -22.447747  33.798615  10.501867  -7.211492  34.114034  48.468133 -59.024576 -14.852853 -13.613817

Then create the deterministic part of the simulation. This basically sets the expected values for each gut location and fish (given above), for which we will simulate data around them.

Then simulate the data. There are 40 numbers to simulate (length of Guts vector)

       Guts   Fish   y
1  GutLoc A Fish A 308
2  GutLoc B Fish A 172
3  GutLoc C Fish A 201
4  GutLoc D Fish A  93
5  GutLoc A Fish B 267
6  GutLoc B Fish B 225
7  GutLoc C Fish B 195
8  GutLoc D Fish B -30
9  GutLoc A Fish C 313
10 GutLoc B Fish C 215
11 GutLoc C Fish C 177
12 GutLoc D Fish C  80
13 GutLoc A Fish D 260
14 GutLoc B Fish D 208
15 GutLoc C Fish D 186
16 GutLoc D Fish D  58
17 GutLoc A Fish E 264
18 GutLoc B Fish E 200
19 GutLoc C Fish E 109
20 GutLoc D Fish E  34
21 GutLoc A Fish F 337
22 GutLoc B Fish F 259
23 GutLoc C Fish F 150
24 GutLoc D Fish F 125
25 GutLoc A Fish G 324
26 GutLoc B Fish G 274
27 GutLoc C Fish G 187
28 GutLoc D Fish G  87
29 GutLoc A Fish H 231
30 GutLoc B Fish H  77
31 GutLoc C Fish H  55
32 GutLoc D Fish H  83
33 GutLoc A Fish I 280
34 GutLoc B Fish I 197
35 GutLoc C Fish I  99
36 GutLoc D Fish I  72
37 GutLoc A Fish J 277
38 GutLoc B Fish J 173
39 GutLoc C Fish J 116
40 GutLoc D Fish J  27

Examine the data

Plots! We can see our general trend along gut location, with variation associated with each fish and also unexplained variation. There are some negative values, but lets forget about that for the sake of the exercise.

Fit the model

We will fit a mixed model examining gut location as a fixed factor and fish is a blocking factor. There may be serial correation with the gut locations, but leave that for now. The residuals are OK, a bit wider at the beginning and narrower at the end.

A summary of the model shows:

Residual Std.Dev. = 36.30 (~ 35, which is about what we input)
Fish Std.Dev. = 28.05 (~ 25, which we input).

Thus the variation in counts varies about 28 units among fish, and about 36 units via stuff we did not measure. This is what we put into the model, so now we should have an understanding of these values in future.

We can examine the random effects of the fish.


                  2.5 %     97.5 %
.sig01         11.21148   50.07753
.sigma         27.27256   45.40158
(Intercept)   258.12044  314.07956
GutsGutLoc B -117.28106  -54.91894
GutsGutLoc C -169.78106 -107.41894
GutsGutLoc D -254.38106 -192.01894

We can examine the fixed effects and see for Gut location A (the Intercept), the estimate is 286 counts. It sequentially drops further as we examine the other locations with respect to A.

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ Guts + (1 | Fish)
   Data: fishDat

REML criterion at convergence: 381

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3672 -0.3626  0.1062  0.4612  1.7694 

Random effects:
 Groups   Name        Variance Std.Dev.
 Fish     (Intercept)  786.6   28.05   
 Residual             1318.0   36.30   
Number of obs: 40, groups:  Fish, 10

Fixed effects:
             Estimate Std. Error t value
(Intercept)    286.10      14.51  19.721
GutsGutLoc B   -86.10      16.24  -5.303
GutsGutLoc C  -138.60      16.24  -8.537
GutsGutLoc D  -223.20      16.24 -13.747

We can see a significant difference assocaited with the guts factor.

Analysis of Deviance Table (Type II Wald chisquare tests)

Response: y
      Chisq Df Pr(>Chisq)    
Guts 199.45  3  < 2.2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can then plot the model data. It is easy to use the effects package to obtain the data

      Guts   fit       se     lower     upper
1 GutLoc A 286.1 14.50712 256.67821 315.52179
2 GutLoc B 200.0 14.50712 170.57821 229.42179
3 GutLoc C 147.5 14.50712 118.07821 176.92179
4 GutLoc D  62.9 14.50712  33.47821  92.32179

We basically get our ‘deterministic’ part back with roughly the variance we expect given the fish and unexplained stuff. !