Here we’re going to try and simulate Mirjams experiment. This experiment tests the effect of different fertilizer treatments on sweet potato yields. The experiment is set up as such:

• The outcome is the weight of sweet potato crops per mound.
• The treatment is four different types of fertilizer.
• The unit of analysis is the mound, which the sweet potato plant is planted in.
• There is nine mounds within each plot.
• There are four plots within each garden.
• There are ten gardens.
• The treatment is applied at the plot level, meaning nine mounts receive one treatment in each of the ten gardens.

Here we’re trying to understand the implications of the treatment being applied at the plot level, rather than the mound level (the unit of analysis).

The analysis is going to model the effect of treatment on yield, with two random effects - plots (1) nested in gardens (2). We’re assuming dependencies between mounts with the same plot and the same garden. The model looks like this:

$y_i = \alpha + T x_{pi} + G_{gpi} + P_{pi} + \epsilon_i$ Where $$y_i$$ is the yield per $$i$$th mound, $$\alpha$$ is the true mean, $$T$$ is the treatment effect per observation ($$x_{pi}$$), $$G$$ is the effect of garden, $$P$$ is the effect of plot, and $$\epsilon$$ is the residual error.

So, in our simulation we can specify the ‘true’ values of each parameter, then run our model a bunch of times to see if the model conforms to the ‘true’ values. Let’s specify those ‘true’ values:

• The true mean ($$\alpha$$) will be 0.
• Treatment is actually a categorical variable, which means in the model it’s actually a set of three dummy binary variables. We’ll call the four treatments “a”, “b”, “c”, and “d”. “a” will be the reference level - for ease we’ll say this has a coefficient of 0. We’ll then say that “b”, “c”, and “d” have coefficients of 1, 2, and 3 respectively. In others words, “b” is expected to be 1 unit of $$y_i$$ higher than “a”, “c” will two units higher, and “d” will be three units.
• As random errors, the effect of garden and plot has a mean of zero, but we’ll say they have sd of .5 and .4 respectively.
• We’ll also say that the unexplained residual error associated with plots is .3
• The sample size is the same as in the experimental example, with the same structure of mounds nested in plots, nested in gardens (with the four treatments repeated in each garden).

The below code essentially takes those true values, samples values dummy data from a normal distribution (or in the case of the treatment, a uniform distribution) based on those true values, calculates the associated outcome, and runs the linear mixed-model with that dummy dataset. Then it does that another 999 times. At the end, we save all those parameter estimates, and plot them on histograms, which we can then compare to the original ‘true’ values to check the model is doing what we expect it to be doing.

# Libraries
library(lme4)
## Loading required package: Matrix
library(reshape2)
library(broom)
library(gridExtra)
library(grid)
library(egg)
## Loading required package: ggplot2
library(kableExtra)

# Set seed
set.seed(16)

# Simultate creating the df, and runnig the model
Mirijiam_simulation <- function(x, ngarden=10,nplot=4,nmound=9, u=0, sdg=.5, sdp=.4, sd=.3 ){
# ngarden = number of gardens
# nplot = number of plots per garden
# nmound = number of mounts per plot
# u = the mean
# sdg = SD of the garden effect
# sdp = SD of the plot effect
# sd = SD of the mound

# Garden level effect - assign a letter each of the ten gardens, repeat that letter 36 times
garden = rep(LETTERS[1:ngarden], each = nplot*nmound)

# The SD in each garden - this is also repeated 36 times for each of the 10 gardens
gardeneff = rep( rnorm(ngarden, 0, sdg), each = nplot*nmound)

# Plot level effect - assign an ID number (factor) for each plot, repeat that 9 times
plot = as.factor(rep(seq(1:(ngarden*nplot)), each = nmound))

# The SD for each plot - this is repeated 40 times for the nine plots
ploteff = rep( rnorm(ngarden*nplot, 0, sdp), each = nmound)

# Mound effect - unique to each mound
moundeff = rnorm(ngarden*nplot*nmound, 0, sd)

# Treatment effect - rember "a" has no effect, "b" increases the outcome by 1, etc.
treat <- as.factor(rep(c(rep("a",nmound),rep("b",nmound),rep("c",nmound),rep("d",nmound)), ngarden) )
treateff <- rep(c(rep(0,nmound),rep(1,nmound),rep(2,nmound),rep(3,nmound)), ngarden)

# Create the DF
df <- data.frame(garden,gardeneff,plot,ploteff,moundeff,treat,treateff)

# Now we create our response
df$resp = u + df$treateff + df$gardeneff + df$ploteff  + moundeff

# Run the LMM and return the results
mod <- lmer(resp ~ treat  + (1|garden) + (1| plot), data = df )

# summary of data
summ <- tidy(mod)

# Create DF of results - the cols are the parameter estimates
summ_DF <- data.frame((melt(summ,1)))
x <- as.data.frame(t(summ_DF$value)) names(x) <- c(paste0(summ_DF$term, summ_DF$variable)) x[1:21] <- apply(x[1:21],2,as.numeric) # Return the DF return(x) } # Simulations model_list <- as.list(seq(1,1000,1)) # How many simulations? model_list_out <- lapply(model_list, Mirijiam_simulation) # Run them model_df <- do.call(rbind,model_list_out) # Create a single DF  Lets look at the simulated results, moving from the top-left panel rightwards. The true value of the intercept is 0, and we can see here that our simulation also indicates that the average across simulations is also 0. The true values of treatment “b”, “c”, and “d” were 1, 2, and 3 respectively, which are also well estimated by the simulation. Similarly, the true SD for the garden and plot effects are .5 and .4, which are also well estimated. out_plot<- egg::ggarrange(Intercept_est, B_est,C_est, D_est, garden_est, plot_est, ncol = 3, nrow = 2) Now, the crucial thing about this is that if there was some kind of collinearity between the treatment and plots, then I think we’d see unstable parameter estimates for the treatment, which it does not appear there are. Although there is a fair amount of spread in the parameter estimates, this is because the sample size is modest. Regarding the question about the way the random effects are specified, lets use the part of the function above to generate a single dataframe that we can inspect. # DF generator DF_simulation <- function(x, ngarden=10,nplot=4,nmound=9, u=0, sdg=.5, sdp=.4, sd=.3 ){ garden = rep(LETTERS[1:ngarden], each = nplot*nmound) gardeneff = rep( rnorm(ngarden, 0, sdg), each = nplot*nmound) plot = as.factor(rep(seq(1:(ngarden*nplot)), each = nmound)) ploteff = rep( rnorm(ngarden*nplot, 0, sdp), each = nmound) moundeff = rnorm(ngarden*nplot*nmound, 0, sd) treat <- as.factor(rep(c(rep("a",nmound),rep("b",nmound),rep("c",nmound),rep("d",nmound)), ngarden) ) treateff <- rep(c(rep(0,nmound),rep(1,nmound),rep(2,nmound),rep(3,nmound)), ngarden) df <- data.frame(garden,gardeneff,plot,ploteff,moundeff,treat,treateff) df$resp = u + df$treateff + df$gardeneff + df\$ploteff  + moundeff
return(df)
}

# Create a single dummy dataframe and inspect it.
DF <- DF_simulation()

# Nice table
DF[c(1:50),] %>%
kable(format = "html", table.attr = "style='width:100%;'")%>%
kable_styling()
garden gardeneff plot ploteff moundeff treat treateff resp
A 0.1454844 1 -0.2023534 -0.2424743 a 0 -0.2993433
A 0.1454844 1 -0.2023534 -0.3453320 a 0 -0.4022010
A 0.1454844 1 -0.2023534 -0.4246446 a 0 -0.4815136
A 0.1454844 1 -0.2023534 -0.3972701 a 0 -0.4541392
A 0.1454844 1 -0.2023534 0.0189861 a 0 -0.0378829
A 0.1454844 1 -0.2023534 -0.0992086 a 0 -0.1560776
A 0.1454844 1 -0.2023534 -0.0303475 a 0 -0.0872165
A 0.1454844 1 -0.2023534 -0.0044954 a 0 -0.0613644
A 0.1454844 1 -0.2023534 0.0784946 a 0 0.0216255
A 0.1454844 2 0.0146497 0.0346224 b 1 1.1947565
A 0.1454844 2 0.0146497 0.6382494 b 1 1.7983835
A 0.1454844 2 0.0146497 0.0057501 b 1 1.1658842
A 0.1454844 2 0.0146497 -0.0941057 b 1 1.0660284
A 0.1454844 2 0.0146497 0.1558525 b 1 1.3159865
A 0.1454844 2 0.0146497 0.0171769 b 1 1.1773110
A 0.1454844 2 0.0146497 -0.1910023 b 1 0.9691318
A 0.1454844 2 0.0146497 -0.3521328 b 1 0.8080013
A 0.1454844 2 0.0146497 -0.1873593 b 1 0.9727748
A 0.1454844 3 0.4310325 0.0323689 c 2 2.6088859
A 0.1454844 3 0.4310325 0.1333975 c 2 2.7099145
A 0.1454844 3 0.4310325 -0.1145324 c 2 2.4619845
A 0.1454844 3 0.4310325 0.2791428 c 2 2.8556598
A 0.1454844 3 0.4310325 -0.1050977 c 2 2.4714192
A 0.1454844 3 0.4310325 -0.2207145 c 2 2.3558024
A 0.1454844 3 0.4310325 -0.5926611 c 2 1.9838558
A 0.1454844 3 0.4310325 -0.0813053 c 2 2.4952117
A 0.1454844 3 0.4310325 -0.0482093 c 2 2.5283076
A 0.1454844 4 0.1021458 0.1626871 d 3 3.4103173
A 0.1454844 4 0.1021458 -0.2130209 d 3 3.0346093
A 0.1454844 4 0.1021458 0.3368032 d 3 3.5844333
A 0.1454844 4 0.1021458 -0.0020207 d 3 3.2456094
A 0.1454844 4 0.1021458 0.7460196 d 3 3.9936497
A 0.1454844 4 0.1021458 -0.0052225 d 3 3.2424077
A 0.1454844 4 0.1021458 0.3201702 d 3 3.5678004
A 0.1454844 4 0.1021458 0.0598634 d 3 3.3074936
A 0.1454844 4 0.1021458 0.0536522 d 3 3.3012824
B -0.0638741 5 0.0113094 0.1271457 a 0 0.0745809
B -0.0638741 5 0.0113094 0.0958891 a 0 0.0433244
B -0.0638741 5 0.0113094 -0.2823032 a 0 -0.3348679
B -0.0638741 5 0.0113094 0.1756208 a 0 0.1230561
B -0.0638741 5 0.0113094 -0.0834307 a 0 -0.1359955
B -0.0638741 5 0.0113094 -0.1199182 a 0 -0.1724829
B -0.0638741 5 0.0113094 0.3393802 a 0 0.2868155
B -0.0638741 5 0.0113094 -0.1324553 a 0 -0.1850201
B -0.0638741 5 0.0113094 0.1656783 a 0 0.1131136
B -0.0638741 6 -0.2670924 -0.1972885 b 1 0.4717450
B -0.0638741 6 -0.2670924 0.6439459 b 1 1.3129794
B -0.0638741 6 -0.2670924 -0.0341085 b 1 0.6349249
B -0.0638741 6 -0.2670924 0.4192690 b 1 1.0883025
B -0.0638741 6 -0.2670924 0.3323428 b 1 1.0013763

Looking at the first 50 observations, you can check that the structure follows the structure that we expect (10 gardens, 40 plots, and 360 mounds). Each garden and plot has a unique identifier. The unique identifier for plot ranges from “1” to “40”, the unique identifier for garden is “A”, “B”, “C”, etc. The identifier for plot isn’t repeated in different gardens. However, the identifier for treatment is repeated (“a”, “b”, “c”, “d”); this is why it’s useful to not confound plot and treatment.

Now comes to the crux - we have to think about what “nesting” actually means. The difference between (1|garden/plot) and (1|garden) + (1| plot) doens’t actually change the statistical calculation. The difference is to do with the relatively trivial matter of the identifier we’ve provided to each garden and plot. So we use (1|garden/plot) when we have not provided a unique identifier to plot - i.e. where the identifier for the plot is “1”, “2”, “3”, “4” repeated for each garden. In this case, we need to tell R that “1” (“2”,.. etc.) in each garden is not the same plot - we do this by saying (1|garden/plot).

However, when we have identified each plot with a unique identifier (always recomended) then we can simply use (1|garden) + (1| plot). In cases where we have a unique identifier, it makes zero difference if we use (1|garden/plot) or (1|garden) + (1| plot) - the results are exactly the same. Let’s test that:

Where we use (1|garden/plot):

summary(lmer(resp ~ treat  + (1|garden/plot), data = DF ))
## Linear mixed model fit by REML ['lmerMod']
## Formula: resp ~ treat + (1 | garden/plot)
##    Data: DF
##
## REML criterion at convergence: 260.3
##
## Scaled residuals:
##      Min       1Q   Median       3Q      Max
## -2.38585 -0.58232 -0.00476  0.59976  2.87498
##
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  plot:garden (Intercept) 0.14309  0.3783
##  garden      (Intercept) 0.15294  0.3911
##  Residual                0.08367  0.2893
## Number of obs: 360, groups:  plot:garden, 40; garden, 10
##
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -0.1918     0.1747  -1.098
## treatb        1.0824     0.1746   6.200
## treatc        2.0398     0.1746  11.684
## treatd        3.2207     0.1746  18.449
##
## Correlation of Fixed Effects:
##        (Intr) treatb treatc
## treatb -0.500
## treatc -0.500  0.500
## treatd -0.500  0.500  0.500

Where we use (1|garden) + (1| plot):

summary(lmer(resp ~ treat  + (1|garden) + (1| plot), data = DF ))
## Linear mixed model fit by REML ['lmerMod']
## Formula: resp ~ treat + (1 | garden) + (1 | plot)
##    Data: DF
##
## REML criterion at convergence: 260.3
##
## Scaled residuals:
##      Min       1Q   Median       3Q      Max
## -2.38585 -0.58232 -0.00476  0.59976  2.87498
##
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  plot     (Intercept) 0.14309  0.3783
##  garden   (Intercept) 0.15294  0.3911
##  Residual             0.08367  0.2893
## Number of obs: 360, groups:  plot, 40; garden, 10
##
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -0.1918     0.1747  -1.098
## treatb        1.0824     0.1746   6.200
## treatc        2.0398     0.1746  11.684
## treatd        3.2207     0.1746  18.449
##
## Correlation of Fixed Effects:
##        (Intr) treatb treatc
## treatb -0.500
## treatc -0.500  0.500
## treatd -0.500  0.500  0.500

You see, no difference - “nesting” isn’t a feature of the mixed-effects model, just a feature of the data.

When it comes to your dataset, it will make a difference if you have repeated the same identifier for plot in each garden (in which case use (1|garden/plot)). However, I’d strongly suggest you don’t repeat the identifier, instead providing a unique code per plot - in which case it doens’t matter if you use (1|garden/plot) or (1|garden) + (1| plot). It’s also explained here in this post: https://stats.stackexchange.com/questions/228800/crossed-vs-nested-random-effects-how-do-they-differ-and-how-are-they-specified