Here we’re going to try and simulate Mirjams experiment. This experiment tests the effect of different fertilizer treatments on sweet potato yields. The experiment is set up as such:

Here we’re trying to understand the implications of the treatment being applied at the plot level, rather than the mound level (the unit of analysis).

The analysis is going to model the effect of treatment on yield, with two random effects - plots (1) nested in gardens (2). We’re assuming dependencies between mounts with the same plot and the same garden. The model looks like this:

\[y_i = \alpha + T x_{pi} + G_{gpi} + P_{pi} + \epsilon_i\] Where \(y_i\) is the yield per \(i\)th mound, \(\alpha\) is the true mean, \(T\) is the treatment effect per observation (\(x_{pi}\)), \(G\) is the effect of garden, \(P\) is the effect of plot, and \(\epsilon\) is the residual error.

So, in our simulation we can specify the ‘true’ values of each parameter, then run our model a bunch of times to see if the model conforms to the ‘true’ values. Let’s specify those ‘true’ values:

The below code essentially takes those true values, samples values dummy data from a normal distribution (or in the case of the treatment, a uniform distribution) based on those true values, calculates the associated outcome, and runs the linear mixed-model with that dummy dataset. Then it does that another 999 times. At the end, we save all those parameter estimates, and plot them on histograms, which we can then compare to the original ‘true’ values to check the model is doing what we expect it to be doing.

# Libraries
library(lme4)
## Loading required package: Matrix
library(reshape2)
library(broom)
library(gridExtra)
library(grid)
library(egg)
## Loading required package: ggplot2
library(kableExtra)


# Set seed
set.seed(16)

# Simultate creating the df, and runnig the model
Mirijiam_simulation <- function(x, ngarden=10,nplot=4,nmound=9, u=0, sdg=.5, sdp=.4, sd=.3 ){
  # ngarden = number of gardens 
  # nplot = number of plots per garden 
  # nmound = number of mounts per plot
  # u = the mean
  # sdg = SD of the garden effect 
  # sdp = SD of the plot effect 
  # sd = SD of the mound 
  
  # Garden level effect - assign a letter each of the ten gardens, repeat that letter 36 times 
  garden = rep(LETTERS[1:ngarden], each = nplot*nmound) 

  # The SD in each garden - this is also repeated 36 times for each of the 10 gardens 
  gardeneff = rep( rnorm(ngarden, 0, sdg), each = nplot*nmound)
  
  # Plot level effect - assign an ID number (factor) for each plot, repeat that 9 times
  plot = as.factor(rep(seq(1:(ngarden*nplot)), each = nmound)) 
  
  # The SD for each plot - this is repeated 40 times for the nine plots 
  ploteff = rep( rnorm(ngarden*nplot, 0, sdp), each = nmound) 
  
  # Mound effect - unique to each mound 
  moundeff = rnorm(ngarden*nplot*nmound, 0, sd) 
  
  # Treatment effect - rember "a" has no effect, "b" increases the outcome by 1, etc. 
  treat <- as.factor(rep(c(rep("a",nmound),rep("b",nmound),rep("c",nmound),rep("d",nmound)), ngarden) )
  treateff <- rep(c(rep(0,nmound),rep(1,nmound),rep(2,nmound),rep(3,nmound)), ngarden) 
  
  # Create the DF 
  df <- data.frame(garden,gardeneff,plot,ploteff,moundeff,treat,treateff)
  
  # Now we create our response
  df$resp = u + df$treateff + df$gardeneff + df$ploteff  + moundeff 
  
  # Run the LMM and return the results
  mod <- lmer(resp ~ treat  + (1|garden) + (1| plot), data = df )
  
  # summary of data
  summ <- tidy(mod)
  
  # Create DF of results - the cols are the parameter estimates 
  summ_DF <- data.frame((melt(summ,1))) 
  x <- as.data.frame(t(summ_DF$value))
  names(x) <- c(paste0(summ_DF$term, summ_DF$variable))
  x[1:21] <- apply(x[1:21],2,as.numeric)
  # Return the DF
  return(x)
}

# Simulations 
model_list <- as.list(seq(1,1000,1)) # How many simulations? 
model_list_out <- lapply(model_list, Mirijiam_simulation) # Run them 
model_df <- do.call(rbind,model_list_out) # Create a single DF 

Lets look at the simulated results, moving from the top-left panel rightwards. The true value of the intercept is 0, and we can see here that our simulation also indicates that the average across simulations is also 0. The true values of treatment “b”, “c”, and “d” were 1, 2, and 3 respectively, which are also well estimated by the simulation. Similarly, the true SD for the garden and plot effects are .5 and .4, which are also well estimated.

out_plot<- egg::ggarrange(Intercept_est, B_est,C_est, D_est, garden_est, plot_est, ncol = 3, nrow = 2)

Now, the crucial thing about this is that if there was some kind of collinearity between the treatment and plots, then I think we’d see unstable parameter estimates for the treatment, which it does not appear there are. Although there is a fair amount of spread in the parameter estimates, this is because the sample size is modest.

Regarding the question about the way the random effects are specified, lets use the part of the function above to generate a single dataframe that we can inspect.

# DF generator 
DF_simulation <- function(x, ngarden=10,nplot=4,nmound=9, u=0, sdg=.5, sdp=.4, sd=.3 ){
  garden = rep(LETTERS[1:ngarden], each = nplot*nmound) 
  gardeneff = rep( rnorm(ngarden, 0, sdg), each = nplot*nmound)
  plot = as.factor(rep(seq(1:(ngarden*nplot)), each = nmound)) 
   ploteff = rep( rnorm(ngarden*nplot, 0, sdp), each = nmound) 
  moundeff = rnorm(ngarden*nplot*nmound, 0, sd) 
  treat <- as.factor(rep(c(rep("a",nmound),rep("b",nmound),rep("c",nmound),rep("d",nmound)), ngarden) )
  treateff <- rep(c(rep(0,nmound),rep(1,nmound),rep(2,nmound),rep(3,nmound)), ngarden) 
  df <- data.frame(garden,gardeneff,plot,ploteff,moundeff,treat,treateff)
  df$resp = u + df$treateff + df$gardeneff + df$ploteff  + moundeff 
  return(df)
}

# Create a single dummy dataframe and inspect it. 
DF <- DF_simulation()

# Nice table 
DF[c(1:50),] %>%
    kable(format = "html", table.attr = "style='width:100%;'")%>%
           kable_styling()
garden gardeneff plot ploteff moundeff treat treateff resp
A 0.1454844 1 -0.2023534 -0.2424743 a 0 -0.2993433
A 0.1454844 1 -0.2023534 -0.3453320 a 0 -0.4022010
A 0.1454844 1 -0.2023534 -0.4246446 a 0 -0.4815136
A 0.1454844 1 -0.2023534 -0.3972701 a 0 -0.4541392
A 0.1454844 1 -0.2023534 0.0189861 a 0 -0.0378829
A 0.1454844 1 -0.2023534 -0.0992086 a 0 -0.1560776
A 0.1454844 1 -0.2023534 -0.0303475 a 0 -0.0872165
A 0.1454844 1 -0.2023534 -0.0044954 a 0 -0.0613644
A 0.1454844 1 -0.2023534 0.0784946 a 0 0.0216255
A 0.1454844 2 0.0146497 0.0346224 b 1 1.1947565
A 0.1454844 2 0.0146497 0.6382494 b 1 1.7983835
A 0.1454844 2 0.0146497 0.0057501 b 1 1.1658842
A 0.1454844 2 0.0146497 -0.0941057 b 1 1.0660284
A 0.1454844 2 0.0146497 0.1558525 b 1 1.3159865
A 0.1454844 2 0.0146497 0.0171769 b 1 1.1773110
A 0.1454844 2 0.0146497 -0.1910023 b 1 0.9691318
A 0.1454844 2 0.0146497 -0.3521328 b 1 0.8080013
A 0.1454844 2 0.0146497 -0.1873593 b 1 0.9727748
A 0.1454844 3 0.4310325 0.0323689 c 2 2.6088859
A 0.1454844 3 0.4310325 0.1333975 c 2 2.7099145
A 0.1454844 3 0.4310325 -0.1145324 c 2 2.4619845
A 0.1454844 3 0.4310325 0.2791428 c 2 2.8556598
A 0.1454844 3 0.4310325 -0.1050977 c 2 2.4714192
A 0.1454844 3 0.4310325 -0.2207145 c 2 2.3558024
A 0.1454844 3 0.4310325 -0.5926611 c 2 1.9838558
A 0.1454844 3 0.4310325 -0.0813053 c 2 2.4952117
A 0.1454844 3 0.4310325 -0.0482093 c 2 2.5283076
A 0.1454844 4 0.1021458 0.1626871 d 3 3.4103173
A 0.1454844 4 0.1021458 -0.2130209 d 3 3.0346093
A 0.1454844 4 0.1021458 0.3368032 d 3 3.5844333
A 0.1454844 4 0.1021458 -0.0020207 d 3 3.2456094
A 0.1454844 4 0.1021458 0.7460196 d 3 3.9936497
A 0.1454844 4 0.1021458 -0.0052225 d 3 3.2424077
A 0.1454844 4 0.1021458 0.3201702 d 3 3.5678004
A 0.1454844 4 0.1021458 0.0598634 d 3 3.3074936
A 0.1454844 4 0.1021458 0.0536522 d 3 3.3012824
B -0.0638741 5 0.0113094 0.1271457 a 0 0.0745809
B -0.0638741 5 0.0113094 0.0958891 a 0 0.0433244
B -0.0638741 5 0.0113094 -0.2823032 a 0 -0.3348679
B -0.0638741 5 0.0113094 0.1756208 a 0 0.1230561
B -0.0638741 5 0.0113094 -0.0834307 a 0 -0.1359955
B -0.0638741 5 0.0113094 -0.1199182 a 0 -0.1724829
B -0.0638741 5 0.0113094 0.3393802 a 0 0.2868155
B -0.0638741 5 0.0113094 -0.1324553 a 0 -0.1850201
B -0.0638741 5 0.0113094 0.1656783 a 0 0.1131136
B -0.0638741 6 -0.2670924 -0.1972885 b 1 0.4717450
B -0.0638741 6 -0.2670924 0.6439459 b 1 1.3129794
B -0.0638741 6 -0.2670924 -0.0341085 b 1 0.6349249
B -0.0638741 6 -0.2670924 0.4192690 b 1 1.0883025
B -0.0638741 6 -0.2670924 0.3323428 b 1 1.0013763

Looking at the first 50 observations, you can check that the structure follows the structure that we expect (10 gardens, 40 plots, and 360 mounds). Each garden and plot has a unique identifier. The unique identifier for plot ranges from “1” to “40”, the unique identifier for garden is “A”, “B”, “C”, etc. The identifier for plot isn’t repeated in different gardens. However, the identifier for treatment is repeated (“a”, “b”, “c”, “d”); this is why it’s useful to not confound plot and treatment.

Now comes to the crux - we have to think about what “nesting” actually means. The difference between (1|garden/plot) and (1|garden) + (1| plot) doens’t actually change the statistical calculation. The difference is to do with the relatively trivial matter of the identifier we’ve provided to each garden and plot. So we use (1|garden/plot) when we have not provided a unique identifier to plot - i.e. where the identifier for the plot is “1”, “2”, “3”, “4” repeated for each garden. In this case, we need to tell R that “1” (“2”,.. etc.) in each garden is not the same plot - we do this by saying (1|garden/plot).

However, when we have identified each plot with a unique identifier (always recomended) then we can simply use (1|garden) + (1| plot). In cases where we have a unique identifier, it makes zero difference if we use (1|garden/plot) or (1|garden) + (1| plot) - the results are exactly the same. Let’s test that:

Where we use (1|garden/plot):

summary(lmer(resp ~ treat  + (1|garden/plot), data = DF ))
## Linear mixed model fit by REML ['lmerMod']
## Formula: resp ~ treat + (1 | garden/plot)
##    Data: DF
## 
## REML criterion at convergence: 260.3
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.38585 -0.58232 -0.00476  0.59976  2.87498 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  plot:garden (Intercept) 0.14309  0.3783  
##  garden      (Intercept) 0.15294  0.3911  
##  Residual                0.08367  0.2893  
## Number of obs: 360, groups:  plot:garden, 40; garden, 10
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -0.1918     0.1747  -1.098
## treatb        1.0824     0.1746   6.200
## treatc        2.0398     0.1746  11.684
## treatd        3.2207     0.1746  18.449
## 
## Correlation of Fixed Effects:
##        (Intr) treatb treatc
## treatb -0.500              
## treatc -0.500  0.500       
## treatd -0.500  0.500  0.500

Where we use (1|garden) + (1| plot):

summary(lmer(resp ~ treat  + (1|garden) + (1| plot), data = DF ))
## Linear mixed model fit by REML ['lmerMod']
## Formula: resp ~ treat + (1 | garden) + (1 | plot)
##    Data: DF
## 
## REML criterion at convergence: 260.3
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.38585 -0.58232 -0.00476  0.59976  2.87498 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  plot     (Intercept) 0.14309  0.3783  
##  garden   (Intercept) 0.15294  0.3911  
##  Residual             0.08367  0.2893  
## Number of obs: 360, groups:  plot, 40; garden, 10
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -0.1918     0.1747  -1.098
## treatb        1.0824     0.1746   6.200
## treatc        2.0398     0.1746  11.684
## treatd        3.2207     0.1746  18.449
## 
## Correlation of Fixed Effects:
##        (Intr) treatb treatc
## treatb -0.500              
## treatc -0.500  0.500       
## treatd -0.500  0.500  0.500

You see, no difference - “nesting” isn’t a feature of the mixed-effects model, just a feature of the data.

When it comes to your dataset, it will make a difference if you have repeated the same identifier for plot in each garden (in which case use (1|garden/plot)). However, I’d strongly suggest you don’t repeat the identifier, instead providing a unique code per plot - in which case it doens’t matter if you use (1|garden/plot) or (1|garden) + (1| plot). It’s also explained here in this post: https://stats.stackexchange.com/questions/228800/crossed-vs-nested-random-effects-how-do-they-differ-and-how-are-they-specified