Welcome to this practical guide on fitting Linear Mixed Models (LMMs), also known as Multilevel Models (MLMs) or Hierarchical Linear Models (HLMs).

In this document, a comprehensive blueprint for a complete linear mixed-effects model analysis is provided. As you navigate through the Table of Contents on the left, we will guide you step-by-step through:

  1. Introduction : Motivation for multilevel analysis methods.
  2. Exploratory and Graphical Data Analysis: Visualizing cluster-level variations and identifying patterns.
  3. Fitting a multilevel model: Writing clean lme4 syntax to fit random intercepts and slopes.
  4. Interpreting the Model Summary: Breaking down the dense summary() output into digestible, actionable insights.
  5. Parameter Inference: Evaluating the statistical significance of your fixed effects.
  6. Model Selection: Comparing competing nested and non-nested models using Information Criteria (AIC/BIC) and Likelihood Ratio Tests to find the most optimal framework.
  7. Model Assummptions: Check the fit: Validating your model’s diagnostics at both the individual and cluster levels.

By the end of this guide, you will possess both the core statistical intuition and the exact R code templates required to independently execute, interpret, and report a multilevel analysis for your own research.


1 Introduction

In standard linear regression (lm()), we assume that all our observations are completely independent of one another. However, real-world data often violates this assumption because it is naturally nested or clustered.

The Problem with clustering

Consider these classic examples of nested data:
* Education: Students nested within classrooms, nested within schools.
* Medicine: Repeated health measurements (Level 1) nested within individual patients (Level 2).
* Social Sciences: Citizens nested within neighborhoods or countries.

If two students go to the same school, they share the same teachers, resources, and neighborhood culture. Because of this, their test scores are likely to be more similar to each other than to the scores of a student from a completely different school.

If we run a standard OLS regression on clustered data, we violate the independence assumption. This artificially inflates our sample size, shrinks our standard errors, and leads to a massive problem: we find statistically significant “discoveries” that are actually just random noise (Type I errors).

1.1 Motivating Example: Limitations of Standard Linear Regression

To illustrate how and why standard linear regression fails when the assumption of independence is violated, we will use a simulated dataset of 90 patients across 3 hospitals (A, B, and C) measuring how their Symptom_Severity relates to their Life_Quality. The data can be generated using the following code:

# Generate independent variable
set.seed(25)
x1 <- rnorm(n=30, mean=2.5, sd=0.5)
x2 <- rnorm(n=30, mean=5, sd=0.5)
x3 <- rnorm(n=30, mean=7.5, sd=0.5)
# Generate the dependent variable
y1 <- -0.75*x1 + 4.8 + rnorm(n=30, mean = 0, sd = 0.75)
y2 <- -0.85*x2 + 7.8 + rnorm(n=30, mean = 0, sd = 0.75)
y3 <- -0.9*x3 + 10.8 + rnorm(n=30, mean = 0, sd = 0.65)

Life_Quality <- c(y1,y2,y3)
Symptom_Severity <- c(x1,x2,x3)
Hospital <- c(rep("A",30),rep("B",30),rep("C",30))
hospital_data <- data.frame(Symptom_Severity, Life_Quality, Hospital)

Let’s start by looking at what happens if we treat our data as completely independent and run a standard, pooled ordinary least squares (OLS) regression line through it.

library(ggplot2)

# Overall OLS trend (ignoring hospital clustering)
ggplot(hospital_data, aes(x = Symptom_Severity, y = Life_Quality)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "red", size = 1.2) +
  theme_minimal() +
  labs(title = "OLS regression (Ignoring Clusters)",
       subtitle = "Aggregating the data shows a positive relationship!",
       x = "Symptom Severity", y = "Quality of Life")

If you look at this plot, the red regression line has a positive slope. A naive researcher would conclude: “As symptom severity increases, a patient’s quality of life improves!” Physiologically, this makes no sense. This distortion is a classic textbook example of Simpson’s Paradox. It occurs because the massive baseline differences between the hospitals are completely masking the true biological trends happening within each hospital.

Let’s modify our ggplot code to color-code the points by Hospital and draw a separate regression line for each individual hospital.

# Breaking the trends down by cluster

ggplot(hospital_data, aes(x = Symptom_Severity, y = Life_Quality, color = Hospital, group = Hospital)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, size = 1.2) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") +
  labs(title = "OLS (seperate for each cluster)",
       subtitle = "Accounting for clusters reveals the true negative relationship!",
       x = "Symptom Severity", y = "Quality of Life")

What the plot reveals:

By untangling the clusters, the plot tells a completely different and more accurate story:

  • The within-cluster trend (Slopes): Inside every single hospital, the slopes are strictly negative. As symptoms get worse, quality of life goes down.
  • The between-cluster trend (Intercepts): The hospitals have very different intercepts. Hospital C (blue) has a much higher baseline quality of life than Hospital A (red).

Because the hospital with the worst symptoms (Hospital C) also happens to have the highest baseline quality of life due to external infrastructure, pooling the data completely flipped the sign of our predictor.

Methodological Note: Dataset Scale vs. Model Choice

This illustrative dataset features a small sample size with a limited number of individual observations and distinct groups (hospitals). In actual research practice, a dataset of this scale does not technically require a complex multilevel model (Linear Mixed Model). Given the low number of groups, a standard fixed-effects approach such as a Two-Way ANOVA or a Regular Linear Regression with a categorical hospital effect would be the statistically appropriate and parsimonious choice.

This dataset is used strictly as a simplified, conceptual proof-of-concept. Its purpose is to clearly demonstrate the mathematical and visual consequences that occur when you completely ignore clustering and grouping variables in your data, providing the intuition needed before applying multilevel modeling to larger, more complex nested datasets.

1.2 Motivating Example: Group-Level vs. Profile-Specific Trajectories

To demonstrate how aggregate summary statistics can obscure conflicting group trends in repeated measures data, we use a second simulated dataset tracking 20 employees across three timepoints. This case study measures how the rollout of an AI-driven coding assistant affects employee job satisfaction to assess how the introduction of the tool may influence workplace experience and perceived well-being.

The dataset contains repeated measurements from 20 employees, each of whom provided job satisfaction scores on a scale ranging from 1 to 100 at three distinct timepoints: Baseline (prior to implementation of the assistant), Week 2, and Week 4 following the introduction of the AI tool. Because multiple observations were collected from the same employees over time, the data have a hierarchical or longitudinal structure, with repeated measurements nested within individuals.The data can be generated using the following code:

set.seed(42)
num_employees <- 20
timeline_steps <- c("Baseline", "Week 2", "Week 4")

AIdata <- data.frame()

for (i in 1:num_employees) {
  emp_id <- sprintf("Emp_%02d", i)
  
  if (i <= 10) {
    # Group 1: Starts low, crosses the middle, ends high
    base_score <- runif(1, 30, 40)
    w2_score   <- base_score + runif(1, 15, 25) # Midpoint crossover
    w4_score   <- w2_score + runif(1, 15, 25)
    group_label <- "Tech-Savvy (Thriving)"
  } else {
    # Group 2: Starts high, crosses the middle, ends low
    base_score <- runif(1, 70, 80)
    w2_score   <- base_score - runif(1, 15, 25) # Midpoint crossover
    w4_score   <- w2_score - runif(1, 15, 25)
    group_label <- "Non-Technical (Struggling)"
  }
  
  AIdata <- rbind(AIdata, 
    data.frame(Employee_ID = emp_id, Time = "Baseline", Satisfaction = base_score, Group = group_label),
    data.frame(Employee_ID = emp_id, Time = "Week 2", Satisfaction = w2_score, Group = group_label),
    data.frame(Employee_ID = emp_id, Time = "Week 4", Satisfaction = w4_score, Group = group_label)
  )
}

# Ensure Time is ordered chronologically
AIdata$Time <- factor(AIdata$Time, levels = timeline_steps)

When examining the aggregate group mean across the entire sample, the resulting trend line appears perfectly horizontal and stagnant, artificially suggesting a universal lack of response to the software over time.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

overall_mean <- AIdata %>%
  group_by(Time) %>%
  summarize(Mean_Satisfaction = mean(Satisfaction), .groups = 'drop')

plot_mean <- ggplot(overall_mean, aes(x = Time, y = Mean_Satisfaction, group = 1)) +
  geom_line(color = "purple", size = 2) +
  geom_point(color = "purple", size = 4) +
  scale_y_continuous(limits = c(10, 90)) +
  theme_minimal(base_size = 12) +
  labs(title = "Overall Group Mean", subtitle = "Looks completely stagnant", x = "", y = "Average Satisfaction")

plot_mean

However, this flat aggregate is a mathematical illusion caused by two diametrically opposed cohorts whose trajectories move in opposite directions at an identical pace. To uncover the true structure of clustered data, we use Spaghetti Plots (sometimes called parallel coordinate plots or individual growth curves). A spaghetti plot draws distinct, separate lines for each cluster group. This allows us to see both individual variations and group-level trends simultaneously.

plot_spaghetti <- ggplot(AIdata, aes(x = Time, y = Satisfaction, group = Employee_ID)) +
  geom_line(aes(color = Group), alpha = 0.7, size = 1) +
  geom_point(aes(color = Group), size = 2) +
  scale_color_manual(values = c("Tech-Savvy (Thriving)" = "green", "Non-Technical (Struggling)" = "red")) +
  scale_y_continuous(limits = c(10, 90)) +
  theme_minimal(base_size = 12) +
  labs(title = "Individual Spaghetti Plot", subtitle = "Reveals massive, crossing trajectories", x = "", y = "Individual Satisfaction", color = "Actual Impact") +
  theme(legend.position = "bottom")

plot_spaghetti

In reality, the spaghetti plot uncovers a violent “X-shaped” crossover effect, where a tech-savvy group steadily climbs from low to high satisfaction, while a non-technical group plummets at the exact same rate.

To summarize, when analyzing repeated measures data, traditional summary metrics like a group mean or a standard boxplot can be deeply misleading. For instance, if half of our workforce experiences a severe stress increase while the other half sees a decrease, the aggregate average line remains completely flat, falsely signaling that the software update had “zero impact.” A Spaghetti Plot solves this problem by mapping each individual data profile as a standalone trajectory across the timeline. This lets us visualize both macro-level trends and micro-level outliers simultaneously.

Conclusion: Relying solely on group-level means creates an illusion of uniformity. Pairing it with a spaghetti plot ensures that organizational decisions are informed by the aggregate trend without losing sight of critical individual outliers.


2 Exploratory and Graphical Analysis

Before fitting a complex linear mixed-effects model, it is vital to visually explore your data. In multilevel research, standard summary statistics (like an overall correlation coefficient) can be deeply misleading because they blend together variance that exists between groups and variance that exists within groups. In this section, several graphical and statistical techniques are discussed to evaluate whether clustering is present in the data and to determine if a multilevel model is necessary. Throughout the next sections, the simulated datasets from the introduction (hospital_data and AIdata), and the Lizard and Captopril datasets from the accompanying slides will be used.

2.1 Grouped Boxplots

Before executing any formal statistical modeling, a critical first step is evaluating how the distribution of the dependent variable behaves across the distinct clusters in your dataset. A standard, pooled boxplot aggregates all observations into a single distribution, completely obscuring any group-level variations. In a multilevel framework, we instead visualize seperate boxplots for each group in the data, displaying separate distributional summaries for each individual cluster side-by-side.

Using the simulated hospital data, this technique allows us to visualize exactly how much the center, spread, and overall distribution of Life_Quality shifts from one hospital to another.

If the boxes across the hospitals were perfectly lined up horizontally, it would mean the hospital groupings don’t matter, and a regular linear regression (lm()) would be perfectly fine.

However, because the boxes are completely staggered, it confirms that a substantial amount of variance lives at Level 2 (the hospital level).

# Boxplot of the outcome across clusters

ggplot(hospital_data, aes(x = Hospital, y = Life_Quality, fill = Hospital)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.5, size = 1.5) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Distribution of Quality of Life Across Hospitals",
       subtitle = "Clear evidence of systematic between-hospital variance",
       x = "Hospital", y = "Quality of Life")

Here is a step-by-step explanation of the code:

ggplot(data, aes(x = Hospital, y = Life_Quality, fill = Hospital))

What it does: Sets up the foundational canvas and maps the variables. It places the grouping variable (Hospital) on the X-axis, the outcome variable (Life_Quality) on the Y-axis, and tells R to color-fill the boxes based on the specific hospital.

geom_boxplot(alpha = 0.7, outlier.shape = NA)

What it does: Draws the actual boxplots. The alpha = 0.7 adds slight transparency to the background fill.
Crucial detail: outlier.shape = NA hides the default outlier points generated by the boxplot function. This is done because we are manually plotting the raw data points in the next layer, and we want to avoid doubling up on outlier points.

geom_jitter(width = 0.15, alpha = 0.5, size = 1.5)

What it does: Overlays the individual patient data points on top of the boxplots.
Crucial detail: Instead of printing points in a straight vertical line, “jittering” spreads them out slightly sideways (width = 0.15). This prevents points from overlapping completely, making it easy to see data density and the exact number of observations per cluster.

theme_minimal()

What it does: Strips away the default heavy gray background and harsh gridlines, replacing them with a clean, modern white-space layout.

scale_fill_brewer(palette = "Set1")

What it does: Replaces R’s default neon colors with a professionally curated color palette (“Set1”) from the ColorBrewer library to make the distinct hospitals visually pop.

labs(...)

What it does: Adds the final text polish. It applies a clear main title, a descriptive subtitle that highlights the scientific takeaway, and clean axis labels.

While grouped boxplots are excellent for comparing static differences between distinct groups (like hospitals), they are not suitable for tracking changes over time. A boxplot summarizes data at a single timepoint by grouping all observations into a single distribution. Because of this aggregation, it loses track of individual identity. It cannot show whether a specific person’s score is going up, going down, or staying the same across the study.

To actually observe how individual subjects change across a timeline, we need a visualization that preserves individual identities and connects their specific data points over time. For this, we use Spaghetti Plots.

2.2 Individual Trajectory Plots (Spaghetti Plots)

When analyzing repeated measures or longitudinal data, we must observe how individual subjects change across a timeline. While aggregate summary plots show the overall group trajectory, they can mask massive variations happening beneath the surface. To expose these dynamics, we use Spaghetti Plots alongside group-level mean plots using the simulated AI dataset.

This plot collapses the individual data points to show only the average trajectory for the three time points:

# Group Mean Plot (Aggregated) 

group_means <- AIdata %>%
  group_by(Time) %>%
  summarize(Mean_Satisfaction = mean(Satisfaction), .groups = "drop")

ggplot(group_means, aes(x = Time, y = Mean_Satisfaction, group = 1)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(limits = c(10, 90))+
  labs(title = "Mean trajectories over time", y = "Average satisfaction")

Here is a step-by-step explanation of the code:

group_by(Time) %>% summarize(...)

What it does: Prepares the data by calculating a single average satisfaction score for each timepoint, completely ignoring group or individual identities.

ggplot(group_means, aes(x = Time, y = Mean_Satisfaction, group = 1))

What it does: Maps Time to the X-axis and the global Mean_Satisfaction to the Y-axis.
Crucial detail: The group = 1 argument tells ggplot2 that all the summary rows belong to a single global group, instructing it to connect the Baseline, Week 2, and Week 4 summary metrics with a single continuous line.

geom_line(size = 1.2) + geom_point(size = 3)

What it does: Draws the overall trend line and highlights the average score at each interval with a large dot.

scale_y_continuous(limits = c(10, 90))

What it does: Explicitly sets the hard lower and upper boundaries of the continuous vertical axis, forcing it to start exactly at 10 and end exactly at 90. Crucial detail: By default, ggplot2 automatically zooms the axis to tightly fit the data. If you do not override this, the overall mean plot (which stays around 50) will look wildly zoomed in, while the spaghetti plot (which ranges from 15 to 85) will look zoomed out. Forcing identical limits on both plots ensures a fair, apples-to-apples visual comparison you can immediately see how flat the aggregate mean is relative to the massive shifts in individual trajectories.

labs(...)

What it does: Adds the final text polish. It applies a clear main title, and clean axis labels.

On the contrary, this plot maps every single employee’s unique path over time:

# Spaghetti Plot (Individual Trajectories)

ggplot(AIdata, aes(x = Time, y = Satisfaction, group = Employee_ID)) +
  geom_line(alpha = 0.6) +
  geom_point(alpha = 0.6) +
  labs(title = "Individual Spaghetti Plot", y = "Individual Satisfaction")

Here is a step-by-step explanation of the code:

ggplot(AIdata, aes(x = Time, y = Satisfaction, group = Employee_ID, color = Group))

What it does: Maps Time to the X-axis and the raw Satisfaction scores to the Y-axis. Crucial detail: The group = Employee_ID argument is the core of the spaghetti plot. Instead of drawing one global line, it tells R to isolate each unique employee ID and draw a separate, independent line connecting their specific three timepoints.

geom_line(alpha = 0.6) + geom_point(alpha = 0.6)

What it does: Draws the lines and points for all 20 individuals. Crucial detail: The alpha = 0.6 adds transparency to the lines. Because multiple individual trajectories cross paths and overlap in the middle, transparency prevents the plot from becoming an unreadable block of color, allowing you to trace individual trends clearly.

labs(...)

What it does: Adds the final text polish. It applies a clear main title, and clean axis labels.

While a spaghetti plot is highly effective for tracing individual pathways on a single coordinate system, it can quickly become cluttered as the number of subjects or clusters increases. Furthermore, while we can see that lines are moving in different directions, a spaghetti plot makes it difficult to mathematically isolate and compare the exact baseline starting points and rates of change for each distinct entity. To systematically evaluate these shifts without visual crowding, we can break the data apart into a grid of smaller, individual screens or Trellis plots.

2.3 Faceted OLS Regression Plots (Trellis Plots)

To systematically evaluate shifts without visual crowding, we can break the data apart into a grid of smaller, individual screens. This technique, known as Trellis Plots, fits a separate ordinary least squares (OLS) regression line to each cluster or individual side-by-side. This layout allows us to visually audit whether the starting baselines (intercepts) and the trajectories over time (slopes) vary systematically across our groups, providing the direct visual intuition needed before we specify random intercepts and random slopes in a formal linear mixed-effects model.

Using the lizard dataset, we treat individual mothers (MOTHC) as our clusters, the offspring’s sex (SEX) as our predictor, and the number of mid-dorsal scales (DORS) as our dependent variable. This visualization allows us to see whether the baseline scale counts and the differences between sexes vary from mother to mother.

# Read in the Lizard data
lizard<-read.table("lizard.txt",header=T,sep="\t")

# Specify categorical variables
lizard$SEX<-as.factor(lizard$SEX)
lizard$MOTHC<-as.factor(lizard$MOTHC)

# Generating the Faceted OLS Regression Plot
ggplot(lizard, aes(x = SEX, y = DORS, group=1)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", size = 1) +
  facet_wrap(~ MOTHC, labeller = label_both) +
  labs(
    title = "Within-Mother OLS Regression of Dorsal Scales by Sex",
    subtitle = "Faceted panels reveal shifting intercepts and slopes across clusters",
    x = "Sex of Offspring",
    y = "Dorsal Scales (DORS)"
  ) +
  theme_minimal()
#> `geom_smooth()` using formula = 'y ~ x'

Here is a step-by-step explanation of the code:

ggplot(lizard_subset, aes(x = SEX, y = DORS, group = 1))

What it does: Sets up the canvas, mapping SEX to the X-axis and DORS to the Y-axis.
Crucial Detail: The group = 1 forces R to treat all the data points within a single panel as one interconnected group, which allows geom_smooth() to successfully draw a continuous OLS regression line across the X-axis.

geom_smooth(method = "lm", se = FALSE, color = "blue")

What it does: Computes and draws the independent linear regression line for each individual panel.
Crucial detail: Setting method = “lm” forces a straight ordinary least squares regression line instead of a curved LOESS smoother. Setting se = FALSE turns off the standard error confidence ribbons, preventing the small panels from becoming cluttered and hard to read.

facet_wrap(~ MOTHC, labeller = label_both)

What it does: This is the core function that creates the trellis layout. It splits the data by the mother’s ID column (MOTHC) and arranges the resulting scatterplots into a clean grid of small multiples.
Crucial detail: Using labeller = label_both changes the panel headers from just a bare number (e.g., “6”) to an explicit variable-value pair (e.g., “MOTHC: 6”), making it immediately clear what grouping factor is being isolated.

theme_minimal()

What it does: Strips away R’s default heavy gray background, harsh gridlines, and outer border, replacing them with a clean, modern white-space layout. Crucial detail: When working with faceted plots (trellis plots), the default theme can make the grid look incredibly crowded because every single panel gets its own gray box and border. theme_minimal() lightens the background lines and cleans up the panels, ensuring you can focus on the actual direction of the regression lines without visual clutter.

labs(...)

What it does: Applies custom text labels to the visualization, including the main title, descriptive subtitle, and clean axis titles. Crucial Detail: By default, ggplot2 pulls the column names straight from the dataframe as axis labels (which would display as the raw abbreviations SEX and DORS). Using labs() allows you to replace those cryptic variable names with clear, publication-ready text (“Sex of Offspring” and “Dorsal Scales”) so anyone reading the plot can immediately understand the metrics.

While the previous trellis plot analyzed categorical group differences within clusters (comparing male vs. female offspring across different mothers), the exact same faceting logic can be applied to longitudinal data. In the code snippet below, we use our simulated AI dataset to track individual employees (Employee_ID) sequentially over time. Although the structural goal shifts from evaluating group differences to tracking a chronological trajectory across three distinct timepoints (Baseline, Week 2, and Week 4), the coding mechanics remain nearly identical: we isolate each individual into their own screen using facet_wrap(), using points and lines to map their unique progression over the course of the study.

ggplot(AIdata, aes(x = Time, y = Satisfaction, group = Employee_ID)) +
  geom_point(size = 2, color = "darkgray") +
  geom_line(size = 1, color = "blue") +
  
  # Splits the layout into a separate panel for each cluster
  facet_wrap(~ Employee_ID) + 
  
  labs(
    title = "Within-Cluster Trajectories Across Three Timepoints",
    subtitle = "Each panel tracks a single cluster's progression from Baseline to Week 4",
    x = "Assessment Interval",
    y = "Observed Score"
  ) +
  theme_minimal()

2.4 Individual-vs.-Aggregate Dot Plots (Scatterplots/Index plot)

An effective way to evaluate raw variation across clusters is to display all data points on a single coordinate system. By aligning the ‘groups’ side-by-side along the horizontal axis and plotting individual observations vertically, we can instantly observe within-cluster variance and between-cluster differences simultaneously.

The following code generates this cluster-by-cluster scatterplot using the lizard dataset. Varying the color of the points by sex allows us to see how group-level trends differ from one mother to the next.


# Convert SEX numeric codes to clean, descriptive text labels
lizard$SEX_Label <- factor(lizard$SEX, levels = c(1, 2), labels = c("1 = Male", "2 = Female"))

# Calculating the mean numbers of dorsal shells for males/females, per motherID
group_means <- lizard %>%
  group_by(MOTHC,SEX_Label) %>%
  summarize(Mean_DORS = mean(DORS), .groups = "drop")

# Generating the Individual-vs.-Aggregate Dot Plot (Black and White)
ggplot(group_means, aes(x = MOTHC, y = Mean_DORS, col = SEX_Label)) +
  geom_point(alpha = 0.7, size = 2.5) +
  labs(
    title = "Distribution of Mean Dorsal Scales Across Mothers and Offspring Sex",
    subtitle = "Vertical spreads highlight within-mother variance and sex differentials",
    x = "Mother ID (Cluster)",
    y = "MEAN # Dorsal Scales",
    color = "Sex of Offspring"
  ) +
  theme_minimal()

Here is a step-by-step explanation of the code:

lizard$SEX_Label <- factor(lizard$SEX, levels = c(1, 2), labels = c("1 = Male", "2 = Female"))

What it does: Creates a new categorical factor column that maps the raw numeric values (1 and 2) to explicit, reader-friendly text descriptions. Crucial detail: If we bypass this step, R will generate a generic numeric legend scale. Explicitly defining the labels inside a factor guarantees that the legend on the side of the plot will directly spell out exactly what “1” and “2” represents.

group_by(MOTHC,SEX_Label) %>% summarize(Mean_DORS = mean(DORS), .groups = "drop")

What it does: It calculates the mean number of dorsal shells for female and male lizard separately within a MOTHER cluster.

ggplot(group_means, aes(x = MOTHC, y = DORS, col = SEX_Label))

What it does: Initializes the plot canvas, placing the cluster identifier (MOTHC) on the horizontal X-axis and the outcome variable (mean number of dorsal shells) on the vertical Y-axis. Crucial detail: Mapping col = SEX_Label instructs ggplot2 to distinguish between the sexes by using different colors.

geom_point(alpha = 0.7, size = 2.5)

What it does: Plots a distinct point for every single offspring in the dataset. alpha controls the opacity of the data points, and size controls the absolute physical size (diameter) of the geometric shapes plotted on the canvas.

labs(...)

What it does: Replaces the raw, technical database column labels with clean, publication-ready titles and axis text.

theme_minimal()

What it does: Strips away the heavy gray plotting background, leaving a clean, high-contrast white canvas with faint gridlines.

While plots are excellent for catching overall patterns, we also need descriptive statistics to quantify the variation across our clusters.

2.5 Descriptive statistics

Descriptive statistics provide formal statistical justification for multilevel modeling by quantifying how much variance is attributable to the grouping structure. Several methods exist that determine if the combination of cluster size and clustering violation is severe enough to artificially deflate standard errors in OLS estimation. Three primary descriptive metrics you can calculate, include:

  1. Cluster-Specific Means: The average score of the outcome within each group.
  2. The Intraclass Correlation Coefficient (ICC) of the Raw Data: A descriptive estimate of how much variance lives at the group level before adding predictors.
  3. Design Effect (Deff): A metric that tells you whether your clustering is strong enough to actually require a multilevel model, or if a standard OLS regression (lm()) would suffice.

Let’s compute these using our hospital data. We can use the dplyr package to quickly look at the sample size (\(n\)) and average Quality of Life for each hospital.

library(dplyr)

data_summary <- hospital_data %>%
  group_by(Hospital) %>%
  summarise(
    Patient_Count = n(),
    Mean_Symptoms = mean(Symptom_Severity),
    Mean_Life_Qual = mean(Life_Quality),
    SD_Life_Qual   = sd(Life_Quality)
  )

print(data_summary)
#> # A tibble: 3 × 5
#>   Hospital Patient_Count Mean_Symptoms Mean_Life_Qual SD_Life_Qual
#>   <chr>            <int>         <dbl>          <dbl>        <dbl>
#> 1 A                   30          2.41           2.93        0.687
#> 2 B                   30          4.87           3.66        0.629
#> 3 C                   30          7.38           4.16        0.608

What this tells us: Look at the big gaps between the Mean_Life_Qual scores. Hospital A averages around 2.8, while Hospital C averages around 4.0. This descriptive table numerically confirms the “staggered boxes” we saw in our boxplot.

Before running a full mixed model with predictors (for explanation of the lme4-related code, see next section), we can fit a Null Model (a model with only a random intercept and no independent variables) to calculate a descriptive baseline of our variance components.

# Fit an intercept-only model

library(lme4)
#> Loading required package: Matrix

fit_null <- lme4::lmer(Life_Quality ~ 1 + (1 | Hospital), data = hospital_data)
summary(fit_null)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Life_Quality ~ 1 + (1 | Hospital)
#>    Data: hospital_data
#> 
#> REML criterion at convergence: 184.8
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.0721 -0.6492 -0.1133  0.6795  2.5294 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  Hospital (Intercept) 0.3679   0.6066  
#>  Residual             0.4122   0.6421  
#> Number of obs: 90, groups:  Hospital, 3
#> 
#> Fixed effects:
#>             Estimate Std. Error t value
#> (Intercept)   3.5802     0.3567   10.04

Look at the Random effects section of this output. We can grab the Variance numbers to calculate the descriptive Intraclass Correlation Coefficient (ICC): \[\text{ICC} = \frac{\sigma^2_{\text{between}}}{\sigma^2_{\text{between}} + \sigma^2_{\text{within}}}\] If your between-hospital variance is 0.3679 and your residual variance is 0.4122: \[\text{ICC} = \frac{0.3679}{0.3679 + 0.4122} = 0.472\] Interpretation: \(47.2\%\) of the total, raw variance in patient Quality of Life is due to differences between the hospitals themselves.

A common question to ask is: “How high does the ICC have to be before I am forced to use a multilevel model?” To answer this, you can look at the Design Effect (Deff). The Design Effect estimates how much our standard errors will be underestimated if we mistakenly ignore the clustering. It uses both the ICC and the average cluster size (\(\bar{n}\)):

\[\text{Deff} = 1 + (\bar{n} - 1) \times \text{ICC}\]

Rule of Thumb: If \(\text{Deff} > 2.0\), you must use a multilevel model. If it is less than 2.0, clustering is negligible, and a standard lm() is usually acceptable.

Let’s calculate it for our dataset (where our average cluster size \(\bar{n} = 30\), and our raw \(\text{ICC} = 0.472\)):\[\text{Deff} = 1 + (30 - 1) \times 0.472 = 1 + (29 \times 0.472) = 14.688\]

Our Design Effect is 14.688, which is massively higher than the threshold of 2.0. This tells us that if we were to ignore the hospitals and run a standard OLS regression, our standard errors would be catastrophically deflated, drastically increasing our risk of drawing false-positive conclusions.

As a conclusion, the visual diagnostics and the descriptive measures strongly justify our decision to move forward with a Linear Mixed-Effects Model!

Design Effect: rule of thumb

A common rule of thumb suggests that if \(\text{Deff} > 2.0\), the clustering effect may be small enough that you could bypass multilevel modeling. However, researchers highly recommend avoiding this rule if you have small cluster sizes (fewer than 10 units) or are explicitly interested in the effects of higher-level (Level 2) predictors.

2.6 Long vs Wide Data

When working with repeated measures or longitudinal data in R, the “shape” of your dataset is just as important as the data itself. To plot a spaghetti plot using ggplot2, the data must be in a Long Format.

Here is a breakdown of how these two structures differ and why the distinction matters for visualization.

Wide Format (‘Readable by humans’) In a wide dataset, each subject gets exactly one row, and repeated measurements are spread out across multiple columns. If you were manually entering data into an Excel spreadsheet, this is likely how you would set it up because it is easy for human eyes to read horizontally.

The Problem for ggplot2:
Look at the column names. The actual variable “Time” is split across three different column headers (Baseline, Week_2, Week_4), and the variable “Satisfaction” is scattered as values inside those columns. ggplot2 requires a single column for your X-axis variable and a single column for your Y-axis variable. It cannot read across columns to build a trajectory.

Long Format (‘Readable by software’) In a long dataset, each row represents a single observation at a single timepoint. Instead of growing wider with each new measurement, the dataset grows longer vertically. The three columns of scores from the wide table are collapsed into two dedicated variables: one for the timepoint (Time) and one for the measurement value (Satisfaction).

Why this works:
Now, mapping your variables into aes() is straightforward:

  • x = Time (R looks down the single “Time” column)
  • y = Satisfaction (R looks down the single “Satisfaction” column)
  • group = Employee_ID (R groups rows 1, 2, and 3 together for the first line, and rows 4, 5, and 6 together for the second line)

How to Reshape Data in R The easiest way to covert data from wide–>long or from long–>wide format in R is by using the pivot_longer(), or pivot_wider() function from the tidyr package (part of the tidyverse).

Here is a quick snippet showing how to convert your data into the correct format. You can transform data from the wide format to the long format with the pivot_longer function. In the cols argument, you need to specify which columns in the wide format to transform into a single variable in the long format. In the names_to argument, you can specify the name of the resulting identifier for each value, and in the values_to argument, you can specify the name of the variable in the long format which contains the values:

library(tidyr)

# Converting from Wide to Long
long_data <- wide_data %>%
  pivot_longer(
    cols = c(Baseline, Week_2, Week_4), # The columns you want to collapse
    names_to = "Time",                  # What to name the new column of headers
    values_to = "Satisfaction"          # What to name the new column of data values
  )

The pivot_wider function is used to transform long-format data to the wide format. In the names_from argument, you can specify a variable which identifies the within-subjects levels, which is used to name the resulting new set of dependent variables. In the values_from argument, you specify the variable which contains the values of the new set of dependent variables:

# Converting from Long to Wide
wide_data <- long_data %>%
  pivot_wider(
    names_from = Time,          # The column whose values will become the new headers
    values_from = Satisfaction  # The column whose values will fill the new headers
  )

3 Fitting a multilevel model

When our exploratory visual and descriptive inspections reveal clear clustering, they expose a fundamental violation of standard statistical assumptions. Traditional OLS linear regression relies on the strict assumption that all observations are independent of one another. However, when data has a hierarchical or nested structure, observations within the same group are naturally correlated, meaning we no longer have independent samples.

Ignoring this grouping and running a standard linear regression can artificially deflate standard errors, leading to a drastically high rate of false positives (Type I errors) and misleading conclusions. To properly account for this dependency without losing statistical power, we must transition from simple linear regression to a hierarchical framework.

Multilevel modeling solves this by splitting the model into two distinct components: Fixed Effects and Random Effects. This is why the package we use is called lme4 (Linear Mixed-Effects models).

Fixed Effects (The “Grand Trends”)
Fixed effects are the systematic, overall relationships in your data that you expect to generalize to the whole population. They are exactly what you are used to seeing in regular lm() outputs.
* Example: Does a new math curriculum improve test scores on average across the entire state?
Random Effects (The “Cluster Deviations”)
Random effects capture the grouped noise or variation across our clusters. Instead of estimating a separate regression line for every single observation (which would use up all our statistical power), we assume that our clusters are drawn randomly from a larger population, and their variations follow a normal distribution around the fixed effect.

We generally look at two types of random effects:

  • Random Intercepts: We allow the baseline starting point (the intercept) to differ from cluster to cluster. (e.g., Some schools just start with higher average math scores than others.)
  • Random Slopes: We allow the relationship between a predictor and the outcome (the slope) to vary from cluster to cluster. (e.g., The new curriculum works incredibly well in some schools, but has a flat or negative effect in others.)

3.1 Enter lme4 - Syntax Structure

Before we can build multilevel models, we need to make sure the necessary tools are installed and ready to use in your R session. We will be using the lme4 package to fit our models. If you have never used this package on your computer before, you need to install it first. Run the following code in your R Console.

install.packages("lme4")

Note that you only need to install a new package once. However, every time you open a new R session, you must explicitly load the packages into R’s active memory using the library() function.

library(lme4)

In R, the lme4 package extends the formula syntax you already know from regular linear models.

A standard regression looks like this:

lm(outcome ~ predictor, data = my_data)

In lme4, we use the lmer() function (Linear Mixed Effects Regression) and add our random effects inside parentheses, separated by a vertical bar |:

lmer(outcome ~ predictor + (random_effects | cluster_variable), data = my_data)

The vertical bar | can be read as “conditional on” or “grouped by”. Here is a quick cheat sheet for the syntax you will be learning to write in this guide:

Syntax:
lmer(outcome ~ predictor + (random_effects | cluster_variable), data = my_data)

Where:

  • Random Intercept: y ~ x + (1 | school) - The baseline y varies by school, but the effect of x is identical for everyone.
  • Random Intercept & Slope: y ~ x + (1 + x | school) - Both the baseline y and the impact of x vary depending on which school you are in.
  • Nested Random Effects: y ~ x + (1 | school/classroom) - Used when your groups are cleanly tucked inside one another. For example, students are nested inside classrooms, and those classrooms only exist within specific schools. This tells R to calculate a random intercept for each school, and a unique random intercept for each classroom within that school.
  • Crossed Random Effects: y ~ x + (1 | student_ID) + (1 | subject) - Used when your grouping variables are completely independent rather than nested. For instance, every student takes multiple subjects (Math, Science, History), and every subject is taken by multiple students. Subjects are not “owned” by a single student, so R estimates independent random variations for both factors simultaneously.
  • Random Intercept Only (No Predictors): y ~ 1 + (1 | school) - This fits an unconditional baseline model (or null model). It calculates the grand mean of \(y\) (represented by the 1) and checks how much that baseline fluctuates across schools before you add any actual explanatory variables.

4 Interpreting the Model Summary

Once you have successfully fit your model using lmer(), almost everything you need to evaluate your hypotheses, check your variance components, and report your results is contained within the output of the summary() command.

Let’s look at a typical output from summary(my_model). Fitting the model with a random intercept for Hospital;

fit_hospital <- lmer(Life_Quality ~ Symptom_Severity + (1 | Hospital), data = hospital_data)

# View the full summary
summary(fit_hospital)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: Life_Quality ~ Symptom_Severity + (1 | Hospital)
#>    Data: hospital_data
#> 
#> REML criterion at convergence: 178.3
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -1.8966 -0.6978 -0.1927  0.4995  2.4279 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  Hospital (Intercept) 2.9357   1.7134  
#>  Residual             0.3594   0.5995  
#> Number of obs: 90, groups:  Hospital, 3
#> 
#> Fixed effects:
#>                  Estimate Std. Error t value
#> (Intercept)        5.6898     1.1629   4.893
#> Symptom_Severity  -0.4316     0.1244  -3.469
#> 
#> Correlation of Fixed Effects:
#>             (Intr)
#> Symptm_Svrt -0.523

Here is a breakdown of the four key sections you see in your console.

4.1 The Model Formula and Fit Statistics

At the very top, R reminds you of the exact model structure you ran and provides metrics showing how well the model fits the data.

Linear mixed model fit by REML ['lmerMod']
Formula: Life_Quality ~ Symptom_Severity + (1 | Hospital)
   Data: data

REML criterion at convergence: 178.3

REML criterion / Deviance: This is a measure of lack-of-fit. Lower numbers mean a better fit. You will use these numbers later if you perform Likelihood Ratio Tests to compare different model structures.

4.2 Scaled Residuals

This section gives you a quick snapshot of your Level-1 (patient-level) residuals to help you check if your errors are balanced.

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8966 -0.6978 -0.1927  0.4995  2.4279 

What to look for: In a perfectly normal distribution, the median should be very close to 0, and the first quartile (1Q) and third quartile (3Q) should be roughly symmetrical. If the Min or Max are extremely large (e.g., beyond \(\pm 3\)), it flags that you have influential outliers to investigate.

4.3 Random Effects (The Group Variance Component)

This is the most unique section of a multilevel summary. It tells you how much variance in Quality of Life lives between the hospitals versus within the hospitals.

Random effects:
 Groups   Name        Variance Std.Dev.
 Hospital (Intercept) 2.9357   1.7134  
 Residual             0.3594   0.5995  
Number of obs: 90, groups:  Hospital, 3

Groups (Hospital - Intercept): This tells you how much the baseline hospital averages vary around the grand mean. The variance is 2.9357. Residual: This is the Level-1 variance, i.e. how much individual patients vary within the same hospital. The variance is 0.3594. Number of obs: Confirms your sample size (90 patients nested within 3 hospitals).

Calculating the ICC: You can use these numbers to calculate the Intraclass Correlation Coefficient manually: \[\text{ICC} = \frac{2.9357}{2.9357 + 0.3594} = 0.891\] This tells us that \(89.1\%\) of the total variance in Quality of Life scores is due to which hospital the patient attended, completely justifying our use of a multilevel model!

5 Parameter Inference: Testing the Significance of Fixed Effects

In a standard linear regression (lm()), testing whether a predictor is statistically significant is straightforward. The software divides the estimate by its standard error to get a t-statistic, looks up the p-value using a fixed number of degrees of freedom based on your sample size (n - k), and prints a p-value.

In multilevel modeling, this process breaks down. Because, in case of the Hospital data for example, the 90 patients are nested within 3 hospitals, our data points are not entirely independent. Are our degrees of freedom based on the number of patients (90), the number of hospitals (3), or something in between?

Because there is no single, mathematically correct way to calculate degrees of freedom in an MLM, standard lme4 refuses to guess and omits p-values entirely. To perform parameter inference on our fixed effects, we have three primary tools at our disposal.

5.1 Method 1: Wald Tests (with df Corrections)

The fastest and most common way to test fixed effects is to use Wald t-tests or F-tests. Since standard lme4 won’t give you p-values due to the degrees of freedom problem, the lmerTest package is introduced. In this package, the \(df\)-problem is solved by approximating the effective degrees of freedom behind the scenes.

Note that in order to use the lmerTest package, you first need to install the package (only once) and load it into the environment (everytime you start a new session):

# install.packages("lmerTest")

library(lmerTest)
#> Warning: package 'lmerTest' was built under R version 4.5.3
#> 
#> Attaching package: 'lmerTest'
#> The following object is masked from 'package:lme4':
#> 
#>     lmer
#> The following object is masked from 'package:stats':
#> 
#>     step

To get p-values, we use the lmerTest package. When loaded, lmerTest runs behind the scenes to estimate the effective degrees of freedom based on the grouping structure and variance components of your model.

The two most common methods are:
1. Satterthwaite Approximation (Default): Highly accurate, computationally fast, and excellent for mid-to-large datasets.
2. Kenward-Roger Approximation: An extra-robust correction that adjusts both the degrees of freedom and the standard errors. It is highly recommended for small sample sizes (e.g., fewer than 30–50 clusters).

Let’s see how our model summary changes when we apply these methods to the hospital data.

# Re-run our hospital model (lmerTest automatically adds p-values to the summary)
fit_wald <- lmer(Life_Quality ~ Symptom_Severity + (1 | Hospital), data = hospital_data)

# View standard summary (Uses Satterthwaite by default)
summary(fit_wald)
#> Linear mixed model fit by REML. t-tests use Satterthwaite's method [
#> lmerModLmerTest]
#> Formula: Life_Quality ~ Symptom_Severity + (1 | Hospital)
#>    Data: hospital_data
#> 
#> REML criterion at convergence: 178.3
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -1.8966 -0.6978 -0.1927  0.4995  2.4279 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  Hospital (Intercept) 2.9357   1.7134  
#>  Residual             0.3594   0.5995  
#> Number of obs: 90, groups:  Hospital, 3
#> 
#> Fixed effects:
#>                  Estimate Std. Error      df t value Pr(>|t|)    
#> (Intercept)        5.6898     1.1629  3.1458   4.893 0.014608 *  
#> Symptom_Severity  -0.4316     0.1244 80.3377  -3.469 0.000841 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Correlation of Fixed Effects:
#>             (Intr)
#> Symptm_Svrt -0.523

# Alternatively, view an ANOVA-style table using Kenward-Roger corrections
anova(fit_wald, ddf = "Kenward-Roger")
#> Type III Analysis of Variance Table with Kenward-Roger's method
#>                  Sum Sq Mean Sq NumDF  DenDF F value   Pr(>F)   
#> Symptom_Severity   3.82    3.82     1 80.835  10.628 0.001632 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unlike the Likelihood Ratio Test (which requires REML = FALSE), Wald tests can be performed on models fit with REML (REML = TRUE). Because Wald tests rely heavily on accurate standard errors, and REML provides the most unbiased standard errors for multilevel data, keeping REML turned on ensures your Satterthwaite and Kenward-Roger p-values are as precise as possible.


5.2 Method 2: Likelihood Ratio Tests (LRT)

Instead of looking at a single coefficient’s t-statistic, a Likelihood Ratio Test evaluates a predictor by comparing two models:
* A Full Model containing your predictor(s)
* A Reduced Model that leaves out the predictors you would like to test.

We look to see if dropping the variable significantly worsens the model’s overall fit.

# 1. Fit a Reduced Model (Without Symptom_Severity) using ML
model_red <- lmer(Life_Quality ~ 1 + (1 | Hospital), data = hospital_data, REML = FALSE)

# 2. Fit the Full Model (With Symptom_Severity) using ML
model_full <- lmer(Life_Quality ~ Symptom_Severity + (1 | Hospital), data = hospital_data, REML = FALSE)

# 3. Compare them using a Chi-Square Likelihood Ratio Test
anova(model_red, model_full)
#> Data: hospital_data
#> Models:
#> model_red: Life_Quality ~ 1 + (1 | Hospital)
#> model_full: Life_Quality ~ Symptom_Severity + (1 | Hospital)
#>            npar    AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)   
#> model_red     3 190.41 197.91 -92.205    184.41                        
#> model_full    4 185.54 195.54 -88.768    177.54 6.8724  1   0.008754 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How to interpret: If the p-value (Pr(>Chisq)) is \(< 0.05\), it means adding Symptom_Severity significantly improves the model’s ability to explain the data.

The Golden Rule for Fixed Effect LRTs:

Because you are comparing models that differ in their fixed effects, both models must be fit using Maximum Likelihood (REML = FALSE), not the default REML.


5.3 Method 3: Parametric Bootstrapping

When your sample size is very small, or your data behaves strangely, both Wald approximations and LRTs can become mathematically unreliable. In these scenarios, Parametric Bootstrapping is considered the gold standard.

Instead of relying on theoretical mathematical curves, R will use your actual model’s estimates to simulate thousands of artificial datasets. It then fits the model to these simulated datasets to build an empirical distribution of your test statistic.

We can implement this safely and efficiently using the pbkrtest package.

#install.packages('pbkrtest')

library(pbkrtest)
#> Warning: package 'pbkrtest' was built under R version 4.5.3

# 1. Fit a Reduced Model (Without Symptom_Severity) using ML
model_red <- lme4::lmer(Life_Quality ~ 1 + (1 | Hospital), data = hospital_data, REML = FALSE)

# 2. Fit the Full Model (With Symptom_Severity) using ML
model_full <- lme4::lmer(Life_Quality ~ Symptom_Severity + (1 | Hospital), data = hospital_data, REML = FALSE)

# Run a parametric bootstrap comparing our null and full models
# (Note: We use nsim=500 here for speed, but use 1000+ for final publications!)
bootstrap_test <- PBmodcomp(model_full, model_red, nsim = 500)

# View the simulation results
summary(bootstrap_test)
#> Bootstrap test; time: 10.34 sec; samples: 500; extremes: 54;
#> large : Life_Quality ~ Symptom_Severity + (1 | Hospital)
#>            stat     df    ddf  p.value   
#> LRT      6.8724 1.0000        0.008754 **
#> PBtest   6.8724               0.109780   
#> Gamma    6.8724               0.087356 . 
#> Bartlett 2.5887 1.0000        0.107628   
#> F        6.8724 1.0000 3.2086 0.073592 . 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How to interpret: Look for the row labeled ‘PBtest’. It provides a p-value based purely on the percentage of simulations where the null model outperformed the full model by chance.

Critical Syntax Warning: Package Conflicts with lmer()

When performing a Likelihood Ratio Test (anova()) or Parametric Bootstrapping (PBmodcomp()), you might encounter a cryptic error that looks like this: Error in as_lmerModLmerTest(res) : Unable to extract deviance function from model fit

This happens because lmerTest modifies the underlying structure of the model to calculate p-values, which can accidentally corrupt the math needed for model comparisons.

Whenever you are fitting models specifically for a Likelihood Ratio Test or Bootstrapping, explicitly force R to use the original, clean function from the lme4 package by typing lme4::lmer() instead of just lmer().


6 Model Selection

When working with nested or longitudinal data, one of the most critical steps in the analysis is determining the optimal model structure. Model selection in a multilevel framework involves finding the right balance between goodness-of-fit and model parsimony. If we underfit by omitting important grouping structures or predictors, our standard errors will be biased, and our Type I error rate will inflate. If we overfit by adding unnecessary random effects, we risk model non-convergence and a loss of statistical power.

Systematic model selection generally follows a two-tiered approach:

1. Selecting the Random Effects Structure: We evaluate which grouping factors (e.g., schools, patients, or clinics) require random intercepts or slopes. Nested variance components are formally compared using LRT, non-nested variance components can be compared using AIC or BIC

2. Selecting the Fixed Effects Structure: Once the random baseline structure is stabilized, we evaluate which explanatory variables or interaction terms should remain in the model. We compare these structural variations using Information Criteria such as AIC and BIC.

A fundamental rule when running these comparisons in R is the choice of estimation method:

  • Testing Random Effects: Models must be fit using Restricted Maximum Likelihood (REML = TRUE) because it provides unbiased estimates of variance parameters.

  • Testing Fixed Effects: Models must be switched to Maximum Likelihood (ML / REML = FALSE). Likelihood Ratio Tests comparing different fixed predictors are mathematically invalid under REML because the underlying mathematical transformation alters the likelihood scale between differing fixed structures.

To demonstrate this selection workflow in practice, we will use longitudinal data from The Diabetes Project Leuven. This study tracks diabetes patients over time to monitor how well their blood sugar levels are being managed.

6.1 Selecting the Random effects

To find the best model for these longitudinal trajectories, we will systematically test four candidate variance structures: a baseline model with no random effects, a model with random GP effects, a model with random patient effects, and a fully nested model containing random effects for both GPs and patients.

# 1. Load the dataset
diabetes <- read.table("diabetes.txt", header = TRUE, sep = "\t")

# 2. Fit Candidate Models (Using REML = TRUE to test random structures)

# Model 1: Linear Regression baseline (No random effects)
mod_ols <- lm(hba1c ~ time + bmi0, data = diabetes)

# Model 2: Random GP intercepts only
mod_gp <- lmer(hba1c ~ time + bmi0 + (1 | mdnr), 
               data = diabetes, REML = TRUE)

# Model 3: Random Patient intercepts only 
mod_patient <- lmer(hba1c ~ time + bmi0 + (1 | md_patient), 
                    data = diabetes, REML = TRUE)

# Model 4: Fully Nested Random Intercepts (Patients nested within GPs)
mod_nested <- lmer(hba1c ~ time + bmi0 + (1 | mdnr) + (1 | md_patient), 
                   data = diabetes, REML = TRUE)

To determine if adding a random layer is statistically justified, we evaluate whether a more complex model significantly improves the log-likelihood compared to a simpler, nested model.

It is vital to remind you: we cannot use a Likelihood Ratio Test to compare mod_gp directly to mod_patient. Because neither model is a subset of the other, they are considered non-nested models. To compare non-nested structures, we must rely strictly on Information Criteria (AIC and BIC).

However, both mod_gp and mod_patient are cleanly nested within our fully integrated model (mod_nested). This allows us to run two separate Likelihood Ratio Tests to evaluate whether adding a second clustering layer is mathematically justified:

# Test 1: Does adding patient-level variance to a GP-only model improve fit?
anova(mod_gp, mod_nested)
#> refitting model(s) with ML (instead of REML)
#> Data: diabetes
#> Models:
#> mod_gp: hba1c ~ time + bmi0 + (1 | mdnr)
#> mod_nested: hba1c ~ time + bmi0 + (1 | mdnr) + (1 | md_patient)
#>            npar    AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)    
#> mod_gp        5 8286.8 8316.4 -4138.4    8276.8                         
#> mod_nested    6 7911.2 7946.7 -3949.6    7899.2 377.62  1  < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Test 2: Does adding GP-level variance to a patient-only model improve fit?
anova(mod_patient, mod_nested)
#> refitting model(s) with ML (instead of REML)
#> Data: diabetes
#> Models:
#> mod_patient: hba1c ~ time + bmi0 + (1 | md_patient)
#> mod_nested: hba1c ~ time + bmi0 + (1 | mdnr) + (1 | md_patient)
#>             npar    AIC    BIC  logLik -2*log(L)  Chisq Df Pr(>Chisq)    
#> mod_patient    5 7936.5 7966.2 -3963.3    7926.5                         
#> mod_nested     6 7911.2 7946.7 -3949.6    7899.2 27.349  1  1.698e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • From Test 1 (mod_gp vs. mod_nested): We conclude that even after accounting for differences between GGPs, individual patients within those practices still vary significantly from one another in their baseline HbA1c levels. Omitting the random patient effect completely ignores this massive layer of variation, which would violate our independence assumptions and dangerously deflate our standard errors.
  • From Test 2 (mod_patient vs. mod_nested): We conclude that even after accounting for the fact that individual patients differ, the specific GP managing them introduces an additional, unique layer of institutional variance. Patients are not completely independent islands; they share common practice-level environments or treatment approaches that must be statistically accounted for.

A significant p-value (p < 0.05) in both of these tests allows us to draw a definitive conclusion about the nesting structure of our data: the fully nested model (mod_nested) is statistically superior to both single-level models.

Additionally, to evaluate which model offers the best balance between model fit and complexity, we can directly extract and compare the AIC and BIC for all candidate models.

The code below compiles these parameters alongside the Log-Likelihood metrics into a single summary table for direct comparison:

# Compile Information Criteria Summary Table
model_comparison <- data.frame(
  Model = c("OLS (No Random)", "Random GP Only", "Random Patient Only", "Fully Nested (GP + Patient)"),
  AIC   = c(AIC(mod_ols), AIC(mod_gp), AIC(mod_patient), AIC(mod_nested)),
  BIC   = c(BIC(mod_ols), BIC(mod_gp), BIC(mod_patient), BIC(mod_nested)),
  LogLik = c(as.numeric(logLik(mod_ols)), as.numeric(logLik(mod_gp)), 
             as.numeric(logLik(mod_patient)), as.numeric(logLik(mod_nested)))
)
print(model_comparison)
#>                         Model      AIC      BIC    LogLik
#> 1             OLS (No Random) 8354.664 8378.382 -4173.332
#> 2              Random GP Only 8305.148 8334.796 -4147.574
#> 3         Random Patient Only 7956.128 7985.776 -3973.064
#> 4 Fully Nested (GP + Patient) 7929.847 7965.424 -3958.923

As you can see, to compute the AIC, BIC, and/or log-likelihood of a model, you can simply use the commands AIC(model_name), BIC(model_name), logLik(model_name).

Even though the fully nested model (Fully Nested (GP + Patient)) adds more parameters to the model, which triggers a penalty in both information criteria formulas, it still achieves the lowest overall AIC and BIC values, alongside the highest (least negative) Log-Likelihood.

In statistical modeling, it is common for AIC and BIC to occasionally disagree because BIC applies a much harsher penalty for complexity based on sample size. However, when both criteria point to the exact same model as the optimal choice, it provides the ultimate statistical justification:

By combining our findings from the LRTs and this Information Criteria table, we can confidently conclude that the fully nested model (mod_nested) is the mathematically optimal framework for analyzing these longitudinal diabetes trajectories. It successfully accounts for the fact that a patient’s HbA1c progression is simultaneously constrained by their own unique physiological baseline as well as the overarching clinical environment of their managing General Practitioner.

6.2 Selecting the Fixed effects

Once the optimal random effects structure is established, the next phase of model selection focuses on identifying which fixed predictors or covariates should remain in the model. While random components capture grouped noise and hierarchical variance, fixed effects quantify the systematic, population-level relationships we want to generalize.

When testing alternative sets of fixed effects, remember the golden rule of mixed-model specification: we must switch the estimation method to Maximum Likelihood (ML) by setting REML = FALSE. Likelihood Ratio Tests comparing models with different fixed components are mathematically invalid under REML because the underlying mathematical transformation alters the likelihood scale when the fixed structure changes.

To find the most accurate and parsimonious fixed structure for predicting HbA1c over time, we build upon our winning fully nested random structure by adding covariates at both organizational levels:

  • GP Level Covariate: practice (The operational form of the practice: ‘one’, ‘two’, or ‘more’ practitioners).

  • Patient Level Covariates: bmi0 (Body Mass Index at baseline) and new0 (Whether the patient is a newly diagnosed diabetic: 1 = Yes, 0 = No).

In longitudinal studies, simply adding baseline covariates as fixed effects assumes that their impact on the outcome remains perfectly constant across the entire study. However, variables like a patient’s baseline BMI or whether they are newly diagnosed often change how a patient progresses over time. To test whether these patient characteristics alter the trajectory of blood sugar control, we must introduce interaction terms with our time variable (time). When adding interaction terms, we are evaluating if slopes differ across fixed groups.

# Model A: Adding GP-level practice form
mod_fixed_practice <- lmer(hba1c ~ time + practice + time:practice + (1 | mdnr) + (1 | md_patient), 
                           data = diabetes, REML = FALSE)

# Model B: Adding patient-level baseline BMI
mod_fixed_bmi <- lmer(hba1c ~ time + bmi0 + time:bmi0 + (1 | mdnr) + (1 | md_patient), 
                      data = diabetes, REML = FALSE)

# Model C: Adding patient-level new diagnosis status
mod_fixed_full <- lmer(hba1c ~ time + new0 + time:new0 + (1 | mdnr) + (1 | md_patient), 
                       data = diabetes, REML = FALSE)

fixed_comparison <- data.frame(
  Model  = c("Model A: Practice Form", "Model B: Baseline BMI", "Model C: New Diagnosis"),
  AIC    = c(AIC(mod_fixed_practice), AIC(mod_fixed_bmi), AIC(mod_fixed_full)),
  BIC    = c(BIC(mod_fixed_practice), BIC(mod_fixed_bmi), BIC(mod_fixed_full)),
  LogLik = c(as.numeric(logLik(mod_fixed_practice)), as.numeric(logLik(mod_fixed_bmi)), as.numeric(logLik(mod_fixed_full)))
)
print(fixed_comparison)
#>                    Model      AIC      BIC    LogLik
#> 1 Model A: Practice Form 8599.267 8653.228 -4290.634
#> 2  Model B: Baseline BMI 7911.544 7953.051 -3948.772
#> 3 Model C: New Diagnosis 7911.849 7953.385 -3948.924

Explanation of the code:

fixed_comparison <- data.frame(...)

What it does: Gathers the AIC, BIC, and Log-Likelihood metrics for all three models into a single table.
Crucial detail: Since these three models test entirely different covariates rather than being sequentially nested within each other, they cannot be compared using a LRT. Instead, we evaluate them concurrently using AIC and BIC. The model with the lowest overall information criteria values represents the most statistically optimal fixed structure for explaining the longitudinal diabetes trends.

When evaluating our summary output, we can see that the AIC and BIC values for Model B (Baseline BMI) and Model C (New Diagnosis) are remarkably close, indicating that both models offer an essentially equivalent statistical fit to the longitudinal data. Crucially, both configurations achieve substantially lower information criteria scores than Model A (Practice Form), demonstrating that accounting for these patient-level characteristics is a far superior strategy than focusing on the GP-level practice structure alone. Consequently, selecting either Model B or Model C serves as an excellent, statistically grounded foundation for your analysis.

It is worth noting that while this example focuses on a targeted set of variables, you can easily extend this exact workflow to test a much larger array of candidate predictors and interactions; regardless of how many variables you introduce, the core principle of minimizing AIC and BIC to find the most balanced, parsimonious model remains completely the same.

6.3 Visualizing the final model

Plotting your model’s predictions directly on top of your raw data makes it much easier to understand your results. Instead of just looking at abstract numbers in a summary table, you can see exactly how the trends for different groups diverge over time.

The R code below uses ggplot2 to plot the individual data points from the diabetes dataset and overlay the straight “grand trend” lines calculated by Model C (New Diagnosis Status).


# 1. Fit the chosen model (using REML = TRUE for final plotting)
mod_final <- lmer(hba1c ~ time + new0 + time:new0 + (1 | mdnr) + (1 | md_patient), 
                  data = diabetes, REML = TRUE, na.action = na.exclude)

# 2. Add population-level predictions to the dataset
# re.form = NA tells R to predict using ONLY fixed effects (the grand population trend)
predicted <- predict(mod_final, re.form = NA)

# 3. Convert the 0/1 variable to clear text labels for the legend
diabetes$new0_factor <- factor(diabetes$new0, labels = c("Established Patient", "Newly Diagnosed"))

# 4. Create the overlay plot
ggplot(diabetes, aes(x = time, y = hba1c, color = new0_factor)) +
  geom_point(alpha = 0.3) +
  geom_line(aes(y = predicted), linewidth = 1.2) +
  labs(
    title = "HbA1c Trajectories by Diagnosis Status",
    x = "Time",
    y = "HbA1c (%)",
    color = "Patient Group"
  )
#> Warning: Removed 178 rows containing missing values or values outside the scale range
#> (`geom_line()`).

Step-by-Step code xxplanation:

predict(mod_final, re.form = NA, na.action = na.exclude) What it does: This is the most important part. It tells R to calculate the expected HbA1c values based only on the fixed effects. By ignoring the random variations of individual patients and doctors, it gives us the clean, overall population trends. Crucial detail: Note that the option na.action = na.exclude was added to the model fitting procedure. This is due to the fact that the predict() function will automatically skip any rows in the dataset that contain missing values, meaning the raw dataset contains more observations than the predicted outcomes, therefore causing problems for plotting the figure later.

factor(..., labels = ...)
What it does: Changes the raw 0 and 1 values into readable categories (“Established Patient” and “Newly Diagnosed”) so your plot legend is easy to read.

geom_point(alpha = 0.3) What it does: Plots the actual raw data points. Setting alpha = 0.3 makes the dots semi-transparent so they do not crowd the plot or hide the main lines.

geom_line(aes(y = predicted), linewidth = 1.2)
What it does: Draws the final lines calculated by your model. By setting y = predicted, it draws the straight trend lines representing each group over time.

Conclusion
The resulting plot visually demonstrates what the interaction term time:new0 actually means. Instead of the two lines running perfectly parallel, they have noticeably different slopes. For instance, you see that the line for “Newly Diagnosed” drops much more sharply than the line for “Established Patients.” This clearly shows that newly diagnosed individuals experience a much faster rate of blood sugar reduction over the course of the study. —

7 Model Assummptions: Check the fit

Just like standard linear regression, linear mixed-effects models rely on specific statistical assumptions to ensure your p-values, standard errors, and confidence intervals are valid. Because mixed models split variance into multiple layers, we have to check assumptions for both the residual errors and the random effects.

We primarily evaluate these assumptions by creating diagnostic plots of the final model’s residuals.

Key Model Assumptions: * Independence: The error terms should be independent.
* Linearity: The relationship between the predictors and the outcome must be linear.

  • Homoscedasticity (Equal Variance): The residual variance should remain constant across all predicted values.

  • Normality of Residuals: The individual-level errors (residuals) should be normally distributed.

  • Normality of Random Effects: The random intercepts for both patients and GPs should follow a normal distribution around the population average.

Before you begin running any final model checks or interpreting your diagnostic plots, it is critical to ensure that your final model is fit using Restricted Maximum Likelihood (REML = TRUE).

While we temporarily switched to Maximum Likelihood (REML = FALSE) to fairly compare different combinations of fixed predictors, ML estimation tends to underestimate the variance components in your data. This happens because ML calculates variance without accounting for the degrees of freedom lost by estimating the fixed effects.

Switching back to REML = TRUE for your final model fixes this bias, providing accurate, reliable estimates for your patient-level and GP-level random effects. Because all residual diagnostics and random-effect calculations rely on these variance components, your assumption checks will only be accurate if they are performed on the REML-fitted version of your final model.

In the following subsection, the R code that extracts the residuals and random effects from your final model and generates the necessary plots to verify these assumptions are discussed.

7.1 Testing Equal Variance & Linearity - Residual vs fitted values plots

# Residuals vs. Fitted Values Plot (Tests Equal Variance)
plot(mod_final, 
     main = "Residuals vs. Fitted Values",
     xlab = "Fitted Values", ylab = "Residuals")

plot(mod_final) What it does: This creates a scatter plot of the model’s residuals on the y-axis against the predicted (fitted) values on the x-axis.

Evaluating Linearity: In the Residuals vs. Fitted plot, you look for a random pattern in the points. Any deviation from a randomly scattered pattern, for example a parabola or S-shape, indicates a violation of the linearity assumption.

In this case we see an ‘exponential’ shape in the points, indicating that there might be a higher order effect that we’re missing.

Evaluating Equal Variance: To verify equal variance, look at the spread of the points; the vertical cloud of dots should look like a random, evenly distributed band from left to right. If the cloud fans out (gets wider on one side), your variance is unequal.

In this case we see a clear fanning effect, where the variance of the residuals (i.e. the deviation from the 0-line) gets biggers as the fitted values get bigger. This indicates that the assumption of equal variance might be violated.

7.2 Testing Normality - QQ-plot

# Q-Q Plot of Residuals (Tests Normality of Residuals)
qqnorm(resid(mod_final), main = "Normal Q-Q Plot: Residuals")
qqline(resid(mod_final), col = "red")

resid(mod_final) What is does: Extracts the individual-level error for every single observation in the dataset.

qqnorm() and qqline()
What it does: Generates a Quantile-Quantile plot. It compares the actual distribution of your residuals or random effects against a theoretical perfect normal distribution.

Evaluating Normality:
In the Normal Q-Q Plots, your data points should closely follow the straight red reference line. If the points curve away from the line at the ends (resembling an “S” or a “J” shape), it indicates that your residuals are skewed or heavy-tailed, meaning the normality assumption is violated. If all your plots look clean, you can safely trust your model’s statistical conclusions.

In this case we see a clear S-shape in the QQ-plot, indicating that the assumption of normality is not fulfilled.

7.3 Checking outliers - Emprical Bayes estimates

When we extract the random effects from a mixed model, we are looking at the Empirical Bayes (EB) estimates (also called Best Linear Unbiased Predictions, or BLUPs). These values represent the unique structural deviations for each group, in our case, how much an individual patient’s baseline or a specific GP’s practice shifts away from the overall population average.

Analyzing these estimates is an excellent way to check for two critical issues:

Outliers: Group-level clusters (specific GPs or patients) that behave completely differently from the rest of the sample.

Multicollinearity or Structural Confounding: Checking if the patient-level random adjustments are unexpectedly correlated with, or heavily dependent on, the higher-level GP random adjustments.

# Extract Empirical Bayes estimates (random effects)
eb_estimates <- ranef(mod_final)

# Histogram to check for Patient-level outliers
ggplot(mapping = aes(x = eb_estimates$md_patient[[1]])) +
  geom_histogram(bins = 30, fill = "lightblue", color = "black") +
  labs(title = "Distribution of Patient Random Effects", x = "Patient Intercept Deviation")

# Histogram to check for GP-level outliers
ggplot(mapping = aes(x = eb_estimates$mdnr[[1]])) +
  geom_histogram(bins = 15, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of GP Random Effects", x = "GP Intercept Deviation")

# Scatterplot: Patient Effects vs. GP Effects
# First, we link each patient's random effect to their specific GP's random effect
mapping_df <- unique(diabetes[, c("md_patient", "mdnr")])
mapping_df <- na.omit(mapping_df)

mapping_df$patient_eb <- eb_estimates$md_patient[as.character(mapping_df$md_patient), 1]
mapping_df$gp_eb      <- eb_estimates$mdnr[as.character(mapping_df$mdnr), 1]

# Plot the relationship between the levels
ggplot(mapping_df, aes(x = gp_eb, y = patient_eb)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Patient Effects vs. GP Effects", x = "GP Random Effect", y = "Patient Random Effect")
#> `geom_smooth()` using formula = 'y ~ x'
#> Warning: Removed 105 rows containing non-finite outside the scale range
#> (`stat_smooth()`).
#> Warning: Removed 105 rows containing missing values or values outside the scale range
#> (`geom_point()`).

Step-by-Step code explanation:

ranef(mod_final) What this does: This extracts the Empirical Bayes estimates. It returns a list containing a data frame for each grouping factor (md_patient and mdnr), showing the calculated intercept shift for each individual ID. We can extract the respective random effects by using the $ approach, similarly to how you extract a variable out of a dataset.

geom_histogram() What this does: We plot these shifts to ensure they follow a clean, bell-shaped distribution.

eb_estimates$md_patient[[1]] What it does: Uses the $ sign to jump into the list of random effects and pulls out the exact column of patient intercepts

mapping_df matching What this does: Because patients are nested within GPs, this step aligns each patient’s unique random intercept side-by-side with the random intercept of the GP they belong to.

geom_smooth(method = "lm") What this does: Adds a linear trend line to the scatterplot to visually check if a patient’s random effect can be predicted by their doctor’s random effect.

Interpretation
* How to spot outliers: Look closely at the edges of both histograms. If you see a lone bar sitting far away from the rest of the data (for example, a single GP with an intercept deviation of +3.5 while everyone else is between -1 and +1), that group is an outlier. It means that specific doctor or patient has an exceptionally unusual baseline HbA1c profile that the model’s fixed effects cannot explain.

  • How to interpret the Patient vs. GP Scatterplot: Because mixed models assume that variance components at different levels are completely independent, you want to see a random cloud of points with a flat red line. * If the scatterplot shows a completely random distribution, your model successfully separated patient-level noise from doctor-level noise.

    • If you see a distinct diagonal pattern (a steep positive or negative red line), it means the patient effects are highly correlated with the GP effects. This points to a structural confounding issue—meaning your model is struggling to differentiate whether the variance in HbA1c belongs to the individual patient’s physiology or to the overarching treatment style of the clinic.
      We notice some patients with extremely large HbA1c values.The largest estimated effect = 3.45. If we look back into the data, this patient was newly diagnosed, with initial BMI equal to 26.40, initial HbA1c equal to 14.3%, and no follow-up measurement after one year. It is therefore worthwhile to repeat the analysis with this subject removed from the data.
# Check which patient has an Empirical Bayes estimate larger than 3
which(eb_estimates$md_patient >3)
#> [1] 489

# Get the patient ID
patID <- rownames(eb_estimates$md_patient)[which(eb_estimates$md_patient >3)]

# Get the values of the variables for this patient ID
diabetes[diabetes$md_patient==patID,]
#>      hba1c     bmi mdnr patientnr md_patient group pat_week practice time
#> 2462  14.3 26.3958  140         2      140-2     A      125      One    0
#>      duration target new weightclass new0    bmi0 weightclass0     new0_factor
#> 2462        0      0   1           2    1 26.3958            2 Newly Diagnosed