Mixed models

Author

Chathura Edirisuriya

What is a fixed effect model:

In a fixed effects model, we estimate specific, average effects of predictor variables on dependent variable across the population (or data set).

  • For a continuous fixed effect (like temperature), we estimate: the average change in the response (e.g., dengue cases) for a 1-unit increase in temperature, across all observations.

  • For a categorical fixed effect (like district, when treated as fixed), we estimate the mean effect (intercept) for each level, relative to a reference level.

Type of Fixed Effect What we estimate Is it a “mean effect”?
Continuous (e.g., Temp) One slope \(\beta\): average effect across all data Yes — average effect per unit
Categorical (e.g., District) One intercept per level (if included as fixed effect) Yes — group-specific means

What is a random effect model?

A random effect is a grouping variable where you believe each group has its own effect, but you’re not interested in estimating each group’s effect individually. Instead, you assume these effects are randomly drawn from a larger population.

Think of random effects as the background characters — they matter, but you’re not studying them individually. You’re just accounting for their variation. What we actually estimate are random effect parameters associated with levels of a grouping variable, under the assumption that these levels are random samples from a population/distribution.

Contrast with random effects

  • Fixed effects estimate specific effects per variable or group.
  • Random effects assume group effects vary and are drawn from a distribution — and we estimate the mean and variance of that distribution, not the effect for each level as a parameter.

What is a mixed model?

A mixed model, also known as a mixed-effects model, is a statistical framework that combines both fixed effects and random effects within a single analysis. This approach is particularly valuable when dealing with data that exhibit correlations or dependencies, such as measurements clustered within groups or repeated observations over time.

library(lme4)
library(ggplot2)
library(dplyr)
data("sleepstudy")
head(sleepstudy, n = 10)
   Reaction Days Subject
1  249.5600    0     308
2  258.7047    1     308
3  250.8006    2     308
4  321.4398    3     308
5  356.8519    4     308
6  414.6901    5     308
7  382.2038    6     308
8  290.1486    7     308
9  430.5853    8     308
10 466.3535    9     308

Analysing whole dataset without considering the repeated measurements from individuals (Method 1)

mod1 <- Reaction~Days
result1 <- lm(formula = mod1, data = sleepstudy)
summary(result1)

Call:
lm(formula = mod1, data = sleepstudy)

Residuals:
     Min       1Q   Median       3Q      Max 
-110.848  -27.483    1.546   26.142  139.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  251.405      6.610  38.033  < 2e-16 ***
Days          10.467      1.238   8.454 9.89e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.71 on 178 degrees of freedom
Multiple R-squared:  0.2865,    Adjusted R-squared:  0.2825 
F-statistic: 71.46 on 1 and 178 DF,  p-value: 9.894e-15
plot(x = sleepstudy$Days, 
     y = sleepstudy$Reaction,
     main = "Correlation between Reaction time and Days",
     xlab = "Days",
     ylab = "Reaction time (ms)",
     col = "blue")
abline(result1)

sleepstudy |> ggplot(aes(x = Days, y = Reaction)) +
  geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Analysing dataset taking the average values for each day (Method 2)

sleepStudy1 <- sleepstudy |> group_by(Days) |> summarise(meanReactionTime = mean(Reaction))



meanReact <- sleepstudy |> group_by(Days) |> 
  summarise(meanReaction = mean(Reaction))
mod2 <- lm(formula = meanReaction~Days, data = meanReact)
summary(mod2)

Call:
lm(formula = meanReaction ~ Days, data = meanReact)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9778 -3.9763  0.8356  4.2385  5.2467 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 251.4051     2.9111   86.36 3.61e-13 ***
Days         10.4673     0.5453   19.20 5.62e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.953 on 8 degrees of freedom
Multiple R-squared:  0.9788,    Adjusted R-squared:  0.9761 
F-statistic: 368.5 on 1 and 8 DF,  p-value: 5.623e-08
meanReact |> ggplot(aes(x = Days, y = meanReaction)) + 
  geom_point() + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Analyzing individual specific effects (Method 3)

  • Each group has its own intercept and slop estimates.

  • They can only applied within each group

\[\text{Reaction}_i \sim \beta_{0i} + \beta_{1i}\text {Days}_{ij}\]

sleepstudy |> ggplot(aes(x = Days, y = Reaction)) + geom_point()

sleepstudy |> ggplot(aes(x = Days, y = Reaction)) + geom_point() + facet_wrap(~Subject)

sleepstudy |> ggplot(aes(x = Days, y = Reaction)) + geom_point() +
  facet_wrap(~Subject) + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

sleepstudy %>% filter(Subject == 330) %>%
  lm(formula = Reaction~Days, data = .)

Call:
lm(formula = Reaction ~ Days, data = .)

Coefficients:
(Intercept)         Days  
    289.685        3.008  
mod3 <- sleepstudy %>% filter(Subject == 308) %>%
  lm(formula = Reaction~Days, data = .)

summary(mod3)

Call:
lm(formula = Reaction ~ Days, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-106.397   -4.098    9.688   22.269   61.674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   244.19      28.08   8.695 2.39e-05 ***
Days           21.77       5.26   4.137  0.00326 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.78 on 8 degrees of freedom
Multiple R-squared:  0.6815,    Adjusted R-squared:  0.6417 
F-statistic: 17.12 on 1 and 8 DF,  p-value: 0.003265

If we assume that initial reaction time is similar across (fixed effect for intercept) all subject and having different effects (random effect for slop) after wards. (Method 4a)

mod4 <- lmer(formula = Reaction ~ Days + (0+Days|Subject), data = sleepstudy)
summary(mod4)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (0 + Days | Subject)
   Data: sleepstudy

REML criterion at convergence: 1766.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5104 -0.5588  0.0541  0.6244  4.6022 

Random effects:
 Groups   Name Variance Std.Dev.
 Subject  Days  52.71    7.26   
 Residual      842.03   29.02   
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)   251.41       4.02  62.539
Days           10.47       1.87   5.599

Correlation of Fixed Effects:
     (Intr)
Days -0.340
coef(mod4)
$Subject
    (Intercept)       Days
308    251.4051 20.0866918
309    251.4051 -4.2326711
310    251.4051 -0.8189202
330    251.4051  9.1273878
331    251.4051 10.6754843
332    251.4051 11.5352979
333    251.4051 12.7430088
334    251.4051 10.4774867
335    251.4051 -0.4337385
337    251.4051 24.3577325
349    251.4051  7.9069248
350    251.4051 15.2012144
351    251.4051  8.1041559
352    251.4051 17.1349527
369    251.4051 11.8340809
370    251.4051 11.5298510
371    251.4051  9.5898795
372    251.4051 13.5923281

attr(,"class")
[1] "coef.mer"
sleepstudy1 <- sleepstudy %>%
  mutate(Fitted = predict(mod4)) # Predict the Reaction time @ Day based on available values in the data set

ggplot(sleepstudy1, aes(x = Days, y = Reaction)) +
  geom_point(aes(color = Subject), alpha = 0.6) +
  geom_line(aes(y = Fitted, color = Subject), linewidth = 1) +
  facet_wrap(~Subject) +
  labs(title = "Reaction Time by Days of Sleep Deprivation",
       x = "Days of Sleep Deprivation",
       y = "Reaction Time (ms)") +
  theme_minimal() +
  theme(legend.position = "none")

The final model will get general form of

\[Y_{ij} = \beta_0 + (\beta_1+b_{1i}).Days_{ij}+\epsilon_i\] where,

\(Y_i\): the reaction time for subject \(i\) and \(j\)

\(\beta_0\): Fixed (average) intercept

\(\beta_1\): Fixed (average) slope for Days

\(b_{1i}\): Random slope for subject \(i\)

\(\epsilon_{ij}\): Residual error for observation \(j\) of subject \(i\)

  • There is no random intercept because of (0 + Days | Subject), so every subject shares the same base level (intercept), but the effect of Days varies.

  • \(b_{1i} \sim N(0, \sigma^2_b)\), random slopes are normally distributed with mean 0.

  • \(\epsilon_{ij} \sim N(0, \sigma^2_b)\), residuals are also normally distributed.

This model can generalize to new subjects if the following are true:

  1. Subjects are sampled from a broader population:
  • The 18 subjects in the sleepstudy should represent a broader population (e.g., adults under sleep deprivation).

  • If these subjects were randomly sampled, the model’s fixed effect for Days (and the distribution of random slopes) represents the broader trend.

  1. Random effects are treated as coming from a distribution:
  • The random slope for each subject \(b1i \sim N(0,\sigma^2_b)\) captures variability across subjects.

  • This allows you to make probabilistic predictions about how a new subject might behave

  1. The fixed effect estimates are generalizable: - The fixed effect for Days (i.e.,\(\beta_1\)) reflects the average response across all subjects and is considered generalizable across the population, not just the 18 studied.
Component Generalizes To New Subjects? Why?
Fixed Effects Yes Represents average population-level trend
Random Effects Yes, probabilistically Captures individual variability; assume new subjects drawn from same population
Residuals Yes Captures unexplained noise at the observation level

If you’re generalizing to a new subject not in the data:

You don’t know \(b_{1i}\), because you have no data for that subject. So:

  • You use only the fixed effect \(\beta_1\), the population-level average.

  • This is the best estimate in the absence of subject-specific data.

In practice:

Use case What to use for slope \(\beta_1+b_{1i}\)
Subject in the dataset Use both \(\beta_1\) and \(b_{1i}\)
New subject Use only \(\beta_1\)
Partial data (few observations) Use shrinkage estimate (BLUP): a mix of subject and population info

What is partial data:

You have some data about a group (e.g., a subject), but not enough to estimate its behavior confidently on its own.

Example:

  • Suppose you have 18 subjects in total.
  • For Subject 1, you have 10 observations.
  • For a new Subject 19, you have only 1 or 2 observations.

In that case, Subject 19 is said to have partial data — not enough to reliably estimate their own slope for Days, but also not zero data like a brand-new unseen subject.

How the model handles it:

This is where shrinkage (or partial pooling) comes in:

The model estimates , the subject-specific deviation, by shrinking it toward 0 (i.e., toward the population-level ) based on how much data you have for that subject.

More data for subject i –> more confidence in their unique slope

Less data for subject i –> more shrinkage toward the average slope

Subject Type Data Available Estimate of Slope
Well-studied subject Many points \(\beta_1+b_{1i}\) (strong subject-specific slope)
Partial-data subject Few points Mixed estimate: some subject info + shrinkage toward \(\beta_1\)
New/unseen subject No data Use only \(\beta_1\)

This is one of the key strengths of mixed models: they balance individual-specific information with overall population trends, automatically adapting based on how much data you have for each group.

Scenario Model Equation Notes
Subject in dataset \(Y_{ij} = (\beta_0+b_{0i})+(\beta_1+b_{1i}).Days_{ij}+\epsilon_{ij}\) Uses both fixed + random effects
New subject (generalization) \(\hat Y = \beta_0+\beta_1.Days_j\) Use fixed effects only

What is partial pooling and shrinkage?

Group levels with low sample size and/or poor information (i.e., no strong relationship) are more strongly influenced by the grand mean, which is serving to add information to an otherwise poorly-estimated group. However, a group with a large sample size and/or strong information (i.e., a strong relationship) will have very little influence of the grand mean and largely reflect the information contained entirely within the group. This process is called partial pooling (as opposed to no pooling where no effect is considered or total pooling where separate models are run for the separate groups). Partial pooling results in the phenemenon known as shrinkage, which refers to the group-level estimates being shrink toward the mean. What does all this mean? If you use a random effect you should be prepared for your factor levels to have some influence from the overall mean of all levels. With a good, clear signal in each group, you won’t see much influence of the overal mean, but you will with small groups or those witout much signal.

If we assume different initial Reaction time for individuals (random effect on intercept) and having similar effect after wards (fixed effect on slop). (Method 4b)

mod5 <- lmer(formula = Reaction~Days+(1|Subject), data = sleepstudy)
summary(mod5)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (1 | Subject)
   Data: sleepstudy

REML criterion at convergence: 1786.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2257 -0.5529  0.0109  0.5188  4.2506 

Random effects:
 Groups   Name        Variance Std.Dev.
 Subject  (Intercept) 1378.2   37.12   
 Residual              960.5   30.99   
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept) 251.4051     9.7467   25.79
Days         10.4673     0.8042   13.02

Correlation of Fixed Effects:
     (Intr)
Days -0.371
coef(mod5)
$Subject
    (Intercept)     Days
308    292.1888 10.46729
309    173.5556 10.46729
310    188.2965 10.46729
330    255.8115 10.46729
331    261.6213 10.46729
332    259.6263 10.46729
333    267.9056 10.46729
334    248.4081 10.46729
335    206.1230 10.46729
337    323.5878 10.46729
349    230.2089 10.46729
350    265.5165 10.46729
351    243.5429 10.46729
352    287.7835 10.46729
369    258.4415 10.46729
370    245.0424 10.46729
371    248.1108 10.46729
372    269.5209 10.46729

attr(,"class")
[1] "coef.mer"
sleepstudy2 <- sleepstudy %>%
  mutate(Fitted = predict(mod5)) # Predict the Reaction time @ Day based on available values in the data set

ggplot(sleepstudy2, aes(x = Days, y = Reaction)) +
  geom_point(aes(color = Subject), alpha = 0.6) +
  geom_line(aes(y = Fitted, color = Subject), linewidth = 1) +
  facet_wrap(~Subject) +
  labs(title = "Reaction Time by Days of Sleep Deprivation",
       x = "Days of Sleep Deprivation",
       y = "Reaction Time (ms)") +
  theme_minimal() +
  theme(legend.position = "none")

The general equation can be written as:

\[Y_{ij} = (\beta_0+b_{0i}) + \beta_1.Days_{ij}+\epsilon_{ij}\] where,

\(Y_i\): the reaction time for subject \(i\) and \(j\)

\(\beta_0\): Fixed (average) intercept

\(b_{0i}\): Random intercept for subject \(i\)

\(\beta_1\): Fixed (average) slope for Days

\(\epsilon_{ij}\): Residual error for observation \(j\) of subject \(i\)

If we assume each subject has his/her own initial reaction time (random effect on intercept) and fatigue affect reaction time in unique way (random effect on slop). (Method 4c)

mod6 <- lmer(formula = Reaction~Days+(Days|Subject), data = sleepstudy)
summary(mod6)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
   Data: sleepstudy

REML criterion at convergence: 1743.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9536 -0.4634  0.0231  0.4634  5.1793 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 612.10   24.741       
          Days         35.07    5.922   0.07
 Residual             654.94   25.592       
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)  251.405      6.825  36.838
Days          10.467      1.546   6.771

Correlation of Fixed Effects:
     (Intr)
Days -0.138
coef(mod6)
$Subject
    (Intercept)       Days
308    253.6637 19.6662617
309    211.0064  1.8476053
310    212.4447  5.0184295
330    275.0957  5.6529356
331    273.6654  7.3973743
332    260.4447 10.1951090
333    268.2456 10.2436499
334    244.1725 11.5418676
335    251.0714 -0.2848792
337    286.2956 19.0955511
349    226.1949 11.6407181
350    238.3351 17.0815038
351    255.9830  7.4520239
352    272.2688 14.0032871
369    254.6806 11.3395008
370    225.7921 15.2897709
371    252.2122  9.4791297
372    263.7197 11.7513080

attr(,"class")
[1] "coef.mer"
sleepstudy3 <- sleepstudy %>%
  mutate(Fitted = predict(mod6))

ggplot(sleepstudy3, aes(x = Days, y = Reaction)) +
  geom_point(aes(color = Subject), alpha = 0.6) +
  geom_line(aes(y = Fitted, color = Subject), linewidth = 1) +
  facet_wrap(~Subject) +
  labs(title = "Reaction Time by Days of Sleep Deprivation",
       x = "Days of Sleep Deprivation",
       y = "Reaction Time (ms)") +
  theme_minimal() +
  theme(legend.position = "none")

You can write the general model equaltion as follows:

\[Y_{ij} = (\beta_0+b_{0i})+(\beta_1+b_{1i}).Days_{ij}+\epsilon_{ij}\] Or

\[Y_{ij} = \beta_0+\beta_1.Days_{ij}+b_{0i}+b_{1i}.Days_{ij}+\epsilon_{ij}\] Where,

\(Y_i\): the reaction time for subject \(i\) and \(j\)

\(\beta_0\): Fixed (average) intercept

\(\beta_1\): Fixed (average) slope for Days

\(b_{0i}\): Random intercept for subject \(i\)

\(b_{1i}\): Random slope for subject \(i\)

\(\epsilon_{ij}\): Residual error for observation \(j\) of subject \(i\)

Summary:

Fixed effect, mixed effects, and random effects linear regression models.

Reference:

https://bookdown.org/steve_midway/DAR/random-effects.html