library(lme4)
library(ggplot2)
library(dplyr)Mixed models
What is a fixed effect model:
In a fixed effects model, we estimate specific, average effects of predictor variables on dependent variable across the population (or data set).
For a continuous fixed effect (like temperature), we estimate: the average change in the response (e.g., dengue cases) for a 1-unit increase in temperature, across all observations.
For a categorical fixed effect (like district, when treated as fixed), we estimate the mean effect (intercept) for each level, relative to a reference level.
| Type of Fixed Effect | What we estimate | Is it a “mean effect”? |
|---|---|---|
| Continuous (e.g., Temp) | One slope \(\beta\): average effect across all data | Yes — average effect per unit |
| Categorical (e.g., District) | One intercept per level (if included as fixed effect) | Yes — group-specific means |
What is a random effect model?
A random effect is a grouping variable where you believe each group has its own effect, but you’re not interested in estimating each group’s effect individually. Instead, you assume these effects are randomly drawn from a larger population.
Think of random effects as the background characters — they matter, but you’re not studying them individually. You’re just accounting for their variation. What we actually estimate are random effect parameters associated with levels of a grouping variable, under the assumption that these levels are random samples from a population/distribution.
Contrast with random effects
- Fixed effects estimate specific effects per variable or group.
- Random effects assume group effects vary and are drawn from a distribution — and we estimate the mean and variance of that distribution, not the effect for each level as a parameter.
What is a mixed model?
A mixed model, also known as a mixed-effects model, is a statistical framework that combines both fixed effects and random effects within a single analysis. This approach is particularly valuable when dealing with data that exhibit correlations or dependencies, such as measurements clustered within groups or repeated observations over time.
data("sleepstudy")
head(sleepstudy, n = 10) Reaction Days Subject
1 249.5600 0 308
2 258.7047 1 308
3 250.8006 2 308
4 321.4398 3 308
5 356.8519 4 308
6 414.6901 5 308
7 382.2038 6 308
8 290.1486 7 308
9 430.5853 8 308
10 466.3535 9 308
Analysing whole dataset without considering the repeated measurements from individuals (Method 1)
mod1 <- Reaction~Days
result1 <- lm(formula = mod1, data = sleepstudy)
summary(result1)
Call:
lm(formula = mod1, data = sleepstudy)
Residuals:
Min 1Q Median 3Q Max
-110.848 -27.483 1.546 26.142 139.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 251.405 6.610 38.033 < 2e-16 ***
Days 10.467 1.238 8.454 9.89e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47.71 on 178 degrees of freedom
Multiple R-squared: 0.2865, Adjusted R-squared: 0.2825
F-statistic: 71.46 on 1 and 178 DF, p-value: 9.894e-15
plot(x = sleepstudy$Days,
y = sleepstudy$Reaction,
main = "Correlation between Reaction time and Days",
xlab = "Days",
ylab = "Reaction time (ms)",
col = "blue")
abline(result1)sleepstudy |> ggplot(aes(x = Days, y = Reaction)) +
geom_point() + geom_smooth(method = "lm")`geom_smooth()` using formula = 'y ~ x'
Analysing dataset taking the average values for each day (Method 2)
sleepStudy1 <- sleepstudy |> group_by(Days) |> summarise(meanReactionTime = mean(Reaction))
meanReact <- sleepstudy |> group_by(Days) |>
summarise(meanReaction = mean(Reaction))
mod2 <- lm(formula = meanReaction~Days, data = meanReact)
summary(mod2)
Call:
lm(formula = meanReaction ~ Days, data = meanReact)
Residuals:
Min 1Q Median 3Q Max
-6.9778 -3.9763 0.8356 4.2385 5.2467
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 251.4051 2.9111 86.36 3.61e-13 ***
Days 10.4673 0.5453 19.20 5.62e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.953 on 8 degrees of freedom
Multiple R-squared: 0.9788, Adjusted R-squared: 0.9761
F-statistic: 368.5 on 1 and 8 DF, p-value: 5.623e-08
meanReact |> ggplot(aes(x = Days, y = meanReaction)) +
geom_point() + geom_smooth(method = "lm")`geom_smooth()` using formula = 'y ~ x'
Analyzing individual specific effects (Method 3)
Each group has its own intercept and slop estimates.
They can only applied within each group
\[\text{Reaction}_i \sim \beta_{0i} + \beta_{1i}\text {Days}_{ij}\]
sleepstudy |> ggplot(aes(x = Days, y = Reaction)) + geom_point()sleepstudy |> ggplot(aes(x = Days, y = Reaction)) + geom_point() + facet_wrap(~Subject)sleepstudy |> ggplot(aes(x = Days, y = Reaction)) + geom_point() +
facet_wrap(~Subject) + geom_smooth(method = "lm")`geom_smooth()` using formula = 'y ~ x'
sleepstudy %>% filter(Subject == 330) %>%
lm(formula = Reaction~Days, data = .)
Call:
lm(formula = Reaction ~ Days, data = .)
Coefficients:
(Intercept) Days
289.685 3.008
mod3 <- sleepstudy %>% filter(Subject == 308) %>%
lm(formula = Reaction~Days, data = .)
summary(mod3)
Call:
lm(formula = Reaction ~ Days, data = .)
Residuals:
Min 1Q Median 3Q Max
-106.397 -4.098 9.688 22.269 61.674
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 244.19 28.08 8.695 2.39e-05 ***
Days 21.77 5.26 4.137 0.00326 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47.78 on 8 degrees of freedom
Multiple R-squared: 0.6815, Adjusted R-squared: 0.6417
F-statistic: 17.12 on 1 and 8 DF, p-value: 0.003265
If we assume that initial reaction time is similar across (fixed effect for intercept) all subject and having different effects (random effect for slop) after wards. (Method 4a)
mod4 <- lmer(formula = Reaction ~ Days + (0+Days|Subject), data = sleepstudy)
summary(mod4)Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (0 + Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1766.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.5104 -0.5588 0.0541 0.6244 4.6022
Random effects:
Groups Name Variance Std.Dev.
Subject Days 52.71 7.26
Residual 842.03 29.02
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.41 4.02 62.539
Days 10.47 1.87 5.599
Correlation of Fixed Effects:
(Intr)
Days -0.340
coef(mod4)$Subject
(Intercept) Days
308 251.4051 20.0866918
309 251.4051 -4.2326711
310 251.4051 -0.8189202
330 251.4051 9.1273878
331 251.4051 10.6754843
332 251.4051 11.5352979
333 251.4051 12.7430088
334 251.4051 10.4774867
335 251.4051 -0.4337385
337 251.4051 24.3577325
349 251.4051 7.9069248
350 251.4051 15.2012144
351 251.4051 8.1041559
352 251.4051 17.1349527
369 251.4051 11.8340809
370 251.4051 11.5298510
371 251.4051 9.5898795
372 251.4051 13.5923281
attr(,"class")
[1] "coef.mer"
sleepstudy1 <- sleepstudy %>%
mutate(Fitted = predict(mod4)) # Predict the Reaction time @ Day based on available values in the data set
ggplot(sleepstudy1, aes(x = Days, y = Reaction)) +
geom_point(aes(color = Subject), alpha = 0.6) +
geom_line(aes(y = Fitted, color = Subject), linewidth = 1) +
facet_wrap(~Subject) +
labs(title = "Reaction Time by Days of Sleep Deprivation",
x = "Days of Sleep Deprivation",
y = "Reaction Time (ms)") +
theme_minimal() +
theme(legend.position = "none")The final model will get general form of
\[Y_{ij} = \beta_0 + (\beta_1+b_{1i}).Days_{ij}+\epsilon_i\] where,
\(Y_i\): the reaction time for subject \(i\) and \(j\)
\(\beta_0\): Fixed (average) intercept
\(\beta_1\): Fixed (average) slope for Days
\(b_{1i}\): Random slope for subject \(i\)
\(\epsilon_{ij}\): Residual error for observation \(j\) of subject \(i\)
There is no random intercept because of (0 + Days | Subject), so every subject shares the same base level (intercept), but the effect of Days varies.
\(b_{1i} \sim N(0, \sigma^2_b)\), random slopes are normally distributed with mean 0.
\(\epsilon_{ij} \sim N(0, \sigma^2_b)\), residuals are also normally distributed.
This model can generalize to new subjects if the following are true:
- Subjects are sampled from a broader population:
The 18 subjects in the sleepstudy should represent a broader population (e.g., adults under sleep deprivation).
If these subjects were randomly sampled, the model’s fixed effect for Days (and the distribution of random slopes) represents the broader trend.
- Random effects are treated as coming from a distribution:
The random slope for each subject \(b1i \sim N(0,\sigma^2_b)\) captures variability across subjects.
This allows you to make probabilistic predictions about how a new subject might behave
- The fixed effect estimates are generalizable: - The fixed effect for Days (i.e.,\(\beta_1\)) reflects the average response across all subjects and is considered generalizable across the population, not just the 18 studied.
| Component | Generalizes To New Subjects? | Why? |
|---|---|---|
| Fixed Effects | Yes | Represents average population-level trend |
| Random Effects | Yes, probabilistically | Captures individual variability; assume new subjects drawn from same population |
| Residuals | Yes | Captures unexplained noise at the observation level |
If you’re generalizing to a new subject not in the data:
You don’t know \(b_{1i}\), because you have no data for that subject. So:
You use only the fixed effect \(\beta_1\), the population-level average.
This is the best estimate in the absence of subject-specific data.
In practice:
| Use case | What to use for slope \(\beta_1+b_{1i}\) |
|---|---|
| Subject in the dataset | Use both \(\beta_1\) and \(b_{1i}\) |
| New subject | Use only \(\beta_1\) |
| Partial data (few observations) | Use shrinkage estimate (BLUP): a mix of subject and population info |
What is partial data:
You have some data about a group (e.g., a subject), but not enough to estimate its behavior confidently on its own.
Example:
- Suppose you have 18 subjects in total.
- For Subject 1, you have 10 observations.
- For a new Subject 19, you have only 1 or 2 observations.
In that case, Subject 19 is said to have partial data — not enough to reliably estimate their own slope for Days, but also not zero data like a brand-new unseen subject.
How the model handles it:
This is where shrinkage (or partial pooling) comes in:
The model estimates , the subject-specific deviation, by shrinking it toward 0 (i.e., toward the population-level ) based on how much data you have for that subject.
More data for subject i –> more confidence in their unique slope
Less data for subject i –> more shrinkage toward the average slope
| Subject Type | Data Available | Estimate of Slope |
|---|---|---|
| Well-studied subject | Many points | \(\beta_1+b_{1i}\) (strong subject-specific slope) |
| Partial-data subject | Few points | Mixed estimate: some subject info + shrinkage toward \(\beta_1\) |
| New/unseen subject | No data | Use only \(\beta_1\) |
This is one of the key strengths of mixed models: they balance individual-specific information with overall population trends, automatically adapting based on how much data you have for each group.
| Scenario | Model Equation | Notes |
|---|---|---|
| Subject in dataset | \(Y_{ij} = (\beta_0+b_{0i})+(\beta_1+b_{1i}).Days_{ij}+\epsilon_{ij}\) | Uses both fixed + random effects |
| New subject (generalization) | \(\hat Y = \beta_0+\beta_1.Days_j\) | Use fixed effects only |
What is partial pooling and shrinkage?
Group levels with low sample size and/or poor information (i.e., no strong relationship) are more strongly influenced by the grand mean, which is serving to add information to an otherwise poorly-estimated group. However, a group with a large sample size and/or strong information (i.e., a strong relationship) will have very little influence of the grand mean and largely reflect the information contained entirely within the group. This process is called partial pooling (as opposed to no pooling where no effect is considered or total pooling where separate models are run for the separate groups). Partial pooling results in the phenemenon known as shrinkage, which refers to the group-level estimates being shrink toward the mean. What does all this mean? If you use a random effect you should be prepared for your factor levels to have some influence from the overall mean of all levels. With a good, clear signal in each group, you won’t see much influence of the overal mean, but you will with small groups or those witout much signal.
If we assume different initial Reaction time for individuals (random effect on intercept) and having similar effect after wards (fixed effect on slop). (Method 4b)
mod5 <- lmer(formula = Reaction~Days+(1|Subject), data = sleepstudy)
summary(mod5)Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (1 | Subject)
Data: sleepstudy
REML criterion at convergence: 1786.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.2257 -0.5529 0.0109 0.5188 4.2506
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 1378.2 37.12
Residual 960.5 30.99
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.4051 9.7467 25.79
Days 10.4673 0.8042 13.02
Correlation of Fixed Effects:
(Intr)
Days -0.371
coef(mod5)$Subject
(Intercept) Days
308 292.1888 10.46729
309 173.5556 10.46729
310 188.2965 10.46729
330 255.8115 10.46729
331 261.6213 10.46729
332 259.6263 10.46729
333 267.9056 10.46729
334 248.4081 10.46729
335 206.1230 10.46729
337 323.5878 10.46729
349 230.2089 10.46729
350 265.5165 10.46729
351 243.5429 10.46729
352 287.7835 10.46729
369 258.4415 10.46729
370 245.0424 10.46729
371 248.1108 10.46729
372 269.5209 10.46729
attr(,"class")
[1] "coef.mer"
sleepstudy2 <- sleepstudy %>%
mutate(Fitted = predict(mod5)) # Predict the Reaction time @ Day based on available values in the data set
ggplot(sleepstudy2, aes(x = Days, y = Reaction)) +
geom_point(aes(color = Subject), alpha = 0.6) +
geom_line(aes(y = Fitted, color = Subject), linewidth = 1) +
facet_wrap(~Subject) +
labs(title = "Reaction Time by Days of Sleep Deprivation",
x = "Days of Sleep Deprivation",
y = "Reaction Time (ms)") +
theme_minimal() +
theme(legend.position = "none")The general equation can be written as:
\[Y_{ij} = (\beta_0+b_{0i}) + \beta_1.Days_{ij}+\epsilon_{ij}\] where,
\(Y_i\): the reaction time for subject \(i\) and \(j\)
\(\beta_0\): Fixed (average) intercept
\(b_{0i}\): Random intercept for subject \(i\)
\(\beta_1\): Fixed (average) slope for Days
\(\epsilon_{ij}\): Residual error for observation \(j\) of subject \(i\)
If we assume each subject has his/her own initial reaction time (random effect on intercept) and fatigue affect reaction time in unique way (random effect on slop). (Method 4c)
mod6 <- lmer(formula = Reaction~Days+(Days|Subject), data = sleepstudy)
summary(mod6)Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.10 24.741
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.405 6.825 36.838
Days 10.467 1.546 6.771
Correlation of Fixed Effects:
(Intr)
Days -0.138
coef(mod6)$Subject
(Intercept) Days
308 253.6637 19.6662617
309 211.0064 1.8476053
310 212.4447 5.0184295
330 275.0957 5.6529356
331 273.6654 7.3973743
332 260.4447 10.1951090
333 268.2456 10.2436499
334 244.1725 11.5418676
335 251.0714 -0.2848792
337 286.2956 19.0955511
349 226.1949 11.6407181
350 238.3351 17.0815038
351 255.9830 7.4520239
352 272.2688 14.0032871
369 254.6806 11.3395008
370 225.7921 15.2897709
371 252.2122 9.4791297
372 263.7197 11.7513080
attr(,"class")
[1] "coef.mer"
sleepstudy3 <- sleepstudy %>%
mutate(Fitted = predict(mod6))
ggplot(sleepstudy3, aes(x = Days, y = Reaction)) +
geom_point(aes(color = Subject), alpha = 0.6) +
geom_line(aes(y = Fitted, color = Subject), linewidth = 1) +
facet_wrap(~Subject) +
labs(title = "Reaction Time by Days of Sleep Deprivation",
x = "Days of Sleep Deprivation",
y = "Reaction Time (ms)") +
theme_minimal() +
theme(legend.position = "none")You can write the general model equaltion as follows:
\[Y_{ij} = (\beta_0+b_{0i})+(\beta_1+b_{1i}).Days_{ij}+\epsilon_{ij}\] Or
\[Y_{ij} = \beta_0+\beta_1.Days_{ij}+b_{0i}+b_{1i}.Days_{ij}+\epsilon_{ij}\] Where,
\(Y_i\): the reaction time for subject \(i\) and \(j\)
\(\beta_0\): Fixed (average) intercept
\(\beta_1\): Fixed (average) slope for Days
\(b_{0i}\): Random intercept for subject \(i\)
\(b_{1i}\): Random slope for subject \(i\)
\(\epsilon_{ij}\): Residual error for observation \(j\) of subject \(i\)
Summary:
Fixed effect, mixed effects, and random effects linear regression models.
Reference: