In this document, we will go through step-by-step how to analyze teacher salaries across different schools using random effects models. We will simulate salary data, fit linear mixed models, and visualize how salaries change based on years of work experience. Finally, we will compare Bayesian and frequentist approaches for modeling these data.
We start by simulating salary data across multiple schools. The salary depends on the years worked by the teachers, and there are random intercepts and slopes for each school.
# Install and load required packages
if(!require(lmerTest)) install.packages("lmerTest")
## Loading required package: lmerTest
## Loading required package: lme4
## Loading required package: Matrix
##
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
##
## lmer
## The following object is masked from 'package:stats':
##
## step
if(!require(ggplot2)) install.packages("ggplot2")
## Loading required package: ggplot2
library(lmerTest) # For linear mixed models with p-values
library(ggplot2) # For visualization
# Set random seed for reproducibility
set.seed(123)
# Number of schools and teachers per school
n_schools <- 20 # Number of schools
n_teachers_per_school <- 30 # Number of teachers per school
n_total <- n_schools * n_teachers_per_school # Total observations
# Simulate random intercepts (base salary) and random slopes (salary growth rate)
school_intercepts <- rnorm(n_schools, mean = 30000, sd = 2000) # Baseline salary
school_slopes <- rnorm(n_schools, mean = 1500, sd = 300) # Salary growth rate
# Generate years worked and school grouping
years_worked <- runif(n_total, min = 1, max = 40) # Years of experience
school <- rep(1:n_schools, each = n_teachers_per_school) # School grouping variable
# Generate salary data, with random intercepts and slopes
salary <- rep(NA, n_total)
for (i in 1:n_schools) {
idx <- which(school == i)
salary[idx] <- school_intercepts[i] + school_slopes[i] * years_worked[idx] + rnorm(n_teachers_per_school, mean = 0, sd = 2000)
}
# Combine data into a data frame
data <- data.frame(school = factor(school), years_worked = years_worked, salary = salary)
# Check part of the data
head(data)
## school years_worked salary
## 1 1 10.501159 42515.15
## 2 1 27.054168 62711.59
## 3 1 17.288224 52615.26
## 4 1 31.739638 66432.84
## 5 1 5.011721 34687.18
## 6 1 17.960817 46560.10
We fit two models: one with only random intercepts and another with both random intercepts and random slopes for schools.
# Fit random intercept model
model_random_intercept <- lmer(salary ~ years_worked + (1 | school), data = data)
# Fit random slope model
model_random_slope <- lmer(salary ~ years_worked + (years_worked | school), data = data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0109629 (tol = 0.002, component 1)
# View summaries of both models
summary(model_random_intercept)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (1 | school)
## Data: data
##
## REML criterion at convergence: 11505.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.0299 -0.6433 0.0602 0.6948 2.6783
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 28764348 5363
## Residual 11227538 3351
## Number of obs: 600, groups: school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 29660.46 1233.40 20.71 24.05 <2e-16 ***
## years_worked 1517.40 12.36 579.41 122.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.206
summary(model_random_slope)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (years_worked | school)
## Data: data
##
## REML criterion at convergence: 10948.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.74046 -0.62584 0.06166 0.64776 2.78432
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 3699501 1923.4
## years_worked 63620 252.2 -0.10
## Residual 4002754 2000.7
## Number of obs: 600, groups: school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30092.86 464.12 19.23 64.84 <2e-16 ***
## years_worked 1497.38 56.90 18.96 26.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.134
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0109629 (tol = 0.002, component 1)
We compare the two models using AIC and likelihood ratio tests.
# Compare models using AIC
AIC(model_random_intercept, model_random_slope)
## df AIC
## model_random_intercept 4 11513.26
## model_random_slope 6 10960.67
# Likelihood ratio test between models
anova(model_random_intercept, model_random_slope)
## refitting model(s) with ML (instead of REML)
## Data: data
## Models:
## model_random_intercept: salary ~ years_worked + (1 | school)
## model_random_slope: salary ~ years_worked + (years_worked | school)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_random_intercept 4 11536 11554 -5764.1 11528
## model_random_slope 6 10985 11011 -5486.3 10973 555.49 2 < 2.2e-16
##
## model_random_intercept
## model_random_slope ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, we visualize the relationship between years worked and salary for each school.
# Plot salary vs. years worked, with regression lines for each school
ggplot(data, aes(x = years_worked, y = salary, color = school)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, aes(group = school), linetype = "dashed") +
theme_minimal() +
labs(title = "Salary vs. Years Worked (including negative growth schools)",
x = "Years Worked",
y = "Salary",
color = "School")
## `geom_smooth()` using formula = 'y ~ x'
Next, we dive into a more advanced model, allowing for random intercepts and random slopes. We also visualize how salaries change with experience across schools.
# Install and load necessary packages
if(!require(lmerTest)) install.packages("lmerTest")
if(!require(ggplot2)) install.packages("ggplot2")
library(lmerTest) # For linear mixed models and p-value calculation
library(ggplot2) # For data visualization
# Set random seed for reproducibility
set.seed(123)
# Define number of schools and teachers per school
n_schools <- 20 # Number of schools
n_teachers_per_school <- 30 # Teachers per school
n_total <- n_schools * n_teachers_per_school # Total number of observations
# Simulate random intercepts (base salary)
school_intercepts <- rnorm(n_schools, mean = 30000, sd = 2000)
# Simulate random slopes (salary growth rates), including schools with zero or negative growth
school_slopes <- rnorm(n_schools, mean = 1500, sd = 300)
school_slopes[1:5] <- runif(5, min = -1000, max = 0) # Simulate schools with salary decrease or no increase
# Generate years worked and school groupings
years_worked <- runif(n_total, min = 1, max = 40) # Years of experience
school <- rep(1:n_schools, each = n_teachers_per_school) # School grouping variable
# Generate salary data with random intercepts and random slopes
salary <- rep(NA, n_total)
for (i in 1:n_schools) {
idx <- which(school == i)
salary[idx] <- school_intercepts[i] + school_slopes[i] * years_worked[idx] + rnorm(n_teachers_per_school, mean = 0, sd = 2000)
}
# Combine data into a data frame
data <- data.frame(school = factor(school), years_worked = years_worked, salary = salary)
# Check part of the data
head(data)
## school years_worked salary
## 1 1 17.960817 15548.2758
## 2 1 39.413322 -2835.2398
## 3 1 35.828993 453.6669
## 4 1 35.572293 2662.0484
## 5 1 7.827053 20799.1095
## 6 1 6.097132 24686.6020
Now we fit the random intercept and random slope model and summarize the results.
# Fit random intercept and random slope model
model_random_slope <- lmer(salary ~ years_worked + (years_worked | school), data = data)
# View model summary, including p-values
summary(model_random_slope)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (years_worked | school)
## Data: data
##
## REML criterion at convergence: 11001.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.98758 -0.61427 0.01437 0.61787 2.70240
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 4295989 2072.7
## years_worked 967555 983.6 -0.12
## Residual 3991908 1998.0
## Number of obs: 600, groups: school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30327.33 495.35 18.98 61.224 < 2e-16 ***
## years_worked 1018.83 220.08 19.00 4.629 0.000183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.122
Finally, we visualize salary trends across schools, showing how the salary increases or decreases with years of experience.
# Plot salary vs. years worked, adding regression lines for each school
ggplot(data, aes(x = years_worked, y = salary, color = school)) +
geom_point(alpha = 0.6) + # Scatter plot
geom_smooth(method = "lm", se = FALSE, aes(group = school), linetype = "dashed") + # Add regression lines for each school
theme_minimal() +
labs(title = "Salary vs. Years Worked (including negative growth schools)",
x = "Years Worked",
y = "Salary",
color = "School")
## `geom_smooth()` using formula = 'y ~ x'
## More varaiables
In this document, we explore how salaries are distributed based on years of work experience, taking into account the random effects of schools and teachers. We will fit linear mixed models with both random intercepts and random slopes, and compare the results to determine the best-fitting model.
We simulate data for 20 schools, each having 30 teachers. The salaries of teachers are influenced by years of experience, and there are random effects for schools (both intercepts and slopes). Additionally, random intercepts for teachers are also considered.
set.seed(123)
n_schools <- 20
n_teachers_per_school <- 30
n_years_per_teacher <- 5
n_total <- n_schools * n_teachers_per_school * n_years_per_teacher
# Random intercepts and slopes for schools
school_intercepts <- rnorm(n_schools, mean = 30000, sd = 2000)
school_slopes <- rnorm(n_schools, mean = 1500, sd = 300)
school_slopes[1:5] <- runif(5, min = -1000, max = 0)
# Random intercepts for teachers
teacher_intercepts <- rnorm(n_schools * n_teachers_per_school, mean = 0, sd = 2000)
# Simulate data
years_worked <- rep(1:n_years_per_teacher, times = n_schools * n_teachers_per_school)
school <- rep(1:n_schools, each = n_teachers_per_school * n_years_per_teacher)
teacher <- rep(1:(n_schools * n_teachers_per_school), each = n_years_per_teacher)
salary <- rep(NA, n_total)
for (i in 1:n_schools) {
idx <- which(school == i)
salary[idx] <- school_intercepts[i] + school_slopes[i] * years_worked[idx] + teacher_intercepts[teacher[idx]] + rnorm(length(idx), mean = 0, sd = 2000)
}
data <- data.frame(school = factor(school), teacher = factor(teacher), years_worked = years_worked, salary = salary)
head(data)
## school teacher years_worked salary
## 1 1 1 1 26604.67
## 2 1 1 2 27286.11
## 3 1 1 3 26357.09
## 4 1 1 4 25657.47
## 5 1 1 5 26825.99
## 6 1 2 1 29578.78
We fit the model with random intercepts and random slopes for schools, and also add random intercepts for teachers.
# Model with random intercepts for teachers and random slopes for schools
model_teacher_random <- lmer(salary ~ years_worked + (years_worked | school) + (1 | teacher), data = data)
summary(model_teacher_random)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (years_worked | school) + (1 | teacher)
## Data: data
##
## REML criterion at convergence: 55311.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1029 -0.5898 0.0119 0.6151 3.1836
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## teacher (Intercept) 4228282 2056
## school (Intercept) 4310576 2076
## years_worked 1003883 1002 -0.17
## Residual 3945957 1986
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30305.8 479.4 19.0 63.22 < 2e-16 ***
## years_worked 978.6 225.5 19.0 4.34 0.000353 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.185
We visualize how salary changes with years of experience, grouped by schools, and overlay regression lines for each school.
ggplot(data, aes(x = years_worked, y = salary, color = school)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, aes(group = school), linetype = "dashed") +
theme_minimal() +
labs(title = "Salary vs Years of Experience (Random Effects for Schools and Teachers)",
x = "Years Worked",
y = "Salary",
color = "School")
## `geom_smooth()` using formula = 'y ~ x'
We run a full analysis that includes both random intercepts and slopes for schools, and random intercepts for teachers, and check if these improve model performance.
# Model with random intercepts for schools
model_random_intercept <- lmer(salary ~ years_worked + (1 | school), data = data)
# Model with random intercepts and random slopes for schools
model_random_slope <- lmer(salary ~ years_worked + (years_worked | school), data = data)
# Model with random intercepts for teachers and random slopes for schools
model_teacher_random <- lmer(salary ~ years_worked + (years_worked | school) + (1 | teacher), data = data)
summary(model_random_intercept)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (1 | school)
## Data: data
##
## REML criterion at convergence: 56939.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.4015 -0.6682 0.0040 0.6611 3.5904
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 11276692 3358
## Residual 9982931 3160
## Number of obs: 3000, groups: school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30305.75 762.98 20.02 39.72 <2e-16 ***
## years_worked 978.65 40.79 2979.00 23.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.160
summary(model_random_slope)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (years_worked | school)
## Data: data
##
## REML criterion at convergence: 56363.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.0455 -0.6741 0.0094 0.6670 3.9906
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 4299644 2074
## years_worked 990079 995 -0.16
## Residual 8088530 2844
## Number of obs: 3000, groups: school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30305.8 479.4 19.0 63.22 < 2e-16 ***
## years_worked 978.6 225.5 19.0 4.34 0.000353 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.185
summary(model_teacher_random)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (years_worked | school) + (1 | teacher)
## Data: data
##
## REML criterion at convergence: 55311.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1029 -0.5898 0.0119 0.6151 3.1836
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## teacher (Intercept) 4228282 2056
## school (Intercept) 4310576 2076
## years_worked 1003883 1002 -0.17
## Residual 3945957 1986
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30305.8 479.4 19.0 63.22 < 2e-16 ***
## years_worked 978.6 225.5 19.0 4.34 0.000353 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.185
We compare the models using AIC and a likelihood ratio test to determine whether the inclusion of random slopes or random intercepts for teachers significantly improves the model fit.
AIC(model_random_intercept, model_random_slope, model_teacher_random)
## df AIC
## model_random_intercept 4 56947.22
## model_random_slope 6 56375.49
## model_teacher_random 7 55325.75
anova(model_random_intercept, model_random_slope)
## refitting model(s) with ML (instead of REML)
## Data: data
## Models:
## model_random_intercept: salary ~ years_worked + (1 | school)
## model_random_slope: salary ~ years_worked + (years_worked | school)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_random_intercept 4 56972 56996 -28482 56964
## model_random_slope 6 56402 56438 -28195 56390 573.27 2 < 2.2e-16
##
## model_random_intercept
## model_random_slope ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this analysis, we have modeled teacher salaries considering both fixed and random effects, including the impact of years worked, school-specific random intercepts, random slopes, and teacher-specific random intercepts. The results suggest that including random effects improves the model fit.
You can further customize your mixed-effects models in several ways. Below are examples to extend the analysis:
We can include more fixed effects, such as teacher’s gender or education level.
# Adding a gender variable to the dataset
data$gender <- sample(c("Male", "Female"), size = nrow(data), replace = TRUE)
# Model with gender as an additional fixed effect
model_with_gender <- lmer(salary ~ years_worked + gender + (years_worked | school) + (1 | teacher), data = data)
summary(model_with_gender)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + gender + (years_worked | school) + (1 |
## teacher)
## Data: data
##
## REML criterion at convergence: 55299.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0722 -0.5890 0.0189 0.6144 3.2073
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## teacher (Intercept) 4244107 2060
## school (Intercept) 4296287 2073
## years_worked 1004778 1002 -0.17
## Residual 3941383 1985
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30248.68 480.29 19.26 62.981 < 2e-16 ***
## years_worked 978.23 225.60 19.00 4.336 0.000356 ***
## genderMale 114.59 79.34 2539.97 1.444 0.148797
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) yrs_wr
## years_workd -0.185
## genderMale -0.082 -0.001
You can add random slopes for more than one variable. For example,
adding random slopes for both years_worked
and
gender
.
# Adding random slopes for both years_worked and gender
model_random_slopes <- lmer(salary ~ years_worked + gender + (years_worked + gender | school) + (1 | teacher), data = data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0128206 (tol = 0.002, component 1)
summary(model_random_slopes)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + gender + (years_worked + gender | school) +
## (1 | teacher)
## Data: data
##
## REML criterion at convergence: 55297.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0454 -0.5895 0.0135 0.6117 3.2249
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## teacher (Intercept) 4248282 2061.14
## school (Intercept) 4270008 2066.40
## years_worked 1002877 1001.44 -0.19
## genderMale 8012 89.51 0.11 0.94
## Residual 3938720 1984.62
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30246.52 478.92 18.94 63.156 < 2e-16 ***
## years_worked 978.73 225.39 19.01 4.342 0.00035 ***
## genderMale 115.02 81.80 33.77 1.406 0.16884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) yrs_wr
## years_workd -0.203
## genderMale -0.054 0.227
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0128206 (tol = 0.002, component 1)
If teachers work in multiple schools, you can use crossed random effects.
# Crossed random effects for schools and teachers
model_crossed <- lmer(salary ~ years_worked + (1 | school) + (1 | teacher), data = data)
summary(model_crossed)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (1 | school) + (1 | teacher)
## Data: data
##
## REML criterion at convergence: 56381.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1831 -0.5994 0.0156 0.5953 3.0187
##
## Random effects:
## Groups Name Variance Std.Dev.
## teacher (Intercept) 3751263 1937
## school (Intercept) 11176415 3343
## Residual 6331162 2516
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30305.75 759.39 19.64 39.91 <2e-16 ***
## years_worked 978.65 32.48 2399.00 30.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.128
You can include interaction terms to see how the effect of one variable depends on another.
# Adding an interaction term between years_worked and gender
model_interaction <- lmer(salary ~ years_worked * gender + (years_worked | school) + (1 | teacher), data = data)
summary(model_interaction)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked * gender + (years_worked | school) + (1 |
## teacher)
## Data: data
##
## REML criterion at convergence: 55288.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0712 -0.5907 0.0136 0.6145 3.1870
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## teacher (Intercept) 4246426 2061
## school (Intercept) 4295152 2072
## years_worked 1005114 1003 -0.17
## Residual 3941421 1985
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30176.32 487.83 20.51 61.858 < 2e-16 ***
## years_worked 1002.52 227.47 19.62 4.407 0.000283 ***
## genderMale 257.92 187.52 2553.11 1.375 0.169117
## years_worked:genderMale -47.86 56.75 2559.09 -0.843 0.399055
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) yrs_wr gndrMl
## years_workd -0.203
## genderMale -0.194 0.114
## yrs_wrkd:gM 0.176 -0.127 -0.906
If you think the effect of years_worked
on salary is
nonlinear, you can include polynomial terms or splines.
# Adding a quadratic term for years_worked
model_nonlinear <- lmer(salary ~ poly(years_worked, 2) + (years_worked | school) + (1 | teacher), data = data)
summary(model_nonlinear)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ poly(years_worked, 2) + (years_worked | school) + (1 |
## teacher)
## Data: data
##
## REML criterion at convergence: 55286
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1022 -0.5896 0.0118 0.6151 3.1830
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## teacher (Intercept) 4227945 2056
## school (Intercept) 4310583 2076
## years_worked 1003906 1002 -0.17
## Residual 3947615 1987
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 33241.71 753.11 19.00 44.139 < 2e-16 ***
## poly(years_worked, 2)1 75806.12 17467.67 19.00 4.340 0.000353 ***
## poly(years_worked, 2)2 13.41 1986.86 2379.00 0.007 0.994616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) p(_,2)1
## ply(yr_,2)1 0.780
## ply(yr_,2)2 0.000 0.000
For non-Gaussian responses, you can use a generalized linear mixed model.
# Binary outcome model: salary > 40000
data$high_salary <- as.factor(ifelse(data$salary > 40000, 1, 0))
# Logistic regression with random effects
model_glmm <- glmer(high_salary ~ years_worked + (1 | school) + (1 | teacher), data = data, family = binomial)
summary(model_glmm)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: high_salary ~ years_worked + (1 | school) + (1 | teacher)
## Data: data
##
## AIC BIC logLik deviance df.resid
## 1086.6 1110.6 -539.3 1078.6 2996
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.6783 -0.1428 -0.0594 -0.0176 30.3012
##
## Random effects:
## Groups Name Variance Std.Dev.
## teacher (Intercept) 3.709 1.926
## school (Intercept) 6.341 2.518
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.2220 0.9215 -11.09 <2e-16 ***
## years_worked 1.3942 0.1143 12.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.702
For complex models, trying different optimizers can help improve convergence.
# Customize optimizer
model_custom_optimizer <- lmer(salary ~ years_worked + (years_worked | school) + (1 | teacher),
data = data, control = lmerControl(optimizer = "bobyqa"))
summary(model_custom_optimizer)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (years_worked | school) + (1 | teacher)
## Data: data
## Control: lmerControl(optimizer = "bobyqa")
##
## REML criterion at convergence: 55311.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.1029 -0.5898 0.0119 0.6151 3.1836
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## teacher (Intercept) 4228283 2056
## school (Intercept) 4310597 2076
## years_worked 1003883 1002 -0.17
## Residual 3945956 1986
## Number of obs: 3000, groups: teacher, 600; school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30305.8 479.4 19.0 63.22 < 2e-16 ***
## years_worked 978.6 225.5 19.0 4.34 0.000353 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.185
You can use the brms
package to fit Bayesian
mixed-effects models.
# Fit a Bayesian mixed model using brms
if (!requireNamespace("brms")) install.packages("brms")
library(brms)
model_bayes <- brm(salary ~ years_worked + (years_worked | school) + (1 | teacher), data = data, chains = 2, iter = 1000)
summary(model_bayes)
You can use AIC and likelihood ratio tests to compare models.
# Compare models using AIC and likelihood ratio test
AIC(model_random_intercept, model_random_slopes, model_with_gender)
## df AIC
## model_random_intercept 4 56947.22
## model_random_slopes 11 55319.91
## model_with_gender 8 55315.08
anova(model_random_intercept, model_random_slopes)
## refitting model(s) with ML (instead of REML)
## Data: data
## Models:
## model_random_intercept: salary ~ years_worked + (1 | school)
## model_random_slopes: salary ~ years_worked + gender + (years_worked + gender | school) + (1 | teacher)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_random_intercept 4 56972 56996 -28482 56964
## model_random_slopes 11 55357 55423 -27668 55335 1628.3 7 < 2.2e-16
##
## model_random_intercept
## model_random_slopes ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is important to check model assumptions, such as residual plots and random effect estimates.
# Checking residuals
plot(resid(model_teacher_random))
# Checking random effects
ranef(model_teacher_random)
## $teacher
## (Intercept)
## 1 -183.7767106
## 2 2378.7771550
## 3 -2817.4268178
## 4 -986.4855303
## 5 508.7084186
## 6 -1956.0695855
## 7 -3100.9765975
## 8 -200.6775072
## 9 -1479.5374284
## 10 3082.3867440
## 11 1827.9739631
## 12 100.3494642
## 13 -421.7753321
## 14 -535.9667033
## 15 2949.0068378
## 16 -1559.5357889
## 17 2498.2228578
## 18 -523.7566249
## 19 -1546.2441702
## 20 92.9964223
## 21 2665.5026412
## 22 -1903.9229791
## 23 1100.0814773
## 24 682.4367959
## 25 -33.4690388
## 26 367.7859593
## 27 1067.6789902
## 28 -375.7406369
## 29 -812.2838507
## 30 -2371.9743305
## 31 -830.8101434
## 32 -3426.3594711
## 33 1076.8333089
## 34 -874.2531672
## 35 -512.4097367
## 36 898.0014322
## 37 -707.1790717
## 38 -1067.8398398
## 39 -692.8630478
## 40 -559.7259823
## 41 136.6139641
## 42 -1036.2276229
## 43 -775.2677091
## 44 -1613.9164854
## 45 1733.0927231
## 46 1044.2353415
## 47 -119.4054150
## 48 102.2988182
## 49 546.4401907
## 50 -1411.5464376
## 51 -590.8008068
## 52 1248.5868696
## 53 1994.7629712
## 54 -1395.9357064
## 55 1323.1810459
## 56 69.5832248
## 57 -566.1957066
## 58 1332.4907308
## 59 3527.3963123
## 60 -480.8170401
## 61 2830.0652516
## 62 -786.6990198
## 63 -1502.0522562
## 64 -2212.3601562
## 65 -307.3409473
## 66 617.0852389
## 67 -528.1981099
## 68 1324.9933287
## 69 2345.0147577
## 70 -239.4840374
## 71 -2169.6802552
## 72 96.4337479
## 73 1089.1220918
## 74 -2501.6727644
## 75 -3207.8393845
## 76 60.0717642
## 77 574.9156579
## 78 1733.5245523
## 79 179.4347657
## 80 -490.8183495
## 81 510.3491948
## 82 2857.2290025
## 83 643.5692563
## 84 -2419.7043303
## 85 -2061.0487823
## 86 413.0883392
## 87 -895.6607480
## 88 1310.5179020
## 89 3112.8482812
## 90 2525.7967383
## 91 872.8787483
## 92 -651.0416532
## 93 385.7965671
## 94 -4552.6375467
## 95 -1722.5790111
## 96 2747.2816399
## 97 1865.4592244
## 98 1515.8394934
## 99 -2561.6075364
## 100 -285.7892425
## 101 -207.5158404
## 102 331.7170903
## 103 259.3539564
## 104 -866.5917334
## 105 1793.9402151
## 106 2754.2971676
## 107 -1676.1952962
## 108 1776.5420985
## 109 -3505.3917549
## 110 952.8342313
## 111 -223.7251985
## 112 -1572.2467626
## 113 -132.2635251
## 114 9.3319174
## 115 -1429.7330100
## 116 1698.9926693
## 117 -666.6935232
## 118 2663.8220378
## 119 -1717.1097098
## 120 593.3989337
## 121 -3121.8103662
## 122 -3506.5266030
## 123 1895.5509979
## 124 -1297.6426062
## 125 2123.2927929
## 126 -117.7578683
## 127 701.9323752
## 128 2993.0358631
## 129 -1755.6707309
## 130 -3166.6803359
## 131 -668.1466968
## 132 217.3144250
## 133 -554.5004755
## 134 3514.3457247
## 135 -2579.7522392
## 136 3328.3599070
## 137 -1370.4284938
## 138 1313.4096745
## 139 -2566.9195869
## 140 -1038.1821614
## 141 2505.8850217
## 142 -1520.2165074
## 143 322.1894311
## 144 -547.2720879
## 145 -508.4662279
## 146 2316.6945936
## 147 -2075.7076955
## 148 3072.9838009
## 149 558.2715468
## 150 1588.6620975
## 151 2213.6606282
## 152 -701.8399288
## 153 194.0309045
## 154 213.6807533
## 155 1448.8713088
## 156 -578.4035490
## 157 -3037.5608871
## 158 4468.1722999
## 159 -1735.2357052
## 160 1552.0758392
## 161 520.3480870
## 162 -2669.3698882
## 163 1040.1449859
## 164 2189.1785336
## 165 -340.1473950
## 166 -1555.8784253
## 167 -84.2990194
## 168 -3170.8840495
## 169 -782.0577488
## 170 -2119.5069844
## 171 938.7900447
## 172 -310.7494316
## 173 2117.1098188
## 174 -468.0023427
## 175 5927.6611797
## 176 -401.5518718
## 177 -67.7254681
## 178 -1118.1690049
## 179 1506.5731055
## 180 -1412.6779411
## 181 326.5638166
## 182 -1381.7834804
## 183 231.9022239
## 184 1208.9582712
## 185 -1559.3934129
## 186 1576.4963354
## 187 1556.5869805
## 188 -2695.0867993
## 189 -339.1842885
## 190 -2030.1654630
## 191 17.1127841
## 192 965.6220575
## 193 1393.5072700
## 194 369.0006195
## 195 1709.9386517
## 196 739.3338957
## 197 -5639.7502902
## 198 1129.8537943
## 199 1665.2604658
## 200 830.5306461
## 201 -3249.6326370
## 202 351.1793083
## 203 2618.6960030
## 204 -3197.9678788
## 205 1191.2224926
## 206 4601.6150795
## 207 -2567.0927225
## 208 2173.6298603
## 209 508.0341912
## 210 -1848.7906935
## 211 -2223.6612774
## 212 76.1796456
## 213 -784.1012133
## 214 -2675.1426513
## 215 -374.9721289
## 216 -1395.1347868
## 217 41.1878649
## 218 946.0194753
## 219 -336.2344523
## 220 -2390.0600042
## 221 1027.4992275
## 222 -3572.9429373
## 223 -1840.0591844
## 224 22.2849411
## 225 2585.6459196
## 226 798.6118304
## 227 2986.6719288
## 228 -1277.5545074
## 229 312.9024443
## 230 -1137.7241686
## 231 2594.9906232
## 232 2948.1188598
## 233 -411.6815284
## 234 -1385.3608635
## 235 2371.1792403
## 236 49.9444883
## 237 703.2552698
## 238 -2567.3190281
## 239 2898.1869180
## 240 -664.7848081
## 241 166.0766762
## 242 322.3529978
## 243 2165.3738091
## 244 3035.5132143
## 245 -1040.4514397
## 246 -54.8208911
## 247 -4484.4970042
## 248 -174.1668291
## 249 2885.7384260
## 250 -1766.5056890
## 251 -1764.5170695
## 252 1500.5631454
## 253 29.0245805
## 254 -2480.9029101
## 255 -2233.8484220
## 256 1896.6924194
## 257 -823.5860565
## 258 -3285.3356637
## 259 1965.7818545
## 260 -477.0669927
## 261 67.6878589
## 262 872.8377996
## 263 -441.9395060
## 264 -2154.6785868
## 265 3833.4763683
## 266 812.9209481
## 267 2273.0668280
## 268 -3054.4483408
## 269 1065.1323475
## 270 198.1022230
## 271 268.4629650
## 272 1922.6959953
## 273 -1088.1104928
## 274 2454.4814932
## 275 1366.7438577
## 276 -510.1739595
## 277 406.0294634
## 278 -1075.0562415
## 279 894.0972073
## 280 317.5691396
## 281 377.0883648
## 282 -1116.2819732
## 283 631.5725759
## 284 2629.9513387
## 285 758.9925280
## 286 557.3589159
## 287 -3735.2945433
## 288 -689.4412507
## 289 1786.2319202
## 290 855.6241553
## 291 -2782.0836411
## 292 -2473.0760178
## 293 -1622.3374319
## 294 -680.9392553
## 295 -916.1816920
## 296 -782.3642546
## 297 -1832.3592835
## 298 184.7016104
## 299 1934.7413501
## 300 1143.5184247
## 301 391.3823761
## 302 -101.5892473
## 303 -1324.7155821
## 304 1063.0456544
## 305 -1187.7155850
## 306 620.8963573
## 307 -1029.8934172
## 308 -532.4638016
## 309 181.3050928
## 310 1846.0383501
## 311 1745.9097181
## 312 305.5407371
## 313 50.5131656
## 314 1004.6281118
## 315 349.6569067
## 316 -490.9521275
## 317 -4540.9768312
## 318 -2746.6318262
## 319 2290.5843798
## 320 2079.1081753
## 321 513.7782362
## 322 579.3238357
## 323 -3438.6566157
## 324 2266.3097940
## 325 3488.9430214
## 326 -1393.0600364
## 327 3944.9769384
## 328 -5856.5770409
## 329 275.3060681
## 330 1471.9038291
## 331 878.2598713
## 332 -2767.3346886
## 333 2542.0661504
## 334 -94.0074143
## 335 -1392.9075770
## 336 401.1850585
## 337 2781.8219318
## 338 -1641.5932310
## 339 -824.7370059
## 340 -701.0018538
## 341 -1012.9246970
## 342 155.4769858
## 343 732.1886591
## 344 885.3350430
## 345 2742.0494959
## 346 398.8037287
## 347 593.5233902
## 348 2176.5634562
## 349 2580.6264210
## 350 578.0089279
## 351 68.3293292
## 352 -4341.9286319
## 353 1002.4935461
## 354 1169.8539159
## 355 -641.9426230
## 356 -402.4977827
## 357 -539.1395232
## 358 -1399.4045602
## 359 -920.0191489
## 360 -2756.7149352
## 361 2992.3491158
## 362 399.8284670
## 363 1625.5299216
## 364 640.2603017
## 365 2850.4184517
## 366 -3152.4161521
## 367 -1495.5278880
## 368 247.3676287
## 369 -869.3597537
## 370 4192.1084846
## 371 -2653.5709948
## 372 -2883.9636588
## 373 1973.5826939
## 374 2808.5646668
## 375 1578.6340163
## 376 876.3341183
## 377 2326.8188356
## 378 -1858.0360954
## 379 2017.4341551
## 380 -2452.7044439
## 381 -839.2243426
## 382 -4409.6592067
## 383 -2206.4159802
## 384 1829.1929700
## 385 1711.8352885
## 386 668.7840400
## 387 -3662.4835675
## 388 1724.5737700
## 389 -3293.0659782
## 390 -504.2203819
## 391 -969.4865228
## 392 -1533.3986265
## 393 627.0195741
## 394 -788.8905720
## 395 10.3611740
## 396 -195.0569881
## 397 671.3395786
## 398 -41.8041186
## 399 1585.4012570
## 400 -2713.7093058
## 401 2202.6094694
## 402 154.4789024
## 403 -1695.9555283
## 404 1179.4700644
## 405 190.6197515
## 406 522.7041883
## 407 -1724.7885610
## 408 2299.0888539
## 409 322.2222314
## 410 924.1466860
## 411 811.5960281
## 412 2603.7142903
## 413 -2731.9123025
## 414 -2337.2294471
## 415 513.8454455
## 416 3887.2344750
## 417 -129.4814756
## 418 -1673.1883973
## 419 -4040.4198531
## 420 2714.9752981
## 421 2364.3518672
## 422 407.7835714
## 423 1830.3712366
## 424 -221.3683290
## 425 2476.5553894
## 426 -632.4870098
## 427 1562.4913828
## 428 -2448.6230838
## 429 402.8662257
## 430 1951.8092194
## 431 -895.9932086
## 432 -1291.9671424
## 433 45.9071696
## 434 -739.1203209
## 435 -1988.5466720
## 436 3409.0877538
## 437 -2304.2929843
## 438 -1539.1391118
## 439 -744.8150093
## 440 -1636.9235966
## 441 774.0407898
## 442 -2220.0545470
## 443 788.7091343
## 444 -2207.6830649
## 445 -251.0476176
## 446 1441.4193869
## 447 -5315.8468193
## 448 1170.3499378
## 449 1820.6992946
## 450 2257.6464061
## 451 572.7525847
## 452 -1366.5551657
## 453 -347.7682070
## 454 -1586.7905437
## 455 -535.9511139
## 456 1812.5125083
## 457 -40.8157467
## 458 -1826.8235898
## 459 1300.9537076
## 460 1601.6159626
## 461 1394.1384125
## 462 -197.2020325
## 463 142.4727247
## 464 -1239.9278124
## 465 165.9076543
## 466 1980.6198797
## 467 -68.0508615
## 468 -647.1222744
## 469 2619.0646447
## 470 -1015.9293919
## 471 -1619.8239677
## 472 1113.7990845
## 473 856.8634981
## 474 888.7578940
## 475 -1340.8073583
## 476 317.7919423
## 477 -431.6122296
## 478 932.2597634
## 479 423.0475288
## 480 -138.0621759
## 481 -1964.7200495
## 482 -4901.2966151
## 483 5340.8127379
## 484 1522.4980610
## 485 195.9375243
## 486 148.6029624
## 487 -1117.2104594
## 488 1887.2892086
## 489 -863.1419332
## 490 212.1715027
## 491 2262.8866722
## 492 -1329.3196713
## 493 2088.1369548
## 494 2920.4796072
## 495 2412.7022588
## 496 879.3207150
## 497 -2516.9256764
## 498 -1706.7317588
## 499 -1350.7116961
## 500 41.8559252
## 501 -1452.7458688
## 502 -2495.1580006
## 503 -462.5370748
## 504 -2035.7797991
## 505 -1397.4948328
## 506 1684.0716584
## 507 1261.8613961
## 508 -587.7319528
## 509 925.7541070
## 510 508.9560557
## 511 -611.4416966
## 512 2047.2977906
## 513 -857.0385157
## 514 859.7082281
## 515 -995.9238154
## 516 -2026.2246174
## 517 3687.7536450
## 518 1157.9716479
## 519 -2284.9295420
## 520 -1587.6605682
## 521 -1308.6708092
## 522 -467.1892744
## 523 21.3710108
## 524 479.4573105
## 525 434.2181049
## 526 1737.4088101
## 527 -550.1054659
## 528 -2842.7874395
## 529 -254.0773374
## 530 2037.7372605
## 531 -802.4657662
## 532 -0.6695602
## 533 -1175.3167546
## 534 1096.8280068
## 535 1657.7171262
## 536 112.1405112
## 537 2693.4089896
## 538 -838.5676815
## 539 -3177.8952364
## 540 -1598.3391241
## 541 -3070.2591664
## 542 -284.0825467
## 543 4206.7068289
## 544 -53.2527738
## 545 2337.2673977
## 546 -454.2225997
## 547 -1847.5553688
## 548 -1551.7014943
## 549 -1475.7074626
## 550 2834.0387121
## 551 -3008.7094372
## 552 1247.2652452
## 553 -330.0210944
## 554 -534.5313692
## 555 1335.3751380
## 556 -2840.7145952
## 557 -2654.7162883
## 558 -1522.2257537
## 559 1574.8211859
## 560 647.0103285
## 561 3276.1116759
## 562 -1684.1674684
## 563 -1878.1992279
## 564 6179.2178116
## 565 98.6229635
## 566 -809.4809107
## 567 -101.7072177
## 568 986.2908813
## 569 -1654.7970473
## 570 2752.1966898
## 571 -3305.7002735
## 572 -775.2511513
## 573 1519.8173048
## 574 131.4735435
## 575 1443.0386089
## 576 -2570.3679315
## 577 1312.1010101
## 578 -1651.2758708
## 579 -448.2119957
## 580 236.0444981
## 581 -1089.1856562
## 582 -1905.0335968
## 583 475.2995075
## 584 -637.1722137
## 585 -5007.6640577
## 586 2683.7737533
## 587 -566.1880259
## 588 -6.5773894
## 589 -267.7218352
## 590 -2195.8870815
## 591 -2089.2859974
## 592 3693.0409869
## 593 3815.2288195
## 594 2219.0807003
## 595 -42.7876088
## 596 650.6627136
## 597 65.1588584
## 598 -103.4919009
## 599 1717.6161915
## 600 1244.2433537
##
## $school
## (Intercept) years_worked
## 1 -787.7252 -1896.54645
## 2 -1182.4089 -1185.94805
## 3 3447.4161 -1606.97658
## 4 -1139.8900 -1088.79607
## 5 800.2666 -2067.40071
## 6 3849.7029 -322.84512
## 7 376.5007 755.86640
## 8 -2846.8572 563.93767
## 9 -1105.6172 -77.20062
## 10 -1147.8199 952.85761
## 11 1556.5695 690.31436
## 12 106.3735 392.06807
## 13 -128.3956 858.68399
## 14 407.3814 640.78552
## 15 -1977.4082 730.86180
## 16 3452.0502 624.92076
## 17 -189.6471 833.47958
## 18 -3461.3417 397.33294
## 19 1551.5623 410.39373
## 20 -1580.7122 394.21120
##
## with conditional variances for "teacher" "school"
These extensions provide more flexibility and help in analyzing complex datasets by customizing mixed-effects models.
Bayesian mixed models provide rich insights through posterior
distributions and other diagnostics. Below are some ways to visualize
different aspects of Bayesian mixed models using the brms
package.
We can visualize the posterior distributions of the model parameters:
# Plot posterior distributions
plot(model_bayes)
Marginal effects plots show the relationship between predictors and outcomes, averaged over other variables:
# Plot marginal effects
marginal_effects(model_bayes)
Posterior predictive checks help evaluate how well the model fits the observed data by comparing the observed outcomes to those simulated from the posterior:
# Plot posterior predictive checks
pp_check(model_bayes)
Trace plots are used to diagnose the convergence of MCMC sampling:
# Plot trace and density plots
plot(model_bayes, combo = c("trace", "density"))
Bayes factors provide a comparison of model fit between two Bayesian models:
# Compare models using Bayes Factor
bayes_factor(model_bayes, model_with_gender)
Credible intervals show the range of parameter values that contain a certain proportion of the posterior distribution (e.g., 95%):
# Plot credible intervals for fixed effects
plot(conditional_effects(model_bayes, prob = 0.95), points = TRUE)
You can also extract and plot the random effects:
# Extract and plot random effects
ranef_bayes <- ranef(model_bayes)
plot(ranef_bayes)
These visualizations help assess the fit of the Bayesian mixed model, its predictions, and its uncertainties. They are key for interpreting the posterior distributions and random effects.
In this section, we start by simulating teacher salary data across different schools, where the salary depends on years of experience. We will fit random intercept and random slope models to this data.
# Set seed for reproducibility
set.seed(123)
# Define number of schools and teachers per school
n_schools <- 20
n_teachers_per_school <- 30
n_total <- n_schools * n_teachers_per_school # Total observations
# Simulate random intercepts (base salary) and random slopes (salary increase)
school_intercepts <- rnorm(n_schools, mean = 30000, sd = 2000) # Base salary
school_slopes <- rnorm(n_schools, mean = 1500, sd = 300) # Salary growth rate
# Generate years of experience and school grouping
years_worked <- runif(n_total, min = 1, max = 40) # Teacher's experience in years
school <- rep(1:n_schools, each = n_teachers_per_school) # Grouping by school
# Generate salary data with random intercepts and slopes
salary <- rep(NA, n_total)
for (i in 1:n_schools) {
idx <- which(school == i)
salary[idx] <- school_intercepts[i] + school_slopes[i] * years_worked[idx] + rnorm(n_teachers_per_school, mean = 0, sd = 2000)
}
# Combine data into a dataframe
data <- data.frame(school = factor(school), years_worked = years_worked, salary = salary)
# View a portion of the data
head(data)
## school years_worked salary
## 1 1 10.501159 42515.15
## 2 1 27.054168 62711.59
## 3 1 17.288224 52615.26
## 4 1 31.739638 66432.84
## 5 1 5.011721 34687.18
## 6 1 17.960817 46560.10
We will fit two types of models: one with only random intercepts and another with both random intercepts and slopes for schools.
# Fit a random intercept model
model_random_intercept <- lmer(salary ~ years_worked + (1 | school), data = data)
# Fit a random slope model
model_random_slope <- lmer(salary ~ years_worked + (years_worked | school), data = data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0109629 (tol = 0.002, component 1)
# View summaries of both models
summary(model_random_intercept)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (1 | school)
## Data: data
##
## REML criterion at convergence: 11505.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.0299 -0.6433 0.0602 0.6948 2.6783
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 28764348 5363
## Residual 11227538 3351
## Number of obs: 600, groups: school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 29660.46 1233.40 20.71 24.05 <2e-16 ***
## years_worked 1517.40 12.36 579.41 122.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.206
summary(model_random_slope)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ years_worked + (years_worked | school)
## Data: data
##
## REML criterion at convergence: 10948.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.74046 -0.62584 0.06166 0.64776 2.78432
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 3699501 1923.4
## years_worked 63620 252.2 -0.10
## Residual 4002754 2000.7
## Number of obs: 600, groups: school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 30092.86 464.12 19.23 64.84 <2e-16 ***
## years_worked 1497.38 56.90 18.96 26.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_workd -0.134
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0109629 (tol = 0.002, component 1)
We will compare the two models using AIC (Akaike Information Criterion) and likelihood ratio tests.
# Compare models using AIC
AIC(model_random_intercept, model_random_slope)
## df AIC
## model_random_intercept 4 11513.26
## model_random_slope 6 10960.67
# Likelihood ratio test between models
anova(model_random_intercept, model_random_slope)
## refitting model(s) with ML (instead of REML)
## Data: data
## Models:
## model_random_intercept: salary ~ years_worked + (1 | school)
## model_random_slope: salary ~ years_worked + (years_worked | school)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## model_random_intercept 4 11536 11554 -5764.1 11528
## model_random_slope 6 10985 11011 -5486.3 10973 555.49 2 < 2.2e-16
##
## model_random_intercept
## model_random_slope ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, we visualize the relationship between years worked and salary for each school.
# Plotting salary vs. years worked, with regression lines for each school
ggplot(data, aes(x = years_worked, y = salary, color = school)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, aes(group = school), linetype = "dashed") +
theme_minimal() +
labs(title = "Salary vs. Years Worked",
x = "Years Worked",
y = "Salary",
color = "School")
## `geom_smooth()` using formula = 'y ~ x'
When comparing Bayesian and frequentist models, we can evaluate fit using different criteria (e.g., residuals for frequentist models, posterior predictive checks for Bayesian models) and statistical measures (e.g., AIC/BIC vs. LOO). Below are key methods for comparison:
# Residuals for the frequentist model
plot(resid(model_teacher_random))
# Posterior predictive checks for the Bayesian model
pp_check(model_bayes)
# Compare models using AIC/BIC in frequentist models
AIC(model_random_intercept, model_teacher_random)
## Warning in AIC.default(model_random_intercept, model_teacher_random): models
## are not all fitted to the same number of observations
## df AIC
## model_random_intercept 4 11513.26
## model_teacher_random 7 55325.75
BIC(model_random_intercept, model_teacher_random)
## Warning in BIC.default(model_random_intercept, model_teacher_random): models
## are not all fitted to the same number of observations
## df BIC
## model_random_intercept 4 11530.84
## model_teacher_random 7 55367.79
# Leave-One-Out Cross-Validation (LOO) for Bayesian models
loo_bayes <- loo(model_bayes)
print(loo_bayes)
# Bayes Factor comparison for Bayesian models
bayes_factor(model_bayes, model_with_gender)
# Coefficients from the frequentist model
coef_frequentist <- fixef(model_teacher_random)
print(coef_frequentist)
## (Intercept) years_worked
## 30305.7494 978.6528
# Posterior means from the Bayesian model
posterior_means <- summary(model_bayes)$fixed[, "Estimate"]
print(posterior_means)
# Confidence intervals for the frequentist model
confint(model_teacher_random)
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 1921.0590047 2.203286e+03
## .sig02 1484.5233310 2.903189e+03
## .sig03 -0.5625566 2.793986e-01
## .sig04 733.2898497 1.383857e+03
## .sigma 1931.3204621 2.044239e+03
## (Intercept) 29344.1758448 3.126732e+04
## years_worked 526.3288275 1.430977e+03
# Credible intervals for the Bayesian model
summary(model_bayes)$fixed[, c("Estimate", "l-95% CI", "u-95% CI")]
# Predicted values from the frequentist model
predicted_frequentist <- predict(model_teacher_random)
head(predicted_frequentist)
## 1 2 3 4 5 6
## 28416.35 27498.46 26580.57 25662.67 24744.78 30978.91
# Predicted values from the Bayesian model
predicted_bayesian <- posterior_predict(model_bayes)
head(predicted_bayesian)
In this tutorial, we will generate and analyze three types of datasets: longitudinal data, hierarchical data, and nested data. For each dataset, we will simulate data, fit mixed-effects models, and visualize the results.
Longitudinal data involves repeated measurements taken over time for the same individuals. This could be health measurements taken from patients over several months, or student test scores collected across different semesters.
In this example, we will simulate data for 100 patients measured at 5 different time points. The outcome is a health measurement that changes over time, and the patients’ intercepts and slopes are randomly generated.
# Load necessary libraries
if (!require(lme4)) install.packages("lme4")
if (!require(ggplot2)) install.packages("ggplot2")
library(lme4)
library(ggplot2)
# Set seed for reproducibility
set.seed(123)
# Simulate longitudinal data: patients measured at 5 time points
n_patients <- 100
n_timepoints <- 5
n_total <- n_patients * n_timepoints
# Simulate patient-specific intercepts and slopes for health measurements
patient_intercepts <- rnorm(n_patients, mean = 50, sd = 5)
patient_slopes <- rnorm(n_patients, mean = 2, sd = 0.5)
# Generate time variable (time points for each patient)
time <- rep(1:n_timepoints, times = n_patients)
# Generate health measurements
health_measurements <- rep(NA, n_total)
patient <- rep(1:n_patients, each = n_timepoints)
for (i in 1:n_patients) {
idx <- which(patient == i)
health_measurements[idx] <- patient_intercepts[i] + patient_slopes[i] * time[idx] + rnorm(n_timepoints, mean = 0, sd = 2)
}
# Combine into data frame
longitudinal_data <- data.frame(patient = factor(patient), time = time, health_measurement = health_measurements)
# View part of the data
head(longitudinal_data)
## patient time health_measurement
## 1 1 1 53.24004
## 2 1 2 53.11204
## 3 1 3 51.60172
## 4 1 4 54.86320
## 5 1 5 54.59293
## 6 2 1 50.02506
We fit a random intercept and random slope model to the longitudinal data to capture variability in health measurements across time for different patients.
# Fit random intercept and random slope model
model_longitudinal <- lmer(health_measurement ~ time + (time | patient), data = longitudinal_data)
summary(model_longitudinal)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: health_measurement ~ time + (time | patient)
## Data: longitudinal_data
##
## REML criterion at convergence: 2497.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.64183 -0.52849 0.03303 0.59224 2.58574
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## patient (Intercept) 20.3868 4.5152
## time 0.3222 0.5676 -0.14
## Residual 3.9640 1.9910
## Number of obs: 500, groups: patient, 100
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 50.53616 0.49747 99.00038 101.59 <2e-16 ***
## time 1.91794 0.08477 98.99995 22.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## time -0.365
We visualize the health measurements over time for each patient.
# Visualize health measurements over time for each patient
ggplot(longitudinal_data, aes(x = time, y = health_measurement, color = patient)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, aes(group = patient)) +
theme_minimal() +
labs(title = "Health Measurement vs Time (Longitudinal Data)",
x = "Time",
y = "Health Measurement",
color = "Patient")
## `geom_smooth()` using formula = 'y ~ x'
Hierarchical data involves individuals grouped within larger units, such as students nested within schools or employees nested within departments.
We simulate test scores for students, where students are grouped within schools. Each school has its own intercept (average test score) and slope (effect of years of education).
# Simulate hierarchical data: students grouped within schools
n_schools <- 20
n_students_per_school <- 50
n_total <- n_schools * n_students_per_school
# Simulate school-specific intercepts and slopes for student test scores
school_intercepts <- rnorm(n_schools, mean = 70, sd = 5)
school_slopes <- rnorm(n_schools, mean = 0.5, sd = 0.1)
# Generate student-specific test scores based on years of education
years_education <- runif(n_total, min = 1, max = 12)
school <- rep(1:n_schools, each = n_students_per_school)
# Generate test scores
test_scores <- rep(NA, n_total)
for (i in 1:n_schools) {
idx <- which(school == i)
test_scores[idx] <- school_intercepts[i] + school_slopes[i] * years_education[idx] + rnorm(n_students_per_school, mean = 0, sd = 5)
}
# Combine into data frame
hierarchical_data <- data.frame(school = factor(school), years_education = years_education, test_score = test_scores)
# View part of the data
head(hierarchical_data)
## school years_education test_score
## 1 1 9.378088 66.44053
## 2 1 2.233171 71.27341
## 3 1 10.228747 77.52939
## 4 1 9.952356 66.37765
## 5 1 11.905534 67.80948
## 6 1 1.531460 70.21900
We fit a random intercept and random slope model to the hierarchical data to capture differences in test scores across schools.
# Fit a random intercept and random slope model for schools
model_hierarchical <- lmer(test_score ~ years_education + (years_education | school), data = hierarchical_data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00227531 (tol = 0.002, component 1)
summary(model_hierarchical)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: test_score ~ years_education + (years_education | school)
## Data: hierarchical_data
##
## REML criterion at convergence: 6099
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0249 -0.6670 0.0137 0.6982 3.4375
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 16.22188 4.0276
## years_education 0.01967 0.1403 0.27
## Residual 24.07801 4.9069
## Number of obs: 1000, groups: school, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 70.34630 0.96658 19.17814 72.779 < 2e-16 ***
## years_education 0.52049 0.05788 18.41235 8.993 3.69e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## years_edctn -0.139
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00227531 (tol = 0.002, component 1)
We visualize test scores versus years of education for each school.
# Visualize test scores vs years of education for each school
ggplot(hierarchical_data, aes(x = years_education, y = test_score, color = school)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, aes(group = school)) +
theme_minimal() +
labs(title = "Test Scores vs Years of Education (Hierarchical Data)",
x = "Years of Education",
y = "Test Score",
color = "School")
## `geom_smooth()` using formula = 'y ~ x'
Nested data refers to data where observations are nested within subjects. For example, biological data where multiple measurements are taken from different body parts of the same individual.
We simulate data for individuals where multiple body parts are measured. Each individual has their own intercept and slope for the body part measurements.
# Simulate nested data: measurements from multiple body parts for each individual
n_individuals <- 50
n_body_parts_per_individual <- 4
n_total <- n_individuals * n_body_parts_per_individual
# Simulate individual-specific intercepts and slopes for body part measurements
individual_intercepts <- rnorm(n_individuals, mean = 30, sd = 5)
individual_slopes <- rnorm(n_individuals, mean = 1.5, sd = 0.3)
# Generate measurement variable (e.g., body part size)
body_part <- rep(1:n_body_parts_per_individual, times = n_individuals)
individual <- rep(1:n_individuals, each = n_body_parts_per_individual)
# Generate body part measurements
measurements <- rep(NA, n_total)
for (i in 1:n_individuals) {
idx <- which(individual == i)
measurements[idx] <- individual_intercepts[i] + individual_slopes[i] * body_part[idx] + rnorm(n_body_parts_per_individual, mean = 0, sd = 1)
}
# Combine into data frame
nested_data <- data.frame(individual = factor(individual), body_part = body_part, measurement = measurements)
# View part of the data
head(nested_data)
## individual body_part measurement
## 1 1 1 26.29874
## 2 1 2 27.26560
## 3 1 3 27.32426
## 4 1 4 30.60466
## 5 2 1 31.79844
## 6 2 2 33.16914
We fit a random intercept and random slope model to the nested data to account for differences in measurements across body parts for each individual.
# Fit a random intercept and random slope model for individuals
model_nested <- lmer(measurement ~ body_part + (body_part | individual), data = nested_data)
summary(model_nested)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: measurement ~ body_part + (body_part | individual)
## Data: nested_data
##
## REML criterion at convergence: 795.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.09567 -0.56315 0.02403 0.49072 2.64707
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## individual (Intercept) 16.97903 4.1206
## body_part 0.07862 0.2804 0.12
## Residual 0.97930 0.9896
## Number of obs: 200, groups: individual, 50
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 28.57091 0.60742 48.99867 47.04 <2e-16 ***
## body_part 1.50028 0.07409 49.00000 20.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## body_part -0.154
We visualize measurements across body parts for each individual.
# Visualize measurements vs body part for each individual
ggplot(nested_data, aes(x = body_part, y = measurement, color = individual)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, aes(group = individual)) +
theme_minimal() +
labs(title = "Body Part Measurement vs Body Part (Nested Data)",
x = "Body Part",
y = "Measurement",
color = "Individual")
## `geom_smooth()` using formula = 'y ~ x'
This tutorial demonstrates how to generate and analyze three types of datasets: 1. Longitudinal data: Repeated measurements over time for individuals. 2. Hierarchical data: Individuals grouped within larger units (e.g., students within schools). 3. Nested data: Observations nested within subjects (e.g., measurements from multiple body parts of the same individual).
The same mixed-effects modeling approach can be extended to real-world datasets where hierarchical, longitudinal, or nested data structures exist.