You may have to install some of these packages if they are not
already on your machine using the install.packages()
function.
In this document, we will use a simulated data set and linear growth models to explore the development of reading skill from 1st grade through 8th grade in 100 students. Each individual in this data set has one reading score for each year from 1st through 7th grade.
The columns Subject, Grade, and Reading correspond to the group, time, and change variables required for growth modeling, respectively.
set.seed(123) # to allow for reproducible data
# Number of subjects
n_subjects <- 100
# Number of time points per subject
n_timepoints <- 8
# Create a subject ID variable
subject_ids <- rep(1:n_subjects, each = n_timepoints)
# Simulate time variable
time <- rep(seq(1, n_timepoints), n_subjects)
# Fixed effects
beta0 <- 50 # Population-level intercept
beta1 <- 10 # Initial slope (before plateau)
beta2 <- 1 # Reduced slope after plateau (time > 5)
# Define the variance of random intercepts and slopes
sigma_intercept <- 20
sigma_slope <- 5
rho <- -0.6
# Construct the covariance matrix
cov_matrix <- matrix(c(sigma_intercept^2, rho * sigma_intercept * sigma_slope,
rho * sigma_intercept * sigma_slope, sigma_slope^2),
nrow = 2)
# Generate correlated random intercepts and slopes
random_effects <- mvrnorm(n = n_subjects, mu = c(0, 0), Sigma = cov_matrix)
# Extract intercepts and slopes
random_intercepts <- random_effects[, 1]
random_slopes <- random_effects[, 2]
# Map random effects to each subject
subject_intercepts <- rep(random_intercepts, each = n_timepoints)
subject_slopes <- rep(random_slopes, each = n_timepoints)
# Apply piecewise linear growth (plateau effect after time = 5)
y <- (beta0 + subject_intercepts) +
(beta1 + subject_slopes) * pmin(time, 6) + # Growth before plateau
beta2 * pmax(time - 6, 0) + # Reduced growth after plateau
rnorm(n_subjects * n_timepoints, mean = 0, sd = 5) # Residual variance
# Create the data frame
longitudinal_data <- data.frame(
subject = factor(subject_ids),
Grade = time-1,
Reading = y
)
Let’s plot the raw reading scores. In this plot, each line represents an individual and how their reading scores change over the course of 1st through 8th grade. What do you notice about the data?
raw_data_plot = longitudinal_data %>%
mutate(Grade = as.factor(Grade+1)) %>% # reset Grade so y-intercet label is 1st grade
ggplot( aes(x = Grade, y = Reading, group = subject)) +
geom_line(alpha = 0.3) + # reduce transparency of lines
theme_minimal() +
labs(x = "Grade", y = "Reading Score")+
theme(
axis.text=element_text(size=15),
axis.title=element_text(size=17))
raw_data_plot
We can first fit a simple linear model using the lm()
function. Examining the output of the summary() function,
we can see from the coefficients that in first grade
((Intercept)) the average reading score is 63.83 and that,
on average, with each subsequent grade, reading scores increase by 8.52
points.
# Linear Model
linear_model = lm(Reading ~ Grade, data = longitudinal_data)
summary(linear_model) # examine model output
##
## Call:
## lm(formula = Reading ~ Grade, data = longitudinal_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.419 -12.444 0.098 14.863 51.042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.0838 1.3225 47.70 <2e-16 ***
## Grade 8.5165 0.3161 26.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.49 on 798 degrees of freedom
## Multiple R-squared: 0.4763, Adjusted R-squared: 0.4756
## F-statistic: 725.7 on 1 and 798 DF, p-value: < 2.2e-16
We can also generate predicted reading scores from our model. As we can see from below, this model estimates a single intercept and slope for the entire sample (red line). What does this model do well? And what does it miss in the data?
longitudinal_data$linear_pred = predict(linear_model) # add OLS prediction to df
longitudinal_data %>%
mutate(Grade = as.factor(Grade+1)) %>%
ggplot(aes(x = Grade, y = Reading, group = subject)) +
geom_line(color="black", alpha=0.2) +
geom_line(aes(x = time, y = linear_pred), color="red",linewidth=2) +
theme_minimal() +
theme(
axis.text=element_text(size=15),
axis.title=element_text(size=17))
We can then expand our model to include estimates for each individual
in our simulated dataset. To do so, we use the lmer()
function. The syntax for this model looks very similar to our first
model but includes something called random effects, which allow our
model to generate estimates at the individual level. The random effects
looks like (1|subject). The right hand side refers to the
grouping variable in our data set.
# fit the model
int_mod = lmer(Reading ~ Grade + (1|subject), data = longitudinal_data)
When we look at the summary of our model, it looks a little more
complex compared to our first model. The section of the output with the
header Fixed effects refers to the average effects across
the entire sample. These look identical to the coefficients from our
first model and suggest an average 1st grade reading score of 63.08 and
an average growth rate of 8.52 points per grade.
The Random effects header tells us some information
about our individual-specific estimates. The first row (where it says
(Intercept)) describes the variability of reading scores
for each subject in first grade. This value is not 0, suggesting that
students started from different places and can be better examined by
plotting the model predictions (see below).
# let's look at the summary
summary(int_mod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reading ~ Grade + (1 | subject)
## Data: longitudinal_data
##
## REML criterion at convergence: 6486.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.6982 -0.4872 0.0429 0.5753 3.2464
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 285.7 16.90
## Residual 136.2 11.67
## Number of obs: 800, groups: subject, 100
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 63.0838 1.8506 126.3866 34.09 <2e-16 ***
## Grade 8.5165 0.1801 699.0000 47.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Grade -0.341
After we fit our model, we can plot both the average prediction across the entire sample (red line) and predictions for each individual in the data (black lines). As you can see, the starting points vary for each individual line, but the slopes are the same.
# generate predictions
longitudinal_data$int_pred = predict(int_mod) # individual level predictions
longitudinal_data$int_proto = predict(int_mod,re.form=NA) # average sample prediction
# plot our data
longitudinal_data %>%
mutate(Grade = as.factor(Grade+1)) %>%
ggplot(aes(x = Grade, y = int_pred, group = subject)) +
geom_line(color="black", alpha=0.4) + # individual predictions in black
geom_line(aes(x = time, y = int_proto),
color="red",linewidth=2) + # average prediction in red
ylab("Predicted Reading Scores")+
theme_minimal() +
theme(
axis.text=element_text(size=15),
axis.title=element_text(size=17))
Here we can update our individual intercept model to try and capture some of the variability in growth trajectories. To do so, we can update the right side of our random effects structure to include Grade. This tells the model to estimate a slope for each individual in our dataset.
# Linear Model
# fit the model
growth_mod = lmer(Reading ~ Grade + (1+Grade|subject), data = longitudinal_data)
When we examine the model summary, we again see the same values as
before for our fixed effects. The random effects, however, now include a
new row, Grade. The fact that the Variance and Std.Dev. are
not 0 suggest that the individuals in our data set not vary in their
starting points but also in how they change over time. We can again get
a sense of this visually by plotting the model predictions.
# let's look at the summary
summary(growth_mod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reading ~ Grade + (1 + Grade | subject)
## Data: longitudinal_data
##
## REML criterion at convergence: 6118
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4883 -0.5409 -0.0072 0.5651 3.3275
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 215.48 14.679
## Grade 12.67 3.560 -0.21
## Residual 60.85 7.801
## Number of obs: 800, groups: subject, 100
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 63.0838 1.5519 99.0002 40.65 <2e-16 ***
## Grade 8.5165 0.3758 98.9972 22.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Grade -0.272
From our model, we can again plot the average prediction across the entire sample (red line) and predictions for each individual in the data (black lines). As you can see, now both the starting points and growth slopes vary for each individual line, which is more in line with what we found in the raw data.
# generate predictions
longitudinal_data$growth_pred = predict(growth_mod)
longitudinal_data$growth_proto = predict(growth_mod,re.form=NA)
# plot our data
growth_plot = longitudinal_data %>%
mutate(Grade = as.factor(Grade+1)) %>%
ggplot(aes(x = Grade, y = growth_pred, group = subject)) +
geom_line(color="black", alpha=0.4) + # individual predictions in black
geom_line(aes(x = time, y = growth_proto),
color="red",linewidth=2) + # average prediction in red
ylab("Predicted Reading Scores")+
ylim(ggplot_build(raw_data_plot)$layout$panel_scales_y[[1]]$range$range)+
theme_minimal() +
theme(
axis.text=element_text(size=15),
axis.title=element_text(size=17))
growth_plot
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
And for a closer comparison, we can plot our raw data and growth model predictions side-by-side. What does our model do well and what could it do better?
(raw_data_plot | growth_plot)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
Hooray! We’ve fit a linear growth model and identified individual growth trajectories for our simulated data! If this were a real dataset, some next steps might be to look at other variables (age, classroom, etc…) to see how they might relate to individual differences in reading gains or to explore non-linear growth models to see if we can capture some of the curve-shapes in the raw data (in fact, you could even try this with the simulated data). And of course, you can try to find your own longitudinal dataset and use these methods to fit growth models to explore your own research questions.