Load neccessary packages

You may have to install some of these packages if they are not already on your machine using the install.packages() function.

Introduction

In this document, we will use a simulated data set and linear growth models to explore the development of reading skill from 1st grade through 8th grade in 100 students. Each individual in this data set has one reading score for each year from 1st through 7th grade.

The columns Subject, Grade, and Reading correspond to the group, time, and change variables required for growth modeling, respectively.

Simulate Data

set.seed(123) # to allow for reproducible data

# Number of subjects
n_subjects <- 100

# Number of time points per subject
n_timepoints <- 8

# Create a subject ID variable
subject_ids <- rep(1:n_subjects, each = n_timepoints)

# Simulate time variable
time <- rep(seq(1, n_timepoints), n_subjects)

# Fixed effects
beta0 <- 50   # Population-level intercept
beta1 <- 10    # Initial slope (before plateau)
beta2 <- 1  # Reduced slope after plateau (time > 5)

# Define the variance of random intercepts and slopes
sigma_intercept <- 20  
sigma_slope <- 5     
rho <- -0.6            

# Construct the covariance matrix
cov_matrix <- matrix(c(sigma_intercept^2, rho * sigma_intercept * sigma_slope,
                       rho * sigma_intercept * sigma_slope, sigma_slope^2), 
                     nrow = 2)

# Generate correlated random intercepts and slopes
random_effects <- mvrnorm(n = n_subjects, mu = c(0, 0), Sigma = cov_matrix)

# Extract intercepts and slopes
random_intercepts <- random_effects[, 1]
random_slopes <- random_effects[, 2]

# Map random effects to each subject
subject_intercepts <- rep(random_intercepts, each = n_timepoints)
subject_slopes <- rep(random_slopes, each = n_timepoints)

# Apply piecewise linear growth (plateau effect after time = 5)
y <- (beta0 + subject_intercepts) + 
     (beta1 + subject_slopes) * pmin(time, 6) +  # Growth before plateau
     beta2 * pmax(time - 6, 0) +                 # Reduced growth after plateau
     rnorm(n_subjects * n_timepoints, mean = 0, sd = 5)  # Residual variance


# Create the data frame
longitudinal_data <- data.frame(
  subject = factor(subject_ids),
  Grade = time-1,
  Reading = y
)

Plot the Raw Data

Let’s plot the raw reading scores. In this plot, each line represents an individual and how their reading scores change over the course of 1st through 8th grade. What do you notice about the data?

raw_data_plot = longitudinal_data %>% 
  mutate(Grade = as.factor(Grade+1)) %>% # reset Grade so y-intercet label is 1st grade
  ggplot( aes(x = Grade, y = Reading, group = subject)) +
  geom_line(alpha = 0.3) + # reduce transparency of lines
  theme_minimal() +
  labs(x = "Grade", y = "Reading Score")+
  theme(
        axis.text=element_text(size=15),
        axis.title=element_text(size=17))


raw_data_plot

Ordinary Least Squares Regression Model

We can first fit a simple linear model using the lm() function. Examining the output of the summary() function, we can see from the coefficients that in first grade ((Intercept)) the average reading score is 63.83 and that, on average, with each subsequent grade, reading scores increase by 8.52 points.

# Linear Model

linear_model = lm(Reading ~ Grade, data = longitudinal_data)
summary(linear_model) # examine model output
## 
## Call:
## lm(formula = Reading ~ Grade, data = longitudinal_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.419 -12.444   0.098  14.863  51.042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.0838     1.3225   47.70   <2e-16 ***
## Grade         8.5165     0.3161   26.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.49 on 798 degrees of freedom
## Multiple R-squared:  0.4763, Adjusted R-squared:  0.4756 
## F-statistic: 725.7 on 1 and 798 DF,  p-value: < 2.2e-16

We can also generate predicted reading scores from our model. As we can see from below, this model estimates a single intercept and slope for the entire sample (red line). What does this model do well? And what does it miss in the data?

longitudinal_data$linear_pred = predict(linear_model) # add OLS prediction to df


longitudinal_data %>% 
  mutate(Grade = as.factor(Grade+1)) %>% 
  ggplot(aes(x = Grade, y = Reading, group = subject)) +
  geom_line(color="black", alpha=0.2) +
  geom_line(aes(x = time, y = linear_pred), color="red",linewidth=2) +
 theme_minimal() +
  theme(
        axis.text=element_text(size=15),
        axis.title=element_text(size=17))

Individual Intercept Model

We can then expand our model to include estimates for each individual in our simulated dataset. To do so, we use the lmer() function. The syntax for this model looks very similar to our first model but includes something called random effects, which allow our model to generate estimates at the individual level. The random effects looks like (1|subject). The right hand side refers to the grouping variable in our data set.

# fit the model
int_mod = lmer(Reading ~ Grade + (1|subject), data = longitudinal_data)

When we look at the summary of our model, it looks a little more complex compared to our first model. The section of the output with the header Fixed effects refers to the average effects across the entire sample. These look identical to the coefficients from our first model and suggest an average 1st grade reading score of 63.08 and an average growth rate of 8.52 points per grade.

The Random effects header tells us some information about our individual-specific estimates. The first row (where it says (Intercept)) describes the variability of reading scores for each subject in first grade. This value is not 0, suggesting that students started from different places and can be better examined by plotting the model predictions (see below).

# let's look at the summary
summary(int_mod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reading ~ Grade + (1 | subject)
##    Data: longitudinal_data
## 
## REML criterion at convergence: 6486.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6982 -0.4872  0.0429  0.5753  3.2464 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 285.7    16.90   
##  Residual             136.2    11.67   
## Number of obs: 800, groups:  subject, 100
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  63.0838     1.8506 126.3866   34.09   <2e-16 ***
## Grade         8.5165     0.1801 699.0000   47.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## Grade -0.341

After we fit our model, we can plot both the average prediction across the entire sample (red line) and predictions for each individual in the data (black lines). As you can see, the starting points vary for each individual line, but the slopes are the same.

# generate predictions
longitudinal_data$int_pred = predict(int_mod) # individual level predictions
longitudinal_data$int_proto = predict(int_mod,re.form=NA) # average sample prediction

# plot our data
longitudinal_data %>% 
  mutate(Grade = as.factor(Grade+1)) %>% 
  ggplot(aes(x = Grade, y = int_pred, group = subject)) +
  geom_line(color="black", alpha=0.4) + # individual predictions in black
  geom_line(aes(x = time, y = int_proto), 
            color="red",linewidth=2) + # average prediction in red
  ylab("Predicted Reading Scores")+
 theme_minimal() +
  theme(
        axis.text=element_text(size=15),
        axis.title=element_text(size=17))

Individual Slopes Model

Here we can update our individual intercept model to try and capture some of the variability in growth trajectories. To do so, we can update the right side of our random effects structure to include Grade. This tells the model to estimate a slope for each individual in our dataset.

# Linear Model

# fit the model
growth_mod = lmer(Reading ~ Grade + (1+Grade|subject), data = longitudinal_data)

When we examine the model summary, we again see the same values as before for our fixed effects. The random effects, however, now include a new row, Grade. The fact that the Variance and Std.Dev. are not 0 suggest that the individuals in our data set not vary in their starting points but also in how they change over time. We can again get a sense of this visually by plotting the model predictions.

# let's look at the summary
summary(growth_mod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reading ~ Grade + (1 + Grade | subject)
##    Data: longitudinal_data
## 
## REML criterion at convergence: 6118
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4883 -0.5409 -0.0072  0.5651  3.3275 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 215.48   14.679        
##           Grade        12.67    3.560   -0.21
##  Residual              60.85    7.801        
## Number of obs: 800, groups:  subject, 100
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  63.0838     1.5519 99.0002   40.65   <2e-16 ***
## Grade         8.5165     0.3758 98.9972   22.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## Grade -0.272

From our model, we can again plot the average prediction across the entire sample (red line) and predictions for each individual in the data (black lines). As you can see, now both the starting points and growth slopes vary for each individual line, which is more in line with what we found in the raw data.

# generate predictions
longitudinal_data$growth_pred = predict(growth_mod)
longitudinal_data$growth_proto = predict(growth_mod,re.form=NA)

# plot our data
growth_plot = longitudinal_data %>% 
  mutate(Grade = as.factor(Grade+1)) %>% 
  ggplot(aes(x = Grade, y = growth_pred, group = subject)) +
  geom_line(color="black", alpha=0.4) +  # individual predictions in black
  geom_line(aes(x = time, y = growth_proto),
            color="red",linewidth=2) + # average prediction in red
  ylab("Predicted Reading Scores")+
  ylim(ggplot_build(raw_data_plot)$layout$panel_scales_y[[1]]$range$range)+
 theme_minimal() +
  theme(
        axis.text=element_text(size=15),
        axis.title=element_text(size=17))

growth_plot
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

And for a closer comparison, we can plot our raw data and growth model predictions side-by-side. What does our model do well and what could it do better?

(raw_data_plot | growth_plot)
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Conclusions

Hooray! We’ve fit a linear growth model and identified individual growth trajectories for our simulated data! If this were a real dataset, some next steps might be to look at other variables (age, classroom, etc…) to see how they might relate to individual differences in reading gains or to explore non-linear growth models to see if we can capture some of the curve-shapes in the raw data (in fact, you could even try this with the simulated data). And of course, you can try to find your own longitudinal dataset and use these methods to fit growth models to explore your own research questions.