library(tidyverse) # data management
library(haven) # read in data files
library(sjPlot) # regression tables
library(vtable) # summary tables
## Warning: package 'vtable' was built under R version 4.1.3
library(lme4) # estimate MLM
library(lmerTest) # hypothesis testing for MLM
This is an example data set presented in Willett (1988) and Singer and Willett (2003).
It is a fictitious example in which a random sample of 35 participants completed an “opposites-naming” task on 4 occasions. At each occasion the participant is asked to name the opposite of a presented word, and they complete as many such words as they can in a fixed time. They repeat the task on 4 separate occasions, each 1 week apart. We expect that their scores (number of opposites named) will increase across the 4 weeks as they gain practice with the task. There is also 1 person-level covariate measured at the initial time point, which is the score on a general cognitive ability test.
oppos <- read_dta(file = "opposites.dta")
Produce a table of descriptive statistics for the level-1 variables time, wave, and opp. How many total observations are there in the data? What is the difference between the “time” and “wave” variables?
There are a total of 35 participants with 4 observations across time each. This totals to 140 observations in this data set. Time and wave are on the same scale, but time is centered wave to make the zero mark more meaningful.
sumtable(oppos, vars = c("time", "wave", "opp"), digits = 2)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| time | 140 | 1.5 | 1.12 | 0 | 0.75 | 2.25 | 3 |
| wave | 140 | 2.5 | 1.12 | 1 | 1.75 | 3.25 | 4 |
| opp | 140 | 204.81 | 46.62 | 76 | 171.5 | 231.25 | 310 |
Produce a table of descriptive statistics for the level-2 variables “cog” and “ccog”. How many level-2 observations are there? What are these level-2 units? What is the difference between “cog” and “ccog”?
There are 140 level-2 observations for ccog and cog, 4 for each participant. Because ccog is grand mean centered, the mean is zero (SD = 12.4). Cog is, however, not mean centered and has a mean of 113.5 (SD = 12.4, Range: [81, 137]).
level2data <- oppos %>%
group_by(id) %>%
summarise(
cog = mean(cog),
ccog = mean(ccog)
)
sumtable(oppos, vars = c("cog", "ccog"), digits = 1)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| cog | 140 | 113.5 | 12.4 | 81 | 104 | 123 | 137 |
| ccog | 140 | 0 | 12.4 | -32.5 | -9.5 | 9.5 | 23.5 |
Create one or more scatter plots to visualize the trajectories of opposites scores over time. Do you think summarizing these trajectories with a linear model is appropriate? Why or why not?
I think summarizing these trajectories with a linear model is appropriate because there doesn’t seem to be any obvious curvilinear change at you increase in time.
oppos %>%
ggplot(aes(x = time, y = opp)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~id) +
xlab("Time (in weeks)") +
ylab("Opposite Naming Score") +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
oppos %>%
ggplot(aes(x = time, y = opp)) +
geom_line(aes(group = id), color = "grey", alpha = 0.75) +
geom_smooth(method = "lm") +
xlab("Time (in weeks)") +
ylab("Opposite Naming Score") +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
Estimate the unconditional means model (i.e. random intercept model) with “opp” as the outcome variable. What is the ICC? What does this tell you in this context?
ICC = 0.27
The ICC describes how much of the variance in the outcome, opp, is between vs. within subjects. In this case about 27% of variance in opp is due to between-time differences, and the remaining 73% is due to between-subject differences (within-time).
m4 <- lmer(opp ~ 1 + (1 | id), data = oppos, REML = F)
tab_model(m4)
| opposites naming score | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 204.81 | 194.49 – 215.13 | <0.001 |
| Random Effects | |||
| σ2 | 1583.72 | ||
| τ00 id | 574.31 | ||
| ICC | 0.27 | ||
| N id | 35 | ||
| Observations | 140 | ||
| Marginal R2 / Conditional R2 | 0.000 / 0.266 | ||
Write out the “level” equations for an unconditional linear growth model that allows initial scores and trends to vary across people. (No R code needed. A chunk is provided in case you want to import an image of your equations. See below.)
| Level | Model 2 |
|---|---|
| Level 1 | \(opp_{ij} = \beta_{0j} + \beta_{1j}(time_{ij}) + e_{ij}\) |
| Level 2 | \(\beta_{0j} = \gamma_{00} + u_{0j}\) |
| Level 2 | \(\beta_{1j} = \gamma_{10} + \gamma_{11}(id_j) + u_{1j}\) |
Estimate the model in Question 5 using full maximum
likelihood (FML; in lmer this is
REML = FALSE). Create a table of the model parameter
estimates. Interpret the two fixed effects parameters, each of the three
variance component estimates, and the covariance estimate.
m5 <- lmer(opp ~ time + (time | id), data = oppos, REML = F)
tab_model(m5)
| opposites naming score | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 164.37 | 152.55 – 176.19 | <0.001 |
|
time of observation (in weeks) |
26.96 | 22.77 – 31.15 | <0.001 |
| Random Effects | |||
| σ2 | 159.48 | ||
| τ00 id | 1161.31 | ||
| τ11 id.time | 127.71 | ||
| ρ01 id | -0.45 | ||
| ICC | 0.87 | ||
| N id | 35 | ||
| Observations | 140 | ||
| Marginal R2 / Conditional R2 | 0.423 / 0.926 | ||
fixed effects components
\(\beta_{0j}\): Averaging across time points, the predicted opposing name score for subjects is 164.37, CI: [152.55, 176.19], p < .001.
\(\beta_{1j}\): As you move up one time point there is a predicted increase in opposing name score for subjects by 26.97 points, CI: [22.77, 31.15], p < .001.
variance component estimates
\(\sigma^2\) = 159.48 = the variance of residuals in observed opposing name score around individual regression lines (i.e., the scores predicted by the model)
\(\tau_{00.id}\) = 1161.31 = the variance of opposite scores at the first time point across students
\(\tau_{11 id.time}\) = 127.71 = the variance of the time slopes across individuals
correlation
\(\rho_{01 id}\) = -0.45 = correlation between the individual’s opposite task score at time = 0 and their time slope
Create plots to assess the normality of the level-1 and level-2 residuals. Does assuming normality seem reasonable for these data?
Generally, yes we can assume normality for the level 1 and 2 residuals. However, I do worry a little about the level 2 participant intercept residuals as they peak seems tobe slightly above zero and there is a long left leaning tail.
hist(resid(m5)) #level 1 residuals
hist(ranef(m5)$id[,1]) #Level2 intercept residuals
hist(ranef(m5)$id[,2]) #Level2 slope residuals
head(ranef(m5)$id)
## (Intercept) time
## 1 32.506086 5.957349
## 2 51.466517 1.448539
## 3 -7.447833 16.817182
## 4 38.862161 -3.912873
## 5 29.850187 -13.604288
## 6 1.787951 2.131319
Now, consider the following three hypotheses:
Researcher A claims that people will have different initial performance on the opposites task, but they will all learn at about the same rate (i.e., have the same practice effects), and neither initial performance nor learning rate are associated with general cognitive ability.
Researcher B claims that people with greater general cognitive ability will start with higher initial performance on the opposites task but people’s rate of learning (practice effects) will be varied and unpredictable.
Researcher C claims that people with greater general cognitive ability will start with higher initial performance on the opposites task and these people will also learn at a faster rate.
Write out the level-1 and level-2 equations for a conditional growth model that you can use to decide which researcher you agree with. (No R code needed. A chunk is provided in case you want to import an image of your equations. See below.)
| Level | Model 2 |
|---|---|
| Level 1 | \(opp_{ti} = \pi_{0i} + \pi_{1i}(time_{ti}) + e_{ti}\) |
| Level 2 | \(\pi_{0i} = \beta_{00} + \beta_{01}(ccog_i) + u_{0i}\) |
| Level | \(\pi_{1i} = \beta_{10} + \beta_{11}(ccog_i) + u_{1i}\) |
Estimate the model from Question 8 and produce a table of parameter estimates.
m9 <- lmer(opp ~ time * ccog + (time | id), data = oppos,
control = lmerControl(optimizer = "Nelder_Mead"), REML = F)
tab_model(m9)
| opposites naming score | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 164.37 | 152.56 – 176.19 | <0.001 |
|
time of observation (in weeks) |
26.96 | 23.17 – 30.75 | <0.001 |
|
grand mean centered cognitive skill score |
-0.11 | -1.07 – 0.85 | 0.817 |
| time:ccog | 0.43 | 0.12 – 0.74 | 0.006 |
| Random Effects | |||
| σ2 | 159.48 | ||
| τ00 id | 1159.38 | ||
| τ11 id.time | 99.30 | ||
| ρ01 id | -0.49 | ||
| ICC | 0.86 | ||
| N id | 35 | ||
| Observations | 140 | ||
| Marginal R2 / Conditional R2 | 0.459 / 0.926 | ||
sqrt(159.48) #12.63
## [1] 12.62854
sqrt(1159.38) #34.05
## [1] 34.04967
sqrt(99.30) #9.96
## [1] 9.964939
Which researcher do you agree with the most? Explain why based on the results of Question 9. Write a short paragraph that clearly explains your answer.
\(\beta_{0j}\): For average cognitive score and at time = 0, the average opposites task score across subjects is predicted to be 164.37, 95% CI: [152.56, 176.19], p < .001.
\(\beta_{1j}\): For average cognitive score, the predicted change in relationship between time and opposites task score is 26.96 increase in score for each unit increase in time, 95% CI: [23.17, 30.75], p < .001.
\(\beta_{2j}\): When time = 0, there is a predicted change in the relationship between cognitive score and opposites task score such that for each unit increase in cognitive skill score there is a predicted .11 unit decrease in opposites task score, 95% CI: [-1.07, 0.85], p = 0.817.
\(\beta_{3j}\): For each unit increase in time there is a .43 unit increase in the strength of the relationship between cognitive score and opposite task score, B = 0.43, 95% CI: [0.12, 0.74], p = .006
\(\sigma^2\) = sqrt(159.48) = 12.63 = the SD of residuals in observed opposing name score around individual regression lines (i.e., the scores predicted by the model)
\(\tau_{00.id}\) = sqrt(1159.38) = 34.05 = the SD of opposite scores at the first time point across students
\(\tau_{11 id.time}\) = sqrt(99.30) = 9.96 = the SD of the time slopes across individuals
\(\rho_{01 id}\) = -0.49 = correlation between the individual’s opposite task score at time = 0 and their time slope