Problems

Problem 1

Seventeen subjects were measured with 2 instruments for measuring people’s peak-expiratory-flow-rate (Pefr). The data is in the pefr file. We are just interested in the Wright peak flow meter results on two occasions stored in wp1 and wp2.

Let \(Y_{ij}\) be the response at occasion \(i\) for subject \(j\).

Define a model for the expected peak flow meter rate for the wright method assuming normally distributed errors.

Based on the model:

1.

Estimate the mean (expected value = \(E(Y_{ij})\)) of pefr based on the Wright peak flow meter rate with 95% CI.

Table 1 below shows the expected grand mean of PEFR for all groups.

Table 1: PEFR grand mean estimate along with 95% CI
95% Confidence Interval
Estimate Upper Lower
447.882 387.483 508.282

2.

Estimate the variance \(var(Y_{ij})\) of pefr based on the Wright peak flow meter rate.

Table 2 shows the expected overall variance in measurements of PEFR

Table 2: Variance of PEFR
Variance
811.762

3.

Estimate the covariance \(cov(Y_{ij},Y_{i'j})\) of pefr based on the Wright peak flow meter rate between two occasions.

Table 3 shows the expected covariance in measurements of PEFR on the same individual.

Table 3: Variance and covariance components of model
\(\hat\sigma^2_{id}\)
235.967

4.

Estimate the correlation \(cor(Y_{ij},Y_{i'j})\) of pefr based on the Wright peak flow meter rate between two occasions. This correlation is called the intraclass correlation.

Table 4 shows the calculated intraclass correlation (ICC) between measurements of PEFR on the same individual.

Table 4: Intraclass correlation
\(\hat\sigma^2_{id}\) \(\hat\sigma^2\) ICC
13681.97 235.967 0.983

Problem 2

This is a dataset from inner-London schools.

We have the variables:

  • lrt = London reading test at age 11 (a z score)
  • gcse = graduate certificate of secondary education exam score
  • school = school id
  • schgen = type of school (1 mixed, 2 boys only, 3 girls only)
  • student = student identifyer

Note that students are nested within schools.

First select schools with at least 10 or more students.

The questions are about shool performance. Maybe these notes can help: (http://www.princeton.edu/~otorres/Multilevel101.pdf)

1.

Create a “spaghetti” plot of the data.

Figure 1: Observed trajectories for all schools

Figure 1: Observed trajectories for all schools

2.

Does lrt predict the outcome of the gcse and does it depend on school type?

Table 5 below shows an F test for the interaction between LRT and School type which turned out not to be significant.

Table 5:
Term F p
lrt:schgend 0.342 0.712

The interaction effect was removed from the model and another round of tests was performed with only the marginal effects. Table 6 shows that both main effects should be in the model.

Table 6:
Term F p
LRT 797.214 0.000
School type 5.576 0.006

Table 7 below shows fitted parameters for the mixed linear model.LRT and School type were fitted as fixed effects, and SchoolID was fitted as a random effect. A higher LRT score seems to indicate a higher value of GCSE. It also seems that all-girls schools have the highest GCSE after correcting for LRT

Table 7: Summary table for model parameters
Term Estimate SE t DF p
Intercept -0.959 0.494 -1.940 62.894 0.057
London Reading Test
LRT\(^1\) 5.581 0.198 28.235 57.160 0.000
School Type
School type: (Boys - Mixed) 0.934 0.991 0.942 58.039 0.350
School type: (Girls - Mixed) 2.598 0.778 3.340 54.511 0.002
Random Effects
\(\hat\sigma:\) Residual 7.443
\(\hat\rho:\) Intercept \(\times\) LRT\(^1\) 0.562
\(\hat\sigma:\) Intercept 2.816
\(\hat\sigma:\) LRT\(^1\) 1.190
1 Normalized by subtracting the mean and dividing by the standard deviation

3.

Is there a signfificant variability on how lrt predict gcse between schools?

Table 8 shows the outcome of comparing models with and without a random effect for the variability in lrt between schools. As can be seen the test score is high and \(p < 0.0005\) indicating that there is variability between schools.

Table 8:
Term AIC LRT p
<none> 27942.18
lrt in (lrt | school) 27980.62 42.437 0

4.

Write out your final model using greek letters for parameters associated with fixed effects and small cap roman letters for random coefficients.

\[ \begin{aligned} Y_{GCSE} = -&0.959 \\ +&5.581 \cdot LRT \\ +&0.934 \cdot \mathbb I_{\text{All-Boys}} \\ +&2.598 \cdot \mathbb I_{\text{All-Girls}} \\ +&s_{\text{Intercept}} \\ +&s_{\text{LRT}} \\ +&s_{\text{Residual}} \\s_{\text{Intercept}} = \quad &\mathcal N(0, 2.816^2) \\s_{\text{LRT}} = \quad &\mathcal N(0, a^2) \\a = \quad &1.190 \cdot LRT \\s_{\text{Residual}} = \quad &\mathcal N(0, 7.443^2) \end{aligned} \]

5.

Based on your model: Find the 5 schools with highest and lowest gcse score adjusted to lrt=0 for each shcool type.

Table 9 below shows the bottom five schools for each type and table 10 below that shows the top five.

Table 9: Bottom 5 schools for each type
School Type School Prediction
1 59 -7.275
28 -6.808
23 -5.245
22 -4.351
46 -3.675
2 37 -2.959
44 -2.769
40 -2.071
36 -1.939
27 0.278
3 16 -3.504
25 -2.265
65 -1.450
18 -0.627
8 -0.091
Table 10: Top 5 schools for each type
School Type School Prediction
1 63 4.732
55 4.566
3 4.364
1 3.488
5 2.118
2 52 3.656
11 1.792
24 1.653
64 0.717
57 0.394
3 53 6.575
6 5.200
2 4.808
7 3.577
21 2.549

Problem 3

Find the Loblolly data : help(“Loblolly”). It is growth data of individual trees (the Seed variable).

Instead of using age in years. Prepare a new variable so that you can describe the growth per decade (the time unit here) in your analysis.

1.

Describe the structure of the data.

The data are composed of 6 repeated measurements taken from each of 14 different loblolly pine trees at 5-year increments.

Table 11: View of data structure.
Age (in decades)
Tree Seed 0.3 0.5 1 1.5 2 2.5
329 3.93 9.34 26.08 37.79 48.31 56.43
327 4.12 9.92 26.54 37.82 48.43 56.81
325 4.38 10.48 27.93 40.20 50.06 58.49
307 3.91 9.48 25.66 39.07 50.78 59.07
331 3.46 9.05 25.85 39.15 49.12 59.49
311 3.88 9.40 25.99 39.55 51.46 59.64
315 4.32 10.43 27.16 40.85 51.33 60.07
321 3.77 9.03 25.45 38.98 49.76 60.28
319 4.57 10.57 27.90 41.13 52.43 60.69
301 4.51 10.89 28.72 41.74 52.70 60.92
323 4.33 10.79 28.97 42.44 53.17 61.62
309 4.81 11.20 28.66 41.66 53.31 63.05
303 4.55 10.92 29.07 42.83 53.88 63.39
305 4.79 11.37 30.21 44.40 55.82 64.10

2.

Provide mean height by time and variance by time. Comment on how the mean and variance change with time.

Figure 2 below shows the mean and variance of tree heights over time. The mean height seems to be mostly linear but with a small negative second-order term. Seeing such a slow decrease in growth could stem from trees continuing to grow for many years and the sample is recorded for only 25 years. The variance seems to grow linearly with age with a discontinuity around 20 years. The reduction in variance might stem from a decreased growth-rate. As the trees grow more slowly the difference between them should also grow more slowly.

Figure 2: Mean and variance of height over time.

Figure 2: Mean and variance of height over time.

3.

Provide a “spaghetti plot” of the data.

Figure fig_num + 1 below shows a spaghetti plot of the observed growth trajectories.

4.

Fit the appropriate model using polynomial regression (no higher than third degree).

We start out with a complicated model containing fixed effects up to the third degree and second degree random effects.

Simplifying variance structure

We first simplify the variance structure as needed. Table table_num + 1 below shows that no random effects should be removed.

Table 12: Likelihood ratio tests for random effects
Term AIC LRT p
<none> 241.658
age_cent in (age_cent + I(age_cent^2) | seed) 292.499 56.840 0.000
I(age_cent^2) in (age_cent + I(age_cent^2) | seed) 249.230 13.571 0.004
Simplifying fixed effect structure

We then proceed to simplifying the fixed effect structure. Table 13 shows that the third degree term should be removed.

Table 13: Significance tests of fixed effects
Term F Statistic p
(Intercept) 5431.331 0.000
age_cent 5390.532 0.000
I(age_cent^2) 475.243 0.000
I(age_cent^3) 0.009 0.923

After removing the third-degree term we see on table 14 below that no other fixed effects should be removed.

Table 14: Significance tests for fixed effects
Term F Statistic p
(Intercept) 5449.772 0
age_cent 9689.159 0
I(age_cent^2) 545.932 0
Table 15: Summary table for model parameters
Term Estimate SE t DF p
Intercept\(^a\) 35.438 0.480 73.823 13.007 0
Age
Age (In decades)\(^b\) 26.633 0.271 98.434 13.065 0
Age\(^2\) -4.984 0.213 -23.365 14.575 0
Random Effects
\(\hat\sigma:\) Intercept 1.752
\(\hat\sigma:\) Age 0.962
\(\hat\sigma:\) Age\(^2\) 0.618
\(\hat\rho:\) Intercept \(\times\) Age 0.898
\(\hat\rho:\) Intercept \(\times\) Age\(^2\) -0.877
\(\hat\rho:\) Age \(\times\) Age\(^2\) -0.575
\(\hat\sigma:\) Residual 0.592
a Height at age = 13 years
b Centered by subtracting its mean (13 years) before modelling.

5.

Comment on the shape of the growth curve of the final model and derive the formula for growth.

The estimated growth curve has a positive linear shape with a negative second-order term, implying that the growth slows down over time.

\[ \begin{aligned} x &= Age - 1.3 \\ Y_{\text{Height}} &= 35.438 + x \cdot \beta_\text{Growth}\\ \beta_\text{Growth} &= 26.633 - 4.985 \cdot x + s^2_{\text{Growth}} \\ s^2_{\text{Growth}} &= \mathcal N(0, a^2) \\ a^2 &= \sigma^2_{\text{Seed}} + \sigma^2_{\text{Residual}} \\ & \hspace{3.4em} + x (x \cdot \sigma^2_{x} + 2\sigma_{\text{Seed}\times x}) \\ &\hspace{3.4em}+ x^2 (x^2 \cdot \sigma^2_{x^2} + 2\sigma_{\text{Seed}\times x^2}) \\ &\hspace{3.4em} + 2x^3 \cdot \sigma_{x\times x^2} \\ &= 3.07 + 0.350 \\ & \hspace{3.4em} + x (x \cdot 0.925+ 2 \cdot 1.51) \\ &\hspace{3.4em}+ x^2 (x^2 \cdot 0.382 - 2 \cdot 0.949) \\ &\hspace{3.4em} - x^3 \cdot 2 \cdot 0.342 \end{aligned} \]



6.

Comment on the variance of the growth over time. In other words, what does the model you choose imply about the variance over time. In addition you could compare that to the sample estimates above.

Figure 4 shows the observed variance by age and the fitted variance estimated via the model’s covariance matrix.

Figure 4: Observed and fitted variance by age

Figure 4: Observed and fitted variance by age

7.

Provide the fitted mean growth vs. age and plot it with 95% confidence bands.

Figure 5 below shows the estimated mean growth along with 95% confidence bands.

Figure 5: Estimated mean growth and 95% confidence bands

Figure 5: Estimated mean growth and 95% confidence bands

8.

Estimate the average growth the first decade, based on the final model you choose.

Table 16 shows the estimated growth in the first decade along with its 95% confidence interval.

Table 16: Estimated growth in the first decade
95% Confidence Interval
Estimate SE Lower Upper
35.438 0.48 34.401 36.475

9.

Look at residuals

Figure fig_num + 1 below shows the distribution of model residuals. They seem to be approximately normally distributed with mean zero. Figure fig_num + 2 shows a pairplot for estimated random effects. We can see the effects are approximately randomly distributed.

Figure 6: Distribution of residuals

Figure 6: Distribution of residuals

Figure 7: Pairplot of random effects

Figure 7: Pairplot of random effects

10.

Select 12 trees randomly, and plot observed and fitted growth at the individual level using a facet grid.

Figure 8 shows observed and fitted growth for twelve randomly selected trees.

Figure 8: Fitted growth curves along with observed trajectory values for twelve randomly sampled seeds.

Figure 8: Fitted growth curves along with observed trajectory values for twelve randomly sampled seeds.

Problem 4

Pre post analysis of repeated measures data by

  • ANCOVA Change (with the difference as response and adjustment for baselin)
  • LMM.

Adjust both models by age and sex.

The data is called TotalFlow.RDS.

We have two measures of brain flow. Before a treatment and after. Everyone got the treatment but it was deemed successful (SR group) and not successful (AF group)

1.

The change in TOTALFLOW in each group with 95% CI and if it was statistically significant.

Table 17 shows the estimated contrast along with significance tests and 95% confidence intervals for the change in TOTALFLOW for each group.

Table 17: Estimated pre-post change for each group
95% Confidence Interval
Contrast Group Estimate Lower Upper p
Post - Pre SR Group 58.879 9.105 108.652 0.022
Post - Pre AF Group -21.504 -87.539 44.532 0.508

2.

The estimated difference in the change between groups with 95%, and if the change in the SR group was significantly different from the AF group. Use emmeans.

Table 18 shows the estimated difference between groups in TOTALFLOW change along with significance test and 95% confidence interval.

Table 18: Estimated difference in pre-post change between groups
95% Confidence Interval
Contrast Estimate Lower Upper p
AF - SR -80.382 -158.692 -2.072 0.056

3.

A short description of the methodology for his paper.

In this clinical trial participant levels of brain flow were measured before and after a treatment. Participants were grouped by whether the treatment was deemed successful (SR group) or not (AF group). A linear mixed model (LMM) was used to estimate the treatment effect as well as any difference in effect between the two groups. The LMM was fitted according to the formula

\[ Y_{ij} = \beta_0 + \beta_1 X_i + \beta_2 t_{ij} + \beta_3 t_{ij} X_i + \beta_4S_i + \beta_5A_i + \epsilon_{ij}, \]

where \(Y_{ij}\) is the j’th measurement for subject i, \(X_i\) is the treatment group of subject i, \(t_ij\) indicates whether \(Y_{ij}\) is a pre- or post measurement for subject i, and \(S_i\) and \(A_i\) are the sex and age of subject i. The estimated difference in pre-post change between groups is then modeled by the parameter \(\beta_3\). Age and sex were included in the model to adjust for any potantial confounding.

4.

The spaghetti plot

Figure 9 shows the observed trajectories for participants grouped by whether their treatment was deemed successful or unsuccessful.

Figure 9: Observed trajectories for participants

Figure 9: Observed trajectories for participants

5.

A look at total residuals.

Figure fig_num + 1 below shows the distribution of model residuals. They seem to be approximately normally distributed with mean zero. Figure fig_num + 2 shows the distribution of estimated random intercepts. We can see the intercepts are approximately randomly distributed.

Figure 10: Distribution of residuals

Figure 10: Distribution of residuals

Figure 11: Distribution of random intercepts

Figure 11: Distribution of random intercepts

6.

Compute the contrast in part 2) using glht of multcomp and comment on the difference.

The main difference between using emmeans and glht is that glht requires the use of matrices to calculate your linear hypothesis.

Table 19: Estimated difference in pre-post change between groups using glht
Contrast Estimate SE p
AF - SF -80.382 39.955 0.044