Research question: We measured 500 young adults’ overall physical health at 4 time points (i.e., September each year during their undergraduate studies), at each time point we also collected information about their health consciousness (time varying) and converted it into a single numeric score. Parental health status at first time point and average steps taken per day over their final year of high school (assuming all subjects owns wearables and sync data with their phones religiously) were used as time-invariant predictors. We examined whether health consciousness, parental health status and daily activity level affects the trajectories of university students’ overall physical health.

library(lme4)
library(afex)
library(lattice) #xyplot
library(data.table) #setnames
library(reshape2) #melt
library(ggplot2)
library(texreg)

Data preprocessing

df<-read.csv('Assign3.csv')

#change colnames to fit our scenerio
setnames(df, old = c('Y','A3','X1','X2','Person','Time'), 
         new = c('phy_health','heal_consc','parent_heal','steps','person','time'))

df['person']<-lapply(df['person'],factor)

head(df)

#View(df)

All predictor variables are standardized (centered around 0), and normally distributed. The measurement occasions time were coded as t=0,1,2,3, then the intercept can be interpreted as the expected outcome on the first occasion. Before we start the analysis, we examine the distribution of the outcome variable physical health in the long data file with 500 × 4 = 2000 observations.

Get a sense of col distribution

summary(df)

##    phy_health        heal_consc             time       parent_heal      
##  Min.   :-7.2030   Min.   :-3.990000   Min.   :0.00   Min.   :-2.51800  
##  1st Qu.: 0.2677   1st Qu.:-0.639500   1st Qu.:0.75   1st Qu.:-0.72350  
##  Median : 1.9015   Median :-0.036000   Median :1.50   Median :-0.04950  
##  Mean   : 2.1945   Mean   :-0.006729   Mean   :1.50   Mean   :-0.07299  
##  3rd Qu.: 3.7923   3rd Qu.: 0.667000   3rd Qu.:2.25   3rd Qu.: 0.59175  
##  Max.   :12.6860   Max.   : 3.446000   Max.   :3.00   Max.   : 2.79700  
##                                                                         
##      steps             person    
##  Min.   :-2.0120   1      :   4  
##  1st Qu.:-0.5058   2      :   4  
##  Median : 0.0615   3      :   4  
##  Mean   : 0.1265   4      :   4  
##  3rd Qu.: 0.7798   5      :   4  
##  Max.   : 3.1760   6      :   4  
##                    (Other):1976

Visually inspect col frequency distribution

long <- melt(df,id.vars = 'person')

ggplot(long, aes(value)) + 
      facet_wrap(~variable, scales = 'free_x') +
      geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)))

All numerical predictors seem to normally distributed. Scores seem to be standardized and centered around 0.

Model building and testing

get a sense of time/phy_health relationship through sampling

#sample 8 persons to see phy_health regressed on time intercept and slope variations among subjects

xyplot(phy_health~time | person, 
   data=df[df$person %in% c(4, 14, 23, 32, 41, 56, 65, 82), ], 
   panel=function(x,y)
  {panel.xyplot(x, y)
     panel.lmline(x,y)},
   ylim=c(-3, 10), as.table=T)

We do see some variations between persons in terms of intercept and slope.

Unconditional Model

m_null<-lmer(phy_health~1+(1|person),df,REML=F)

summary(m_null)

## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: phy_health ~ 1 + (1 | person)
##    Data: df
## 
##      AIC      BIC   logLik deviance df.resid 
##   8943.8   8960.6  -4468.9   8937.8     1997 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7111 -0.5862 -0.0811  0.5280  3.3443 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  person   (Intercept) 4.063    2.016   
##  Residual             3.268    1.808   
## Number of obs: 2000, groups:  person, 500
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)   2.19455    0.09879 499.99998   22.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ICC

4.063/(4.063+3.268)

## [1] 0.5542218

The null-model contains only an intercept term and variances at the occasion and the subject level. The intercept of 4.063 in this model is simply the average physical health across all individuals and occasions. The intercept-only model estimates the repeated measures variance as 4.06 and the subject-level variance as 3.27, this renders an intra-class correlation of .55. About half of the variance of the 4 physical health measures is variance between individuals, and about half is variance within individuals across time.

Unconditional growth model

m<-lmer(phy_health~1+(time|person),df,REML=F)

summary(m)

## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: phy_health ~ 1 + (time | person)
##    Data: df
## 
##      AIC      BIC   logLik deviance df.resid 
##   7375.9   7403.9  -3682.9   7365.9     1995 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3181 -0.5072  0.0143  0.4922  3.0978 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  person   (Intercept) 2.3533   1.5340       
##           time        1.5798   1.2569   0.65
##  Residual             0.6349   0.7968       
## Number of obs: 2000, groups:  person, 500
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)
## (Intercept)  -0.03019    0.06534 499.99750  -0.462    0.644

Conditional growth model

Add level-1 predictor

#model with random intercept and random slope
m1<-lmer(phy_health~time+heal_consc+
          (time|person),df,REML=F)

#model without random slope
m2<-lmer(phy_health~time+heal_consc+
          (1|person),df,REML=F)

#model without random intercept
m3<-lmer(phy_health~time+heal_consc+
          (0+time|person),df,REML=F)

anova(m1, m2, m3)

Model with random intercept and random slope is the best model

Add level-2 predictors

m4<-lmer(phy_health~time+heal_consc+parent_heal+steps+
          (time|person),df,REML=F)

anova(m3, m4)

summary(m4)

## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: phy_health ~ time + heal_consc + parent_heal + steps + (time |  
##     person)
##    Data: df
## 
##      AIC      BIC   logLik deviance df.resid 
##   6505.2   6555.6  -3243.6   6487.2     1991 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.13650 -0.50443 -0.00628  0.47587  3.05359 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  person   (Intercept) 1.1169   1.0568        
##           time        0.5010   0.7078   -0.07
##  Residual             0.5415   0.7359        
## Number of obs: 2000, groups:  person, 500
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept) 5.559e-01  5.531e-02 4.595e+02   10.05   <2e-16 ***
## time        1.051e+00  3.491e-02 4.995e+02   30.11   <2e-16 ***
## heal_consc  3.002e-01  2.210e-02 1.327e+03   13.59   <2e-16 ***
## parent_heal 6.544e-01  5.366e-02 5.000e+02   12.20   <2e-16 ***
## steps       8.854e-01  5.493e-02 5.002e+02   16.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) time   hl_cns prnt_h
## time        -0.225                     
## heal_consc  -0.001  0.007              
## parent_heal  0.080  0.000  0.002       
## steps       -0.131  0.000 -0.017 -0.072

Add interaction between time and level-2 predictor

Parent_heal

m5<-lmer(phy_health~time+heal_consc+parent_heal+steps+
           time*parent_heal+
          (time|person),df,REML=F)

anova(m4, m5)

Average daily Steps

m6<-lmer(phy_health~time+heal_consc+parent_heal+steps+
           time*parent_heal+time*steps+
          (time|person),df,REML=F)

anova(m5, m6)

summary(m6)

## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: phy_health ~ time + heal_consc + parent_heal + steps + time *  
##     parent_heal + time * steps + (time | person)
##    Data: df
## 
##      AIC      BIC   logLik deviance df.resid 
##   6173.7   6235.3  -3075.9   6151.7     1989 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.12316 -0.52340  0.00693  0.48509  3.09613 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  person   (Intercept) 1.0786   1.0386       
##           time        0.2031   0.4506   0.11
##  Residual             0.5416   0.7359       
## Number of obs: 2000, groups:  person, 500
## 
## Fixed effects:
##                   Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)      5.704e-01  5.463e-02 4.999e+02   10.44   <2e-16 ***
## time             1.010e+00  2.525e-02 4.992e+02   40.01   <2e-16 ***
## heal_consc       2.968e-01  2.136e-02 1.461e+03   13.89   <2e-16 ***
## parent_heal      5.605e-01  5.440e-02 4.999e+02   10.30   <2e-16 ***
## steps            7.166e-01  5.569e-02 5.002e+02   12.87   <2e-16 ***
## time:parent_heal 2.630e-01  2.514e-02 4.992e+02   10.46   <2e-16 ***
## time:steps       4.734e-01  2.574e-02 4.994e+02   18.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) time   hl_cns prnt_h steps  tm:pr_
## time        -0.165                                   
## heal_consc  -0.001  0.007                            
## parent_heal  0.082 -0.013  0.002                     
## steps       -0.134  0.022 -0.019 -0.072              
## tim:prnt_hl -0.014  0.082 -0.001 -0.165  0.012       
## time:steps   0.022 -0.134  0.017  0.012 -0.165 -0.072

A model-building approach is used to arrive at the final model. First, the time variable is added as a linear predictor with the same coefficient for all subjects; next, a series of models are tested with either or both of random intercept and slope of the measurement occasion variable, and both are included in the model (unconditional growth model); then the time-varying covariate health consciousness is added; after which, the subject-level (time-invariant) predictors parental health status and average steps taken per day were added; finally, the interaction of the time variable with the subject-level predictors parental health status and average steps taken per day are added to the model. All effects and interactions are significant at p < .001, likelihood ratio tests were done in each model-building step through comparing models with and without certain effect. See detailed model testing and comparison above.

In the final model, the regression coefficient for time 1.01 and significant, it means that with each scale points higher on time, physical health is expected to increase by 1.01 scale points; the regression coefficient for health consciousness .3 and significant, it means that with each scale points higher on health consciousness, physical health is expected to increase by .3 scale points. In our multilevel model, the regression coefficient for time varies across individuals, the varying regression slopes for time are assumed to follow a normal distribution. The variance of this distribution is estimated as .2. The regression coefficient for parental health is .56, it means that for each unit increase of parental health score, the physical health of the person goes up by .56. The variance of this distribution is estimated as .2. The regression coefficient for average daily steps is .47, it means that for each unit increase of steps, the physical health of the person goes up by .47. The cross-level interaction between time and parental health shows how parental health status moderate the effect of time on individual physical health. The regression coefficient for this cross-level interaction is .26, it can be interpreted that parental health significantly affects the correlation between time and individual physical health. The cross-level interaction between time and steps shows how steps moderate the effect of time on individual physical health. The regression coefficient for this cross-level interaction is .47, it can be interpreted that steps significantly affects the correlation between time and individual physical health. The covariance between the initial status and the growth rate in the final model is positive (.05); students who start with a relatively high value of their physical health score increase their physical health faster than the other students, it seems that physical health is self-perpetuating. Note that before the interaction terms were added, this correlation is slightly negative (-. 07), which means the subject-level predictors really influenced the correlation between intercept and slope of time. The deviation went down to 6151.7 from 8937.8 in the null model. The estimated of τ00 is interpreted as the variance of mean for each individual around the overall mean physical health score, in our case the subject level variance of mean is 1.08. The estimate of residual is the variance of each measurement of physical health score around the individual’s mean score, in our case the observation level variance is .54.

#write results to table
htmlreg(list(m_null, m6), 
          file='texreg.doc',
          single.row = T, 
          stars = numeric(0),
          caption = "",
          custom.note = "null model = intercept only, 
                         final model =  
                        
        ")

## The table was written to the file 'texreg.doc'.

Model summary table with parameter estimation and significance level

Multilevel Modeling of Longitudinal data

Predictors of University students’ physical Health