We will tabulate and plot the prevalence of respiratory disease among
the 537 children by the child’s age and the mother’s smoking status. To
explore the correlation structures, we can inspect the residuals for any
patterns.
| Respiratory Disease Prevalence by Smoking Status and Age | ||
| Age (years) | Prevalence | n |
|---|---|---|
| Non-smoker | ||
| 6 | 0.160 | 350 |
| 7 | 0.149 | 350 |
| 8 | 0.143 | 350 |
| 9 | 0.106 | 350 |
| Smoker | ||
| 6 | 0.166 | 187 |
| 7 | 0.209 | 187 |
| 8 | 0.187 | 187 |
| 9 | 0.139 | 187 |
## residuals_1 residuals_2 residuals_3 residuals_4
## residuals_1 1.0000 0.3537 0.3078 0.3266
## residuals_2 0.3537 1.0000 0.4433 0.3289
## residuals_3 0.3078 0.4433 1.0000 0.3809
## residuals_4 0.3266 0.3289 0.3809 1.0000
Using the correlation coefficient as a measure of dependence between
binary outcome values is challenging when the outcome can only take on a
value of 0 or 1 despite correlation assuming a linear relationship. It
can then be more productive using odds ratios to capture the
relationship between binary variables. When working with binary
outcomes, the correlation coefficient does not necessarily freely range
from [-1,1]; it is restricted based on the probabilities of the binary
outcomes.
Below, we will estimate the risk of pediatric respiratory disease by
the child’s age and the mother’s smoking status with 3 models: marginal,
conditional, and transitional.
Marginal: \(E[Y_{ij}|X_{ij}]\)
Conditional: \(E[Y_{ij}|b_i,
X_{ij}]\)
Transition: \(E[Y_{ij}|Y_{ik}, k<j, X_{ij}]\)
We will be determining the impact of relationships/covariates at a
significance level of \(\alpha=.05\).
We included an interaction term between age and smoking status for the
marginal model, but we do not anticipate it being significant as, by the
plot, they do not seem to significantly interact.
Model: \(Y_{ij} = \beta_0
+ \beta_1t_{ij} + \epsilon_{ij}\)
This model assumes Gaussian error (\(\epsilon_{ij} \sim N(0,
\sigma_\epsilon^2)\)) and \(E[\epsilon_{ij}\epsilon_{ik}]=\rho\) when
j≠k. The expected outcome \(\mu_{ij} = \beta_0
+ \beta_1t_{ij}\).
##
## Call:
## geeglm(formula = resp ~ agec * smk, family = binomial(link = "logit"),
## data = resp, id = id, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.90084 0.11908 254.823 <2e-16 ***
## agec -0.14125 0.05821 5.888 0.0152 *
## smk 0.31395 0.18784 2.794 0.0946 .
## agec:smk 0.07084 0.08829 0.644 0.4223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.9996 0.1137
## Number of clusters: 537 Maximum cluster size: 4
##
## Call:
## geeglm(formula = resp ~ agec * smk, family = binomial(link = "logit"),
## data = resp, id = id, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.9005 0.1191 254.69 <2e-16 ***
## agec -0.1412 0.0582 5.89 0.015 *
## smk 0.3138 0.1878 2.79 0.095 .
## agec:smk 0.0708 0.0883 0.64 0.422
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.999 0.114
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.355 0.063
## Number of clusters: 537 Maximum cluster size: 4
##
## Call:
## geeglm(formula = resp ~ agec * smk, family = binomial(link = "logit"),
## data = resp, id = id, corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.9248 0.1207 254.31 <2e-16 ***
## agec -0.1478 0.0598 6.10 0.014 *
## smk 0.2888 0.1914 2.28 0.131
## agec:smk 0.0835 0.0917 0.83 0.362
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = ar1
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1.02 0.125
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.491 0.068
## Number of clusters: 537 Maximum cluster size: 4
| Coefficient Estimates for Marginal Models | ||||
| Model | Model Estimates | |||
|---|---|---|---|---|
| β Estimate | Std. Error | p-value | ||
| Intercept | Independence | −1.9008 | 0.1191 | 0.0000 |
| Intercept | Exchangeable | −1.9005 | 0.1191 | 0.0000 |
| Intercept | AR1 | −1.9248 | 0.1207 | 0.0000 |
| agec | Independence | −0.1413 | 0.0582 | 0.0152 |
| agec | Exchangeable | −0.1412 | 0.0582 | 0.0152 |
| agec | AR1 | −0.1478 | 0.0598 | 0.0135 |
| smk | Independence | 0.3140 | 0.1878 | 0.0946 |
| smk | Exchangeable | 0.3138 | 0.1878 | 0.0948 |
| smk | AR1 | 0.2888 | 0.1914 | 0.1313 |
| agec:smk | Independence | 0.0708 | 0.0883 | 0.4223 |
| agec:smk | Exchangeable | 0.0708 | 0.0883 | 0.4223 |
| agec:smk | AR1 | 0.0835 | 0.0917 | 0.3621 |
| Marginal Model QIC | ||||||
| Model | QIC | QICu | Quasi Lik | CIC | params | QICC |
|---|---|---|---|---|---|---|
| Independence | 1830 | 1827 | -910 | 5.44 | 4 | 1830 |
| Exchangeable | 1830 | 1827 | -910 | 5.44 | 4 | 1830 |
| AR1 | 1831 | 1828 | -910 | 5.68 | 4 | 1831 |
When constructing three marginal models with independence,
exchangeable, and first-order autoregressive correlation structures, we
observe that age has a significant predictive effect on risk of
respiratory infection (p<.05). From the plot, we observe that the
prevalence of respiratory disease is consistently higher among mothers
who smoke compared to mothers who do not smoke across all ages. The
independence and exchangeable marginal models assert that with every
unit increase in age, the expected risk of pediatric respiratory disease
decreases by 0.14, on average. The AR1 model has this risk decreasing by
0.15 with each unit increase in age.
Additionally, we can observe that the independence and exchangeable
correlation structures are systematically more similar in their \(\hat{\beta}\) and quasi-likelihood under
the independence model criterion (QIC). Standard error for parameter
estimates in the autoregressive model was also systematically higher
compared to the other two models.
In the autoregressive model, the correlation depends only on the
distance between the observations; the correlation parameters are not
different at different points in time. The error at time t only depends
on the previous error term. Consequently, the working correlation decays
exponentially as the distance between observations increases. When
inspecting the residuals’ correlation structure in this respiratory
disease dataset, we are not seeing this pattern. For instance, the
correlation between the first and fourth residuals is higher than the
correlation between the first and third residuals (i.e., \(\rho_{14} > \rho_{13}\)). Ultimately, we
are seeing more randomness to the residuals’ correlations as their
distance increases. This would corroborate our conclusion that the AR1
model is not appropriate in this situation. It appears that the
independent correlation structure may best apply.
Model: \(Y_{ij} =
(\beta_0* + b_i) + \beta_1^*x_{ij} + \epsilon_{ij}\)
The conditional model is subject-specific, where \(b_i\) accounts for individual heterogeneity
while modeling the probability of respiratory infection. We are assuming
that within-subject correlatoin is captured by random intercepts that
follow a normal distribution (\(bi \sim N(0,
\sigma_b^2)\)). In other words, everyone has their own baseline
risk of infection that is expected to be 0 with some variance \(\sigma_b^2\). There is also some model
error we account for with \(\epsilon_{ij}\) that also independently
follows a normal distribution with an expectation of 0 and variance
\(\sigma_\epsilon^2\). The expected
outcome is \(\mu_{ij}^* = \beta_0^* +
\beta_1^*t_{ij}\) with variance \(\sigma_\epsilon^2 + \sigma_b^2\).
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: resp ~ agec + smk + (1 | id)
## Data: resp
##
## AIC BIC logLik deviance df.resid
## 1598 1621 -795 1590 2144
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.403 -0.180 -0.158 -0.132 2.518
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 5.49 2.34
## Number of obs: 2148, groups: id, 537
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.374 0.275 -12.27 <2e-16 ***
## agec -0.177 0.068 -2.60 0.0093 **
## smk 0.415 0.287 1.45 0.1484
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) agec
## agec 0.227
## smk -0.419 -0.010
In this random-intercepts conditional model, we find that the estimated coefficient for maternal smoking status is 0.415 (p=.148). Exponentiating this value, we see that the relative risk of respiratory infection for an individuals is 1.51 times higher with a smoking mother compared to a non-smoking mother (95% CI: [0.863, 2.66]). At a significance level of \(\alpha=0.05\), this is insufficient evidence for a difference in the relative risk of respiratory infection by maternal smoking status.
Model: \(Y_{ij} =
\beta_0^{**} + \beta_1^{**}t_{ij} + \alpha(Y_{i,j-1} - \beta_0^{**} -
\beta_1^{**}t_{i,j-1}) + u_{ij}\), where \(x_{ij}\) are the age and smoking status
covariates, \(Y_{i,j-1}\) is the lagged
infection status (i.e., whether the individual was infected at
the previous time point), \(\alpha\)
measures the predictive strength of the previous infection on the
current infection, and \(u_{ij}\) is
the error term.
Or equivalently, \(logit(\pi_{ij}^C)=x_{ij}'\beta^{**} + \alpha
Y_{i,j-1}\), where \(\pi_{ij}^C=E[Y_{ij}|H_{ij},x_{ij}]\).
If \(\alpha>0\), then the past
infection increases the risk of current infection. If \(\alpha=0\), then the infection status is
independent across time points, suggesting a marginal model may be more
appropriate. And finally, if \(\alpha<0\), then past infection reduces
the risk of current infection.
Here, we are assuming \(E[u_{ij}]=0\), \(v(u_{ij})=\sigma_u^2\), and \(E[u_{ij}u_{ik}]=0\).
The linear transition model generally produces similar estimates to
the marginal and conditional models. This method treats past responses
\(H_{ij} = \{Y_{i1},...,Y_{i,j-1}\}\)
as additional explanatory variables. We can let centered age and smoking
status, agec and smk, be the predictors \(x_{ij}'\); it follows that \(Y_{ij}\) takes on a value of 0 with no
infection and 1 when child i experiences respiratory infection at time
j.
##
## Call:
## glm(formula = resp ~ agec + smk + lag_Y, family = binomial, data = resp_trans)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4778 0.1158 -21.40 <2e-16 ***
## agec -0.2428 0.0947 -2.57 0.010 *
## smk 0.2960 0.1563 1.89 0.058 .
## lag_Y 2.2111 0.1582 13.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1352.7 on 1610 degrees of freedom
## Residual deviance: 1148.6 on 1607 degrees of freedom
## AIC: 1157
##
## Number of Fisher Scoring iterations: 5
In the transitional model, we interpret the estimated coefficient for
mother’s smoking status as the log hazard ratio of the
incidence of pediatric respiratory disease among
mothers who smoke compared to the incidence of pediatric respiratory
disease among mothers who do not smoke. In this model, we can
exponentiate the estimated coefficient for the mother’s smoking status,
0.296, to find that the hazard of the incidence of pediatric respiratory
disease is 1.34 times greater (95% CI: [0.99, 1.83]) among mothers who
smoke compared to mothers who do not smoke, controlling for the child’s
age (p=.058). At a significance level of \(\alpha=0.05\), we would just barely fail to
find sufficient evidence for a difference in the incidence of pediatric
respiratory infection on the basis of maternal smoking status, all other
factors held constant. Meanwhile, age is significantly predictive of the
incidence of pediatric respiratory disease (p<.05). With every unit
increase in age, the hazard of incidence of respiratory infection is
expected to decrease by 0.784 times on average (95% CI: [0.652, 0.944]).
As the 95% confidence interval for this hazard ratio does not contain 1,
we find compelling evidence of the significance of age in predicting the
incidence of pediatric respiratory disease with the transitional
model.
The coefficient for lag_Y, \(\alpha\), is positive, indicating that past
infection status is predictive of current infection (p<.05).
The marginal model averages effects across the population. Meanwhile, the transitional model conditions on past outcomes. And finally, the conditional model yield sestimates on the individual by implementing random-effects or mixed effects.
| Model Estimates for Maternal Smoking Status | ||||
| Model | Estimate | Std.err | CI | Pr(>|W|) |
|---|---|---|---|---|
| Marginal | 0.314 | 0.188 | [-.05, 0.68] | 0.0946 |
| Conditional | 0.415 | 0.287 | [-.011, 1.06] | 0.1484 |
| Transitional | 0.296 | 0.156 | [-.01, .60] | 0.0583 |
To reiterate previous interpretations of the regression coefficient
for mother’s smoking status, the marginal model provides information on
prevalence, a population-based measure. The conditional model will
provide information on the probability of infection for the same
individual with and without a mother who smokes. The transitional model
provides information on the incidence of respiratory infection based on
past outcomes.
The ratio of prevalence of pediatric respiratory disease in smoking mothers to non-smoking mothers is exp(0.314). The relative risk for an individual child developing this respiratory disease given their mother smokes versus does not smoke is exp(0.478). And finally, the ratio of incidence rate of pediatric respiratory disease in smoking mothers to non-smoking nmothers is exp(0.296).