BIOSTAT718 HW 5

Pediatric Respiratory Infections

a. Exploring Age and Smoking Status among Mothers

We will tabulate and plot the prevalence of respiratory disease among the 537 children by the child’s age and the mother’s smoking status. To explore the correlation structures, we can inspect the residuals for any patterns.

Age (years)	Prevalence	n
Respiratory Disease Prevalence by Smoking Status and Age
Non-smoker
6	0.160	350
7	0.149	350
8	0.143	350
9	0.106	350
Smoker
6	0.166	187
7	0.209	187
8	0.187	187
9	0.139	187

##             residuals_1 residuals_2 residuals_3 residuals_4
## residuals_1      1.0000      0.3537      0.3078      0.3266
## residuals_2      0.3537      1.0000      0.4433      0.3289
## residuals_3      0.3078      0.4433      1.0000      0.3809
## residuals_4      0.3266      0.3289      0.3809      1.0000

Using the correlation coefficient as a measure of dependence between binary outcome values is challenging when the outcome can only take on a value of 0 or 1 despite correlation assuming a linear relationship. It can then be more productive using odds ratios to capture the relationship between binary variables. When working with binary outcomes, the correlation coefficient does not necessarily freely range from [-1,1]; it is restricted based on the probabilities of the binary outcomes.

Below, we will estimate the risk of pediatric respiratory disease by the child’s age and the mother’s smoking status with 3 models: marginal, conditional, and transitional.

Marginal: \(E[Y_{ij}|X_{ij}]\)
Conditional: \(E[Y_{ij}|b_i, X_{ij}]\)
Transition: \(E[Y_{ij}|Y_{ik}, k<j, X_{ij}]\)

We will be determining the impact of relationships/covariates at a significance level of \(\alpha=.05\). We included an interaction term between age and smoking status for the marginal model, but we do not anticipate it being significant as, by the plot, they do not seem to significantly interact.

b. Marginal Regression Model for Risk of Respiratory Disease by Age

Model: \(Y_{ij} = \beta_0 + \beta_1t_{ij} + \epsilon_{ij}\)

This model assumes Gaussian error (\(\epsilon_{ij} \sim N(0, \sigma_\epsilon^2)\)) and \(E[\epsilon_{ij}\epsilon_{ik}]=\rho\) when j≠k. The expected outcome \(\mu_{ij} = \beta_0 + \beta_1t_{ij}\).

## 
## Call:
## geeglm(formula = resp ~ agec * smk, family = binomial(link = "logit"), 
##     data = resp, id = id, corstr = "independence")
## 
##  Coefficients:
##             Estimate  Std.err    Wald Pr(>|W|)    
## (Intercept) -1.90084  0.11908 254.823   <2e-16 ***
## agec        -0.14125  0.05821   5.888   0.0152 *  
## smk          0.31395  0.18784   2.794   0.0946 .  
## agec:smk     0.07084  0.08829   0.644   0.4223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)   0.9996  0.1137
## Number of clusters:   537  Maximum cluster size: 4

## 
## Call:
## geeglm(formula = resp ~ agec * smk, family = binomial(link = "logit"), 
##     data = resp, id = id, corstr = "exchangeable")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -1.9005  0.1191 254.69   <2e-16 ***
## agec         -0.1412  0.0582   5.89    0.015 *  
## smk           0.3138  0.1878   2.79    0.095 .  
## agec:smk      0.0708  0.0883   0.64    0.422    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    0.999   0.114
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.355   0.063
## Number of clusters:   537  Maximum cluster size: 4

## 
## Call:
## geeglm(formula = resp ~ agec * smk, family = binomial(link = "logit"), 
##     data = resp, id = id, corstr = "ar1")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -1.9248  0.1207 254.31   <2e-16 ***
## agec         -0.1478  0.0598   6.10    0.014 *  
## smk           0.2888  0.1914   2.28    0.131    
## agec:smk      0.0835  0.0917   0.83    0.362    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = ar1 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     1.02   0.125
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.491   0.068
## Number of clusters:   537  Maximum cluster size: 4

	Model	Model Estimates
Coefficient Estimates for Marginal Models
	Model	β Estimate	Std. Error	p-value
Intercept	Independence	−1.9008	0.1191	0.0000
Intercept	Exchangeable	−1.9005	0.1191	0.0000
Intercept	AR1	−1.9248	0.1207	0.0000
agec	Independence	−0.1413	0.0582	0.0152
agec	Exchangeable	−0.1412	0.0582	0.0152
agec	AR1	−0.1478	0.0598	0.0135
smk	Independence	0.3140	0.1878	0.0946
smk	Exchangeable	0.3138	0.1878	0.0948
smk	AR1	0.2888	0.1914	0.1313
agec:smk	Independence	0.0708	0.0883	0.4223
agec:smk	Exchangeable	0.0708	0.0883	0.4223
agec:smk	AR1	0.0835	0.0917	0.3621

Model	QIC	QICu	Quasi Lik	CIC	params	QICC
Marginal Model QIC
Independence	1830	1827	-910	5.44	4	1830
Exchangeable	1830	1827	-910	5.44	4	1830
AR1	1831	1828	-910	5.68	4	1831

When constructing three marginal models with independence, exchangeable, and first-order autoregressive correlation structures, we observe that age has a significant predictive effect on risk of respiratory infection (p<.05). From the plot, we observe that the prevalence of respiratory disease is consistently higher among mothers who smoke compared to mothers who do not smoke across all ages. The independence and exchangeable marginal models assert that with every unit increase in age, the expected risk of pediatric respiratory disease decreases by 0.14, on average. The AR1 model has this risk decreasing by 0.15 with each unit increase in age.

Additionally, we can observe that the independence and exchangeable correlation structures are systematically more similar in their \(\hat{\beta}\) and quasi-likelihood under the independence model criterion (QIC). Standard error for parameter estimates in the autoregressive model was also systematically higher compared to the other two models.

In the autoregressive model, the correlation depends only on the distance between the observations; the correlation parameters are not different at different points in time. The error at time t only depends on the previous error term. Consequently, the working correlation decays exponentially as the distance between observations increases. When inspecting the residuals’ correlation structure in this respiratory disease dataset, we are not seeing this pattern. For instance, the correlation between the first and fourth residuals is higher than the correlation between the first and third residuals (i.e., \(\rho_{14} > \rho_{13}\)). Ultimately, we are seeing more randomness to the residuals’ correlations as their distance increases. This would corroborate our conclusion that the AR1 model is not appropriate in this situation. It appears that the independent correlation structure may best apply.

c. Conditional Model for Risk of Respiratory Disease

Model: \(Y_{ij} = (\beta_0* + b_i) + \beta_1^*x_{ij} + \epsilon_{ij}\)

The conditional model is subject-specific, where \(b_i\) accounts for individual heterogeneity while modeling the probability of respiratory infection. We are assuming that within-subject correlatoin is captured by random intercepts that follow a normal distribution (\(bi \sim N(0, \sigma_b^2)\)). In other words, everyone has their own baseline risk of infection that is expected to be 0 with some variance \(\sigma_b^2\). There is also some model error we account for with \(\epsilon_{ij}\) that also independently follows a normal distribution with an expectation of 0 and variance \(\sigma_\epsilon^2\). The expected outcome is \(\mu_{ij}^* = \beta_0^* + \beta_1^*t_{ij}\) with variance \(\sigma_\epsilon^2 + \sigma_b^2\).

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: resp ~ agec + smk + (1 | id)
##    Data: resp
## 
##      AIC      BIC   logLik deviance df.resid 
##     1598     1621     -795     1590     2144 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -1.403 -0.180 -0.158 -0.132  2.518 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  id     (Intercept) 5.49     2.34    
## Number of obs: 2148, groups:  id, 537
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.374      0.275  -12.27   <2e-16 ***
## agec          -0.177      0.068   -2.60   0.0093 ** 
## smk            0.415      0.287    1.45   0.1484    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr) agec  
## agec  0.227       
## smk  -0.419 -0.010

In this random-intercepts conditional model, we find that the estimated coefficient for maternal smoking status is 0.415 (p=.148). Exponentiating this value, we see that the relative risk of respiratory infection for an individuals is 1.51 times higher with a smoking mother compared to a non-smoking mother (95% CI: [0.863, 2.66]). At a significance level of \(\alpha=0.05\), this is insufficient evidence for a difference in the relative risk of respiratory infection by maternal smoking status.

d. Transitional Model for Risk of Respiratory Disease

Model: \(Y_{ij} = \beta_0^{**} + \beta_1^{**}t_{ij} + \alpha(Y_{i,j-1} - \beta_0^{**} - \beta_1^{**}t_{i,j-1}) + u_{ij}\), where \(x_{ij}\) are the age and smoking status covariates, \(Y_{i,j-1}\) is the lagged infection status (i.e., whether the individual was infected at the previous time point), \(\alpha\) measures the predictive strength of the previous infection on the current infection, and \(u_{ij}\) is the error term.

Or equivalently, \(logit(\pi_{ij}^C)=x_{ij}'\beta^{**} + \alpha Y_{i,j-1}\), where \(\pi_{ij}^C=E[Y_{ij}|H_{ij},x_{ij}]\).

If \(\alpha>0\), then the past infection increases the risk of current infection. If \(\alpha=0\), then the infection status is independent across time points, suggesting a marginal model may be more appropriate. And finally, if \(\alpha<0\), then past infection reduces the risk of current infection.

Here, we are assuming \(E[u_{ij}]=0\), \(v(u_{ij})=\sigma_u^2\), and \(E[u_{ij}u_{ik}]=0\).

The linear transition model generally produces similar estimates to the marginal and conditional models. This method treats past responses \(H_{ij} = \{Y_{i1},...,Y_{i,j-1}\}\) as additional explanatory variables. We can let centered age and smoking status, agec and smk, be the predictors \(x_{ij}'\); it follows that \(Y_{ij}\) takes on a value of 0 with no infection and 1 when child i experiences respiratory infection at time j.

## 
## Call:
## glm(formula = resp ~ agec + smk + lag_Y, family = binomial, data = resp_trans)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4778     0.1158  -21.40   <2e-16 ***
## agec         -0.2428     0.0947   -2.57    0.010 *  
## smk           0.2960     0.1563    1.89    0.058 .  
## lag_Y         2.2111     0.1582   13.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1352.7  on 1610  degrees of freedom
## Residual deviance: 1148.6  on 1607  degrees of freedom
## AIC: 1157
## 
## Number of Fisher Scoring iterations: 5

In the transitional model, we interpret the estimated coefficient for mother’s smoking status as the log hazard ratio of the incidence of pediatric respiratory disease among mothers who smoke compared to the incidence of pediatric respiratory disease among mothers who do not smoke. In this model, we can exponentiate the estimated coefficient for the mother’s smoking status, 0.296, to find that the hazard of the incidence of pediatric respiratory disease is 1.34 times greater (95% CI: [0.99, 1.83]) among mothers who smoke compared to mothers who do not smoke, controlling for the child’s age (p=.058). At a significance level of \(\alpha=0.05\), we would just barely fail to find sufficient evidence for a difference in the incidence of pediatric respiratory infection on the basis of maternal smoking status, all other factors held constant. Meanwhile, age is significantly predictive of the incidence of pediatric respiratory disease (p<.05). With every unit increase in age, the hazard of incidence of respiratory infection is expected to decrease by 0.784 times on average (95% CI: [0.652, 0.944]). As the 95% confidence interval for this hazard ratio does not contain 1, we find compelling evidence of the significance of age in predicting the incidence of pediatric respiratory disease with the transitional model.

The coefficient for lag_Y, \(\alpha\), is positive, indicating that past infection status is predictive of current infection (p<.05).

e. Comparing Coefficient Estimate for Mother’s Smoking Status Across the Marginal, Conditional, and Transitional Models

The marginal model averages effects across the population. Meanwhile, the transitional model conditions on past outcomes. And finally, the conditional model yield sestimates on the individual by implementing random-effects or mixed effects.

Model	Estimate	Std.err	CI	Pr(>\|W\|)
Model Estimates for Maternal Smoking Status
Marginal	0.314	0.188	[-.05, 0.68]	0.0946
Conditional	0.415	0.287	[-.011, 1.06]	0.1484
Transitional	0.296	0.156	[-.01, .60]	0.0583

To reiterate previous interpretations of the regression coefficient for mother’s smoking status, the marginal model provides information on prevalence, a population-based measure. The conditional model will provide information on the probability of infection for the same individual with and without a mother who smokes. The transitional model provides information on the incidence of respiratory infection based on past outcomes.

The ratio of prevalence of pediatric respiratory disease in smoking mothers to non-smoking mothers is exp(0.314). The relative risk for an individual child developing this respiratory disease given their mother smokes versus does not smoke is exp(0.478). And finally, the ratio of incidence rate of pediatric respiratory disease in smoking mothers to non-smoking nmothers is exp(0.296).