font-family: ‘Garamond’
(Applied Multivariate Biostatistics)
LINEAR MODELS
(Factorial Predictors, Interactions and Transformations)
Donatello Telesca
UCLA Biostatistics
Smoking and Forced Expiratory Volume - Sample of 654 youths, aged 3 to 19, in the area of East Boston during middle to late 1970’s. Interest concerns the relationship between smoking and FEV. Since the study is necessarily observational, statistical adjustment via regression models clarifies the relationship.
| fev | Forced Expiratory Volume (liters) |
|---|---|
| age | Age in years |
| height | Heigth in inches |
| sex | male (1) or female (0) |
| smoke | smoker (1) or non-smoker (0) |
title: false
Note that sex and smoker are factorial predictors, so a better pairwise plot showing their relationship with fev is a boxplot
class: small-code
Call:
lm(formula = fev$fev ~ fev$smoker)
Residuals:
Min 1Q Median 3Q Max
-1.7751 -0.6339 -0.1021 0.4804 3.2269
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.56614 0.03466 74.037 < 2e-16 ***
fev$smoker 0.71072 0.10994 6.464 1.99e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8412 on 652 degrees of freedom
Multiple R-squared: 0.06023, Adjusted R-squared: 0.05879
F-statistic: 41.79 on 1 and 652 DF, p-value: 1.993e-10
The regression equation is:
\[E[Y \mid smoker] = \alpha + \beta\,smoker\]
This is meaningless unless we adopt a numerical convention; e.g. define \(X\) such that: \(X=0\) if non-smoker, and \(X=1\) if smoker.
\[ \begin{array}{llllll} E[Y \mid X=0] &= &\alpha + \beta \times 0 &=& \alpha, &\;\;\mbox{ for non smokers};\\ E[Y \mid X=1] &= &\alpha + \beta \times 1 &=& \alpha + \beta, &\;\;\mbox{ for smokers} \end{array} \]
Note that numerical conventiona associated with factorial predictors are not unique.
For example, I could define \(X\) such that: \(X=-1\) if non-smoker, and \(X=1\) if smoker.
\[ \begin{array}{llllll} E[Y \mid X=0] &= &\alpha + \beta \times (-1)&=& \alpha - \beta, &\;\;\mbox{ for non smokers};\\ E[Y \mid X=1] &= &\alpha + \beta \times 1 &=& \alpha + \beta, &\;\;\mbox{ for smokers} \end{array} \]
class: small-code
\(X = 1\) if smoker \(X=0\) if non-smoker
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.5661 | 0.0347 | 74.0367 | 0 |
| fev$smoker | 0.7107 | 0.1099 | 6.4645 | 0 |
\(X = 1\) if smoker \(X=-1\) if non-smoker
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.9215 | 0.055 | 53.1459 | 0 |
| x1 | 0.3554 | 0.055 | 6.4645 | 0 |
\(\;\)
Factorial predictors do not have to be binary and could include multiple levels.
For, example one may discretize age to consider groups of subjects in the cathegories: (\(\leq\) 10), (10 to 15) and (>15)
Example To encode three factorial age groups, consider defining
\(X_1 = 1\) if age is in (10 to 15], and \(X_1 = 0\) otherwise
\(X_2 = 1\) if age is > 15, and \(X_2 = 0\) otherwise
leaving age < 10 as a reference category
Example To encode three factorial age groups, consider defining
\(X_1 = 1\) if age is in (10 to 15], and \(X_1 = 0\) otherwise
\(X_2 = 1\) if age is > 15, and \(X_2 = 0\) otherwise
In the regression \(E[Y \mid X] = \beta_0 + \beta_1 X_1 + \beta_2 X_2\)
| Group mean | Regression assumption | Coefficients |
|---|---|---|
| \(E[Y\mid age < 10]\) | = \(\;\;\beta_0 + \beta_1\times 0 + \beta_2\times 0\) | = \(\;\;\beta_0\) |
| \(E[Y\mid 10< age <= 15]\) | = \(\;\;\beta_0 + \beta_1\times 1 + \beta_2\times 0\) | = \(\;\;\beta_0 + \beta_1\) |
| \(E[Y\mid age > 15]\) | = \(\;\;\beta_0 + \beta_1\times 0 + \beta_2\times 1\) | = \(\;\;\beta_0 + \beta_2\) |
class: small-code \(\;\)
Age as a continuous predictor
Call:
lm(formula = fev$fev ~ fev$age)
Residuals:
Min 1Q Median 3Q Max
-1.57539 -0.34567 -0.04989 0.32124 2.12786
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.431648 0.077895 5.541 4.36e-08 ***
fev$age 0.222041 0.007518 29.533 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5675 on 652 degrees of freedom
Multiple R-squared: 0.5722, Adjusted R-squared: 0.5716
F-statistic: 872.2 on 1 and 652 DF, p-value: < 2.2e-16
\(\;\)
Age as a factorial predictor
Call:
lm(formula = fev$fev ~ age)
Residuals:
Min 1Q Median 3Q Max
-1.56594 -0.46934 -0.07584 0.41828 2.53306
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.16934 0.03264 66.46 <2e-16 ***
age10 to 15 1.09059 0.05330 20.46 <2e-16 ***
age>15 1.68349 0.12213 13.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6446 on 651 degrees of freedom
Multiple R-squared: 0.449, Adjusted R-squared: 0.4474
F-statistic: 265.3 on 2 and 651 DF, p-value: < 2.2e-16
class: small-code \(\;\)
Age as a continuous predictor
| | Estimate| Std. Error| t value| Pr(>|t|)| |:———–|———:|———-:|———:|——————:| |(Intercept) | 0.4316481| 0.0778954| 5.541382| 0| |fev$age | 0.2220410| 0.0075185| 29.532766| 0| - Intercept = 0.43, interpreted as the average fev at age 0 (meaningless unless we re-center age) - Slope = 0.22, interpreted as the expected increase in fev associated with each year of aging
class: small-code
Age as a factorial predictor
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.169344 | 0.0326393 | 66.46418 | 0 |
| age10 to 15 | 1.090592 | 0.0532997 | 20.46150 | 0 |
| age>15 | 1.683490 | 0.1221250 | 13.78497 | 0 |
class: small-code One usually works with both continuous and factorial predictors in regression.
Example: Predict fev using smoking and age (continuous)
Call:
lm(formula = fev$fev ~ fev$age + fev$smoker)
Residuals:
Min 1Q Median 3Q Max
-1.6653 -0.3564 -0.0508 0.3494 2.0894
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.367373 0.081436 4.511 7.65e-06 ***
fev$age 0.230605 0.008184 28.176 < 2e-16 ***
fev$smoker -0.208995 0.080745 -2.588 0.00986 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5651 on 651 degrees of freedom
Multiple R-squared: 0.5766, Adjusted R-squared: 0.5753
F-statistic: 443.3 on 2 and 651 DF, p-value: < 2.2e-16
class: small-code
Example: Predict fev using smoking and age (continuous)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.3673730 | 0.0814357 | 4.511203 | 0.0000076 |
| age | 0.2306046 | 0.0081844 | 28.176209 | 0.0000000 |
| smoker | -0.2089949 | 0.0807453 | -2.588321 | 0.0098598 |
title: false FEV predicted using smoking alone
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.4316481 | 0.0778954 | 5.541382 | 0 |
| fev$age | 0.2220410 | 0.0075185 | 29.532766 | 0 |
FEV predicted using smoking and age
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.3673730 | 0.0814357 | 4.511203 | 0.0000076 |
| age | 0.2306046 | 0.0081844 | 28.176209 | 0.0000000 |
| smoker | -0.2089949 | 0.0807453 | -2.588321 | 0.0098598 |
In the model considered before we assumed strict additivity
\[E[Y\mid X] = \beta_0 + \beta_1 Age + \beta_2 Smoker\]
| Group | Group Mean |
|---|---|
| Non Smokers | \(\beta_0 + \beta_1 Age\) |
| Smokers | \(\beta_0 + \beta_1 Age + \beta2\) |
title: false
title: false Is the relationship between age and FEV different amongst smokers as opposed to non-smokers?
This question can be addressed by adding an interaction term
\[E[Y \mid X] = \beta_0 + \beta_1 Age + \beta_2 Smoker + \beta_3 Age\times Smoker\]
| Group | Group Mean |
|---|---|
| Non Smokers | \(\beta_0 + \beta_1 Age\) |
| Smokers | \((\beta_0 + \beta_2) + (\beta_1 + \beta_3) Age\) |
| \(\beta_0\) | Intercept for non smokers (not too meaningful without recentering) |
| \(\beta_1\) | Slope for non smokers |
| \(\beta_2\) | Difference between intercept of smokers and intercept of non-smokers |
| \(\beta_3\) | Difference between slope of smokers and slope of non smokers |
title: false
title: false class: small-code Interactions Model
Call:
lm(formula = fev ~ age * smoker, data = fev)
Residuals:
Min 1Q Median 3Q Max
-1.76645 -0.34947 -0.03364 0.33679 2.05990
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.253396 0.082651 3.066 0.00226 **
age 0.242558 0.008332 29.113 < 2e-16 ***
smoker 1.943571 0.414285 4.691 3.31e-06 ***
age:smoker -0.162703 0.030738 -5.293 1.65e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5537 on 650 degrees of freedom
Multiple R-squared: 0.5941, Adjusted R-squared: 0.5922
F-statistic: 317.1 on 3 and 650 DF, p-value: < 2.2e-16
title: false class: small-code Interactions model re-centered - It is perhaps more meaningful to look at expected FEV at a ``pivotal age’’, say at 14, rather than at age 0.
Re-center age and define a new variable: age_c = age - 14
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 3.6492133 | 0.0436404 | 83.620145 | 0.0e+00 |
| age_c | 0.2425584 | 0.0083315 | 29.113273 | 0.0e+00 |
| smoker | -0.3342667 | 0.0825838 | -4.047606 | 5.8e-05 |
| age_c:smoker | -0.1627027 | 0.0307375 | -5.293291 | 2.0e-07 |
In model with predictors \(X_1\), \(X_2\) and interaction terms \(X_1 \times X_2\): - if the interaction is significant, we say that \(X_2\) is an effect modifier for \(X_1\) (and vice-versa) - if interaction is not significant, we do not robotically remove it from the model, but we refer to more formal model building ideas
title:false - POLYNOMIAL REGRESSION - TRANSFORMATIONS - BOOTSTRAP
Consider - Additive model with all predictors
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.8312825 | 0.0453674 | 62.407871 | 0.0000000 |
| age_c | 0.0655093 | 0.0094886 | 6.904038 | 0.0000000 |
| smoker | -0.0872464 | 0.0592535 | -1.472426 | 0.1413907 |
| sexMale | 0.1571029 | 0.0332071 | 4.731010 | 0.0000027 |
| height_c | 0.1041994 | 0.0047577 | 21.901145 | 0.0000000 |
title: false
title: false
title: false
title: false
In order of importance
title: false
The assumption of a linear relationships between \(Y\) and a countinuous predictor \(X\) may be relaxed by considering adding polynomial terms.
For example:
\[E[Y\mid X] = \beta_0 + \beta_1 X + \beta_2 X^2\]
\[E[Y\mid X] = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3\]
title: false
title: false Adding a quadratic term for height
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.7825861 | 0.0439455 | 63.319073 | 0.0000000 |
| age_c | 0.0694646 | 0.0091089 | 7.626030 | 0.0000000 |
| smoker | -0.1332112 | 0.0571079 | -2.332622 | 0.0199732 |
| sexMale | 0.0945352 | 0.0328613 | 2.876793 | 0.0041495 |
| height_c | 0.1079208 | 0.0045859 | 23.533380 | 0.0000000 |
| height2 | 0.0031251 | 0.0004086 | 7.648942 | 0.0000000 |
title: false
title:false There seems to still be a mean / variance relationship
Box-Cox Transformations
| Transform | \(g(Y)\) | Mean-Variance relationship |
|---|---|---|
| no transform! | \(Y\) | \(\sigma \propto const\) |
| square root | \(\sqrt{Y}\) | \(\sigma \propto \sqrt{\hat{Y}}\) |
| log | \(log(Y)\) | \(\sigma \propto \hat{Y}\) |
| reciprocal | \(1/Y\) | \(\sigma \propto \hat{Y}^2\) |
title: false FEV example - Consider a model where we use \(\sqrt{Y}\) (mild variance stabilization)
\[\sqrt{Y} = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon\]
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1.6589481 | 0.0129735 | 127.871573 | 0.0000000 |
| age_c | 0.0198771 | 0.0026891 | 7.391695 | 0.0000000 |
| smoker | -0.0376014 | 0.0168594 | -2.230298 | 0.0260694 |
| sexMale | 0.0264355 | 0.0097013 | 2.724947 | 0.0066048 |
| height_c | 0.0335403 | 0.0013538 | 24.774307 | 0.0000000 |
| height2 | 0.0004417 | 0.0001206 | 3.662404 | 0.0002703 |
title: false
title: false
title: false FEV example
| Estimate | T-CI | Bootstrap-CI | |
|---|---|---|---|
| (Intercept) | 2.783 | 2.696, 2.869 | 2.696, 2.882 |
| age_c | 0.069 | 0.052, 0.087 | 0.049, 0.091 |
| smoker | -0.133 | -0.245, -0.021 | -0.286, 0.015 |
| sexMale | 0.095 | 0.030, 0.159 | 0.035, 0.164 |
| height_c | 0.108 | 0.099, 0.117 | 0.096, 0.118 |
| height2 | 0.003 | 0.002, 0.004 | 0.002, 0.004 |
title: false FEV example