Biostatistics 406

font-family: ‘Garamond’

(Applied Multivariate Biostatistics)

LINEAR MODELS

(Factorial Predictors, Interactions and Transformations)

Donatello Telesca

UCLA Biostatistics

Case Study

Smoking and Forced Expiratory Volume - Sample of 654 youths, aged 3 to 19, in the area of East Boston during middle to late 1970’s. Interest concerns the relationship between smoking and FEV. Since the study is necessarily observational, statistical adjustment via regression models clarifies the relationship.

fev Forced Expiratory Volume (liters)
age Age in years
height Heigth in inches
sex male (1) or female (0)
smoke smoker (1) or non-smoker (0)

FEV Data

title: false

plot of chunk myplot1

FEV Data

Note that sex and smoker are factorial predictors, so a better pairwise plot showing their relationship with fev is a boxplot

plot of chunk myplot2 plot of chunk myplot3

FEV Data - Simple Linear Regression

class: small-code


Call:
lm(formula = fev$fev ~ fev$smoker)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7751 -0.6339 -0.1021  0.4804  3.2269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.56614    0.03466  74.037  < 2e-16 ***
fev$smoker   0.71072    0.10994   6.464 1.99e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8412 on 652 degrees of freedom
Multiple R-squared:  0.06023,   Adjusted R-squared:  0.05879 
F-statistic: 41.79 on 1 and 652 DF,  p-value: 1.993e-10

Factorial Predictors (Dummy Variables)

The regression equation is:

\[E[Y \mid smoker] = \alpha + \beta\,smoker\]

This is meaningless unless we adopt a numerical convention; e.g. define \(X\) such that: \(X=0\) if non-smoker, and \(X=1\) if smoker.

\[ \begin{array}{llllll} E[Y \mid X=0] &= &\alpha + \beta \times 0 &=& \alpha, &\;\;\mbox{ for non smokers};\\ E[Y \mid X=1] &= &\alpha + \beta \times 1 &=& \alpha + \beta, &\;\;\mbox{ for smokers} \end{array} \]

Factorial Predictors (Dummy Variables)

Note that numerical conventiona associated with factorial predictors are not unique.

For example, I could define \(X\) such that: \(X=-1\) if non-smoker, and \(X=1\) if smoker.

\[ \begin{array}{llllll} E[Y \mid X=0] &= &\alpha + \beta \times (-1)&=& \alpha - \beta, &\;\;\mbox{ for non smokers};\\ E[Y \mid X=1] &= &\alpha + \beta \times 1 &=& \alpha + \beta, &\;\;\mbox{ for smokers} \end{array} \]

FEV Data - Simple Linear Regression

class: small-code

\(X = 1\) if smoker \(X=0\) if non-smoker

Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.5661 0.0347 74.0367 0
fev$smoker 0.7107 0.1099 6.4645 0

\(X = 1\) if smoker \(X=-1\) if non-smoker

Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9215 0.055 53.1459 0
x1 0.3554 0.055 6.4645 0

Factorial Predictors (Dummy Variables)

\(\;\)

Factorial predictors do not have to be binary and could include multiple levels.

For, example one may discretize age to consider groups of subjects in the cathegories: (\(\leq\) 10), (10 to 15) and (>15)


plot of chunk myplot4

Factorial Predictors (Dummy Variables)

Example To encode three factorial age groups, consider defining

leaving age < 10 as a reference category

Factorial Predictors (Dummy Variables)

Example To encode three factorial age groups, consider defining

In the regression \(E[Y \mid X] = \beta_0 + \beta_1 X_1 + \beta_2 X_2\)

Group mean Regression assumption Coefficients
\(E[Y\mid age < 10]\) = \(\;\;\beta_0 + \beta_1\times 0 + \beta_2\times 0\) = \(\;\;\beta_0\)
\(E[Y\mid 10< age <= 15]\) = \(\;\;\beta_0 + \beta_1\times 1 + \beta_2\times 0\) = \(\;\;\beta_0 + \beta_1\)
\(E[Y\mid age > 15]\) = \(\;\;\beta_0 + \beta_1\times 0 + \beta_2\times 1\) = \(\;\;\beta_0 + \beta_2\)

FEV Vs. Age

class: small-code \(\;\)

Age as a continuous predictor


Call:
lm(formula = fev$fev ~ fev$age)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.57539 -0.34567 -0.04989  0.32124  2.12786 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.431648   0.077895   5.541 4.36e-08 ***
fev$age     0.222041   0.007518  29.533  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5675 on 652 degrees of freedom
Multiple R-squared:  0.5722,    Adjusted R-squared:  0.5716 
F-statistic: 872.2 on 1 and 652 DF,  p-value: < 2.2e-16

\(\;\)

Age as a factorial predictor


Call:
lm(formula = fev$fev ~ age)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.56594 -0.46934 -0.07584  0.41828  2.53306 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.16934    0.03264   66.46   <2e-16 ***
age10 to 15  1.09059    0.05330   20.46   <2e-16 ***
age>15       1.68349    0.12213   13.79   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6446 on 651 degrees of freedom
Multiple R-squared:  0.449, Adjusted R-squared:  0.4474 
F-statistic: 265.3 on 2 and 651 DF,  p-value: < 2.2e-16

FEV Vs. Age

plot of chunk fevage

FEV Vs. Age

class: small-code \(\;\)

Age as a continuous predictor

| | Estimate| Std. Error| t value| Pr(>|t|)| |:———–|———:|———-:|———:|——————:| |(Intercept) | 0.4316481| 0.0778954| 5.541382| 0| |fev$age | 0.2220410| 0.0075185| 29.532766| 0| - Intercept = 0.43, interpreted as the average fev at age 0 (meaningless unless we re-center age) - Slope = 0.22, interpreted as the expected increase in fev associated with each year of aging

FEV Vs. Age

class: small-code

Age as a factorial predictor

Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.169344 0.0326393 66.46418 0
age10 to 15 1.090592 0.0532997 20.46150 0
age>15 1.683490 0.1221250 13.78497 0

Multiple Regression

class: small-code One usually works with both continuous and factorial predictors in regression.

Example: Predict fev using smoking and age (continuous)


Call:
lm(formula = fev$fev ~ fev$age + fev$smoker)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6653 -0.3564 -0.0508  0.3494  2.0894 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.367373   0.081436   4.511 7.65e-06 ***
fev$age      0.230605   0.008184  28.176  < 2e-16 ***
fev$smoker  -0.208995   0.080745  -2.588  0.00986 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5651 on 651 degrees of freedom
Multiple R-squared:  0.5766,    Adjusted R-squared:  0.5753 
F-statistic: 443.3 on 2 and 651 DF,  p-value: < 2.2e-16

Multiple Regression - Interpretation

class: small-code

Example: Predict fev using smoking and age (continuous)

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3673730 0.0814357 4.511203 0.0000076
age 0.2306046 0.0081844 28.176209 0.0000000
smoker -0.2089949 0.0807453 -2.588321 0.0098598

Case Study: FEV

title: false FEV predicted using smoking alone

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4316481 0.0778954 5.541382 0
fev$age 0.2220410 0.0075185 29.532766 0

FEV predicted using smoking and age

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3673730 0.0814357 4.511203 0.0000076
age 0.2306046 0.0081844 28.176209 0.0000000
smoker -0.2089949 0.0807453 -2.588321 0.0098598

Interactions

In the model considered before we assumed strict additivity

\[E[Y\mid X] = \beta_0 + \beta_1 Age + \beta_2 Smoker\]

Group Group Mean
Non Smokers \(\beta_0 + \beta_1 Age\)
Smokers \(\beta_0 + \beta_1 Age + \beta2\)

Interactions

title: false plot of chunk fevage1

Interactions

title: false Is the relationship between age and FEV different amongst smokers as opposed to non-smokers?

This question can be addressed by adding an interaction term

\[E[Y \mid X] = \beta_0 + \beta_1 Age + \beta_2 Smoker + \beta_3 Age\times Smoker\]

Group Group Mean
Non Smokers \(\beta_0 + \beta_1 Age\)
Smokers \((\beta_0 + \beta_2) + (\beta_1 + \beta_3) Age\)
\(\beta_0\) Intercept for non smokers (not too meaningful without recentering)
\(\beta_1\) Slope for non smokers
\(\beta_2\) Difference between intercept of smokers and intercept of non-smokers
\(\beta_3\) Difference between slope of smokers and slope of non smokers

Interactions

title: false plot of chunk fevage2

Interactions

title: false class: small-code Interactions Model


Call:
lm(formula = fev ~ age * smoker, data = fev)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76645 -0.34947 -0.03364  0.33679  2.05990 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.253396   0.082651   3.066  0.00226 ** 
age          0.242558   0.008332  29.113  < 2e-16 ***
smoker       1.943571   0.414285   4.691 3.31e-06 ***
age:smoker  -0.162703   0.030738  -5.293 1.65e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5537 on 650 degrees of freedom
Multiple R-squared:  0.5941,    Adjusted R-squared:  0.5922 
F-statistic: 317.1 on 3 and 650 DF,  p-value: < 2.2e-16

Interactions

title: false class: small-code Interactions model re-centered - It is perhaps more meaningful to look at expected FEV at a ``pivotal age’’, say at 14, rather than at age 0.

Re-center age and define a new variable: age_c = age - 14

Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6492133 0.0436404 83.620145 0.0e+00
age_c 0.2425584 0.0083315 29.113273 0.0e+00
smoker -0.3342667 0.0825838 -4.047606 5.8e-05
age_c:smoker -0.1627027 0.0307375 -5.293291 2.0e-07

Effect Modifiers

In model with predictors \(X_1\), \(X_2\) and interaction terms \(X_1 \times X_2\): - if the interaction is significant, we say that \(X_2\) is an effect modifier for \(X_1\) (and vice-versa) - if interaction is not significant, we do not robotically remove it from the model, but we refer to more formal model building ideas

TRANSFORMATIONS AND OTHER HACKS

title:false - POLYNOMIAL REGRESSION - TRANSFORMATIONS - BOOTSTRAP

Regression Diagnostics

Consider - Additive model with all predictors

Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8312825 0.0453674 62.407871 0.0000000
age_c 0.0655093 0.0094886 6.904038 0.0000000
smoker -0.0872464 0.0592535 -1.472426 0.1413907
sexMale 0.1571029 0.0332071 4.731010 0.0000027
height_c 0.1041994 0.0047577 21.901145 0.0000000

Heteroschedasticity

title: false plot of chunk hetero1

Normality

title: false plot of chunk hetero2

Linearity

title: false plot of chunk linear1

Linearity

title: false plot of chunk linear2

Diagnostics Checks Conclusions

In order of importance

Fix the mean structure first

Fix the mean structure first

title: false plot of chunk linear3

Introducing polynomial terms

The assumption of a linear relationships between \(Y\) and a countinuous predictor \(X\) may be relaxed by considering adding polynomial terms.

For example:

\[E[Y\mid X] = \beta_0 + \beta_1 X + \beta_2 X^2\]

\[E[Y\mid X] = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3\]

Quadratic regressio of FEV vs Height

title: false plot of chunk quadrativ

Adding a quadratic term

title: false Adding a quadratic term for height

Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.7825861 0.0439455 63.319073 0.0000000
age_c 0.0694646 0.0091089 7.626030 0.0000000
smoker -0.1332112 0.0571079 -2.332622 0.0199732
sexMale 0.0945352 0.0328613 2.876793 0.0041495
height_c 0.1079208 0.0045859 23.533380 0.0000000
height2 0.0031251 0.0004086 7.648942 0.0000000

Linearity

title: false plot of chunk linear4

mean/variance relationships

title:false There seems to still be a mean / variance relationship

Variance Stabilizing Transformations

Box-Cox Transformations

Transform \(g(Y)\) Mean-Variance relationship
no transform! \(Y\) \(\sigma \propto const\)
square root \(\sqrt{Y}\) \(\sigma \propto \sqrt{\hat{Y}}\)
log \(log(Y)\) \(\sigma \propto \hat{Y}\)
reciprocal \(1/Y\) \(\sigma \propto \hat{Y}^2\)

Variance Stabilizing Transformations

title: false FEV example - Consider a model where we use \(\sqrt{Y}\) (mild variance stabilization)

\[\sqrt{Y} = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon\]

Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6589481 0.0129735 127.871573 0.0000000
age_c 0.0198771 0.0026891 7.391695 0.0000000
smoker -0.0376014 0.0168594 -2.230298 0.0260694
sexMale 0.0264355 0.0097013 2.724947 0.0066048
height_c 0.0335403 0.0013538 24.774307 0.0000000
height2 0.0004417 0.0001206 3.662404 0.0002703

Heteroschedasticity

title: false plot of chunk hetero_sqrt1

Normality

title: false plot of chunk hetero_sqrt2

Considerations about Transformations

Bootstrap (Advanced)

Bootstrap

Bootstrap

title: false FEV example

Estimate T-CI Bootstrap-CI
(Intercept) 2.783 2.696, 2.869 2.696, 2.882
age_c 0.069 0.052, 0.087 0.049, 0.091
smoker -0.133 -0.245, -0.021 -0.286, 0.015
sexMale 0.095 0.030, 0.159 0.035, 0.164
height_c 0.108 0.099, 0.117 0.096, 0.118
height2 0.003 0.002, 0.004 0.002, 0.004

Bootstrap

title: false FEV example

plot of chunk boot1

Bootstrap (Final Considerations)

THANK YOU!