Revisit an example with two continuous and two categorical predictors

Part 1 of this analysis is here https://www.rpubs.com/kdomijan/325930.

FEV data - for a full description see http://ww2.amstat.org/publications/jse/v13n2/datasets.kahn.html.

Response variable: fev (forced expiratory volume) measures respiratory function.

Predictors: age, height, gender and smoke.

The dataset is in library(covreg).

library(covreg)
data(fev)
head(fev)

##   age   fev height male smoke
## 1   9 1.708   57.0    0     0
## 2   8 1.724   67.5    0     0
## 3   7 1.720   54.5    0     0
## 4   9 1.558   53.0    1     0
## 5   9 1.895   57.0    1     0
## 6   8 2.336   61.0    0     0

fev$smoke <- as.factor(fev$smoke)
fev$male <- as.factor(fev$male)

We want to visualize the data with all variables included - here we could use color and symbols on a 3d plot to include all variables in the dataset. We will use library(plotly) for dynamic 3d plots. Color by smoking status:

plot_ly(data = fev, z = ~fev, x = ~age, y = ~height, color = ~smoke, colors = c('#0C4B8E' ,'#BF382A'),opacity = 0.5) %>%
  add_markers( marker = list(size = 2))

Height and age are not independent - older childern tend to be taller, but this trend tapers off after 14. Smokers (red) tend to have lower fev compared to individuals of same age/height.

Colour by gender (symbols are hard to see for this data).

plot_ly(data = fev, z = ~fev, x = ~age, y = ~height, color = ~male, colors = c('blue' ,'red'),opacity = 0.5) %>%
  add_markers( marker = list(size = 2))

Males (red) have higher fev, but this is confounded with height. For children older than 10, boys are taller than girls.

Let’s visualise some models (fitted in an earlier post):

fit6 <- lm(fev~ age * smoke + I(poly(height,2)), data = fev)
summary(fit6)

## 
## Call:
## lm(formula = fev ~ age * smoke + I(poly(height, 2)), data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70600 -0.24141  0.00623  0.23525  1.81959 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.95061    0.09813  19.878  < 2e-16 ***
## age                  0.07068    0.00994   7.111 3.06e-12 ***
## smoke1               0.14787    0.31663   0.467    0.641    
## I(poly(height, 2))1 15.38342    0.68327  22.514  < 2e-16 ***
## I(poly(height, 2))2  3.45231    0.40066   8.616  < 2e-16 ***
## age:smoke1          -0.02264    0.02368  -0.956    0.340    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3974 on 648 degrees of freedom
## Multiple R-squared:  0.7916, Adjusted R-squared:   0.79 
## F-statistic: 492.2 on 5 and 648 DF,  p-value: < 2.2e-16

This model fits two hyper-planes (one for smoker and one for non-smokers) in 4D space with age, height, height\(^2\) and fev as dimensions. The planes are not parallel, but the inteaction term is very small and statistically insignificant, so if we could visualise them they would be almost parallel.

Instead we can plot this data in 3D (points are coloured by smoking status.) and the fitted model is two surfaces, curved along height and tilted along age. The surfaces are almost parallel, however, for non smokers (blue), the surface is slightly more tilted along age and the two surfaces intersect (this is easiest to see when rotationg the plot to a marginal view of age vs fev).

In other words, the model predicts that fev increases with age, but for smokers the slope of this increase is smaller. Therefore, for children smokers aged over 10 the fev is lower compared to non-smoking children of the same age. Thus smoking reduces lung capacity, but the drop in fev due to smoking increases with age. The slopes intersect at around 7 years of ages. Below this age the estimated average fev will be higher for smokers (red surface) and the difference increases as children get younger! Notice on the plot though that there are no red points below 9 years of age - i.e. there are no smokers among children under 9, so the model is extrapolating in this region of the predictor space.

Furthermore, the difference in slopes is due to the interaction term which is not statistically significant - so there is no evidence in the data that slopes should be different.

Height has an additional positive effect on fev which is significant once accounting for age and smoking status. Thus taller children (of the same smoking status) will have higher fev compared to shorter children of the same age. This difference will depend on height as there is a significant quadratic term.

In the above model, smoke is not a significant predictor, but if we drop the interaction term, it becomes significant.

fit7 <- lm(fev~ age + smoke + I(poly(height,2)), data = fev)
summary(fit7)

## 
## Call:
## lm(formula = fev ~ age + smoke + I(poly(height, 2)), data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70075 -0.23454  0.00381  0.23419  1.82989 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.985793   0.090957  21.832  < 2e-16 ***
## age                  0.067059   0.009189   7.298 8.58e-13 ***
## smoke1              -0.149820   0.057100  -2.624   0.0089 ** 
## I(poly(height, 2))1 15.568286   0.655280  23.758  < 2e-16 ***
## I(poly(height, 2))2  3.422936   0.399457   8.569  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3974 on 649 degrees of freedom
## Multiple R-squared:  0.7913, Adjusted R-squared:   0.79 
## F-statistic: 615.1 on 4 and 649 DF,  p-value: < 2.2e-16

In another model we add male as a predictor. Here we will be fitting 4 surfaces - one for each combination of male/smoke factors. The model does not contain any interactions, so in 4D, the surfaces are 4 parallel hyperplanes with differing intercepts. In 3d (spanned by age, height and smoke) we are looking at 4 curved surfaces tilted along age (but since there are no interaction terms, the slope in marginal views of fev vs age should remain the same).

fit8 <- lm(fev~  male + height + I(height^2) + smoke + age, data = fev)
summary(fit8)

## 
## Call:
## lm(formula = fev ~ male + height + I(height^2) + smoke + age, 
##     data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61264 -0.22793  0.00617  0.22435  1.80390 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.8630488  1.4990763   4.578 5.62e-06 ***
## male1        0.0945539  0.0328676   2.877  0.00415 ** 
## height      -0.2732761  0.0496785  -5.501 5.44e-08 ***
## I(height^2)  0.0031165  0.0004086   7.628 8.54e-14 ***
## smoke1      -0.1325347  0.0570996  -2.321  0.02059 *  
## age          0.0699792  0.0091943   7.611 9.62e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3951 on 648 degrees of freedom
## Multiple R-squared:  0.7939, Adjusted R-squared:  0.7923 
## F-statistic: 499.2 on 5 and 648 DF,  p-value: < 2.2e-16

Plotly allows us to superimpose multiple surfaces to a cloud of points in 3D, but with 4 curved and tilted surfaces, patterns hard to see. Marginal views are of overlaid curved surfaces are not clear in this plot - it is easier to study the model with condiotional views.

Visualizing multiple linear regression models with Plotly

K. Domijan

20 November 2017

Revisit an example with two continuous and two categorical predictors