An example with two continuous and two categorical predictors

FEV data - for a full description see http://ww2.amstat.org/publications/jse/v13n2/datasets.kahn.html.

Response variable: fev (forced expiratory volume) measures respiratory function.

Predictors: age, height, gender and smoke.

The dataset is in library(covreg).

library(covreg)
data(fev)
head(fev)
##   age   fev height male smoke
## 1   9 1.708   57.0    0     0
## 2   8 1.724   67.5    0     0
## 3   7 1.720   54.5    0     0
## 4   9 1.558   53.0    1     0
## 5   9 1.895   57.0    1     0
## 6   8 2.336   61.0    0     0

Does smoking affect pulmonary function?

boxplot(fev ~smoke, data = fev,ylab = "FEV", names = c("Nonsmokers", "Smokers"))

From this plot it appers that smoking improves lung function! However, in the above plot, smoking is not adjusted for age and body size. Let’s visualize the effect of smoking on fev accounting for age and height:

op <- par(mfrow = c(1, 2) )
plot(fev$age, fev$fev, col = fev$smoke +1,  pch = fev$smoke +1)
plot(fev$height, fev$fev, col = fev$smoke +1, pch = fev$smoke +1)

par(op)

Younger children are less likely to be smokers, but more likely to have have lower fev than older (and bigger) ones. This is an example of Simpson’s paradox.

visualize the data with all variables included - here we could use color and symbols on a 3d plot to include all variables in the dataset. We will need library(rgl) for dynamic 3d plots. Function pch3d() function allows for different symbols in 3d. This is quite hard to see for this data so I make two separate visualizations:

Color by smoking status:

plot3d(fev$age, fev$height, fev$fev,size=1, type="s" ,col = fev$smoke +1 )

You must enable Javascript to view this page properly.

Height and age are not independent - older childern tend to be taller, but this trend tapers off after 14. Smokers (red) tend to have lower fev compared to individuals of same age/height.

Color by gender:

plot3d(fev$age, fev$height, fev$fev,size=1, type="s", col = fev$male +1)

You must enable Javascript to view this page properly.

Males (red) have higher fev, but this is confounded with height. For children older than 10, boys are taller than girls.

Let’s fit some models. Male and smoke are binary (0/1) numeric vectors so we will use them as dummy variables in the models.

fit1 <- lm(fev~ age + smoke, data = fev)
summary(fit1)
## 
## Call:
## lm(formula = fev ~ age + smoke, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68070 -0.35220 -0.04599  0.35034  2.08515 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.349135   0.081787   4.269 2.26e-05 ***
## age          0.232476   0.008223  28.272  < 2e-16 ***
## smoke       -0.208911   0.080573  -2.593  0.00973 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.564 on 651 degrees of freedom
## Multiple R-squared:  0.5782, Adjusted R-squared:  0.5769 
## F-statistic: 446.1 on 2 and 651 DF,  p-value: < 2.2e-16

Age has a positive effect of fev and smoking negative. We can visualize this using added variable plots (we can use library(car))

library(car)
avPlots(fit1)

The plot of age by fev suggested that we should include a smoking by age interaction.

fit2 <- lm(fev~ age * smoke, data = fev)
summary(fit2)
## 
## Call:
## lm(formula = fev ~ age * smoke, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77573 -0.34712 -0.03269  0.33471  2.05749 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.242056   0.083021   2.916  0.00367 ** 
## age          0.243704   0.008371  29.114  < 2e-16 ***
## smoke        1.904894   0.424926   4.483 8.70e-06 ***
## age:smoke   -0.159959   0.031594  -5.063 5.38e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5536 on 650 degrees of freedom
## Multiple R-squared:  0.5942, Adjusted R-squared:  0.5923 
## F-statistic: 317.2 on 3 and 650 DF,  p-value: < 2.2e-16

Interaction is significant in this model.

Display the fitted model:

plot(fev$age, fev$fev, col = fev$smoke +1,  pch = fev$smoke +1)
abline(a = coef(fit2)[1], b= coef(fit2)[2] )
abline(a = coef(fit2)[1] + coef(fit2)[3] , b= coef(fit2)[2] + coef(fit2)[4], col = 2)

Does the model fit well? Let’s take a look at the fits vs residuals diagnostic plot. Colour the points by gender (male = red).

plot(fit2, which = 1, col = fev$male +1)

Note: increasing variance of the residuals. Gender or height might account for some of this. Remember that there is a relationship between age and height, as well as height and gender.

Let’s fit a model including height. The scatterplot above indicated that a quadratic fit might be a good option. We will consider a model with height and gender only (a model we can still visualize on a scatterplot):

fit3 <- lm(fev~ male + height + I(height^2) , data = fev)
summary(fit3)
## 
## Call:
## lm(formula = fev ~ male + height + I(height^2), data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75732 -0.22659  0.00657  0.22542  1.97980 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.2915306  1.5444567   3.426 0.000651 ***
## male         0.0671410  0.0336791   1.994 0.046618 *  
## height      -0.2257124  0.0512179  -4.407 1.23e-05 ***
## I(height^2)  0.0029466  0.0004233   6.960 8.31e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4118 on 650 degrees of freedom
## Multiple R-squared:  0.7755, Adjusted R-squared:  0.7744 
## F-statistic: 748.3 on 3 and 650 DF,  p-value: < 2.2e-16

Should we allow for height and gender interaction? It’s not obvious from the plot.

fit4 <- lm(fev~ male * height + I(height^2) , data = fev)
summary(fit4)
## 
## Call:
## lm(formula = fev ~ male * height + I(height^2), data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68434 -0.23790  0.01399  0.23822  1.95853 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.4935207  1.5887895   2.828  0.00482 ** 
## male        -0.7366202  0.3924216  -1.877  0.06095 .  
## height      -0.1904238  0.0538979  -3.533  0.00044 ***
## I(height^2)  0.0025829  0.0004578   5.642 2.52e-08 ***
## male:height  0.0133266  0.0064826   2.056  0.04021 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4108 on 649 degrees of freedom
## Multiple R-squared:  0.7769, Adjusted R-squared:  0.7755 
## F-statistic: 565.1 on 4 and 649 DF,  p-value: < 2.2e-16

The interaction is significant, althought barely (P value = 0.04)

What happens if we add smoke to the model?

fit5 <- lm(fev ~ male + height + I(height^2) + smoke, data = fev)
summary(fit5)
## 
## Call:
## lm(formula = fev ~ male + height + I(height^2) + smoke, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75791 -0.22589  0.00649  0.22519  1.97932 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.2954318  1.5486166   3.419 0.000667 ***
## male         0.0669389  0.0340693   1.965 0.049865 *  
## height      -0.2258696  0.0514027  -4.394 1.30e-05 ***
## I(height^2)  0.0029482  0.0004255   6.929 1.02e-11 ***
## smoke       -0.0023111  0.0568151  -0.041 0.967565    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4121 on 649 degrees of freedom
## Multiple R-squared:  0.7755, Adjusted R-squared:  0.7741 
## F-statistic: 560.4 on 4 and 649 DF,  p-value: < 2.2e-16

Smoke is no longer significant (in the presence of male and height).

Are males more likely to smoke? Yes, accoroding to the barchart below.

spineplot(as.factor(fev$smoke) ~ as.factor(fev$male))

Let’s consider age as well. Below we construct an added variable for age. This plot indicates age is a useful predictor in addition to smoke, male and height.

e2 <- residuals(fit5)
e1 <- residuals(lm(age~male + height + I(height^2) + smoke, data = fev))
plot(e1, e2)
abline(lm(e2~e1), col = 2)

Earlier on we fitted a model using age, smoke and the interaction (fit2). We will add height to this model.

fit6 <- lm(fev~ age * smoke + I(poly(height,2)), data = fev)
summary(fit6)
## 
## Call:
## lm(formula = fev ~ age * smoke + I(poly(height, 2)), data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70600 -0.24141  0.00623  0.23525  1.81959 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.95061    0.09813  19.878  < 2e-16 ***
## age                  0.07068    0.00994   7.111 3.06e-12 ***
## smoke                0.14787    0.31663   0.467    0.641    
## I(poly(height, 2))1 15.38342    0.68327  22.514  < 2e-16 ***
## I(poly(height, 2))2  3.45231    0.40066   8.616  < 2e-16 ***
## age:smoke           -0.02264    0.02368  -0.956    0.340    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3974 on 648 degrees of freedom
## Multiple R-squared:  0.7916, Adjusted R-squared:   0.79 
## F-statistic: 492.2 on 5 and 648 DF,  p-value: < 2.2e-16

Whereas there is a significant interaction beween age and smoking in fit2, this is no longer significant in the presence of height.

Refit the model without the interaction.

fit7 <- lm(fev~ age + smoke + I(poly(height,2)), data = fev)
summary(fit7)
## 
## Call:
## lm(formula = fev ~ age + smoke + I(poly(height, 2)), data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70075 -0.23454  0.00381  0.23419  1.82989 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.985793   0.090957  21.832  < 2e-16 ***
## age                  0.067059   0.009189   7.298 8.58e-13 ***
## smoke               -0.149820   0.057100  -2.624   0.0089 ** 
## I(poly(height, 2))1 15.568286   0.655280  23.758  < 2e-16 ***
## I(poly(height, 2))2  3.422936   0.399457   8.569  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3974 on 649 degrees of freedom
## Multiple R-squared:  0.7913, Adjusted R-squared:   0.79 
## F-statistic: 615.1 on 4 and 649 DF,  p-value: < 2.2e-16
plot(fit7, which = 1, col = fev$male +1)

There is an increasing variance in the residuals (although improved from fit2). The points are coloured by gender (male = red). We will include this variable in our next model.

fit8 <- lm(fev~  male + height + I(height^2) + smoke + age, data = fev)
summary(fit8)
## 
## Call:
## lm(formula = fev ~ male + height + I(height^2) + smoke + age, 
##     data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61264 -0.22793  0.00617  0.22435  1.80390 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.8630488  1.4990763   4.578 5.62e-06 ***
## male         0.0945539  0.0328676   2.877  0.00415 ** 
## height      -0.2732761  0.0496785  -5.501 5.44e-08 ***
## I(height^2)  0.0031165  0.0004086   7.628 8.54e-14 ***
## smoke       -0.1325347  0.0570996  -2.321  0.02059 *  
## age          0.0699792  0.0091943   7.611 9.62e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3951 on 648 degrees of freedom
## Multiple R-squared:  0.7939, Adjusted R-squared:  0.7923 
## F-statistic: 499.2 on 5 and 648 DF,  p-value: < 2.2e-16

Compare fit8 to fit5 - adding age makes smoke significant. We might choose this as our favourite model.

Let’s consider one more model - include interaction male \(\times\) height.

fit9 <- lm(fev~ age + smoke + I(height^2) + male * height, data = fev)
summary(fit9)
## 
## Call:
## lm(formula = fev ~ age + smoke + I(height^2) + male * height, 
##     data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44593 -0.23008  0.01161  0.24121  1.77778 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.8500905  1.5391982   3.801 0.000158 ***
## age          0.0715987  0.0091708   7.807 2.36e-14 ***
## smoke       -0.1123052  0.0573296  -1.959 0.050549 .  
## I(height^2)  0.0026440  0.0004433   5.965 4.03e-09 ***
## male        -0.9200542  0.3803088  -2.419 0.015828 *  
## height      -0.2281294  0.0522391  -4.367 1.47e-05 ***
## male:height  0.0168668  0.0062988   2.678 0.007599 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3933 on 647 degrees of freedom
## Multiple R-squared:  0.7962, Adjusted R-squared:  0.7943 
## F-statistic: 421.2 on 6 and 647 DF,  p-value: < 2.2e-16

Smoke is no longers significant in the presence of all the other predictors.

plot(fit9, which = 1, col = fev$male +1)

We still observe increasing variance in the residuals.

Use R package condvis to visualize the models (see https://www.jstatsoft.org/article/view/v081i05)

library(condvis)

models <- list(fit5 = fit5, fit8 = fit8, fit9 = fit9)
ceplot(data = fev, model = models, sectionvars = c("height"), threshold = 2, type = "shiny")


models <- list(fit8 = fit8)
ceplot(data = fev, model = models, sectionvars = c("age", "height"), view3d = TRUE, threshold = 2, type = "shiny")