Multivariate Regression: Undestanding R's default output

A data based example: smoking and newborn weights

Note: This page was written using R Markdown. The .Rmd file is available here.

Consider the dataset available from Noland and Speed's Stat Lab: Mathematical Statistics through applications (2000) which registers the weight of 1236 newborn babies, and additional covariates for their mothers.

The objective in this study is to measure the effect of smoking on the newborn weights. The researchers need to isolate its effect, by controlling other covariates such as the mothers' weight and height. This can be done by using a multivariate regression model, for example, by considering that the weight \( Y_i \) can be modeled by

\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_p x_{i,p} + \varepsilon_i \]

rm(list = ls())
options(width = 80)
babies <- read.table("http://www.stat.wisc.edu/~gvludwig/327-5/babies.data", 
    header = TRUE)
str(babies)

## 'data.frame':    1236 obs. of  7 variables:
##  $ bwt      : int  120 113 128 123 108 136 138 132 120 143 ...
##  $ gestation: int  284 282 279 999 282 286 244 245 289 299 ...
##  $ parity   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age      : int  27 33 28 36 23 25 33 23 25 30 ...
##  $ height   : int  62 64 64 69 67 62 62 65 62 66 ...
##  $ weight   : int  100 135 115 190 125 93 178 140 125 136 ...
##  $ smoke    : int  0 0 1 0 1 0 0 0 0 1 ...

A description of the dataset is as follows:

bwt is the response variable, with the newborn weights in ounces. The dataset uses 999 as a missing value.
gestation is the length of the pregnancy, in days. 999 is the code for missing values.
parity uses 0 for first borns, 1 otherwise, and 9 for missing values.
age is the mother's age, an integer. 99 is a missing value.
height is the mother's height, in inches. 99 is a missing value.
weight is the mother's weight, in pounds. 999 is a missing value.
smoke is a categorical variable, for whether the mother smokes (1) or not now (0). 9 is a missing value.

The investigators of this problem wanted to assert the following:

Mothers who smoke have increased rates of premature delivery.
The newborns of smokers are smaller at every gestational age.
Smoking seems to be a more significant determinant of birth weight than the mother's prepregnancy height and weight, parity, history of previous pregnancy outcomes or the infant's sex (these last two covariates are not available).

We will focus on the second assertion.

Notice from the str() command that all variables are stored as integer. We need to tell R that parity and smoke are not to be treated as such. Before proceeding, I am going to convert the missing values into NAs, which are the correct representation of missing values in R.

babies$bwt[babies$bwt == 999] <- NA
babies$gestation[babies$gestation == 999] <- NA
babies$parity[babies$parity == 9] <- NA
babies$age[babies$age == 99] <- NA
babies$height[babies$height == 99] <- NA
babies$weight[babies$weight == 999] <- NA
babies$smoke[babies$smoke == 9] <- NA

# How many observations are missing?
count.na <- function(x) {
    return(sum(is.na(x)))
}
sapply(babies, count.na)

##       bwt gestation    parity       age    height    weight     smoke 
##         0        13         0         2        22        36        10

Whenever you use a function in R, keep in mind that it may or may not have an na-action by default. For example, the mean() function does not, and simply returns NA when an argument with missing values is passed to it:

sapply(babies, mean)

##       bwt gestation    parity       age    height    weight     smoke 
##  119.5769        NA    0.2549        NA        NA        NA        NA

You can correct it by checking the mean() function help and see if there's a flag to deal with NAs. It does, by taking an argument na.rm=TRUE, which removes the NAs.

sapply(babies, mean, na.rm = TRUE)

##       bwt gestation    parity       age    height    weight     smoke 
##  119.5769  279.3385    0.2549   27.2553   64.0478  128.6258    0.3948

On the other hand, summary() does remove NAs by default, and furthermore prints the number of NAs found, which makes it a preferable choice when summarizing data.

summary(babies)

##       bwt        gestation       parity           age           height  
##  Min.   : 55   Min.   :148   Min.   :0.000   Min.   :15.0   Min.   :53  
##  1st Qu.:109   1st Qu.:272   1st Qu.:0.000   1st Qu.:23.0   1st Qu.:62  
##  Median :120   Median :280   Median :0.000   Median :26.0   Median :64  
##  Mean   :120   Mean   :279   Mean   :0.255   Mean   :27.3   Mean   :64  
##  3rd Qu.:131   3rd Qu.:288   3rd Qu.:1.000   3rd Qu.:31.0   3rd Qu.:66  
##  Max.   :176   Max.   :353   Max.   :1.000   Max.   :45.0   Max.   :72  
##                NA's   :13                    NA's   :2      NA's   :22  
##      weight        smoke      
##  Min.   : 87   Min.   :0.000  
##  1st Qu.:115   1st Qu.:0.000  
##  Median :125   Median :0.000  
##  Mean   :129   Mean   :0.395  
##  3rd Qu.:139   3rd Qu.:1.000  
##  Max.   :250   Max.   :1.000  
##  NA's   :36    NA's   :10

We can see that converting the factors shows a different summary, as the summary() action changes according to the variable type:

babies$parity <- factor(babies$parity, levels = c(0, 1), labels = c("NotFirstborn", 
    "Firstborn"))
babies$smoke <- factor(babies$smoke, levels = c(0, 1), labels = c("NotSmoke", "Smoke"))
summary(babies)

##       bwt        gestation            parity         age           height  
##  Min.   : 55   Min.   :148   NotFirstborn:921   Min.   :15.0   Min.   :53  
##  1st Qu.:109   1st Qu.:272   Firstborn   :315   1st Qu.:23.0   1st Qu.:62  
##  Median :120   Median :280                      Median :26.0   Median :64  
##  Mean   :120   Mean   :279                      Mean   :27.3   Mean   :64  
##  3rd Qu.:131   3rd Qu.:288                      3rd Qu.:31.0   3rd Qu.:66  
##  Max.   :176   Max.   :353                      Max.   :45.0   Max.   :72  
##                NA's   :13                       NA's   :2      NA's   :22  
##      weight         smoke    
##  Min.   : 87   NotSmoke:742  
##  1st Qu.:115   Smoke   :484  
##  Median :125   NA's    : 10  
##  Mean   :129                 
##  3rd Qu.:139                 
##  Max.   :250                 
##  NA's   :36

Plotting the data is the first action you should take. Visual inspection let you anticipate problems such as model formulation, existence of outliers, the need for transformations, etc. I am going to use the lattice package to plot it, since its greatest strength is dealing with multivariate data.

require(lattice)
xyplot(bwt ~ gestation, data = babies, main = "Baby weight vs. Gestation period")

plot of chunk unnamed-chunk-7

xyplot(bwt ~ gestation | smoke, data = babies, main = "Baby weight vs. Gestation period", 
    auto.key = TRUE)

plot of chunk unnamed-chunk-7

xyplot(bwt ~ gestation | smoke * parity, data = babies, main = "Baby weight vs. Gestation period", 
    auto.key = TRUE)

plot of chunk unnamed-chunk-7

xyplot(bwt ~ height | smoke * parity, data = babies, main = "Baby weight vs. Mother's height", 
    auto.key = TRUE)

plot of chunk unnamed-chunk-7

xyplot(bwt ~ weight | smoke * parity, data = babies, main = "Baby weight vs. Mother's weight", 
    auto.key = TRUE)

plot of chunk unnamed-chunk-7

xyplot(bwt ~ age | smoke * parity, data = babies, main = "Baby weight vs. Mother's age", 
    auto.key = TRUE)

plot of chunk unnamed-chunk-7

xyplot(gestation ~ age | smoke * parity, data = babies, main = "Gestation period vs. Mother's age", 
    auto.key = TRUE)

plot of chunk unnamed-chunk-7

splom(~babies[c(1, 2, 4, 5, 6)] | parity, groups = smoke, data = babies, key = list(columns = 2, 
    points = list(pch = ".")), main = "Scatterplot Matrix", type = c("p", "smooth"), 
    pch = ".")

plot of chunk unnamed-chunk-8

To fit a multivariate regression model, we use the command lm(). It uses formulas as an argument. A common formula syntax is Y ~ . which does Y ~ x1 + x2 + ... + xp, for all available covariates in a data.frame

model <- lm(bwt ~ ., data = babies)

As we learned in the first five weeks, this is the summary for the fit:

summary(model)

## 
## Call:
## lm(formula = bwt ~ ., data = babies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57.61 -10.19  -0.13   9.68  51.71 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -80.41085   14.34657   -5.60  2.6e-08 ***
## gestation         0.44398    0.02910   15.26  < 2e-16 ***
## parityFirstborn  -3.32720    1.12895   -2.95   0.0033 ** 
## age              -0.00895    0.08582   -0.10   0.9170    
## height            1.15402    0.20502    5.63  2.3e-08 ***
## weight            0.05017    0.02524    1.99   0.0471 *  
## smokeSmoke       -8.40073    0.95382   -8.81  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 15.8 on 1167 degrees of freedom
##   (62 observations deleted due to missingness)
## Multiple R-squared: 0.258,   Adjusted R-squared: 0.254 
## F-statistic: 67.6 on 6 and 1167 DF,  p-value: <2e-16

Notice R default action is to remove the lines with missing information, for any covariate, even though the baby weight is available for all entries. How to interpret these coefficients, though?

If the \( j \) covariate, \( \mathbf{x}_j \), takes real values, then the value of the coefficient \( \beta_j \) is the average effect on \( Y_i \) of incrementing \( x_{ij} \) in 1 unit, while the other covariates are held equal.
If the \( j \) covariate, \( \mathbf{x}_j \), is categorical, then the value of the coefficient \( \beta_j \) is the increment effect on ther average of \( Y_i \) from the reference category to the specified level, while the other covariates are held equal. The reference category average is the intercept (or reference categories, if there's more than one categorical covariate in the model).

To verify the assumptions, R has a built-in plotting scheme for lm():

layout(matrix(1:4, ncol = 2))
plot(model)

plot of chunk unnamed-chunk-11

The curvature in the residuals suggest that some transformation is in order. I checked the splom() matrix and decided to try taking the log of bwt for a better fit (in comparison with gestation)

model.log <- lm(log(bwt) ~ ., data = babies)
summary(model.log)

## 
## Call:
## lm(formula = log(bwt) ~ ., data = babies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5806 -0.0777  0.0067  0.0864  0.4640 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.008783   0.125253   24.02  < 2e-16 ***
## gestation        0.004104   0.000254   16.16  < 2e-16 ***
## parityFirstborn -0.028618   0.009856   -2.90   0.0038 ** 
## age             -0.000369   0.000749   -0.49   0.6228    
## height           0.009562   0.001790    5.34  1.1e-07 ***
## weight           0.000390   0.000220    1.77   0.0773 .  
## smokeSmoke      -0.073386   0.008327   -8.81  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.138 on 1167 degrees of freedom
##   (62 observations deleted due to missingness)
## Multiple R-squared: 0.268,   Adjusted R-squared: 0.264 
## F-statistic: 71.2 on 6 and 1167 DF,  p-value: <2e-16

layout(matrix(1:4, ncol = 2))
plot(model.log)

plot of chunk unnamed-chunk-12

The improvement is minimal, so I would keep the linear model for simplicity. Adding a quadratic term to gestation might work. However gestation^2 does not work in R. Formulas usually save ^ as an interaction shortcut, so (gestation + smoke)^2 is the same as gestation*smoke or gestation + smoke + gestation:smoke. You have to force quadratic terms with the I() function.

model.gest2 <- lm(bwt ~ . + I(gestation^2), data = babies)
summary(model.gest2)

## 
## Call:
## lm(formula = bwt ~ . + I(gestation^2), data = babies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58.01 -10.24  -0.31   9.62  67.16 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.75e+02   4.94e+01   -3.54  0.00042 ***
## gestation        1.14e+00   3.48e-01    3.26  0.00113 ** 
## parityFirstborn -3.38e+00   1.13e+00   -3.00  0.00279 ** 
## age             -1.43e-02   8.58e-02   -0.17  0.86739    
## height           1.16e+00   2.05e-01    5.68  1.7e-08 ***
## weight           5.12e-02   2.52e-02    2.03  0.04247 *  
## smokeSmoke      -8.49e+00   9.54e-01   -8.90  < 2e-16 ***
## I(gestation^2)  -1.27e-03   6.37e-04   -2.00  0.04627 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 15.8 on 1166 degrees of freedom
##   (62 observations deleted due to missingness)
## Multiple R-squared: 0.26,    Adjusted R-squared: 0.256 
## F-statistic: 58.7 on 7 and 1166 DF,  p-value: <2e-16

layout(matrix(1:4, ncol = 2))
plot(model.gest2)

plot of chunk unnamed-chunk-13

The improvement is again minimal, but it does now show observation 261 as an outlier. What is wrong with that observation?

babies[261, ]

##     bwt gestation       parity age height weight    smoke
## 261 116       148 NotFirstborn  28     66    135 NotSmoke

We can see that while the mother has very reasonable height, age, etc.; this baby is exceptionally premature. Therefore, it is no harm to take him/her out of the model.

model.gest2 <- lm(bwt ~ . + I(gestation^2), data = babies[-261, ])
summary(model.gest2)

## 
## Call:
## lm(formula = bwt ~ . + I(gestation^2), data = babies[-261, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58.38 -10.13  -0.17   9.25  68.23 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.00e+02   6.28e+01   -6.37  2.6e-10 ***
## gestation        2.76e+00   4.46e-01    6.18  8.7e-10 ***
## parityFirstborn -3.46e+00   1.11e+00   -3.11   0.0019 ** 
## age             -1.90e-02   8.46e-02   -0.22   0.8220    
## height           1.13e+00   2.02e-01    5.58  2.9e-08 ***
## weight           5.44e-02   2.49e-02    2.18   0.0291 *  
## smokeSmoke      -8.41e+00   9.41e-01   -8.93  < 2e-16 ***
## I(gestation^2)  -4.15e-03   8.07e-04   -5.15  3.1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 15.6 on 1165 degrees of freedom
##   (62 observations deleted due to missingness)
## Multiple R-squared: 0.281,   Adjusted R-squared: 0.276 
## F-statistic: 64.9 on 7 and 1165 DF,  p-value: <2e-16

layout(matrix(1:4, ncol = 2))
plot(model.gest2)

plot of chunk unnamed-chunk-15

There is an improvement in fit, but now baby 870 shows up as an outlier… this can go on until we are all happy, but I will stop now. What other transformations would you do? Would it be better to interact smoking and gestation? Would it be better to interact age with parity, or does parity have all information that age can contribute to the model? And when do we stop?