Note: This page was written using R Markdown. The .Rmd file is available here.
Consider the dataset available from Noland and Speed's Stat Lab: Mathematical Statistics through applications (2000) which registers the weight of 1236 newborn babies, and additional covariates for their mothers.
The objective in this study is to measure the effect of smoking on the newborn weights. The researchers need to isolate its effect, by controlling other covariates such as the mothers' weight and height. This can be done by using a multivariate regression model, for example, by considering that the weight \( Y_i \) can be modeled by
\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_p x_{i,p} + \varepsilon_i \]
rm(list = ls())
options(width = 80)
babies <- read.table("http://www.stat.wisc.edu/~gvludwig/327-5/babies.data",
header = TRUE)
str(babies)
## 'data.frame': 1236 obs. of 7 variables:
## $ bwt : int 120 113 128 123 108 136 138 132 120 143 ...
## $ gestation: int 284 282 279 999 282 286 244 245 289 299 ...
## $ parity : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 27 33 28 36 23 25 33 23 25 30 ...
## $ height : int 62 64 64 69 67 62 62 65 62 66 ...
## $ weight : int 100 135 115 190 125 93 178 140 125 136 ...
## $ smoke : int 0 0 1 0 1 0 0 0 0 1 ...
A description of the dataset is as follows:
bwt is the response variable, with the newborn weights in ounces. The dataset uses 999 as a missing value.gestation is the length of the pregnancy, in days. 999 is the code for missing values.parity uses 0 for first borns, 1 otherwise, and 9 for missing values.age is the mother's age, an integer. 99 is a missing value.height is the mother's height, in inches. 99 is a missing value.weight is the mother's weight, in pounds. 999 is a missing value.smoke is a categorical variable, for whether the mother smokes (1) or not now (0). 9 is a missing value.The investigators of this problem wanted to assert the following:
We will focus on the second assertion.
Notice from the str() command that all variables are stored as integer. We need to tell R that parity and smoke are not to be treated as such. Before proceeding, I am going to convert the missing values into NAs, which are the correct representation of missing values in R.
babies$bwt[babies$bwt == 999] <- NA
babies$gestation[babies$gestation == 999] <- NA
babies$parity[babies$parity == 9] <- NA
babies$age[babies$age == 99] <- NA
babies$height[babies$height == 99] <- NA
babies$weight[babies$weight == 999] <- NA
babies$smoke[babies$smoke == 9] <- NA
# How many observations are missing?
count.na <- function(x) {
return(sum(is.na(x)))
}
sapply(babies, count.na)
## bwt gestation parity age height weight smoke
## 0 13 0 2 22 36 10
Whenever you use a function in R, keep in mind that it may or may not have an na-action by default. For example, the mean() function does not, and simply returns NA when an argument with missing values is passed to it:
sapply(babies, mean)
## bwt gestation parity age height weight smoke
## 119.5769 NA 0.2549 NA NA NA NA
You can correct it by checking the mean() function help and see if there's a flag to deal with NAs. It does, by taking an argument na.rm=TRUE, which removes the NAs.
sapply(babies, mean, na.rm = TRUE)
## bwt gestation parity age height weight smoke
## 119.5769 279.3385 0.2549 27.2553 64.0478 128.6258 0.3948
On the other hand, summary() does remove NAs by default, and furthermore prints the number of NAs found, which makes it a preferable choice when summarizing data.
summary(babies)
## bwt gestation parity age height
## Min. : 55 Min. :148 Min. :0.000 Min. :15.0 Min. :53
## 1st Qu.:109 1st Qu.:272 1st Qu.:0.000 1st Qu.:23.0 1st Qu.:62
## Median :120 Median :280 Median :0.000 Median :26.0 Median :64
## Mean :120 Mean :279 Mean :0.255 Mean :27.3 Mean :64
## 3rd Qu.:131 3rd Qu.:288 3rd Qu.:1.000 3rd Qu.:31.0 3rd Qu.:66
## Max. :176 Max. :353 Max. :1.000 Max. :45.0 Max. :72
## NA's :13 NA's :2 NA's :22
## weight smoke
## Min. : 87 Min. :0.000
## 1st Qu.:115 1st Qu.:0.000
## Median :125 Median :0.000
## Mean :129 Mean :0.395
## 3rd Qu.:139 3rd Qu.:1.000
## Max. :250 Max. :1.000
## NA's :36 NA's :10
We can see that converting the factors shows a different summary, as the summary() action changes according to the variable type:
babies$parity <- factor(babies$parity, levels = c(0, 1), labels = c("NotFirstborn",
"Firstborn"))
babies$smoke <- factor(babies$smoke, levels = c(0, 1), labels = c("NotSmoke", "Smoke"))
summary(babies)
## bwt gestation parity age height
## Min. : 55 Min. :148 NotFirstborn:921 Min. :15.0 Min. :53
## 1st Qu.:109 1st Qu.:272 Firstborn :315 1st Qu.:23.0 1st Qu.:62
## Median :120 Median :280 Median :26.0 Median :64
## Mean :120 Mean :279 Mean :27.3 Mean :64
## 3rd Qu.:131 3rd Qu.:288 3rd Qu.:31.0 3rd Qu.:66
## Max. :176 Max. :353 Max. :45.0 Max. :72
## NA's :13 NA's :2 NA's :22
## weight smoke
## Min. : 87 NotSmoke:742
## 1st Qu.:115 Smoke :484
## Median :125 NA's : 10
## Mean :129
## 3rd Qu.:139
## Max. :250
## NA's :36
Plotting the data is the first action you should take. Visual inspection let you anticipate problems such as model formulation, existence of outliers, the need for transformations, etc. I am going to use the lattice package to plot it, since its greatest strength is dealing with multivariate data.
require(lattice)
xyplot(bwt ~ gestation, data = babies, main = "Baby weight vs. Gestation period")
xyplot(bwt ~ gestation | smoke, data = babies, main = "Baby weight vs. Gestation period",
auto.key = TRUE)
xyplot(bwt ~ gestation | smoke * parity, data = babies, main = "Baby weight vs. Gestation period",
auto.key = TRUE)
xyplot(bwt ~ height | smoke * parity, data = babies, main = "Baby weight vs. Mother's height",
auto.key = TRUE)
xyplot(bwt ~ weight | smoke * parity, data = babies, main = "Baby weight vs. Mother's weight",
auto.key = TRUE)
xyplot(bwt ~ age | smoke * parity, data = babies, main = "Baby weight vs. Mother's age",
auto.key = TRUE)
xyplot(gestation ~ age | smoke * parity, data = babies, main = "Gestation period vs. Mother's age",
auto.key = TRUE)
splom(~babies[c(1, 2, 4, 5, 6)] | parity, groups = smoke, data = babies, key = list(columns = 2,
points = list(pch = ".")), main = "Scatterplot Matrix", type = c("p", "smooth"),
pch = ".")
To fit a multivariate regression model, we use the command lm(). It uses formulas as an argument. A common formula syntax is Y ~ . which does Y ~ x1 + x2 + ... + xp, for all available covariates in a data.frame
model <- lm(bwt ~ ., data = babies)
As we learned in the first five weeks, this is the summary for the fit:
summary(model)
##
## Call:
## lm(formula = bwt ~ ., data = babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.61 -10.19 -0.13 9.68 51.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.41085 14.34657 -5.60 2.6e-08 ***
## gestation 0.44398 0.02910 15.26 < 2e-16 ***
## parityFirstborn -3.32720 1.12895 -2.95 0.0033 **
## age -0.00895 0.08582 -0.10 0.9170
## height 1.15402 0.20502 5.63 2.3e-08 ***
## weight 0.05017 0.02524 1.99 0.0471 *
## smokeSmoke -8.40073 0.95382 -8.81 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.8 on 1167 degrees of freedom
## (62 observations deleted due to missingness)
## Multiple R-squared: 0.258, Adjusted R-squared: 0.254
## F-statistic: 67.6 on 6 and 1167 DF, p-value: <2e-16
Notice R default action is to remove the lines with missing information, for any covariate, even though the baby weight is available for all entries. How to interpret these coefficients, though?
To verify the assumptions, R has a built-in plotting scheme for lm():
layout(matrix(1:4, ncol = 2))
plot(model)
The curvature in the residuals suggest that some transformation is in order. I checked the splom() matrix and decided to try taking the log of bwt for a better fit (in comparison with gestation)
model.log <- lm(log(bwt) ~ ., data = babies)
summary(model.log)
##
## Call:
## lm(formula = log(bwt) ~ ., data = babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5806 -0.0777 0.0067 0.0864 0.4640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.008783 0.125253 24.02 < 2e-16 ***
## gestation 0.004104 0.000254 16.16 < 2e-16 ***
## parityFirstborn -0.028618 0.009856 -2.90 0.0038 **
## age -0.000369 0.000749 -0.49 0.6228
## height 0.009562 0.001790 5.34 1.1e-07 ***
## weight 0.000390 0.000220 1.77 0.0773 .
## smokeSmoke -0.073386 0.008327 -8.81 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.138 on 1167 degrees of freedom
## (62 observations deleted due to missingness)
## Multiple R-squared: 0.268, Adjusted R-squared: 0.264
## F-statistic: 71.2 on 6 and 1167 DF, p-value: <2e-16
layout(matrix(1:4, ncol = 2))
plot(model.log)
The improvement is minimal, so I would keep the linear model for simplicity. Adding a quadratic term to gestation might work. However gestation^2 does not work in R. Formulas usually save ^ as an interaction shortcut, so (gestation + smoke)^2 is the same as gestation*smoke or gestation + smoke + gestation:smoke. You have to force quadratic terms with the I() function.
model.gest2 <- lm(bwt ~ . + I(gestation^2), data = babies)
summary(model.gest2)
##
## Call:
## lm(formula = bwt ~ . + I(gestation^2), data = babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.01 -10.24 -0.31 9.62 67.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.75e+02 4.94e+01 -3.54 0.00042 ***
## gestation 1.14e+00 3.48e-01 3.26 0.00113 **
## parityFirstborn -3.38e+00 1.13e+00 -3.00 0.00279 **
## age -1.43e-02 8.58e-02 -0.17 0.86739
## height 1.16e+00 2.05e-01 5.68 1.7e-08 ***
## weight 5.12e-02 2.52e-02 2.03 0.04247 *
## smokeSmoke -8.49e+00 9.54e-01 -8.90 < 2e-16 ***
## I(gestation^2) -1.27e-03 6.37e-04 -2.00 0.04627 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.8 on 1166 degrees of freedom
## (62 observations deleted due to missingness)
## Multiple R-squared: 0.26, Adjusted R-squared: 0.256
## F-statistic: 58.7 on 7 and 1166 DF, p-value: <2e-16
layout(matrix(1:4, ncol = 2))
plot(model.gest2)
The improvement is again minimal, but it does now show observation 261 as an outlier. What is wrong with that observation?
babies[261, ]
## bwt gestation parity age height weight smoke
## 261 116 148 NotFirstborn 28 66 135 NotSmoke
We can see that while the mother has very reasonable height, age, etc.; this baby is exceptionally premature. Therefore, it is no harm to take him/her out of the model.
model.gest2 <- lm(bwt ~ . + I(gestation^2), data = babies[-261, ])
summary(model.gest2)
##
## Call:
## lm(formula = bwt ~ . + I(gestation^2), data = babies[-261, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.38 -10.13 -0.17 9.25 68.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.00e+02 6.28e+01 -6.37 2.6e-10 ***
## gestation 2.76e+00 4.46e-01 6.18 8.7e-10 ***
## parityFirstborn -3.46e+00 1.11e+00 -3.11 0.0019 **
## age -1.90e-02 8.46e-02 -0.22 0.8220
## height 1.13e+00 2.02e-01 5.58 2.9e-08 ***
## weight 5.44e-02 2.49e-02 2.18 0.0291 *
## smokeSmoke -8.41e+00 9.41e-01 -8.93 < 2e-16 ***
## I(gestation^2) -4.15e-03 8.07e-04 -5.15 3.1e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.6 on 1165 degrees of freedom
## (62 observations deleted due to missingness)
## Multiple R-squared: 0.281, Adjusted R-squared: 0.276
## F-statistic: 64.9 on 7 and 1165 DF, p-value: <2e-16
layout(matrix(1:4, ncol = 2))
plot(model.gest2)
There is an improvement in fit, but now baby 870 shows up as an outlier… this can go on until we are all happy, but I will stop now. What other transformations would you do? Would it be better to interact smoking and gestation? Would it be better to interact age with parity, or does parity have all information that age can contribute to the model? And when do we stop?