The Child Health and Development Studies investigate a range of topics. One study, in particular, considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area.
case: id number
bwt: birthweight, in ounces
gestation: length of gestation, in days
parity: binary indicator for a first pregnancy (0 = first pregnancy)
age: mother’s age in years
height: mother’s height in inches
weight: mother’s weight in pounds
smoke: binary indicator for whether the mother smokes
library(skimr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(ggplot2)
baby<-openintro::babies
head(baby)
## # A tibble: 6 × 8
## case bwt gestation parity age height weight smoke
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 120 284 0 27 62 100 0
## 2 2 113 282 0 33 64 135 0
## 3 3 128 279 0 28 64 115 1
## 4 4 123 NA 0 36 69 190 0
## 5 5 108 282 0 23 67 125 1
## 6 6 136 286 0 25 62 93 0
summary(baby)
## case bwt gestation parity
## Min. : 1.0 Min. : 55.0 Min. :148.0 Min. :0.0000
## 1st Qu.: 309.8 1st Qu.:108.8 1st Qu.:272.0 1st Qu.:0.0000
## Median : 618.5 Median :120.0 Median :280.0 Median :0.0000
## Mean : 618.5 Mean :119.6 Mean :279.3 Mean :0.2549
## 3rd Qu.: 927.2 3rd Qu.:131.0 3rd Qu.:288.0 3rd Qu.:1.0000
## Max. :1236.0 Max. :176.0 Max. :353.0 Max. :1.0000
## NA's :13
## age height weight smoke
## Min. :15.00 Min. :53.00 Min. : 87.0 Min. :0.0000
## 1st Qu.:23.00 1st Qu.:62.00 1st Qu.:114.8 1st Qu.:0.0000
## Median :26.00 Median :64.00 Median :125.0 Median :0.0000
## Mean :27.26 Mean :64.05 Mean :128.6 Mean :0.3948
## 3rd Qu.:31.00 3rd Qu.:66.00 3rd Qu.:139.0 3rd Qu.:1.0000
## Max. :45.00 Max. :72.00 Max. :250.0 Max. :1.0000
## NA's :2 NA's :22 NA's :36 NA's :10
skim(baby)
| Name | baby |
| Number of rows | 1236 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| case | 0 | 1.00 | 618.50 | 356.95 | 1 | 309.75 | 618.5 | 927.25 | 1236 | ▇▇▇▇▇ |
| bwt | 0 | 1.00 | 119.58 | 18.24 | 55 | 108.75 | 120.0 | 131.00 | 176 | ▁▂▇▅▁ |
| gestation | 13 | 0.99 | 279.34 | 16.03 | 148 | 272.00 | 280.0 | 288.00 | 353 | ▁▁▂▇▁ |
| parity | 0 | 1.00 | 0.25 | 0.44 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▃ |
| age | 2 | 1.00 | 27.26 | 5.78 | 15 | 23.00 | 26.0 | 31.00 | 45 | ▃▇▅▂▁ |
| height | 22 | 0.98 | 64.05 | 2.53 | 53 | 62.00 | 64.0 | 66.00 | 72 | ▁▁▇▇▁ |
| weight | 36 | 0.97 | 128.63 | 20.97 | 87 | 114.75 | 125.0 | 139.00 | 250 | ▅▇▁▁▁ |
| smoke | 10 | 0.99 | 0.39 | 0.49 | 0 | 0.00 | 0.0 | 1.00 | 1 | ▇▁▁▁▅ |
There are 13 NAs in gestation, 2 NAs in age, 22 NAs in height, 36NAs in weight, and 10 NAs in smoke. Also, the variables, parity and smoke need to convert to binary variables.
Baby<-na.aggregate(baby[,c(-4,-8)])
baby$parity<-as.factor(baby$parity)
Baby<-Baby%>%
mutate(parity = baby$parity, smoke = baby$smoke)
Baby$smoke[is.na(Baby$smoke)] <- median(Baby$smoke, na.rm = TRUE)
Baby$smoke<-as.factor(Baby$smoke)
Now, the NAs are cleaned, and the smoke and parity are converted to binary variables.
# Linear regression model
lm <- lm(bwt ~ gestation*age + parity + height^2 + weight+smoke, data=Baby)
# summary of model
summary(lm)
##
## Call:
## lm(formula = bwt ~ gestation * age + parity + height^2 + weight +
## smoke, data = Baby)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.313 -10.085 0.289 9.677 54.119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.079709 40.548941 1.087 0.27722
## gestation -0.003169 0.140914 -0.022 0.98206
## age -4.538602 1.416080 -3.205 0.00139 **
## parity1 -2.953411 1.102410 -2.679 0.00748 **
## height 1.138490 0.199574 5.705 1.46e-08 ***
## weight 0.052944 0.024568 2.155 0.03136 *
## smoke1 -8.034917 0.927005 -8.668 < 2e-16 ***
## gestation:age 0.016331 0.005069 3.222 0.00131 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.77 on 1228 degrees of freedom
## Multiple R-squared: 0.2568, Adjusted R-squared: 0.2526
## F-statistic: 60.62 on 7 and 1228 DF, p-value: < 2.2e-16
# residuals plot
hist(lm$residuals, main = "histogram of residual")
plot(lm$fitted.values, lm$residuals,
xlab="Fitted Values", ylab="Residuals",
main="Residuals Plot")
qqnorm(lm$residuals)
qqline(lm$residuals)
Conclusion:
The adjusted R-squared shows that the model only explains 25.26% of variation in birthweight
The Standard Error 15.82.
P value shows the gestation, height, weight, smoke are statistically significant in this model. Therefore, reject the null hypothesis.
F-Statistic shows some relationship exists between dependent and independent variables
The residual plot shows residuals are normally distributed.
The model is not useful due to lack of explaination of variations.