Description of Data

The Child Health and Development Studies investigate a range of topics. One study, in particular, considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area.

case: id number

bwt: birthweight, in ounces

gestation: length of gestation, in days

parity: binary indicator for a first pregnancy (0 = first pregnancy)

age: mother’s age in years

height: mother’s height in inches

weight: mother’s weight in pounds

smoke: binary indicator for whether the mother smokes

library(skimr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(ggplot2)

Data Overview

baby<-openintro::babies
head(baby)
## # A tibble: 6 × 8
##    case   bwt gestation parity   age height weight smoke
##   <int> <int>     <int>  <int> <int>  <int>  <int> <int>
## 1     1   120       284      0    27     62    100     0
## 2     2   113       282      0    33     64    135     0
## 3     3   128       279      0    28     64    115     1
## 4     4   123        NA      0    36     69    190     0
## 5     5   108       282      0    23     67    125     1
## 6     6   136       286      0    25     62     93     0
summary(baby)
##       case             bwt          gestation         parity      
##  Min.   :   1.0   Min.   : 55.0   Min.   :148.0   Min.   :0.0000  
##  1st Qu.: 309.8   1st Qu.:108.8   1st Qu.:272.0   1st Qu.:0.0000  
##  Median : 618.5   Median :120.0   Median :280.0   Median :0.0000  
##  Mean   : 618.5   Mean   :119.6   Mean   :279.3   Mean   :0.2549  
##  3rd Qu.: 927.2   3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000  
##  Max.   :1236.0   Max.   :176.0   Max.   :353.0   Max.   :1.0000  
##                                   NA's   :13                      
##       age            height          weight          smoke       
##  Min.   :15.00   Min.   :53.00   Min.   : 87.0   Min.   :0.0000  
##  1st Qu.:23.00   1st Qu.:62.00   1st Qu.:114.8   1st Qu.:0.0000  
##  Median :26.00   Median :64.00   Median :125.0   Median :0.0000  
##  Mean   :27.26   Mean   :64.05   Mean   :128.6   Mean   :0.3948  
##  3rd Qu.:31.00   3rd Qu.:66.00   3rd Qu.:139.0   3rd Qu.:1.0000  
##  Max.   :45.00   Max.   :72.00   Max.   :250.0   Max.   :1.0000  
##  NA's   :2       NA's   :22      NA's   :36      NA's   :10
skim(baby)
Data summary
Name baby
Number of rows 1236
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
case 0 1.00 618.50 356.95 1 309.75 618.5 927.25 1236 ▇▇▇▇▇
bwt 0 1.00 119.58 18.24 55 108.75 120.0 131.00 176 ▁▂▇▅▁
gestation 13 0.99 279.34 16.03 148 272.00 280.0 288.00 353 ▁▁▂▇▁
parity 0 1.00 0.25 0.44 0 0.00 0.0 1.00 1 ▇▁▁▁▃
age 2 1.00 27.26 5.78 15 23.00 26.0 31.00 45 ▃▇▅▂▁
height 22 0.98 64.05 2.53 53 62.00 64.0 66.00 72 ▁▁▇▇▁
weight 36 0.97 128.63 20.97 87 114.75 125.0 139.00 250 ▅▇▁▁▁
smoke 10 0.99 0.39 0.49 0 0.00 0.0 1.00 1 ▇▁▁▁▅

There are 13 NAs in gestation, 2 NAs in age, 22 NAs in height, 36NAs in weight, and 10 NAs in smoke. Also, the variables, parity and smoke need to convert to binary variables.

Baby<-na.aggregate(baby[,c(-4,-8)])
baby$parity<-as.factor(baby$parity)
Baby<-Baby%>%
  mutate(parity = baby$parity, smoke = baby$smoke)
Baby$smoke[is.na(Baby$smoke)] <- median(Baby$smoke, na.rm = TRUE)
Baby$smoke<-as.factor(Baby$smoke)

Now, the NAs are cleaned, and the smoke and parity are converted to binary variables.

# Linear regression model
lm <- lm(bwt ~ gestation*age + parity + height^2 + weight+smoke, data=Baby)

# summary of model
summary(lm)
## 
## Call:
## lm(formula = bwt ~ gestation * age + parity + height^2 + weight + 
##     smoke, data = Baby)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.313 -10.085   0.289   9.677  54.119 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   44.079709  40.548941   1.087  0.27722    
## gestation     -0.003169   0.140914  -0.022  0.98206    
## age           -4.538602   1.416080  -3.205  0.00139 ** 
## parity1       -2.953411   1.102410  -2.679  0.00748 ** 
## height         1.138490   0.199574   5.705 1.46e-08 ***
## weight         0.052944   0.024568   2.155  0.03136 *  
## smoke1        -8.034917   0.927005  -8.668  < 2e-16 ***
## gestation:age  0.016331   0.005069   3.222  0.00131 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.77 on 1228 degrees of freedom
## Multiple R-squared:  0.2568, Adjusted R-squared:  0.2526 
## F-statistic: 60.62 on 7 and 1228 DF,  p-value: < 2.2e-16
# residuals plot
hist(lm$residuals, main = "histogram of residual")

plot(lm$fitted.values, lm$residuals, 
     xlab="Fitted Values", ylab="Residuals",
     main="Residuals Plot")

qqnorm(lm$residuals)
qqline(lm$residuals)

Conclusion:

  1. The adjusted R-squared shows that the model only explains 25.26% of variation in birthweight

  2. The Standard Error 15.82.

  3. P value shows the gestation, height, weight, smoke are statistically significant in this model. Therefore, reject the null hypothesis.

  4. F-Statistic shows some relationship exists between dependent and independent variables

  5. The residual plot shows residuals are normally distributed.

The model is not useful due to lack of explaination of variations.