library(Stat2Data)
library(GGally)
library(ggplot2)
library(dplyr)
data(NCbirths)
summary(NCbirths)
## ID Plural Sex MomAge
## Min. : 1.0 Min. :1.000 Min. :1.000 Min. :13.00
## 1st Qu.: 363.2 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:22.00
## Median : 725.5 Median :1.000 Median :1.000 Median :26.00
## Mean : 725.5 Mean :1.037 Mean :1.487 Mean :26.76
## 3rd Qu.:1087.8 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:31.00
## Max. :1450.0 Max. :3.000 Max. :2.000 Max. :43.00
##
## Weeks Marital RaceMom HispMom Gained
## Min. :22.00 Min. :1.000 Min. :1.000 C: 2 Min. : 0.0
## 1st Qu.:38.00 1st Qu.:1.000 1st Qu.:1.000 M: 128 1st Qu.:20.0
## Median :39.00 Median :1.000 Median :1.000 N:1283 Median :30.0
## Mean :38.62 Mean :1.345 Mean :1.831 O: 3 Mean :30.6
## 3rd Qu.:40.00 3rd Qu.:2.000 3rd Qu.:2.000 P: 9 3rd Qu.:40.0
## Max. :45.00 Max. :2.000 Max. :8.000 S: 25 Max. :95.0
## NA's :1 NA's :40
## Smoke BirthWeightOz BirthWeightGm Low
## Min. :0.0000 Min. : 12.0 Min. : 340.2 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:106.0 1st Qu.:3005.1 1st Qu.:0.00000
## Median :0.0000 Median :118.0 Median :3345.3 Median :0.00000
## Mean :0.1446 Mean :116.2 Mean :3295.6 Mean :0.08621
## 3rd Qu.:0.0000 3rd Qu.:130.0 3rd Qu.:3685.5 3rd Qu.:0.00000
## Max. :1.0000 Max. :181.0 Max. :5131.4 Max. :1.00000
## NA's :5
## Premie MomRace
## Min. :0.0000 black :332
## 1st Qu.:0.0000 hispanic:164
## Median :0.0000 other : 48
## Mean :0.1317 white :906
## 3rd Qu.:0.0000
## Max. :1.0000
##
The overall data summary gives us good insight into the spread of the variables.
* Average weight in grams is around 3300 - good baseline to judge whether a birth is healthy or not.
* Interquartile range (IQR) of weights from ~3000 to ~3700 gms.
* Majority of moms are white (around 2/3s).
* ~13% of babies were premies and ~8% considered “low” weight - more research should be done to determine what a healty baby weighs.
* 75% of gestation periods are from 38-40 weeks, but demonstrate more extreme variation in the first quartile.
* A large majority of Hispanic moms are Mexican.
* Plural births are very rare in this data set.
* Most moms are married when they have children.
* Sex almost completely even between M/F.
* IQR of mom’s ages from 22 to 31.
* ~15% of mom’s smoke.
ggpairs(NCbirths, columns = c("MomAge","Weeks","Gained","BirthWeightGm"))
Plotting birth weight against other continuous variables reveals:
* Birth weight has ~.6 correlation with weeks, and would be a good explanatory variable to include in model.
* Definite collinearity between the weight variables. To test this, I did the same plot with BirthWeightOz instead and got the exact same correlation coefficients - we will have to test if they exhibit the same traits in a regression model.
* Some correlation between birth weight and Gained/MomAge - .19 and .15 respectively.
ggpairs(NCbirths, columns = c("BirthWeightGm", "Low", "Premie", "Plural", "Smoke"))
For birth weight against categorical variables:
* Premie and low correlate with each other and have high negative correlation with birth weight- good predictor of whether baby will be healthy or not.
* Number of children has correlation with premie and low (.33 and .37 - probably good to include in model since premie and low themselves are good predictors of baby’s health).
* Smokers less likely to have multiple babies and more likely to have health defects.
* Plural and smoke have negative correlations with birth weight.
NA’s are present in certain variables of the dataset - 1 in Weeks, 5 in Smoke, 40 in Gained.
Can throw out overall number of rows with missing data (41) because they are so few and have little to no association with other variables. To test association, I used the tally method detailed in Exercise 2 of Lab 7 on the variable with the greatest number of missing values, Gained
NCbirths$gained_mis <- factor(is.na(NCbirths$Gained))
mosaic::tally(~Sex|gained_mis, data=NCbirths)
## gained_mis
## Sex FALSE TRUE
## 1 721 23
## 2 689 17
mosaic::tally(~Plural|gained_mis, data=NCbirths)
## gained_mis
## Plural FALSE TRUE
## 1 1363 38
## 2 43 2
## 3 4 0
mosaic::tally(~Smoke|gained_mis, data=NCbirths)
## gained_mis
## Smoke FALSE TRUE
## 0 1203 33
## 1 206 3
## <NA> 1 4
mosaic::tally(~Low|gained_mis, data=NCbirths)
## gained_mis
## Low FALSE TRUE
## 0 1291 34
## 1 119 6
mosaic::tally(~Premie|gained_mis, data=NCbirths)
## gained_mis
## Premie FALSE TRUE
## 0 1229 30
## 1 181 10
mosaic::tally(~Marital|gained_mis, data=NCbirths)
## gained_mis
## Marital FALSE TRUE
## 1 924 26
## 2 486 14
There seems to be no correlation between which values are missing Gained and which are not, as the true/false (indicating missing values) generally match the overall distribution of the variables, rather than indicating a new connection between the variables and the number of missing values present.
We can still retain about 97% of data after removing missing values.
NCbirths <- filter(NCbirths, !is.na(NCbirths$Weeks))
NCbirths <- filter(NCbirths, !is.na(NCbirths$Smoke))
NCbirths <- filter(NCbirths, !is.na(NCbirths$Gained))
print(nrow(NCbirths) / 1450)
## [1] 0.9717241