Exploratory Analysis

library(Stat2Data)
library(GGally)
library(ggplot2)
library(dplyr)
data(NCbirths)
summary(NCbirths)
##        ID             Plural           Sex            MomAge     
##  Min.   :   1.0   Min.   :1.000   Min.   :1.000   Min.   :13.00  
##  1st Qu.: 363.2   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:22.00  
##  Median : 725.5   Median :1.000   Median :1.000   Median :26.00  
##  Mean   : 725.5   Mean   :1.037   Mean   :1.487   Mean   :26.76  
##  3rd Qu.:1087.8   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:31.00  
##  Max.   :1450.0   Max.   :3.000   Max.   :2.000   Max.   :43.00  
##                                                                  
##      Weeks          Marital         RaceMom      HispMom      Gained    
##  Min.   :22.00   Min.   :1.000   Min.   :1.000   C:   2   Min.   : 0.0  
##  1st Qu.:38.00   1st Qu.:1.000   1st Qu.:1.000   M: 128   1st Qu.:20.0  
##  Median :39.00   Median :1.000   Median :1.000   N:1283   Median :30.0  
##  Mean   :38.62   Mean   :1.345   Mean   :1.831   O:   3   Mean   :30.6  
##  3rd Qu.:40.00   3rd Qu.:2.000   3rd Qu.:2.000   P:   9   3rd Qu.:40.0  
##  Max.   :45.00   Max.   :2.000   Max.   :8.000   S:  25   Max.   :95.0  
##  NA's   :1                                                NA's   :40    
##      Smoke        BirthWeightOz   BirthWeightGm         Low         
##  Min.   :0.0000   Min.   : 12.0   Min.   : 340.2   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:106.0   1st Qu.:3005.1   1st Qu.:0.00000  
##  Median :0.0000   Median :118.0   Median :3345.3   Median :0.00000  
##  Mean   :0.1446   Mean   :116.2   Mean   :3295.6   Mean   :0.08621  
##  3rd Qu.:0.0000   3rd Qu.:130.0   3rd Qu.:3685.5   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :181.0   Max.   :5131.4   Max.   :1.00000  
##  NA's   :5                                                          
##      Premie           MomRace   
##  Min.   :0.0000   black   :332  
##  1st Qu.:0.0000   hispanic:164  
##  Median :0.0000   other   : 48  
##  Mean   :0.1317   white   :906  
##  3rd Qu.:0.0000                 
##  Max.   :1.0000                 
## 

The overall data summary gives us good insight into the spread of the variables.
* Average weight in grams is around 3300 - good baseline to judge whether a birth is healthy or not.
* Interquartile range (IQR) of weights from ~3000 to ~3700 gms.
* Majority of moms are white (around 2/3s).
* ~13% of babies were premies and ~8% considered “low” weight - more research should be done to determine what a healty baby weighs.
* 75% of gestation periods are from 38-40 weeks, but demonstrate more extreme variation in the first quartile.
* A large majority of Hispanic moms are Mexican.
* Plural births are very rare in this data set.
* Most moms are married when they have children.
* Sex almost completely even between M/F.
* IQR of mom’s ages from 22 to 31.
* ~15% of mom’s smoke.

ggpairs(NCbirths, columns = c("MomAge","Weeks","Gained","BirthWeightGm"))

Plotting birth weight against other continuous variables reveals:
* Birth weight has ~.6 correlation with weeks, and would be a good explanatory variable to include in model.
* Definite collinearity between the weight variables. To test this, I did the same plot with BirthWeightOz instead and got the exact same correlation coefficients - we will have to test if they exhibit the same traits in a regression model.
* Some correlation between birth weight and Gained/MomAge - .19 and .15 respectively.

ggpairs(NCbirths, columns = c("BirthWeightGm", "Low", "Premie", "Plural", "Smoke"))

For birth weight against categorical variables:
* Premie and low correlate with each other and have high negative correlation with birth weight- good predictor of whether baby will be healthy or not.
* Number of children has correlation with premie and low (.33 and .37 - probably good to include in model since premie and low themselves are good predictors of baby’s health).
* Smokers less likely to have multiple babies and more likely to have health defects.
* Plural and smoke have negative correlations with birth weight.

Missing Data

NA’s are present in certain variables of the dataset - 1 in Weeks, 5 in Smoke, 40 in Gained.
Can throw out overall number of rows with missing data (41) because they are so few and have little to no association with other variables. To test association, I used the tally method detailed in Exercise 2 of Lab 7 on the variable with the greatest number of missing values, Gained

NCbirths$gained_mis <- factor(is.na(NCbirths$Gained))
mosaic::tally(~Sex|gained_mis, data=NCbirths)
##    gained_mis
## Sex FALSE TRUE
##   1   721   23
##   2   689   17
mosaic::tally(~Plural|gained_mis, data=NCbirths)
##       gained_mis
## Plural FALSE TRUE
##      1  1363   38
##      2    43    2
##      3     4    0
mosaic::tally(~Smoke|gained_mis, data=NCbirths)
##       gained_mis
## Smoke  FALSE TRUE
##   0     1203   33
##   1      206    3
##   <NA>     1    4
mosaic::tally(~Low|gained_mis, data=NCbirths)
##    gained_mis
## Low FALSE TRUE
##   0  1291   34
##   1   119    6
mosaic::tally(~Premie|gained_mis, data=NCbirths)
##       gained_mis
## Premie FALSE TRUE
##      0  1229   30
##      1   181   10
mosaic::tally(~Marital|gained_mis, data=NCbirths)
##        gained_mis
## Marital FALSE TRUE
##       1   924   26
##       2   486   14

There seems to be no correlation between which values are missing Gained and which are not, as the true/false (indicating missing values) generally match the overall distribution of the variables, rather than indicating a new connection between the variables and the number of missing values present.
We can still retain about 97% of data after removing missing values.

NCbirths <- filter(NCbirths, !is.na(NCbirths$Weeks))
NCbirths <- filter(NCbirths, !is.na(NCbirths$Smoke))
NCbirths <- filter(NCbirths, !is.na(NCbirths$Gained))
print(nrow(NCbirths) / 1450)
## [1] 0.9717241