Motivation

This study aims to determine if baby weights vary with various criteria of the mother such as age, weight, height, smoking behavior

url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/babies.txt"
filename <- basename(url)
download.file(url, destfile = filename)
babies1 <- read.table("babies.txt", header = T)
glimpse(babies1)
## Observations: 1,236
## Variables: 7
## $ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 14...
## $ gestation <int> 284, 282, 279, 999, 282, 286, 244, 245, 289, 299, 35...
## $ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, ...
## $ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, ...
## $ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120...
## $ smoke     <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1...

Data Manipulation

The parity and smoke field shows only 0 or 1 values. Let’s check how many records contain different values

table(babies1$smoke)
## 
##   0   1   9 
## 742 484  10
table(babies1$parity)
## 
##   0   1 
## 921 315

There seems to be several records with a different parity. Let’s study these in more detail

table(babies1$parity,babies1$smoke)
##    
##       0   1   9
##   0 548 363  10
##   1 194 121   0
head(babies1[babies1$parity==0,])
##   bwt gestation parity age height weight smoke
## 1 120       284      0  27     62    100     0
## 2 113       282      0  33     64    135     0
## 3 128       279      0  28     64    115     1
## 4 123       999      0  36     69    190     0
## 5 108       282      0  23     67    125     1
## 6 136       286      0  25     62     93     0
head(babies1[babies1$parity==1,])
##     bwt gestation parity age height weight smoke
## 272 133       276      1  22     63    119     0
## 274 104       274      1  20     62    115     1
## 277 123       284      1  20     65    120     1
## 280 141       319      1  20     67    140     1
## 282 113       282      1  36     59    140     0
## 284 109       295      1  23     63    103     1

the documentation says parity indicates total number of previous pregnancies, including fetal deaths and still births

Does birthweight vary based on smoking behavior?

bwt.nonsmoke <- babies1 %>% filter(smoke==0) %>% select(bwt) %>% unlist
bwt.smoke <- babies1 %>% filter(smoke==1) %>% select(bwt) %>% unlist
t.test(bwt.nonsmoke,bwt.smoke)
## 
##  Welch Two Sample t-test
## 
## data:  bwt.nonsmoke and bwt.smoke
## t = 8.5813, df = 1003.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.89385 10.98148
## sample estimates:
## mean of x mean of y 
##  123.0472  114.1095

this indicates non-smokers give birth to babies on average 9 ounces heavier.

birthweight vary based on smoking and gestation

Let’s remove the “abnormally” long gestation periods and recalculate if smoking does affect birthweight

baby_normal_gest <- babies1 %>% filter(gestation > 240 & gestation < 300)
baby_normal_smoke <- baby_normal_gest %>% filter(smoke==1) %>% select(bwt) %>% unlist
baby_normal_nonsmoke <- baby_normal_gest %>% filter(smoke==0) %>% select(bwt) %>% unlist
t.test(baby_normal_nonsmoke,baby_normal_smoke)
## 
##  Welch Two Sample t-test
## 
## data:  baby_normal_nonsmoke and baby_normal_smoke
## t = 9.5739, df = 921.67, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.893376 11.963872
## sample estimates:
## mean of x mean of y 
##  123.8223  113.8937

this indicates that gestation does not change this situation - non-smokers’ babies are, on average, 9 ounces heavier

birthweight depend on height and weight of mother?

cor(x = babies1$bwt, babies1$height)
## [1] 0.1255413
cor(x = babies1$bwt, babies1$weight)
## [1] 0.04676414

shows very little correlation between birthweight and mothers’ height, weight

Let’s look at correlation between birthweight and gestation

cor(x = babies1$bwt, babies1$gestation)
## [1] 0.06250354
ggplot(babies1, aes(x=gestation, y=bwt))+geom_point()

the plot shows that we need to remove gestation=999

babies2 <- babies1 %>% filter(gestation != 999)
ggplot(babies2, aes(x=gestation, y=bwt))+geom_point()

cor(x = babies2$bwt, babies2$gestation)
## [1] 0.407854

Now the correlation jumps to 40% - this shows we need to watch out for outliers! 999 is the code for unknown data

Let’s correct the previous correlations too

unique(babies1$height)
##  [1] 62 64 69 67 65 66 68 63 61 60 56 58 99 70 71 59 72 53 57 54
unique(babies1$weight)
##   [1] 100 135 115 190 125  93 178 140 136 120 124 128  99 154 130 170 142
##  [18] 175 145 182 122 112 106 132 105 146 123  92 101 999 160 177 119 110
##  [35] 150  90 147 148 126 116  96 118 137 149 107 103 104 117 143 196 113
##  [52] 162 121 108 114 109 215 133 155 165 127 250 152 111 138 131 192 220
##  [69]  97 168 180 169 185 139 102 156 189 157 176 129 144 159 134 210 141
##  [86]  95 200  89 197 171  98 202  94 174 153 198 164  91 191 181 217 163
## [103] 228 158 151  87

This shows the entries corresponding to absence of data are coded as 99

babies2 <- babies2 %>% filter(height != 99) %>% filter(weight != 999) %>% filter(bwt != 999) %>% filter(height !=999) %>% filter(height !=99)
glimpse(babies2)
## Observations: 1,185
## Variables: 7
## $ bwt       <int> 120, 113, 128, 108, 136, 138, 132, 120, 143, 140, 14...
## $ gestation <int> 284, 282, 279, 282, 286, 244, 245, 289, 299, 351, 28...
## $ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ age       <int> 27, 33, 28, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, ...
## $ height    <int> 62, 64, 64, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, ...
## $ weight    <int> 100, 135, 115, 125, 93, 178, 140, 125, 136, 120, 124...
## $ smoke     <int> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0...

Now we have a little under 1200 records that do not have incomplete height, weight, gestation

cor(babies2$bwt, babies2$gestation)
## [1] 0.4103706
cor(babies2$bwt, babies2$height)
## [1] 0.2011152
cor(babies2$bwt, babies2$weight)
## [1] 0.1554782
ggplot(babies2, aes(x=bwt, y=weight))+geom_point()

ggplot(babies2, aes(x=bwt, y=height))+geom_point()

While the correlation improves somewhat - height and weight of mother are largely uncorrelated. Height is very mildly correlated

dividing into 2 groups low, high height and determining birthweight

hist(babies2$height)

med_height <- median(babies2$height)
babies3 <- babies2 %>% mutate(height_factor=ifelse(height>med_height,"high","low"))
baby_low_height <- babies3 %>% filter(height_factor=="low") %>% select(bwt) %>% unlist
baby_high_height <- babies3 %>% filter(height_factor=="high") %>% select(bwt) %>% unlist
mean(baby_low_height)-mean(baby_high_height)
## [1] -6.007238
t.test(baby_low_height,baby_high_height)
## 
##  Welch Two Sample t-test
## 
## data:  baby_low_height and baby_high_height
## t = -5.6198, df = 1080, p-value = 2.432e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.104669 -3.909807
## sample estimates:
## mean of x mean of y 
##  116.8795  122.8868

Using a histogram to verify the height distribution is normal and locating the median height at about 64 inches (5 feet 4 inches), we divided the group into 2 - birthweights of tall mothers and birthweights of short mothers Result: height of mothers does affect birthweight - to an extent of 5.8 ounces

Birthweight broken out by smoke and height factors

Let’s study birthweight averages for all 4 groups - smokers who are tall, smokers who are short and non-smokers - tall and short.

baby_low_low_sm_ht <- babies3 %>% filter(height_factor=="low") %>% filter(smoke==0) %>% select(bwt) %>% unlist
baby_low_high_sm_ht <- babies3 %>% filter(height_factor=="high") %>% filter(smoke==0) %>% select(bwt) %>% unlist
baby_high_low_sm_ht <- babies3 %>% filter(height_factor=="low") %>% filter(smoke==1) %>% select(bwt) %>% unlist
baby_high_high_sm_ht <- babies3 %>% filter(height_factor=="high") %>% filter(smoke==1) %>% select(bwt) %>% unlist
mean(baby_low_low_sm_ht); mean(baby_low_high_sm_ht); mean(baby_high_low_sm_ht); mean(baby_high_high_sm_ht)
## [1] 120.4244
## [1] 126.6623
## [1] 110.7269
## [1] 117.4787
t.test(baby_low_low_sm_ht, baby_high_high_sm_ht)
## 
##  Welch Two Sample t-test
## 
## data:  baby_low_low_sm_ht and baby_high_high_sm_ht
## t = 1.9964, df = 388.46, p-value = 0.04659
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0447606 5.8466739
## sample estimates:
## mean of x mean of y 
##  120.4244  117.4787
babies3 <- babies3 %>% mutate(sm_ht_factor= case_when(
  height_factor=="low" & smoke==0 ~ "l_sm_l_ht",
  height_factor=="low" & smoke==1 ~ "l_sm_h_ht",
  height_factor=="high" & smoke==0 ~ "h_sm_l_ht",
  height_factor=="high" & smoke==1 ~ "h_sm_h_ht"
))
pairwise.t.test(babies3$bwt, babies3$sm_ht_factor)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  babies3$bwt and babies3$sm_ht_factor 
## 
##           h_sm_h_ht h_sm_l_ht l_sm_h_ht
## h_sm_l_ht 2.3e-08   -         -        
## l_sm_h_ht 7.9e-05   < 2e-16   -        
## l_sm_l_ht 0.047     8.0e-06   4.2e-11  
## 
## P value adjustment method: holm

The pairwise comparison among all 4 groups clearly shows that smoking behavior trumps height when it comes to baby birthweight Even if a mother is low height, but is a non-smoker, the average baby weight is about 3 ounces higher.

The best combination - high height and non-smoking results in a whopping 16 ounce heavier baby, on average, than the lowe height smoker!

Does previous pregnancy affect birthweight?

baby_low_parity <- babies2 %>% filter(parity==0) %>% select(bwt) %>% unlist
baby_high_parity <- babies2 %>% filter(parity==1) %>% select(bwt) %>% unlist
mean(baby_low_parity)-mean(baby_high_parity)
## [1] 1.901187
t.test(baby_low_parity, baby_high_parity)
## 
##  Welch Two Sample t-test
## 
## data:  baby_low_parity and baby_high_parity
## t = 1.6233, df = 575.85, p-value = 0.1051
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3991804  4.2015542
## sample estimates:
## mean of x mean of y 
##  120.0148  118.1136

The t test shows no statistical difference in weights

Correlation with age

finally, let’s check if birthweight is correlated with Age

unique(babies2$age)
##  [1] 27 33 28 23 25 30 32 36 38 43 22 26 20 34 24 37 31 21 39 35 29 19 42
## [24] 40 18 17 99 41 15 44 45

Since there are values of 99 - let’s remove those records

babies2 <- babies2 %>% filter(age !=99)
cor(babies2$bwt, babies2$age)
## [1] 0.03116204

Seems there is little correlation between age and birthweight

Overall conclusion

Being a smoker plays a significant role in determining weight of babies at birth; Non-smokers’ babies, on average are heavier by about 9 ounces. Other factors, such as age, previous pregnancy and weight are poorly correlated and don’t affect birthweight as much. Height does affect birthweight, but not to the extent as smoking does. The best combination - a tall, non-smoking mother results in a whopping 16 ounce heavier baby as compared to the average birthweight of a short mother who smokes.