North Carolina births

Exploratory analysis

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")

Exercice 1

Answer: The cases in this data set are birth records of babies of North Carolina State and there are 1000 cases.

summary(nc)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 
par(mfrow=c(2,2))
hist(nc$weeks, main = "Length of Pregnancy", )
hist(nc$fage, main = "Age of the father")
hist(nc$mage, main = "Age of the mother")
hist(nc$weight, main = "weight of the baby")

Answer: There are outliers for the lenght of pregnancy and weight of the baby.

Exercice 2

boxplot(data = nc, weight~habit, main= "Weight of the Babies by Smoking Habits of the Mothers", xlab = "Weight of the Babies", ylab = "Smoking Habits of the Mothers")

The boxplots show that the medians and interquartile range of the two distributions are very close, but the distribution for nonsmokers have more outliers and is more disperse.

by(nc$weight, nc$habit, mean)
## nc$habit: nonsmoker
## [1] 7.144273
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 6.82873

Inference

Exercice 3

by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## ------------------------------------------------------------ 
## nc$habit: smoker
## [1] 126

The conditions for inference should be satisfied because both sample sizes are bigger than 30.

Exercice 4

Answer: H0: There is no difference in means of the average weights of babies born between the smoking and non-smoking mother groups. HA: There is a difference in means of the average weights of babies born between the smoking and non-smoking mother groups.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical")
## Warning: package 'BHH2' was built under R version 4.0.4
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

Exercice 5

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical", 
          order = c("smoker","nonsmoker"))
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( -0.5777 , -0.0534 )
LS0tDQp0aXRsZTogIkxhYiA3OiBOb3J0aCBDYXJvbGluYSBiaXJ0aHMiDQphdXRob3I6ICJBdXJpYW5lIEdyaXBwaSINCmRhdGU6ICJgciBTeXMuRGF0ZSgpYCINCm91dHB1dDogb3BlbmludHJvOjpsYWJfcmVwb3J0DQotLS0NCg0KIyBOb3J0aCBDYXJvbGluYSBiaXJ0aHMNCg0KIyMgRXhwbG9yYXRvcnkgYW5hbHlzaXMNCg0KYGBge3J9DQpkb3dubG9hZC5maWxlKCJodHRwOi8vd3d3Lm9wZW5pbnRyby5vcmcvc3RhdC9kYXRhL25jLlJEYXRhIiwgZGVzdGZpbGUgPSAibmMuUkRhdGEiKQ0KbG9hZCgibmMuUkRhdGEiKQ0KYGBgDQoNCiMjIyBFeGVyY2ljZSAxDQoNCkFuc3dlcjoNClRoZSBjYXNlcyBpbiB0aGlzIGRhdGEgc2V0IGFyZSBiaXJ0aCByZWNvcmRzIG9mIGJhYmllcyBvZiBOb3J0aCBDYXJvbGluYSBTdGF0ZSBhbmQgdGhlcmUgYXJlIDEwMDAgY2FzZXMuDQoNCg0KYGBge3J9DQpzdW1tYXJ5KG5jKQ0KYGBgDQoNCmBgYHtyfQ0KcGFyKG1mcm93PWMoMiwyKSkNCmhpc3QobmMkd2Vla3MsIG1haW4gPSAiTGVuZ3RoIG9mIFByZWduYW5jeSIsICkNCmhpc3QobmMkZmFnZSwgbWFpbiA9ICJBZ2Ugb2YgdGhlIGZhdGhlciIpDQpoaXN0KG5jJG1hZ2UsIG1haW4gPSAiQWdlIG9mIHRoZSBtb3RoZXIiKQ0KaGlzdChuYyR3ZWlnaHQsIG1haW4gPSAid2VpZ2h0IG9mIHRoZSBiYWJ5IikNCmBgYA0KQW5zd2VyOiBUaGVyZSBhcmUgb3V0bGllcnMgZm9yIHRoZSBsZW5naHQgb2YgcHJlZ25hbmN5IGFuZCB3ZWlnaHQgb2YgdGhlIGJhYnkuDQoNCg0KIyMjIEV4ZXJjaWNlIDINCg0KYGBge3J9DQpib3hwbG90KGRhdGEgPSBuYywgd2VpZ2h0fmhhYml0LCBtYWluPSAiV2VpZ2h0IG9mIHRoZSBCYWJpZXMgYnkgU21va2luZyBIYWJpdHMgb2YgdGhlIE1vdGhlcnMiLCB4bGFiID0gIldlaWdodCBvZiB0aGUgQmFiaWVzIiwgeWxhYiA9ICJTbW9raW5nIEhhYml0cyBvZiB0aGUgTW90aGVycyIpDQpgYGANCg0KVGhlIGJveHBsb3RzIHNob3cgdGhhdCB0aGUgbWVkaWFucyBhbmQgaW50ZXJxdWFydGlsZSByYW5nZSBvZiB0aGUgdHdvIGRpc3RyaWJ1dGlvbnMgYXJlIHZlcnkgY2xvc2UsIGJ1dCB0aGUgZGlzdHJpYnV0aW9uIGZvciBub25zbW9rZXJzIGhhdmUgbW9yZSBvdXRsaWVycyBhbmQgaXMgbW9yZSBkaXNwZXJzZS4NCg0KDQpgYGB7cn0NCmJ5KG5jJHdlaWdodCwgbmMkaGFiaXQsIG1lYW4pDQpgYGANCg0KDQojIyBJbmZlcmVuY2UNCg0KIyMjIEV4ZXJjaWNlIDMNCg0KYGBge3J9DQpieShuYyR3ZWlnaHQsIG5jJGhhYml0LCBsZW5ndGgpDQpgYGANCg0KVGhlIGNvbmRpdGlvbnMgZm9yIGluZmVyZW5jZSBzaG91bGQgYmUgc2F0aXNmaWVkIGJlY2F1c2UgYm90aCBzYW1wbGUgc2l6ZXMgYXJlIGJpZ2dlciB0aGFuIDMwLg0KDQoNCiMjIyBFeGVyY2ljZSA0DQoNCkFuc3dlcjoNCkgwOiAgVGhlcmUgaXMgbm8gZGlmZmVyZW5jZSBpbiBtZWFucyBvZiB0aGUgYXZlcmFnZSB3ZWlnaHRzIG9mIGJhYmllcyBib3JuIGJldHdlZW4gdGhlIHNtb2tpbmcgYW5kIG5vbi1zbW9raW5nIG1vdGhlciBncm91cHMuDQpIQTogVGhlcmUgaXMgYSBkaWZmZXJlbmNlIGluIG1lYW5zIG9mIHRoZSBhdmVyYWdlIHdlaWdodHMgb2YgYmFiaWVzIGJvcm4gYmV0d2VlbiB0aGUgc21va2luZyBhbmQgbm9uLXNtb2tpbmcgbW90aGVyIGdyb3Vwcy4NCg0KDQpgYGB7cn0NCmluZmVyZW5jZSh5ID0gbmMkd2VpZ2h0LCB4ID0gbmMkaGFiaXQsIGVzdCA9ICJtZWFuIiwgdHlwZSA9ICJodCIsIG51bGwgPSAwLCBhbHRlcm5hdGl2ZSA9ICJ0d29zaWRlZCIsIG1ldGhvZCA9ICJ0aGVvcmV0aWNhbCIpDQpgYGANCg0KIyMjIEV4ZXJjaWNlIDUNCg0KYGBge3J9DQppbmZlcmVuY2UoeSA9IG5jJHdlaWdodCwgeCA9IG5jJGhhYml0LCBlc3QgPSAibWVhbiIsIHR5cGUgPSAiY2kiLCBudWxsID0gMCwgYWx0ZXJuYXRpdmUgPSAidHdvc2lkZWQiLCBtZXRob2QgPSAidGhlb3JldGljYWwiKQ0KYGBgDQoNCg0KYGBge3J9DQppbmZlcmVuY2UoeSA9IG5jJHdlaWdodCwgeCA9IG5jJGhhYml0LCBlc3QgPSAibWVhbiIsIHR5cGUgPSAiY2kiLCBudWxsID0gMCwgDQogICAgICAgICAgYWx0ZXJuYXRpdmUgPSAidHdvc2lkZWQiLCBtZXRob2QgPSAidGhlb3JldGljYWwiLCANCiAgICAgICAgICBvcmRlciA9IGMoInNtb2tlciIsIm5vbnNtb2tlciIpKQ0KYGBgDQoNCg==