Summary

A study relating the birth weight of a baby to the age of the mother or whether or not she smoked cigarettes or drank alcohol during her pregnancy would be observed and some statistical methods would be employed to analyze the weight difference of baby from Alaska and Wyoming, under the conditions of mothers (age, smoking or not, etc.).

Introduction

Researchers look for possible relationships between the birth weight of a baby and the age of the mother or whether or not she smoked cigarettes or drank alcohol during her pregnancy. I would look into a dataset consists of a random sample of 40 baby girls born in Alaska and 40 baby girls born in Wyoming. These babies had a gestation period of at least 37 weeks and were single births. Born states, mothers’ age and smoking habit of mothers will be considered and comparisons will be made to verify our test.

Methods

First exploratory plots (boxplots, histograms) would be drawn to have a straightforward summary of the baby weights from different state, mothers’ ages and smoking habits, and then I would combine Q-Q plots to check the normality assumption of data. Providing these assumptions being satisfied, I would do the variance test to check the homogeneity of the variance if before I conduct the t-test because we should use Welch t-test if the ratio of variance is not statistically equal. Lastly, I carry out t-test to see if there is statistical significant difference of baby weight under different conditions.

Results

First I create a histogram of Weight at birth (grams), which shows an approximately normal distribution. To verify my conclusion, I do a Q-Q plot whose pattern approximately lies on the line y=x, and this support my first impression – the distribution of Weight at birth (grams) is almost normal.

data=read.csv("https://raw.githubusercontent.com/yesimiao/Dataset/master/Girls2004.csv")
attach(data)
par(mfrow=c(1,3))
hist(Weight)
boxplot(Weight,main="Bxoplot of Weight")
qqnorm(Weight)
qqline(Weight)

In order to analyze different of weight between babies from Alaska and Wyoming, I decide to carry out a hypothesis test. Of course before employing t-test, we should again check the normality assumption and homogeneity of variance.

AK=subset(data,State=="AK")
WY=subset(data,State=="WY")
par(mfrow=c(2,2))
hist(AK$Weight)
hist(WY$Weight)
qqnorm(AK$Weight)
qqline(AK$Weight)
qqnorm(WY$Weight)
qqline(WY$Weight)

par(mfrow=c(1,1))
boxplot(AK$Weight,WY$Weight,main="comparison of weight of AK and WY",names=c("Alaska","Wyoming"))

The histograms and Q-Q plots shows that the data are normally distributed and we could do the t-test individually first. The results are shown as following:

t.test(AK$Weight,alt="two.sided",conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  AK$Weight
## t = 38.421, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3331.23 3701.47
## sample estimates:
## mean of x 
##   3516.35
t.test(WY$Weight,alt="two.sided",conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  WY$Weight
## t = 48.5, df = 39, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  3074.115 3341.685
## sample estimates:
## mean of x 
##    3207.9

We see from above, the 95% confidence interval for baby weight in Alaska is (3331.23, 3701.47), and the 95% confidence interval for baby weight in Wyoming is (3074.12, 3341.69), there is some overlap (3331.23, 3341.69) between them, but we cannot say average means are equal because the sample size is not that large, and we need to conduct further hypothesis test to verify this.

var.test(AK$Weight,WY$Weight,ratio=1,alt="two.sided",conf.level=0.95)
## 
##  F test to compare two variances
## 
## data:  AK$Weight and WY$Weight
## F = 1.9147, num df = 39, denom df = 39, p-value = 0.04571
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.012667 3.620100
## sample estimates:
## ratio of variances 
##           1.914668

The p-value=0.0457 in the F test tells us that we should reject the null hypothesis, and the true ratio of these two variance is not equal to 1. Since the variance of baby weight in Alaska is not the same with those in Wyoming, we need to employ the Welch Two sample t-test.

t.test(AK$Weight,WY$Weight,alt="two.sided",var.equal=F,conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  AK$Weight and WY$Weight
## t = 2.7316, df = 71.007, p-value = 0.007946
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   83.29395 533.60605
## sample estimates:
## mean of x mean of y 
##   3516.35   3207.90

The p-value=0.0079 of t-test indicates that we should reject the claim that the true difference in means of baby weight from Alaska and Wyoming is equal to 0, and the 95% confidence interval for the difference is (83.29, 533.61), which also support the conclusion because 0 is not contained in this interval. We could say the latter weigh less than former according to the information provided and the output of boxplot. So the new born girl babies in Alaska is highly possible to be heavier than those in Wyoming.

And next we can go further to find out the relationship between mothers’ age and baby weight, here is the original boxplot of baby weight given different group of mothers’ age:

plot(MothersAge,Weight,main="Comparison of baby weight under different mothers' age")

In order to carry out the hypothesis test, I divided the age into two group: Younger group (age from 15 to 29) and Elder group (age from 30 to 44). Then I draw the histogram and Q-Q plots for each group:

Younger=subset(data,MothersAge %in% c("15-19","20-24","25-29"))
Elder=subset(data,MothersAge %in% c("30-34","35-39","40-44"))
par(mfrow=c(2,2))
hist(Younger$Weight)
hist(Elder$Weight)
qqnorm(Younger$Weight)
qqline(Younger$Weight)
qqnorm(Elder$Weight)
qqline(Elder$Weight)

par(mfrow=c(1,1))
boxplot(Younger$Weight,Elder$Weight,main="comparison of weight of younger and elder mothers",names=c("Younger","Elder"))

We can see from the plots above that the Younger and Elder group are approximately normal distributed, the quantiles lies closely to the line y=x in the Q-Q plots. According to the above findings, we could do the F test and t-test again.

var.test(Younger$Weight,Elder$Weight,ratio=1,alt="two.sided",conf.level=0.95)
## 
##  F test to compare two variances
## 
## data:  Younger$Weight and Elder$Weight
## F = 0.99113, num df = 56, denom df = 22, p-value = 0.9388
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4601171 1.9110724
## sample estimates:
## ratio of variances 
##          0.9911336

The p-value in F test equals to 0.9388>0.05, so there is not enough evidence saying that the baby weight of two variance of Younger and Elder group is not the same, so we can employ Two Sample t-test with pooled standard deviation later.

t.test(Younger$Weight,Elder$Weight,alt="greater",var.equal=TRUE,conf.level=0.95)
## 
##  Two Sample t-test
## 
## data:  Younger$Weight and Elder$Weight
## t = 0.58491, df = 78, p-value = 0.2801
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -140.6783       Inf
## sample estimates:
## mean of x mean of y 
##  3384.035  3307.826

The p-value=0.28>0.05, so we fail the reject the claim that mean of these two group are the same, and the 95% confidence interval (-140.68, infinity) contains 0, which also support our conclusion above. From the test, we know that mothers’ age does not exactly affect the weight of new born baby girls.

Furthermore, we can study the affection of mother smoking on girl baby weights. Again I separate the data into smoker and non-smoker group, the plots and analysis are shown below:

NS=subset(data,Smoker=="No")
YS=subset(data,Smoker=="Yes")
par(mfrow=c(2,2))
hist(NS$Weight)
hist(YS$Weight)
qqnorm(NS$Weight)
qqline(NS$Weight)
qqnorm(YS$Weight)
qqline(YS$Weight)

par(mfrow=c(1,1))
boxplot(NS$Weight,YS$Weight,names=c("Non-Smoker","Smoker"),main="comparison of weight of nonsmoker and smoker")

The histogram of competition group shows uniform distribution instead of normal, however if we look at the output of Q-Q plots, it represent good fit to the line y=x, so I still assume it’s normally distributed and carry out the F test.

var.test(NS$Weight,YS$Weight,ratio=1,alt="two.sided",conf.level=0.95)
## 
##  F test to compare two variances
## 
## data:  NS$Weight and YS$Weight
## F = 1.2638, num df = 68, denom df = 10, p-value = 0.7257
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.396838 2.834983
## sample estimates:
## ratio of variances 
##           1.263842

P-value=0.726>0.05, so we fail the reject the claim that these two variances are not the same. Based on this, two sample t-test can be employed later.

t.test(NS$Weight,YS$Weight,alt="greater",var.equal=TRUE,conf.level=0.95)
## 
##  Two Sample t-test
## 
## data:  NS$Weight and YS$Weight
## t = 1.7028, df = 78, p-value = 0.04629
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  6.439446      Inf
## sample estimates:
## mean of x mean of y 
##  3401.580  3114.636

P-value=0.046<0.05 of the t-test tells us that the true difference in means baby weight of these two group is not the same, actually, non-smoker group has slightly more weight than those in smoker group because the 95% single side confidence interval for the difference is (6.439, infinity), which means the babies born from non-smoker are highly possible to have 6.439 grams more than those born from smokers. From the exploratory plots and tests above, we could know that, statistically speaking, new born girl babies in Alaska are heavier than those born in Wyoming. And smoking may probably results in lighter babies, but the practical difference would not be that large. However, the mothers’ age seems to be unrelated to baby weights.

Discussion

Basically I can find out the conclusion of whatever I expected in this dataset. However, after subseting the data to some extension, the sample size is not that large that some plots seem to be non-normal distributed. We know that our F test and corresponding t-test are based on the normality and simple randomness assumption. So if we could have larger sample size, and finalize the standard to define the normality assumption, it would be more persuasive for the analysis and conclusions.

#all the codes
data=read.csv("https://raw.githubusercontent.com/yesimiao/Dataset/master/Girls2004.csv")
attach(data)
par(mfrow=c(1,3))
hist(Weight)
boxplot(Weight,main="Bxoplot of Weight")
qqnorm(Weight)
qqline(Weight)
AK=subset(data,State=="AK")
WY=subset(data,State=="WY")
par(mfrow=c(2,2))
hist(AK$Weight)
hist(WY$Weight)
qqnorm(AK$Weight)
qqline(AK$Weight)
qqnorm(WY$Weight)
qqline(WY$Weight)
par(mfrow=c(1,1))
boxplot(AK$Weight,WY$Weight,main="comparison of weight of AK and WY",names=c("Alaska","Wyoming"))
t.test(AK$Weight,alt="two.sided",conf.level=0.95)
t.test(WY$Weight,alt="two.sided",conf.level=0.95)
var.test(AK$Weight,WY$Weight,ratio=1,alt="two.sided",conf.level=0.95)
t.test(AK$Weight,WY$Weight,alt="two.sided",var.equal=F,conf.level=0.95)
plot(MothersAge,Weight,main="Comparison of baby weight under different mothers' age")
Younger=subset(data,MothersAge %in% c("15-19","20-24","25-29"))
Elder=subset(data,MothersAge %in% c("30-34","35-39","40-44"))
par(mfrow=c(2,2))
hist(Younger$Weight)
hist(Elder$Weight)
qqnorm(Younger$Weight)
qqline(Younger$Weight)
qqnorm(Elder$Weight)
qqline(Elder$Weight)
par(mfrow=c(1,1))
boxplot(Younger$Weight,Elder$Weight,main="comparison of weight of younger and elder mothers",names=c("Younger","Elder"))
var.test(Younger$Weight,Elder$Weight,ratio=1,alt="two.sided",conf.level=0.95)
t.test(Younger$Weight,Elder$Weight,alt="greater",var.equal=TRUE,conf.level=0.95)
NS=subset(data,Smoker=="No")
YS=subset(data,Smoker=="Yes")
par(mfrow=c(2,2))
hist(NS$Weight)
hist(YS$Weight)
qqnorm(NS$Weight)
qqline(NS$Weight)
qqnorm(YS$Weight)
qqline(YS$Weight)
par(mfrow=c(1,1))
boxplot(NS$Weight,YS$Weight,names=c("Non-Smoker","Smoker"),main="comparison of weight of nonsmoker and smoker")
var.test(NS$Weight,YS$Weight,ratio=1,alt="two.sided",conf.level=0.95)
t.test(NS$Weight,YS$Weight,alt="greater",var.equal=TRUE,conf.level=0.95)