Data Sets:

You need to download dataset birthweight.csv for Exercise 1-4. The birthweight data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are:

Exercise 1A

Generate Boxplot for infant birth weight (Weight) and comment on the general features of the distribution. Generate a normal QQ-plot and perform Shapiro-wilk test to check whether normality is a reasonable assumption for Weight. Make a conclusion.

Boxplot

boxplot(bweight$Weight,
        main = "Infant Birth Weight",
        ylab = "Weight",
        col = "#A4DEFE",
        border = "#526E7E",
        horizontal = FALSE)

Conclusion:

  • Based on the Boxplot generated above, we can see that the data is pretty much normally distributed. It also shows a couple of outliers below the minimum and one above the maximum level.

Q-QPlot

qqnorm(bweight$Weight)
qqline(bweight$Weight, col = "red")

Conclusion:

  • Based on the Q-Q Plot generated above, we can conclude that the data follows normal distribution, as a straight red line is visible.

Shapiro-Wilk Test

shapiro.test(bweight$Weight)
## 
##  Shapiro-Wilk normality test
## 
## data:  bweight$Weight
## W = 0.99206, p-value = 0.1153

Conclusion:

  • Based on the Shapiro-Wilk Test, we can see that the p-value is above the significance level of 0.05 (p-value = 0.1153) And because of this, we cannot reject the null hypothesis. In other words, this data follows normal distribution.

Exercise 1B

Generate a boxplot of Weight by MomSmoke and compare infant birth weights between smoking levels.

boxplot(Weight ~ MomSmoke, bweight,
        main = "Infant Birth Weights and Smoking Levels",
        ylab = "Weights",
        xlab = "Smoking Levels",
        col = "mistyrose",
        border = "red",
        horizontal = FALSE)

Observation:

Exercise 1C

For each level in MomSmoke, perform Shapiro-wilk test for checking the Normality of Weight. Make a conclusion.

Shapiro-Wilk Test

shapiro.test(bweight[bweight$MomSmoke == '0', "Weight"])
## 
##  Shapiro-Wilk normality test
## 
## data:  bweight[bweight$MomSmoke == "0", "Weight"]
## W = 0.99362, p-value = 0.3549

Conclusion:

  • Based on the Shapiro-Wilk Test above, we can see that the p-value for mothers who don’t smoke (0) is higher than the significance level of 0.05 (p-value = 0.3549) This means that this particular dataset follows normal distribution.
shapiro.test(bweight[bweight$MomSmoke == '1', "Weight"])
## 
##  Shapiro-Wilk normality test
## 
## data:  bweight[bweight$MomSmoke == "1", "Weight"]
## W = 0.96299, p-value = 0.2

Conclusion:

  • Based on the Shapiro-Wilk Test above, we can see that the p-value for mothers who do smoke (1) is higher than the significance level of 0.05 (p-value = 0.2) This means that this particular dataset also follows normal distribution.

Exercise 2A

We want to test if there is a significant difference in birth weights between infants from smoking mom and non-smoking mom.

Perform a hypothesis test of whether infants from smoking moms have different weights than infants from non-smoking moms. Which test do you choose? Use the answer in Exercise 1 for choosing the proper test. Specify null and alternative hypotheses and state your conclusion

NOTE: If you decide to use the parametric test, perform two-sample t-test rather than ANOVA.

Answer:

Because both datasets follow normal distribution, a two-sample t-test will be performed on birth weights between infants of mothers who do smoke, and infants of those who don’t. An equal variance test will be conducted before choosing which test to run.

  • Null hypothesis (Ho): Mean value of mothers who do smoke = mean value of those who do not smoke

  • Alternate hypothesis (Ha): Mean value of mothers who do smoke ≠ mean value of those who do not smoke (one of the populations tends to have larger mean values)

Two-sample t-test

var.test(Weight ~ MomSmoke, bweight, alternative = "two.sided")
## 
##  F test to compare two variances
## 
## data:  Weight by MomSmoke
## F = 1.0786, num df = 253, denom df = 40, p-value = 0.8009
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6421109 1.6671729
## sample estimates:
## ratio of variances 
##           1.078555

Observation:

  • Since the p-value is above the significance level of 0.05 (p-value = 0.8009), this means that a pooled t-test is the best option to use to compare birth weights.

Pooled T-test

t.test(Weight ~ MomSmoke, bweight, alternative = "two.sided", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  Weight by MomSmoke
## t = 3.071, df = 293, p-value = 0.002334
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   93.37931 426.65488
## sample estimates:
## mean in group 0 mean in group 1 
##        3422.724        3162.707

Conclusion:

  • Based on our Pooled T-test above, we can see that the p-value falls under the significance level of 0.05 (p-value = 0.0023) This means that we can reject our null hypothesis and conclude that the mean values of mothers who do smoke and the mean value of those who do not smoke are not the same; we support Ha: Mean of mothers who don’t smoke (0) ≠ Mean of mothers who smoke (1)

Exercise 3A

Now perform one-way ANOVA on Weight with MomSmoke. Check homogeneity of variance assumption. Does it hold and is it okay to perform ANOVA?

aov.weight <- aov(Weight ~ MomSmoke, bweight)
summary(aov.weight)
##              Df   Sum Sq Mean Sq F value  Pr(>F)   
## MomSmoke      1  2386708 2386708   9.431 0.00233 **
## Residuals   293 74151291  253076                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LeveneTest(aov.weight)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.6767 0.4114
##       293

Conclusion:

Exercise 3B

Make a conclusion on the effect of MomSmoke. Compare your result with the conclusion of Exercise 2.

ScheffeTest(aov.weight)
## 
##   Posthoc multiple comparisons of means : Scheffe Test 
##     95% family-wise confidence level
## 
## $MomSmoke
##          diff    lwr.ci    upr.ci   pval    
## 1-0 -260.0171 -426.6549 -93.37931 0.0023 ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion:


Exercise 4A

Using Black, Married, Boy, and MomSmoke, and Ed variables as possible effects, find the best ANOVA model for Weight. Manually perform backward selection based on Type 3 SS result with 0.05 criteria on p-value. Perform backward selection only with main effects and then check the interaction effects only based on significant main effect terms.

NOTE: For backward selection, you remove a variable from the least significant one, ONE BY ONE, until there is no more variable with a p-value larger than the criteria.

Write down step by step how you perform backward selection and how you find the final model. Please do NOT include all intermediate tables and graphs in the report. Just describe each step which variable you delete and why.

Answer:

  1. All independent variables are entered into the equation (Black, Married, Boy, and MomSmoke, and Ed) and we conduct a one-way ANOVA
  2. We then remove Ed because it has the highest p-value (0.8625)
  3. With Ed removed, we run another one-way ANOVA with the left over variables (Black, Married, Boy, and MomSmoke)
  4. We then remove Boy because it has the highest p-value (0.3531)
  5. With Ed and Boy removed, we run another one-way ANOVA with the left over variables (Black, Married, and MomSmoke)
  6. We then remove Married because it has the highest p-value (0.2530)
  7. With Ed, Boy, and Married removed, we can finally see that MomSmoke and Black are the only significant variables since they have p-values less than the significant value of 0.05. This means we should check their interaction to get the final model by running another one-way ANOVA with the interaction effect.
  8. After running the one-way ANOVA, we concluded that the interaction of Black:MomSmoke will not be included because its p-value is larger than the significance level of 0.05. In other words, the final model would be: Weight ~ Black + MomSmoke

Exercise 4B

Specify the final model and report the amount of variation explained by the model. Also, check the Normality assumption through diagnostics plots.

aov.weight2 <- aov(Weight ~ Black + MomSmoke, bweight)
summary(aov.weight2, type = 3)
##              Df   Sum Sq Mean Sq F value  Pr(>F)    
## Black         1  3530450 3530450   14.62 0.00016 ***
## MomSmoke      1  2513301 2513301   10.41 0.00140 ** 
## Residuals   292 70494249  241419                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table(bweight$Black)
## 
##   0   1 
## 246  49
table(bweight$MomSmoke)
## 
##   0   1 
## 254  41

Observation:

lm.res <- lm(Weight ~ Black + MomSmoke, bweight)
summary(lm.res)$r.squared
## [1] 0.07896405

Conclusion:

Q-Q Plot

plot(aov.weight2, 2)

Observation:

  • Based on the Q-QPlot above, we can conclude that the final model follows normal distribution, as we can see a straight line plotted on the quantiles.

Exercise 4C

State conclusions about significant differences in Weight across groups. For each significant variable, state specifically which level has a larger or smaller mean value of Weight.

ScheffeTest(aov.weight2)
## 
##   Posthoc multiple comparisons of means : Scheffe Test 
##     95% family-wise confidence level
## 
## $Black
##          diff    lwr.ci    upr.ci  pval    
## 1-0 -293.9412 -483.0575 -104.8249 8e-04 ***
## 
## $MomSmoke
##         diff    lwr.ci    upr.ci   pval    
## 1-0 -266.763 -470.2261 -63.29987 0.0060 ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: