You need to download dataset birthweight.csv for Exercise 1-4. The birthweight data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are:
Weight: Infant birth weight (gram)Black: Categorical variable; 0 is white, 1 is blackMarried: Categorical variable; 0 is not married, 1 is marriedBoy: Categorical variable; 0 is girl, 1 is boyMomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking momEd: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or aboveGenerate Boxplot for infant birth weight (Weight) and comment on the general features of the distribution. Generate a normal QQ-plot and perform Shapiro-wilk test to check whether normality is a reasonable assumption for Weight. Make a conclusion.
boxplot(bweight$Weight,
main = "Infant Birth Weight",
ylab = "Weight",
col = "#A4DEFE",
border = "#526E7E",
horizontal = FALSE)
Conclusion:
qqnorm(bweight$Weight)
qqline(bweight$Weight, col = "red")
Conclusion:
shapiro.test(bweight$Weight)
##
## Shapiro-Wilk normality test
##
## data: bweight$Weight
## W = 0.99206, p-value = 0.1153
Conclusion:
Generate a boxplot of Weight by MomSmoke and compare infant birth weights between smoking levels.
boxplot(Weight ~ MomSmoke, bweight,
main = "Infant Birth Weights and Smoking Levels",
ylab = "Weights",
xlab = "Smoking Levels",
col = "mistyrose",
border = "red",
horizontal = FALSE)
Observation:
For each level in MomSmoke, perform Shapiro-wilk test for checking the Normality of Weight. Make a conclusion.
shapiro.test(bweight[bweight$MomSmoke == '0', "Weight"])
##
## Shapiro-Wilk normality test
##
## data: bweight[bweight$MomSmoke == "0", "Weight"]
## W = 0.99362, p-value = 0.3549
Conclusion:
shapiro.test(bweight[bweight$MomSmoke == '1', "Weight"])
##
## Shapiro-Wilk normality test
##
## data: bweight[bweight$MomSmoke == "1", "Weight"]
## W = 0.96299, p-value = 0.2
Conclusion:
We want to test if there is a significant difference in birth weights between infants from smoking mom and non-smoking mom.
Perform a hypothesis test of whether infants from smoking moms have different weights than infants from non-smoking moms. Which test do you choose? Use the answer in Exercise 1 for choosing the proper test. Specify null and alternative hypotheses and state your conclusion
NOTE: If you decide to use the parametric test, perform two-sample t-test rather than ANOVA.
Because both datasets follow normal distribution, a two-sample t-test will be performed on birth weights between infants of mothers who do smoke, and infants of those who don’t. An equal variance test will be conducted before choosing which test to run.
Null hypothesis (Ho): Mean value of mothers who do smoke = mean value of those who do not smoke
Alternate hypothesis (Ha): Mean value of mothers who do smoke ≠ mean value of those who do not smoke (one of the populations tends to have larger mean values)
var.test(Weight ~ MomSmoke, bweight, alternative = "two.sided")
##
## F test to compare two variances
##
## data: Weight by MomSmoke
## F = 1.0786, num df = 253, denom df = 40, p-value = 0.8009
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6421109 1.6671729
## sample estimates:
## ratio of variances
## 1.078555
Observation:
t.test(Weight ~ MomSmoke, bweight, alternative = "two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: Weight by MomSmoke
## t = 3.071, df = 293, p-value = 0.002334
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 93.37931 426.65488
## sample estimates:
## mean in group 0 mean in group 1
## 3422.724 3162.707
Conclusion:
Now perform one-way ANOVA on Weight with MomSmoke. Check homogeneity of variance assumption. Does it hold and is it okay to perform ANOVA?
aov.weight <- aov(Weight ~ MomSmoke, bweight)
summary(aov.weight)
## Df Sum Sq Mean Sq F value Pr(>F)
## MomSmoke 1 2386708 2386708 9.431 0.00233 **
## Residuals 293 74151291 253076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LeveneTest(aov.weight)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.6767 0.4114
## 293
Conclusion:
Make a conclusion on the effect of MomSmoke. Compare your result with the conclusion of Exercise 2.
ScheffeTest(aov.weight)
##
## Posthoc multiple comparisons of means : Scheffe Test
## 95% family-wise confidence level
##
## $MomSmoke
## diff lwr.ci upr.ci pval
## 1-0 -260.0171 -426.6549 -93.37931 0.0023 **
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
Based on the Scheffe Test conducted above, we can see that the p-value is below the significance level of 0.05 (p-value = 0.0023) Thus, it is safe to say that at least one group from MomSmoke has a significant effect on birth weight (significantly different mean from each other)
Similarly in Exercise 2, the p-value is 0.0023, in which we concluded that the mean values of mothers who do not smoke (0) and mothers who do smoke (1) are not the same. Looking at the Scheffe test result, we can see that mothers who do not smoke (0) has a significantly higher mean value than mothers who do smoke (1), as it results in a negative difference value (-260.0171) In other words:
Using Black, Married, Boy, and MomSmoke, and Ed variables as possible effects, find the best ANOVA model for Weight. Manually perform backward selection based on Type 3 SS result with 0.05 criteria on p-value. Perform backward selection only with main effects and then check the interaction effects only based on significant main effect terms.
NOTE: For backward selection, you remove a variable from the least significant one, ONE BY ONE, until there is no more variable with a p-value larger than the criteria.
Write down step by step how you perform backward selection and how you find the final model. Please do NOT include all intermediate tables and graphs in the report. Just describe each step which variable you delete and why.
Black, Married, Boy, and MomSmoke, and Ed) and we conduct a one-way ANOVAEd because it has the highest p-value (0.8625)Ed removed, we run another one-way ANOVA with the left over variables (Black, Married, Boy, and MomSmoke)Boy because it has the highest p-value (0.3531)Ed and Boy removed, we run another one-way ANOVA with the left over variables (Black, Married, and MomSmoke)Married because it has the highest p-value (0.2530)Ed, Boy, and Married removed, we can finally see that MomSmoke and Black are the only significant variables since they have p-values less than the significant value of 0.05. This means we should check their interaction to get the final model by running another one-way ANOVA with the interaction effect.Black:MomSmoke will not be included because its p-value is larger than the significance level of 0.05. In other words, the final model would be: Weight ~ Black + MomSmokeSpecify the final model and report the amount of variation explained by the model. Also, check the Normality assumption through diagnostics plots.
aov.weight2 <- aov(Weight ~ Black + MomSmoke, bweight)
summary(aov.weight2, type = 3)
## Df Sum Sq Mean Sq F value Pr(>F)
## Black 1 3530450 3530450 14.62 0.00016 ***
## MomSmoke 1 2513301 2513301 10.41 0.00140 **
## Residuals 292 70494249 241419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table(bweight$Black)
##
## 0 1
## 246 49
table(bweight$MomSmoke)
##
## 0 1
## 254 41
Observation:
Black and MomSmoke contain unbalanced data.lm.res <- lm(Weight ~ Black + MomSmoke, bweight)
summary(lm.res)$r.squared
## [1] 0.07896405
Conclusion:
Weight variation can be explained by the final model (Weight ~ Black + MomSmoke), which is fairly low.plot(aov.weight2, 2)
Observation:
State conclusions about significant differences in Weight across groups. For each significant variable, state specifically which level has a larger or smaller mean value of Weight.
ScheffeTest(aov.weight2)
##
## Posthoc multiple comparisons of means : Scheffe Test
## 95% family-wise confidence level
##
## $Black
## diff lwr.ci upr.ci pval
## 1-0 -293.9412 -483.0575 -104.8249 8e-04 ***
##
## $MomSmoke
## diff lwr.ci upr.ci pval
## 1-0 -266.763 -470.2261 -63.29987 0.0060 **
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion:
Black:
Weight, due a p-value less than 0.05 (8e-04)MomSmoke: