boxplot(bweight$Weight,
main = "Infant Birth Weight",
ylab = "Weight (grams)")
Based on the boxplot, the data is normally distributed with three outliers below the minimum and one above the maximum.
qqnorm(bweight$Weight)
qqline(bweight$Weight)
The QQ-Plot also shows a normal distribution.
shapiro.test(bweight$Weight)
##
## Shapiro-Wilk normality test
##
## data: bweight$Weight
## W = 0.99206, p-value = 0.1153
The P-Value is above the 0.05 significant value and therefore, we cannot reject the null hypothesis.
boxplot(Weight ~ MomSmoke, bweight,
main = "Infant Birth Weight and Smoking",
ylab = "Weight (grams)",
xlab = "Smoking")
The Boxplot shows a difference in infant weights with mothers who do smoke (1) and those who don’t (0). Those who smoke, have a smaller range of infant birth weights. They do follow a normal distribution.
shapiro.test(bweight[bweight$MomSmoke == "0", "Weight"])
##
## Shapiro-Wilk normality test
##
## data: bweight[bweight$MomSmoke == "0", "Weight"]
## W = 0.99362, p-value = 0.3549
The p-value for mother’s who don’t smoke (0) is higher than 0.05 meaning this variable follows normal distribution.
shapiro.test(bweight[bweight$MomSmoke == "1", "Weight"])
##
## Shapiro-Wilk normality test
##
## data: bweight[bweight$MomSmoke == "1", "Weight"]
## W = 0.96299, p-value = 0.2
The p-value for mother’s who do smoke (1) is also higher than 0.05 meaning this variable also follows normal distribution.
Based on the results from Exercise 1, there are two samples, both following a normal distribution. This means a two-sample t-test would be the best choice. Null Hypothesis (Ho): Mean weights from mothers who do smoke = mean weights from mothers who do not smoke Alternative Hypothesis (Ha): Mean weights from mothers who do smoke NOT= mean weights from mothers who do not smoke A variance equality check will show which specific test to run.
var.test(Weight ~ MomSmoke, bweight, alternative = "two.sided")
##
## F test to compare two variances
##
## data: Weight by MomSmoke
## F = 1.0786, num df = 253, denom df = 40, p-value = 0.8009
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6421109 1.6671729
## sample estimates:
## ratio of variances
## 1.078555
Because the p-value is above 0.05, a Pooled T-Test is the best option.
t.test(Weight ~ MomSmoke, bweight, alternative = "two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: Weight by MomSmoke
## t = 3.071, df = 293, p-value = 0.002334
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 93.37931 426.65488
## sample estimates:
## mean in group 0 mean in group 1
## 3422.724 3162.707
The Pooled T-Test shows the p-value is below 0.05 - this means we can reject the Null (Ho). The mean weights from mothers who do smoke IS NOT EQUAL TO the mean weights from mothers who do not smoke.
aov.weight = aov(Weight ~ MomSmoke, bweight)
summary(aov.weight)
## Df Sum Sq Mean Sq F value Pr(>F)
## MomSmoke 1 2386708 2386708 9.431 0.00233 **
## Residuals 293 74151291 253076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LeveneTest(aov.weight)
Based on the Levene Test, the p-value is greater than 0.05 meaning we cannot reject the Null and we can assume equal variance. From here, we can move on to perform ANOVA.
ScheffeTest(aov.weight)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $MomSmoke
## diff lwr.ci upr.ci pval
## 1-0 -260.0171 -426.6549 -93.37931 0.0023 **
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the Scheffe Test, the p-value is lower than 0.05. This means at least one group in MomSmoke has a significant effect on Weight. Mothers who do not smoke have a higher mean value than those who do smoke.
Use all independent variables to run one-way ANOVA (Black, Married, Boy, MomSmoke, and Ed)
Remove the variable with the highest p-value - remove Ed (0.8626)
Run one-way ANOVA with the remaining variables (Black, Married, Boy, MomSmoke)
Remove the variable with the highest p-value - remove Married (0.6157)
Run one-way ANOVA with the remaining variables (Black, Boy, MomSmoke)
Remove the variable with the highest p-value - remove Boy (0.3888)
The remaining two variables (Black and MomSmoke) both have p-values below 0.05 meaning they are the only significant variables. We now need to check their interaction to get the final model. We Will run another one-way ANOVA (with only Black and Married) for the interaction effect.
We can conclude that the interaction will not be included because of it’s p-value.
aov.weight2 = aov(Weight ~ Black + MomSmoke, bweight)
summary(aov.weight2, type = 3)
## Df Sum Sq Mean Sq F value Pr(>F)
## Black 1 3530450 3530450 14.62 0.00016 ***
## MomSmoke 1 2513301 2513301 10.41 0.00140 **
## Residuals 292 70494249 241419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table(bweight$Black)
##
## 0 1
## 246 49
table(bweight$MomSmoke)
##
## 0 1
## 254 41
Based on the ANOVA model and final tables, both variables contain unbalanced data.
lm.res = lm(Weight ~ Black + MomSmoke, bweight)
summary(lm.res)$r.squared
## [1] 0.07896405
The linear model shows that 7.896% of Weight Variation can be explained by the final model.
plot(aov.weight2, 2)
The QQPlot shows the data follows a normal distribution.
ScheffeTest(aov.weight2)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $Black
## diff lwr.ci upr.ci pval
## 1-0 -293.9412 -483.0575 -104.8249 8e-04 ***
##
## $MomSmoke
## diff lwr.ci upr.ci pval
## 1-0 -266.763 -470.2261 -63.29987 0.0060 **
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Black: - African American variable has a lower mean than Caucasian variables. Both have significant effect on weight. MomSmoke: - Mothers who smoke have a lower mean than mothers who do not smoke. Both have siginificant effect on weight.