boxplot(bweight$Weight,
main = "Distribution of Infant Birth Weight",
ylab = "Weight",
col = "red",
border = "blue",
horizontal = FALSE )
Based on the Boxplot, the distribution of weight looks to be normally distributed. The boxplot also shows a couple of outliers below the minimum value and one outlier above the maximum value.
qqnorm(bweight$Weight)
qqline(bweight$Weight, col = "red")
Based on the qq plot the data follows the qq line and can assumes to follow normal distribution. You can also see a small left tail, but its not enough evidence to show this data does not follow normal distribution.
shapiro.test(bweight$Weight)
##
## Shapiro-Wilk normality test
##
## data: bweight$Weight
## W = 0.99206, p-value = 0.1153
Based off the Shapiro Wilk test the P-value is above the .05 significance level at 0.1153. With this we have enough evidence to state that the Weight follows normal distribution (Do not reject the null).
boxplot(Weight ~ MomSmoke, bweight,
main = "Weight Vs. Does the Mom Smoke?",
ylab = "Weight",
xlab = "Smoke?",
col = "red",
border = "blue",
horizontal = FALSE)
Based off the Boxplot you can see a difference in the values between smoking and non smoking moms. Non smoking moms have a wider range of infant weight where as moms who smoke have a smaller range. Each boxplot looks to be normally distributed.
shapiro.test(bweight[bweight$MomSmoke == '0', "Weight"])
##
## Shapiro-Wilk normality test
##
## data: bweight[bweight$MomSmoke == "0", "Weight"]
## W = 0.99362, p-value = 0.3549
Based on the Shapiro WIlk Test Non-Smoking Moms (0) follows normal distribution since the P-value is above the .05 significance level (0.3549).
shapiro.test(bweight[bweight$MomSmoke == '1', "Weight"])
##
## Shapiro-Wilk normality test
##
## data: bweight[bweight$MomSmoke == "1", "Weight"]
## W = 0.96299, p-value = 0.2
Based on the Shapiro WIlk Test Smoking Moms (1) follows normal distribution since the P-value is above the .05 significance level (0.2).
*** Choice of Test ***
I will choose to run a two sample test.
Since both groups follow normal distribution in Exercise 1, I will be using the parametric test.
Hypothesis: * Ho: Mean of 0 = Mean of 1 * Ha: Mean of 0 ≠ Mean of 1
I will check for equal variance to determine which test to run
var.test(Weight ~ MomSmoke, bweight, alternative ="two.sided")
##
## F test to compare two variances
##
## data: Weight by MomSmoke
## F = 1.0786, num df = 253, denom df = 40, p-value = 0.8009
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6421109 1.6671729
## sample estimates:
## ratio of variances
## 1.078555
Based on the variance test I can conclude that this follows equal variance. This is because the p-value is above the .05 significance level (0.8009)
Since Equal Variance is followed I will run the Pooled t-test
t.test(Weight ~ MomSmoke, bweight,alternative ="two.sided", var.equal=TRUE)
##
## Two Sample t-test
##
## data: Weight by MomSmoke
## t = 3.071, df = 293, p-value = 0.002334
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 93.37931 426.65488
## sample estimates:
## mean in group 0 mean in group 1
## 3422.724 3162.707
Conclusion:
Based off the Two Sample t-test the p-value is below the .05 significance level (0.002334). With this information I will reject the null and support the alternative hypothesis.
aov.weight = (aov(Weight ~ MomSmoke, data = bweight))
summary (aov.weight)
## Df Sum Sq Mean Sq F value Pr(>F)
## MomSmoke 1 2386708 2386708 9.431 0.00233 **
## Residuals 293 74151291 253076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LeveneTest(aov.weight)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.6767 0.4114
## 293
Based on the Levene Test:
Levene test is significant (null: homoscedasticity) since the p-value is above the .05 significance level (0.4114). This proves that it holds and is okay to perform ANOVA.
ScheffeTest(aov.weight)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $MomSmoke
## diff lwr.ci upr.ci pval
## 1-0 -260.0171 -426.6549 -93.37931 0.0023 **
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based of the post-hoc test I can determine that with the small a p-value under the .05 significance level that MomSmoke has at least one group that has a significant effect on Weight.
Comparing the results from Exercise 2 and 3 shows that the p-values came out to equal each other at 0.0023.
Using Black, Married, Boy, and MomSmoke, and Ed variables as possible effects, find the best ANOVA model for Weight. Manually perform backward selection based on type3 SS result with 0.05 criteria on p-value. Perform backward selection only with main effects and then check the interaction effects only based on significant main effect terms.
NOTE: For backward selection, you remove a variable from the least significant one, ONE BY ONE, until there is no more variable with a p-value larger than the criteria.
aov.weight_2 <- (aov(Weight ~ Black + MomSmoke, data = bweight))
summary(aov.weight_2, type = 3) #final model, type 3
## Df Sum Sq Mean Sq F value Pr(>F)
## Black 1 3530450 3530450 14.62 0.00016 ***
## MomSmoke 1 2513301 2513301 10.41 0.00140 **
## Residuals 292 70494249 241419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table(bweight$Black); table(bweight$MomSmoke)
##
## 0 1
## 246 49
##
## 0 1
## 254 41
lm.res = lm(Weight ~ Black + MomSmoke, data = bweight)
anova(lm.res)
## Analysis of Variance Table
##
## Response: Weight
## Df Sum Sq Mean Sq F value Pr(>F)
## Black 1 3530450 3530450 14.624 0.0001605 ***
## MomSmoke 1 2513301 2513301 10.411 0.0013954 **
## Residuals 292 70494249 241419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm.res)$r.squared
## [1] 0.07896405
plot(aov.weight_2,2)
The final model includes catergorical variables: Black and MomSmoke, without interaction effect.
Unbalanced data based on the number of observations being different.
Only 7.8% can explain the variation of Weight by the model (Black and MomSmoke).
Based off the qq plot we can assume the model follows normal.
ScheffeTest(aov.weight_2)
##
## Posthoc multiple comparisons of means: Scheffe Test
## 95% family-wise confidence level
##
## $Black
## diff lwr.ci upr.ci pval
## 1-0 -293.9412 -483.0575 -104.8249 8e-04 ***
##
## $MomSmoke
## diff lwr.ci upr.ci pval
## 1-0 -266.763 -470.2261 -63.29987 0.0060 **
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1