Midterm Exam

Data Sets:

You need to download dataset birthweight.csv for Exercise 1-4. The birthweight data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 295 observations in birthweight, and variables are:

Weight: Infant birth weight (gram)
Black: Categorical variable; 0 is white, 1 is black
Married: Categorical variable; 0 is not married, 1 is married
Boy: Categorical variable; 0 is girl, 1 is boy
MomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking mom
Ed: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or above

Exercise 1 A

Generate Boxplot for infant birth weight (Weight) and comment on the general features of the distribution. Generate a normal QQ-plot and perform Shapiro-wilk test to check whether normality is a reasonable assumption for Weight. Make a conclusion.

Boxplot

boxplot(bweight$Weight,
        main = "Infant Birth Weight Distribution",
        ylab = "Weight",
        horizontal = FALSE)

Conclusion: The boxplot generated above shows that the weight is almost normally distributed. However, it has some outliers below the minimum level and an outlier above the maximum level.

QQ-Plot

qqnorm(bweight$Weight); qqline(bweight$Weight, col = 2)

Conclusion: From the analysis of QQ-Plot above, we can assume that the data follows a normal distribution overall.

Shapiro-wilk test

shapiro.test(bweight$Weight)

## 
##  Shapiro-Wilk normality test
## 
## data:  bweight$Weight
## W = 0.99206, p-value = 0.1153

Conclusion: From the Shapiro-Wilk Test, the p-value is above the significance level of 0.05 (p-value = 0.1153). We do not have enough evidence to reject the null hypothesis. Hence we can say that the data follows normal distribution.

Exercise 1 B

Generate a boxplot of Weight by MomSmoke and compare infant birth weights between smoking levels.

Boxplot

boxplot(Weight ~ MomSmoke, bweight,
        main = "Infant Birth Weights and Smoking Levels",
        ylab = "Weights",
        xlab = "Smoking Levels",
        horizontal = FALSE)

Conclusion: The boxplot generated above shows that there is difference between the values of mom’s who do not smoke (i.e. 0) and mom’s who smoke (i.e. 1). The infant weight range is wider in the case of mom’s who do not smoke, while the mom’s who smoke the weight range of infant is smaller. The distribution pretty much looks normal in both the cases.

Exercise 1 C

For each level in MomSmoke, perform Shapiro-wilk test for checking the Normality of Weight. Make a conclusion.

Shapiro-wilk test

shapiro.test(bweight[bweight$MomSmoke == '0', "Weight"])

## 
##  Shapiro-Wilk normality test
## 
## data:  bweight[bweight$MomSmoke == "0", "Weight"]
## W = 0.99362, p-value = 0.3549

Conclusion: From the Shapiro-Wilk Test, the p-value for mom’s who do not smoke is above the significance level of 0.05 (p-value = 0.3549). We do not have enough evidence to reject the null hypothesis. Hence, we can say that this dataset follows normal distribution.

shapiro.test(bweight[bweight$MomSmoke == '1', "Weight"])

## 
##  Shapiro-Wilk normality test
## 
## data:  bweight[bweight$MomSmoke == "1", "Weight"]
## W = 0.96299, p-value = 0.2

Conclusion: From the Shapiro-Wilk Test, the p-value for mom’s who smoke is above the significance level of 0.05 (p-value = 0.2). We do not have enough evidence to reject the null hypothesis. Hence, we can say that this dataset follows normal distribution as well.

Exercise 2

We want to test if there is a significant difference in birth weights between infants from smoking mom and nonsmoking mom.

Perform a hypothesis test of whether infants from smoking moms have different weights than infants from nonsmoking moms. Which test do you choose? Use the answer in Exercise 1 for choosing the proper test. Specify null and alternative hypotheses and state your conclusion

NOTE: If you decide to use the parametric test, perform two-sample t-test rather than ANOVA.

Answer

Selection of Test

Because both the datasets follow normal distribution, we will choose to run the two-sample t-test on birth weights between infants of mom’s who smoke and infants of mom’s who do not smoke. An equal variance test will be conducted before choosing which test to run.

Null hypothesis: Mean value of mom’s who smoke = mean value of mom’s who do not smoke

Alternate hypothesis: Mean value of mom’s who smoke ≠ mean value of mom’s who do not smoke

var.test(Weight ~ MomSmoke, bweight, alternative = "two.sided")

## 
##  F test to compare two variances
## 
## data:  Weight by MomSmoke
## F = 1.0786, num df = 253, denom df = 40, p-value = 0.8009
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6421109 1.6671729
## sample estimates:
## ratio of variances 
##           1.078555

Conclusion: From the variance test performed above, we can conclude that the data follows equal variance. The p-value is above the significance level of 0.05 (p-value = 0.8009), this means that we can perform pooled t-test to compare birth weights of infant.

t.test(Weight ~ MomSmoke, bweight, alternative = "two.sided", var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  Weight by MomSmoke
## t = 3.071, df = 293, p-value = 0.002334
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##   93.37931 426.65488
## sample estimates:
## mean in group 0 mean in group 1 
##        3422.724        3162.707

Conclusion: From the Pooled T-test performed above, we can see that the p-value is under the significance level of 0.05 (p-value = 0.002334). Hence, we have enough evidence to reject the null hypothesis. Therefore, the mean values of infant birth weight of mom’s who smoke and the mean value of infant birth weight of mom’s who do not smoke are not the same. We will support the alternate hypothesis: - Alternate hypothesis (Ha): Mean value of mom’s who smoke(0) ≠ mean value of mom’s who do not smoke(1)

Exercise 3

Now perform one-way ANOVA on Weight with MomSmoke.

Exercise 3 A

Check homogeneity of variance assumption. Does it hold and okay to perform ANOVA?

One-way ANOVA (Weight ~ MomSmoke)

aov.weight1 <- aov(Weight ~ MomSmoke, bweight)
summary(aov.weight1)

##              Df   Sum Sq Mean Sq F value  Pr(>F)   
## MomSmoke      1  2386708 2386708   9.431 0.00233 **
## Residuals   293 74151291  253076                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Levene Test for Equal Variance Assumption

LeveneTest(aov.weight1)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.6767 0.4114
##       293

Conclusion: Based on the Levene Test above, we can conclude that the p-value is above the significance level of 0.05 (p-value = 0.4114). This means that we do not have enough evidence to reject the null hypothesis (homoscedasticity) hence, we can assume equal variance. This proves that it is okay to perform ANOVA.

Exercise 3 B

Make a conclusion on the effect of MomSmoke. Compare your result with the conclusion of Exercise 2.

ScheffeTest(aov.weight1)

## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $MomSmoke
##          diff    lwr.ci    upr.ci   pval    
## 1-0 -260.0171 -426.6549 -93.37931 0.0023 ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: From the Scheffe Test above, we can conclude that the p-value is below the significance level of 0.05 (p-value = 0.0023) Thus, we can say that at least one group from MomSmoke has a significant effect on birth weight of infant (significantly different mean from each other).

Also, in Exercise 2, we saw that the p-value is 0.0023, and concluded that the mean values of mom’s who do not smoke and mom’s who smoke are not the same. And looking at the Scheffe test result, we can see that mom’s who do not smoke has a significantly higher mean value than the mom’s who smoke, as it results in a negative difference value (-260.0171). In other words: - Mom’s who do not smoke (0) > mom’s who smoke (1)

Exercise 4

Using Black, Married, Boy, and MomSmoke, and Ed variables as possible effects, find the best ANOVA model for Weight. Manually perform backward selection based on type3 SS result with 0.05 criteria on p-value. Perform backward selection only with main effects and then check the interaction effects only based on significant main effect terms.

NOTE: For backward selection, you remove a variable from the least significant one, ONE BY ONE, until there is no more variable with a p-value larger than the criteria.

Exercise 4 A

Write down step by step how you perform backward selection and how you find the final model. Please do NOT include all intermediate tables and graphs in the report. Just describe each step which variable you delete and why.

Answer

Enter all the independent variables into the equation (Black, Married, Boy, MomSmoke, and Ed) and we conduct a one-way ANOVA
Remove Ed because it has the highest p-value of 0.8625846
Now we run another one-way ANOVA with the remaining variables (Black, Married, Boy and MomSmoke)
Remove Married because it has the highest p-value of 0.6157671
Now we run another one-way ANOVA with the remaining variables (Black, Boy and MomSmoke)
Remove Boy because it has the highest p-value of 0.3888071
After removing Ed, Married and Boy, we can finally see that MomSmoke and Black are the only significant variables since they have p-values less than the significant value of 0.05 (Black with p-value of 0.0001232 and Momsmoke with p-value of 0.0013954). This means we should check their interaction to get the final model by running another one-way ANOVA with the interaction effect.
Main Effects: Black and MomSmoke variables
The interaction between Black and MomSmoke will not be included because its p-value is larger than the significance level of 0.05. In other words, the final model would be: Weight ~ Black + MomSmoke

Exercise 4 B

Specify the final model and report the amount of variation explained by the model. Also, check the Normality assumption through diagnostics plots.

aov.weight2 <- aov(Weight ~ Black + MomSmoke, bweight)
summary(aov.weight2, type = 3)

##              Df   Sum Sq Mean Sq F value  Pr(>F)    
## Black         1  3530450 3530450   14.62 0.00016 ***
## MomSmoke      1  2513301 2513301   10.41 0.00140 ** 
## Residuals   292 70494249  241419                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

table(bweight$Black); table(bweight$MomSmoke)

## 
##   0   1 
## 246  49

## 
##   0   1 
## 254  41

Conclusion: Based on the result above, we can see that both Black and MomSmoke contain unbalanced data.

lm.res <- lm(Weight ~ Black + MomSmoke, bweight)
summary(lm.res)$r.squared

## [1] 0.07896405

Conclusion: Based on the result above of the linear model, we can see that 7.89% of the Weight variation can be explained by the final model (Weight ~ Black + MomSmoke).

plot(aov.weight2, 2)

Conclusion: Based on the Q-Q Plot above, we can assume that the final model follows normal distribution.

Exercise 4 C

State conclusions about significant differences in Weight across groups. For each significant variable, state specifically which level has a larger or smaller mean value of Weight.

ScheffeTest(aov.weight2)

## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $Black
##          diff    lwr.ci    upr.ci  pval    
## 1-0 -293.9412 -483.0575 -104.8249 8e-04 ***
## 
## $MomSmoke
##         diff    lwr.ci    upr.ci   pval    
## 1-0 -266.763 -470.2261 -63.29987 0.0060 ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion:

Black:
- Using the Scheffe Test, we can see that the categorical variable 1 (Black) has a lower mean value than the categorical variable 0 (White). We conclude this because they have a negative difference value of -293.9412.
- Hence, Black (1) < White (0) (White has higher mean value than Black)
- We can also conclude that both categorical variables have a significant effect on Weight, because of p-value less than the significance value of 0.05 (8e-04)
MomSmoke:
- Using the Scheffe Test, we can see that the categorical variable 1 (Mom’s who smoke) has a lower mean value than the categorical variable 0 (Mom’s who do not smoke). We conclude this because they have a negative difference value of -266.763.
- Hence, Mom’s who smoke (1) < Mom’s who do not smoke (0) (Non Smoking has a higher mean value than Smoking)
- We can also conclude that both categorical variables have a significant effect on Weight, because of p-value less than the significance value of 0.05 (0.0060)