Midterm

Data Sets:

You need to download dataset birthweight.csv for Exercise 1-4. The birthweight data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are:

Weight: Infant birth weight (gram)
Black: Categorical variable; 0 is white, 1 is black
Married: Categorical variable; 0 is not married, 1 is married
Boy: Categorical variable; 0 is girl, 1 is boy
MomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking mom
Ed: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or above

Exercise 1

(a) Generate Boxplot for infant birth weight (Weight) and comment on the general features of the distribution. Generate a normal QQ-plot and perform Shapiro-wilk test to check whether normality is a reasonable assumption for Weight. Make a conclusion.

boxplot(bweight$Weight, 
        main = "Distribution of Infant Birth Weight",
        ylab = "Weight",
        col = "red",
        border = "blue",
        horizontal = FALSE )

Based on the Boxplot, the distribution of weight looks to be normally distributed. The boxplot also shows a couple of outliers below the minimum value and one outlier above the maximum value.

qqnorm(bweight$Weight)
qqline(bweight$Weight, col = "red")

Based on the qq plot the data follows the qq line and can assumes to follow normal distribution. You can also see a small left tail, but its not enough evidence to show this data does not follow normal distribution.

shapiro.test(bweight$Weight)

## 
##  Shapiro-Wilk normality test
## 
## data:  bweight$Weight
## W = 0.99206, p-value = 0.1153

Based off the Shapiro Wilk test the P-value is above the .05 significance level at 0.1153. With this we have enough evidence to state that the Weight follows normal distribution (Do not reject the null).

(b) Generate a boxplot of Weight by MomSmoke and compare infant birth weights between smoking levels.

boxplot(Weight ~ MomSmoke, bweight,
        main = "Weight Vs. Does the Mom Smoke?",
            ylab = "Weight", 
            xlab = "Smoke?",
            col = "red",
            border = "blue",
            horizontal = FALSE)

0 is non-smoking mom, 1 is smoking mom

Based off the Boxplot you can see a difference in the values between smoking and non smoking moms. Non smoking moms have a wider range of infant weight where as moms who smoke have a smaller range. Each boxplot looks to be normally distributed.

(c) For each level in MomSmoke, perform Shapiro-wilk test for checking the Normality of Weight. Make a conclusion.

shapiro.test(bweight[bweight$MomSmoke == '0', "Weight"])

## 
##  Shapiro-Wilk normality test
## 
## data:  bweight[bweight$MomSmoke == "0", "Weight"]
## W = 0.99362, p-value = 0.3549

Based on the Shapiro WIlk Test Non-Smoking Moms (0) follows normal distribution since the P-value is above the .05 significance level (0.3549).

shapiro.test(bweight[bweight$MomSmoke == '1', "Weight"])

## 
##  Shapiro-Wilk normality test
## 
## data:  bweight[bweight$MomSmoke == "1", "Weight"]
## W = 0.96299, p-value = 0.2

Based on the Shapiro WIlk Test Smoking Moms (1) follows normal distribution since the P-value is above the .05 significance level (0.2).

Exercise 2

We want to test if there is a significant difference in birth weights between infants from smoking mom and nonsmoking mom.

Perform a hypothesis test of whether infants from smoking moms have different weights than infants from nonsmoking moms. Which test do you choose? Use the answer in Exercise 1 for choosing the proper test. Specify null and alternative hypotheses and state your conclusion

NOTE: If you decide to use the parametric test, perform two-sample t-test rather than ANOVA.

*** Choice of Test ***

I will choose to run a two sample test.

Since both groups follow normal distribution in Exercise 1, I will be using the parametric test.

Hypothesis: * Ho: Mean of 0 = Mean of 1 * Ha: Mean of 0 ≠ Mean of 1

I will check for equal variance to determine which test to run

var.test(Weight ~ MomSmoke, bweight, alternative ="two.sided")

## 
##  F test to compare two variances
## 
## data:  Weight by MomSmoke
## F = 1.0786, num df = 253, denom df = 40, p-value = 0.8009
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6421109 1.6671729
## sample estimates:
## ratio of variances 
##           1.078555

Based on the variance test I can conclude that this follows equal variance. This is because the p-value is above the .05 significance level (0.8009)

Since Equal Variance is followed I will run the Pooled t-test

t.test(Weight ~ MomSmoke, bweight,alternative ="two.sided", var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  Weight by MomSmoke
## t = 3.071, df = 293, p-value = 0.002334
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   93.37931 426.65488
## sample estimates:
## mean in group 0 mean in group 1 
##        3422.724        3162.707

Conclusion:

Based off the Two Sample t-test the p-value is below the .05 significance level (0.002334). With this information I will reject the null and support the alternative hypothesis.

Ha: Mean of 0 ≠ Mean of 1

Exercise 3

Now perform one-way ANOVA on Weight with MomSmoke.

(a) Check homogeneity of variance assumption. Does it hold and okay to perform ANOVA?

aov.weight = (aov(Weight ~ MomSmoke, data = bweight))
summary (aov.weight)

##              Df   Sum Sq Mean Sq F value  Pr(>F)   
## MomSmoke      1  2386708 2386708   9.431 0.00233 **
## Residuals   293 74151291  253076                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LeveneTest(aov.weight)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1  0.6767 0.4114
##       293

Based on the Levene Test:

Levene test is significant (null: homoscedasticity) since the p-value is above the .05 significance level (0.4114). This proves that it holds and is okay to perform ANOVA.

(b) Make a conclusion on the effect of MomSmoke. Compare your result with the conclusion of Exercise 2.

ScheffeTest(aov.weight)

## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $MomSmoke
##          diff    lwr.ci    upr.ci   pval    
## 1-0 -260.0171 -426.6549 -93.37931 0.0023 ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based of the post-hoc test I can determine that with the small a p-value under the .05 significance level that MomSmoke has at least one group that has a significant effect on Weight.

Comparing the results from Exercise 2 and 3 shows that the p-values came out to equal each other at 0.0023.

Exercise 4

Using Black, Married, Boy, and MomSmoke, and Ed variables as possible effects, find the best ANOVA model for Weight. Manually perform backward selection based on type3 SS result with 0.05 criteria on p-value. Perform backward selection only with main effects and then check the interaction effects only based on significant main effect terms.

NOTE: For backward selection, you remove a variable from the least significant one, ONE BY ONE, until there is no more variable with a p-value larger than the criteria.

(a) Write down step by step how you perform backward selection and how you find the final model. Please do NOT include all intermediate tables and graphs in the report. Just describe each step which variable you delete and why.

Step 1: Utilize all the variables first and remove the variable that has the largest p-value (Black, Married, Boy, MomSmoke, and Ed)
Step 2: Remove Ed (had the highest p-value above .05 of the group).
Step 3: Run again with Black, Married, Boy, and MomSmoke
Step 4: Remove Married (had the highest p-value above .05 of the group).
Step 5: Run again with Black, Boy, and MomSmoke
Step 6: Remove Boy (had the highest p-value above .05 of the group).
Step 7: Run again with Black and MomSmoke
Step 8: Keep MomSmoke (had the next highest p-value, but it was below the .05 significance level so we will keep in the model)
Step 8: Main Effects: Black and MomSmoke variables
Step 9: Interaction between Black and MomSmoke will not be used since the p-value is higher than the .05 significance level

(b) Specify the final model and report the amount of variation explained by the model. Also, check the Normality assumption through diagnostics plots

aov.weight_2 <- (aov(Weight ~ Black + MomSmoke, data = bweight))
summary(aov.weight_2, type = 3) #final model, type 3

##              Df   Sum Sq Mean Sq F value  Pr(>F)    
## Black         1  3530450 3530450   14.62 0.00016 ***
## MomSmoke      1  2513301 2513301   10.41 0.00140 ** 
## Residuals   292 70494249  241419                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

table(bweight$Black); table(bweight$MomSmoke)

## 
##   0   1 
## 246  49

## 
##   0   1 
## 254  41

lm.res = lm(Weight ~ Black + MomSmoke, data = bweight)
anova(lm.res)

## Analysis of Variance Table
## 
## Response: Weight
##            Df   Sum Sq Mean Sq F value    Pr(>F)    
## Black       1  3530450 3530450  14.624 0.0001605 ***
## MomSmoke    1  2513301 2513301  10.411 0.0013954 ** 
## Residuals 292 70494249  241419                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm.res)$r.squared

## [1] 0.07896405

plot(aov.weight_2,2)

The final model includes catergorical variables: Black and MomSmoke, without interaction effect.

Unbalanced data based on the number of observations being different.

Only 7.8% can explain the variation of Weight by the model (Black and MomSmoke).

Based off the qq plot we can assume the model follows normal.

(c) State conclusions about significant differences in Weight across groups. For each significant variable, state specifically which level has a larger or smaller mean value of Weight.

ScheffeTest(aov.weight_2)

## 
##   Posthoc multiple comparisons of means: Scheffe Test 
##     95% family-wise confidence level
## 
## $Black
##          diff    lwr.ci    upr.ci  pval    
## 1-0 -293.9412 -483.0575 -104.8249 8e-04 ***
## 
## $MomSmoke
##         diff    lwr.ci    upr.ci   pval    
## 1-0 -266.763 -470.2261 -63.29987 0.0060 ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Black:

Due to a p-value below the significance level of 0.05, type group “1 - 0” has a significant effect on Weight.
Specifically, Black < White (White has a higher mean value than Black)

MomSmoke:

Due to a p-value below the significance level of 0.05, type group “1 - 0” has a significant effect on Weight.
Specifically, Smoking < Non Smoking (Non Smoking has a higher mean value than Smoking)