Solutions to Day 26 Homework

STAT2 5.6: The purpose of ANOVA in the setting of this chapter is to learn about (c) the means of several populations.

STAT2 5.7: Some of the conditions of the ANOVA model include: (a) Error terms have the same standard deviation; (c) error terms are independent; (d) error terms follow a normal distribution. The ANOVA model does not require that the error terms all be positive (b).

STAT2 5.31:

(a) Load data:

setwd("/Users/traves/Dropbox/SM339/stat2 data files")
ncb = read.csv("NCbirths.csv")
attach(ncb)
table(MomRace)
## MomRace
##    black hispanic    other    white 
##      332      164       48      906
plot(BirthWeightOz ~ MomRace, main = "Birth Weights by Mom's Race")

plot of chunk unnamed-chunk-1

stripchart(BirthWeightOz ~ MomRace, method = "stack", pch = 19)

plot of chunk unnamed-chunk-1

The means look pretty close and the distribution of the bulk (middle 50%) of the data seems to have the same spread, but the black and white mothers seem to have a much wider spread of birth weights than the other two groups.

(b) Sample sizes, sample means and sample birth weights (in oz) by race of mother:

Length = tapply(ncb$BirthWeightOz, MomRace, length)
Means = tapply(ncb$BirthWeightOz, MomRace, mean)
StdDev = tapply(ncb$BirthWeightOz, MomRace, sd)
data.frame(Length, Means, StdDev)
##          Length Means StdDev
## black       332 110.6  23.40
## hispanic    164 118.5  18.17
## other        48 117.1  17.60
## white       906 117.9  22.52

(c) We can't tell that the true means of the birth weights of the babies born to mothers of different races are really different just by looking at the sample means. We also need to take into account the full information from the samples (e.g. counts and standard deviations); after all, the sample means may differ but that different may be sufficiently explained by chance alone.

STAT2 5.32:

(a) Let's check whether each of the treatment group means are different using pairwise comparison and Fisher's LSD. There are six comparisons. We'll perform the test at the 5% significance level.

mod.1 = lm(BirthWeightOz ~ MomRace)
anova(mod.1)
## Analysis of Variance Table
## 
## Response: BirthWeightOz
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## MomRace      3  14002    4667    9.53 3.1e-06 ***
## Residuals 1446 708332     490                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
MSE = 490
mod.2 = lm(ncb$BirthWeightOz ~ MomRace - 1)
summary(mod.2)
## 
## Call:
## lm(formula = ncb$BirthWeightOz ~ MomRace - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101.87  -10.87    1.48   13.97   63.13 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## MomRaceblack     110.563      1.215    91.0   <2e-16 ***
## MomRacehispanic  118.518      1.728    68.6   <2e-16 ***
## MomRaceother     117.146      3.195    36.7   <2e-16 ***
## MomRacewhite     117.872      0.735   160.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.1 on 1446 degrees of freedom
## Multiple R-squared:  0.965,  Adjusted R-squared:  0.965 
## F-statistic: 1e+04 on 4 and 1446 DF,  p-value: <2e-16
ys = coefficients(mod.2)
n = length(ncb$BirthWeightOz)
ns = tapply(ncb$BirthWeightOz, MomRace, length)
tstar = qt(0.975, n - 4)  # df = n-K where K = #groups

# compare black and hispanic:
LSD = tstar * sqrt(MSE * ((1/ns[1]) + (1/ns[2])))
abs(ys[1] - ys[2])
## MomRaceblack 
##        7.955
abs(ys[1] - ys[2]) > LSD  # significant difference
## MomRaceblack 
##         TRUE

# compare black and other:
LSD = tstar * sqrt(MSE * ((1/ns[1]) + (1/ns[3])))
abs(ys[1] - ys[3])
## MomRaceblack 
##        6.583
abs(ys[1] - ys[3]) > LSD  # not significant difference
## MomRaceblack 
##        FALSE

# compare black and white:
LSD = tstar * sqrt(MSE * ((1/ns[1]) + (1/ns[4])))
abs(ys[1] - ys[4])
## MomRaceblack 
##        7.309
abs(ys[1] - ys[4]) > LSD  # significant difference
## MomRaceblack 
##         TRUE

# compare other and hispanic:
LSD = tstar * sqrt(MSE * ((1/ns[3]) + (1/ns[2])))
abs(ys[3] - ys[2])
## MomRaceother 
##        1.372
abs(ys[3] - ys[2]) > LSD  # not significant difference
## MomRaceother 
##        FALSE

# compare white and hispanic:
LSD = tstar * sqrt(MSE * ((1/ns[4]) + (1/ns[2])))
abs(ys[4] - ys[2])
## MomRacewhite 
##       0.6463
abs(ys[4] - ys[2]) > LSD  # not significant difference
## MomRacewhite 
##        FALSE

# compare other and white:
LSD = tstar * sqrt(MSE * ((1/ns[3]) + (1/ns[4])))
abs(ys[3] - ys[4])
## MomRaceother 
##       0.7261
abs(ys[3] - ys[4]) > LSD  # not significant difference
## MomRaceother 
##        FALSE

So the only groups with a significantly different birth rate are: black and white, and black and hispanic.

(b)

mod.1 = lm(BirthWeightOz ~ MomRace)
summary(mod.1)  # note that we fit the model without the grand mean so no reference value
## 
## Call:
## lm(formula = BirthWeightOz ~ MomRace)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101.87  -10.87    1.48   13.97   63.13 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       110.56       1.21   91.02  < 2e-16 ***
## MomRacehispanic     7.96       2.11    3.77  0.00017 ***
## MomRaceother        6.58       3.42    1.93  0.05430 .  
## MomRacewhite        7.31       1.42    5.15    3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.1 on 1446 degrees of freedom
## Multiple R-squared:  0.0194, Adjusted R-squared:  0.0174 
## F-statistic: 9.53 on 3 and 1446 DF,  p-value: 3.12e-06
anova(mod.1)
## Analysis of Variance Table
## 
## Response: BirthWeightOz
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## MomRace      3  14002    4667    9.53 3.1e-06 ***
## Residuals 1446 708332     490                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova suggests that the extra group effects are statistically significant, so the mother's race plays a role in birth weights.