STAT2 5.6: The purpose of ANOVA in the setting of this chapter is to learn about (c) the means of several populations.
STAT2 5.7: Some of the conditions of the ANOVA model include: (a) Error terms have the same standard deviation; (c) error terms are independent; (d) error terms follow a normal distribution. The ANOVA model does not require that the error terms all be positive (b).
STAT2 5.31:
(a) Load data:
setwd("/Users/traves/Dropbox/SM339/stat2 data files")
ncb = read.csv("NCbirths.csv")
attach(ncb)
table(MomRace)
## MomRace
## black hispanic other white
## 332 164 48 906
plot(BirthWeightOz ~ MomRace, main = "Birth Weights by Mom's Race")
stripchart(BirthWeightOz ~ MomRace, method = "stack", pch = 19)
The means look pretty close and the distribution of the bulk (middle 50%) of the data seems to have the same spread, but the black and white mothers seem to have a much wider spread of birth weights than the other two groups.
(b) Sample sizes, sample means and sample birth weights (in oz) by race of mother:
Length = tapply(ncb$BirthWeightOz, MomRace, length)
Means = tapply(ncb$BirthWeightOz, MomRace, mean)
StdDev = tapply(ncb$BirthWeightOz, MomRace, sd)
data.frame(Length, Means, StdDev)
## Length Means StdDev
## black 332 110.6 23.40
## hispanic 164 118.5 18.17
## other 48 117.1 17.60
## white 906 117.9 22.52
(c) We can't tell that the true means of the birth weights of the babies born to mothers of different races are really different just by looking at the sample means. We also need to take into account the full information from the samples (e.g. counts and standard deviations); after all, the sample means may differ but that different may be sufficiently explained by chance alone.
STAT2 5.32:
(a) Let's check whether each of the treatment group means are different using pairwise comparison and Fisher's LSD. There are six comparisons. We'll perform the test at the 5% significance level.
mod.1 = lm(BirthWeightOz ~ MomRace)
anova(mod.1)
## Analysis of Variance Table
##
## Response: BirthWeightOz
## Df Sum Sq Mean Sq F value Pr(>F)
## MomRace 3 14002 4667 9.53 3.1e-06 ***
## Residuals 1446 708332 490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
MSE = 490
mod.2 = lm(ncb$BirthWeightOz ~ MomRace - 1)
summary(mod.2)
##
## Call:
## lm(formula = ncb$BirthWeightOz ~ MomRace - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.87 -10.87 1.48 13.97 63.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## MomRaceblack 110.563 1.215 91.0 <2e-16 ***
## MomRacehispanic 118.518 1.728 68.6 <2e-16 ***
## MomRaceother 117.146 3.195 36.7 <2e-16 ***
## MomRacewhite 117.872 0.735 160.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.1 on 1446 degrees of freedom
## Multiple R-squared: 0.965, Adjusted R-squared: 0.965
## F-statistic: 1e+04 on 4 and 1446 DF, p-value: <2e-16
ys = coefficients(mod.2)
n = length(ncb$BirthWeightOz)
ns = tapply(ncb$BirthWeightOz, MomRace, length)
tstar = qt(0.975, n - 4) # df = n-K where K = #groups
# compare black and hispanic:
LSD = tstar * sqrt(MSE * ((1/ns[1]) + (1/ns[2])))
abs(ys[1] - ys[2])
## MomRaceblack
## 7.955
abs(ys[1] - ys[2]) > LSD # significant difference
## MomRaceblack
## TRUE
# compare black and other:
LSD = tstar * sqrt(MSE * ((1/ns[1]) + (1/ns[3])))
abs(ys[1] - ys[3])
## MomRaceblack
## 6.583
abs(ys[1] - ys[3]) > LSD # not significant difference
## MomRaceblack
## FALSE
# compare black and white:
LSD = tstar * sqrt(MSE * ((1/ns[1]) + (1/ns[4])))
abs(ys[1] - ys[4])
## MomRaceblack
## 7.309
abs(ys[1] - ys[4]) > LSD # significant difference
## MomRaceblack
## TRUE
# compare other and hispanic:
LSD = tstar * sqrt(MSE * ((1/ns[3]) + (1/ns[2])))
abs(ys[3] - ys[2])
## MomRaceother
## 1.372
abs(ys[3] - ys[2]) > LSD # not significant difference
## MomRaceother
## FALSE
# compare white and hispanic:
LSD = tstar * sqrt(MSE * ((1/ns[4]) + (1/ns[2])))
abs(ys[4] - ys[2])
## MomRacewhite
## 0.6463
abs(ys[4] - ys[2]) > LSD # not significant difference
## MomRacewhite
## FALSE
# compare other and white:
LSD = tstar * sqrt(MSE * ((1/ns[3]) + (1/ns[4])))
abs(ys[3] - ys[4])
## MomRaceother
## 0.7261
abs(ys[3] - ys[4]) > LSD # not significant difference
## MomRaceother
## FALSE
So the only groups with a significantly different birth rate are: black and white, and black and hispanic.
(b)
mod.1 = lm(BirthWeightOz ~ MomRace)
summary(mod.1) # note that we fit the model without the grand mean so no reference value
##
## Call:
## lm(formula = BirthWeightOz ~ MomRace)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.87 -10.87 1.48 13.97 63.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.56 1.21 91.02 < 2e-16 ***
## MomRacehispanic 7.96 2.11 3.77 0.00017 ***
## MomRaceother 6.58 3.42 1.93 0.05430 .
## MomRacewhite 7.31 1.42 5.15 3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.1 on 1446 degrees of freedom
## Multiple R-squared: 0.0194, Adjusted R-squared: 0.0174
## F-statistic: 9.53 on 3 and 1446 DF, p-value: 3.12e-06
anova(mod.1)
## Analysis of Variance Table
##
## Response: BirthWeightOz
## Df Sum Sq Mean Sq F value Pr(>F)
## MomRace 3 14002 4667 9.53 3.1e-06 ***
## Residuals 1446 708332 490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova suggests that the extra group effects are statistically significant, so the mother's race plays a role in birth weights.