KITADA
Lesson #24
Motivation:
Let’s stop to think about what we’ve covered so far. We learned what analyses to do when we have categorical response and categorical explanatory variables (one- and two-proportion methods, Chi-square methods, Fisher’s Exact test), what to do when we have quantitative response variables and a categorical explanatory variables with 2 categories (two-sample methods for quantitative variables), and what to do when both the response and explanatory variable(s) are quantitative (simple linear regression, multiple linear regression). But, what should we do if we have a quantitative response variable and a categorical explanatory variable with more than 2 groups?? The procedure to use in these situations is called Analysis of Variance (which will use the Analysis of Variance table). This is a common procedure to use in experiments. This lesson introduces the Analysis of Variance procedure.
What you need to know from this lesson: After completing this lesson, you should be able to
To accomplish the above “What You Need to Know”, do the following:
The Lesson
A. When is the Analysis of Variance procedure used?
When you have a quantitative response variable and categorical variable with more than 2 categories.
**Think of wanting to compare a quantitative variable of interest between MORE than two groups.
We can use the ANOVA procedure for categorical response variables to test the difference in means between mutliple populations.
Note: This is more general than the two sample t-test
B. How does the Analysis of Variance procedure work?
It tests for a difference in means (if we have categorical explanatory vars).
1. Which of the two scatterplots below would you be more convinced that the group means are different? Why?
(SEE PLOTS IN HANDOUT)
The left graph because the data group 2 is shifted up from the data for group 1. The right graph shows data that is similar for the two groups.
2. The Analysis of Variance procedure analyzes variation to determine if there is evidence to say group means are different.
a. What are the null and alternative hypotheses (in words and in notation) that the Analysis of Variance procedure is “testing”?
\( H_0:\mu_1=\mu_2=...=\mu_g \), there is no difference in the group means
\( H_A: \) At least one group has a different mean.
b. The Analysis of Variance procedure uses an F-test to test the above null hypothesis. How is the F-statistic calculated?
\( F=\frac{variation BETWEEN groups}{variation WITHIN groups} \)
C. Revisiting the Analysis of Variance table
To illustrate some of the calculations in the Analysis of Variance table, we’ll revisit the Biofeedback example in Lesson 23 and determine if there is a difference in the average # of consecutive repetitions before making an error between those who receive biofeedback and those that don’t. Here are the # of consecutive repetitions before making an error for each group and the summary statistics for each group.
### LESSON 24 ###
BIOFEED_0$repetitions
## [1] 73 51 89 52 49 50 75 55 50 87 106 91 100
BIOFEED_1$repetitions
## [1] 88 102 105 52 106 76 100 112 75 75 112 115 75 70 100
## NO BIOFEEDBACK ##
length(BIOFEED_0$repetitions)
## [1] 13
mean(BIOFEED_0$repetitions)
## [1] 71.38462
sd(BIOFEED_0$repetitions)
## [1] 21.30547
## BIOFEEDBACK ##
length(BIOFEED_1$repetitions)
## [1] 15
mean(BIOFEED_1$repetitions)
## [1] 90.86667
sd(BIOFEED_1$repetitions)
## [1] 19.13436
## OVERALL ##
length(BIOFEEDBACK$repetitions)
## [1] 28
mean(BIOFEEDBACK$repetitions)
## [1] 81.82143
sd(BIOFEEDBACK$repetitions)
## [1] 22.12432
1. The Between Group Mean variation
a. How is the sum of squares calculated for the between group mean variation?
### SCATTERPLOT ###
with(BIOFEEDBACK, plot(biofeedback,repetitions))
mod<-with(BIOFEEDBACK, lm(repetitions~biofeedback))
abline(coefficients(mod), lwd=2, lty=2, col="blue")
b. What is the between group mean variation sometimes called in the Analysis of Variance table?
Groups, Model, Treatment
c. To see how to manually calculate the sum of squares for the between group mean variation, let’s use the biofeedback example from Lesson 23.
We defined how the sum of squares for the between group mean variation is calculated. We’ll call this SSG (“G” for “groups”).
In notation:
\( SSG=\sum_{all obs} (\bar{y}_i-\bar{y})^2 \)
**(SEE FULL EXAMPLE ON HANDOUT)##
\( \sum_{all obs} (\bar{y}_i-\bar{y})^2 \)
Therefore, SSG = 1416.9168 + 1228.5375 = 2645.4543
2. The Within Group variation
a. How is the sum of squares calculated for the within group variation?
b. What is the within group variation sometimes called in the Analysis of Variance table?
Residual, Error
c. Use the biofeedback example from Lesson 24 to calculate the sum of squares for the within group variation.
The sum of squares for the within group variation is just a measure of the variation of the residuals (or “error”). The typical notation is SSE.
In notation,
\( SSE=\sum_{all obs} (y_{ij}-\bar{y}_i)^2 \)
(SEE HANDOUT FOR FULL WORK)
Even though this will work, there is an easier way to calculate SSE.
\( SSE=\sum_{all obs} (n_i-1)s_i^2 \)
Thus, \( SSE=(12)(21.32^2)+(14)(19.13^2)=10572.79 \).
(SEE PROOF OF “SIMPLE” EQUATION ON HANDOUT)
The total sum of squares is calculated as in regression:
To illustrate, let’s calculate SST for the Biofeedback example. The order in which the observed # repetitions are presented follows the order in which they’re presented in the data on page 202 in Lesson 24.
(SEE WORK ON HANDOUT)
\( SST=\sum_{all obs}(y_{ij}-\bar{y})^2=13216.1 \)
Filling in the rest of the Analysis of Variance table:
(SEE EQUATIONS ATTACHED)
2. Fill in the ANOVA table with the appropriate values for the “Biofeedback” problem.
Source of variation Sum of Squares Degrees of freedom Mean squares f-stat
Between groups 2645.4543 2-1=1 2645.4543 6.51
Within groups 10572.79 28-2=26 406.65
Total 13216.2443 27
E. The Analysis of Variance F-test
Using the F-statistic, is there evidence to indicate that the average # of consecutive repetitions is different between those who receive biofeedback and those that do not?
### F-test ###
Fstat<-6.51
n_df<-1
d_df<-26
pf(Fstat, n_df, d_df, lower.tail=FALSE)
## [1] 0.01695354
F. Another example
Is there a difference in the average GPA’s of students at three universities?
1) State the null and alternative hypotheses for the Analysis of Variance F-test.
\( H_0:\mu_1=\mu_2=\mu_3 \), there is no difference in the group means
\( H_A: \) At least one group has a different mean.
2) Use the information below to fill in the Analysis of Variance table and answer the question of interest.
Summary information:
University 1 University 2 University 3
n 10 10 10
mean 3.12 2.887 3.516
s 0.7631 0.6328 0.4724