KITADA

Lesson #24

Introduction to Analysis of Variance

Motivation:

Let’s stop to think about what we’ve covered so far. We learned what analyses to do when we have categorical response and categorical explanatory variables (one- and two-proportion methods, Chi-square methods, Fisher’s Exact test), what to do when we have quantitative response variables and a categorical explanatory variables with 2 categories (two-sample methods for quantitative variables), and what to do when both the response and explanatory variable(s) are quantitative (simple linear regression, multiple linear regression). But, what should we do if we have a quantitative response variable and a categorical explanatory variable with more than 2 groups?? The procedure to use in these situations is called Analysis of Variance (which will use the Analysis of Variance table). This is a common procedure to use in experiments. This lesson introduces the Analysis of Variance procedure.

What you need to know from this lesson: After completing this lesson, you should be able to

explain when to use the Analysis of Variance procedure
explain how the Analysis of Variance F-test works when comparing means of two or more groups
determine values for the sum of squares, degrees of freedom, and mean squares when comparing means of two or more groups
perform an Analysis of Variance F-test when comparing means of two or more groups (write the hypotheses, calculate the F-statistic, determine the degrees of freedom for the F-statistic, determine the p-value, and state a conclusion in the context of the problem)

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Section 8.1 in the text
3. Do the Lesson 24 questions at the end of the lesson notes

The Lesson

A. When is the Analysis of Variance procedure used?

When you have a quantitative response variable and categorical variable with more than 2 categories.

**Think of wanting to compare a quantitative variable of interest between MORE than two groups.

We can use the ANOVA procedure for categorical response variables to test the difference in means between mutliple populations.

Note: This is more general than the two sample t-test

B. How does the Analysis of Variance procedure work?

It tests for a difference in means (if we have categorical explanatory vars).

1. Which of the two scatterplots below would you be more convinced that the group means are different? Why?

(SEE PLOTS IN HANDOUT)

The left graph because the data group 2 is shifted up from the data for group 1. The right graph shows data that is similar for the two groups.

2. The Analysis of Variance procedure analyzes variation to determine if there is evidence to say group means are different.

a. What are the null and alternative hypotheses (in words and in notation) that the Analysis of Variance procedure is “testing”?

\( H_0:\mu_1=\mu_2=...=\mu_g \), there is no difference in the group means

\( H_A: \) At least one group has a different mean.

b. The Analysis of Variance procedure uses an F-test to test the above null hypothesis. How is the F-statistic calculated?

\( F=\frac{variation BETWEEN groups}{variation WITHIN groups} \)

C. Revisiting the Analysis of Variance table

To illustrate some of the calculations in the Analysis of Variance table, we’ll revisit the Biofeedback example in Lesson 23 and determine if there is a difference in the average # of consecutive repetitions before making an error between those who receive biofeedback and those that don’t. Here are the # of consecutive repetitions before making an error for each group and the summary statistics for each group.

### LESSON 24 ###
BIOFEED_0$repetitions

##  [1]  73  51  89  52  49  50  75  55  50  87 106  91 100

BIOFEED_1$repetitions

##  [1]  88 102 105  52 106  76 100 112  75  75 112 115  75  70 100

## NO BIOFEEDBACK ##
length(BIOFEED_0$repetitions)

## [1] 13

mean(BIOFEED_0$repetitions)

## [1] 71.38462

sd(BIOFEED_0$repetitions)

## [1] 21.30547

## BIOFEEDBACK ##
length(BIOFEED_1$repetitions)

## [1] 15

mean(BIOFEED_1$repetitions)

## [1] 90.86667

sd(BIOFEED_1$repetitions)

## [1] 19.13436

## OVERALL ##
length(BIOFEEDBACK$repetitions)

## [1] 28

mean(BIOFEEDBACK$repetitions)

## [1] 81.82143

sd(BIOFEEDBACK$repetitions)

## [1] 22.12432

1. The Between Group Mean variation

a. How is the sum of squares calculated for the between group mean variation?

### SCATTERPLOT ###
with(BIOFEEDBACK, plot(biofeedback,repetitions))
mod<-with(BIOFEEDBACK, lm(repetitions~biofeedback))
abline(coefficients(mod), lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-3

b. What is the between group mean variation sometimes called in the Analysis of Variance table?

Groups, Model, Treatment

c. To see how to manually calculate the sum of squares for the between group mean variation, let’s use the biofeedback example from Lesson 23.

We defined how the sum of squares for the between group mean variation is calculated. We’ll call this SSG (“G” for “groups”).

In notation:

\( SSG=\sum_{all obs} (\bar{y}_i-\bar{y})^2 \)

**(SEE FULL EXAMPLE ON HANDOUT)##

\( \sum_{all obs} (\bar{y}_i-\bar{y})^2 \)

No feedback group = 13(108.9936)=1416.9168
Feeback group = 15(81.9025)=1228.5375

Therefore, SSG = 1416.9168 + 1228.5375 = 2645.4543

2. The Within Group variation

a. How is the sum of squares calculated for the within group variation?

b. What is the within group variation sometimes called in the Analysis of Variance table?

Residual, Error

c. Use the biofeedback example from Lesson 24 to calculate the sum of squares for the within group variation.

The sum of squares for the within group variation is just a measure of the variation of the residuals (or “error”). The typical notation is SSE.

In notation,

\( SSE=\sum_{all obs} (y_{ij}-\bar{y}_i)^2 \)

(SEE HANDOUT FOR FULL WORK)

Even though this will work, there is an easier way to calculate SSE.

\( SSE=\sum_{all obs} (n_i-1)s_i^2 \)

\( n_i \): Sample size for \( i^{th} \) group
\( s_i \): Sample standard deviation of the \( i^{th} \) group

Thus, \( SSE=(12)(21.32^2)+(14)(19.13^2)=10572.79 \).

(SEE PROOF OF “SIMPLE” EQUATION ON HANDOUT)

The total sum of squares is calculated as in regression:

Total variation measures the difference between each observed value and the overall mean for each observation: (observed – overall mean)
In notation: \( y_{ij}-\bar{y} \)
- \( y_{ij} \): \( ij^{th} \) observation
- \( \bar{y} \): overall mean
Therfore, \( SST=\sum_{all obs}(y_{ij}-\bar{y})^2 \)

To illustrate, let’s calculate SST for the Biofeedback example. The order in which the observed # repetitions are presented follows the order in which they’re presented in the data on page 202 in Lesson 24.

(SEE WORK ON HANDOUT)

\( SST=\sum_{all obs}(y_{ij}-\bar{y})^2=13216.1 \)

Filling in the rest of the Analysis of Variance table:

(SEE EQUATIONS ATTACHED)

2. Fill in the ANOVA table with the appropriate values for the “Biofeedback” problem.

Source of variation Sum of Squares     Degrees of freedom       Mean squares  f-stat

Between groups        2645.4543           2-1=1                 2645.4543     6.51


Within groups         10572.79            28-2=26               406.65

Total                 13216.2443          27

E. The Analysis of Variance F-test

Using the F-statistic, is there evidence to indicate that the average # of consecutive repetitions is different between those who receive biofeedback and those that do not?

### F-test ###
Fstat<-6.51
n_df<-1
d_df<-26

pf(Fstat, n_df, d_df, lower.tail=FALSE)

## [1] 0.01695354

F. Another example

Is there a difference in the average GPA’s of students at three universities?

1) State the null and alternative hypotheses for the Analysis of Variance F-test.

\( H_0:\mu_1=\mu_2=\mu_3 \), there is no difference in the group means

\( H_A: \) At least one group has a different mean.

2) Use the information below to fill in the Analysis of Variance table and answer the question of interest.

Summary information:
        University 1      University 2     University 3
    n        10            10                10
    mean     3.12           2.887            3.516
    s        0.7631         0.6328           0.4724