Module 7 : ANOVA Tests (Analysis of Variance)

By the end of this lesson, you should be able to:

Define ANOVA tests
Use one-way ANOVA tests
Follow-up ANOVA tests with pairwise t-tests using the TukeyHSD() function in R

Intorduction to ANOVA Tests

Analysis of variance (ANOVA) is used to analyze the differences between group means. ANOVA provides a statistical test of whether or not the means of several groups are equal, and it is useful in comparing 2 or more groups for statistical significance (contrary to the t-test which can be used for two conditions or groups). We can use the ANOVA test to test whether the means are different for different categories of a categorical variable.
Let us look at the following dataset which accounts for energy use in three campus-owned houses.

gasdata = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdEotQWVNREZQeFVXZEItS2JtYVQzTmc&output=csv")

## Error: Missing packages.  Please retry after installing the following:
## RCurl

head(gasdata)

## Error: object 'gasdata' not found

The variable therms is a measure of energy use (the more energy used, the larger the therms).

hdd is monthly heating degree days, a measure of the amount of heating that the houses needed that month given the outside temperatures

address is a categorical variable specifying location a, b, or c

renovated is a categorical variable indicating whether the location had already been renovated/insulated or not (yes/no)

ANOVA tests are related to treatment/categorical variables in our model.
The one-way ANOVA test explores the significance of one treatment/categorical variable.

If we only look at the price and address in the gasdata, the null hypothesis for the ANOVA test would be: the means of price for different treatment groups (address a, b, c) are the same. When we do an ANOVA test, we once again focus on the p-value, which tells us whether the relationship being investigated is statistically significant or not. If the p-value is < 0.05, then we reject the null hypothesis and conclude that at least one of the mean prices is different among the three addresses.

There are two treatment/categorical variables in this dataset–address and renovated. To understand how the one-way ANOVA test works, let's build a model, called mod1.

mod1 = lm(price ~ address, data = gasdata)

## Error: object 'gasdata' not found

One-way ANOVA Tests

We will use ANOVA test to see whether the mean prices are different at the different addresses a, b, c.

We use the anova() function to perform ANOVA tests. The anova() function is similar to the summary() function, in that both take linear models as their arguments; however, they give us somewhat different outputs.

anova(mod1)

## Error: object 'mod1' not found

Based on the ANOVA table, the p-value for address is < 0.05. Therefore we can reject the null hypothesis that the mean price in addresses a, b, c are the same.

The TukeyHSD() Function

TukeyHSD means Tukey's Honest Significant Difference method. The function TukeyHSD() creates a set of confidence intervals on the differences between means with the specified family-wise probability of coverage. We use the TukeyHSD() function when the ANOVA table reveals statistical significance.

The general command is TukeyHSD(YOUR_LINEAR_MODEL, conf.level = 0.95).
If you do not put in conf.level (which stands for confidence level), then R will use the default value 0.95.

TukeyHSD(mod1)

## Error: object 'mod1' not found

Based on the results, the mean price difference for C-A is significantly different (p=0.0441), while the mean price for B-A and C-B is not significantly different (p=0.1277 and p=0.8849 respectively). The first column of the result table is the difference in mean price. We can see that the price in address a is the highest.

We can also use a boxplot to confirm our result.

boxplot(price~address, gasdata)

## Error: object 'gasdata' not found

As shown in the boxplot, the price at address a is higher than the prices at addresses b and c.

Now You Try!

First upload the dataset Rdata3 to files in R, then type dat = read.csv(“Rdata3.csv”).

Anawer the following questions:
Q1. What is the categorical variable in the dataset Rdata3? What are the groups?

Q2. Build a one-way ANOVA test first using the appropriate response variable, and generate a summary table.

Q3. What does the summary result tell you?

Q4. Using the TukeyHSD() method, what does the result tell you? Use a boxplot to confirm your answer.

Answer Key

Q1.
inducer; water, lactose, and raffinose

Q2.
test = aov(IU~inducer, data=dat)
summary(test)

Q3.
The result tells that the mean IU is significanty different for different inducer groups.

Q5.
test = aov(IU~inducer, data=dat)
TukeyHSD(test)
The p-value for raffinose-lactose is > 0.05， which means the mean IU for raffinose-lactose groups is not significantly different. The p-values for water-lactose and water-raffinose are both < 0.05, which means the mean IU is significantly different for both water-lactose and water-raffinose. Also, we can see that the water group has the lowest IU value while the raffinose group has the highest IU value.

boxplot(IU~inducer, RData3)

## Error: object 'RData3' not found

As the boxplot shows, water has the lowest IU value and raffinose has the highest IU value.

Citation [1]:http://stat.ethz.ch/R-manual/R-patched/library/stats/html/TukeyHSD.html