This article shows an example of the analysis of variance (ANOVA) using the statistical software R. The code used to generates this content can be seen in the next Github repository.
An engineer is designing a battery for use in a device that will be subjected to some extreme variations in temperature. The only design parameter that he can select at this point is the plate material for the battery, and he has three possible choices. When the device is manufactured and is shipped to the field, the engineer has no control over the temperature extremes that the device will encounter, and he knows from experience that temperature will probably affect the effective battery life. However, the temperature can be controlled in the product development laboratory for a test.
The engineer decides to test all three plate materials at three temperature levels: \(15\) (low), \(70\) (medium), and \(125°F\) (high) because these temperature levels are consistent with the product end-use environment. Four batteries are tested at each combination of plate material and temperature, and all 36 tests are run in random order. The experiment and the resulting observed battery life data are given in the next table.
battery_life | temperature | material_type |
---|---|---|
130 | low | A |
155 | low | A |
34 | medium | A |
40 | medium | A |
20 | high | A |
70 | high | A |
74 | low | A |
180 | low | A |
80 | medium | A |
75 | medium | A |
82 | high | A |
58 | high | A |
150 | low | B |
188 | low | B |
136 | medium | B |
122 | medium | B |
25 | high | B |
70 | high | B |
159 | low | B |
126 | low | B |
106 | medium | B |
115 | medium | B |
58 | high | B |
45 | high | B |
138 | low | C |
110 | low | C |
174 | medium | C |
120 | medium | C |
96 | high | C |
104 | high | C |
168 | low | C |
160 | low | C |
150 | medium | C |
139 | medium | C |
82 | high | C |
60 | high | C |
We can show the table in a different way to a better understanding of the data.
low | low | medium | medium | high | high | |
---|---|---|---|---|---|---|
A | 130 | 155 | 34 | 40 | 20 | 70 |
A | 74 | 180 | 80 | 75 | 82 | 58 |
B | 150 | 188 | 136 | 122 | 25 | 70 |
B | 159 | 126 | 106 | 115 | 58 | 45 |
C | 138 | 110 | 174 | 120 | 96 | 104 |
C | 168 | 160 | 150 | 139 | 82 | 60 |
In this table, the columns represent the different temperatures and the rows depict the types of material.
It is always a good idea to examine experimental data graphically. The next figure presents a boxplot for battery life at each level of temperature and material type.
The graph indicates that generally, the battery life increases as the temperature decreases (for material type A this is not so clear). Based on this simple graphical analysis, we strongly suspect that temperature affects the battery life (at least for material type B), and (2) generally smaller temperature result in increased battery life(for material type A this is not so clear).
Because there are two factors at three levels, this design is sometimes called a \(3^2\) factorial design. In this problem the engineer wants to answer the following questions:
What effects do material type and temperature have on the life of the battery?
Is there a choice of material that would give a uniformly long life regardless of temperature?
This design is a specific example of the general case of a two-factor factorial.
The fixed effects model can be described as follow \[y_{ijk} = \mu+\tau_i + \beta_j+ (\tau\beta)_{ij} +\epsilon_{ijk} \mspace{36mu} i=1,...,a \mspace{12mu} j=1,...,b \mspace{12mu} k=1,2,...,n\] where \(\mu\) is the overall mean effect, \(\tau_i\) is the effect of the \(i\)th level of the row factor A, \(\beta_j\) is the effect of the \(j\)th level of column factor B, \((\tau\beta)_{ij}\) is the effect of the interaction between \(\tau_i\) and \(\beta_j\) , and \(\epsilon_{ijk}\) is a random error component. Both factors are assumed to be fixed.
In the example, the number of levels for both factors temperature and material type are three so \(a=3\) and \(b=3\). We have four observations for each combination of the factors so \(k=4\). There are \(abn=3*3*4=36\) total observations
We are interested in testing hypotheses about the equality of row treatment effects, say \[ H_0: \tau_1=\tau_2=...=\tau_a=0\\ H_1: \tau_i\neq0 \text{ for at least one }i \]
and the equality of column treatment effects, say
\[ H_0: \beta_1=\beta_2=...=\beta_b=0\\ H_1: \beta_j\neq0 \text{ for at least one }j \]
We are also interested in determining whether row and column treatments interact. Thus, we also wish to test
\[ H_0: (\tau\beta)_{ij}=0 \text{ for all }ij\\ H_1: (\tau\beta)_{ij}\neq0 \text{ for at least one pair }ij \]
These hypotheses are tested using a two-factor analysis of variance.
The Anova table is shown in the next table.
Df | Sum Sq | Mean Sq | F value | P value | |
---|---|---|---|---|---|
temperature | 2 | 39118.722 | 19559.361 | 28.967692 | 0.0000002 |
material_type | 2 | 10683.722 | 5341.861 | 7.911372 | 0.0019761 |
temperature:material_type | 4 | 9613.778 | 2403.444 | 3.559535 | 0.0186112 |
Residuals | 27 | 18230.750 | 675.213 |
Because the P-value is smaller than the level \(\alpha=0.05\), we reject all null hypothesis \(H_0\) and conclude that there is a significant interaction between material types and temperature and the main effects of material type and temperature are also significant.
To assist in interpreting the results of this experiment, it is helpful to construct a graph of the average responses at each treatment combination. This graph is shown in the next figure.
The significant interaction is indicated by the lack of parallelism of the lines. In general, longer battery life is attained at low temperature, regardless of material type. Changing from low to an intermediate temperature, battery life with material type 3 may actually increase, whereas it decreases for types 1 and 2. From intermediate to high temperature, battery life decreases for material types 2 and 3 and is essentially unchanged for type 1. Material type 3 seems to give the best results if we want less loss of effective life as the temperature changes.
We can also see the tests on the individual terms (temperature, material type, and temperature: material type).
Estimate | Standard error | Statistic | P value | |
---|---|---|---|---|
(Intercept) | 57.50 | 12.99243 | 4.4256540 | 0.0001424 |
temperaturelow | 77.25 | 18.37407 | 4.2042942 | 0.0002573 |
temperaturemedium | -0.25 | 18.37407 | -0.0136061 | 0.9892443 |
material_typeB | -8.00 | 18.37407 | -0.4353962 | 0.6667358 |
material_typeC | 28.00 | 18.37407 | 1.5238866 | 0.1391655 |
temperaturelow:material_typeB | 29.00 | 25.98486 | 1.1160345 | 0.2742418 |
temperaturemedium:material_typeB | 70.50 | 25.98486 | 2.7131183 | 0.0114615 |
temperaturelow:material_typeC | -18.75 | 25.98486 | -0.7215740 | 0.4767592 |
temperaturemedium:material_typeC | 60.50 | 25.98486 | 2.3282788 | 0.0276325 |
An F test is displayed for the model source of variation.
R-squared | Adjusted R-squared | Standard error | Statistic | P value | Df |
---|---|---|---|---|---|
0.7652098 | 0.6956423 | 25.98486 | 10.99953 | 9e-07 | 9 |
The P-value is small (\(0.0001\)), so the interpretation of this test is that at least one of the three terms in the model is significant. Also \(\text{R-squared }=0.7652\). That is, about \(77\) percent of the variability in the battery life is explained by the plate material in the battery, the temperature, and the material type–temperature interaction.
In the next section, we discuss the use of the residuals and residual plots in model adequacy checking.
Violations of the basic assumptions and model adequacy can be easily investigated by the examination of residuals. The residuals for the two-factor factorial model with interaction are \[e_{ijk}=y_{ijk}-\hat{y}_{ijk}=y_{ijk}-\overline{y}_{ij.}\]
A check of the normality assumption could be made by plotting a histogram of the residuals. If the \(NID(0,\sigma^2)\) assumption on the errors is satisfied, this plot should look like a sample from a normal distribution centered at zero. Unfortunately, with small samples, considerable fluctuation in the shape of a histogram often occurs, so the appearance of a moderate departure from normality does not necessarily imply a serious violation of the assumptions. Gross deviations from normality are potentially serious and require further analysis.
An extremely useful procedure is to construct a normal probability plot of the residuals. If the error distribution is normal, this plot will resemble a straight line. In visualizing the straight line, place more emphasis on the central values of the plot than on the extremes.
The general impression from examining this display is that the error distribution is approximately normal, although the largest negative residual (\(-60.75\) at low temperature for material type 1) does stand out somewhat from the others. The standardized value of this residual is \(\frac{-60.75}{\sqrt{675.21}}=-2.34\), and this is the only residual whose absolute value is larger than 2.
Alternatively, we can use the Shapiro-Wilk test to check the normality of the errors. In this case, the null-hypothesis of this test is that the errors are normally distributed.
The results of this test in the example are shown in the next table.
Statistic | P value | |
---|---|---|
0.976057 | 0.6117267 |
Because the P-value is \(p=0.6117267>\alpha=0.05\), the null hypothesis that the residuals came from a normally distributed population can not be rejected. This is the same conclusion reached by analyzing the normal probability plot of the residuals.
Plotting the residuals in time order of data collection helps detect a strong correlation between the residuals. A tendency to have runs of positive and negative residuals indicates a positive correlation. This would imply that the independence assumption on the errors has been violated.
A plot of these residuals versus run order or time is shown in the next figure.
There is no reason to suspect any violation of independence or constant variance assumptions.
If the model is correct and the assumptions are satisfied, the residuals should be structureless; in particular, they should be unrelated to any other variable including the predicted response. A simple check is to plot the residuals versus the fitted values \(\hat{y}_{ij.}\) (\(\hat{y}_{ij}=\overline{y}_{ij.}\)). This plot should not reveal any obvious pattern. The next figure plots the residuals versus the fitted values for the example.
There is some mild tendency for the variance of the residuals to increase as the battery life increases.
Inequality of variance also shows up occasionally on the plot of residuals versus run order. An outward-opening funnel pattern indicates that variability is increasing over time.
The next two figures plot the residuals versus temperature and material types, respectively.
Both plots indicate mild inequality of variance, with the treatment combination of \(15°F\) (low temperature) and material type 1 possibly having larger variance than the others.
We can see that the low temperature-material type 1 cell contains both extreme residuals (\(-60.75\) and \(45.25\)). These two residuals are primarily responsible for the inequality of variance detected in these figures and in the plot of the residuals versus fitted values. Reexamination of the data does not reveal any obvious problem, such as an error in recording, so we accept these responses as legitimate. It is possible that this particular treatment combination produces a slightly more erratic battery life than the others. The problem, however, is not severe enough to have a dramatic impact on the analysis and conclusions.
Although residual plots are frequently used to diagnose inequality of variance, several statistical tests have also been proposed. These tests may be viewed as formal tests of the hypotheses \[H_0:\sigma_1^2=\sigma_2^2=...=\sigma_a^2\] \[H_1:\text{above not true for at least one } \sigma_i^2\]
A widely used procedure to test the homogeneity of variances is the Bartlett’s test. The procedure involves computing a statistic whose sampling distribution is closely approximated by the chi-square distribution.
The results of this test in the example are shown in the next two tables. The first table tests the homogeneity of variances of the residuals for each level of factor temperature and the second table for levels of the factor material type.
Statistic | P value | |
---|---|---|
3.311821 | 0.1909182 |
Statistic | P value | |
---|---|---|
3.173694 | 0.2045696 |
The P-value is bigger than the level \(\alpha=0.05\), so we cannot reject the null hypothesis.
Because Bartlett’s test is sensitive to the normality assumption, there may be situations where an alternative procedure would be useful. The modified Levene test is a very nice procedure that is robust to departures from normality. To test the hypothesis of equal variances in all treatments, the modified Levene test uses the absolute deviation of the observations \(y_{ij}\) in each treatment from the treatment median.
The results of this test in the example are shown in the next two tables. The first table tests the homogeneity of variances of the residuals for each level of factor temperature and the second table for levels of the factor material type.
Statistic | P value | |
---|---|---|
1.382132 | 0.2651959 |
Statistic | P value | |
---|---|---|
1.672862 | 0.2032358 |
The P-value is bigger than the level \(\alpha=0.05\), so we cannot reject the null hypothesis (that all three variances are the same).
When the ANOVA indicates that row or column means differ, it is usually of interest to make comparisons between the individual row or column means to discover the specific differences.
We now illustrate the use of Tukey’s test on the battery life data for example. Note that in this experiment, interaction is significant. When the interaction is significant, comparisons between the means of one factor (e.g., A) may be obscured by the AB interaction. One approach to this situation is to fix factor B at a specific level and apply Tukey’s test to the means of factor A at that level.
To illustrate, suppose that in the example we are interested in detecting differences among the means of the three material types. Because interaction is significant, we make this comparison at just one level of temperature, say level 2 (medium temperature). We assume that the best estimate of the error variance is the \(MS_E\) from the ANOVA table, utilizing the assumption that the experimental error variance is the same over all treatment combinations.
The results of this test in the example are shown in the next table (the rows marked in bold specify the particular test).
Difference | Lower ci | Uper ci | P value | |
---|---|---|---|---|
low:A-high:A | 77.25 | 15.426816 | 139.073184 | 0.0067471 |
medium:A-high:A | -0.25 | -62.073184 | 61.573184 | 1.0000000 |
high:B-high:A | -8.00 | -69.823184 | 53.823184 | 0.9999508 |
low:B-high:A | 98.25 | 36.426816 | 160.073184 | 0.0003574 |
medium:B-high:A | 62.25 | 0.426816 | 124.073184 | 0.0474675 |
high:C-high:A | 28.00 | -33.823184 | 89.823184 | 0.8347331 |
low:C-high:A | 86.50 | 24.676816 | 148.323184 | 0.0018765 |
medium:C-high:A | 88.25 | 26.426816 | 150.073184 | 0.0014679 |
medium:A-low:A | -77.50 | -139.323184 | -15.676816 | 0.0065212 |
high:B-low:A | -85.25 | -147.073184 | -23.426816 | 0.0022351 |
low:B-low:A | 21.00 | -40.823184 | 82.823184 | 0.9616404 |
medium:B-low:A | -15.00 | -76.823184 | 46.823184 | 0.9953182 |
high:C-low:A | -49.25 | -111.073184 | 12.573184 | 0.2016535 |
low:C-low:A | 9.25 | -52.573184 | 71.073184 | 0.9998527 |
medium:C-low:A | 11.00 | -50.823184 | 72.823184 | 0.9994703 |
high:B-medium:A | -7.75 | -69.573184 | 54.073184 | 0.9999614 |
low:B-medium:A | 98.50 | 36.676816 | 160.323184 | 0.0003449 |
medium:B-medium:A | 62.50 | 0.676816 | 124.323184 | 0.0460388 |
high:C-medium:A | 28.25 | -33.573184 | 90.073184 | 0.8281938 |
low:C-medium:A | 86.75 | 24.926816 | 148.573184 | 0.0018119 |
medium:C-medium:A | 88.50 | 26.676816 | 150.323184 | 0.0014173 |
low:B-high:B | 106.25 | 44.426816 | 168.073184 | 0.0001152 |
medium:B-high:B | 70.25 | 8.426816 | 132.073184 | 0.0172076 |
high:C-high:B | 36.00 | -25.823184 | 97.823184 | 0.5819453 |
low:C-high:B | 94.50 | 32.676816 | 156.323184 | 0.0006078 |
medium:C-high:B | 96.25 | 34.426816 | 158.073184 | 0.0004744 |
medium:B-low:B | -36.00 | -97.823184 | 25.823184 | 0.5819453 |
high:C-low:B | -70.25 | -132.073184 | -8.426816 | 0.0172076 |
low:C-low:B | -11.75 | -73.573184 | 50.073184 | 0.9991463 |
medium:C-low:B | -10.00 | -71.823184 | 51.823184 | 0.9997369 |
high:C-medium:B | -34.25 | -96.073184 | 27.573184 | 0.6420441 |
low:C-medium:B | 24.25 | -37.573184 | 86.073184 | 0.9165175 |
medium:C-medium:B | 26.00 | -35.823184 | 87.823184 | 0.8822881 |
low:C-high:C | 58.50 | -3.323184 | 120.323184 | 0.0742711 |
medium:C-high:C | 60.25 | -1.573184 | 122.073184 | 0.0604247 |
medium:C-low:C | 1.75 | -60.073184 | 63.573184 | 1.0000000 |
This analysis indicates that at the medium temperature level, the mean battery life is the same for material types B and C and that the mean battery life for material type A differs significantly in comparison to both types B and C. Specifically, the mean battery life for material type A is significantly lower in comparison to both types B and C (see the graph of the average responses at each treatment combination).
As the interaction is significant, we could compare all \(ab=9\) cell means to determine which ones differ significantly. In this analysis, differences between cell means include interaction effects as well as both main effects. In the example, this would give 36 comparisons between all possible pairs of the nine-cell means (all these comparisons can be seen in the previous table).