KITADA

Lesson #20

The Analysis of Variance Table in the Regression Setting

Motivation:

The idea behind the Analysis of Variance his table is that the variation in the response variable (referred to as the total variation) is partitioned into two parts: a part that tells us how much of the variation is explained by the regression model (referred to as the variable name in the R output) and a part that tells us how much of the variation is left unexplained by the regression model (called the “Residual” or “Error”). These different sources of variation are listed in this Analysis of Variance (ANOVA) table. In this lesson, we'll learn all about these different sources of variation and where the numbers we see in the table come from.

What you need to know from this lesson:

After completing this lesson, you should be able to

explain what the different sources of variation in an Analysis of Variance table represent in the simple linear regression setting
calculate sums of squares, degrees of freedom, and mean squares for each source of variation
compute and interpret the value of R2 from the Analysis of Variance table
compute and interpret the estimate of the standard deviation of the residuals from the Analysis of Variance table
complete an Analysis of Variance table given some (but not all) of the information in the table and/or additional statistics
perform an F-test in simple linear regression

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Section 9.2 in the text
3. Do the Lesson 20 questions at the end of the lesson notes

The Lesson

A. The different sources of variation:

We will use the following scatter plot to help with the understanding of the different sources of variation in the Analysis of Variance table:

(SEE HANDOUT ILLUSTRATION)

1. Total variation:

The total variation is a measure of the spread of the y-values. It is the difference between each y-value and the mean of the y-values.
A way to understand it in the simple linear regression setting is to assume the explanatory variable does not help to explain the response variable. In such a situation,\( \beta_1=0 \) .
In this case, the summary statistic used to describe the center of the y-values is \( \bar{y} \) , the mean of the y-values.
The total variation, then, is the vertical distance between each point and the mean line (\( \bar{y} \)) on the scatterplot.
On the scatter plot above, draw a horizontal line at \( \bar{y} \)= 7.2. For the point labeled with an open square, draw a vertical line representing its total error. (You can repeat this for all other points.)
This is referred to as “total” in the Analysis of Variance table under Source of Variation.

2. Partitioning the total variation:

In the above example, there is a relationship between X and Y which implies that the least-squares regression line has a slope \( \neq \) 0. The least-squares regression equation is \( \hat{y}=0.62+1.22x \) and the least-squares regression line is sketched on the plot below. (Note that the horizontal line at \( \bar{y} \) is also drawn in).

(SEE HANDOUT ILLUSTRATION)

For the point with the open square, notice how the regression line “goes right through” the vertical line representing that point’s total variation. That is, the regression line partitions (or “divides”) the total variation into two parts:

1) the part that’s explained by the regression model.

It is the vertical distance between the least-squares regression line and the mean line.
In particular, it is the difference between the predicted value (\( \hat{y}_i \)) and the mean of the y-values (\( \bar{y} \)).
That is, the part of the total variation explained by the regression model is (\( \hat{y}_i-\bar{y} \)).
For the point labeled with an open square, draw a vertical line representing the part of the variation due to the regression model.
This is referred to as the variable name, “model” or “regression” (or perhaps both) in the Analysis of Variance table under Source of Variation.

2) the part that is left unexplained by the regression model.

This part can be thought of as the “leftover” part and is called the error or residual.
It is the vertical distance between the observed y-value and its predicted value.
That is, the part of the total variation left unexplained by the regression model is (\( y_i-\hat{y}_i \)).
For the point labeled with an open square, draw a vertical line representing the error.
The part of the total variation left unexplained by the current regression model could be explained by other variables or may be due to natural variation (i.e. cannot be explained by any other variables).
This is referred to as “residual” or “error” (or perhaps both as is done in R) in the Analysis of Variance table under Source of Variation.

3. Summarizing, total variation = variation explained by the regression model + error. That is,

\( (y_i-\bar{y})=( \hat{y}_i-\bar{y})+(y_i-\hat{y}_i) \)

This is illustrated in the plot below: (Note: this partitioning of the total variation can be done for each observation, but is only illustrated for the one point in the graph.)

(SEE HANDOUT ILLUSTRATION)

B. Sum of Squares

We want to summarize the total variation for all data points with one number. A good way to do this would be to sum each point’s “total variation”. However, when this is done, the sum would always be zero. (Points below the mean line have negative values for their total variation and points above the mean line have positive values.) To alleviate this problem, we square the each of the differences and then sum the squared differences. This is called sum of squares and is done for each of source of variation:

1. Total Sum of Squares (SST) = the sum of the squared differences between \( y_i \) and \( \bar{y} \).

That is:

\( SST = \sum_{i=1}^n (y_i-\bar{y})^2 \)

This is the sum of squares of the total variation

2. Regression Model Sum of Squares (usually SSM) = the sum of the squared differences between the predicted value (\( \hat{y}_i \)) and the overall mean (\bar{y}).

That is:

\( SSM = \sum_{i=1}^n (\hat{y}_i-\bar{y})^2 \)

This is the sum of squares of the variation explained by the regression model

3. Error (or Residual) Sum of Squares (usually SSE) = the sum of the squared differences between an observed value (\( y_i \)) and its predicted value (\( \hat{y}_i \)).

That is:

\( SSE = \sum_{i=1}^n (y_i-\hat{y})^2 \)

This is the sum of squares of the variation left unexplained by the regression model

*C. Example of calculating sums of squares *

The following data are made up to illustrate how to calculate the sums of squares for each source of variation.

### LESSON 20 ###
x<-c(1, 2, 3, 4)
y<-c(3, 5, 6, 6)

### LINEAR MODEL ###
mod<-lm(y~x)
summary(mod)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    1    2    3    4 
## -0.5  0.5  0.5 -0.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   2.5000     0.8660   2.887   0.1020  
## x             1.0000     0.3162   3.162   0.0871 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7071 on 2 degrees of freedom
## Multiple R-squared:  0.8333, Adjusted R-squared:   0.75 
## F-statistic:    10 on 1 and 2 DF,  p-value: 0.08713

To calculate the sum of squares, we need the following: • \( y_i \) – the observed y-value for each point • \( \bar{y} \) – the average of the y-values • \( \hat{y}_i \) – the predicted y-value for each x-value

1) What is \( \bar{y} \) , the average of the y-values?

### Y BAR ###
mean(y)

## [1] 5

2) For each x-value, calculate its predicted y-value (\( \hat{y}_i \))

### Y HATS ###
y_hat<-(2.5+x)
y_hat

## [1] 3.5 4.5 5.5 6.5

Place these in the appropriate column in the table below.

### SST ###
SST<-y-mean(y)
SST

## [1] -2  0  1  1

SST^2

## [1] 4 0 1 1

sum(SST^2)

## [1] 6

### SSM ###
SSM<-y_hat-mean(y)
SSM

## [1] -1.5 -0.5  0.5  1.5

SSM^2

## [1] 2.25 0.25 0.25 2.25

sum(SSM^2)

## [1] 5

### SSE ###
SSE<-y-y_hat
SSE

## [1] -0.5  0.5  0.5 -0.5

SSE^2

## [1] 0.25 0.25 0.25 0.25

sum(SSE^2)

## [1] 1

D. The Analysis of Variance (ANOVA) table:

In general, here is the Analysis of Variance table:

(SEE TABLE IN HANDOUT)

### ANOVA ###
anova(mod)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## x          1      5     5.0      10 0.08713 .
## Residuals  2      1     0.5                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

E. Relationships in the Analysis of Variance table

1. The relationship between SST, SSM, and SSE is: SST = SSM + SSE
2. The relationship between the degrees of freedom is: DFT = DFM + DFE
3. Recall that R2 tells us what proportion of the total variation in the values of the response variable (SST) is explained by the regression model (SSM).

Therefore, \( R^2=\frac{SSM}{SST} \).

Calculate \( R^2 \) from the example in part C.

### R SQUARED ###
sum(SSM^2)/sum(SST^2)

## [1] 0.8333333

4. Notice the similarity between SST and the variance (s2). SST is just the numerator of the variance

5. So, the variance of the values of the response variable = SST / DFT. Therefore, how can we use the values in the Analysis of Variance table to calculate the variance of the residuals of the sample data?

This is like the mean of the sum of squares total.

Therefore, what is the estimate of \( \sigma \) , the standard deviation of the residuals in the population?

### ESTIMATE SIGMA ###
sqrt(sum(SST^2)/3)

## [1] 1.414214

F. The F-test and the F-statistic

1. The hypotheses for the F-test in simple linear regression

The F-test is another way to determine the evidence to say the explanatory variable is a useful predictor of the response variable.

Therefore, what are the null and alternative hypotheses for the F-test in simple linear regression?

\( H_0: b=0 \)

\( H_A: b \neq 0 \)

2. The F-statistic

a. Which scatterplot below would you feel more comfortable with using the least-squares regression line to get an accurate prediction (i.e. close to what will actually be observed)? WHY?

(SEE HANDOUT FOR PLOT)

Left!

b. If an explanatory variable helps explain the response variable, what would you expect the regression model source of variation to be in relation to the error? Why?

We would expect this value to be high because it doest a good job at capturing most of the data.

Putting these last two questions together, the lower the errors (or “residuals”) are in relation to the regression model part of the variation, the more comfortable we would feel with saying the explanatory variable helps explain (or predict) the response variable (i.e. is a “useful” predictor of the response variable). The F-statistic compares the part of the total variation explained by the regression model to the part left unexplained.

c. The F-statistic is the test-statistic calculated when an F-test is performed. The F-statistic compares the ratio of the mean squares of the regression model to the mean squares of the error:

\( F=\frac{MSM}{MSE} \)

Which scatterplot below would have a higher F-statistic?

(SEE HANDOUT FOR PLOTS)

Therefore, would higher or lower F-statistics lead to more evidence to reject the null hypothesis?

In summary, the higher the mean squares explained by the regression model compared to the part left unexplained (mean square error), the more evidence that the explanatory variable DOES indeed help explain the response. This ratio is the F-statistic which has an F-distribution.

3. Degrees of freedom for the F-statistic:

Just how high the F-statistic must be to feel comfortable with rejecting the null hypothesis depends on the degrees of freedom. An F-statistic will have two types of degrees of freedom:

Numerator degrees of freedom = # of explanatory variables (DFM)
Denominator degrees of freedom = n - # of explanatory variables – 1 (DFE)

a. In the example in part C, calculate the F-statistic and give the degrees of freedom for the F-statistic.

### ANOVA ###
anova(mod)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## x          1      5     5.0      10 0.08713 .
## Residuals  2      1     0.5                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-valu = 10 Numerator DF = 5 Numerator DF = 1

b. Determine the p-value.

P-value = 0.08713

4. State a conclusion based on this p-value.

There is little to no evidence to suggest that x has a significant effect on y, with a p-value of 0.08713. Therefore, we will reject the null.

5. We could use the t-test to test the null hypothesis above (in simple linear regression – this will not be true in multiple linear regression).

For simple linear regression, we were also able to use a t-test to test the effect of slope.

What is the relationship between the t-statistic and the F-statistic in simple linear regression?

### T-TEST FOR SLOPE ###
# test statistic: 3.162
# p-value : 0.0871

### F-TEST ###
# test statistic: 10
# p-value : 0.08713

The F is the t squared.

Will the p-value be the same using either test?

The p-values are the same.