KITADA
Lesson #20
Motivation:
The idea behind the Analysis of Variance his table is that the variation in the response variable (referred to as the total variation) is partitioned into two parts: a part that tells us how much of the variation is explained by the regression model (referred to as the variable name in the R output) and a part that tells us how much of the variation is left unexplained by the regression model (called the “Residual” or “Error”). These different sources of variation are listed in this Analysis of Variance (ANOVA) table. In this lesson, we'll learn all about these different sources of variation and where the numbers we see in the table come from.
What you need to know from this lesson:
After completing this lesson, you should be able to
To accomplish the above “What You Need to Know”, do the following:
The Lesson
A. The different sources of variation:
We will use the following scatter plot to help with the understanding of the different sources of variation in the Analysis of Variance table:
(SEE HANDOUT ILLUSTRATION)
1. Total variation:
2. Partitioning the total variation:
In the above example, there is a relationship between X and Y which implies that the least-squares regression line has a slope \( \neq \) 0. The least-squares regression equation is \( \hat{y}=0.62+1.22x \) and the least-squares regression line is sketched on the plot below. (Note that the horizontal line at \( \bar{y} \) is also drawn in).
(SEE HANDOUT ILLUSTRATION)
For the point with the open square, notice how the regression line “goes right through” the vertical line representing that point’s total variation. That is, the regression line partitions (or “divides”) the total variation into two parts:
1) the part that’s explained by the regression model.
2) the part that is left unexplained by the regression model.
3. Summarizing, total variation = variation explained by the regression model + error. That is,
\( (y_i-\bar{y})=( \hat{y}_i-\bar{y})+(y_i-\hat{y}_i) \)
This is illustrated in the plot below: (Note: this partitioning of the total variation can be done for each observation, but is only illustrated for the one point in the graph.)
(SEE HANDOUT ILLUSTRATION)
B. Sum of Squares
We want to summarize the total variation for all data points with one number. A good way to do this would be to sum each point’s “total variation”. However, when this is done, the sum would always be zero. (Points below the mean line have negative values for their total variation and points above the mean line have positive values.) To alleviate this problem, we square the each of the differences and then sum the squared differences. This is called sum of squares and is done for each of source of variation:
That is:
\( SST = \sum_{i=1}^n (y_i-\bar{y})^2 \)
This is the sum of squares of the total variation
That is:
\( SSM = \sum_{i=1}^n (\hat{y}_i-\bar{y})^2 \)
This is the sum of squares of the variation explained by the regression model
That is:
\( SSE = \sum_{i=1}^n (y_i-\hat{y})^2 \)
This is the sum of squares of the variation left unexplained by the regression model
*C. Example of calculating sums of squares *
The following data are made up to illustrate how to calculate the sums of squares for each source of variation.
### LESSON 20 ###
x<-c(1, 2, 3, 4)
y<-c(3, 5, 6, 6)
### LINEAR MODEL ###
mod<-lm(y~x)
summary(mod)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## 1 2 3 4
## -0.5 0.5 0.5 -0.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5000 0.8660 2.887 0.1020
## x 1.0000 0.3162 3.162 0.0871 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7071 on 2 degrees of freedom
## Multiple R-squared: 0.8333, Adjusted R-squared: 0.75
## F-statistic: 10 on 1 and 2 DF, p-value: 0.08713
To calculate the sum of squares, we need the following: • \( y_i \) – the observed y-value for each point • \( \bar{y} \) – the average of the y-values • \( \hat{y}_i \) – the predicted y-value for each x-value
### Y BAR ###
mean(y)
## [1] 5
### Y HATS ###
y_hat<-(2.5+x)
y_hat
## [1] 3.5 4.5 5.5 6.5
Place these in the appropriate column in the table below.
### SST ###
SST<-y-mean(y)
SST
## [1] -2 0 1 1
SST^2
## [1] 4 0 1 1
sum(SST^2)
## [1] 6
### SSM ###
SSM<-y_hat-mean(y)
SSM
## [1] -1.5 -0.5 0.5 1.5
SSM^2
## [1] 2.25 0.25 0.25 2.25
sum(SSM^2)
## [1] 5
### SSE ###
SSE<-y-y_hat
SSE
## [1] -0.5 0.5 0.5 -0.5
SSE^2
## [1] 0.25 0.25 0.25 0.25
sum(SSE^2)
## [1] 1
D. The Analysis of Variance (ANOVA) table:
In general, here is the Analysis of Variance table:
(SEE TABLE IN HANDOUT)
### ANOVA ###
anova(mod)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 5 5.0 10 0.08713 .
## Residuals 2 1 0.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
E. Relationships in the Analysis of Variance table
1. The relationship between SST, SSM, and SSE is: SST = SSM + SSE
2. The relationship between the degrees of freedom is: DFT = DFM + DFE
3. Recall that R2 tells us what proportion of the total variation in the values of the response variable (SST) is explained by the regression model (SSM).
Therefore, \( R^2=\frac{SSM}{SST} \).
Calculate \( R^2 \) from the example in part C.
### R SQUARED ###
sum(SSM^2)/sum(SST^2)
## [1] 0.8333333
4. Notice the similarity between SST and the variance (s2). SST is just the numerator of the variance
5. So, the variance of the values of the response variable = SST / DFT. Therefore, how can we use the values in the Analysis of Variance table to calculate the variance of the residuals of the sample data?
This is like the mean of the sum of squares total.
Therefore, what is the estimate of \( \sigma \) , the standard deviation of the residuals in the population?
### ESTIMATE SIGMA ###
sqrt(sum(SST^2)/3)
## [1] 1.414214
F. The F-test and the F-statistic
1. The hypotheses for the F-test in simple linear regression
The F-test is another way to determine the evidence to say the explanatory variable is a useful predictor of the response variable.
Therefore, what are the null and alternative hypotheses for the F-test in simple linear regression?
\( H_0: b=0 \)
\( H_A: b \neq 0 \)
2. The F-statistic
a. Which scatterplot below would you feel more comfortable with using the least-squares regression line to get an accurate prediction (i.e. close to what will actually be observed)? WHY?
(SEE HANDOUT FOR PLOT)
Left!
b. If an explanatory variable helps explain the response variable, what would you expect the regression model source of variation to be in relation to the error? Why?
We would expect this value to be high because it doest a good job at capturing most of the data.
Putting these last two questions together, the lower the errors (or “residuals”) are in relation to the regression model part of the variation, the more comfortable we would feel with saying the explanatory variable helps explain (or predict) the response variable (i.e. is a “useful” predictor of the response variable). The F-statistic compares the part of the total variation explained by the regression model to the part left unexplained.
c. The F-statistic is the test-statistic calculated when an F-test is performed. The F-statistic compares the ratio of the mean squares of the regression model to the mean squares of the error:
\( F=\frac{MSM}{MSE} \)
Which scatterplot below would have a higher F-statistic?
(SEE HANDOUT FOR PLOTS)
Therefore, would higher or lower F-statistics lead to more evidence to reject the null hypothesis?
In summary, the higher the mean squares explained by the regression model compared to the part left unexplained (mean square error), the more evidence that the explanatory variable DOES indeed help explain the response. This ratio is the F-statistic which has an F-distribution.
3. Degrees of freedom for the F-statistic:
Just how high the F-statistic must be to feel comfortable with rejecting the null hypothesis depends on the degrees of freedom. An F-statistic will have two types of degrees of freedom:
a. In the example in part C, calculate the F-statistic and give the degrees of freedom for the F-statistic.
### ANOVA ###
anova(mod)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 5 5.0 10 0.08713 .
## Residuals 2 1 0.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-valu = 10 Numerator DF = 5 Numerator DF = 1
b. Determine the p-value.
P-value = 0.08713
4. State a conclusion based on this p-value.
There is little to no evidence to suggest that x has a significant effect on y, with a p-value of 0.08713. Therefore, we will reject the null.
5. We could use the t-test to test the null hypothesis above (in simple linear regression – this will not be true in multiple linear regression).
For simple linear regression, we were also able to use a t-test to test the effect of slope.
What is the relationship between the t-statistic and the F-statistic in simple linear regression?
### T-TEST FOR SLOPE ###
# test statistic: 3.162
# p-value : 0.0871
### F-TEST ###
# test statistic: 10
# p-value : 0.08713
The F is the t squared.
Will the p-value be the same using either test?
The p-values are the same.