Variance Reduction
Assume that we have two variables, X
and Y
that are related according to the hypothetical model Y=β0+β1X+ε. We collect a sample consisting of n
paired observations of the form (xi,yi). We then use the sample to create a fitted model of the form ˆY=ˆβ0+ˆβ1X.
Assume we wish to make a prediction about the value of Y for a new observation. Without considering the effect that X has on Y, our best point estimate for the new value of Y would be ˉy. To take into account the uncertainty that we know exists in our prediction, we could create a 95% prediction interval around our point prediction, ˉy. Such an interval is shown in the plot on the left below. We will discuss how prediction intervals are formed in a later lesson.
If, however, we know the value of X in the new observation, then a better estimate for the value of Y would be given by E[Y|X=x]=β0+β1x, which can be approximated using our fitted model. This gives us a point estimate of ˆy=ˆβ0+ˆβ1x. Again, we know that there is some uncertainty in our prediction, so we might create a 95% prediction interval around the point prediction ˆy. Such a prediction interval is shown in the plot on the right below.
Notice that the prediction interval in the plot on the right is considerably shorter than the one on the left. By taking into account the effect that X has on Y, we are able to reduce the uncertainty, or variance, in our prediction. This allows us to make more precise predictions.
In this lesson, we will discuss a method of measuring the amount of variance reduction we obtain in using a regression model to explain a portion of the variation in Y through its relationship with X.

SST, SSM, and SSE
We will use a quantity denoted by r2 to measure the proportion of the variance in our response variable Y that is explained by the relationship between Y and a predictor X. Before we define r2, we first need to introduce some related quantities.
For a particular observation yi, notice that the quantity ti=yi−ˉy measures the amount by which the observation deviates from the sample mean. We will decompose this quantity into two pieces.
Let mi=ˆyi−ˉy. The quantity mi is the portion of the deviaton ti that is explained by the regression model. You would expect the y value to vary from ˉy by this amount as a result of the effect of x.
Let ˆεi=yi−ˆyi. This is the residual associated with yi. It can be thought as the portion of the overall deviation ti that is left unexplained by our model.
Notice that ti=mi+ˆεi.

We now sum the squares of each of these three quantities over all of the points in our sample.
Let SST=∑t2i=∑(yi−ˉy)2
Let SSM=∑m2i=∑(ˆyi−ˉy)2
Let SSE=∑ˆε2i=∑(yi−ˆyi)2
SST = SSE + SSM
An important relationship between these three sums is given by the equation SST=SSE+SSM.
To establish this result, we will need to make use of the two identities shown below. These identities are known as the normal equations and were derived in the notebook titled 3.1.a - Derivation of PArameter Estimates.Rmd.
The first identity is one of our two normal equations. The second identity can be derived from the two normal equations. Armed with these identities, we may procede with our proof that SST=SSE+SSM as follows:
SST=∑(yi−ˉy)2
SST=∑[(ˆεi+ˆyi)−ˉy]2
SST=∑[ˆεi+(ˆyi−ˉy)]2
SST=∑[ˆε2i+2ˆεi(ˆyi−ˉy)+(ˆyi−ˉy)2]
SST=∑ˆε2i+2∑ˆεi(ˆyi−ˉy)+∑(ˆyi−ˉy)2
SST=SSE+2∑(ˆεiˆyi−ˆεiˉy)+SSM
SST=SSE+2∑ˆεiˆyi−2ˉy∑ˆεi+SSM
SST=SSE+0−0+SSM
SST=SSE+SSM
r-Squared
Intuitive explanations of the meaning of the variables SST, SSM, and SSE are as follows:
SST is a measure of the amount of variation in the response variable Y.
SSM is a measure of the variation in Y that is explained by the regression model.
SSE is a measure of the variation in Y that is left unexplained by our model.
Ideally, we would like for SSE to be close to 0 and for SSM to thus be close to SST. We can measure the proportion of the variance in Y that is explained by our regression model using the following quantity:
r2=SSMSST
Note that since 0≤SSM≤SST, we get that 0≤r2≤1. The quantity r2 is a diagnostic tool that is commonly uses to measure the quality of the fit in a regression model.
Notice that since SST=SSM+SSE, we can rewrite the formula for r2 as follows:
r2=SSMSST=1−SSESSM
The r2=1−SSESSM formula for r2 is often the more useful of the two formulas, since we will have other reasons for calculating SSE.
r-Squared and Correlation
The value r2 is related to the sample correlation ρX,Y=corr[X,Y]. In fact, it can be shown that:
r2=corr[X,Y]
To establish this result, we first need to derive an alternate form for the expression SSM. Notice that:
SSM=n∑i=1(ˆyi−ˉy)2
SSM=n∑i=1[(ˆβ0+ˆβ1xi)−ˉy]2
SSM=n∑i=1[(ˉy−ˆβ1ˉx)+ˆβ1xi−ˉy]2
SSM=n∑i=1(−ˆβ1ˉx+ˆβ1xi)2
SSM=β21n∑i=1(xi−ˉx)2
SSM=β21SXX
SSM=(SXYSXX)2SXX
SSM=(SXY)2SXX
Now, recall that r2=SSMSST. We will substitute the expression above in for SSM, and then simplify.
r2=SSMSST
r2=(SXY)2SXX1SST
r2=(SXY)2SXX⋅SST
r2=(SXY)2SXX⋅SST
r2=(n∑i=1(xi−ˉx)(yi−ˉy))2n∑i=1(xi−ˉx)2⋅n∑i=1(yi−ˉy)2
r2=(n∑i=1(xi−ˉx)(yi−ˉy)√n∑i=1(xi−ˉx)2⋅√n∑i=1(yi−ˉy)2)2
r2=(cov[X,Y]sXsY)2
r2=(corr[X,Y])2
This completes our proof.
Correlation Between Y and ˆY
It can also be shown that the correlation between the fitted value ˆY and the response Y is exactly the same as that between the predictor X and the response Y. In other words:
corr[ˆY,Y]=corr[X,Y]
The proof of this fact is left as an exercise.
