Variance Reduction
Assume that we have two variables, X
and Y
that are related according to the hypothetical model Y=β0+β1X+ε. We collect a sample consisting of n
paired observations of the form (xi,yi). We then use the sample to create a fitted model of the form ˆY=ˆβ0+ˆβ1X.
Assume we wish to make a prediction about the value of Y for a new observation. Without considering the effect that X has on Y, our best point estimate for the new value of Y would be ˉy. To take into account the uncertainty that we know exists in our prediction, we could create a 95% prediction interval around our point prediction, ˉy. Such an interval is shown in the plot on the left below. We will discuss how prediction intervals are formed in a later lesson.
If, however, we know the value of X in the new observation, then a better estimate for the value of Y would be given by E[Y|X=x]=β0+β1x, which can be approximated using our fitted model. This gives us a point estimate of ˆy=ˆβ0+ˆβ1x. Again, we know that there is some uncertainty in our prediction, so we might create a 95% prediction interval around the point prediction ˆy. Such a prediction interval is shown in the plot on the right below.
Notice that the prediction interval in the plot on the right is considerably shorter than the one on the left. By taking into account the effect that X has on Y, we are able to reduce the uncertainty, or variance, in our prediction. This allows us to make more precise predictions.
In this lesson, we will discuss a method of measuring the amount of variance reduction we obtain in using a regression model to explain a portion of the variation in Y through its relationship with X.

SST, SSM, and SSE
We will use a quantity denoted by r2 to measure the proportion of the variance in our response variable Y that is explained by the relationship between Y and a predictor X. Before we define r2, we first need to introduce some related quantities.
For a particular observation yi, notice that the quantity ti=yi−ˉy measures the amount by which the observation deviates from the sample mean. We will decompose this quantity into two pieces.
Let mi=ˆyi−ˉy. The quantity mi is the portion of the deviaton ti that is explained by the regression model. You would expect the y value to vary from ˉy by this amount as a result of the effect of x.
Let ˆεi=yi−ˆyi. This is the residual associated with yi. It can be thought as the portion of the overall deviation ti that is left unexplained by our model.
Notice that ti=mi+ˆεi.

We now sum the squares of each of these three quantities over all of the points in our sample.
Let SST=∑t2i=∑(yi−ˉy)2
Let SSM=∑m2i=∑(ˆyi−ˉy)2
Let SSE=∑ˆε2i=∑(yi−ˆyi)2
SST = SSE + SSM
An important relationship between these three sums is given by the equation SST=SSE+SSM.
To establish this result, we will need to make use of the two identities shown below. These identities are known as the normal equations and were derived in the notebook titled 3.1.a - Derivation of PArameter Estimates.Rmd.
The first identity is one of our two normal equations. The second identity can be derived from the two normal equations. Armed with these identities, we may procede with our proof that SST=SSE+SSM as follows:
SST=∑(yi−ˉy)2
SST=∑[(ˆεi+ˆyi)−ˉy]2
SST=∑[ˆεi+(ˆyi−ˉy)]2
SST=∑[ˆε2i+2ˆεi(ˆyi−ˉy)+(ˆyi−ˉy)2]
SST=∑ˆε2i+2∑ˆεi(ˆyi−ˉy)+∑(ˆyi−ˉy)2
SST=SSE+2∑(ˆεiˆyi−ˆεiˉy)+SSM
SST=SSE+2∑ˆεiˆyi−2ˉy∑ˆεi+SSM
SST=SSE+0−0+SSM
SST=SSE+SSM
r-Squared
Intuitive explanations of the meaning of the variables SST, SSM, and SSE are as follows:
SST is a measure of the amount of variation in the response variable Y.
SSM is a measure of the variation in Y that is explained by the regression model.
SSE is a measure of the variation in Y that is left unexplained by our model.
Ideally, we would like for SSE to be close to 0 and for SSM to thus be close to SST. We can measure the proportion of the variance in Y that is explained by our regression model using the following quantity:
r2=SSMSST
Note that since 0≤SSM≤SST, we get that 0≤r2≤1. The quantity r2 is a diagnostic tool that is commonly uses to measure the quality of the fit in a regression model.
Notice that since SST=SSM+SSE, we can rewrite the formula for r2 as follows:
r2=SSMSST=1−SSESSM
The r2=1−SSESSM formula for r2 is often the more useful of the two formulas, since we will have other reasons for calculating SSE.
r-Squared and Correlation
The value r2 is related to the sample correlation ρX,Y=corr[X,Y]. In fact, it can be shown that:
r2=corr[X,Y]
To establish this result, we first need to derive an alternate form for the expression SSM. Notice that:
SSM=n∑i=1(ˆyi−ˉy)2
SSM=n∑i=1[(ˆβ0+ˆβ1xi)−ˉy]2
SSM=n∑i=1[(ˉy−ˆβ1ˉx)+ˆβ1xi−ˉy]2
SSM=n∑i=1(−ˆβ1ˉx+ˆβ1xi)2
SSM=β21n∑i=1(xi−ˉx)2
SSM=β21SXX
SSM=(SXYSXX)2SXX
SSM=(SXY)2SXX
Now, recall that r2=SSMSST. We will substitute the expression above in for SSM, and then simplify.
r2=SSMSST
r2=(SXY)2SXX1SST
r2=(SXY)2SXX⋅SST
r2=(SXY)2SXX⋅SST
r2=(n∑i=1(xi−ˉx)(yi−ˉy))2n∑i=1(xi−ˉx)2⋅n∑i=1(yi−ˉy)2
r2=(n∑i=1(xi−ˉx)(yi−ˉy)√n∑i=1(xi−ˉx)2⋅√n∑i=1(yi−ˉy)2)2
r2=(cov[X,Y]sXsY)2
r2=(corr[X,Y])2
This completes our proof.
Correlation Between Y and ˆY
It can also be shown that the correlation between the fitted value ˆY and the response Y is exactly the same as that between the predictor X and the response Y. In other words:
corr[ˆY,Y]=corr[X,Y]
The proof of this fact is left as an exercise.
---
title: "3.1.b - Derivation of R-Squared"
author: "Robbie Beane"
output:
  html_notebook:
    theme: flatly
    toc: true
    toc_depth: 4
---



### **Variance Reduction**

Assume that we have two variables, `X` and `Y` that are related according to the hypothetical model $Y = \beta_0 + \beta_1 X + \varepsilon$. We collect a sample consisting of `n` paired observations of the form $(x_i,y_i)$. We then use the sample to create a fitted model of the form $\hat Y = \hat \beta_0 + \hat \beta_1 X$.

Assume we wish to make a prediction about the value of $Y$ for a new observation. Without considering the effect that $X$ has on $Y$, our best point estimate for the new value of $Y$ would be $\bar y$. To take into account the uncertainty that we know exists in our prediction, we could create a 95% prediction interval around our point prediction, $\bar y$. Such an interval is shown in the plot on the left below. We will discuss how prediction intervals are formed in a later lesson.   

If, however, we know the value of $X$ in the new observation, then a better estimate for the value of $Y$ would be given by $E[Y | X = x]= \beta_0 + \beta_1 x$, which can be approximated using our fitted model. This gives us a point estimate of $\hat y = \hat \beta_0 + \hat \beta_1 x$. Again, we know that there is some uncertainty in our prediction, so we might create a 95% prediction interval around the point prediction $\hat y$. Such a prediction interval is shown in the plot on the right below. 

Notice that the prediction interval in the plot on the right is considerably shorter than the one on the left. By taking into account the effect that $X$ has on $Y$, we are able to reduce the uncertainty, or variance, in our prediction. This allows us to make more precise predictions. 

In this lesson, we will discuss a method of measuring the amount of variance reduction we obtain in using a regression model to explain a portion of the variation in $Y$ through its relationship with $X$. 

<br />
<br />


```{r, echo=FALSE, warning=FALSE}

set.seed(3)
x <- runif(20, 5,10)
y <- 1.2 + 0.3 * x + rnorm(20, 0, 0.1)
z <- 0 * x 

# Create first model
ones <- 0*1:20 + 1
mod1 <- lm(y ~ ones)
pred1 <- predict(mod1, newdata = data.frame(ones=c(1)), interval='prediction', level=0.95)


# Create second model
mod2 <- lm(y ~ x)
pred2 <- predict(mod2, newdata = data.frame(x=c(6.5)), interval='prediction', level=0.95)

#yhat <- sum(mod1$coefficients * c(1,6.5))

# Display plots
par(mfrow=c(1,2))

plot(y ~ z, xaxt='n', xlab="", ylim=c(2.5,4.5), pch=19, col=rgb(0,0,1,0.8), 
     main="95% Prediction Int for Y ")
abline(h=mean(y), lty=2, col="red")
segments(0.25,pred1[2],0.25,pred1[3], lwd=4, col="Dark Orange")

plot(y ~ x, ylim=c(2.5,4.5), pch=19, col=rgb(0,0,1,0.8),
     main="95% Prediction Int for Y, \n Given that X=6.5")
abline(h=pred2[1], lty=2, col="red")
segments(6.5,pred2[2],6.5,pred2[3], lwd=4, col="Dark Orange")


```


### **SST, SSM, and SSE**

We will use a quantity denoted by $r^2$ to measure the proportion of the variance in our response variable $Y$ that is explained by the relationship between $Y$ and a predictor $X$. Before we define $r^2$, we first need to introduce some related quantities. 

For a particular observation $y_i$, notice that the quantity $t_i = y_i - \bar y$ measures the amount by which the observation deviates from the sample mean. We will decompose this quantity into two pieces. 

* Let $m_i = \hat y_i - \bar y$. The quantity $m_i$ is the portion of the deviaton $t_i$ that is explained by the regression model. You would expect the $y$ value to vary from $\bar y$ by this amount as a result of the effect of $x$.

* Let $\hat \varepsilon_i = y_i - \hat y_i$. This is the residual associated with $y_i$. It can be thought as the portion of the overall deviation $t_i$ that is left unexplained by our model. 

Notice that $t_i = m_i + \hat \varepsilon_i$.

```{r, echo=FALSE}
yhat8 <- sum(mod2$coefficients*c(1,x[9]))
plot(y[9] ~ x[9], xlim=c(7.5,8.4),  ylim=c(3.4,3.75), pch=19, cex=2, col="blue", xlab="", ylab="")

abline(mod2$coefficients)
abline(h=mean(y), lty=2, col="red")

d = 0.1
segments(x[9] - d,mean(y),x[9] - d,y[9], lwd=3, col="deepskyblue3")
segments(x[9] - d,y[9],x[9],y[9], lty=2, lwd=2, col="deepskyblue3")

segments(x[9] + d,mean(y),x[9] + d,yhat8, lwd=3, col="Dark Orange")
segments(x[9] + d,yhat8,x[9] + d,y[9], lwd=3, col="Green4")

segments(x[9],yhat8,x[9] + d,yhat8, lty=2, lwd=2, col="Dark Orange")
segments(x[9],y[9],x[9] + d,y[9], lty=2, lwd=2, col="Green4")

text(7.69,3.63, expression(t[i] == y[i] - bar(y)), cex=1.5, col="deepskyblue3") 
text(8.08,3.65, expression(hat(e)[i] == y[i] - hat(y)[i]), cex=1.5, col="Green4") 
text(8.08,3.5, expression(m[i] == hat(y)[i] - bar(y)), cex=1.5, col="Dark Orange") 

points(x[9],y[9], pch=19, cex=2, col="blue")
points(x[9],yhat8, pch=19, cex=2)
```

We now sum the squares of each of these three quantities over all of the points in our sample.

* Let $SST = \sum t_i^2  = \sum (y_i - \bar y)^2$

* Let $SSM = \sum m_i^2  = \sum ( \hat y_i - \bar y)^2$

* Let $SSE = \sum \hat \varepsilon_i^2  = \sum ( y_i - \hat y_i)^2$


### **SST = SSE + SSM**

An important relationship between these three sums is given by the equation $SST = SSE + SSM$. 

To establish this result, we will need to make use of the two identities shown below. These identities are known as the **normal equations** and were derived in the notebook titled **3.1.a - Derivation of PArameter Estimates.Rmd**. 

* $\sum \hat \varepsilon_i = 0$

* $\sum \hat \varepsilon_i \hat y_i = 0$

The first identity is one of our two normal equations. The second identity can be derived from the two normal equations. Armed with these identities, we may procede with our proof that $SST = SSE + SSM$ as follows:

$\hspace{30pt} SST = \sum (y_i - \bar{y})^2$

$\hspace{30pt} SST = \sum [(\hat \varepsilon_i + \hat y_i) - \bar{y}]^2$

$\hspace{30pt} SST = \sum [\hat \varepsilon_i + (\hat y_i - \bar{y})]^2$

$\hspace{30pt} SST = \sum [\hat \varepsilon_i^2 + 2\hat \varepsilon_i (\hat y_i - \bar{y}) + ( \hat y_i - \bar y)^2]$

$\hspace{30pt} SST = \sum \hat \varepsilon_i^2 + 2\sum \hat \varepsilon_i (\hat y_i - \bar{y}) + \sum ( \hat y_i - \bar y)^2$

$\hspace{30pt} SST = SSE + 2 \sum( \hat \varepsilon_i \hat y_i -  \hat \varepsilon_i\bar{y}) + SSM$

$\hspace{30pt} SST = SSE + 2 \sum \hat \varepsilon_i \hat y_i -  2 \bar{y}\sum\hat \varepsilon_i + SSM$

$\hspace{30pt} SST = SSE + 0 - 0 + SSM$

$\hspace{30pt} SST = SSE + SSM$


### **r-Squared**

Intuitive explanations of the meaning of the variables $SST$, $SSM$, and $SSE$ are as follows:

* $SST$ is a measure of the amount of variation in the response variable $Y$. 

* $SSM$ is a measure of the variation in $Y$ that is explained by the regression model. 

* $SSE$ is a measure of the variation in $Y$ that is left unexplained by our model. 

Ideally, we would like for $SSE$ to be close to 0 and for $SSM$ to thus be close to $SST$. We can measure the proportion of the variance in $Y$ that is explained by our regression model using the following quantity:

$$r^2 = \frac{SSM}{SST}$$ 

Note that since $0 \leq SSM \leq SST$, we get that $0 \leq r^2 \leq 1$. The quantity $r^2$ is a diagnostic tool that is commonly uses to measure the quality of the fit in a regression model. 

Notice that since $SST = SSM + SSE$, we can rewrite the formula for $r^2$ as follows:


$$r^2 = \frac{SSM}{SST} = 1 - \frac{SSE}{SSM}$$ 

The $r^2 = 1 - \frac{SSE}{SSM}$ formula for $r^2$ is often the more useful of the two formulas, since we will have other reasons for calculating $SSE$. 



### **r-Squared and Correlation**

The value $r^2$ is related to the sample correlation $\rho_{X,Y} = \mathrm{corr}[X,Y]$. In fact, it can be shown that:

$$r^2 = \mathrm{corr}[X,Y]$$

To establish this result, we first need to derive an alternate form for the expression $SSM$. Notice that:

$\hspace{30pt} SSM = \sum\limits_{i=1}^n (\hat y_i - \bar y)^2$

$\hspace{30pt} SSM = \sum\limits_{i=1}^n \left[\left(\hat \beta_0 + \hat \beta_1 x_i \right) - \bar y\right]^2$

$\hspace{30pt} SSM = \sum\limits_{i=1}^n \left[\left(\bar y - \hat\beta_1\bar x \right) + \hat \beta_1 x_i  - \bar y\right]^2$

$\hspace{30pt} SSM = \sum\limits_{i=1}^n \left( - \hat\beta_1\bar x  + \hat \beta_1 x_i\right)^2$

$\hspace{30pt} SSM = \beta_1^2 \sum\limits_{i=1}^n \left( x_i - \bar x\right)^2$

$\hspace{30pt} SSM = \beta_1^2 S_{XX}$

$\hspace{30pt} SSM = \left(\frac{S_{XY}}{S_{XX}} \right)^2 S_{XX}$

$\hspace{30pt} SSM = \frac{\left(S_{XY} \right)^2}{S_{XX}}$


Now, recall that $r^2 = \frac{SSM}{SST}$. We will substitute the expression above in for $SSM$, and then simplify. 

$\hspace{30pt} r^2 = \frac{SSM}{SST}$

$\hspace{30pt} r^2 = \frac{\left(S_{XY} \right)^2}{S_{XX}} \frac{1}{SST}$

$\hspace{30pt} r^2 = \frac{\left(S_{XY} \right)^2}{S_{XX} \cdot SST}$

$\hspace{30pt} r^2 = \frac{\left(S_{XY} \right)^2}{S_{XX} \cdot SST}$

$\hspace{30pt} r^2 = \frac{\left(\sum\limits_{i=1}^n (x_i - \bar x)(y_i - \bar y) \right)^2}{\sum\limits_{i=1}^n (x_i - \bar x)^2 \cdot \sum\limits_{i=1}^n (y_i - \bar y)^2}$

$\hspace{30pt} r^2 = \left(\frac{\sum\limits_{i=1}^n (x_i - \bar x)(y_i - \bar y) }{\sqrt{\sum\limits_{i=1}^n (x_i - \bar x)^2} \cdot \sqrt{\sum\limits_{i=1}^n (y_i - \bar y)^2}} \right)^2$

$\hspace{30pt} r^2 = \left( \frac{\mathrm{cov}[X,Y]}{s_X s_Y}  \right)^2$

$\hspace{30pt} r^2 = \left( \mathrm{corr}[X,Y]  \right)^2$

This completes our proof. 

### **Correlation Between $Y$ and $\hat Y$**

It can also be shown that the correlation between the fitted value $\hat Y$ and the response $Y$ is exactly the same as that between the predictor $X$ and the response $Y$. In other words:

$$\mathrm{corr}\left[\hat Y,Y \right] = \mathrm{corr}\left[X,Y\right]$$

The proof of this fact is left as an exercise. 








