library(tidyverse)
library(openintro)

Comment:

This is material I had planned to present Thursday night but couldn’t. So I am presenting it in an R Notebook. Please work through this material before doing the assignment, it will help.

Total variation

Suppose we have sets of \(n\) observations x and y. And we want to analyze the relatioship between them. Looking at a scatter plot of x and y we conclude that there may be a linear association between the variables. (WARNING: an association does not mean there is a causal relation.) Anyway, we perform a least squares regression and fit a model of the form: \[Y = \beta_0 + \beta_1X + \epsilon_i\] Where \(\epsilon \sim\mathcal{N}(0,\sigma^2).\) That is the \(\epsilon_i\) are normally distributed with mean 0 and variance \(\sigma^2\).

Now look at the variability of \(Y\). This is given by the variation of \(Y\) about \(\bar{Y}\) which is \[\text{Total variability} = \sum_{i=1}^{n}(Y_i - \bar{Y})^2\]

The Regression Variablity is the variability that is explained by adding the predictor (x). It is given by \[\text{Regression variability} = \sum_{i=1}^{n}(\hat{Y_i} - \bar{Y})^2\]

The Residual variablitity is what is left over around the regression line. \[\text{Residual variability}= \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2\] An interesting fact is that: (proof beyond scope of this class)

\[\text{Total variability} = \text{Residual variability} + \text{Regression variability}\] Or in symbols:

\[\sum_{i=1}^{n}(Y_i - \bar{Y})^2 = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 + \sum_{i=1}^{n}(\hat{Y_i} - \bar{Y})^2\]

Let’s check this with the diamond data:

fit <- lm(price ~ carat, data = diamond)
summary(fit)

Call:
lm(formula = price ~ carat, data = diamond)

Residuals:
    Min      1Q  Median      3Q     Max 
-85.159 -21.448  -0.869  18.972  79.370 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -259.63      17.32  -14.99   <2e-16 ***
carat        3721.02      81.79   45.50   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31.84 on 46 degrees of freedom
Multiple R-squared:  0.9783,    Adjusted R-squared:  0.9778 
F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16
price <-  diamond$price
carat <-  diamond$carat
df <- nrow(diamond) - 1
df
[1] 47
res_var = sum(residuals(fit)^2)
y_hat = predict(fit)
reg_var = sum((y_hat - mean(price))^2)
tot_var = sum((price - mean(price))^2)
tot_var
[1] 2145232
res_var + reg_var
[1] 2145232
tot_var - (res_var + reg_var)
[1] -9.313226e-10

\(R^2\)

R squared is the percentage of the total variability that is explained by the linear relationsip with the predictor. i.e., \[R^2 = \frac{\sum_{i=1}^{n}(\hat{Y}_i - \bar{Y})^2}{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}\]

Looking at \(R^2\) in diamond, we get

Rsq = reg_var/tot_var
Rsq
[1] 0.9782608
cor(carat,price)^2 #square of the correlation coefficient
[1] 0.9782608
Rsq - cor(carat,price)^2
[1] 4.440892e-16

Interesting observations:

  1. \(R^2\) is the percent of variation explained by the regression model.
  2. \(R^2\) is the sample correlation squared
  3. \(R^2\) can be a misleading summary of model fit. (deleting data may inflate the value)
LS0tCnRpdGxlOiAiQW5hbHl6aW5nIFZhcmlhYmlsaXR5IgphdXRob3I6ICJDZWxpYSBFdmFucyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKYGBge3J9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KG9wZW5pbnRybykKYGBgCgojIyBDb21tZW50OgpUaGlzIGlzIG1hdGVyaWFsIEkgaGFkIHBsYW5uZWQgdG8gcHJlc2VudCBUaHVyc2RheSBuaWdodCBidXQgY291bGRuJ3QuIFNvIEkgYW0gcHJlc2VudGluZyBpdCBpbiBhbiBSIE5vdGVib29rLiBQbGVhc2Ugd29yayB0aHJvdWdoIHRoaXMgbWF0ZXJpYWwgYmVmb3JlIGRvaW5nIHRoZSBhc3NpZ25tZW50LCBpdCB3aWxsIGhlbHAuCgojIyBUb3RhbCB2YXJpYXRpb24KClN1cHBvc2Ugd2UgaGF2ZSBzZXRzIG9mICRuJCBvYnNlcnZhdGlvbnMgYHhgIGFuZCBgeWAuIEFuZCB3ZSB3YW50IHRvIGFuYWx5emUgdGhlIHJlbGF0aW9zaGlwIGJldHdlZW4gdGhlbS4gTG9va2luZyBhdCBhIHNjYXR0ZXIgcGxvdCBvZiBgeGAgYW5kIGB5YCB3ZSBjb25jbHVkZSB0aGF0IHRoZXJlIG1heSBiZSBhIGxpbmVhciBhc3NvY2lhdGlvbiBiZXR3ZWVuIHRoZSB2YXJpYWJsZXMuICgqKldBUk5JTkc6KiogYW4gYXNzb2NpYXRpb24gZG9lcyBub3QgbWVhbiB0aGVyZSBpcyBhIGNhdXNhbCByZWxhdGlvbi4pIEFueXdheSwgd2UgcGVyZm9ybSBhIGxlYXN0IHNxdWFyZXMgcmVncmVzc2lvbiBhbmQgZml0IGEgbW9kZWwgb2YgdGhlIGZvcm06CiQkWSA9IFxiZXRhXzAgKyBcYmV0YV8xWCArIFxlcHNpbG9uX2kkJCBXaGVyZSAkXGVwc2lsb24gXHNpbVxtYXRoY2Fse059KDAsXHNpZ21hXjIpLiQgVGhhdCBpcyB0aGUgJFxlcHNpbG9uX2kkIGFyZSBub3JtYWxseSBkaXN0cmlidXRlZCB3aXRoIG1lYW4gMCBhbmQgdmFyaWFuY2UgJFxzaWdtYV4yJC4KCk5vdyAgbG9vayBhdCB0aGUgdmFyaWFiaWxpdHkgb2YgJFkkLiAgVGhpcyBpcyBnaXZlbiBieSB0aGUgdmFyaWF0aW9uIG9mICRZJCBhYm91dCAkXGJhcntZfSQgd2hpY2ggaXMgJCRcdGV4dHtUb3RhbCB2YXJpYWJpbGl0eX0gPSBcc3VtX3tpPTF9XntufShZX2kgLSBcYmFye1l9KV4yJCQKClRoZSAqKlJlZ3Jlc3Npb24gVmFyaWFibGl0eSAqKiBpcyB0aGUgdmFyaWFiaWxpdHkgdGhhdCBpcyBleHBsYWluZWQgYnkgYWRkaW5nIHRoZSBwcmVkaWN0b3IgKGB4YCkuIEl0IGlzIGdpdmVuIGJ5IAokJFx0ZXh0e1JlZ3Jlc3Npb24gdmFyaWFiaWxpdHl9ID0gXHN1bV97aT0xfV57bn0oXGhhdHtZX2l9IC0gXGJhcntZfSleMiQkCgpUaGUgKipSZXNpZHVhbCB2YXJpYWJsaXRpdHkqKiBpcyB3aGF0IGlzIGxlZnQgb3ZlciBhcm91bmQgdGhlIHJlZ3Jlc3Npb24gbGluZS4KJCRcdGV4dHtSZXNpZHVhbCB2YXJpYWJpbGl0eX09IFxzdW1fe2k9MX1ee259KFlfaSAtIFxoYXR7WX1faSleMiQkCkFuIGludGVyZXN0aW5nIGZhY3QgaXMgdGhhdDogKHByb29mIGJleW9uZCBzY29wZSBvZiB0aGlzIGNsYXNzKQoKJCRcdGV4dHtUb3RhbCB2YXJpYWJpbGl0eX0gPSAgIFx0ZXh0e1Jlc2lkdWFsIHZhcmlhYmlsaXR5fSArIFx0ZXh0e1JlZ3Jlc3Npb24gdmFyaWFiaWxpdHl9JCQKT3IgaW4gc3ltYm9sczoKCiQkXHN1bV97aT0xfV57bn0oWV9pIC0gXGJhcntZfSleMiA9IFxzdW1fe2k9MX1ee259KFlfaSAtIFxoYXR7WX1faSleMiArIFxzdW1fe2k9MX1ee259KFxoYXR7WV9pfSAtIFxiYXJ7WX0pXjIkJAoKTGV0J3MgY2hlY2sgdGhpcyB3aXRoIHRoZSBkaWFtb25kIGRhdGE6CgpgYGB7cn0KZml0IDwtIGxtKHByaWNlIH4gY2FyYXQsIGRhdGEgPSBkaWFtb25kKQpzdW1tYXJ5KGZpdCkKYGBgCmBgYHtyfQpwcmljZSA8LSAgZGlhbW9uZCRwcmljZQpjYXJhdCA8LSAgZGlhbW9uZCRjYXJhdApkZiA8LSBucm93KGRpYW1vbmQpIC0gMQoKcmVzX3ZhciA9IHN1bShyZXNpZHVhbHMoZml0KV4yKQp5X2hhdCA9IHByZWRpY3QoZml0KQpyZWdfdmFyID0gc3VtKCh5X2hhdCAtIG1lYW4ocHJpY2UpKV4yKQp0b3RfdmFyID0gc3VtKChwcmljZSAtIG1lYW4ocHJpY2UpKV4yKQp0b3RfdmFyCnJlc192YXIgKyByZWdfdmFyCnRvdF92YXIgLSAocmVzX3ZhciArIHJlZ192YXIpCmBgYAoKIyMjICRSXjIkCgoqKlIgc3F1YXJlZCoqIGlzIHRoZSBwZXJjZW50YWdlIG9mIHRoZSB0b3RhbCB2YXJpYWJpbGl0eSB0aGF0IGlzIGV4cGxhaW5lZCBieSB0aGUgbGluZWFyIHJlbGF0aW9uc2lwIHdpdGggdGhlIHByZWRpY3Rvci4gaS5lLiwgCiQkUl4yID0gXGZyYWN7XHN1bV97aT0xfV57bn0oXGhhdHtZfV9pIC0gXGJhcntZfSleMn17XHN1bV97aT0xfV57bn0oWV9pIC0gXGJhcntZfSleMn0kJAoKTG9va2luZyBhdCAkUl4yJCBpbiBkaWFtb25kLCB3ZSBnZXQKCmBgYHtyfQpSc3EgPSByZWdfdmFyL3RvdF92YXIKUnNxCmBgYApgYGB7cn0KY29yKGNhcmF0LHByaWNlKV4yICNzcXVhcmUgb2YgdGhlIGNvcnJlbGF0aW9uIGNvZWZmaWNpZW50ClJzcSAtIGNvcihjYXJhdCxwcmljZSleMgpgYGAKSW50ZXJlc3Rpbmcgb2JzZXJ2YXRpb25zOgoKMS4gJFJeMiQgaXMgdGhlIHBlcmNlbnQgb2YgdmFyaWF0aW9uIGV4cGxhaW5lZCBieSB0aGUgcmVncmVzc2lvbiBtb2RlbC4KMi4gJFJeMiQgaXMgdGhlIHNhbXBsZSBjb3JyZWxhdGlvbiBzcXVhcmVkCjMuICRSXjIkIGNhbiBiZSBhIG1pc2xlYWRpbmcgc3VtbWFyeSBvZiBtb2RlbCBmaXQuIChkZWxldGluZyBkYXRhIG1heSBpbmZsYXRlIHRoZSB2YWx1ZSkK