Using `lm`

lm is a base R function used to fit linear models.
Example:

fit <- lm(price ~ carat,data = diamond)
m = fit$coef[[2]]
b = fit$coef[[1]]
c(m,b)

## [1] 3721.0249 -259.6259

So how hard is it to calculate `m` and `b`?

It turns out that it is not so dificult, but first we need to review a couple of things from elementary statistics.

Empirical Statistics

When we talk about empirical statistics we are talking about working with samples not populations.
Studying populations is probability.
Studying a sample to conclude something about a population is statistics.

Some definitions from elementary statistics.

Let \(X = \{x_1,x_2,x_3,\ldots,x_n\}\) be a set consisting of \(n\) numbers and let \(Y = \{y_1,y_2,y_3,\dots,y_n\}\) be another set of \(n\) numbers.

Sigma Notation, the Greek letter \(\Sigma\) is used to denote a sume, specifically: \[\sum_{i=1}^{n}x_i = x_1 + x_2 + \ldots + x_n\]
Mean The mean is simple the arithemetic average of the numbers in a set of numbers \(X\) it is denotec by \(\bar{X}\) and defined by: \[\bar{X} = \frac{1}{n}\sum_{i=1}^{n}x_i\]
variance The variance. \(S\) of a set of numbers, \(X\) is denoted by: \[ S_X = \frac{1}{n - 1}\sum_{i=1}^{n}(x_i - \bar{X})^2 \]

What is traditional Regression?

Last time, I said that least squares regression consists of fitting a linear model to a data set in such a way that the sum of the squared deviations from the model, i.e., the sum over all values of (data value - the coreesponding model value)\(^2\) were as small as possible.
Let’s look at the simplest case. Given \(X\) and \(Y\), as sets of numbers find the number \(\theta\) that minimizes: \[\sum_{i = 1}^{n}(y_i - \theta)^2\]
Claim: \(\theta = \bar{Y}\)
\[\sum_{i = 1}^{n}(y_i - \theta)^2 = \sum_{i = 1}^{n}(y_i - \bar{Y} + \bar{Y} - \theta)^2 = \\ \sum_{i = 1}^{n}(y_i - \bar{Y} )^2 -2(\bar{Y}- \theta)\sum_{i}^{n}(y_i - \bar{Y}) + n(\bar{Y} - \theta)^2 \]
But, \(\sum_{i}^{n}(y_i - \bar{Y}) = 0\) (exercise)
So, \[\sum_{i = 1}^{n}(y_i - \theta)^2 = \sum_{i = 1}^{n}(y_i - \bar{Y})^2 + n(\bar{Y} - \theta)^2\]
Therefore: \(\sum_{i = 1}^{n}(y_i - \bar{Y})^2 \le \sum_{i = 1}^{n}(y_i - \theta)^2\)
What have we just proven?

Next simplest case

Find \(\beta\) such that: \(\sum_{i = 1}^{n}(y_i - \beta x_i)^2\) is minimized.

This time, let’s use a little calculus, Let \(f(\beta) = \sum_{i = 1}^{n}(y_i - \beta x_i)^2\). From first semester calculus we know that if \(f\) has a minimum at a point \(\beta_1\) then \(f^\prime(\beta) = 0\) A quick look at the form of \(f\) revelas it to be a 2-degree polynomil, so it’s graph as a function of \(\beta\) is a parabola that opens up. So we know that any critical point will be a minimum. (why?)
\[f^\prime(\beta) = 2\sum_{i=1}^{n}(y_i - \beta x_i)x_i\\ = 2\left[\sum_{i=1}^{n}(y_ix_i) - \beta\sum_{i=1}^{n}x_i^2\right] = 0 \]
Solving for \(\beta\) we get:
\[\beta = \frac{\sum_{i=1}^{n}y_ix_i}{\sum_{i=1}^{n}x_i^2}\]

Let’s try this on `diamond`

x = diamond$carat
y = diamond$price

beta = sum(x*y)/sum(x^2)
beta

## [1] 2538.933

This is clearly not satisfactory,

We shoud have taken the intercept into account in our model.
In other words maybe we should have minimized \[\sum_{i=1}^{n}(y_i - (\beta_0 + \beta_1x_i)^2)\]
Before we go through a lot of math, let’s see what we get from lm

fit = lm(y ~ x)
beta0 = coef(fit)[[1]]
beta1 = coef(fit)[[2]]
c(beta0,beta1)

## [1] -259.6259 3721.0249

Plotting

as_tibble(x,y) %>% ggplot(aes(x = x, y = y)) +
  geom_point(alpha = 1/3, size = 6, color = "black") +
  geom_point(alpha = 1/3, size = 5, color = "red") +
  geom_abline(aes(slope = beta1, intercept = beta0)) +
  ggtitle("approximation with both intercepts")

So did we waste our time with the simple example?

No.
Let’s look at what happens when we center the data by, i.e. \(YC = Y - \hat{Y}\) and \(XC = X - \hat{X}\)

xc = x - mean(x)
yc = y - mean(y)

fitc <- lm(yc ~ xc)
fitc

## 
## Call:
## lm(formula = yc ~ xc)
## 
## Coefficients:
## (Intercept)           xc  
##   5.145e-14    3.721e+03

Now lets calculate using our original formula except now using the centered values for X and Y.

beta_test = mean(xc*yc)/mean(xc^2)
beta_test

## [1] 3721.025

So now we get the same answeer lm got! why?
It’s not so surprising. From our model we get \[\bar{Y} = \beta_0 + \beta_1\bar{X}\\ \beta_0 = \bar{Y} - \beta_1\bar{X}\]
Which says if \(\bar{Y} = \bar{X} = 0\) then \(\beta_0 = 0\) }

Covariance and Correlation

Covariance The covariance of \(X\) and \(Y\) is defined by \[cov(X,Y) = \frac{\sum_{i = 1}^{n}(x_i - \bar{X})(y_i - \bar{Y})}{n-1}\]
Recall the variance of \(X\) is defined by \[var(X) = S_X^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{X})^2}{n-1}\]
Also recall the standard deviation of \(X\) is the square root of the variance so, \[S_X = \sqrt{S_X^2}= \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{X})^2}{n-1}}\]
The correlation of \(X\) and \(Y\) is defined by \[ r(X,Y) = \frac{cov(X,Y)}{S_XS_Y}\]
So what does this have to do with the price of laptops in New Zealand?

Back to our original formula for \(\beta_1 = \frac{\sum x_i y_i}{sum x_i^2}\)

If x and y are centered, i.e., \(\bar{X} = \bar{Y} = 0\) then we can interpret this equation as the \[\beta = \frac{cov(X,Y)}{S^2_X}\]
But by definition of the correlation this gives us: \[ \beta = r(X,Y)\frac{S_Y}{S_X}\]
We previously noticed that translating \(X\) and \(Y\) had no effect on the estimated slope of the regression line, so we can use that fact to and the fact that the intercept, \(\beta_0 = \bar{Y} - \beta \bar{X}\)
One other point since covariance and variance already translate by the means, the translation has no effect on those values.

beta1 = cov(x,y)/sd(x)^2
beta0 = mean(y) - beta1*mean(x)
rbind(c(beta0, beta1),c(coef(fit)[[1]],coef(fit)[[2]]))

##           [,1]     [,2]
## [1,] -259.6259 3721.025
## [2,] -259.6259 3721.025

Other things you can get out of `lm`

fit <- lm(y ~ x)
summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.159 -21.448  -0.869  18.972  79.370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -259.63      17.32  -14.99   <2e-16 ***
## x            3721.02      81.79   45.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared:  0.9783, Adjusted R-squared:  0.9778 
## F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16

fit$coefficients

## (Intercept)           x 
##   -259.6259   3721.0249

fit$residuals

##           1           2           3           4           5           6 
## -17.9483176  -7.7380691 -22.9483176 -85.1585661 -28.6303057   6.2619309 
##           7           8           9          10          11          12 
##  23.4721795  37.6311854 -38.7893116  24.4721795  51.8414339  40.7389488 
##          13          14          15          16          17          18 
##   0.2619309  13.4209369  -1.2098087  40.5287002  36.1029250 -44.8405542 
##          19          20          21          22          23          24 
##  79.3696943 -25.0508027  57.8414339   9.2619309 -20.9483176  -3.7380691 
##          25          26          27          28          29          30 
## -19.9483176  27.8414339 -54.9483176   8.8414339 -26.9483176  16.4721795 
##          31          32          33          34          35          36 
## -22.9483176 -13.1020453 -12.1020453  -0.5278205   3.2619309   2.2619309 
##          37          38          39          40          41          42 
##  -1.2098087 -43.2098087 -27.9483176 -23.3122938 -15.6303057  43.2672091 
##          43          44          45          46          47          48 
##  32.8414339   7.3696943   4.3696943 -11.5278205 -14.8405542  17.4721795

predict(fit)

##         1         2         3         4         5         6         7         8 
##  372.9483  335.7381  372.9483  410.1586  670.6303  335.7381  298.5278  447.3688 
##         9        10        11        12        13        14        15        16 
##  521.7893  298.5278  410.1586  782.2611  335.7381  484.5791  596.2098  819.4713 
##        17        18        19        20        21        22        23        24 
##  186.8971  707.8406  670.6303  745.0508  410.1586  335.7381  372.9483  335.7381 
##        25        26        27        28        29        30        31        32 
##  372.9483  410.1586  372.9483  410.1586  372.9483  298.5278  372.9483  931.1020 
##        33        34        35        36        37        38        39        40 
##  931.1020  298.5278  335.7381  335.7381  596.2098  596.2098  372.9483  968.3123 
##        41        42        43        44        45        46        47        48 
##  670.6303 1042.7328  410.1586  670.6303  670.6303  298.5278  707.8406  298.5278

ggplot() +
  geom_point(aes(x = x, y = y)) +
  geom_point(aes(x = x, y = predict(fit)),color = "red", shape = 'o',size = 4) +
  geom_abline(aes(slope = fit$coefficients[[2]], intercept = fit$coefficients[[1]]))

Regression Part 2

Introduction

Example Data Set

Diamond

Using `lm`

So how hard is it to calculate `m` and `b`?

Empirical Statistics

Some definitions from elementary statistics.

What is traditional Regression?

Next simplest case

Let’s try this on `diamond`

This is clearly not satisfactory,

So did we waste our time with the simple example?

Covariance and Correlation

Back to our original formula for \(\beta_1 = \frac{\sum x_i y_i}{sum x_i^2}\)

Other things you can get out of `lm`

Regression Part 2

Introduction

Example Data Set

Diamond

Using lm

So how hard is it to calculate m and b?

Empirical Statistics

Some definitions from elementary statistics.

What is traditional Regression?

Next simplest case

Let’s try this on diamond

This is clearly not satisfactory,

So did we waste our time with the simple example?

Covariance and Correlation

Back to our original formula for \(\beta_1 = \frac{\sum x_i y_i}{sum x_i^2}\)

Other things you can get out of lm

Using `lm`

So how hard is it to calculate `m` and `b`?

Let’s try this on `diamond`

Other things you can get out of `lm`