Celia Evans
The main purpose of tonight’s class is to finish getting up to speed with regression.
Regression is a hugely important within Data Science, Machine Learning and Statistis.
Tonight we are going to continue a more statistical and mathematical look.
Don’t be frighted, I won’t let the equations bite…too hard :)
## # A tibble: 48 x 2
## carat price
## <dbl> <dbl>
## 1 0.17 355
## 2 0.16 328
## 3 0.17 350
## 4 0.18 325
## 5 0.25 642
## 6 0.16 342
## 7 0.15 322
## 8 0.19 485
## 9 0.21 483
## 10 0.15 323
## # … with 38 more rows
lmlm is a base R function used to fit linear models.## [1] 3721.0249 -259.6259
m and b?When we talk about empirical statistics we are talking about working with samples not populations.
Studying populations is probability.
Studying a sample to conclude something about a population is statistics.
Let \(X = \{x_1,x_2,x_3,\ldots,x_n\}\) be a set consisting of \(n\) numbers and let \(Y = \{y_1,y_2,y_3,\dots,y_n\}\) be another set of \(n\) numbers.
Sigma Notation, the Greek letter \(\Sigma\) is used to denote a sume, specifically: \[\sum_{i=1}^{n}x_i = x_1 + x_2 + \ldots + x_n\]
Mean The mean is simple the arithemetic average of the numbers in a set of numbers \(X\) it is denotec by \(\bar{X}\) and defined by: \[\bar{X} = \frac{1}{n}\sum_{i=1}^{n}x_i\]
variance The variance. \(S\) of a set of numbers, \(X\) is denoted by: \[ S_X = \frac{1}{n - 1}\sum_{i=1}^{n}(x_i - \bar{X})^2 \]
Last time, I said that least squares regression consists of fitting a linear model to a data set in such a way that the sum of the squared deviations from the model, i.e., the sum over all values of (data value - the coreesponding model value)\(^2\) were as small as possible.
Let’s look at the simplest case. Given \(X\) and \(Y\), as sets of numbers find the number \(\theta\) that minimizes: \[\sum_{i = 1}^{n}(y_i - \theta)^2\]
Claim: \(\theta = \bar{Y}\)
\[\sum_{i = 1}^{n}(y_i - \theta)^2 = \sum_{i = 1}^{n}(y_i - \bar{Y} + \bar{Y} - \theta)^2 = \\ \sum_{i = 1}^{n}(y_i - \bar{Y} )^2 -2(\bar{Y}- \theta)\sum_{i}^{n}(y_i - \bar{Y}) + n(\bar{Y} - \theta)^2 \]
But, \(\sum_{i}^{n}(y_i - \bar{Y}) = 0\) (exercise)
So, \[\sum_{i = 1}^{n}(y_i - \theta)^2 = \sum_{i = 1}^{n}(y_i - \bar{Y})^2 + n(\bar{Y} - \theta)^2\]
Therefore: \(\sum_{i = 1}^{n}(y_i - \bar{Y})^2 \le \sum_{i = 1}^{n}(y_i - \theta)^2\)
What have we just proven?
This time, let’s use a little calculus, Let \(f(\beta) = \sum_{i = 1}^{n}(y_i - \beta x_i)^2\). From first semester calculus we know that if \(f\) has a minimum at a point \(\beta_1\) then \(f^\prime(\beta) = 0\) A quick look at the form of \(f\) revelas it to be a 2-degree polynomil, so it’s graph as a function of \(\beta\) is a parabola that opens up. So we know that any critical point will be a minimum. (why?)
\[f^\prime(\beta) = 2\sum_{i=1}^{n}(y_i - \beta x_i)x_i\\ = 2\left[\sum_{i=1}^{n}(y_ix_i) - \beta\sum_{i=1}^{n}x_i^2\right] = 0 \]
Solving for \(\beta\) we get:
\[\beta = \frac{\sum_{i=1}^{n}y_ix_i}{\sum_{i=1}^{n}x_i^2}\]
diamond## [1] 2538.933
We shoud have taken the intercept into account in our model.
In other words maybe we should have minimized \[\sum_{i=1}^{n}(y_i - (\beta_0 + \beta_1x_i)^2)\]
Before we go through a lot of math, let’s see what we get from lm
## [1] -259.6259 3721.0249
as_tibble(x,y) %>% ggplot(aes(x = x, y = y)) +
geom_point(alpha = 1/3, size = 6, color = "black") +
geom_point(alpha = 1/3, size = 5, color = "red") +
geom_abline(aes(slope = beta1, intercept = beta0)) +
ggtitle("approximation with both intercepts")No.
Let’s look at what happens when we center the data by, i.e. \(YC = Y - \hat{Y}\) and \(XC = X - \hat{X}\)
##
## Call:
## lm(formula = yc ~ xc)
##
## Coefficients:
## (Intercept) xc
## 5.145e-14 3.721e+03
## [1] 3721.025
So now we get the same answeer lm got! why?
It’s not so surprising. From our model we get \[\bar{Y} = \beta_0 + \beta_1\bar{X}\\ \beta_0 = \bar{Y} - \beta_1\bar{X}\]
Which says if \(\bar{Y} = \bar{X} = 0\) then \(\beta_0 = 0\) }
Covariance The covariance of \(X\) and \(Y\) is defined by \[cov(X,Y) = \frac{\sum_{i = 1}^{n}(x_i - \bar{X})(y_i - \bar{Y})}{n-1}\]
Recall the variance of \(X\) is defined by \[var(X) = S_X^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{X})^2}{n-1}\]
Also recall the standard deviation of \(X\) is the square root of the variance so, \[S_X = \sqrt{S_X^2}= \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{X})^2}{n-1}}\]
The correlation of \(X\) and \(Y\) is defined by \[ r(X,Y) = \frac{cov(X,Y)}{S_XS_Y}\]
So what does this have to do with the price of laptops in New Zealand?
If x and y are centered, i.e., \(\bar{X} = \bar{Y} = 0\) then we can interpret this equation as the \[\beta = \frac{cov(X,Y)}{S^2_X}\]
But by definition of the correlation this gives us: \[ \beta = r(X,Y)\frac{S_Y}{S_X}\]
We previously noticed that translating \(X\) and \(Y\) had no effect on the estimated slope of the regression line, so we can use that fact to and the fact that the intercept, \(\beta_0 = \bar{Y} - \beta \bar{X}\)
One other point since covariance and variance already translate by the means, the translation has no effect on those values.
beta1 = cov(x,y)/sd(x)^2
beta0 = mean(y) - beta1*mean(x)
rbind(c(beta0, beta1),c(coef(fit)[[1]],coef(fit)[[2]]))## [,1] [,2]
## [1,] -259.6259 3721.025
## [2,] -259.6259 3721.025
lm##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.159 -21.448 -0.869 18.972 79.370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -259.63 17.32 -14.99 <2e-16 ***
## x 3721.02 81.79 45.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared: 0.9783, Adjusted R-squared: 0.9778
## F-statistic: 2070 on 1 and 46 DF, p-value: < 2.2e-16
## (Intercept) x
## -259.6259 3721.0249
## 1 2 3 4 5 6
## -17.9483176 -7.7380691 -22.9483176 -85.1585661 -28.6303057 6.2619309
## 7 8 9 10 11 12
## 23.4721795 37.6311854 -38.7893116 24.4721795 51.8414339 40.7389488
## 13 14 15 16 17 18
## 0.2619309 13.4209369 -1.2098087 40.5287002 36.1029250 -44.8405542
## 19 20 21 22 23 24
## 79.3696943 -25.0508027 57.8414339 9.2619309 -20.9483176 -3.7380691
## 25 26 27 28 29 30
## -19.9483176 27.8414339 -54.9483176 8.8414339 -26.9483176 16.4721795
## 31 32 33 34 35 36
## -22.9483176 -13.1020453 -12.1020453 -0.5278205 3.2619309 2.2619309
## 37 38 39 40 41 42
## -1.2098087 -43.2098087 -27.9483176 -23.3122938 -15.6303057 43.2672091
## 43 44 45 46 47 48
## 32.8414339 7.3696943 4.3696943 -11.5278205 -14.8405542 17.4721795
## 1 2 3 4 5 6 7 8
## 372.9483 335.7381 372.9483 410.1586 670.6303 335.7381 298.5278 447.3688
## 9 10 11 12 13 14 15 16
## 521.7893 298.5278 410.1586 782.2611 335.7381 484.5791 596.2098 819.4713
## 17 18 19 20 21 22 23 24
## 186.8971 707.8406 670.6303 745.0508 410.1586 335.7381 372.9483 335.7381
## 25 26 27 28 29 30 31 32
## 372.9483 410.1586 372.9483 410.1586 372.9483 298.5278 372.9483 931.1020
## 33 34 35 36 37 38 39 40
## 931.1020 298.5278 335.7381 335.7381 596.2098 596.2098 372.9483 968.3123
## 41 42 43 44 45 46 47 48
## 670.6303 1042.7328 410.1586 670.6303 670.6303 298.5278 707.8406 298.5278
ggplot() +
geom_point(aes(x = x, y = y)) +
geom_point(aes(x = x, y = predict(fit)),color = "red", shape = 'o',size = 4) +
geom_abline(aes(slope = fit$coefficients[[2]], intercept = fit$coefficients[[1]]))