This survey paper provides references and insight into linear regression.
We begin with covering the types of problems we can solve, we typically work with a set of data that can be real-valued (although this can be extended) as a set of predictors \(\{ X_i \}\) that map to a set of observations \(\{ y_i \}\).
Some problems of interest are
The term regression refers to regression to the mean, which is the idea that given a sequence of obseravations (along with predictors); if a particular observation is an outlier to the mean then it is expected that the next elemenent in the sequence will be closer to the mean. These concepts were first made popular by Sir Francis Galton in the ninetenth century.
Some concrete examples would be
in this simple 1-D example, we would assume we have the set \(\{ X_i \}\) of measurements of the fathers; and we have \(\{ y_i \}\) measurements of the sons. (this just so happens to be the classical example and case that Galton examined in the 1885)
The lighter line is the \(y=x\) which represents the break-even for a child height, being the same height as the parent. The red line is the linear regression line, and the blue dots represent the data samples. You can see the red line is more level than the diagonal, this is visualization shows that at the lower end, you are more likely to have a taller observation (shorter parents children are more likley to be taller than them).
The Linear regression model is the one that assumes there is a linear relationship between the dependent variable (observation) \(y_i\) and the regressor (predictor) \(X_i\), that
\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \text{ for each } i \]
In the language of machine learning, the result of training the model is the method to identify the best \(( \beta_0, \beta_1 )\), that is the values so that the set of residuals \(\{ \epsilon_i \}\) is minimized.
This abstracts for \(y_i \in \mathbb{R}^M\) and \(X_i \in \mathbb{R}^N\), where the parameters \(\beta_i\) and residuals \(\epsilon_i\) are all in \(\mathbb{R}^M\).
A Linear Model is any model where the co-efficicents on the predictor terms are linear, the predictor terms themselves can be used in non-linear ways. For example we could use the parents height AND the square of the parents height (this would make more sense combining different predictors, but may expose some behaviour that is non-linear)
\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i, \text{ for each } i \]
The goal is to minimize \(\| \{ \varepsilon_i \} \|\), which involves identifying which norm (measurement) to use.
In our example with Galton’s heights we have, \(M, N = 1\), and \(n\) data points, using the Euclidean square-distance norm \(\| x \|^2 = \sum_i x_i^2\).
\[ \| \{ \varepsilon_i \}^n_{i=1} \|^2 = \sum_{i=1}^n \varepsilon_i^2 = \sum_i^n \left( y_i - (\beta_0 + \beta_1 x_i ) \right)^2 = S(\beta), \text{ where } \beta \in \mathbb{R}^2 \]
This is called the Optimized Least Squares, to solve the problem of \(\min_\beta S(\beta)\), we find when the gradients are zero since this is a quadratic, the gradients are linear functions, and there is an easy to calculate solution.
In the case of our 1 dimensional example, we have \(S : \mathbb{R}^2 \to \mathbb{R}\), which is explicitly solved as (ref)
\[ \begin{aligned} \beta_0 & = \frac{1}{n} \left( \sum_{i=1}^n y_i - \beta_1 \sum_{i=1}^n x_i \right) = \bar{y} - \beta_1 \bar{x}, \\ \beta_1 & = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{ \text{Cov}(x,y)}{\text{Var(x)}} \end{aligned} \]
The value \(\beta_1\) is the slope of the linear regression model (or fit), this represents the amount of change in the observation per unit change in the predictor.
Why is the square norm used?
Fun Facts
Given vector \(X\), minimize \(\| X - \alpha \|\).
Answer \(E[X] = \frac{1}{n} \sum_{i=1}^n x_i\).
Answer \(E[Y]/E[X] = \frac{\sum y_i}{\sum x_i}\)
To the mathematician we would look at the \(L^1\), or \(L^2\) norm, to the lay-person these are called the
Why do we use the square of delta’s when estimating linear regression?
The answer is related to the construction of how the to optimizae a formula including many data points.
Start with the fact \[ \alpha \ni \min_{\alpha} \sum_{i=1}^n \left( y_i - \alpha \right)^2 = \frac{\sum y_i}{n} \]
This is proven by taking the derivate with respect to \(\alpha\) of the left hand side \[ \frac{d}{d\alpha} \sum_{i=1}^n (y_i - \alpha)^2 = \sum_{i=1}^n 2 * (y_i - \alpha) (-1) = 0 \] \[ \sum_{i=1}^n y_i = \alpha n \]
We also have \[ \alpha \ni \min_{\alpha} \sum_{i=1}^n (y_i - \alpha x_i)^2 = \frac{\sum y_i}{\sum x_i} \]
\[ \frac{d}{d\alpha} \sum_{i=1}^n (y_i - \alpha x_i)^2 = \sum_{i=1}^n 2 * (y_i - x_i \alpha) (-1) = 0 \]
\[ \sum_{i=1}^n y_i = \alpha \sum_{x=1}^n x_i \]
since we’re just laying out formula to get Degrees of freedom we have this for 2 populations with different variances TODO = look this up from Wek 1 of Regression models \[ df = \frac{ \left( {S_x}^2/n_x + {S_y}^2/n_y \right)^2 }{ \left( {S_x}^2 / n_x \right)^2} \dots \]
The general rule for rejection of a hypothesis test is to fail whenever the probability that the sample we have seen is consistent with the hypothesis (p value) is low. \[ \frac{\bar{X} - \mu}{s/\sqrt{n}} > Z_{1-\alpha} \]
Include This # note I saved this locally on H:/Programming/datasets
dat <- read.table("http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt", header=FALSE)
pairs(dat)
head(dat)
## V1 V2 V3 V4 V5
## 1 -0.75052 -0.282230 0.228190 -0.084136 -0.2474800
## 2 -0.39380 -0.074787 -0.013689 0.072776 -0.3602600
## 3 -0.15599 0.358390 -0.118070 0.013815 -0.6567200
## 4 -0.68392 -0.059086 -0.060048 -0.231480 -0.0380600
## 5 -0.59474 0.148360 -0.097664 0.667820 -1.0545000
## 6 -0.53529 0.253750 -0.530250 -0.097325 -0.0024249
fit <- lm(V1 ~ . -1, data=dat)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## V2 0.9856157 0.12798121 7.701253 1.989126e-14
## V3 0.9714707 0.12663829 7.671225 2.500259e-14
## V4 0.8606368 0.11958267 7.197003 8.301184e-13
## V5 0.9266981 0.08328434 11.126919 4.778110e-28
plot(predict(fit), resid(fit), pch='.')