Linear Regression

This survey paper provides references and insight into linear regression.

What are we trying to Solve?

We begin with covering the types of problems we can solve, we typically work with a set of data that can be real-valued (although this can be extended) as a set of predictors \(\{ X_i \}\) that map to a set of observations \(\{ y_i \}\).

Some problems of interest are

what is the expected observation value for a new set of never before seen predictors?

The term regression refers to regression to the mean, which is the idea that given a sequence of obseravations (along with predictors); if a particular observation is an outlier to the mean then it is expected that the next elemenent in the sequence will be closer to the mean. These concepts were first made popular by Sir Francis Galton in the ninetenth century.

Some concrete examples would be

using the height of the father as the predictor for the height of the son as the observation; regression to the mean would say that we would expect very tall parents to have shorter children, and shorter parents to have taller children.

in this simple 1-D example, we would assume we have the set \(\{ X_i \}\) of measurements of the fathers; and we have \(\{ y_i \}\) measurements of the sons. (this just so happens to be the classical example and case that Galton examined in the 1885)

The lighter line is the \(y=x\) which represents the break-even for a child height, being the same height as the parent. The red line is the linear regression line, and the blue dots represent the data samples. You can see the red line is more level than the diagonal, this is visualization shows that at the lower end, you are more likely to have a taller observation (shorter parents children are more likley to be taller than them).

The Linear regression model is the one that assumes there is a linear relationship between the dependent variable (observation) \(y_i\) and the regressor (predictor) \(X_i\), that

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \text{ for each } i \]

In the language of machine learning, the result of training the model is the method to identify the best \(( \beta_0, \beta_1 )\), that is the values so that the set of residuals \(\{ \epsilon_i \}\) is minimized.

This abstracts for \(y_i \in \mathbb{R}^M\) and \(X_i \in \mathbb{R}^N\), where the parameters \(\beta_i\) and residuals \(\epsilon_i\) are all in \(\mathbb{R}^M\).

A Linear Model is any model where the co-efficicents on the predictor terms are linear, the predictor terms themselves can be used in non-linear ways. For example we could use the parents height AND the square of the parents height (this would make more sense combining different predictors, but may expose some behaviour that is non-linear)

\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i, \text{ for each } i \]

Optimizing the model

The goal is to minimize \(\| \{ \varepsilon_i \} \|\), which involves identifying which norm (measurement) to use.

In our example with Galton’s heights we have, \(M, N = 1\), and \(n\) data points, using the Euclidean square-distance norm \(\| x \|^2 = \sum_i x_i^2\).

\[ \| \{ \varepsilon_i \}^n_{i=1} \|^2 = \sum_{i=1}^n \varepsilon_i^2 = \sum_i^n \left( y_i - (\beta_0 + \beta_1 x_i ) \right)^2 = S(\beta), \text{ where } \beta \in \mathbb{R}^2 \]

This is called the Optimized Least Squares, to solve the problem of \(\min_\beta S(\beta)\), we find when the gradients are zero since this is a quadratic, the gradients are linear functions, and there is an easy to calculate solution.

In the case of our 1 dimensional example, we have \(S : \mathbb{R}^2 \to \mathbb{R}\), which is explicitly solved as (ref)

\[ \begin{aligned} \beta_0 & = \frac{1}{n} \left( \sum_{i=1}^n y_i - \beta_1 \sum_{i=1}^n x_i \right) = \bar{y} - \beta_1 \bar{x}, \\ \beta_1 & = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{ \text{Cov}(x,y)}{\text{Var(x)}} \end{aligned} \]

The value \(\beta_1\) is the slope of the linear regression model (or fit), this represents the amount of change in the observation per unit change in the predictor.

Nice Facts about OLS

Why is the square norm used?

Fun Facts

given a vector \(X = \{ x_i \}\), the value \(\alpha\) that will minimize \(\| X - \alpha \|\) is the mean of \(x_i\).

Given vector \(X\), minimize \(\| X - \alpha \|\).

Answer \(E[X] = \frac{1}{n} \sum_{i=1}^n x_i\).

Given 2 sets of \(N\) numbers \(X\) and \(Y\), find the \(\alpha\) so we minimize \(\| Y - \alpha X \|\).

Answer \(E[Y]/E[X] = \frac{\sum y_i}{\sum x_i}\)

Ordinary Least Squares
Maximum likelihood estimation

To the mathematician we would look at the \(L^1\), or \(L^2\) norm, to the lay-person these are called the

Why do we use the square of delta’s when estimating linear regression?

The answer is related to the construction of how the to optimizae a formula including many data points.

Start with the fact \[ \alpha \ni \min_{\alpha} \sum_{i=1}^n \left( y_i - \alpha \right)^2 = \frac{\sum y_i}{n} \]

This is proven by taking the derivate with respect to \(\alpha\) of the left hand side \[ \frac{d}{d\alpha} \sum_{i=1}^n (y_i - \alpha)^2 = \sum_{i=1}^n 2 * (y_i - \alpha) (-1) = 0 \] \[ \sum_{i=1}^n y_i = \alpha n \]

We also have \[ \alpha \ni \min_{\alpha} \sum_{i=1}^n (y_i - \alpha x_i)^2 = \frac{\sum y_i}{\sum x_i} \]

\[ \frac{d}{d\alpha} \sum_{i=1}^n (y_i - \alpha x_i)^2 = \sum_{i=1}^n 2 * (y_i - x_i \alpha) (-1) = 0 \]

\[ \sum_{i=1}^n y_i = \alpha \sum_{x=1}^n x_i \]

since we’re just laying out formula to get Degrees of freedom we have this for 2 populations with different variances TODO = look this up from Wek 1 of Regression models \[ df = \frac{ \left( {S_x}^2/n_x + {S_y}^2/n_y \right)^2 }{ \left( {S_x}^2 / n_x \right)^2} \dots \]

The general rule for rejection of a hypothesis test is to fail whenever the probability that the sample we have seen is consistent with the hypothesis (p value) is low. \[ \frac{\bar{X} - \mu}{s/\sqrt{n}} > Z_{1-\alpha} \]

Residual Plotting

Include This # note I saved this locally on H:/Programming/datasets

dat <- read.table("http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt", header=FALSE)
pairs(dat)

head(dat)

##         V1        V2        V3        V4         V5
## 1 -0.75052 -0.282230  0.228190 -0.084136 -0.2474800
## 2 -0.39380 -0.074787 -0.013689  0.072776 -0.3602600
## 3 -0.15599  0.358390 -0.118070  0.013815 -0.6567200
## 4 -0.68392 -0.059086 -0.060048 -0.231480 -0.0380600
## 5 -0.59474  0.148360 -0.097664  0.667820 -1.0545000
## 6 -0.53529  0.253750 -0.530250 -0.097325 -0.0024249

fit <- lm(V1 ~ . -1, data=dat)
summary(fit)$coef

##     Estimate Std. Error   t value     Pr(>|t|)
## V2 0.9856157 0.12798121  7.701253 1.989126e-14
## V3 0.9714707 0.12663829  7.671225 2.500259e-14
## V4 0.8606368 0.11958267  7.197003 8.301184e-13
## V5 0.9266981 0.08328434 11.126919 4.778110e-28

plot(predict(fit), resid(fit), pch='.')

Linear Regression

Matt Cliff

March 21, 2018

What are we trying to Solve?

Optimizing the model

Nice Facts about OLS

Residual Plotting