LS_v1.knit

class: center, middle
# Least Squares Algebra
## Econometría
#### Dr. Francisco J. Cabrera-Hernández
#### Maestría en Economía
Primavera 2025
#####CIDE Santa Fe, Ciudad de México.

---
##Introduction

We discussed the linear projection model.

We now make a quick recap of the Least Squares estimator.

We first focus on the algebra.

Later, we will talk more on distribution, asymptotics and inference.

---
##Samples
We are now interested on estimating the parameters of the linear projection model:

`$$Y = X' \beta + e$$`

Explicitly, the projection coefficient:

`$$\beta = (E[XX'])^{-1}E[XY]$$`

We can estimate `$\beta$` from samples including joint measurements of `$(Y,X)$`.

We use observations `$(X_i,Y_i)$` for `$i=1...n$` of random variables `$(Y,X)$`. These are *realizations of random variables*

---
##Samples

For statistical analysis the dataset is a realization of a random process.

For empirical analysis, the dataset is fixed, as *it is presented* to us.

The individual observation are assummed to be drawns from a common (homogenous) distribution, hence **identically distributed**.

These are realizations from an identical underlying **population** `$F$`. Theoretical and infinitely large!

`$F$` is also know as the Data Generating Process (DGP) (from Data Science)

---
## Identification

If a parameter is uniquely determined by the distribution of the observed variables is **identified**.

`$F$` denotes the distribution of the observed data, for example of a pair (Y,X).

Let `$\mathcal{F}$` be a collection of distributions `$F$`.

A parameter `$\theta\in \mathbb{R}$` is identified on `$\mathcal{F}$` if for all `$F \in \mathcal{F}$` there is a uniquely determined value.

What if we have `$F*$` instead of `$F$`? For example, *censoring* when we ask for wages in a survey.

If we are interested in `$\mu = E[Y]$`, we cannot calculate `$\mu$` from `$F*$` and we cannot identify `$\mu$`.

---
## Linear Projection Model

Applies to random variables `$(Y,X)$`. This is:
`$$Y = X' \beta + e$$`
where `$\beta$`: 
$$ \beta = argmin S(b) ; b \in \mathbb{R}^k  $$
the minimizer of the expected squared error:

`$$S(\beta) = E [(Y - X'\beta)^{2}]$$`

with explicit solution:

`$$\beta = (E[XX'])^{-1}E[XY]$$`
---
## Least Squares Estimator

For a given `$\beta$`, the expected squared error is the expectation of `$(Y-X'\beta)^{2}$`

The moment estimator of `$S(B)$` is the sample average:

`$$\hat{S}(B) = {1 \over n} \sum_{i = 1}^{n} (Y_i-X'_i \beta)^{2} = {1 \over n}SSE(\beta)$$`
`$$SSE(\beta) = \sum_{i = 1}^{n}(Y_i-X'_i \beta)^{2}$$`

---
## Least Squares Estimator

`$\hat{S}(B)$` is an estimator of the expected squared error `$S(\beta)$`

The projection coefficient minimizes `$S(\beta)$`, an analog `$\hat{\beta}$` minimizes `$\hat{S}(B)$`

As `$\hat{S}(B)$` is a scale of `$SSE(\beta)$` we can minimize this instead.

Hence called "Least Squares (LS) Estimator"

Note that ** `$\beta$` is fixed, while `$\hat{\beta}$` varies across samples ** and depends on the sample size.

*How does this relate to identification?*

---
##Solving for one regressor

Given any `$\beta$` we calculate the "error" `$Y_i - X_i\beta$`.

`$$SSE(\beta) = \sum_{i = 1}^{n} (Y_i-X_i \beta)^{2}=$$`
`$$(\sum_{i = 1}^{n}Y_i^{2})-2\beta(\sum_{i = 1}^{n}X_iY_i)+\beta^{2}(\sum_{i = 1}^{n}X_i^{2})$$`
OLS estimator `$\hat\beta$` minimizes this.
The minimizer of `$a -2bx+cx^2$` is `$x=b/c$`, hence:

`$$\hat{\beta}= {\sum_{i = 1}^{n}X_iY_i \over  \sum_{i = 1}^{n}X_i^{2}}$$`
`$\hat{\beta}$` only exists if the denominator is non-zero.

---
## Graphically Solving for one regressor
Over the range `$[2,4]$`

`$$SSE(\beta)=(\sum_{i = 1}^{n}Y_i^{2})-2\beta(\sum_{i = 1}^{n}X_iY_i)+\beta^{2}(\sum_{i = 1}^{n}X_i^{2})$$`

---
## The intercept-only model

When the special case is `$X_i = 1$`. We find:

`$$\hat{\beta}= {\sum_{i = 1}^{n}1 Y_i \over  \sum_{i = 1}^{n} 1 ^{2}}$$`
`$$\hat{\beta}= {1 \over n} {\sum_{i = 1}^{n} Y_i} = \bar{Y}$$`

---
## LS with multiple regressors

We now consider `$\beta$` as a vector of coefficients `$\beta \in \mathbb{R}^k$`.

Hence: `$x'\beta= x_1\beta_1 + x_2\beta_2$` is a two-dimensional surface:

For any `$\beta$` the error `$Y_i - X'_i\beta$` is the vertical distance between `$Y_i$` and `$X'_i\beta$`.

---
## LS with multiple regressors
The sum of squared errors is:

`$$SSE(\beta) = (\sum_{i = 1}^{n}Y_i^{2})-2\beta'(\sum_{i = 1}^{n}X_iY_i)+\beta'(\sum_{i = 1}^{n}X_iX'_i)\beta$$`
The difference is that **this is a vector-valued quadratic function.**

LS Estimator Minimizes `$SSE(\beta)$` by FOC and plugin `$\hat\beta$`:

`$$0 = \frac{\partial}{\partial \beta} SSE( \hat{\beta})= -2 \sum_{i = 1}^{n} X_iY_i + 2\sum_{i = 1}^{n}X_i X'_i\hat{\beta}$$`
This is using a single expression, but it is  a system of `$k$` equations with `$k$` unknowns (the elements of `$\beta$`).

---
## LS with multiple regressors

We can find `$\hat\beta$` by solving the system of equations or write compactly in matrixes:

`$$(\sum_{i = 1}^{n}X_i X'_i)\hat{\beta} = (\sum_{i = 1}^{n} X_iY_i)$$`

Where `$X_iX'$` is `$k\times k$`, `$\hat\beta$` is `$k \times 1$`.

`$$\hat{\beta} = (\sum_{i = 1}^{n}X_i X'_i)^{-1} (\sum_{i = 1}^{n} X_iY_i)$$`
This is the estimator of the best linear projection coefficient `$\beta$`.
---
## LS with multiple regressors

The second order condition:

`$${\partial^2 \over \partial \beta \partial \beta'} SSE(\beta) = 2 \sum_{i=1}^n X_iX'_i > 0$$`

Given `$\sum_{i = 1}^{n}X_iX'_i>0$` (positive definte matrix) the LS estimator is a unique minimizer of `$SSE(\beta)$`.

Hence, we saw the best linear projection coefficient `$\beta$`, and LS is the **best linear projection estimator** of it.

---
##Graphically

`$\hat{\beta}$` is the pair `$(\hat{\beta_1},\hat{\beta_2})$` which minimize `$SSE(\hat\beta)$` function:

If `$k=1$`, `$X_i$` is a scalar and so `$X_iX_i'= X_i^2$` and this simplifies to the one regressor version derived before.

---
## LS with multiple regressors (moments estimator)

`$\beta$` is an explicit function of the population moments `$Q_{XY}$` and `$Q_{XX}$` and their moment estimators are:

`$$\hat{Q}_{XY} = {1 \over n} \sum_{i = 1}^{n}X_i Y_i$$`
`$$\hat{Q}_{XX} = {1 \over n} \sum_{i = 1}^{n}X_i X'_i$$`
Thus the moment estimator of `$\beta$` uses sample moments:

`$$\hat{\beta}=\hat{Q}_{XX}^{-1} \hat{Q}_{XY}$$`

---
## LS with multiple regressors (moments estimator)

Where:

`$$\hat{\beta}=\hat{Q}_{XX}^{-1} \hat{Q}_{XY}$$`
`$$= \left( {1 \over n}   \sum_{i = 1}^{n} X_iX_i'\right)^{-1} \left( {1 \over n} \sum_{i = 1}^{n}X_iYi   \right)$$`
`$$= \left(\sum_{i = 1}^{n} X_iX_i'\right)^{-1} \left(\sum_{i = 1}^{n}X_iYi   \right)$$`
Which is identical to the LS solution.

---
##Application:

See BH pp.69:

``` r
#Load the data and create subsamples
dat <- read.table("...cps09mar.txt")
experience <- dat[,1]-dat[,4]-6
mbf <- (dat[,11]==2)&(dat[,12]<=2)&(dat[,2]==1)&(experience==12)
sam <- (dat[,11]==4)&(dat[,12]==7)&(dat[,2]==0)
dat1 <- dat[mbf,]

#Matrix Regression
y <- as.matrix(log(dat1[,5]/(dat1[,6]*dat1[,7]))) #this is log(wage)
x <- cbind(dat1[,4],matrix(1,nrow(dat1),1)) #this is education
xx <- t(x)%*%x
xy <- t(x)%*%y
beta <- solve(xx,xy)
print(beta)

#LS command
model <- lm(y ~ x)
summary (model)
```

---
##Least Squares Residuals

We define the *fitted* value and the residual:

The fitted value `$\hat{Y}_i$` is a function of the entire sample including `$Y_i$` so **is not a prediction of `$Y_i$`.**

`$$\hat{e_i} = Y_i - \hat{Y_i} = Y_i - X'_i\hat{\beta}$$`

`$$Y_i = \hat Y_i + \hat e_i$$`

And:

`$$Y_i = X_i'\hat\beta + \hat e_i$$`
Note that **error `$e_i$` is unobservable and residual `$\hat{e_i}$` is an estimator.**

---
##Least Squares Residuals

`$$\sum_{i = 1}^{n}X_i\hat{e_i} = \sum_{i = 1}^{n} X_i (Y_i - X'_i\hat{\beta})$$`
`$$= \sum_{i = 1}^{n}X_iY_i -\sum_{i = 1}^{n}X_iX'_i\hat{\beta}$$`

`$$= \sum_{i = 1}^{n}X_iY_i -\sum_{i = 1}^{n}X_iX'_i \left(\sum_{i = 1}^{n}X_i X'_i \right)^{-1} \left(\sum_{i = 1}^{n} X_iY_i \right)$$`

`$$= \sum_{i = 1}^{n}X_iY_i - \left(\sum_{i = 1}^{n} X_iY_i \right) = 0$$`
---
##Least Squares Residuals

When `$X_i$` contains a constant implies:

`$${1 \over n} \sum_{i = 1}^{n} X_i \hat e_i = 0$$`

`$${1 \over n} \sum_{i = 1}^{n} \hat{e_i} = 0$$`

All this also implies that correlation between regressors and residual is zero:

**These are algebraic results and hold true for all linear regression estimates.**

---
##Demeaned regressors
Another format of the linear projection formula:

`$$Y_i = X_i' \beta + \alpha + e_i$$`

Where `$\alpha$` is the intercept and `$X_i$` does not include a vector of 1 (constant).

Hence:

`$$\sum_{i = 1}^{n}\hat{e_i} = \sum_{i = 1}^{n} (Y_i - X'_i\hat{\beta} - \hat\alpha)=0 \ (1)$$`

And:

`$$\sum_{i = 1}^{n}X_i\hat{e_i} = \sum_{i = 1}^{n} X_i (Y_i - X'_i\hat{\beta} - \hat\alpha)=0 \ (2)$$`

---
##Demeaned regressors

From (1):

`$$\hat \alpha = \bar Y - \bar X'\hat \beta$$`

Substracting from (2):

`$$\sum_{i=1}^n X_i ((Y_i - \bar{Y}) - (X_i - \bar{X})'\hat\beta)=0$$`

`$$\hat \beta = \left( \sum_{i=1}^n X_i (X_i - \bar X)' \right)^{-1} \left( \sum_{i=1}^n X_i (Y_i-\bar Y)  \right)$$`

`$$\hat \beta = \left( \sum_{i=1}^n (X_i-\bar X) (X_i - \bar X)' \right)^{-1} \left( \sum_{i=1}^n (X_i- \bar X) (Y_i-\bar Y)  \right)$$`

---
##Demeaned regressors

`$$\hat \beta = \left( \sum_{i=1}^n (X_i-\bar X) (X_i - \bar X)' \right)^{-1} \left( \sum_{i=1}^n (X_i- \bar X) (Y_i-\bar Y)  \right)$$`

This is the classic *demeaned formula* for the LS estimator of `$\hat\beta$`.

The OLS estimator for the slope is OLS with demeaned data and no intercept.

[Some code here!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/4_LS_demeaned_nointercept_graph.R)

---
## Matrix Notation

`$n$` equations: 
`$$Y_1 = X'_1\beta + e_1$$`
`$$Y_2 = X'_2\beta + e_2$$`
`$$Y_n = X'_n\beta + e_n$$`
Define:

`$$Y= \left(
\begin{array}{c}
Y_1\\
Y_2\\
\vdots\\
Y_n
\end{array}
\right)
X= \left(
\begin{array}{c}
X'_1\\
X'_2\\
\vdots\\
X'_n
\end{array}
\right)
e= \left(
\begin{array}{c}
e_1\\
e_2\\
\vdots\\
e_n
\end{array}
\right)$$`

this are: `$nx1$`,  `$nxk$`, and  `$nx1$`

`$$Y= X \beta + e$$`
---
## Matrix Form

`$\sum_{i = 1}^{n}X_iX'_i = X'X$` ; `$\sum_{i = 1}^{n}X_iY_i = X'Y$`

**$$\hat{\beta} = (X'X)^{-1}(X'Y)$$**

Where:
`$$Y=X\hat{\beta}+\hat{e}$$` 
`$$\hat{e} = Y - X \hat{\beta}$$`
`$$X'\hat{e}=0$$`
`$$SSE(\beta)= (Y-X\beta)'(Y-X\beta)$$`

[Some more code!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/5_LS_matrix__OLS_estimation.R)

---
##Projection Matrix
A projection matrix is a square matrix that, when applied to a vector, projects the vector onto a subspace.

Define the matrix:

`$$P= X(X'X)^{-1}X'$$`
A property of the projection matrix: **Idempotent**: PP = P.

`$$PX= X(X'X)^{-1}X'X=X$$`

---
##Projection Matrix

Generally, for a Z that can be written as Z = XT for some T (we say Z lies in the range space of X):

`$$PZ = PXT = X(X'X)^{-1} X' XT = XT = Z$$`

P matrix creates the fitted values in a least squares regression:

`$$PY = X\color{green}{(X'X)^{-1}X'Y} = X\color{green}{\hat{\beta}} = \hat{Y}$$`
**Because of this property `$P$` is also known as the hat matrix.**

[Some more code?](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/5_LS_projection_toY.R)

---
##Annhilator Matrix

Define `$M = I_n - P = I_n - X(X'X)^{-1} X'$` where `$I_n$` is the identity matrix with `$nxn$`

Hence:

$$ MX = (I_n - P)X = X - PX = X - X = 0 $$
Thus `$M$` and `$X$` are orthogonal. `$M$` is also symmetrical and idempotent.

`$MX_1$` = 0 for any subcomponent `$X_1$` of `$X$` and `$MP=0$`

`$M$` applied to Y **creates LS residuals:**

**$$MY = Y - PY = Y - X\hat{\beta} = \hat{e}$$**

*Now see BH pp.75 and explain (3.24).

---
##Projection

We can visualize LS fitting as a projection operation.

Matrix `$X = [X_1, X_2... X_k.]$` The range space `$\mathcal{R}(X)$` of `$X$` is the space consisting of all linear combinations of columns `$X_1,X_2,...X_k$`.

Hence `$\mathcal{R}(X)$` is a `$k$` dimensional surface contained in `$\mathbb{R}^n$`.

If `$k=2$`, `$\mathcal{R}(X)$` is a plane. Operator `$P= X(X'X)^{-1}X'$` projects vectors onto  `$\mathcal{R}(X)$`

`$\hat{Y}=PY$` are the projection of Y onto `$\mathcal{R}(X)$`

Lets display this in n=3 and for k=2 vectors `$Y, X_1, X_2$` ...

---
##Visualization

The plane created by `$X_1$` and `$X_2$` is the range space `$\mathcal{R}(X)$`

`$\hat{Y}$` are linear combinations of `$X_1$` and `$X_2$` and lie in such plane.

`$\hat{Y}$` is the closet to `$Y$` on this plane.  `$\hat{Y}$` and `$\hat{e}$` (orthogonal)

---
##Error Variance

`$\sigma^{2} = E[e^2]$` is a moment. An estimator of `$\sigma^2$`:

`$$\tilde{\sigma}^2 = {1 \over n} \sum_{i = 1}^{n} e_i^2$$` 
But `$e_i$` is unobserved. Hence we calculate `$\hat{e}_i^2$` first and plug:

`$$\hat{\sigma}^2 = {1 \over n} \sum_{i = 1}^{n} \hat{e}_i^2$$` 
In matrix notation:

`$$\hat{\sigma}^2 = n^{-1}\hat{e}'\hat{e}$$`

---
##Decomposition of Variance

`$$\sum_{i=1}^n(Y_i - \bar{Y})^2 = \sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2 +  \sum_{i=1}^n \hat{e_i}^2$$`
`$$TSS = ESS + RSS$$`

`$$R^2 = {{\sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2} \over {\sum_{i=1}^n({Y_i} - \bar{Y})^2}} = 1-{{\sum_{i=1}^n\hat{e_i}^2} \over {\sum_{i=1}^n({Y_i} - \bar{Y})^2}}$$`
Hence, `$R^2$` relates the *explained* variation of `$Y$` with variation of `$Y$` respect to to the mean.

In other words how much of the variation in `$Y$` is explained by `$X$`.

---
<style>
  .centered-word {
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
  }
</style>