class: center, middle # Least Squares Algebra ## EconometrÃa #### Dr. Francisco J. Cabrera-Hernández #### MaestrÃa en EconomÃa Primavera 2025 #####CIDE Santa Fe, Ciudad de México. --- ##Introduction We discussed the linear projection model. We now make a quick recap of the Least Squares estimator. We first focus on the algebra. Later, we will talk more on distribution, asymptotics and inference. --- ##Samples We are now interested on estimating the parameters of the linear projection model: `$$Y = X' \beta + e$$` Explicitly, the projection coefficient: `$$\beta = (E[XX'])^{-1}E[XY]$$` We can estimate `\(\beta\)` from samples including joint measurements of `\((Y,X)\)`. We use observations `\((X_i,Y_i)\)` for `\(i=1...n\)` of random variables `\((Y,X)\)`. These are *realizations of random variables* --- ##Samples For statistical analysis the dataset is a realization of a random process. For empirical analysis, the dataset is fixed, as *it is presented* to us. The individual observation are assummed to be drawns from a common (homogenous) distribution, hence **identically distributed**. These are realizations from an identical underlying **population** `\(F\)`. Theoretical and infinitely large! `\(F\)` is also know as the Data Generating Process (DGP) (from Data Science) --- ## Identification If a parameter is uniquely determined by the distribution of the observed variables is **identified**. `\(F\)` denotes the distribution of the observed data, for example of a pair (Y,X). Let `\(\mathcal{F}\)` be a collection of distributions `\(F\)`. A parameter `\(\theta\in \mathbb{R}\)` is identified on `\(\mathcal{F}\)` if for all `\(F \in \mathcal{F}\)` there is a uniquely determined value. What if we have `\(F*\)` instead of `\(F\)`? For example, *censoring* when we ask for wages in a survey. If we are interested in `\(\mu = E[Y]\)`, we cannot calculate `\(\mu\)` from `\(F*\)` and we cannot identify `\(\mu\)`. --- ## Linear Projection Model Applies to random variables `\((Y,X)\)`. This is: `$$Y = X' \beta + e$$` where `\(\beta\)`: $$ \beta = argmin S(b) ; b \in \mathbb{R}^k $$ the minimizer of the expected squared error: `$$S(\beta) = E [(Y - X'\beta)^{2}]$$` with explicit solution: `$$\beta = (E[XX'])^{-1}E[XY]$$` --- ## Least Squares Estimator For a given `\(\beta\)`, the expected squared error is the expectation of `\((Y-X'\beta)^{2}\)` The moment estimator of `\(S(B)\)` is the sample average: `$$\hat{S}(B) = {1 \over n} \sum_{i = 1}^{n} (Y_i-X'_i \beta)^{2} = {1 \over n}SSE(\beta)$$` `$$SSE(\beta) = \sum_{i = 1}^{n}(Y_i-X'_i \beta)^{2}$$` --- ## Least Squares Estimator `\(\hat{S}(B)\)` is an estimator of the expected squared error `\(S(\beta)\)` The projection coefficient minimizes `\(S(\beta)\)`, an analog `\(\hat{\beta}\)` minimizes `\(\hat{S}(B)\)` As `\(\hat{S}(B)\)` is a scale of `\(SSE(\beta)\)` we can minimize this instead. Hence called "Least Squares (LS) Estimator" Note that ** `\(\beta\)` is fixed, while `\(\hat{\beta}\)` varies across samples ** and depends on the sample size. *How does this relate to identification?* --- ##Solving for one regressor Given any `\(\beta\)` we calculate the "error" `\(Y_i - X_i\beta\)`. `$$SSE(\beta) = \sum_{i = 1}^{n} (Y_i-X_i \beta)^{2}=$$` `$$(\sum_{i = 1}^{n}Y_i^{2})-2\beta(\sum_{i = 1}^{n}X_iY_i)+\beta^{2}(\sum_{i = 1}^{n}X_i^{2})$$` OLS estimator `\(\hat\beta\)` minimizes this. The minimizer of `\(a -2bx+cx^2\)` is `\(x=b/c\)`, hence: `$$\hat{\beta}= {\sum_{i = 1}^{n}X_iY_i \over \sum_{i = 1}^{n}X_i^{2}}$$` `\(\hat{\beta}\)` only exists if the denominator is non-zero. --- ## Graphically Solving for one regressor Over the range `\([2,4]\)` `$$SSE(\beta)=(\sum_{i = 1}^{n}Y_i^{2})-2\beta(\sum_{i = 1}^{n}X_iY_i)+\beta^{2}(\sum_{i = 1}^{n}X_i^{2})$$` <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#minbeta.png" alt=" " width="45%" /> <p class="caption"> </p> </div> --- ## The intercept-only model When the special case is `\(X_i = 1\)`. We find: `$$\hat{\beta}= {\sum_{i = 1}^{n}1 Y_i \over \sum_{i = 1}^{n} 1 ^{2}}$$` `$$\hat{\beta}= {1 \over n} {\sum_{i = 1}^{n} Y_i} = \bar{Y}$$` --- ## LS with multiple regressors We now consider `\(\beta\)` as a vector of coefficients `\(\beta \in \mathbb{R}^k\)`. Hence: `\(x'\beta= x_1\beta_1 + x_2\beta_2\)` is a two-dimensional surface: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#multiplane.png" alt=" " width="40%" /> <p class="caption"> </p> </div> For any `\(\beta\)` the error `\(Y_i - X'_i\beta\)` is the vertical distance between `\(Y_i\)` and `\(X'_i\beta\)`. --- ## LS with multiple regressors The sum of squared errors is: `$$SSE(\beta) = (\sum_{i = 1}^{n}Y_i^{2})-2\beta'(\sum_{i = 1}^{n}X_iY_i)+\beta'(\sum_{i = 1}^{n}X_iX'_i)\beta$$` The difference is that **this is a vector-valued quadratic function.** LS Estimator Minimizes `\(SSE(\beta)\)` by FOC and plugin `\(\hat\beta\)`: `$$0 = \frac{\partial}{\partial \beta} SSE( \hat{\beta})= -2 \sum_{i = 1}^{n} X_iY_i + 2\sum_{i = 1}^{n}X_i X'_i\hat{\beta}$$` This is using a single expression, but it is a system of `\(k\)` equations with `\(k\)` unknowns (the elements of `\(\beta\)`). --- ## LS with multiple regressors We can find `\(\hat\beta\)` by solving the system of equations or write compactly in matrixes: `$$(\sum_{i = 1}^{n}X_i X'_i)\hat{\beta} = (\sum_{i = 1}^{n} X_iY_i)$$` Where `\(X_iX'\)` is `\(k\times k\)`, `\(\hat\beta\)` is `\(k \times 1\)`. `$$\hat{\beta} = (\sum_{i = 1}^{n}X_i X'_i)^{-1} (\sum_{i = 1}^{n} X_iY_i)$$` This is the estimator of the best linear projection coefficient `\(\beta\)`. --- ## LS with multiple regressors The second order condition: `$${\partial^2 \over \partial \beta \partial \beta'} SSE(\beta) = 2 \sum_{i=1}^n X_iX'_i > 0$$` Given `\(\sum_{i = 1}^{n}X_iX'_i>0\)` (positive definte matrix) the LS estimator is a unique minimizer of `\(SSE(\beta)\)`. Hence, we saw the best linear projection coefficient `\(\beta\)`, and LS is the **best linear projection estimator** of it. --- ##Graphically `\(\hat{\beta}\)` is the pair `\((\hat{\beta_1},\hat{\beta_2})\)` which minimize `\(SSE(\hat\beta)\)` function: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#minbetas.png" alt=" " width="45%" /> <p class="caption"> </p> </div> If `\(k=1\)`, `\(X_i\)` is a scalar and so `\(X_iX_i'= X_i^2\)` and this simplifies to the one regressor version derived before. --- ## LS with multiple regressors (moments estimator) `\(\beta\)` is an explicit function of the population moments `\(Q_{XY}\)` and `\(Q_{XX}\)` and their moment estimators are: `$$\hat{Q}_{XY} = {1 \over n} \sum_{i = 1}^{n}X_i Y_i$$` `$$\hat{Q}_{XX} = {1 \over n} \sum_{i = 1}^{n}X_i X'_i$$` Thus the moment estimator of `\(\beta\)` uses sample moments: `$$\hat{\beta}=\hat{Q}_{XX}^{-1} \hat{Q}_{XY}$$` --- ## LS with multiple regressors (moments estimator) Where: `$$\hat{\beta}=\hat{Q}_{XX}^{-1} \hat{Q}_{XY}$$` `$$= \left( {1 \over n} \sum_{i = 1}^{n} X_iX_i'\right)^{-1} \left( {1 \over n} \sum_{i = 1}^{n}X_iYi \right)$$` `$$= \left(\sum_{i = 1}^{n} X_iX_i'\right)^{-1} \left(\sum_{i = 1}^{n}X_iYi \right)$$` Which is identical to the LS solution. --- ##Application: See BH pp.69: ``` r #Load the data and create subsamples dat <- read.table("...cps09mar.txt") experience <- dat[,1]-dat[,4]-6 mbf <- (dat[,11]==2)&(dat[,12]<=2)&(dat[,2]==1)&(experience==12) sam <- (dat[,11]==4)&(dat[,12]==7)&(dat[,2]==0) dat1 <- dat[mbf,] #Matrix Regression y <- as.matrix(log(dat1[,5]/(dat1[,6]*dat1[,7]))) #this is log(wage) x <- cbind(dat1[,4],matrix(1,nrow(dat1),1)) #this is education xx <- t(x)%*%x xy <- t(x)%*%y beta <- solve(xx,xy) print(beta) #LS command model <- lm(y ~ x) summary (model) ``` --- ##Least Squares Residuals We define the *fitted* value and the residual: The fitted value `\(\hat{Y}_i\)` is a function of the entire sample including `\(Y_i\)` so **is not a prediction of `\(Y_i\)`.** `$$\hat{e_i} = Y_i - \hat{Y_i} = Y_i - X'_i\hat{\beta}$$` `$$Y_i = \hat Y_i + \hat e_i$$` And: `$$Y_i = X_i'\hat\beta + \hat e_i$$` Note that **error `\(e_i\)` is unobservable and residual `\(\hat{e_i}\)` is an estimator.** --- ##Least Squares Residuals `$$\sum_{i = 1}^{n}X_i\hat{e_i} = \sum_{i = 1}^{n} X_i (Y_i - X'_i\hat{\beta})$$` `$$= \sum_{i = 1}^{n}X_iY_i -\sum_{i = 1}^{n}X_iX'_i\hat{\beta}$$` `$$= \sum_{i = 1}^{n}X_iY_i -\sum_{i = 1}^{n}X_iX'_i \left(\sum_{i = 1}^{n}X_i X'_i \right)^{-1} \left(\sum_{i = 1}^{n} X_iY_i \right)$$` `$$= \sum_{i = 1}^{n}X_iY_i - \left(\sum_{i = 1}^{n} X_iY_i \right) = 0$$` --- ##Least Squares Residuals When `\(X_i\)` contains a constant implies: `$${1 \over n} \sum_{i = 1}^{n} X_i \hat e_i = 0$$` `$${1 \over n} \sum_{i = 1}^{n} \hat{e_i} = 0$$` All this also implies that correlation between regressors and residual is zero: **These are algebraic results and hold true for all linear regression estimates.** --- ##Demeaned regressors Another format of the linear projection formula: `$$Y_i = X_i' \beta + \alpha + e_i$$` Where `\(\alpha\)` is the intercept and `\(X_i\)` does not include a vector of 1 (constant). Hence: `$$\sum_{i = 1}^{n}\hat{e_i} = \sum_{i = 1}^{n} (Y_i - X'_i\hat{\beta} - \hat\alpha)=0 \ (1)$$` And: `$$\sum_{i = 1}^{n}X_i\hat{e_i} = \sum_{i = 1}^{n} X_i (Y_i - X'_i\hat{\beta} - \hat\alpha)=0 \ (2)$$` --- ##Demeaned regressors From (1): `$$\hat \alpha = \bar Y - \bar X'\hat \beta$$` Substracting from (2): `$$\sum_{i=1}^n X_i ((Y_i - \bar{Y}) - (X_i - \bar{X})'\hat\beta)=0$$` `$$\hat \beta = \left( \sum_{i=1}^n X_i (X_i - \bar X)' \right)^{-1} \left( \sum_{i=1}^n X_i (Y_i-\bar Y) \right)$$` `$$\hat \beta = \left( \sum_{i=1}^n (X_i-\bar X) (X_i - \bar X)' \right)^{-1} \left( \sum_{i=1}^n (X_i- \bar X) (Y_i-\bar Y) \right)$$` --- ##Demeaned regressors `$$\hat \beta = \left( \sum_{i=1}^n (X_i-\bar X) (X_i - \bar X)' \right)^{-1} \left( \sum_{i=1}^n (X_i- \bar X) (Y_i-\bar Y) \right)$$` This is the classic *demeaned formula* for the LS estimator of `\(\hat\beta\)`. The OLS estimator for the slope is OLS with demeaned data and no intercept. [Some code here!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/4_LS_demeaned_nointercept_graph.R) --- ## Matrix Notation `\(n\)` equations: `$$Y_1 = X'_1\beta + e_1$$` `$$Y_2 = X'_2\beta + e_2$$` `$$Y_n = X'_n\beta + e_n$$` Define: `$$Y= \left( \begin{array}{c} Y_1\\ Y_2\\ \vdots\\ Y_n \end{array} \right) X= \left( \begin{array}{c} X'_1\\ X'_2\\ \vdots\\ X'_n \end{array} \right) e= \left( \begin{array}{c} e_1\\ e_2\\ \vdots\\ e_n \end{array} \right)$$` this are: `\(nx1\)`, `\(nxk\)`, and `\(nx1\)` `$$Y= X \beta + e$$` --- ## Matrix Form `\(\sum_{i = 1}^{n}X_iX'_i = X'X\)` ; `\(\sum_{i = 1}^{n}X_iY_i = X'Y\)` **$$\hat{\beta} = (X'X)^{-1}(X'Y)$$** Where: `$$Y=X\hat{\beta}+\hat{e}$$` `$$\hat{e} = Y - X \hat{\beta}$$` `$$X'\hat{e}=0$$` `$$SSE(\beta)= (Y-X\beta)'(Y-X\beta)$$` [Some more code!](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/5_LS_matrix__OLS_estimation.R) --- ##Projection Matrix A projection matrix is a square matrix that, when applied to a vector, projects the vector onto a subspace. Define the matrix: `$$P= X(X'X)^{-1}X'$$` A property of the projection matrix: **Idempotent**: PP = P. `$$PX= X(X'X)^{-1}X'X=X$$` --- ##Projection Matrix Generally, for a Z that can be written as Z = XT for some T (we say Z lies in the range space of X): `$$PZ = PXT = X(X'X)^{-1} X' XT = XT = Z$$` P matrix creates the fitted values in a least squares regression: `$$PY = X\color{green}{(X'X)^{-1}X'Y} = X\color{green}{\hat{\beta}} = \hat{Y}$$` **Because of this property `\(P\)` is also known as the hat matrix.** [Some more code?](https://github.com/fcabrerahz/EconometricsME/blob/main/Code/5_LS_projection_toY.R) --- ##Annhilator Matrix Define `\(M = I_n - P = I_n - X(X'X)^{-1} X'\)` where `\(I_n\)` is the identity matrix with `\(nxn\)` Hence: $$ MX = (I_n - P)X = X - PX = X - X = 0 $$ Thus `\(M\)` and `\(X\)` are orthogonal. `\(M\)` is also symmetrical and idempotent. `\(MX_1\)` = 0 for any subcomponent `\(X_1\)` of `\(X\)` and `\(MP=0\)` `\(M\)` applied to Y **creates LS residuals:** **$$MY = Y - PY = Y - X\hat{\beta} = \hat{e}$$** *Now see BH pp.75 and explain (3.24). --- ##Projection We can visualize LS fitting as a projection operation. Matrix `\(X = [X_1, X_2... X_k.]\)` The range space `\(\mathcal{R}(X)\)` of `\(X\)` is the space consisting of all linear combinations of columns `\(X_1,X_2,...X_k\)`. Hence `\(\mathcal{R}(X)\)` is a `\(k\)` dimensional surface contained in `\(\mathbb{R}^n\)`. If `\(k=2\)`, `\(\mathcal{R}(X)\)` is a plane. Operator `\(P= X(X'X)^{-1}X'\)` projects vectors onto `\(\mathcal{R}(X)\)` `\(\hat{Y}=PY\)` are the projection of Y onto `\(\mathcal{R}(X)\)` Lets display this in n=3 and for k=2 vectors `\(Y, X_1, X_2\)` ... --- ##Visualization The plane created by `\(X_1\)` and `\(X_2\)` is the range space `\(\mathcal{R}(X)\)` `\(\hat{Y}\)` are linear combinations of `\(X_1\)` and `\(X_2\)` and lie in such plane. `\(\hat{Y}\)` is the closet to `\(Y\)` on this plane. `\(\hat{Y}\)` and `\(\hat{e}\)` (orthogonal) <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#projection.png" alt=" " width="50%" /> <p class="caption"> </p> </div> --- ##Error Variance `\(\sigma^{2} = E[e^2]\)` is a moment. An estimator of `\(\sigma^2\)`: `$$\tilde{\sigma}^2 = {1 \over n} \sum_{i = 1}^{n} e_i^2$$` But `\(e_i\)` is unobserved. Hence we calculate `\(\hat{e}_i^2\)` first and plug: `$$\hat{\sigma}^2 = {1 \over n} \sum_{i = 1}^{n} \hat{e}_i^2$$` In matrix notation: `$$\hat{\sigma}^2 = n^{-1}\hat{e}'\hat{e}$$` --- ##Decomposition of Variance `$$\sum_{i=1}^n(Y_i - \bar{Y})^2 = \sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2 + \sum_{i=1}^n \hat{e_i}^2$$` `$$TSS = ESS + RSS$$` `$$R^2 = {{\sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2} \over {\sum_{i=1}^n({Y_i} - \bar{Y})^2}} = 1-{{\sum_{i=1}^n\hat{e_i}^2} \over {\sum_{i=1}^n({Y_i} - \bar{Y})^2}}$$` Hence, `\(R^2\)` relates the *explained* variation of `\(Y\)` with variation of `\(Y\)` respect to to the mean. In other words how much of the variation in `\(Y\)` is explained by `\(X\)`. --- <style> .centered-word { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); } </style> <div class="centered-word"> <h2>The End</h2> </div>