The Linear Regression Estimator, Multiple Regression, and Matrix Algebra

Three Parts

Guidance for Midterm Exam

  • All material (lectures, readings, examples, etc.) are fair game for the exam.
  • I do not expect you to entirely reproduce proofs of theorems, but I expect you to recognize and understand the key ideas and be able to apply them.
  • You may be asked to elaborate on particular points and provide additional context or examples.
  • This may entail working through particular characteristics of the estimator – e.g., demonstrate the unbiasedness of the OLS estimator.

Guidance for Midterm Exam

  • Key terms and concepts: Normal equations, OLS estimator, Gauss-Markov theorem, OLS estimator, OLS residuals, etc.
  • Conceptual applications are common (e.g., what are the assumptions of the Gauss-Markov theorem? What does it tell us about the OLS estimator? How does it help us understand the OLS estimator?)
  • Interpretation is also important (e.g., what does the OLS estimator tell us about the relationship between \(X\) and \(Y\)?)

Example

Gauss-Markov Assumptions. If the PRF is written as: \(Y_i=\alpha+\beta X_i+\epsilon_i\), and the SRF is expressed as: \(Y_i=a+b X_i+e_i\).

  1. What assumptions are required in order for the OLS estimator to be the best linear unbiased estimator with minimum variance? Describe each assumption in no more than 1-2 sentences.

  2. The Gauss-Markov theorem holds that if these assumptions are met, the OLS estimator of \(b\) is a linear function of \(y_i\) (i.e., \(b=\sum k_i Y_i\)). Please demonstrate this.

  3. The Gauss-Markov theorem holds that if these assumptions are met, the OLS estimator is an unbiased estimator of \(\beta\). Please demonstrate this (hint: use \(b=\sum k_i Y_i\) to show this is the case).

Part I: Estimation in R

The Data Generating Process (DGP)

The DGP is the underlying process that produces the sample data we observe — the PRF that generates the data.

Its the mechanism – we assume – generated the observed data + sampling error.

We can simulate data from a known DGP to understand how sampling, estimation, and inference work together.

Using a function in R

simulate_regression_data <- function(
  n = 500, beta_0 = 0, beta_1 = 0.2,
  x_mean = 0, x_sd = 1, error_sd = 1
) {
  X <- rnorm(n, mean = x_mean, sd = x_sd)
  errors <- rnorm(n, mean = 0, sd = error_sd)
  Y <- beta_0 + beta_1 * X + errors
  data.frame(x = X, y = Y, true_y = beta_0 + beta_1 * X, error = errors)
}

PRF vs. SRF: Visualizing the DGP

Key elements of the plot:

  • PRF (red solid): \(Y_i = \alpha + \beta X_i + \epsilon_i\) — the true line we never see
  • SRF (blue dashed): \(Y_i = a + bX_i + e_i\) — our estimate from the sample
  • Residuals (gray): \(e_i = Y_i - \hat{Y}_i\)

The SRF approximates the PRF. How well depends on:

  • Sample size \(n\)
  • Error variance \(\sigma^2_\epsilon\)
  • Variance of \(X\)

Estimation with lm()

fit <- lm(y ~ x, data = sim_dat)
summary(fit)

Call:
lm(formula = y ~ x, data = sim_dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.75568 -0.67016  0.01042  0.63073  2.73709 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0004678  0.0452219   -0.01    0.992    
x            0.6960271  0.0465049   14.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.011 on 498 degrees of freedom
Multiple R-squared:  0.3103,    Adjusted R-squared:  0.3089 
F-statistic:   224 on 1 and 498 DF,  p-value: < 2.2e-16

Estimation with lm()

Key elements of the output:

Element Meaning
Coefficients Estimated \(a\), \(b\) with SEs, t-values, p-values
Residual SE Average distance of observations from the regression line
\(R^2\) Proportion of variance in \(Y\) explained by \(X\)
F-statistic Tests whether the model beats the null (\(\bar{Y}\))

Generating Predictions

# Single
predict(fit, newdata = data.frame(x = 1))
        1 
0.6955593 
# Multiple values
predict(fit, newdata = data.frame(x = seq(0.25, 0.35, by = 0.05)))
        1         2         3 
0.1735390 0.2083404 0.2431417 
# all fitted values...
predict(fit) |> head()
          1           2           3           4           5           6 
-0.39057401 -0.16067754  1.08443545  0.04860798  0.08952000  1.19326394 

Predictions are just \(\hat{Y}_i = a + bX_i\) evaluated at chosen \(X\) values.

Part II: Model Fit

Model Fit: \(R^2\), Correlation, & ANOVA

The correlation \(r\) equals the standardized slope in bivariate regression:

Regression Estimates


Call:
lm(formula = scale(y) ~ scale(x), data = sim_dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.26700 -0.55132  0.00857  0.51888  2.25170 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.351e-17  3.718e-02    0.00        1    
scale(x)    5.570e-01  3.722e-02   14.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8313 on 498 degrees of freedom
Multiple R-squared:  0.3103,    Adjusted R-squared:  0.3089 
F-statistic:   224 on 1 and 498 DF,  p-value: < 2.2e-16
[1] 0.5570035

\(b_{x} \approx 0.557\)

Correlation Estimates

[1] 0.5570035

\(r_{xy} \approx 0.557\)

Key relationships:

  • \(R^2 = r^2\) in bivariate regression
  • \(r = \sqrt{R^2} \times \text{sign}(\beta)\)
  • The correlation is the standardized regression coefficient

Decomposing the Variance (ANOVA)

\[TSS = RegSS + RSS\]

\[\sum(Y_i - \bar{Y})^2 = \sum(\hat{Y}_i - \bar{Y})^2 + \sum(Y_i - \hat{Y}_i)^2\]

Component Formula Meaning
TSS \(\sum(Y_i - \bar{Y})^2\) Total variation in \(Y\)
RegSS \(\sum(\hat{Y}_i - \bar{Y})^2\) Variation explained by the model
RSS \(\sum(Y_i - \hat{Y}_i)^2\) Unexplained (residual) variation

\(R^2\): The Coefficient of Determination

\[R^2 = \frac{RegSS}{TSS} = 1 - \frac{RSS}{TSS}\]

  • \(R^2 = 0\): model explains nothing; \(E(Y|X) = \bar{Y}\)
  • \(R^2 = 1\): deterministic relationship; all points on the line

Proportional reduction in error — how much better is our model than just predicting \(\bar{Y}\)?

The F-statistic

\[F = \frac{RegSS / df_{reg}}{RSS / df_{res}} = \frac{MSS_{reg}}{MSS_{res}}\]

Where \(df_{reg} = k\) and \(df_{res} = n - k - 1\).

Hypotheses:

  • \(H_0\): \(\beta = 0\) (model no better than null)
  • \(H_a\): \(\beta \neq 0\) (model explains variance)

The F-statistic is a ratio of two variances — it follows the F-distribution under \(H_0\).

What Happens When You Vary Parameters?

Increase error (\(\sigma_\epsilon\)):

  • Points scatter more around PRF
  • \(R^2\) drops
  • Sampling distribution of \(b\) widens
  • But \(b\) remains unbiased

Increase \(\sigma_X\):

  • More spread in \(X\) → more leverage
  • \(R^2\) increases
  • \(var(b)\) decreases — more efficient

Increase \(n\):

  • SRF → PRF
  • Standard errors shrink \(\propto 1/\sqrt{n}\)
  • \(R^2\) stabilizes

Set \(\beta = 0\):

  • SRF still estimates some slope (sampling error)
  • Monte Carlo distribution centers on zero
  • This is the null hypothesis in action

Characteristics of the OLS Estimator

Multiple variables

The OLS estimator is a point estimator. We predict a single point \(\hat{y}_i\), given a predictor \(x_i\). Let’s consider the SRF with multiple predictors:

\[ Y_i - \underbrace{(a + b_1 X_1 + b_2 X_2)}_{\hat{Y}_{i}} = e_i \]

We’ll still find a line that minimizes \(\sum e^2\)

  • We can just use a little algebra to rewrite these equations. Let’s simplify things by writing each term in “deviation” form.

  • \(y_i=Y_i-\bar{Y}\)

  • \(x_{1i}=X_{1i}-\bar{X_1}\)

  • \(x_{2i}=X_{2i}-\bar{X_2}\)

Multiple variables

Setting \(\frac{\partial SSR} {\partial b_1}\) to 0

\[\begin{eqnarray*} 0&=& -2 \sum (Y_i-a-b_1X_{1i}-b_2 X_{2i}) (X_{1i}) \nonumber \\ b_1&=& \frac{\sum x_{1i} y_i-b_2\sum x_{1i} x_{2i}}{\sum x_{1i}^2} \nonumber \\ \end{eqnarray*}\]

Setting \(\frac{\partial SSR} {\partial b_2}\) to 0 \[\begin{eqnarray*} 0&=& -2 \sum (Y_i-a-b_1X_{1i}-b_2 X_{2i}) (X_{2i}) \nonumber \\ b_2&=& \frac{\sum x_{2i} y_i-b_1\sum x_{1i} x_{2i}}{\sum x_{2i}^2} \nonumber \\ \end{eqnarray*}\]

Multiple variables

Then,

\[\begin{eqnarray*} b_1&=& \frac{\sum y_i x_{1i} \sum x_{2i}^2 - \sum x_{1i} x_{2i} \sum x_{2i} y_i }{\sum x_{1i}^2 \sum x_{2i}^2 -(\sum x_{1i} x_{2i})^2} \nonumber \\ b_2&=& \frac{\sum y_i x_{2i} \sum x_{1i}^2 - \sum x_{1i} x_{2i} \sum x_{1i} y_i }{\sum x_{1i}^2 \sum x_{2i}^2 -(\sum x_{1i} x_{2i})^2} \nonumber \\ \end{eqnarray*}\]

Multiple variables

\(\hat{Y} = a + b_1 X_1 + b_2 X_2\)

  • \(a\) = 0.849
  • \(b_1\) = 0.78
  • \(b_2\) = -0.499

\(R^2\) = 0.618

Multiple Regression: Fitted Plane & Residuals

Part III: Matrix Algebra

Vectors, Matrices, and the OLS Estimator

Why Matrix Algebra?

  • Writing the SRF in matrix form makes it easier to solve
  • Quantitative social science aims to quantify relationships between multiple variables
  • Data is tabular: rows = observations, columns = variables
  • We need tools to solve systems of equations efficiently

\[\begin{bmatrix} y_{1}= & b_0+ b_1 x_{1}+b_2 x_{2}\\ y_{2}= & b_0+ b_1 x_{1}+b_2 x_{2}\\ \vdots\\ y_{n}= & b_0+ b_1 x_{1}+b_2 x_{2} \end{bmatrix}\]

  • \(n\) equations, fewer unknowns — linear algebra gives us the solution

Data as a Matrix

Each row is an observation; each column is a variable.

\[\begin{bmatrix} Vote & PID & Ideology \\\hline a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ \vdots & \vdots & \vdots\\ a_{n1} & a_{n2} & a_{n3} \end{bmatrix}\]

  • This is an \(n \times 3\) matrix
  • First subscript = row, second = column
  • Notation: \(\mathbf{A}_{n \times 3}\)
A <- matrix(c(1, 0, 3,
              0, 1, 5,
              1, 1, 4), nrow = 3, byrow = TRUE)
colnames(A) <- c("Vote", "PID", "Ideology")
A
     Vote PID Ideology
[1,]    1   0        3
[2,]    0   1        5
[3,]    1   1        4
dim(A)
[1] 3 3

Vectors: The Building Blocks

  • Scalar: a single number (magnitude only)
  • Vector: multiple elements — encodes magnitude and direction

A vector \(\mathbf{a} \in \mathbb{R}^k\) has \(k\) elements.

Euclidean Distance between \(\mathbf{a}=[x_1,y_1]\) and \(\mathbf{b}=[x_2,y_2]\):

\[\text{Distance}(\mathbf{a},\mathbf{b}) = \sqrt{(x_1-x_2)^2+(y_1-y_2)^2}\]

This is just the Pythagorean theorem!

a <- c(3, 2); b <- c(1, 1)
sqrt(sum((a - b)^2))  # Euclidean distance
[1] 2.236068

The Norm of a Vector

The norm measures the length (magnitude) of a vector from the origin:

\[\|\mathbf{a}\| = \sqrt{x_1^2 + y_1^2}\]

  • Dividing a vector by its norm gives a unit vector (length = 1)
  • Useful for standardization
a <- c(3, 2, 1)
sqrt(sum(a^2))       # norm of a
[1] 3.741657
a / sqrt(sum(a^2))   # unit vector
[1] 0.8017837 0.5345225 0.2672612

In higher dimensions (\(\mathbb{R}^3\)):

\[\|\mathbf{a}\| = \sqrt{x_1^2 + y_1^2 + z_1^2}\]

Vector Addition & Subtraction

Element-wise operations on conformable vectors (same length):

\[\mathbf{a} + \mathbf{b} = [3+1,\; 2+1,\; 1+1] = [4, 3, 2]\]

\[\mathbf{a} - \mathbf{b} = [3-1,\; 2-1,\; 1-1] = [2, 1, 0]\]

Properties:

Property Statement
Commutative \(\mathbf{a}+\mathbf{b}=\mathbf{b}+\mathbf{a}\)
Associative \((\mathbf{a}+\mathbf{b})+\mathbf{c}=\mathbf{a}+(\mathbf{b}+\mathbf{c})\)
Distributive \(c(\mathbf{a}+\mathbf{b})=c\mathbf{a}+c\mathbf{b}\)
Zero \(\mathbf{a}+0=\mathbf{a}\)
a <- c(3, 2, 1); b <- c(1, 1, 1)
a + b    # addition
[1] 4 3 2
a - b    # subtraction
[1] 2 1 0

Vector Multiplication

  • Inner (dot) product → produces a scalar (measures similarity / covariance)
  • Cross product → produces a vector (orthogonal to both inputs)
  • Outer product → produces a matrix

The Inner (Dot) Product

Multiply corresponding elements and sum:

\[\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i\]

For \(\mathbf{a}=[3,2,1]\) and \(\mathbf{b}=[1,1,1]\): \(\;\; 3(1)+2(1)+1(1) = 6\)

a <- c(3, 2, 1); b <- c(1, 1, 1)
sum(a * b)          # inner product
[1] 6
a %*% b             # same thing with matrix notation
     [,1]
[1,]    6

The inner product is a measure of covariance:

\[\text{cov}(x,y) = \frac{\text{inner product}(x-\bar{x},\; y-\bar{y})}{n-1}\]

\[r_{x,y} = \frac{\text{inner product}(x-\bar{x},\; y-\bar{y})}{\|x-\bar{x}\|\;\|y-\bar{y}\|}\]

Inner Product Rules

Property Statement
Commutative \(\mathbf{a} \cdot \mathbf{b} = \mathbf{b} \cdot \mathbf{a}\)
Associative \(d(\mathbf{a} \cdot \mathbf{b}) = (d\mathbf{a}) \cdot \mathbf{b}\)
Distributive \(\mathbf{c} \cdot (\mathbf{a}+\mathbf{b}) = \mathbf{c}\cdot\mathbf{a} + \mathbf{c}\cdot\mathbf{b}\)
Zero \(\mathbf{a} \cdot 0 = 0\)

The Cross Product

For \(\mathbf{a}=[3,2,1]\) and \(\mathbf{b}=[1,4,7]\):

  1. Stack the vectors
  2. Calculate \(2 \times 2\) determinants

\[\mathbf{a} \times \mathbf{b} = [2(7)-4(1),\;\; 1(1)-3(7),\;\; 3(4)-2(1)] = [10, -20, 10]\]

  • Result is orthogonal to both original vectors
  • Useful for determinants and matrix inversion

The Outer Product

Transpose one vector, then multiply:

\[\begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix} \begin{bmatrix} 1 & 4 & 7 \end{bmatrix} = \begin{bmatrix} 3 & 12 & 21 \\ 2 & 8 & 14 \\ 1 & 4 & 7 \end{bmatrix}\]

  • Input: two vectors of length \(k\)
  • Output: a \(k \times k\) matrix
a <- c(3, 2, 1); b <- c(1, 4, 7)
a %o% b   # outer product
     [,1] [,2] [,3]
[1,]    3   12   21
[2,]    2    8   14
[3,]    1    4    7

Matrices

A matrix combines row or column vectors. Notation: bold uppercase (\(\mathbf{A}\)).

\[\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots\\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix}\]

Matrix Types

Type Definition
Square Equal rows and columns
Symmetric Same entries above and below the diagonal; \(\mathbf{A} = \mathbf{A}^T\)
Identity (\(\mathbf{I}\)) 1s on diagonal, 0s off; \(\mathbf{AI} = \mathbf{A}\)
Idempotent \(\mathbf{A}^2 = \mathbf{A}\)
Trace Sum of diagonal elements: \(\text{tr}(\mathbf{I}) = n\)

Matrix Addition & Subtraction

Matrices must be conformable (same dimensions). Add/subtract element-wise:

\[\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} + \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11}+b_{11} & a_{12}+b_{12} \\ a_{21}+b_{21} & a_{22}+b_{22} \end{bmatrix}\]

Properties: Commutative, Associative, Distributive, Zero

Matrix Multiplication

Order matters! \(\mathbf{AB} \neq \mathbf{BA}\) in general.

Multiply \(i\)-th row by \(j\)-th column:

\[\begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} 3 & 5 \\ 2 & 4 \end{bmatrix} = \begin{bmatrix} 1(3)+3(2) & 1(5)+3(4) \\ 2(3)+4(2) & 2(5)+4(4) \end{bmatrix} = \begin{bmatrix} 9 & 17 \\ 14 & 26 \end{bmatrix}\]

A <- matrix(c(1,3,2,4), nrow=2, byrow=TRUE)
B <- matrix(c(3,5,2,4), nrow=2, byrow=TRUE)
A %*% B
     [,1] [,2]
[1,]    9   17
[2,]   14   26

Conformability rule: columns of first = rows of second

\[\mathbf{A}_{m \times n} \times \mathbf{B}_{n \times p} = \mathbf{C}_{m \times p}\]

Inner dimensions must match; result has outer dimensions.

The Transpose

\(\mathbf{A}^T\) swaps rows and columns. If \(\mathbf{A}\) is \(m \times n\), then \(\mathbf{A}^T\) is \(n \times m\).

\[\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}^T = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}\]

A <- matrix(c(1,2,3,4,5,6), nrow=2, byrow=TRUE)
t(A)         # transpose
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
t(A) %*% A   # A'A is always square & symmetric
     [,1] [,2] [,3]
[1,]   17   22   27
[2,]   22   29   36
[3,]   27   36   45

Key properties:

Property Statement
Double transpose \((\mathbf{A}^T)^T = \mathbf{A}\)
Sum \((\mathbf{A}+\mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T\)
Product (reversal) \((\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T\)

Why the Transpose Matters

Transposing a product reverses the order:

\[(\mathbf{ABC})^T = \mathbf{C}^T\mathbf{B}^T\mathbf{A}^T\]

Critical result: For any matrix \(\mathbf{A}\), the product \(\mathbf{A}^T\mathbf{A}\) is always:

  • Square (\(n \times n\) if \(\mathbf{A}\) is \(m \times n\))
  • Symmetric

This is exactly what \(\mathbf{X}^T\mathbf{X}\) produces in the normal equations.

The Determinant

The determinant is a scalar value computed from a square matrix. It’s necessary for matrix inversion (later).

For a \(2 \times 2\) matrix:

\[\det(\mathbf{A}) = \det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc\]

  • If \(\det(\mathbf{A}) \neq 0\): the matrix is nonsingular (invertible)
  • If \(\det(\mathbf{A}) = 0\): the matrix is singular (no inverse exists — columns are linearly dependent)

Example:

\[\det\begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix} = 4(6) - 7(2) = 10 \neq 0 \;\; ✓\]

\[\det\begin{bmatrix} 2 & 4 \\ 1 & 2 \end{bmatrix} = 2(2) - 4(1) = 0 \;\; \text{(singular — row 1 = 2 × row 2)}\]

A <- matrix(c(4,7,2,6), nrow=2, byrow=TRUE)
det(A)   # nonsingular
[1] 10
B <- matrix(c(2,4,1,2), nrow=2, byrow=TRUE)
det(B)   # singular — no inverse
[1] 0

For OLS: \(\det(\mathbf{X}^T\mathbf{X}) = 0\) means perfect multicollinearity — the columns of \(\mathbf{X}\) are linearly dependent, and we cannot solve for \(\mathbf{b}\).

Matrix Inversion

For scalars: \(a \cdot a^{-1} = 1\)

For matrices: \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\)

Requirements:

  • Only square matrices can have inverses
  • Must be nonsingular: \(\det(\mathbf{A}) \neq 0\)

The \(2 \times 2\) Inverse

\[\mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \quad \Longrightarrow \quad \mathbf{A}^{-1} = \frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}\]

Example:

\[\begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix}^{-1} = \frac{1}{10}\begin{bmatrix} 6 & -7 \\ -2 & 4 \end{bmatrix} = \begin{bmatrix} 0.6 & -0.7 \\ -0.2 & 0.4 \end{bmatrix}\]

\[\mathbf{A}\mathbf{A}^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \mathbf{I} \;\; ✓\]

A <- matrix(c(4,7,2,6), nrow=2, byrow=TRUE)
solve(A)          # inverse
     [,1] [,2]
[1,]  0.6 -0.7
[2,] -0.2  0.4
A %*% solve(A)    # verify: should be identity
              [,1]          [,2]
[1,]  1.000000e+00 -1.110223e-16
[2,] -1.110223e-16  1.000000e+00

Properties of the Inverse

Property Statement
Product (reversal) \((\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\)
Transpose \((\mathbf{A}^T)^{-1} = (\mathbf{A}^{-1})^T\)
Double inverse \((\mathbf{A}^{-1})^{-1} = \mathbf{A}\)
Identity \(\mathbf{I}^{-1} = \mathbf{I}\)

Like the transpose, inverting a product reverses the order.

Linear Regression in Matrix Form

\[\mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{e}\]

\[\mathbf{y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \quad \mathbf{b} = \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_k \end{bmatrix}\]

\[\mathbf{X} = \begin{bmatrix} 1 & X_{11} & \cdots & X_{1k} \\ 1 & X_{21} & \cdots & X_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & \cdots & X_{nk} \end{bmatrix}\]

Deriving the OLS Estimator

Same Objective: Minimize the sum of squared errors:

\[\min_{\mathbf{b}}\; \mathbf{e}^T\mathbf{e} = (\mathbf{y} - \mathbf{Xb})^T(\mathbf{y} - \mathbf{Xb})\]

Expand:

\[\mathbf{e}^T\mathbf{e} = \mathbf{y}^T\mathbf{y} - 2\mathbf{b}^T\mathbf{X}^T\mathbf{y} + \mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}\]

(using the fact that \(\mathbf{b}^T\mathbf{X}^T\mathbf{y}\) and \(\mathbf{y}^T\mathbf{X}\mathbf{b}\) are equal scalars)

The Normal Equations

Take the derivative and set to zero:

\[\frac{\partial\; \mathbf{e}^T\mathbf{e}}{\partial\; \mathbf{b}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{b} = 0\]

The normal equations:

\[\mathbf{X}^T\mathbf{X}\mathbf{b} = \mathbf{X}^T\mathbf{y}\]

Multiply both sides by \((\mathbf{X}^T\mathbf{X})^{-1}\):

\[\boxed{\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}\]

This requires \(\mathbf{X}^T\mathbf{X}\) to be invertible — fails under perfect multicollinearity.

x <- c(1, 2, 3, 4, 5)
y <- c(5.1, 7.9, 11.2, 13.8, 17.1)
X <- cbind(1, x)    
b <- solve(t(X) %*% X) %*% t(X) %*% y
b                    # OLS 
  [,1]
  2.05
x 2.99
coef(lm(y ~ x))     # verify with lm()
(Intercept)           x 
       2.05        2.99 

An Example: Electoral Contestation

## Solution with LM
lm(electoral_contestation ~ college, data = wss20) |>
 coef()
(Intercept)     college 
 2.84760467  0.05181289 
# Solution with matrix algebra
X = cbind(1, wss20$college)
y = wss20$electoral_contestation

b = solve(t(X) %*% X) %*% t(X) %*% y
b
           [,1]
[1,] 2.84760467
[2,] 0.05181289

Summary

  1. Vectors encode magnitude and direction; the inner product measures covariance
  2. Matrices combine vectors; operations require conformability
  3. Transpose swaps rows/columns; transposing a product reverses order
  4. Inversion is the matrix analog of division; requires a nonsingular matrix
  5. The OLS estimator in matrix form: \(\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

References:

  • Gill, Jeff. 2006. Essential Mathematics for Political and Social Research. Cambridge.
  • Moore, Will and David Siegel. 2013. A Mathematics Course for Political and Social Research. Princeton.