Multiple Regression

Multiple Predictors, Dummy Variables, and Model Fit

Multiple variables

The OLS estimator is a point estimator. We predict a single point \(\hat{y}_i\), given a predictor \(x_i\). Let’s consider the SRF with multiple predictors:

\[ Y_i - \underbrace{(a + b_1 X_1 + b_2 X_2)}_{\hat{Y}_{i}} = e_i \]

We’ll still find a line that minimizes \(\sum e^2\)

We can just use a little algebra to rewrite these equations. Let’s simplify things by writing each term in “deviation” form.
\(y_i=Y_i-\bar{Y}\)
\(x_{1i}=X_{1i}-\bar{X_1}\)
\(x_{2i}=X_{2i}-\bar{X_2}\)

Multiple variables

Setting \(\frac{\partial SSR} {\partial b_1}\) to 0

\[\begin{eqnarray*} 0&=& -2 \sum (Y_i-a-b_1X_{1i}-b_2 X_{2i}) (X_{1i}) \nonumber \\ b_1&=& \frac{\sum x_{1i} y_i-b_2\sum x_{1i} x_{2i}}{\sum x_{1i}^2} \nonumber \\ \end{eqnarray*}\]

Setting \(\frac{\partial SSR} {\partial b_2}\) to 0 \[\begin{eqnarray*} 0&=& -2 \sum (Y_i-a-b_1X_{1i}-b_2 X_{2i}) (X_{2i}) \nonumber \\ b_2&=& \frac{\sum x_{2i} y_i-b_1\sum x_{1i} x_{2i}}{\sum x_{2i}^2} \nonumber \\ \end{eqnarray*}\]

Multiple variables

Then,

\[\begin{eqnarray*} b_1&=& \frac{\sum y_i x_{1i} \sum x_{2i}^2 - \sum x_{1i} x_{2i} \sum x_{2i} y_i }{\sum x_{1i}^2 \sum x_{2i}^2 -(\sum x_{1i} x_{2i})^2} \nonumber \\ b_2&=& \frac{\sum y_i x_{2i} \sum x_{1i}^2 - \sum x_{1i} x_{2i} \sum x_{1i} y_i }{\sum x_{1i}^2 \sum x_{2i}^2 -(\sum x_{1i} x_{2i})^2} \nonumber \\ \end{eqnarray*}\]

Multiple variables

\(\hat{Y} = a + b_1 X_1 + b_2 X_2\)

\(a\) = 0.849
\(b_1\) = 0.78
\(b_2\) = -0.499

\(R^2\) = 0.618

Multiple Regression: Fitted Plane & Residuals

Matrix Algebra

Vectors, Matrices, and the OLS Estimator

Why Matrix Algebra?

Writing the SRF in matrix form makes it easier to solve
Quantitative social science aims to quantify relationships between multiple variables
Data is tabular: rows = observations, columns = variables
We need tools to solve systems of equations efficiently

\[\begin{bmatrix} y_{1}= & b_0+ b_1 x_{1}+b_2 x_{2}\\ y_{2}= & b_0+ b_1 x_{1}+b_2 x_{2}\\ \vdots\\ y_{n}= & b_0+ b_1 x_{1}+b_2 x_{2} \end{bmatrix}\]

\(n\) equations, fewer unknowns — linear algebra gives us the solution

Data as a Matrix

Each row is an observation; each column is a variable.

\[\begin{bmatrix} Vote & PID & Ideology \\\hline a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ \vdots & \vdots & \vdots\\ a_{n1} & a_{n2} & a_{n3} \end{bmatrix}\]

This is an \(n \times 3\) matrix
First subscript = row, second = column
Notation: \(\mathbf{A}_{n \times 3}\)

A <- matrix(c(1, 0, 3,
              0, 1, 5,
              1, 1, 4), nrow = 3, byrow = TRUE)
colnames(A) <- c("Vote", "PID", "Ideology")
A

     Vote PID Ideology
[1,]    1   0        3
[2,]    0   1        5
[3,]    1   1        4

dim(A)

[1] 3 3

Vectors: The Building Blocks

Scalar: a single number (magnitude only)
Vector: multiple elements — encodes magnitude and direction

A vector \(\mathbf{a} \in \mathbb{R}^k\) has \(k\) elements.

Euclidean Distance between \(\mathbf{a}=[x_1,y_1]\) and \(\mathbf{b}=[x_2,y_2]\):

\[\text{Distance}(\mathbf{a},\mathbf{b}) = \sqrt{(x_1-x_2)^2+(y_1-y_2)^2}\]

This is just the Pythagorean theorem!

a <- c(3, 2); b <- c(1, 1)
sqrt(sum((a - b)^2))  # Euclidean distance

[1] 2.236068

The Norm of a Vector

The norm measures the length (magnitude) of a vector from the origin:

\[\|\mathbf{a}\| = \sqrt{x_1^2 + y_1^2}\]

Dividing a vector by its norm gives a unit vector (length = 1)
Useful for standardization

a <- c(3, 2, 1)
sqrt(sum(a^2))       # norm of a

[1] 3.741657

a / sqrt(sum(a^2))   # unit vector

[1] 0.8017837 0.5345225 0.2672612

In higher dimensions (\(\mathbb{R}^3\)):

\[\|\mathbf{a}\| = \sqrt{x_1^2 + y_1^2 + z_1^2}\]

Vector Addition & Subtraction

Element-wise operations on conformable vectors (same length):

\[\mathbf{a} + \mathbf{b} = [3+1,\; 2+1,\; 1+1] = [4, 3, 2]\]

\[\mathbf{a} - \mathbf{b} = [3-1,\; 2-1,\; 1-1] = [2, 1, 0]\]

Properties:

Property	Statement
Commutative	\(\mathbf{a}+\mathbf{b}=\mathbf{b}+\mathbf{a}\)
Associative	\((\mathbf{a}+\mathbf{b})+\mathbf{c}=\mathbf{a}+(\mathbf{b}+\mathbf{c})\)
Distributive	\(c(\mathbf{a}+\mathbf{b})=c\mathbf{a}+c\mathbf{b}\)
Zero	\(\mathbf{a}+0=\mathbf{a}\)

a <- c(3, 2, 1); b <- c(1, 1, 1)
a + b    # addition

[1] 4 3 2

a - b    # subtraction

[1] 2 1 0

Vector Multiplication

Inner (dot) product → produces a scalar (measures similarity / covariance)
Cross product → produces a vector (orthogonal to both inputs)
Outer product → produces a matrix

The Inner (Dot) Product

Multiply corresponding elements and sum:

\[\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i\]

For \(\mathbf{a}=[3,2,1]\) and \(\mathbf{b}=[1,1,1]\): \(\;\; 3(1)+2(1)+1(1) = 6\)

a <- c(3, 2, 1); b <- c(1, 1, 1)
sum(a * b)          # inner product

[1] 6

a %*% b             # same thing with matrix notation

     [,1]
[1,]    6

The inner product is a measure of covariance:

\[\text{cov}(x,y) = \frac{\text{inner product}(x-\bar{x},\; y-\bar{y})}{n-1}\]

\[r_{x,y} = \frac{\text{inner product}(x-\bar{x},\; y-\bar{y})}{\|x-\bar{x}\|\;\|y-\bar{y}\|}\]

Inner Product Rules

Property	Statement
Commutative	\(\mathbf{a} \cdot \mathbf{b} = \mathbf{b} \cdot \mathbf{a}\)
Associative	\(d(\mathbf{a} \cdot \mathbf{b}) = (d\mathbf{a}) \cdot \mathbf{b}\)
Distributive	\(\mathbf{c} \cdot (\mathbf{a}+\mathbf{b}) = \mathbf{c}\cdot\mathbf{a} + \mathbf{c}\cdot\mathbf{b}\)
Zero	\(\mathbf{a} \cdot 0 = 0\)

The Cross Product

For \(\mathbf{a}=[3,2,1]\) and \(\mathbf{b}=[1,4,7]\):

Stack the vectors
Calculate \(2 \times 2\) determinants

\[\mathbf{a} \times \mathbf{b} = [2(7)-4(1),\;\; 1(1)-3(7),\;\; 3(4)-2(1)] = [10, -20, 10]\]

Result is orthogonal to both original vectors
Useful for determinants and matrix inversion

The Outer Product

Transpose one vector, then multiply:

\[\begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix} \begin{bmatrix} 1 & 4 & 7 \end{bmatrix} = \begin{bmatrix} 3 & 12 & 21 \\ 2 & 8 & 14 \\ 1 & 4 & 7 \end{bmatrix}\]

Input: two vectors of length \(k\)
Output: a \(k \times k\) matrix

a <- c(3, 2, 1); b <- c(1, 4, 7)
a %o% b   # outer product

     [,1] [,2] [,3]
[1,]    3   12   21
[2,]    2    8   14
[3,]    1    4    7

Matrices

A matrix combines row or column vectors. Notation: bold uppercase (\(\mathbf{A}\)).

\[\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots\\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix}\]

Matrix Types

Type	Definition
Square	Equal rows and columns
Symmetric	Same entries above and below the diagonal; \(\mathbf{A} = \mathbf{A}^T\)
Identity (\(\mathbf{I}\))	1s on diagonal, 0s off; \(\mathbf{AI} = \mathbf{A}\)
Idempotent	\(\mathbf{A}^2 = \mathbf{A}\)
Trace	Sum of diagonal elements: \(\text{tr}(\mathbf{I}) = n\)

Matrix Addition & Subtraction

Matrices must be conformable (same dimensions). Add/subtract element-wise:

\[\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} + \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11}+b_{11} & a_{12}+b_{12} \\ a_{21}+b_{21} & a_{22}+b_{22} \end{bmatrix}\]

Properties: Commutative, Associative, Distributive, Zero

Matrix Multiplication

Order matters! \(\mathbf{AB} \neq \mathbf{BA}\) in general.

Multiply \(i\)-th row by \(j\)-th column:

\[\begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} 3 & 5 \\ 2 & 4 \end{bmatrix} = \begin{bmatrix} 1(3)+3(2) & 1(5)+3(4) \\ 2(3)+4(2) & 2(5)+4(4) \end{bmatrix} = \begin{bmatrix} 9 & 17 \\ 14 & 26 \end{bmatrix}\]

A <- matrix(c(1,3,2,4), nrow=2, byrow=TRUE)
B <- matrix(c(3,5,2,4), nrow=2, byrow=TRUE)
A %*% B

     [,1] [,2]
[1,]    9   17
[2,]   14   26

Conformability rule: columns of first = rows of second

\[\mathbf{A}_{m \times n} \times \mathbf{B}_{n \times p} = \mathbf{C}_{m \times p}\]

Inner dimensions must match; result has outer dimensions.

The Transpose

\(\mathbf{A}^T\) swaps rows and columns. If \(\mathbf{A}\) is \(m \times n\), then \(\mathbf{A}^T\) is \(n \times m\).

\[\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}^T = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}\]

A <- matrix(c(1,2,3,4,5,6), nrow=2, byrow=TRUE)
t(A)         # transpose

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

t(A) %*% A   # A'A is always square & symmetric

     [,1] [,2] [,3]
[1,]   17   22   27
[2,]   22   29   36
[3,]   27   36   45

Key properties:

Property	Statement
Double transpose	\((\mathbf{A}^T)^T = \mathbf{A}\)
Sum	\((\mathbf{A}+\mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T\)
Product (reversal)	\((\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T\)

Why the Transpose Matters

Transposing a product reverses the order:

\[(\mathbf{ABC})^T = \mathbf{C}^T\mathbf{B}^T\mathbf{A}^T\]

Critical result: For any matrix \(\mathbf{A}\), the product \(\mathbf{A}^T\mathbf{A}\) is always:

Square (\(n \times n\) if \(\mathbf{A}\) is \(m \times n\))
Symmetric

This is exactly what \(\mathbf{X}^T\mathbf{X}\) produces in the normal equations.

The Determinant

The determinant is a scalar value computed from a square matrix. It’s necessary for matrix inversion (later).

For a \(2 \times 2\) matrix:

\[\det(\mathbf{A}) = \det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc\]

If \(\det(\mathbf{A}) \neq 0\): the matrix is nonsingular (invertible)
If \(\det(\mathbf{A}) = 0\): the matrix is singular (no inverse exists — columns are linearly dependent)

Example:

\[\det\begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix} = 4(6) - 7(2) = 10 \neq 0 \;\; ✓\]

\[\det\begin{bmatrix} 2 & 4 \\ 1 & 2 \end{bmatrix} = 2(2) - 4(1) = 0 \;\; \text{(singular — row 1 = 2 × row 2)}\]

A <- matrix(c(4,7,2,6), nrow=2, byrow=TRUE)
det(A)   # nonsingular

[1] 10

B <- matrix(c(2,4,1,2), nrow=2, byrow=TRUE)
det(B)   # singular — no inverse

[1] 0

For OLS: \(\det(\mathbf{X}^T\mathbf{X}) = 0\) means perfect multicollinearity — the columns of \(\mathbf{X}\) are linearly dependent, and we cannot solve for \(\mathbf{b}\).

Matrix Inversion

For scalars: \(a \cdot a^{-1} = 1\)

For matrices: \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\)

Requirements:

Only square matrices can have inverses
Must be nonsingular: \(\det(\mathbf{A}) \neq 0\)

The \(2 \times 2\) Inverse

\[\mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \quad \Longrightarrow \quad \mathbf{A}^{-1} = \frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}\]

Example:

\[\begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix}^{-1} = \frac{1}{10}\begin{bmatrix} 6 & -7 \\ -2 & 4 \end{bmatrix} = \begin{bmatrix} 0.6 & -0.7 \\ -0.2 & 0.4 \end{bmatrix}\]

\[\mathbf{A}\mathbf{A}^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \mathbf{I} \;\; ✓\]

A <- matrix(c(4,7,2,6), nrow=2, byrow=TRUE)
solve(A)          # inverse

     [,1] [,2]
[1,]  0.6 -0.7
[2,] -0.2  0.4

A %*% solve(A)    # verify: should be identity

              [,1]          [,2]
[1,]  1.000000e+00 -1.110223e-16
[2,] -1.110223e-16  1.000000e+00

Properties of the Inverse

Property	Statement
Product (reversal)	\((\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\)
Transpose	\((\mathbf{A}^T)^{-1} = (\mathbf{A}^{-1})^T\)
Double inverse	\((\mathbf{A}^{-1})^{-1} = \mathbf{A}\)
Identity	\(\mathbf{I}^{-1} = \mathbf{I}\)

Like the transpose, inverting a product reverses the order.

Linear Regression in Matrix Form

\[\mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{e}\]

\[\mathbf{y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \quad \mathbf{b} = \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_k \end{bmatrix}\]

\[\mathbf{X} = \begin{bmatrix} 1 & X_{11} & \cdots & X_{1k} \\ 1 & X_{21} & \cdots & X_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & \cdots & X_{nk} \end{bmatrix}\]

Deriving the OLS Estimator

Same Objective: Minimize the sum of squared errors. In scalar form, we minimized \(\sum e_i^2\). In matrix form:

\[\min_{\mathbf{b}}\; \mathbf{e}^T\mathbf{e} = (\mathbf{y} - \mathbf{Xb})^T(\mathbf{y} - \mathbf{Xb})\]

Expand using the distributive property of the transpose, \((\mathbf{A} - \mathbf{B})^T = \mathbf{A}^T - \mathbf{B}^T\):

\[\mathbf{e}^T\mathbf{e} = (\mathbf{y}^T - \mathbf{b}^T\mathbf{X}^T)(\mathbf{y} - \mathbf{Xb})\]

Multiply out the terms:

\[= \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{Xb} - \mathbf{b}^T\mathbf{X}^T\mathbf{y} + \mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}\]

The middle two terms are both scalars (a \(1 \times 1\) result), and a scalar equals its own transpose: \(\mathbf{y}^T\mathbf{Xb} = (\mathbf{b}^T\mathbf{X}^T\mathbf{y})^T = \mathbf{b}^T\mathbf{X}^T\mathbf{y}\). So they combine:

\[\mathbf{e}^T\mathbf{e} = \mathbf{y}^T\mathbf{y} - 2\mathbf{b}^T\mathbf{X}^T\mathbf{y} + \mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}\]

The Derivative.

In scalar calculus, \(\frac{d}{db} f(b)\) gives a single number. In matrix calculus, \(\frac{\partial f}{\partial \mathbf{b}}\) gives a vector of derivatives — corresponding to each element of \(\mathbf{b}\), the intercepts and the slopes

For instance, if \(\mathbf{b} = [b_0, b_1, \ldots, b_k]^T\), then:

\[\frac{\partial f}{\partial \mathbf{b}} = \begin{bmatrix} \frac{\partial f}{\partial b_0} \\ \frac{\partial f}{\partial b_1} \\ \vdots \\ \frac{\partial f}{\partial b_k} \end{bmatrix}\]

This vector of partial derivatives is called the gradient. Setting it to \(\mathbf{0}\) means every partial derivative equals zero simultaneously.

Why does this matter? In the bivariate case, we took two separate derivatives (\(\frac{\partial SSR}{\partial a}\) and \(\frac{\partial SSR}{\partial b}\)) and solved two equations. The gradient is effectively the same thing — but for all \(k+1\) coefficients simultaneously.

Matrix Derivative Rules

We need three rules — each corresponds to rules scalar calculus:

Scalar Rule	Matrix Rule
\(\frac{d}{db}(c) = 0\)	\(\frac{\partial}{\partial \mathbf{b}}(\mathbf{y}^T\mathbf{y}) = \mathbf{0}\)
\(\frac{d}{db}(\mathbf{c}^T b) = \mathbf{c}\)	\(\frac{\partial}{\partial \mathbf{b}}(\mathbf{c}^T\mathbf{b}) = \mathbf{c}\)
\(\frac{d}{db}(b^2 a) = 2ab\)	\(\frac{\partial}{\partial \mathbf{b}}(\mathbf{b}^T\mathbf{A}\mathbf{b}) = 2\mathbf{A}\mathbf{b}\) (if \(\mathbf{A}\) is symmetric)

The third rule requires \(\mathbf{A}\) to be symmetric. Since \(\mathbf{X}^T\mathbf{X}\) is always symmetric (recall: \((\mathbf{X}^T\mathbf{X})^T = \mathbf{X}^T\mathbf{X}\)), the rule applies.

Applying the Rules

Our function is: \(\;\mathbf{e}^T\mathbf{e} = \underbrace{\mathbf{y}^T\mathbf{y}}_{\text{constant}} - \underbrace{2\mathbf{b}^T\mathbf{X}^T\mathbf{y}}_{\text{linear in } \mathbf{b}} + \underbrace{\mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}}_{\text{quadratic in } \mathbf{b}}\)

Term by term:

Term	Rule Applied	Derivative w.r.t. \(\mathbf{b}\)
\(\mathbf{y}^T\mathbf{y}\)	Constant → 0	\(\mathbf{0}\)
\(-2\mathbf{b}^T\mathbf{X}^T\mathbf{y}\)	Linear: \(\frac{\partial}{\partial \mathbf{b}}(\mathbf{c}^T\mathbf{b}) = \mathbf{c}\), where \(\mathbf{c} = \mathbf{X}^T\mathbf{y}\)	\(-2\mathbf{X}^T\mathbf{y}\)
\(\mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}\)	Quadratic: \(\frac{\partial}{\partial \mathbf{b}}(\mathbf{b}^T\mathbf{A}\mathbf{b}) = 2\mathbf{A}\mathbf{b}\), where \(\mathbf{A} = \mathbf{X}^T\mathbf{X}\)	\(2\mathbf{X}^T\mathbf{X}\mathbf{b}\)

Combining: \(\;\frac{\partial\; \mathbf{e}^T\mathbf{e}}{\partial\; \mathbf{b}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{b}\)

Setting the Derivative to Zero

Combining the terms:

\[\frac{\partial\; \mathbf{e}^T\mathbf{e}}{\partial\; \mathbf{b}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{b} = 0\]

Rearrange (move \(-2\mathbf{X}^T\mathbf{y}\) to the right, divide by 2):

\[\mathbf{X}^T\mathbf{X}\mathbf{b} = \mathbf{X}^T\mathbf{y}\]

These are the normal equations — the matrix version of the first-order conditions.

Solve for \(\mathbf{b}\) by multiplying both sides on the left by \((\mathbf{X}^T\mathbf{X})^{-1}\):

\[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Since \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X} = \mathbf{I}\) and \(\mathbf{Ib} = \mathbf{b}\):

\[\boxed{\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}\]

This requires \(\mathbf{X}^T\mathbf{X}\) to be invertible — fails under perfect multicollinearity.

x <- c(1, 2, 3, 4, 5)
y <- c(5.1, 7.9, 11.2, 13.8, 17.1)
X <- cbind(1, x)
b <- solve(t(X) %*% X) %*% t(X) %*% y
b                    # OLS

  [,1]
  2.05
x 2.99

coef(lm(y ~ x))     # verify with lm()

(Intercept)           x 
       2.05        2.99

An Example: Electoral Contestation

## Solution with LM
lm(electoral_contestation ~ authoritarianism, data = wss20) |>
 coef()

     (Intercept) authoritarianism 
        3.110123        -0.554824

# Solution with matrix algebra
X = cbind(1, wss20$college)
y = wss20$electoral_contestation

b = solve(t(X) %*% X) %*% t(X) %*% y
b

           [,1]
[1,] 2.84760467
[2,] 0.05181289

Dummy Variable Regression

\(x\) can be qualitative or quantitative — a dummy variable encodes group membership as 0 or 1
With \(k\) categories, include \(k-1\) dummies; the omitted group is the reference category
Why omit one? All \(k\) dummies sum to a column of ones — perfectly collinear with the intercept

Here, “Independent” is the excluded category — coefficients for “Republican” and “Democrat” are differences relative to Independents.

lm(electoral_contestation ~ republican + democrat + authoritarianism, data = wss20) |>
  summary()


Call:
lm(formula = electoral_contestation ~ republican + democrat + 
    authoritarianism, data = wss20)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.22347 -0.50962 -0.02344  0.43179  2.60419 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3.03857    0.03543   85.77  < 2e-16 ***
republican        0.18489    0.03640    5.08 3.98e-07 ***
democrat         -0.18751    0.03859   -4.86 1.23e-06 ***
authoritarianism -0.45526    0.03956  -11.51  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.767 on 3427 degrees of freedom
  (169 observations deleted due to missingness)
Multiple R-squared:  0.09947,   Adjusted R-squared:  0.09868 
F-statistic: 126.2 on 3 and 3427 DF,  p-value: < 2.2e-16

Dummy Variable Regression: Intercept Shifts

\(E(Y \mid \text{Independent}) = a + b_{\text{Auth}} \cdot X\) — the intercept alone
\(E(Y \mid \text{Republican}) = (a + b_{\text{Rep}}) + b_{\text{Auth}} \cdot X\) — intercept shifts
\(E(Y \mid \text{Democrat}) = (a + b_{\text{Dem}}) + b_{\text{Auth}} \cdot X\) — a different shift

The slope on authoritarianism is the same for every group — only the intercept moves.

Dummy Variable Regression: Intercept Shifts

Revisiting Model Fit: \(R^2\) in Multiple Regression

Recall the decomposition: \(TSS = RegSS + RSS\)

\[R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

Problem: \(R^2\) never decreases when you add a predictor — even a useless one. Adding noise variables inflates \(R^2\).

Adjusted \(R^2\)

The adjusted \(R^2\) penalizes for model complexity:

\[\bar{R}^2 = 1 - \frac{RSS / (n - k - 1)}{TSS / (n - 1)} = 1 - (1 - R^2)\frac{n - 1}{n - k - 1}\]

\(k\) = number of predictors, \(n\) = sample size
Unlike \(R^2\), adjusted \(R^2\) can decrease if a new predictor doesn’t improve the model enough to offset the lost degree of freedom
When \(k\) is large relative to \(n\), the penalty is substantial

Model Fit: Example

## Bivariate model
fit1 <- lm(electoral_contestation ~ authoritarianism, data = wss20)
## Multiple regression with dummies
fit2 <- lm(electoral_contestation ~ authoritarianism + republican + democrat, data = wss20)

data.frame(
  Model = c("Authoritarianism only", "+ Party ID dummies"),
  R2 = c(summary(fit1)$r.squared, summary(fit2)$r.squared),
  Adj_R2 = c(summary(fit1)$adj.r.squared, summary(fit2)$adj.r.squared),
  k = c(1, 3)
) |> knitr::kable(digits = 4)

Model	R2	Adj_R2	k
Authoritarianism only	0.0556	0.0553	1
+ Party ID dummies	0.0995	0.0987	3

The Overfitting Problem

In-sample fit (\(R^2\)) measures how well the model explains the data it was trained on. Call this the “training set” performance. - Adding predictors improves in-sample fit, even if they are irrelevant. They won’t help predict new data, but they will reduce residuals in the training set. - With enough predictors, you can fit the training data perfectly — but this is just memorizing noise, not learning signal. - A model with \(n - 1\) predictors and \(n\) observations achieves \(R^2 = 1\)

Out-of-sample predictions measure how well the model generalizes to new, unseen data. Call this the “test set” performance, or even “out-of-sample” performance. - An overfit model memorizes noise in the training data - It performs well in-sample but poorly out-of-sample - The gap between in-sample and out-of-sample performance is a diagnostic for overfitting.

Key Point Is…: A good model captures signal, not noise. \(R^2\) alone cannot distinguish between the two.

K-Fold Cross-Validation

Cross-validation estimates out-of-sample prediction error using only the available data (Hastie, Tibshirani, & Friedman, 2009, §7.10, “Cross-Validation”, p. 241).

Procedure:

Randomly partition the data into \(K\) roughly equal-sized folds
For each fold \(k = 1, \ldots, K\):
- Train the model on all data except fold \(k\)
- Predict on fold \(k\) (the held-out data)
- Compute prediction error: \(\text{MSE}_k = \frac{1}{n_k}\sum_{i \in \text{fold } k}(Y_i - \hat{Y}_i)^2\). Do this for each fold
Average across folds: \(\text{CV}_{(K)} = \frac{1}{K}\sum_{k=1}^{K} \text{MSE}_k\)

There are not simple rules for how many folds to include. Typically, \(K = 5\) or \(K = 10\). When \(K = n\), this is leave-one-out cross-validation (LOOCV).

K-Fold Cross-Validation

Each row is one iteration: the orange block is held out for testing, the blue blocks are used for training.

K-Fold Cross-Validation: Example

Model	CV_MSE
Authoritarianism only	0.6143
+ Party ID dummies	0.5890

If the fuller model has lower CV-MSE, it genuinely improves prediction — not just in-sample fit.

The Lewis-Beck vs. Achen Debate over \(R^2\)

Scholars occasionally disgree about the role of \(R^2\) in evaluating models.

Lewis-Beck & Skalaban (1990): \(R^2\) is a useful and informative measure of model fit.

A high \(R^2\) indicates that the model accounts for a substantial share of variance in \(Y\)
Comparing \(R^2\) across models helps assess whether new predictors contribute
In applied work, \(R^2\) provides a meaningful summary of explanatory power

Achen (1982, 1990): \(R^2\) is misleading and overemphasized.

\(R^2\) depends on the variance of \(X\) in the sample (Remember this calculation?) — the same causal effect can yield very different \(R^2\) values in different datasets
Researchers can inflate \(R^2\) by choosing samples with high variance in \(X\), or by adding irrelevant predictors
\(R^2\) does not measure causal impact; a coefficient’s magnitude, sign, and uncertainty about that estimate matter more

The Debate: Implications

Achen’s argument: is that we should focus attention on coefficient estimates and their standard errors, not on \(R^2\). A model with a small \(R^2\) can still identify an important causal effect (e.g., genes and voting); a model with a large \(R^2\) can be theoretically vacuous (predictions in small samples).

Middle Ground:

\(R^2\) is useful for prediction — how well does the model forecast \(Y\)?
\(R^2\) is less useful for testing substantive theories about variables and their relationships, particularly causal relationships.
Adjusted \(R^2\) and cross-validation are better tools for model comparison than raw \(R^2\)
Report \(R^2\) but don’t overemphasize it; substantive interpretation of coefficients comes first

Summary

Multiple regression extends OLS to multiple predictors: \(\hat{Y} = a + b_1 X_1 + b_2 X_2 + \ldots\)
Dummy variables encode categorical data as 0/1; omit one category to avoid perfect collinearity
The omitted group is the reference category; coefficients are differences from it
\(R^2\) always increases with more predictors; adjusted \(R^2\) penalizes for complexity
Overfitting: high in-sample \(R^2\) does not guarantee good out-of-sample prediction
K-fold cross-validation estimates true predictive performance by repeatedly holding out data
The Lewis-Beck/Achen debate: \(R^2\) is useful for prediction, but coefficients matter more for causal inference

Google Colab Notebook

References

Achen, Christopher H. 1982. Interpreting and Using Regression. Sage.
Achen, Christopher H. 1990. “What Does ‘Explained Variance’ Explain?” Political Analysis 2: 173–184.
Gill, Jeff. 2006. Essential Mathematics for Political and Social Research. Cambridge.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer.
Lewis-Beck, Michael S. and Andrew Skalaban. 1990. “The R-Squared: Some Straight Talk.” Political Analysis 2: 153–171.
Moore, Will and David Siegel. 2013. A Mathematics Course for Political and Social Research. Princeton.