Multiple Predictors, Dummy Variables, and Model Fit
The OLS estimator is a point estimator. We predict a single point \(\hat{y}_i\), given a predictor \(x_i\). Let’s consider the SRF with multiple predictors:
\[ Y_i - \underbrace{(a + b_1 X_1 + b_2 X_2)}_{\hat{Y}_{i}} = e_i \]
We’ll still find a line that minimizes \(\sum e^2\)
We can just use a little algebra to rewrite these equations. Let’s simplify things by writing each term in “deviation” form.
\(y_i=Y_i-\bar{Y}\)
\(x_{1i}=X_{1i}-\bar{X_1}\)
\(x_{2i}=X_{2i}-\bar{X_2}\)
Setting \(\frac{\partial SSR} {\partial b_1}\) to 0
\[\begin{eqnarray*} 0&=& -2 \sum (Y_i-a-b_1X_{1i}-b_2 X_{2i}) (X_{1i}) \nonumber \\ b_1&=& \frac{\sum x_{1i} y_i-b_2\sum x_{1i} x_{2i}}{\sum x_{1i}^2} \nonumber \\ \end{eqnarray*}\]
Setting \(\frac{\partial SSR} {\partial b_2}\) to 0 \[\begin{eqnarray*} 0&=& -2 \sum (Y_i-a-b_1X_{1i}-b_2 X_{2i}) (X_{2i}) \nonumber \\ b_2&=& \frac{\sum x_{2i} y_i-b_1\sum x_{1i} x_{2i}}{\sum x_{2i}^2} \nonumber \\ \end{eqnarray*}\]
Then,
\[\begin{eqnarray*} b_1&=& \frac{\sum y_i x_{1i} \sum x_{2i}^2 - \sum x_{1i} x_{2i} \sum x_{2i} y_i }{\sum x_{1i}^2 \sum x_{2i}^2 -(\sum x_{1i} x_{2i})^2} \nonumber \\ b_2&=& \frac{\sum y_i x_{2i} \sum x_{1i}^2 - \sum x_{1i} x_{2i} \sum x_{1i} y_i }{\sum x_{1i}^2 \sum x_{2i}^2 -(\sum x_{1i} x_{2i})^2} \nonumber \\ \end{eqnarray*}\]
\(\hat{Y} = a + b_1 X_1 + b_2 X_2\)
\(R^2\) = 0.618
Vectors, Matrices, and the OLS Estimator
\[\begin{bmatrix} y_{1}= & b_0+ b_1 x_{1}+b_2 x_{2}\\ y_{2}= & b_0+ b_1 x_{1}+b_2 x_{2}\\ \vdots\\ y_{n}= & b_0+ b_1 x_{1}+b_2 x_{2} \end{bmatrix}\]
Each row is an observation; each column is a variable.
\[\begin{bmatrix} Vote & PID & Ideology \\\hline a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ \vdots & \vdots & \vdots\\ a_{n1} & a_{n2} & a_{n3} \end{bmatrix}\]
A vector \(\mathbf{a} \in \mathbb{R}^k\) has \(k\) elements.
The norm measures the length (magnitude) of a vector from the origin:
\[\|\mathbf{a}\| = \sqrt{x_1^2 + y_1^2}\]
[1] 3.741657
[1] 0.8017837 0.5345225 0.2672612
In higher dimensions (\(\mathbb{R}^3\)):
\[\|\mathbf{a}\| = \sqrt{x_1^2 + y_1^2 + z_1^2}\]
Element-wise operations on conformable vectors (same length):
\[\mathbf{a} + \mathbf{b} = [3+1,\; 2+1,\; 1+1] = [4, 3, 2]\]
\[\mathbf{a} - \mathbf{b} = [3-1,\; 2-1,\; 1-1] = [2, 1, 0]\]
Properties:
| Property | Statement |
|---|---|
| Commutative | \(\mathbf{a}+\mathbf{b}=\mathbf{b}+\mathbf{a}\) |
| Associative | \((\mathbf{a}+\mathbf{b})+\mathbf{c}=\mathbf{a}+(\mathbf{b}+\mathbf{c})\) |
| Distributive | \(c(\mathbf{a}+\mathbf{b})=c\mathbf{a}+c\mathbf{b}\) |
| Zero | \(\mathbf{a}+0=\mathbf{a}\) |
Multiply corresponding elements and sum:
\[\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i\]
For \(\mathbf{a}=[3,2,1]\) and \(\mathbf{b}=[1,1,1]\): \(\;\; 3(1)+2(1)+1(1) = 6\)
[1] 6
[,1]
[1,] 6
The inner product is a measure of covariance:
\[\text{cov}(x,y) = \frac{\text{inner product}(x-\bar{x},\; y-\bar{y})}{n-1}\]
\[r_{x,y} = \frac{\text{inner product}(x-\bar{x},\; y-\bar{y})}{\|x-\bar{x}\|\;\|y-\bar{y}\|}\]
| Property | Statement |
|---|---|
| Commutative | \(\mathbf{a} \cdot \mathbf{b} = \mathbf{b} \cdot \mathbf{a}\) |
| Associative | \(d(\mathbf{a} \cdot \mathbf{b}) = (d\mathbf{a}) \cdot \mathbf{b}\) |
| Distributive | \(\mathbf{c} \cdot (\mathbf{a}+\mathbf{b}) = \mathbf{c}\cdot\mathbf{a} + \mathbf{c}\cdot\mathbf{b}\) |
| Zero | \(\mathbf{a} \cdot 0 = 0\) |
For \(\mathbf{a}=[3,2,1]\) and \(\mathbf{b}=[1,4,7]\):
\[\mathbf{a} \times \mathbf{b} = [2(7)-4(1),\;\; 1(1)-3(7),\;\; 3(4)-2(1)] = [10, -20, 10]\]
Transpose one vector, then multiply:
\[\begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix} \begin{bmatrix} 1 & 4 & 7 \end{bmatrix} = \begin{bmatrix} 3 & 12 & 21 \\ 2 & 8 & 14 \\ 1 & 4 & 7 \end{bmatrix}\]
A matrix combines row or column vectors. Notation: bold uppercase (\(\mathbf{A}\)).
\[\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots\\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix}\]
| Type | Definition |
|---|---|
| Square | Equal rows and columns |
| Symmetric | Same entries above and below the diagonal; \(\mathbf{A} = \mathbf{A}^T\) |
| Identity (\(\mathbf{I}\)) | 1s on diagonal, 0s off; \(\mathbf{AI} = \mathbf{A}\) |
| Idempotent | \(\mathbf{A}^2 = \mathbf{A}\) |
| Trace | Sum of diagonal elements: \(\text{tr}(\mathbf{I}) = n\) |
Matrices must be conformable (same dimensions). Add/subtract element-wise:
\[\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} + \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11}+b_{11} & a_{12}+b_{12} \\ a_{21}+b_{21} & a_{22}+b_{22} \end{bmatrix}\]
Properties: Commutative, Associative, Distributive, Zero
Order matters! \(\mathbf{AB} \neq \mathbf{BA}\) in general.
Multiply \(i\)-th row by \(j\)-th column:
\[\begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} 3 & 5 \\ 2 & 4 \end{bmatrix} = \begin{bmatrix} 1(3)+3(2) & 1(5)+3(4) \\ 2(3)+4(2) & 2(5)+4(4) \end{bmatrix} = \begin{bmatrix} 9 & 17 \\ 14 & 26 \end{bmatrix}\]
[,1] [,2]
[1,] 9 17
[2,] 14 26
Conformability rule: columns of first = rows of second
\[\mathbf{A}_{m \times n} \times \mathbf{B}_{n \times p} = \mathbf{C}_{m \times p}\]
Inner dimensions must match; result has outer dimensions.
\(\mathbf{A}^T\) swaps rows and columns. If \(\mathbf{A}\) is \(m \times n\), then \(\mathbf{A}^T\) is \(n \times m\).
\[\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}^T = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}\]
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
[,1] [,2] [,3]
[1,] 17 22 27
[2,] 22 29 36
[3,] 27 36 45
Key properties:
| Property | Statement |
|---|---|
| Double transpose | \((\mathbf{A}^T)^T = \mathbf{A}\) |
| Sum | \((\mathbf{A}+\mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T\) |
| Product (reversal) | \((\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T\) |
Transposing a product reverses the order:
\[(\mathbf{ABC})^T = \mathbf{C}^T\mathbf{B}^T\mathbf{A}^T\]
Critical result: For any matrix \(\mathbf{A}\), the product \(\mathbf{A}^T\mathbf{A}\) is always:
This is exactly what \(\mathbf{X}^T\mathbf{X}\) produces in the normal equations.
The determinant is a scalar value computed from a square matrix. It’s necessary for matrix inversion (later).
For a \(2 \times 2\) matrix:
\[\det(\mathbf{A}) = \det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc\]
Example:
\[\det\begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix} = 4(6) - 7(2) = 10 \neq 0 \;\; ✓\]
\[\det\begin{bmatrix} 2 & 4 \\ 1 & 2 \end{bmatrix} = 2(2) - 4(1) = 0 \;\; \text{(singular — row 1 = 2 × row 2)}\]
For OLS: \(\det(\mathbf{X}^T\mathbf{X}) = 0\) means perfect multicollinearity — the columns of \(\mathbf{X}\) are linearly dependent, and we cannot solve for \(\mathbf{b}\).
For scalars: \(a \cdot a^{-1} = 1\)
For matrices: \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\)
Requirements:
\[\mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \quad \Longrightarrow \quad \mathbf{A}^{-1} = \frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}\]
Example:
\[\begin{bmatrix} 4 & 7 \\ 2 & 6 \end{bmatrix}^{-1} = \frac{1}{10}\begin{bmatrix} 6 & -7 \\ -2 & 4 \end{bmatrix} = \begin{bmatrix} 0.6 & -0.7 \\ -0.2 & 0.4 \end{bmatrix}\]
\[\mathbf{A}\mathbf{A}^{-1} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \mathbf{I} \;\; ✓\]
| Property | Statement |
|---|---|
| Product (reversal) | \((\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\) |
| Transpose | \((\mathbf{A}^T)^{-1} = (\mathbf{A}^{-1})^T\) |
| Double inverse | \((\mathbf{A}^{-1})^{-1} = \mathbf{A}\) |
| Identity | \(\mathbf{I}^{-1} = \mathbf{I}\) |
Like the transpose, inverting a product reverses the order.
\[\mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{e}\]
\[\mathbf{y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \quad \mathbf{b} = \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_k \end{bmatrix}\]
\[\mathbf{X} = \begin{bmatrix} 1 & X_{11} & \cdots & X_{1k} \\ 1 & X_{21} & \cdots & X_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & \cdots & X_{nk} \end{bmatrix}\]
Same Objective: Minimize the sum of squared errors. In scalar form, we minimized \(\sum e_i^2\). In matrix form:
\[\min_{\mathbf{b}}\; \mathbf{e}^T\mathbf{e} = (\mathbf{y} - \mathbf{Xb})^T(\mathbf{y} - \mathbf{Xb})\]
Expand using the distributive property of the transpose, \((\mathbf{A} - \mathbf{B})^T = \mathbf{A}^T - \mathbf{B}^T\):
\[\mathbf{e}^T\mathbf{e} = (\mathbf{y}^T - \mathbf{b}^T\mathbf{X}^T)(\mathbf{y} - \mathbf{Xb})\]
Multiply out the terms:
\[= \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{Xb} - \mathbf{b}^T\mathbf{X}^T\mathbf{y} + \mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}\]
The middle two terms are both scalars (a \(1 \times 1\) result), and a scalar equals its own transpose: \(\mathbf{y}^T\mathbf{Xb} = (\mathbf{b}^T\mathbf{X}^T\mathbf{y})^T = \mathbf{b}^T\mathbf{X}^T\mathbf{y}\). So they combine:
\[\mathbf{e}^T\mathbf{e} = \mathbf{y}^T\mathbf{y} - 2\mathbf{b}^T\mathbf{X}^T\mathbf{y} + \mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}\]
In scalar calculus, \(\frac{d}{db} f(b)\) gives a single number. In matrix calculus, \(\frac{\partial f}{\partial \mathbf{b}}\) gives a vector of derivatives — corresponding to each element of \(\mathbf{b}\), the intercepts and the slopes
For instance, if \(\mathbf{b} = [b_0, b_1, \ldots, b_k]^T\), then:
\[\frac{\partial f}{\partial \mathbf{b}} = \begin{bmatrix} \frac{\partial f}{\partial b_0} \\ \frac{\partial f}{\partial b_1} \\ \vdots \\ \frac{\partial f}{\partial b_k} \end{bmatrix}\]
This vector of partial derivatives is called the gradient. Setting it to \(\mathbf{0}\) means every partial derivative equals zero simultaneously.
Why does this matter? In the bivariate case, we took two separate derivatives (\(\frac{\partial SSR}{\partial a}\) and \(\frac{\partial SSR}{\partial b}\)) and solved two equations. The gradient is effectively the same thing — but for all \(k+1\) coefficients simultaneously.
We need three rules — each corresponds to rules scalar calculus:
| Scalar Rule | Matrix Rule |
|---|---|
| \(\frac{d}{db}(c) = 0\) | \(\frac{\partial}{\partial \mathbf{b}}(\mathbf{y}^T\mathbf{y}) = \mathbf{0}\) |
| \(\frac{d}{db}(\mathbf{c}^T b) = \mathbf{c}\) | \(\frac{\partial}{\partial \mathbf{b}}(\mathbf{c}^T\mathbf{b}) = \mathbf{c}\) |
| \(\frac{d}{db}(b^2 a) = 2ab\) | \(\frac{\partial}{\partial \mathbf{b}}(\mathbf{b}^T\mathbf{A}\mathbf{b}) = 2\mathbf{A}\mathbf{b}\) (if \(\mathbf{A}\) is symmetric) |
The third rule requires \(\mathbf{A}\) to be symmetric. Since \(\mathbf{X}^T\mathbf{X}\) is always symmetric (recall: \((\mathbf{X}^T\mathbf{X})^T = \mathbf{X}^T\mathbf{X}\)), the rule applies.
Our function is: \(\;\mathbf{e}^T\mathbf{e} = \underbrace{\mathbf{y}^T\mathbf{y}}_{\text{constant}} - \underbrace{2\mathbf{b}^T\mathbf{X}^T\mathbf{y}}_{\text{linear in } \mathbf{b}} + \underbrace{\mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}}_{\text{quadratic in } \mathbf{b}}\)
Term by term:
| Term | Rule Applied | Derivative w.r.t. \(\mathbf{b}\) |
|---|---|---|
| \(\mathbf{y}^T\mathbf{y}\) | Constant → 0 | \(\mathbf{0}\) |
| \(-2\mathbf{b}^T\mathbf{X}^T\mathbf{y}\) | Linear: \(\frac{\partial}{\partial \mathbf{b}}(\mathbf{c}^T\mathbf{b}) = \mathbf{c}\), where \(\mathbf{c} = \mathbf{X}^T\mathbf{y}\) | \(-2\mathbf{X}^T\mathbf{y}\) |
| \(\mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{b}\) | Quadratic: \(\frac{\partial}{\partial \mathbf{b}}(\mathbf{b}^T\mathbf{A}\mathbf{b}) = 2\mathbf{A}\mathbf{b}\), where \(\mathbf{A} = \mathbf{X}^T\mathbf{X}\) | \(2\mathbf{X}^T\mathbf{X}\mathbf{b}\) |
Combining: \(\;\frac{\partial\; \mathbf{e}^T\mathbf{e}}{\partial\; \mathbf{b}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{b}\)
Combining the terms:
\[\frac{\partial\; \mathbf{e}^T\mathbf{e}}{\partial\; \mathbf{b}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{b} = 0\]
Rearrange (move \(-2\mathbf{X}^T\mathbf{y}\) to the right, divide by 2):
\[\mathbf{X}^T\mathbf{X}\mathbf{b} = \mathbf{X}^T\mathbf{y}\]
These are the normal equations — the matrix version of the first-order conditions.
Solve for \(\mathbf{b}\) by multiplying both sides on the left by \((\mathbf{X}^T\mathbf{X})^{-1}\):
\[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
Since \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X} = \mathbf{I}\) and \(\mathbf{Ib} = \mathbf{b}\):
\[\boxed{\mathbf{b} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}\]
This requires \(\mathbf{X}^T\mathbf{X}\) to be invertible — fails under perfect multicollinearity.
(Intercept) authoritarianism
3.110123 -0.554824
[,1]
[1,] 2.84760467
[2,] 0.05181289
Here, “Independent” is the excluded category — coefficients for “Republican” and “Democrat” are differences relative to Independents.
Call:
lm(formula = electoral_contestation ~ republican + democrat +
authoritarianism, data = wss20)
Residuals:
Min 1Q Median 3Q Max
-2.22347 -0.50962 -0.02344 0.43179 2.60419
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.03857 0.03543 85.77 < 2e-16 ***
republican 0.18489 0.03640 5.08 3.98e-07 ***
democrat -0.18751 0.03859 -4.86 1.23e-06 ***
authoritarianism -0.45526 0.03956 -11.51 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.767 on 3427 degrees of freedom
(169 observations deleted due to missingness)
Multiple R-squared: 0.09947, Adjusted R-squared: 0.09868
F-statistic: 126.2 on 3 and 3427 DF, p-value: < 2.2e-16
The slope on authoritarianism is the same for every group — only the intercept moves.
Recall the decomposition: \(TSS = RegSS + RSS\)
\[R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]
Problem: \(R^2\) never decreases when you add a predictor — even a useless one. Adding noise variables inflates \(R^2\).
The adjusted \(R^2\) penalizes for model complexity:
\[\bar{R}^2 = 1 - \frac{RSS / (n - k - 1)}{TSS / (n - 1)} = 1 - (1 - R^2)\frac{n - 1}{n - k - 1}\]
## Bivariate model
fit1 <- lm(electoral_contestation ~ authoritarianism, data = wss20)
## Multiple regression with dummies
fit2 <- lm(electoral_contestation ~ authoritarianism + republican + democrat, data = wss20)
data.frame(
Model = c("Authoritarianism only", "+ Party ID dummies"),
R2 = c(summary(fit1)$r.squared, summary(fit2)$r.squared),
Adj_R2 = c(summary(fit1)$adj.r.squared, summary(fit2)$adj.r.squared),
k = c(1, 3)
) |> knitr::kable(digits = 4)| Model | R2 | Adj_R2 | k |
|---|---|---|---|
| Authoritarianism only | 0.0556 | 0.0553 | 1 |
| + Party ID dummies | 0.0995 | 0.0987 | 3 |
In-sample fit (\(R^2\)) measures how well the model explains the data it was trained on. Call this the “training set” performance. - Adding predictors improves in-sample fit, even if they are irrelevant. They won’t help predict new data, but they will reduce residuals in the training set. - With enough predictors, you can fit the training data perfectly — but this is just memorizing noise, not learning signal. - A model with \(n - 1\) predictors and \(n\) observations achieves \(R^2 = 1\)
Out-of-sample predictions measure how well the model generalizes to new, unseen data. Call this the “test set” performance, or even “out-of-sample” performance. - An overfit model memorizes noise in the training data - It performs well in-sample but poorly out-of-sample - The gap between in-sample and out-of-sample performance is a diagnostic for overfitting.
Key Point Is…: A good model captures signal, not noise. \(R^2\) alone cannot distinguish between the two.
Cross-validation estimates out-of-sample prediction error using only the available data (Hastie, Tibshirani, & Friedman, 2009, §7.10, “Cross-Validation”, p. 241).
Procedure:
There are not simple rules for how many folds to include. Typically, \(K = 5\) or \(K = 10\). When \(K = n\), this is leave-one-out cross-validation (LOOCV).
Each row is one iteration: the orange block is held out for testing, the blue blocks are used for training.
| Model | CV_MSE |
|---|---|
| Authoritarianism only | 0.6143 |
| + Party ID dummies | 0.5890 |
If the fuller model has lower CV-MSE, it genuinely improves prediction — not just in-sample fit.
Scholars occasionally disgree about the role of \(R^2\) in evaluating models.
Lewis-Beck & Skalaban (1990): \(R^2\) is a useful and informative measure of model fit.
Achen (1982, 1990): \(R^2\) is misleading and overemphasized.
Achen’s argument: is that we should focus attention on coefficient estimates and their standard errors, not on \(R^2\). A model with a small \(R^2\) can still identify an important causal effect (e.g., genes and voting); a model with a large \(R^2\) can be theoretically vacuous (predictions in small samples).
Middle Ground: