Model Inference

Dummy Variables, R-squared, Cross-Validation, and Interpretation

Arizona Precinct Data

We can apply regression to aggregate-level data. The precinct_voter_summary dataset contains 1,688 Arizona precincts from the 2024 election, with voter registration, turnout, and Census demographics linked at the tract level.

Data comes from the Arizona Secretary of State voter file, merged with ACS tract-level demographics
Each row is a precinct; the DV is Trump’s margin over Harris (positive = Trump advantage)

load("/Users/chrisweber/Dropbox/github_repos/linear_regression/quarto-book/precinct_voter_summary.rda")
load("/Users/chrisweber/Dropbox/github_repos/linear_regression/quarto-book/precinct_tract_data.rda")
head(precinct_voter_summary[, c("dos_precinct_key", "trump_harris_margin",
                                 "tract_acs_median_household_income",
                                 "tract_acs_latino", "tract_acs_median_age",
                                 "tract_acs_gini_index")])

# A tibble: 6 × 6
  dos_precinct_key trump_harris_margin tract_acs_median_house…¹ tract_acs_latino
  <chr>                          <dbl>                    <dbl>            <dbl>
1 0001 ACACIA                   0.0430                    69005             2341
2 0002 ACOMA                    0.254                     95395              348
3 0003 ACUNA                   -0.508                     49849             5315
4 0004 ADOBE                    0.115                     78531             1363
5 0005 ADORA                    0.194                    156354             1092
6 0006 AGRITOPIA                0.0943                   101179              962
# ℹ abbreviated name: ¹tract_acs_median_household_income
# ℹ 2 more variables: tract_acs_median_age <dbl>, tract_acs_gini_index <dbl>

Arizona Precincts: Map

Arizona Precincts: Regression

Can tract-level ACS demographics predict the Trump–Harris margin at the precinct level?

fit_az <- lm(trump_harris_margin ~ tract_acs_median_household_income +
               tract_acs_latino + tract_acs_median_age + tract_acs_gini_index,
             data = precinct_voter_summary)
summary(fit_az)


Call:
lm(formula = trump_harris_margin ~ tract_acs_median_household_income + 
    tract_acs_latino + tract_acs_median_age + tract_acs_gini_index, 
    data = precinct_voter_summary)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.08872 -0.19837 -0.01971  0.18299  1.19789 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       -5.598e-02  6.465e-02  -0.866   0.3866    
tract_acs_median_household_income  1.416e-06  2.106e-07   6.724 2.42e-11 ***
tract_acs_latino                  -1.723e-05  6.987e-06  -2.466   0.0138 *  
tract_acs_median_age               1.099e-02  6.808e-04  16.147  < 2e-16 ***
tract_acs_gini_index              -1.172e+00  1.122e-01 -10.443  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2905 on 1674 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.2582,    Adjusted R-squared:  0.2565 
F-statistic: 145.7 on 4 and 1674 DF,  p-value: < 2.2e-16

Arizona Precincts: Predicted vs. Observed

Interpreting Coefficients in Multiple Regression

The linear regression model should be estimated with a continuous dependent variable, but the independent variables can be quantitative or qualitative.

For a continuous IV, \(b\) is the expected change in \(Y\) for a one-unit increase in \(X\), holding all other predictors constant
The relationship is linear and additive: \(dy/dx = b\) for all values of \(x\) and all other predictors
This means the effect of \(X\) on \(Y\) is constant across the data

Linearity and additivity means it is relatively simple to glean information from a regression table — each coefficient has a straightforward interpretation.

Data: 2020 Western States Survey

YouGov survey of 3,000 respondents across western U.S. states, fielded around the 2020 presidential election
Institutional trust: composite of 7 items (Congress, President, Supreme Court, federal government, state legislatures, police, science), each on a 4-point scale, rescaled to 0–1
Authoritarianism: mean of 4 binary items, yielding a 0–1 scale
Party identification: Republican, Independent, Democrat — coded as dummy variables with Independents as reference

Institutional Trust Model

fit_trust <- lm(institutional_trust ~ authoritarianism + republican + democrat, data = wss20)
summary(fit_trust)


Call:
lm(formula = institutional_trust ~ authoritarianism + republican + 
    democrat, data = wss20)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.60364 -0.11477  0.01496  0.12002  0.58308 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.403791   0.008483  47.601  < 2e-16 ***
authoritarianism 0.052506   0.009472   5.543 3.20e-08 ***
republican       0.147347   0.009239  15.949  < 2e-16 ***
democrat         0.044315   0.008715   5.085 3.88e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1836 on 3427 degrees of freedom
  (169 observations deleted due to missingness)
Multiple R-squared:  0.1051,    Adjusted R-squared:  0.1044 
F-statistic: 134.2 on 3 and 3427 DF,  p-value: < 2.2e-16

Dummy Variables and Categorical Predictors

OLS requires a continuous DV, but the IVs can be continuous or categorical.

With a categorical predictor, encode group membership using dummy variables — binary indicators coded 0 or 1
With \(k\) categories, include \(k-1\) dummies; the omitted group is the reference category
Why omit one? All \(k\) dummies sum to a column of ones — perfectly collinear with the intercept

For instance, party ID is coded as 1 = Republican, 2 = Independent, 3 = Democrat. What is a one-unit change? Is the movement from Republican to Independent the same as Independent to Democrat?

Instead, create dummy variables for each category. An Independent is a case when both the Republican and Democrat dummies are 0.

Intercept Shifts

\[ Y_i = \alpha + \gamma_{\text{Rep}} D_{\text{Rep}} + \gamma_{\text{Dem}} D_{\text{Dem}} + \beta_{\text{Auth}} X_{\text{Auth}} + e_i \]

The expected values for each group:

\[ \begin{eqnarray*} E(Y \mid \text{Independent}) &=& \alpha + \beta_{\text{Auth}} X_{\text{Auth}} \\ E(Y \mid \text{Republican}) &=& (\alpha + \gamma_{\text{Rep}}) + \beta_{\text{Auth}} X_{\text{Auth}} \\ E(Y \mid \text{Democrat}) &=& (\alpha + \gamma_{\text{Dem}}) + \beta_{\text{Auth}} X_{\text{Auth}} \end{eqnarray*} \]

The slope on authoritarianism is the same for every group — only the intercept shifts. \(\gamma_{\text{Rep}}\) and \(\gamma_{\text{Dem}}\) represent the average difference in \(Y\) between each group and Independents, holding authoritarianism constant.

Intercept Shifts: Visualization

All three groups share the same slope on authoritarianism, but differ in their baseline levels of institutional trust.

Interactions

What if the effect of authoritarianism differs across party groups?

The Additive Model Is Restrictive

The additive model assumes \(\frac{\partial Y}{\partial X_{\text{Auth}}} = \beta_{\text{Auth}}\) — the same slope for every group.

We can relax this by estimating an interactive model:

\[ Y_i = \alpha + \gamma_{\text{Rep}} D_{\text{Rep}} + \gamma_{\text{Dem}} D_{\text{Dem}} + \beta_{\text{Auth}} X_{\text{Auth}} + \delta_{\text{Rep}} (X_{\text{Auth}} \times D_{\text{Rep}}) + \delta_{\text{Dem}} (X_{\text{Auth}} \times D_{\text{Dem}}) + e_i \]

We are creating two additional variables: authoritarianism \(\times\) Republican and authoritarianism \(\times\) Democrat.

Interactions: Group-Specific Slopes

\[ \begin{eqnarray*} E(Y \mid \text{Independent}) &=& \alpha + \beta_{\text{Auth}} X_{\text{Auth}} \\ E(Y \mid \text{Republican}) &=& (\alpha + \gamma_{\text{Rep}}) + (\beta_{\text{Auth}} + \delta_{\text{Rep}}) X_{\text{Auth}} \\ E(Y \mid \text{Democrat}) &=& (\alpha + \gamma_{\text{Dem}}) + (\beta_{\text{Auth}} + \delta_{\text{Dem}}) X_{\text{Auth}} \end{eqnarray*} \]

The marginal effect of authoritarianism now depends on party identification:

\[\frac{\partial Y}{\partial X_{\text{Auth}}} = \beta_{\text{Auth}} + \delta_{\text{Rep}} D_{\text{Rep}} + \delta_{\text{Dem}} D_{\text{Dem}}\]

\(\beta_{\text{Auth}}\) = slope for Independents (reference group)
\(\delta_{\text{Rep}}\) = difference in slope between Republicans and Independents
\(\delta_{\text{Dem}}\) = difference in slope between Democrats and Independents

The lines are no longer parallel — each group gets its own intercept and slope.

Always Include Lower-Order Terms

It is important to always include the lower-order constituent terms in an interactive model.

Omitting dummies forces all groups to share the same intercept:

\[Y_i = \alpha + \beta_{\text{Auth}} X_{\text{Auth}} + \delta_{\text{Rep}} (X_{\text{Auth}} \times D_{\text{Rep}}) + \delta_{\text{Dem}} (X_{\text{Auth}} \times D_{\text{Dem}}) + e_i\]

Omitting the main effect of authoritarianism forces the slope to be zero for Independents:

\[Y_i = \alpha + \gamma_{\text{Rep}} D_{\text{Rep}} + \gamma_{\text{Dem}} D_{\text{Dem}} + \delta_{\text{Rep}} (X_{\text{Auth}} \times D_{\text{Rep}}) + \delta_{\text{Dem}} (X_{\text{Auth}} \times D_{\text{Dem}}) + e_i\]

The full model should include all constituent terms. We can always test whether they are zero.

Interaction Model: Estimation

# Additive model (restricted)
fit_additive <- lm(institutional_trust ~ authoritarianism + republican + democrat, data = wss20)

# Interactive model (unrestricted)
fit_interaction <- lm(institutional_trust ~ authoritarianism * republican + authoritarianism * democrat, data = wss20)
summary(fit_interaction)


Call:
lm(formula = institutional_trust ~ authoritarianism * republican + 
    authoritarianism * democrat, data = wss20)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.58654 -0.10874  0.00852  0.12260  0.58245 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  0.40529    0.01215  33.348   <2e-16 ***
authoritarianism             0.04905    0.02221   2.208   0.0273 *  
republican                   0.16678    0.01637  10.188   <2e-16 ***
democrat                     0.03461    0.01379   2.510   0.0121 *  
authoritarianism:republican -0.03458    0.02819  -1.227   0.2201    
authoritarianism:democrat    0.02635    0.02579   1.021   0.3071    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1835 on 3425 degrees of freedom
  (169 observations deleted due to missingness)
Multiple R-squared:  0.1072,    Adjusted R-squared:  0.1059 
F-statistic: 82.25 on 5 and 3425 DF,  p-value: < 2.2e-16

Interaction Model: F-Test

Do the interaction terms jointly improve the model?

anova(fit_additive, fit_interaction)

Analysis of Variance Table

Model 1: institutional_trust ~ authoritarianism + republican + democrat
Model 2: institutional_trust ~ authoritarianism * republican + authoritarianism * 
    democrat
  Res.Df    RSS Df Sum of Sq      F Pr(>F)  
1   3427 115.57                             
2   3425 115.30  2   0.26505 3.9366 0.0196 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction Model: Visualization

Unlike the additive model, slopes now differ across groups.

Inference

Confidence Intervals, Hypothesis Tests, and the F-Statistic

Inference about the PRF

From the bivariate regression, recall the estimated variances:

\[ \begin{eqnarray*} \hat{var(b)} &=& \frac{\hat{\sigma^2}}{\sum x_i^2}\\ \hat{var(a)} &=& \frac{\hat{\sigma^2} \sum X_i^2}{n\sum x_i^2} \end{eqnarray*} \]

Construct a \(100(1-\alpha)\%\) confidence interval:

\[\beta = b \pm t_{\alpha/2} \cdot SE(b)\]

Locate \(t_{\alpha/2}\) using the Student’s \(t\)-distribution with \(n-k-1\) degrees of freedom.

Hypothesis Testing

We might conduct a null hypothesis test. If our expectation is that \(\beta\) is positive:

\[H_a: \beta_k > 0 \qquad H_0: \beta_k \leq 0\]

If our expectation is that \(\beta\) is negative:

\[H_a: \beta_k < 0 \qquad H_0: \beta_k \geq 0\]

Or, a two-tailed test:

\[H_a: \beta_k \neq 0 \qquad H_0: \beta_k = 0\]

The \(t\)-statistic is distributed with \(n-k-1\) degrees of freedom. If the computed value falls in the critical region, reject the null.

Comparing Nested Models: F-Test

We can test \(H_0: \beta_1 = \beta_2 = \ldots = \beta_k = 0\) by comparing \(R^2\) of nested models.

Model 1 (restricted): \(Y_i = \alpha + \beta_1 X_1 + \epsilon_i\)

Model 2 (unrestricted): \(Y_i = \alpha + \beta_1 X_1 + \beta_2 X_2 + \epsilon_i\)

\[F = \frac{(RegSS_2 - RegSS_1)/q}{RSS_2/(n-k-1)} \sim F[q, \; n-k-1]\]

where \(q\) is the number of additional restrictions tested.

Model Fit

\(R^2\), Overfitting, and Cross-Validation

Revisiting Model Fit: \(R^2\) in Multiple Regression

Recall the decomposition: \(TSS = RegSS + RSS\)

Revisiting Model Fit: \(R^2\) in Multiple Regression

\[R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(Y_i - \bar{Y})^2}\]

Problem: \(R^2\) never decreases when you add a predictor — even a useless one.

Adjusted \(R^2\)

The adjusted \(R^2\) penalizes for model complexity. Adding predictors that explain little variation in \(Y\) may lead to a reduction in this value.

\[\bar{R}^2 = 1 - \frac{RSS / (n - k - 1)}{TSS / (n - 1)} = 1 - (1 - R^2)\frac{n - 1}{n - k - 1}\]

\(k\) = number of predictors, \(n\) = sample size
Unlike \(R^2\), adjusted \(R^2\) can decrease if a new predictor doesn’t improve the model enough to offset the lost degree of freedom
When \(k\) is large relative to \(n\), the penalty is substantial

Model Fit: Example

## Bivariate model
fit1 <- lm(institutional_trust ~ authoritarianism, data = wss20)
## Multiple regression with dummies
fit2 <- lm(institutional_trust ~ authoritarianism + republican + democrat, data = wss20)

data.frame(
  Model = c("Authoritarianism only", "+ Party ID dummies"),
  R2 = c(summary(fit1)$r.squared, summary(fit2)$r.squared),
  Adj_R2 = c(summary(fit1)$adj.r.squared, summary(fit2)$adj.r.squared),
  k = c(1, 3)
) |> knitr::kable(digits = 4)

Model	R2	Adj_R2	k
Authoritarianism only	0.0185	0.0182	1
+ Party ID dummies	0.1051	0.1044	3

The Overfitting Problem

In-sample fit (\(R^2\)) measures how well the model explains the data it was trained on.

As we add more variables, \(R^2\) increases. With \(n-1\) predictors and \(n\) observations, \(R^2 = 1\)
It becomes difficult to discern whether inclusion of a variable actually improves the model, or whether it just capitalizes on randomness

Out-of-sample prediction measures how well the model generalizes to new, unseen data.

What is our \(R^2\) when we apply the model to a different dataset? Is there a substantial discrepancy?
If \(R^2\) is high in-sample but low out-of-sample, the model is overfitting — capturing noise rather than signal

One solution: estimate the model on one dataset, evaluate performance on a different dataset.

K-Fold Cross-Validation

Cross-validation estimates out-of-sample prediction error using only the available data (Hastie, Tibshirani, & Friedman, 2009).

Procedure:

Randomly partition the data into \(K\) roughly equal-sized folds
For each fold \(k = 1, \ldots, K\):
- Train the model on all data except fold \(k\)
- Predict on fold \(k\) (the held-out data)
- Compute prediction error: \(\text{MSE}_k = \frac{1}{n_k}\sum_{i \in \text{fold } k}(Y_i - \hat{Y}_i)^2\)
Average across folds: \(\text{CV}_{(K)} = \frac{1}{K}\sum_{k=1}^{K} \text{MSE}_k\)

Typically, \(K = 5\) or \(K = 10\). When \(K = n\), this is leave-one-out cross-validation (LOOCV).

K-Fold Cross-Validation

Each row is one iteration: the orange block is held out for testing, the blue blocks are used for training.

Why MSE and not \(R^2\)?

\(R^2\) is defined relative to \(\bar{Y}\) and \(TSS\) from the training data. When predicting on a held-out fold, that fold has its own \(\bar{Y}\).

MSE just tells us how much we’re off on average — it’s a more natural measure for out-of-sample prediction.

K-Fold Cross-Validation: Example

set.seed(42)
cv_data <- wss20 |> filter(!is.na(institutional_trust), !is.na(authoritarianism),
                           !is.na(republican), !is.na(democrat))
K <- 10
folds <- sample(rep(1:K, length.out = nrow(cv_data)))
mse_simple <- mse_full <- numeric(K)

for (k in 1:K) {
  train <- cv_data[folds != k, ]
  test  <- cv_data[folds == k, ]
  fit_s <- lm(institutional_trust ~ authoritarianism, data = train)
  mse_simple[k] <- mean((test$institutional_trust - predict(fit_s, test))^2)
  fit_f <- lm(institutional_trust ~ authoritarianism + republican + democrat, data = train)
  mse_full[k] <- mean((test$institutional_trust - predict(fit_f, test))^2)
}
data.frame(Model = c("Authoritarianism only", "+ Party ID dummies"),
           CV_MSE = round(c(mean(mse_simple), mean(mse_full)), 4)) |>
  knitr::kable()

Model	CV_MSE
Authoritarianism only	0.0368
+ Party ID dummies	0.0338

If the fuller model has lower CV-MSE, it genuinely improves prediction — not just in-sample fit.

The Lewis-Beck vs. Achen Debate over \(R^2\)

Lewis-Beck & Skalaban (1990): \(R^2\) is useful

A high \(R^2\) indicates that the model accounts for a substantial share of variance in \(Y\)
Comparing \(R^2\) across models helps assess whether new predictors contribute
It serves as a useful, comparable statistic

Achen (1982, 1990): \(R^2\) is misleading.

\(R^2\) depends on the variance of \(X\) in the sample — the same causal effect can yield very different \(R^2\) values in different datasets
Researchers may observe inflated \(R^2\) by choosing samples with high variance in \(X\), or by adding irrelevant predictors

The Debate: Implications

Probably a Middle Ground:

\(R^2\) is useful for prediction — how well does the model forecast \(Y\)?
Methods like cross-validation are more robust for sorting out whether a new predictor genuinely improves the model

Summary

Coefficient interpretation depends on variable type — continuous IVs give marginal effects, categorical IVs give group differences
Dummy variables encode categorical data as 0/1; omit one category to avoid perfect collinearity
The additive model constrains all groups to share the same slope — interactions relax this
Always include lower-order terms in an interactive model
\(R^2\) always increases with more predictors; adjusted \(R^2\) penalizes for complexity
K-fold cross-validation estimates true predictive performance by repeatedly holding out data
The Lewis-Beck/Achen debate: \(R^2\) is useful for prediction, but coefficients matter more for causal inference

Google Colab Notebook

References

Achen, Christopher H. 1982. Interpreting and Using Regression. Sage.
Achen, Christopher H. 1990. “What Does ‘Explained Variance’ Explain?” Political Analysis 2: 173–184.
Fox, John. 2008. Applied Regression Analysis and Generalized Linear Models. Sage.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer.
Lewis-Beck, Michael S. and Andrew Skalaban. 1990. “The R-Squared: Some Straight Talk.” Political Analysis 2: 153–171.