Multivariable Regression Modelling and Exploratory Causal Inference

Author

D.McCabe

Multivariable Regression Model Building

Model Design

Multivariable linear regression involves selecting the dependent and predictor variables, fitting a hyperplane in multidimensional space and iteratively evaluating and refining the model through feature selection and performance assessment.

The linear model: \[ \boxed{\begin{align} Y_i &= \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \epsilon_i \\ &= \beta_0 + \sum_{k=1}^{p} \beta_k X_{ik} + \epsilon_i \end{align}} \]

Where:

\(Y_i\) is the response (outcome) variable
\(X_{i1}, X_{i2}, \dots, X_{ip}\) are the predictor variables
\(\beta_0\) is the (optional) intercept
\(\beta_1, \beta_2, \dots, \beta_p\) are the coefficients for each predictor
\(\epsilon_i\) is the error term capturing variation not explained by the predictors

Points to note:

Typically, the intercept is included in the sum by setting \(X_0 = 1\), critical from a linear algebra perspective giving: \[\boxed{Y_i = \sum_{k=0}^{p} \beta_k X_{ik} + \epsilon_i}\]
The loss/cost function is: \[\sum_{i=1}^n \left(Y_i - \sum_{k=0}^{p} \beta_k X_{ik}\right)^2\]
The important linearity is linearity in the coefficients; for example, the following is still a linear model: \[Y = \beta_0 + \beta_1 X_1^2 + \dots + \beta_p X_p^2 + \epsilon\]

Bryan Caffo’s - intuitive development/understanding:

The loss/cost function and estimator for \(\beta_1\) in simple linear regression (1 covariate - no intercept) are: \[\text{cost}:\quad\sum_{i=1}^n \left(y_i - \beta_1 x_{i1}\right)^2\qquad\text{estimate: }\quad\beta_1=\frac{\sum y_ix_{1i}}{\sum x_{1i}^2}\]

The loss/cost function for (2 covariates): \[\sum_{i=1}^n \left(y_i - \beta_1 x_{i1}- \beta_2 x_{i2}\right)^2\] …but we can absorb one covariate into a modified \(\tilde{y}_i\) term where \(\tilde{y}_i=y_i - \beta_1 x_{i1}\): \[\sum_{i=1}^n \left(\tilde{y}_i - \beta_2 x_{i2}\right)^2\] … this is precisely the expression for the simple linear regression through the origin case - we can extend this to greater than 2 covariates.

\[\text{cost}:\quad\sum_{i=1}^n \left(\tilde{y}_i - \beta_2 x_{i2}\right)^2\qquad\text{estimate: }\quad\beta_2=\frac{\sum\tilde{y}_ix_{2i}}{\sum x_{2i}^2}\]

…substituting the \(\beta_2\) estimate into the cost function we obtain a cost function purely in terms of \(\beta_1\).

His idea: - multivariable regression involves starting with the total variance of the output and continuously reducing residual variance through sequentially application of simple linear regression on the different covariates. - this is not quite correct as the variance is distributed more evenly as these simple regressions happen simultaneously rather than sequentially. He suggests - A coefficient \(\beta_k\) in a multivariate regression is the coefficent where the linear effect of _all the other variables on the response have been removed. - again, all is not quite right as each coefficient takes its fair share of the variance. Respectfully i don’t think its a great explination.

Linear Algebra Perspective:

In multivariable regression, the fitted values are the orthogonal projection of \(Y\) onto the column space of \(X\):
\[\hat{Y}=X\hat{\beta},\quad \hat{\beta}=(X^\top X)^{-1}X^\top Y,\quad \hat{Y}=P_X Y\]
where \(P_X=X(X^\top X)^{-1}X^\top\) is the projection matrix.

The residuals are \(e=(I-P_X)Y\), orthogonal to the predictor space. Each coefficient \(\beta_k\) represents the effect of \(X_k\) after projecting out the linear effects of the other predictors. Least squares is a simultaneous projection, allocating each variable its fair share of explained variation.

Geometrically, we are finding the best-fit hyperplane through the cloud of points. Each covariate coefficient can be viewed as the slope in the 2-D cross-section along that covariate’s axis—simple, really.

FYI: …lm will actually throw an NA if you include a covariate which is an exact linear combination of the other covariates i.e. \(X^TX\) is singular. This seems fine interms of the visual hyperplace picture - its a property of the algebra - the rank or the dimension of the matrix is too low for the feature space.

Main model building considerations (McCabe, 2025b):

Model selection and overfitting
- Model search - deciding which predictors to include in an analysis. It is important not to omit predictors that are theoretically or scientifically justified.
- Overfitting to irrelevant predictors - including too many unnecessary variables can lead to a model that fits the sample data perfectly (zero residuals) but performs poorly on new data. Garbage in, garbage out.
- Model validation - using techniques like cross-validation or a holdout dataset to assess predictive performance and avoid overfitting.
Assumptions about predictors and residuals
- Linearity - the relationship between predictors and the response is assumed to be linear; violations may require transformations or nonlinear modeling.
- Collinearity / Multicollinearity - when two or more predictors are highly correlated, it can inflate standard errors and make coefficient estimates unstable.
- Homoscedasticity - the variance of residuals should be roughly constant across predicted values; heteroscedasticity can affect inference.
- Normality of errors - many inferential procedures assume normally distributed residuals, especially for small samples.
Data quality and influence
- Outliers and influential points - extreme values can disproportionately affect the model fit and estimates.
- Missing data - how missing values are handled can impact estimates and inference.
Purpose of the analysis
- Causal versus predictive goals - be clear whether the goal is inference about relationships (causal) or accurate prediction; this affects model choice and interpretation.

Model FItting

R command usage following (Peng, 2015) …

0: Load Libraries and Data

library(data.table)
library(ggplot2)
library(car)  # for VIF
dt <- as.data.table(swiss)

1. Standardising Predictors in Multivariable Regression?

Standardising predictors before multivariable regression means transforming each variable to have mean 0 and standard deviation 1. This does not change the relationships among variables but rescales them so that the regression coefficients are directly comparable in magnitude. It is particularly useful when predictors are on very different scales, as it prevents variables with larger units from dominating the estimation and improves numerical stability. Standardisation also facilitates interpretation of coefficients in terms of standard deviation changes in the outcome per standard deviation change in the predictor.

predictors <- c("Fertility", "Agriculture", "Examination", "Education")
dt[, (predictors) := lapply(.SD, scale), .SDcols = predictors]
model <- lm(mpg ~ Fertility + Agriculture + Examination + Education, data = dt)

2. Visually Inspect Data

library(GGally)  # for nice splom

ggpairs(dt, lower = list(continuous = wrap("smooth", method = "loess")))

High correlations may indicate potential multicollinearity issues.

3. a) Fit Regression Model

fit <- lm(Fertility ~ Agriculture, data = dt)
summary(fit)


Call:
lm(formula = Fertility ~ Agriculture, data = dt)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.5374  -7.8685  -0.6362   9.0464  24.4858 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.30438    4.25126  14.185   <2e-16 ***
Agriculture  0.19420    0.07671   2.532   0.0149 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.82 on 45 degrees of freedom
Multiple R-squared:  0.1247,    Adjusted R-squared:  0.1052 
F-statistic: 6.409 on 1 and 45 DF,  p-value: 0.01492

Notice: the agriculture coefficient flips in the ill thought through regression in part b…

3. b) Fit Full Multivariable Regression Model

fit <- lm(Fertility ~ ., data = dt) # this is silly but for illustration
summary(fit)$coefficient

                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

This model estimates the association of each predictor with Fertility, adjusting for the other variables.

We expect a 0.17 decrease in standardised fertility for every 1% increase in the percentage of males involved in agriculture, holding the other variables constant. So that is interpreted as we hold examination and education, percent Catholic and infant mortality constant. This next column, the standard error 0.07, talks about how precise that variable.

4. Check Model Assumptions

Homoscedasticity: residuals should have roughly constant variance.

ggplot(data.table(fitted = fitted(fit), resid = resid(fit)),
       aes(x = fitted, y = resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted Values")

Normality of residuals:

qqnorm(resid(fit))
qqline(resid(fit))

Multicollinearity: check variance inflation factors (VIF).

vif(fit)

     Agriculture      Examination        Education         Catholic 
        2.284129         3.675420         2.774943         1.937160 
Infant.Mortality 
        1.107542

Model Assessment (McCabe, 2025a)

After fitting the model, R provides a few quick checks to assess the main aspects of model fit, including overall metrics such as the ANOVA table and VIF (aside: additional diagnostics can examine individual data points if needed residuals, influence/leverage see (mccabe2025regression?)).

dt<-as.data.table(mtcars)
model<-lm(mpg~wt+disp+carb,dt)

ANOVA Table

An ANOVA (Analysis of Variance) table summarises how much of the variation in the response variable can be attributed to each term in a regression model. The F-value is calculated by comparing the fit of the full model to a model with that specific term removed, effectively testing whether adding the term significantly improves the model.

anova(model)

Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq  F value    Pr(>F)    
wt         1 847.73  847.73 115.9334 1.836e-11 ***
disp       1  31.64   31.64   4.3270   0.04678 *  
carb       1  41.94   41.94   5.7359   0.02355 *  
Residuals 28 204.74    7.31                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Small p-values (high significance terms) indicate that the predictor contributes meaningfully to explaining the response, while large p-values (low significance terms) suggest little additional contribution once other predictors are included.

The residual row shows the variation left unexplained by the model.

Variance inflation Factors

The Variance Inflation Factor (VIF) quantifies how much the variance of a regression coefficient is inflated due to multicollinearity among the terms in a model. For each term, the VIF is calculated by fitting a regression of that term on all the other terms and measuring how well it can be predicted. (swap dependent and independent variables).

vif(model)

      wt     disp     carb 
4.890224 4.734708 1.225414

A VIF of 1 indicates no correlation with other terms, values between 1 and 5 suggest moderate correlation, and values above 5 (or 10) indicate high multicollinearity, which can make coefficient estimates unstable (imagine a flate plane teetering on a tightrope).

High VIF values highlight terms whose explanatory power overlaps substantially with other terms, making it difficult to isolate their individual effects.

Model Selection

Note

Penn offers a course on this topic: A Crash Course in Causality: Inferring Causal Effects from Observational Data.

Correlation does not imply causation. Furthermore, in multivariable regression, apparent correlations between variables can sometimes be artifacts of an ill-considered model or poorly designed experiment.

Interactions occur when the effect of one predictor depends on the level of another (Simpson’s paradox).
Instability/Multicollinearity: Strong correlations among predictors can make coefficient estimates unstable and inflate apparent relationships.
Confounding arises when a predictor is correlated with both the outcome and another predictor, potentially biasing coefficient estimates.
Reverse causation: The outcome may instead be influencing the predictor.
Nonlinear relationships: Linear correlation measures may miss or misrepresent relationships that are curvilinear.

Additionally, poor experimental design may lead to:

Omission: Important variables are left out of the model, creating spurious associations.
Selection bias: If the sample is not representative of the population, observed correlations may not reflect the true underlying relationships.
Random chance: With many variables, some correlations may appear significant purely by chance (multiple testing issue).

Properly accounting for these factors ensures more reliable and interpretable regression results.

Classification of predictors

When choosing models we balance simplicity (fewer assumptions, fewer parameters) against explanatory power (ability to capture reality). The “Rumsfeldian triplet” provides a useful metaphor for thinking about sources of uncertainty and the risks of over/underfitting:

The Rumsfeldian Triplet

“there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.” - Complete Cunt

The Žižekian Quadruplet (extended & exhuastive)

“there are unknown knowns [the things we don’t know that we know - unconscious assumptions, beliefs, or ideological frameworks that shape how we see the world.].” - Salovj Žižek

Known knowns: Captured by the model. These are the patterns in data we can reliably represent.

Known unknowns: Quantified uncertainty, noise, or gaps we acknowledge. A good model should remain robust to these.

Unknown unknowns: Surprises - structure we didn’t anticipate, or regimes outside training data. Overly complex models may hallucinate explanations for these, while parsimonious models tend to generalize better.

Unknown knowns: Implicit biases and assumptions baked into our model choices (priors, loss functions, inductive biases). These shape what we see as “plausible” without us realising it.

Model Bias vs Variance

Omission of Important Covariates

Underfitting effect on model fit - Bias

Leaving out relevant predictors introduces omitted variable bias.
For ML practitioners, adding many features can reduce bias (good for prediction).
For data scientists concerned with inference, unnecessary inclusion inflates variance and standard errors.
Metrics like R² (residual variance / total variance) are misleading - they only ever increase with more covariates, so they are not good criteria/adjudicators for model selection.

Underfitting effect on residual variance

Underfitting occurs when a model is too simple to capture the underlying structure in the data.
Effect on variance: Because the model explains less of the total variance, the residual variance (unexplained variance) is larger.
Implication: Predictions are less accurate, bias is high, and the model fails to capture important patterns.

Inclusion of Unnecessary Predictors

Overfitting effect on model fit - Variance Inflation

Let’s examine the standard error of the sampling distribution of a slope estimator in two scenarios.

Case 1: Redundant but Independent Predictors

The redundant covariates x2 and x3 are independent of x1.
Result: negligible variance inflation.

nSample <- 100
nSim <- 1000
x1 <- rnorm(nSample)  # useful regressor
x2 <- rnorm(nSample)  # redundant
x3 <- rnorm(nSample)  # redundant

betas <- sapply(1:nSim, function(i) {
  y <- x1 + rnorm(nSample, sd = 0.3)
  c(coef(lm(y ~ x1))[2],
    coef(lm(y ~ x1 + x2))[2],
    coef(lm(y ~ x1 + x2 + x3))[2])
})

round(apply(betas, 1, sd), 5) # standard deviation of slope estimates

     x1      x1      x1 
0.03290 0.03290 0.03285

Case 2: Redundant but Correlated Predictors

The redundant covariates \(x_2\) and \(x_3\) are highly correlated with \(x_1\).
Result: severe variance inflation.

nSample <- 100
nSim <- 1000
x1 <- rnorm(nSample)  # useful regressor
x2 <- x1/sqrt(2) + rnorm(nSample)/sqrt(2)   # moderately correlated with x1
x3 <- x1*0.95 + rnorm(nSample)*(1 - 0.95^2) # extremely correlated with x1

betas <- sapply(1:nSim, function(i) {
  y <- x1 + rnorm(nSample, sd = 0.3)
  c(coef(lm(y ~ x1))[2],
    coef(lm(y ~ x1 + x2))[2],
    coef(lm(y ~ x1 + x2 + x3))[2])
})

round(apply(betas, 1, sd), 5) # standard deviation of slope estimates

     x1      x1      x1 
0.02987 0.04125 0.30097

Tip

Intuition: Fitting a regression is like fitting a hyperplane through a cloud of points. With independent predictors, the cloud is spread wide (like all over a table top) - the hyperplane is stable. With highly correlated predictors, the cloud collapses into a narrow ridge - the hyperplane becomes unstable “spinning” around the ridge.

Diagnosing/Avoiding Overfitting

There are a couple of ways to try and strike the balance:

Good experiment design
- randomisation
- crossover design - subjects are their own control at a later time
- nested models
- propensity scores
Measure overfitting (VIF)
Factor analysis (parsimonious?)
PCA can get a feel for the right amount of predictors
Cross-validation trial and error model fitting

Propensity Scores

A propensity score is the probability that a unit receives a treatment given its observed covariates:

\(e(X) = P(T = 1 \mid X)\)

Range: Always between 0 and 1.
Purpose: To balance treatment and control groups in observational studies so that, conditional on the propensity score, covariates are roughly similar between groups.
Computation: Often via logistic regression or other classifiers:

# Example: Simulated treatment assignment
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
# Logistic regression to compute propensity scores
treatment <- rbinom(n, 1, prob = 0.5)
ps_model <- glm(treatment ~ x1 + x2, family = binomial)
propensity_scores <- predict(ps_model, type = "response")
head(propensity_scores)

        1         2         3         4         5         6 
0.5044563 0.4912475 0.3394871 0.4570620 0.4432132 0.3304919

Variance Inflation Factor (VIF)

The Variance Inflation Factor (VIF) measures how much the variance of a regression coefficient is inflated due to multicollinearity with other predictors. \[\text{VIF}_j = \frac{1}{1 - R_j^2}\]

where \(R_j^2\) is the coefficient of determination from regressing predictor \(x_j\) on all the other predictors.

VIF = 1 ~ no correlation with other predictors.
VIF > 5 (sometimes 10) ~ indicates problematic multicollinearity.

# Example using car::vif
library(car)

# Simulated correlated predictors
set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- x1*0.8 + rnorm(n)*0.2
x3 <- rnorm(n)
y  <- 1 + 2*x1 + rnorm(n)

model <- lm(y ~ x1 + x2 + x3)
vif(model)

       x1        x2        x3 
14.969143 14.929013  1.017576

Nested Models

The high significance *** of the models containing additional predictors signify that these new diminsions bring useful prediction power into the model - TODO not too sure beynod this

dt <-as.data.table(swiss)
fit1 <- lm(Fertility~Agriculture, data=dt)
fit3 <- update(fit, Fertility~Agriculture+Examination+Education)
fit5 <- update(fit, Fertility~Agriculture+Examination+Education+Catholic+Infant.Mortality)
anova(fit1,fit3,fit5)

Analysis of Variance Table

Model 1: Fertility ~ Agriculture
Model 2: Fertility ~ Agriculture + Examination + Education
Model 3: Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     45 6283.1                                  
2     43 3180.9  2    3102.2 30.211 8.638e-09 ***
3     41 2105.0  2    1075.9 10.477 0.0002111 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Overfitting effect on residual variance

Overfitting occurs when a model is too flexible — instead of capturing the true underlying structure, it models the chance characteristics (noise) of the training data.
Effect on variance:
- The observed residual variance (variance of errors on the training data) decreases because the model explains more of the training data’s total variance.
- The true, out-of-sample variance increases, because the model is unstable and performs poorly on new data.
Implication:
- Predictions on unseen data are unreliable.
- While bias might be low on training data, variance is high, leading to poor generalisation.

Adjustment

In causal analysis, the primary aim is to estimate the effect of an exposure on an outcome. The exposure is the variable hypothesised to exert a causal influence (e.g. smoking) and the outcome is the variable potentially affected (e.g. lung cancer). The observed association however, may be influenced by a third variable which can distort, transmit or modify the exposure-outcome relationship. Examples include general health, tar accumulation in the lungs or other behavioural factors. By adjusting for a third variable within a statistical model we attempt to obtain an unbiased estimate of the exposure’s effect on the outcome.

Third Variable Types (causal classification):

We postulate that the exposure influences the output:

Effect modifier/interaction - affects the causation between exposure and output throttling the causal pathway itself.
Mediator – causal stepping stone between exposure and output carrying part (or all) of the causal pathway.
Confounder - causal driver upstream of both the output and exposure
Collider - downstream causal outcome driven upstream by both the output and exposure

Third Variable Effects:

Scenario	notes	example
the marginal/primary effect is relatively unchanged - the mean difference in the residual plane is roughtly equal to total/marginal mean separation project on the y axis	The third variable is a precision variable* explains some variation in the outcome (thus reducing residual variance) but does not materially change the exposure–outcome estimate.*	Outcome: no. beach rescues Exposure: no. beach goers Mediator: no. swimmers Interaction: fitness Confounder: ice-cream sales
Masking/Suppression the marginal/primary effect gets magnified - the mean difference in the residual plane is very large whereas the total/marginal mean separation project on the y axis was neglegible	The third variable acts as a suppressor, after adjusting for the third variable, the conditional effect is revealed to be substantially larger. good overlaping data in adjusting \(x\) cross-section	Outcome: Life expectancy Exposure: Gender Mediator: Occupation risk Interaction: Diet/exercise Confounder: Chinese national
Amplification the marginal/primary effect gets hidden - the mean difference in the residual plane is negligible whereas the total/marginal mean separation project on the y axis was very large: bad experiment (poor propensity score)	The third variable acts as a confounder. The large marginal difference was entirely due to imbalance in covariates. the effect disappears with adjustment no overlaping data in adjusting \(x\) cross-section	Outcome: no. drownings Exposure: gender Mediator: no. men’s triathlon Interaction: cider festival sales Confounder: fitness (ironman)
Reversal the marginal/primary effect is inverted - the mean difference in the residual plane is roughtly opposite to total/marginal mean separation project on the y axis (simpson’s paradox)	The third variable here is a confounder, it reverses the apparent association. intermediate overlaping data in adjusting \(x\) cross-section	Outcome: no. breakdowns Exposure: maintanance costs Mediator: custodian exp. Interaction: sr. management Confounder: system age
the marginal/primary effect is relatively unchanged but with clear interaction - the mean difference in the residual plane is roughtly equal to total/marginal mean separation project on the y axis	The third variable is an effect modifier. It alters the strength or direction of the main effect within its levels but the overall average effect remains roughly unchanged.	Outcome: body mass index Exposure: level of exercise Mediator: muscle mass Interaction: ??? Confounder: desire to slim
Nothing discussed here is exclusive to a continuous control variable

Epilogue

Multivariable analysis is challenging - adjustment can make effects go away/removing them can make effects appear - need to be constantly thinking of causal mechanism and viewing the data from different perspectives. > “All models are wrong, but some are useful.” — George E. P. Box (1976)

References

DATAtab. (2025). ANCOVA (analysis of covariance): A mix of ANOVA and···. https://www.youtube.com/watch?v=PngndHgZOgY.

McCabe, D. (2025a). ANOVA workflows. R Pubs. https://rpubs.com/mccabe08/1303319

McCabe, D. (2025b). Regression i: Practical linear multivariable regression. R Pubs. https://rpubs.com/mccabe08/1335788

Peng, R. D. (2015). Regression models. Johns Hopkins University; Coursera. https://www.coursera.org/learn/regression-models/home/module/4