Multivariable Regression Modelling and Exploratory Causal Inference |
|
Multivariable Regression Model Building
Model Design
Multivariable linear regression involves selecting the dependent and predictor variables, fitting a hyperplane in multidimensional space and iteratively evaluating and refining the model through feature selection and performance assessment.
The linear model: \[
\boxed{\begin{align}
Y_i &= \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \epsilon_i \\
&= \beta_0 + \sum_{k=1}^{p} \beta_k X_{ik} + \epsilon_i
\end{align}}
\]
Where:
- \(Y_i\) is the response (outcome) variable
- \(X_{i1}, X_{i2}, \dots, X_{ip}\) are the predictor variables
- \(\beta_0\) is the (optional) intercept
- \(\beta_1, \beta_2, \dots, \beta_p\) are the coefficients for each predictor
- \(\epsilon_i\) is the error term capturing variation not explained by the predictors
Points to note:
- Typically, the intercept is included in the sum by setting \(X_0 = 1\), critical from a linear algebra perspective giving: \[\boxed{Y_i = \sum_{k=0}^{p} \beta_k X_{ik} + \epsilon_i}\]
- The loss/cost function is: \[\sum_{i=1}^n \left(Y_i - \sum_{k=0}^{p} \beta_k X_{ik}\right)^2\]
- The important linearity is linearity in the coefficients; for example, the following is still a linear model: \[Y = \beta_0 + \beta_1 X_1^2 + \dots + \beta_p X_p^2 + \epsilon\]
The linear model: \[ \boxed{\begin{align} Y_i &= \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \epsilon_i \\ &= \beta_0 + \sum_{k=1}^{p} \beta_k X_{ik} + \epsilon_i \end{align}} \]
Where:
- \(Y_i\) is the response (outcome) variable
- \(X_{i1}, X_{i2}, \dots, X_{ip}\) are the predictor variables
- \(\beta_0\) is the (optional) intercept
- \(\beta_1, \beta_2, \dots, \beta_p\) are the coefficients for each predictor
- \(\epsilon_i\) is the error term capturing variation not explained by the predictors
Points to note:
- Typically, the intercept is included in the sum by setting \(X_0 = 1\), critical from a linear algebra perspective giving: \[\boxed{Y_i = \sum_{k=0}^{p} \beta_k X_{ik} + \epsilon_i}\]
- The loss/cost function is: \[\sum_{i=1}^n \left(Y_i - \sum_{k=0}^{p} \beta_k X_{ik}\right)^2\]
- The important linearity is linearity in the coefficients; for example, the following is still a linear model: \[Y = \beta_0 + \beta_1 X_1^2 + \dots + \beta_p X_p^2 + \epsilon\]
Bryan Caffo’s - intuitive development/understanding:
The loss/cost function and estimator for \(\beta_1\) in simple linear regression (1 covariate - no intercept) are: \[\text{cost}:\quad\sum_{i=1}^n \left(y_i - \beta_1 x_{i1}\right)^2\qquad\text{estimate: }\quad\beta_1=\frac{\sum y_ix_{1i}}{\sum x_{1i}^2}\]
The loss/cost function for (2 covariates): \[\sum_{i=1}^n \left(y_i - \beta_1 x_{i1}- \beta_2 x_{i2}\right)^2\] …but we can absorb one covariate into a modified \(\tilde{y}_i\) term where \(\tilde{y}_i=y_i - \beta_1 x_{i1}\): \[\sum_{i=1}^n \left(\tilde{y}_i - \beta_2 x_{i2}\right)^2\] … this is precisely the expression for the simple linear regression through the origin case - we can extend this to greater than 2 covariates.
\[\text{cost}:\quad\sum_{i=1}^n \left(\tilde{y}_i - \beta_2 x_{i2}\right)^2\qquad\text{estimate: }\quad\beta_2=\frac{\sum\tilde{y}_ix_{2i}}{\sum x_{2i}^2}\]
…substituting the \(\beta_2\) estimate into the cost function we obtain a cost function purely in terms of \(\beta_1\).
His idea: - multivariable regression involves starting with the total variance of the output and continuously reducing residual variance through sequentially application of simple linear regression on the different covariates. - this is not quite correct as the variance is distributed more evenly as these simple regressions happen simultaneously rather than sequentially. He suggests - A coefficient \(\beta_k\) in a multivariate regression is the coefficent where the linear effect of all the other variables on the response have been removed. - again, all is not quite right as each coefficient takes its fair share of the variance. Respectfully i don’t think its a great explination.
Linear Algebra Perspective:
In multivariable regression, the fitted values are the orthogonal projection of \(Y\) onto the column space of \(X\):
\[\hat{Y}=X\hat{\beta},\quad \hat{\beta}=(X^\top X)^{-1}X^\top Y,\quad \hat{Y}=P_X Y\]
where \(P_X=X(X^\top X)^{-1}X^\top\) is the projection matrix.
The residuals are \(e=(I-P_X)Y\), orthogonal to the predictor space. Each coefficient \(\beta_k\) represents the effect of \(X_k\) after projecting out the linear effects of the other predictors. Least squares is a simultaneous projection, allocating each variable its fair share of explained variation.
Geometrically, we are finding the best-fit hyperplane through the cloud of points. Each covariate coefficient can be viewed as the slope in the 2-D cross-section along that covariate’s axis—simple, really.
FYI: …
lm
will actually throw anNA
if you include a covariate which is an exact linear combination of the other covariates i.e. \(X^TX\) is singular. This seems fine interms of the visual hyperplace picture - its a property of the algebra - the rank or the dimension of the matrix is too low for the feature space.
Main model building considerations (McCabe, 2025b):
- Model selection and overfitting
- Model search - deciding which predictors to include in an analysis. It is important not to omit predictors that are theoretically or scientifically justified.
- Overfitting to irrelevant predictors - including too many unnecessary variables can lead to a model that fits the sample data perfectly (zero residuals) but performs poorly on new data. Garbage in, garbage out.
- Model validation - using techniques like cross-validation or a holdout dataset to assess predictive performance and avoid overfitting.
- Model search - deciding which predictors to include in an analysis. It is important not to omit predictors that are theoretically or scientifically justified.
- Assumptions about predictors and residuals
- Linearity - the relationship between predictors and the response is assumed to be linear; violations may require transformations or nonlinear modeling.
- Collinearity / Multicollinearity - when two or more predictors are highly correlated, it can inflate standard errors and make coefficient estimates unstable.
- Homoscedasticity - the variance of residuals should be roughly constant across predicted values; heteroscedasticity can affect inference.
- Normality of errors - many inferential procedures assume normally distributed residuals, especially for small samples.
- Linearity - the relationship between predictors and the response is assumed to be linear; violations may require transformations or nonlinear modeling.
- Data quality and influence
- Outliers and influential points - extreme values can disproportionately affect the model fit and estimates.
- Missing data - how missing values are handled can impact estimates and inference.
- Outliers and influential points - extreme values can disproportionately affect the model fit and estimates.
- Purpose of the analysis
- Causal versus predictive goals - be clear whether the goal is inference about relationships (causal) or accurate prediction; this affects model choice and interpretation.
Model FItting
R command usage following (Peng, 2015) …
0: Load Libraries and Data
library(data.table)
library(ggplot2)
library(car) # for VIF
<- as.data.table(swiss) dt
1. Standardising Predictors in Multivariable Regression?
Standardising predictors before multivariable regression means transforming each variable to have mean 0 and standard deviation 1. This does not change the relationships among variables but rescales them so that the regression coefficients are directly comparable in magnitude. It is particularly useful when predictors are on very different scales, as it prevents variables with larger units from dominating the estimation and improves numerical stability. Standardisation also facilitates interpretation of coefficients in terms of standard deviation changes in the outcome per standard deviation change in the predictor.
<- c("Fertility", "Agriculture", "Examination", "Education")
predictors := lapply(.SD, scale), .SDcols = predictors]
dt[, (predictors) <- lm(mpg ~ Fertility + Agriculture + Examination + Education, data = dt) model
2. Visually Inspect Data
library(GGally) # for nice splom
ggpairs(dt, lower = list(continuous = wrap("smooth", method = "loess")))
High correlations may indicate potential multicollinearity issues.
3. a) Fit Regression Model
<- lm(Fertility ~ Agriculture, data = dt)
fit summary(fit)
Call:
lm(formula = Fertility ~ Agriculture, data = dt)
Residuals:
Min 1Q Median 3Q Max
-25.5374 -7.8685 -0.6362 9.0464 24.4858
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.30438 4.25126 14.185 <2e-16 ***
Agriculture 0.19420 0.07671 2.532 0.0149 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.82 on 45 degrees of freedom
Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052
F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
Notice: the agriculture coefficient flips in the ill thought through regression in part b…
3. b) Fit Full Multivariable Regression Model
<- lm(Fertility ~ ., data = dt) # this is silly but for illustration
fit summary(fit)$coefficient
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07
Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02
Examination -0.2580082 0.25387820 -1.016268 3.154617e-01
Education -0.8709401 0.18302860 -4.758492 2.430605e-05
Catholic 0.1041153 0.03525785 2.952969 5.190079e-03
Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03
This model estimates the association of each predictor with Fertility
, adjusting for the other variables.
We expect a 0.17 decrease in standardised fertility for every 1% increase in the percentage of males involved in agriculture, holding the other variables constant. So that is interpreted as we hold examination and education, percent Catholic and infant mortality constant. This next column, the standard error 0.07, talks about how precise that variable.
4. Check Model Assumptions
- Homoscedasticity: residuals should have roughly constant variance.
ggplot(data.table(fitted = fitted(fit), resid = resid(fit)),
aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted Values")
- Normality of residuals:
qqnorm(resid(fit))
qqline(resid(fit))
- Multicollinearity: check variance inflation factors (VIF).
vif(fit)
Agriculture Examination Education Catholic
2.284129 3.675420 2.774943 1.937160
Infant.Mortality
1.107542
Model Assessment (McCabe, 2025a)
After fitting the model, R provides a few quick checks to assess the main aspects of model fit, including overall metrics such as the ANOVA table and VIF (aside: additional diagnostics can examine individual data points if needed residuals, influence/leverage see (mccabe2025regression?)).
<-as.data.table(mtcars)
dt<-lm(mpg~wt+disp+carb,dt) model
ANOVA Table
An ANOVA (Analysis of Variance) table summarises how much of the variation in the response variable can be attributed to each term in a regression model. The F-value is calculated by comparing the fit of the full model to a model with that specific term removed, effectively testing whether adding the term significantly improves the model.
anova(model)
Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 847.73 847.73 115.9334 1.836e-11 ***
disp 1 31.64 31.64 4.3270 0.04678 *
carb 1 41.94 41.94 5.7359 0.02355 *
Residuals 28 204.74 7.31
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Small p-values (high significance terms) indicate that the predictor contributes meaningfully to explaining the response, while large p-values (low significance terms) suggest little additional contribution once other predictors are included.
The residual row shows the variation left unexplained by the model.
Variance inflation Factors
The Variance Inflation Factor (VIF) quantifies how much the variance of a regression coefficient is inflated due to multicollinearity among the terms in a model. For each term, the VIF is calculated by fitting a regression of that term on all the other terms and measuring how well it can be predicted. (swap dependent and independent variables).
vif(model)
wt disp carb
4.890224 4.734708 1.225414
A VIF of 1 indicates no correlation with other terms, values between 1 and 5 suggest moderate correlation, and values above 5 (or 10) indicate high multicollinearity, which can make coefficient estimates unstable (imagine a flate plane teetering on a tightrope).
High VIF values highlight terms whose explanatory power overlaps substantially with other terms, making it difficult to isolate their individual effects.Model Selection
Penn offers a course on this topic: A Crash Course in Causality: Inferring Causal Effects from Observational Data.
Correlation does not imply causation. Furthermore, in multivariable regression, apparent correlations between variables can sometimes be artifacts of an ill-considered model or poorly designed experiment.
- Interactions occur when the effect of one predictor depends on the level of another (Simpson’s paradox).
- Instability/Multicollinearity: Strong correlations among predictors can make coefficient estimates unstable and inflate apparent relationships.
- Confounding arises when a predictor is correlated with both the outcome and another predictor, potentially biasing coefficient estimates.
- Reverse causation: The outcome may instead be influencing the predictor.
- Nonlinear relationships: Linear correlation measures may miss or misrepresent relationships that are curvilinear.
Additionally, poor experimental design may lead to:
- Omission: Important variables are left out of the model, creating spurious associations.
- Selection bias: If the sample is not representative of the population, observed correlations may not reflect the true underlying relationships.
- Random chance: With many variables, some correlations may appear significant purely by chance (multiple testing issue).
Properly accounting for these factors ensures more reliable and interpretable regression results.
Classification of predictors
When choosing models we balance simplicity (fewer assumptions, fewer parameters) against explanatory power (ability to capture reality). The “Rumsfeldian triplet” provides a useful metaphor for thinking about sources of uncertainty and the risks of over/underfitting:
The Rumsfeldian Triplet
“there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.” - Complete Cunt
The Žižekian Quadruplet (extended & exhuastive)
“there are unknown knowns [the things we don’t know that we know - unconscious assumptions, beliefs, or ideological frameworks that shape how we see the world.].” - Salovj Žižek
Known knowns: Captured by the model. These are the patterns in data we can reliably represent.
Known unknowns: Quantified uncertainty, noise, or gaps we acknowledge. A good model should remain robust to these.
Unknown unknowns: Surprises - structure we didn’t anticipate, or regimes outside training data. Overly complex models may hallucinate explanations for these, while parsimonious models tend to generalize better.
Unknown knowns: Implicit biases and assumptions baked into our model choices (priors, loss functions, inductive biases). These shape what we see as “plausible” without us realising it.
Model Bias vs Variance
Omission of Important Covariates
Underfitting effect on model fit - Bias
- Leaving out relevant predictors introduces omitted variable bias.
- For ML practitioners, adding many features can reduce bias (good for prediction).
- For data scientists concerned with inference, unnecessary inclusion inflates variance and standard errors.
- Metrics like R² (residual variance / total variance) are misleading - they only ever increase with more covariates, so they are not good criteria/adjudicators for model selection.
Underfitting effect on residual variance
- Underfitting occurs when a model is too simple to capture the underlying structure in the data.
- Effect on variance: Because the model explains less of the total variance, the residual variance (unexplained variance) is larger.
- Implication: Predictions are less accurate, bias is high, and the model fails to capture important patterns.
Inclusion of Unnecessary Predictors
Overfitting effect on model fit - Variance Inflation
Let’s examine the standard error of the sampling distribution of a slope estimator in two scenarios.
Case 1: Redundant but Independent Predictors
- The redundant covariates x2 and x3 are independent of x1.
- Result: negligible variance inflation.
<- 100
nSample <- 1000
nSim <- rnorm(nSample) # useful regressor
x1 <- rnorm(nSample) # redundant
x2 <- rnorm(nSample) # redundant
x3
<- sapply(1:nSim, function(i) {
betas <- x1 + rnorm(nSample, sd = 0.3)
y c(coef(lm(y ~ x1))[2],
coef(lm(y ~ x1 + x2))[2],
coef(lm(y ~ x1 + x2 + x3))[2])
})
round(apply(betas, 1, sd), 5) # standard deviation of slope estimates
x1 x1 x1
0.03290 0.03290 0.03285
Nested Models
The high significance ***
of the models containing additional predictors signify that these new diminsions bring useful prediction power into the model - TODO not too sure beynod this
<-as.data.table(swiss)
dt <- lm(Fertility~Agriculture, data=dt)
fit1 <- update(fit, Fertility~Agriculture+Examination+Education)
fit3 <- update(fit, Fertility~Agriculture+Examination+Education+Catholic+Infant.Mortality)
fit5 anova(fit1,fit3,fit5)
Analysis of Variance Table
Model 1: Fertility ~ Agriculture
Model 2: Fertility ~ Agriculture + Examination + Education
Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
Infant.Mortality
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 6283.1
2 43 3180.9 2 3102.2 30.211 8.638e-09 ***
3 41 2105.0 2 1075.9 10.477 0.0002111 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Overfitting effect on residual variance
- Overfitting occurs when a model is too flexible — instead of capturing the true underlying structure, it models the chance characteristics (noise) of the training data.
- Effect on variance:
- The observed residual variance (variance of errors on the training data) decreases because the model explains more of the training data’s total variance.
- The true, out-of-sample variance increases, because the model is unstable and performs poorly on new data.
- The observed residual variance (variance of errors on the training data) decreases because the model explains more of the training data’s total variance.
- Implication:
- Predictions on unseen data are unreliable.
- While bias might be low on training data, variance is high, leading to poor generalisation.
- Predictions on unseen data are unreliable.
Adjustment
In causal analysis, the primary aim is to estimate the effect of an exposure on an outcome. The exposure is the variable hypothesised to exert a causal influence (e.g. smoking) and the outcome is the variable potentially affected (e.g. lung cancer). The observed association however, may be influenced by a third variable which can distort, transmit or modify the exposure-outcome relationship. Examples include general health, tar accumulation in the lungs or other behavioural factors. By adjusting for a third variable within a statistical model we attempt to obtain an unbiased estimate of the exposure’s effect on the outcome.
Third Variable Types (causal classification):
We postulate that the exposure influences the output:
- Effect modifier/interaction - affects the causation between exposure and output throttling the causal pathway itself.
- Mediator – causal stepping stone between exposure and output carrying part (or all) of the causal pathway.
- Confounder - causal driver upstream of both the output and exposure
- Collider - downstream causal outcome driven upstream by both the output and exposure
Third Variable Effects:
Scenario | visualisation | notes | example |
---|---|---|---|
the marginal/primary effect is relatively unchanged - the mean difference in the residual plane is roughtly equal to total/marginal mean separation project on the y axis | The third variable is a precision variable explains some variation in the outcome (thus reducing residual variance) but does not materially change the exposure–outcome estimate. | Outcome: no. beach rescues Exposure: no. beach goers Mediator: no. swimmers Interaction: fitness Confounder: ice-cream sales |
|
Masking/Suppression the marginal/primary effect gets magnified - the mean difference in the residual plane is very large whereas the total/marginal mean separation project on the y axis was neglegible | The third variable acts as a suppressor, after adjusting for the third variable, the conditional effect is revealed to be substantially larger. good overlaping data in adjusting \(x\) cross-section |
Outcome: Life expectancy Exposure: Gender Mediator: Occupation risk Interaction: Diet/exercise Confounder: Chinese national |
|
Amplification the marginal/primary effect gets hidden - the mean difference in the residual plane is negligible whereas the total/marginal mean separation project on the y axis was very large: bad experiment (poor propensity score) | The third variable acts as a confounder. The large marginal difference was entirely due to imbalance in covariates. the effect disappears with adjustment no overlaping data in adjusting \(x\) cross-section |
Outcome: no. drownings Exposure: gender Mediator: no. men’s triathlon Interaction: cider festival sales Confounder: fitness (ironman) |
|
Reversal the marginal/primary effect is inverted - the mean difference in the residual plane is roughtly opposite to total/marginal mean separation project on the y axis (simpson’s paradox) | The third variable here is a confounder, it reverses the apparent association. intermediate overlaping data in adjusting \(x\) cross-section |
Outcome: no. breakdowns Exposure: maintanance costs Mediator: custodian exp. Interaction: sr. management Confounder: system age |
|
the marginal/primary effect is relatively unchanged but with clear interaction - the mean difference in the residual plane is roughtly equal to total/marginal mean separation project on the y axis | The third variable is an effect modifier. It alters the strength or direction of the main effect within its levels but the overall average effect remains roughly unchanged. | Outcome: body mass index Exposure: level of exercise Mediator: muscle mass Interaction: ??? Confounder: desire to slim |
|
Nothing discussed here is exclusive to a continuous control variable |
Epilogue
Multivariable analysis is challenging - adjustment can make effects go away/removing them can make effects appear - need to be constantly thinking of causal mechanism and viewing the data from different perspectives. > “All models are wrong, but some are useful.” — George E. P. Box (1976)