Generalised Linear Models |
|
“GLMs -”the sum of residuals times any regressor variable is always zero””
Introduction/terminology
In regression, covariates are the independent variables \(x_1, x_2, \dots, x_p\) used to predict the dependent outcome. A linear predictor is the weighted sum of the covariates:
\[
\eta = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p,
\]
which is the purely linear part of the model before it is “bent to fit” non-normal data using a link function. The link function \(g(\mu) = \eta\) transforms the linear predictor \(\eta\) to the expected value of the response, \(\mu = \mathbb{E}[Y]\). The systematic component is this linear predictor:
\[
\text{Systematic component:} \quad \eta = \mathbf{X} \boldsymbol{\beta},
\]
representing the part of the model that is “systematic” – determined entirely by the covariates and coefficients, with no randomness. The remaining part is the random component (noise or error), which specifies how the outcome \(Y\) varies about the mean. This component always follows a distribution from the exponential family. The shape and variability of each \(Y_i\) depend on its mean \(\mu_i = \mathbb{E}[Y_i]\) and a dispersion parameter \(\phi\), which controls the spread of the distribution beyond the mean further generalising the models.
The complete GLM framework is
\[
\boxed{
\begin{aligned}
&\text{Random component:} && Y_i \sim \text{Exponential family}(\mu_i, \phi) \\
&\text{Systematic component:} && \eta_i = \mathbf{x}_i^\top \boldsymbol{\beta} \\
&\text{Link function:} && g(\mu_i) = \eta_i
\end{aligned}
}
\]
This general form of simple models covers many common distributions used in GLMs, such as Gaussian, Binomial, Poisson, and Gamma. There exist canonical link functions that arise naturally from the distribution of the response in the exponential family; these have useful mathematical properties, including simplifying estimation and interpretation.
Exponential Distributions (Distribution Class)
This is a huge class of distributions (Normal, Binomial, Poisson, Gamma, etc.). A probability distribution belongs to the exponential family if its probability density (or mass) function can be written as:
\[ f(y; \theta, \phi) = \exp\left( \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right) \]
where:
- \(\theta\) is the natural parameter, which parametrizes the distribution in a way that appears linearly in the exponent.
- \(\phi\) is the dispersion parameter, controlling the spread or variance, often known or fixed in some models.
- \(b(\theta)\) is the cumulant function, a convex function that ensures proper normalization and encodes moments of the distribution.
- \(a(\phi)\) is related to dispersion (variance scaling)
- \(c(y, \phi)\) handles normalisation
The nice thing about this form is that: - The mean of the distribution is linked to the natural parameter by \(\mu = E[Y] = b'(\theta)\)
- The variance is \(\mathrm{Var}(Y) = a(\phi) \, b''(\theta)\)
These properties make the exponential family distributions especially convenient for statistical modeling and inference.
Catalogue
Linear Regression
The canonical link here id the identity function \(g(\mu)=\mu\).
Gaussian (Normal) distribution in standard form: \(f(y; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y - \mu)^2}{2\sigma^2} \right)\)
Gaussian in exponential family form \(f(y; \theta, \phi) = \exp\left( \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right)\)
For the Gaussian: \(\theta = \mu\) \(\phi = \sigma^2\) \(b(\theta) = \frac{\theta^2}{2}\) \(a(\phi) = \phi\) \(c(y, \phi) = -\frac{y^2}{2\phi} - \frac12 \log(2\pi\phi)\)
Full exponential family form for Gaussian: \(f(y; \theta, \phi) = \exp\left( \frac{y\theta - \frac{\theta^2}{2}}{\phi} - \frac{y^2}{2\phi} - \frac12\log(2\pi\phi) \right)\)
Poisson (Count) Regression
The canonical link here id the log function \(g(\mu)=\operatorname{log}(\mu)\).
Binomial (Logistic) Regression
The canonical link here id the logit function \(g(\mu)=\operatorname{log}\left(\frac{\mu}{1-\mu}\right)\).
Gamma Regression: Modeling positive continuous skewed data (e.g., insurance claims)
The canonical link here id the inverse function \(g(\mu)=\frac{1}{\mu}\).
Inverse Gaussian Regression: Highly skewed positive outcomes
The canonical link here id the log function \(g(\mu)=\frac{1}{\mu^2}\).
Negative Binomial: Overdispersed count data
The canonical link here id the log function \(log(g(\mu))=log(\mu)\).
The Elegance of Generalised Linear Models (OpenAI, 2025)
Generalised Linear Models (GLMs) are often presented as a combination of three ingredients:
1. An exponential family distribution,
2. A link function, and
3. A linear predictor.
While this is correct, the real elegance of GLMs lies in the deeper mathematical and geometric properties they inherit from the exponential family and maximum likelihood estimation. This document highlights those properties, and provides first-principle derivations in the appendix.
1. Orthogonality of Residuals and Regressors
Statement: In a GLM with a canonical link, the residuals are orthogonal to every regressor.
Why it matters: This generalises the familiar OLS property that residuals are orthogonal to the regressors, even though GLMs are nonlinear.
Equation:
\[ X^T (y - \mu) = 0, \]
where \(X\) is the design matrix, \(y\) is the vector of responses, and \(\mu = \mathbb{E}[Y|X]\).
Proof idea (short): These are the score equations from maximum likelihood estimation. Differentiating the log-likelihood with respect to the coefficients and setting equal to zero yields this condition.
(Full derivation in Appendix A.1)
2. Fisher Information and the Information Equality
Statement: For exponential family distributions, the variance of the score equals the Fisher information.
Why it matters: This property allows GLMs to be estimated efficiently by iteratively reweighted least squares (IRLS). It also explains why they inherit OLS-like properties despite being more general.
Equation:
\[ \mathbb{E}\left[ \left( \frac{\partial \ell}{\partial \theta} \right)^2 \right] = - \mathbb{E}\left[ \frac{\partial^2 \ell}{\partial \theta^2} \right]. \]
Proof idea (short): Using the exponential family form, we can interchange differentiation and integration under regularity conditions.
(Full derivation in Appendix A.2)
3. Canonical Links as the Natural Choice
Statement: Choosing the canonical link aligns the linear predictor with the natural parameter of the exponential family.
Why it matters:
- Score equations simplify,
- Variance-mean relationships are clean,
- Estimation is more stable,
- Model interpretation aligns with sufficient statistics.
Example (Bernoulli):
\[ \eta = \log \frac{\mu}{1 - \mu}. \]
Explanation: For the Bernoulli, the canonical link is the logit. Substituting this into the score equations yields simple residual equations.
(Appendix A.3)
4. Unified Framework
Statement: GLMs unify linear regression, logistic regression, Poisson regression, and log-linear models under one framework.
Why it matters:
- The same estimation algorithm (IRLS),
- The same asymptotic theory for inference (Wald, score, and likelihood ratio tests),
- The same diagnostic toolkit.
This explains the broad applicability of GLMs across fields from epidemiology to social sciences.
5. Geometric Viewpoint
Statement: Fitting a GLM corresponds to projecting observed sufficient statistics onto the convex set of realisable means defined by the model.
Why it matters: This geometric interpretation explains why residual orthogonality appears naturally: it is the geometry of projections in the exponential family space.
Optional detail: This can be formalised using convex duality and information geometry.
(Appendix A.4)
Appendix: First-Principle Proofs (AI authered)
A.1 Residual Orthogonality
This appendix gives a first-principles derivation of the residual-orthogonality condition for Generalised Linear Models (GLMs). The derivation starts from the exponential-family formulation, applies the chain rule, and shows how the canonical link yields the simple orthogonality condition
\[ X^T (y - \mu) = 0. \]
It also states the more general weighted orthogonality that appears with non-canonical links.
1. Exponential-family setup
Assume observations y_1,…,y_n are independent and each has a density (or mass) in the exponential family written in canonical form:
\[ f(y_i; \theta_i, \phi) = \exp\left\{ \frac{y_i \theta_i - b(\theta_i)}{\phi} + c(y_i,\phi) \right\}, \]
where - \(\theta_i\) is the canonical (natural) parameter for observation \(i\), - \(b(\theta)\) is the cumulant function, - \(\phi > 0\) is the dispersion parameter (for many common models \(\phi = 1\)), - \(c(y_i,\phi)\) does not depend on \(\theta_i\).
Standard exponential-family identities are
\[ \\mu_i \\equiv E[Y_i] = b'(\\theta_i), \\qquad \\mathrm{Var}(Y_i) = \\phi\\, b''(\\theta_i). \]
2. GLM structure and link function
In a GLM we relate the mean \mu_i to a linear predictor:
\[ \\eta_i = x_i^T \\beta, \]
where x_i is the p-vector of regressors for observation i and \beta is the coefficient vector. The link function g(\cdot) relates \mu_i and \eta_i by
\[ \\eta_i = g(\\mu_i). \]
The canonical link is the choice for which the linear predictor equals the canonical parameter:
\[ \\eta_i = \\theta_i. \]
3. Log-likelihood and score
The log-likelihood for the full sample is
\[ \\ell(\\beta) = \\sum_{i=1}^n \\log f(y_i; \\theta_i(\\beta), \\phi) = \\sum_{i=1}^n \\left[ \\frac{y_i \\theta_i(\\beta) - b(\\theta_i(\\beta))}{\\phi} + c(y_i,\\phi) \\right]. \]
Differentiate \ell(\beta) with respect to \beta. By the chain rule,
\[ \\frac{\\partial \\ell}{\\partial \\beta} = \\sum_{i=1}^n \\frac{1}{\\phi} \\left( y_i - b'(\\theta_i) \\right) \\frac{\\partial \\theta_i}{\\partial \\beta}. \]
Using b’(\theta_i) = \mu_i, this becomes
\[ \\frac{\\partial \\ell}{\\partial \\beta} = \\frac{1}{\\phi} \\sum_{i=1}^n (y_i - \\mu_i) \\frac{\\partial \\theta_i}{\\partial \\beta}. \]
Setting the score equal to zero for the maximum-likelihood estimator yields the generic score condition
\[ \\sum_{i=1}^n (y_i - \\mu_i) \\frac{\\partial \\theta_i}{\\partial \\beta} = 0. \\tag{1} \]
Equation (1) is the scalar-observation form of the score equations.
4. Expressing the derivative \partial\theta_i / \partial\beta
We now expand \partial\theta_i/\partial\beta according to the link choice.
(a) Canonical link. If the link is canonical then \theta_i = \eta_i = x_i^T \beta. Hence
\[ \\frac{\\partial \\theta_i}{\\partial \\beta} = x_i. \]
Substituting into (1) gives
\[ \\sum_{i=1}^n (y_i - \\mu_i) x_i = 0, \]
or in matrix notation, with X the n-by-p design matrix and y, \mu the n-vectors,
\[ X^T (y - \\mu) = 0. \]
This is the residual-orthogonality condition: the residual vector y - \mu is orthogonal to every column of X.
(b) General (non-canonical) link. If the link is not canonical then \theta_i is a function of \eta_i, say \theta_i = h(\eta_i), and \eta_i = x_i^T \beta. By the chain rule
\[ \\frac{\\partial \\theta_i}{\\partial \\beta} = \\frac{d\\theta_i}{d\\eta_i} \\frac{\\partial \\eta_i}{\\partial \\beta} = \\frac{d\\theta_i}{d\\eta_i} x_i. \]
Then (1) becomes
\[ \\sum_{i=1}^n (y_i - \\mu_i) \\frac{d\\theta_i}{d\\eta_i} x_i = 0, \]
or equivalently
\[ X^T W (y - \\mu) = 0, \]
where W is the n-by-n diagonal matrix with diagonal entries w_i = d\theta_i/d\eta_i. Using standard identities (and algebraic manipulation) one commonly rewrites the weights in the IRLS form as
\[ w_i = \\frac{1}{\\mathrm{Var}(Y_i)} \\left( \\frac{d\\mu_i}{d\\eta_i} \\right)^2, \]
but the key point here is that non-canonical links yield a weighted orthogonality: the residuals are orthogonal to the regressors under the weighted inner product defined by W, not the plain inner product.
5. Remarks and interpretation
Why the canonical link is special. When the canonical link is used, \partial\theta_i/\partial\beta = x_i and the score simplifies to the unweighted orthogonality condition X^T(y - \mu) = 0. This mirrors ordinary least squares.
Residuals are orthogonal to regressors. The condition X^T(y - \mu) = 0 implies for each column j of X that
\[ \\sum_{i=1}^n x_{ij} (y_i - \\mu_i) = 0, \]
so the sample covariance between each regressor and the residuals is zero.
Weighted orthogonality for non-canonical links. With non-canonical links the score yields X^T W (y - \mu) = 0. In practice IRLS solves a sequence of weighted least squares problems using such weights; the canonical link makes those weights trivial in the score.
Dispersion parameter. The factor 1/\phi in the score does not affect the root of the score equations (unless \phi itself is a parameter to be estimated), so it does not change the orthogonality aside from a scalar factor that cancels when setting the score to zero.
Connection to sufficiency. For canonical links and exponential-family models, sufficient statistics for the regression coefficients typically appear linearly (for example X^T y), which is another reason the score (which involves X^T y and X^T \mu) has a linear form.
6. Worked example: Bernoulli (logit) with canonical link
For Bernoulli outcomes Y_i in {0,1} with probability \mu_i, the canonical-form log-likelihood per observation is
\[ \\ell_i = y_i \\theta_i - b(\\theta_i), \\qquad b(\\theta) = \\log(1 + e^{\\theta}), \\qquad \\mu_i = b'(\\theta_i) = \\frac{e^{\\theta_i}}{1 + e^{\\theta_i}}. \]
With the canonical link \theta_i = \eta_i = x_i^T \beta, the score is
\[ \\frac{\\partial \\ell}{\\partial \\beta} = \\sum_i (y_i - \\mu_i) x_i, \]
so at the maximum-likelihood estimator
\[ X^T (y - \\mu) = 0. \]
Conclusion
The orthogonality condition X^T (y - \mu) = 0 follows directly from the score equations for GLMs when the canonical link is chosen. For non-canonical links the same first-principle approach yields a weighted orthogonality X^T W (y - \mu) = 0. This derivation explains why residuals in canonical-link GLMs behave like OLS residuals: the likelihood geometry forces the same linear orthogonality with respect to the design matrix. ## A.2 Information Equality
This appendix gives a first-principles derivation of the information equality for exponential-family models and its matrix generalisation for GLMs. The main identity is
\[ \\mathrm{Var}\\left(\\frac{\\partial \\ell}{\\partial \\theta}\\right) = -\\mathbb{E}\\left[\\frac{\\partial^2 \\ell}{\\partial \\theta^2}\\right], \]
and it holds under standard regularity conditions. We show the scalar-parameter case first, then specialise to the exponential family, and finally give the multivariate version for GLM regression parameters.
1. Scalar-parameter proof (basic identity)
Let \(f(y;\\theta)\) be the likelihood for a single observation and write the log-likelihood as \(\\ell(\\theta;y)=\\log f(y;\\theta)\). Define the score
\[ u(\\theta;y) = \\frac{\\partial}{\\partial\\theta} \\ell(\\theta;y). \]
Under regularity conditions we may differentiate under the integral sign. Normalisation of the density gives
\[ 1 = \\int f(y;\\theta)\\,dy. \]
Differentiate both sides with respect to \(\\theta\) to obtain
\[ 0 = \\int \\frac{\\partial}{\\partial\\theta} f(y;\\theta)\\,dy = \\int f(y;\\theta) \\frac{\\partial}{\\partial\\theta} \\ell(\\theta;y)\\,dy = \\mathbb{E}[u(\\theta;Y)]. \]
Thus the score has zero expectation:
\[ \\mathbb{E}[u(\\theta;Y)] = 0. \]
Differentiate this identity with respect to \(\\theta\) again (and interchange differentiation and expectation). The left-hand side is zero, so
\[ 0 = \\frac{\\partial}{\\partial\\theta} \\mathbb{E}[u(\\theta;Y)] = \\mathbb{E}\\left[\\frac{\\partial}{\\partial\\theta} u(\\theta;Y)\\right]. \]
Now write
\[ \\frac{\\partial}{\\partial\\theta} u(\\theta;Y) = \\frac{\\partial^2}{\\partial\\theta^2} \\ell(\\theta;Y). \]
We also use the product rule on \(\\partial (f) / \\partial\\theta = f\\, u\) to get the relationship
\[ \\frac{\\partial}{\\partial\\theta} u(\\theta;Y) + u(\\theta;Y)^2 = \\frac{\\partial}{\\partial\\theta} u(\\theta;Y) + (u(\\theta;Y))^2 = \\frac{\\partial}{\\partial\\theta} u(\\theta;Y) + u(\\theta;Y)^2, \]
and, more directly, note the identity obtained by differentiating \(\\int f = 1\) twice:
\[ 0 = \\int \\frac{\\partial^2}{\\partial\\theta^2} f(y;\\theta)\\,dy = \\int f \\left(\\frac{\\partial^2 \\ell}{\\partial\\theta^2} + \\left(\\frac{\\partial \\ell}{\\partial\\theta}\\right)^2\\right) dy. \]
Hence
\[ 0 = \\mathbb{E}\\left[\\frac{\\partial^2 \\ell}{\\partial\\theta^2}\\right] + \\mathbb{E}\\left[\\left(\\frac{\\partial \\ell}{\\partial\\theta}\\right)^2\\right]. \]
Rearranging gives the scalar information equality:
\[ \\mathbb{E}\\left[\\left(\\frac{\\partial \\ell}{\\partial\\theta}\\right)^2\\right] = -\\mathbb{E}\\left[\\frac{\\partial^2 \\ell}{\\partial\\theta^2}\\right]. \]
The left-hand side is the variance of the score (since the score has mean zero), so this says
\[ \\mathrm{Var}\\left(\\frac{\\partial \\ell}{\\partial\\theta}\\right) = -\\mathbb{E}\\left[\\frac{\\partial^2 \\ell}{\\partial\\theta^2}\\right]. \]
2. Exponential-family specialisation
Consider the one-parameter exponential family in canonical form
\[ f(y;\\theta,\\phi) = \\exp\\left\\{\\frac{y\\theta - b(\\theta)}{\\phi} + c(y,\\phi)\\right\\}, \]
with canonical parameter \(\\theta\) and dispersion \(\\phi>0\). The log-likelihood for one observation is
\[ \\ell(\\theta;y) = \\frac{y\\theta - b(\\theta)}{\\phi} + c(y,\\phi). \]
Compute derivatives with respect to \(\\theta\):
\[ \\frac{\\partial \\ell}{\\partial\\theta} = \\frac{y - b'(\\theta)}{\\phi} = \\frac{y - \\mu}{\\phi}, \]
and
\[ \\frac{\\partial^2 \\ell}{\\partial\\theta^2} = -\\frac{b''(\\theta)}{\\phi}. \]
Now take expectations. Using the standard identities
\[ \\mu = \\mathbb{E}[Y] = b'(\\theta), \\qquad \\mathrm{Var}(Y) = \\phi\\, b''(\\theta), \]
we have
\[ \\mathbb{E}\\left[\\left(\\frac{\\partial \\ell}{\\partial\\theta}\\right)^2\\right] = \\mathbb{E}\\left[\\frac{(Y-\\mu)^2}{\\phi^2}\\right] = \\frac{\\mathrm{Var}(Y)}{\\phi^2} = \\frac{\\phi\\, b''(\\theta)}{\\phi^2} = \\frac{b''(\\theta)}{\\phi}. \]
On the other hand
\[ -\\mathbb{E}\\left[\\frac{\\partial^2 \\ell}{\\partial\\theta^2}\\right] = -\\left(-\\frac{b''(\\theta)}{\\phi}\\right) = \\frac{b''(\\theta)}{\\phi}, \]
so the identity holds. This direct calculation also clarifies why the exponential-family form makes the information quantity simple: the second derivative depends only on \(b''(\\theta)\), which is directly related to the variance function.
3. Multivariate parameter case and GLMs
For a vector parameter \(\\psi\\in\\mathbb{R}^p\) the scalar identity becomes the matrix identity for the Fisher information matrix:
\[ \\mathcal{I}(\\psi) = \\mathbb{E}\\left[\\left( -\\frac{\\partial^2 \\ell}{\\partial\\psi\\partial\\psi^T} \\right)\\right] = \\mathbb{E}\\left[\\left( \\frac{\\partial \\ell}{\\partial\\psi} \\right)\\left(\\frac{\\partial \\ell}{\\partial\\psi}\\right)^T\\right]. \]
In a GLM with independent observations and canonical link, let \(\\beta\) be the regression coefficient vector, \(X\) the \(n\\times p\) design matrix, and write the per-observation log-likelihood as in Section A.1. For canonical links the score vector is
\[ s(\\beta) = \\frac{\\partial \\ell}{\\partial\\beta} = \\frac{1}{\\phi} \\sum_{i=1}^n (y_i - \\mu_i) x_i = \\frac{1}{\\phi} X^T (y - \\mu). \]
Compute the covariance (or variance) of the score:
\[ \\mathrm{Var}(s(\\beta)) = \\frac{1}{\\phi^2} \\, \\mathrm{Var}\\left(X^T (y - \\mu)\\right) = \\frac{1}{\\phi^2} X^T \\mathrm{diag}(\\mathrm{Var}(Y_i)) X. \]
Using Var\((Y_i)=\\phi\\, b''(\\theta_i)\) we obtain
\[ \\mathrm{Var}(s(\\beta)) = \\frac{1}{\\phi^2} X^T \\mathrm{diag}(\\phi\\, b''(\\theta_i)) X = \\frac{1}{\\phi} X^T \\mathrm{diag}(b''(\\theta_i)) X. \]
Now compute minus the expected Hessian directly:
\[ -\\mathbb{E}\\left[\\frac{\\partial^2 \\ell}{\\partial\\beta\\partial\\beta^T}\\right] = -\\mathbb{E}\\left[\\frac{1}{\\phi} \\sum_{i=1}^n b''(\\theta_i) x_i x_i^T \\right] = \\frac{1}{\\phi} X^T \\mathrm{diag}(b''(\\theta_i)) X, \]
which matches the variance of the score. Thus the matrix form of the information equality holds:
\[ \\mathcal{I}(\\beta) = \\mathrm{Var}(s(\\beta)) = -\\mathbb{E}\\left[\\frac{\\partial^2 \\ell}{\\partial\\beta\\partial\\beta^T}\\right] = \\frac{1}{\\phi} X^T \\mathrm{diag}(b''(\\theta_i)) X. \]
Using Var\((Y_i)=\\phi\\, b''(\\theta_i)\) one may also write
\[ \\mathcal{I}(\\beta) = \\frac{1}{\\phi^2} X^T \\mathrm{diag}(\\mathrm{Var}(Y_i)) X. \]
4. Relation to IRLS and Newton updates
The information equality is the reason the expected Hessian (the Fisher information) is used as a natural curvature matrix in Newton-Raphson or Fisher scoring updates. For example, a Fisher-scoring update for \(\\beta\) is
\[ \\beta_{\\text{new}} = \\beta_{\\text{old}} + \\mathcal{I}(\\beta_{\\text{old}})^{-1} s(\\beta_{\\text{old}}). \]
For canonical-link GLMs this Newton/Fisher-scoring step can be rearranged to the familiar iteratively reweighted least-squares (IRLS) update: at each iteration one solves a weighted least-squares problem with weights related to \(b''(\\theta_i)\) (or equivalently Var\((Y_i)\)) and a working response that incorporates the current linear predictor. The equality between Var(score) and minus expected Hessian is what makes the Fisher-scoring step equivalent, in expectation, to the Newton step and gives IRLS its stability and OLS-like form.
5. Worked example: Bernoulli (logit, canonical link)
For Bernoulli observations with canonical parameter \(\\theta_i = x_i^T\\beta\) we have \(b''(\\theta_i) = \\mu_i(1-\\mu_i)\) and \(\\phi=1\). The Fisher information matrix is
\[ \\mathcal{I}(\\beta) = X^T \\mathrm{diag}(\\mu_i(1-\\mu_i)) X. \]
The score is \(s(\\beta) = X^T (y - \\mu)\) and its variance is
\[ \\mathrm{Var}(s(\\beta)) = X^T \\mathrm{diag}(\\mu_i(1-\\mu_i)) X, \]
matching the expected negative Hessian as required.
Conclusion
The information equality is a direct consequence of the normalisation of the likelihood and differentiability under the integral sign. For exponential-family models the identity becomes particularly explicit because the second derivative of the cumulant function, \(b''(\\theta)\), directly encodes the variance function. In GLMs the equality explains why the expected Hessian (Fisher information) is the correct curvature matrix to use in Fisher scoring/IRLS and why those algorithms have such close ties to weighted least squares.
A.3 Canonical Link Simplification
This appendix shows why the canonical link in a Generalised Linear Model (GLM) is special: it aligns the linear predictor with the natural (canonical) parameter of the exponential family. This yields much simpler score equations and variance functions, and explains the OLS-like residual orthogonality property described in Appendix A.1.
1. General GLM score equations
Recall the log-likelihood for independent observations \(y_1,...,y_n\) from an exponential family distribution:
\[ \\ell(\\beta) = \\sum_{i=1}^n \\left[ \\frac{y_i \\theta_i - b(\\theta_i)}{\\phi} + c(y_i,\\phi) \\right]. \]
Differentiating with respect to \(\\beta\) gives the score vector:
\[ \\frac{\\partial \\ell}{\\partial \\beta} = \\frac{1}{\\phi} \\sum_{i=1}^n (y_i - \\mu_i) \\frac{\\partial \\theta_i}{\\partial \\beta}. \]
By the chain rule,
\[ \\frac{\\partial \\theta_i}{\\partial \\beta} = \\frac{d\\theta_i}{d\\mu_i} \\frac{d\\mu_i}{d\\eta_i} \\frac{\\partial \\eta_i}{\\partial \\beta}, \]
where \(\\eta_i = x_i^T \\beta\) is the linear predictor.
So in general,
\[ \\frac{\\partial \\ell}{\\partial \\beta} = \\frac{1}{\\phi} \\sum_{i=1}^n (y_i - \\mu_i) \\left(\\frac{d\\theta_i}{d\\mu_i}\\right)\\left(\\frac{d\\mu_i}{d\\eta_i}\\right) x_i. \\tag{1} \]
This is the general GLM score equation.
2. Canonical link condition
The canonical link is defined by \(\\eta_i = \\theta_i\), i.e. the linear predictor is exactly the natural parameter. In this case,
\[ \\frac{d\\theta_i}{d\\mu_i} \\frac{d\\mu_i}{d\\eta_i} = 1. \]
To see this, recall the exponential-family identity
\[ \\mu_i = b'(\\theta_i). \]
So
\[ \\frac{d\\mu_i}{d\\theta_i} = b''(\\theta_i), \\qquad \\frac{d\\theta_i}{d\\mu_i} = \\frac{1}{b''(\\theta_i)}. \]
If the link is canonical, then \(\\eta_i = \\theta_i\), so \(d\\mu_i/d\\eta_i = b''(\\theta_i)\). Multiplying gives
\[ \\frac{d\\theta_i}{d\\mu_i}\\frac{d\\mu_i}{d\\eta_i} = \\frac{1}{b''(\\theta_i)} b''(\\theta_i) = 1. \]
3. Simplified score equation
Substituting back into (1), for canonical links the score reduces to
\[ \\frac{\\partial \\ell}{\\partial \\beta} = \\frac{1}{\\phi} \\sum_{i=1}^n (y_i - \\mu_i) x_i = \\frac{1}{\\phi} X^T (y - \\mu). \]
Setting this equal to zero gives
\[ X^T (y - \\mu) = 0, \]
which is the exact analogue of the normal equations in OLS.
4. Consequences
- Simplified estimation equations. The orthogonality condition \(X^T(y - \\mu)=0\) is unweighted. For non-canonical links the equations involve additional weights depending on derivatives of the link function.
- Variance function alignment. With canonical links, the variance of \(Y\) is directly tied to the cumulant function \(b(\\theta)\), making the information matrix simpler and the IRLS weights natural.
- Sufficiency. In canonical-link GLMs, the sufficient statistics for the coefficients are linear in \(y\) (e.g. \(X^T y\)), which aligns estimation neatly with the design matrix.
5. Worked example: Poisson regression
For Poisson(\(\\mu\)), the exponential-family form has \(b(\\theta) = e^{\\theta}\), so \(\\mu = e^{\\theta}\). The canonical link sets \(\\eta = \\theta\), so
\[ \\mu = e^{\\eta} = e^{x^T \\beta}. \]
The score is
\[ \\frac{\\partial \\ell}{\\partial \\beta} = \\sum_{i=1}^n (y_i - \\mu_i) x_i, \]
and the information matrix is
\[ \\mathcal{I}(\\beta) = X^T \\mathrm{diag}(\\mu_i) X. \]
Again, the estimating equations are exactly in the form \(X^T(y - \\mu)=0\).
Conclusion
The canonical link eliminates the derivative terms in the score function by aligning the linear predictor with the natural parameter. This yields the simplest possible estimating equations, makes residual orthogonality exact rather than weighted, and gives GLMs their elegant OLS-like geometry.
A.4 Projection Interpretation
This appendix gives a geometric interpretation of the score equations for GLMs.
In the case of canonical links, the equations can be seen as an orthogonal projection of the observed data vector onto the model subspace spanned by the regressors. This parallels the OLS case, where residuals are orthogonal to the design matrix.
1. OLS as a projection
For ordinary least squares, the model is
\[ y = X \beta + \varepsilon, \]
and minimising the squared error \(||y - X\beta||^2\) yields the normal equations
\[ X^T (y - X\hat{\beta}) = 0. \]
This condition means that the residual vector \(r = y - X\hat{\beta}\) is orthogonal to every column of \(X\).
Geometrically, \(\hat{y} = X\hat{\beta}\) is the orthogonal projection of \(y\) onto the column space of \(X\).
2. GLMs and weighted projections
For a GLM, the mean vector is \(\mu(\beta)\) with \(\mu_i = b'(\theta_i)\), and the score equations are
\[ X^T W (y - \mu) = 0, \]
for some diagonal weight matrix \(W\) that depends on the link function and variance structure (see Appendix A.1).
With a canonical link, \(W = I\), and the equations reduce to
\[ X^T (y - \mu) = 0, \]
exactly mirroring OLS orthogonality.
With a non-canonical link, the weights \(W\) modify the inner product, so the residuals are orthogonal to the regressors under the weighted inner product
\[ \langle u, v \rangle_W = u^T W v. \]
Thus GLM estimation corresponds to a weighted projection of \(y\) onto the modelled mean space.
3. Iterative reweighted least squares (IRLS)
This projection interpretation underlies the IRLS algorithm for solving GLM score equations:
At each step, construct a “working dependent variable” \(z\) (a linearised version of \(y\) around the current mean).
Update \(\beta\) by solving a weighted least squares problem:
\[ \hat{\beta}_{\text{new}} = \arg\min_\beta (z - X\beta)^T W (z - X\beta). \]
Thus, IRLS can be viewed as repeatedly re-projecting the working response onto the column space of \(X\) with changing weights until convergence.
4. Geometric consequences
Canonical links give true orthogonal projections.
With \(W = I\), the fitted mean \(\mu\) is the plain orthogonal projection of \(y\) onto the model subspace (in expectation).Non-canonical links give weighted projections.
Residuals are not orthogonal in the standard Euclidean sense, but are orthogonal under the inner product induced by \(W\).Connection to sufficiency and information.
The projection view reflects the fact that \(X^Ty\) (or its weighted analogue) is sufficient for \(\beta\) in exponential families with canonical links. The Fisher information matrix, \(X^T W X\), describes the geometry of the projection.
5. Worked example: Logistic regression
For Bernoulli(\(\mu\)) with canonical logit link:
Score equations:
\[ X^T (y - \mu) = 0. \]
The fitted means \(\hat{\mu}\) are the projection of \(y\) onto the model space spanned by \(X\), under the Bernoulli likelihood geometry.
In IRLS, at each step one solves a weighted least squares projection with weights \(W = \text{diag}(\mu_i(1-\mu_i))\).
Conclusion
GLM estimation can be viewed geometrically as a projection problem:
- Canonical links: exact orthogonal projection of the data vector (in expectation).
- Non-canonical links: weighted projection under an inner product defined by the link and variance function.
This projection interpretation unifies OLS and GLMs under a common geometric framework and highlights why the canonical link yields particularly elegant estimation properties.