Degrees of Freedom in Statistical Modelling

Author

D.McCabe

“Few ideas in statistics are as elusive and yet as deeply embedded in statistical theory and practice as the concept of degrees of freedom.”

Degrees of freedom are inherently tied to a model—or more generally, to a constraint framework—applied to data. They come into play once structure is imposed; that is, once the data is contextualised within a model, which limits the space of possible configurations or interpretations of the data thereby reducing the degrees of freedom.

Residual degrees of freedom count how many independent “directions” remain after accounting for the model see intuitive example. This count underlies the validity of inferential statistics (e.g., variance estimation, hypothesis tests) because these “directions” correspond to independent pieces of information left unexplained by the model. In all rigorous statistical contexts, degrees of freedom (DoF) ultimately trace back to this idea of dimensionality in a vector space, usually via projections in a Hilbert space or its generalisations.

A Large residual DoF coresponds to more “freedom” to estimate variability and less risk of variance more risk of bias A Small residual DoF signals there is little left to check model’s accuracy and high risk of overfitting

Statistical Modelling: Models and Estimators

Term	Programming Analog	Definition
Model - parameterised probability distribution (hypothesis)	Quosure / Function Generator / Lazy Output Subroutine with Enclosure symbolic representation of data-generating process	family of distributions i) describing characteristics of data in exploratory data analysis ii) capture or represent the data-generating process or function in statistical analysis i.e. the theoretical mechanism that governs real world data. A fitted model is produced once parameters are estimated.
Statistic	Evaluated Function Result A concrete number: the output of an estimator once data is supplied.	a function of sampled data only — no unknown parameters involved. e.g. sample mean, sample variance, t-statistic, confidence interval…
Estimator - a model input	Quosure / Pure Function Definition A general rule or symbolic expression for estimating model parameter, prior to data input	always tied to model parameters - a random variable which is a function of data (usually a function of the sample) used to estimate parameters in the model. An estimator becomes an estimate when evaluated on a particular sample
Fitted Model	Function after parameter substitution function template with parameters model analog	The result of applying an estimator to sample data producing a curve fit (or decision boundary…)
Model quantity	Evaluated Local Subroutine / Intermediate Output numerical value derived from plugging observed data into part of a statistical model—typically an intermediate or derived component, not the final fitted model or outcome	Derived by plugging observed data into components of a statistical model Not the final model output (like a parameter estimate or prediction) Often serves as an intermediate value used to compute or support other outputs (e.g., residual sum of squares, variance explained, information criteria)

Estimator Bias and Variance (Parameter-level)

An estimator \(\hat{\theta}\) is a random variable - a function with sample data as input, it can be biased or unbiased. The bias is the difference between the expected value \(E[\hat{\theta}]\) of the estimator and the true parameter value \(\theta\): \[\text{Bias}(\hat{\theta})=E[\hat{\theta}]-\theta\] As a random variable, every estimator has variance, regardless of bias. The variance of an estimator is the measure of how much the estimator would vary across different input samples: \[\text{Var}(\hat{\theta})=E[(\hat{\theta}-E[\hat{\theta}])^2]\]

Note

An unbiased estimator estimator of \(\mu\) i.e. \(E[\bar{x}]=\mu\) isn’t generally the best estimator as in certain context unbiasedness is not prioritised. In Bayesian statistics where the true population parameter is undefined bias is likewise undefined.

Bias–Variance Decomposition (Regression Setting)

Let \(\hat{f}(x)\) be your model’s prediction for a given input \(x\), and \(f(x)\) be the true data-generating function. Then the expected prediction error (mean squared error) at point \(x\) is the square of the bias ‘target off centre’ plus the model variance ‘spread of shots’: \[ \mathbb{E}\left[ \left( \hat{f}(x) - f(x) \right)^2 \right] = \underbrace{ \left( \mathbb{E}[\hat{f}(x)] - f(x) \right)^2 }_{\text{Bias}^2} + \underbrace{ \mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E}[\hat{f}(x)] \right)^2 \right] }_{\text{Variance}} + \underbrace{ \sigma^2 }_{\text{Irreducible\ Error}} \]

Bias: How far the model’s average prediction is from the true function.
Variance: How much the model’s prediction fluctuates across different training sets.
Irreducible Error: The inherent noise in the data (e.g., from \(\epsilon\sim N(0,\sigma^2)\)).

Degrees of Freedom relationship with Model Bias & Variance

Note

Analogy: Modes are to time series what moments are to distributions.

Both describe structure, but from different perspectives: moments summarise the shape and spread of a distribution, while modes capture oscillatory patterns in a time series. Though not a perfect mapping, this analogy can be conceptually helpful for thinking about how degrees of freedom (DoF) are consumed when fitting more complex models.

Specifically, while DFT modes are orthogonal and globally interpretable, higher-order moments tend to be abstract and sensitive to distributional tails. Still, both require sufficient data for stable estimation and both reflect a common trade-off: adding complexity increases variance risk unless supported by enough data.

Warning: This is a conceptual analogy used for personal intuition-building, not a formal mathematical equivalence.

In discrete Fourier analysis (DFT), the length of a time series limits the number and resolution of frequency modes we can estimate. Longer samples allow finer resolution and access to higher-frequency components. This mirrors moment estimation in statistics: higher-order moments require more data to estimate stably and are more sensitive to noise.

In both cases, estimating too many components with too little data increases model variance and overfitting risk. Even if we care about only a few key components, having residual degrees of freedom (i.e., more data than necessary) helps ensure estimation stability and generalisability.

Implications:

Model complexity consumes degrees of freedom: As model flexibility increases—e.g., by adding parameters, basis functions, or higher-order terms—more DoF are used. Data volume thus limits how complex a model can be before overfitting becomes likely.
Residual degrees of freedom ensure stability: A model that fits well while leaving some DoF unused is more likely to generalise. Leaving “room” in the data helps guard against overfitting and variance inflation.

Demonstrative Example 0: Intuitive Spline Regression

This is the most intuitive way to understand degrees of freedom in statistical modeling. Imagine a wire being shaped and fixed to a board in a downstairs cupboard. Each new fixing limits its flexibility

No fixings: Initially, the wire is completely free to take any shape — total flexibility. This corresponds to all \(n\) degrees of freedom available.

Choose the mean: Suppose we decide on the average height of the wire. This is like estimating the mean of a dataset and it imposes one constraint. Lose 1 degree of freedom

Fit a line: Now we fix both ends of the wire with two pins determining both the i) slope and ii) intercept. The wire can only form a straight line. Lose 2 more degree of freedom Lose 1 degree of freedom

Fit a cubic spline: We add multiple fixings (like collars or pegs) along the board. These allow for smooth bends, but each one adds a constraint to the shape the wire can take. The more fixings we add, the more we constrain the flexibility of the curve. Each constraint reduces the degrees of freedom, though some splines (like natural cubic splines) manage this more efficiently.

Demonstrative Example 1: Sample Mean & Sample Variance

Sample Mean

The typical sample mean and sample variance formulas assume the sample data \(x_1, x_2,\dots,x_n\) are independent and identically distributed (i.i.d.). The i.i.d. assumption alone is the model that imposes the constraint framework introducing degrees of freedom to any sample statistics we calculate. Given a set of sample data the arithmetic mean is given by: \[ \begin{align} \mu\approx\bar{x}&=\frac{1}{n}\sum_{i=1}^n x_i \\ &=\frac{1}{dof}\sum_{i=1}^n x_i \end{align} \]

This involves no modeling of relationships or assumptions beyond basic i.i.d. structure. This operation costs one degree of freedom, therefore, when the next statistic is calculated we should remember to reduce the \(dof\) by one.

Note

This really is a complete model and it is not always the correct model. Alternatives include:

We might employ a much more comple model of the data:
- A frequentist might assume a probability distribution for the data or a bayesianist might use a posterior distribution, and estimate \(\mu\) as a parameter of that distribution using maximum likelihood estimate (MLE) e.g. \(x_i\sim N(\mu,\sigma^2) \qquad\therefore\:\hat{\mu}_{MLE}=\bar{x}\)
- Maybe instead we might believe the mean depends on other dependent variables (predictors), model it as a function: \(\mu=\mathbb{E}[XY_i|X_i]=X_i\beta\) This is linear regression \(Y_i=X_i\beta+\epsilon_i\) with \(\epsilon_i\sim N(0,\sigma^2)\) and \(\mu\) is the conditional mean of \(Y\) given \(X\).
Furthermore the i.i.d. assumption might be incorrect for some datasets:
- for time series data ARIMA or State Space Models can be used to forcast/extrapolate. Here the mean is not constant, the mean evolves according to the dynamics of the system.

More formally, the data \(y\in\mathbb{R}^n\) and the model takes the form \(y_i=\hat{\mu}+\epsilon_i\). This is a linear model with just one parameter \(\mu\) and the design matrix \(X\) is just a column of 1’s:

\(y=\mu\begin{bmatrix}1\\\vdots\\1\end{bmatrix}+\begin{bmatrix}\epsilon_1\\\vdots\\\epsilon_n\end{bmatrix}\) where \(X\in\mathbb{R}^{n\times 1}\)

Therefore:

the column space has dimension one \(\text{dim}(\mathcal{C}(X))=1\) i.e. one consumed degree of freedom
the residual space has dimension \(n-1\) i.e. the residuales live in the nullspace of the 1D subspace used to estimate the mean

Estimating Mean: Variance & Covariance

The population variance \(\sigma^2\) is estimated using the sample variance \(\hat{\sigma}^2\) under the i.i.d. model. This statistic uses a mean estimator \(\hat{x}\) and therefore the residual degrees of freedom are reduced by one already when using the sample mean estimate \(\bar{x}\):

\[ s_{xx}=\frac{1}{dof}\sum(x_i-\hat{x})^2\\ s_{xx}=\frac{1}{n-1}\sum(x_i-\bar{x})^2 \] Again this statistic comes at the cost of another degree of freedom.

Note

In the case we actually know the distribution mean \(\mu\) there is no prior loss of degree of freedom and the correct formula becomes: \[s_{xx}=\frac{1}{n}\sum(x_i-\bar{x})^2\]

Similarly we might model the population covariance \(\sigma_x\sigma_y\) using the sample covariance \(\hat{\sigma}_x\hat{\sigma}_y\). This statistic uses estimators for the means of each feature \(\hat{x}\) and \(\hat{y}\) therefore the residual degrees of freedom are reduced by one respectively for each feature when estimating the unbiased estimator: \(s_{xy}=\frac{1}{n-1}\sum(x_i-\bar{x})(y_i-\bar{y})\).

Demonstrative Example 2: Linear Regression

The Bivariate Problem

Undergrad numerical methods class: review of linear regresion

The method of least squares finds the values \(\hat{a}\) and \(\hat{b}\) of \(a\) and \(b\) that minisise the sum of the squared errors: \[S(a,b)=\sum\epsilon_i^2=\sum(y_i-ax_i-b)^2\]

with solution: \[\boxed{\hat{a}=\frac{s_{xy}}{s_{xx}}}\quad\text{and}\quad\boxed{\hat{b}=\bar{y}-\hat{a}\bar{x}}\]

symbol	statistic	formula
\(\bar{x}\)	the the sample mean of \(x\)	\(\bar{x}=\frac{1}{n}\sum x_i\)
\(\bar{y}\)	the the sample mean of \(y\)	\(\bar{y}=\frac{1}{n}\sum y_i\)
\(s_{xx}\)	the sample variance of \(x\)	\(s_{xx}=\frac{1}{n-1}\sum(x_i-\bar{x})^2\)
\(s_{xy}\)	the sample covariance of \(x\) and \(y\)	\(s_{xy}=\frac{1}{n-1}\sum(x_i-\bar{x})(y_i-\bar{y})\)

\[S(a,b)=\sum\epsilon_i^2=\sum(y_i-ax_i-b)^2\] find minima with partial derivatives (and remembering that \(x_i\) and \(y_i\) are the data, hence constant)

\[ \frac{\partial S(a,b)}{\partial b}=\sum -2(y_i-ax_i-b)=0 \\ \:\\ \begin{align*} \therefore&\sum y_i=a\sum x_i+b\sum 1 \\ &\sum y_i=a\sum x_i+bn \end{align*} \]

Solving for \(b\) we get the intercept:

\[ \sum y_i=a\sum x_i+bn\\ \therefore\quad \boxed{b=\frac{\sum y_i-a\sum x_i}{n}}\\ \]

\[ \frac{\partial S(a,b)}{\partial a}=\sum -2x_i(y_i-ax_i-b)=0\\ \:\\ \begin{align*} \therefore &\sum -2x_iy_i+2ax_i^2+2bx_i=0 \\ &\sum ax_i^2+bx_i=\sum x_iy_i \\ &a\sum x_i^2+b\sum x_i=\sum x_iy_i \end{align*} \] Solving for \(a\) we get an expression for the slope…

\[ \begin{align} \therefore\quad & a\sum x_i^2 + \left(\frac{\sum y_i - a\sum x_i}{n}\right)\sum x_i = \sum x_i y_i \\ & a\sum x_i^2 + \frac{\sum x_i \sum y_i}{n} - \frac{a\left(\sum x_i\right)^2}{n} = \sum x_i y_i \\ & a\left(\sum x_i^2 - \frac{(\sum x_i)^2}{n}\right) = \sum x_i y_i - \frac{\sum x_i \sum y_i}{n} \\ &a = \frac{\sum x_i y_i - \frac{1}{n} \sum x_i \sum y_i}{\sum x_i^2 - \frac{1}{n}(\sum x_i)^2} = \frac{\sum x_i y_i - \frac{1}{n} \sum x_i \sum y_i}{\sum x_i^2 - \frac{1}{n}(n\bar{x})^2} = \frac{\sum x_i y_i - \frac{1}{n} \sum x_i \sum y_i}{\sum x_i^2 - n\bar{x}^2}\\ \end{align} \]

\[ \begin{align*} \text{denominator...}&\\ &\quad\mathrm{Var}(x)= \frac{1}{n} \sum (x_i - \bar{x})^2 \\[5pt] & \qquad\qquad = \frac{1}{n} \sum \left(x_i^2 - 2x_i\bar{x} + \bar{x}^2\right) \\ & \qquad\qquad = \frac{1}{n} \left( \sum x_i^2 - 2\bar{x} \sum x_i + n \bar{x}^2 \right) \\ & \qquad\qquad = \frac{1}{n} \left( \sum x_i^2 - 2\bar{x} \cdot n\bar{x} + n \bar{x}^2 \right) \\ & \qquad\qquad = \frac{1}{n} \left( \sum x_i^2 - n\bar{x}^2 \right) \\ \end{align*} \] \[ \begin{align*} \text{numerator...}&\\ &\quad\mathrm{Cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \\ &\qquad\qquad= \frac{1}{n} \left( \sum x_i y_i - \bar{x} \sum y_i - \bar{y} \sum x_i + n \bar{x} \bar{y} \right) \\ &\qquad\qquad= \frac{1}{n} \left( \sum x_i y_i - n \bar{x} \bar{y} - n \bar{x} \bar{y} + n \bar{x} \bar{y} \right) \\ &\qquad\qquad= \frac{1}{n} \sum x_i y_i - \bar{x} \bar{y} \\ &\qquad\qquad= \frac{1}{n} \sum x_i y_i - \frac{1}{n^2} \sum x_i \sum y_i \\ &\qquad\qquad= \frac{1}{n} \left( \sum x_i y_i - \frac{1}{n} \sum x_i \sum y_i \right) \end{align*} \] Recognising the covariance and variance in the numerator and denominator respectively… \[ \begin{align} a&= \frac{\sum x_i y_i - \frac{1}{n} \sum x_i \sum y_i}{\sum x_i^2 - n\bar{x}^2}\\ &= \frac{n \cdot \mathrm{Cov}(x, y)}{n \cdot \mathrm{Var}(x)} \\ & \therefore\quad \boxed{a = \frac{S_{xy}}{S_{xx}}} \end{align} \]

Addendum: Fitting a Polynomial

A polynomial of degree \(d\) is a linear combination of powers of \(x\), and we can fit it using linear least squares by constructing a suitable (Vandermonde) design matrix.

\[ A = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \\ 1 & x_2 & x_2^2 & \cdots & x_2^d \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix} \]

Undergrad stats class: review of least squares as a statistical model

The linear regression model for fitting a line - the value \(y_i\) is drawn from a random variable: \[Y_i=ax_i+b+\epsilon_i\] where the ‘error’ terms \(\epsilon_i\) are independent random variables with mean \(0\) and standard deviation \(\sigma\). The standard assumption is that the \(\epsilon_i\) are i.i.d. with distribution \(N(0,\sigma^2)\).

Least squares method chooses the values of \(\hat{a}\) and \(\hat{b}\) which minimise the sample variance about the line: \[E[Y_i]=ax_i+b+E[\epsilon_i]=ax_i+b\] Furthermore, under the assumption that \(\epsilon_i\sim N(0,\sigma^2)\), the least square estimate coincides with the maximum likelihood estimate for the parameters.

The reason for the term ‘regression to the mean’ stems from the phenomenon that the standardised predicted response variable \(y\) will tend to be closer to (i.e., regress to) its mean than the standardised predictor variable \(x\) is to its mean.

Standardise the predictor and the response variables: \[u_i=\frac{x_i-\bar{x}}{\sqrt{s_{xx}}},\qquad v_i=\frac{y_i-\bar{y}}{\sqrt{s_{yy}}}\] so that: \[\bar{u}=\bar{v}=0,\qquad s_{uu}=s_{vv}=1\] The algebraic properties of covariance give the correlation coefficient \[s_{uv}=\frac{s_{xy}}{\sqrt{s_{xx}s_{yy}}}=\rho\] and thus the least squares fit to \(v = au + b\) has \[\boxed{\hat{a}=\frac{s_{uv}}{s_{uu}}=\rho},\qquad\boxed{\hat{b}=\bar{v}-\hat{a}\bar{u}=0}\] So the least squares line is just \(v = \rho u\). Since \(\rho\) is the correlation coefficient! Since the correlation coefficient is in the range \(\rho\in[-1,1]\), the predicted standardised response \(v\) is always closer to \(0\) (i.e., the mean) than the predictor \(\hat{u}\)

The term “regression” comes from Sir Francis Galton in the 19th century. He was studying the heights of parents and their children, and found that: Tall parents tend to have tall children, but not as tall. Short parents tend to have short children, but not as short. So on average, children’s heights regressed toward the population mean — and he coined the term “regression toward mediocrity” to describe this. Then, the least squares line he used to describe this trend became known as the regression line.

Addendum: Homoscedasticity Assumption

The linear regression model is that all residuals \(\epsilon\) have the same variance. Before using least squares on hetroscedastic data we would need to transform the data to be homoscedastic.

Addendum: Measuring Quality of Fit

The residual sum of squares is given by the sum of the squares of the residuals. For a line: \[RSS=\sum(y_i-\hat{a}x_i-\hat{b})^2\] The total sum of squares is given by: \[TSS=\sum(y_i-\bar{y})^2\] The RSS is the error portion of the total sum of squares. The difference \(TSS − RSS\) is the modelled portion of the total sum of squares. The coefficient of determination \(R^2\) is the ratio of the modeled portion to the total sum of squares: \[\boxed{R^2=\frac{TSS − RSS}{TSS}}\] The coefficient of determination measures the proportion of the variability of the data that is accounted for by the regression model. A value close to \(1\) indicates a good fit, while a value close to \(0\) indicates a poor fit. In the case of simple linear regression, \(R^2\) is simply the square of the correlation coefficient between the observed values \(y_i\) and the predicted values \(\hat{a}x_i + \hat{b}\).

Undergrad linear algebra class: review of least squares

We have a small set of parameters \(\vec{x}\) and a large set of independent variables forming the design matrix \(A\) and a corresponding set of dependent observations \(\vec{b}\) however, in the presence of any noise or error, the system is generally inconsistent \(A\vec{x}\ne\vec{b}\). Instead we must aim for an approximate solution that minimises the residual:

\[ \begin{bmatrix} x_1 & y_1 \\ x_2 & y_2 \\ \vdots & \vdots \\ x_n & y_n \\ \end{bmatrix} \begin{bmatrix} p_1 \\ p_2 \\ \end{bmatrix}\approx \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \\ \end{bmatrix} \] We express the line equation \(y=mx+c\) in aterms of a linear system: \[y=mx+c\quad\iff \quad p_1 x_i+p_2 y_i=b_i\quad(\text{where}\;\;\;b_i:=p_2 c\;\;\;\text{and}\;\;\;p_1:=-p_2 m)\]

the independent variable \(x_i\) is called the predictor variable
the dependent variable \(y\) is called the response variable

To make the system solvable we must project \(\vec{b}\) onto the column space of \(A\) using the symetric, idempotent projection matrix McCabe (2024a): \[P=A(A^T A)^{-1}A^T\]

So that we solve \(A\vec{x}=\vec{b'}\) using the orthogonal projection of the observation matrix \(\vec{b'}=P\vec{b}\).

The N-Dimensional Problem & Degrees of Freedom

More generally thedata and the model are multivariate: \[\vec{y}=\beta X + \vec{\epsilon},\qquad y\in\mathbb{R}^n,\qquad X\in\mathbb{R}^{n\times p},\qquad n>>p\]

From our knowledge of projection matrices (McCabe, 2024b), the fundamental theorem of linear algebra (McCabe, 2024a) and intuition/observation we see a geometric interpretation explaination of why estimating \(p\) parameters reduces the effective dimension (freedom) of the data by \(p\).

The fitted values \(\hat{\vec{y}}=X \hat{\beta}\) lie in the column space \(\text{Col}(X)\), which is a \(p\)-dimensional subspace of \(\mathbb{R}^n\).
- The DoF of the model corresponds exactly to the dimension of the subspace \(\text{Col}(X)\), i.e., number of parameters \(\text{rank}(X)=p\) (assuming full column rank)
The residuals \(\vec{r}=\vec{y}-\hat{\vec{y}}=(I-P)\vec{y}\) lie in the null space \(N(X^T)\), the orthogonal complement of the column space (null space) of dimension \(\mathbb{R}^{n-p}\)
- The degrees of freedom for residuals is the dimension of the residual space \(N(X^T)\), specifically the difference in number of samples and model parameters \(n−p\).

Note

Estimating parameters is equivalent projecting data onto a subspace
Consumed Degrees of freedom correspond to dimension of the subspace used
Residual degrees of freedom correspond to what’s left after projection

In all rigorous statistical contexts, degrees of freedom (DoF) ultimately trace back to this idea of dimensionality in a vector space, usually via projections in a Hilbert space or its generalisations.

Practical linear regression in R with`lm`

require(graphics)

## Annette Dobson (1990) "An Introduction to Generalised Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
lm.D90 <- lm(weight ~ group - 1) # omitting intercept

anova(lm.D9)

Analysis of Variance Table

Response: weight
          Df Sum Sq Mean Sq F value Pr(>F)
group      1 0.6882 0.68820  1.4191  0.249
Residuals 18 8.7292 0.48496

summary(lm.D90)


Call:
lm(formula = weight ~ group - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
groupCtl   5.0320     0.2202   22.85 9.55e-15 ***
groupTrt   4.6610     0.2202   21.16 3.62e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared:  0.9818,    Adjusted R-squared:  0.9798 
F-statistic: 485.1 on 2 and 18 DF,  p-value: < 2.2e-16

opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(lm.D9, las = 1)      # Residuals, Fitted, ...

par(opar)

### less simple examples in "See Also" above

References

McCabe, D. (2024a). Projection matrices. https://www.mathworks.com/matlabcentral/fileexchange/176098-projection-matrices.

McCabe, D. (2024b). Fundamental subspaces of a matrix. Zenodo. https://doi.org/10.5281/zenodo.14194389

Statistical Modelling: Models and Estimators

Estimator Bias and Variance (Parameter-level)

Bias–Variance Decomposition (Regression Setting)

Degrees of Freedom relationship with Model Bias & Variance

Demonstrative Example 0: Intuitive Spline Regression

Demonstrative Example 1: Sample Mean & Sample Variance

Sample Mean

Estimating Mean: Variance & Covariance

Demonstrative Example 2: Linear Regression

The Bivariate Problem

Undergrad numerical methods class: review of linear regresion

Addendum: Fitting a Polynomial

Undergrad stats class: review of least squares as a statistical model

Addendum: Homoscedasticity Assumption

Addendum: Measuring Quality of Fit

Undergrad linear algebra class: review of least squares

The N-Dimensional Problem & Degrees of Freedom

Practical linear regression in R withlm

References

Practical linear regression in R with`lm`