<- lm(waiting ~ eruptions, data = faithful_dt) model
Inference in Simple Linear Regression & Linear Model Metrics |
|
Basics
Statistical Model
assumed population model: \[
\boxed{\begin{align}
\text{model}: \quad & \vec{Y} = \mathbf{X} \vec{\beta} + \vec{\varepsilon}, \quad \vec{\varepsilon} \sim N(\vec{0}, \sigma^2 \mathbf{I}_n) \\[1em]
\text{estimators}: \quad &
\hat{\vec{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \vec{Y}, \\
& \vec{\hat{Y}} = \mathbf{X} \hat{\vec{\beta}}, \\
& \vec{\hat{\varepsilon}} = \vec{Y} - \vec{\hat{Y}}, \\
& \hat{\sigma}^2 = \frac{\vec{\hat{\varepsilon}}^\top \vec{\hat{\varepsilon}}}{n - p}\\\:
\end{align}}
\]
estimated (fitted) model: \[
\boxed{\begin{align}
\text{fitted model}: \quad & \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i, \quad i = 1, \dots, n \\[1em]
\text{estimators}: \quad &
\hat{\beta}_1 = \text{Cor}(X,Y) \frac{\sigma_Y}{\sigma_X}\\
& \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}, \\
& \hat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \frac{1}{n-2} \sum_{i=1}^n \hat{\varepsilon}_i^2\\
\text{where}\quad&\hat{\varepsilon}_i=Y_i-\hat{Y}_i\text{ are the residuals.}\\\:
\end{align}}
\]
assumed population model: \[ \boxed{\begin{align} \text{model}: \quad & \vec{Y} = \mathbf{X} \vec{\beta} + \vec{\varepsilon}, \quad \vec{\varepsilon} \sim N(\vec{0}, \sigma^2 \mathbf{I}_n) \\[1em] \text{estimators}: \quad & \hat{\vec{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \vec{Y}, \\ & \vec{\hat{Y}} = \mathbf{X} \hat{\vec{\beta}}, \\ & \vec{\hat{\varepsilon}} = \vec{Y} - \vec{\hat{Y}}, \\ & \hat{\sigma}^2 = \frac{\vec{\hat{\varepsilon}}^\top \vec{\hat{\varepsilon}}}{n - p}\\\: \end{align}} \]
estimated (fitted) model: \[ \boxed{\begin{align} \text{fitted model}: \quad & \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i, \quad i = 1, \dots, n \\[1em] \text{estimators}: \quad & \hat{\beta}_1 = \text{Cor}(X,Y) \frac{\sigma_Y}{\sigma_X}\\ & \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}, \\ & \hat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \frac{1}{n-2} \sum_{i=1}^n \hat{\varepsilon}_i^2\\ \text{where}\quad&\hat{\varepsilon}_i=Y_i-\hat{Y}_i\text{ are the residuals.}\\\: \end{align}} \]
\[\vec{y} = \mathbf{X} \, \vec{\beta} + \vec{\varepsilon}\]
where:
\(\vec{y}=\begin{bmatrix}Y_1\\Y_2\\\vdots\\Y_n\end{bmatrix},\qquad\mathbf{X} =\begin{bmatrix}1&X_1\\1&X_2\\\vdots&\vdots\\1&X_n\end{bmatrix}\)
\(\quad\vec{\beta}=\begin{bmatrix}\beta_0\\\beta_1\end{bmatrix}\quad\text{and}\quad\vec{\varepsilon}=\begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix}\).
\[ \begin{align} \mathrm{Var}(\vec{\varepsilon}) &= \begin{bmatrix} \mathrm{Var}(\varepsilon_1) & \mathrm{Cov}(\varepsilon_1, \varepsilon_2) & \cdots & \mathrm{Cov}(\varepsilon_1, \varepsilon_n) \\ \mathrm{Cov}(\varepsilon_2, \varepsilon_1) & \mathrm{Var}(\varepsilon_2) & \cdots & \mathrm{Cov}(\varepsilon_2, \varepsilon_n) \\ \vdots & \vdots & \ddots & \vdots \\ \mathrm{Cov}(\varepsilon_n, \varepsilon_1) & \mathrm{Cov}(\varepsilon_n, \varepsilon_2) & \cdots & \mathrm{Var}(\varepsilon_n) \end{bmatrix}\\ &=\begin{bmatrix}\sigma^2&0&\cdots&0\\0&\sigma^2&\cdots&0\\\vdots&\vdots&\ddots&\vdots\\0&0&\cdots&\sigma^2\end{bmatrix}=\sigma^2\mathbf{I}_n\qquad\text{i.i.d.} \:\\ \therefore\quad&\vec{\varepsilon} \sim N\!\left( \vec{0},\ \sigma^2 \mathbf{I}_n \right) \end{align} \]
The ordinary least squares (OLS) estimates minimise the sum of squared residuals:
\[ Y_i = \hat{\beta_0} + \hat{\beta_1} X_i + \varepsilon_i \qquad\therefore\qquad \text{SSE} = \sum_{i=1}^n (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2. \]
Solving the normal equations (McCabe, 2024) and perform some manipulation we get build a nice expression for the estimated gradient:
\[ \begin{align*} \hat{\beta}_1 &=\sum_{i=1}^n\frac{(X_i-\bar{X})(Y_i-\bar{Y})}{(X_i-\bar{X})^2}\qquad\text{...now finding }X\text{ variance term...}\\ &=\sum_{i=1}^n\frac{(X_i-\bar{X})(Y_i-\bar{Y})}{n-1}\cdot\sqrt{\frac{n-1}{(X_i-\bar{X})^2}} \quad\cdot\quad \sqrt{\frac{n-1}{(X_i-\bar{X})^2}}\\ &\text{...now we can introduce a }Y\text{ variance term...}\\ &=\sum_{i=1}^n\frac{(X_i-\bar{X})(Y_i-\bar{Y})}{n-1}\cdot\sqrt{\frac{n-1}{(X_i-\bar{X})^2}}\cdot\sqrt{\frac{n-1}{(Y_i-\bar{Y})^2}} \quad\cdot\quad \sqrt{\frac{n-1}{(X_i-\bar{X})^2}}\cdot\sqrt{\frac{(Y_i-\bar{Y})^2}{n-1}}\\ &=\text{Cor}(X,Y) \sqrt{\frac{\operatorname{Var}(Y)}{\operatorname{Var}(X)}} = \text{Cor}(X,Y) \frac{\sigma_Y}{\sigma_X} \end{align*} \] Knowing \(\beta_1\) we can find the \(\beta_0\) which minimises the residuals… \[\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\]
For a given predictor value \(x_0\), the predicted mean response is
\[ \hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0. \]
Regression Inference
A prediction interval is a range that gives an estimate of where a new individual observation is likely to fall, given a certain value of the predictors, with a specified probability (e.g. 95%). It differs from a confidence interval:
- A confidence interval for the mean response at \(x_0\) estimates where the average response would lie if the experiment were repeated many times. Its variance reflects only the uncertainty in estimating the regression line. Its value corresponds to the appropriate contour on the sampling distribution of the estimator.
- A prediction interval at \(x_0\) accounts for both (i) the uncertainty in estimating the regression line and (ii) the random variability of individual observations around the line. Because of this additional source of variability, it is always wider than the confidence interval for the mean. Its value corresponds to the appropriate contour on the probability density function of the random error \(\varepsilon\).
- We often illustrate confidence intervals as prediction intervals, centring them over the estimated value.
- The term “prediction interval” is somewhat ambiguous, since the true population prediction interval is distinct from the estimated prediction interval in the same way that a parameter is distinct from its estimate.
True Prediction Intervals
In simple linear regression, assuming the errors are independent and identically distributed as \(\varepsilon_i \sim N(0, \sigma^2)\), the true population prediction interval for a new observation at a predictor value \(x_a\) is \[ Y \sim N(\beta_0' + \beta_1' x_a, \sigma^2)\quad\therefore\quad \boxed{ \begin{align*} \beta_0' + \beta_1' x_a &\pm z_{1-\alpha/2}\, \sigma\\[0.2em] \text{or}\quad y_a&\pm z_{1-\alpha/2}\,\sigma \end{align*} } \] where \(\sigma^2\) the residual variance of the true model, \(\beta_0'\) are \(\beta_1'\) are true unknown parameters.
Confidence Interval/Estimated Prediction Intervals
Standard error along fitted model: the variance of the predicted value \(\text{Var}(\hat{y}_a)\) at a particular point on the regression line \(x_a\) comes from the variance of the OLS estimators: \[
\begin{align*}
\text{Var}(\hat{y}_a)=\sigma^2\left[\frac{1}{n}+\underbrace{\frac{(x_a-\bar{X})^2}{(n-1)\sigma_x^2}}_{\text{leverage term}}\right]
\qquad\therefore\qquad
\operatorname{SE}_a=\sigma\sqrt{\frac{1}{n}+\frac{(x_a-\bar{X})^2}{(n-1)\sigma_x^2}}\\[1em]
\qquad\text{where }\sigma\text{ is the true (unknown) residual population variance i.e. }\epsilon\overset{\text{i.i.d.}}{\sim}N(0,\sigma^2)
\end{align*}
\] The leverage term:
- measures how far the predictor value \(x_a\) is from the mean of the centre of the data cloud \(\bar{X}\).
- points far from the centre are situated where the scalar leverage field is higher
- Data points are “singularities” in the leverage field; their leverage values (diagonal elements of the hat matrix) do not exactly match the surrounding continuous field.
- The \(1/n\) term is baseline variance (uncertainty) from estimating the mean response.
The predicted mean response at a given predictor value \(x_0\) is
\[ \hat{y}_0 = \hat{\beta}_0 + \hat{\beta}_1 x_0 \]
Substituting \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}\) gives…
\[ \hat{y}_0 = (\bar{Y} - \hat{\beta}_1 \bar{X}) + \hat{\beta}_1 x_0.\\ \hat{y}_0 = \bar{Y} + (x_0 - \bar{X}) \hat{\beta}_1 \] Substituting \(\hat{\beta}_1 = \frac{\sum_{i=1}^n (X_i - \bar{X}) Y_i}{\sum_{i=1}^n (X_i - \bar{X})^2}\) gives…
\[ \hat{y}_0 = \bar{Y} + (x_0 - \bar{X}) \frac{\sum_{i=1}^n (X_i - \bar{X}) Y_i}{\sum_{i=1}^n (X_i - \bar{X})^2} \]
This can be written as a linear combination of the \(Y_i\):
\[ \hat{y}_0 = \sum_{i=1}^n a_i Y_i, \quad a_i = \frac{1}{n} + \frac{(x_0 - \bar{X})(X_i - \bar{X})}{\sum_{j=1}^n (X_j - \bar{X})^2}. \]
Since the \(Y_i\) are independent with variance \(\sigma^2\), through Variance of a linear combination (with independence) …the variance of \(\hat{y}_0\) is
\[ \begin{align*} \text{now}\quad\mathrm{Var}\!\left(\sum_{i=1}^n \alpha_i Z_i\right)=\sum_{i=1}^n \alpha_i^{2}\,\mathrm{Var}(Z_i)\qquad\text{so}\qquad\mathrm{Var}(\hat{y}_0) \:&= \sigma^2 \sum_{i=1}^n a_i^2\\ &= \sigma^2 \left[ \frac{1}{n} + \frac{(x_0 - \bar{X})^2}{\sum_{i=1}^n (X_i - \bar{X})^2} \right] \end{align*} \]
which is the standard formula for the variance of the predicted mean in simple linear regression.
Since true variance of the normally distributed output \(y_a\) is unknown it must be estimated and the standardised/studentised estimated output \(\hat{y}_a\) follows a student’s t-distribution (McCabe, 2025a):
\[ \begin{align*} \widehat{\operatorname{SE}}_a=\hat{\sigma}\sqrt{\frac{1}{n}+\frac{(x_a-\bar{X})^2}{(n-1)\sigma_x^2}}&\\[1em] \text{where...}& \qquad\hat{\sigma}=\sqrt{\frac{1}{n-2}\sum_{i=1}^n\left(Y_i-\hat{Y}_i\right)^2} =\sqrt{\frac{1}{n-2} \sum_{i=1}^n \left(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i \right)^2} \end{align*} \]
A \(100(1-\alpha)\%\) confidence interval for the mean response at \(x_a\) is given by the \(t_{\alpha/2, n-2}\) t-statistic (McCabe, 2025b)…
\[ \boxed{\begin{align*} \hat{y}_a&\pm t_{\alpha/2, n-2}\cdot\operatorname{SE}(x_a)\\[0.2em] \hat{\beta}_0+\hat{\beta}_1&\pm t_{\alpha/2,n-2}\cdot\operatorname{SE}(x_a) \end{align*}} \]
Residuals
Raw (resid
) residuals
Residuals are in units of the dependent variable \(y\) and their magnitude depends on the dependent variable mesurements - e.g. microscopic or astronomical scale.
<-as.data.table(mtcars, keep.rownames = TRUE)
dt<- lm(mpg ~ wt + hp, data = dt)
model := rstandard(model)]
dt[,rstandardised := rstudent(model)] dt[,rstudentised
Standardised (internally studentised - rstandard
) residuals
standardised residuals are scaled so we can impose/test standard/common thersholds (used in variaous test metrics Cook’s Distance, Residuals vs Leverage plot etc…
:= rstandard(model)] dt[,rstandardised
- Use the overall estimate of the error variance from the full model.
- give a quick, scale-free way to check whether any observation looks unusually far from the regression line.
- Gn. Rule of thumb: absolute values greater than 2 may suggest outliers.
- Weakness: if a point itself is an outlier, it still influences the estimate of error variance, so its own standardised residual may be “dampened”.
:= rstandard(model)] dt[,rstandardised
We know the variance of the sampling distribution of residual estimator is \(\mathrm{Var}(r_i)=\sigma^2(1-h_{ii})\), where \(h_{ii}\) denote the \(i\)-th diagonal of \(H\) (the leverage).
Replace true (unknown) \(\sigma\) by the usual unbiased estimator \(\operatorname{sd}=\hat{\sigma}^2=\sqrt{\mathrm{RSS}/(n-p)}\) with \(\mathrm{RSS}=(\|\vec{r}\|^2)_i\). Define \[ r'_i := \frac{r_i}{\operatorname{sd}\sqrt{1-h_{ii}}} \] This rescales \(r_i\) by its model-implied standard deviation \(\sigma\sqrt{1-h_{ii}}\) (estimated by \(\operatorname{sd}\sqrt{1-h_{ii}}\)). Under normal errors, \(r_i\) is approximately \(\mathcal{N}(0,1)\), but the \(r'_i\) are correlated and use \(\operatorname{sd}\) computed with observation \(i\) as well as all the others.Studentised (externally studentised - rstudent
) residuals
On the otherhand studentised residuals are more useful for identifying outliers since they don’t use their own leverage in their own p-value calculation. There are used to identify outliers.
:= rstudent(model)] dt[,rstudentised
- Recalculate the error variance leaving that observation out.
- This removes the circularity (an outlier doesn’t get to hide by inflating the variance it is judged against).
- They follow a known Student’s t-distribution with \(n-p-1\) degrees of freedom.
- Gn. Rule of thumb: absolute values greater than 3 are stronger evidence of outliers.
>3.,] dt[rstudentised
:= rstudent(model)] dt[,rstudentised
>3.,] dt[rstudentised
To avoid letting observation \(i\) influence the variance it is judged against, recompute the error variance leaving out \(i\). Let \(s_{(i)}^2\) be the unbiased variance estimator from the model refit on the remaining \(n-1\) observations ~ accountancy style notation. The studentised residual is \[ t_i \;:=\; \frac{r_i}{\,s_{(i)}\sqrt{1-h_{ii}}\,} \] Under the Gauss-Markov-normal assumptions, \(t_i \sim t_{\,n-p-1}\) exactly and different \(t_i\) are independent of \(s_{(i)}\).
Efficient formulae (no refitting needed). TODOLeverage & Influence Measures
Introduction
Data points used to fit a regression line exert influence on the fit. Not all points are equal: distant outliers with high leverage affect the fit more than points deep in the centre of a dense data cloud.
Handling (noting/quantifying) outliers:
Think of a fulcrum at the centroid of the cluster. Points in each quadrant exert varying levels of leverage and influence over the fitted regression model. Specifically, the point in the lower-left has low leverage (it is close to the centroid), while the point in the upper-right has high leverage but does not exert much influence. See ?influence.measures
for a comprehensive list of outlier metrics.
Leverage
Simple linear regression involves overdetermined and inconsistent system of equations (McCabe, 2024). To fitting a linear model/solve we must project the measured output vector \(\vec{y}\) onto the column space of the covariate matrix \(X\) using a projection or hat (\(\:\widehat{\quad}\:\)) matrix \(\:H\):
\[
\begin{align*}
\hat{\vec{y}}&=X(X^\top X)^{-1}X^\top\vec{y}\\
\hat{\vec{y}}&=H\vec{y}
\end{align*}
\]
- The Leverage \(h_{ii}\) is the diagonal elements in the hat (projection) matrix \(H\)
- The off diagonal elements \(h_{ij}\) are also leverage. They determining how the prediction of other points \(\hat{y}_{j\ne i}\) are influenced by \(y_i\).
- The fit defines leverage over the entire space - a scalar field
Consider the linear model \(\vec{y} = X\vec{\beta} + \vec{\varepsilon}\) with \(X\in\mathbb{R}^{n\times p}\) full column rank and \(\varepsilon \sim \mathcal{N}(0,\sigma^2 I_n)\). The ordinary least squares (OLS) estimator is \(\vec{\hat\beta}=(X^\top X)^{-1}X^\top\vec{y}\), the fitted values are \(\vec{\hat{y}} = X\vec{\hat\beta} = H\vec{y}\), where \(H := X(X^\top X)^{-1}X^\top\) - is the hat matrix (an idempotent, symmetric projector onto \(\mathcal{C}(X)\)). The residual vector is
\[ \begin{align*} \vec{r}:&= \vec{y}-\vec{\hat{y}}=(I-H)\vec{y}=(I-H)(X\vec{\beta}+\vec{\varepsilon})\\ &=\underbrace{(I-H)X\beta}_{\text{idempotent: }0=X\beta-X\beta}+(I-H)\vec{\varepsilon}=(I-H)\vec{\varepsilon}\\\\ &where:\\ &\qquad \vec{r}\quad\text{the (sampled) residuals vector}\\ &\qquad \vec{\varepsilon}\quad\text{the true (population) error of predicted }\vec{\hat{y}} \end{align*} \]
Hence… \[ \boxed{\begin{align*} &\text{sampling distribution of residual estimator...}\\ &\mathbb{E}(\vec{r})=\vec{0}\qquad \mathrm{Var}(\vec{r})=\sigma^2(I-H)\\\\ &\:\text{...the true datagenerating process is homoscedastic }\sigma \end{align*}} \]
Let \(h_{ii}\) denote the \(i\)-th diagonal of \(H\) (the leverage). From the variance expression \(\boxed{\mathrm{Var}(r_i)=\sigma^2(1-h_{ii})}\)
Thus each residual has a different variance, shrinking in proportion to its leverage
- Residual variance \(\mathrm{Var}(\vec{r})\) is calculated at each training point and defines the leverage field. Residual variance shrinks with leverage
- Prediction variance \(\text{Var}(\hat{y}_a)\) defines the scalar field. It grows as you move to high-leverage areas.
- this is the reason why the confidence intevals (unlike prediction intervals) narrow inside the data cloud and spread or fan out beyond the data limits
Influence/Cook’s Distance
:= cooks.distance(model)] dt[,cooks.dist
Cook’s distance is the product of both a residual size and a leverage leverage measure. It is used to quantify how influential a point is on the regression fit. It is simply the product of a leverage measure and a residual mesure. In standardised residuals it is expressed as..
\[
D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}
\]
For observation \(i\), Cook’s distance is calculated as:
\[ D_i = \overbrace{\underbrace{\frac{e_i^2}{p \cdot \text{MSE}}}_{\text{residual size}} \cdot \underbrace{\frac{h_{ii}}{(1 - h_{ii})^2}}_{\text{leverage}}}^{\text{influence measure}} =\frac{ (\hat{\vec{\beta}} - \hat{\vec{\beta}}_{(i)})^\top X^\top X(\hat{\vec{\beta}} - \hat{\vec{\beta}}_{(i)}) }{p \cdot \text{MSE}} \] where:
- in the residual term \(\frac{e_i^2}{p \cdot \text{MSE}}\)…
- the numerator \(e_i^2\) only acounts for absolute residual value, furthermore large residuals are magnified and small ones shrunk by the square term
- in the denominator; \(p\) is the number of parameters in the model including the intercept (this just helps to standardise the metric between models with different numbers of parameters), \(\text{MSE}=\frac{1}{\operatorname{d.o.f.}}\sum e_i^2\) i.e. we divide by the residual variance to standardised the residuals (this is on the scale of output units squared - dividing by it make the term dimensionless)
- in the leverage term \(\frac{h_{ii}}{(1-h_{ii})^2}\) the \((1-h_{ii})^2\) in the denominator amplifies the effect of high-leverage points
Commonly we express the Cook’s distance in standardise the residuals \(r_i = \frac{e_i}{\sqrt{\text{MSE}(1 - h_{ii})}}\):
\[ \boxed{D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}} \]Beta Derivatives
The derivatives of the the model parameters give an idea of the stability of the model and more of a breakdown of influence than cook’s distance.
<-dfbetas(model)
dfbpaste0("beta_", colnames(dfb)) := as.data.table(dfb)] dt[,
DFBETAS measure the standardised change in coefficient \(j\) caused by excluding observation \(i\).
\[ \operatorname{DFBETAS}_{ij} = \frac{\hat{\beta}_j - \hat{\beta}_{j(i)}}{\operatorname{SE}_{(i)}(\hat{\beta}_j)} \]
Where:
- \(\hat{\beta}_j\): coefficient estimate using all observations
- \(\hat{\beta}_{j(i)}\): coefficient estimate with observation \(i\) removed
- \(\operatorname{SE}_{(i)}(\hat{\beta}_j)\): standard error of \(\hat{\beta}_j\) computed without observation \(i\)
A common rule of thumb is:
\[ |\operatorname{DFBETAS}_{ij}| > \begin{cases} \tfrac{2}{\sqrt{n}}, & \text{large samples} \\\\ 1, & \text{small samples} \end{cases} \quad \Rightarrow \quad \text{potential influence.} \]
Hat Values
:= hatvalues(model)] dt[, leverage
These are the diagonal values of the projection matrix. Hatvalues are tremendously useful is in finding data entry errors.
A data entry error could just as easily be in the predictors \(x_i\) - e.g. someone typed weight as 400 instead of 40. That would place the observation way outside the predictor space. Studentised residuals only react if the response \(y\) looks inconsistent with predictors but the hat value would be huge immediately flagging it.
We could individually check the covariates but looking at hat values would mean we wouldn’t have to check them all unnesesarily.
Typical Linear Regression R Workflow
1. Fit a Model
<- as.data.table(faithful)
dt <- lm(waiting ~ eruptions, data = faithful_dt) model
2. Query Model
We can prduce a summary
showing the R-squared or F-statistic.
summary(model)
Call:
lm(formula = waiting ~ eruptions, data = faithful_dt)
Residuals:
Min 1Q Median 3Q Max
-10.330 -4.696 1.417 3.498 8.496
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.857 4.435 7.634 4.75e-07 ***
eruptions 10.453 1.165 8.976 4.58e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.678 on 18 degrees of freedom
Multiple R-squared: 0.8174, Adjusted R-squared: 0.8072
F-statistic: 80.56 on 1 and 18 DF, p-value: 4.585e-08
We can access coeficients with the coef
command to see p-values, residuals…
coef(model) # Extract coefficients only
(Intercept) eruptions
33.85685 10.45276
confint(model) # Extract confidence intervals for coefficients
2.5 % 97.5 %
(Intercept) 24.538808 43.17489
eruptions 8.006113 12.89940
<- residuals(model) # Residuals
r <- fitted(model) # Fitted values p
R automatically names each coefficient after the variable it multiplies
- (Intercept) -> the constant term
- caret -> the slope multiplying caret (predictor/regressor) in the equation
3. Prediction
straightforward..
# Create dt with required covariates
<- data.table(eruptions = c(2., 3., 5., 100.))
dt2
<- predict(model, newdata = dt2) # Interpolate/extrapolate
pred := pred] # Keep organised
dt2[,pred_wait
print(dt2)
eruptions pred_wait
<num> <num>
1: 2 54.76236
2: 3 65.21512
3: 5 86.12063
4: 100 1079.13249
4. Standardised Diagnostic Plots
par(mfrow = c(2, 2))
plot(model) # Generates diagnostic plots
par(mfrow = c(1,1))
4.1 Residuals vs Fitted (Residuals plot: rough all-round eyeball check)
In the Residuals vs Fitted plot, the vertical axis shows residuals (observed minus fitted values) and the horizontal axis shows the model’s fitted values; ideally, points should be randomly scattered around the horizontal line \(y=0\) at zero, indicating that the model’s errors are unbiased and have constant variance — any clear pattern, curvature suggest non-linearity and fanning is indicative of heteroscedasticity suggesting we may require model refinement.
4.2 Normal Q-Q (normality check)
In the Normal Q-Q plot, the residuals are plotted against the theoretical quantiles of a normal distribution; if the residuals are approximately normally distributed, the points should lie close to the \(y=x\) reference line, with systematic deviations suggesting departures from normality such as skewness or heavy tails.
4.3 Scale-Location (scrutinise homoscedasticity)
Similar to the Residuals vs Fitted, the Scale-Location plot shows standardised/transformed residuals against fitted values to highlight changes in the spread of residuals; a roughly horizontal band suggests constant variance, while trends or patterns indicate heteroscedasticity.
If the points form a horizontal band it suggests homoscedasticity and if they fan out or funnel it indicates heteroscedasticity.
Notes:
- Using absolute values \(|\varepsilon_i|\) focuses on the magnitude of deviations from the model, ignoring sign however these introduces a right-skewedness
- Square root compresses large residuals while expanding small residuals, making trends in spread across fitted values easier to spot.
- Taking \(\sqrt{|\varepsilon_i|}\) reduces skewness and spreads out smaller residuals, making patterns in variance more visible.
- Mathematically, it’s related to the variance-stabilising transformation for residuals: if \(\operatorname{Var}(\varepsilon)\propto\mu^2\) then \(\sqrt{|\varepsilon_i|}\) tends to have roughly constant variance.
4.4 Residuals vs Leverage see influence
The Residuals vs Leverage plot is used to identify observations that might have an undue influence on a regression model. The individual plot shows the Cook’s dinstances for each point. A common heuristic is:
- \(D>1\) is highly influential/anomalous
- \(D>4/n\) is should be investigated
In the Residuals vs Leverage summary plot, dashed curves indicate approximate Cook’s distance thresholds. Observations that are both high-leverage and high-residual, especially those outside the Cook’s distance curves, should be investigated carefully as they can disproportionately influence the fitted regression model. Cook’s distance contours are given by:
\(|r_i| = \sqrt{ D \cdot p \cdot \frac{1 - h_{ii}}{h_{ii}} }\)
5. ANOVA Table
In the context of regression, ANOVA is the p-value formalisation of an \(R^2\) value, it partitions the total variability of the response \(Y\) into:
\[SS_{total}=SS_{regression}+SS_{residuals}\] Where… \[ \begin{align*} SS_{\text{total}} &= \sum_{i} (y_i - \bar{y})^2 \quad \text{total variability in the response} \\ SS_{\text{regression}} &= \sum_{i} (\hat{y}_i - \bar{y})^2 \quad \text{variability explained by the model} \\ SS_{\text{residual}} &= \sum_{i} (y_i - \hat{y}_i)^2 \quad \text{variability not explained} \end{align*} \]
\[ \begin{align*} MS_{\text{regression}} &= \frac{SS_{\text{regression}}}{df_{\text{regression}}} \quad \text{— mean square for regression} \\ MS_{\text{residual}} &= \frac{SS_{\text{residual}}}{df_{\text{residual}}} \quad \text{— mean square for residuals} \\ F &= \frac{MS_{\text{regression}}}{MS_{\text{residual}}} \quad \text{— F-statistic for testing model significance} \end{align*} \]
Calling anova(model)
in R produces a table summarising this decomposition, showing the contribution of each predictor and testing whether it explains a statistically significant amount of variation. Unlike classical ANOVA that compares group means, here it evaluates the effect of predictors on the response within a regression framework.
anova(model)
Analysis of Variance Table
Response: waiting
Df Sum Sq Mean Sq F value Pr(>F)
eruptions 1 2597.62 2597.62 80.564 4.585e-08 ***
Residuals 18 580.38 32.24
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6. Standardised Coefficients - TODO
Standardised coefficients express the effect of each predictor on the response in a comparable, unitless form. While ordinary regression coefficients $ _j$ depend on the units of the predictors, standardised coefficients $ _j^$ are calculated by first converting both the predictors and response to z-scores. This allows direct comparison of the relative influence of each predictor: larger absolute values indicate stronger impact on the response. In R, the lm.beta
package can compute these values from a fitted model, providing a clearer sense of which variables matter most.
library(lm.beta)
lm.beta(model)
Call:
lm(formula = waiting ~ eruptions, data = faithful_dt)
Standardized Coefficients::
(Intercept) eruptions
NA 0.9040892
Addendums
Gauss-Markov Assumptions (Homoscedastic Linear Model)
Consider the linear model \[ \vec{y} = X \vec{\beta} + \vec{\varepsilon}, \quad X \in \mathbb{R}^{n \times p}, \; \vec{\varepsilon} \in \mathbb{R}^n. \]
The Gauss-Markov assumptions are a subset of homoscedasticity-based assumptions ensuring OLS is BLUE:
- Linearity: \(\vec{y}\) is linear in \(\vec{\beta}\).
- Full rank: \(\mathrm{rank}(X) = p\) (no perfect multicollinearity).
- Zero-mean errors: \(\mathbb{E}[\vec{\varepsilon}] = \vec{0}\).
- Homoscedasticity: \(\mathrm{Var}(\vec{\varepsilon}) = \sigma^2 I_n\).
- No autocorrelation: \(\mathrm{Cov}(\varepsilon_i, \varepsilon_j) = 0, \; i \neq j\).
Note: Normality of errors is not required for BLUE; it is only needed for exact t/F inference and exact normality of residuals.