Imagine two variables (\(n=30\)), \(x = \mathcal{N}(100,\,10^{2})\) and \(y\), where \(y = 10 + 5x + \mathcal{N}(0,\,2.5^{2})\):
require(tidyverse)
library(ggExtra)
x <- rnorm(30, 100, 10)
y <- 10 + 5 * x + rnorm(30, 0, 2.5)
df <- tibble(x = x, y = y)
gg <- ggplot(df, aes(x = x, y = y)) +
theme_linedraw() +
geom_point(alpha = 0.8)
gg
Now let’s fit simple regression:
m1 <- lm(y~x)
gg <- gg +
geom_abline(intercept = coef(m1)[1], slope = coef(m1)[2])
gg
The Residual Standard Error (RSE) is our estimate of error in the \(y\) (i.e., \(\mathcal{N}(0,\,2.5^{2})\)):
sigma(m1)
#> [1] 2.248066
The difference is due to the sampling error. If we plot the residuals, RSE represents (corrected) standard deviation:
df <- tibble(fit = predict(m1), resid = resid(m1))
gg <- ggplot(df, aes(x = fit, y = resid)) +
theme_linedraw() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_point(alpha = 0.8)
gg <- ggMarginal(gg, type = "density", margins = "y", fill = "grey")
gg
But now, let’s imagine we are interested in interpreting estimated RSE as Minimal detectable change (MDC). We are able to detect \(y\), given our data and the model implemented, with the 95% confidence equal to:
\[ \pm RSE \times 1.96 \] Value 1.96 represents critical value for the normal distribution where we cover 95% of the distribution spread. First question is: should we use critical value using t-distribution instead? For example, our \(df = n - 2\), or 28, so 95% critical t-value for 95% confidence is equal to 2.05.
Second question, can we interpret this as MDC (i.e., in this case equal to 4.41)? I am worried about the change term here. Or we need to multiply by \(\sqrt{2}\)?
\[ MDC = \pm RSE \times 1.96 \times \sqrt{2} \] Using \(\sqrt{2}\), our model is able to detect changes in \(y\) that are within 6.23 (given the assumption that the residuals are normally distributed with constant variance).