For the Leave-One-Out Cross-Validation MSE (\(LOOCV_{(n)}\)) in least squares regression, the following holds:
\[\begin{align*} & LOOCV_{(n)} = \frac{1}{n} \sum_{i = 1}^{n} \left( \frac{y_i - \hat{y_i}}{1 - h_i} \right)^2 \\ where: \quad & H = X(X^TX)^{-1}X^T && \text{(} X \text{ is the design matrix)} \\ \quad & h_i = [H]_{ii} && \text{(diagonal elements of the hat matrix } H \text{, i.e. leverages)} \\ \end{align*}\]
An example dataset data (\(n\) = 20,000 rows), to which I fit \(y = \hat{\beta_0} + \hat{\beta_1}x + \hat{\beta_2}x^2\):
x <- rnorm(20000)
y <- x - 2*x^2 + rnorm(20000)
data <- data.frame(x = x, y = y)The LOOCV() function:
LOOCV <- function(data, formula) {
model <- glm(formula(formula), data = data)
mean(((data$y - predict(model))/(1 - boot::glm.diag(model)$h))^2)
}LOOCV MSE estimate using my LOOCV() function:
LOOCV(data, "y ~ x + I(x^2)")## [1] 1.013968
## 2.13 sec elapsed
LOOCV MSE estimate using the boot::cv.glm() function (does not utilize this shortcut formula):
glm_1 <- glm(y ~ x + I(x^2), data = data)
(glm_1_LOOCV <- boot::cv.glm(data, glm_1)$delta[1])## [1] 1.013968
## 1106.97 sec elapsed