The Geometry of Least Squares: Where Mathematics Meets Statistics

Introduction

Goal: Explore the intersection of mathematics and statistics through linear regression
Focus: The geometric interpretation of least squares — regression as a projection problem
Key Question: How does algebra and calculus explain how we estimate relationships in data?

The Simple Linear Regression Model

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \] - \(y_i\): response variable - \(x_i\): predictor variable - \(\varepsilon_i\): random noise Objective: Find line \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\) that minimizes total squared error: \[ S = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Mathematical Foundation

Regression in matrix form: \[ y = X\beta + \varepsilon \] where
\[ X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, \quad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \] The least squares estimate is: \[ \hat{\beta} = (X^T X)^{-1} X^T y \]

Example Data

n <- 50
x <- runif(n, 1, 10)
y <- 3 + 2*x + rnorm(n, sd = 2)
df <- data.frame(x, y)
ggplot(df, aes(x, y)) +
  geom_point(color = "#8C1D40", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  theme_minimal() +
  labs(title = "Data and Fitted Regression Line", x = "x", y = "y")

## `geom_smooth()` using formula = 'y ~ x'

R Code for the Model

model <- lm(y ~ x, data = df)
summary(model)

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8632 -1.2932  0.3771  1.3816  3.2918 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9399     0.7115   4.132 0.000143 ***
## x             1.9912     0.1025  19.420  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.961 on 48 degrees of freedom
## Multiple R-squared:  0.8871, Adjusted R-squared:  0.8847 
## F-statistic: 377.1 on 1 and 48 DF,  p-value: < 2.2e-16

Interpretation: - Coefficients estimate \(\hat{\beta}_0\) and \(\hat{\beta}_1\) - They minimize squared error — mathematically, a projection of \(y\) onto the space spanned by \(X\)

3D Geometry of Least Squares

x1 <- seq(1, 10, length.out = 30)
x2 <- seq(1, 10, length.out = 30)
grid <- expand.grid(x1 = x1, x2 = x2)
grid$y <- model$coefficients[1] + model$coefficients[2]*grid$x1
fig <- plot_ly(df, x = ~x, y = ~y, type = 'scatter3d', mode = 'markers',
               marker = list(size = 4, color = 'maroon')) %>%
  add_surface(x = ~x1, y = ~x2, z = matrix(grid$y, 30, 30),
              showscale = FALSE, opacity = 0.6) %>%
  layout(title = "Projection of Data Points onto Regression Plane",
         scene = list(xaxis = list(title = "x"),
                      yaxis = list(title = "index"),
                      zaxis = list(title = "y")),
         margin = list(l = 0, r = 0, b = 0, t = 40)) %>%
  config(displayModeBar = FALSE)
fig

Residuals as Orthogonal Errors

Each residual is orthogonal (perpendicular) to the fitted line
This means the sum of residuals times \(x\) is zero: \[ \sum x_i (y_i - \hat{y}_i) = 0 \]
In vector form: \[ X^T(y - X\hat{\beta}) = 0 \]

ggplot(df, aes(x, y)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE, color="#8C1D40") +
  geom_segment(aes(xend = x, yend = fitted(model)), color="gray") +
  theme_minimal() +
  labs(title="Residuals as Perpendicular Distances")

## `geom_smooth()` using formula = 'y ~ x'

Mathematical Derivation (LaTeX Slide)

To minimize \(S = (y - X\beta)^T(y - X\beta)\): \[ \frac{\partial S}{\partial \beta} = -2X^T(y - X\beta) \] Set derivative to zero: \[ X^T y = X^T X \hat{\beta} \] \[ \Rightarrow \hat{\beta} = (X^T X)^{-1} X^T y \] This shows how calculus and linear algebra produce the least squares estimator.

From Mathematics to Statistics

Mathematical Concept	Statistical Meaning
Vector projection	Best linear prediction of \(y\)
Matrix inverse \((X^T X)^{-1}\)	Variance–covariance of estimates
Minimizing distance	Equivalent to maximizing likelihood under normal errors
Orthogonality	Residuals uncorrelated with predictors

Mathematics provides structure, Statistics adds interpretation and uncertainty.

The Gradient Descent Connection

Modern machine learning still solves: \[ \min_\beta \|y - X\beta\|^2 \] - Instead of solving analytically, algorithms use gradient descent: \[ \beta_{t+1} = \beta_t - \alpha \nabla_\beta S \] - Shows how linear algebra underpins deep learning, regression, and optimization.

beta <- seq(0, 4, 0.1)
S <- sapply(beta, function(b) sum((y - b*x)^2))
data.frame(beta, S) %>%
  ggplot(aes(beta, S)) +
  geom_line(color="#8C1D40") +
  geom_point(aes(x=coef(model)[2], y=min(S)), color="black", size=3) +
  theme_minimal() +
  labs(title="Gradient Descent Landscape", x="Beta (Slope)", y="Sum of Squared Errors")

Conclusion

Linear regression is both mathematics (geometry, calculus, algebra) and statistics (inference, uncertainty)
Together, they form the foundation of data science
The concept of least squares connects geometry, optimization, and learning theory Math + Statistics = Insight

References

Freedman, D. Statistical Models: Theory and Practice
Strang, G. Linear Algebra and Its Applications
James et al. An Introduction to Statistical Learning