Introduction

  • Goal: Explore the intersection of mathematics and statistics through linear regression
  • Focus: The geometric interpretation of least squares — regression as a projection problem
  • Key Question: How does algebra and calculus explain how we estimate relationships in data?

The Simple Linear Regression Model

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \] - \(y_i\): response variable - \(x_i\): predictor variable - \(\varepsilon_i\): random noise Objective: Find line \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\) that minimizes total squared error: \[ S = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Mathematical Foundation

Regression in matrix form: \[ y = X\beta + \varepsilon \] where
\[ X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, \quad \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \] The least squares estimate is: \[ \hat{\beta} = (X^T X)^{-1} X^T y \]

Example Data

n <- 50
x <- runif(n, 1, 10)
y <- 3 + 2*x + rnorm(n, sd = 2)
df <- data.frame(x, y)
ggplot(df, aes(x, y)) +
  geom_point(color = "#8C1D40", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  theme_minimal() +
  labs(title = "Data and Fitted Regression Line", x = "x", y = "y")
## `geom_smooth()` using formula = 'y ~ x'

R Code for the Model

model <- lm(y ~ x, data = df)
summary(model)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8632 -1.2932  0.3771  1.3816  3.2918 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9399     0.7115   4.132 0.000143 ***
## x             1.9912     0.1025  19.420  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.961 on 48 degrees of freedom
## Multiple R-squared:  0.8871, Adjusted R-squared:  0.8847 
## F-statistic: 377.1 on 1 and 48 DF,  p-value: < 2.2e-16

Interpretation: - Coefficients estimate \(\hat{\beta}_0\) and \(\hat{\beta}_1\) - They minimize squared error — mathematically, a projection of \(y\) onto the space spanned by \(X\)

3D Geometry of Least Squares

x1 <- seq(1, 10, length.out = 30)
x2 <- seq(1, 10, length.out = 30)
grid <- expand.grid(x1 = x1, x2 = x2)
grid$y <- model$coefficients[1] + model$coefficients[2]*grid$x1
fig <- plot_ly(df, x = ~x, y = ~y, type = 'scatter3d', mode = 'markers',
               marker = list(size = 4, color = 'maroon')) %>%
  add_surface(x = ~x1, y = ~x2, z = matrix(grid$y, 30, 30),
              showscale = FALSE, opacity = 0.6) %>%
  layout(title = "Projection of Data Points onto Regression Plane",
         scene = list(xaxis = list(title = "x"),
                      yaxis = list(title = "index"),
                      zaxis = list(title = "y")),
         margin = list(l = 0, r = 0, b = 0, t = 40)) %>%
  config(displayModeBar = FALSE)
fig

Residuals as Orthogonal Errors

  • Each residual is orthogonal (perpendicular) to the fitted line
  • This means the sum of residuals times \(x\) is zero: \[ \sum x_i (y_i - \hat{y}_i) = 0 \]
  • In vector form: \[ X^T(y - X\hat{\beta}) = 0 \]
ggplot(df, aes(x, y)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE, color="#8C1D40") +
  geom_segment(aes(xend = x, yend = fitted(model)), color="gray") +
  theme_minimal() +
  labs(title="Residuals as Perpendicular Distances")
## `geom_smooth()` using formula = 'y ~ x'

Mathematical Derivation (LaTeX Slide)

To minimize \(S = (y - X\beta)^T(y - X\beta)\): \[ \frac{\partial S}{\partial \beta} = -2X^T(y - X\beta) \] Set derivative to zero: \[ X^T y = X^T X \hat{\beta} \] \[ \Rightarrow \hat{\beta} = (X^T X)^{-1} X^T y \] This shows how calculus and linear algebra produce the least squares estimator.

From Mathematics to Statistics

Mathematical Concept Statistical Meaning
Vector projection Best linear prediction of \(y\)
Matrix inverse \((X^T X)^{-1}\) Variance–covariance of estimates
Minimizing distance Equivalent to maximizing likelihood under normal errors
Orthogonality Residuals uncorrelated with predictors

Mathematics provides structure, Statistics adds interpretation and uncertainty.

The Gradient Descent Connection

Modern machine learning still solves: \[ \min_\beta \|y - X\beta\|^2 \] - Instead of solving analytically, algorithms use gradient descent: \[ \beta_{t+1} = \beta_t - \alpha \nabla_\beta S \] - Shows how linear algebra underpins deep learning, regression, and optimization.

beta <- seq(0, 4, 0.1)
S <- sapply(beta, function(b) sum((y - b*x)^2))
data.frame(beta, S) %>%
  ggplot(aes(beta, S)) +
  geom_line(color="#8C1D40") +
  geom_point(aes(x=coef(model)[2], y=min(S)), color="black", size=3) +
  theme_minimal() +
  labs(title="Gradient Descent Landscape", x="Beta (Slope)", y="Sum of Squared Errors")

Conclusion

  • Linear regression is both mathematics (geometry, calculus, algebra) and statistics (inference, uncertainty)
  • Together, they form the foundation of data science
  • The concept of least squares connects geometry, optimization, and learning theory Math + Statistics = Insight

References

  • Freedman, D. Statistical Models: Theory and Practice
  • Strang, G. Linear Algebra and Its Applications
  • James et al. An Introduction to Statistical Learning