Why polynomial regression?

A straight line is often too rigid. Many real relationships curve: braking distance grows faster than speed, dose response saturates, growth accelerates.

Polynomial regression keeps the simplicity of ordinary least squares but lets the fitted function bend, by adding powers of the predictor as extra terms.

The key idea that makes it “linear” regression:

  • the model is linear in the coefficients \(\beta_0, \beta_1, \dots, \beta_d\),
  • even though it is nonlinear in the predictor \(x\).

So all the familiar least-squares machinery still applies, we have just enriched the set of columns we regress on.

The model

A degree-\(d\) polynomial regression of a response \(y\) on a single predictor \(x\) is

\[ y_i \;=\; \beta_0 + \beta_1 x_i + \beta_2 x_i^{2} + \cdots + \beta_d x_i^{d} + \varepsilon_i, \qquad \varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2). \]

In matrix form this is an ordinary linear model \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\), where \(\mathbf{X}\) is the Vandermonde design matrix

\[ \mathbf{X} = \begin{bmatrix} 1 & x_1 & x_1^{2} & \cdots & x_1^{d} \\ 1 & x_2 & x_2^{2} & \cdots & x_2^{d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^{2} & \cdots & x_n^{d} \end{bmatrix}. \]

Least squares estimation

We choose \(\boldsymbol{\beta}\) to minimize the residual sum of squares

\[ \mathrm{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^{n}\bigl(y_i - \hat{y}_i\bigr)^2 = \lVert \mathbf{y} - \mathbf{X}\boldsymbol{\beta} \rVert^2 . \]

Setting the gradient to zero gives the normal equations \(\mathbf{X}^{\top}\mathbf{X}\,\boldsymbol{\beta} = \mathbf{X}^{\top}\mathbf{y}\), whose solution is

\[ \hat{\boldsymbol{\beta}} = \bigl(\mathbf{X}^{\top}\mathbf{X}\bigr)^{-1}\mathbf{X}^{\top}\mathbf{y}. \]

In practice we use orthogonal polynomials (R’s poly()) instead of raw powers: the columns \(x, x^2, x^3, \dots\) are highly collinear, which makes \(\mathbf{X}^{\top}\mathbf{X}\) nearly singular and the estimates numerically unstable.

The data: stopping distances

We use R’s built-in cars dataset: the speed of 1920s cars and the distance required to stop. Physics gives us a prior: kinetic energy \(\propto v^2\), so we expect stopping distance to grow roughly quadratically with speed.

Fitting curves with stat_smooth

ggplot2’s stat_smooth() fits and overlays a model in one step. Here we compare a linear, quadratic, and cubic fit on the same axes.

The code behind the plot

The quadratic curve above is produced by passing an orthogonal-polynomial formula to stat_smooth():

ggplot(cars, aes(speed, dist)) +
  geom_point(color = "grey40", alpha = 0.7) +
  stat_smooth(method = "lm",
              formula = y ~ poly(x, 2),   # quadratic in speed
              se = TRUE) +
  labs(x = "Speed (mph)", y = "Stopping distance (ft)")

# The same fit, available as a model object:
fit2 <- lm(dist ~ poly(speed, 2), data = cars)
summary(fit2)$coefficients

Choosing the degree

Adding terms always lowers the training error, so we cannot pick the degree by RSS alone, that just rewards overfitting. A penalized criterion such as the Akaike Information Criterion balances fit against complexity:

\[ \mathrm{AIC} = 2k - 2\ln\!\bigl(\hat{L}\bigr), \]

where \(k\) is the number of parameters and \(\hat{L}\) the maximized likelihood. Lower is better. Below, AIC bottoms out at a low degree, then rises again.

An interactive 3D response surface

With two predictors the fitted polynomial becomes a surface. Using mtcars, we model fuel economy as a quadratic in horsepower and weight: \(\widehat{\text{mpg}} = f(\text{hp}, \text{wt})\). Drag to rotate the surface.

Takeaways

  • Polynomial regression is linear regression on transformed columns, so least squares, \(R^2\), and inference all carry over unchanged.
  • In ggplot2, stat_smooth(method = "lm", formula = y ~ poly(x, d)) overlays a degree-\(d\) fit directly; plotly extends the picture to interactive 3D surfaces.
  • Higher degree always fits the training data better, so choose the degree with a criterion that penalizes complexity (AIC, BIC, or cross-validation), not RSS.
  • Prefer orthogonal polynomials over raw powers for numerical stability.