Comparison of Linear Regression, Polynomial Regression, and Spline Regression

Author

Takafumi Kubota

Published

May 31, 2024

Abstract

This document compares linear, polynomial, and spline regression models using R. It evaluates their performance and suitability for fitting data, highlighting each model’s flexibility and application.

Keywords

R, regression, linear, polynomial, spline

1 Differences between Linear Regression, Polynomial Regression, and Spline Regression

1.1 Linear Regression

  • Definition: The most basic form of regression analysis that fits a straight line to the data points.

  • Features: Simple and fast to compute. Effective when the relationship between variables is linear.

  • Use Case: Situations where there’s a straight-line relationship between two variables, such as sales and advertising expenditure.

1.2 Polynomial Regression

  • Definition: An extension of linear regression that fits a polynomial (e.g., quadratic, cubic) to the data points.

  • Features: Can handle nonlinear trends in the data. The higher the degree of the polynomial, the more flexibility, but it also increases the risk of overfitting.

  • Use Case: When data shows a curved trend, such as stock market movements.

1.3 Spline Regression

  • Definition: A method that divides the data into several segments and fits a low-degree polynomial to each segment.

  • Features: Suitable for fitting nonlinear and complex data. Provides flexibility with smooth transitions between segments.

  • Use Case: Data with multiple inflection points or local trends, such as the relationship between age and salary.

2 R Code to Compare the Three Models

Create a dataset of 20 points and apply the three models.

Code
# Data creation
set.seed(123)
x <- seq(1, 20)
y <- 2 * x + rnorm(20, mean = 0, sd = 3) # Data suitable for linear regression

# Create a data frame
data <- data.frame(x = x, y = y)

# Linear regression model
linear_model <- lm(y ~ x, data = data)

# Polynomial regression model (2nd degree)
poly_model <- lm(y ~ poly(x, 2), data = data)

# Spline regression model
library(splines)
spline_model <- lm(y ~ bs(x, df = 4), data = data)

# Plotting
plot(x, y, main = "Regression Models Comparison", pch = 16)
abline(linear_model, col = "blue", lwd = 2) # Linear regression

# Polynomial regression curve
x_poly <- seq(min(x), max(x), length.out = 100)
y_poly <- predict(poly_model, newdata = data.frame(x = x_poly))
lines(x_poly, y_poly, col = "red", lwd = 2)

# Spline regression curve
y_spline <- predict(spline_model, newdata = data.frame(x = x_poly))
lines(x_poly, y_spline, col = "green", lwd = 2)

legend("topleft", legend = c("Linear", "Polynomial", "Spline"),
       col = c("blue", "red", "green"), lwd = 2)

3 Explanation of the Code

  1. Data Creation: x is a sequence of numbers from 1 to 20, and y is 2 * x with added random noise.

  2. Create a Data Frame: The data points x and y are combined into a data frame called data.

  3. Linear Regression Model: A linear regression model is created using lm(y ~ x, data = data).

  4. Polynomial Regression Model: A second-degree polynomial regression model is created using lm(y ~ poly(x, 2), data = data).

  5. Spline Regression Model: A spline regression model is created using lm(y ~ bs(x, df = 4), data = data), where bs stands for basis splines with a specified degree of freedom.

  6. Plotting: The data points are plotted using plot(x, y), and the fitted curves for each model are added with different colors:

    • Linear regression in blue (abline function).

    • Polynomial regression in red (lines function).

    • Spline regression in green (lines function).

  7. Adding a Legend: A legend is added using the legend function to indicate which color corresponds to which regression model.

This code allows you to visually compare the fits of linear regression, polynomial regression, and spline regression.

4 Example of airquality data

The following R code demonstrates the differences between linear regression, polynomial regression, and spline regression models using the airquality dataset. This code uses Ozone as the dependent variable and Temp as the independent variable.

Code
# Load necessary packages
library(gam)    # Generalized Additive Models package
Warning: package 'gam' was built under R version 4.3.1
Loading required package: foreach
Loaded gam 1.22-3
Code
library(splines) # Spline functions package

# Load the dataset
data("airquality")
airquality <- na.omit(airquality) # Remove missing values

# Linear regression model
linear_model <- lm(Ozone ~ Temp, data = airquality)

# Polynomial regression model (2nd degree)
poly_model <- lm(Ozone ~ poly(Temp, 2), data = airquality)

# Spline regression model
spline_model <- lm(Ozone ~ bs(Temp, df = 4), data = airquality)

# Prepare the plot
plot(airquality$Temp, airquality$Ozone, 
     main = "Regression Models Comparison", 
     pch = 16, xlab = "Temperature", ylab = "Ozone")

# Add the linear regression line
abline(linear_model, col = "blue", lwd = 2)

# Add the polynomial regression curve
temp_seq <- seq(min(airquality$Temp), max(airquality$Temp), length.out = 100)
pred_poly <- predict(poly_model, newdata = data.frame(Temp = temp_seq))
lines(temp_seq, pred_poly, col = "red", lwd = 2)

# Add the spline regression curve
pred_spline <- predict(spline_model, newdata = data.frame(Temp = temp_seq))
lines(temp_seq, pred_spline, col = "green", lwd = 2)

# Add a legend to the plot
legend("topleft", legend = c("Linear", "Polynomial", "Spline"),
       col = c("blue", "red", "green"), lwd = 2)

4.1 Explanation:

  1. Load necessary packages:

    • library(gam): Loads the package for Generalized Additive Models.

    • library(splines): Loads the package for spline functions.

  2. Load the dataset:

    • data("airquality"): Loads the airquality dataset.

    • airquality <- na.omit(airquality): Removes rows with missing values from the dataset.

  3. Linear regression model:

    • linear_model <- lm(Ozone ~ Temp, data = airquality): Fits a linear regression model with Ozone as the dependent variable and Temp as the independent variable.
  4. Polynomial regression model (2nd degree):

    • poly_model <- lm(Ozone ~ poly(Temp, 2), data = airquality): Fits a polynomial regression model of degree 2 with Ozone as the dependent variable and Temp as the independent variable.
  5. Spline regression model:

    • spline_model <- lm(Ozone ~ bs(Temp, df = 4), data = airquality): Fits a spline regression model with Ozone as the dependent variable and Temp as the independent variable. bs is used to create a B-spline basis for the spline regression.
  6. Prepare the plot:

    • plot(airquality$Temp, airquality$Ozone, ...): Creates a scatter plot of Temp vs. Ozone with appropriate labels and title.
  7. Add the linear regression line:

    • abline(linear_model, col = "blue", lwd = 2): Adds the linear regression line to the plot in blue.
  8. Add the polynomial regression curve:

    • temp_seq <- seq(min(airquality$Temp), max(airquality$Temp), length.out = 100): Creates a sequence of temperature values for prediction.

    • pred_poly <- predict(poly_model, newdata = data.frame(Temp = temp_seq)): Predicts Ozone values using the polynomial model.

    • lines(temp_seq, pred_poly, col = "red", lwd = 2): Adds the polynomial regression curve to the plot in red.

  9. Add the spline regression curve:

    • pred_spline <- predict(spline_model, newdata = data.frame(Temp = temp_seq)): Predicts Ozone values using the spline model.

    • lines(temp_seq, pred_spline, col = "green", lwd = 2): Adds the spline regression curve to the plot in green.

  10. Add a legend to the plot:

    • legend("topleft", legend = c("Linear", "Polynomial", "Spline"), col = c("blue", "red", "green"), lwd = 2): Adds a legend to the plot indicating which color corresponds to which regression model.

This code helps visualize the differences between linear, polynomial, and spline regression models on the airquality dataset.