Comparison of Linear Regression, Polynomial Regression, and Spline Regression
Author
Takafumi Kubota
Published
May 31, 2024
Abstract
This document compares linear, polynomial, and spline regression models using R. It evaluates their performance and suitability for fitting data, highlighting each model’s flexibility and application.
Keywords
R, regression, linear, polynomial, spline
1 Differences between Linear Regression, Polynomial Regression, and Spline Regression
1.1 Linear Regression
Definition: The most basic form of regression analysis that fits a straight line to the data points.
Features: Simple and fast to compute. Effective when the relationship between variables is linear.
Use Case: Situations where there’s a straight-line relationship between two variables, such as sales and advertising expenditure.
1.2 Polynomial Regression
Definition: An extension of linear regression that fits a polynomial (e.g., quadratic, cubic) to the data points.
Features: Can handle nonlinear trends in the data. The higher the degree of the polynomial, the more flexibility, but it also increases the risk of overfitting.
Use Case: When data shows a curved trend, such as stock market movements.
1.3 Spline Regression
Definition: A method that divides the data into several segments and fits a low-degree polynomial to each segment.
Features: Suitable for fitting nonlinear and complex data. Provides flexibility with smooth transitions between segments.
Use Case: Data with multiple inflection points or local trends, such as the relationship between age and salary.
2 R Code to Compare the Three Models
Create a dataset of 20 points and apply the three models.
Code
# Data creationset.seed(123)x <-seq(1, 20)y <-2* x +rnorm(20, mean =0, sd =3) # Data suitable for linear regression# Create a data framedata <-data.frame(x = x, y = y)# Linear regression modellinear_model <-lm(y ~ x, data = data)# Polynomial regression model (2nd degree)poly_model <-lm(y ~poly(x, 2), data = data)# Spline regression modellibrary(splines)spline_model <-lm(y ~bs(x, df =4), data = data)# Plottingplot(x, y, main ="Regression Models Comparison", pch =16)abline(linear_model, col ="blue", lwd =2) # Linear regression# Polynomial regression curvex_poly <-seq(min(x), max(x), length.out =100)y_poly <-predict(poly_model, newdata =data.frame(x = x_poly))lines(x_poly, y_poly, col ="red", lwd =2)# Spline regression curvey_spline <-predict(spline_model, newdata =data.frame(x = x_poly))lines(x_poly, y_spline, col ="green", lwd =2)legend("topleft", legend =c("Linear", "Polynomial", "Spline"),col =c("blue", "red", "green"), lwd =2)
3 Explanation of the Code
Data Creation: x is a sequence of numbers from 1 to 20, and y is 2 * x with added random noise.
Create a Data Frame: The data points x and y are combined into a data frame called data.
Linear Regression Model: A linear regression model is created using lm(y ~ x, data = data).
Polynomial Regression Model: A second-degree polynomial regression model is created using lm(y ~ poly(x, 2), data = data).
Spline Regression Model: A spline regression model is created using lm(y ~ bs(x, df = 4), data = data), where bs stands for basis splines with a specified degree of freedom.
Plotting: The data points are plotted using plot(x, y), and the fitted curves for each model are added with different colors:
Linear regression in blue (abline function).
Polynomial regression in red (lines function).
Spline regression in green (lines function).
Adding a Legend: A legend is added using the legend function to indicate which color corresponds to which regression model.
This code allows you to visually compare the fits of linear regression, polynomial regression, and spline regression.
4 Example of airquality data
The following R code demonstrates the differences between linear regression, polynomial regression, and spline regression models using the airquality dataset. This code uses Ozone as the dependent variable and Temp as the independent variable.
Warning: package 'gam' was built under R version 4.3.1
Loading required package: foreach
Loaded gam 1.22-3
Code
library(splines) # Spline functions package# Load the datasetdata("airquality")airquality <-na.omit(airquality) # Remove missing values# Linear regression modellinear_model <-lm(Ozone ~ Temp, data = airquality)# Polynomial regression model (2nd degree)poly_model <-lm(Ozone ~poly(Temp, 2), data = airquality)# Spline regression modelspline_model <-lm(Ozone ~bs(Temp, df =4), data = airquality)# Prepare the plotplot(airquality$Temp, airquality$Ozone, main ="Regression Models Comparison", pch =16, xlab ="Temperature", ylab ="Ozone")# Add the linear regression lineabline(linear_model, col ="blue", lwd =2)# Add the polynomial regression curvetemp_seq <-seq(min(airquality$Temp), max(airquality$Temp), length.out =100)pred_poly <-predict(poly_model, newdata =data.frame(Temp = temp_seq))lines(temp_seq, pred_poly, col ="red", lwd =2)# Add the spline regression curvepred_spline <-predict(spline_model, newdata =data.frame(Temp = temp_seq))lines(temp_seq, pred_spline, col ="green", lwd =2)# Add a legend to the plotlegend("topleft", legend =c("Linear", "Polynomial", "Spline"),col =c("blue", "red", "green"), lwd =2)
4.1 Explanation:
Load necessary packages:
library(gam): Loads the package for Generalized Additive Models.
library(splines): Loads the package for spline functions.
Load the dataset:
data("airquality"): Loads the airquality dataset.
airquality <- na.omit(airquality): Removes rows with missing values from the dataset.
Linear regression model:
linear_model <- lm(Ozone ~ Temp, data = airquality): Fits a linear regression model with Ozone as the dependent variable and Temp as the independent variable.
Polynomial regression model (2nd degree):
poly_model <- lm(Ozone ~ poly(Temp, 2), data = airquality): Fits a polynomial regression model of degree 2 with Ozone as the dependent variable and Temp as the independent variable.
Spline regression model:
spline_model <- lm(Ozone ~ bs(Temp, df = 4), data = airquality): Fits a spline regression model with Ozone as the dependent variable and Temp as the independent variable. bs is used to create a B-spline basis for the spline regression.
Prepare the plot:
plot(airquality$Temp, airquality$Ozone, ...): Creates a scatter plot of Temp vs. Ozone with appropriate labels and title.
Add the linear regression line:
abline(linear_model, col = "blue", lwd = 2): Adds the linear regression line to the plot in blue.
Add the polynomial regression curve:
temp_seq <- seq(min(airquality$Temp), max(airquality$Temp), length.out = 100): Creates a sequence of temperature values for prediction.
pred_poly <- predict(poly_model, newdata = data.frame(Temp = temp_seq)): Predicts Ozone values using the polynomial model.
lines(temp_seq, pred_poly, col = "red", lwd = 2): Adds the polynomial regression curve to the plot in red.
Add the spline regression curve:
pred_spline <- predict(spline_model, newdata = data.frame(Temp = temp_seq)): Predicts Ozone values using the spline model.
lines(temp_seq, pred_spline, col = "green", lwd = 2): Adds the spline regression curve to the plot in green.
Add a legend to the plot:
legend("topleft", legend = c("Linear", "Polynomial", "Spline"), col = c("blue", "red", "green"), lwd = 2): Adds a legend to the plot indicating which color corresponds to which regression model.
This code helps visualize the differences between linear, polynomial, and spline regression models on the airquality dataset.