Introduction to Polynomial Regression

Understanding and Coding in R

Linh Cao & Deidre Okeke

What is a Polynomial Regression?

Polynomial regression is a type of regression analysis that models the non-linear relationship between the predictor variable(s) and response variable1. It is an extension of simple linear regression that allows for more complex relationships between predictor and response variables1.

In simple terms, it allows us to fit a curve to our data instead of a straight line.

When is a Polynomial Regression Used?

Polynomial regression is useful when the relationship between the independent and dependent variables is nonlinear.

It can capture more complex relationships than linear regression, making it suitable for cases where the data exhibits curvature.

Assumptions of Polynomial Regression

  1. Linearity: There is a curvilinear relationship between the independent variable(s) and the dependent variable.

  2. Independence: The predictor variables are independent of each other.

  3. Homoscedasticity: The variance of the errors should be constant across all levels of the independent variable(s).

  4. Normality: The errors should be normally distributed with mean zero and a constant variance.

Mathematical Equation

The mathematical equation for a polynomial regression represents the relationship between the response variable (Y) and the predictor variable (X) as a polynomial function. The general formula is:

\[ y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + \beta_3x_i^3 + ... \beta_dx_i^d + \epsilon_i \]

Where:

  • Yi = represents the response variable.

  • Xi = represents the predictor variable.

  • β₀, β₁, … ,βd are the coefficients to be estimated.

  • εi = represents the errors.

For large degree d, polynomial regression allows us to produce an extremely non-linear curve. Therefore, it is not common to use d greater than 3 because the larger value of d, the more overly flexible polynomial curve becomes, which can create very strange shapes.

The coefficients in polynomial function can be estimated using least square linear regression because it can be viewed as a standard linear model with predictors \(x_i, \,x_i^2, \,x_i^3, ..., x_i^d\). Hence, polynomial regression is also known as polynomial linear regression.

Generally, this kind of regression is used for one resultant variable and one predictor.

Performing a Polynomial Regression in R

  • Step 0: Load required package(s)

  • Step 1: Load and inspect the data

  • Step 2: Visualize the data

  • Step 3: Split Data into Train and Test Set

  • Step 4: Fit Models

  • Step 5: Assess Assumptions

  • Step 6: Make Predictions & Evaluate The Models

  • Step 7: Visualize The Final Model on The Test Set

Let’s Practice!

Now let’s go through the steps to perform a polynomial regression in R. We’ll be using the lm() function to fit the polynomial regression model. This function comes standard in base R.

Let’s Practice!

For this example, we are investigating the following:

  • Research Question: Is there a significant relationship between the weight of a car (wt) and its miles per gallon (mpg) in the mtcars dataset?

  • Null hypothesis (H0): There is no significant relationship between the weight of a car (wt) and its miles per gallon (mpg) in the mtcars dataset.

  • Alternative hypothesis (Ha): There is a significant relationship between the weight of a car (wt) and its miles per gallon (mpg) in the mtcars dataset.

In this case, the null hypothesis assumes that the coefficients of the quadratic polynomial terms are zero, indicating no relationship between the weight of the car and miles per gallon. The alternative hypothesis, on the other hand, suggests that at least one of the quadratic polynomial terms is non-zero, indicating a significant relationship between the weight of the car and miles per gallon.

By performing the polynomial regression analysis and examining the model summary and coefficients, we can evaluate the statistical significance of the relationship and determine whether to reject or fail to reject the null hypothesis.

Step 0: Install and load required package

In R, we’ll use the lm() function from the base package to perform polynomial regression. We will also use the caret package to help streamline the process of creating predictive models. This package contains the functions we need for data splitting. Finally, since we want to visualize our data, we will be loading the ggplot2 package for use.

# For data visualization purposes
# install.packages("ggplot2")
library(ggplot2)
library(caret)

Step 1: Load and inspect the data

For this example, we will use the built-in mtcars dataset which is publicly available and contains information about various car models.

# Load mtcars dataset
data(mtcars)
# Print the first few rows
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 2: Visualize the data

Before fitting a polynomial regression model, it’s helpful to visualize the data to identify any non-linear patterns.

For our example, we will use a scatter plot to visualize the relationship between the independent and dependent variables:

# Scatter plot of mpg (dependent variable) vs. wt (independent variable)
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(x = "Weight (lbs/1000)", y = "Miles per Gallon") +
  theme_minimal()

Step 3: Split the data

You will need to split the data to apply different models and compare which one is the better fit.

# Set seed for reproducibility
set.seed(123)

# Randomly split the dataset into training and testing sets
train_index <- createDataPartition(mtcars$mpg, p = 0.7, list = FALSE)
train_data <- mtcars[train_index, ]
test_data <- mtcars[-train_index, ]

Step 4: Fit Models

Let’s create a function so we can build multiple models. We will build a standard linear model and a quadratic model (degrees 1 and 2, respectively).

# Function to fit and evaluate polynomial regression models
fit_poly_regression <- function(degree) {
  formula <- as.formula(paste("mpg ~ poly(wt, ", degree, ")"))
  model <- lm(formula, data = train_data)
  return(model)
}

# Fit polynomial regression models with degrees 1 to 2
model_1 <- fit_poly_regression(1)
summary(model_1)

Call:
lm(formula = formula, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9597 -2.1766 -0.2829  1.8810  6.0211 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   20.229      0.584  34.637  < 2e-16 ***
poly(wt, 1)  -26.481      2.861  -9.255 4.84e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.861 on 22 degrees of freedom
Multiple R-squared:  0.7957,    Adjusted R-squared:  0.7864 
F-statistic: 85.66 on 1 and 22 DF,  p-value: 4.841e-09
model_2 <- fit_poly_regression(2)
summary(model_2)

Call:
lm(formula = formula, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3649 -2.0190 -0.2274  1.4574  5.4672 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   20.2292     0.5482  36.902  < 2e-16 ***
poly(wt, 2)1 -26.4808     2.6855  -9.861 2.48e-09 ***
poly(wt, 2)2   5.3518     2.6855   1.993   0.0594 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.686 on 21 degrees of freedom
Multiple R-squared:  0.8282,    Adjusted R-squared:  0.8118 
F-statistic:  50.6 on 2 and 21 DF,  p-value: 9.312e-09

To fit a polynomial regression model, we’ll use the lm() function and create polynomial terms using the poly() function. In this example, we’ll fit a standard linear (degree = 1) and a quadratic polynomial (degree = 2) to the mtcars dataset.

Step 5: Assess Assumptions

Step 6: Make Predictions and Evaluate The Models

We can use the fitted model to make predictions on new data. Let’s create a sequence of weights to predict the corresponding miles per gallon:

# Function to evaluate model performance on the test set
evaluate_model <- function(model, test_data) {
      predictions <- predict(model, newdata = test_data)
      rmse = RMSE(predictions, test_data$mpg)
      r2 = R2(predictions, test_data$mpg)
      aic = AIC(model)
      print(rmse)
      print(r2)
      print(aic)
}
[1] 3.698258
[1] 0.6969344
[1] 122.4796
[1] 2.942631
[1] 0.8550308
[1] 120.3227

We select the final model with the lowest RMSE, R-square, and AIC. The quadratic regression has all three lowest, hence, it is our final model.

Step 7: Visualize The Final Model on The Test Set

Finally, let’s plot the scatter plot with the polynomial regression line to visualize the fit:

# Create a data frame with data points and predictions from the best-fit model
plot_data <- data.frame(wt = test_data$wt, mpg = test_data$mpg, 
                        Predicted = predict(model_2, newdata = test_data))

# Scatter plot with the polynomial regression line
ggplot(plot_data, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_line(aes(y = Predicted), color = "red", size = 1) +
  labs(title = "Scatter Plot with Polynomial Regression Line",
       x = "Weight (wt)", y = "Miles per Gallon (mpg)") +
  theme_minimal()

Further Discussion

  • Piecewise polynomials: Instead of fitting a high-degree polynomial over the entire range of X, piece- wise polynomial regression involves fitting separate low-degree polynomials over different regions of X. The coefficients βi differ in different parts of the range of X. The points where the coefficients change are called knots. Using more knots leads to a more flexible piecewise polynomial2.

  • Constraints and spline: the technique of reduce the number of degree of freedom on piecewise polynomial to produce a continuous and naturally smooth fit model on data2.