Understanding and Coding in R
Polynomial regression is a type of regression analysis that models the non-linear relationship between the predictor variable(s) and response variable1. It is an extension of simple linear regression that allows for more complex relationships between predictor and response variables1.
In simple terms, it allows us to fit a curve to our data instead of a straight line.
Polynomial regression is useful when the relationship between the independent and dependent variables is nonlinear.
It can capture more complex relationships than linear regression, making it suitable for cases where the data exhibits curvature.
Linearity: There is a curvilinear relationship between the independent variable(s) and the dependent variable.
Independence: The predictor variables are independent of each other.
Homoscedasticity: The variance of the errors should be constant across all levels of the independent variable(s).
Normality: The errors should be normally distributed with mean zero and a constant variance.
The mathematical equation for a polynomial regression represents the relationship between the response variable (Y) and the predictor variable (X) as a polynomial function. The general formula is:
\[ y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + \beta_3x_i^3 + ... \beta_dx_i^d + \epsilon_i \]
Where:
Yi = represents the response variable.
Xi = represents the predictor variable.
β₀, β₁, … ,βd are the coefficients to be estimated.
εi = represents the errors.
For large degree d, polynomial regression allows us to produce an extremely non-linear curve. Therefore, it is not common to use d greater than 3 because the larger value of d, the more overly flexible polynomial curve becomes, which can create very strange shapes.
The coefficients in polynomial function can be estimated using least square linear regression because it can be viewed as a standard linear model with predictors \(x_i, \,x_i^2, \,x_i^3, ..., x_i^d\). Hence, polynomial regression is also known as polynomial linear regression.
Generally, this kind of regression is used for one resultant variable and one predictor.
Step 0: Load required package(s)
Step 1: Load and inspect the data
Step 2: Visualize the data
Step 3: Split Data into Train and Test Set
Step 4: Fit Models
Step 5: Assess Assumptions
Step 6: Make Predictions & Evaluate The Models
Step 7: Visualize The Final Model on The Test Set
Now let’s go through the steps to perform a polynomial regression in R. We’ll be using the lm() function to fit the polynomial regression model. This function comes standard in base R.
For this example, we are investigating the following:
Research Question: Is there a significant relationship between the weight of a car (wt) and its miles per gallon (mpg) in the mtcars dataset?
Null hypothesis (H0): There is no significant relationship between the weight of a car (wt) and its miles per gallon (mpg) in the mtcars dataset.
Alternative hypothesis (Ha): There is a significant relationship between the weight of a car (wt) and its miles per gallon (mpg) in the mtcars dataset.
In this case, the null hypothesis assumes that the coefficients of the quadratic polynomial terms are zero, indicating no relationship between the weight of the car and miles per gallon. The alternative hypothesis, on the other hand, suggests that at least one of the quadratic polynomial terms is non-zero, indicating a significant relationship between the weight of the car and miles per gallon.
By performing the polynomial regression analysis and examining the model summary and coefficients, we can evaluate the statistical significance of the relationship and determine whether to reject or fail to reject the null hypothesis.
In R, we’ll use the lm() function from the base package to perform polynomial regression. We will also use the caret package to help streamline the process of creating predictive models. This package contains the functions we need for data splitting. Finally, since we want to visualize our data, we will be loading the ggplot2 package for use.
For this example, we will use the built-in mtcars dataset which is publicly available and contains information about various car models.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Before fitting a polynomial regression model, it’s helpful to visualize the data to identify any non-linear patterns.
For our example, we will use a scatter plot to visualize the relationship between the independent and dependent variables:
You will need to split the data to apply different models and compare which one is the better fit.
Let’s create a function so we can build multiple models. We will build a standard linear model and a quadratic model (degrees 1 and 2, respectively).
# Function to fit and evaluate polynomial regression models
fit_poly_regression <- function(degree) {
formula <- as.formula(paste("mpg ~ poly(wt, ", degree, ")"))
model <- lm(formula, data = train_data)
return(model)
}
# Fit polynomial regression models with degrees 1 to 2
model_1 <- fit_poly_regression(1)
summary(model_1)
Call:
lm(formula = formula, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-3.9597 -2.1766 -0.2829 1.8810 6.0211
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.229 0.584 34.637 < 2e-16 ***
poly(wt, 1) -26.481 2.861 -9.255 4.84e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.861 on 22 degrees of freedom
Multiple R-squared: 0.7957, Adjusted R-squared: 0.7864
F-statistic: 85.66 on 1 and 22 DF, p-value: 4.841e-09
Call:
lm(formula = formula, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-3.3649 -2.0190 -0.2274 1.4574 5.4672
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.2292 0.5482 36.902 < 2e-16 ***
poly(wt, 2)1 -26.4808 2.6855 -9.861 2.48e-09 ***
poly(wt, 2)2 5.3518 2.6855 1.993 0.0594 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.686 on 21 degrees of freedom
Multiple R-squared: 0.8282, Adjusted R-squared: 0.8118
F-statistic: 50.6 on 2 and 21 DF, p-value: 9.312e-09
To fit a polynomial regression model, we’ll use the lm() function and create polynomial terms using the poly() function. In this example, we’ll fit a standard linear (degree = 1) and a quadratic polynomial (degree = 2) to the mtcars dataset.
We can use the fitted model to make predictions on new data. Let’s create a sequence of weights to predict the corresponding miles per gallon:
[1] 3.698258
[1] 0.6969344
[1] 122.4796
[1] 2.942631
[1] 0.8550308
[1] 120.3227
We select the final model with the lowest RMSE, R-square, and AIC. The quadratic regression has all three lowest, hence, it is our final model.
Finally, let’s plot the scatter plot with the polynomial regression line to visualize the fit:
# Create a data frame with data points and predictions from the best-fit model
plot_data <- data.frame(wt = test_data$wt, mpg = test_data$mpg,
Predicted = predict(model_2, newdata = test_data))
# Scatter plot with the polynomial regression line
ggplot(plot_data, aes(x = wt, y = mpg)) +
geom_point() +
geom_line(aes(y = Predicted), color = "red", size = 1) +
labs(title = "Scatter Plot with Polynomial Regression Line",
x = "Weight (wt)", y = "Miles per Gallon (mpg)") +
theme_minimal()Piecewise polynomials: Instead of fitting a high-degree polynomial over the entire range of X, piece- wise polynomial regression involves fitting separate low-degree polynomials over different regions of X. The coefficients βi differ in different parts of the range of X. The points where the coefficients change are called knots. Using more knots leads to a more flexible piecewise polynomial2.
Constraints and spline: the technique of reduce the number of degree of freedom on piecewise polynomial to produce a continuous and naturally smooth fit model on data2.