Polynomial Regression

2024-03-21

INTRODUCTION

Polynomial regression is a machine learning technique that models the relationship between an independent variable, x, and a dependent variable, y, as an nth degree polynomial in x.
Polynomial regression allows for a more flexible curve than linear regression, making it suitable for modeling more complex relationships between variables. By adjusting the degree of the polynomial, you can control the curvature of the fitted line or curve, allowing for a better fit to the data.
Polynomial regression can be particularly useful in cases where the relationship between the independent and dependent variables is known to be non-linear. Examples include trends in diseases spread, environmental pollutant levels, and stock market behaviors, where the impact of variables can accelerate or decelerate.

Equation for Regression

\[ Y=a+bX \]

Y=Dependant Variable
X=Explanatory Variable
a=Intercept
b=slope of the line

Equation for Polynomial Regression

\[ Y = \beta_0 + \beta_1X + \beta_2X^2 + \ldots + \beta_hX^h + \epsilon \]

Y=Response Variable
X=Predictor Variable
β0, β1, βh=Regression Coefficients
h=Degree of Polynomial
ε=Epsilon

-In order to estimate the equation above, we would only need the response variable (Y) and the predictor variable (X). However, polynomial regression models may have other predictor variables in them as well, which could lead to interaction terms. So as you can see, the basic equation for a polynomial regression model above is a relatively simple model, but you can imagine how the model can grow depending on your situation!

Example of Polynomial Regression

Interpretation for the example

The data points (shown in blue) depict the relationship between X and Y, which appears to follow a non-linear pattern. The orange line represents the fitted polynomial regression model, indicating the best fit curve through the data points. This curve is a polynomial function of the independent variable and it is clear from the plot that the degree of the polynomial is higher than one, as the shape of the curve is not a straight line.

The plot also includes a legend that differentiates between the actual data points and the polynomial regression line, making it clear to the viewer what each component of the plot represents.

Plotly plot for Polynomial Regression

Interpretation for the Plotly

The visualized data likely represents a polynomial regression analysis. The plot depicts data points scattered along a graph, with a red line superimposed. The X and Y axes presumably represent independent and dependent variables, respectively. The dispersion of the black data points suggests a positive correlation between these variables, but with some deviation from a perfectly linear relationship. The red line, on the other hand, represents the fitted polynomial regression model. This curved line captures the overall trend of the data points, indicating that a polynomial function of the third degree (cubic) better models the relationship compared to a straight line. However, a more definitive interpretation would require additional context about the specific data and the origin of the plot.

GGPLOT1 for Polynomial Regression

Interpretation for GGPLOT1

The black dots represent individual data points. There seems to be a cluster of data points in the lower left corner, with some spread towards the upper right corner. This suggests a positive correlation between the variables on the x and y axes. However, the spread of the data points indicates that the relationship is not perfectly linear.

The red line represents the polynomial regression fit. It’s likely a cubic polynomial curve (degree 3) because the code you provided fits a polynomial regression of degree 3. The line captures the general trend of the data points, suggesting that a polynomial function can better model the relationship between the variables compared to a straight line.

GGPLOT2 for Polynomial Regression

Interpretation for GGPLOT2

The independent variable ‘qsec’ is plotted on the horizontal axis, and the dependent variable ‘hp’ is on the vertical axis. The blue line depicts the polynomial regression model, which is designed to capture the relationship between ‘qsec’ and ‘hp’. The model suggests a non-linear relationship; as ‘qsec’ increases, ‘hp’ initially decreases, then increases, and finally decreases again, suggesting a complex, possibly higher-order polynomial relationship.

The shaded area around the polynomial line represents the confidence interval, providing a range within which we can be a certain percentage sure that the true regression line exists. This shaded region appears to be fairly wide, indicating a higher degree of uncertainty or variability in the model’s predictions.

Code for previous graph

library(ggplot2)
set.seed(123)
qsec <- seq(14, 24, by = 0.2)
hp <- 300 + rnorm(length(qsec), mean = 0, sd = 50) - 8 * qsec + 0.3 * qsec^2
df <- data.frame(qsec = qsec, hp = hp)
ggplot(df, aes(x = qsec, y = hp)) +
  geom_point() + 
  geom_smooth(method = "loess", se = TRUE, span = 0.5) + 
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Resources

https://online.stat.psu.edu/stat462/node/158/

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491