Regression Lines

March 17, 2024

Regression Lines

In statistics, regression lines are a fundamental tool for analyzing data trends.
They represents the equation of a function that best approximates a given set of data.
They have many uses:
- Determining Error
- Visually seeing data trends
- Finding values between data points
- Predicting end behavior

Types of Parent Functions

While there are infinite potential parent functions, here are some of the most commonly used:
- Linear: \(y=mx+b\)
- Quadratic: \(y=ax^2+bx+c\)
- Exponential: \(a^x\)
- Logarithmic: \(log_a(x)\)
- Polynomial: \(x^n...\)
Most data fits into these categories, allowing these parent functions to predict most emergent trends

Misapplication Of Parent Function

When picking a parent function, its often best to look at your equation by hand

As you can see this exponential equation doesnt have a good lienar fit
Once a parent function is picked, a computer can find the values, but picking a function is easiest for humans

Computers Process Behind The Scenes

For the line of best fit, computers use the least squares method
This states that a line is best when the distance squared from the data to the line is minimized.
This makes sense, as you want the data to be close at every point, and the squared helps by penalizing large outliers from a potential line
For the simple linear case, finding m and b is actually solved:

\(m=\frac{n\sum_{}^{}(xy)-\sum(x)\sum(x)}{n\sum_{}^{}(x^2)-(\sum_{}^{}x)^2}\)

\(b=\frac{\sum{}{}y-m\sum_{}^{}x}{n}\)

This is derived from a similar idea, which we can see from M using \(n\sum_{}^{}(x^2)-(\sum_{}^{}x)^2\) in the denominator.

Implementation of Least Squares

Here is a coded example of the Least Squares Algorithm

x <- c(0,1,1,2,2.5,4,5,6,7)
y <- c(3,1,2.5,4,3,5,7,5,8)
n <- length(x)

m = (n * sum(x*y) - sum(x) * sum(y)) / (n * sum(x^2) - (sum(x))^2)
b = (sum(y) - m * sum(x)) / (n)

print(sprintf("Our final equation is y = %.4f x + %.4f", m, b))

## [1] "Our final equation is y = 0.7934 x + 1.7653"

As you can see, the data has given us the equation y = 0.7934 x + 1.7653

Visualization of our output

In the last step we got our equation, now lets see how accurate it is

As you can see, the best fit line fits pretty well

Advanced Linear Regression

For more complicated parent functions, regression uses the \(\chi^2\) method, along with number scaling
The \(\chi^2\) method is what we’ve seen before, it tries to minimize distance from the line
Scaling, is a uniquely computer feature, it shuffles numbers, running through mass amounts of potential equations, and plotting what values lead to the smallest \(\chi^2\) value
This is what all modern computers use to do linear regression

3D Linear Regression

Similar principles work in 3D

As you can see, this plane works as a linear regression line in 3d, this plane does the best job of lowering the \(\chi^2\) value

Advanced Applications

Here is my last example, a real use of linear regression I used in flower pedal length analysis

By grouping the flowers in this way, the commonalities between species become obvious

Conclusion

In conclusion, linear regression is a powerful tool
It allows trend visualization and showcases error points
It’s also uniquely suited for computers, which is fun
Hopefully you learned something!