March 17, 2024

Regression Lines

  • In statistics, regression lines are a fundamental tool for analyzing data trends.
  • They represents the equation of a function that best approximates a given set of data.
  • They have many uses:
    • Determining Error
    • Visually seeing data trends
    • Finding values between data points
    • Predicting end behavior

Types of Parent Functions

  • While there are infinite potential parent functions, here are some of the most commonly used:
    • Linear: \(y=mx+b\)
    • Quadratic: \(y=ax^2+bx+c\)
    • Exponential: \(a^x\)
    • Logarithmic: \(log_a(x)\)
    • Polynomial: \(x^n...\)
  • Most data fits into these categories, allowing these parent functions to predict most emergent trends

Misapplication Of Parent Function

  • When picking a parent function, its often best to look at your equation by hand

  • As you can see this exponential equation doesnt have a good lienar fit
  • Once a parent function is picked, a computer can find the values, but picking a function is easiest for humans

Computers Process Behind The Scenes

  • For the line of best fit, computers use the least squares method
  • This states that a line is best when the distance squared from the data to the line is minimized.
  • This makes sense, as you want the data to be close at every point, and the squared helps by penalizing large outliers from a potential line
  • For the simple linear case, finding m and b is actually solved:

\(m=\frac{n\sum_{}^{}(xy)-\sum(x)\sum(x)}{n\sum_{}^{}(x^2)-(\sum_{}^{}x)^2}\)

\(b=\frac{\sum{}{}y-m\sum_{}^{}x}{n}\)

  • This is derived from a similar idea, which we can see from M using \(n\sum_{}^{}(x^2)-(\sum_{}^{}x)^2\) in the denominator.

Implementation of Least Squares

Here is a coded example of the Least Squares Algorithm

x <- c(0,1,1,2,2.5,4,5,6,7)
y <- c(3,1,2.5,4,3,5,7,5,8)
n <- length(x)

m = (n * sum(x*y) - sum(x) * sum(y)) / (n * sum(x^2) - (sum(x))^2)
b = (sum(y) - m * sum(x)) / (n)

print(sprintf("Our final equation is y = %.4f x + %.4f", m, b))
## [1] "Our final equation is y = 0.7934 x + 1.7653"
  • As you can see, the data has given us the equation y = 0.7934 x + 1.7653

Visualization of our output

  • In the last step we got our equation, now lets see how accurate it is

  • As you can see, the best fit line fits pretty well

Advanced Linear Regression

  • For more complicated parent functions, regression uses the \(\chi^2\) method, along with number scaling
  • The \(\chi^2\) method is what we’ve seen before, it tries to minimize distance from the line
  • Scaling, is a uniquely computer feature, it shuffles numbers, running through mass amounts of potential equations, and plotting what values lead to the smallest \(\chi^2\) value
  • This is what all modern computers use to do linear regression

3D Linear Regression

  • Similar principles work in 3D

As you can see, this plane works as a linear regression line in 3d, this plane does the best job of lowering the \(\chi^2\) value

Advanced Applications

  • Here is my last example, a real use of linear regression I used in flower pedal length analysis

  • By grouping the flowers in this way, the commonalities between species become obvious

Conclusion

  • In conclusion, linear regression is a powerful tool
  • It allows trend visualization and showcases error points
  • It’s also uniquely suited for computers, which is fun
  • Hopefully you learned something!