Intro to OLS Regression
POLS 3316: Statistics for Political Scientists
2023-11-14
\(y = \alpha + \beta X + \epsilon\)
\(y = \alpha + \beta X + \epsilon\)
\[\\[5in]\]
\(y = \alpha + \beta X + \epsilon\)
\(y = \alpha + \beta X + \epsilon\)
\(y = \alpha + \beta X + \epsilon\) is our abstract model
Technically, regression gives us:
\(E[y] = \alpha + \beta X + \epsilon\)
where E[y] is our expectation of y given X.
E[y] may also be called \(\hat{y}\)
We want to minimize the distance between the actual data and the predicted, \(\hat{y}\), values for each observation.
\(y = \alpha + \beta X + \epsilon\)
This is another case of squared differences:
+ We did squared differences from the mean to get variance
+ We used squared differences in the $X^2$ test
The differences in this case are the distance between the actual data points and the predicted location of Y based on X.
The Least Squares Regression Line | A Demo Video from Statistics: An Animated Journey
We’ll watch Up to about 3:00 in class, but you can watch the whole thing for more detail.
\(y = \alpha + \beta X + \epsilon\)
\(y = \alpha + \beta X + \epsilon\)
Two are arguably consequences of the others and the last doesn’t apply with only one X variable.
- 5. Mean error is zero
- 6. Error term observations are independent
- 7. No perfect multicollinearity
\(y = \alpha + \beta X + \epsilon\)
\(y = \alpha + \beta X + \epsilon\)
Linearity - X and Y have a linear relationship.
Normality - For any value of X, Y is normally distributed.
- We're in a random world
- So, X won't predict Y with precision
- X should predict Y according to a random, normal distribution
- The residuals are normally distributed
\(y = \alpha + \beta X + \epsilon\)
\(y = \alpha + \beta X + \epsilon\)
Linearity - X and Y have a linear relationship.
Normality - errors are normally distributed.
Independence - The observations are independent of each other.
Homoskedasticity - The variance of the residual (\(\epsilon\)) is constant.
+ The error term is the same for any value of X as any other
+ 2 told us the errors are normally distributed. The variance of that distribution is independent of the value of X.
+ The opposite of homoskedasticity is heteroskedasticity and it is bad
\(y = \alpha + \beta X + \epsilon\)
Linearity - we can’t draw a line without doing other things to transform the variables.
Independence - Have to account for whatever is causing the lack of independence.
Homoskedasticity - The precision of the estimates decreases.
Normality - The statistical tests are called into question.
These are all fixable in many cases, some fairly simply.
Regression formula
m is the formula to find the intercept b is the formula to find the slope
DON’T PANIC!
There is a “simpler” way to find a regression line that uses the correlation coefficient. But if you had to find the correlation coefficient by hand, you’d have to use this formula:
correlation coefficient formula
\[\\[5in]\]
DON’T PANIC!
I’m not asking you do any of thatm but…
It is worth looking at the formulas all together to see some of the relationship:
## Authorship, License, Credits
Author: Tom Hanna
Website: tomhanna.me
License: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
POLS3316, Fall 2023, Instructor: Tom Hanna